Struct AnnotatedText

Struct Documentation

struct AnnotatedText

AnnotatedText is effectively std::string text + Annotation, providing the following additional desiderata.

  1. Access to processed string_views for convenience rather than ByteRanges (which only provides index information).

  2. Transparently convert string_views into ByteRanges for the Annotation referring to the text bound by this structure.

  3. Bind the text and annotations together, to move around as a meaningful unit.

Public Functions

AnnotatedText()

Construct an empty AnnotatedText.

This is useful when the target string or ByteRanges are not known yet, but the public members can be used to populate it. One use-case, when translated-text is created decoding from histories and the ByteRanges only known after the string has been constructed.

AnnotatedText(std::string &&text)

Construct moving in a string (for efficiency purposes, copying string constructor is disallowed).

void appendSentence(string_view prefix, std::vector<string_view>::iterator tokens_begin, std::vector<string_view>::iterator tokens_end)

Appends a sentence to the existing text and transparently rebases string_views.

Since this tracks only prefix, remember appendEndingWhitespace. The string_views must not already be in text.

void appendEndingWhitespace(string_view whitespace)

Append the whitespace at the end of input.

string_view must not be in text.

void recordExistingSentence(std::vector<string_view>::iterator tokens_begin, std::vector<string_view>::iterator tokens_end, const char *sentence_begin)

Record the existence of a sentence that is already in text.

The iterators are over string_views for each token that must be in text already. This function must be called to record sentences in order. Normally the beginning of the sentence can be inferred from tokens_begin->data() but the tokens could be empty, so sentence_begin is required to know where the sentence is.

const size_t numSentences() const

Returns the number of sentences in the annotation structure.

const size_t numWords(size_t sentenceIdx) const

Returns number of words in the sentece identified by sentenceIdx.

string_view word(size_t sentenceIdx, size_t wordIdx) const

Returns a string_view representing wordIdx in sentenceIdx.

string_view sentence(size_t sentenceIdx) const

Returns a string_view representing sentence corresponding to sentenceIdx.

string_view gap(size_t sentenceIdx) const

Returns the string_view of the gap between two sentences in the container.

More precisely where i = sentenceIdx, N = numSentences() for brevity:

  • For i = 0: The gap between the start of text and the 0th sentence.

  • For i = 1...N-1, returns the text comprising of the gap between the i-th and i+1-th sentence.

  • For i = N, the gap between the last (N-1th) sentence and end of text.

    Parameters

ByteRange wordAsByteRange(size_t sentenceIdx, size_t wordIdx) const

Returns a ByteRange representing wordIdx in sentenceIdx.

ByteRange sentenceAsByteRange(size_t sentenceIdx) const

Returns a ByteRange representing sentence corresponding to sentenceIdx.

template<typename Fun>
AnnotatedText apply(Fun fun) const

Utility function to call fun on each word (subword token effectively) in an AnnotatedText.

fun is called with the ByteRange, the string_view with the word, and a bool to indicate whether it is the last word in the AnnotatedText, which is also the ending whitespace slot of AnnotatedText.

Public Members

std::string text

Blob of string elements in annotation refers to.

Annotation annotation

sentence and (sub-) word annotations.