Class Annotation

Class Documentation

class Annotation

Annotation expresses sentence and token boundary information as ranges of bytes in a string, but does not itself own the string.

See also AnnotatedText, which owns Annotation and the string. AnnotatedText wraps these ByteRange functions to provide a string_view interface.

Text is divided into gaps (whitespace between sentences) and sentences like so: gap sentence gap sentence gap Because gaps appear at the beginning and end of the text, there’s always one more gap than there are sentences.

The entire text is a unbroken sequence of tokens (i.e. the end of a token is the beginning of the next token). A gap is exactly one token containing whatever whitespace is between the sentences. A sentence is a sequence of tokens.

Since we are using SentencePiece, a token can include whitespace. The term “word” is used, somewhat incorrectly, as a synonym of token.

A gap can be empty (for example there may not have been whitespace at the beginning). A sentence can also be empty (typically the translation system produced empty output). That’s fine, these are just empty ranges as you would expect.

Public Functions

Annotation()

Initially an empty string. Populated by AnnotatedText.

size_t numSentences() const
size_t numWords(size_t sentenceIdx) const

Returns number of words in the sentence identified by sentenceIdx.

ByteRange word(size_t sentenceIdx, size_t wordIdx) const

Returns a ByteRange representing wordIdx in sentence indexed by sentenceIdx.

wordIdx follows 0-based indexing, and should be less than .numWords() for sentenceIdx for defined behaviour.

ByteRange sentence(size_t sentenceIdx) const

Returns a ByteRange representing sentence corresponding to sentenceIdx.

sentenceIdx follows 0-based indexing, and behaviour is defined only when less than .numSentences().

ByteRange gap(size_t gapIdx) const