Class Annotation¶

Defined in File annotation.h

Class Documentation¶

class Annotation¶

Annotation expresses sentence and token boundary information as ranges of bytes in a string, but does not itself own the string.

See also AnnotatedText, which owns Annotation and the string. AnnotatedText wraps these ByteRange functions to provide a string_view interface.

Text is divided into gaps (whitespace between sentences) and sentences like so: gap sentence gap sentence gap Because gaps appear at the beginning and end of the text, there’s always one more gap than there are sentences.

The entire text is a unbroken sequence of tokens (i.e. the end of a token is the beginning of the next token). A gap is exactly one token containing whatever whitespace is between the sentences. A sentence is a sequence of tokens.

Since we are using SentencePiece, a token can include whitespace. The term “word” is used, somewhat incorrectly, as a synonym of token.

A gap can be empty (for example there may not have been whitespace at the beginning). A sentence can also be empty (typically the translation system produced empty output). That’s fine, these are just empty ranges as you would expect.

Public Functions

Annotation()¶: Initially an empty string. Populated by AnnotatedText.

size_t numSentences() const¶

size_t numWords(size_t sentenceIdx) const¶: Returns number of words in the sentence identified by sentenceIdx.

ByteRange word(size_t sentenceIdx, size_t wordIdx) const¶

Returns a ByteRange representing wordIdx in sentence indexed by sentenceIdx.

wordIdx follows 0-based indexing, and should be less than .numWords() for sentenceIdx for defined behaviour.

ByteRange sentence(size_t sentenceIdx) const¶

Returns a ByteRange representing sentence corresponding to sentenceIdx.

sentenceIdx follows 0-based indexing, and behaviour is defined only when less than .numSentences().

ByteRange gap(size_t gapIdx) const¶