Class Annotation¶
Defined in File annotation.h
Class Documentation¶
-
class
Annotation
¶ Annotation expresses sentence and token boundary information as ranges of bytes in a string, but does not itself own the string.
See also AnnotatedText, which owns Annotation and the string. AnnotatedText wraps these ByteRange functions to provide a string_view interface.
Text is divided into gaps (whitespace between sentences) and sentences like so: gap sentence gap sentence gap Because gaps appear at the beginning and end of the text, there’s always one more gap than there are sentences.
The entire text is a unbroken sequence of tokens (i.e. the end of a token is the beginning of the next token). A gap is exactly one token containing whatever whitespace is between the sentences. A sentence is a sequence of tokens.
Since we are using SentencePiece, a token can include whitespace. The term “word” is used, somewhat incorrectly, as a synonym of token.
A gap can be empty (for example there may not have been whitespace at the beginning). A sentence can also be empty (typically the translation system produced empty output). That’s fine, these are just empty ranges as you would expect.
Public Functions
-
Annotation
()¶ Initially an empty string. Populated by AnnotatedText.
-
size_t
numSentences
() const¶
-
size_t
numWords
(size_t sentenceIdx) const¶ Returns number of words in the sentence identified by
sentenceIdx
.
-
ByteRange
word
(size_t sentenceIdx, size_t wordIdx) const¶ Returns a ByteRange representing
wordIdx
in sentence indexed bysentenceIdx
.wordIdx
follows 0-based indexing, and should be less than.numWords()
forsentenceIdx
for defined behaviour.
-