Class TextProcessor¶

Defined in File text_processor.h

Class Documentation¶

class TextProcessor¶

Public Functions

TextProcessor(Ptr<Options> options, const Vocabs &vocabs, const std::string &ssplit_prefix_file)¶

TextProcessor handles loading the sentencepiece vocabulary and also contains an instance of sentence-splitter based on ssplit.

Used in Service to convert an incoming blob of text to a vector of sentences (vector of words). In addition, the ByteRanges of the source-tokens in unnormalized text are provided as string_views. Construct TextProcessor from options, vocabs and prefix-file.

Parameters

[in] options: expected to contain max-length-break, ssplit-mode.
[in] vocabs: Vocabularies used to process text into sentences to marian::Words and corresponding ByteRange information in AnnotatedText.
[in] ssplit_prefix_file: Path to ssplit-prefix file compatible with moses-tokenizer.

TextProcessor(Ptr<Options> options, const Vocabs &vocabs, const AlignedMemory &memory)¶

Construct TextProcessor from options, vocabs and prefix-file supplied as a bytearray.

For other parameters, see the path based constructor. Note: This falls back to string based loads if memory is null, this behaviour will be deprecated in the future.

Parameters

[in] memory: ssplit-prefix-file contents in memory, passed as a bytearray.

void process(std::string &&blob, AnnotatedText &source, Segments &segments) const¶

Wrap into sentences of at most maxLengthBreak_ tokens and add to source.

Parameters

[in] blob: Input blob, will be bound to source and annotations on it stored.
[out] source: AnnotatedText instance holding input and annotations of sentences and pieces
[out] segments: marian::Word equivalents of the sentences processed and stored in AnnotatedText for consumption of marian translation pipeline.

void processFromAnnotation(AnnotatedText &source, Segments &segments) const¶