Class TextProcessor

Class Documentation

class TextProcessor

Public Functions

TextProcessor(Ptr<Options> options, const Vocabs &vocabs, const std::string &ssplit_prefix_file)

TextProcessor handles loading the sentencepiece vocabulary and also contains an instance of sentence-splitter based on ssplit.

Used in Service to convert an incoming blob of text to a vector of sentences (vector of words). In addition, the ByteRanges of the source-tokens in unnormalized text are provided as string_views. Construct TextProcessor from options, vocabs and prefix-file.

Parameters
  • [in] options: expected to contain max-length-break, ssplit-mode.

  • [in] vocabs: Vocabularies used to process text into sentences to marian::Words and corresponding ByteRange information in AnnotatedText.

  • [in] ssplit_prefix_file: Path to ssplit-prefix file compatible with moses-tokenizer.

TextProcessor(Ptr<Options> options, const Vocabs &vocabs, const AlignedMemory &memory)

Construct TextProcessor from options, vocabs and prefix-file supplied as a bytearray.

For other parameters, see the path based constructor. Note: This falls back to string based loads if memory is null, this behaviour will be deprecated in the future.

Parameters
  • [in] memory: ssplit-prefix-file contents in memory, passed as a bytearray.

void process(std::string &&blob, AnnotatedText &source, Segments &segments) const

Wrap into sentences of at most maxLengthBreak_ tokens and add to source.

Parameters
  • [in] blob: Input blob, will be bound to source and annotations on it stored.

  • [out] source: AnnotatedText instance holding input and annotations of sentences and pieces

  • [out] segments: marian::Word equivalents of the sentences processed and stored in AnnotatedText for consumption of marian translation pipeline.

void processFromAnnotation(AnnotatedText &source, Segments &segments) const