Class TextProcessor¶
Defined in File text_processor.h
Class Documentation¶
-
class
TextProcessor
¶ Public Functions
-
TextProcessor
(Ptr<Options> options, const Vocabs &vocabs, const std::string &ssplit_prefix_file)¶ TextProcessor handles loading the sentencepiece vocabulary and also contains an instance of sentence-splitter based on ssplit.
Used in Service to convert an incoming blob of text to a vector of sentences (vector of words). In addition, the ByteRanges of the source-tokens in unnormalized text are provided as string_views. Construct TextProcessor from options, vocabs and prefix-file.
- Parameters
[in] options
: expected to containmax-length-break
,ssplit-mode
.[in] vocabs
: Vocabularies used to process text into sentences to marian::Words and corresponding ByteRange information in AnnotatedText.[in] ssplit_prefix_file
: Path to ssplit-prefix file compatible with moses-tokenizer.
-
TextProcessor
(Ptr<Options> options, const Vocabs &vocabs, const AlignedMemory &memory)¶ Construct TextProcessor from options, vocabs and prefix-file supplied as a bytearray.
For other parameters, see the path based constructor. Note: This falls back to string based loads if memory is null, this behaviour will be deprecated in the future.
- Parameters
[in] memory
: ssplit-prefix-file contents in memory, passed as a bytearray.
-
void
process
(std::string &&blob, AnnotatedText &source, Segments &segments) const¶ Wrap into sentences of at most maxLengthBreak_ tokens and add to source.
- Parameters
[in] blob
: Input blob, will be bound to source and annotations on it stored.[out] source
: AnnotatedText instance holding input and annotations of sentences and pieces[out] segments
: marian::Word equivalents of the sentences processed and stored in AnnotatedText for consumption of marian translation pipeline.
-
void
processFromAnnotation
(AnnotatedText &source, Segments &segments) const¶
-