Function marian::bergamot::mapWords

Function Documentation

std::vector<SubwordRange> marian::bergamot::mapWords(const std::vector<float> &logProbs, const AnnotatedText &target, const size_t sentenceIdx)

A word is composed of multiple subtokens.

Entire words are tokens splitted by whitespace. This method takes a sequence of sublevel tokens (given by AnnotatedText) as well aligned with their log probabilities and conflate them to their respective words The return of this function is a SubwordRange (an alias of ByteRange) vector where each value corresponds to a word id and its content represent the range of subword value that compose a given word

If a translated sentence does not contain any alphanumeric character (therefore, it is made basically of the EOS token), this method ignores it and returns an empty ByteRange vector of words.

Examples: Suppose that you have the following source target (A): marian is a good translation service and the translate service gives you the following sentence (B): service gives you the following sentence (B):

ma(0.15) ri(0.15) an(0.2) es(0.3) un(0.1) bu(0.3) en(0.2) ser(0.1) vi(0.2) cio(0.4) de(0.1) tra(0.4) du(0.2) cción(0.1)

The numbers that the words follow represent the logProb of each BPE token.

Then, the result would be something like: a vector where each position corresponds to the SubwordRange of the following words: marian es un buen servicio de traducción. Hence, its length is 7. The value of the first element would be [0,3)

Parameters
  • [in] logProbs: the log probabilities of byte pair encodings (BPE) that comes from the tracebackWordScores method (which belongs to hypothesis.h in Marian)

  • [in] target: AnnotatedText target value

  • [in] sentenceIdx: the id of a candidate sentence