Function marian::bergamot::mapWords¶
Defined in File quality_estimator.cpp
Function Documentation¶
-
std::vector<SubwordRange>
marian::bergamot
::
mapWords
(const std::vector<float> &logProbs, const AnnotatedText &target, const size_t sentenceIdx)¶ A word is composed of multiple subtokens.
Entire words are tokens splitted by whitespace. This method takes a sequence of sublevel tokens (given by AnnotatedText) as well aligned with their log probabilities and conflate them to their respective words The return of this function is a SubwordRange (an alias of ByteRange) vector where each value corresponds to a word id and its content represent the range of subword value that compose a given word
If a translated sentence does not contain any alphanumeric character (therefore, it is made basically of the EOS token), this method ignores it and returns an empty ByteRange vector of words.
Examples: Suppose that you have the following source target (A): marian is a good translation service and the translate service gives you the following sentence (B): service gives you the following sentence (B):
ma(0.15) ri(0.15) an(0.2) es(0.3) un(0.1) bu(0.3) en(0.2) ser(0.1) vi(0.2) cio(0.4) de(0.1) tra(0.4) du(0.2) cción(0.1)
The numbers that the words follow represent the logProb of each BPE token.
Then, the result would be something like: a vector where each position corresponds to the SubwordRange of the following words: marian es un buen servicio de traducción. Hence, its length is 7. The value of the first element would be [0,3)
- Parameters
[in] logProbs
: the log probabilities of byte pair encodings (BPE) that comes from the tracebackWordScores method (which belongs to hypothesis.h in Marian)[in] target
: AnnotatedText target value[in] sentenceIdx
: the id of a candidate sentence