Class HTML¶
Defined in File html.h
Nested Relationships¶
Nested Types¶
Class Documentation¶
-
class
HTML
¶ HTML class parses and removes HTML from input text, and places it back into the translated output text.
When parsing the HTML, it treats tags as markup, where a list of nested tags can be seen as a list of markups that are applicable to all the text that follows. This list is stored as a
TagStack
. Whenever an HTML tag opens or closes, a new TagStack is created to reflect that. TagStack used to be calledTaint
because it tainted the text it was associated with with those tags as markup. The text between tags themselves is stored in the input variable. Inspans_
, the TagStack that is associated with a substring of that text is stored. When transferring the HTML from the source text to the translated target text, the TagStacks are first associated with each of the subwords from the source text. Using hard alignment, each subword in the source text is linked to a subword in the target text. The TagStacks are then copied over these links. Finally, the HTML is inserted back into the target text by for each subword, comparing the TagStack from the previous word to that word, and opening and closing elements to make up for the difference.There are a couple of complexities though:
Not all tags can be treated as markup applied to text. For example, an
<img>
does not contain text itself. Or<i></i>
does not. We do want those tags to remain in the output though. We do this by associating them to an emptySpan
. When inserting HTML back into the translation input or output, we keep track of where in thespans_
vector we are, and insert any elements from empty spans that we might have skipped over because empty spans are never linked to tokens/subwords. These are stragglers in some parts of the code, or void or empty elements in other parts.Some tags should be treated as paragraph indicators, and break up sentences. These are the usual suspects like
<p>
, but also<li>
and<td>
, to make sure we don’t translate two table cells into a single word. This is theaddSentenceBreak
flag in the HTML parsing bit. We mark these breaks with\n\n
in the input text and with a special WHITESPACE tag that we treat as any other void tag. Hopefully this tag moves with the added\n\n
and it is easy for us to remove it again. (in practise it is since these only occur at the end of sentences and the end of sentences are always aligned between source and target.)We treat most tags as word-breaking. We do this by adding spaces just after where we saw the open or close tag occur. If there is already some whitespace in that place, we do not add extra spaces.
TODO
Public Types
-
using
TagNameSet
= std::set<std::string, std::less<>>¶
Public Functions
-
HTML
(std::string &&source, bool processMarkup)¶ Parses HTML in
source
(ifprocessMarkup
is true).source
is updated to only contain the plain text extracted from the HTML.HTML
instance retains information about what tags are extracted from where to later reconstruct the HTML in aResponse
object (bothsource
andtarget
).
-
struct
Options
¶ Options struct that controls how HTML is interpreted.
Public Members
-
TagNameSet
voidTags
{"area", "base", "basefont", "bgsound", "br", "col", "embed", "frame", "hr", "img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr"}¶ List of elements for which we do not expect a closing tag, or self-closing elements in XHTML.
We do not need to see a closing tag for these elements, and they cannot contain text or tags themselves. See also: https://developer.mozilla.org/en-US/docs/Glossary/Empty_element. More relevant source of this list: https://searchfox.org/mozilla-central/rev/7d17fd1fe9f0005a2fb19e5d53da4741b06a98ba/dom/base/FragmentOrElement.cpp#1791
-
TagNameSet
inlineTags
{"abbr", "a", "b", "em", "i", "kbd", "mark", "math", "output", "q", "ruby", "small", "span", "strong", "sub", "sup", "time", "u", "var", "wbr", "ins", "del", "img"}¶ List of elements that are treated as inline, meaning they do not break up sentences.
Any element not in this list will cause the text that follows its open or close tag to be treated as a separate sentence.
-
TagNameSet
inWordTags
= {"wbr"}¶ List of elements that are, regardless of
substituteInlineTagsWithSpaces
, not substituted with spaces.Technically almost all inline elements should be treated like this, except
<br>
maybe, But in practice it seems to be more effective to limit this set to just that one tag that that can only really be used inside words:<wbr>
. See also: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr
-
TagNameSet
ignoredTags
= {"code", "kbd", "samp", "var", "dir", "acronym", "math"}¶ List of elements we copy as is, but do parse as if they’re HTML because they could be nested.
For <script> we just scan for </script> because the script tag may not be nested, but that is not the case for these elements per se. Some tags, like <script>, are ignored at the
Scanner
level. Seexh_scanner.cpp/Scanner::scanAttribute()
.
-
std::string
continuationDelimiters
= "n ,.(){}[]"¶ List of characters that occur at the start of a token that indicate that the this token is probably not a continuation of a word.
This is also used to determine whether there should be a space after a closing tag or not. I.e. a
.
after a</strong>
does not need to be separated by an extra space.
-
bool
substituteInlineTagsWithSpaces
= true¶ Should we always add spaces to the places where tags used to be? I.e.
un<u>der</u>line
should becomeun der line
? This does help with retaining tags inside words, or with odd pages that use CSS to add spacing between a lot of tags. Cases like<td>
and<li>
are already covered by treating them as sentence splitting.
-
TagNameSet