Class HTML

Nested Relationships

Nested Types

Class Documentation

class HTML

HTML class parses and removes HTML from input text, and places it back into the translated output text.

When parsing the HTML, it treats tags as markup, where a list of nested tags can be seen as a list of markups that are applicable to all the text that follows. This list is stored as a TagStack. Whenever an HTML tag opens or closes, a new TagStack is created to reflect that. TagStack used to be called Taint because it tainted the text it was associated with with those tags as markup. The text between tags themselves is stored in the input variable. In spans_, the TagStack that is associated with a substring of that text is stored. When transferring the HTML from the source text to the translated target text, the TagStacks are first associated with each of the subwords from the source text. Using hard alignment, each subword in the source text is linked to a subword in the target text. The TagStacks are then copied over these links. Finally, the HTML is inserted back into the target text by for each subword, comparing the TagStack from the previous word to that word, and opening and closing elements to make up for the difference.

There are a couple of complexities though:

  1. Not all tags can be treated as markup applied to text. For example, an <img> does not contain text itself. Or <i></i> does not. We do want those tags to remain in the output though. We do this by associating them to an empty Span. When inserting HTML back into the translation input or output, we keep track of where in the spans_ vector we are, and insert any elements from empty spans that we might have skipped over because empty spans are never linked to tokens/subwords. These are stragglers in some parts of the code, or void or empty elements in other parts.

  2. Some tags should be treated as paragraph indicators, and break up sentences. These are the usual suspects like <p>, but also <li> and <td>, to make sure we don’t translate two table cells into a single word. This is the addSentenceBreak flag in the HTML parsing bit. We mark these breaks with \n\n in the input text and with a special WHITESPACE tag that we treat as any other void tag. Hopefully this tag moves with the added \n\n and it is easy for us to remove it again. (in practise it is since these only occur at the end of sentences and the end of sentences are always aligned between source and target.)

  3. We treat most tags as word-breaking. We do this by adding spaces just after where we saw the open or close tag occur. If there is already some whitespace in that place, we do not add extra spaces.

  4. TODO

Public Types

using TagNameSet = std::set<std::string, std::less<>>
using TagStack = std::vector<Tag *>

Representation of markup that is being applied to a string of text.

Order matters as this represents how the tags are nested. The Tag objects themselves are owned by pool_.

Public Functions

HTML(std::string &&source, bool processMarkup)

Parses HTML in source (if processMarkup is true).

source is updated to only contain the plain text extracted from the HTML. HTML instance retains information about what tags are extracted from where to later reconstruct the HTML in a Response object (both source and target).

HTML(std::string &&source, bool processMarkup, Options &&options)
HTML(const HTML&)

It is not save to copy a HTML instance.

HTML(HTML&&)

Moving is fine.

void restore(Response &response)

Reconstructs (not perfectly) the HTML as it was parsed from source, and uses alignment information to also reconstruct the same markup in response.target.

struct Options

Options struct that controls how HTML is interpreted.

Public Members

TagNameSet voidTags{"area", "base", "basefont", "bgsound", "br", "col", "embed", "frame", "hr", "img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr"}

List of elements for which we do not expect a closing tag, or self-closing elements in XHTML.

We do not need to see a closing tag for these elements, and they cannot contain text or tags themselves. See also: https://developer.mozilla.org/en-US/docs/Glossary/Empty_element. More relevant source of this list: https://searchfox.org/mozilla-central/rev/7d17fd1fe9f0005a2fb19e5d53da4741b06a98ba/dom/base/FragmentOrElement.cpp#1791

TagNameSet inlineTags{"abbr", "a", "b", "em", "i", "kbd", "mark", "math", "output", "q", "ruby", "small", "span", "strong", "sub", "sup", "time", "u", "var", "wbr", "ins", "del", "img"}

List of elements that are treated as inline, meaning they do not break up sentences.

Any element not in this list will cause the text that follows its open or close tag to be treated as a separate sentence.

TagNameSet inWordTags = {"wbr"}

List of elements that are, regardless of substituteInlineTagsWithSpaces, not substituted with spaces.

Technically almost all inline elements should be treated like this, except <br> maybe, But in practice it seems to be more effective to limit this set to just that one tag that that can only really be used inside words: <wbr>. See also: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr

TagNameSet ignoredTags = {"code", "kbd", "samp", "var", "dir", "acronym", "math"}

List of elements we copy as is, but do parse as if they’re HTML because they could be nested.

For <script> we just scan for </script> because the script tag may not be nested, but that is not the case for these elements per se. Some tags, like <script>, are ignored at the Scanner level. See xh_scanner.cpp/Scanner::scanAttribute().

std::string continuationDelimiters = "n ,.(){}[]"

List of characters that occur at the start of a token that indicate that the this token is probably not a continuation of a word.

This is also used to determine whether there should be a space after a closing tag or not. I.e. a . after a </strong> does not need to be separated by an extra space.

bool substituteInlineTagsWithSpaces = true

Should we always add spaces to the places where tags used to be? I.e.

un<u>der</u>line should become un der line? This does help with retaining tags inside words, or with odd pages that use CSS to add spacing between a lot of tags. Cases like <td> and <li> are already covered by treating them as sentence splitting.

struct Span

Span of text, with which a TagStack is associated.

A span may be empty, for example to represent the presence of an empty or VOID element.

Public Functions

size_t size() const

Public Members

size_t begin
size_t end
TagStack tags
struct Tag

Represents a tag, or markup that is being applied to a string of text.

We treat all elements except ELEMENT as void elements or empty elements.

Public Types

enum NodeType

Values:

ELEMENT
VOID_ELEMENT
COMMENT
PROCESSING_INSTRUCTION
WHITESPACE

Public Members

NodeType type
std::string name
std::string attributes
std::string data