Struct HTML::Options

Nested Relationships

This struct is a nested type of Class HTML.

Struct Documentation

struct Options

Options struct that controls how HTML is interpreted.

Public Members

TagNameSet voidTags{"area", "base", "basefont", "bgsound", "br", "col", "embed", "frame", "hr", "img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr"}

List of elements for which we do not expect a closing tag, or self-closing elements in XHTML.

We do not need to see a closing tag for these elements, and they cannot contain text or tags themselves. See also: https://developer.mozilla.org/en-US/docs/Glossary/Empty_element. More relevant source of this list: https://searchfox.org/mozilla-central/rev/7d17fd1fe9f0005a2fb19e5d53da4741b06a98ba/dom/base/FragmentOrElement.cpp#1791

TagNameSet inlineTags{"abbr", "a", "b", "em", "i", "kbd", "mark", "math", "output", "q", "ruby", "small", "span", "strong", "sub", "sup", "time", "u", "var", "wbr", "ins", "del", "img"}

List of elements that are treated as inline, meaning they do not break up sentences.

Any element not in this list will cause the text that follows its open or close tag to be treated as a separate sentence.

TagNameSet inWordTags = {"wbr"}

List of elements that are, regardless of substituteInlineTagsWithSpaces, not substituted with spaces.

Technically almost all inline elements should be treated like this, except <br> maybe, But in practice it seems to be more effective to limit this set to just that one tag that that can only really be used inside words: <wbr>. See also: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr

TagNameSet ignoredTags = {"code", "kbd", "samp", "var", "dir", "acronym", "math"}

List of elements we copy as is, but do parse as if they’re HTML because they could be nested.

For <script> we just scan for </script> because the script tag may not be nested, but that is not the case for these elements per se. Some tags, like <script>, are ignored at the Scanner level. See xh_scanner.cpp/Scanner::scanAttribute().

std::string continuationDelimiters = "n ,.(){}[]"

List of characters that occur at the start of a token that indicate that the this token is probably not a continuation of a word.

This is also used to determine whether there should be a space after a closing tag or not. I.e. a . after a </strong> does not need to be separated by an extra space.

bool substituteInlineTagsWithSpaces = true

Should we always add spaces to the places where tags used to be? I.e.

un<u>der</u>line should become un der line? This does help with retaining tags inside words, or with odd pages that use CSS to add spacing between a lot of tags. Cases like <td> and <li> are already covered by treating them as sentence splitting.