Struct HTML::Options¶
Defined in File html.h
Nested Relationships¶
This struct is a nested type of Class HTML.
Struct Documentation¶
-
struct
Options
Options struct that controls how HTML is interpreted.
Public Members
-
TagNameSet
voidTags
{"area", "base", "basefont", "bgsound", "br", "col", "embed", "frame", "hr", "img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr"} List of elements for which we do not expect a closing tag, or self-closing elements in XHTML.
We do not need to see a closing tag for these elements, and they cannot contain text or tags themselves. See also: https://developer.mozilla.org/en-US/docs/Glossary/Empty_element. More relevant source of this list: https://searchfox.org/mozilla-central/rev/7d17fd1fe9f0005a2fb19e5d53da4741b06a98ba/dom/base/FragmentOrElement.cpp#1791
-
TagNameSet
inlineTags
{"abbr", "a", "b", "em", "i", "kbd", "mark", "math", "output", "q", "ruby", "small", "span", "strong", "sub", "sup", "time", "u", "var", "wbr", "ins", "del", "img"} List of elements that are treated as inline, meaning they do not break up sentences.
Any element not in this list will cause the text that follows its open or close tag to be treated as a separate sentence.
-
TagNameSet
inWordTags
= {"wbr"} List of elements that are, regardless of
substituteInlineTagsWithSpaces
, not substituted with spaces.Technically almost all inline elements should be treated like this, except
<br>
maybe, But in practice it seems to be more effective to limit this set to just that one tag that that can only really be used inside words:<wbr>
. See also: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr
-
TagNameSet
ignoredTags
= {"code", "kbd", "samp", "var", "dir", "acronym", "math"} List of elements we copy as is, but do parse as if they’re HTML because they could be nested.
For <script> we just scan for </script> because the script tag may not be nested, but that is not the case for these elements per se. Some tags, like <script>, are ignored at the
Scanner
level. Seexh_scanner.cpp/Scanner::scanAttribute()
.
-
std::string
continuationDelimiters
= "n ,.(){}[]" List of characters that occur at the start of a token that indicate that the this token is probably not a continuation of a word.
This is also used to determine whether there should be a space after a closing tag or not. I.e. a
.
after a</strong>
does not need to be separated by an extra space.
-
bool
substituteInlineTagsWithSpaces
= true Should we always add spaces to the places where tags used to be? I.e.
un<u>der</u>line
should becomeun der line
? This does help with retaining tags inside words, or with odd pages that use CSS to add spacing between a lot of tags. Cases like<td>
and<li>
are already covered by treating them as sentence splitting.
-
TagNameSet