For XHTML documents, resolve named character entities (e.g., )
using the HTML entity table via a getEntity SAX callback. This avoids
parsing a large embedded DTD on every document and matches the approach
used by Blink and WebKit.
This also removes the now-unused DTD infrastructure:
- Remove resolve_external_resource callback from Parser::Options
- Remove resolve_xml_resource() function and its ~60KB embedded DTD
- Remove all call sites passing the unused callback
This change adds a limit of 5000 on the count for how deeply elements
can be nested in documents parsed with our XML parser. Blink and WebKit
both have such a limit, and both set it at 5000.
This prevents bad actors from performing attacks by giving us XML docs
with pathological levels of nesting, and causing stack exhaustion.
This change replaces our LibXML parser with a new implementation that
wraps libxml2's SAX2 API.
The new Parser class uses libxml2's SAX2 callbacks to drive the existing
XML::Listener interface. That preserves backward compatibility with all
existing consumers (XMLDocumentBuilder, DOMParser, etc.).
`set_source` takes a ByteString but the implementation might require a
specific encoding. Make it fallible so that we don't need to crash in
the case of invalid UTF-8 or similar.
The test includes a sequence of invalid UTF-8 bytes that crash the
browser without this change.
* Remove completely unused methods.
* Deduplicate methods that were overloaded with both StringView and
char const* parameters.
A future commit will templatize GenericLexer by char type. This patch
serves to make that a tiny bit easier.
This allows listeners to be notified when a CDATASection or
ProcessingInstruction is encountered during parsing. The non-listener
path still has the incorrect behavior of silently treating CDATASection
as Text nodes, but this allows listeners to handle them correctly.
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float
However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.
We can also use fast_float for integer parsing.