Commit graph

13 commits

Author SHA1 Message Date
sideshowbarker
1b41659efd LibXML+LibWeb: Use existing HTML entities table for XML parsing too
For XHTML documents, resolve named character entities (e.g.,  )
using the HTML entity table via a getEntity SAX callback. This avoids
parsing a large embedded DTD on every document and matches the approach
used by Blink and WebKit.

This also removes the now-unused DTD infrastructure:

- Remove resolve_external_resource callback from Parser::Options
- Remove resolve_xml_resource() function and its ~60KB embedded DTD
- Remove all call sites passing the unused callback
2026-01-09 19:13:41 +00:00
sideshowbarker
cfe5ef32e1 LibXML: Add element-nesting depth limit for XML-parsed documents
This change adds a limit of 5000 on the count for how deeply elements
can be nested in documents parsed with our XML parser. Blink and WebKit
both have such a limit, and both set it at 5000.

This prevents bad actors from performing attacks by giving us XML docs
with pathological levels of nesting, and causing stack exhaustion.
2026-01-08 14:49:12 +01:00
Tim Ledbetter
a48aa62b7a LibXML: Prevent auto-detection of UTF-32 encoding by libxml2 2026-01-08 10:06:40 +01:00
sideshowbarker
fac81e84ba LibXML: Replace the existing XML parser with libxml2 parsing
This change replaces our LibXML parser with a new implementation that
wraps libxml2's SAX2 API.

The new Parser class uses libxml2's SAX2 callbacks to drive the existing
XML::Listener interface. That preserves backward compatibility with all
existing consumers (XMLDocumentBuilder, DOMParser, etc.).
2026-01-07 14:38:52 +01:00
rmg-x
b9554038ff LibWeb+LibXML: Make Listener::set_source(ByteString) fallible
`set_source` takes a ByteString but the implementation might require a
specific encoding. Make it fallible so that we don't need to crash in
the case of invalid UTF-8 or similar.

The test includes a sequence of invalid UTF-8 bytes that crash the
browser without this change.
2025-10-02 02:25:28 +02:00
Andreas Kling
b7595013c1 LibWeb+LibXML: Preserve element attribute order in XML documents
We now use OrderedHashMap instead of HashMap to ensure that attributes
on XML elements retain their original order.
2025-08-22 11:35:59 +02:00
Timothy Flynn
28d9d3a2c7 AK+Libraries: Reduce API surface of GenericLexer a bit
* Remove completely unused methods.
* Deduplicate methods that were overloaded with both StringView and
  char const* parameters.

A future commit will templatize GenericLexer by char type. This patch
serves to make that a tiny bit easier.
2025-08-13 09:56:13 -04:00
Andrew Kaster
d9976b98b9 LibXML: Add parser hooks for CDATASection and ProcessingInstructions
This allows listeners to be notified when a CDATASection or
ProcessingInstruction is encountered during parsing. The non-listener
path still has the incorrect behavior of silently treating CDATASection
as Text nodes, but this allows listeners to handle them correctly.
2025-07-19 14:56:20 +02:00
Timothy Flynn
62d9a84b8d AK+Everywhere: Replace custom number parsers with fast_float
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float

However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.

We can also use fast_float for integer parsing.
2025-07-03 09:51:56 -04:00
mikiubo
cd576e594d LibXml: Notify listener when doctype is parsed 2025-01-20 14:48:19 +01:00
Timothy Flynn
488034477a Revert "LibWeb: Set doctype node immediately while parsing XML document"
This reverts commit cd446e5e9c.

This broke about 20k WPT subtests, all related to XML parsing. See:
https://wpt.fyi/results/html/the-xhtml-syntax/parsing-xhtml-documents?diff=&filter=ADC&run_id=5154815472828416&run_id=5090731742199808
2024-11-20 19:11:56 -05:00
Andreas Kling
cd446e5e9c LibWeb: Set doctype node immediately while parsing XML document
Instead of deferring it to the end of parsing, where scripts that
were expecting to look at the doctype may have already run.
2024-11-20 16:10:57 +01:00
Timothy Flynn
93712b24bf Everywhere: Hoist the Libraries folder to the top-level 2024-11-10 12:50:45 +01:00
Renamed from Userland/Libraries/LibXML/Parser/Parser.cpp (Browse further)