2020-05-22 21:46:13 +02:00
|
|
|
/*
|
2024-10-04 13:19:50 +02:00
|
|
|
* Copyright (c) 2020, Andreas Kling <andreas@ladybird.org>
|
2022-02-15 18:52:45 +00:00
|
|
|
* Copyright (c) 2022, Linus Groh <linusg@serenityos.org>
|
2020-05-22 21:46:13 +02:00
|
|
|
*
|
2021-04-22 01:24:48 -07:00
|
|
|
* SPDX-License-Identifier: BSD-2-Clause
|
2020-05-22 21:46:13 +02:00
|
|
|
*/
|
|
|
|
|
|
2021-01-17 16:57:17 +01:00
|
|
|
#include <AK/Debug.h>
|
2026-05-27 20:08:22 +02:00
|
|
|
#include <AK/FFIHelpers.h>
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
#include <AK/FlyString.h>
|
2026-06-04 10:40:32 +02:00
|
|
|
#include <AK/NeverDestroyed.h>
|
2026-05-24 09:07:05 +02:00
|
|
|
#include <AK/StringBuilder.h>
|
|
|
|
|
#include <AK/Utf8View.h>
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
#include <AK/Vector.h>
|
2020-05-28 12:35:19 +02:00
|
|
|
#include <LibTextCodec/Decoder.h>
|
2020-07-28 19:18:23 +02:00
|
|
|
#include <LibWeb/HTML/Parser/HTMLToken.h>
|
|
|
|
|
#include <LibWeb/HTML/Parser/HTMLTokenizer.h>
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
#include <LibWeb/HTMLTokenizerRustFFI.h>
|
2020-05-22 21:46:13 +02:00
|
|
|
|
2020-07-28 18:20:36 +02:00
|
|
|
namespace Web::HTML {
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
static Vector<u32> code_points_from_string(String const& string)
|
|
|
|
|
{
|
|
|
|
|
Vector<u32> code_points;
|
|
|
|
|
code_points.ensure_capacity(string.bytes().size());
|
|
|
|
|
for (auto code_point : string.code_points())
|
|
|
|
|
code_points.append(code_point);
|
|
|
|
|
return code_points;
|
|
|
|
|
}
|
2020-05-22 21:46:13 +02:00
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
static RustFfiTokenizerHandle* create_tokenizer_from_utf8(StringView utf8_bytes)
|
2026-04-27 20:38:28 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
auto* bytes = reinterpret_cast<u8 const*>(utf8_bytes.characters_without_null_termination());
|
|
|
|
|
if (bytes == nullptr)
|
|
|
|
|
bytes = reinterpret_cast<u8 const*>("");
|
|
|
|
|
return rust_html_tokenizer_create_from_utf8(bytes, utf8_bytes.length());
|
|
|
|
|
}
|
2026-04-28 18:04:17 +02:00
|
|
|
|
2026-05-24 09:07:05 +02:00
|
|
|
static String decoded_string_for_utf8_tokenizer(StringView input)
|
|
|
|
|
{
|
|
|
|
|
Utf8View utf8_view { input };
|
|
|
|
|
if (utf8_view.validate(AllowLonelySurrogates::No))
|
|
|
|
|
return String::from_utf8_without_validation(input.bytes());
|
|
|
|
|
|
|
|
|
|
// Decoded strings may come from WTF-16 JS strings. Rust's UTF-8 path
|
|
|
|
|
// requires scalar-value UTF-8, so replace lone surrogates but keep BOMs.
|
|
|
|
|
VERIFY(utf8_view.validate());
|
|
|
|
|
|
|
|
|
|
StringBuilder builder(input.length());
|
|
|
|
|
for (auto code_point : utf8_view)
|
|
|
|
|
builder.append_code_point(is_unicode_surrogate(code_point) ? AK::UnicodeUtils::REPLACEMENT_CODE_POINT : code_point);
|
|
|
|
|
return builder.to_string_without_validation();
|
|
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
static Vector<FlyString> build_interned_name_table(size_t count, void (*fetch)(uint16_t, uint8_t const**, size_t*))
|
|
|
|
|
{
|
|
|
|
|
Vector<FlyString> table;
|
|
|
|
|
// Slot 0 is unused (id 0 means "not interned"); store an empty FlyString there.
|
|
|
|
|
table.append(FlyString {});
|
|
|
|
|
table.ensure_capacity(count + 1);
|
|
|
|
|
for (size_t i = 0; i < count; ++i) {
|
|
|
|
|
uint8_t const* ptr = nullptr;
|
|
|
|
|
size_t len = 0;
|
|
|
|
|
fetch(static_cast<uint16_t>(i + 1), &ptr, &len);
|
|
|
|
|
if (ptr == nullptr || len == 0) {
|
|
|
|
|
table.append(FlyString {});
|
|
|
|
|
continue;
|
2026-04-28 18:04:17 +02:00
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
table.append(MUST(FlyString::from_utf8(StringView { ptr, len })));
|
2026-04-28 18:04:17 +02:00
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
return table;
|
2026-04-27 20:38:28 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
static FlyString const& interned_rust_tag_name(uint16_t id)
|
2026-04-28 18:05:24 +02:00
|
|
|
{
|
2026-06-04 10:40:32 +02:00
|
|
|
static NeverDestroyed<Vector<FlyString>> table { build_interned_name_table(
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_interned_tag_name_count(),
|
2026-06-04 10:40:32 +02:00
|
|
|
rust_html_tokenizer_interned_tag_name) };
|
|
|
|
|
if (id == 0 || id >= table->size())
|
|
|
|
|
return (*table)[0];
|
|
|
|
|
return (*table)[id];
|
2026-04-28 18:05:24 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
static FlyString const& interned_rust_attr_name(uint16_t id)
|
2020-05-22 21:46:13 +02:00
|
|
|
{
|
2026-06-04 10:40:32 +02:00
|
|
|
static NeverDestroyed<Vector<FlyString>> table { build_interned_name_table(
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_interned_attr_name_count(),
|
2026-06-04 10:40:32 +02:00
|
|
|
rust_html_tokenizer_interned_attr_name) };
|
|
|
|
|
if (id == 0 || id >= table->size())
|
|
|
|
|
return (*table)[0];
|
|
|
|
|
return (*table)[id];
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
HTMLTokenizer::HTMLTokenizer()
|
2021-05-20 23:11:41 +04:30
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
m_tokenizer = create_tokenizer_from_utf8({});
|
|
|
|
|
rust_html_tokenizer_set_input_stream_closed(m_tokenizer, false);
|
2021-05-20 23:11:41 +04:30
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
HTMLTokenizer::~HTMLTokenizer()
|
2020-05-22 21:46:13 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
if (m_tokenizer)
|
|
|
|
|
rust_html_tokenizer_destroy(m_tokenizer);
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|
|
|
|
|
|
2026-05-24 09:07:05 +02:00
|
|
|
HTMLTokenizer::HTMLTokenizer(StringView input, ByteString const& encoding, InputType input_type)
|
2021-06-04 11:31:43 +02:00
|
|
|
{
|
2026-05-24 09:07:05 +02:00
|
|
|
if (input_type == InputType::EncodedBytes) {
|
|
|
|
|
auto decoder = TextCodec::decoder_for(encoding);
|
|
|
|
|
VERIFY(decoder.has_value());
|
|
|
|
|
m_source = MUST(decoder->to_utf8(input));
|
|
|
|
|
} else {
|
|
|
|
|
m_source = decoded_string_for_utf8_tokenizer(input);
|
|
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
m_input_stream_closed = true;
|
|
|
|
|
m_tokenizer = create_tokenizer_from_utf8(m_source.bytes_as_string_view());
|
2021-06-04 11:31:43 +02:00
|
|
|
}
|
|
|
|
|
|
2024-02-18 12:45:53 -05:00
|
|
|
Optional<HTMLToken> HTMLTokenizer::next_token(StopAtInsertionPoint stop_at_insertion_point)
|
2020-05-22 21:46:13 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
RustFfiToken ffi;
|
|
|
|
|
bool stop = stop_at_insertion_point == StopAtInsertionPoint::Yes;
|
2026-05-16 13:47:49 +02:00
|
|
|
bool cdata_allowed = false;
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
if (!rust_html_tokenizer_next_token(m_tokenizer, &ffi, stop, cdata_allowed))
|
2022-09-20 21:08:14 +02:00
|
|
|
return {};
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
HTMLToken::Type type;
|
|
|
|
|
switch (ffi.token_type) {
|
|
|
|
|
case 1:
|
|
|
|
|
type = HTMLToken::Type::DOCTYPE;
|
|
|
|
|
break;
|
|
|
|
|
case 2:
|
|
|
|
|
type = HTMLToken::Type::StartTag;
|
|
|
|
|
break;
|
|
|
|
|
case 3:
|
|
|
|
|
type = HTMLToken::Type::EndTag;
|
|
|
|
|
break;
|
|
|
|
|
case 4:
|
|
|
|
|
type = HTMLToken::Type::Comment;
|
|
|
|
|
break;
|
|
|
|
|
case 5:
|
|
|
|
|
type = HTMLToken::Type::Character;
|
|
|
|
|
break;
|
|
|
|
|
case 6:
|
|
|
|
|
type = HTMLToken::Type::EndOfFile;
|
|
|
|
|
break;
|
|
|
|
|
default:
|
|
|
|
|
VERIFY_NOT_REACHED();
|
|
|
|
|
}
|
2020-06-12 23:43:06 +01:00
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
HTMLToken token { type };
|
|
|
|
|
token.set_start_position({}, { ffi.start_line, ffi.start_column });
|
|
|
|
|
token.set_end_position({}, { ffi.end_line, ffi.end_column });
|
|
|
|
|
|
|
|
|
|
switch (type) {
|
|
|
|
|
case HTMLToken::Type::Character:
|
|
|
|
|
token.set_code_point(ffi.code_point);
|
|
|
|
|
break;
|
|
|
|
|
case HTMLToken::Type::StartTag:
|
|
|
|
|
case HTMLToken::Type::EndTag: {
|
|
|
|
|
if (ffi.tag_name_id != 0)
|
|
|
|
|
token.set_tag_name(interned_rust_tag_name(ffi.tag_name_id));
|
|
|
|
|
else
|
|
|
|
|
token.set_tag_name(MUST(FlyString::from_utf8(ffi_string_view(ffi.tag_name_ptr, ffi.tag_name_len))));
|
|
|
|
|
|
|
|
|
|
token.set_self_closing(ffi.self_closing);
|
|
|
|
|
for (size_t i = 0; i < ffi.attributes_len; ++i) {
|
|
|
|
|
auto const& ffi_attribute = ffi.attributes_ptr[i];
|
|
|
|
|
HTMLToken::Attribute attribute;
|
|
|
|
|
if (ffi_attribute.name_id != 0)
|
|
|
|
|
attribute.local_name = interned_rust_attr_name(ffi_attribute.name_id);
|
|
|
|
|
else
|
|
|
|
|
attribute.local_name = MUST(FlyString::from_utf8(ffi_string_view(ffi_attribute.name_ptr, ffi_attribute.name_len)));
|
|
|
|
|
attribute.value = MUST(String::from_utf8(ffi_string_view(ffi_attribute.value_ptr, ffi_attribute.value_len)));
|
|
|
|
|
attribute.name_start_position = { ffi_attribute.name_start_line, ffi_attribute.name_start_column };
|
|
|
|
|
attribute.name_end_position = { ffi_attribute.name_end_line, ffi_attribute.name_end_column };
|
|
|
|
|
attribute.value_start_position = { ffi_attribute.value_start_line, ffi_attribute.value_start_column };
|
|
|
|
|
attribute.value_end_position = { ffi_attribute.value_end_line, ffi_attribute.value_end_column };
|
|
|
|
|
token.add_attribute(move(attribute));
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
token.normalize_attributes();
|
LibWeb: Complete Rust HTML tree construction
Finish the Rust implementation of the spec tree-construction algorithms
needed by the LibWeb test suite. Add the remaining table modes, foster
parenting, scope helpers, adoption agency handling, ruby/list/form and
select cases, frameset state, foreign-content edge cases, and parser
host callbacks.
Preserve behavior that depends on the C++ DOM integration, including
parser-created custom element reactions, fragment quirks mode, arbitrary
fragment namespaces, template fragment mode, fragment form ownership,
MathML annotation-xml boundaries, contextual fragment scripts, parser
script source positions, document.close() parser state, void-element
insertion, and duplicate attribute tracking.
Add focused tests for the parser edge cases that are easy to regress at
the boundary between the Rust tree builder and the C++ DOM host.
2026-05-15 21:56:35 +02:00
|
|
|
if (ffi.had_duplicate_attribute)
|
|
|
|
|
token.set_had_duplicate_attribute({});
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
break;
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
case HTMLToken::Type::Comment:
|
|
|
|
|
token.set_comment(MUST(String::from_utf8(ffi_string_view(ffi.comment_ptr, ffi.comment_len))));
|
|
|
|
|
break;
|
|
|
|
|
case HTMLToken::Type::DOCTYPE: {
|
|
|
|
|
auto& doctype = token.ensure_doctype_data();
|
|
|
|
|
if (!ffi.missing_name) {
|
|
|
|
|
doctype.name = MUST(String::from_utf8(ffi_string_view(ffi.doctype_name_ptr, ffi.doctype_name_len)));
|
|
|
|
|
doctype.missing_name = false;
|
2024-11-24 10:18:17 +01:00
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
if (!ffi.missing_public_id) {
|
|
|
|
|
doctype.public_identifier = MUST(String::from_utf8(ffi_string_view(ffi.public_id_ptr, ffi.public_id_len)));
|
|
|
|
|
doctype.missing_public_identifier = false;
|
2020-05-25 19:22:23 +02:00
|
|
|
}
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
if (!ffi.missing_system_id) {
|
|
|
|
|
doctype.system_identifier = MUST(String::from_utf8(ffi_string_view(ffi.system_id_ptr, ffi.system_id_len)));
|
|
|
|
|
doctype.missing_system_identifier = false;
|
|
|
|
|
}
|
|
|
|
|
doctype.force_quirks = ffi.force_quirks;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
case HTMLToken::Type::EndOfFile:
|
|
|
|
|
break;
|
|
|
|
|
case HTMLToken::Type::Invalid:
|
|
|
|
|
VERIFY_NOT_REACHED();
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
return token;
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|
|
|
|
|
|
2025-10-23 21:45:00 +02:00
|
|
|
void HTMLTokenizer::parser_did_run(Badge<HTMLParser>)
|
|
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_parser_did_run(m_tokenizer);
|
2025-10-23 21:45:00 +02:00
|
|
|
}
|
|
|
|
|
|
2026-04-26 03:21:39 +02:00
|
|
|
String HTMLTokenizer::unparsed_input() const
|
|
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
uint8_t const* ptr = nullptr;
|
|
|
|
|
size_t len = 0;
|
|
|
|
|
rust_html_tokenizer_unparsed_input(m_tokenizer, &ptr, &len);
|
|
|
|
|
return MUST(String::from_utf8(ffi_string_view(ptr, len)));
|
2026-04-26 03:21:39 +02:00
|
|
|
}
|
|
|
|
|
|
2026-04-28 18:04:17 +02:00
|
|
|
void HTMLTokenizer::append_to_input_stream(StringView input)
|
|
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
if (input.is_empty())
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
auto utf8_input = MUST(String::from_utf8(input));
|
|
|
|
|
auto code_points = code_points_from_string(utf8_input);
|
|
|
|
|
rust_html_tokenizer_append_input(m_tokenizer, code_points.data(), code_points.size());
|
2026-04-28 18:04:17 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void HTMLTokenizer::close_input_stream()
|
|
|
|
|
{
|
|
|
|
|
m_input_stream_closed = true;
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_set_input_stream_closed(m_tokenizer, true);
|
2026-04-28 18:04:17 +02:00
|
|
|
}
|
|
|
|
|
|
2023-09-12 23:16:10 +12:00
|
|
|
void HTMLTokenizer::insert_input_at_insertion_point(StringView input)
|
2022-02-19 15:58:21 +01:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
auto utf8_input = MUST(String::from_utf8(input));
|
|
|
|
|
auto code_points = code_points_from_string(utf8_input);
|
|
|
|
|
rust_html_tokenizer_insert_input(m_tokenizer, code_points.data(), code_points.size());
|
2022-02-19 15:58:21 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void HTMLTokenizer::insert_eof()
|
|
|
|
|
{
|
2026-04-28 18:04:17 +02:00
|
|
|
close_input_stream();
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_insert_eof(m_tokenizer);
|
2022-02-19 15:58:21 +01:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
bool HTMLTokenizer::is_insertion_point_defined() const
|
2020-05-24 20:24:43 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
return rust_html_tokenizer_is_insertion_point_defined(m_tokenizer);
|
2020-05-24 20:24:43 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
bool HTMLTokenizer::is_insertion_point_reached()
|
2020-05-24 20:24:43 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
return rust_html_tokenizer_is_insertion_point_reached(m_tokenizer);
|
|
|
|
|
}
|
2023-08-24 17:01:19 -04:00
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
void HTMLTokenizer::undefine_insertion_point()
|
|
|
|
|
{
|
|
|
|
|
rust_html_tokenizer_undefine_insertion_point(m_tokenizer);
|
|
|
|
|
}
|
2024-09-30 17:52:30 -06:00
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
void HTMLTokenizer::store_insertion_point()
|
|
|
|
|
{
|
|
|
|
|
rust_html_tokenizer_store_insertion_point(m_tokenizer);
|
2020-05-24 20:24:43 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
void HTMLTokenizer::restore_insertion_point()
|
2020-05-24 20:24:43 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_restore_insertion_point(m_tokenizer);
|
2020-05-24 20:24:43 +02:00
|
|
|
}
|
2020-05-27 16:16:23 +02:00
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
void HTMLTokenizer::update_insertion_point()
|
2020-05-27 16:16:23 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_update_insertion_point(m_tokenizer);
|
2020-05-27 16:16:23 +02:00
|
|
|
}
|
|
|
|
|
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
void HTMLTokenizer::abort()
|
2021-05-20 23:11:41 +04:30
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
rust_html_tokenizer_abort(m_tokenizer);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
void HTMLTokenizer::switch_to(State new_state)
|
2021-05-23 08:20:03 +02:00
|
|
|
{
|
LibWeb: Replace the HTML tokenizer with Rust
Replace the C++ HTML tokenizer with a Rust implementation behind the
existing HTMLTokenizer API.
Keep the parser-facing integration points for streaming input,
insertion points, document.write(), EOF insertion, parser aborts,
speculative parser input, and last start tag tracking. The generated
FFI handle stays an implementation detail of HTMLTokenizer, so callers
keep a single tokenizer class.
Preserve duplicate attributes through FFI so C++ token normalization can
record the duplicate-attribute signal used by CSP nonce checks. Keep
bulk tag-name and attribute scans capped at the active insertion point
so streamed parser input is spliced at the right offset.
Use generated DAFSA tables for named character references and intern
common tag and attribute names to reduce FFI marshalling overhead. This
also fixes attribute name source positions, nested old insertion points,
and aborted fast-path handling.
TestHTMLTokenizer covers duplicate attributes and insertion points in
fast tag-name, attribute-name, and quoted-value scans. A CSP text test
covers duplicate nonce attributes on parser-created script elements.
The tokenizer dump fixtures still match, TestHTMLTokenizer passes, and
the full release test-web run passes with 6981 tests and 226 skipped.
2026-05-15 15:13:43 +02:00
|
|
|
dbgln_if(TOKENIZER_TRACE_DEBUG, "[{}] Switch to {}", state_name(m_state), state_name(new_state));
|
|
|
|
|
m_state = new_state;
|
|
|
|
|
rust_html_tokenizer_switch_state(m_tokenizer, static_cast<uint8_t>(new_state));
|
2021-05-23 08:20:03 +02:00
|
|
|
}
|
|
|
|
|
|
2020-05-22 21:46:13 +02:00
|
|
|
}
|