ladybird

Stowage/ladybird

Fork 0

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2026-06-19 08:11:58 +00:00

Commit graph

Author	SHA1	Message	Date
Andreas Kling	f4960d9d7d	LibWeb: Honor requested CSS tokenizer encoding Keep decoded CSS text separate from tokenizer byte input. CSSOM and already-decoded stylesheet text preserve code point preprocessing, so a lone surrogate maps to one replacement character instead of being re-decoded as malformed UTF-8 bytes. Decode tokenizer byte input with the requested encoding unless that encoding is UTF-8 and the byte stream is strictly valid UTF-8. Keep the fast path by constructing the decoded string without validating twice after strict validation succeeds. Preserve UTF-8 decoder behavior on the byte fast path by stripping an initial UTF-8 BOM and rejecting encoded surrogate bytes. Invalid UTF-8 still goes through the decoder. Add tokenizer coverage for both the C++ and Rust backends across decoded text, UTF-8 aliases, BOM-prefixed input, invalid UTF-8, and non-UTF requested encodings.	2026-05-18 14:08:22 +02:00
Andreas Kling	355fb6b825	LibWeb: Stream Rust CSS tokenizer tokens over FFI Avoid building a temporary Rust token vector before calling back into C++. The tokenizer now invokes the callback as each token is produced, while borrowing the already-filtered input for source slices. Reserve an initial C++ token capacity from the input size so the common path avoids repeated growth while appending the converted tokens. With this change, the Rust CSS tokenizer is now ~1.3x faster than the C++ CSS tokenizer at churning through all the https://vercel.com/ CSS.	2026-05-03 17:22:17 +02:00
Sam Atkins	4278194d96	LibWeb/CSS: Port the CSS Tokenizer to Rust test-css-tokenizer is updated to run both the C++ and Rust tokenizers and compare their output, to ensure they behave identically. The Parser still uses the C++ Tokenizer. The LibWeb crate, FFI layer etc are all based on the existing ones for other libraries. This is a direct AI translation to get us started, and not idiomatic Rust. Future work can be done to make it more sensible.	2026-05-03 09:49:00 +02:00

Author

SHA1

Message

Date

Andreas Kling

f4960d9d7d

LibWeb: Honor requested CSS tokenizer encoding

Keep decoded CSS text separate from tokenizer byte input. CSSOM and
already-decoded stylesheet text preserve code point preprocessing, so a
lone surrogate maps to one replacement character instead of being
re-decoded as malformed UTF-8 bytes.

Decode tokenizer byte input with the requested encoding unless that
encoding is UTF-8 and the byte stream is strictly valid UTF-8. Keep the
fast path by constructing the decoded string without validating twice
after strict validation succeeds.

Preserve UTF-8 decoder behavior on the byte fast path by stripping an
initial UTF-8 BOM and rejecting encoded surrogate bytes. Invalid UTF-8
still goes through the decoder. Add tokenizer coverage for both the C++
and Rust backends across decoded text, UTF-8 aliases, BOM-prefixed
input, invalid UTF-8, and non-UTF requested encodings.

2026-05-18 14:08:22 +02:00

Andreas Kling

355fb6b825

LibWeb: Stream Rust CSS tokenizer tokens over FFI

Avoid building a temporary Rust token vector before calling back into
C++. The tokenizer now invokes the callback as each token is produced,
while borrowing the already-filtered input for source slices.

Reserve an initial C++ token capacity from the input size so the common
path avoids repeated growth while appending the converted tokens.

With this change, the Rust CSS tokenizer is now ~1.3x faster than the
C++ CSS tokenizer at churning through all the https://vercel.com/ CSS.

2026-05-03 17:22:17 +02:00

Sam Atkins

4278194d96

LibWeb/CSS: Port the CSS Tokenizer to Rust

test-css-tokenizer is updated to run both the C++ and Rust tokenizers
and compare their output, to ensure they behave identically. The Parser
still uses the C++ Tokenizer.

The LibWeb crate, FFI layer etc are all based on the existing ones for
other libraries.

This is a direct AI translation to get us started, and not idiomatic
Rust. Future work can be done to make it more sensible.

2026-05-03 09:49:00 +02:00

3 commits