Store CSS token payloads in a variant so each token only carries the
state needed by its type. Keep delimiter, number, hash, string, and
dimension data separate instead of storing every possible payload on
every token.
Use a smaller component-value token for function and block boundary
metadata. These component values only need token type, original source
text, and source positions, so avoid embedding full token payload
storage inside every Function and SimpleBlock.
Shrink CSS source positions to explicit 32-bit counters. Guard the C++
and Rust tokenizer paths against overflow. Add size assertions for the
hot Token and ComponentValue types so future growth is intentional.
Keep decoded CSS text separate from tokenizer byte input. CSSOM and
already-decoded stylesheet text preserve code point preprocessing, so a
lone surrogate maps to one replacement character instead of being
re-decoded as malformed UTF-8 bytes.
Decode tokenizer byte input with the requested encoding unless that
encoding is UTF-8 and the byte stream is strictly valid UTF-8. Keep the
fast path by constructing the decoded string without validating twice
after strict validation succeeds.
Preserve UTF-8 decoder behavior on the byte fast path by stripping an
initial UTF-8 BOM and rejecting encoded surrogate bytes. Invalid UTF-8
still goes through the decoder. Add tokenizer coverage for both the C++
and Rust backends across decoded text, UTF-8 aliases, BOM-prefixed
input, invalid UTF-8, and non-UTF requested encodings.
Avoid building a temporary Rust token vector before calling back into
C++. The tokenizer now invokes the callback as each token is produced,
while borrowing the already-filtered input for source slices.
Reserve an initial C++ token capacity from the input size so the common
path avoids repeated growth while appending the converted tokens.
With this change, the Rust CSS tokenizer is now ~1.3x faster than the
C++ CSS tokenizer at churning through all the https://vercel.com/ CSS.
test-css-tokenizer is updated to run both the C++ and Rust tokenizers
and compare their output, to ensure they behave identically. The Parser
still uses the C++ Tokenizer.
The LibWeb crate, FFI layer etc are all based on the existing ones for
other libraries.
This is a direct AI translation to get us started, and not idiomatic
Rust. Future work can be done to make it more sensible.