Keep decoded CSS text separate from tokenizer byte input. CSSOM and
already-decoded stylesheet text preserve code point preprocessing, so a
lone surrogate maps to one replacement character instead of being
re-decoded as malformed UTF-8 bytes.
Decode tokenizer byte input with the requested encoding unless that
encoding is UTF-8 and the byte stream is strictly valid UTF-8. Keep the
fast path by constructing the decoded string without validating twice
after strict validation succeeds.
Preserve UTF-8 decoder behavior on the byte fast path by stripping an
initial UTF-8 BOM and rejecting encoded surrogate bytes. Invalid UTF-8
still goes through the decoder. Add tokenizer coverage for both the C++
and Rust backends across decoded text, UTF-8 aliases, BOM-prefixed
input, invalid UTF-8, and non-UTF requested encodings.
test-css-tokenizer is updated to run both the C++ and Rust tokenizers
and compare their output, to ensure they behave identically. The Parser
still uses the C++ Tokenizer.
The LibWeb crate, FFI layer etc are all based on the existing ones for
other libraries.
This is a direct AI translation to get us started, and not idiomatic
Rust. Future work can be done to make it more sensible.