Switch bytecode cache source identity from decoded UTF-16 source text to
the original encoded response bytes plus the effective source encoding.
Store the decoded source length in the cache blob header so warm loads
can build lazy SourceCode objects without decoding the source before
checking the sidecar.
This removes the main-thread decoded_source_text_info pass from valid
warm-cache script and module loads. The source is only decoded on cache
miss, or when a rejected sidecar falls back to source compilation.
Treat trailing UTF-8 prefixes with an invalid second byte as complete
input for streaming decode, so replacement characters are emitted in the
current chunk instead of being held until later input or finish. Keep
valid incomplete prefixes buffered across chunk boundaries.
Keep TextDecoderStream from holding continuation bytes after an invalid
lead byte at a chunk boundary. Add LibTextCodec and TextDecoderStream
coverage for invalid tails, valid split sequences, EOF partials,
surrogate sequences, and malformed continuation tails.
Reject UTF-8 second bytes outside the Encoding Standard's per-lead-byte
bounds before consuming the rest of each sequence. This keeps surrogate
and out-of-range sequences from collapsing multiple malformed bytes into
one replacement character.
Also report an odd trailing UTF-16 byte as U+FFFD through the streaming
code point path and route UTF-16 to_utf8() through the same logic. This
keeps lazy and eager script decoding aligned for bytecode cache source
hashes.
Cover the malformed UTF-8 and UTF-16 cases in LibTextCodec, TextDecoder,
and bytecode-cache source decoding tests.
Consume truncated UTF-8 tails as one malformed sequence while preserving
existing behavior for encoded surrogate code points. This keeps lazy
SourceCode decoding and bytecode cache source hashing aligned with eager
source text decoding for invalid cached script source bytes.
Continue stripping an initial UTF-8 byte order mark in
UTF8Decoder::to_utf8() so HTML parsing keeps matching the previous
String-based implementation. Cover the shared decoder behavior and the
TextDecoder API surface.
Decode malformed UTF-8 consistently in UTF8Decoder::process() and
UTF8Decoder::to_utf8(). This keeps lazy SourceCode decoding and bytecode
cache source hashing in step with eager source text decoding when cached
script source bytes contain invalid UTF-8.
Cover UTF-8 encoded surrogate code points and overlong byte sequences in
LibTextCodec, and add lazy SourceCode coverage for both cases.
Add Decoder::process_code_points() so callers can stream decoded code
points without first materializing a UTF-8 string. Implement the UTF-16
variants by walking code units directly and emitting replacement code
points for malformed surrogate pairs.
Update the GB18030 encoder to spec-compliantly handle old PUA code
points via a direct byte lookup table (spec step 5). Bake the 18
GB18030-2022 code point updates into indexes.json and remove the
now-unnecessary patching logic from the code generator. Drop the
redundant hardcoded switch in the decoder's range function, as the
range formula already produces correct values.
Import WPT tests for gb18030 decoder, gb18030 encoder, and gbk
encoder, and register the worker variant in TestConfig.ini.
When the encoder encounters an unencodable code point while in jis0208
state, the spec says to emit ESC ( B (0x1B 0x28 0x42) to switch to
ASCII mode before returning an error. The encoder was incorrectly
emitting ESC ( J (0x1B 0x28 0x4A) which selects Roman mode instead.
This caused form submission using ISO-2022-JP to produce incorrect
escape sequences when replacing unencodable characters with numeric
character references.
Also imports the WPT iso2022jp-encode-form-errors-stateful test.
Introduce a StreamingDecoder wrapper that lets callers feed bytes to a
Decoder one chunk at a time. It buffers any incomplete trailing byte
sequence at the end of a chunk and prepends it to the next chunk, so a
multi-byte code point split across a chunk boundary is decoded correctly
once the next chunk arrives.
To support that, add an incomplete_tail_length() virtual on Decoder
returning the number of trailing bytes that form an incomplete sequence
per the Encoding Standard's decoder handler byte ranges, with overrides
for UTF-8, UTF-16BE, UTF-16LE, GB18030, Big5, EUC-JP, ISO-2022-JP,
Shift_JIS, and EUC-KR. The default implementation returns 0, which keeps
single-byte legacy decoders correct.
This is the foundation for the upcoming incremental HTML parser, which
needs to decode network response bodies as they arrive.
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.
While here, add helpers and make use of simdutf for fast conversion.
Instead of asserting when you call TextCoded::decoder_for() with a
non-standard encoding name, let's be nice and see if we can't find a
decoder for the standardized version of the encoding name.
We're starting with a very basic decoding API and only ISO-8859-1 and
UTF-8 decoding (and UTF-8 decoding is really a no-op since String is
expected to be UTF-8.)