ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2026-06-17 15:25:35 +00:00

Author	SHA1	Message	Date
Andreas Kling	afb0fa2413	LibJS: Hash encoded source identity in bytecode cache Switch bytecode cache source identity from decoded UTF-16 source text to the original encoded response bytes plus the effective source encoding. Store the decoded source length in the cache blob header so warm loads can build lazy SourceCode objects without decoding the source before checking the sidecar. This removes the main-thread decoded_source_text_info pass from valid warm-cache script and module loads. The source is only decoded on cache miss, or when a rejected sidecar falls back to source compilation.	2026-06-03 14:11:23 +02:00
Andreas Kling	01cec162c8	LibTextCodec: Stop buffering invalid UTF-8 tails Treat trailing UTF-8 prefixes with an invalid second byte as complete input for streaming decode, so replacement characters are emitted in the current chunk instead of being held until later input or finish. Keep valid incomplete prefixes buffered across chunk boundaries. Keep TextDecoderStream from holding continuation bytes after an invalid lead byte at a chunk boundary. Add LibTextCodec and TextDecoderStream coverage for invalid tails, valid split sequences, EOF partials, surrogate sequences, and malformed continuation tails.	2026-05-18 14:08:22 +02:00
Andreas Kling	45da0e4a0e	LibTextCodec: Preserve malformed decoder replacements Reject UTF-8 second bytes outside the Encoding Standard's per-lead-byte bounds before consuming the rest of each sequence. This keeps surrogate and out-of-range sequences from collapsing multiple malformed bytes into one replacement character. Also report an odd trailing UTF-16 byte as U+FFFD through the streaming code point path and route UTF-16 to_utf8() through the same logic. This keeps lazy and eager script decoding aligned for bytecode cache source hashes. Cover the malformed UTF-8 and UTF-16 cases in LibTextCodec, TextDecoder, and bytecode-cache source decoding tests.	2026-05-18 14:08:22 +02:00
Andreas Kling	c92079893a	LibTextCodec: Match UTF-8 replacement handling Consume truncated UTF-8 tails as one malformed sequence while preserving existing behavior for encoded surrogate code points. This keeps lazy SourceCode decoding and bytecode cache source hashing aligned with eager source text decoding for invalid cached script source bytes. Continue stripping an initial UTF-8 byte order mark in UTF8Decoder::to_utf8() so HTML parsing keeps matching the previous String-based implementation. Cover the shared decoder behavior and the TextDecoder API surface.	2026-05-18 09:18:35 +02:00
Andreas Kling	5627e89956	LibTextCodec: Preserve UTF-8 replacement decoding Decode malformed UTF-8 consistently in UTF8Decoder::process() and UTF8Decoder::to_utf8(). This keeps lazy SourceCode decoding and bytecode cache source hashing in step with eager source text decoding when cached script source bytes contain invalid UTF-8. Cover UTF-8 encoded surrogate code points and overlong byte sequences in LibTextCodec, and add lazy SourceCode coverage for both cases.	2026-05-18 09:18:35 +02:00
Andreas Kling	fc02f6267a	LibTextCodec: Allow processing decoded UTF-16 code points Add Decoder::process_code_points() so callers can stream decoded code points without first materializing a UTF-8 string. Implement the UTF-16 variants by walking code units directly and emitting replacement code points for malformed surrogate pairs.	2026-05-18 09:18:35 +02:00
Martin Chrástek	c382e5d254	LibTextCodec: Update GB18030 for GB18030-2022 and import WPT tests Update the GB18030 encoder to spec-compliantly handle old PUA code points via a direct byte lookup table (spec step 5). Bake the 18 GB18030-2022 code point updates into indexes.json and remove the now-unnecessary patching logic from the code generator. Drop the redundant hardcoded switch in the decoder's range function, as the range formula already produces correct values. Import WPT tests for gb18030 decoder, gb18030 encoder, and gbk encoder, and register the worker variant in TestConfig.ini.	2026-05-09 11:44:42 +02:00
Martin Chrástek	9267d2d408	LibTextCodec: Fix ISO-2022-JP encoder escape seq on unencodable error When the encoder encounters an unencodable code point while in jis0208 state, the spec says to emit ESC ( B (0x1B 0x28 0x42) to switch to ASCII mode before returning an error. The encoder was incorrectly emitting ESC ( J (0x1B 0x28 0x4A) which selects Roman mode instead. This caused form submission using ISO-2022-JP to produce incorrect escape sequences when replacing unencodable characters with numeric character references. Also imports the WPT iso2022jp-encode-form-errors-stateful test.	2026-05-07 17:46:31 +02:00
Aliaksandr Kalenik	9375499e52	LibTextCodec: Add streaming decoder Introduce a StreamingDecoder wrapper that lets callers feed bytes to a Decoder one chunk at a time. It buffers any incomplete trailing byte sequence at the end of a chunk and prepends it to the next chunk, so a multi-byte code point split across a chunk boundary is decoded correctly once the next chunk arrives. To support that, add an incomplete_tail_length() virtual on Decoder returning the number of trailing bytes that form an incomplete sequence per the Encoding Standard's decoder handler byte ranges, with overrides for UTF-8, UTF-16BE, UTF-16LE, GB18030, Big5, EUC-JP, ISO-2022-JP, Shift_JIS, and EUC-KR. The default implementation returns 0, which keeps single-byte legacy decoders correct. This is the foundation for the upcoming incremental HTML parser, which needs to decode network response bodies as they arrive.	2026-04-29 04:12:44 +02:00
R-Goc	ae5f28fb40	LibTextEncoder/LibURL: Cleanup includes Cleans up LibURL/Parser.h to use the forwarding header from LibTextEncoder.	2026-02-26 18:31:57 +01:00
Timothy Flynn	0fd80a8f99	LibTextCodec+LibWeb: Move isomorphic coders to LibTextCodec This will be used outside of LibWeb.	2025-11-27 14:57:29 +01:00
ayeteadoe	e497303e94	LibTextCodec: Enable EXPLICIT_SYMBOL_EXPORT	2025-08-23 16:04:36 -06:00
Gingeh	f098bd029c	LibTextCodec: Replace unmatched utf16 surrogates	2025-07-05 09:58:57 -04:00
ayeteadoe	25f5936dee	CMake: Rename serenity_* helper functions/macros to ladybird_*	2025-07-03 23:19:41 +02:00
Timothy Flynn	7280ed6312	Meta: Enforce newlines around namespaces This has come up several times during code review, so let's just enforce it using a new clang-format 20 option.	2025-05-14 02:01:59 -06:00
Andreas Kling	0e9480b944	AK+LibTextCodec: Stop using Utf16View endianness override This is preparation for removing the endianness override, since it was only used by a single client: LibTextCodec. While here, add helpers and make use of simdutf for fast conversion.	2025-04-16 10:04:50 +02:00
Timothy Flynn	93712b24bf	Everywhere: Hoist the Libraries folder to the top-level	2024-11-10 12:50:45 +01:00
Andreas Kling	13d7c09125	Libraries: Move to Userland/Libraries/	2021-01-12 12:17:46 +01:00
Lukasz Maciejewski	7e5199a394	LibTextCodec: Fix minor errors in Latin2 decoder	2020-12-28 23:31:12 +01:00
Łukasz Maciejewski	518ba73dcb	LibTextCodec: Add Latin2 text decoder (#4579 )	2020-12-27 22:44:38 +01:00
Andreas Kling	024059b49b	LibTextCodec: Normalize incoming encodings in decoder_for() Instead of asserting when you call TextCoded::decoder_for() with a non-standard encoding name, let's be nice and see if we can't find a decoder for the standardized version of the encoding name.	2020-12-13 18:20:50 +01:00
Luke	f3d2053bff	LibTextCodec: Add a function to convert encodings to standardized names https://encoding.spec.whatwg.org/#names-and-labels	2020-11-14 10:14:03 +01:00
Ben Wiederhake	69a0502f80	LibTextCodec: Mark compilation-unit-only functions as static This enables a nice warning in case a function becomes dead code.	2020-08-12 20:40:59 +02:00
Nico Weber	ce95628b7f	Unicode: Try s/codepoint/code_point/g again This time, without trailing 's'. Ran: git grep -l 'codepoint' \| xargs sed -ie 's/codepoint/code_point/g	2020-08-05 22:33:42 +02:00
Nico Weber	19ac1f6368	Revert "Unicode: s/codepoint/code_point/g" This reverts commit `ea9ac3155d`. It replaced "codepoint" with "code_points", not "code_point".	2020-08-05 22:33:42 +02:00
Andreas Kling	ea9ac3155d	Unicode: s/codepoint/code_point/g Unicode calls them "code points" so let's follow their style.	2020-08-03 19:06:41 +02:00
Nico Weber	01522b8d71	LibTextCodec: Simplify Latin1Decoder::to_utf8 No intended behavior change.	2020-07-22 19:16:00 +02:00
Andreas Kling	893a9ff5b0	LibTextCodec: Improve Latin-1 decoder so it decodes everything I can now see Swedish letters when opening Google in the browser. :^)	2020-05-27 19:52:18 +02:00
Sergey Bugaev	450a2a0f9c	Build: Switch to CMake :^) Closes https://github.com/SerenityOS/serenity/issues/2080	2020-05-14 20:15:18 +02:00
Andreas Kling	e09b83c60c	LibTextCodec: Start fleshing out a simple text codec library We're starting with a very basic decoding API and only ISO-8859-1 and UTF-8 decoding (and UTF-8 decoding is really a no-op since String is expected to be UTF-8.)	2020-05-03 23:01:58 +02:00

30 commits