Commit graph

70 commits

Author SHA1 Message Date
Simon Wanner
7f3b457e62 LibTextCodec: Add EUC-KR decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
ded6512ca8 LibTextCodec: Add Shift_JIS decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
06f7c393b2 LibTextCodec: Add ISO-2022-JP decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
45f0ae52be LibTextCodec: Add EUC-JP decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
9943bb1d8e LibTextCodec: Add Big5 decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
2ce61fe6ea LibTextCodec: Add GBK/GB18030 decoder
Includes changes from GB-18030-2022, which are not yet included in the
Encoding Specification, but WebKit, Blink and WPT are already updated.
2024-05-31 07:56:26 +02:00
Simon Wanner
9ed52504ab LibTextCodec: Delegate to process() in default validate() implementation 2024-05-31 07:56:26 +02:00
Simon Wanner
b79815c5a5 LibTextCodec: Add x-mac-cyrillic decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
07a9435da5 LibTextCodec: Add windows-1258 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
275b89720b LibTextCodec: Add windows-1257 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
c76308c7e6 LibTextCodec: Add windows-1256 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
eb9ed10573 LibTextCodec: Add windows-1253 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
2d35687db0 LibTextCodec: Add windows-874 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
1b6878b6ca LibTextCodec: Add KOI8-U decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
1fd3a6f48c LibTextCodec: Add ISO-8859-16 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
3e882f26db LibTextCodec: Sort checks in decoder_for mostly alphabetically
Keeps checks for common encodings (Latin1 & UTF-*) at the top.
2024-05-27 20:50:50 +02:00
Simon Wanner
56241df604 LibTextCodec: Add ISO-8859-14 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
4188e328ac LibTextCodec: Add ISO-8859-13 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
cc640f4363 LibTextCodec: Add ISO-8859-10 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
d73220837e LibTextCodec: Add ISO-8859-8(-I) decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
24028e353e LibTextCodec: Add ISO-8859-7 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
01c3b8091a LibTextCodec: Add ISO-8859-6 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
763d904ad5 LibTextCodec: Add ISO-8859-5 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
c6b17320db LibTextCodec: Add ISO-8859-4 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
6c84edaaa2 LibTextCodec: Add ISO-8859-3 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
fc783199f1 LibTextCodec: Add IBM866 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
96b3c35358 LibTextCodec: Implement table based decoders as SingleByteDecoder
Instead of copy-pasting the implementation, let's use a single class.
This "Single Byte Decoder" concept even exists in the Encoding Spec :^)
2024-05-27 20:50:50 +02:00
Michal Grich
7a6d84d036 LibTextCodec: Add Windows-1250 text decoder
This commit is adding Windows-1250 decoding based on unicode.org
mapping table.
2024-04-23 16:26:16 +02:00
Andreas Kling
3c039903fb LibTextCodec+AK: Don't validate UTF-8 strings twice
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.

This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
2023-12-30 13:49:50 +01:00
Nico Weber
8f47acee6a LibTextCodec: Add PDFDocEncoding decoder 2023-11-22 09:08:06 -07:00
Idan Horowitz
079c96376c LibTextCodec: Support validating encoded inputs 2023-11-17 16:02:36 +01:00
Luke Wilde
eaa4048870 LibTextCodec: Add "get output encoding" from the Encoding specification 2023-06-19 06:12:26 +02:00
Timothy Flynn
00fa23237a LibTextCodec: Change UTF-8's decoder to replace invalid code points
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
2023-05-12 05:47:36 +02:00
Andreas Kling
a504ac3e2a Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case
Let's make it clear that these functions deal with ASCII case only.
2023-03-10 13:15:44 +01:00
Luke Wilde
e864444fe3 LibTextCodec/Latin1: Iterate over input string with u8 instead of char
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.

For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].

Required by Cloudflare's IUAM challenges.
2023-02-28 08:46:06 +00:00
Sam Atkins
2db168acc1 LibTextCodec+Everywhere: Port Decoders to new Strings 2023-02-19 17:15:47 +01:00
Sam Atkins
3c5090e172 LibTextCodec: Return Optional<Decoder&> from bom_sniff_to_decoder() 2023-02-19 17:15:47 +01:00
Sam Atkins
f2a9426885 LibTextCodec+Everywhere: Return Optional<Decoder&> from decoder_for() 2023-02-19 17:15:47 +01:00
Sam Atkins
d6075ef5b5 LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView
We don't need a full String/DeprecatedString inside this function, so we
might as well not force users to create one.
2023-02-15 12:48:26 -05:00
Nico Weber
eac2b2382c LibTextCodec: Add a MacRoman decoder
Allows displaying `<meta charset="x-mac-roman">` html files.
(`:set fenc=macroman`, `:w` in vim to save in that encoding.)
2023-01-24 14:37:20 +00:00
Nico Weber
b14b5a4d06 LibTextCodec: Simplify Latin1Decoder::process() a tiny bit 2023-01-24 14:37:20 +00:00
Nico Weber
3423b54eb9 LibTextCodec: Make utf-16be and utf-16le codecs actually work
There were two problems:

1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4

Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.
2023-01-22 21:30:44 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Idan Horowitz
086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00
Karol Kosek
b006a60366 LibTextCodec: Pass code points instead of bytes on UTF-8 string process
Previously we were passing raw UTF-8 bytes as code points, which caused
CSS content properties to display incorrect characters.

This makes bullet separators in Wikipedia templates display correctly.
2022-03-29 01:01:32 +02:00
Hendiadyoin1
6a95df2526 LibTextCodec: Don't allocate Strings on encoding normalisation
This ripples down to LibWeb's HTML and XHR decoders, which therefore
become less allocation heavy.
2022-03-21 10:48:17 +01:00
Jelle Raaijmakers
9c2a7c0e03 LibTextCodec: Add support for the UTF16-LE encoding 2022-03-08 14:51:06 +01:00
Luke Wilde
0e0f98a45e LibTextCodec: Add x-user-defined decoder
It's a pretty simple charset: the bottom 128 bytes (0x00-0x7F) are
standard ASCII, while the top 128 bytes (0x80-0xFF) are mapped to a
portion of the Unicode Private Use Area, specifically 0xF780-0xF7FF.

This is used by Google Maps for certain blobs.
2022-02-12 12:53:28 +01:00