Commit graph

11 commits

Author SHA1 Message Date
Timothy Flynn
1869399fd1 AK: Specialize Optional for Utf16String and Utf16FlyString
We added this for String some time ago, so let's give Utf16String the
same optimization. Note that Utf16String was already handling its data
member potentially being null as of 5af99f4dd0.
2025-08-19 06:24:09 -04:00
Timothy Flynn
298ec6a12a AK: Ensure StringBuilder encodes U+10000 as 2 UTF-16 code units 2025-08-07 02:05:50 +02:00
Timothy Flynn
1b611fba67 AK: Ensure Utf16FlyString is hash-compatible with Utf16View/Utf16String 2025-08-07 02:05:50 +02:00
Timothy Flynn
2dc0a3b3ce AK: Add trim methods to Utf16String that skip allocation when not needed
If the string does not begin with any of the provided code units, we do
not need to create a new string.
2025-08-05 15:13:36 +02:00
Timothy Flynn
0bf565b97f AK: Allow comparing UTF-16 strings to UTF-8 strings
Before now, you could compare a Utf16View to a StringView, but it would
only be valid if the StringView were ASCII. When porting code to UTF-16,
it will be handy to have a code point-aware implementation for non-ASCII
StringViews.
2025-08-05 07:07:15 -04:00
Timothy Flynn
13ed6aba71 AK+LibIPC: Implement an encoder/decoder for UTF-16 strings 2025-08-02 10:10:14 -07:00
Timothy Flynn
a740bfd8ff AK+LibUnicode: Implement Unicode-aware UTF-16 case transformations 2025-07-25 18:16:22 +02:00
Timothy Flynn
df77ae1920 AK: Implement creating a UTF-16 string from a repeated code point 2025-07-25 18:16:22 +02:00
Timothy Flynn
f53389bab1 AK: Add a couple of Utf16String factories
* Utf16String::from_utf8_with_replacement_character
* Utf16String::from_code_point
2025-07-24 19:00:20 +02:00
Timothy Flynn
2803d66d87 AK: Support UTF-16 string formatting
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.

In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.

Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.

In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00