Commit graph

605 commits

Author SHA1 Message Date
Timothy Flynn
ad7ac679fd AK: Compute Utf16View::code_point_offset_of correctly
There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():

    code_unit_offset -= it.length_in_code_units();
2025-07-22 17:17:33 +02:00
Timothy Flynn
f595e47c1f AK: Add unit tests for Utf16View::code_unit_offset_of 2025-07-22 17:17:33 +02:00
Jelle Raaijmakers
265e278275 AK: Allow indexing at length in Utf8View::byte_offset_of()
And do the same for Utf8View::code_point_offset_of(). Some of these
`VERIFY`s of the view's length were introduced recently, but they caused
the parsing of named capture groups in RegexParser to crash in some
situations.

Instead, allow indexing at the view's length: the byte offset of code
point `length()` is known, even though that code point does not exist in
the view. Similarly, we know the code point offset at byte offset
`byte_length()`. Beyond those offsets, we still crash.

Fixes 13 failures in test262's `language/literals/regexp/named-groups`.
2025-07-22 09:10:32 -04:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Timothy Flynn
d40e3af697 AK: Implement UTF-16 string-to-number conversions 2025-07-18 12:45:38 -04:00
Timothy Flynn
6e0290ecaa AK: Define some UTF-16 helper methods
* contains
* escape_html_entities
* replace
* to_ascii_lowercase
* to_ascii_uppercase
* to_ascii_titlecase
* trim
* trim_whitespace
2025-07-18 12:45:38 -04:00
Timothy Flynn
7f069efbc4 AK: Implement a flyweight string for Utf16String
Utf16FlyString more or less works exactly the same as FlyString. It will
store the raw encoded data of the string instance. If the string is a
short ASCII string, Utf16FlyString holds the ShortString bytes; else,
Utf16FlyString holds a pointer to the Utf16StringData.
2025-07-18 12:45:38 -04:00
Timothy Flynn
2803d66d87 AK: Support UTF-16 string formatting
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.

In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.

Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.

In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00
Timothy Flynn
8fbb80fffc AK: Do not fall back to simdutf for UTF-16 ASCII validation
This was a mistake. Consider U+201C (LEFT DOUBLE QUOTATION MARK). This
code point is encoded as the bytes 0x1c 0x20 in UTF-16LE. Both of these
bytes are ASCII if interpreted as UTF-8. But the string itself is most
certainly not ASCII.
2025-07-18 12:45:38 -04:00
Timothy Flynn
01ebf1eb07 AK: Replace surrogates in String::from_utf8_with_replacement_character
We are expected to replace lonely surrogates with U+FFFD when decoding
UTF-8 text.
2025-07-06 04:30:17 +12:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
Timothy Flynn
62d9a84b8d AK+Everywhere: Replace custom number parsers with fast_float
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float

However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.

We can also use fast_float for integer parsing.
2025-07-03 09:51:56 -04:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
2abc955ca9 AK: Allow treating UTF-16 views with lonely surrogates as valid
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
d978a582a0 AK: Add a Utf16View ASCII validator 2025-07-03 09:51:56 -04:00
Timothy Flynn
35a1832d08 Tests/AK: Rename TestUtf16 / TestUtf8 to TestUtf16View / TestUtf8View
These are the files they actually test, so let's rename them to avoid
confusion with an upcoming Utf16String test.
2025-07-03 09:51:56 -04:00
Luke Wilde
31a8004ddb AK: Add the ability to consume specifically by a predicate
This will be used by Content Security Policy to consume the next
character, if it matches a whole range of characters, such as
is_ascii_alpha.
2025-07-01 10:24:24 +12:00
Tomasz Strejczek
8f8e51b1fc AK: Implement AK::UnixDateTime::to_string()
Copy implementation of LibCore::DateTime::to_string()
to AK.
Rename TestDuration.cpp to TestTime.cpp and add
there tests for to_string().
2025-06-19 18:42:45 -06:00
Tomasz Strejczek
e03c558a0a AK: Implement demangle() for MSVC ABI
This implements demangle() using Windows API. Also some rudimentary
test is provided.
2025-06-17 18:39:18 -06:00
Sam Atkins
26105b8b11 AK: Add a Formatter for Checked
This goes in Format.h instead of Checked.h, to avoid an include cycle.
2025-06-17 20:44:01 +02:00
Jelle Raaijmakers
6f926e6977 AK: Add Utf8View::code_point_offset_of() 2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
cc0a28ee7d AK: Add Utf16View::find_code_unit_offset(_ignoring_case) 2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
7d7f6fa494 AK: Remove superfluous check from Utf16View::equals_ignoring_case()
Returning true if both lengths are 0 is already handled by the default
case.
2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
b558b4dba6 AK: Add Span<T>::index_of(ReadonlySpan)
This will be used for case-sensitive substring index matches in a later
commit.
2025-06-13 15:08:26 +02:00
ayeteadoe
8cf01a25c2 AK: Add initial support for AK testsuite on Windows
We now explicitly enabling support for the minimum libraries needed
to build and run the AK testsuite. 81/82 tests are running and
passing. The exception is LexicalPath, as some path behaviour on
Windows is different than Unix, so the current tests will have lots of
platform specific failures. The implementer of LexicalPathWindows
recommended windows-specific tests here, so I will do that in a
follow up.
2025-05-20 10:58:43 -06:00
Ashton
5f5ae6bf8b AK: Replace wchar_t formatting with char32_t
This makes TestFormat fully cross-platform as we no longer have to
work around the 16 vs 32-bit wide strings
2025-05-18 19:18:13 -06:00
Ashton
4b3a3b0856 AK: Remove redundant TestPrint test
This test was only useful when AK/PrintfImplementation.h existed. But
that was removed 11 months ago, so since then this has just been
testing std library functions not implemented by us.
2025-05-18 19:18:13 -06:00
Andreas Kling
734bc2a0ea AK: Strip trailing zero decimals in default formatting of float numbers
This gives us a more human-looking serialization of numbers by default,
and in case a fixed number of decimal digits is actually wanted, we
still have the 'f' specifier.
2025-05-18 17:23:34 +02:00
ayeteadoe
744fd91d0b LibTest: Support death tests without child process cloning
A challenge for getting LibTest working on Windows has always
been CrashTest. It implements death tests similar to Google Test
where a child process is cloned to invoke the expression that
should abort/terminate the program. Then the exit code of the
child is used by the parent test process to verify if the
application correctly aborted/terminated due to invoking
the expression.

The problem was that finding an equivalent way to port Crash::run()
to Windows was not looking very likely as publicly exposed Win32/
Native APIs have no equivalent to fork(); however, Windows actually
does have native support for process cloning via undocumented NT
APIs that clever people reverse engineered and published, see
`NtCreateUserProcess()`.

All that being said, this `EXPECT_DEATH()` implementation avoids
needing to use a child process in general, allowing us to remove
CrashTest in favour of a single cross-platform solution for death
tests.
2025-05-16 13:23:32 -06:00
Andreas Kling
cf6e2531d9 AK: Make String::number() much faster for integer types
Instead of going through String::formatted(), we now have a specialized
code path for base-10 serialization directly to UTF-8.

This is roughly 5-10x faster than the previous implementation, depending
on how many digits we end up outputting.

1.07x speedup on MicroBench/for-in-indexed-properties.js
2025-05-02 19:13:03 +02:00
Tim Ledbetter
31dea89fe0 AK: Add lowest common multiple and greatest common divisor functions 2025-04-23 09:13:45 +01:00
Jonne Ransijn
bb20a0d8f8 AK: Allow the Optional<T> move assignment operator to be trivial
This will change behaviour for moved-from `Optional<T>`s, since they
will now no longer clear their value if `T` is trivial. However, a
moved-from value should be considered to be in an unspecified state.
Use `Optional<T>::clear` or `Optional<T>::release_value` instead.
2025-04-22 21:19:31 -06:00
Jonne Ransijn
a059ab4677 AK: Allow Optional<T> to be used in constant expressions 2025-04-22 21:19:31 -06:00
Andreas Kling
7628ddfaf7 AK: Remove endianness override from Utf16View
Utf16View is now always in "host" endian mode. This makes it smaller
and less branchy for everyone!
2025-04-16 10:04:50 +02:00
Timothy Flynn
917537b449 AK: Enable compile-time check of a format test string
Our implementation seems to parse this string just fine now.
2025-04-08 20:00:18 -04:00
Timothy Flynn
1d9e226206 AK: Remove unused UTF-8 / other factory methods from ByteString 2025-04-07 17:44:38 +02:00
Timothy Flynn
ee6b2db009 AK+LibURL+LibWeb: Use simdutf to validate ASCII strings
simdutf provides a vectorized ASCII validator, so let's use that instead
of looping over strings manually.
2025-04-06 11:05:58 -04:00
Timothy Flynn
7f37a8f60f AK: Add an AK::find helper to return a reference to the found value
This is often more convenient than dealing with iterators.

This commit includes a couple conversions to find_value as examples.
2025-04-06 13:45:10 +02:00
rmg-x
37998895d8 AK+Meta+LibCore+Tests: Remove unused SipHash implementation
This is a homegrown implementation that wasn't actually used in
dependent classes. If this is needed in the future, using OpenSSL would
probably be a better option.
2025-04-06 01:47:50 +02:00
rmg-x
6480e1a3fe AK+Tests: Add support for URI syntax in IPv6Address::from_string
This supports IPv6 strings that start with `[` and end with `]` in
accordance with RFC3986 which states:

A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]").  This is the only place where
square bracket characters are allowed in the URI syntax.
2025-04-05 14:26:09 -04:00
Kenneth Myhra
82a2ae99c8 Everywhere: Remove DeprecatedFlyString + any remaining references to it
This reverts commit 7c32d1e8a5.
2025-04-02 11:43:13 +02:00
Andreas Kling
7c32d1e8a5 Revert "Everywhere: Remove DeprecatedFlyString + any remaining references to it"
This reverts commit 3131e6369f.

Greatly regressed JavaScript benchmark performance.
2025-04-01 15:40:27 +02:00
Kenneth Myhra
3131e6369f Everywhere: Remove DeprecatedFlyString + any remaining references to it 2025-04-01 12:50:00 +02:00
Andrew Kaster
01ac48b36f AK: Support storing blocks in AK::Function
This has two slightly different implementations for ARC and non-ARC
compiler modes. The main idea is to store a block pointer as our
closure and use either ARC magic or BlockRuntime methods to manage
the memory for the block. Things are complicated by the fact that
we don't yet force-enable swift, so we can't count on the swift.org
llvm fork being our compiler toolchain. The patch adds some CMake
checks and ifdefs to still support environments without support
for blocks or ARC.
2025-03-18 17:15:08 -06:00
Tim Ledbetter
040dca0223 AK: Add first_is_equal_to_all_of()
This method returns true if all arguments are equal.
2025-03-18 21:55:06 +01:00
Jess
88c4f71114 AK/Checked: Dont verify overflow bit in lvalue operations
Before, adding an overflow'n `Checked<T>` to another `Checked<T>` would
cause a verification faliure when instead it should propogate m_overflow
and allow the user to handle the overflow.
2025-02-25 11:20:13 +00:00
Timothy Flynn
1e841cd453 Tests: Add a test for moving an object out of a JSON value
I recently questioned whether this would work as expected:

    JsonValue value { JsonObject {} };
    auto object = move(value.as_object());

So this just adds a unit test to ensure that it does.
2025-02-24 12:05:29 -05:00
Timothy Flynn
fe2dff4944 AK+Everywhere: Convert JSON value serialization to String
This removes the use of StringBuilder::OutputType (which was ByteString,
and only used by the JSON classes). And it removes the StringBuilder
template parameter from the serialization methods; this was only ever
used with StringBuilder, so a template is pretty overkill here.
2025-02-20 19:27:51 -05:00