| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
										
											2025-10-04 00:06:38 +02:00
										 |  |  |  * Copyright (c) 2018-2025, Andreas Kling <andreas@ladybird.org> | 
					
						
							| 
									
										
										
										
											2025-02-10 11:47:51 +00:00
										 |  |  |  * Copyright (c) 2025, Sam Atkins <sam@ladybird.org> | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  |  * | 
					
						
							|  |  |  |  * SPDX-License-Identifier: BSD-2-Clause | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-22 10:17:48 -05:00
										 |  |  | #include <AK/Array.h>
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | #include <AK/Checked.h>
 | 
					
						
							| 
									
										
										
										
											2024-07-18 11:57:01 -04:00
										 |  |  | #include <AK/Endian.h>
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  | #include <AK/Enumerate.h>
 | 
					
						
							| 
									
										
										
										
											2023-01-11 08:26:49 -05:00
										 |  |  | #include <AK/FlyString.h>
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | #include <AK/Format.h>
 | 
					
						
							| 
									
										
										
										
											2023-01-27 10:17:34 -05:00
										 |  |  | #include <AK/MemMem.h>
 | 
					
						
							| 
									
										
										
										
											2023-02-19 18:34:29 -07:00
										 |  |  | #include <AK/Stream.h>
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | #include <AK/String.h>
 | 
					
						
							| 
									
										
										
										
											2025-10-04 00:06:38 +02:00
										 |  |  | #include <AK/StringNumber.h>
 | 
					
						
							| 
									
										
										
										
											2024-07-16 16:05:46 -04:00
										 |  |  | #include <AK/Utf16View.h>
 | 
					
						
							| 
									
										
										
										
											2023-01-13 11:34:00 -05:00
										 |  |  | #include <AK/Vector.h>
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | #include <stdlib.h>
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2024-07-16 16:05:46 -04:00
										 |  |  | #include <simdutf.h>
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | namespace AK { | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2024-08-11 15:19:47 +12:00
										 |  |  | String String::from_utf8_with_replacement_character(StringView view, WithBOMHandling with_bom_handling) | 
					
						
							| 
									
										
										
										
											2024-08-10 17:16:01 +12:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2025-07-05 11:48:43 -04:00
										 |  |  |     if (auto bytes = view.bytes(); with_bom_handling == WithBOMHandling::Yes && bytes.starts_with({ { 0xEF, 0xBB, 0xBF } })) | 
					
						
							| 
									
										
										
										
											2024-08-11 15:19:47 +12:00
										 |  |  |         view = view.substring_view(3); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-05 12:00:04 -04:00
										 |  |  |     Utf8View utf8_view { view }; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     if (utf8_view.validate(AllowLonelySurrogates::No)) | 
					
						
							| 
									
										
										
										
											2024-08-11 14:48:30 +12:00
										 |  |  |         return String::from_utf8_without_validation(view.bytes()); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-05 12:00:04 -04:00
										 |  |  |     StringBuilder builder(view.length()); | 
					
						
							| 
									
										
										
										
											2024-08-10 17:16:01 +12:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-05 12:00:04 -04:00
										 |  |  |     for (auto code_point : utf8_view) { | 
					
						
							|  |  |  |         if (is_unicode_surrogate(code_point)) | 
					
						
							|  |  |  |             builder.append_code_point(UnicodeUtils::REPLACEMENT_CODE_POINT); | 
					
						
							|  |  |  |         else | 
					
						
							|  |  |  |             builder.append_code_point(code_point); | 
					
						
							|  |  |  |     } | 
					
						
							| 
									
										
										
										
											2024-08-10 17:16:01 +12:00
										 |  |  | 
 | 
					
						
							|  |  |  |     return builder.to_string_without_validation(); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-12-29 15:30:15 +01:00
										 |  |  | String String::from_utf8_without_validation(ReadonlyBytes bytes) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2023-10-28 16:43:56 -04:00
										 |  |  |     String result; | 
					
						
							|  |  |  |     MUST(result.replace_with_new_string(bytes.size(), [&](Bytes buffer) { | 
					
						
							|  |  |  |         bytes.copy_to(buffer); | 
					
						
							|  |  |  |         return ErrorOr<void> {}; | 
					
						
							|  |  |  |     })); | 
					
						
							|  |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2023-12-29 15:30:15 +01:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-10-04 12:24:04 +02:00
										 |  |  | String String::from_ascii_without_validation(ReadonlyBytes bytes) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return from_utf8_without_validation(bytes); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | ErrorOr<String> String::from_utf8(StringView view) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2023-03-03 09:03:45 -05:00
										 |  |  |     if (!Utf8View { view }.validate()) | 
					
						
							|  |  |  |         return Error::from_string_literal("String::from_utf8: Input was not valid UTF-8"); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-10-28 16:43:56 -04:00
										 |  |  |     String result; | 
					
						
							|  |  |  |     TRY(result.replace_with_new_string(view.length(), [&](Bytes buffer) { | 
					
						
							|  |  |  |         view.bytes().copy_to(buffer); | 
					
						
							|  |  |  |         return ErrorOr<void> {}; | 
					
						
							|  |  |  |     })); | 
					
						
							|  |  |  |     return result; | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-01 23:20:28 +10:00
										 |  |  | ErrorOr<String> String::from_utf16_le_with_replacement_character(ReadonlyBytes bytes) | 
					
						
							| 
									
										
										
										
											2025-04-15 17:49:09 +02:00
										 |  |  | { | 
					
						
							|  |  |  |     if (bytes.is_empty()) | 
					
						
							|  |  |  |         return String {}; | 
					
						
							| 
									
										
										
										
											2025-07-03 10:12:07 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  |     auto const* utf16_data = reinterpret_cast<char16_t const*>(bytes.data()); | 
					
						
							|  |  |  |     auto utf16_length = bytes.size() / 2; | 
					
						
							| 
									
										
										
										
											2025-07-01 23:20:28 +10:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-09 14:13:38 -04:00
										 |  |  |     Vector<char16_t> well_formed_utf16; | 
					
						
							| 
									
										
										
										
											2025-07-01 23:20:28 +10:00
										 |  |  | 
 | 
					
						
							|  |  |  |     if (!validate_utf16_le(bytes)) { | 
					
						
							|  |  |  |         well_formed_utf16.resize(bytes.size()); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         simdutf::to_well_formed_utf16le(utf16_data, utf16_length, well_formed_utf16.data()); | 
					
						
							|  |  |  |         utf16_data = well_formed_utf16.data(); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-03 10:12:07 -04:00
										 |  |  |     auto utf8_length = simdutf::utf8_length_from_utf16le(utf16_data, utf16_length); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     String result; | 
					
						
							|  |  |  |     TRY(result.replace_with_new_string(utf8_length, [&](Bytes buffer) -> ErrorOr<void> { | 
					
						
							|  |  |  |         [[maybe_unused]] auto result = simdutf::convert_utf16le_to_utf8(utf16_data, utf16_length, reinterpret_cast<char*>(buffer.data())); | 
					
						
							|  |  |  |         ASSERT(result == buffer.size()); | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  |     })); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2025-04-15 17:49:09 +02:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-01 23:20:28 +10:00
										 |  |  | ErrorOr<String> String::from_utf16_be_with_replacement_character(ReadonlyBytes bytes) | 
					
						
							| 
									
										
										
										
											2025-04-15 17:49:09 +02:00
										 |  |  | { | 
					
						
							|  |  |  |     if (bytes.is_empty()) | 
					
						
							|  |  |  |         return String {}; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-03 10:12:07 -04:00
										 |  |  |     auto const* utf16_data = reinterpret_cast<char16_t const*>(bytes.data()); | 
					
						
							|  |  |  |     auto utf16_length = bytes.size() / 2; | 
					
						
							| 
									
										
										
										
											2025-07-01 23:20:28 +10:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-09 14:13:38 -04:00
										 |  |  |     Vector<char16_t> well_formed_utf16; | 
					
						
							| 
									
										
										
										
											2025-07-01 23:20:28 +10:00
										 |  |  | 
 | 
					
						
							|  |  |  |     if (!validate_utf16_le(bytes)) { | 
					
						
							|  |  |  |         well_formed_utf16.resize(bytes.size()); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         simdutf::to_well_formed_utf16be(utf16_data, utf16_length, well_formed_utf16.data()); | 
					
						
							|  |  |  |         utf16_data = well_formed_utf16.data(); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-03 10:12:07 -04:00
										 |  |  |     auto utf8_length = simdutf::utf8_length_from_utf16be(utf16_data, utf16_length); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     String result; | 
					
						
							|  |  |  |     TRY(result.replace_with_new_string(utf8_length, [&](Bytes buffer) -> ErrorOr<void> { | 
					
						
							|  |  |  |         [[maybe_unused]] auto result = simdutf::convert_utf16be_to_utf8(utf16_data, utf16_length, reinterpret_cast<char*>(buffer.data())); | 
					
						
							|  |  |  |         ASSERT(result == buffer.size()); | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  |     })); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2024-07-16 16:05:46 -04:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-02-19 18:34:29 -07:00
										 |  |  | ErrorOr<String> String::from_stream(Stream& stream, size_t byte_count) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2023-10-28 17:11:55 -04:00
										 |  |  |     String result; | 
					
						
							|  |  |  |     TRY(result.replace_with_new_string(byte_count, [&](Bytes buffer) -> ErrorOr<void> { | 
					
						
							|  |  |  |         TRY(stream.read_until_filled(buffer)); | 
					
						
							|  |  |  |         if (!Utf8View { StringView { buffer } }.validate()) | 
					
						
							|  |  |  |             return Error::from_string_literal("String::from_stream: Input was not valid UTF-8"); | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  |     })); | 
					
						
							|  |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2023-02-19 18:34:29 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2024-07-19 15:38:41 -04:00
										 |  |  | ErrorOr<String> String::from_string_builder(Badge<StringBuilder>, StringBuilder& builder) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     if (!Utf8View { builder.string_view() }.validate()) | 
					
						
							|  |  |  |         return Error::from_string_literal("String::from_string_builder: Input was not valid UTF-8"); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     String result; | 
					
						
							|  |  |  |     result.replace_with_string_builder(builder); | 
					
						
							|  |  |  |     return result; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | String String::from_string_builder_without_validation(Badge<StringBuilder>, StringBuilder& builder) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     String result; | 
					
						
							|  |  |  |     result.replace_with_string_builder(builder); | 
					
						
							|  |  |  |     return result; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-22 10:17:48 -05:00
										 |  |  | ErrorOr<String> String::repeated(u32 code_point, size_t count) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     VERIFY(is_unicode(code_point)); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     Array<u8, 4> code_point_as_utf8; | 
					
						
							|  |  |  |     size_t i = 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     size_t code_point_byte_length = UnicodeUtils::code_point_to_utf8(code_point, [&](auto byte) { | 
					
						
							|  |  |  |         code_point_as_utf8[i++] = static_cast<u8>(byte); | 
					
						
							|  |  |  |     }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     auto total_byte_count = code_point_byte_length * count; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-10-28 17:15:40 -04:00
										 |  |  |     String result; | 
					
						
							|  |  |  |     TRY(result.replace_with_new_string(total_byte_count, [&](Bytes buffer) { | 
					
						
							|  |  |  |         if (code_point_byte_length == 1) { | 
					
						
							|  |  |  |             buffer.fill(code_point_as_utf8[0]); | 
					
						
							|  |  |  |         } else { | 
					
						
							|  |  |  |             for (i = 0; i < count; ++i) | 
					
						
							|  |  |  |                 memcpy(buffer.data() + (i * code_point_byte_length), code_point_as_utf8.data(), code_point_byte_length); | 
					
						
							|  |  |  |         } | 
					
						
							|  |  |  |         return ErrorOr<void> {}; | 
					
						
							|  |  |  |     })); | 
					
						
							|  |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2023-01-22 10:17:48 -05:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | ErrorOr<String> String::vformatted(StringView fmtstr, TypeErasedFormatParams& params) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     StringBuilder builder; | 
					
						
							|  |  |  |     TRY(vformat(builder, fmtstr, params)); | 
					
						
							|  |  |  |     return builder.to_string(); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-16 17:12:53 +01:00
										 |  |  | ErrorOr<Vector<String>> String::split(u32 separator, SplitBehavior split_behavior) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return split_limit(separator, 0, split_behavior); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ErrorOr<Vector<String>> String::split_limit(u32 separator, size_t limit, SplitBehavior split_behavior) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     Vector<String> result; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     if (is_empty()) | 
					
						
							|  |  |  |         return result; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     bool keep_empty = has_flag(split_behavior, SplitBehavior::KeepEmpty); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     size_t substring_start = 0; | 
					
						
							|  |  |  |     for (auto it = code_points().begin(); it != code_points().end() && (result.size() + 1) != limit; ++it) { | 
					
						
							|  |  |  |         u32 code_point = *it; | 
					
						
							|  |  |  |         if (code_point == separator) { | 
					
						
							|  |  |  |             size_t substring_length = code_points().iterator_offset(it) - substring_start; | 
					
						
							|  |  |  |             if (substring_length != 0 || keep_empty) | 
					
						
							|  |  |  |                 TRY(result.try_append(TRY(substring_from_byte_offset_with_shared_superstring(substring_start, substring_length)))); | 
					
						
							|  |  |  |             substring_start = code_points().iterator_offset(it) + it.underlying_code_point_length_in_bytes(); | 
					
						
							|  |  |  |         } | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  |     size_t tail_length = code_points().byte_length() - substring_start; | 
					
						
							|  |  |  |     if (tail_length != 0 || keep_empty) | 
					
						
							|  |  |  |         TRY(result.try_append(TRY(substring_from_byte_offset_with_shared_superstring(substring_start, tail_length)))); | 
					
						
							|  |  |  |     return result; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-22 09:24:12 -05:00
										 |  |  | Optional<size_t> String::find_byte_offset(u32 code_point, size_t from_byte_offset) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     auto code_points = this->code_points(); | 
					
						
							|  |  |  |     if (from_byte_offset >= code_points.byte_length()) | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     for (auto it = code_points.iterator_at_byte_offset(from_byte_offset); it != code_points.end(); ++it) { | 
					
						
							|  |  |  |         if (*it == code_point) | 
					
						
							|  |  |  |             return code_points.byte_offset_of(it); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return {}; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-27 10:17:34 -05:00
										 |  |  | Optional<size_t> String::find_byte_offset(StringView substring, size_t from_byte_offset) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     auto view = bytes_as_string_view(); | 
					
						
							|  |  |  |     if (from_byte_offset >= view.length()) | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     auto index = memmem_optional( | 
					
						
							|  |  |  |         view.characters_without_null_termination() + from_byte_offset, view.length() - from_byte_offset, | 
					
						
							|  |  |  |         substring.characters_without_null_termination(), substring.length()); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     if (index.has_value()) | 
					
						
							|  |  |  |         return *index + from_byte_offset; | 
					
						
							|  |  |  |     return {}; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-11 08:26:49 -05:00
										 |  |  | bool String::operator==(FlyString const& other) const | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2023-10-28 18:58:29 -04:00
										 |  |  |     return static_cast<StringBase const&>(*this) == other.data({}); | 
					
						
							| 
									
										
										
										
											2023-01-11 08:26:49 -05:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | bool String::operator==(StringView other) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return bytes_as_string_view() == other; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ErrorOr<String> String::substring_from_byte_offset(size_t start, size_t byte_count) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     if (!byte_count) | 
					
						
							|  |  |  |         return String {}; | 
					
						
							|  |  |  |     return String::from_utf8(bytes_as_string_view().substring_view(start, byte_count)); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-22 11:40:57 -05:00
										 |  |  | ErrorOr<String> String::substring_from_byte_offset(size_t start) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     VERIFY(start <= bytes_as_string_view().length()); | 
					
						
							|  |  |  |     return substring_from_byte_offset(start, bytes_as_string_view().length() - start); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | ErrorOr<String> String::substring_from_byte_offset_with_shared_superstring(size_t start, size_t byte_count) const | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2023-10-28 17:50:24 -04:00
										 |  |  |     return String { TRY(StringBase::substring_from_byte_offset_with_shared_superstring(start, byte_count)) }; | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-22 11:40:57 -05:00
										 |  |  | ErrorOr<String> String::substring_from_byte_offset_with_shared_superstring(size_t start) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     VERIFY(start <= bytes_as_string_view().length()); | 
					
						
							|  |  |  |     return substring_from_byte_offset_with_shared_superstring(start, bytes_as_string_view().length() - start); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | bool String::operator==(char const* c_string) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return bytes_as_string_view() == c_string; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-09-05 19:55:21 +02:00
										 |  |  | u32 String::ascii_case_insensitive_hash() const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return case_insensitive_string_hash(reinterpret_cast<char const*>(bytes().data()), bytes().size()); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2024-04-03 22:00:23 -04:00
										 |  |  | Utf8View String::code_points() const& | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | { | 
					
						
							|  |  |  |     return Utf8View(bytes_as_string_view()); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ErrorOr<void> Formatter<String>::format(FormatBuilder& builder, String const& utf8_string) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return Formatter<StringView>::format(builder, utf8_string.bytes_as_string_view()); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ErrorOr<String> String::replace(StringView needle, StringView replacement, ReplaceMode replace_mode) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return StringUtils::replace(*this, needle, replacement, replace_mode); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-13 11:34:00 -05:00
										 |  |  | ErrorOr<String> String::reverse() const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     // FIXME: This handles multi-byte code points, but not e.g. grapheme clusters.
 | 
					
						
							|  |  |  |     // FIXME: We could avoid allocating a temporary vector if Utf8View supports reverse iteration.
 | 
					
						
							|  |  |  |     auto code_point_length = code_points().length(); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     Vector<u32> code_points; | 
					
						
							|  |  |  |     TRY(code_points.try_ensure_capacity(code_point_length)); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     for (auto code_point : this->code_points()) | 
					
						
							|  |  |  |         code_points.unchecked_append(code_point); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-10-04 10:34:06 +02:00
										 |  |  |     StringBuilder builder(code_point_length * sizeof(u32)); | 
					
						
							| 
									
										
										
										
											2023-01-13 11:34:00 -05:00
										 |  |  |     while (!code_points.is_empty()) | 
					
						
							|  |  |  |         TRY(builder.try_append_code_point(code_points.take_last())); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return builder.to_string(); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-27 14:37:40 -05:00
										 |  |  | ErrorOr<String> String::trim(Utf8View const& code_points_to_trim, TrimMode mode) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     auto trimmed = code_points().trim(code_points_to_trim, mode); | 
					
						
							|  |  |  |     return String::from_utf8(trimmed.as_string()); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ErrorOr<String> String::trim(StringView code_points_to_trim, TrimMode mode) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return trim(Utf8View { code_points_to_trim }, mode); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-07-07 13:22:36 +05:30
										 |  |  | ErrorOr<String> String::trim_ascii_whitespace(TrimMode mode) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return trim(" \n\t\v\f\r"sv, mode); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-01-14 10:17:32 -05:00
										 |  |  | bool String::contains(StringView needle, CaseSensitivity case_sensitivity) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return StringUtils::contains(bytes_as_string_view(), needle, case_sensitivity); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-03-08 09:06:59 -05:00
										 |  |  | bool String::contains(u32 needle, CaseSensitivity case_sensitivity) const | 
					
						
							| 
									
										
										
										
											2023-01-14 10:17:32 -05:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2023-03-08 09:06:59 -05:00
										 |  |  |     auto needle_as_string = String::from_code_point(needle); | 
					
						
							|  |  |  |     return contains(needle_as_string.bytes_as_string_view(), case_sensitivity); | 
					
						
							| 
									
										
										
										
											2023-01-14 10:17:32 -05:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-03-03 09:27:50 +00:00
										 |  |  | bool String::starts_with(u32 code_point) const | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2023-03-08 08:56:02 -05:00
										 |  |  |     if (is_empty()) | 
					
						
							|  |  |  |         return false; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return *code_points().begin() == code_point; | 
					
						
							| 
									
										
										
										
											2023-03-03 09:27:50 +00:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-11-04 10:07:01 +01:00
										 |  |  | bool String::starts_with_bytes(StringView bytes, CaseSensitivity case_sensitivity) const | 
					
						
							| 
									
										
										
										
											2023-02-18 10:04:37 +03:30
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2023-11-04 10:07:01 +01:00
										 |  |  |     return bytes_as_string_view().starts_with(bytes, case_sensitivity); | 
					
						
							| 
									
										
										
										
											2023-02-18 10:04:37 +03:30
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-03-03 09:27:50 +00:00
										 |  |  | bool String::ends_with(u32 code_point) const | 
					
						
							| 
									
										
										
										
											2023-02-18 10:04:37 +03:30
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2025-04-29 01:45:40 +12:00
										 |  |  |     ASSERT(is_unicode(code_point)); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-03-08 08:56:02 -05:00
										 |  |  |     if (is_empty()) | 
					
						
							|  |  |  |         return false; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-29 01:45:40 +12:00
										 |  |  |     Array<u8, 4> code_point_as_utf8; | 
					
						
							|  |  |  |     size_t i = 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     size_t code_point_byte_length = UnicodeUtils::code_point_to_utf8(code_point, [&](auto byte) { | 
					
						
							|  |  |  |         code_point_as_utf8[i++] = static_cast<u8>(byte); | 
					
						
							|  |  |  |     }); | 
					
						
							| 
									
										
										
										
											2023-03-08 08:56:02 -05:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-29 01:45:40 +12:00
										 |  |  |     return ends_with_bytes(StringView { code_point_as_utf8.data(), code_point_byte_length }); | 
					
						
							| 
									
										
										
										
											2023-03-03 09:27:50 +00:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-11-04 10:07:01 +01:00
										 |  |  | bool String::ends_with_bytes(StringView bytes, CaseSensitivity case_sensitivity) const | 
					
						
							| 
									
										
										
										
											2023-03-03 09:27:50 +00:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2023-11-04 10:07:01 +01:00
										 |  |  |     return bytes_as_string_view().ends_with(bytes, case_sensitivity); | 
					
						
							| 
									
										
										
										
											2023-02-18 10:04:37 +03:30
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | unsigned Traits<String>::hash(String const& string) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return string.hash(); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-12-16 17:49:34 +03:30
										 |  |  | ByteString String::to_byte_string() const | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2023-12-16 17:49:34 +03:30
										 |  |  |     return ByteString(bytes_as_string_view()); | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-12-16 17:49:34 +03:30
										 |  |  | ErrorOr<String> String::from_byte_string(ByteString const& byte_string) | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2023-12-16 17:49:34 +03:30
										 |  |  |     return String::from_utf8(byte_string.view()); | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2024-10-14 10:51:15 +02:00
										 |  |  | String String::to_ascii_lowercase() const | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2025-04-06 09:45:05 -04:00
										 |  |  |     if (!any_of(bytes(), is_ascii_upper_alpha)) | 
					
						
							| 
									
										
										
										
											2024-10-14 10:51:15 +02:00
										 |  |  |         return *this; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  |     String result; | 
					
						
							| 
									
										
										
										
											2025-04-06 09:45:05 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  |     MUST(result.replace_with_new_string(byte_count(), [&](Bytes buffer) -> ErrorOr<void> { | 
					
						
							|  |  |  |         for (auto [i, byte] : enumerate(bytes())) | 
					
						
							|  |  |  |             buffer[i] = static_cast<u8>(AK::to_ascii_lowercase(byte)); | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  |     })); | 
					
						
							| 
									
										
										
										
											2025-04-06 09:45:05 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2024-10-14 10:51:15 +02:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | String String::to_ascii_uppercase() const | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2025-04-06 09:45:05 -04:00
										 |  |  |     if (!any_of(bytes(), is_ascii_lower_alpha)) | 
					
						
							| 
									
										
										
										
											2024-10-14 10:51:15 +02:00
										 |  |  |         return *this; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  |     String result; | 
					
						
							| 
									
										
										
										
											2025-04-06 09:45:05 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  |     MUST(result.replace_with_new_string(byte_count(), [&](Bytes buffer) -> ErrorOr<void> { | 
					
						
							|  |  |  |         for (auto [i, byte] : enumerate(bytes())) | 
					
						
							|  |  |  |             buffer[i] = static_cast<u8>(AK::to_ascii_uppercase(byte)); | 
					
						
							|  |  |  |         return {}; | 
					
						
							|  |  |  |     })); | 
					
						
							| 
									
										
										
										
											2025-04-06 09:45:05 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-04-07 08:51:36 -04:00
										 |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2024-10-14 10:51:15 +02:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | bool String::equals_ignoring_ascii_case(String const& other) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return StringUtils::equals_ignoring_ascii_case(bytes_as_string_view(), other.bytes_as_string_view()); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-11-04 10:07:01 +01:00
										 |  |  | bool String::equals_ignoring_ascii_case(StringView other) const | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return StringUtils::equals_ignoring_ascii_case(bytes_as_string_view(), other); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2024-04-18 21:35:35 +12:00
										 |  |  | ErrorOr<String> String::repeated(String const& input, size_t count) | 
					
						
							| 
									
										
										
										
											2023-12-29 13:20:11 +01:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2024-05-06 21:52:11 +02:00
										 |  |  |     if (Checked<u32>::multiplication_would_overflow(count, input.bytes().size())) | 
					
						
							| 
									
										
										
										
											2024-04-18 21:35:35 +12:00
										 |  |  |         return Error::from_errno(EOVERFLOW); | 
					
						
							| 
									
										
										
										
											2023-12-29 13:20:11 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-10-28 17:15:40 -04:00
										 |  |  |     String result; | 
					
						
							|  |  |  |     size_t input_size = input.bytes().size(); | 
					
						
							| 
									
										
										
										
											2024-04-18 21:35:35 +12:00
										 |  |  |     TRY(result.replace_with_new_string(count * input_size, [&](Bytes buffer) { | 
					
						
							| 
									
										
										
										
											2023-10-28 17:15:40 -04:00
										 |  |  |         if (input_size == 1) { | 
					
						
							|  |  |  |             buffer.fill(input.bytes().first()); | 
					
						
							|  |  |  |         } else { | 
					
						
							|  |  |  |             for (size_t i = 0; i < count; ++i) | 
					
						
							|  |  |  |                 input.bytes().copy_to(buffer.slice(i * input_size, input_size)); | 
					
						
							|  |  |  |         } | 
					
						
							|  |  |  |         return ErrorOr<void> {}; | 
					
						
							|  |  |  |     })); | 
					
						
							| 
									
										
										
										
											2024-04-18 21:35:35 +12:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-10-28 17:15:40 -04:00
										 |  |  |     return result; | 
					
						
							| 
									
										
										
										
											2023-12-29 13:20:11 +01:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-02-10 11:47:51 +00:00
										 |  |  | String String::bijective_base_from(size_t value, Case target_case, unsigned base, StringView map) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     value++; | 
					
						
							|  |  |  |     if (map.is_null()) | 
					
						
							|  |  |  |         map = target_case == Case::Upper ? "ABCDEFGHIJKLMNOPQRSTUVWXYZ"sv : "abcdefghijklmnopqrstuvwxyz"sv; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     VERIFY(base >= 2 && base <= map.length()); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     // The '8 bits per byte' assumption may need to go?
 | 
					
						
							|  |  |  |     Array<char, round_up_to_power_of_two(sizeof(size_t) * 8 + 1, 2)> buffer; | 
					
						
							|  |  |  |     size_t i = 0; | 
					
						
							|  |  |  |     do { | 
					
						
							|  |  |  |         auto remainder = value % base; | 
					
						
							|  |  |  |         auto new_value = value / base; | 
					
						
							|  |  |  |         if (remainder == 0) { | 
					
						
							|  |  |  |             new_value--; | 
					
						
							|  |  |  |             remainder = map.length(); | 
					
						
							|  |  |  |         } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         buffer[i++] = map[remainder - 1]; | 
					
						
							|  |  |  |         value = new_value; | 
					
						
							|  |  |  |     } while (value > 0); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     for (size_t j = 0; j < i / 2; ++j) | 
					
						
							|  |  |  |         swap(buffer[j], buffer[i - j - 1]); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return MUST(from_utf8(ReadonlyBytes(buffer.data(), i))); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-07-18 17:08:27 -04:00
										 |  |  | String String::greek_letter_from(size_t value) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     static StringView const map = "αβγδεζηθικλμνξοπρστυφχψω"sv; | 
					
						
							|  |  |  |     static unsigned const base = 24; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     StringBuilder builder; | 
					
						
							|  |  |  |     while (value > 0) { | 
					
						
							|  |  |  |         value--; | 
					
						
							|  |  |  |         builder.append(map.substring_view((value % base) * 2, 2)); | 
					
						
							|  |  |  |         value /= base; | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return MUST(builder.to_string_without_validation().reverse()); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-02-10 11:47:51 +00:00
										 |  |  | String String::roman_number_from(size_t value, Case target_case) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     if (value > 3999) | 
					
						
							|  |  |  |         return number(value); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     StringBuilder builder; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     while (value > 0) { | 
					
						
							|  |  |  |         if (value >= 1000) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'M' : 'm'); | 
					
						
							|  |  |  |             value -= 1000; | 
					
						
							|  |  |  |         } else if (value >= 900) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? "CM"sv : "cm"sv); | 
					
						
							|  |  |  |             value -= 900; | 
					
						
							|  |  |  |         } else if (value >= 500) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'D' : 'd'); | 
					
						
							|  |  |  |             value -= 500; | 
					
						
							|  |  |  |         } else if (value >= 400) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? "CD"sv : "cd"sv); | 
					
						
							|  |  |  |             value -= 400; | 
					
						
							|  |  |  |         } else if (value >= 100) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'C' : 'c'); | 
					
						
							|  |  |  |             value -= 100; | 
					
						
							|  |  |  |         } else if (value >= 90) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? "XC"sv : "xc"sv); | 
					
						
							|  |  |  |             value -= 90; | 
					
						
							|  |  |  |         } else if (value >= 50) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'L' : 'l'); | 
					
						
							|  |  |  |             value -= 50; | 
					
						
							|  |  |  |         } else if (value >= 40) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? "XL"sv : "xl"sv); | 
					
						
							|  |  |  |             value -= 40; | 
					
						
							|  |  |  |         } else if (value >= 10) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'X' : 'x'); | 
					
						
							|  |  |  |             value -= 10; | 
					
						
							|  |  |  |         } else if (value == 9) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? "IX"sv : "ix"sv); | 
					
						
							|  |  |  |             value -= 9; | 
					
						
							|  |  |  |         } else if (value >= 5 && value <= 8) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'V' : 'v'); | 
					
						
							|  |  |  |             value -= 5; | 
					
						
							|  |  |  |         } else if (value == 4) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? "IV"sv : "iv"sv); | 
					
						
							|  |  |  |             value -= 4; | 
					
						
							|  |  |  |         } else if (value <= 3) { | 
					
						
							|  |  |  |             builder.append(target_case == Case::Upper ? 'I' : 'i'); | 
					
						
							|  |  |  |             value -= 1; | 
					
						
							|  |  |  |         } | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     return builder.to_string_without_validation(); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2025-05-02 13:12:14 +02:00
										 |  |  | template<Integral T> | 
					
						
							|  |  |  | String String::number(T value) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2025-10-04 00:06:38 +02:00
										 |  |  |     return create_string_from_number<String, T>(value); | 
					
						
							| 
									
										
										
										
											2025-05-02 13:12:14 +02:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | template String String::number(char); | 
					
						
							|  |  |  | template String String::number(signed char); | 
					
						
							|  |  |  | template String String::number(unsigned char); | 
					
						
							|  |  |  | template String String::number(signed short); | 
					
						
							|  |  |  | template String String::number(unsigned short); | 
					
						
							|  |  |  | template String String::number(int); | 
					
						
							|  |  |  | template String String::number(unsigned int); | 
					
						
							|  |  |  | template String String::number(long); | 
					
						
							|  |  |  | template String String::number(unsigned long); | 
					
						
							|  |  |  | template String String::number(long long); | 
					
						
							|  |  |  | template String String::number(unsigned long long); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.
- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.
- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
											
										 
											2022-12-01 13:27:43 +01:00
										 |  |  | } |