Commit graph

39 commits

Author SHA1 Message Date
aplefull
aeec2c804c LibRegex: Implement Unicode case-insensitive matching
Previously, case-insensitive regex matching used ASCII-only case
conversion (to_ascii_lowercase) even for Unicode characters.

Now we implement Canonicalize abstract operation, so we can case-fold
Unicode characters properly during case-insensitive matching.
2026-02-16 07:51:00 -05:00
Ali Mohammad Pur
01be1ed583 LibRegex: Mark OpCode_classes with REGEX_API 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
6aba31ba13 LibRegex: Add some FileCheck-like tests to ensure opts don't break 2026-02-07 14:09:56 +01:00
aplefull
e4572aa9d7 LibRegex: Add support for regex modifiers
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
2026-01-16 15:00:00 +01:00
aplefull
6ce312e22f LibRegex: Prevent empty matches in optional quantifiers
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
2026-01-16 01:11:24 +01:00
mikiubo
535d2476a7 LibRegex: Implement proper lookbehind via new StepBack opcodes
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.

These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.

Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.

Partially fix issue: #3459
2026-01-11 23:24:49 +01:00
Ali Mohammad Pur
c1535ef65b LibRegex: Skip multi-op compare overhead when not necessary 2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
637d47ba30 LibRegex: Add an optimisation for replacing /.*x/ with a seek op
This will avoid some catastrophic backtracking by just skipping to 'x'.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
e2c6918cdb LibRegex: Fuse consecutive single-char Compares into a String Compare
This avoids huge instruction decoding and dispatch overhead, ~40x
performance improvement for /(^|x)ppp/.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
9d49fafdbf LibRegex: Add an optimisation to skip forks that cannot produce a match
...and implement it for 'start of line' checks.
This makes patterns like /(^|x)ppp/ fork-free at runtime, ~30% perf
improvement for that pattern.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
3f35d84785 LibRegex+LibJS: Flatten the bytecode buffer before regex execution
This makes it so we don't have to unnecessarily check for having a
flattened buffer; significant performance increase.
2026-01-05 18:22:11 +01:00
aplefull
a49c39de32 LibRegex: Support matching unicode multi-character sequences 2025-11-26 11:34:38 +01:00
aplefull
c4eef822de LibRegex: Fix backreferences to undefined capture groups
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
2025-10-16 16:37:54 +02:00
Rocco Corsi
3d1d055e27 LibRegex: Export OpCode/OpCode_Compare for REGEX_DEBUG builds
When building with REGEX_DEBUG or ENABLE_ALL_THE_DEBUG_MACROS there are
two issues with linking of bin/TestRegex

 - Libraries/LibRegex/RegexDebug.h:76 with undefined reference
       regex::OpCode_Compare::variable_arguments_to_byte_string(
           AK::Optional<regex::MatchInput const&>) const

 - Libraries/LibRegex/RegexByteCode.h:672 with undefined reference
       regex::OpCode::name(regex::OpCodeId)

Add REGEX_API on regex::OpCode and regex::OptCode_Compare to allow
access to the classes in bin/TestRegex
2025-09-18 11:02:13 +02:00
Jelle Raaijmakers
73967ee90c Everywhere: Use HashMap::update() where applicable 2025-07-25 16:22:06 +02:00
Ali Mohammad Pur
5b45223d5f LibRegex: Account for uppercase characters in insensitive patterns 2025-07-12 11:26:23 +02:00
Ali Mohammad Pur
b0e471228d LibRegex: Avoid use-after-return of MatchState in 'is_an_eligible_jump'
The opcode may have last been accessed by
block_satisfies_atomic_rewrite_precondition, which would set it to a
state that no longer exists.
Set the state to the correct one unconditionally to ensure we're looking
at the right value.
Fixes #5145.
2025-06-24 18:43:01 +02:00
ayeteadoe
a3754a7bf1 LibRegex: Annotate classes with export macro for hidden visibility
This fix demos the gradual opt-in migration process libraries can
take to switch to explicit symbol exports via the FOO_API macros.
2025-05-12 03:22:23 -06:00
Andrew Kaster
3dd2fbd041 LibRegex: Move StringTable ctor/dtor out of line
This also moves the next_serial class static into a file scope static.
The public class static was causing visibility issues with certain Linux
builds when hidden visibility was enabled. However, the current design
makes more sense anyway :^).
2025-05-12 03:22:23 -06:00
Ali Mohammad Pur
4b9abdb963 LibRegex: Remove useless jumps (Jump* +0) before running opts
This leads to some more significant performance increases on the simple
/<script|<style|<link/ regex in speedometer (~2x)
2025-04-23 22:57:49 +02:00
Andreas Kling
54edf29f1b LibRegex: Make Match::capture_group_name an index into the string table
This removes another Match member that required destruction. The "API"
for accessing the strings is definitely a bit awkward. We'll think of
something nicer eventually.
2025-04-14 17:40:13 +02:00
Ali Mohammad Pur
69050da929 LibRegex: Merge inverse string table mappings separately 2025-04-06 20:21:16 +02:00
Ali Mohammad Pur
4136d8d13e LibRegex: Use an interned string table for capture group names
This avoids messing around with unsafe string pointers and removes the
only non-FlyString-able user of DeprecatedFlyString.
2025-04-02 11:43:13 +02:00
mikiubo
c85df78c4c LibRegex: Remove orphaned save points in nested LookAhead 2025-03-17 16:11:02 +01:00
Tim Ledbetter
b9ac99d2eb Revert "LibRegex: Remove orphaned save points in nested LookAhead"
This reverts commit f2678bfcb8.
2025-03-14 19:57:33 +00:00
mikiubo
f2678bfcb8 LibRegex: Remove orphaned save points in nested LookAhead 2025-03-14 09:41:41 +01:00
Timothy Flynn
85b424464a AK+Everywhere: Rename verify_cast to as
Follow-up to fc20e61e72.
2025-01-21 11:34:06 -05:00
Ali Mohammad Pur
f8092455e2 LibRegex: Print OpCode_Repeat's offset as ssize_t 2024-12-13 10:00:16 +01:00
Pavel Shliak
cdb54fe504 LibRegex: Clean up #include directives
This change aims to improve the speed of incremental builds.
2024-11-21 14:08:33 +01:00
Timothy Flynn
93712b24bf Everywhere: Hoist the Libraries folder to the top-level 2024-11-10 12:50:45 +01:00
Andreas Kling
13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00
Sahan Fernando
fe2b8906d4 Everywhere: Fix incorrect uses of String::format and StringBuilder::appendf
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.
2021-01-11 21:06:32 +01:00
Nathan Lanza
d1891f67ac
AK: Use direct-list-initialization for Vector::empend() (#4564)
clang trunk with -std=c++20 doesn't seem to properly look for an
aggregate initializer here when the type being constructed is a simple
aggregate (e.g. `struct Thing { int a; int b; };`). This template fails
to compile in a usage added 12/16/2020 in `AK/Trie.h`.

Both forms of initialization are supposed to call the
aggregate-initializers but direct-list-initialization delegating to
aggregate initializers is a new addition in c++20 that might not be
implemented yet.
2020-12-27 23:06:37 +01:00
AnotherTest
19bf7734a4 LibRegex: Store 'String' matches inside the bytecode
Also removes an unnecessary 'length' argument (StringView has a length!)
2020-12-06 15:38:40 +01:00
AnotherTest
dbef2b1ee9 LibRegex: Implement an ECMA262-compatible parser
This also adds support for lookarounds and individually-negated
comparisons.
The only unimplemented part of the parser spec is the unicode stuff.
2020-11-27 21:32:41 +01:00
AnotherTest
3db8ced4c7 LibRegex: Change bytecode value type to a 64-bit value
To allow storing unicode ranges compactly; this is not utilised at the
moment, but changing this later would've been significantly more
difficult.
Also fixes a few debug logs.
2020-11-27 21:32:41 +01:00
AnotherTest
92ea9ed4a5 LibRegex: Fix greedy/reluctant modifiers in PosixExtendedParser
Also fixes the issue with assertions causing early termination when
they fail.
2020-11-27 21:32:41 +01:00
Emanuel Sprung
4a630d4b63 LibRegex: Add RegexStringView wrapper to support utf8 and utf32 views 2020-11-27 21:32:41 +01:00
Emanuel Sprung
55450055d8 LibRegex: Add a regular expression library
This commit is a mix of several commits, squashed into one because the
commits before 'Move regex to own Library and fix all the broken stuff'
were not fixable in any elegant way.
The commits are listed below for "historical" purposes:

- AK: Add options/flags and Errors for regular expressions

Flags can be provided for any possible flavour by adding a new scoped enum.
Handling of flags is done by templated Options class and the overloaded
'|' and '&' operators.

- AK: Add Lexer for regular expressions

The lexer parses the input and extracts tokens needed to parse a regular
expression.

- AK: Add regex Parser and PosixExtendedParser

This patchset adds a abstract parser class that can be derived to implement
different parsers. A parser produces bytecode to be executed within the
regex matcher.

- AK: Add regex matcher

This patchset adds an regex matcher based on the principles of the T-REX VM.
The bytecode pruduced by the respective Parser is put into the matcher and
the VM will recursively execute the bytecode according to the available OpCodes.
Possible improvement: the recursion could be replaced by multi threading capabilities.

To match a Regular expression, e.g. for the Posix standard regular expression matcher
use the following API:

```
Pattern<PosixExtendedParser> pattern("^.*$");
auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle

EXPECT(result.count == 1);
EXPECT(result.matches.at(0).view.starts_with("Well"));
EXPECT(result.matches.at(0).view.end() == "!");

result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line

EXPECT(result.count == 2);
EXPECT(result.matches.at(0).view == "Well, hello friends!");
EXPECT(result.matches.at(1).view == "Hello World!");

EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources.
```

- AK: Rework regex to work with opcodes objects

This patchsets reworks the matcher to work on a more structured base.
For that an abstract OpCode class and derived classes for the specific
OpCodes have been added. The respective opcode logic is contained in
each respective execute() method.

- AK: Add benchmark for regex

- AK: Some optimization in regex for runtime and memory

- LibRegex: Move regex to own Library and fix all the broken stuff

Now regex works again and grep utility is also in place for testing.
This commit also fixes the use of regex.h in C by making `regex_t`
an opaque (-ish) type, which makes its behaviour consistent between
C and C++ compilers.
Previously, <regex.h> would've blown C compilers up, and even if it
didn't, would've caused a leak in C code, and not in C++ code (due to
the existence of `OwnPtr` inside the struct).

To make this whole ordeal easier to deal with (for now), this pulls the
definitions of `reg*()` into LibRegex.

pros:
- The circular dependency between LibC and LibRegex is broken
- Eaiser to test (without accidentally pulling in the host's libc!)

cons:
- Using any of the regex.h functions will require the user to link -lregex
- The symbols will be missing from libc, which will be a big surprise
  down the line (especially with shared libs).

Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-11-27 21:32:41 +01:00