Commit graph

21 commits

Author SHA1 Message Date
Serhiy Storchaka
bd4bd3e76a
gh-152100: Support set operations in character classes (GH-152153)
Implement set difference [A--B], intersection [A&&B] and union [A||B] in
regular expression character classes (Unicode Technical Standard #18),
including nested, complemented and compound set operands.  Symmetric
difference [A~~B] remains reserved.

Also use the new syntax in the standard library (_strptime, textwrap,
doctest, pkgutil).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 10:09:41 +03:00
Pieter Eendebak
21c4b7359d
gh-152056: Compile single-category character sets to a bare CATEGORY opcode (GH-152057)
A character set containing exactly one category, e.g. [\d] or [^\s], now
compiles to a single CATEGORY opcode (like \d or \S) instead of an IN
block.  The negated form maps to the complementary category.  This speeds
up matching and reduces the size of the compiled byte code.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 11:09:50 +00:00
Serhiy Storchaka
fde4cf862c
gh-152033: Optimize category escapes outside character sets (GH-152035)
Character class escapes (``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and
``\W``) that occur outside a character set are now compiled directly to a
single CATEGORY opcode instead of being wrapped in an IN block.  This
removes the IN wrapper (three code words) and an indirect charset() call,
and makes such an escape a simple repeatable unit so that, for example,
``\d+`` uses the REPEAT_ONE fast path; a CATEGORY case is added to
SRE(count).

The transformation preserves behaviour exactly.  For category-heavy
patterns the compiled byte code is about 20% smaller and matching is up
to ~2x faster, with no effect on patterns that do not use bare category
escapes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:14 +03:00
Victor Stinner
0b8c348f27
Fix pyflakes warnings: variable is assigned to but never used (#142294)
Example of fixed warning:

    Lib/netrc.py:98:13: local variable 'toplevel'
    is assigned to but never used
2025-12-08 14:00:31 +01:00
Serhiy Storchaka
ac56f8cc8d
gh-133306: Support \z as a synonym for \Z in regular expressions (GH-133314)
\Z was an error inherited from PCRE 0.95. It was fixed in PCRE 2.0.
In other engines, \Z means not “anchor at string end”, but
“anchor before optional newline at string end”.

\z means “anchor at string end” in most RE engines.
2025-05-03 07:54:33 +00:00
Serhiy Storchaka
f9637b4ba3
Remove dead code in the RE parser (GH-122796) 2024-08-07 19:44:18 +00:00
Serhiy Storchaka
e2b3d831fd
gh-109747: Improve errors for unsupported look-behind patterns (GH-109859)
Now re.error is raised instead of OverflowError or RuntimeError for
too large width of look-behind pattern.

The limit is increased to 2**32-1 (was 2**31-1).
2023-10-14 09:13:02 +03:00
Serhiy Storchaka
ed64204716
gh-106566: Optimize (?!) in regular expressions (GH-106567) 2023-08-07 18:09:56 +03:00
Serhiy Storchaka
74ec02e949
gh-106510: Fix DEBUG output for atomic group (GH-106511) 2023-07-08 14:31:25 +03:00
Nikita Sobolev
67f69dba0a
gh-105687: Remove deprecated objects from re module (#105688) 2023-06-14 12:26:20 +02:00
Serhiy Storchaka
75a6fadf36
gh-91524: Speed up the regular expression substitution (#91525)
Functions re.sub() and re.subn() and corresponding re.Pattern methods
are now 2-3 times faster for replacement strings containing group references.

Closes #91524

Primarily authored by serhiy-storchaka Serhiy Storchaka
Minor-cleanups-by: Gregory P. Smith [Google] <greg@krypto.org>
2022-10-23 15:57:30 -07:00
Miro Hrončok
16a7e4a0b7
gh-92728: Restore re.template, but deprecate it (GH-93161)
Revert "bpo-47211: Remove function re.template() and flag re.TEMPLATE (GH-32300)"

This reverts commit b09184bf05.
2022-05-25 09:05:35 +03:00
Serhiy Storchaka
a84a56d80f
gh-91760: More strict rules for numerical group references and group names in RE (GH-91792)
Only sequence of ASCII digits is now accepted as a numerical reference.
The group name in bytes patterns and replacement strings can now only
contain ASCII letters and digits and underscore.
2022-05-08 19:19:29 +03:00
Serhiy Storchaka
19dca04121
gh-91760: Deprecate group names and numbers which will be invalid in future (GH-91794)
Only sequence of ASCII digits will be accepted as a numerical reference.
The group name in bytes patterns and replacement strings could only
contain ASCII letters and digits and underscore.
2022-04-30 13:13:46 +03:00
Serhiy Storchaka
f703c96cf0
gh-91870: Remove unsupported SRE opcode CALL (GH-91872)
It was initially added to support atomic groups, but that
support was never fully implemented, and CALL was only left
in the compiler, but not interpreter and parser.

ATOMIC_GROUP is now used to support atomic groups.
2022-04-26 21:07:25 +03:00
Serhiy Storchaka
130a8c386b
gh-91308: Simplify parsing inline flag "x" (verbose) (GH-91855) 2022-04-23 12:50:42 +03:00
Serhiy Storchaka
48ec61a89a
gh-91700: Validate the group number in conditional expression in RE (GH-91702)
In expression (?(group)...) an appropriate re.error is now
raised if the group number refers to not defined group.

Previously it raised RuntimeError: invalid SRE code.
2022-04-22 19:53:10 +03:00
Serhiy Storchaka
6ccfa31421
gh-90568: Fix exception type for \N with a named sequence in RE (GH-91665)
re.error is now raised instead of TypeError.
2022-04-22 18:35:28 +03:00
Serhiy Storchaka
50872dbadc
bpo-47227: Suppress expression chaining for more RE parsing errors (GH-32333) 2022-04-06 19:54:44 +03:00
Serhiy Storchaka
b09184bf05
bpo-47211: Remove function re.template() and flag re.TEMPLATE (GH-32300)
They were undocumented and never working.
2022-04-06 19:53:50 +03:00
Serhiy Storchaka
1be3260a90
bpo-47152: Convert the re module into a package (GH-32177)
The sre_* modules are now deprecated.
2022-04-02 11:35:13 +03:00