Implement set difference [A--B], intersection [A&&B] and union [A||B] in
regular expression character classes (Unicode Technical Standard #18),
including nested, complemented and compound set operands. Symmetric
difference [A~~B] remains reserved.
Also use the new syntax in the standard library (_strptime, textwrap,
doctest, pkgutil).
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
A character set containing exactly one category, e.g. [\d] or [^\s], now
compiles to a single CATEGORY opcode (like \d or \S) instead of an IN
block. The negated form maps to the complementary category. This speeds
up matching and reduces the size of the compiled byte code.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Character class escapes (``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and
``\W``) that occur outside a character set are now compiled directly to a
single CATEGORY opcode instead of being wrapped in an IN block. This
removes the IN wrapper (three code words) and an indirect charset() call,
and makes such an escape a simple repeatable unit so that, for example,
``\d+`` uses the REPEAT_ONE fast path; a CATEGORY case is added to
SRE(count).
The transformation preserves behaviour exactly. For category-heavy
patterns the compiled byte code is about 20% smaller and matching is up
to ~2x faster, with no effect on patterns that do not use bare category
escapes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
\Z was an error inherited from PCRE 0.95. It was fixed in PCRE 2.0.
In other engines, \Z means not “anchor at string end”, but
“anchor before optional newline at string end”.
\z means “anchor at string end” in most RE engines.
Now re.error is raised instead of OverflowError or RuntimeError for
too large width of look-behind pattern.
The limit is increased to 2**32-1 (was 2**31-1).
Functions re.sub() and re.subn() and corresponding re.Pattern methods
are now 2-3 times faster for replacement strings containing group references.
Closes#91524
Primarily authored by serhiy-storchaka Serhiy Storchaka
Minor-cleanups-by: Gregory P. Smith [Google] <greg@krypto.org>
Only sequence of ASCII digits is now accepted as a numerical reference.
The group name in bytes patterns and replacement strings can now only
contain ASCII letters and digits and underscore.
Only sequence of ASCII digits will be accepted as a numerical reference.
The group name in bytes patterns and replacement strings could only
contain ASCII letters and digits and underscore.
It was initially added to support atomic groups, but that
support was never fully implemented, and CALL was only left
in the compiler, but not interpreter and parser.
ATOMIC_GROUP is now used to support atomic groups.
In expression (?(group)...) an appropriate re.error is now
raised if the group number refers to not defined group.
Previously it raised RuntimeError: invalid SRE code.