mirror of
				https://github.com/python/cpython.git
				synced 2025-11-04 07:31:38 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			1463 lines
		
	
	
	
		
			58 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			1463 lines
		
	
	
	
		
			58 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
\documentclass{howto}
 | 
						|
 | 
						|
% TODO:
 | 
						|
% Document lookbehind assertions
 | 
						|
% Better way of displaying a RE, a string, and what it matches
 | 
						|
% Mention optional argument to match.groups()
 | 
						|
% Unicode (at least a reference)
 | 
						|
 | 
						|
\title{Regular Expression HOWTO}
 | 
						|
 | 
						|
\release{0.05}
 | 
						|
 | 
						|
\author{A.M. Kuchling}
 | 
						|
\authoraddress{\email{amk@amk.ca}}
 | 
						|
 | 
						|
\begin{document}
 | 
						|
\maketitle
 | 
						|
 | 
						|
\begin{abstract}
 | 
						|
\noindent
 | 
						|
This document is an introductory tutorial to using regular expressions
 | 
						|
in Python with the \module{re} module.  It provides a gentler
 | 
						|
introduction than the corresponding section in the Library Reference.
 | 
						|
 | 
						|
This document is available from 
 | 
						|
\url{http://www.amk.ca/python/howto}.
 | 
						|
 | 
						|
\end{abstract}
 | 
						|
 | 
						|
\tableofcontents
 | 
						|
 | 
						|
\section{Introduction}
 | 
						|
 | 
						|
The \module{re} module was added in Python 1.5, and provides
 | 
						|
Perl-style regular expression patterns.  Earlier versions of Python
 | 
						|
came with the \module{regex} module, which provided Emacs-style
 | 
						|
patterns.  \module{regex} module was removed in Python 2.5.
 | 
						|
 | 
						|
Regular expressions (or REs) are essentially a tiny, highly
 | 
						|
specialized programming language embedded inside Python and made
 | 
						|
available through the \module{re} module.  Using this little language,
 | 
						|
you specify the rules for the set of possible strings that you want to
 | 
						|
match; this set might contain English sentences, or e-mail addresses,
 | 
						|
or TeX commands, or anything you like.  You can then ask questions
 | 
						|
such as ``Does this string match the pattern?'', or ``Is there a match
 | 
						|
for the pattern anywhere in this string?''.  You can also use REs to
 | 
						|
modify a string or to split it apart in various ways.
 | 
						|
 | 
						|
Regular expression patterns are compiled into a series of bytecodes
 | 
						|
which are then executed by a matching engine written in C.  For
 | 
						|
advanced use, it may be necessary to pay careful attention to how the
 | 
						|
engine will execute a given RE, and write the RE in a certain way in
 | 
						|
order to produce bytecode that runs faster.  Optimization isn't
 | 
						|
covered in this document, because it requires that you have a good
 | 
						|
understanding of the matching engine's internals.
 | 
						|
 | 
						|
The regular expression language is relatively small and restricted, so
 | 
						|
not all possible string processing tasks can be done using regular
 | 
						|
expressions.  There are also tasks that \emph{can} be done with
 | 
						|
regular expressions, but the expressions turn out to be very
 | 
						|
complicated.  In these cases, you may be better off writing Python
 | 
						|
code to do the processing; while Python code will be slower than an
 | 
						|
elaborate regular expression, it will also probably be more understandable.
 | 
						|
 | 
						|
\section{Simple Patterns}
 | 
						|
 | 
						|
We'll start by learning about the simplest possible regular
 | 
						|
expressions.  Since regular expressions are used to operate on
 | 
						|
strings, we'll begin with the most common task: matching characters.
 | 
						|
 | 
						|
For a detailed explanation of the computer science underlying regular
 | 
						|
expressions (deterministic and non-deterministic finite automata), you
 | 
						|
can refer to almost any textbook on writing compilers.
 | 
						|
 | 
						|
\subsection{Matching Characters}
 | 
						|
 | 
						|
Most letters and characters will simply match themselves.  For
 | 
						|
example, the regular expression \regexp{test} will match the string
 | 
						|
\samp{test} exactly.  (You can enable a case-insensitive mode that
 | 
						|
would let this RE match \samp{Test} or \samp{TEST} as well; more
 | 
						|
about this later.)  
 | 
						|
 | 
						|
There are exceptions to this rule; some characters are
 | 
						|
special, and don't match themselves.  Instead, they signal that some
 | 
						|
out-of-the-ordinary thing should be matched, or they affect other
 | 
						|
portions of the RE by repeating them.  Much of this document is
 | 
						|
devoted to discussing various metacharacters and what they do.
 | 
						|
 | 
						|
Here's a complete list of the metacharacters; their meanings will be
 | 
						|
discussed in the rest of this HOWTO.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
. ^ $ * + ? { [ ] \ | ( )
 | 
						|
\end{verbatim}
 | 
						|
% $
 | 
						|
 | 
						|
The first metacharacters we'll look at are \samp{[} and \samp{]}.
 | 
						|
They're used for specifying a character class, which is a set of
 | 
						|
characters that you wish to match.  Characters can be listed
 | 
						|
individually, or a range of characters can be indicated by giving two
 | 
						|
characters and separating them by a \character{-}.  For example,
 | 
						|
\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
 | 
						|
\samp{c}; this is the same as
 | 
						|
\regexp{[a-c]}, which uses a range to express the same set of
 | 
						|
characters.  If you wanted to match only lowercase letters, your
 | 
						|
RE would be \regexp{[a-z]}.
 | 
						|
 | 
						|
Metacharacters are not active inside classes.  For example,
 | 
						|
\regexp{[akm\$]} will match any of the characters \character{a},
 | 
						|
\character{k}, \character{m}, or \character{\$}; \character{\$} is
 | 
						|
usually a metacharacter, but inside a character class it's stripped of
 | 
						|
its special nature.
 | 
						|
 | 
						|
You can match the characters not within a range by \dfn{complementing}
 | 
						|
the set.  This is indicated by including a \character{\^} as the first
 | 
						|
character of the class; \character{\^} elsewhere will simply match the
 | 
						|
\character{\^} character.  For example, \verb|[^5]| will match any
 | 
						|
character except \character{5}.
 | 
						|
 | 
						|
Perhaps the most important metacharacter is the backslash, \samp{\e}.  
 | 
						|
As in Python string literals, the backslash can be followed by various
 | 
						|
characters to signal various special sequences.  It's also used to escape
 | 
						|
all the metacharacters so you can still match them in patterns; for
 | 
						|
example, if you need to match a \samp{[} or 
 | 
						|
\samp{\e}, you can precede them with a backslash to remove their
 | 
						|
special meaning: \regexp{\e[} or \regexp{\e\e}.
 | 
						|
 | 
						|
Some of the special sequences beginning with \character{\e} represent
 | 
						|
predefined sets of characters that are often useful, such as the set
 | 
						|
of digits, the set of letters, or the set of anything that isn't
 | 
						|
whitespace.  The following predefined special sequences are available:
 | 
						|
 | 
						|
\begin{itemize}
 | 
						|
\item[\code{\e d}]Matches any decimal digit; this is
 | 
						|
equivalent to the class \regexp{[0-9]}.
 | 
						|
 | 
						|
\item[\code{\e D}]Matches any non-digit character; this is
 | 
						|
equivalent to the class \verb|[^0-9]|.
 | 
						|
 | 
						|
\item[\code{\e s}]Matches any whitespace character; this is
 | 
						|
equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
 | 
						|
 | 
						|
\item[\code{\e S}]Matches any non-whitespace character; this is
 | 
						|
equivalent to the class \verb|[^ \t\n\r\f\v]|.
 | 
						|
 | 
						|
\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
 | 
						|
\regexp{[a-zA-Z0-9_]}.  
 | 
						|
 | 
						|
\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
 | 
						|
\verb|[^a-zA-Z0-9_]|.   
 | 
						|
\end{itemize}
 | 
						|
 | 
						|
These sequences can be included inside a character class.  For
 | 
						|
example, \regexp{[\e s,.]} is a character class that will match any
 | 
						|
whitespace character, or \character{,} or \character{.}.
 | 
						|
 | 
						|
The final metacharacter in this section is \regexp{.}.  It matches
 | 
						|
anything except a newline character, and there's an alternate mode
 | 
						|
(\code{re.DOTALL}) where it will match even a newline.  \character{.}
 | 
						|
is often used where you want to match ``any character''.  
 | 
						|
 | 
						|
\subsection{Repeating Things}
 | 
						|
 | 
						|
Being able to match varying sets of characters is the first thing
 | 
						|
regular expressions can do that isn't already possible with the
 | 
						|
methods available on strings.  However, if that was the only
 | 
						|
additional capability of regexes, they wouldn't be much of an advance.
 | 
						|
Another capability is that you can specify that portions of the RE
 | 
						|
must be repeated a certain number of times.
 | 
						|
 | 
						|
The first metacharacter for repeating things that we'll look at is
 | 
						|
\regexp{*}.  \regexp{*} doesn't match the literal character \samp{*};
 | 
						|
instead, it specifies that the previous character can be matched zero
 | 
						|
or more times, instead of exactly once.
 | 
						|
 | 
						|
For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
 | 
						|
characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
 | 
						|
characters), and so forth.  The RE engine has various internal
 | 
						|
limitations stemming from the size of C's \code{int} type, that will
 | 
						|
prevent it from matching over 2 billion \samp{a} characters; you
 | 
						|
probably don't have enough memory to construct a string that large, so
 | 
						|
you shouldn't run into that limit.
 | 
						|
 | 
						|
Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
 | 
						|
the matching engine will try to repeat it as many times as possible.
 | 
						|
If later portions of the pattern don't match, the matching engine will
 | 
						|
then back up and try again with few repetitions.
 | 
						|
 | 
						|
A step-by-step example will make this more obvious.  Let's consider
 | 
						|
the expression \regexp{a[bcd]*b}.  This matches the letter
 | 
						|
\character{a}, zero or more letters from the class \code{[bcd]}, and
 | 
						|
finally ends with a \character{b}.  Now imagine matching this RE
 | 
						|
against the string \samp{abcbd}.  
 | 
						|
 | 
						|
\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
 | 
						|
\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
 | 
						|
\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
 | 
						|
it can, which is to the end of the string.}
 | 
						|
\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
 | 
						|
current position is at the end of the string, so it fails.}
 | 
						|
\lineiii{4}{\code{abcb}}{Back up, so that  \regexp{[bcd]*} matches
 | 
						|
one less character.}
 | 
						|
\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
 | 
						|
current position is at the last character, which is a \character{d}.}
 | 
						|
\lineiii{6}{\code{abc}}{Back up again, so that  \regexp{[bcd]*} is
 | 
						|
only matching \samp{bc}.}
 | 
						|
\lineiii{6}{\code{abcb}}{Try \regexp{b} again.  This time 
 | 
						|
but the character at the current position is \character{b}, so it succeeds.}
 | 
						|
\end{tableiii}
 | 
						|
 | 
						|
The end of the RE has now been reached, and it has matched
 | 
						|
\samp{abcb}.  This demonstrates how the matching engine goes as far as
 | 
						|
it can at first, and if no match is found it will then progressively
 | 
						|
back up and retry the rest of the RE again and again.  It will back up
 | 
						|
until it has tried zero matches for \regexp{[bcd]*}, and if that
 | 
						|
subsequently fails, the engine will conclude that the string doesn't
 | 
						|
match the RE at all.
 | 
						|
 | 
						|
Another repeating metacharacter is \regexp{+}, which matches one or
 | 
						|
more times.  Pay careful attention to the difference between
 | 
						|
\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
 | 
						|
times, so whatever's being repeated may not be present at all, while
 | 
						|
\regexp{+} requires at least \emph{one} occurrence.  To use a similar
 | 
						|
example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
 | 
						|
\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
 | 
						|
 | 
						|
There are two more repeating qualifiers.  The question mark character,
 | 
						|
\regexp{?}, matches either once or zero times; you can think of it as
 | 
						|
marking something as being optional.  For example, \regexp{home-?brew}
 | 
						|
matches either \samp{homebrew} or \samp{home-brew}.  
 | 
						|
 | 
						|
The most complicated repeated qualifier is
 | 
						|
\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
 | 
						|
integers.  This qualifier means there must be at least \var{m}
 | 
						|
repetitions, and at most \var{n}.  For example, \regexp{a/\{1,3\}b}
 | 
						|
will match \samp{a/b}, \samp{a//b}, and \samp{a///b}.  It won't match
 | 
						|
\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
 | 
						|
 | 
						|
You can omit either \var{m} or \var{n}; in that case, a reasonable
 | 
						|
value is assumed for the missing value.  Omitting \var{m} is
 | 
						|
interpreted as a lower limit of 0, while omitting \var{n} results in  an
 | 
						|
upper bound of infinity --- actually, the 2 billion limit mentioned
 | 
						|
earlier, but that might as well be infinity.  
 | 
						|
 | 
						|
Readers of a reductionist bent may notice that the three other qualifiers
 | 
						|
can all be expressed using this notation.  \regexp{\{0,\}} is the same
 | 
						|
as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
 | 
						|
\regexp{\{0,1\}} is the same as \regexp{?}.  It's better to use
 | 
						|
\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
 | 
						|
they're shorter and easier to read.
 | 
						|
 | 
						|
\section{Using Regular Expressions}
 | 
						|
 | 
						|
Now that we've looked at some simple regular expressions, how do we
 | 
						|
actually use them in Python?  The \module{re} module provides an
 | 
						|
interface to the regular expression engine, allowing you to compile
 | 
						|
REs into objects and then perform matches with them.
 | 
						|
 | 
						|
\subsection{Compiling Regular Expressions}
 | 
						|
 | 
						|
Regular expressions are compiled into \class{RegexObject} instances,
 | 
						|
which have methods for various operations such as searching for
 | 
						|
pattern matches or performing string substitutions.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> import re
 | 
						|
>>> p = re.compile('ab*')
 | 
						|
>>> print p
 | 
						|
<re.RegexObject instance at 80b4150>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\function{re.compile()} also accepts an optional \var{flags}
 | 
						|
argument, used to enable various special features and syntax
 | 
						|
variations.  We'll go over the available settings later, but for now a
 | 
						|
single example will do:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('ab*', re.IGNORECASE)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The RE is passed to \function{re.compile()} as a string.  REs are
 | 
						|
handled as strings because regular expressions aren't part of the core
 | 
						|
Python language, and no special syntax was created for expressing
 | 
						|
them.  (There are applications that don't need REs at all, so there's
 | 
						|
no need to bloat the language specification by including them.)
 | 
						|
Instead, the \module{re} module is simply a C extension module
 | 
						|
included with Python, just like the \module{socket} or \module{zlib}
 | 
						|
module.
 | 
						|
 | 
						|
Putting REs in strings keeps the Python language simpler, but has one
 | 
						|
disadvantage which is the topic of the next section.
 | 
						|
 | 
						|
\subsection{The Backslash Plague}
 | 
						|
 | 
						|
As stated earlier, regular expressions use the backslash
 | 
						|
character (\character{\e}) to indicate special forms or to allow
 | 
						|
special characters to be used without invoking their special meaning.
 | 
						|
This conflicts with Python's usage of the same character for the same
 | 
						|
purpose in string literals.
 | 
						|
 | 
						|
Let's say you want to write a RE that matches the string
 | 
						|
\samp{{\e}section}, which might be found in a \LaTeX\ file.  To figure
 | 
						|
out what to write in the program code, start with the desired string
 | 
						|
to be matched.  Next, you must escape any backslashes and other
 | 
						|
metacharacters by preceding them with a backslash, resulting in the
 | 
						|
string \samp{\e\e section}.  The resulting string that must be passed
 | 
						|
to \function{re.compile()} must be \verb|\\section|.  However, to
 | 
						|
express this as a Python string literal, both backslashes must be
 | 
						|
escaped \emph{again}.
 | 
						|
 | 
						|
\begin{tableii}{c|l}{code}{Characters}{Stage}
 | 
						|
  \lineii{\e section}{Text string to be matched}
 | 
						|
  \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
 | 
						|
  \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
 | 
						|
\end{tableii}
 | 
						|
 | 
						|
In short, to match a literal backslash, one has to write
 | 
						|
\code{'\e\e\e\e'} as the RE string, because the regular expression
 | 
						|
must be \samp{\e\e}, and each backslash must be expressed as
 | 
						|
\samp{\e\e} inside a regular Python string literal.  In REs that
 | 
						|
feature backslashes repeatedly, this leads to lots of repeated
 | 
						|
backslashes and makes the resulting strings difficult to understand.
 | 
						|
 | 
						|
The solution is to use Python's raw string notation for regular
 | 
						|
expressions; backslashes are not handled in any special way in
 | 
						|
a string literal prefixed with \character{r}, so \code{r"\e n"} is a
 | 
						|
two-character string containing \character{\e} and \character{n},
 | 
						|
while \code{"\e n"} is a one-character string containing a newline.
 | 
						|
Frequently regular expressions will be expressed in Python
 | 
						|
code using this raw string notation.  
 | 
						|
 | 
						|
\begin{tableii}{c|c}{code}{Regular String}{Raw string}
 | 
						|
  \lineii{"ab*"}{\code{r"ab*"}}
 | 
						|
  \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
 | 
						|
  \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
 | 
						|
\end{tableii}
 | 
						|
 | 
						|
\subsection{Performing Matches}
 | 
						|
 | 
						|
Once you have an object representing a compiled regular expression,
 | 
						|
what do you do with it?  \class{RegexObject} instances have several
 | 
						|
methods and attributes.  Only the most significant ones will be
 | 
						|
covered here; consult \ulink{the Library
 | 
						|
Reference}{http://www.python.org/doc/lib/module-re.html} for a
 | 
						|
complete listing.
 | 
						|
 | 
						|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
 | 
						|
  \lineii{match()}{Determine if the RE matches at the beginning of
 | 
						|
  the string.}
 | 
						|
  \lineii{search()}{Scan through a string, looking for any location
 | 
						|
  where this RE matches.}
 | 
						|
  \lineii{findall()}{Find all substrings where the RE matches,
 | 
						|
and returns them as a list.}
 | 
						|
  \lineii{finditer()}{Find all substrings where the RE matches,
 | 
						|
and returns them as an iterator.}
 | 
						|
\end{tableii}
 | 
						|
 | 
						|
\method{match()} and \method{search()} return \code{None} if no match
 | 
						|
can be found.  If they're successful, a \code{MatchObject} instance is
 | 
						|
returned, containing information about the match: where it starts and
 | 
						|
ends, the substring it matched, and more.
 | 
						|
 | 
						|
You can learn about this by interactively experimenting with the
 | 
						|
\module{re} module.  If you have Tkinter available, you may also want
 | 
						|
to look at \file{Tools/scripts/redemo.py}, a demonstration program
 | 
						|
included with the Python distribution.  It allows you to enter REs and
 | 
						|
strings, and displays whether the RE matches or fails.
 | 
						|
\file{redemo.py} can be quite useful when trying to debug a
 | 
						|
complicated RE.  Phil Schwartz's
 | 
						|
\ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive
 | 
						|
tool for developing and testing RE patterns.  This HOWTO will use the
 | 
						|
standard Python interpreter for its examples.
 | 
						|
 | 
						|
First, run the Python interpreter, import the \module{re} module, and
 | 
						|
compile a RE:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
 | 
						|
>>> import re
 | 
						|
>>> p = re.compile('[a-z]+')
 | 
						|
>>> p
 | 
						|
<_sre.SRE_Pattern object at 80c3c28>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Now, you can try matching various strings against the RE
 | 
						|
\regexp{[a-z]+}.  An empty string shouldn't match at all, since
 | 
						|
\regexp{+} means 'one or more repetitions'.  \method{match()} should
 | 
						|
return \code{None} in this case, which will cause the interpreter to
 | 
						|
print no output.  You can explicitly print the result of
 | 
						|
\method{match()} to make this clear.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p.match("")
 | 
						|
>>> print p.match("")
 | 
						|
None
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Now, let's try it on a string that it should match, such as
 | 
						|
\samp{tempo}.  In this case, \method{match()} will return a
 | 
						|
\class{MatchObject}, so you should store the result in a variable for
 | 
						|
later use.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> m = p.match( 'tempo')
 | 
						|
>>> print m
 | 
						|
<_sre.SRE_Match object at 80c4f68>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Now you can query the \class{MatchObject} for information about the
 | 
						|
matching string.   \class{MatchObject} instances also have several
 | 
						|
methods and attributes; the most important ones are:
 | 
						|
 | 
						|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
 | 
						|
  \lineii{group()}{Return the string matched by the RE}
 | 
						|
  \lineii{start()}{Return the starting position of the match}
 | 
						|
  \lineii{end()}{Return the ending position of the match}
 | 
						|
  \lineii{span()}{Return a tuple containing the (start, end) positions 
 | 
						|
                  of the match}
 | 
						|
\end{tableii}
 | 
						|
 | 
						|
Trying these methods will soon clarify their meaning:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> m.group()
 | 
						|
'tempo'
 | 
						|
>>> m.start(), m.end()
 | 
						|
(0, 5)
 | 
						|
>>> m.span()
 | 
						|
(0, 5)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\method{group()} returns the substring that was matched by the
 | 
						|
RE.  \method{start()} and \method{end()} return the starting and
 | 
						|
ending index of the match. \method{span()} returns both start and end
 | 
						|
indexes in a single tuple.  Since the \method{match} method only
 | 
						|
checks if the RE matches at the start of a string,
 | 
						|
\method{start()} will always be zero.  However, the \method{search}
 | 
						|
method of \class{RegexObject} instances scans through the string, so 
 | 
						|
the match may not start at zero in that case.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print p.match('::: message')
 | 
						|
None
 | 
						|
>>> m = p.search('::: message') ; print m
 | 
						|
<re.MatchObject instance at 80c9650>
 | 
						|
>>> m.group()
 | 
						|
'message'
 | 
						|
>>> m.span()
 | 
						|
(4, 11)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
In actual programs, the most common style is to store the
 | 
						|
\class{MatchObject} in a variable, and then check if it was
 | 
						|
\code{None}.  This usually looks like:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
p = re.compile( ... )
 | 
						|
m = p.match( 'string goes here' )
 | 
						|
if m:
 | 
						|
    print 'Match found: ', m.group()
 | 
						|
else:
 | 
						|
    print 'No match'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Two \class{RegexObject} methods return all of the matches for a pattern.
 | 
						|
\method{findall()} returns a list of matching strings:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('\d+')
 | 
						|
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
 | 
						|
['12', '11', '10']
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\method{findall()} has to create the entire list before it can be
 | 
						|
returned as the result.  In Python 2.2, the \method{finditer()} method
 | 
						|
is also available, returning a sequence of \class{MatchObject} instances 
 | 
						|
as an iterator.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
 | 
						|
>>> iterator
 | 
						|
<callable-iterator object at 0x401833ac>
 | 
						|
>>> for match in iterator:
 | 
						|
...     print match.span()
 | 
						|
...
 | 
						|
(0, 2)
 | 
						|
(22, 24)
 | 
						|
(29, 31)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
 | 
						|
\subsection{Module-Level Functions}
 | 
						|
 | 
						|
You don't have to produce a \class{RegexObject} and call its methods;
 | 
						|
the \module{re} module also provides top-level functions called
 | 
						|
\function{match()}, \function{search()}, \function{sub()}, and so
 | 
						|
forth.  These functions take the same arguments as the corresponding
 | 
						|
\class{RegexObject} method, with the RE string added as the first
 | 
						|
argument, and still return either \code{None} or a \class{MatchObject}
 | 
						|
instance.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print re.match(r'From\s+', 'Fromage amk')
 | 
						|
None
 | 
						|
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
 | 
						|
<re.MatchObject instance at 80c5978>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Under the hood, these functions simply produce a \class{RegexObject}
 | 
						|
for you and call the appropriate method on it.  They also store the
 | 
						|
compiled object in a cache, so future calls using the same
 | 
						|
RE are faster.  
 | 
						|
 | 
						|
Should you use these module-level functions, or should you get the
 | 
						|
\class{RegexObject} and call its methods yourself?  That choice
 | 
						|
depends on how frequently the RE will be used, and on your personal
 | 
						|
coding style.  If a RE is being used at only one point in the code,
 | 
						|
then the module functions are probably more convenient.  If a program
 | 
						|
contains a lot of regular expressions, or re-uses the same ones in
 | 
						|
several locations, then it might be worthwhile to collect all the
 | 
						|
definitions in one place, in a section of code that compiles all the
 | 
						|
REs ahead of time.  To take an example from the standard library,
 | 
						|
here's an extract from \file{xmllib.py}:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
ref = re.compile( ... )
 | 
						|
entityref = re.compile( ... )
 | 
						|
charref = re.compile( ... )
 | 
						|
starttagopen = re.compile( ... )
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
I generally prefer to work with the compiled object, even for
 | 
						|
one-time uses, but few people will be as much of a purist about this
 | 
						|
as I am.
 | 
						|
 | 
						|
\subsection{Compilation Flags}
 | 
						|
 | 
						|
Compilation flags let you modify some aspects of how regular
 | 
						|
expressions work.  Flags are available in the \module{re} module under
 | 
						|
two names, a long name such as \constant{IGNORECASE}, and a short,
 | 
						|
one-letter form such as \constant{I}.  (If you're familiar with Perl's
 | 
						|
pattern modifiers, the one-letter forms use the same letters; the
 | 
						|
short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
 | 
						|
Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
 | 
						|
re.M} sets both the \constant{I} and \constant{M} flags, for example.
 | 
						|
 | 
						|
Here's a table of the available flags, followed by
 | 
						|
a more detailed explanation of each one.
 | 
						|
 | 
						|
\begin{tableii}{c|l}{}{Flag}{Meaning}
 | 
						|
  \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
 | 
						|
  character, including newlines}
 | 
						|
  \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
 | 
						|
  \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
 | 
						|
  \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
 | 
						|
  affecting \regexp{\^} and \regexp{\$}}
 | 
						|
  \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
 | 
						|
  which can be organized more cleanly and understandably.}
 | 
						|
\end{tableii}
 | 
						|
 | 
						|
\begin{datadesc}{I}
 | 
						|
\dataline{IGNORECASE}
 | 
						|
Perform case-insensitive matching; character class and literal strings
 | 
						|
will match
 | 
						|
letters by ignoring case.  For example, \regexp{[A-Z]} will match
 | 
						|
lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
 | 
						|
\samp{spam}, or \samp{spAM}.
 | 
						|
This lowercasing doesn't take the current locale into account; it will
 | 
						|
if you also set the \constant{LOCALE} flag.
 | 
						|
\end{datadesc}
 | 
						|
 | 
						|
\begin{datadesc}{L}
 | 
						|
\dataline{LOCALE}
 | 
						|
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
 | 
						|
and \regexp{\e B}, dependent on the current locale.  
 | 
						|
 | 
						|
Locales are a feature of the C library intended to help in writing
 | 
						|
programs that take account of language differences.  For example, if
 | 
						|
you're processing French text, you'd want to be able to write
 | 
						|
\regexp{\e w+} to match words, but \regexp{\e w} only matches the
 | 
						|
character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
 | 
						|
\character{\c c}.  If your system is configured properly and a French
 | 
						|
locale is selected, certain C functions will tell the program that
 | 
						|
\character{\'e} should also be considered a letter.  Setting the
 | 
						|
\constant{LOCALE} flag when compiling a regular expression will cause the
 | 
						|
resulting compiled object to use these C functions for \regexp{\e w};
 | 
						|
this is slower, but also enables \regexp{\e w+} to match French words as
 | 
						|
you'd expect.
 | 
						|
\end{datadesc}
 | 
						|
 | 
						|
\begin{datadesc}{M}
 | 
						|
\dataline{MULTILINE}
 | 
						|
(\regexp{\^} and \regexp{\$} haven't been explained yet; 
 | 
						|
they'll be introduced in section~\ref{more-metacharacters}.)
 | 
						|
 | 
						|
Usually \regexp{\^} matches only at the beginning of the string, and
 | 
						|
\regexp{\$} matches only at the end of the string and immediately before the
 | 
						|
newline (if any) at the end of the string. When this flag is
 | 
						|
specified, \regexp{\^} matches at the beginning of the string and at
 | 
						|
the beginning of each line within the string, immediately following
 | 
						|
each newline.  Similarly, the \regexp{\$} metacharacter matches either at
 | 
						|
the end of the string and at the end of each line (immediately
 | 
						|
preceding each newline).
 | 
						|
 | 
						|
\end{datadesc}
 | 
						|
 | 
						|
\begin{datadesc}{S}
 | 
						|
\dataline{DOTALL}
 | 
						|
Makes the \character{.} special character match any character at all,
 | 
						|
including a newline; without this flag, \character{.} will match
 | 
						|
anything \emph{except} a newline.
 | 
						|
\end{datadesc}
 | 
						|
 | 
						|
\begin{datadesc}{X}
 | 
						|
\dataline{VERBOSE} This flag allows you to write regular expressions
 | 
						|
that are more readable by granting you more flexibility in how you can
 | 
						|
format them.  When this flag has been specified, whitespace within the
 | 
						|
RE string is ignored, except when the whitespace is in a character
 | 
						|
class or preceded by an unescaped backslash; this lets you organize
 | 
						|
and indent the RE more clearly.  It also enables you to put comments
 | 
						|
within a RE that will be ignored by the engine; comments are marked by
 | 
						|
a \character{\#} that's neither in a character class or preceded by an
 | 
						|
unescaped backslash.
 | 
						|
 | 
						|
For example, here's a RE that uses \constant{re.VERBOSE}; see how
 | 
						|
much easier it is to read?
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
charref = re.compile(r"""
 | 
						|
 &[#]		     # Start of a numeric entity reference
 | 
						|
 (
 | 
						|
   [0-9]+[^0-9]      # Decimal form
 | 
						|
   | 0[0-7]+[^0-7]   # Octal form
 | 
						|
   | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
 | 
						|
 )
 | 
						|
""", re.VERBOSE)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Without the verbose setting, the RE would look like this:
 | 
						|
\begin{verbatim}
 | 
						|
charref = re.compile("&#([0-9]+[^0-9]"
 | 
						|
                     "|0[0-7]+[^0-7]"
 | 
						|
                     "|x[0-9a-fA-F]+[^0-9a-fA-F])")
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
In the above example, Python's automatic concatenation of string
 | 
						|
literals has been used to break up the RE into smaller pieces, but
 | 
						|
it's still more difficult to understand than the version using
 | 
						|
\constant{re.VERBOSE}.
 | 
						|
 | 
						|
\end{datadesc}
 | 
						|
 | 
						|
\section{More Pattern Power}
 | 
						|
 | 
						|
So far we've only covered a part of the features of regular
 | 
						|
expressions.  In this section, we'll cover some new metacharacters,
 | 
						|
and how to use groups to retrieve portions of the text that was matched.
 | 
						|
 | 
						|
\subsection{More Metacharacters\label{more-metacharacters}}
 | 
						|
 | 
						|
There are some metacharacters that we haven't covered yet.  Most of
 | 
						|
them will be covered in this section.
 | 
						|
 | 
						|
Some of the remaining metacharacters to be discussed are
 | 
						|
\dfn{zero-width assertions}.  They don't cause the engine to advance
 | 
						|
through the string; instead, they consume no characters at all,
 | 
						|
and simply succeed or fail.  For example, \regexp{\e b} is an
 | 
						|
assertion that the current position is located at a word boundary; the
 | 
						|
position isn't changed by the \regexp{\e b} at all.  This means that
 | 
						|
zero-width assertions should never be repeated, because if they match
 | 
						|
once at a given location, they can obviously be matched an infinite
 | 
						|
number of times.
 | 
						|
 | 
						|
\begin{list}{}{}
 | 
						|
 | 
						|
\item[\regexp{|}] 
 | 
						|
Alternation, or the ``or'' operator.  
 | 
						|
If A and B are regular expressions, 
 | 
						|
\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
 | 
						|
\regexp{|} has very low precedence in order to make it work reasonably when
 | 
						|
you're alternating multi-character strings.
 | 
						|
\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
 | 
						|
\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
 | 
						|
 | 
						|
To match a literal \character{|},
 | 
						|
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
 | 
						|
 | 
						|
\item[\regexp{\^}] Matches at the beginning of lines.  Unless the
 | 
						|
\constant{MULTILINE} flag has been set, this will only match at the
 | 
						|
beginning of the string.  In \constant{MULTILINE} mode, this also
 | 
						|
matches immediately after each newline within the string.  
 | 
						|
 | 
						|
For example, if you wish to match the word \samp{From} only at the
 | 
						|
beginning of a line, the RE to use is \verb|^From|.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print re.search('^From', 'From Here to Eternity')
 | 
						|
<re.MatchObject instance at 80c1520>
 | 
						|
>>> print re.search('^From', 'Reciting From Memory')
 | 
						|
None
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
 | 
						|
%inside a character class, as in \regexp{[{\e}\^]}.
 | 
						|
 | 
						|
\item[\regexp{\$}] Matches at the end of a line, which is defined as
 | 
						|
either the end of the string, or any location followed by a newline
 | 
						|
character.    
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print re.search('}$', '{block}')
 | 
						|
<re.MatchObject instance at 80adfa8>
 | 
						|
>>> print re.search('}$', '{block} ')
 | 
						|
None
 | 
						|
>>> print re.search('}$', '{block}\n')
 | 
						|
<re.MatchObject instance at 80adfa8>
 | 
						|
\end{verbatim}
 | 
						|
% $
 | 
						|
 | 
						|
To match a literal \character{\$}, use \regexp{\e\$} or enclose it
 | 
						|
inside a character class, as in  \regexp{[\$]}.
 | 
						|
 | 
						|
\item[\regexp{\e A}] Matches only at the start of the string.  When
 | 
						|
not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
 | 
						|
effectively the same.  In \constant{MULTILINE} mode, however, they're
 | 
						|
different; \regexp{\e A} still matches only at the beginning of the
 | 
						|
string, but \regexp{\^} may match at any location inside the string
 | 
						|
that follows a newline character.
 | 
						|
 | 
						|
\item[\regexp{\e Z}]Matches only at the end of the string.  
 | 
						|
 | 
						|
\item[\regexp{\e b}] Word boundary.  
 | 
						|
This is a zero-width assertion that matches only at the
 | 
						|
beginning or end of a word.  A word is defined as a sequence of
 | 
						|
alphanumeric characters, so the end of a word is indicated by
 | 
						|
whitespace or a non-alphanumeric character.  
 | 
						|
 | 
						|
The following example matches \samp{class} only when it's a complete
 | 
						|
word; it won't match when it's contained inside another word.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile(r'\bclass\b')
 | 
						|
>>> print p.search('no class at all')
 | 
						|
<re.MatchObject instance at 80c8f28>
 | 
						|
>>> print p.search('the declassified algorithm')
 | 
						|
None
 | 
						|
>>> print p.search('one subclass is')
 | 
						|
None
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
There are two subtleties you should remember when using this special
 | 
						|
sequence.  First, this is the worst collision between Python's string
 | 
						|
literals and regular expression sequences.  In Python's string
 | 
						|
literals, \samp{\e b} is the backspace character, ASCII value 8.  If
 | 
						|
you're not using raw strings, then Python will convert the \samp{\e b} to
 | 
						|
a backspace, and your RE won't match as you expect it to.  The
 | 
						|
following example looks the same as our previous RE, but omits
 | 
						|
the \character{r} in front of the RE string.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('\bclass\b')
 | 
						|
>>> print p.search('no class at all')
 | 
						|
None
 | 
						|
>>> print p.search('\b' + 'class' + '\b')  
 | 
						|
<re.MatchObject instance at 80c3ee0>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Second, inside a character class, where there's no use for this
 | 
						|
assertion, \regexp{\e b} represents the backspace character, for
 | 
						|
compatibility with Python's string literals.
 | 
						|
 | 
						|
\item[\regexp{\e B}] Another zero-width assertion, this is the
 | 
						|
opposite of \regexp{\e b}, only matching when the current
 | 
						|
position is not at a word boundary.
 | 
						|
 | 
						|
\end{list}
 | 
						|
 | 
						|
\subsection{Grouping}
 | 
						|
 | 
						|
Frequently you need to obtain more information than just whether the
 | 
						|
RE matched or not.  Regular expressions are often used to dissect
 | 
						|
strings by writing a RE divided into several subgroups which
 | 
						|
match different components of interest.  For example, an RFC-822
 | 
						|
header line is divided into a header name and a value, separated by a
 | 
						|
\character{:}.  This can be handled by writing a regular expression
 | 
						|
which matches an entire header line, and has one group which matches the
 | 
						|
header name, and another group which matches the header's value.
 | 
						|
 | 
						|
Groups are marked by the \character{(}, \character{)} metacharacters.
 | 
						|
\character{(} and \character{)} have much the same meaning as they do
 | 
						|
in mathematical expressions; they group together the expressions
 | 
						|
contained inside them. For example, you can repeat the contents of a
 | 
						|
group with a repeating qualifier, such as \regexp{*}, \regexp{+},
 | 
						|
\regexp{?}, or \regexp{\{\var{m},\var{n}\}}.  For example,
 | 
						|
\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('(ab)*')
 | 
						|
>>> print p.match('ababababab').span()
 | 
						|
(0, 10)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Groups indicated with \character{(}, \character{)} also capture the
 | 
						|
starting and ending index of the text that they match; this can be
 | 
						|
retrieved by passing an argument to \method{group()},
 | 
						|
\method{start()}, \method{end()}, and \method{span()}.  Groups are
 | 
						|
numbered starting with 0.  Group 0 is always present; it's the whole
 | 
						|
RE, so \class{MatchObject} methods all have group 0 as their default
 | 
						|
argument.  Later we'll see how to express groups that don't capture
 | 
						|
the span of text that they match.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('(a)b')
 | 
						|
>>> m = p.match('ab')
 | 
						|
>>> m.group()
 | 
						|
'ab'
 | 
						|
>>> m.group(0)
 | 
						|
'ab'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Subgroups are numbered from left to right, from 1 upward.  Groups can
 | 
						|
be nested; to determine the number, just count the opening parenthesis
 | 
						|
characters, going from left to right.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('(a(b)c)d')
 | 
						|
>>> m = p.match('abcd')
 | 
						|
>>> m.group(0)
 | 
						|
'abcd'
 | 
						|
>>> m.group(1)
 | 
						|
'abc'
 | 
						|
>>> m.group(2)
 | 
						|
'b'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\method{group()} can be passed multiple group numbers at a time, in
 | 
						|
which case it will return a tuple containing the corresponding values
 | 
						|
for those groups.
 | 
						|
 | 
						|
\begin{verbatim}  
 | 
						|
>>> m.group(2,1,2)
 | 
						|
('b', 'abc', 'b')
 | 
						|
\end{verbatim}  
 | 
						|
 | 
						|
The \method{groups()} method returns a tuple containing the strings
 | 
						|
for all the subgroups, from 1 up to however many there are.
 | 
						|
 | 
						|
\begin{verbatim}  
 | 
						|
>>> m.groups()
 | 
						|
('abc', 'b')
 | 
						|
\end{verbatim}  
 | 
						|
 | 
						|
Backreferences in a pattern allow you to specify that the contents of
 | 
						|
an earlier capturing group must also be found at the current location
 | 
						|
in the string.  For example, \regexp{\e 1} will succeed if the exact
 | 
						|
contents of group 1 can be found at the current position, and fails
 | 
						|
otherwise.  Remember that Python's string literals also use a
 | 
						|
backslash followed by numbers to allow including arbitrary characters
 | 
						|
in a string, so be sure to use a raw string when incorporating
 | 
						|
backreferences in a RE.
 | 
						|
 | 
						|
For example, the following RE detects doubled words in a string.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile(r'(\b\w+)\s+\1')
 | 
						|
>>> p.search('Paris in the the spring').group()
 | 
						|
'the the'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Backreferences like this aren't often useful for just searching
 | 
						|
through a string --- there are few text formats which repeat data in
 | 
						|
this way --- but you'll soon find out that they're \emph{very} useful
 | 
						|
when performing string substitutions.
 | 
						|
 | 
						|
\subsection{Non-capturing and Named Groups}
 | 
						|
 | 
						|
Elaborate REs may use many groups, both to capture substrings of
 | 
						|
interest, and to group and structure the RE itself.  In complex REs,
 | 
						|
it becomes difficult to keep track of the group numbers.  There are
 | 
						|
two features which help with this problem.  Both of them use a common
 | 
						|
syntax for regular expression extensions, so we'll look at that first.
 | 
						|
 | 
						|
Perl 5 added several additional features to standard regular
 | 
						|
expressions, and the Python \module{re} module supports most of them.
 | 
						|
It would have been difficult to choose new single-keystroke
 | 
						|
metacharacters or new special sequences beginning with \samp{\e} to
 | 
						|
represent the new features without making Perl's regular expressions
 | 
						|
confusingly different from standard REs.  If you chose \samp{\&} as a
 | 
						|
new metacharacter, for example, old expressions would be assuming that
 | 
						|
\samp{\&} was a regular character and wouldn't have escaped it by
 | 
						|
writing \regexp{\e \&} or \regexp{[\&]}.  
 | 
						|
 | 
						|
The solution chosen by the Perl developers was to use \regexp{(?...)}
 | 
						|
as the extension syntax.  \samp{?} immediately after a parenthesis was
 | 
						|
a syntax error because the \samp{?} would have nothing to repeat, so
 | 
						|
this didn't introduce any compatibility problems.  The characters
 | 
						|
immediately after the \samp{?}  indicate what extension is being used,
 | 
						|
so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
 | 
						|
\regexp{(?:foo)} is something else (a non-capturing group containing
 | 
						|
the subexpression \regexp{foo}).
 | 
						|
 | 
						|
Python adds an extension syntax to Perl's extension syntax.  If the
 | 
						|
first character after the question mark is a \samp{P}, you know that
 | 
						|
it's an extension that's specific to Python.  Currently there are two
 | 
						|
such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
 | 
						|
and \regexp{(?P=\var{name})} is a backreference to a named group.  If
 | 
						|
future versions of Perl 5 add similar features using a different
 | 
						|
syntax, the \module{re} module will be changed to support the new
 | 
						|
syntax, while preserving the Python-specific syntax for
 | 
						|
compatibility's sake.
 | 
						|
 | 
						|
Now that we've looked at the general extension syntax, we can return
 | 
						|
to the features that simplify working with groups in complex REs.
 | 
						|
Since groups are numbered from left to right and a complex expression
 | 
						|
may use many groups, it can become difficult to keep track of the
 | 
						|
correct numbering, and modifying such a complex RE is annoying.
 | 
						|
Insert a new group near the beginning, and you change the numbers of
 | 
						|
everything that follows it.
 | 
						|
 | 
						|
First, sometimes you'll want to use a group to collect a part of a
 | 
						|
regular expression, but aren't interested in retrieving the group's
 | 
						|
contents.  You can make this fact explicit by using a non-capturing
 | 
						|
group: \regexp{(?:...)}, where you can put any other regular
 | 
						|
expression inside the parentheses.  
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> m = re.match("([abc])+", "abc")
 | 
						|
>>> m.groups()
 | 
						|
('c',)
 | 
						|
>>> m = re.match("(?:[abc])+", "abc")
 | 
						|
>>> m.groups()
 | 
						|
()
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Except for the fact that you can't retrieve the contents of what the
 | 
						|
group matched, a non-capturing group behaves exactly the same as a
 | 
						|
capturing group; you can put anything inside it, repeat it with a
 | 
						|
repetition metacharacter such as \samp{*}, and nest it within other
 | 
						|
groups (capturing or non-capturing).  \regexp{(?:...)} is particularly
 | 
						|
useful when modifying an existing group, since you can add new groups
 | 
						|
without changing how all the other groups are numbered.  It should be
 | 
						|
mentioned that there's no performance difference in searching between
 | 
						|
capturing and non-capturing groups; neither form is any faster than
 | 
						|
the other.
 | 
						|
 | 
						|
The second, and more significant, feature is named groups; instead of
 | 
						|
referring to them by numbers, groups can be referenced by a name.
 | 
						|
 | 
						|
The syntax for a named group is one of the Python-specific extensions:
 | 
						|
\regexp{(?P<\var{name}>...)}.  \var{name} is, obviously, the name of
 | 
						|
the group.  Except for associating a name with a group, named groups
 | 
						|
also behave identically to capturing groups.  The \class{MatchObject}
 | 
						|
methods that deal with capturing groups all accept either integers, to
 | 
						|
refer to groups by number, or a string containing the group name.
 | 
						|
Named groups are still given numbers, so you can retrieve information
 | 
						|
about a group in two ways:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile(r'(?P<word>\b\w+\b)')
 | 
						|
>>> m = p.search( '(((( Lots of punctuation )))' )
 | 
						|
>>> m.group('word')
 | 
						|
'Lots'
 | 
						|
>>> m.group(1)
 | 
						|
'Lots'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Named groups are handy because they let you use easily-remembered
 | 
						|
names, instead of having to remember numbers.  Here's an example RE
 | 
						|
from the \module{imaplib} module:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
InternalDate = re.compile(r'INTERNALDATE "'
 | 
						|
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
 | 
						|
	r'(?P<year>[0-9][0-9][0-9][0-9])'
 | 
						|
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
 | 
						|
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
 | 
						|
        r'"')
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
It's obviously much easier to retrieve \code{m.group('zonem')},
 | 
						|
instead of having to remember to retrieve group 9.
 | 
						|
 | 
						|
Since the syntax for backreferences, in an expression like
 | 
						|
\regexp{(...)\e 1}, refers to the number of the group there's
 | 
						|
naturally a variant that uses the group name instead of the number.
 | 
						|
This is also a Python extension: \regexp{(?P=\var{name})} indicates
 | 
						|
that the contents of the group called \var{name} should again be found
 | 
						|
at the current point.  The regular expression for finding doubled
 | 
						|
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
 | 
						|
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
 | 
						|
>>> p.search('Paris in the the spring').group()
 | 
						|
'the the'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\subsection{Lookahead Assertions}
 | 
						|
 | 
						|
Another zero-width assertion is the lookahead assertion.  Lookahead
 | 
						|
assertions are available in both positive and negative form, and 
 | 
						|
look like this:
 | 
						|
 | 
						|
\begin{itemize}
 | 
						|
\item[\regexp{(?=...)}] Positive lookahead assertion.  This succeeds
 | 
						|
if the contained regular expression, represented here by \code{...},
 | 
						|
successfully matches at the current location, and fails otherwise.
 | 
						|
But, once the contained expression has been tried, the matching engine
 | 
						|
doesn't advance at all; the rest of the pattern is tried right where
 | 
						|
the assertion started.
 | 
						|
 | 
						|
\item[\regexp{(?!...)}] Negative lookahead assertion.  This is the
 | 
						|
opposite of the positive assertion; it succeeds if the contained expression
 | 
						|
\emph{doesn't} match at the current position in the string.
 | 
						|
\end{itemize}
 | 
						|
 | 
						|
An example will help make this concrete by demonstrating a case
 | 
						|
where a lookahead is useful.  Consider a simple pattern to match a
 | 
						|
filename and split it apart into a base name and an extension,
 | 
						|
separated by a \samp{.}.  For example, in \samp{news.rc}, \samp{news}
 | 
						|
is the base name, and \samp{rc} is the filename's extension.  
 | 
						|
 | 
						|
The pattern to match this is quite simple: 
 | 
						|
 | 
						|
\regexp{.*[.].*\$}
 | 
						|
 | 
						|
Notice that the \samp{.} needs to be treated specially because it's a
 | 
						|
metacharacter; I've put it inside a character class.  Also notice the
 | 
						|
trailing \regexp{\$}; this is added to ensure that all the rest of the
 | 
						|
string must be included in the extension.  This regular expression
 | 
						|
matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
 | 
						|
\samp{printers.conf}.
 | 
						|
 | 
						|
Now, consider complicating the problem a bit; what if you want to
 | 
						|
match filenames where the extension is not \samp{bat}?
 | 
						|
Some incorrect attempts:
 | 
						|
 | 
						|
\verb|.*[.][^b].*$|
 | 
						|
% $
 | 
						|
 | 
						|
The first attempt above tries to exclude \samp{bat} by requiring that
 | 
						|
the first character of the extension is not a \samp{b}.  This is
 | 
						|
wrong, because the pattern also doesn't match \samp{foo.bar}.
 | 
						|
 | 
						|
% Messes up the HTML without the curly braces around \^
 | 
						|
\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
 | 
						|
 | 
						|
The expression gets messier when you try to patch up the first
 | 
						|
solution by requiring one of the following cases to match: the first
 | 
						|
character of the extension isn't \samp{b}; the second character isn't
 | 
						|
\samp{a}; or the third character isn't \samp{t}.  This accepts
 | 
						|
\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
 | 
						|
three-letter extension and won't accept a filename with a two-letter
 | 
						|
extension such as \samp{sendmail.cf}.  We'll complicate the pattern
 | 
						|
again in an effort to fix it.
 | 
						|
 | 
						|
\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
 | 
						|
 | 
						|
In the third attempt, the second and third letters are all made
 | 
						|
optional in order to allow matching extensions shorter than three
 | 
						|
characters, such as \samp{sendmail.cf}.
 | 
						|
 | 
						|
The pattern's getting really complicated now, which makes it hard to
 | 
						|
read and understand.  Worse, if the problem changes and you want to
 | 
						|
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
 | 
						|
would get even more complicated and confusing.
 | 
						|
 | 
						|
A negative lookahead cuts through all this:
 | 
						|
 | 
						|
\regexp{.*[.](?!bat\$).*\$}
 | 
						|
% $
 | 
						|
 | 
						|
The lookahead means: if the expression \regexp{bat} doesn't match at
 | 
						|
this point, try the rest of the pattern; if \regexp{bat\$} does match,
 | 
						|
the whole pattern will fail.  The trailing \regexp{\$} is required to
 | 
						|
ensure that something like \samp{sample.batch}, where the extension
 | 
						|
only starts with \samp{bat}, will be allowed.
 | 
						|
 | 
						|
Excluding another filename extension is now easy; simply add it as an
 | 
						|
alternative inside the assertion.  The following pattern excludes
 | 
						|
filenames that end in either \samp{bat} or \samp{exe}:
 | 
						|
 | 
						|
\regexp{.*[.](?!bat\$|exe\$).*\$}
 | 
						|
% $
 | 
						|
 | 
						|
 | 
						|
\section{Modifying Strings}
 | 
						|
 | 
						|
Up to this point, we've simply performed searches against a static
 | 
						|
string.  Regular expressions are also commonly used to modify a string
 | 
						|
in various ways, using the following \class{RegexObject} methods:
 | 
						|
 | 
						|
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
 | 
						|
  \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
 | 
						|
  \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
 | 
						|
  \lineii{subn()}{Does the same thing as \method{sub()}, 
 | 
						|
   but returns the new string and the number of replacements}
 | 
						|
\end{tableii}
 | 
						|
 | 
						|
 | 
						|
\subsection{Splitting Strings}
 | 
						|
 | 
						|
The \method{split()} method of a \class{RegexObject} splits a string
 | 
						|
apart wherever the RE matches, returning a list of the pieces.
 | 
						|
It's similar to the \method{split()} method of strings but
 | 
						|
provides much more
 | 
						|
generality in the delimiters that you can split by;
 | 
						|
\method{split()} only supports splitting by whitespace or by
 | 
						|
a fixed string.  As you'd expect, there's a module-level
 | 
						|
\function{re.split()} function, too.
 | 
						|
 | 
						|
\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
 | 
						|
  Split \var{string} by the matches of the regular expression.  If
 | 
						|
  capturing parentheses are used in the RE, then their contents will
 | 
						|
  also be returned as part of the resulting list.  If \var{maxsplit}
 | 
						|
  is nonzero, at most \var{maxsplit} splits are performed.
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
You can limit the number of splits made, by passing a value for
 | 
						|
\var{maxsplit}.  When \var{maxsplit} is nonzero, at most
 | 
						|
\var{maxsplit} splits will be made, and the remainder of the string is
 | 
						|
returned as the final element of the list.  In the following example,
 | 
						|
the delimiter is any sequence of non-alphanumeric characters.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile(r'\W+')
 | 
						|
>>> p.split('This is a test, short and sweet, of split().')
 | 
						|
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
 | 
						|
>>> p.split('This is a test, short and sweet, of split().', 3)
 | 
						|
['This', 'is', 'a', 'test, short and sweet, of split().']
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Sometimes you're not only interested in what the text between
 | 
						|
delimiters is, but also need to know what the delimiter was.  If
 | 
						|
capturing parentheses are used in the RE, then their values are also
 | 
						|
returned as part of the list.  Compare the following calls:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile(r'\W+')
 | 
						|
>>> p2 = re.compile(r'(\W+)')
 | 
						|
>>> p.split('This... is a test.')
 | 
						|
['This', 'is', 'a', 'test', '']
 | 
						|
>>> p2.split('This... is a test.')
 | 
						|
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The module-level function \function{re.split()} adds the RE to be
 | 
						|
used as the first argument, but is otherwise the same.  
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> re.split('[\W]+', 'Words, words, words.')
 | 
						|
['Words', 'words', 'words', '']
 | 
						|
>>> re.split('([\W]+)', 'Words, words, words.')
 | 
						|
['Words', ', ', 'words', ', ', 'words', '.', '']
 | 
						|
>>> re.split('[\W]+', 'Words, words, words.', 1)
 | 
						|
['Words', 'words, words.']
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\subsection{Search and Replace}
 | 
						|
 | 
						|
Another common task is to find all the matches for a pattern, and
 | 
						|
replace them with a different string.  The \method{sub()} method takes
 | 
						|
a replacement value, which can be either a string or a function, and
 | 
						|
the string to be processed.
 | 
						|
 | 
						|
\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
 | 
						|
Returns the string obtained by replacing the leftmost non-overlapping
 | 
						|
occurrences of the RE in \var{string} by the replacement
 | 
						|
\var{replacement}.  If the pattern isn't found, \var{string} is returned
 | 
						|
unchanged.  
 | 
						|
 | 
						|
The optional argument \var{count} is the maximum number of pattern
 | 
						|
occurrences to be replaced; \var{count} must be a non-negative
 | 
						|
integer.  The default value of 0 means to replace all occurrences.
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
Here's a simple example of using the \method{sub()} method.  It
 | 
						|
replaces colour names with the word \samp{colour}:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile( '(blue|white|red)')
 | 
						|
>>> p.sub( 'colour', 'blue socks and red shoes')
 | 
						|
'colour socks and colour shoes'
 | 
						|
>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
 | 
						|
'colour socks and red shoes'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The \method{subn()} method does the same work, but returns a 2-tuple
 | 
						|
containing the new string value and the number of replacements 
 | 
						|
that were performed:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile( '(blue|white|red)')
 | 
						|
>>> p.subn( 'colour', 'blue socks and red shoes')
 | 
						|
('colour socks and colour shoes', 2)
 | 
						|
>>> p.subn( 'colour', 'no colours at all')
 | 
						|
('no colours at all', 0)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Empty matches are replaced only when they're not
 | 
						|
adjacent to a previous match.  
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('x*')
 | 
						|
>>> p.sub('-', 'abxd')
 | 
						|
'-a-b-d-'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
If \var{replacement} is a string, any backslash escapes in it are
 | 
						|
processed.  That is, \samp{\e n} is converted to a single newline
 | 
						|
character, \samp{\e r} is converted to a carriage return, and so forth.
 | 
						|
Unknown escapes such as \samp{\e j} are left alone.  Backreferences,
 | 
						|
such as \samp{\e 6}, are replaced with the substring matched by the
 | 
						|
corresponding group in the RE.  This lets you incorporate
 | 
						|
portions of the original text in the resulting
 | 
						|
replacement string.
 | 
						|
 | 
						|
This example matches the word \samp{section} followed by a string
 | 
						|
enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
 | 
						|
\samp{subsection}:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
 | 
						|
>>> p.sub(r'subsection{\1}','section{First} section{second}')
 | 
						|
'subsection{First} subsection{second}'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
There's also a syntax for referring to named groups as defined by the
 | 
						|
\regexp{(?P<name>...)} syntax.  \samp{\e g<name>} will use the
 | 
						|
substring matched by the group named \samp{name}, and 
 | 
						|
\samp{\e g<\var{number}>} 
 | 
						|
uses the corresponding group number.  
 | 
						|
\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, 
 | 
						|
but isn't ambiguous in a
 | 
						|
replacement string such as \samp{\e g<2>0}.  (\samp{\e 20} would be
 | 
						|
interpreted as a reference to group 20, not a reference to group 2
 | 
						|
followed by the literal character \character{0}.)  The following
 | 
						|
substitutions are all equivalent, but use all three variations of the
 | 
						|
replacement string.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
 | 
						|
>>> p.sub(r'subsection{\1}','section{First}')
 | 
						|
'subsection{First}'
 | 
						|
>>> p.sub(r'subsection{\g<1>}','section{First}')
 | 
						|
'subsection{First}'
 | 
						|
>>> p.sub(r'subsection{\g<name>}','section{First}')
 | 
						|
'subsection{First}'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\var{replacement} can also be a function, which gives you even more
 | 
						|
control.  If \var{replacement} is a function, the function is
 | 
						|
called for every non-overlapping occurrence of \var{pattern}.  On each
 | 
						|
call, the function is 
 | 
						|
passed a \class{MatchObject} argument for the match
 | 
						|
and can use this information to compute the desired replacement string and return it.
 | 
						|
 | 
						|
In the following example, the replacement function translates 
 | 
						|
decimals into hexadecimal:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> def hexrepl( match ):
 | 
						|
...     "Return the hex string for a decimal number"
 | 
						|
...     value = int( match.group() )
 | 
						|
...     return hex(value)
 | 
						|
...
 | 
						|
>>> p = re.compile(r'\d+')
 | 
						|
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
 | 
						|
'Call 0xffd2 for printing, 0xc000 for user code.'
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
When using the module-level \function{re.sub()} function, the pattern
 | 
						|
is passed as the first argument.  The pattern may be a string or a
 | 
						|
\class{RegexObject}; if you need to specify regular expression flags,
 | 
						|
you must either use a \class{RegexObject} as the first parameter, or use
 | 
						|
embedded modifiers in the pattern, e.g.  \code{sub("(?i)b+", "x", "bbbb
 | 
						|
BBBB")} returns \code{'x x'}.
 | 
						|
 | 
						|
\section{Common Problems}
 | 
						|
 | 
						|
Regular expressions are a powerful tool for some applications, but in
 | 
						|
some ways their behaviour isn't intuitive and at times they don't
 | 
						|
behave the way you may expect them to.  This section will point out
 | 
						|
some of the most common pitfalls.
 | 
						|
 | 
						|
\subsection{Use String Methods}
 | 
						|
 | 
						|
Sometimes using the \module{re} module is a mistake.  If you're
 | 
						|
matching a fixed string, or a single character class, and you're not
 | 
						|
using any \module{re} features such as the \constant{IGNORECASE} flag,
 | 
						|
then the full power of regular expressions may not be required.
 | 
						|
Strings have several methods for performing operations with fixed
 | 
						|
strings and they're usually much faster, because the implementation is
 | 
						|
a single small C loop that's been optimized for the purpose, instead
 | 
						|
of the large, more generalized regular expression engine.
 | 
						|
 | 
						|
One example might be replacing a single fixed string with another
 | 
						|
one; for example, you might replace \samp{word}
 | 
						|
with \samp{deed}.  \code{re.sub()} seems like the function to use for
 | 
						|
this, but consider the \method{replace()} method.  Note that 
 | 
						|
\function{replace()} will also replace \samp{word} inside
 | 
						|
words, turning \samp{swordfish} into \samp{sdeedfish}, but the 
 | 
						|
na{\"\i}ve RE \regexp{word} would have done that, too.  (To avoid performing
 | 
						|
the substitution on parts of words, the pattern would have to be
 | 
						|
\regexp{\e bword\e b}, in order to require that \samp{word} have a
 | 
						|
word boundary on either side.  This takes the job beyond 
 | 
						|
\method{replace}'s abilities.)
 | 
						|
 | 
						|
Another common task is deleting every occurrence of a single character
 | 
						|
from a string or replacing it with another single character.  You
 | 
						|
might do this with something like \code{re.sub('\e n', ' ', S)}, but
 | 
						|
\method{translate()} is capable of doing both tasks
 | 
						|
and will be faster than any regular expression operation can be.
 | 
						|
 | 
						|
In short, before turning to the \module{re} module, consider whether
 | 
						|
your problem can be solved with a faster and simpler string method.
 | 
						|
 | 
						|
\subsection{match() versus search()}
 | 
						|
 | 
						|
The \function{match()} function only checks if the RE matches at
 | 
						|
the beginning of the string while \function{search()} will scan
 | 
						|
forward through the string for a match.
 | 
						|
It's important to keep this distinction in mind.  Remember, 
 | 
						|
\function{match()} will only report a successful match which
 | 
						|
will start at 0; if the match wouldn't start at zero, 
 | 
						|
\function{match()} will \emph{not} report it.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print re.match('super', 'superstition').span()  
 | 
						|
(0, 5)
 | 
						|
>>> print re.match('super', 'insuperable')    
 | 
						|
None
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
On the other hand, \function{search()} will scan forward through the
 | 
						|
string, reporting the first match it finds.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print re.search('super', 'superstition').span()
 | 
						|
(0, 5)
 | 
						|
>>> print re.search('super', 'insuperable').span()
 | 
						|
(2, 7)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Sometimes you'll be tempted to keep using \function{re.match()}, and
 | 
						|
just add \regexp{.*} to the front of your RE.  Resist this temptation
 | 
						|
and use \function{re.search()} instead.  The regular expression
 | 
						|
compiler does some analysis of REs in order to speed up the process of
 | 
						|
looking for a match.  One such analysis figures out what the first
 | 
						|
character of a match must be; for example, a pattern starting with
 | 
						|
\regexp{Crow} must match starting with a \character{C}.  The analysis
 | 
						|
lets the engine quickly scan through the string looking for the
 | 
						|
starting character, only trying the full match if a \character{C} is found.
 | 
						|
 | 
						|
Adding \regexp{.*} defeats this optimization, requiring scanning to
 | 
						|
the end of the string and then backtracking to find a match for the
 | 
						|
rest of the RE.  Use \function{re.search()} instead.
 | 
						|
 | 
						|
\subsection{Greedy versus Non-Greedy}
 | 
						|
 | 
						|
When repeating a regular expression, as in \regexp{a*}, the resulting
 | 
						|
action is to consume as much of the pattern as possible.  This
 | 
						|
fact often bites you when you're trying to match a pair of
 | 
						|
balanced delimiters, such as the angle brackets surrounding an HTML
 | 
						|
tag.  The na{\"\i}ve pattern for matching a single HTML tag doesn't
 | 
						|
work because of the greedy nature of \regexp{.*}.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> s = '<html><head><title>Title</title>'
 | 
						|
>>> len(s)
 | 
						|
32
 | 
						|
>>> print re.match('<.*>', s).span()
 | 
						|
(0, 32)
 | 
						|
>>> print re.match('<.*>', s).group()
 | 
						|
<html><head><title>Title</title>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The RE matches the \character{<} in \samp{<html>}, and the
 | 
						|
\regexp{.*} consumes the rest of the string.  There's still more left
 | 
						|
in the RE, though, and the \regexp{>} can't match at the end of
 | 
						|
the string, so the regular expression engine has to backtrack
 | 
						|
character by character until it finds a match for the \regexp{>}.  
 | 
						|
The final match extends from the \character{<} in \samp{<html>}
 | 
						|
to the \character{>} in \samp{</title>}, which isn't what you want.
 | 
						|
 | 
						|
In this case, the solution is to use the non-greedy qualifiers
 | 
						|
\regexp{*?}, \regexp{+?}, \regexp{??}, or
 | 
						|
\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
 | 
						|
possible.  In the above example, the \character{>} is tried
 | 
						|
immediately after the first \character{<} matches, and when it fails,
 | 
						|
the engine advances a character at a time, retrying the \character{>}
 | 
						|
at every step.  This produces just the right result:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> print re.match('<.*?>', s).group()
 | 
						|
<html>
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
(Note that parsing HTML or XML with regular expressions is painful.
 | 
						|
Quick-and-dirty patterns will handle common cases, but HTML and XML
 | 
						|
have special cases that will break the obvious regular expression; by
 | 
						|
the time you've written a regular expression that handles all of the
 | 
						|
possible cases, the patterns will be \emph{very} complicated.  Use an
 | 
						|
HTML or XML parser module for such tasks.)
 | 
						|
 | 
						|
\subsection{Not Using re.VERBOSE}
 | 
						|
 | 
						|
By now you've probably noticed that regular expressions are a very
 | 
						|
compact notation, but they're not terribly readable.  REs of
 | 
						|
moderate complexity can become lengthy collections of backslashes,
 | 
						|
parentheses, and metacharacters, making them difficult to read and
 | 
						|
understand.  
 | 
						|
 | 
						|
For such REs, specifying the \code{re.VERBOSE} flag when
 | 
						|
compiling the regular expression can be helpful, because it allows
 | 
						|
you to format the regular expression more clearly.
 | 
						|
 | 
						|
The \code{re.VERBOSE} flag has several effects.  Whitespace in the
 | 
						|
regular expression that \emph{isn't} inside a character class is
 | 
						|
ignored.  This means that an expression such as \regexp{dog | cat} is
 | 
						|
equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
 | 
						|
will still match the characters \character{a}, \character{b}, or a
 | 
						|
space.  In addition, you can also put comments inside a RE; comments
 | 
						|
extend from a \samp{\#} character to the next newline.  When used with
 | 
						|
triple-quoted strings, this enables REs to be formatted more neatly:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
pat = re.compile(r"""
 | 
						|
 \s*                 # Skip leading whitespace
 | 
						|
 (?P<header>[^:]+)   # Header name
 | 
						|
 \s* :               # Whitespace, and a colon
 | 
						|
 (?P<value>.*?)      # The header's value -- *? used to
 | 
						|
                     # lose the following trailing whitespace
 | 
						|
 \s*$                # Trailing whitespace to end-of-line
 | 
						|
""", re.VERBOSE)
 | 
						|
\end{verbatim}
 | 
						|
% $
 | 
						|
 | 
						|
This is far more readable than:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
 | 
						|
\end{verbatim}
 | 
						|
% $
 | 
						|
 | 
						|
\section{Feedback}
 | 
						|
 | 
						|
Regular expressions are a complicated topic.  Did this document help
 | 
						|
you understand them?  Were there parts that were unclear, or Problems
 | 
						|
you encountered that weren't covered here?  If so, please send
 | 
						|
suggestions for improvements to the author.
 | 
						|
 | 
						|
The most complete book on regular expressions is almost certainly
 | 
						|
Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
 | 
						|
by O'Reilly.  Unfortunately, it exclusively concentrates on Perl and
 | 
						|
Java's flavours of regular expressions, and doesn't contain any Python
 | 
						|
material at all, so it won't be useful as a reference for programming
 | 
						|
in Python.  (The first edition covered Python's now-removed
 | 
						|
\module{regex} module, which won't help you much.)  Consider checking
 | 
						|
it out from your library.
 | 
						|
 | 
						|
\end{document}
 | 
						|
 |