mirror of
				https://github.com/python/cpython.git
				synced 2025-10-30 21:21:22 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			1466 lines
		
	
	
	
		
			59 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			1466 lines
		
	
	
	
		
			59 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| \documentclass{howto}
 | |
| 
 | |
| % TODO:
 | |
| % Document lookbehind assertions
 | |
| % Better way of displaying a RE, a string, and what it matches
 | |
| % Mention optional argument to match.groups()
 | |
| % Unicode (at least a reference)
 | |
| 
 | |
| \title{Regular Expression HOWTO}
 | |
| 
 | |
| \release{0.05}
 | |
| 
 | |
| \author{A.M. Kuchling}
 | |
| \authoraddress{\email{amk@amk.ca}}
 | |
| 
 | |
| \begin{document}
 | |
| \maketitle
 | |
| 
 | |
| \begin{abstract}
 | |
| \noindent
 | |
| This document is an introductory tutorial to using regular expressions
 | |
| in Python with the \module{re} module.  It provides a gentler
 | |
| introduction than the corresponding section in the Library Reference.
 | |
| 
 | |
| This document is available from 
 | |
| \url{http://www.amk.ca/python/howto}.
 | |
| 
 | |
| \end{abstract}
 | |
| 
 | |
| \tableofcontents
 | |
| 
 | |
| \section{Introduction}
 | |
| 
 | |
| The \module{re} module was added in Python 1.5, and provides
 | |
| Perl-style regular expression patterns.  Earlier versions of Python
 | |
| came with the \module{regex} module, which provides Emacs-style
 | |
| patterns.  Emacs-style patterns are slightly less readable and
 | |
| don't provide as many features, so there's not much reason to use
 | |
| the \module{regex} module when writing new code, though you might
 | |
| encounter old code that uses it.
 | |
| 
 | |
| Regular expressions (or REs) are essentially a tiny, highly
 | |
| specialized programming language embedded inside Python and made
 | |
| available through the \module{re} module.  Using this little language,
 | |
| you specify the rules for the set of possible strings that you want to
 | |
| match; this set might contain English sentences, or e-mail addresses,
 | |
| or TeX commands, or anything you like.  You can then ask questions
 | |
| such as ``Does this string match the pattern?'', or ``Is there a match
 | |
| for the pattern anywhere in this string?''.  You can also use REs to
 | |
| modify a string or to split it apart in various ways.
 | |
| 
 | |
| Regular expression patterns are compiled into a series of bytecodes
 | |
| which are then executed by a matching engine written in C.  For
 | |
| advanced use, it may be necessary to pay careful attention to how the
 | |
| engine will execute a given RE, and write the RE in a certain way in
 | |
| order to produce bytecode that runs faster.  Optimization isn't
 | |
| covered in this document, because it requires that you have a good
 | |
| understanding of the matching engine's internals.
 | |
| 
 | |
| The regular expression language is relatively small and restricted, so
 | |
| not all possible string processing tasks can be done using regular
 | |
| expressions.  There are also tasks that \emph{can} be done with
 | |
| regular expressions, but the expressions turn out to be very
 | |
| complicated.  In these cases, you may be better off writing Python
 | |
| code to do the processing; while Python code will be slower than an
 | |
| elaborate regular expression, it will also probably be more understandable.
 | |
| 
 | |
| \section{Simple Patterns}
 | |
| 
 | |
| We'll start by learning about the simplest possible regular
 | |
| expressions.  Since regular expressions are used to operate on
 | |
| strings, we'll begin with the most common task: matching characters.
 | |
| 
 | |
| For a detailed explanation of the computer science underlying regular
 | |
| expressions (deterministic and non-deterministic finite automata), you
 | |
| can refer to almost any textbook on writing compilers.
 | |
| 
 | |
| \subsection{Matching Characters}
 | |
| 
 | |
| Most letters and characters will simply match themselves.  For
 | |
| example, the regular expression \regexp{test} will match the string
 | |
| \samp{test} exactly.  (You can enable a case-insensitive mode that
 | |
| would let this RE match \samp{Test} or \samp{TEST} as well; more
 | |
| about this later.)  
 | |
| 
 | |
| There are exceptions to this rule; some characters are
 | |
| special, and don't match themselves.  Instead, they signal that some
 | |
| out-of-the-ordinary thing should be matched, or they affect other
 | |
| portions of the RE by repeating them.  Much of this document is
 | |
| devoted to discussing various metacharacters and what they do.
 | |
| 
 | |
| Here's a complete list of the metacharacters; their meanings will be
 | |
| discussed in the rest of this HOWTO.
 | |
| 
 | |
| \begin{verbatim}
 | |
| . ^ $ * + ? { [ ] \ | ( )
 | |
| \end{verbatim}
 | |
| % $
 | |
| 
 | |
| The first metacharacters we'll look at are \samp{[} and \samp{]}.
 | |
| They're used for specifying a character class, which is a set of
 | |
| characters that you wish to match.  Characters can be listed
 | |
| individually, or a range of characters can be indicated by giving two
 | |
| characters and separating them by a \character{-}.  For example,
 | |
| \regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
 | |
| \samp{c}; this is the same as
 | |
| \regexp{[a-c]}, which uses a range to express the same set of
 | |
| characters.  If you wanted to match only lowercase letters, your
 | |
| RE would be \regexp{[a-z]}.
 | |
| 
 | |
| Metacharacters are not active inside classes.  For example,
 | |
| \regexp{[akm\$]} will match any of the characters \character{a},
 | |
| \character{k}, \character{m}, or \character{\$}; \character{\$} is
 | |
| usually a metacharacter, but inside a character class it's stripped of
 | |
| its special nature.
 | |
| 
 | |
| You can match the characters not within a range by \dfn{complementing}
 | |
| the set.  This is indicated by including a \character{\^} as the first
 | |
| character of the class; \character{\^} elsewhere will simply match the
 | |
| \character{\^} character.  For example, \verb|[^5]| will match any
 | |
| character except \character{5}.
 | |
| 
 | |
| Perhaps the most important metacharacter is the backslash, \samp{\e}.  
 | |
| As in Python string literals, the backslash can be followed by various
 | |
| characters to signal various special sequences.  It's also used to escape
 | |
| all the metacharacters so you can still match them in patterns; for
 | |
| example, if you need to match a \samp{[} or 
 | |
| \samp{\e}, you can precede them with a backslash to remove their
 | |
| special meaning: \regexp{\e[} or \regexp{\e\e}.
 | |
| 
 | |
| Some of the special sequences beginning with \character{\e} represent
 | |
| predefined sets of characters that are often useful, such as the set
 | |
| of digits, the set of letters, or the set of anything that isn't
 | |
| whitespace.  The following predefined special sequences are available:
 | |
| 
 | |
| \begin{itemize}
 | |
| \item[\code{\e d}]Matches any decimal digit; this is
 | |
| equivalent to the class \regexp{[0-9]}.
 | |
| 
 | |
| \item[\code{\e D}]Matches any non-digit character; this is
 | |
| equivalent to the class \verb|[^0-9]|.
 | |
| 
 | |
| \item[\code{\e s}]Matches any whitespace character; this is
 | |
| equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
 | |
| 
 | |
| \item[\code{\e S}]Matches any non-whitespace character; this is
 | |
| equivalent to the class \verb|[^ \t\n\r\f\v]|.
 | |
| 
 | |
| \item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
 | |
| \regexp{[a-zA-Z0-9_]}.  
 | |
| 
 | |
| \item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
 | |
| \verb|[^a-zA-Z0-9_]|.   
 | |
| \end{itemize}
 | |
| 
 | |
| These sequences can be included inside a character class.  For
 | |
| example, \regexp{[\e s,.]} is a character class that will match any
 | |
| whitespace character, or \character{,} or \character{.}.
 | |
| 
 | |
| The final metacharacter in this section is \regexp{.}.  It matches
 | |
| anything except a newline character, and there's an alternate mode
 | |
| (\code{re.DOTALL}) where it will match even a newline.  \character{.}
 | |
| is often used where you want to match ``any character''.  
 | |
| 
 | |
| \subsection{Repeating Things}
 | |
| 
 | |
| Being able to match varying sets of characters is the first thing
 | |
| regular expressions can do that isn't already possible with the
 | |
| methods available on strings.  However, if that was the only
 | |
| additional capability of regexes, they wouldn't be much of an advance.
 | |
| Another capability is that you can specify that portions of the RE
 | |
| must be repeated a certain number of times.
 | |
| 
 | |
| The first metacharacter for repeating things that we'll look at is
 | |
| \regexp{*}.  \regexp{*} doesn't match the literal character \samp{*};
 | |
| instead, it specifies that the previous character can be matched zero
 | |
| or more times, instead of exactly once.
 | |
| 
 | |
| For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
 | |
| characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
 | |
| characters), and so forth.  The RE engine has various internal
 | |
| limitations stemming from the size of C's \code{int} type, that will
 | |
| prevent it from matching over 2 billion \samp{a} characters; you
 | |
| probably don't have enough memory to construct a string that large, so
 | |
| you shouldn't run into that limit.
 | |
| 
 | |
| Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
 | |
| the matching engine will try to repeat it as many times as possible.
 | |
| If later portions of the pattern don't match, the matching engine will
 | |
| then back up and try again with few repetitions.
 | |
| 
 | |
| A step-by-step example will make this more obvious.  Let's consider
 | |
| the expression \regexp{a[bcd]*b}.  This matches the letter
 | |
| \character{a}, zero or more letters from the class \code{[bcd]}, and
 | |
| finally ends with a \character{b}.  Now imagine matching this RE
 | |
| against the string \samp{abcbd}.  
 | |
| 
 | |
| \begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
 | |
| \lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
 | |
| \lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
 | |
| it can, which is to the end of the string.}
 | |
| \lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
 | |
| current position is at the end of the string, so it fails.}
 | |
| \lineiii{4}{\code{abcb}}{Back up, so that  \regexp{[bcd]*} matches
 | |
| one less character.}
 | |
| \lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
 | |
| current position is at the last character, which is a \character{d}.}
 | |
| \lineiii{6}{\code{abc}}{Back up again, so that  \regexp{[bcd]*} is
 | |
| only matching \samp{bc}.}
 | |
| \lineiii{6}{\code{abcb}}{Try \regexp{b} again.  This time 
 | |
| but the character at the current position is \character{b}, so it succeeds.}
 | |
| \end{tableiii}
 | |
| 
 | |
| The end of the RE has now been reached, and it has matched
 | |
| \samp{abcb}.  This demonstrates how the matching engine goes as far as
 | |
| it can at first, and if no match is found it will then progressively
 | |
| back up and retry the rest of the RE again and again.  It will back up
 | |
| until it has tried zero matches for \regexp{[bcd]*}, and if that
 | |
| subsequently fails, the engine will conclude that the string doesn't
 | |
| match the RE at all.
 | |
| 
 | |
| Another repeating metacharacter is \regexp{+}, which matches one or
 | |
| more times.  Pay careful attention to the difference between
 | |
| \regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
 | |
| times, so whatever's being repeated may not be present at all, while
 | |
| \regexp{+} requires at least \emph{one} occurrence.  To use a similar
 | |
| example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
 | |
| \samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
 | |
| 
 | |
| There are two more repeating qualifiers.  The question mark character,
 | |
| \regexp{?}, matches either once or zero times; you can think of it as
 | |
| marking something as being optional.  For example, \regexp{home-?brew}
 | |
| matches either \samp{homebrew} or \samp{home-brew}.  
 | |
| 
 | |
| The most complicated repeated qualifier is
 | |
| \regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
 | |
| integers.  This qualifier means there must be at least \var{m}
 | |
| repetitions, and at most \var{n}.  For example, \regexp{a/\{1,3\}b}
 | |
| will match \samp{a/b}, \samp{a//b}, and \samp{a///b}.  It won't match
 | |
| \samp{ab}, which has no slashes, or \samp{a////b}, which has four.
 | |
| 
 | |
| You can omit either \var{m} or \var{n}; in that case, a reasonable
 | |
| value is assumed for the missing value.  Omitting \var{m} is
 | |
| interpreted as a lower limit of 0, while omitting \var{n} results in  an
 | |
| upper bound of infinity --- actually, the 2 billion limit mentioned
 | |
| earlier, but that might as well be infinity.  
 | |
| 
 | |
| Readers of a reductionist bent may notice that the three other qualifiers
 | |
| can all be expressed using this notation.  \regexp{\{0,\}} is the same
 | |
| as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
 | |
| \regexp{\{0,1\}} is the same as \regexp{?}.  It's better to use
 | |
| \regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
 | |
| they're shorter and easier to read.
 | |
| 
 | |
| \section{Using Regular Expressions}
 | |
| 
 | |
| Now that we've looked at some simple regular expressions, how do we
 | |
| actually use them in Python?  The \module{re} module provides an
 | |
| interface to the regular expression engine, allowing you to compile
 | |
| REs into objects and then perform matches with them.
 | |
| 
 | |
| \subsection{Compiling Regular Expressions}
 | |
| 
 | |
| Regular expressions are compiled into \class{RegexObject} instances,
 | |
| which have methods for various operations such as searching for
 | |
| pattern matches or performing string substitutions.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> import re
 | |
| >>> p = re.compile('ab*')
 | |
| >>> print p
 | |
| <re.RegexObject instance at 80b4150>
 | |
| \end{verbatim}
 | |
| 
 | |
| \function{re.compile()} also accepts an optional \var{flags}
 | |
| argument, used to enable various special features and syntax
 | |
| variations.  We'll go over the available settings later, but for now a
 | |
| single example will do:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('ab*', re.IGNORECASE)
 | |
| \end{verbatim}
 | |
| 
 | |
| The RE is passed to \function{re.compile()} as a string.  REs are
 | |
| handled as strings because regular expressions aren't part of the core
 | |
| Python language, and no special syntax was created for expressing
 | |
| them.  (There are applications that don't need REs at all, so there's
 | |
| no need to bloat the language specification by including them.)
 | |
| Instead, the \module{re} module is simply a C extension module
 | |
| included with Python, just like the \module{socket} or \module{zlib}
 | |
| module.
 | |
| 
 | |
| Putting REs in strings keeps the Python language simpler, but has one
 | |
| disadvantage which is the topic of the next section.
 | |
| 
 | |
| \subsection{The Backslash Plague}
 | |
| 
 | |
| As stated earlier, regular expressions use the backslash
 | |
| character (\character{\e}) to indicate special forms or to allow
 | |
| special characters to be used without invoking their special meaning.
 | |
| This conflicts with Python's usage of the same character for the same
 | |
| purpose in string literals.
 | |
| 
 | |
| Let's say you want to write a RE that matches the string
 | |
| \samp{{\e}section}, which might be found in a \LaTeX\ file.  To figure
 | |
| out what to write in the program code, start with the desired string
 | |
| to be matched.  Next, you must escape any backslashes and other
 | |
| metacharacters by preceding them with a backslash, resulting in the
 | |
| string \samp{\e\e section}.  The resulting string that must be passed
 | |
| to \function{re.compile()} must be \verb|\\section|.  However, to
 | |
| express this as a Python string literal, both backslashes must be
 | |
| escaped \emph{again}.
 | |
| 
 | |
| \begin{tableii}{c|l}{code}{Characters}{Stage}
 | |
|   \lineii{\e section}{Text string to be matched}
 | |
|   \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
 | |
|   \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
 | |
| \end{tableii}
 | |
| 
 | |
| In short, to match a literal backslash, one has to write
 | |
| \code{'\e\e\e\e'} as the RE string, because the regular expression
 | |
| must be \samp{\e\e}, and each backslash must be expressed as
 | |
| \samp{\e\e} inside a regular Python string literal.  In REs that
 | |
| feature backslashes repeatedly, this leads to lots of repeated
 | |
| backslashes and makes the resulting strings difficult to understand.
 | |
| 
 | |
| The solution is to use Python's raw string notation for regular
 | |
| expressions; backslashes are not handled in any special way in
 | |
| a string literal prefixed with \character{r}, so \code{r"\e n"} is a
 | |
| two-character string containing \character{\e} and \character{n},
 | |
| while \code{"\e n"} is a one-character string containing a newline.
 | |
| Frequently regular expressions will be expressed in Python
 | |
| code using this raw string notation.  
 | |
| 
 | |
| \begin{tableii}{c|c}{code}{Regular String}{Raw string}
 | |
|   \lineii{"ab*"}{\code{r"ab*"}}
 | |
|   \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
 | |
|   \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
 | |
| \end{tableii}
 | |
| 
 | |
| \subsection{Performing Matches}
 | |
| 
 | |
| Once you have an object representing a compiled regular expression,
 | |
| what do you do with it?  \class{RegexObject} instances have several
 | |
| methods and attributes.  Only the most significant ones will be
 | |
| covered here; consult \ulink{the Library
 | |
| Reference}{http://www.python.org/doc/lib/module-re.html} for a
 | |
| complete listing.
 | |
| 
 | |
| \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
 | |
|   \lineii{match()}{Determine if the RE matches at the beginning of
 | |
|   the string.}
 | |
|   \lineii{search()}{Scan through a string, looking for any location
 | |
|   where this RE matches.}
 | |
|   \lineii{findall()}{Find all substrings where the RE matches,
 | |
| and returns them as a list.}
 | |
|   \lineii{finditer()}{Find all substrings where the RE matches,
 | |
| and returns them as an iterator.}
 | |
| \end{tableii}
 | |
| 
 | |
| \method{match()} and \method{search()} return \code{None} if no match
 | |
| can be found.  If they're successful, a \code{MatchObject} instance is
 | |
| returned, containing information about the match: where it starts and
 | |
| ends, the substring it matched, and more.
 | |
| 
 | |
| You can learn about this by interactively experimenting with the
 | |
| \module{re} module.  If you have Tkinter available, you may also want
 | |
| to look at \file{Tools/scripts/redemo.py}, a demonstration program
 | |
| included with the Python distribution.  It allows you to enter REs and
 | |
| strings, and displays whether the RE matches or fails.
 | |
| \file{redemo.py} can be quite useful when trying to debug a
 | |
| complicated RE.  Phil Schwartz's
 | |
| \ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive
 | |
| tool for developing and testing RE patterns.  This HOWTO will use the
 | |
| standard Python interpreter for its examples.
 | |
| 
 | |
| First, run the Python interpreter, import the \module{re} module, and
 | |
| compile a RE:
 | |
| 
 | |
| \begin{verbatim}
 | |
| Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
 | |
| >>> import re
 | |
| >>> p = re.compile('[a-z]+')
 | |
| >>> p
 | |
| <_sre.SRE_Pattern object at 80c3c28>
 | |
| \end{verbatim}
 | |
| 
 | |
| Now, you can try matching various strings against the RE
 | |
| \regexp{[a-z]+}.  An empty string shouldn't match at all, since
 | |
| \regexp{+} means 'one or more repetitions'.  \method{match()} should
 | |
| return \code{None} in this case, which will cause the interpreter to
 | |
| print no output.  You can explicitly print the result of
 | |
| \method{match()} to make this clear.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p.match("")
 | |
| >>> print p.match("")
 | |
| None
 | |
| \end{verbatim}
 | |
| 
 | |
| Now, let's try it on a string that it should match, such as
 | |
| \samp{tempo}.  In this case, \method{match()} will return a
 | |
| \class{MatchObject}, so you should store the result in a variable for
 | |
| later use.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> m = p.match( 'tempo')
 | |
| >>> print m
 | |
| <_sre.SRE_Match object at 80c4f68>
 | |
| \end{verbatim}
 | |
| 
 | |
| Now you can query the \class{MatchObject} for information about the
 | |
| matching string.   \class{MatchObject} instances also have several
 | |
| methods and attributes; the most important ones are:
 | |
| 
 | |
| \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
 | |
|   \lineii{group()}{Return the string matched by the RE}
 | |
|   \lineii{start()}{Return the starting position of the match}
 | |
|   \lineii{end()}{Return the ending position of the match}
 | |
|   \lineii{span()}{Return a tuple containing the (start, end) positions 
 | |
|                   of the match}
 | |
| \end{tableii}
 | |
| 
 | |
| Trying these methods will soon clarify their meaning:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> m.group()
 | |
| 'tempo'
 | |
| >>> m.start(), m.end()
 | |
| (0, 5)
 | |
| >>> m.span()
 | |
| (0, 5)
 | |
| \end{verbatim}
 | |
| 
 | |
| \method{group()} returns the substring that was matched by the
 | |
| RE.  \method{start()} and \method{end()} return the starting and
 | |
| ending index of the match. \method{span()} returns both start and end
 | |
| indexes in a single tuple.  Since the \method{match} method only
 | |
| checks if the RE matches at the start of a string,
 | |
| \method{start()} will always be zero.  However, the \method{search}
 | |
| method of \class{RegexObject} instances scans through the string, so 
 | |
| the match may not start at zero in that case.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print p.match('::: message')
 | |
| None
 | |
| >>> m = p.search('::: message') ; print m
 | |
| <re.MatchObject instance at 80c9650>
 | |
| >>> m.group()
 | |
| 'message'
 | |
| >>> m.span()
 | |
| (4, 11)
 | |
| \end{verbatim}
 | |
| 
 | |
| In actual programs, the most common style is to store the
 | |
| \class{MatchObject} in a variable, and then check if it was
 | |
| \code{None}.  This usually looks like:
 | |
| 
 | |
| \begin{verbatim}
 | |
| p = re.compile( ... )
 | |
| m = p.match( 'string goes here' )
 | |
| if m:
 | |
|     print 'Match found: ', m.group()
 | |
| else:
 | |
|     print 'No match'
 | |
| \end{verbatim}
 | |
| 
 | |
| Two \class{RegexObject} methods return all of the matches for a pattern.
 | |
| \method{findall()} returns a list of matching strings:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('\d+')
 | |
| >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
 | |
| ['12', '11', '10']
 | |
| \end{verbatim}
 | |
| 
 | |
| \method{findall()} has to create the entire list before it can be
 | |
| returned as the result.  In Python 2.2, the \method{finditer()} method
 | |
| is also available, returning a sequence of \class{MatchObject} instances 
 | |
| as an iterator.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
 | |
| >>> iterator
 | |
| <callable-iterator object at 0x401833ac>
 | |
| >>> for match in iterator:
 | |
| ...     print match.span()
 | |
| ...
 | |
| (0, 2)
 | |
| (22, 24)
 | |
| (29, 31)
 | |
| \end{verbatim}
 | |
| 
 | |
| 
 | |
| \subsection{Module-Level Functions}
 | |
| 
 | |
| You don't have to produce a \class{RegexObject} and call its methods;
 | |
| the \module{re} module also provides top-level functions called
 | |
| \function{match()}, \function{search()}, \function{sub()}, and so
 | |
| forth.  These functions take the same arguments as the corresponding
 | |
| \class{RegexObject} method, with the RE string added as the first
 | |
| argument, and still return either \code{None} or a \class{MatchObject}
 | |
| instance.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print re.match(r'From\s+', 'Fromage amk')
 | |
| None
 | |
| >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
 | |
| <re.MatchObject instance at 80c5978>
 | |
| \end{verbatim}
 | |
| 
 | |
| Under the hood, these functions simply produce a \class{RegexObject}
 | |
| for you and call the appropriate method on it.  They also store the
 | |
| compiled object in a cache, so future calls using the same
 | |
| RE are faster.  
 | |
| 
 | |
| Should you use these module-level functions, or should you get the
 | |
| \class{RegexObject} and call its methods yourself?  That choice
 | |
| depends on how frequently the RE will be used, and on your personal
 | |
| coding style.  If a RE is being used at only one point in the code,
 | |
| then the module functions are probably more convenient.  If a program
 | |
| contains a lot of regular expressions, or re-uses the same ones in
 | |
| several locations, then it might be worthwhile to collect all the
 | |
| definitions in one place, in a section of code that compiles all the
 | |
| REs ahead of time.  To take an example from the standard library,
 | |
| here's an extract from \file{xmllib.py}:
 | |
| 
 | |
| \begin{verbatim}
 | |
| ref = re.compile( ... )
 | |
| entityref = re.compile( ... )
 | |
| charref = re.compile( ... )
 | |
| starttagopen = re.compile( ... )
 | |
| \end{verbatim}
 | |
| 
 | |
| I generally prefer to work with the compiled object, even for
 | |
| one-time uses, but few people will be as much of a purist about this
 | |
| as I am.
 | |
| 
 | |
| \subsection{Compilation Flags}
 | |
| 
 | |
| Compilation flags let you modify some aspects of how regular
 | |
| expressions work.  Flags are available in the \module{re} module under
 | |
| two names, a long name such as \constant{IGNORECASE}, and a short,
 | |
| one-letter form such as \constant{I}.  (If you're familiar with Perl's
 | |
| pattern modifiers, the one-letter forms use the same letters; the
 | |
| short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
 | |
| Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
 | |
| re.M} sets both the \constant{I} and \constant{M} flags, for example.
 | |
| 
 | |
| Here's a table of the available flags, followed by
 | |
| a more detailed explanation of each one.
 | |
| 
 | |
| \begin{tableii}{c|l}{}{Flag}{Meaning}
 | |
|   \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
 | |
|   character, including newlines}
 | |
|   \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
 | |
|   \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
 | |
|   \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
 | |
|   affecting \regexp{\^} and \regexp{\$}}
 | |
|   \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
 | |
|   which can be organized more cleanly and understandably.}
 | |
| \end{tableii}
 | |
| 
 | |
| \begin{datadesc}{I}
 | |
| \dataline{IGNORECASE}
 | |
| Perform case-insensitive matching; character class and literal strings
 | |
| will match
 | |
| letters by ignoring case.  For example, \regexp{[A-Z]} will match
 | |
| lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
 | |
| \samp{spam}, or \samp{spAM}.
 | |
| This lowercasing doesn't take the current locale into account; it will
 | |
| if you also set the \constant{LOCALE} flag.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{L}
 | |
| \dataline{LOCALE}
 | |
| Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
 | |
| and \regexp{\e B}, dependent on the current locale.  
 | |
| 
 | |
| Locales are a feature of the C library intended to help in writing
 | |
| programs that take account of language differences.  For example, if
 | |
| you're processing French text, you'd want to be able to write
 | |
| \regexp{\e w+} to match words, but \regexp{\e w} only matches the
 | |
| character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
 | |
| \character{\c c}.  If your system is configured properly and a French
 | |
| locale is selected, certain C functions will tell the program that
 | |
| \character{\'e} should also be considered a letter.  Setting the
 | |
| \constant{LOCALE} flag when compiling a regular expression will cause the
 | |
| resulting compiled object to use these C functions for \regexp{\e w};
 | |
| this is slower, but also enables \regexp{\e w+} to match French words as
 | |
| you'd expect.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{M}
 | |
| \dataline{MULTILINE}
 | |
| (\regexp{\^} and \regexp{\$} haven't been explained yet; 
 | |
| they'll be introduced in section~\ref{more-metacharacters}.)
 | |
| 
 | |
| Usually \regexp{\^} matches only at the beginning of the string, and
 | |
| \regexp{\$} matches only at the end of the string and immediately before the
 | |
| newline (if any) at the end of the string. When this flag is
 | |
| specified, \regexp{\^} matches at the beginning of the string and at
 | |
| the beginning of each line within the string, immediately following
 | |
| each newline.  Similarly, the \regexp{\$} metacharacter matches either at
 | |
| the end of the string and at the end of each line (immediately
 | |
| preceding each newline).
 | |
| 
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{S}
 | |
| \dataline{DOTALL}
 | |
| Makes the \character{.} special character match any character at all,
 | |
| including a newline; without this flag, \character{.} will match
 | |
| anything \emph{except} a newline.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{X}
 | |
| \dataline{VERBOSE} This flag allows you to write regular expressions
 | |
| that are more readable by granting you more flexibility in how you can
 | |
| format them.  When this flag has been specified, whitespace within the
 | |
| RE string is ignored, except when the whitespace is in a character
 | |
| class or preceded by an unescaped backslash; this lets you organize
 | |
| and indent the RE more clearly.  It also enables you to put comments
 | |
| within a RE that will be ignored by the engine; comments are marked by
 | |
| a \character{\#} that's neither in a character class or preceded by an
 | |
| unescaped backslash.
 | |
| 
 | |
| For example, here's a RE that uses \constant{re.VERBOSE}; see how
 | |
| much easier it is to read?
 | |
| 
 | |
| \begin{verbatim}
 | |
| charref = re.compile(r"""
 | |
|  &[#]		     # Start of a numeric entity reference
 | |
|  (
 | |
|    [0-9]+[^0-9]      # Decimal form
 | |
|    | 0[0-7]+[^0-7]   # Octal form
 | |
|    | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
 | |
|  )
 | |
| """, re.VERBOSE)
 | |
| \end{verbatim}
 | |
| 
 | |
| Without the verbose setting, the RE would look like this:
 | |
| \begin{verbatim}
 | |
| charref = re.compile("&#([0-9]+[^0-9]"
 | |
|                      "|0[0-7]+[^0-7]"
 | |
|                      "|x[0-9a-fA-F]+[^0-9a-fA-F])")
 | |
| \end{verbatim}
 | |
| 
 | |
| In the above example, Python's automatic concatenation of string
 | |
| literals has been used to break up the RE into smaller pieces, but
 | |
| it's still more difficult to understand than the version using
 | |
| \constant{re.VERBOSE}.
 | |
| 
 | |
| \end{datadesc}
 | |
| 
 | |
| \section{More Pattern Power}
 | |
| 
 | |
| So far we've only covered a part of the features of regular
 | |
| expressions.  In this section, we'll cover some new metacharacters,
 | |
| and how to use groups to retrieve portions of the text that was matched.
 | |
| 
 | |
| \subsection{More Metacharacters\label{more-metacharacters}}
 | |
| 
 | |
| There are some metacharacters that we haven't covered yet.  Most of
 | |
| them will be covered in this section.
 | |
| 
 | |
| Some of the remaining metacharacters to be discussed are
 | |
| \dfn{zero-width assertions}.  They don't cause the engine to advance
 | |
| through the string; instead, they consume no characters at all,
 | |
| and simply succeed or fail.  For example, \regexp{\e b} is an
 | |
| assertion that the current position is located at a word boundary; the
 | |
| position isn't changed by the \regexp{\e b} at all.  This means that
 | |
| zero-width assertions should never be repeated, because if they match
 | |
| once at a given location, they can obviously be matched an infinite
 | |
| number of times.
 | |
| 
 | |
| \begin{list}{}{}
 | |
| 
 | |
| \item[\regexp{|}] 
 | |
| Alternation, or the ``or'' operator.  
 | |
| If A and B are regular expressions, 
 | |
| \regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
 | |
| \regexp{|} has very low precedence in order to make it work reasonably when
 | |
| you're alternating multi-character strings.
 | |
| \regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
 | |
| \samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
 | |
| 
 | |
| To match a literal \character{|},
 | |
| use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
 | |
| 
 | |
| \item[\regexp{\^}] Matches at the beginning of lines.  Unless the
 | |
| \constant{MULTILINE} flag has been set, this will only match at the
 | |
| beginning of the string.  In \constant{MULTILINE} mode, this also
 | |
| matches immediately after each newline within the string.  
 | |
| 
 | |
| For example, if you wish to match the word \samp{From} only at the
 | |
| beginning of a line, the RE to use is \verb|^From|.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print re.search('^From', 'From Here to Eternity')
 | |
| <re.MatchObject instance at 80c1520>
 | |
| >>> print re.search('^From', 'Reciting From Memory')
 | |
| None
 | |
| \end{verbatim}
 | |
| 
 | |
| %To match a literal \character{\^}, use \regexp{\e\^} or enclose it
 | |
| %inside a character class, as in \regexp{[{\e}\^]}.
 | |
| 
 | |
| \item[\regexp{\$}] Matches at the end of a line, which is defined as
 | |
| either the end of the string, or any location followed by a newline
 | |
| character.    
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print re.search('}$', '{block}')
 | |
| <re.MatchObject instance at 80adfa8>
 | |
| >>> print re.search('}$', '{block} ')
 | |
| None
 | |
| >>> print re.search('}$', '{block}\n')
 | |
| <re.MatchObject instance at 80adfa8>
 | |
| \end{verbatim}
 | |
| % $
 | |
| 
 | |
| To match a literal \character{\$}, use \regexp{\e\$} or enclose it
 | |
| inside a character class, as in  \regexp{[\$]}.
 | |
| 
 | |
| \item[\regexp{\e A}] Matches only at the start of the string.  When
 | |
| not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
 | |
| effectively the same.  In \constant{MULTILINE} mode, however, they're
 | |
| different; \regexp{\e A} still matches only at the beginning of the
 | |
| string, but \regexp{\^} may match at any location inside the string
 | |
| that follows a newline character.
 | |
| 
 | |
| \item[\regexp{\e Z}]Matches only at the end of the string.  
 | |
| 
 | |
| \item[\regexp{\e b}] Word boundary.  
 | |
| This is a zero-width assertion that matches only at the
 | |
| beginning or end of a word.  A word is defined as a sequence of
 | |
| alphanumeric characters, so the end of a word is indicated by
 | |
| whitespace or a non-alphanumeric character.  
 | |
| 
 | |
| The following example matches \samp{class} only when it's a complete
 | |
| word; it won't match when it's contained inside another word.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile(r'\bclass\b')
 | |
| >>> print p.search('no class at all')
 | |
| <re.MatchObject instance at 80c8f28>
 | |
| >>> print p.search('the declassified algorithm')
 | |
| None
 | |
| >>> print p.search('one subclass is')
 | |
| None
 | |
| \end{verbatim}
 | |
| 
 | |
| There are two subtleties you should remember when using this special
 | |
| sequence.  First, this is the worst collision between Python's string
 | |
| literals and regular expression sequences.  In Python's string
 | |
| literals, \samp{\e b} is the backspace character, ASCII value 8.  If
 | |
| you're not using raw strings, then Python will convert the \samp{\e b} to
 | |
| a backspace, and your RE won't match as you expect it to.  The
 | |
| following example looks the same as our previous RE, but omits
 | |
| the \character{r} in front of the RE string.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('\bclass\b')
 | |
| >>> print p.search('no class at all')
 | |
| None
 | |
| >>> print p.search('\b' + 'class' + '\b')  
 | |
| <re.MatchObject instance at 80c3ee0>
 | |
| \end{verbatim}
 | |
| 
 | |
| Second, inside a character class, where there's no use for this
 | |
| assertion, \regexp{\e b} represents the backspace character, for
 | |
| compatibility with Python's string literals.
 | |
| 
 | |
| \item[\regexp{\e B}] Another zero-width assertion, this is the
 | |
| opposite of \regexp{\e b}, only matching when the current
 | |
| position is not at a word boundary.
 | |
| 
 | |
| \end{list}
 | |
| 
 | |
| \subsection{Grouping}
 | |
| 
 | |
| Frequently you need to obtain more information than just whether the
 | |
| RE matched or not.  Regular expressions are often used to dissect
 | |
| strings by writing a RE divided into several subgroups which
 | |
| match different components of interest.  For example, an RFC-822
 | |
| header line is divided into a header name and a value, separated by a
 | |
| \character{:}.  This can be handled by writing a regular expression
 | |
| which matches an entire header line, and has one group which matches the
 | |
| header name, and another group which matches the header's value.
 | |
| 
 | |
| Groups are marked by the \character{(}, \character{)} metacharacters.
 | |
| \character{(} and \character{)} have much the same meaning as they do
 | |
| in mathematical expressions; they group together the expressions
 | |
| contained inside them. For example, you can repeat the contents of a
 | |
| group with a repeating qualifier, such as \regexp{*}, \regexp{+},
 | |
| \regexp{?}, or \regexp{\{\var{m},\var{n}\}}.  For example,
 | |
| \regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('(ab)*')
 | |
| >>> print p.match('ababababab').span()
 | |
| (0, 10)
 | |
| \end{verbatim}
 | |
| 
 | |
| Groups indicated with \character{(}, \character{)} also capture the
 | |
| starting and ending index of the text that they match; this can be
 | |
| retrieved by passing an argument to \method{group()},
 | |
| \method{start()}, \method{end()}, and \method{span()}.  Groups are
 | |
| numbered starting with 0.  Group 0 is always present; it's the whole
 | |
| RE, so \class{MatchObject} methods all have group 0 as their default
 | |
| argument.  Later we'll see how to express groups that don't capture
 | |
| the span of text that they match.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('(a)b')
 | |
| >>> m = p.match('ab')
 | |
| >>> m.group()
 | |
| 'ab'
 | |
| >>> m.group(0)
 | |
| 'ab'
 | |
| \end{verbatim}
 | |
| 
 | |
| Subgroups are numbered from left to right, from 1 upward.  Groups can
 | |
| be nested; to determine the number, just count the opening parenthesis
 | |
| characters, going from left to right.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('(a(b)c)d')
 | |
| >>> m = p.match('abcd')
 | |
| >>> m.group(0)
 | |
| 'abcd'
 | |
| >>> m.group(1)
 | |
| 'abc'
 | |
| >>> m.group(2)
 | |
| 'b'
 | |
| \end{verbatim}
 | |
| 
 | |
| \method{group()} can be passed multiple group numbers at a time, in
 | |
| which case it will return a tuple containing the corresponding values
 | |
| for those groups.
 | |
| 
 | |
| \begin{verbatim}  
 | |
| >>> m.group(2,1,2)
 | |
| ('b', 'abc', 'b')
 | |
| \end{verbatim}  
 | |
| 
 | |
| The \method{groups()} method returns a tuple containing the strings
 | |
| for all the subgroups, from 1 up to however many there are.
 | |
| 
 | |
| \begin{verbatim}  
 | |
| >>> m.groups()
 | |
| ('abc', 'b')
 | |
| \end{verbatim}  
 | |
| 
 | |
| Backreferences in a pattern allow you to specify that the contents of
 | |
| an earlier capturing group must also be found at the current location
 | |
| in the string.  For example, \regexp{\e 1} will succeed if the exact
 | |
| contents of group 1 can be found at the current position, and fails
 | |
| otherwise.  Remember that Python's string literals also use a
 | |
| backslash followed by numbers to allow including arbitrary characters
 | |
| in a string, so be sure to use a raw string when incorporating
 | |
| backreferences in a RE.
 | |
| 
 | |
| For example, the following RE detects doubled words in a string.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile(r'(\b\w+)\s+\1')
 | |
| >>> p.search('Paris in the the spring').group()
 | |
| 'the the'
 | |
| \end{verbatim}
 | |
| 
 | |
| Backreferences like this aren't often useful for just searching
 | |
| through a string --- there are few text formats which repeat data in
 | |
| this way --- but you'll soon find out that they're \emph{very} useful
 | |
| when performing string substitutions.
 | |
| 
 | |
| \subsection{Non-capturing and Named Groups}
 | |
| 
 | |
| Elaborate REs may use many groups, both to capture substrings of
 | |
| interest, and to group and structure the RE itself.  In complex REs,
 | |
| it becomes difficult to keep track of the group numbers.  There are
 | |
| two features which help with this problem.  Both of them use a common
 | |
| syntax for regular expression extensions, so we'll look at that first.
 | |
| 
 | |
| Perl 5 added several additional features to standard regular
 | |
| expressions, and the Python \module{re} module supports most of them.
 | |
| It would have been difficult to choose new single-keystroke
 | |
| metacharacters or new special sequences beginning with \samp{\e} to
 | |
| represent the new features without making Perl's regular expressions
 | |
| confusingly different from standard REs.  If you chose \samp{\&} as a
 | |
| new metacharacter, for example, old expressions would be assuming that
 | |
| \samp{\&} was a regular character and wouldn't have escaped it by
 | |
| writing \regexp{\e \&} or \regexp{[\&]}.  
 | |
| 
 | |
| The solution chosen by the Perl developers was to use \regexp{(?...)}
 | |
| as the extension syntax.  \samp{?} immediately after a parenthesis was
 | |
| a syntax error because the \samp{?} would have nothing to repeat, so
 | |
| this didn't introduce any compatibility problems.  The characters
 | |
| immediately after the \samp{?}  indicate what extension is being used,
 | |
| so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
 | |
| \regexp{(?:foo)} is something else (a non-capturing group containing
 | |
| the subexpression \regexp{foo}).
 | |
| 
 | |
| Python adds an extension syntax to Perl's extension syntax.  If the
 | |
| first character after the question mark is a \samp{P}, you know that
 | |
| it's an extension that's specific to Python.  Currently there are two
 | |
| such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
 | |
| and \regexp{(?P=\var{name})} is a backreference to a named group.  If
 | |
| future versions of Perl 5 add similar features using a different
 | |
| syntax, the \module{re} module will be changed to support the new
 | |
| syntax, while preserving the Python-specific syntax for
 | |
| compatibility's sake.
 | |
| 
 | |
| Now that we've looked at the general extension syntax, we can return
 | |
| to the features that simplify working with groups in complex REs.
 | |
| Since groups are numbered from left to right and a complex expression
 | |
| may use many groups, it can become difficult to keep track of the
 | |
| correct numbering, and modifying such a complex RE is annoying.
 | |
| Insert a new group near the beginning, and you change the numbers of
 | |
| everything that follows it.
 | |
| 
 | |
| First, sometimes you'll want to use a group to collect a part of a
 | |
| regular expression, but aren't interested in retrieving the group's
 | |
| contents.  You can make this fact explicit by using a non-capturing
 | |
| group: \regexp{(?:...)}, where you can put any other regular
 | |
| expression inside the parentheses.  
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> m = re.match("([abc])+", "abc")
 | |
| >>> m.groups()
 | |
| ('c',)
 | |
| >>> m = re.match("(?:[abc])+", "abc")
 | |
| >>> m.groups()
 | |
| ()
 | |
| \end{verbatim}
 | |
| 
 | |
| Except for the fact that you can't retrieve the contents of what the
 | |
| group matched, a non-capturing group behaves exactly the same as a
 | |
| capturing group; you can put anything inside it, repeat it with a
 | |
| repetition metacharacter such as \samp{*}, and nest it within other
 | |
| groups (capturing or non-capturing).  \regexp{(?:...)} is particularly
 | |
| useful when modifying an existing group, since you can add new groups
 | |
| without changing how all the other groups are numbered.  It should be
 | |
| mentioned that there's no performance difference in searching between
 | |
| capturing and non-capturing groups; neither form is any faster than
 | |
| the other.
 | |
| 
 | |
| The second, and more significant, feature is named groups; instead of
 | |
| referring to them by numbers, groups can be referenced by a name.
 | |
| 
 | |
| The syntax for a named group is one of the Python-specific extensions:
 | |
| \regexp{(?P<\var{name}>...)}.  \var{name} is, obviously, the name of
 | |
| the group.  Except for associating a name with a group, named groups
 | |
| also behave identically to capturing groups.  The \class{MatchObject}
 | |
| methods that deal with capturing groups all accept either integers, to
 | |
| refer to groups by number, or a string containing the group name.
 | |
| Named groups are still given numbers, so you can retrieve information
 | |
| about a group in two ways:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile(r'(?P<word>\b\w+\b)')
 | |
| >>> m = p.search( '(((( Lots of punctuation )))' )
 | |
| >>> m.group('word')
 | |
| 'Lots'
 | |
| >>> m.group(1)
 | |
| 'Lots'
 | |
| \end{verbatim}
 | |
| 
 | |
| Named groups are handy because they let you use easily-remembered
 | |
| names, instead of having to remember numbers.  Here's an example RE
 | |
| from the \module{imaplib} module:
 | |
| 
 | |
| \begin{verbatim}
 | |
| InternalDate = re.compile(r'INTERNALDATE "'
 | |
|         r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
 | |
| 	r'(?P<year>[0-9][0-9][0-9][0-9])'
 | |
|         r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
 | |
|         r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
 | |
|         r'"')
 | |
| \end{verbatim}
 | |
| 
 | |
| It's obviously much easier to retrieve \code{m.group('zonem')},
 | |
| instead of having to remember to retrieve group 9.
 | |
| 
 | |
| Since the syntax for backreferences, in an expression like
 | |
| \regexp{(...)\e 1}, refers to the number of the group there's
 | |
| naturally a variant that uses the group name instead of the number.
 | |
| This is also a Python extension: \regexp{(?P=\var{name})} indicates
 | |
| that the contents of the group called \var{name} should again be found
 | |
| at the current point.  The regular expression for finding doubled
 | |
| words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
 | |
| \regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
 | |
| >>> p.search('Paris in the the spring').group()
 | |
| 'the the'
 | |
| \end{verbatim}
 | |
| 
 | |
| \subsection{Lookahead Assertions}
 | |
| 
 | |
| Another zero-width assertion is the lookahead assertion.  Lookahead
 | |
| assertions are available in both positive and negative form, and 
 | |
| look like this:
 | |
| 
 | |
| \begin{itemize}
 | |
| \item[\regexp{(?=...)}] Positive lookahead assertion.  This succeeds
 | |
| if the contained regular expression, represented here by \code{...},
 | |
| successfully matches at the current location, and fails otherwise.
 | |
| But, once the contained expression has been tried, the matching engine
 | |
| doesn't advance at all; the rest of the pattern is tried right where
 | |
| the assertion started.
 | |
| 
 | |
| \item[\regexp{(?!...)}] Negative lookahead assertion.  This is the
 | |
| opposite of the positive assertion; it succeeds if the contained expression
 | |
| \emph{doesn't} match at the current position in the string.
 | |
| \end{itemize}
 | |
| 
 | |
| An example will help make this concrete by demonstrating a case
 | |
| where a lookahead is useful.  Consider a simple pattern to match a
 | |
| filename and split it apart into a base name and an extension,
 | |
| separated by a \samp{.}.  For example, in \samp{news.rc}, \samp{news}
 | |
| is the base name, and \samp{rc} is the filename's extension.  
 | |
| 
 | |
| The pattern to match this is quite simple: 
 | |
| 
 | |
| \regexp{.*[.].*\$}
 | |
| 
 | |
| Notice that the \samp{.} needs to be treated specially because it's a
 | |
| metacharacter; I've put it inside a character class.  Also notice the
 | |
| trailing \regexp{\$}; this is added to ensure that all the rest of the
 | |
| string must be included in the extension.  This regular expression
 | |
| matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
 | |
| \samp{printers.conf}.
 | |
| 
 | |
| Now, consider complicating the problem a bit; what if you want to
 | |
| match filenames where the extension is not \samp{bat}?
 | |
| Some incorrect attempts:
 | |
| 
 | |
| \verb|.*[.][^b].*$|
 | |
| % $
 | |
| 
 | |
| The first attempt above tries to exclude \samp{bat} by requiring that
 | |
| the first character of the extension is not a \samp{b}.  This is
 | |
| wrong, because the pattern also doesn't match \samp{foo.bar}.
 | |
| 
 | |
| % Messes up the HTML without the curly braces around \^
 | |
| \regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
 | |
| 
 | |
| The expression gets messier when you try to patch up the first
 | |
| solution by requiring one of the following cases to match: the first
 | |
| character of the extension isn't \samp{b}; the second character isn't
 | |
| \samp{a}; or the third character isn't \samp{t}.  This accepts
 | |
| \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
 | |
| three-letter extension and won't accept a filename with a two-letter
 | |
| extension such as \samp{sendmail.cf}.  We'll complicate the pattern
 | |
| again in an effort to fix it.
 | |
| 
 | |
| \regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
 | |
| 
 | |
| In the third attempt, the second and third letters are all made
 | |
| optional in order to allow matching extensions shorter than three
 | |
| characters, such as \samp{sendmail.cf}.
 | |
| 
 | |
| The pattern's getting really complicated now, which makes it hard to
 | |
| read and understand.  Worse, if the problem changes and you want to
 | |
| exclude both \samp{bat} and \samp{exe} as extensions, the pattern
 | |
| would get even more complicated and confusing.
 | |
| 
 | |
| A negative lookahead cuts through all this:
 | |
| 
 | |
| \regexp{.*[.](?!bat\$).*\$}
 | |
| % $
 | |
| 
 | |
| The lookahead means: if the expression \regexp{bat} doesn't match at
 | |
| this point, try the rest of the pattern; if \regexp{bat\$} does match,
 | |
| the whole pattern will fail.  The trailing \regexp{\$} is required to
 | |
| ensure that something like \samp{sample.batch}, where the extension
 | |
| only starts with \samp{bat}, will be allowed.
 | |
| 
 | |
| Excluding another filename extension is now easy; simply add it as an
 | |
| alternative inside the assertion.  The following pattern excludes
 | |
| filenames that end in either \samp{bat} or \samp{exe}:
 | |
| 
 | |
| \regexp{.*[.](?!bat\$|exe\$).*\$}
 | |
| % $
 | |
| 
 | |
| 
 | |
| \section{Modifying Strings}
 | |
| 
 | |
| Up to this point, we've simply performed searches against a static
 | |
| string.  Regular expressions are also commonly used to modify a string
 | |
| in various ways, using the following \class{RegexObject} methods:
 | |
| 
 | |
| \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
 | |
|   \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
 | |
|   \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
 | |
|   \lineii{subn()}{Does the same thing as \method{sub()}, 
 | |
|    but returns the new string and the number of replacements}
 | |
| \end{tableii}
 | |
| 
 | |
| 
 | |
| \subsection{Splitting Strings}
 | |
| 
 | |
| The \method{split()} method of a \class{RegexObject} splits a string
 | |
| apart wherever the RE matches, returning a list of the pieces.
 | |
| It's similar to the \method{split()} method of strings but
 | |
| provides much more
 | |
| generality in the delimiters that you can split by;
 | |
| \method{split()} only supports splitting by whitespace or by
 | |
| a fixed string.  As you'd expect, there's a module-level
 | |
| \function{re.split()} function, too.
 | |
| 
 | |
| \begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
 | |
|   Split \var{string} by the matches of the regular expression.  If
 | |
|   capturing parentheses are used in the RE, then their contents will
 | |
|   also be returned as part of the resulting list.  If \var{maxsplit}
 | |
|   is nonzero, at most \var{maxsplit} splits are performed.
 | |
| \end{methoddesc}
 | |
| 
 | |
| You can limit the number of splits made, by passing a value for
 | |
| \var{maxsplit}.  When \var{maxsplit} is nonzero, at most
 | |
| \var{maxsplit} splits will be made, and the remainder of the string is
 | |
| returned as the final element of the list.  In the following example,
 | |
| the delimiter is any sequence of non-alphanumeric characters.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile(r'\W+')
 | |
| >>> p.split('This is a test, short and sweet, of split().')
 | |
| ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
 | |
| >>> p.split('This is a test, short and sweet, of split().', 3)
 | |
| ['This', 'is', 'a', 'test, short and sweet, of split().']
 | |
| \end{verbatim}
 | |
| 
 | |
| Sometimes you're not only interested in what the text between
 | |
| delimiters is, but also need to know what the delimiter was.  If
 | |
| capturing parentheses are used in the RE, then their values are also
 | |
| returned as part of the list.  Compare the following calls:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile(r'\W+')
 | |
| >>> p2 = re.compile(r'(\W+)')
 | |
| >>> p.split('This... is a test.')
 | |
| ['This', 'is', 'a', 'test', '']
 | |
| >>> p2.split('This... is a test.')
 | |
| ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
 | |
| \end{verbatim}
 | |
| 
 | |
| The module-level function \function{re.split()} adds the RE to be
 | |
| used as the first argument, but is otherwise the same.  
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> re.split('[\W]+', 'Words, words, words.')
 | |
| ['Words', 'words', 'words', '']
 | |
| >>> re.split('([\W]+)', 'Words, words, words.')
 | |
| ['Words', ', ', 'words', ', ', 'words', '.', '']
 | |
| >>> re.split('[\W]+', 'Words, words, words.', 1)
 | |
| ['Words', 'words, words.']
 | |
| \end{verbatim}
 | |
| 
 | |
| \subsection{Search and Replace}
 | |
| 
 | |
| Another common task is to find all the matches for a pattern, and
 | |
| replace them with a different string.  The \method{sub()} method takes
 | |
| a replacement value, which can be either a string or a function, and
 | |
| the string to be processed.
 | |
| 
 | |
| \begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
 | |
| Returns the string obtained by replacing the leftmost non-overlapping
 | |
| occurrences of the RE in \var{string} by the replacement
 | |
| \var{replacement}.  If the pattern isn't found, \var{string} is returned
 | |
| unchanged.  
 | |
| 
 | |
| The optional argument \var{count} is the maximum number of pattern
 | |
| occurrences to be replaced; \var{count} must be a non-negative
 | |
| integer.  The default value of 0 means to replace all occurrences.
 | |
| \end{methoddesc}
 | |
| 
 | |
| Here's a simple example of using the \method{sub()} method.  It
 | |
| replaces colour names with the word \samp{colour}:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile( '(blue|white|red)')
 | |
| >>> p.sub( 'colour', 'blue socks and red shoes')
 | |
| 'colour socks and colour shoes'
 | |
| >>> p.sub( 'colour', 'blue socks and red shoes', count=1)
 | |
| 'colour socks and red shoes'
 | |
| \end{verbatim}
 | |
| 
 | |
| The \method{subn()} method does the same work, but returns a 2-tuple
 | |
| containing the new string value and the number of replacements 
 | |
| that were performed:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile( '(blue|white|red)')
 | |
| >>> p.subn( 'colour', 'blue socks and red shoes')
 | |
| ('colour socks and colour shoes', 2)
 | |
| >>> p.subn( 'colour', 'no colours at all')
 | |
| ('no colours at all', 0)
 | |
| \end{verbatim}
 | |
| 
 | |
| Empty matches are replaced only when they're not
 | |
| adjacent to a previous match.  
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('x*')
 | |
| >>> p.sub('-', 'abxd')
 | |
| '-a-b-d-'
 | |
| \end{verbatim}
 | |
| 
 | |
| If \var{replacement} is a string, any backslash escapes in it are
 | |
| processed.  That is, \samp{\e n} is converted to a single newline
 | |
| character, \samp{\e r} is converted to a carriage return, and so forth.
 | |
| Unknown escapes such as \samp{\e j} are left alone.  Backreferences,
 | |
| such as \samp{\e 6}, are replaced with the substring matched by the
 | |
| corresponding group in the RE.  This lets you incorporate
 | |
| portions of the original text in the resulting
 | |
| replacement string.
 | |
| 
 | |
| This example matches the word \samp{section} followed by a string
 | |
| enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
 | |
| \samp{subsection}:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
 | |
| >>> p.sub(r'subsection{\1}','section{First} section{second}')
 | |
| 'subsection{First} subsection{second}'
 | |
| \end{verbatim}
 | |
| 
 | |
| There's also a syntax for referring to named groups as defined by the
 | |
| \regexp{(?P<name>...)} syntax.  \samp{\e g<name>} will use the
 | |
| substring matched by the group named \samp{name}, and 
 | |
| \samp{\e g<\var{number}>} 
 | |
| uses the corresponding group number.  
 | |
| \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, 
 | |
| but isn't ambiguous in a
 | |
| replacement string such as \samp{\e g<2>0}.  (\samp{\e 20} would be
 | |
| interpreted as a reference to group 20, not a reference to group 2
 | |
| followed by the literal character \character{0}.)  The following
 | |
| substitutions are all equivalent, but use all three variations of the
 | |
| replacement string.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
 | |
| >>> p.sub(r'subsection{\1}','section{First}')
 | |
| 'subsection{First}'
 | |
| >>> p.sub(r'subsection{\g<1>}','section{First}')
 | |
| 'subsection{First}'
 | |
| >>> p.sub(r'subsection{\g<name>}','section{First}')
 | |
| 'subsection{First}'
 | |
| \end{verbatim}
 | |
| 
 | |
| \var{replacement} can also be a function, which gives you even more
 | |
| control.  If \var{replacement} is a function, the function is
 | |
| called for every non-overlapping occurrence of \var{pattern}.  On each
 | |
| call, the function is 
 | |
| passed a \class{MatchObject} argument for the match
 | |
| and can use this information to compute the desired replacement string and return it.
 | |
| 
 | |
| In the following example, the replacement function translates 
 | |
| decimals into hexadecimal:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> def hexrepl( match ):
 | |
| ...     "Return the hex string for a decimal number"
 | |
| ...     value = int( match.group() )
 | |
| ...     return hex(value)
 | |
| ...
 | |
| >>> p = re.compile(r'\d+')
 | |
| >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
 | |
| 'Call 0xffd2 for printing, 0xc000 for user code.'
 | |
| \end{verbatim}
 | |
| 
 | |
| When using the module-level \function{re.sub()} function, the pattern
 | |
| is passed as the first argument.  The pattern may be a string or a
 | |
| \class{RegexObject}; if you need to specify regular expression flags,
 | |
| you must either use a \class{RegexObject} as the first parameter, or use
 | |
| embedded modifiers in the pattern, e.g.  \code{sub("(?i)b+", "x", "bbbb
 | |
| BBBB")} returns \code{'x x'}.
 | |
| 
 | |
| \section{Common Problems}
 | |
| 
 | |
| Regular expressions are a powerful tool for some applications, but in
 | |
| some ways their behaviour isn't intuitive and at times they don't
 | |
| behave the way you may expect them to.  This section will point out
 | |
| some of the most common pitfalls.
 | |
| 
 | |
| \subsection{Use String Methods}
 | |
| 
 | |
| Sometimes using the \module{re} module is a mistake.  If you're
 | |
| matching a fixed string, or a single character class, and you're not
 | |
| using any \module{re} features such as the \constant{IGNORECASE} flag,
 | |
| then the full power of regular expressions may not be required.
 | |
| Strings have several methods for performing operations with fixed
 | |
| strings and they're usually much faster, because the implementation is
 | |
| a single small C loop that's been optimized for the purpose, instead
 | |
| of the large, more generalized regular expression engine.
 | |
| 
 | |
| One example might be replacing a single fixed string with another
 | |
| one; for example, you might replace \samp{word}
 | |
| with \samp{deed}.  \code{re.sub()} seems like the function to use for
 | |
| this, but consider the \method{replace()} method.  Note that 
 | |
| \function{replace()} will also replace \samp{word} inside
 | |
| words, turning \samp{swordfish} into \samp{sdeedfish}, but the 
 | |
| na{\"\i}ve RE \regexp{word} would have done that, too.  (To avoid performing
 | |
| the substitution on parts of words, the pattern would have to be
 | |
| \regexp{\e bword\e b}, in order to require that \samp{word} have a
 | |
| word boundary on either side.  This takes the job beyond 
 | |
| \method{replace}'s abilities.)
 | |
| 
 | |
| Another common task is deleting every occurrence of a single character
 | |
| from a string or replacing it with another single character.  You
 | |
| might do this with something like \code{re.sub('\e n', ' ', S)}, but
 | |
| \method{translate()} is capable of doing both tasks
 | |
| and will be faster than any regular expression operation can be.
 | |
| 
 | |
| In short, before turning to the \module{re} module, consider whether
 | |
| your problem can be solved with a faster and simpler string method.
 | |
| 
 | |
| \subsection{match() versus search()}
 | |
| 
 | |
| The \function{match()} function only checks if the RE matches at
 | |
| the beginning of the string while \function{search()} will scan
 | |
| forward through the string for a match.
 | |
| It's important to keep this distinction in mind.  Remember, 
 | |
| \function{match()} will only report a successful match which
 | |
| will start at 0; if the match wouldn't start at zero, 
 | |
| \function{match()} will \emph{not} report it.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print re.match('super', 'superstition').span()  
 | |
| (0, 5)
 | |
| >>> print re.match('super', 'insuperable')    
 | |
| None
 | |
| \end{verbatim}
 | |
| 
 | |
| On the other hand, \function{search()} will scan forward through the
 | |
| string, reporting the first match it finds.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print re.search('super', 'superstition').span()
 | |
| (0, 5)
 | |
| >>> print re.search('super', 'insuperable').span()
 | |
| (2, 7)
 | |
| \end{verbatim}
 | |
| 
 | |
| Sometimes you'll be tempted to keep using \function{re.match()}, and
 | |
| just add \regexp{.*} to the front of your RE.  Resist this temptation
 | |
| and use \function{re.search()} instead.  The regular expression
 | |
| compiler does some analysis of REs in order to speed up the process of
 | |
| looking for a match.  One such analysis figures out what the first
 | |
| character of a match must be; for example, a pattern starting with
 | |
| \regexp{Crow} must match starting with a \character{C}.  The analysis
 | |
| lets the engine quickly scan through the string looking for the
 | |
| starting character, only trying the full match if a \character{C} is found.
 | |
| 
 | |
| Adding \regexp{.*} defeats this optimization, requiring scanning to
 | |
| the end of the string and then backtracking to find a match for the
 | |
| rest of the RE.  Use \function{re.search()} instead.
 | |
| 
 | |
| \subsection{Greedy versus Non-Greedy}
 | |
| 
 | |
| When repeating a regular expression, as in \regexp{a*}, the resulting
 | |
| action is to consume as much of the pattern as possible.  This
 | |
| fact often bites you when you're trying to match a pair of
 | |
| balanced delimiters, such as the angle brackets surrounding an HTML
 | |
| tag.  The na{\"\i}ve pattern for matching a single HTML tag doesn't
 | |
| work because of the greedy nature of \regexp{.*}.
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> s = '<html><head><title>Title</title>'
 | |
| >>> len(s)
 | |
| 32
 | |
| >>> print re.match('<.*>', s).span()
 | |
| (0, 32)
 | |
| >>> print re.match('<.*>', s).group()
 | |
| <html><head><title>Title</title>
 | |
| \end{verbatim}
 | |
| 
 | |
| The RE matches the \character{<} in \samp{<html>}, and the
 | |
| \regexp{.*} consumes the rest of the string.  There's still more left
 | |
| in the RE, though, and the \regexp{>} can't match at the end of
 | |
| the string, so the regular expression engine has to backtrack
 | |
| character by character until it finds a match for the \regexp{>}.  
 | |
| The final match extends from the \character{<} in \samp{<html>}
 | |
| to the \character{>} in \samp{</title>}, which isn't what you want.
 | |
| 
 | |
| In this case, the solution is to use the non-greedy qualifiers
 | |
| \regexp{*?}, \regexp{+?}, \regexp{??}, or
 | |
| \regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
 | |
| possible.  In the above example, the \character{>} is tried
 | |
| immediately after the first \character{<} matches, and when it fails,
 | |
| the engine advances a character at a time, retrying the \character{>}
 | |
| at every step.  This produces just the right result:
 | |
| 
 | |
| \begin{verbatim}
 | |
| >>> print re.match('<.*?>', s).group()
 | |
| <html>
 | |
| \end{verbatim}
 | |
| 
 | |
| (Note that parsing HTML or XML with regular expressions is painful.
 | |
| Quick-and-dirty patterns will handle common cases, but HTML and XML
 | |
| have special cases that will break the obvious regular expression; by
 | |
| the time you've written a regular expression that handles all of the
 | |
| possible cases, the patterns will be \emph{very} complicated.  Use an
 | |
| HTML or XML parser module for such tasks.)
 | |
| 
 | |
| \subsection{Not Using re.VERBOSE}
 | |
| 
 | |
| By now you've probably noticed that regular expressions are a very
 | |
| compact notation, but they're not terribly readable.  REs of
 | |
| moderate complexity can become lengthy collections of backslashes,
 | |
| parentheses, and metacharacters, making them difficult to read and
 | |
| understand.  
 | |
| 
 | |
| For such REs, specifying the \code{re.VERBOSE} flag when
 | |
| compiling the regular expression can be helpful, because it allows
 | |
| you to format the regular expression more clearly.
 | |
| 
 | |
| The \code{re.VERBOSE} flag has several effects.  Whitespace in the
 | |
| regular expression that \emph{isn't} inside a character class is
 | |
| ignored.  This means that an expression such as \regexp{dog | cat} is
 | |
| equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
 | |
| will still match the characters \character{a}, \character{b}, or a
 | |
| space.  In addition, you can also put comments inside a RE; comments
 | |
| extend from a \samp{\#} character to the next newline.  When used with
 | |
| triple-quoted strings, this enables REs to be formatted more neatly:
 | |
| 
 | |
| \begin{verbatim}
 | |
| pat = re.compile(r"""
 | |
|  \s*                 # Skip leading whitespace
 | |
|  (?P<header>[^:]+)   # Header name
 | |
|  \s* :               # Whitespace, and a colon
 | |
|  (?P<value>.*?)      # The header's value -- *? used to
 | |
|                      # lose the following trailing whitespace
 | |
|  \s*$                # Trailing whitespace to end-of-line
 | |
| """, re.VERBOSE)
 | |
| \end{verbatim}
 | |
| % $
 | |
| 
 | |
| This is far more readable than:
 | |
| 
 | |
| \begin{verbatim}
 | |
| pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
 | |
| \end{verbatim}
 | |
| % $
 | |
| 
 | |
| \section{Feedback}
 | |
| 
 | |
| Regular expressions are a complicated topic.  Did this document help
 | |
| you understand them?  Were there parts that were unclear, or Problems
 | |
| you encountered that weren't covered here?  If so, please send
 | |
| suggestions for improvements to the author.
 | |
| 
 | |
| The most complete book on regular expressions is almost certainly
 | |
| Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
 | |
| by O'Reilly.  Unfortunately, it exclusively concentrates on Perl and
 | |
| Java's flavours of regular expressions, and doesn't contain any Python
 | |
| material at all, so it won't be useful as a reference for programming
 | |
| in Python.  (The first edition covered Python's now-obsolete
 | |
| \module{regex} module, which won't help you much.)  Consider checking
 | |
| it out from your library.
 | |
| 
 | |
| \end{document}
 | |
| 
 | 
