mirror of
				https://github.com/python/cpython.git
				synced 2025-10-30 21:21:22 +00:00 
			
		
		
		
	
		
			
	
	
		
			1467 lines
		
	
	
	
		
			59 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
		
		
			
		
	
	
			1467 lines
		
	
	
	
		
			59 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
|   | \documentclass{howto} | ||
|  | 
 | ||
|  | % TODO:
 | ||
|  | % Document lookbehind assertions
 | ||
|  | % Better way of displaying a RE, a string, and what it matches
 | ||
|  | % Mention optional argument to match.groups()
 | ||
|  | % Unicode (at least a reference)
 | ||
|  | 
 | ||
|  | \title{Regular Expression HOWTO} | ||
|  | 
 | ||
|  | \release{0.05} | ||
|  | 
 | ||
|  | \author{A.M. Kuchling} | ||
|  | \authoraddress{\email{amk@amk.ca}} | ||
|  | 
 | ||
|  | \begin{document} | ||
|  | \maketitle | ||
|  | 
 | ||
|  | \begin{abstract} | ||
|  | \noindent | ||
|  | This document is an introductory tutorial to using regular expressions | ||
|  | in Python with the \module{re} module.  It provides a gentler | ||
|  | introduction than the corresponding section in the Library Reference. | ||
|  | 
 | ||
|  | This document is available from  | ||
|  | \url{http://www.amk.ca/python/howto}. | ||
|  | 
 | ||
|  | \end{abstract} | ||
|  | 
 | ||
|  | \tableofcontents | ||
|  | 
 | ||
|  | \section{Introduction} | ||
|  | 
 | ||
|  | The \module{re} module was added in Python 1.5, and provides | ||
|  | Perl-style regular expression patterns.  Earlier versions of Python | ||
|  | came with the \module{regex} module, which provides Emacs-style | ||
|  | patterns.  Emacs-style patterns are slightly less readable and | ||
|  | don't provide as many features, so there's not much reason to use | ||
|  | the \module{regex} module when writing new code, though you might | ||
|  | encounter old code that uses it. | ||
|  | 
 | ||
|  | Regular expressions (or REs) are essentially a tiny, highly | ||
|  | specialized programming language embedded inside Python and made | ||
|  | available through the \module{re} module.  Using this little language, | ||
|  | you specify the rules for the set of possible strings that you want to | ||
|  | match; this set might contain English sentences, or e-mail addresses, | ||
|  | or TeX commands, or anything you like.  You can then ask questions | ||
|  | such as ``Does this string match the pattern?'', or ``Is there a match | ||
|  | for the pattern anywhere in this string?''.  You can also use REs to | ||
|  | modify a string or to split it apart in various ways. | ||
|  | 
 | ||
|  | Regular expression patterns are compiled into a series of bytecodes | ||
|  | which are then executed by a matching engine written in C.  For | ||
|  | advanced use, it may be necessary to pay careful attention to how the | ||
|  | engine will execute a given RE, and write the RE in a certain way in | ||
|  | order to produce bytecode that runs faster.  Optimization isn't | ||
|  | covered in this document, because it requires that you have a good | ||
|  | understanding of the matching engine's internals. | ||
|  | 
 | ||
|  | The regular expression language is relatively small and restricted, so | ||
|  | not all possible string processing tasks can be done using regular | ||
|  | expressions.  There are also tasks that \emph{can} be done with | ||
|  | regular expressions, but the expressions turn out to be very | ||
|  | complicated.  In these cases, you may be better off writing Python | ||
|  | code to do the processing; while Python code will be slower than an | ||
|  | elaborate regular expression, it will also probably be more understandable. | ||
|  | 
 | ||
|  | \section{Simple Patterns} | ||
|  | 
 | ||
|  | We'll start by learning about the simplest possible regular | ||
|  | expressions.  Since regular expressions are used to operate on | ||
|  | strings, we'll begin with the most common task: matching characters. | ||
|  | 
 | ||
|  | For a detailed explanation of the computer science underlying regular | ||
|  | expressions (deterministic and non-deterministic finite automata), you | ||
|  | can refer to almost any textbook on writing compilers. | ||
|  | 
 | ||
|  | \subsection{Matching Characters} | ||
|  | 
 | ||
|  | Most letters and characters will simply match themselves.  For | ||
|  | example, the regular expression \regexp{test} will match the string | ||
|  | \samp{test} exactly.  (You can enable a case-insensitive mode that | ||
|  | would let this RE match \samp{Test} or \samp{TEST} as well; more | ||
|  | about this later.)   | ||
|  | 
 | ||
|  | There are exceptions to this rule; some characters are | ||
|  | special, and don't match themselves.  Instead, they signal that some | ||
|  | out-of-the-ordinary thing should be matched, or they affect other | ||
|  | portions of the RE by repeating them.  Much of this document is | ||
|  | devoted to discussing various metacharacters and what they do. | ||
|  | 
 | ||
|  | Here's a complete list of the metacharacters; their meanings will be | ||
|  | discussed in the rest of this HOWTO. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | . ^ $ * + ? { [ ] \ | ( )
 | ||
|  | \end{verbatim} | ||
|  | % $
 | ||
|  | 
 | ||
|  | The first metacharacters we'll look at are \samp{[} and \samp{]}. | ||
|  | They're used for specifying a character class, which is a set of | ||
|  | characters that you wish to match.  Characters can be listed | ||
|  | individually, or a range of characters can be indicated by giving two | ||
|  | characters and separating them by a \character{-}.  For example, | ||
|  | \regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or | ||
|  | \samp{c}; this is the same as | ||
|  | \regexp{[a-c]}, which uses a range to express the same set of | ||
|  | characters.  If you wanted to match only lowercase letters, your | ||
|  | RE would be \regexp{[a-z]}. | ||
|  | 
 | ||
|  | Metacharacters are not active inside classes.  For example, | ||
|  | \regexp{[akm\$]} will match any of the characters \character{a}, | ||
|  | \character{k}, \character{m}, or \character{\$}; \character{\$} is | ||
|  | usually a metacharacter, but inside a character class it's stripped of | ||
|  | its special nature. | ||
|  | 
 | ||
|  | You can match the characters not within a range by \dfn{complementing} | ||
|  | the set.  This is indicated by including a \character{\^} as the first | ||
|  | character of the class; \character{\^} elsewhere will simply match the | ||
|  | \character{\^} character.  For example, \verb|[^5]| will match any | ||
|  | character except \character{5}. | ||
|  | 
 | ||
|  | Perhaps the most important metacharacter is the backslash, \samp{\e}.   | ||
|  | As in Python string literals, the backslash can be followed by various | ||
|  | characters to signal various special sequences.  It's also used to escape | ||
|  | all the metacharacters so you can still match them in patterns; for | ||
|  | example, if you need to match a \samp{[} or  | ||
|  | \samp{\e}, you can precede them with a backslash to remove their | ||
|  | special meaning: \regexp{\e[} or \regexp{\e\e}. | ||
|  | 
 | ||
|  | Some of the special sequences beginning with \character{\e} represent | ||
|  | predefined sets of characters that are often useful, such as the set | ||
|  | of digits, the set of letters, or the set of anything that isn't | ||
|  | whitespace.  The following predefined special sequences are available: | ||
|  | 
 | ||
|  | \begin{itemize} | ||
|  | \item[\code{\e d}]Matches any decimal digit; this is | ||
|  | equivalent to the class \regexp{[0-9]}. | ||
|  | 
 | ||
|  | \item[\code{\e D}]Matches any non-digit character; this is | ||
|  | equivalent to the class \verb|[^0-9]|. | ||
|  | 
 | ||
|  | \item[\code{\e s}]Matches any whitespace character; this is | ||
|  | equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. | ||
|  | 
 | ||
|  | \item[\code{\e S}]Matches any non-whitespace character; this is | ||
|  | equivalent to the class \verb|[^ \t\n\r\f\v]|. | ||
|  | 
 | ||
|  | \item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class | ||
|  | \regexp{[a-zA-Z0-9_]}.   | ||
|  | 
 | ||
|  | \item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class | ||
|  | \verb|[^a-zA-Z0-9_]|.    | ||
|  | \end{itemize} | ||
|  | 
 | ||
|  | These sequences can be included inside a character class.  For | ||
|  | example, \regexp{[\e s,.]} is a character class that will match any | ||
|  | whitespace character, or \character{,} or \character{.}. | ||
|  | 
 | ||
|  | The final metacharacter in this section is \regexp{.}.  It matches | ||
|  | anything except a newline character, and there's an alternate mode | ||
|  | (\code{re.DOTALL}) where it will match even a newline.  \character{.} | ||
|  | is often used where you want to match ``any character''.   | ||
|  | 
 | ||
|  | \subsection{Repeating Things} | ||
|  | 
 | ||
|  | Being able to match varying sets of characters is the first thing | ||
|  | regular expressions can do that isn't already possible with the | ||
|  | methods available on strings.  However, if that was the only | ||
|  | additional capability of regexes, they wouldn't be much of an advance. | ||
|  | Another capability is that you can specify that portions of the RE | ||
|  | must be repeated a certain number of times. | ||
|  | 
 | ||
|  | The first metacharacter for repeating things that we'll look at is | ||
|  | \regexp{*}.  \regexp{*} doesn't match the literal character \samp{*}; | ||
|  | instead, it specifies that the previous character can be matched zero | ||
|  | or more times, instead of exactly once. | ||
|  | 
 | ||
|  | For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} | ||
|  | characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} | ||
|  | characters), and so forth.  The RE engine has various internal | ||
|  | limitations stemming from the size of C's \code{int} type, that will | ||
|  | prevent it from matching over 2 billion \samp{a} characters; you | ||
|  | probably don't have enough memory to construct a string that large, so | ||
|  | you shouldn't run into that limit. | ||
|  | 
 | ||
|  | Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE, | ||
|  | the matching engine will try to repeat it as many times as possible. | ||
|  | If later portions of the pattern don't match, the matching engine will | ||
|  | then back up and try again with few repetitions. | ||
|  | 
 | ||
|  | A step-by-step example will make this more obvious.  Let's consider | ||
|  | the expression \regexp{a[bcd]*b}.  This matches the letter | ||
|  | \character{a}, zero or more letters from the class \code{[bcd]}, and | ||
|  | finally ends with a \character{b}.  Now imagine matching this RE | ||
|  | against the string \samp{abcbd}.   | ||
|  | 
 | ||
|  | \begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation} | ||
|  | \lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.} | ||
|  | \lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as | ||
|  | it can, which is to the end of the string.} | ||
|  | \lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the | ||
|  | current position is at the end of the string, so it fails.} | ||
|  | \lineiii{4}{\code{abcb}}{Back up, so that  \regexp{[bcd]*} matches | ||
|  | one less character.} | ||
|  | \lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the | ||
|  | current position is at the last character, which is a \character{d}.} | ||
|  | \lineiii{6}{\code{abc}}{Back up again, so that  \regexp{[bcd]*} is | ||
|  | only matching \samp{bc}.} | ||
|  | \lineiii{6}{\code{abcb}}{Try \regexp{b} again.  This time  | ||
|  | but the character at the current position is \character{b}, so it succeeds.} | ||
|  | \end{tableiii} | ||
|  | 
 | ||
|  | The end of the RE has now been reached, and it has matched | ||
|  | \samp{abcb}.  This demonstrates how the matching engine goes as far as | ||
|  | it can at first, and if no match is found it will then progressively | ||
|  | back up and retry the rest of the RE again and again.  It will back up | ||
|  | until it has tried zero matches for \regexp{[bcd]*}, and if that | ||
|  | subsequently fails, the engine will conclude that the string doesn't | ||
|  | match the RE at all. | ||
|  | 
 | ||
|  | Another repeating metacharacter is \regexp{+}, which matches one or | ||
|  | more times.  Pay careful attention to the difference between | ||
|  | \regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more | ||
|  | times, so whatever's being repeated may not be present at all, while | ||
|  | \regexp{+} requires at least \emph{one} occurrence.  To use a similar | ||
|  | example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}), | ||
|  | \samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}. | ||
|  | 
 | ||
|  | There are two more repeating qualifiers.  The question mark character, | ||
|  | \regexp{?}, matches either once or zero times; you can think of it as | ||
|  | marking something as being optional.  For example, \regexp{home-?brew} | ||
|  | matches either \samp{homebrew} or \samp{home-brew}.   | ||
|  | 
 | ||
|  | The most complicated repeated qualifier is | ||
|  | \regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal | ||
|  | integers.  This qualifier means there must be at least \var{m} | ||
|  | repetitions, and at most \var{n}.  For example, \regexp{a/\{1,3\}b} | ||
|  | will match \samp{a/b}, \samp{a//b}, and \samp{a///b}.  It won't match | ||
|  | \samp{ab}, which has no slashes, or \samp{a////b}, which has four. | ||
|  | 
 | ||
|  | You can omit either \var{m} or \var{n}; in that case, a reasonable | ||
|  | value is assumed for the missing value.  Omitting \var{m} is | ||
|  | interpreted as a lower limit of 0, while omitting \var{n} results in  an | ||
|  | upper bound of infinity --- actually, the 2 billion limit mentioned | ||
|  | earlier, but that might as well be infinity.   | ||
|  | 
 | ||
|  | Readers of a reductionist bent may notice that the three other qualifiers | ||
|  | can all be expressed using this notation.  \regexp{\{0,\}} is the same | ||
|  | as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and | ||
|  | \regexp{\{0,1\}} is the same as \regexp{?}.  It's better to use | ||
|  | \regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because | ||
|  | they're shorter and easier to read. | ||
|  | 
 | ||
|  | \section{Using Regular Expressions} | ||
|  | 
 | ||
|  | Now that we've looked at some simple regular expressions, how do we | ||
|  | actually use them in Python?  The \module{re} module provides an | ||
|  | interface to the regular expression engine, allowing you to compile | ||
|  | REs into objects and then perform matches with them. | ||
|  | 
 | ||
|  | \subsection{Compiling Regular Expressions} | ||
|  | 
 | ||
|  | Regular expressions are compiled into \class{RegexObject} instances, | ||
|  | which have methods for various operations such as searching for | ||
|  | pattern matches or performing string substitutions. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> import re | ||
|  | >>> p = re.compile('ab*') | ||
|  | >>> print p | ||
|  | <re.RegexObject instance at 80b4150> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \function{re.compile()} also accepts an optional \var{flags} | ||
|  | argument, used to enable various special features and syntax | ||
|  | variations.  We'll go over the available settings later, but for now a | ||
|  | single example will do: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('ab*', re.IGNORECASE) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | The RE is passed to \function{re.compile()} as a string.  REs are | ||
|  | handled as strings because regular expressions aren't part of the core | ||
|  | Python language, and no special syntax was created for expressing | ||
|  | them.  (There are applications that don't need REs at all, so there's | ||
|  | no need to bloat the language specification by including them.) | ||
|  | Instead, the \module{re} module is simply a C extension module | ||
|  | included with Python, just like the \module{socket} or \module{zlib} | ||
|  | module. | ||
|  | 
 | ||
|  | Putting REs in strings keeps the Python language simpler, but has one | ||
|  | disadvantage which is the topic of the next section. | ||
|  | 
 | ||
|  | \subsection{The Backslash Plague} | ||
|  | 
 | ||
|  | As stated earlier, regular expressions use the backslash | ||
|  | character (\character{\e}) to indicate special forms or to allow | ||
|  | special characters to be used without invoking their special meaning. | ||
|  | This conflicts with Python's usage of the same character for the same | ||
|  | purpose in string literals. | ||
|  | 
 | ||
|  | Let's say you want to write a RE that matches the string | ||
|  | \samp{{\e}section}, which might be found in a \LaTeX\ file.  To figure | ||
|  | out what to write in the program code, start with the desired string | ||
|  | to be matched.  Next, you must escape any backslashes and other | ||
|  | metacharacters by preceding them with a backslash, resulting in the | ||
|  | string \samp{\e\e section}.  The resulting string that must be passed | ||
|  | to \function{re.compile()} must be \verb|\\section|.  However, to | ||
|  | express this as a Python string literal, both backslashes must be | ||
|  | escaped \emph{again}. | ||
|  | 
 | ||
|  | \begin{tableii}{c|l}{code}{Characters}{Stage} | ||
|  |   \lineii{\e section}{Text string to be matched} | ||
|  |   \lineii{\e\e section}{Escaped backslash for \function{re.compile}} | ||
|  |   \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal} | ||
|  | \end{tableii} | ||
|  | 
 | ||
|  | In short, to match a literal backslash, one has to write | ||
|  | \code{'\e\e\e\e'} as the RE string, because the regular expression | ||
|  | must be \samp{\e\e}, and each backslash must be expressed as | ||
|  | \samp{\e\e} inside a regular Python string literal.  In REs that | ||
|  | feature backslashes repeatedly, this leads to lots of repeated | ||
|  | backslashes and makes the resulting strings difficult to understand. | ||
|  | 
 | ||
|  | The solution is to use Python's raw string notation for regular | ||
|  | expressions; backslashes are not handled in any special way in | ||
|  | a string literal prefixed with \character{r}, so \code{r"\e n"} is a | ||
|  | two-character string containing \character{\e} and \character{n}, | ||
|  | while \code{"\e n"} is a one-character string containing a newline. | ||
|  | Frequently regular expressions will be expressed in Python | ||
|  | code using this raw string notation.   | ||
|  | 
 | ||
|  | \begin{tableii}{c|c}{code}{Regular String}{Raw string} | ||
|  |   \lineii{"ab*"}{\code{r"ab*"}} | ||
|  |   \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}} | ||
|  |   \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}} | ||
|  | \end{tableii} | ||
|  | 
 | ||
|  | \subsection{Performing Matches} | ||
|  | 
 | ||
|  | Once you have an object representing a compiled regular expression, | ||
|  | what do you do with it?  \class{RegexObject} instances have several | ||
|  | methods and attributes.  Only the most significant ones will be | ||
|  | covered here; consult \ulink{the Library | ||
|  | Reference}{http://www.python.org/doc/lib/module-re.html} for a | ||
|  | complete listing. | ||
|  | 
 | ||
|  | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} | ||
|  |   \lineii{match()}{Determine if the RE matches at the beginning of | ||
|  |   the string.} | ||
|  |   \lineii{search()}{Scan through a string, looking for any location | ||
|  |   where this RE matches.} | ||
|  |   \lineii{findall()}{Find all substrings where the RE matches, | ||
|  | and returns them as a list.} | ||
|  |   \lineii{finditer()}{Find all substrings where the RE matches, | ||
|  | and returns them as an iterator.} | ||
|  | \end{tableii} | ||
|  | 
 | ||
|  | \method{match()} and \method{search()} return \code{None} if no match | ||
|  | can be found.  If they're successful, a \code{MatchObject} instance is | ||
|  | returned, containing information about the match: where it starts and | ||
|  | ends, the substring it matched, and more. | ||
|  | 
 | ||
|  | You can learn about this by interactively experimenting with the | ||
|  | \module{re} module.  If you have Tkinter available, you may also want | ||
|  | to look at \file{Tools/scripts/redemo.py}, a demonstration program | ||
|  | included with the Python distribution.  It allows you to enter REs and | ||
|  | strings, and displays whether the RE matches or fails. | ||
|  | \file{redemo.py} can be quite useful when trying to debug a | ||
|  | complicated RE.  Phil Schwartz's | ||
|  | \ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive | ||
|  | tool for developing and testing RE patterns.  This HOWTO will use the | ||
|  | standard Python interpreter for its examples. | ||
|  | 
 | ||
|  | First, run the Python interpreter, import the \module{re} module, and | ||
|  | compile a RE: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | Python 2.2.2 (#1, Feb 10 2003, 12:57:01) | ||
|  | >>> import re | ||
|  | >>> p = re.compile('[a-z]+') | ||
|  | >>> p | ||
|  | <_sre.SRE_Pattern object at 80c3c28> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Now, you can try matching various strings against the RE | ||
|  | \regexp{[a-z]+}.  An empty string shouldn't match at all, since | ||
|  | \regexp{+} means 'one or more repetitions'.  \method{match()} should | ||
|  | return \code{None} in this case, which will cause the interpreter to | ||
|  | print no output.  You can explicitly print the result of | ||
|  | \method{match()} to make this clear. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p.match("") | ||
|  | >>> print p.match("") | ||
|  | None | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Now, let's try it on a string that it should match, such as | ||
|  | \samp{tempo}.  In this case, \method{match()} will return a | ||
|  | \class{MatchObject}, so you should store the result in a variable for | ||
|  | later use. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> m = p.match( 'tempo') | ||
|  | >>> print m | ||
|  | <_sre.SRE_Match object at 80c4f68> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Now you can query the \class{MatchObject} for information about the | ||
|  | matching string.   \class{MatchObject} instances also have several | ||
|  | methods and attributes; the most important ones are: | ||
|  | 
 | ||
|  | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} | ||
|  |   \lineii{group()}{Return the string matched by the RE} | ||
|  |   \lineii{start()}{Return the starting position of the match} | ||
|  |   \lineii{end()}{Return the ending position of the match} | ||
|  |   \lineii{span()}{Return a tuple containing the (start, end) positions  | ||
|  |                   of the match} | ||
|  | \end{tableii} | ||
|  | 
 | ||
|  | Trying these methods will soon clarify their meaning: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> m.group() | ||
|  | 'tempo' | ||
|  | >>> m.start(), m.end() | ||
|  | (0, 5) | ||
|  | >>> m.span() | ||
|  | (0, 5) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \method{group()} returns the substring that was matched by the | ||
|  | RE.  \method{start()} and \method{end()} return the starting and | ||
|  | ending index of the match. \method{span()} returns both start and end | ||
|  | indexes in a single tuple.  Since the \method{match} method only | ||
|  | checks if the RE matches at the start of a string, | ||
|  | \method{start()} will always be zero.  However, the \method{search} | ||
|  | method of \class{RegexObject} instances scans through the string, so  | ||
|  | the match may not start at zero in that case. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print p.match('::: message') | ||
|  | None | ||
|  | >>> m = p.search('::: message') ; print m | ||
|  | <re.MatchObject instance at 80c9650> | ||
|  | >>> m.group() | ||
|  | 'message' | ||
|  | >>> m.span() | ||
|  | (4, 11) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | In actual programs, the most common style is to store the | ||
|  | \class{MatchObject} in a variable, and then check if it was | ||
|  | \code{None}.  This usually looks like: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | p = re.compile( ... ) | ||
|  | m = p.match( 'string goes here' ) | ||
|  | if m: | ||
|  |     print 'Match found: ', m.group() | ||
|  | else: | ||
|  |     print 'No match' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Two \class{RegexObject} methods return all of the matches for a pattern. | ||
|  | \method{findall()} returns a list of matching strings: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('\d+') | ||
|  | >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') | ||
|  | ['12', '11', '10'] | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \method{findall()} has to create the entire list before it can be | ||
|  | returned as the result.  In Python 2.2, the \method{finditer()} method | ||
|  | is also available, returning a sequence of \class{MatchObject} instances  | ||
|  | as an iterator. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') | ||
|  | >>> iterator | ||
|  | <callable-iterator object at 0x401833ac> | ||
|  | >>> for match in iterator: | ||
|  | ...     print match.span() | ||
|  | ... | ||
|  | (0, 2) | ||
|  | (22, 24) | ||
|  | (29, 31) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | 
 | ||
|  | \subsection{Module-Level Functions} | ||
|  | 
 | ||
|  | You don't have to produce a \class{RegexObject} and call its methods; | ||
|  | the \module{re} module also provides top-level functions called | ||
|  | \function{match()}, \function{search()}, \function{sub()}, and so | ||
|  | forth.  These functions take the same arguments as the corresponding | ||
|  | \class{RegexObject} method, with the RE string added as the first | ||
|  | argument, and still return either \code{None} or a \class{MatchObject} | ||
|  | instance. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print re.match(r'From\s+', 'Fromage amk') | ||
|  | None | ||
|  | >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') | ||
|  | <re.MatchObject instance at 80c5978> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Under the hood, these functions simply produce a \class{RegexObject} | ||
|  | for you and call the appropriate method on it.  They also store the | ||
|  | compiled object in a cache, so future calls using the same | ||
|  | RE are faster.   | ||
|  | 
 | ||
|  | Should you use these module-level functions, or should you get the | ||
|  | \class{RegexObject} and call its methods yourself?  That choice | ||
|  | depends on how frequently the RE will be used, and on your personal | ||
|  | coding style.  If a RE is being used at only one point in the code, | ||
|  | then the module functions are probably more convenient.  If a program | ||
|  | contains a lot of regular expressions, or re-uses the same ones in | ||
|  | several locations, then it might be worthwhile to collect all the | ||
|  | definitions in one place, in a section of code that compiles all the | ||
|  | REs ahead of time.  To take an example from the standard library, | ||
|  | here's an extract from \file{xmllib.py}: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | ref = re.compile( ... ) | ||
|  | entityref = re.compile( ... ) | ||
|  | charref = re.compile( ... ) | ||
|  | starttagopen = re.compile( ... ) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | I generally prefer to work with the compiled object, even for | ||
|  | one-time uses, but few people will be as much of a purist about this | ||
|  | as I am. | ||
|  | 
 | ||
|  | \subsection{Compilation Flags} | ||
|  | 
 | ||
|  | Compilation flags let you modify some aspects of how regular | ||
|  | expressions work.  Flags are available in the \module{re} module under | ||
|  | two names, a long name such as \constant{IGNORECASE}, and a short, | ||
|  | one-letter form such as \constant{I}.  (If you're familiar with Perl's | ||
|  | pattern modifiers, the one-letter forms use the same letters; the | ||
|  | short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) | ||
|  | Multiple flags can be specified by bitwise OR-ing them; \code{re.I | | ||
|  | re.M} sets both the \constant{I} and \constant{M} flags, for example. | ||
|  | 
 | ||
|  | Here's a table of the available flags, followed by | ||
|  | a more detailed explanation of each one. | ||
|  | 
 | ||
|  | \begin{tableii}{c|l}{}{Flag}{Meaning} | ||
|  |   \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any | ||
|  |   character, including newlines} | ||
|  |   \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches} | ||
|  |   \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match} | ||
|  |   \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching, | ||
|  |   affecting \regexp{\^} and \regexp{\$}} | ||
|  |   \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs, | ||
|  |   which can be organized more cleanly and understandably.} | ||
|  | \end{tableii} | ||
|  | 
 | ||
|  | \begin{datadesc}{I} | ||
|  | \dataline{IGNORECASE} | ||
|  | Perform case-insensitive matching; character class and literal strings | ||
|  | will match | ||
|  | letters by ignoring case.  For example, \regexp{[A-Z]} will match | ||
|  | lowercase letters, too, and \regexp{Spam} will match \samp{Spam}, | ||
|  | \samp{spam}, or \samp{spAM}. | ||
|  | This lowercasing doesn't take the current locale into account; it will | ||
|  | if you also set the \constant{LOCALE} flag. | ||
|  | \end{datadesc} | ||
|  | 
 | ||
|  | \begin{datadesc}{L} | ||
|  | \dataline{LOCALE} | ||
|  | Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, | ||
|  | and \regexp{\e B}, dependent on the current locale.   | ||
|  | 
 | ||
|  | Locales are a feature of the C library intended to help in writing | ||
|  | programs that take account of language differences.  For example, if | ||
|  | you're processing French text, you'd want to be able to write | ||
|  | \regexp{\e w+} to match words, but \regexp{\e w} only matches the | ||
|  | character class \regexp{[A-Za-z]}; it won't match \character{\'e} or | ||
|  | \character{\c c}.  If your system is configured properly and a French | ||
|  | locale is selected, certain C functions will tell the program that | ||
|  | \character{\'e} should also be considered a letter.  Setting the | ||
|  | \constant{LOCALE} flag when compiling a regular expression will cause the | ||
|  | resulting compiled object to use these C functions for \regexp{\e w}; | ||
|  | this is slower, but also enables \regexp{\e w+} to match French words as | ||
|  | you'd expect. | ||
|  | \end{datadesc} | ||
|  | 
 | ||
|  | \begin{datadesc}{M} | ||
|  | \dataline{MULTILINE} | ||
|  | (\regexp{\^} and \regexp{\$} haven't been explained yet;  | ||
|  | they'll be introduced in section~\ref{more-metacharacters}.) | ||
|  | 
 | ||
|  | Usually \regexp{\^} matches only at the beginning of the string, and | ||
|  | \regexp{\$} matches only at the end of the string and immediately before the | ||
|  | newline (if any) at the end of the string. When this flag is | ||
|  | specified, \regexp{\^} matches at the beginning of the string and at | ||
|  | the beginning of each line within the string, immediately following | ||
|  | each newline.  Similarly, the \regexp{\$} metacharacter matches either at | ||
|  | the end of the string and at the end of each line (immediately | ||
|  | preceding each newline). | ||
|  | 
 | ||
|  | \end{datadesc} | ||
|  | 
 | ||
|  | \begin{datadesc}{S} | ||
|  | \dataline{DOTALL} | ||
|  | Makes the \character{.} special character match any character at all, | ||
|  | including a newline; without this flag, \character{.} will match | ||
|  | anything \emph{except} a newline. | ||
|  | \end{datadesc} | ||
|  | 
 | ||
|  | \begin{datadesc}{X} | ||
|  | \dataline{VERBOSE} This flag allows you to write regular expressions | ||
|  | that are more readable by granting you more flexibility in how you can | ||
|  | format them.  When this flag has been specified, whitespace within the | ||
|  | RE string is ignored, except when the whitespace is in a character | ||
|  | class or preceded by an unescaped backslash; this lets you organize | ||
|  | and indent the RE more clearly.  It also enables you to put comments | ||
|  | within a RE that will be ignored by the engine; comments are marked by | ||
|  | a \character{\#} that's neither in a character class or preceded by an | ||
|  | unescaped backslash. | ||
|  | 
 | ||
|  | For example, here's a RE that uses \constant{re.VERBOSE}; see how | ||
|  | much easier it is to read? | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | charref = re.compile(r""" | ||
|  |  &[#]		     # Start of a numeric entity reference | ||
|  |  ( | ||
|  |    [0-9]+[^0-9]      # Decimal form | ||
|  |    | 0[0-7]+[^0-7]   # Octal form | ||
|  |    | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form | ||
|  |  ) | ||
|  | """, re.VERBOSE) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Without the verbose setting, the RE would look like this: | ||
|  | \begin{verbatim} | ||
|  | charref = re.compile("&#([0-9]+[^0-9]" | ||
|  |                      "|0[0-7]+[^0-7]" | ||
|  |                      "|x[0-9a-fA-F]+[^0-9a-fA-F])") | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | In the above example, Python's automatic concatenation of string | ||
|  | literals has been used to break up the RE into smaller pieces, but | ||
|  | it's still more difficult to understand than the version using | ||
|  | \constant{re.VERBOSE}. | ||
|  | 
 | ||
|  | \end{datadesc} | ||
|  | 
 | ||
|  | \section{More Pattern Power} | ||
|  | 
 | ||
|  | So far we've only covered a part of the features of regular | ||
|  | expressions.  In this section, we'll cover some new metacharacters, | ||
|  | and how to use groups to retrieve portions of the text that was matched. | ||
|  | 
 | ||
|  | \subsection{More Metacharacters\label{more-metacharacters}} | ||
|  | 
 | ||
|  | There are some metacharacters that we haven't covered yet.  Most of | ||
|  | them will be covered in this section. | ||
|  | 
 | ||
|  | Some of the remaining metacharacters to be discussed are | ||
|  | \dfn{zero-width assertions}.  They don't cause the engine to advance | ||
|  | through the string; instead, they consume no characters at all, | ||
|  | and simply succeed or fail.  For example, \regexp{\e b} is an | ||
|  | assertion that the current position is located at a word boundary; the | ||
|  | position isn't changed by the \regexp{\e b} at all.  This means that | ||
|  | zero-width assertions should never be repeated, because if they match | ||
|  | once at a given location, they can obviously be matched an infinite | ||
|  | number of times. | ||
|  | 
 | ||
|  | \begin{list}{}{} | ||
|  | 
 | ||
|  | \item[\regexp{|}]  | ||
|  | Alternation, or the ``or'' operator.   | ||
|  | If A and B are regular expressions,  | ||
|  | \regexp{A|B} will match any string that matches either \samp{A} or \samp{B}. | ||
|  | \regexp{|} has very low precedence in order to make it work reasonably when | ||
|  | you're alternating multi-character strings. | ||
|  | \regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not | ||
|  | \samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}. | ||
|  | 
 | ||
|  | To match a literal \character{|}, | ||
|  | use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. | ||
|  | 
 | ||
|  | \item[\regexp{\^}] Matches at the beginning of lines.  Unless the | ||
|  | \constant{MULTILINE} flag has been set, this will only match at the | ||
|  | beginning of the string.  In \constant{MULTILINE} mode, this also | ||
|  | matches immediately after each newline within the string.   | ||
|  | 
 | ||
|  | For example, if you wish to match the word \samp{From} only at the | ||
|  | beginning of a line, the RE to use is \verb|^From|. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print re.search('^From', 'From Here to Eternity') | ||
|  | <re.MatchObject instance at 80c1520> | ||
|  | >>> print re.search('^From', 'Reciting From Memory') | ||
|  | None | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | %To match a literal \character{\^}, use \regexp{\e\^} or enclose it
 | ||
|  | %inside a character class, as in \regexp{[{\e}\^]}.
 | ||
|  | 
 | ||
|  | \item[\regexp{\$}] Matches at the end of a line, which is defined as | ||
|  | either the end of the string, or any location followed by a newline | ||
|  | character.     | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print re.search('}$', '{block}')
 | ||
|  | <re.MatchObject instance at 80adfa8> | ||
|  | >>> print re.search('}$', '{block} ')
 | ||
|  | None | ||
|  | >>> print re.search('}$', '{block}\n')
 | ||
|  | <re.MatchObject instance at 80adfa8> | ||
|  | \end{verbatim} | ||
|  | % $
 | ||
|  | 
 | ||
|  | To match a literal \character{\$}, use \regexp{\e\$} or enclose it | ||
|  | inside a character class, as in  \regexp{[\$]}. | ||
|  | 
 | ||
|  | \item[\regexp{\e A}] Matches only at the start of the string.  When | ||
|  | not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are | ||
|  | effectively the same.  In \constant{MULTILINE} mode, however, they're | ||
|  | different; \regexp{\e A} still matches only at the beginning of the | ||
|  | string, but \regexp{\^} may match at any location inside the string | ||
|  | that follows a newline character. | ||
|  | 
 | ||
|  | \item[\regexp{\e Z}]Matches only at the end of the string.   | ||
|  | 
 | ||
|  | \item[\regexp{\e b}] Word boundary.   | ||
|  | This is a zero-width assertion that matches only at the | ||
|  | beginning or end of a word.  A word is defined as a sequence of | ||
|  | alphanumeric characters, so the end of a word is indicated by | ||
|  | whitespace or a non-alphanumeric character.   | ||
|  | 
 | ||
|  | The following example matches \samp{class} only when it's a complete | ||
|  | word; it won't match when it's contained inside another word. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile(r'\bclass\b') | ||
|  | >>> print p.search('no class at all') | ||
|  | <re.MatchObject instance at 80c8f28> | ||
|  | >>> print p.search('the declassified algorithm') | ||
|  | None | ||
|  | >>> print p.search('one subclass is') | ||
|  | None | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | There are two subtleties you should remember when using this special | ||
|  | sequence.  First, this is the worst collision between Python's string | ||
|  | literals and regular expression sequences.  In Python's string | ||
|  | literals, \samp{\e b} is the backspace character, ASCII value 8.  If | ||
|  | you're not using raw strings, then Python will convert the \samp{\e b} to | ||
|  | a backspace, and your RE won't match as you expect it to.  The | ||
|  | following example looks the same as our previous RE, but omits | ||
|  | the \character{r} in front of the RE string. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('\bclass\b') | ||
|  | >>> print p.search('no class at all') | ||
|  | None | ||
|  | >>> print p.search('\b' + 'class' + '\b')   | ||
|  | <re.MatchObject instance at 80c3ee0> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Second, inside a character class, where there's no use for this | ||
|  | assertion, \regexp{\e b} represents the backspace character, for | ||
|  | compatibility with Python's string literals. | ||
|  | 
 | ||
|  | \item[\regexp{\e B}] Another zero-width assertion, this is the | ||
|  | opposite of \regexp{\e b}, only matching when the current | ||
|  | position is not at a word boundary. | ||
|  | 
 | ||
|  | \end{list} | ||
|  | 
 | ||
|  | \subsection{Grouping} | ||
|  | 
 | ||
|  | Frequently you need to obtain more information than just whether the | ||
|  | RE matched or not.  Regular expressions are often used to dissect | ||
|  | strings by writing a RE divided into several subgroups which | ||
|  | match different components of interest.  For example, an RFC-822 | ||
|  | header line is divided into a header name and a value, separated by a | ||
|  | \character{:}.  This can be handled by writing a regular expression | ||
|  | which matches an entire header line, and has one group which matches the | ||
|  | header name, and another group which matches the header's value. | ||
|  | 
 | ||
|  | Groups are marked by the \character{(}, \character{)} metacharacters. | ||
|  | \character{(} and \character{)} have much the same meaning as they do | ||
|  | in mathematical expressions; they group together the expressions | ||
|  | contained inside them. For example, you can repeat the contents of a | ||
|  | group with a repeating qualifier, such as \regexp{*}, \regexp{+}, | ||
|  | \regexp{?}, or \regexp{\{\var{m},\var{n}\}}.  For example, | ||
|  | \regexp{(ab)*} will match zero or more repetitions of \samp{ab}. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('(ab)*') | ||
|  | >>> print p.match('ababababab').span() | ||
|  | (0, 10) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Groups indicated with \character{(}, \character{)} also capture the | ||
|  | starting and ending index of the text that they match; this can be | ||
|  | retrieved by passing an argument to \method{group()}, | ||
|  | \method{start()}, \method{end()}, and \method{span()}.  Groups are | ||
|  | numbered starting with 0.  Group 0 is always present; it's the whole | ||
|  | RE, so \class{MatchObject} methods all have group 0 as their default | ||
|  | argument.  Later we'll see how to express groups that don't capture | ||
|  | the span of text that they match. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('(a)b') | ||
|  | >>> m = p.match('ab') | ||
|  | >>> m.group() | ||
|  | 'ab' | ||
|  | >>> m.group(0) | ||
|  | 'ab' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Subgroups are numbered from left to right, from 1 upward.  Groups can | ||
|  | be nested; to determine the number, just count the opening parenthesis | ||
|  | characters, going from left to right. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('(a(b)c)d') | ||
|  | >>> m = p.match('abcd') | ||
|  | >>> m.group(0) | ||
|  | 'abcd' | ||
|  | >>> m.group(1) | ||
|  | 'abc' | ||
|  | >>> m.group(2) | ||
|  | 'b' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \method{group()} can be passed multiple group numbers at a time, in | ||
|  | which case it will return a tuple containing the corresponding values | ||
|  | for those groups. | ||
|  | 
 | ||
|  | \begin{verbatim}   | ||
|  | >>> m.group(2,1,2) | ||
|  | ('b', 'abc', 'b') | ||
|  | \end{verbatim}   | ||
|  | 
 | ||
|  | The \method{groups()} method returns a tuple containing the strings | ||
|  | for all the subgroups, from 1 up to however many there are. | ||
|  | 
 | ||
|  | \begin{verbatim}   | ||
|  | >>> m.groups() | ||
|  | ('abc', 'b') | ||
|  | \end{verbatim}   | ||
|  | 
 | ||
|  | Backreferences in a pattern allow you to specify that the contents of | ||
|  | an earlier capturing group must also be found at the current location | ||
|  | in the string.  For example, \regexp{\e 1} will succeed if the exact | ||
|  | contents of group 1 can be found at the current position, and fails | ||
|  | otherwise.  Remember that Python's string literals also use a | ||
|  | backslash followed by numbers to allow including arbitrary characters | ||
|  | in a string, so be sure to use a raw string when incorporating | ||
|  | backreferences in a RE. | ||
|  | 
 | ||
|  | For example, the following RE detects doubled words in a string. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile(r'(\b\w+)\s+\1') | ||
|  | >>> p.search('Paris in the the spring').group() | ||
|  | 'the the' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Backreferences like this aren't often useful for just searching | ||
|  | through a string --- there are few text formats which repeat data in | ||
|  | this way --- but you'll soon find out that they're \emph{very} useful | ||
|  | when performing string substitutions. | ||
|  | 
 | ||
|  | \subsection{Non-capturing and Named Groups} | ||
|  | 
 | ||
|  | Elaborate REs may use many groups, both to capture substrings of | ||
|  | interest, and to group and structure the RE itself.  In complex REs, | ||
|  | it becomes difficult to keep track of the group numbers.  There are | ||
|  | two features which help with this problem.  Both of them use a common | ||
|  | syntax for regular expression extensions, so we'll look at that first. | ||
|  | 
 | ||
|  | Perl 5 added several additional features to standard regular | ||
|  | expressions, and the Python \module{re} module supports most of them. | ||
|  | It would have been difficult to choose new single-keystroke | ||
|  | metacharacters or new special sequences beginning with \samp{\e} to | ||
|  | represent the new features without making Perl's regular expressions | ||
|  | confusingly different from standard REs.  If you chose \samp{\&} as a | ||
|  | new metacharacter, for example, old expressions would be assuming that | ||
|  | \samp{\&} was a regular character and wouldn't have escaped it by | ||
|  | writing \regexp{\e \&} or \regexp{[\&]}.   | ||
|  | 
 | ||
|  | The solution chosen by the Perl developers was to use \regexp{(?...)} | ||
|  | as the extension syntax.  \samp{?} immediately after a parenthesis was | ||
|  | a syntax error because the \samp{?} would have nothing to repeat, so | ||
|  | this didn't introduce any compatibility problems.  The characters | ||
|  | immediately after the \samp{?}  indicate what extension is being used, | ||
|  | so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and | ||
|  | \regexp{(?:foo)} is something else (a non-capturing group containing | ||
|  | the subexpression \regexp{foo}). | ||
|  | 
 | ||
|  | Python adds an extension syntax to Perl's extension syntax.  If the | ||
|  | first character after the question mark is a \samp{P}, you know that | ||
|  | it's an extension that's specific to Python.  Currently there are two | ||
|  | such extensions: \regexp{(?P<\var{name}>...)} defines a named group, | ||
|  | and \regexp{(?P=\var{name})} is a backreference to a named group.  If | ||
|  | future versions of Perl 5 add similar features using a different | ||
|  | syntax, the \module{re} module will be changed to support the new | ||
|  | syntax, while preserving the Python-specific syntax for | ||
|  | compatibility's sake. | ||
|  | 
 | ||
|  | Now that we've looked at the general extension syntax, we can return | ||
|  | to the features that simplify working with groups in complex REs. | ||
|  | Since groups are numbered from left to right and a complex expression | ||
|  | may use many groups, it can become difficult to keep track of the | ||
|  | correct numbering, and modifying such a complex RE is annoying. | ||
|  | Insert a new group near the beginning, and you change the numbers of | ||
|  | everything that follows it. | ||
|  | 
 | ||
|  | First, sometimes you'll want to use a group to collect a part of a | ||
|  | regular expression, but aren't interested in retrieving the group's | ||
|  | contents.  You can make this fact explicit by using a non-capturing | ||
|  | group: \regexp{(?:...)}, where you can put any other regular | ||
|  | expression inside the parentheses.   | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> m = re.match("([abc])+", "abc") | ||
|  | >>> m.groups() | ||
|  | ('c',) | ||
|  | >>> m = re.match("(?:[abc])+", "abc") | ||
|  | >>> m.groups() | ||
|  | () | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Except for the fact that you can't retrieve the contents of what the | ||
|  | group matched, a non-capturing group behaves exactly the same as a | ||
|  | capturing group; you can put anything inside it, repeat it with a | ||
|  | repetition metacharacter such as \samp{*}, and nest it within other | ||
|  | groups (capturing or non-capturing).  \regexp{(?:...)} is particularly | ||
|  | useful when modifying an existing group, since you can add new groups | ||
|  | without changing how all the other groups are numbered.  It should be | ||
|  | mentioned that there's no performance difference in searching between | ||
|  | capturing and non-capturing groups; neither form is any faster than | ||
|  | the other. | ||
|  | 
 | ||
|  | The second, and more significant, feature is named groups; instead of | ||
|  | referring to them by numbers, groups can be referenced by a name. | ||
|  | 
 | ||
|  | The syntax for a named group is one of the Python-specific extensions: | ||
|  | \regexp{(?P<\var{name}>...)}.  \var{name} is, obviously, the name of | ||
|  | the group.  Except for associating a name with a group, named groups | ||
|  | also behave identically to capturing groups.  The \class{MatchObject} | ||
|  | methods that deal with capturing groups all accept either integers, to | ||
|  | refer to groups by number, or a string containing the group name. | ||
|  | Named groups are still given numbers, so you can retrieve information | ||
|  | about a group in two ways: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile(r'(?P<word>\b\w+\b)') | ||
|  | >>> m = p.search( '(((( Lots of punctuation )))' ) | ||
|  | >>> m.group('word') | ||
|  | 'Lots' | ||
|  | >>> m.group(1) | ||
|  | 'Lots' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Named groups are handy because they let you use easily-remembered | ||
|  | names, instead of having to remember numbers.  Here's an example RE | ||
|  | from the \module{imaplib} module: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | InternalDate = re.compile(r'INTERNALDATE "' | ||
|  |         r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' | ||
|  | 	r'(?P<year>[0-9][0-9][0-9][0-9])' | ||
|  |         r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' | ||
|  |         r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' | ||
|  |         r'"') | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | It's obviously much easier to retrieve \code{m.group('zonem')}, | ||
|  | instead of having to remember to retrieve group 9. | ||
|  | 
 | ||
|  | Since the syntax for backreferences, in an expression like | ||
|  | \regexp{(...)\e 1}, refers to the number of the group there's | ||
|  | naturally a variant that uses the group name instead of the number. | ||
|  | This is also a Python extension: \regexp{(?P=\var{name})} indicates | ||
|  | that the contents of the group called \var{name} should again be found | ||
|  | at the current point.  The regular expression for finding doubled | ||
|  | words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as | ||
|  | \regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') | ||
|  | >>> p.search('Paris in the the spring').group() | ||
|  | 'the the' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \subsection{Lookahead Assertions} | ||
|  | 
 | ||
|  | Another zero-width assertion is the lookahead assertion.  Lookahead | ||
|  | assertions are available in both positive and negative form, and  | ||
|  | look like this: | ||
|  | 
 | ||
|  | \begin{itemize} | ||
|  | \item[\regexp{(?=...)}] Positive lookahead assertion.  This succeeds | ||
|  | if the contained regular expression, represented here by \code{...}, | ||
|  | successfully matches at the current location, and fails otherwise. | ||
|  | But, once the contained expression has been tried, the matching engine | ||
|  | doesn't advance at all; the rest of the pattern is tried right where | ||
|  | the assertion started. | ||
|  | 
 | ||
|  | \item[\regexp{(?!...)}] Negative lookahead assertion.  This is the | ||
|  | opposite of the positive assertion; it succeeds if the contained expression | ||
|  | \emph{doesn't} match at the current position in the string. | ||
|  | \end{itemize} | ||
|  | 
 | ||
|  | An example will help make this concrete by demonstrating a case | ||
|  | where a lookahead is useful.  Consider a simple pattern to match a | ||
|  | filename and split it apart into a base name and an extension, | ||
|  | separated by a \samp{.}.  For example, in \samp{news.rc}, \samp{news} | ||
|  | is the base name, and \samp{rc} is the filename's extension.   | ||
|  | 
 | ||
|  | The pattern to match this is quite simple:  | ||
|  | 
 | ||
|  | \regexp{.*[.].*\$} | ||
|  | 
 | ||
|  | Notice that the \samp{.} needs to be treated specially because it's a | ||
|  | metacharacter; I've put it inside a character class.  Also notice the | ||
|  | trailing \regexp{\$}; this is added to ensure that all the rest of the | ||
|  | string must be included in the extension.  This regular expression | ||
|  | matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and | ||
|  | \samp{printers.conf}. | ||
|  | 
 | ||
|  | Now, consider complicating the problem a bit; what if you want to | ||
|  | match filenames where the extension is not \samp{bat}? | ||
|  | Some incorrect attempts: | ||
|  | 
 | ||
|  | \verb|.*[.][^b].*$|
 | ||
|  | % $
 | ||
|  | 
 | ||
|  | The first attempt above tries to exclude \samp{bat} by requiring that | ||
|  | the first character of the extension is not a \samp{b}.  This is | ||
|  | wrong, because the pattern also doesn't match \samp{foo.bar}. | ||
|  | 
 | ||
|  | % Messes up the HTML without the curly braces around \^
 | ||
|  | \regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} | ||
|  | 
 | ||
|  | The expression gets messier when you try to patch up the first | ||
|  | solution by requiring one of the following cases to match: the first | ||
|  | character of the extension isn't \samp{b}; the second character isn't | ||
|  | \samp{a}; or the third character isn't \samp{t}.  This accepts | ||
|  | \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a | ||
|  | three-letter extension and won't accept a filename with a two-letter | ||
|  | extension such as \samp{sendmail.cf}.  We'll complicate the pattern | ||
|  | again in an effort to fix it. | ||
|  | 
 | ||
|  | \regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} | ||
|  | 
 | ||
|  | In the third attempt, the second and third letters are all made | ||
|  | optional in order to allow matching extensions shorter than three | ||
|  | characters, such as \samp{sendmail.cf}. | ||
|  | 
 | ||
|  | The pattern's getting really complicated now, which makes it hard to | ||
|  | read and understand.  Worse, if the problem changes and you want to | ||
|  | exclude both \samp{bat} and \samp{exe} as extensions, the pattern | ||
|  | would get even more complicated and confusing. | ||
|  | 
 | ||
|  | A negative lookahead cuts through all this: | ||
|  | 
 | ||
|  | \regexp{.*[.](?!bat\$).*\$} | ||
|  | % $
 | ||
|  | 
 | ||
|  | The lookahead means: if the expression \regexp{bat} doesn't match at | ||
|  | this point, try the rest of the pattern; if \regexp{bat\$} does match, | ||
|  | the whole pattern will fail.  The trailing \regexp{\$} is required to | ||
|  | ensure that something like \samp{sample.batch}, where the extension | ||
|  | only starts with \samp{bat}, will be allowed. | ||
|  | 
 | ||
|  | Excluding another filename extension is now easy; simply add it as an | ||
|  | alternative inside the assertion.  The following pattern excludes | ||
|  | filenames that end in either \samp{bat} or \samp{exe}: | ||
|  | 
 | ||
|  | \regexp{.*[.](?!bat\$|exe\$).*\$} | ||
|  | % $
 | ||
|  | 
 | ||
|  | 
 | ||
|  | \section{Modifying Strings} | ||
|  | 
 | ||
|  | Up to this point, we've simply performed searches against a static | ||
|  | string.  Regular expressions are also commonly used to modify a string | ||
|  | in various ways, using the following \class{RegexObject} methods: | ||
|  | 
 | ||
|  | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} | ||
|  |   \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} | ||
|  |   \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} | ||
|  |   \lineii{subn()}{Does the same thing as \method{sub()},  | ||
|  |    but returns the new string and the number of replacements} | ||
|  | \end{tableii} | ||
|  | 
 | ||
|  | 
 | ||
|  | \subsection{Splitting Strings} | ||
|  | 
 | ||
|  | The \method{split()} method of a \class{RegexObject} splits a string | ||
|  | apart wherever the RE matches, returning a list of the pieces. | ||
|  | It's similar to the \method{split()} method of strings but | ||
|  | provides much more | ||
|  | generality in the delimiters that you can split by; | ||
|  | \method{split()} only supports splitting by whitespace or by | ||
|  | a fixed string.  As you'd expect, there's a module-level | ||
|  | \function{re.split()} function, too. | ||
|  | 
 | ||
|  | \begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}} | ||
|  |   Split \var{string} by the matches of the regular expression.  If | ||
|  |   capturing parentheses are used in the RE, then their contents will | ||
|  |   also be returned as part of the resulting list.  If \var{maxsplit} | ||
|  |   is nonzero, at most \var{maxsplit} splits are performed. | ||
|  | \end{methoddesc} | ||
|  | 
 | ||
|  | You can limit the number of splits made, by passing a value for | ||
|  | \var{maxsplit}.  When \var{maxsplit} is nonzero, at most | ||
|  | \var{maxsplit} splits will be made, and the remainder of the string is | ||
|  | returned as the final element of the list.  In the following example, | ||
|  | the delimiter is any sequence of non-alphanumeric characters. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile(r'\W+') | ||
|  | >>> p.split('This is a test, short and sweet, of split().') | ||
|  | ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] | ||
|  | >>> p.split('This is a test, short and sweet, of split().', 3) | ||
|  | ['This', 'is', 'a', 'test, short and sweet, of split().'] | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Sometimes you're not only interested in what the text between | ||
|  | delimiters is, but also need to know what the delimiter was.  If | ||
|  | capturing parentheses are used in the RE, then their values are also | ||
|  | returned as part of the list.  Compare the following calls: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile(r'\W+') | ||
|  | >>> p2 = re.compile(r'(\W+)') | ||
|  | >>> p.split('This... is a test.') | ||
|  | ['This', 'is', 'a', 'test', ''] | ||
|  | >>> p2.split('This... is a test.') | ||
|  | ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | The module-level function \function{re.split()} adds the RE to be | ||
|  | used as the first argument, but is otherwise the same.   | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> re.split('[\W]+', 'Words, words, words.') | ||
|  | ['Words', 'words', 'words', ''] | ||
|  | >>> re.split('([\W]+)', 'Words, words, words.') | ||
|  | ['Words', ', ', 'words', ', ', 'words', '.', ''] | ||
|  | >>> re.split('[\W]+', 'Words, words, words.', 1) | ||
|  | ['Words', 'words, words.'] | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \subsection{Search and Replace} | ||
|  | 
 | ||
|  | Another common task is to find all the matches for a pattern, and | ||
|  | replace them with a different string.  The \method{sub()} method takes | ||
|  | a replacement value, which can be either a string or a function, and | ||
|  | the string to be processed. | ||
|  | 
 | ||
|  | \begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}} | ||
|  | Returns the string obtained by replacing the leftmost non-overlapping | ||
|  | occurrences of the RE in \var{string} by the replacement | ||
|  | \var{replacement}.  If the pattern isn't found, \var{string} is returned | ||
|  | unchanged.   | ||
|  | 
 | ||
|  | The optional argument \var{count} is the maximum number of pattern | ||
|  | occurrences to be replaced; \var{count} must be a non-negative | ||
|  | integer.  The default value of 0 means to replace all occurrences. | ||
|  | \end{methoddesc} | ||
|  | 
 | ||
|  | Here's a simple example of using the \method{sub()} method.  It | ||
|  | replaces colour names with the word \samp{colour}: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile( '(blue|white|red)') | ||
|  | >>> p.sub( 'colour', 'blue socks and red shoes') | ||
|  | 'colour socks and colour shoes' | ||
|  | >>> p.sub( 'colour', 'blue socks and red shoes', count=1) | ||
|  | 'colour socks and red shoes' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | The \method{subn()} method does the same work, but returns a 2-tuple | ||
|  | containing the new string value and the number of replacements  | ||
|  | that were performed: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile( '(blue|white|red)') | ||
|  | >>> p.subn( 'colour', 'blue socks and red shoes') | ||
|  | ('colour socks and colour shoes', 2) | ||
|  | >>> p.subn( 'colour', 'no colours at all') | ||
|  | ('no colours at all', 0) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Empty matches are replaced only when they're not | ||
|  | adjacent to a previous match.   | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('x*') | ||
|  | >>> p.sub('-', 'abxd') | ||
|  | '-a-b-d-' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | If \var{replacement} is a string, any backslash escapes in it are | ||
|  | processed.  That is, \samp{\e n} is converted to a single newline | ||
|  | character, \samp{\e r} is converted to a carriage return, and so forth. | ||
|  | Unknown escapes such as \samp{\e j} are left alone.  Backreferences, | ||
|  | such as \samp{\e 6}, are replaced with the substring matched by the | ||
|  | corresponding group in the RE.  This lets you incorporate | ||
|  | portions of the original text in the resulting | ||
|  | replacement string. | ||
|  | 
 | ||
|  | This example matches the word \samp{section} followed by a string | ||
|  | enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to | ||
|  | \samp{subsection}: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) | ||
|  | >>> p.sub(r'subsection{\1}','section{First} section{second}') | ||
|  | 'subsection{First} subsection{second}' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | There's also a syntax for referring to named groups as defined by the | ||
|  | \regexp{(?P<name>...)} syntax.  \samp{\e g<name>} will use the | ||
|  | substring matched by the group named \samp{name}, and  | ||
|  | \samp{\e g<\var{number}>}  | ||
|  | uses the corresponding group number.   | ||
|  | \samp{\e g<2>} is therefore equivalent to \samp{\e 2},  | ||
|  | but isn't ambiguous in a | ||
|  | replacement string such as \samp{\e g<2>0}.  (\samp{\e 20} would be | ||
|  | interpreted as a reference to group 20, not a reference to group 2 | ||
|  | followed by the literal character \character{0}.)  The following | ||
|  | substitutions are all equivalent, but use all three variations of the | ||
|  | replacement string. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) | ||
|  | >>> p.sub(r'subsection{\1}','section{First}') | ||
|  | 'subsection{First}' | ||
|  | >>> p.sub(r'subsection{\g<1>}','section{First}') | ||
|  | 'subsection{First}' | ||
|  | >>> p.sub(r'subsection{\g<name>}','section{First}') | ||
|  | 'subsection{First}' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | \var{replacement} can also be a function, which gives you even more | ||
|  | control.  If \var{replacement} is a function, the function is | ||
|  | called for every non-overlapping occurrence of \var{pattern}.  On each | ||
|  | call, the function is  | ||
|  | passed a \class{MatchObject} argument for the match | ||
|  | and can use this information to compute the desired replacement string and return it. | ||
|  | 
 | ||
|  | In the following example, the replacement function translates  | ||
|  | decimals into hexadecimal: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> def hexrepl( match ): | ||
|  | ...     "Return the hex string for a decimal number" | ||
|  | ...     value = int( match.group() ) | ||
|  | ...     return hex(value) | ||
|  | ... | ||
|  | >>> p = re.compile(r'\d+') | ||
|  | >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') | ||
|  | 'Call 0xffd2 for printing, 0xc000 for user code.' | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | When using the module-level \function{re.sub()} function, the pattern | ||
|  | is passed as the first argument.  The pattern may be a string or a | ||
|  | \class{RegexObject}; if you need to specify regular expression flags, | ||
|  | you must either use a \class{RegexObject} as the first parameter, or use | ||
|  | embedded modifiers in the pattern, e.g.  \code{sub("(?i)b+", "x", "bbbb | ||
|  | BBBB")} returns \code{'x x'}. | ||
|  | 
 | ||
|  | \section{Common Problems} | ||
|  | 
 | ||
|  | Regular expressions are a powerful tool for some applications, but in | ||
|  | some ways their behaviour isn't intuitive and at times they don't | ||
|  | behave the way you may expect them to.  This section will point out | ||
|  | some of the most common pitfalls. | ||
|  | 
 | ||
|  | \subsection{Use String Methods} | ||
|  | 
 | ||
|  | Sometimes using the \module{re} module is a mistake.  If you're | ||
|  | matching a fixed string, or a single character class, and you're not | ||
|  | using any \module{re} features such as the \constant{IGNORECASE} flag, | ||
|  | then the full power of regular expressions may not be required. | ||
|  | Strings have several methods for performing operations with fixed | ||
|  | strings and they're usually much faster, because the implementation is | ||
|  | a single small C loop that's been optimized for the purpose, instead | ||
|  | of the large, more generalized regular expression engine. | ||
|  | 
 | ||
|  | One example might be replacing a single fixed string with another | ||
|  | one; for example, you might replace \samp{word} | ||
|  | with \samp{deed}.  \code{re.sub()} seems like the function to use for | ||
|  | this, but consider the \method{replace()} method.  Note that  | ||
|  | \function{replace()} will also replace \samp{word} inside | ||
|  | words, turning \samp{swordfish} into \samp{sdeedfish}, but the  | ||
|  | na{\"\i}ve RE \regexp{word} would have done that, too.  (To avoid performing | ||
|  | the substitution on parts of words, the pattern would have to be | ||
|  | \regexp{\e bword\e b}, in order to require that \samp{word} have a | ||
|  | word boundary on either side.  This takes the job beyond  | ||
|  | \method{replace}'s abilities.) | ||
|  | 
 | ||
|  | Another common task is deleting every occurrence of a single character | ||
|  | from a string or replacing it with another single character.  You | ||
|  | might do this with something like \code{re.sub('\e n', ' ', S)}, but | ||
|  | \method{translate()} is capable of doing both tasks | ||
|  | and will be faster that any regular expression operation can be. | ||
|  | 
 | ||
|  | In short, before turning to the \module{re} module, consider whether | ||
|  | your problem can be solved with a faster and simpler string method. | ||
|  | 
 | ||
|  | \subsection{match() versus search()} | ||
|  | 
 | ||
|  | The \function{match()} function only checks if the RE matches at | ||
|  | the beginning of the string while \function{search()} will scan | ||
|  | forward through the string for a match. | ||
|  | It's important to keep this distinction in mind.  Remember,  | ||
|  | \function{match()} will only report a successful match which | ||
|  | will start at 0; if the match wouldn't start at zero,  | ||
|  | \function{match()} will \emph{not} report it. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print re.match('super', 'superstition').span()   | ||
|  | (0, 5) | ||
|  | >>> print re.match('super', 'insuperable')     | ||
|  | None | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | On the other hand, \function{search()} will scan forward through the | ||
|  | string, reporting the first match it finds. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print re.search('super', 'superstition').span() | ||
|  | (0, 5) | ||
|  | >>> print re.search('super', 'insuperable').span() | ||
|  | (2, 7) | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | Sometimes you'll be tempted to keep using \function{re.match()}, and | ||
|  | just add \regexp{.*} to the front of your RE.  Resist this temptation | ||
|  | and use \function{re.search()} instead.  The regular expression | ||
|  | compiler does some analysis of REs in order to speed up the process of | ||
|  | looking for a match.  One such analysis figures out what the first | ||
|  | character of a match must be; for example, a pattern starting with | ||
|  | \regexp{Crow} must match starting with a \character{C}.  The analysis | ||
|  | lets the engine quickly scan through the string looking for the | ||
|  | starting character, only trying the full match if a \character{C} is found. | ||
|  | 
 | ||
|  | Adding \regexp{.*} defeats this optimization, requiring scanning to | ||
|  | the end of the string and then backtracking to find a match for the | ||
|  | rest of the RE.  Use \function{re.search()} instead. | ||
|  | 
 | ||
|  | \subsection{Greedy versus Non-Greedy} | ||
|  | 
 | ||
|  | When repeating a regular expression, as in \regexp{a*}, the resulting | ||
|  | action is to consume as much of the pattern as possible.  This | ||
|  | fact often bites you when you're trying to match a pair of | ||
|  | balanced delimiters, such as the angle brackets surrounding an HTML | ||
|  | tag.  The na{\"\i}ve pattern for matching a single HTML tag doesn't | ||
|  | work because of the greedy nature of \regexp{.*}. | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> s = '<html><head><title>Title</title>' | ||
|  | >>> len(s) | ||
|  | 32 | ||
|  | >>> print re.match('<.*>', s).span() | ||
|  | (0, 32) | ||
|  | >>> print re.match('<.*>', s).group() | ||
|  | <html><head><title>Title</title> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | The RE matches the \character{<} in \samp{<html>}, and the | ||
|  | \regexp{.*} consumes the rest of the string.  There's still more left | ||
|  | in the RE, though, and the \regexp{>} can't match at the end of | ||
|  | the string, so the regular expression engine has to backtrack | ||
|  | character by character until it finds a match for the \regexp{>}.   | ||
|  | The final match extends from the \character{<} in \samp{<html>} | ||
|  | to the \character{>} in \samp{</title>}, which isn't what you want. | ||
|  | 
 | ||
|  | In this case, the solution is to use the non-greedy qualifiers | ||
|  | \regexp{*?}, \regexp{+?}, \regexp{??}, or | ||
|  | \regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as | ||
|  | possible.  In the above example, the \character{>} is tried | ||
|  | immediately after the first \character{<} matches, and when it fails, | ||
|  | the engine advances a character at a time, retrying the \character{>} | ||
|  | at every step.  This produces just the right result: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | >>> print re.match('<.*?>', s).group() | ||
|  | <html> | ||
|  | \end{verbatim} | ||
|  | 
 | ||
|  | (Note that parsing HTML or XML with regular expressions is painful. | ||
|  | Quick-and-dirty patterns will handle common cases, but HTML and XML | ||
|  | have special cases that will break the obvious regular expression; by | ||
|  | the time you've written a regular expression that handles all of the | ||
|  | possible cases, the patterns will be \emph{very} complicated.  Use an | ||
|  | HTML or XML parser module for such tasks.) | ||
|  | 
 | ||
|  | \subsection{Not Using re.VERBOSE} | ||
|  | 
 | ||
|  | By now you've probably noticed that regular expressions are a very | ||
|  | compact notation, but they're not terribly readable.  REs of | ||
|  | moderate complexity can become lengthy collections of backslashes, | ||
|  | parentheses, and metacharacters, making them difficult to read and | ||
|  | understand.   | ||
|  | 
 | ||
|  | For such REs, specifying the \code{re.VERBOSE} flag when | ||
|  | compiling the regular expression can be helpful, because it allows | ||
|  | you to format the regular expression more clearly. | ||
|  | 
 | ||
|  | The \code{re.VERBOSE} flag has several effects.  Whitespace in the | ||
|  | regular expression that \emph{isn't} inside a character class is | ||
|  | ignored.  This means that an expression such as \regexp{dog | cat} is | ||
|  | equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]} | ||
|  | will still match the characters \character{a}, \character{b}, or a | ||
|  | space.  In addition, you can also put comments inside a RE; comments | ||
|  | extend from a \samp{\#} character to the next newline.  When used with | ||
|  | triple-quoted strings, this enables REs to be formatted more neatly: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | pat = re.compile(r""" | ||
|  |  \s*                 # Skip leading whitespace | ||
|  |  (?P<header>[^:]+)   # Header name | ||
|  |  \s* :               # Whitespace, and a colon | ||
|  |  (?P<value>.*?)      # The header's value -- *? used to | ||
|  |                      # lose the following trailing whitespace | ||
|  |  \s*$                # Trailing whitespace to end-of-line
 | ||
|  | """, re.VERBOSE) | ||
|  | \end{verbatim} | ||
|  | % $
 | ||
|  | 
 | ||
|  | This is far more readable than: | ||
|  | 
 | ||
|  | \begin{verbatim} | ||
|  | pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
 | ||
|  | \end{verbatim} | ||
|  | % $
 | ||
|  | 
 | ||
|  | \section{Feedback} | ||
|  | 
 | ||
|  | Regular expressions are a complicated topic.  Did this document help | ||
|  | you understand them?  Were there parts that were unclear, or Problems | ||
|  | you encountered that weren't covered here?  If so, please send | ||
|  | suggestions for improvements to the author. | ||
|  | 
 | ||
|  | The most complete book on regular expressions is almost certainly | ||
|  | Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published | ||
|  | by O'Reilly.  Unfortunately, it exclusively concentrates on Perl and | ||
|  | Java's flavours of regular expressions, and doesn't contain any Python | ||
|  | material at all, so it won't be useful as a reference for programming | ||
|  | in Python.  (The first edition covered Python's now-obsolete | ||
|  | \module{regex} module, which won't help you much.)  Consider checking | ||
|  | it out from your library. | ||
|  | 
 | ||
|  | \end{document} | ||
|  | 
 |