mirror of
				https://github.com/python/cpython.git
				synced 2025-10-25 18:54:53 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			342 lines
		
	
	
	
		
			15 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			342 lines
		
	
	
	
		
			15 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| \section{Built-in Module \sectcode{regex}}
 | |
| 
 | |
| \bimodindex{regex}
 | |
| This module provides regular expression matching operations similar to
 | |
| those found in Emacs.  It is always available.
 | |
| 
 | |
| By default the patterns are Emacs-style regular expressions
 | |
| (with one exception).  There is
 | |
| a way to change the syntax to match that of several well-known
 | |
| \UNIX{} utilities.  The exception is that Emacs' \samp{\e s}
 | |
| pattern is not supported, since the original implementation references
 | |
| the Emacs syntax tables.
 | |
| 
 | |
| This module is 8-bit clean: both patterns and strings may contain null
 | |
| bytes and characters whose high bit is set.
 | |
| 
 | |
| \strong{Please note:} There is a little-known fact about Python string
 | |
| literals which means that you don't usually have to worry about
 | |
| doubling backslashes, even though they are used to escape special
 | |
| characters in string literals as well as in regular expressions.  This
 | |
| is because Python doesn't remove backslashes from string literals if
 | |
| they are followed by an unrecognized escape character.
 | |
| \emph{However}, if you want to include a literal \dfn{backslash} in a
 | |
| regular expression represented as a string literal, you have to
 | |
| \emph{quadruple} it.  E.g.\  to extract \LaTeX\ \samp{\e section\{{\rm
 | |
| \ldots}\}} headers from a document, you can use this pattern:
 | |
| \code{'\e \e \e \e section\{\e (.*\e )\}'}.  \emph{Another exception:}
 | |
| the escape sequece \samp{\e b} is significant in string literals
 | |
| (where it means the ASCII bell character) as well as in Emacs regular
 | |
| expressions (where it stands for a word boundary), so in order to
 | |
| search for a word boundary, you should use the pattern \code{'\e \e b'}.
 | |
| Similarly, a backslash followed by a digit 0-7 should be doubled to
 | |
| avoid interpretation as an octal escape.
 | |
| 
 | |
| \subsection{Regular Expressions}
 | |
| 
 | |
| A regular expression (or RE) specifies a set of strings that matches
 | |
| it; the functions in this module let you check if a particular string
 | |
| matches a given regular expression (or if a given regular expression
 | |
| matches a particular string, which comes down to the same thing).
 | |
| 
 | |
| Regular expressions can be concatenated to form new regular
 | |
| expressions; if \emph{A} and \emph{B} are both regular expressions,
 | |
| then \emph{AB} is also an regular expression.  If a string \emph{p}
 | |
| matches A and another string \emph{q} matches B, the string \emph{pq}
 | |
| will match AB.  Thus, complex expressions can easily be constructed
 | |
| from simpler ones like the primitives described here.  For details of
 | |
| the theory and implementation of regular expressions, consult almost
 | |
| any textbook about compiler construction.
 | |
| 
 | |
| % XXX The reference could be made more specific, say to 
 | |
| % "Compilers: Principles, Techniques and Tools", by Alfred V. Aho, 
 | |
| % Ravi Sethi, and Jeffrey D. Ullman, or some FA text.   
 | |
| 
 | |
| A brief explanation of the format of regular expressions follows.
 | |
| 
 | |
| Regular expressions can contain both special and ordinary characters.
 | |
| Ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are
 | |
| the simplest regular expressions; they simply match themselves.  You
 | |
| can concatenate ordinary characters, so '\code{last}' matches the
 | |
| characters 'last'.  (In the rest of this section, we'll write RE's in
 | |
| \code{this special font}, usually without quotes, and strings to be
 | |
| matched 'in single quotes'.)
 | |
| 
 | |
| Special characters either stand for classes of ordinary characters, or
 | |
| affect how the regular expressions around them are interpreted.
 | |
| 
 | |
| The special characters are:
 | |
| \begin{itemize}
 | |
| \item[\code{.}]{(Dot.)  Matches any character except a newline.}
 | |
| \item[\code{\^}]{(Caret.)  Matches the start of the string.}
 | |
| \item[\code{\$}]{Matches the end of the string.  
 | |
| \code{foo} matches both 'foo' and 'foobar', while the regular
 | |
| expression '\code{foo\$}' matches only 'foo'.}
 | |
| \item[\code{*}] Causes the resulting RE to
 | |
| match 0 or more repetitions of the preceding RE.  \code{ab*} will
 | |
| match 'a', 'ab', or 'a' followed by any number of 'b's.
 | |
| \item[\code{+}] Causes the
 | |
| resulting RE to match 1 or more repetitions of the preceding RE.
 | |
| \code{ab+} will match 'a' followed by any non-zero number of 'b's; it
 | |
| will not match just 'a'.
 | |
| \item[\code{?}] Causes the resulting RE to
 | |
| match 0 or 1 repetitions of the preceding RE.  \code{ab?} will
 | |
| match either 'a' or 'ab'.
 | |
| 
 | |
| \item[\code{\e}] Either escapes special characters (permitting you to match
 | |
| characters like '*?+\&\$'), or signals a special sequence; special
 | |
| sequences are discussed below.  Remember that Python also uses the
 | |
| backslash as an escape sequence in string literals; if the escape
 | |
| sequence isn't recognized by Python's parser, the backslash and
 | |
| subsequent character are included in the resulting string.  However,
 | |
| if Python would recognize the resulting sequence, the backslash should
 | |
| be repeated twice.  
 | |
| 
 | |
| \item[\code{[]}] Used to indicate a set of characters.  Characters can
 | |
| be listed individually, or a range is indicated by giving two
 | |
| characters and separating them by a '-'.  Special characters are
 | |
| not active inside sets.  For example, \code{[akm\$]}
 | |
| will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will
 | |
| match any lowercase letter.  
 | |
| 
 | |
| If you want to include a \code{]} inside a
 | |
| set, it must be the first character of the set; to include a \code{-},
 | |
| place it as the first or last character. 
 | |
| 
 | |
| Characters \emph{not} within a range can be matched by including a
 | |
| \code{\^} as the first character of the set; \code{\^} elsewhere will
 | |
| simply match the '\code{\^}' character.  
 | |
| \end{itemize}
 | |
| 
 | |
| The special sequences consist of '\code{\e}' and a character
 | |
| from the list below.  If the ordinary character is not on the list,
 | |
| then the resulting RE will match the second character.  For example,
 | |
| \code{\e\$} matches the character '\$'.  Ones where the backslash
 | |
| should be doubled are indicated.
 | |
| 
 | |
| \begin{itemize}
 | |
| \item[\code{\e|}]\code{A\e|B}, where A and B can be arbitrary REs,
 | |
| creates a regular expression that will match either A or B.  This can
 | |
| be used inside groups (see below) as well.
 | |
| %
 | |
| \item[\code{\e( \e)}]{Indicates the start and end of a group; the
 | |
| contents of a group can be matched later in the string with the
 | |
| \code{\e \[1-9]} special sequence, described next.}
 | |
| %
 | |
| {\fulllineitems\item[\code{\e \e 1, ... \e \e 7, \e 8, \e 9}]
 | |
| {Matches the contents of the group of the same
 | |
| number.  For example, \code{\e (.+\e ) \e \e 1} matches 'the the' or
 | |
| '55 55', but not 'the end' (note the space after the group).  This
 | |
| special sequence can only be used to match one of the first 9 groups;
 | |
| groups with higher numbers can be matched using the \code{\e v}
 | |
| sequence.  (\code{\e 8} and \code{\e 9} don't need a double backslash
 | |
| because they are not octal digits.)}}
 | |
| %
 | |
| \item[\code{\e \e b}]{Matches the empty string, but only at the
 | |
| beginning or end of a word.  A word is defined as a sequence of
 | |
| alphanumeric characters, so the end of a word is indicated by
 | |
| whitespace or a non-alphanumeric character.}
 | |
| %
 | |
| \item[\code{\e B}]{Matches the empty string, but when it is \emph{not} at the
 | |
| beginning or end of a word.} 
 | |
| %
 | |
| \item[\code{\e v}]{Must be followed by a two digit decimal number, and
 | |
| matches the contents of the group of the same number.  The group number must be between 1 and 99, inclusive.}
 | |
| %
 | |
| \item[\code{\e w}]Matches any alphanumeric character; this is
 | |
| equivalent to the set \code{[a-zA-Z0-9]}.
 | |
| %
 | |
| \item[\code{\e W}]{Matches any non-alphanumeric character; this is
 | |
| equivalent to the set \code{[\^a-zA-Z0-9]}.} 
 | |
| \item[\code{\e <}]{Matches the empty string, but only at the beginning of a
 | |
| word.  A word is defined as a sequence of alphanumeric characters, so
 | |
| the end of a word is indicated by whitespace or a non-alphanumeric 
 | |
| character.}
 | |
| \item[\code{\e >}]{Matches the empty string, but only at the end of a
 | |
| word.}
 | |
| 
 | |
| \item[\code{\e \e \e \e}]{Matches a literal backslash.}
 | |
| 
 | |
| % In Emacs, the following two are start of buffer/end of buffer.  In
 | |
| % Python they seem to be synonyms for ^$.
 | |
| \item[\code{\e `}]{Like \code{\^}, this only matches at the start of the
 | |
| string.}
 | |
| \item[\code{\e \e '}] Like \code{\$}, this only matches at the end of the
 | |
| string.
 | |
| % end of buffer
 | |
| \end{itemize}
 | |
| 
 | |
| \subsection{Module Contents}
 | |
| 
 | |
| The module defines these functions, and an exception:
 | |
| 
 | |
| \renewcommand{\indexsubitem}{(in module regex)}
 | |
| 
 | |
| \begin{funcdesc}{match}{pattern\, string}
 | |
|   Return how many characters at the beginning of \var{string} match
 | |
|   the regular expression \var{pattern}.  Return \code{-1} if the
 | |
|   string does not match the pattern (this is different from a
 | |
|   zero-length match!).
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{funcdesc}{search}{pattern\, string}
 | |
|   Return the first position in \var{string} that matches the regular
 | |
|   expression \var{pattern}.  Return \code{-1} if no position in the string
 | |
|   matches the pattern (this is different from a zero-length match
 | |
|   anywhere!).
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{funcdesc}{compile}{pattern\optional{\, translate}}
 | |
|   Compile a regular expression pattern into a regular expression
 | |
|   object, which can be used for matching using its \code{match} and
 | |
|   \code{search} methods, described below.  The optional argument
 | |
|   \var{translate}, if present, must be a 256-character string
 | |
|   indicating how characters (both of the pattern and of the strings to
 | |
|   be matched) are translated before comparing them; the \code{i}-th
 | |
|   element of the string gives the translation for the character with
 | |
|   \ASCII{} code \code{i}.  This can be used to implement
 | |
|   case-insensitive matching; see the \code{casefold} data item below.
 | |
| 
 | |
|   The sequence
 | |
| 
 | |
| \bcode\begin{verbatim}
 | |
| prog = regex.compile(pat)
 | |
| result = prog.match(str)
 | |
| \end{verbatim}\ecode
 | |
| 
 | |
| is equivalent to
 | |
| 
 | |
| \bcode\begin{verbatim}
 | |
| result = regex.match(pat, str)
 | |
| \end{verbatim}\ecode
 | |
| 
 | |
| but the version using \code{compile()} is more efficient when multiple
 | |
| regular expressions are used concurrently in a single program.  (The
 | |
| compiled version of the last pattern passed to \code{regex.match()} or
 | |
| \code{regex.search()} is cached, so programs that use only a single
 | |
| regular expression at a time needn't worry about compiling regular
 | |
| expressions.)
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{funcdesc}{set_syntax}{flags}
 | |
|   Set the syntax to be used by future calls to \code{compile},
 | |
|   \code{match} and \code{search}.  (Already compiled expression objects
 | |
|   are not affected.)  The argument is an integer which is the OR of
 | |
|   several flag bits.  The return value is the previous value of
 | |
|   the syntax flags.  Names for the flags are defined in the standard
 | |
|   module \code{regex_syntax}; read the file \file{regex_syntax.py} for
 | |
|   more information.
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{funcdesc}{symcomp}{pattern\optional{\, translate}}
 | |
| This is like \code{compile}, but supports symbolic group names: if a
 | |
| parenthesis-enclosed group begins with a group name in angular
 | |
| brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
 | |
| be referenced by its name in arguments to the \code{group} method of
 | |
| the resulting compiled regular expression object, like this:
 | |
| \code{p.group('id')}.  Group names may contain alphanumeric characters
 | |
| and \code{'_'} only.
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{excdesc}{error}
 | |
|   Exception raised when a string passed to one of the functions here
 | |
|   is not a valid regular expression (e.g., unmatched parentheses) or
 | |
|   when some other error occurs during compilation or matching.  (It is
 | |
|   never an error if a string contains no match for a pattern.)
 | |
| \end{excdesc}
 | |
| 
 | |
| \begin{datadesc}{casefold}
 | |
| A string suitable to pass as \var{translate} argument to
 | |
| \code{compile} to map all upper case characters to their lowercase
 | |
| equivalents.
 | |
| \end{datadesc}
 | |
| 
 | |
| \noindent
 | |
| Compiled regular expression objects support these methods:
 | |
| 
 | |
| \renewcommand{\indexsubitem}{(regex method)}
 | |
| \begin{funcdesc}{match}{string\optional{\, pos}}
 | |
|   Return how many characters at the beginning of \var{string} match
 | |
|   the compiled regular expression.  Return \code{-1} if the string
 | |
|   does not match the pattern (this is different from a zero-length
 | |
|   match!).
 | |
|   
 | |
|   The optional second parameter \var{pos} gives an index in the string
 | |
|   where the search is to start; it defaults to \code{0}.  This is not
 | |
|   completely equivalent to slicing the string; the \code{'\^'} pattern
 | |
|   character matches at the real begin of the string and at positions
 | |
|   just after a newline, not necessarily at the index where the search
 | |
|   is to start.
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{funcdesc}{search}{string\optional{\, pos}}
 | |
|   Return the first position in \var{string} that matches the regular
 | |
|   expression \code{pattern}.  Return \code{-1} if no position in the
 | |
|   string matches the pattern (this is different from a zero-length
 | |
|   match anywhere!).
 | |
|   
 | |
|   The optional second parameter has the same meaning as for the
 | |
|   \code{match} method.
 | |
| \end{funcdesc}
 | |
| 
 | |
| \begin{funcdesc}{group}{index\, index\, ...}
 | |
| This method is only valid when the last call to the \code{match}
 | |
| or \code{search} method found a match.  It returns one or more
 | |
| groups of the match.  If there is a single \var{index} argument,
 | |
| the result is a single string; if there are multiple arguments, the
 | |
| result is a tuple with one item per argument.  If the \var{index} is
 | |
| zero, the corresponding return value is the entire matching string; if
 | |
| it is in the inclusive range [1..99], it is the string matching the
 | |
| the corresponding parenthesized group (using the default syntax,
 | |
| groups are parenthesized using \code{\\(} and \code{\\)}).  If no
 | |
| such group exists, the corresponding result is \code{None}.
 | |
| 
 | |
| If the regular expression was compiled by \code{symcomp} instead of
 | |
| \code{compile}, the \var{index} arguments may also be strings
 | |
| identifying groups by their group name.
 | |
| \end{funcdesc}
 | |
| 
 | |
| \noindent
 | |
| Compiled regular expressions support these data attributes:
 | |
| 
 | |
| \renewcommand{\indexsubitem}{(regex attribute)}
 | |
| 
 | |
| \begin{datadesc}{regs}
 | |
| When the last call to the \code{match} or \code{search} method found a
 | |
| match, this is a tuple of pairs of indices corresponding to the
 | |
| beginning and end of all parenthesized groups in the pattern.  Indices
 | |
| are relative to the string argument passed to \code{match} or
 | |
| \code{search}.  The 0-th tuple gives the beginning and end or the
 | |
| whole pattern.  When the last match or search failed, this is
 | |
| \code{None}.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{last}
 | |
| When the last call to the \code{match} or \code{search} method found a
 | |
| match, this is the string argument passed to that method.  When the
 | |
| last match or search failed, this is \code{None}.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{translate}
 | |
| This is the value of the \var{translate} argument to
 | |
| \code{regex.compile} that created this regular expression object.  If
 | |
| the \var{translate} argument was omitted in the \code{regex.compile}
 | |
| call, this is \code{None}.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{givenpat}
 | |
| The regular expression pattern as passed to \code{compile} or
 | |
| \code{symcomp}.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{realpat}
 | |
| The regular expression after stripping the group names for regular
 | |
| expressions compiled with \code{symcomp}.  Same as \code{givenpat}
 | |
| otherwise.
 | |
| \end{datadesc}
 | |
| 
 | |
| \begin{datadesc}{groupindex}
 | |
| A dictionary giving the mapping from symbolic group names to numerical
 | |
| group indices for regular expressions compiled with \code{symcomp}.
 | |
| \code{None} otherwise.
 | |
| \end{datadesc}
 | 
