| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | \section{\module{HTMLParser} --- | 
					
						
							|  |  |  |          Simple HTML and XHTML parser} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \declaremodule{standard}{HTMLParser} | 
					
						
							|  |  |  | \modulesynopsis{A simple parser that can handle HTML and XHTML.} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2004-09-10 01:20:21 +00:00
										 |  |  | \versionadded{2.2} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | This module defines a class \class{HTMLParser} which serves as the | 
					
						
							|  |  |  | basis for parsing text files formatted in HTML\index{HTML} (HyperText | 
					
						
							| 
									
										
										
										
											2001-07-05 16:34:36 +00:00
										 |  |  | Mark-up Language) and XHTML.\index{XHTML}  Unlike the parser in | 
					
						
							|  |  |  | \refmodule{htmllib}, this parser is not based on the SGML parser in | 
					
						
							|  |  |  | \refmodule{sgmllib}. | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{classdesc}{HTMLParser}{} | 
					
						
							|  |  |  | The \class{HTMLParser} class is instantiated without arguments. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | An HTMLParser instance is fed HTML data and calls handler functions | 
					
						
							|  |  |  | when tags begin and end.  The \class{HTMLParser} class is meant to be | 
					
						
							|  |  |  | overridden by the user to provide a desired behavior. | 
					
						
							| 
									
										
										
										
											2001-07-05 16:34:36 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Unlike the parser in \refmodule{htmllib}, this parser does not check | 
					
						
							|  |  |  | that end tags match start tags or call the end-tag handler for | 
					
						
							|  |  |  | elements which are closed implicitly by closing an outer element. | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | \end{classdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2004-09-10 01:20:21 +00:00
										 |  |  | An exception is defined as well: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{excdesc}{HTMLParseError} | 
					
						
							|  |  |  | Exception raised by the \class{HTMLParser} class when it encounters an | 
					
						
							|  |  |  | error while parsing.  This exception provides three attributes: | 
					
						
							|  |  |  | \member{msg} is a brief message explaining the error, \member{lineno} | 
					
						
							|  |  |  | is the number of the line on which the broken construct was detected, | 
					
						
							|  |  |  | and \member{offset} is the number of characters into the line at which | 
					
						
							|  |  |  | the construct starts. | 
					
						
							|  |  |  | \end{excdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | \class{HTMLParser} instances have the following methods: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{reset}{} | 
					
						
							|  |  |  | Reset the instance.  Loses all unprocessed data.  This is called | 
					
						
							|  |  |  | implicitly at instantiation time. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{feed}{data} | 
					
						
							|  |  |  | Feed some text to the parser.  It is processed insofar as it consists | 
					
						
							|  |  |  | of complete elements; incomplete data is buffered until more data is | 
					
						
							|  |  |  | fed or \method{close()} is called. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{close}{} | 
					
						
							|  |  |  | Force processing of all buffered data as if it were followed by an | 
					
						
							|  |  |  | end-of-file mark.  This method may be redefined by a derived class to | 
					
						
							|  |  |  | define additional processing at the end of the input, but the | 
					
						
							|  |  |  | redefined version should always call the \class{HTMLParser} base class | 
					
						
							|  |  |  | method \method{close()}. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{getpos}{} | 
					
						
							|  |  |  | Return current line number and offset. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{get_starttag_text}{} | 
					
						
							|  |  |  | Return the text of the most recently opened start tag.  This should | 
					
						
							|  |  |  | not normally be needed for structured processing, but may be useful in | 
					
						
							|  |  |  | dealing with HTML ``as deployed'' or for re-generating input with | 
					
						
							|  |  |  | minimal changes (whitespace between attributes can be preserved, | 
					
						
							|  |  |  | etc.). | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_starttag}{tag, attrs}  | 
					
						
							|  |  |  | This method is called to handle the start of a tag.  It is intended to | 
					
						
							|  |  |  | be overridden by a derived class; the base class implementation does | 
					
						
							|  |  |  | nothing.   | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-03-06 14:43:00 +00:00
										 |  |  | The \var{tag} argument is the name of the tag converted to lower case. | 
					
						
							|  |  |  | The \var{attrs} argument is a list of \code{(\var{name}, \var{value})} | 
					
						
							|  |  |  | pairs containing the attributes found inside the tag's \code{<>} | 
					
						
							|  |  |  | brackets.  The \var{name} will be translated to lower case, and quotes | 
					
						
							|  |  |  | in the \var{value} have been removed, and character and entity | 
					
						
							|  |  |  | references have been replaced.  For instance, for the tag \code{<A | 
					
						
							|  |  |  |   HREF="http://www.cwi.nl/">}, this method would be called as | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | \samp{handle_starttag('a', [('href', 'http://www.cwi.nl/')])}. | 
					
						
							| 
									
										
										
										
											2007-03-06 14:43:00 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | \versionchanged[All entity references from htmlentitydefs are now | 
					
						
							|  |  |  | replaced in the attribute values]{2.6} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_startendtag}{tag, attrs} | 
					
						
							|  |  |  | Similar to \method{handle_starttag()}, but called when the parser | 
					
						
							|  |  |  | encounters an XHTML-style empty tag (\code{<a .../>}).  This method | 
					
						
							|  |  |  | may be overridden by subclasses which require this particular lexical | 
					
						
							|  |  |  | information; the default implementation simple calls | 
					
						
							|  |  |  | \method{handle_starttag()} and \method{handle_endtag()}. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_endtag}{tag} | 
					
						
							|  |  |  | This method is called to handle the end tag of an element.  It is | 
					
						
							|  |  |  | intended to be overridden by a derived class; the base class | 
					
						
							|  |  |  | implementation does nothing.  The \var{tag} argument is the name of | 
					
						
							|  |  |  | the tag converted to lower case. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_data}{data} | 
					
						
							|  |  |  | This method is called to process arbitrary data.  It is intended to be | 
					
						
							|  |  |  | overridden by a derived class; the base class implementation does | 
					
						
							|  |  |  | nothing. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_charref}{name} This method is called to | 
					
						
							|  |  |  | process a character reference of the form \samp{\&\#\var{ref};}.  It | 
					
						
							|  |  |  | is intended to be overridden by a derived class; the base class | 
					
						
							|  |  |  | implementation does nothing.   | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_entityref}{name}  | 
					
						
							|  |  |  | This method is called to process a general entity reference of the | 
					
						
							|  |  |  | form \samp{\&\var{name};} where \var{name} is an general entity | 
					
						
							|  |  |  | reference.  It is intended to be overridden by a derived class; the | 
					
						
							|  |  |  | base class implementation does nothing. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_comment}{data} | 
					
						
							|  |  |  | This method is called when a comment is encountered.  The | 
					
						
							|  |  |  | \var{comment} argument is a string containing the text between the | 
					
						
							| 
									
										
										
										
											2003-12-30 16:18:23 +00:00
										 |  |  | \samp{--} and \samp{--} delimiters, but not the delimiters | 
					
						
							|  |  |  | themselves.  For example, the comment \samp{<!--text-->} will | 
					
						
							| 
									
										
										
										
											2003-12-07 12:46:16 +00:00
										 |  |  | cause this method to be called with the argument \code{'text'}.  It is | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | intended to be overridden by a derived class; the base class | 
					
						
							|  |  |  | implementation does nothing. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{handle_decl}{decl} | 
					
						
							|  |  |  | Method called when an SGML declaration is read by the parser.  The | 
					
						
							|  |  |  | \var{decl} parameter will be the entire contents of the declaration | 
					
						
							| 
									
										
										
										
											2006-05-03 02:04:40 +00:00
										 |  |  | inside the \code{<!}...\code{>} markup.  It is intended to be overridden | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | by a derived class; the base class implementation does nothing. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2003-04-17 22:36:52 +00:00
										 |  |  | \begin{methoddesc}{handle_pi}{data} | 
					
						
							|  |  |  | Method called when a processing instruction is encountered.  The | 
					
						
							|  |  |  | \var{data} parameter will contain the entire processing instruction. | 
					
						
							|  |  |  | For example, for the processing instruction \code{<?proc color='red'>}, | 
					
						
							|  |  |  | this method would be called as \code{handle_pi("proc color='red'")}.  It | 
					
						
							|  |  |  | is intended to be overridden by a derived class; the base class | 
					
						
							|  |  |  | implementation does nothing. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \note{The \class{HTMLParser} class uses the SGML syntactic rules for | 
					
						
							| 
									
										
										
										
											2003-12-30 16:18:23 +00:00
										 |  |  | processing instructions.  An XHTML processing instruction using the | 
					
						
							| 
									
										
										
										
											2003-04-17 22:36:52 +00:00
										 |  |  | trailing \character{?} will cause the \character{?} to be included in | 
					
						
							|  |  |  | \var{data}.} | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2003-04-17 22:36:52 +00:00
										 |  |  | \subsection{Example HTML Parser Application \label{htmlparser-example}} | 
					
						
							| 
									
										
										
										
											2001-05-30 04:59:00 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | As a basic example, below is a very basic HTML parser that uses the | 
					
						
							|  |  |  | \class{HTMLParser} class to print out tags as they are encountered: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{verbatim} | 
					
						
							|  |  |  | from HTMLParser import HTMLParser | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | class MyHTMLParser(HTMLParser): | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     def handle_starttag(self, tag, attrs): | 
					
						
							|  |  |  |         print "Encountered the beginning of a %s tag" % tag
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     def handle_endtag(self, tag): | 
					
						
							|  |  |  |         print "Encountered the end of a %s tag" % tag
 | 
					
						
							|  |  |  | \end{verbatim} |