mirror of
				https://github.com/python/cpython.git
				synced 2025-10-31 05:31:20 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			159 lines
		
	
	
	
		
			6.4 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			159 lines
		
	
	
	
		
			6.4 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| \section{\module{htmllib} ---
 | |
|          A parser for HTML documents}
 | |
| 
 | |
| \declaremodule{standard}{htmllib}
 | |
| \modulesynopsis{A parser for HTML documents.}
 | |
| 
 | |
| \index{HTML}
 | |
| \index{hypertext}
 | |
| 
 | |
| 
 | |
| This module defines a class which can serve as a base for parsing text
 | |
| files formatted in the HyperText Mark-up Language (HTML).  The class
 | |
| is not directly concerned with I/O --- it must be provided with input
 | |
| in string form via a method, and makes calls to methods of a
 | |
| ``formatter'' object in order to produce output.  The
 | |
| \class{HTMLParser} class is designed to be used as a base class for
 | |
| other classes in order to add functionality, and allows most of its
 | |
| methods to be extended or overridden.  In turn, this class is derived
 | |
| from and extends the \class{SGMLParser} class defined in module
 | |
| \refmodule{sgmllib}\refstmodindex{sgmllib}.  The \class{HTMLParser}
 | |
| implementation supports the HTML 2.0 language as described in
 | |
| \rfc{1866}.  Two implementations of formatter objects are provided in
 | |
| the \refmodule{formatter}\refstmodindex{formatter} module; refer to the
 | |
| documentation for that module for information on the formatter
 | |
| interface.
 | |
| \withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}}
 | |
| 
 | |
| The following is a summary of the interface defined by
 | |
| \class{sgmllib.SGMLParser}:
 | |
| 
 | |
| \begin{itemize}
 | |
| 
 | |
| \item
 | |
| The interface to feed data to an instance is through the \method{feed()}
 | |
| method, which takes a string argument.  This can be called with as
 | |
| little or as much text at a time as desired; \samp{p.feed(a);
 | |
| p.feed(b)} has the same effect as \samp{p.feed(a+b)}.  When the data
 | |
| contains complete HTML tags, these are processed immediately;
 | |
| incomplete elements are saved in a buffer.  To force processing of all
 | |
| unprocessed data, call the \method{close()} method.
 | |
| 
 | |
| For example, to parse the entire contents of a file, use:
 | |
| \begin{verbatim}
 | |
| parser.feed(open('myfile.html').read())
 | |
| parser.close()
 | |
| \end{verbatim}
 | |
| 
 | |
| \item
 | |
| The interface to define semantics for HTML tags is very simple: derive
 | |
| a class and define methods called \method{start_\var{tag}()},
 | |
| \method{end_\var{tag}()}, or \method{do_\var{tag}()}.  The parser will
 | |
| call these at appropriate moments: \method{start_\var{tag}} or
 | |
| \method{do_\var{tag}()} is called when an opening tag of the form
 | |
| \code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called
 | |
| when a closing tag of the form \code{<\var{tag}>} is encountered.  If
 | |
| an opening tag requires a corresponding closing tag, like \code{<H1>}
 | |
| ... \code{</H1>}, the class should define the \method{start_\var{tag}()}
 | |
| method; if a tag requires no closing tag, like \code{<P>}, the class
 | |
| should define the \method{do_\var{tag}()} method.
 | |
| 
 | |
| \end{itemize}
 | |
| 
 | |
| The module defines a single class:
 | |
| 
 | |
| \begin{classdesc}{HTMLParser}{formatter}
 | |
| This is the basic HTML parser class.  It supports all entity names
 | |
| required by the HTML 2.0 specification (\rfc{1866}).  It also defines
 | |
| handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
 | |
| \end{classdesc}
 | |
| 
 | |
| 
 | |
| \begin{seealso}
 | |
|   \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly
 | |
|                          lower-level view of the input, but is
 | |
|                          designed to work with XHTML, and does not
 | |
|                          implement some of the SGML syntax not used in
 | |
|                          ``HTML as deployed'' and which isn't legal
 | |
|                          for XHTML.}
 | |
|   \seemodule{htmlentitydefs}{Definition of replacement text for HTML
 | |
|                              2.0 entities.}
 | |
|   \seemodule{sgmllib}{Base class for \class{HTMLParser}.}
 | |
| \end{seealso}
 | |
| 
 | |
| 
 | |
| \subsection{HTMLParser Objects \label{html-parser-objects}}
 | |
| 
 | |
| In addition to tag methods, the \class{HTMLParser} class provides some
 | |
| additional methods and instance variables for use within tag methods.
 | |
| 
 | |
| \begin{memberdesc}{formatter}
 | |
| This is the formatter instance associated with the parser.
 | |
| \end{memberdesc}
 | |
| 
 | |
| \begin{memberdesc}{nofill}
 | |
| Boolean flag which should be true when whitespace should not be
 | |
| collapsed, or false when it should be.  In general, this should only
 | |
| be true when character data is to be treated as ``preformatted'' text,
 | |
| as within a \code{<PRE>} element.  The default value is false.  This
 | |
| affects the operation of \method{handle_data()} and \method{save_end()}.
 | |
| \end{memberdesc}
 | |
| 
 | |
| 
 | |
| \begin{methoddesc}{anchor_bgn}{href, name, type}
 | |
| This method is called at the start of an anchor region.  The arguments
 | |
| correspond to the attributes of the \code{<A>} tag with the same
 | |
| names.  The default implementation maintains a list of hyperlinks
 | |
| (defined by the \code{HREF} attribute for \code{<A>} tags) within the
 | |
| document.  The list of hyperlinks is available as the data attribute
 | |
| \member{anchorlist}.
 | |
| \end{methoddesc}
 | |
| 
 | |
| \begin{methoddesc}{anchor_end}{}
 | |
| This method is called at the end of an anchor region.  The default
 | |
| implementation adds a textual footnote marker using an index into the
 | |
| list of hyperlinks created by \method{anchor_bgn()}.
 | |
| \end{methoddesc}
 | |
| 
 | |
| \begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{, align\optional{, width\optional{, height}}}}}
 | |
| This method is called to handle images.  The default implementation
 | |
| simply passes the \var{alt} value to the \method{handle_data()}
 | |
| method.
 | |
| \end{methoddesc}
 | |
| 
 | |
| \begin{methoddesc}{save_bgn}{}
 | |
| Begins saving character data in a buffer instead of sending it to the
 | |
| formatter object.  Retrieve the stored data via \method{save_end()}.
 | |
| Use of the \method{save_bgn()} / \method{save_end()} pair may not be
 | |
| nested.
 | |
| \end{methoddesc}
 | |
| 
 | |
| \begin{methoddesc}{save_end}{}
 | |
| Ends buffering character data and returns all data saved since the
 | |
| preceding call to \method{save_bgn()}.  If the \member{nofill} flag is
 | |
| false, whitespace is collapsed to single spaces.  A call to this
 | |
| method without a preceding call to \method{save_bgn()} will raise a
 | |
| \exception{TypeError} exception.
 | |
| \end{methoddesc}
 | |
| 
 | |
| 
 | |
| 
 | |
| \section{\module{htmlentitydefs} ---
 | |
|          Definitions of HTML general entities}
 | |
| 
 | |
| \declaremodule{standard}{htmlentitydefs}
 | |
| \modulesynopsis{Definitions of HTML general entities.}
 | |
| \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
 | |
| 
 | |
| This module defines a single dictionary, \code{entitydefs}, which is
 | |
| used by the \refmodule{htmllib} module to provide the
 | |
| \member{entitydefs} member of the \class{HTMLParser} class.  The
 | |
| definition provided here contains all the entities defined by HTML 2.0 
 | |
| that can be handled using simple textual substitution in the Latin-1
 | |
| character set (ISO-8859-1).
 | |
| 
 | |
| 
 | |
| \begin{datadesc}{entitydefs}
 | |
|   A dictionary mapping HTML 2.0 entity definitions to their
 | |
|   replacement text in ISO Latin-1.
 | |
| \end{datadesc}
 | 
