mirror of
				https://github.com/python/cpython.git
				synced 2025-10-31 21:51:50 +00:00 
			
		
		
		
	 13f959b501
			
		
	
	
		13f959b501
		
	
	
	
	
		
			
			svn+ssh://svn.python.org/python/branches/py3k ........ r83561 | georg.brandl | 2010-08-02 22:17:50 +0200 (Mo, 02 Aug 2010) | 1 line #4280: remove outdated "versionchecker" tool. ........ r83563 | georg.brandl | 2010-08-02 22:21:21 +0200 (Mo, 02 Aug 2010) | 1 line #9037: add example how to raise custom exceptions from C code. ........ r83565 | georg.brandl | 2010-08-02 22:27:20 +0200 (Mo, 02 Aug 2010) | 1 line #9111: document that do_help() looks at docstrings. ........ r83566 | georg.brandl | 2010-08-02 22:30:57 +0200 (Mo, 02 Aug 2010) | 1 line #9019: remove false (in 3k) claim about Headers updates. ........ r83569 | georg.brandl | 2010-08-02 22:39:35 +0200 (Mo, 02 Aug 2010) | 1 line #7797: be explicit about bytes-oriented interface of base64 functions. ........ r83571 | georg.brandl | 2010-08-02 22:44:34 +0200 (Mo, 02 Aug 2010) | 1 line Clarify that abs() is not a namespace. ........ r83574 | georg.brandl | 2010-08-02 22:47:56 +0200 (Mo, 02 Aug 2010) | 1 line #6867: epoll.register() returns None. ........ r83575 | georg.brandl | 2010-08-02 22:52:10 +0200 (Mo, 02 Aug 2010) | 1 line #9238: zipfile does handle archive comments. ........ r83580 | georg.brandl | 2010-08-02 23:02:36 +0200 (Mo, 02 Aug 2010) | 1 line #8119: fix copy-paste error. ........ r83584 | georg.brandl | 2010-08-02 23:07:14 +0200 (Mo, 02 Aug 2010) | 1 line #9457: fix documentation links for 3.2. ........ r83599 | georg.brandl | 2010-08-02 23:51:18 +0200 (Mo, 02 Aug 2010) | 1 line #9061: warn that single quotes are never escaped. ........ r83612 | georg.brandl | 2010-08-03 00:59:44 +0200 (Di, 03 Aug 2010) | 1 line Fix unicode literal. ........ r83659 | georg.brandl | 2010-08-03 14:06:29 +0200 (Di, 03 Aug 2010) | 1 line Terminology fix: exceptions are raised, except in generator.throw(). ........ r83977 | georg.brandl | 2010-08-13 17:10:49 +0200 (Fr, 13 Aug 2010) | 1 line Fix copy-paste error. ........ r84015 | georg.brandl | 2010-08-14 17:44:34 +0200 (Sa, 14 Aug 2010) | 1 line Add some maintainers. ........ r84016 | georg.brandl | 2010-08-14 17:46:15 +0200 (Sa, 14 Aug 2010) | 1 line Wording fix. ........ r84017 | georg.brandl | 2010-08-14 17:46:59 +0200 (Sa, 14 Aug 2010) | 1 line Typo fix. ........ r84018 | georg.brandl | 2010-08-14 17:48:49 +0200 (Sa, 14 Aug 2010) | 1 line Typo fix. ........ r84020 | georg.brandl | 2010-08-14 17:57:20 +0200 (Sa, 14 Aug 2010) | 1 line Fix format. ........ r84141 | georg.brandl | 2010-08-17 16:11:59 +0200 (Di, 17 Aug 2010) | 1 line Markup nits. ........
		
			
				
	
	
		
			193 lines
		
	
	
	
		
			7.1 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			193 lines
		
	
	
	
		
			7.1 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| :mod:`html.parser` --- Simple HTML and XHTML parser
 | |
| ===================================================
 | |
| 
 | |
| .. module:: html.parser
 | |
|    :synopsis: A simple parser that can handle HTML and XHTML.
 | |
| 
 | |
| 
 | |
| .. index::
 | |
|    single: HTML
 | |
|    single: XHTML
 | |
| 
 | |
| This module defines a class :class:`HTMLParser` which serves as the basis for
 | |
| parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
 | |
| 
 | |
| .. class:: HTMLParser()
 | |
| 
 | |
|    The :class:`HTMLParser` class is instantiated without arguments.
 | |
| 
 | |
|    An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
 | |
|    begin and end.  The :class:`HTMLParser` class is meant to be overridden by the
 | |
|    user to provide a desired behavior.
 | |
| 
 | |
|    This parser does not check that end tags match start tags or call the end-tag
 | |
|    handler for elements which are closed implicitly by closing an outer element.
 | |
| 
 | |
| An exception is defined as well:
 | |
| 
 | |
| 
 | |
| .. exception:: HTMLParseError
 | |
| 
 | |
|    Exception raised by the :class:`HTMLParser` class when it encounters an error
 | |
|    while parsing.  This exception provides three attributes: :attr:`msg` is a brief
 | |
|    message explaining the error, :attr:`lineno` is the number of the line on which
 | |
|    the broken construct was detected, and :attr:`offset` is the number of
 | |
|    characters into the line at which the construct starts.
 | |
| 
 | |
| :class:`HTMLParser` instances have the following methods:
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.reset()
 | |
| 
 | |
|    Reset the instance.  Loses all unprocessed data.  This is called implicitly at
 | |
|    instantiation time.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.feed(data)
 | |
| 
 | |
|    Feed some text to the parser.  It is processed insofar as it consists of
 | |
|    complete elements; incomplete data is buffered until more data is fed or
 | |
|    :meth:`close` is called.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.close()
 | |
| 
 | |
|    Force processing of all buffered data as if it were followed by an end-of-file
 | |
|    mark.  This method may be redefined by a derived class to define additional
 | |
|    processing at the end of the input, but the redefined version should always call
 | |
|    the :class:`HTMLParser` base class method :meth:`close`.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.getpos()
 | |
| 
 | |
|    Return current line number and offset.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.get_starttag_text()
 | |
| 
 | |
|    Return the text of the most recently opened start tag.  This should not normally
 | |
|    be needed for structured processing, but may be useful in dealing with HTML "as
 | |
|    deployed" or for re-generating input with minimal changes (whitespace between
 | |
|    attributes can be preserved, etc.).
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_starttag(tag, attrs)
 | |
| 
 | |
|    This method is called to handle the start of a tag.  It is intended to be
 | |
|    overridden by a derived class; the base class implementation does nothing.
 | |
| 
 | |
|    The *tag* argument is the name of the tag converted to lower case. The *attrs*
 | |
|    argument is a list of ``(name, value)`` pairs containing the attributes found
 | |
|    inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
 | |
|    and quotes in the *value* have been removed, and character and entity references
 | |
|    have been replaced.  For instance, for the tag ``<A
 | |
|    HREF="http://www.cwi.nl/">``, this method would be called as
 | |
|    ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
 | |
| 
 | |
|    All entity references from :mod:`html.entities` are replaced in the attribute
 | |
|    values.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_startendtag(tag, attrs)
 | |
| 
 | |
|    Similar to :meth:`handle_starttag`, but called when the parser encounters an
 | |
|    XHTML-style empty tag (``<a .../>``).  This method may be overridden by
 | |
|    subclasses which require this particular lexical information; the default
 | |
|    implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_endtag(tag)
 | |
| 
 | |
|    This method is called to handle the end tag of an element.  It is intended to be
 | |
|    overridden by a derived class; the base class implementation does nothing.  The
 | |
|    *tag* argument is the name of the tag converted to lower case.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_data(data)
 | |
| 
 | |
|    This method is called to process arbitrary data.  It is intended to be
 | |
|    overridden by a derived class; the base class implementation does nothing.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_charref(name)
 | |
| 
 | |
|    This method is called to process a character reference of the form ``&#ref;``.
 | |
|    It is intended to be overridden by a derived class; the base class
 | |
|    implementation does nothing.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_entityref(name)
 | |
| 
 | |
|    This method is called to process a general entity reference of the form
 | |
|    ``&name;`` where *name* is an general entity reference.  It is intended to be
 | |
|    overridden by a derived class; the base class implementation does nothing.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_comment(data)
 | |
| 
 | |
|    This method is called when a comment is encountered.  The *comment* argument is
 | |
|    a string containing the text between the ``--`` and ``--`` delimiters, but not
 | |
|    the delimiters themselves.  For example, the comment ``<!--text-->`` will cause
 | |
|    this method to be called with the argument ``'text'``.  It is intended to be
 | |
|    overridden by a derived class; the base class implementation does nothing.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_decl(decl)
 | |
| 
 | |
|    Method called when an SGML ``doctype`` declaration is read by the parser.
 | |
|    The *decl* parameter will be the entire contents of the declaration inside
 | |
|    the ``<!...>`` markup.  It is intended to be overridden by a derived class;
 | |
|    the base class implementation does nothing.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.unknown_decl(data)
 | |
| 
 | |
|    Method called when an unrecognized SGML declaration is read by the parser.
 | |
|    The *data* parameter will be the entire contents of the declaration inside
 | |
|    the ``<!...>`` markup.  It is sometimes useful to be be overridden by a
 | |
|    derived class; the base class implementation raises an :exc:`HTMLParseError`.
 | |
| 
 | |
| 
 | |
| .. method:: HTMLParser.handle_pi(data)
 | |
| 
 | |
|    Method called when a processing instruction is encountered.  The *data*
 | |
|    parameter will contain the entire processing instruction. For example, for the
 | |
|    processing instruction ``<?proc color='red'>``, this method would be called as
 | |
|    ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived
 | |
|    class; the base class implementation does nothing.
 | |
| 
 | |
|    .. note::
 | |
| 
 | |
|       The :class:`HTMLParser` class uses the SGML syntactic rules for processing
 | |
|       instructions.  An XHTML processing instruction using the trailing ``'?'`` will
 | |
|       cause the ``'?'`` to be included in *data*.
 | |
| 
 | |
| 
 | |
| .. _htmlparser-example:
 | |
| 
 | |
| Example HTML Parser Application
 | |
| -------------------------------
 | |
| 
 | |
| As a basic example, below is a very basic HTML parser that uses the
 | |
| :class:`HTMLParser` class to print out tags as they are encountered::
 | |
| 
 | |
|    >>> from html.parser import HTMLParser
 | |
|    >>>
 | |
|    >>> class MyHTMLParser(HTMLParser):
 | |
|    ...     def handle_starttag(self, tag, attrs):
 | |
|    ...         print("Encountered a {} start tag".format(tag))
 | |
|    ...     def handle_endtag(self, tag):
 | |
|    ...         print("Encountered a {} end tag".format(tag))
 | |
|    ...
 | |
|    >>> page = """<html><h1>Title</h1><p>I'm a paragraph!</p></html>"""
 | |
|    >>>
 | |
|    >>> myparser = MyHTMLParser()
 | |
|    >>> myparser.feed(page)
 | |
|    Encountered a html start tag
 | |
|    Encountered a h1 start tag
 | |
|    Encountered a h1 end tag
 | |
|    Encountered a p start tag
 | |
|    Encountered a p end tag
 | |
|    Encountered a html end tag
 | |
| 
 | |
| 
 |