| 
									
										
										
										
											2008-05-17 22:02:32 +00:00
										 |  |  | :mod:`html.parser` --- Simple HTML and XHTML parser
 | 
					
						
							|  |  |  | ===================================================
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-05-17 22:02:32 +00:00
										 |  |  | .. module:: html.parser
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |    :synopsis: A simple parser that can handle HTML and XHTML.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-05-18 07:53:01 +00:00
										 |  |  | .. index::
 | 
					
						
							|  |  |  |    single: HTML
 | 
					
						
							|  |  |  |    single: XHTML
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-27 01:20:32 +00:00
										 |  |  | **Source code:** :source:`Lib/html/parser.py`
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | --------------
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | This module defines a class :class:`HTMLParser` which serves as the basis for
 | 
					
						
							|  |  |  | parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-23 15:27:51 +02:00
										 |  |  | .. class:: HTMLParser(strict=False)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-23 15:27:51 +02:00
										 |  |  |    Create a parser instance.  If *strict* is ``False`` (the default), the parser
 | 
					
						
							|  |  |  |    will accept and parse invalid markup.  If *strict* is ``True`` the parser
 | 
					
						
							|  |  |  |    will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
 | 
					
						
							|  |  |  |    it's not able to parse the markup.
 | 
					
						
							|  |  |  |    The use of ``strict=True`` is discouraged and the *strict* argument is
 | 
					
						
							|  |  |  |    deprecated.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
 | 
					
						
							|  |  |  |    when start tags, end tags, text, comments, and other markup elements are
 | 
					
						
							|  |  |  |    encountered.  The user should subclass :class:`.HTMLParser` and override its
 | 
					
						
							|  |  |  |    methods to implement the desired behavior.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-01 21:25:55 +00:00
										 |  |  |    This parser does not check that end tags match start tags or call the end-tag
 | 
					
						
							|  |  |  |    handler for elements which are closed implicitly by closing an outer element.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-24 22:48:30 +02:00
										 |  |  |    .. versionchanged:: 3.2
 | 
					
						
							|  |  |  |       *strict* keyword added.
 | 
					
						
							| 
									
										
										
										
											2010-12-03 04:26:18 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-23 15:27:51 +02:00
										 |  |  |    .. deprecated-removed:: 3.3 3.5
 | 
					
						
							|  |  |  |       The *strict* argument and the strict mode have been deprecated.
 | 
					
						
							|  |  |  |       The parser is now able to accept and parse invalid markup too.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | An exception is defined as well:
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. exception:: HTMLParseError
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Exception raised by the :class:`HTMLParser` class when it encounters an error
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    while parsing and *strict* is ``True``.  This exception provides three
 | 
					
						
							|  |  |  |    attributes: :attr:`msg` is a brief message explaining the error,
 | 
					
						
							|  |  |  |    :attr:`lineno` is the number of the line on which the broken construct was
 | 
					
						
							|  |  |  |    detected, and :attr:`offset` is the number of characters into the line at
 | 
					
						
							|  |  |  |    which the construct starts.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-23 15:27:51 +02:00
										 |  |  |    .. deprecated-removed:: 3.3 3.5
 | 
					
						
							|  |  |  |       This exception has been deprecated because it's never raised by the parser
 | 
					
						
							|  |  |  |       (when the default non-strict mode is used).
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | Example HTML Parser Application
 | 
					
						
							|  |  |  | -------------------------------
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | As a basic example, below is a simple HTML parser that uses the
 | 
					
						
							|  |  |  | :class:`HTMLParser` class to print out start tags, end tags, and data
 | 
					
						
							|  |  |  | as they are encountered::
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    from html.parser import HTMLParser
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    class MyHTMLParser(HTMLParser):
 | 
					
						
							|  |  |  |        def handle_starttag(self, tag, attrs):
 | 
					
						
							|  |  |  |            print("Encountered a start tag:", tag)
 | 
					
						
							|  |  |  |        def handle_endtag(self, tag):
 | 
					
						
							|  |  |  |            print("Encountered an end tag :", tag)
 | 
					
						
							|  |  |  |        def handle_data(self, data):
 | 
					
						
							|  |  |  |            print("Encountered some data  :", data)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    parser = MyHTMLParser(strict=False)
 | 
					
						
							|  |  |  |    parser.feed('<html><head><title>Test</title></head>'
 | 
					
						
							|  |  |  |                '<body><h1>Parse me!</h1></body></html>')
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The output will then be::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Encountered a start tag: html
 | 
					
						
							|  |  |  |    Encountered a start tag: head
 | 
					
						
							|  |  |  |    Encountered a start tag: title
 | 
					
						
							|  |  |  |    Encountered some data  : Test
 | 
					
						
							|  |  |  |    Encountered an end tag : title
 | 
					
						
							|  |  |  |    Encountered an end tag : head
 | 
					
						
							|  |  |  |    Encountered a start tag: body
 | 
					
						
							|  |  |  |    Encountered a start tag: h1
 | 
					
						
							|  |  |  |    Encountered some data  : Parse me!
 | 
					
						
							|  |  |  |    Encountered an end tag : h1
 | 
					
						
							|  |  |  |    Encountered an end tag : body
 | 
					
						
							|  |  |  |    Encountered an end tag : html
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | :class:`.HTMLParser` Methods
 | 
					
						
							|  |  |  | ----------------------------
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | :class:`HTMLParser` instances have the following methods:
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. method:: HTMLParser.feed(data)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Feed some text to the parser.  It is processed insofar as it consists of
 | 
					
						
							|  |  |  |    complete elements; incomplete data is buffered until more data is fed or
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    :meth:`close` is called.  *data* must be :class:`str`.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. method:: HTMLParser.close()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Force processing of all buffered data as if it were followed by an end-of-file
 | 
					
						
							|  |  |  |    mark.  This method may be redefined by a derived class to define additional
 | 
					
						
							|  |  |  |    processing at the end of the input, but the redefined version should always call
 | 
					
						
							|  |  |  |    the :class:`HTMLParser` base class method :meth:`close`.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | .. method:: HTMLParser.reset()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Reset the instance.  Loses all unprocessed data.  This is called implicitly at
 | 
					
						
							|  |  |  |    instantiation time.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | .. method:: HTMLParser.getpos()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Return current line number and offset.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. method:: HTMLParser.get_starttag_text()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Return the text of the most recently opened start tag.  This should not normally
 | 
					
						
							|  |  |  |    be needed for structured processing, but may be useful in dealing with HTML "as
 | 
					
						
							|  |  |  |    deployed" or for re-generating input with minimal changes (whitespace between
 | 
					
						
							|  |  |  |    attributes can be preserved, etc.).
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | The following methods are called when data or markup elements are encountered
 | 
					
						
							|  |  |  | and they are meant to be overridden in a subclass.  The base class
 | 
					
						
							|  |  |  | implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | .. method:: HTMLParser.handle_starttag(tag, attrs)
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called to handle the start of a tag (e.g. ``<div id="main">``).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |    The *tag* argument is the name of the tag converted to lower case. The *attrs*
 | 
					
						
							|  |  |  |    argument is a list of ``(name, value)`` pairs containing the attributes found
 | 
					
						
							|  |  |  |    inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
 | 
					
						
							|  |  |  |    and quotes in the *value* have been removed, and character and entity references
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    have been replaced.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
 | 
					
						
							|  |  |  |    would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-05-18 07:53:01 +00:00
										 |  |  |    All entity references from :mod:`html.entities` are replaced in the attribute
 | 
					
						
							|  |  |  |    values.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | .. method:: HTMLParser.handle_endtag(tag)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    This method is called to handle the end tag of an element (e.g. ``</div>``).
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    The *tag* argument is the name of the tag converted to lower case.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | .. method:: HTMLParser.handle_startendtag(tag, attrs)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Similar to :meth:`handle_starttag`, but called when the parser encounters an
 | 
					
						
							| 
									
										
										
										
											2011-10-28 14:34:56 +03:00
										 |  |  |    XHTML-style empty tag (``<img ... />``).  This method may be overridden by
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |    subclasses which require this particular lexical information; the default
 | 
					
						
							| 
									
										
										
										
											2011-10-28 14:34:56 +03:00
										 |  |  |    implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. method:: HTMLParser.handle_data(data)
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called to process arbitrary data (e.g. text nodes and the
 | 
					
						
							|  |  |  |    content of ``<script>...</script>`` and ``<style>...</style>``).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | .. method:: HTMLParser.handle_entityref(name)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called to process a named character reference of the form
 | 
					
						
							|  |  |  |    ``&name;`` (e.g. ``>``), where *name* is a general entity reference
 | 
					
						
							|  |  |  |    (e.g. ``'gt'``).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | .. method:: HTMLParser.handle_charref(name)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called to process decimal and hexadecimal numeric character
 | 
					
						
							|  |  |  |    references of the form ``&#NNN;`` and ``&#xNNN;``.  For example, the decimal
 | 
					
						
							|  |  |  |    equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
 | 
					
						
							|  |  |  |    in this case the method will receive ``'62'`` or ``'x3E'``.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. method:: HTMLParser.handle_comment(data)
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called when a comment is encountered (e.g. ``<!--comment-->``).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    For example, the comment ``<!-- comment -->`` will cause this method to be
 | 
					
						
							|  |  |  |    called with the argument ``' comment '``.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    The content of Internet Explorer conditional comments (condcoms) will also be
 | 
					
						
							|  |  |  |    sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
 | 
					
						
							|  |  |  |    this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-07-29 13:38:37 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | .. method:: HTMLParser.handle_decl(decl)
 | 
					
						
							| 
									
										
										
										
											2010-07-29 13:38:37 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called to handle an HTML doctype declaration (e.g.
 | 
					
						
							|  |  |  |    ``<!DOCTYPE html>``).
 | 
					
						
							| 
									
										
										
										
											2010-07-29 13:38:37 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    The *decl* parameter will be the entire contents of the declaration inside
 | 
					
						
							|  |  |  |    the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. method:: HTMLParser.handle_pi(data)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    Method called when a processing instruction is encountered.  The *data*
 | 
					
						
							|  |  |  |    parameter will contain the entire processing instruction. For example, for the
 | 
					
						
							|  |  |  |    processing instruction ``<?proc color='red'>``, this method would be called as
 | 
					
						
							|  |  |  |    ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived
 | 
					
						
							|  |  |  |    class; the base class implementation does nothing.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. note::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       The :class:`HTMLParser` class uses the SGML syntactic rules for processing
 | 
					
						
							|  |  |  |       instructions.  An XHTML processing instruction using the trailing ``'?'`` will
 | 
					
						
							|  |  |  |       cause the ``'?'`` to be included in *data*.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | .. method:: HTMLParser.unknown_decl(data)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    This method is called when an unrecognized declaration is read by the parser.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    The *data* parameter will be the entire contents of the declaration inside
 | 
					
						
							|  |  |  |    the ``<![...]>`` markup.  It is sometimes useful to be overridden by a
 | 
					
						
							|  |  |  |    derived class.  The base class implementation raises an :exc:`HTMLParseError`
 | 
					
						
							|  |  |  |    when *strict* is ``True``.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  | 
 | 
					
						
							|  |  |  | .. _htmlparser-examples:
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Examples
 | 
					
						
							|  |  |  | --------
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The following class implements a parser that will be used to illustrate more
 | 
					
						
							|  |  |  | examples::
 | 
					
						
							| 
									
										
										
										
											2011-10-28 14:34:56 +03:00
										 |  |  | 
 | 
					
						
							|  |  |  |    from html.parser import HTMLParser
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |    from html.entities import name2codepoint
 | 
					
						
							| 
									
										
										
										
											2011-10-28 14:34:56 +03:00
										 |  |  | 
 | 
					
						
							|  |  |  |    class MyHTMLParser(HTMLParser):
 | 
					
						
							|  |  |  |        def handle_starttag(self, tag, attrs):
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |            print("Start tag:", tag)
 | 
					
						
							|  |  |  |            for attr in attrs:
 | 
					
						
							|  |  |  |                print("     attr:", attr)
 | 
					
						
							| 
									
										
										
										
											2011-10-28 14:34:56 +03:00
										 |  |  |        def handle_endtag(self, tag):
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |            print("End tag  :", tag)
 | 
					
						
							| 
									
										
										
										
											2011-10-28 14:34:56 +03:00
										 |  |  |        def handle_data(self, data):
 | 
					
						
							| 
									
										
										
										
											2012-02-18 02:01:36 +02:00
										 |  |  |            print("Data     :", data)
 | 
					
						
							|  |  |  |        def handle_comment(self, data):
 | 
					
						
							|  |  |  |            print("Comment  :", data)
 | 
					
						
							|  |  |  |        def handle_entityref(self, name):
 | 
					
						
							|  |  |  |            c = chr(name2codepoint[name])
 | 
					
						
							|  |  |  |            print("Named ent:", c)
 | 
					
						
							|  |  |  |        def handle_charref(self, name):
 | 
					
						
							|  |  |  |            if name.startswith('x'):
 | 
					
						
							|  |  |  |                c = chr(int(name[1:], 16))
 | 
					
						
							|  |  |  |            else:
 | 
					
						
							|  |  |  |                c = chr(int(name))
 | 
					
						
							|  |  |  |            print("Num ent  :", c)
 | 
					
						
							|  |  |  |        def handle_decl(self, data):
 | 
					
						
							|  |  |  |            print("Decl     :", data)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    parser = MyHTMLParser(strict=False)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Parsing a doctype::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
 | 
					
						
							|  |  |  |    ...             '"http://www.w3.org/TR/html4/strict.dtd">')
 | 
					
						
							|  |  |  |    Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Parsing an element with a few attributes and a title::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
 | 
					
						
							|  |  |  |    Start tag: img
 | 
					
						
							|  |  |  |         attr: ('src', 'python-logo.png')
 | 
					
						
							|  |  |  |         attr: ('alt', 'The Python logo')
 | 
					
						
							|  |  |  |    >>>
 | 
					
						
							|  |  |  |    >>> parser.feed('<h1>Python</h1>')
 | 
					
						
							|  |  |  |    Start tag: h1
 | 
					
						
							|  |  |  |    Data     : Python
 | 
					
						
							|  |  |  |    End tag  : h1
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The content of ``script`` and ``style`` elements is returned as is, without
 | 
					
						
							|  |  |  | further parsing::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> parser.feed('<style type="text/css">#python { color: green }</style>')
 | 
					
						
							|  |  |  |    Start tag: style
 | 
					
						
							|  |  |  |         attr: ('type', 'text/css')
 | 
					
						
							|  |  |  |    Data     : #python { color: green }
 | 
					
						
							|  |  |  |    End tag  : style
 | 
					
						
							|  |  |  |    >>>
 | 
					
						
							|  |  |  |    >>> parser.feed('<script type="text/javascript">'
 | 
					
						
							|  |  |  |    ...             'alert("<strong>hello!</strong>");</script>')
 | 
					
						
							|  |  |  |    Start tag: script
 | 
					
						
							|  |  |  |         attr: ('type', 'text/javascript')
 | 
					
						
							|  |  |  |    Data     : alert("<strong>hello!</strong>");
 | 
					
						
							|  |  |  |    End tag  : script
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Parsing comments::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> parser.feed('<!-- a comment -->'
 | 
					
						
							|  |  |  |    ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
 | 
					
						
							|  |  |  |    Comment  :  a comment
 | 
					
						
							|  |  |  |    Comment  : [if IE 9]>IE-specific content<![endif]
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Parsing named and numeric character references and converting them to the
 | 
					
						
							|  |  |  | correct char (note: these 3 references are all equivalent to ``'>'``)::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> parser.feed('>>>')
 | 
					
						
							|  |  |  |    Named ent: >
 | 
					
						
							|  |  |  |    Num ent  : >
 | 
					
						
							|  |  |  |    Num ent  : >
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
 | 
					
						
							|  |  |  | :meth:`~HTMLParser.handle_data` might be called more than once::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
 | 
					
						
							|  |  |  |    ...     parser.feed(chunk)
 | 
					
						
							|  |  |  |    ...
 | 
					
						
							|  |  |  |    Start tag: span
 | 
					
						
							|  |  |  |    Data     : buff
 | 
					
						
							|  |  |  |    Data     : ered
 | 
					
						
							|  |  |  |    Data     : text
 | 
					
						
							|  |  |  |    End tag  : span
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Parsing invalid HTML (e.g. unquoted attributes) also works::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
 | 
					
						
							|  |  |  |    Start tag: p
 | 
					
						
							|  |  |  |    Start tag: a
 | 
					
						
							|  |  |  |         attr: ('class', 'link')
 | 
					
						
							|  |  |  |         attr: ('href', '#main')
 | 
					
						
							|  |  |  |    Data     : tag soup
 | 
					
						
							|  |  |  |    End tag  : p
 | 
					
						
							|  |  |  |    End tag  : a
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-12-03 04:06:39 +00:00
										 |  |  | .. rubric:: Footnotes
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-12-03 04:26:18 +00:00
										 |  |  | .. [#] For backward compatibility reasons *strict* mode does not raise
 | 
					
						
							|  |  |  |        exceptions for all non-compliant HTML.  That is, some invalid HTML
 | 
					
						
							| 
									
										
										
										
											2010-12-03 04:06:39 +00:00
										 |  |  |        is tolerated even in *strict* mode.
 |