Commit graph

54 commits

Author SHA1 Message Date
Serhiy Storchaka
0243f97cba
gh-135661: Fix parsing start and end tags in HTMLParser according to the HTML5 standard (GH-135930)
* Whitespaces no longer accepted between `</` and the tag name.
  E.g. `</ script>` does not end the script section.

* Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized
  as whitespaces. The only whitespaces are `\t\n\r\f `.

* Null character (U+0000) no longer ends the tag name.

* Attributes and slashes after the tag name in end tags are now ignored,
  instead of terminating after the first `>` in quoted attribute value.
  E.g. `</script/foo=">"/>`.

* Multiple slashes and whitespaces between the last attribute and closing `>`
  are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`.

* Multiple `=` between attribute name and value are no longer collapsed.
  E.g. `<a foo==bar>` produces attribute "foo" with value "=bar".

* Whitespaces between the `=` separator and attribute name or value are no
  longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and
  "=bar", both with value None; `<a foo= bar>` produces two attributes:
  "foo" with value "" and "bar" with value None.

* Fix Sphinx errors.

* Apply suggestions from code review

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

* Address review comments.

* Move to Security.

---------

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
2025-07-03 23:33:02 +03:00
Serhiy Storchaka
6eb6c5dbfb
gh-135462: Fix quadratic complexity in processing special input in HTMLParser (GH-135464)
End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
2025-06-13 19:57:48 +03:00
Waylan Limberg
53383e90e4
gh-86155: Fix data loss after unclosed script or style tag in HTMLParser (GH-22658)
When calling .close() the HTMLParser should flush all remaining content,
even when that content is in an unclosed script or style tag.
2025-05-10 17:36:06 +00:00
Ezio Melotti
76c0b01bc4
gh-77057: Fix handling of invalid markup declarations in HTMLParser (GH-9295)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2025-05-10 17:31:43 +03:00
Sascha Ißbrücker
77b14a6d58
gh-69426: HTMLParser: only unescape properly terminated character entities in attribute values (GH-95215)
According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.

https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
2025-05-07 18:49:49 +03:00
Dong-hee Na
157aef79b0
gh-95813: Improve HTMLParser from the view of inheritance (#95874)
* gh-95813: Improve HTMLParser from the view of inheritance

* gh-95813: Add unittest

* Address code review
2022-08-18 13:16:33 +02:00
Alberto Mardegan
562c0d7398
bpo-45421: Remove dead code from html.parser (GH-28847)
Support for HtmlParserError was removed back in 2014 with commit
73a4359eb0, however this small block was
missed.
2021-10-12 10:12:21 -07:00
Christian Clauss
745c9d9dfc
Fix typos in the Lib directory (GH-28775)
Fix typos in the Lib directory as identified by codespell.

Co-authored-by: Terry Jan Reedy <tjreedy@udel.edu>
2021-10-06 16:13:48 -07:00
Karl Dubost
9eb11a139f
bpo-41748: Handles unquoted attributes with commas (#24072)
* bpo-41748: Adds tests for unquoted attributes with comma

* bpo-41748: Handles unquoted attributes with comma

* bpo-41748: Addresses review comments

* bpo-41748: Addresses review comments

* Adds more test cases
* Simplifies the regex for handling spaces

* bpo-41748: Moves attributes tests under the right class

* bpo-41748: Addresses review about duplicate attributes

* bpo-41748: Adds NEWS.d entry for this patch
2021-02-01 21:32:50 +01:00
Inada Naoki
fae0ed5099
bpo-37328: remove deprecated HTMLParser.unescape (GH-14186)
It is deprecated since Python 3.4.
2019-08-27 11:48:06 +09:00
Motoki Naruse
3358d589fb bpo-30629: Remove second call of str.lower() in html.parser.parse_endtag. (#2099)
elem is the result of .lower() 6 lines above the handle_endtag call.
Patch by Motoki Naruse
2017-06-16 21:15:25 -04:00
Serhiy Storchaka
c842efc6ae Revert "Fixed a typo in the HTMLParser.feed docstrings" (#1771)
* Revert "Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a The docstring was correct. I read the patch in opposite direction, as *adding* the "r" prefix.
This reverts commit 5ba185039f.
2017-05-24 07:20:45 +03:00
Jani Šumak
5ba185039f Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a rawstring. (#1759) 2017-05-23 16:40:54 +03:00
R David Murray
44b548dda8 #27364: fix "incorrect" uses of escape character in the stdlib.
And most of the tools.

Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and
Martin Panter.
2016-09-08 13:59:53 -04:00
Martin Panter
46f50726a0 Issue #27076: Doc, comment and tests spelling fixes
Most fixes to Doc/ and Lib/ directories by Ville Skyttä.
2016-05-26 05:35:26 +00:00
Ezio Melotti
20a2c6482e #23144: merge with 3.4. 2015-09-06 21:44:45 +03:00
Ezio Melotti
6f2bb98966 #23144: Make sure that HTMLParser.feed() returns all the data, even when convert_charrefs is True. 2015-09-06 21:38:06 +03:00
Ezio Melotti
6fc16d81af #21047: set the default value for the *convert_charrefs* argument of HTMLParser to True. Patch by Berker Peksag. 2014-08-02 18:36:12 +03:00
Ezio Melotti
73a4359eb0 #15114: the strict mode and argument of HTMLParser, HTMLParser.error, and the HTMLParserError exception have been removed. 2014-08-02 14:10:30 +03:00
Ezio Melotti
153d97b24e #20288: merge with 3.3. 2014-02-01 21:22:26 +02:00
Ezio Melotti
f27b9a741a #20288: fix handling of invalid numeric charrefs in HTMLParser. 2014-02-01 21:21:01 +02:00
Ezio Melotti
95401c5f6b #13633: Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references. 2013-11-23 19:52:05 +02:00
Ezio Melotti
f6de9eb2bb #19688: add back and deprecate the internal HTMLParser.unescape() method. 2013-11-22 05:49:29 +02:00
Ezio Melotti
4a9ee26750 #2927: Added the unescape() function to the html module. 2013-11-19 20:28:45 +02:00
Ezio Melotti
b7038817fe #19480: merge with 3.3. 2013-11-07 18:35:27 +02:00
Ezio Melotti
7165d8b9ba #19480: HTMLParser now accepts all valid start-tag names as defined by the HTML5 standard. 2013-11-07 18:33:24 +02:00
Ezio Melotti
88ebfb129b #15114: The html.parser module now raises a DeprecationWarning when the strict argument of HTMLParser or the HTMLParser.error method are used. 2013-11-02 17:08:24 +02:00
Ezio Melotti
f6ca26fbff #17802: merge with 3.3. 2013-05-01 16:20:00 +03:00
Ezio Melotti
8e596a765c #17802: Fix an UnboundLocalError in html.parser. Initial tests by Thomas Barlow. 2013-05-01 16:18:25 +03:00
Ezio Melotti
1698babd1b #14679: add an __all__ (that contains only HTMLParser) to html.parser. 2013-05-01 16:09:34 +03:00
Ezio Melotti
46495182d0 #15156: HTMLParser now uses the new "html.entities.html5" dictionary. 2012-06-24 22:02:56 +02:00
Ezio Melotti
3861d8b271 #15114: the strict mode of HTMLParser and the HTMLParseError exception are deprecated now that the parser is able to parse invalid markup. 2012-06-23 15:27:51 +02:00
Ezio Melotti
0780b6bc58 #14538: HTMLParser can now parse correctly start tags that contain a bare /. 2012-04-18 19:18:22 -06:00
Ezio Melotti
29877e8e04 HTMLParser is now able to handle slashes in the start tag. 2012-02-21 09:25:00 +02:00
Ezio Melotti
e31ddedb0e Fix an index and clean up comments. 2012-02-13 20:20:00 +02:00
Ezio Melotti
f4ab491901 Improve handling of declarations in HTMLParser. 2012-02-13 15:50:37 +02:00
Ezio Melotti
5211ffe4df #13993: HTMLParser is now able to handle broken end tags when strict=False. 2012-02-13 11:24:50 +02:00
Ezio Melotti
fa3702dc28 #13960: HTMLParser is now able to handle broken comments when strict=False. 2012-02-10 10:45:44 +02:00
Ezio Melotti
15cb489234 #13358: HTMLParser now calls handle_data only once for each CDATA. 2011-11-18 18:01:49 +02:00
Ezio Melotti
c2fe57762b #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. 2011-11-14 18:53:33 +02:00
Ezio Melotti
7de56f6a04 #670664: Fix HTMLParser to correctly handle the content of `<script>...</script> and <style>...</style>`. 2011-11-01 14:12:22 +02:00
Ezio Melotti
f50ffa94ab #13273: fix a bug that prevented HTMLParser to properly detect some tags when strict=False. 2011-10-28 13:21:09 +03:00
Ezio Melotti
d9e0b068af #12888: Fix a bug in HTMLParser.unescape that prevented it to escape more than 128 entities. Patch by Peter Otten. 2011-09-05 17:11:06 +03:00
Éric Araujo
51b7aedadd Merge 3.1 2011-05-25 18:13:49 +02:00
Éric Araujo
39f180bb1f Fix display of html.parser.HTMLParser.feed docstring 2011-05-04 15:55:47 +02:00
Ezio Melotti
2e3607c1e7 #7311: fix html.parser to accept non-ASCII attribute values. 2011-04-07 22:03:31 +03:00
Senthil Kumaran
6c85838489 Merged revisions 87542 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/branches/py3k

........
  r87542 | senthil.kumaran | 2010-12-28 23:55:16 +0800 (Tue, 28 Dec 2010) | 3 lines

  Fix Issue10759 - html.parser.unescape() fails on HTML entities with incorrect syntax
........
2010-12-28 16:10:56 +00:00
Senthil Kumaran
164540fee1 Fix Issue10759 - html.parser.unescape() fails on HTML entities with incorrect syntax 2010-12-28 15:55:16 +00:00
R. David Murray
b579dba119 #1486713: Add a tolerant mode to HTMLParser.
The motivation for adding this option is that the the functionality it
provides used to be provided by sgmllib in Python2, and was used by,
for example, BeautifulSoup.  Without this option, the Python3 version
of BeautifulSoup and the many programs that use it are crippled.

The original patch was by 'kxroberto'.  I modified it heavily but kept his
heuristics and test.  I also added additional heuristics to fix #975556,
#1046092, and part of #6191.  This patch should be completely backward
compatible:  the behavior with the default strict=True is unchanged.
2010-12-03 04:06:39 +00:00
Victor Stinner
30c223cff5 Merged revisions 81504 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/branches/py3k

................
  r81504 | victor.stinner | 2010-05-24 23:46:25 +0200 (lun., 24 mai 2010) | 13 lines

  Recorded merge of revisions 81500-81501 via svnmerge from
  svn+ssh://pythondev@svn.python.org/python/trunk

  ........
    r81500 | victor.stinner | 2010-05-24 23:33:24 +0200 (lun., 24 mai 2010) | 2 lines

    Issue #6662: Fix parsing of malformatted charref (&#bad;)
  ........
    r81501 | victor.stinner | 2010-05-24 23:37:28 +0200 (lun., 24 mai 2010) | 2 lines

    Add the author of the last fix (Issue #6662)
  ........
................
2010-05-24 21:48:07 +00:00