Serhiy Storchaka
7636a66635
gh-135661: Fix parsing unterminated bogus comments in HTMLParser (GH-137873)
...
Bogus comments that start with "<![CDATA[" should not include the starting "!"
in its value.
2025-08-17 13:37:50 +03:00
Serhiy Storchaka
0cbbfc4621
gh-135661: Fix CDATA section parsing in HTMLParser (GH-135665)
...
"] ]>" and "]] >" no longer end the CDATA section.
Make CDATA section parsing context depending.
Add private method HTMLParser._set_support_cdata() to change the context.
If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>".
If called with False, "<[CDATA[" starts a bogus comments which ends with ">".
2025-08-14 18:13:22 +00:00
Timon Viola
4d02f31cdd
gh-118350: Fix support of elements "textarea" and "title" in HTMLParser ( #135310 )
...
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Łukasz Langa <lukasz@langa.pl>
2025-07-22 13:27:13 +02:00
Serhiy Storchaka
dee6501894
gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908)
...
This fixes a regression introduced in GH-135930.
2025-07-21 12:07:15 +02:00
Serhiy Storchaka
8ac7613dc8
gh-102555: Fix comment parsing in HTMLParser according to the HTML5 standard (GH-135664)
...
* "--!>" now ends the comment.
* "-- >" no longer ends the comment.
* Support abnormally ended empty comments "<-->" and "<--->".
---------
Co-author: Kerim Kabirov <the.privat33r+gh@pm.me>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
2025-07-04 07:00:23 +00:00
Serhiy Storchaka
0243f97cba
gh-135661: Fix parsing start and end tags in HTMLParser according to the HTML5 standard (GH-135930)
...
* Whitespaces no longer accepted between `</` and the tag name.
E.g. `</ script>` does not end the script section.
* Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized
as whitespaces. The only whitespaces are `\t\n\r\f `.
* Null character (U+0000) no longer ends the tag name.
* Attributes and slashes after the tag name in end tags are now ignored,
instead of terminating after the first `>` in quoted attribute value.
E.g. `</script/foo=">"/>`.
* Multiple slashes and whitespaces between the last attribute and closing `>`
are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`.
* Multiple `=` between attribute name and value are no longer collapsed.
E.g. `<a foo==bar>` produces attribute "foo" with value "=bar".
* Whitespaces between the `=` separator and attribute name or value are no
longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and
"=bar", both with value None; `<a foo= bar>` produces two attributes:
"foo" with value "" and "bar" with value None.
* Fix Sphinx errors.
* Apply suggestions from code review
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
* Address review comments.
* Move to Security.
---------
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
2025-07-03 23:33:02 +03:00
Serhiy Storchaka
6eb6c5dbfb
gh-135462: Fix quadratic complexity in processing special input in HTMLParser (GH-135464)
...
End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
2025-06-13 19:57:48 +03:00
Waylan Limberg
53383e90e4
gh-86155: Fix data loss after unclosed script or style tag in HTMLParser (GH-22658)
...
When calling .close() the HTMLParser should flush all remaining content,
even when that content is in an unclosed script or style tag.
2025-05-10 17:36:06 +00:00
Ezio Melotti
76c0b01bc4
gh-77057: Fix handling of invalid markup declarations in HTMLParser (GH-9295)
...
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2025-05-10 17:31:43 +03:00
Sascha Ißbrücker
77b14a6d58
gh-69426: HTMLParser: only unescape properly terminated character entities in attribute values (GH-95215)
...
According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.
https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
2025-05-07 18:49:49 +03:00
Dong-hee Na
157aef79b0
gh-95813: Improve HTMLParser from the view of inheritance ( #95874 )
...
* gh-95813: Improve HTMLParser from the view of inheritance
* gh-95813: Add unittest
* Address code review
2022-08-18 13:16:33 +02:00
Alberto Mardegan
562c0d7398
bpo-45421: Remove dead code from html.parser (GH-28847)
...
Support for HtmlParserError was removed back in 2014 with commit
73a4359eb0
, however this small block was
missed.
2021-10-12 10:12:21 -07:00
Christian Clauss
745c9d9dfc
Fix typos in the Lib directory (GH-28775)
...
Fix typos in the Lib directory as identified by codespell.
Co-authored-by: Terry Jan Reedy <tjreedy@udel.edu>
2021-10-06 16:13:48 -07:00
Karl Dubost
9eb11a139f
bpo-41748: Handles unquoted attributes with commas ( #24072 )
...
* bpo-41748: Adds tests for unquoted attributes with comma
* bpo-41748: Handles unquoted attributes with comma
* bpo-41748: Addresses review comments
* bpo-41748: Addresses review comments
* Adds more test cases
* Simplifies the regex for handling spaces
* bpo-41748: Moves attributes tests under the right class
* bpo-41748: Addresses review about duplicate attributes
* bpo-41748: Adds NEWS.d entry for this patch
2021-02-01 21:32:50 +01:00
Inada Naoki
fae0ed5099
bpo-37328: remove deprecated HTMLParser.unescape (GH-14186)
...
It is deprecated since Python 3.4.
2019-08-27 11:48:06 +09:00
Motoki Naruse
3358d589fb
bpo-30629: Remove second call of str.lower() in html.parser.parse_endtag. ( #2099 )
...
elem is the result of .lower() 6 lines above the handle_endtag call.
Patch by Motoki Naruse
2017-06-16 21:15:25 -04:00
Serhiy Storchaka
c842efc6ae
Revert "Fixed a typo in the HTMLParser.feed docstrings" ( #1771 )
...
* Revert "Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a The docstring was correct. I read the patch in opposite direction, as *adding* the "r" prefix.
This reverts commit 5ba185039f
.
2017-05-24 07:20:45 +03:00
Jani Šumak
5ba185039f
Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a rawstring. ( #1759 )
2017-05-23 16:40:54 +03:00
R David Murray
44b548dda8
#27364 : fix "incorrect" uses of escape character in the stdlib.
...
And most of the tools.
Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and
Martin Panter.
2016-09-08 13:59:53 -04:00
Martin Panter
46f50726a0
Issue #27076 : Doc, comment and tests spelling fixes
...
Most fixes to Doc/ and Lib/ directories by Ville Skyttä.
2016-05-26 05:35:26 +00:00
Ezio Melotti
20a2c6482e
#23144 : merge with 3.4.
2015-09-06 21:44:45 +03:00
Ezio Melotti
6f2bb98966
#23144 : Make sure that HTMLParser.feed() returns all the data, even when convert_charrefs is True.
2015-09-06 21:38:06 +03:00
Ezio Melotti
6fc16d81af
#21047 : set the default value for the *convert_charrefs* argument of HTMLParser to True. Patch by Berker Peksag.
2014-08-02 18:36:12 +03:00
Ezio Melotti
73a4359eb0
#15114 : the strict mode and argument of HTMLParser, HTMLParser.error, and the HTMLParserError exception have been removed.
2014-08-02 14:10:30 +03:00
Ezio Melotti
153d97b24e
#20288 : merge with 3.3.
2014-02-01 21:22:26 +02:00
Ezio Melotti
f27b9a741a
#20288 : fix handling of invalid numeric charrefs in HTMLParser.
2014-02-01 21:21:01 +02:00
Ezio Melotti
95401c5f6b
#13633 : Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references.
2013-11-23 19:52:05 +02:00
Ezio Melotti
f6de9eb2bb
#19688 : add back and deprecate the internal HTMLParser.unescape() method.
2013-11-22 05:49:29 +02:00
Ezio Melotti
4a9ee26750
#2927 : Added the unescape() function to the html module.
2013-11-19 20:28:45 +02:00
Ezio Melotti
b7038817fe
#19480 : merge with 3.3.
2013-11-07 18:35:27 +02:00
Ezio Melotti
7165d8b9ba
#19480 : HTMLParser now accepts all valid start-tag names as defined by the HTML5 standard.
2013-11-07 18:33:24 +02:00
Ezio Melotti
88ebfb129b
#15114 : The html.parser module now raises a DeprecationWarning when the strict argument of HTMLParser or the HTMLParser.error method are used.
2013-11-02 17:08:24 +02:00
Ezio Melotti
f6ca26fbff
#17802 : merge with 3.3.
2013-05-01 16:20:00 +03:00
Ezio Melotti
8e596a765c
#17802 : Fix an UnboundLocalError in html.parser. Initial tests by Thomas Barlow.
2013-05-01 16:18:25 +03:00
Ezio Melotti
1698babd1b
#14679 : add an __all__ (that contains only HTMLParser) to html.parser.
2013-05-01 16:09:34 +03:00
Ezio Melotti
46495182d0
#15156 : HTMLParser now uses the new "html.entities.html5" dictionary.
2012-06-24 22:02:56 +02:00
Ezio Melotti
3861d8b271
#15114 : the strict mode of HTMLParser and the HTMLParseError exception are deprecated now that the parser is able to parse invalid markup.
2012-06-23 15:27:51 +02:00
Ezio Melotti
0780b6bc58
#14538 : HTMLParser can now parse correctly start tags that contain a bare /.
2012-04-18 19:18:22 -06:00
Ezio Melotti
29877e8e04
HTMLParser is now able to handle slashes in the start tag.
2012-02-21 09:25:00 +02:00
Ezio Melotti
e31ddedb0e
Fix an index and clean up comments.
2012-02-13 20:20:00 +02:00
Ezio Melotti
f4ab491901
Improve handling of declarations in HTMLParser.
2012-02-13 15:50:37 +02:00
Ezio Melotti
5211ffe4df
#13993 : HTMLParser is now able to handle broken end tags when strict=False.
2012-02-13 11:24:50 +02:00
Ezio Melotti
fa3702dc28
#13960 : HTMLParser is now able to handle broken comments when strict=False.
2012-02-10 10:45:44 +02:00
Ezio Melotti
15cb489234
#13358 : HTMLParser now calls handle_data only once for each CDATA.
2011-11-18 18:01:49 +02:00
Ezio Melotti
c2fe57762b
#1745761 , #755670 , #13357 , #12629 , #1200313 : improve attribute handling in HTMLParser.
2011-11-14 18:53:33 +02:00
Ezio Melotti
7de56f6a04
#670664 : Fix HTMLParser to correctly handle the content of `<script>...</script>
and
<style>...</style>
`.
2011-11-01 14:12:22 +02:00
Ezio Melotti
f50ffa94ab
#13273 : fix a bug that prevented HTMLParser to properly detect some tags when strict=False.
2011-10-28 13:21:09 +03:00
Ezio Melotti
d9e0b068af
#12888 : Fix a bug in HTMLParser.unescape that prevented it to escape more than 128 entities. Patch by Peter Otten.
2011-09-05 17:11:06 +03:00
Éric Araujo
51b7aedadd
Merge 3.1
2011-05-25 18:13:49 +02:00
Éric Araujo
39f180bb1f
Fix display of html.parser.HTMLParser.feed docstring
2011-05-04 15:55:47 +02:00