cpython

mirror of https://github.com/python/cpython.git synced 2026-06-09 03:12:34 +00:00

Author	SHA1	Message	Date
Hugo van Kemenade	f0daba1652	gh-106693: Revert "Explicitly mark ob_sval as unsigned char to avoid UB (#106826 )" (#149514 )	2026-05-07 23:39:08 +03:00
Serhiy Storchaka	310fe88994	gh-79638: Treat an unreachable robots.txt as "disallow all" (GH-138555) Disallow all access in urllib.robotparser if the robots.txt file is unreachable due to server or network errors.	2026-05-07 22:06:57 +03:00
Serhiy Storchaka	bc285e5832	gh-138907: Support RFC 9309 in robotparser (GH-138908) * empty lines are always ignored instead of separating groups * the "user-agent" line after a rule starts a new group * groups matching the same user agent are now merged * the rule with the longest match wins instead of the first matching rule * in case of equal matches, the “Allow” rule wins over “Disallow” * special characters “$” and “” are now supported in rules prefer full match for user agent	2026-05-04 18:03:11 +00:00
Serhiy Storchaka	cb7ef18d70	gh-88375, gh-111788: Fix parsing errors and normalization in robotparser (GH-138502) * Don't fail trying to parse weird patterns. * Don't fail trying to decode non-UTF-8 "robots.txt" files. * No longer ignore trailing "?" in patterns and URLs. * Distinguish raw special characters "?", "=" and "&" from the percent-encoded ones. * Remove tests that do nothing.	2025-09-05 18:58:42 +03:00
Bénédikt Tran	53e8942e69	Explicitly import `urllib.error` in `urllib.robotparser` (#128737 )	2025-01-13 17:14:59 +01:00
Serhiy Storchaka	5e65a1acc0	gh-128731: Fix ResourceWarning in robotparser.RobotFileParser.read() (GH-128733)	2025-01-12 15:14:46 +02:00
Rémi Lapeyre	8047e0e1c6	bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com>	2019-06-16 09:48:57 +03:00
Christopher Beacham	5db5c0669e	bpo-21475: Support the Sitemap extension in robotparser (GH-6883)	2018-05-16 10:52:07 -04:00
Michael Lazar	bd08a0af2d	bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) The urllib.robotparser's __str__ representation now includes wildcard entries and the "Crawl-delay" and "Request-rate" fields. Also removes extra newlines that were being appended to the end of the string.	2018-05-14 17:10:41 +03:00
Berker Peksag	3df02dbc8e	bpo-31325: Fix usage of namedtuple in RobotFileParser.parse() (#4529 )	2017-11-23 15:40:26 -08:00
Berker Peksag	9a7bbb2e3f	Issue #25400 : RobotFileParser now correctly returns default values for crawl_delay and request_rate Initial patch by Peter Wirtz.	2016-09-18 20:17:58 +03:00
Martin Panter	1ce738e08f	Merge typo fixes from 3.5	2016-05-08 14:02:35 +00:00
Martin Panter	f0564164ba	Fix typos in comments, documentation and test method names	2016-05-08 13:48:10 +00:00
Berker Peksag	960e848f0d	Issue #16099 : RobotFileParser now supports Crawl-delay and Request-rate extensions. Patch by Nikolay Bogoychev.	2015-10-08 12:27:06 +03:00
Raymond Hettinger	38acd4c028	Issue 21469: Minor code modernization (convert and/or expression to an if/else expression). Suggested by: Tal Einat	2014-05-12 22:22:46 -07:00
Raymond Hettinger	122541bece	Issue 21469: Mitigate risk of false positives with robotparser. * Repair the broken link to norobots-rfc.txt. * HTTP response codes >= 500 treated as a failed read rather than as a not found. Not found means that we can assume the entire site is allowed. A 5xx server error tells us nothing. * A successful read() or parse() updates the mtime (which is defined to be "the time the robots.txt file was last fetched"). * The can_fetch() method returns False unless we've had a read() with a 2xx or 4xx response. This avoids false positives in the case where a user calls can_fetch() before calling read(). * I don't see any easy way to test this patch without hitting internet resources that might change or without use of mock objects that wouldn't provide must reassurance.	2014-05-12 21:56:33 -07:00
Senthil Kumaran	c70a6ae49b	#17403 : urllib.parse.robotparser normalizes the urls before adding to ruleline. This helps in handling certain types invalid urls in a conservative manner.	2013-05-29 05:54:31 -07:00
Georg Brandl	0a0fc07d37	#4108 : the first default entry (User-agent: *) wins.	2010-07-29 17:55:01 +00:00
Senthil Kumaran	3f8ab965f7	Fix Issue6325 - robotparse to honor urls with query strings.	2010-07-28 16:27:56 +00:00
Benjamin Peterson	d63137159b	Merged revisions 65209-65216,65225-65226,65233,65239,65246-65247,65255-65256 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r65209 \| raymond.hettinger \| 2008-07-23 19:08:18 -0500 (Wed, 23 Jul 2008) \| 1 line Finish-up the partial conversion from int to Py_ssize_t for deque indices and length. ........ r65210 \| raymond.hettinger \| 2008-07-23 19:53:49 -0500 (Wed, 23 Jul 2008) \| 1 line Parse to the correct datatype. ........ r65211 \| benjamin.peterson \| 2008-07-23 21:27:46 -0500 (Wed, 23 Jul 2008) \| 1 line fix spacing ........ r65212 \| benjamin.peterson \| 2008-07-23 21:31:28 -0500 (Wed, 23 Jul 2008) \| 1 line fix markup ........ r65213 \| benjamin.peterson \| 2008-07-23 21:45:37 -0500 (Wed, 23 Jul 2008) \| 1 line add some documentation for 2to3 ........ r65214 \| raymond.hettinger \| 2008-07-24 00:38:48 -0500 (Thu, 24 Jul 2008) \| 1 line Finish conversion from int to Py_ssize_t. ........ r65215 \| raymond.hettinger \| 2008-07-24 02:04:55 -0500 (Thu, 24 Jul 2008) \| 1 line Convert from long to Py_ssize_t. ........ r65216 \| georg.brandl \| 2008-07-24 02:09:21 -0500 (Thu, 24 Jul 2008) \| 2 lines Fix indentation. ........ r65225 \| benjamin.peterson \| 2008-07-25 11:55:37 -0500 (Fri, 25 Jul 2008) \| 1 line teach .bzrignore about doc tools ........ r65226 \| benjamin.peterson \| 2008-07-25 12:02:11 -0500 (Fri, 25 Jul 2008) \| 1 line document default value for fillvalue ........ r65233 \| raymond.hettinger \| 2008-07-25 13:43:33 -0500 (Fri, 25 Jul 2008) \| 1 line Issue 1592: Better error reporting for operations on closed shelves. ........ r65239 \| benjamin.peterson \| 2008-07-25 16:59:53 -0500 (Fri, 25 Jul 2008) \| 1 line fix indentation ........ r65246 \| andrew.kuchling \| 2008-07-26 08:08:19 -0500 (Sat, 26 Jul 2008) \| 1 line This sentence continues to bug me; rewrite it for the second time ........ r65247 \| andrew.kuchling \| 2008-07-26 08:09:06 -0500 (Sat, 26 Jul 2008) \| 1 line Remove extra words ........ r65255 \| skip.montanaro \| 2008-07-26 19:49:02 -0500 (Sat, 26 Jul 2008) \| 3 lines Close issue 3437 - missing state change when Allow lines are processed. Adds test cases which use Allow: as well. ........ r65256 \| skip.montanaro \| 2008-07-26 19:50:41 -0500 (Sat, 26 Jul 2008) \| 2 lines note robotparser bug fix. ........	2008-07-31 16:23:04 +00:00
Jeremy Hylton	73fd46d24e	Bug 3347: robotparser failed because it didn't convert bytes to string. The solution is to convert bytes to text via utf-8. I'm not entirely sure if this is safe, but it looks like robots.txt is expected to be ascii.	2008-07-18 20:59:44 +00:00
Jeremy Hylton	1afc169616	Make a new urllib package . It consists of code from urllib, urllib2, urlparse, and robotparser. The old modules have all been removed. The new package has five submodules: urllib.parse, urllib.request, urllib.response, urllib.error, and urllib.robotparser. The urllib.request.urlopen() function uses the url opener from urllib2. Note that the unittests have not been renamed for the beta, but they will be renamed in the future. Joint work with Senthil Kumaran.	2008-06-18 20:49:58 +00:00

22 commits