Commit graph

488 commits

Author SHA1 Message Date
Petr Viktorin
f2bce51b98
gh-140691: urllib.request: Close FTP control socket if data socket can't connect (GH-140835)
Co-authored-by: codenamenam <bluetire27@gmail.com>
2025-11-05 11:52:11 +01:00
SarahPythonista
c50d794c7b
Improve the comment in URLError (#139874)
Clarify that it It overrides `__init__` and `__str__`.
2025-10-14 12:31:21 -07:00
Jeff Epler
d3e3b2b0ac
Edit outdated comment (#121152)
A comment about a possible relaxation of how bytes URLs are treated
in Python 3.3 is no longer relevant or useful.
2025-09-28 14:55:44 -07:00
Serhiy Storchaka
cb7ef18d70
gh-88375, gh-111788: Fix parsing errors and normalization in robotparser (GH-138502)
* Don't fail trying to parse weird patterns.
* Don't fail trying to decode non-UTF-8 "robots.txt" files.
* No longer ignore trailing "?" in patterns and URLs.
* Distinguish raw special characters "?", "=" and "&" from the
  percent-encoded ones.
* Remove tests that do nothing.
2025-09-05 18:58:42 +03:00
Barney Gale
10a925c86d
GH-137059: url2pathname(): fix support for drive letter in netloc (#137060)
Support file URLs like `file://c:/foo` in `urllib.request.url2pathname()`
on Windows. This restores behaviour from 3.13.
2025-07-27 11:44:41 +00:00
Barney Gale
80b2d60a51
GH-136874: url2pathname(): discard query and fragment components (#136875)
In `urllib.request.url2pathname()`, ignore any query or fragment components
in the given URL.
2025-07-21 17:33:20 +00:00
Barney Gale
8e08ac9f32
GH-123599: url2pathname(): don't call gethostbyname() by default (#132610)
Follow-up to 66cdb2bd8a.

Add *resolve_host* keyword-only argument to `url2pathname()`, defaulting to
false. When set to true, we call `socket.gethostbyname()` to resolve the
URL hostname.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Steve Dower <steve.dower@microsoft.com>
2025-05-05 17:03:42 +00:00
Serhiy Storchaka
84a08f8629
gh-133306: Use \z instead of \Z in regular expressions in the stdlib (GH-133337) 2025-05-03 17:58:49 +03:00
Barney Gale
0879ebc953
GH-123599: Match file: URL hostname against machine hostname in urllib (#132523)
In `_is_local_authority()`, return early if the authority matches the
machine hostname from `socket.gethostname()`, rather than resolving the
names and matching IP addresses.
2025-04-15 01:05:06 +01:00
Barney Gale
ccad61e35d
GH-125866: Support complete "file:" URLs in urllib (#132378)
Add optional *add_scheme* argument to `urllib.request.pathname2url()`; when
set to true, a complete URL is returned. Likewise add optional
*require_scheme* argument to `url2pathname()`; when set to true, a complete
URL is accepted.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
2025-04-14 01:49:02 +01:00
Barney Gale
66cdb2bd8a
GH-123599: url2pathname(): handle authority section in file URL (#126844)
In `urllib.request.url2pathname()`, if the authority resolves to the
current host, discard it. If an authority is present but resolves somewhere
else, then on Windows we return a UNC path (as before), and on other
platforms we raise `URLError`.

Affects `pathlib.Path.from_uri()` in the same way.

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
2025-04-10 19:58:04 +00:00
Barney Gale
8abfaba5a6
GH-125866: Deprecate nturl2path module (#131432)
Deprecate the `nturl2path` module. Its functionality is merged into
`urllib.request`.

Add `tests.test_nturl2path` to exercise `nturl2path`, as it's no longer
covered by `test_urllib`.
2025-03-19 19:33:01 +00:00
Seth Michael Larson
d89a5f6a6e
gh-105704: Disallow square brackets ([ and ]) in domain names for parsed URLs (#129418)
* gh-105704: Disallow square brackets ( and ) in domain names for parsed URLs

* Use Sphinx references

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>

* Add mismatched bracket test cases, fix news format

* Add more test coverage for ports

---------

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
2025-01-31 09:41:34 -08:00
Bénédikt Tran
53e8942e69
Explicitly import urllib.error in urllib.robotparser (#128737) 2025-01-13 17:14:59 +01:00
Serhiy Storchaka
5e65a1acc0
gh-128731: Fix ResourceWarning in robotparser.RobotFileParser.read() (GH-128733) 2025-01-12 15:14:46 +02:00
Calvin Bui
f9a5a3a3ef
gh-128192: support HTTP sha-256 digest authentication as per RFC-7617 (GH-128193)
support sha-256 digest authentication

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
2024-12-28 21:05:34 +00:00
Stephen Morton
a03efb533a
gh-127734: improve signature of urllib.request.HTTPPasswordMgrWithPriorAuth.__init__ (#127735)
improve signature of urllib.request.HTTPPasswordMgrWithPriorAuth.__init__
2024-12-08 10:46:34 -08:00
Barney Gale
79b7cab50a
GH-127090: Fix urllib.response.addinfourl.url value for opened file: URIs (#127091)
The canonical `file:` URL (as generated by `pathname2url()`) is now used as the `url` attribute of the returned `addinfourl` object. The `addinfourl.url` attribute reflects the resolved URL for both `file:` or `http[s]:` URLs now.
2024-12-07 17:58:42 +00:00
Barney Gale
5bb059fe60
GH-127236: pathname2url(): generate RFC 1738 URL for absolute POSIX path (#127194)
When handed an absolute Windows path such as `C:\foo` or `//server/share`,
the `urllib.request.pathname2url()` function returns a URL with an
authority section, such as `///C:/foo` or `//server/share` (or before
GH-126205, `////server/share`). Only the `file:` prefix is omitted.

But when handed an absolute POSIX path such as `/etc/hosts`, or a Windows
path of the same form (rooted but lacking a drive), the function returns a
URL without an authority section, such as `/etc/hosts`.

This patch corrects the discrepancy by adding a `//` prefix before
drive-less, rooted paths when generating URLs.
2024-11-25 19:59:20 +00:00
Serhiy Storchaka
97b2ceaaaf
gh-127217: Fix pathname2url() for paths starting with multiple slashes on Posix (GH-127218) 2024-11-24 19:30:29 +02:00
Stephen Morton
a4d4c1ede2
gh-126662: harmonize naming for three namedtuple base classes in urllib.parse (GH-126663)
harmonize naming for three namedtuple base classes in urllib.parse
2024-11-23 18:36:48 -08:00
Barney Gale
ebf564a1d3
GH-126766: url2pathname(): handle 'localhost' authority (#127129)
Discard any 'localhost' authority from the beginning of a `file:` URI. As a
result, file URIs like `//localhost/etc/hosts` are correctly decoded as
`/etc/hosts`.
2024-11-22 03:17:06 +00:00
Barney Gale
c9b399fbdb
GH-85168: Use filesystem encoding when converting to/from file URIs (#126852)
Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
2024-11-19 21:19:30 +00:00
Barney Gale
4d771977b1
GH-84850: Remove urllib.request.URLopener and FancyURLopener (#125739) 2024-11-19 16:01:49 +02:00
Barney Gale
cae9d9d20f
GH-126766: url2pathname(): handle empty authority section. (#126767)
Discard two leading slashes from the beginning of a `file:` URI if they
introduce an empty authority section. As a result, file URIs like
`///etc/hosts` are correctly parsed as `/etc/hosts`.
2024-11-14 20:22:14 +00:00
Serhiy Storchaka
7577307ebd
gh-116897: Deprecate generic false values in urllib.parse.parse_qsl() (GH-116903)
Accepting objects with false values (like 0 and []) except empty strings
and byte-like objects and None in urllib.parse functions parse_qsl() and
parse_qs() is now deprecated.
2024-11-12 21:10:29 +02:00
Serhiy Storchaka
dbb6e22cb1
gh-125926: Fix urllib.parse.urljoin() for base URI with undefined authority (GH-125989)
Although this goes beyond the application of RFC 3986, urljoin()
should support relative base URIs for backward compatibility.
2024-11-07 09:09:59 +02:00
Serhiy Storchaka
fc897fcc01
gh-76960: Fix urljoin() and urldefrag() for URIs with empty components (GH-123273)
* urljoin() with relative reference "?" sets empty query and removes fragment.
* Preserve empty components (authority, params, query, fragment) in urljoin().
* Preserve empty components (authority, params, query) in urldefrag().

Also refactor the code and get rid of double _coerce_args() and
_coerce_result() calls in urljoin(), urldefrag(), urlparse() and
urlunparse().
2024-08-31 12:42:08 +03:00
Serhiy Storchaka
90c892efea
gh-85110: Preserve relative path in URL without netloc in urllib.parse.urlunsplit() (GH-123179) 2024-08-21 10:17:38 +03:00
Jeremy Hylton
77133f570d
gh-122909: Pass ftp error strings to URLError constructor (#122913)
* pass the original string error message from the ftplib error to URLError()

* Update request.py

Change error string for ftp error to be consistent with other errors reported for ftp

* Add NEWS entry for change to urllib.request for ftp errors.

* Track the change in the ftp error message in the test.
2024-08-20 00:35:05 +00:00
Victor Stinner
6ae254aaa0
gh-120417: Add #noqa to used imports in the stdlib (#120421)
Tools such as ruff can ignore "imported but unused" warnings if a
line ends with "# noqa: F401". It avoids the temptation to remove
an import which is used effectively.
2024-06-13 16:14:50 +02:00
Nikita Sobolev
84c3191954
gh-118827: Remove Quoter from urllib.parse (#118828)
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
2024-06-03 10:50:29 +03:00
Serhiy Storchaka
e237b25a4f
gh-67693: Fix urlunparse() and urlunsplit() for URIs with path starting with multiple slashes and no authority (GH-113563) 2024-05-14 12:24:37 +03:00
Harmen Stoppels
759e8e7ab8
gh-99730: urllib.request: Keep HEAD method on redirect (GH-99731) 2024-05-01 18:01:47 +02:00
Serhiy Storchaka
1069a462f6
gh-116764: Fix regressions in urllib.parse.parse_qsl() (GH-116801)
* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in gh-74668
(bdba8ef42b).
2024-03-16 12:36:05 +02:00
Serhiy Storchaka
bdba8ef42b
gh-74668: Fix support of bytes in urllib.parse.parse_qsl() (GH-115771)
urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
2024-03-05 17:49:50 +02:00
Weii Wang
c43b26d02e
gh-115197: Stop resolving host in urllib.request proxy bypass (GH-115210)
Use of a proxy is intended to defer DNS for the hosts to the proxy itself, rather than a potential for information leak of the host doing DNS resolution itself for any reason.  Proxy bypass lists are strictly name based.  Most implementations of proxy support agree.
2024-02-28 12:15:52 -08:00
Raphaël Marinier
5094690efd
gh-91539: Small performance improvement of urrlib.request.getproxies_environment() (#108771)
Small performance improvement of getproxies_environment() when there are many environment variables. In a benchmark with 5k environment variables not related to proxies, and 5 specifying proxies, we get a 10% walltime improvement.
2024-01-15 15:45:01 -08:00
zentarim
f3266c05b6
GH-104554: Add RTSPS support to urllib/parse.py (#104605)
* GH-104554: Add RTSPS support to `urllib/parse.py`

RTSPS is the permanent scheme defined in
https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml
alongside RTSP and RTSPU schemes.

* 📜🤖 Added by blurb_it.

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
2023-06-13 16:45:47 -07:00
Victor Stinner
2587b9f64e
gh-105382: Remove urllib.request cafile parameter (#105384)
Remove cafile, capath and cadefault parameters of the
urllib.request.urlopen() function, deprecated in Python 3.6.
2023-06-06 21:17:45 +00:00
Illia Volochii
2f630e1ce1
gh-102153: Start stripping C0 control and space chars in urlsplit (#102508)
`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bit #25595.

This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/#url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329).

---------

Co-authored-by: Gregory P. Smith [Google] <greg@krypto.org>
2023-05-17 01:49:20 -07:00
JohnJamesUtley
29f348e232
gh-103848: Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format (#103849)
* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

Co-authored-by: Gregory P. Smith <greg@krypto.org>
2023-05-10 00:18:35 +00:00
Gregory P. Smith
82f789be3b
gh-104139: Add itms-services to uses_netloc urllib.parse. (#104312)
Teach unsplit to retain the `"//"` when assembling `itms-services://?action=generate-bugs` style
[Apple Platform Deployment](https://support.apple.com/en-gb/guide/deployment/depce7cefc4d/web) URLs.
2023-05-09 07:04:50 -07:00
Dan Hemberger
e38bebb9ee
gh-81403: Fix for CacheFTPHandler in urllib (#13951)
bpo-37222: Fix for CacheFTPHandler in urllib

A call to FTP.ntransfercmd must be followed by FTP.voidresp to clear
the "end transfer" message. Without this, the client and server get
out of sync, which will result in an error if the FTP instance is
reused to open a second URL. This scenario occurs for even the most
basic usage of CacheFTPHandler.

Reverts the patch merged as a resolution to bpo-16270 and adds a test
case for the CacheFTPHandler in test_urllib2net.py.

Co-authored-by: Senthil Kumaran <senthil@python.org>
2023-04-22 21:41:23 -07:00
Wheeler Law
5c00a6224d
gh-99352: Respect http.client.HTTPConnection.debuglevel in urllib.request.AbstractHTTPHandler (#99353)
* bugfix: let the HTTP- and HTTPSHandlers respect the value of http.client.HTTPConnection.debuglevel

* add tests

* add news

* ReSTify NEWS and reword a bit.

* Address Review Comments.

* Use mock.patch.object instead of settting the module level value.
* Used test values to assert the debuglevel.

---------

Co-authored-by: Gregory P. Smith <greg@krypto.org>
Co-authored-by: Senthil Kumaran <senthil@python.org>
2023-04-20 19:04:25 -07:00
Vo Hoang Long
0d4c7fcd4f
gh-101936: Update the default value of fp from io.StringIO to io.BytesIO (gh-102100)
Co-authored-by: Long Vo <long.vo@linecorp.com>
2023-02-22 00:14:41 +09:00
Gregory P. Smith
2e279e85fe
gh-88500: Reduce memory use of urllib.unquote (#96763)
`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.

Microbenchmarks with some antagonistic inputs like `mess = "\u0141%%%20a%fe"*1000` show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal.  The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using `/usr/bin/time -v` on `python -m timeit` runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 for `unquote()` and <1/3 for `unquote_to_bytes()` using `python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'` as a test.
2022-12-10 16:17:39 -08:00
Dong-hee Na
dc8a86893d
gh-98778: Update HTTPError to initialize properly even if fp is None (gh-99966) 2022-12-08 11:20:34 +09:00
Nick Drozd
024ac542d7
bpo-45975: Simplify some while-loops with walrus operator (GH-29347) 2022-11-26 14:33:25 -08:00
Ben Kallus
439b9cfaf4
gh-99418: Make urllib.parse.urlparse enforce that a scheme must begin with an alphabetical ASCII character. (#99421)
Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
2022-11-13 10:25:55 -08:00