Commit graph

495 commits

Author SHA1 Message Date
Miss Islington (bot)
9352936b4e
[3.15] gh-149083: Use sentinel for urllib.parse._UNSPECIFIED (GH-149612) (#151017)
This was added in 3.15; let's use a real sentinel instead of an ad-hoc list object.
(cherry picked from commit 884ac3e3ec)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
2026-06-06 13:13:52 +00:00
Serhiy Storchaka
bc285e5832
gh-138907: Support RFC 9309 in robotparser (GH-138908)
* empty lines are always ignored instead of separating groups
* the "user-agent" line after a rule starts a new group
* groups matching the same user agent are now merged
* the rule with the longest match wins instead of the first matching rule
* in case of equal matches, the “Allow” rule wins over “Disallow”
* special characters “$” and “*” are now supported in rules
* prefer full match for user agent
2026-05-04 18:03:11 +00:00
Serhiy Storchaka
67ddba9aa9
gh-144148: Update the urllib.parse documentation (GH-144497)
Document urlsplit() as the main parsing function and urlparse() as
an obsolete variant.
2026-02-05 16:32:17 +02:00
Serhiy Storchaka
c5cfcdf16a
gh-67041: Allow to distinguish between empty and not defined URI components (GH-123305)
Changes in the urllib.parse module:

* Add option missing_as_none in urlparse(), urlsplit() and urldefrag(). If
  it is true, represent not defined components as None instead of an
  empty string.
* Add option keep_empty in urlunparse() and urlunsplit(). If it is
  true, keep empty non-None components in the resulting string.
2026-01-22 14:29:13 +02:00
Seth Michael Larson
f25509e78e
gh-143925: Reject control characters in data: URL mediatypes 2026-01-20 20:45:58 +00:00
MonadChains
1c544acaa5
gh-124098: Fix incorrect inclusion of handler methods without protocol prefix in OpenerDirector (GH-136873) 2025-12-18 13:50:05 +01:00
Victor Stinner
0b8c348f27
Fix pyflakes warnings: variable is assigned to but never used (#142294)
Example of fixed warning:

    Lib/netrc.py:98:13: local variable 'toplevel'
    is assigned to but never used
2025-12-08 14:00:31 +01:00
Petr Viktorin
f2bce51b98
gh-140691: urllib.request: Close FTP control socket if data socket can't connect (GH-140835)
Co-authored-by: codenamenam <bluetire27@gmail.com>
2025-11-05 11:52:11 +01:00
SarahPythonista
c50d794c7b
Improve the comment in URLError (#139874)
Clarify that it It overrides `__init__` and `__str__`.
2025-10-14 12:31:21 -07:00
Jeff Epler
d3e3b2b0ac
Edit outdated comment (#121152)
A comment about a possible relaxation of how bytes URLs are treated
in Python 3.3 is no longer relevant or useful.
2025-09-28 14:55:44 -07:00
Serhiy Storchaka
cb7ef18d70
gh-88375, gh-111788: Fix parsing errors and normalization in robotparser (GH-138502)
* Don't fail trying to parse weird patterns.
* Don't fail trying to decode non-UTF-8 "robots.txt" files.
* No longer ignore trailing "?" in patterns and URLs.
* Distinguish raw special characters "?", "=" and "&" from the
  percent-encoded ones.
* Remove tests that do nothing.
2025-09-05 18:58:42 +03:00
Barney Gale
10a925c86d
GH-137059: url2pathname(): fix support for drive letter in netloc (#137060)
Support file URLs like `file://c:/foo` in `urllib.request.url2pathname()`
on Windows. This restores behaviour from 3.13.
2025-07-27 11:44:41 +00:00
Barney Gale
80b2d60a51
GH-136874: url2pathname(): discard query and fragment components (#136875)
In `urllib.request.url2pathname()`, ignore any query or fragment components
in the given URL.
2025-07-21 17:33:20 +00:00
Barney Gale
8e08ac9f32
GH-123599: url2pathname(): don't call gethostbyname() by default (#132610)
Follow-up to 66cdb2bd8a.

Add *resolve_host* keyword-only argument to `url2pathname()`, defaulting to
false. When set to true, we call `socket.gethostbyname()` to resolve the
URL hostname.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Steve Dower <steve.dower@microsoft.com>
2025-05-05 17:03:42 +00:00
Serhiy Storchaka
84a08f8629
gh-133306: Use \z instead of \Z in regular expressions in the stdlib (GH-133337) 2025-05-03 17:58:49 +03:00
Barney Gale
0879ebc953
GH-123599: Match file: URL hostname against machine hostname in urllib (#132523)
In `_is_local_authority()`, return early if the authority matches the
machine hostname from `socket.gethostname()`, rather than resolving the
names and matching IP addresses.
2025-04-15 01:05:06 +01:00
Barney Gale
ccad61e35d
GH-125866: Support complete "file:" URLs in urllib (#132378)
Add optional *add_scheme* argument to `urllib.request.pathname2url()`; when
set to true, a complete URL is returned. Likewise add optional
*require_scheme* argument to `url2pathname()`; when set to true, a complete
URL is accepted.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
2025-04-14 01:49:02 +01:00
Barney Gale
66cdb2bd8a
GH-123599: url2pathname(): handle authority section in file URL (#126844)
In `urllib.request.url2pathname()`, if the authority resolves to the
current host, discard it. If an authority is present but resolves somewhere
else, then on Windows we return a UNC path (as before), and on other
platforms we raise `URLError`.

Affects `pathlib.Path.from_uri()` in the same way.

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
2025-04-10 19:58:04 +00:00
Barney Gale
8abfaba5a6
GH-125866: Deprecate nturl2path module (#131432)
Deprecate the `nturl2path` module. Its functionality is merged into
`urllib.request`.

Add `tests.test_nturl2path` to exercise `nturl2path`, as it's no longer
covered by `test_urllib`.
2025-03-19 19:33:01 +00:00
Seth Michael Larson
d89a5f6a6e
gh-105704: Disallow square brackets ([ and ]) in domain names for parsed URLs (#129418)
* gh-105704: Disallow square brackets ( and ) in domain names for parsed URLs

* Use Sphinx references

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>

* Add mismatched bracket test cases, fix news format

* Add more test coverage for ports

---------

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
2025-01-31 09:41:34 -08:00
Bénédikt Tran
53e8942e69
Explicitly import urllib.error in urllib.robotparser (#128737) 2025-01-13 17:14:59 +01:00
Serhiy Storchaka
5e65a1acc0
gh-128731: Fix ResourceWarning in robotparser.RobotFileParser.read() (GH-128733) 2025-01-12 15:14:46 +02:00
Calvin Bui
f9a5a3a3ef
gh-128192: support HTTP sha-256 digest authentication as per RFC-7617 (GH-128193)
support sha-256 digest authentication

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
2024-12-28 21:05:34 +00:00
Stephen Morton
a03efb533a
gh-127734: improve signature of urllib.request.HTTPPasswordMgrWithPriorAuth.__init__ (#127735)
improve signature of urllib.request.HTTPPasswordMgrWithPriorAuth.__init__
2024-12-08 10:46:34 -08:00
Barney Gale
79b7cab50a
GH-127090: Fix urllib.response.addinfourl.url value for opened file: URIs (#127091)
The canonical `file:` URL (as generated by `pathname2url()`) is now used as the `url` attribute of the returned `addinfourl` object. The `addinfourl.url` attribute reflects the resolved URL for both `file:` or `http[s]:` URLs now.
2024-12-07 17:58:42 +00:00
Barney Gale
5bb059fe60
GH-127236: pathname2url(): generate RFC 1738 URL for absolute POSIX path (#127194)
When handed an absolute Windows path such as `C:\foo` or `//server/share`,
the `urllib.request.pathname2url()` function returns a URL with an
authority section, such as `///C:/foo` or `//server/share` (or before
GH-126205, `////server/share`). Only the `file:` prefix is omitted.

But when handed an absolute POSIX path such as `/etc/hosts`, or a Windows
path of the same form (rooted but lacking a drive), the function returns a
URL without an authority section, such as `/etc/hosts`.

This patch corrects the discrepancy by adding a `//` prefix before
drive-less, rooted paths when generating URLs.
2024-11-25 19:59:20 +00:00
Serhiy Storchaka
97b2ceaaaf
gh-127217: Fix pathname2url() for paths starting with multiple slashes on Posix (GH-127218) 2024-11-24 19:30:29 +02:00
Stephen Morton
a4d4c1ede2
gh-126662: harmonize naming for three namedtuple base classes in urllib.parse (GH-126663)
harmonize naming for three namedtuple base classes in urllib.parse
2024-11-23 18:36:48 -08:00
Barney Gale
ebf564a1d3
GH-126766: url2pathname(): handle 'localhost' authority (#127129)
Discard any 'localhost' authority from the beginning of a `file:` URI. As a
result, file URIs like `//localhost/etc/hosts` are correctly decoded as
`/etc/hosts`.
2024-11-22 03:17:06 +00:00
Barney Gale
c9b399fbdb
GH-85168: Use filesystem encoding when converting to/from file URIs (#126852)
Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
2024-11-19 21:19:30 +00:00
Barney Gale
4d771977b1
GH-84850: Remove urllib.request.URLopener and FancyURLopener (#125739) 2024-11-19 16:01:49 +02:00
Barney Gale
cae9d9d20f
GH-126766: url2pathname(): handle empty authority section. (#126767)
Discard two leading slashes from the beginning of a `file:` URI if they
introduce an empty authority section. As a result, file URIs like
`///etc/hosts` are correctly parsed as `/etc/hosts`.
2024-11-14 20:22:14 +00:00
Serhiy Storchaka
7577307ebd
gh-116897: Deprecate generic false values in urllib.parse.parse_qsl() (GH-116903)
Accepting objects with false values (like 0 and []) except empty strings
and byte-like objects and None in urllib.parse functions parse_qsl() and
parse_qs() is now deprecated.
2024-11-12 21:10:29 +02:00
Serhiy Storchaka
dbb6e22cb1
gh-125926: Fix urllib.parse.urljoin() for base URI with undefined authority (GH-125989)
Although this goes beyond the application of RFC 3986, urljoin()
should support relative base URIs for backward compatibility.
2024-11-07 09:09:59 +02:00
Serhiy Storchaka
fc897fcc01
gh-76960: Fix urljoin() and urldefrag() for URIs with empty components (GH-123273)
* urljoin() with relative reference "?" sets empty query and removes fragment.
* Preserve empty components (authority, params, query, fragment) in urljoin().
* Preserve empty components (authority, params, query) in urldefrag().

Also refactor the code and get rid of double _coerce_args() and
_coerce_result() calls in urljoin(), urldefrag(), urlparse() and
urlunparse().
2024-08-31 12:42:08 +03:00
Serhiy Storchaka
90c892efea
gh-85110: Preserve relative path in URL without netloc in urllib.parse.urlunsplit() (GH-123179) 2024-08-21 10:17:38 +03:00
Jeremy Hylton
77133f570d
gh-122909: Pass ftp error strings to URLError constructor (#122913)
* pass the original string error message from the ftplib error to URLError()

* Update request.py

Change error string for ftp error to be consistent with other errors reported for ftp

* Add NEWS entry for change to urllib.request for ftp errors.

* Track the change in the ftp error message in the test.
2024-08-20 00:35:05 +00:00
Victor Stinner
6ae254aaa0
gh-120417: Add #noqa to used imports in the stdlib (#120421)
Tools such as ruff can ignore "imported but unused" warnings if a
line ends with "# noqa: F401". It avoids the temptation to remove
an import which is used effectively.
2024-06-13 16:14:50 +02:00
Nikita Sobolev
84c3191954
gh-118827: Remove Quoter from urllib.parse (#118828)
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
2024-06-03 10:50:29 +03:00
Serhiy Storchaka
e237b25a4f
gh-67693: Fix urlunparse() and urlunsplit() for URIs with path starting with multiple slashes and no authority (GH-113563) 2024-05-14 12:24:37 +03:00
Harmen Stoppels
759e8e7ab8
gh-99730: urllib.request: Keep HEAD method on redirect (GH-99731) 2024-05-01 18:01:47 +02:00
Serhiy Storchaka
1069a462f6
gh-116764: Fix regressions in urllib.parse.parse_qsl() (GH-116801)
* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in gh-74668
(bdba8ef42b).
2024-03-16 12:36:05 +02:00
Serhiy Storchaka
bdba8ef42b
gh-74668: Fix support of bytes in urllib.parse.parse_qsl() (GH-115771)
urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
2024-03-05 17:49:50 +02:00
Weii Wang
c43b26d02e
gh-115197: Stop resolving host in urllib.request proxy bypass (GH-115210)
Use of a proxy is intended to defer DNS for the hosts to the proxy itself, rather than a potential for information leak of the host doing DNS resolution itself for any reason.  Proxy bypass lists are strictly name based.  Most implementations of proxy support agree.
2024-02-28 12:15:52 -08:00
Raphaël Marinier
5094690efd
gh-91539: Small performance improvement of urrlib.request.getproxies_environment() (#108771)
Small performance improvement of getproxies_environment() when there are many environment variables. In a benchmark with 5k environment variables not related to proxies, and 5 specifying proxies, we get a 10% walltime improvement.
2024-01-15 15:45:01 -08:00
zentarim
f3266c05b6
GH-104554: Add RTSPS support to urllib/parse.py (#104605)
* GH-104554: Add RTSPS support to `urllib/parse.py`

RTSPS is the permanent scheme defined in
https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml
alongside RTSP and RTSPU schemes.

* 📜🤖 Added by blurb_it.

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
2023-06-13 16:45:47 -07:00
Victor Stinner
2587b9f64e
gh-105382: Remove urllib.request cafile parameter (#105384)
Remove cafile, capath and cadefault parameters of the
urllib.request.urlopen() function, deprecated in Python 3.6.
2023-06-06 21:17:45 +00:00
Illia Volochii
2f630e1ce1
gh-102153: Start stripping C0 control and space chars in urlsplit (#102508)
`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bit #25595.

This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/#url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329).

---------

Co-authored-by: Gregory P. Smith [Google] <greg@krypto.org>
2023-05-17 01:49:20 -07:00
JohnJamesUtley
29f348e232
gh-103848: Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format (#103849)
* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

Co-authored-by: Gregory P. Smith <greg@krypto.org>
2023-05-10 00:18:35 +00:00
Gregory P. Smith
82f789be3b
gh-104139: Add itms-services to uses_netloc urllib.parse. (#104312)
Teach unsplit to retain the `"//"` when assembling `itms-services://?action=generate-bugs` style
[Apple Platform Deployment](https://support.apple.com/en-gb/guide/deployment/depce7cefc4d/web) URLs.
2023-05-09 07:04:50 -07:00