[3.10] [3.11] gh-102153: Start stripping C0 control and space chars in urlsplit (GH-102508) (GH-104575) (#104592)

gh-102153: Start stripping C0 control and space chars in `urlsplit` (GH-102508)

`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bit GH-25595.

This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/GH-url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329).

I simplified the docs by eliding the state of the world explanatory
paragraph in this security release only backport.  (people will see
that in the mainline /3/ docs)

---------

(cherry picked from commit 2f630e1ce1)
(cherry picked from commit 610cc0ab1b)

Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Co-authored-by: Illia Volochii <illia.volochii@gmail.com>
Co-authored-by: Gregory P. Smith [Google] <greg@krypto.org>
This commit is contained in:
Miss Islington (bot) 2023-05-17 16:06:06 -07:00 committed by GitHub
parent 425065bb00
commit f48a96a280
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 111 additions and 3 deletions

View file

@ -159,6 +159,10 @@ or on combining URL components into a URL string.
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
.. warning::
:func:`urlparse` does not perform validation. See :ref:`URL parsing
security <url-parsing-security>` for details.
.. versionchanged:: 3.2
Added IPv6 URL parsing capabilities.
@ -324,8 +328,14 @@ or on combining URL components into a URL string.
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised.
Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.
Following some of the `WHATWG spec`_ that updates RFC 3986, leading C0
control and space characters are stripped from the URL. ``\n``,
``\r`` and tab ``\t`` characters are removed from the URL at any position.
.. warning::
:func:`urlsplit` does not perform validation. See :ref:`URL parsing
security <url-parsing-security>` for details.
.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
@ -338,6 +348,9 @@ or on combining URL components into a URL string.
.. versionchanged:: 3.10
ASCII newline and tab characters are stripped from the URL.
.. versionchanged:: 3.10.12
Leading WHATWG C0 control and space characters are stripped from the URL.
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
.. function:: urlunsplit(parts)
@ -414,6 +427,27 @@ or on combining URL components into a URL string.
or ``scheme://host/path``). If *url* is not a wrapped URL, it is returned
without changes.
.. _url-parsing-security:
URL parsing security
--------------------
The :func:`urlsplit` and :func:`urlparse` APIs do not perform **validation** of
inputs. They may not raise errors on inputs that other applications consider
invalid. They may also succeed on some inputs that might not be considered
URLs elsewhere. Their purpose is for practical functionality rather than
purity.
Instead of raising an exception on unusual input, they may instead return some
component parts as empty strings. Or components may contain more than perhaps
they should.
We recommend that users of these APIs where the values may be used anywhere
with security implications code defensively. Do some verification within your
code before trusting a returned component part. Does that ``scheme`` make
sense? Is that a sensible ``path``? Is there anything strange about that
``hostname``? etc.
.. _parsing-ascii-encoded-bytes:
Parsing ASCII Encoded Bytes