[3.13] gh-128571: Document UTF-16/32 native byte order (GH-139974) (#140308)

Closes GH-128571
(cherry picked from commit 920de7ccdc)

Co-authored-by: Parham MohammadAlizadeh <prhmma@gmail.com>
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
This commit is contained in:
Miss Islington (bot) 2025-10-18 21:03:38 +02:00 committed by GitHub
parent 762fbdbf8c
commit 7b6fb716a1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -978,17 +978,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
code point, is to store each code point as four consecutive bytes. There are two code point, is to store each code point as four consecutive bytes. There are two
possibilities: store the bytes in big endian or in little endian order. These possibilities: store the bytes in big endian or in little endian order. These
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this machine you will always have to swap bytes on encoding and decoding.
problem: bytes will always be in natural endianness. When these bytes are read Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
by a CPU with a different endianness, then bytes have to be swapped though. To platform's native byte order when no BOM is present.
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, Python follows prevailing platform
there's the so called BOM ("Byte Order Mark"). This is the Unicode character practice, so native-endian data round-trips without redundant byte swapping,
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` even though the Unicode Standard defaults to big-endian when the byte order is
byte sequence. The byte swapped version of this character (``0xFFFE``) is an unspecified. When these bytes are read by a CPU with a different endianness,
illegal character that may not appear in a Unicode text. So when the the bytes have to be swapped. To be able to detect the endianness of a
first character in a ``UTF-16`` or ``UTF-32`` byte sequence ``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. This is the Unicode character ``U+FEFF``. This character can be prepended to every
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
``U+FFFE``, the bytes have to be swapped on decoding.
Unfortunately the character ``U+FEFF`` had a second purpose as Unfortunately the character ``U+FEFF`` had a second purpose as
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
a word to be split. It can e.g. be used to give hints to a ligature algorithm. a word to be split. It can e.g. be used to give hints to a ligature algorithm.