mirror of
https://github.com/python/cpython.git
synced 2025-10-19 07:53:46 +00:00
[3.13] gh-128571: Document UTF-16/32 native byte order (GH-139974) (#140308)
Closes GH-128571
(cherry picked from commit 920de7ccdc
)
Co-authored-by: Parham MohammadAlizadeh <prhmma@gmail.com>
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
This commit is contained in:
parent
762fbdbf8c
commit
7b6fb716a1
1 changed files with 16 additions and 11 deletions
|
@ -978,17 +978,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
|
||||||
code point, is to store each code point as four consecutive bytes. There are two
|
code point, is to store each code point as four consecutive bytes. There are two
|
||||||
possibilities: store the bytes in big endian or in little endian order. These
|
possibilities: store the bytes in big endian or in little endian order. These
|
||||||
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||||
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
|
||||||
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
|
machine you will always have to swap bytes on encoding and decoding.
|
||||||
problem: bytes will always be in natural endianness. When these bytes are read
|
Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
|
||||||
by a CPU with a different endianness, then bytes have to be swapped though. To
|
platform's native byte order when no BOM is present.
|
||||||
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
|
Python follows prevailing platform
|
||||||
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
|
practice, so native-endian data round-trips without redundant byte swapping,
|
||||||
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
|
even though the Unicode Standard defaults to big-endian when the byte order is
|
||||||
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
|
unspecified. When these bytes are read by a CPU with a different endianness,
|
||||||
illegal character that may not appear in a Unicode text. So when the
|
the bytes have to be swapped. To be able to detect the endianness of a
|
||||||
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
|
``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
|
||||||
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
|
This is the Unicode character ``U+FEFF``. This character can be prepended to every
|
||||||
|
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
|
||||||
|
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
|
||||||
|
When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
|
||||||
|
``U+FFFE``, the bytes have to be swapped on decoding.
|
||||||
|
|
||||||
Unfortunately the character ``U+FEFF`` had a second purpose as
|
Unfortunately the character ``U+FEFF`` had a second purpose as
|
||||||
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
|
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
|
||||||
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
|
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue