Canonical source repository for PyYAML
Find a file
Anish Athalye 0716ae21a1 Fix reader for Unicode code points over 0xFFFF (#351)
This patch fixes the handling of inputs with Unicode code points over
0xFFFF when running on a Python 2 that does not have UCS-4 support
(which certain distributions still ship, e.g. macOS).

When Python is compiled without UCS-4 support, it uses UCS-2. In this
situation, non-BMP Unicode characters, which have code points over
0xFFFF, are represented as surrogate pairs. For example, if we take
u'\U0001f3d4', it will be represented as the surrogate pair
u'\ud83c\udfd4'. This can be seen by running, for example:

    [i for i in u'\U0001f3d4']

In PyYAML, the reader uses a function `check_printable` to validate
inputs, making sure that they only contain printable characters. Prior
to this patch, on UCS-2 builds, it incorrectly identified surrogate
pairs as non-printable.

It would be fairly natural to write a regular expression that captures
strings that contain only *printable* characters, as opposed to
*non-printable* characters (as identified by the old code, so not
excluding surrogate pairs):

    PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')

Adding support for surrogate pairs to this would be straightforward,
adding the option of having a surrogate high followed by a surrogate low
(`[\uD800-\uDBFF][\uDC00-\uDFFF]`):

    PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')

Then, this regex could be used as follows:

    def check_printable(self, data):
        if not self.PRINTABLE.match(data):
            raise ReaderError(...)

However, matching printable strings, rather than searching for
non-printable characters as the code currently does, would have the
disadvantage of not identifying the culprit character (we wouldn't get
the position and the actual non-printable character from a lack of a
regex match).

Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
pairs. We do this by removing surrogate pairs from the existing
character set and adding the following options for illegal uses of
surrogate code points:

- Surrogate low that doesn't follow a surrogate high (either a surrogate
  low at the start of a string, or a surrogate low that follows a
  character that's not a surrogate high):

    (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]

- Surrogate high that isn't followed by a surrogate low (either a
  surrogate high at the end of a string, or a surrogate high that is
  followed by a character that's not a surrogate low):

    [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)

The behavior of this modified regex should match the one that is used
when Python is built with UCS-4 support.
2019-12-20 20:38:46 +01:00
examples Fix typos 2017-08-08 06:05:28 -05:00
ext Fix typos 2017-08-08 06:05:28 -05:00
lib/yaml Fix reader for Unicode code points over 0xFFFF (#351) 2019-12-20 20:38:46 +01:00
lib3/yaml Allow add_multi_constructor with None (#358) 2019-12-07 22:40:48 +01:00
packaging/build Windows build tweaks 2019-11-27 23:00:21 +01:00
tests Allow add_multi_constructor with None (#358) 2019-12-07 22:40:48 +01:00
.appveyor.yml Fix appveyor.yml to use libyaml tag not branch 2019-12-03 23:36:50 +01:00
.gitignore Windows Appveyor build 2019-03-12 16:22:31 -07:00
.travis.yml Travis CI: Test on Python 3.8 production release 2019-12-03 23:38:13 +01:00
announcement.msg Version 5.2 2019-12-02 21:13:24 +01:00
CHANGES Version 5.2 2019-12-02 21:13:24 +01:00
LICENSE Updates for 5.1 release 2019-03-13 08:45:34 -07:00
Makefile Changes for 4.1 release 2018-06-26 15:08:15 -07:00
MANIFEST.in scanner: use infinitive verb after auxiliary word could 2015-04-04 13:25:24 -03:00
README Add use of safe_load() function in README (#285) 2019-12-07 22:44:29 +01:00
setup.cfg Squash/merge pull request #105 from nnadeau/patch-1 2019-03-08 09:09:48 -08:00
setup.py fixup! setup.py: python_requires='!=3.4.*', 2019-12-03 23:38:13 +01:00
tox.ini tox.ini: Add py38 and remove py34 2019-12-03 23:38:13 +01:00

PyYAML - The next generation YAML parser and emitter for Python.

To install, type 'python setup.py install'.

By default, the setup.py script checks whether LibYAML is installed
and if so, builds and installs LibYAML bindings.  To skip the check
and force installation of LibYAML bindings, use the option '--with-libyaml':
'python setup.py --with-libyaml install'.  To disable the check and
skip building and installing LibYAML bindings, use '--without-libyaml':
'python setup.py --without-libyaml install'.

When LibYAML bindings are installed, you may use fast LibYAML-based
parser and emitter as follows:

    >>> yaml.load(stream, Loader=yaml.CLoader)
    >>> yaml.dump(data, Dumper=yaml.CDumper)

If you don't trust the input stream, you should use:

    >>> yaml.safe_load(stream)

PyYAML includes a comprehensive test suite.  To run the tests,
type 'python setup.py test'.

For more information, check the PyYAML homepage:
'https://github.com/yaml/pyyaml'.

For PyYAML tutorial and reference, see:
'http://pyyaml.org/wiki/PyYAMLDocumentation'.

Discuss PyYAML with the maintainers in IRC #pyyaml irc.freenode.net.

You may also use the YAML-Core mailing list:
'http://lists.sourceforge.net/lists/listinfo/yaml-core'.

Submit bug reports and feature requests to the PyYAML bug tracker:
'https://github.com/yaml/pyyaml/issues'.

The PyYAML module was written by Kirill Simonov <xi@resolvent.net>.
It is currently maintained by the YAML and Python communities.

PyYAML is released under the MIT license.
See the file LICENSE for more details.