Add SIMD optimization for `bytes.hex()`, `bytearray.hex()`, and `binascii.hexlify()` as well as `hashlib` `.hexdigest()` methods using platform-agnostic GCC/Clang vector extensions that compile to native SIMD instructions on our [PEP-11 Tier 1 Linux and macOS](https://peps.python.org/pep-0011/#tier-1) platforms.
- 1.1-3x faster for common small data (16-64 bytes, covering md5 through sha512 digest sizes)
- Up to 11x faster for large data (1KB+)
- Retains the existing scalar code for short inputs (<16 bytes) or platforms lacking SIMD instructions, no observable performance regressions there.
## Supported platforms:
- x86-64: the compiler generates SSE2 - always available, no flags or CPU feature checks needed
- ARM64: NEON is always available, always available, no flags or CPU feature checks needed
- ARM32: Requires NEON support and that appropriate compiler flags enable that (e.g., `-march=native` on a Raspberry Pi 3+) - while we _could_ use runtime detection to allow neon when compiled without a recent enough `-march=` flag (`cortex-a53` and later IIRC), there are diminishing returns in doing so. Anyone using 32-bit ARM in a situation where performance matters will already be compiling with such flags. (as opposed to 32-bit Raspbian compilation that defaults to aiming primarily for compatibility with rpi1&0 armv6 arch=armhf which lacks neon)
- Windows/MSVC: Not supported. MSVC lacks `__builtin_shufflevector`, so the existing scalar path is used. Leaving it as an opportunity for the future for someone to figure out how to express the intent to that compiler.
This is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection.
This affects string formatting as well as bytes and bytearray formatting.
* For errors in the format string, always include the position of the
start of the format unit.
* For errors related to the formatted arguments, always include the number
or the name of the formatted argument.
* Suggest more probable causes of errors in the format string (stray %,
unsupported format, unexpected character).
* Provide more information when the number of arguments does not match
the number of format units.
* Raise more specific errors when access of arguments by name is mixed with
sequential access and when * is used with a mapping.
* Add tests for some uncovered cases.
Update `bytearray` to contain a `bytes` and provide a zero-copy path to
"extract" the `bytes`. This allows making several code paths more efficient.
This does not move any codepaths to make use of this new API. The documentation
changes include common code patterns which can be made more efficient with
this API.
---
When just changing `bytearray` to contain `bytes` I ran pyperformance on a
`--with-lto --enable-optimizations --with-static-libpython` build and don't see
any major speedups or slowdowns with this; all seems to be in the noise of
my machine (Generally changes under 5% or benchmarks that don't touch
bytes/bytearray).
Co-authored-by: Victor Stinner <vstinner@python.org>
Co-authored-by: Maurycy Pawłowski-Wieroński <5383+maurycy@users.noreply.github.com>
* Merge existing tests test_repr_str and test_to_str.
* Add more tests for non-printable and non-ASCII bytes.
* Add tests for special escape sequences ('\t\n\r').
* Add tests for slashes.
* Add more tests for quotes.
* Add tests for subclasses.
* Add test for non-ASCII class name.
* Only apply @check_bytes_warnings for str() tests.
Split abstract.c and float.c tests of _testcapi into two parts:
limited C API tests in _testlimitedcapi and non-limited C API tests
in _testcapi.
Update test_bytes and test_class.
Cover all the Mapping Protocol, almost all the Sequence Protocol
(except PySequence_Fast) and a part of the Object Protocol.
Move existing tests to Lib/test/test_capi/test_abstract.py and
Modules/_testcapi/abstract.c.
Add also tests for PyDict C API.
Copying and pickling instances of subclasses of builtin types
bytearray, set, frozenset, collections.OrderedDict, collections.deque,
weakref.WeakSet, and datetime.tzinfo now copies and pickles instance attributes
implemented as slots.
Before, using the * operator to repeat a bytearray would copy data from the start of
the internal buffer (ob_bytes) and not from the start of the actual data (ob_start).
Co-authored-by: Lawrence D’Anna <lawrence_danna@apple.com>
* Add support for macOS 11 and Apple Silicon (aka arm64)
As a side effect of this work use the system copy of libffi on macOS, and remove the vendored copy
* Support building on recent versions of macOS while deploying to older versions
This allows building installers on macOS 11 while still supporting macOS 10.9.
Improve multi-threaded performance by dropping the GIL in the fast path
of bytes.join. To avoid increasing overhead for small joins, it is only
done if the output size exceeds a threshold.
In development mode and in debug build, encoding and errors arguments
are now checked on string encoding and decoding operations. Examples:
open(), str.encode() and bytes.decode().
By default, for best performances, the errors argument is only
checked at the first encoding/decoding error, and the encoding
argument is sometimes ignored for empty strings.
* bpo-22385: Support output separators in hex methods.
Also in binascii.hexlify aka b2a_hex.
The underlying implementation behind all hex generation in CPython uses the
same pystrhex.c implementation. This adds support to bytes, bytearray,
and memoryview objects.
The binascii module functions exist rather than being slated for deprecation
because they return bytes rather than requiring an intermediate step through a
str object.
This change was inspired by MicroPython which supports sep in its binascii
implementation (and does not yet support the .hex methods).
https://bugs.python.org/issue22385
The final addition (cur += step) may overflow, so use size_t for "cur".
"cur" is always positive (even for negative steps), so it is safe to use
size_t here.
Co-Authored-By: Martin Panter <vadmium+py@gmail.com>