This undoes a change made as a part of PR 137470, for compatibility with EMSDK
4.0.19. It adds `emscripten_trampoline` field in `pycore_runtime_structs.h`
and initializes it from JS initialization code with the wasm-gc based trampoline
if possible. Otherwise we fall back to the JS trampoline.
Add SIMD optimization for `bytes.hex()`, `bytearray.hex()`, and `binascii.hexlify()` as well as `hashlib` `.hexdigest()` methods using platform-agnostic GCC/Clang vector extensions that compile to native SIMD instructions on our [PEP-11 Tier 1 Linux and macOS](https://peps.python.org/pep-0011/#tier-1) platforms.
- 1.1-3x faster for common small data (16-64 bytes, covering md5 through sha512 digest sizes)
- Up to 11x faster for large data (1KB+)
- Retains the existing scalar code for short inputs (<16 bytes) or platforms lacking SIMD instructions, no observable performance regressions there.
## Supported platforms:
- x86-64: the compiler generates SSE2 - always available, no flags or CPU feature checks needed
- ARM64: NEON is always available, always available, no flags or CPU feature checks needed
- ARM32: Requires NEON support and that appropriate compiler flags enable that (e.g., `-march=native` on a Raspberry Pi 3+) - while we _could_ use runtime detection to allow neon when compiled without a recent enough `-march=` flag (`cortex-a53` and later IIRC), there are diminishing returns in doing so. Anyone using 32-bit ARM in a situation where performance matters will already be compiling with such flags. (as opposed to 32-bit Raspbian compilation that defaults to aiming primarily for compatibility with rpi1&0 armv6 arch=armhf which lacks neon)
- Windows/MSVC: Not supported. MSVC lacks `__builtin_shufflevector`, so the existing scalar path is used. Leaving it as an opportunity for the future for someone to figure out how to express the intent to that compiler.
This is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection.
Deprecate PEP-456 [1] support for providing an external definition
of the string hashing scheme. Removal is scheduled for Python 3.19.
Previously, embedders could define the ``Py_HASH_ALGORITHM`` macro to be
``Py_HASH_EXTERNAL`` [2] to indicate that the hashing scheme was provided
externally but this feature was undocumented, untested and most likely
unused.
[1]: https://peps.python.org/pep-0456/
[2]: https://peps.python.org/pep-0456/#hash-function-selection
Align the QSBR thread state array to a 64-byte cache line boundary
and add padding at the end of _PyThreadStateImpl. Depending on heap
layout, the QSBR array could end up sharing a cache line with a
thread's tlbc_index, causing QSBR quiescent state updates to contend
with reads of tlbc_index in RESUME_CHECK. This is sensitive to
earlier allocations during interpreter init and can appear or
disappear with seemingly unrelated changes.
Either change alone is sufficient to fix the specific issue, but both
are worthwhile to avoid similar problems in the future.
Add TYPE_FROZENDICT to the marshal module.
Add C API functions:
* PyAnyDict_Check()
* PyAnyDict_CheckExact()
* PyFrozenDict_Check()
* PyFrozenDict_CheckExact()
* PyFrozenDict_New()
Add PyFrozenDict_Type C type.
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
Co-authored-by: Adam Johnson <me@adamj.eu>
Co-authored-by: Benedikt Johannes <benedikt.johannes.hofer@gmail.com>
Remove spurious Py_DECREF on borrowed ref in LOAD_GLOBAL specialization
_PyDict_LookupIndexAndValue() returns a borrowed reference via
_Py_dict_lookup(), but specialize_load_global_lock_held() called
Py_DECREF(value) on it when bailing out for lazy imports. Each time
the adaptive counter fired while a lazy import was still in globals,
this stole one reference from the dict's object. With 8+ threads
racing through LOAD_GLOBAL during concurrent lazy import resolution,
enough triggers accumulated to drive the refcount to zero while the
dict and other threads still referenced the object, causing
use-after-free.
Fix a race condition where a thread could receive a partially-initialized
module when another thread's import fails. The race occurs when:
1. Thread 1 starts importing, adds module to sys.modules
2. Thread 2 sees the module in sys.modules via the fast path
3. Thread 1's import fails, removes module from sys.modules
4. Thread 2 returns a stale module reference not in sys.modules
The fix adds verification after the "skip lock" optimization in both Python
and C code paths to check if the module is still in sys.modules. If the
module was removed (due to import failure), we retry the import so the
caller receives the actual exception from the import failure rather than
a stale module reference.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When _ctypes is imported, it may call dlopen on the libpython shared
library, causing the dynamic linker to load a second mapping of the
library into the process address space. The remote debugging code
iterates memory regions from low addresses upward and returns the first
mapping whose filename matches libpython. After _ctypes is imported, it
finds the dlopen'd copy first, but that copy's PyRuntime section was
never initialized, so reading debug offsets from it fails.
Fix this by validating each candidate PyRuntime address before accepting
it. The validation reads the first 8 bytes and checks for the "xdebugpy"
cookie that is only present in an initialized PyRuntime. Uninitialized
duplicate mappings will fail this check and be skipped, allowing the
search to continue to the real, initialized PyRuntime.
Changing the values requires forking and patching, which is intentional. Simply rebuilding from source does not change the implementation enough to justify changing these values - they would still be `cpython` and compatible with existing `.pyc` files. But people who maintain forks are better served by being able to easily override these values in a place that can be forward-ported reliably.
When the interpreter is in a stop-the-world pause, critical sections
don't need to acquire locks since no other threads can be running.
This avoids a potential deadlock where lock fairness hands off ownership
to a thread that has already suspended for stop-the-world.
Fix thread-safety issues when accessing frame attributes while another
thread is executing the frame:
- Add critical section to frame_repr() to prevent races when accessing
the frame's code object and line number
- Add _Py_NO_SANITIZE_THREAD to PyUnstable_InterpreterFrame_GetLasti()
to allow intentional racy reads of instr_ptr.
- Fix take_ownership() to not write to the original frame's f_executable
Add `_Py_type_getattro_stackref`, a variant of type attribute lookup
that returns `_PyStackRef` instead of `PyObject*`. This allows returning
deferred references in the free-threaded build, reducing reference count
contention when accessing type attributes.
This significantly improves scaling of namedtuple instantiation across
multiple threads.
* Add blurb
* Rename PyObject_GetAttrStackRef to _PyObject_GetAttrStackRef
* Apply suggestion from @vstinner
Co-authored-by: Victor Stinner <vstinner@python.org>
* Apply suggestion from @vstinner
Co-authored-by: Victor Stinner <vstinner@python.org>
* format
* Update Include/internal/pycore_function.h
Co-authored-by: Victor Stinner <vstinner@python.org>
---------
Co-authored-by: Victor Stinner <vstinner@python.org>
The positional arguments passed to _PyStack_UnpackDict are already
kept alive by the caller, so we can avoid the extra reference count
operations by using borrowed references instead of creating new ones.
This reduces reference count contention in the free-threaded build
when calling functions with keyword arguments. In particular, this
avoids contention on the type argument to `__new__` when instantiating
namedtuples with keyword arguments.
The pymalloc huge page support had two problems. First, on
architectures where the default huge page size exceeds the arena
size (e.g. 32 MiB on PPC, 512 MiB on ARM64 with 64 KB base
pages), mmap with MAP_HUGETLB silently allocates a full huge page
even when the requested size is smaller. The subsequent munmap
with the original arena size then fails with EINVAL, permanently
leaking the entire huge page. Second, huge pages were always
attempted when compiled in, with no way to disable them at
runtime. On Linux, if the huge page pool is exhausted, page
faults including copy-on-write faults after fork deliver SIGBUS
and kill the process.
The arena allocator now queries the system huge page size from
/proc/meminfo and skips MAP_HUGETLB when the arena size is not a
multiple of it. Huge pages also now require explicit opt-in at
runtime via the PYTHON_PYMALLOC_HUGEPAGES environment variable,
which is read through PyConfig and respects -E and -I flags.
The config field pymalloc_hugepages is propagated to the runtime
allocators struct so the low-level arena allocator can check it
without calling getenv directly.
Add a FRAME_SUSPENDED_YIELD_FROM_LOCKED state that acts as a brief
lock, preventing other threads from transitioning the frame state
while gen_getyieldfrom reads the yield-from object off the stack.