For AArch64 linux, reduces the total bytes in the code bodies from 489kb to 218kb.
Reduces the size of the stencils files from 394k lines to 167k lines.
* SEND specialization. Adds 2 new specialized instructions:
* SEND_VIRTUAL: for sends to virtual iterators e.g lists and tuples
* SEND_ASYNC_GEN: for sends to async generators
Tweak FOR_ITER_VIRTUAL so that SEND_VIRTUAL and FOR_ITER_VIRTUAL use equivalent guards
* Replaces ad-hoc logic for ending traces with a simple inequality: `fitness < exit_quality`
* Fitness starts high and is reduced for branches, backward edges, calls and trace length
* Exit quality reflect how good a spot that instruction is to end a trace. Closing a loop is very, specializable instructions are very low and the others in between.
* Make the `PY_UNWIND` monitoring event available as a code-local
event to allow trapping on function exit events when an exception
bubbles up. This complements the PY_RETURN event by allowing to
catch any function exit event.
* Allow `PY_UNWIND` to be `DISABLE`d; disabling it disables the event for the whole code object.
* Do the above for `PY_THROW`, `RAISE`, `EXCEPTION_HANDLED`, and `RERAISE` events.
* Add FOR_ITER_VIRTUAL to specialize FOR_ITER for virtual iterators
* Add GET_ITER_SELF to specialize GET_ITER for iterators (including generators)
* Add GET_ITER_VIRTUAL to specialize GET_ITER for iterables as virtual iterators
* Add new (internal) _tp_iteritem function slot to PyTypeObject
* Put limited RESUME at start of genexpr for free-threading. Fix up exception handling in genexpr
There are newly documented restrictions on tp_traverse:
The traversal function must not have any side effects.
It must not modify the reference counts of any Python
objects nor create or destroy any Python objects.
* Add several functions that are guaranteed side-effect-free,
with a _DuringGC suffix.
* Use these in ctypes
* Consolidate tp_traverse docs in gcsupport.rst, moving unique
content from typeobj.rst there
Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
When compiling for abi3t, define Py_GIL_DISABLED, so that users who
check it to enable additional locking aren't broken.
But also avoid using Py_GIL_DISABLED in Python headers themselves
-- abi3 and abi3t ought to be the same except
the _Py_OPAQUE_PYOBJECT differences.
A check for this is coming in a later PR.
It will require rewriting some preprocessor conditions, some of these
changes are included in this PR.
For _Py_IsOwnedByCurrentThread & supporting functions
I opted to move them to a cpython/ header, as they're rather self-contained.
When Python is built in debug mode:
* long_alloc() now initializes digits with a pattern to detect usage of
uninitialized digits.
* _PyLong_CompactValue() now makes sure that the digit is zero when the
sign is zero.
* PyLongWriter_Finish() now raises SystemError if it detects uninitialized
digits
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Rename from _Py_INTERNAL_ABI_SLOT to _Py_ABI_SLOT
and define the macro using _PyABIInfo_DEFAULT.
Use the ABI slot in stdlib extension modules to enable running
a check of ABI version compatibility.
_tkinter, _tracemalloc and readline don't use the slots, hence they need
explicit handling.
Co-authored-by: Victor Stinner <vstinner@python.org>
Also fix a few related issues in the pyatomic headers:
* Fix _Py_atomic_store_uint_release in pyatomic_msc.h to use __stlr32
on ARM64 instead of a plain volatile store (which is only relaxed on
ARM64).
* Add missing _Py_atomic_store_uint_release to pyatomic_gcc.h.
* Fix pseudo-code comment for _Py_atomic_store_ptr_release in
pyatomic.h.
Cache one datachunk per tstate to prevent alloc/dealloc thrashing when repeatedly hitting the same call depth at exactly the wrong boundary.
---------
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Add TYPE_FROZENDICT to the marshal module.
Add C API functions:
* PyAnyDict_Check()
* PyAnyDict_CheckExact()
* PyFrozenDict_Check()
* PyFrozenDict_CheckExact()
* PyFrozenDict_New()
Add PyFrozenDict_Type C type.
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
Co-authored-by: Adam Johnson <me@adamj.eu>
Co-authored-by: Benedikt Johannes <benedikt.johannes.hofer@gmail.com>
The pymalloc huge page support had two problems. First, on
architectures where the default huge page size exceeds the arena
size (e.g. 32 MiB on PPC, 512 MiB on ARM64 with 64 KB base
pages), mmap with MAP_HUGETLB silently allocates a full huge page
even when the requested size is smaller. The subsequent munmap
with the original arena size then fails with EINVAL, permanently
leaking the entire huge page. Second, huge pages were always
attempted when compiled in, with no way to disable them at
runtime. On Linux, if the huge page pool is exhausted, page
faults including copy-on-write faults after fork deliver SIGBUS
and kill the process.
The arena allocator now queries the system huge page size from
/proc/meminfo and skips MAP_HUGETLB when the arena size is not a
multiple of it. Huge pages also now require explicit opt-in at
runtime via the PYTHON_PYMALLOC_HUGEPAGES environment variable,
which is read through PyConfig and respects -E and -I flags.
The config field pymalloc_hugepages is propagated to the runtime
allocators struct so the low-level arena allocator can check it
without calling getenv directly.
This makes generator frame state transitions atomic in the free
threading build, which avoids segfaults when trying to execute
a generator from multiple threads concurrently.
There are still a few operations that aren't thread-safe and may crash
if performed concurrently on the same generator/coroutine:
* Accessing gi_yieldfrom/cr_await/ag_await
* Accessing gi_frame/cr_frame/ag_frame
* Async generator operations
There are places we use "relaxed" loads where C11 requires "consume" or
stronger. Unfortunately, compilers don't really implement "consume" so
fake it for our use in a way that avoids upsetting TSan.
This PR implements frame caching in the RemoteUnwinder class to significantly reduce memory reads when profiling remote processes with deep call stacks.
When cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.
The implementation adds a last_profiled_frame field to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.
The sampling profiler now enables frame caching by default.
* Promote _PyObject_Dump() as a public function.
* Keep _PyObject_Dump() alias to PyUnstable_Object_Dump()
for backward compatibility.
* Replace _PyObject_Dump() with PyUnstable_Object_Dump().
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Co-authored-by: Kumar Aditya <kumaraditya@python.org>
Co-authored-by: Petr Viktorin <encukou@gmail.com>