Revert "GH-126910: Make `_Py_get_machine_stack_pointer` return the stack pointer (#147945)"
This reverts commit 255026d9ee,
which broke a tier-1 buildbot.
* Make _Py_get_machine_stack_pointer return the stack pointer (or close to it), not the frame pointer
* Make ``_Py_ReachedRecursionLimit`` inline again
* Remove ``_Py_MakeRecCheck`` relacing its use with ``_Py_ReachedRecursionLimit``
* Move stack swtiching check into ``_Py_CheckRecursiveCall``
Add special cases for classmethod and staticmethod descriptors in
_PyObject_GetMethodStackRef() to avoid calling tp_descr_get, which
avoids reference count contention on the bound method and underlying
callable. This improves scaling when calling classmethods and
staticmethods from multiple threads.
Also refactor method_vectorcall in classobject.c into a new _PyObject_VectorcallPrepend() helper so that it can be used by
PyObject_VectorcallMethod as well.
The positional arguments passed to _PyStack_UnpackDict are already
kept alive by the caller, so we can avoid the extra reference count
operations by using borrowed references instead of creating new ones.
This reduces reference count contention in the free-threaded build
when calling functions with keyword arguments. In particular, this
avoids contention on the type argument to `__new__` when instantiating
namedtuples with keyword arguments.
Add a FRAME_SUSPENDED_YIELD_FROM_LOCKED state that acts as a brief
lock, preventing other threads from transitioning the frame state
while gen_getyieldfrom reads the yield-from object off the stack.
Now that the specializing interpreter works with free threading,
replace ENABLE_SPECIALIZATION_FT with ENABLE_SPECIALIZATION and
replace requires_specialization_ft with requires_specialization.
Also limit the uniquely referenced check to FOR_ITER_RANGE. It's not
necessary for FOR_ITER_GEN and would cause test_for_iter_gen to fail.
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Brandt Bucher <brandt@python.org>
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
This makes generator frame state transitions atomic in the free
threading build, which avoids segfaults when trying to execute
a generator from multiple threads concurrently.
There are still a few operations that aren't thread-safe and may crash
if performed concurrently on the same generator/coroutine:
* Accessing gi_yieldfrom/cr_await/ag_await
* Accessing gi_frame/cr_frame/ag_frame
* Async generator operations
This PR implements frame caching in the RemoteUnwinder class to significantly reduce memory reads when profiling remote processes with deep call stacks.
When cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.
The implementation adds a last_profiled_frame field to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.
The sampling profiler now enables frame caching by default.
* Factor out bodies of the largest uops, to reduce jit code size.
* Factor out common assert, also reducing jit code size.
* Limit size of jitted code for a single executor to 1MB.
Only raises if the stack pointer is both below the limit *and* above the stack base.
This prevents false positives for user-space threads, as the stack pointer will be outside those bounds
if the stack has been swapped.
Adapted from a patch for Python 3.14 submitted to the Debian BTS by John
https://bugs.debian.org/1105111#20
Co-authored-by: John David Anglin <dave.anglin@bell.net>