cpython

mirror of https://github.com/python/cpython.git synced 2025-12-08 06:10:17 +00:00

Author	SHA1	Message	Date
Ken Jin	4fa80ce74c	gh-139109: A new tracing JIT compiler frontend for CPython (GH-140310) This PR changes the current JIT model from trace projection to trace recording. Benchmarking: better pyperformance (about 1.7% overall) geomean versus current https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251108-3.15.0a1%2B-7e2bc1d-JIT/bm-20251108-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-7e2bc1d-vs-base.svg, 100% faster Richards on the most improved benchmark versus the current JIT. Slowdown of about 10-15% on the worst benchmark versus the current JIT. Note: the fastest version isn't the one merged, as it relies on fixing bugs in the specializing interpreter, which is left to another PR. The speedup in the merged version is about 1.1%. https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251112-3.15.0a1%2B-f8a764a-JIT/bm-20251112-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-f8a764a-vs-base.svg Stats: 50% more uops executed, 30% more traces entered the last time we ran them. It also suggests our trace lengths for a real trace recording JIT are too short, as a lot of trace too long aborts https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251023-3.15.0a1%2B-eb73378-CLANG%2CJIT/bm-20251023-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-eb73378-pystats-vs-base.md . This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this work gh-140277 The optimizer stack space check is disabled, as it's no longer valid to deal with underflow. Pros: * Ignoring the generated tracer code as it's automatically created, this is only additional 1k lines of code. The maintenance burden is handled by the DSL and code generator. * `optimizer.c` is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace. * The new JIT frontend is able to handle a lot more control-flow than the old one. * Tracing is very low overhead. We use the tail calling interpreter/computed goto interpreter to switch between tracing mode and non-tracing mode. I call this mechanism dual dispatch, as we have two dispatch tables dispatching to each other. Specialization is still enabled while tracing. * Better handling of polymorphism. We leverage the specializing interpreter for this. Cons: * (For now) requires tail calling interpreter or computed gotos. This means no Windows JIT for now :(. Not to fret, tail calling is coming soon to Windows though https://github.com/python/cpython/pull/139962 Design: * After each instruction, the `record_previous_inst` function/label is executed. This does as the name suggests. * The tracing interpreter lowers bytecode to uops directly so that it can obtain "fresh" values at the point of lowering. * The tracing version behaves nearly identical to the normal interpreter, in fact it even has specialization! This allows it to run without much of a slowdown when tracing. The actual cost of tracing is only a function call and writes to memory. * The tracing interpreter uses the specializing interpreter's deopt to naturally form the side exit chains. This allows it to side exit chain effectively, without repeating much code. We force a re-specializing when tracing a deopt. * The tracing interpreter can even handle goto errors/exceptions, but I chose to disable them for now as it's not tested. * Because we do not share interpreter dispatch, there is should be no significant slowdown to the original specializing interpreter on tailcall and computed got with JIT disabled. With JIT enabled, there might be a slowdown in the form of the JIT trying to trace. * Things that could have dynamic instruction pointer effects are guarded on. The guard deopts to a new instruction --- `_DYNAMIC_EXIT`.	2025-11-13 18:08:32 +00:00
Victor Stinner	b99db92dde	gh-139653: Add PyUnstable_ThreadState_SetStackProtection() (#139668 ) Add PyUnstable_ThreadState_SetStackProtection() and PyUnstable_ThreadState_ResetStackProtection() functions to set the stack base address and stack size of a Python thread state. Co-authored-by: Petr Viktorin <encukou@gmail.com>	2025-11-13 17:30:50 +01:00
Neil Schemenauer	c98c5b3449	gh-131253: free-threaded build support for pystats (gh-137189) Allow the --enable-pystats build option to be used with free-threading. The stats are now stored on a per-interpreter basis, rather than process global. For free-threaded builds, the stats structure is allocated per-thread and then periodically merged into the per-interpreter stats structure (on thread exit or when the reporting function is called). Most of the pystats related code has be moved into the file Python/pystats.c.	2025-11-03 11:36:37 -08:00
Sam Gross	67fbfb42bd	gh-131586: Avoid refcount contention in some "special" calls (#131588 ) In the free threaded build, the `_PyObject_LookupSpecial()` call can lead to reference count contention on the returned function object becuase it doesn't use stackrefs. Refactor some of the callers to use `_PyObject_MaybeCallSpecialNoArgs`, which uses stackrefs internally. This fixes the scaling bottleneck in the "lookup_special" microbenchmark in `ftscalingbench.py`. However, the are still some uses of `_PyObject_LookupSpecial()` that need to be addressed in future PRs.	2025-03-26 14:38:47 -04:00
Victor Stinner	4b54031323	gh-131238: Remove pycore_runtime.h from pycore_pystate.h (#131356 ) * Remove includes from pycore_pystate.h: * pycore_runtime_structs.h * pycore_runtime.h * pycore_tstate.h * pycore_interp.h * Reorganize internal headers. Move _gc_thread_state from pycore_interp_structs.h to pycore_tstate.h. * Add 3 new header files to PCbuild/pythoncore.vcxproj.	2025-03-19 17:33:24 +01:00
Sam Gross	052cb717f5	gh-124878: Fix race conditions during interpreter finalization (#130649 ) The PyThreadState field gains a reference count field to avoid issues with PyThreadState being a dangling pointer to freed memory. The refcount starts with a value of two: one reference is owned by the interpreter's linked list of thread states and one reference is owned by the OS thread. The reference count is decremented when the thread state is removed from the interpreter's linked list and before the OS thread calls `PyThread_hang_thread()`. The thread that decrements it to zero frees the `PyThreadState` memory. The `holds_gil` field is moved out of the `_status` bit field, to avoid a data race where on thread calls `PyThreadState_Clear()`, modifying the `_status` bit field while the OS thread reads `holds_gil` when attempting to acquire the GIL. The `PyThreadState.state` field now has `_Py_THREAD_SHUTTING_DOWN` as a possible value. This corresponds to the `_PyThreadState_MustExit()` check. This avoids race conditions in the free threading build when checking `_PyThreadState_MustExit()`.	2025-03-06 10:38:34 -05:00
Mark Shannon	014223649c	GH-130396: Use computed stack limits on linux (GH-130398) * Implement C recursion protection with limit pointers for Linux, MacOS and Windows * Remove calls to PyOS_CheckStack * Add stack protection to parser * Make tests more robust to low stacks * Improve error messages for stack overflow	2025-02-25 09:24:48 +00:00
Petr Viktorin	ef29104f7d	GH-91079: Revert "GH-91079: Implement C stack limits using addresses, not counters. (GH-130007)" for now (GH130413) Revert "GH-91079: Implement C stack limits using addresses, not counters. (GH-130007)" for now Unfortunatlely, the change broke some buildbots. This reverts commit `2498c22fa0`.	2025-02-24 11:16:08 +01:00
Mark Shannon	2498c22fa0	GH-91079: Implement C stack limits using addresses, not counters. (GH-130007) * Implement C recursion protection with limit pointers * Remove calls to PyOS_CheckStack * Add stack protection to parser * Make tests more robust to low stacks * Improve error messages for stack overflow	2025-02-19 11:44:57 +00:00
Kumar Aditya	0d68b14a0d	gh-128002: use per threads tasks linked list in asyncio (#128869 ) Co-authored-by: Łukasz Langa <lukasz@langa.pl>	2025-02-06 19:51:07 +01:00
Yury Selivanov	188598851d	GH-91048: Add utils for capturing async call stack for asyncio programs and enable profiling (#124640 ) Signed-off-by: Pablo Galindo <pablogsal@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com> Co-authored-by: Kumar Aditya <kumaraditya@python.org> Co-authored-by: Łukasz Langa <lukasz@langa.pl> Co-authored-by: Savannah Ostrowski <savannahostrowski@gmail.com> Co-authored-by: Jacob Coffee <jacob@z7x.org> Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>	2025-01-22 17:25:29 +01:00
mpage	2e95c5ba3b	gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` (#123926 ) Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads. Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization. Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.	2024-11-04 11:13:32 -08:00
Sam Gross	332356b880	gh-125900: Clean-up logic around immortalization in free-threading (#125901 ) * Remove `@suppress_immortalization` decorator * Make suppression flag per-thread instead of per-interpreter * Suppress immortalization in `eval()` to avoid refleaks in three tests (test_datetime.test_roundtrip, test_logging.test_config8_ok, and test_random.test_after_fork). * frozenset() is constant, but not a singleton. When run multiple times, the test could fail due to constant interning.	2024-10-24 18:09:59 -04:00
Sam Gross	b482538523	gh-124218: Refactor per-thread reference counting (#124844 ) Currently, we only use per-thread reference counting for heap type objects and the naming reflects that. We will extend it to a few additional types in an upcoming change to avoid scaling bottlenecks when creating nested functions. Rename some of the files and functions in preparation for this change.	2024-10-01 17:05:42 +00:00
Sam Gross	dc09301067	gh-122417: Implement per-thread heap type refcounts (#122418 ) The free-threaded build partially stores heap type reference counts in distributed manner in per-thread arrays. This avoids reference count contention when creating or destroying instances. Co-authored-by: Ken Jin <kenjin@python.org>	2024-08-06 14:36:57 -04:00
Sam Gross	5716cc3529	gh-100240: Use a consistent implementation for freelists (#121934 ) This combines and updates our freelist handling to use a consistent implementation. Objects in the freelist are linked together using the first word of memory block. If configured with freelists disabled, these operations are essentially no-ops.	2024-07-22 12:08:27 -04:00
Sam Gross	81fd625b5c	gh-121621: Move asyncio_running_loop to private struct (#121939 ) This avoids changing the ABI and keeps the field in the private struct.	2024-07-17 15:21:24 -07:00
Eric Snow	a905721b9c	gh-120838: Add _PyThreadState_WHENCE_FINI (gh-121010) We also add _PyThreadState_NewBound() and drop _PyThreadState_SetWhence(). This change only affects internal API.	2024-06-25 14:35:12 -06:00
Sam Gross	1a6594f661	gh-117439: Make refleak checking thread-safe without the GIL (#117469 ) This keeps track of the per-thread total reference count operations in PyThreadState in the free-threaded builds. The count is merged into the interpreter's total when the thread exits.	2024-04-08 12:11:36 -04:00
Sam Gross	e3ad6ca56f	gh-115103: Implement delayed free mechanism for free-threaded builds (#115367 ) This adds `_PyMem_FreeDelayed()` and supporting functions. The `_PyMem_FreeDelayed()` function frees memory with the same allocator as `PyMem_Free()`, but after some delay to ensure that concurrent lock-free readers have finished.	2024-02-20 13:04:37 -05:00
Sam Gross	5903190727	gh-115103: Implement delayed memory reclamation (QSBR) (#115180 ) This adds a safe memory reclamation scheme based on FreeBSD's "GUS" and quiescent state based reclamation (QSBR). The API provides a mechanism for callers to detect when it is safe to free memory that may be concurrently accessed by readers.	2024-02-16 15:25:19 -05:00
Sam Gross	b24c9161a6	gh-112529: Make the GC scheduling thread-safe (#114880 ) The GC keeps track of the number of allocations (less deallocations) since the last GC. This buffers the count in thread-local state and uses atomic operations to modify the per-interpreter count. The thread-local buffering avoids contention on shared state. A consequence is that the GC scheduling is not as precise, so "test_sneaky_frame_object" is skipped because it requires that the GC be run exactly after allocating a frame object.	2024-02-16 11:22:27 -05:00
Donghee Na	f15795c9a0	gh-111968: Rename freelist related struct names to Eric's suggestion (gh-115329)	2024-02-14 00:32:51 +00:00
Eric Snow	514b1c91b8	gh-76785: Improved Subinterpreters Compatibility with 3.12 (gh-115424) For the most part, these changes make is substantially easier to backport subinterpreter-related code to 3.12, especially the related modules (e.g. _xxsubinterpreters). The main motivation is to support releasing a PyPI package with the 3.13 capabilities compiled for 3.12. A lot of the changes here involve either hiding details behind macros/functions or splitting up some files.	2024-02-13 14:56:49 -07:00
Sam Gross	a3af3cb4f4	gh-110481: Implement inter-thread queue for biased reference counting (#114824 ) Biased reference counting maintains two refcount fields in each object: `ob_ref_local` and `ob_ref_shared`. The true refcount is the sum of these two fields. In some cases, when refcounting operations are split across threads, the ob_ref_shared field can be negative (although the total refcount must be at least zero). In this case, the thread that decremented the refcount requests that the owning thread give up ownership and merge the refcount fields.	2024-02-09 17:08:32 -05:00
Donghee Na	57bdc6c30d	gh-111968: Introduce _PyFreeListState and _PyFreeListState_GET API (gh-113584)	2024-01-10 08:04:41 +09:00
Sam Gross	acf3bcc886	gh-112532: Use separate mimalloc heaps for GC objects (gh-113263) * gh-112532: Use separate mimalloc heaps for GC objects In `--disable-gil` builds, we now use four separate heaps in anticipation of using mimalloc to find GC objects when the GIL is disabled. To support this, we also make a few changes to mimalloc: * `mi_heap_t` and `mi_tld_t` initialization is split from allocation. This allows us to have a `mi_tld_t` per-`PyThreadState`, which is important to keep interpreter isolation, since the same OS thread may run in multiple interpreters (using different PyThreadStates.) * Heap abandoning (mi_heap_collect_ex) can now be called from a different thread than the one that created the heap. This is necessary because we may clear and delete the containing PyThreadStates from a different thread during finalization and after fork(). * Use enum instead of defines and guard mimalloc includes. * The enum typedef will be convenient for future PRs that use the type. * Guarding the mimalloc includes allows us to unconditionally include pycore_mimalloc.h from other header files that rely on things like `struct _mimalloc_thread_state`. * Only define _mimalloc_thread_state in Py_GIL_DISABLED builds	2023-12-27 01:53:20 +09:00
Sam Gross	db460735af	gh-112538: Add internal-only _PyThreadStateImpl "wrapper" for PyThreadState (gh-112560) Every PyThreadState instance is now actually a _PyThreadStateImpl. It is safe to cast from `PyThreadState` to `_PyThreadStateImpl` and back. The _PyThreadStateImpl will contain fields that we do not want to expose in the public C API.	2023-12-07 12:11:45 -07:00

28 commits