Commit graph

141 commits

Author SHA1 Message Date
Stefano Rivera
f6dd9c12a8
GH-139914: Handle stack growth direction on HPPA (GH-140028)
Adapted from a patch for Python 3.14 submitted to the Debian BTS by John
https://bugs.debian.org/1105111#20

Co-authored-by: John David Anglin <dave.anglin@bell.net>
2025-11-17 14:41:22 +01:00
Ken Jin
4fa80ce74c
gh-139109: A new tracing JIT compiler frontend for CPython (GH-140310)
This PR changes the current JIT model from trace projection to trace recording. Benchmarking: better pyperformance (about 1.7% overall) geomean versus current https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251108-3.15.0a1%2B-7e2bc1d-JIT/bm-20251108-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-7e2bc1d-vs-base.svg, 100% faster Richards on the most improved benchmark versus the current JIT. Slowdown of about 10-15% on the worst benchmark versus the current JIT. **Note: the fastest version isn't the one merged, as it relies on fixing bugs in the specializing interpreter, which is left to another PR**. The speedup in the merged version is about 1.1%. https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251112-3.15.0a1%2B-f8a764a-JIT/bm-20251112-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-f8a764a-vs-base.svg

Stats: 50% more uops executed, 30% more traces entered the last time we ran them. It also suggests our trace lengths for a real trace recording JIT are too short, as a lot of trace too long aborts https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251023-3.15.0a1%2B-eb73378-CLANG%2CJIT/bm-20251023-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-eb73378-pystats-vs-base.md .

This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this work gh-140277

The optimizer stack space check is disabled, as it's no longer valid to deal with underflow.

Pros:
* Ignoring the generated tracer code as it's automatically created, this is only additional 1k lines of code. The maintenance burden is handled by the DSL and code generator.
* `optimizer.c` is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace.
* The new JIT frontend is able to handle a lot more control-flow than the old one.
* Tracing is very low overhead. We use the tail calling interpreter/computed goto interpreter to switch between tracing mode and non-tracing mode. I call this mechanism dual dispatch, as we have two dispatch tables dispatching to each other. Specialization is still enabled while tracing.
* Better handling of polymorphism. We leverage the specializing interpreter for this.

Cons:
* (For now) requires tail calling interpreter or computed gotos. This means no Windows JIT for now :(. Not to fret, tail calling is coming soon to Windows though https://github.com/python/cpython/pull/139962

Design:
* After each instruction, the `record_previous_inst` function/label is executed. This does as the name suggests.
* The tracing interpreter lowers bytecode to uops directly so that it can obtain "fresh" values at the point of lowering.
* The tracing version behaves nearly identical to the normal interpreter, in fact it even has specialization! This allows it to run without much of a slowdown when tracing. The actual cost of tracing is only a function call and writes to memory.
* The tracing interpreter uses the specializing interpreter's deopt to naturally form the side exit chains. This allows it to side exit chain effectively, without repeating much code. We force a re-specializing when tracing a deopt.
* The tracing interpreter can even handle goto errors/exceptions, but I chose to disable them for now as it's not tested.
* Because we do not share interpreter dispatch, there is should be no significant slowdown to the original specializing interpreter on tailcall and computed got with JIT disabled. With JIT enabled, there might be a slowdown in the form of the JIT trying to trace.
* Things that could have dynamic instruction pointer effects are guarded on. The guard deopts to a new instruction --- `_DYNAMIC_EXIT`.
2025-11-13 18:08:32 +00:00
Mikhail Efimov
d17f28fed5
gh-140373: Correctly emit PY_UNWIND event when generator is closed (GH-140767) 2025-10-31 10:09:22 +00:00
Dino Viehland
299de38e61
gh-131776: Expose functions called from the interpreter loop via PyAPI_FUNC (#134242) 2025-09-17 08:04:02 -07:00
Mark Shannon
a8d9d94784
GH-137959: Replace shim code in jitted code with a single trampoline function. (GH-137961) 2025-08-21 10:40:53 +01:00
Sam Gross
a10152f8fd
gh-137400: Fix thread-safety issues when profiling all threads (gh-137518)
There were a few thread-safety issues when profiling or tracing all
threads via PyEval_SetProfileAllThreads or PyEval_SetTraceAllThreads:

* The loop over thread states could crash if a thread exits concurrently
  (in both the free threading and default build)
* The modification of `c_profilefunc` and `c_tracefunc` wasn't
  thread-safe on the free threading build.
2025-08-13 14:15:12 -04:00
Petr Viktorin
1b1ae82fab
gh-135755: Move SPECIAL_ constants to a private header (GH-135922)
Macros without a `Py`/`_Py` prefix should not be defined in public headers.
2025-06-25 13:03:05 +02:00
Eric Snow
a450a0ddec
gh-135443: Sometimes Fall Back to __main__.__dict__ For Globals (gh-135491)
For several builtin functions, we now fall back to __main__.__dict__ for the globals
when there is no current frame and _PyInterpreterState_IsRunningMain() returns
true.  This allows those functions to be run with Interpreter.call().

The affected builtins:

* exec()
* eval()
* globals()
* locals()
* vars()
* dir()

We take a similar approach with "stateless" functions, which don't use any
global variables.
2025-06-16 17:34:19 -06:00
Mark Shannon
b90ecea9e6
GH-132554: Fix tier2 FOR_ITER implementation and optimizations (GH-135137) 2025-06-05 18:53:57 +01:00
Mark Shannon
f6f4e8a662
GH-132554: "Virtual" iterators (GH-132555)
* FOR_ITER now pushes either the iterator and NULL or leaves the iterable and pushes tagged zero

* NEXT_ITER uses the tagged int as the index into the sequence or, if TOS is NULL, iterates as before.
2025-05-27 15:59:45 +01:00
Mark Shannon
44e4c479fb
GH-124715: Move trashcan mechanism into Py_Dealloc (GH-132280) 2025-04-30 11:37:53 +01:00
Pablo Galindo Salgado
99b13775da
gh-131591: Check for remote debug in PyErr_CheckSignals (#132853)
For the same reasons as running the GC, this will allow sections that
run in native code for long periods without executing bytecode to also
run the remote debugger protocol without having to wait until bytecode
is executed

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
2025-04-23 20:59:41 +01:00
Bénédikt Tran
8a9c6c4d16
gh-128398: improve error messages when incorrectly using with and async with (#132218)
Improve the error message with a suggestion when an object supporting the synchronous
(resp. asynchronous) context manager protocol is entered using `async with` (resp. `with`)
instead of `with` (resp. `async with`).
2025-04-19 10:44:01 +02:00
Pablo Galindo Salgado
2067378e6d
gh-131591: Handle includes for iOS in remote_debugging.c (#132050) 2025-04-06 21:39:25 +01:00
Chris Eibl
20098719df
GH-131288: Use _AddressOfReturnAddress for MSVC in pycore_ceval.h (gh-131289)
Use `_AddressOfReturnAddress` in `_Py_get_machine_stack_pointer` to silence MSVC warning in pycore_ceval.h for release builds.
2025-04-04 09:03:12 -04:00
Pablo Galindo Salgado
943cc1431e
gh-131591: Implement PEP 768 (#131937)
Co-authored-by: Ivona Stojanovic <stojanovic.i@hotmail.com>
Co-authored-by: Matt Wozniski <godlygeek@gmail.com>
2025-04-03 16:20:01 +01:00
Victor Stinner
20c5f969dd
gh-131238: Remove more includes from pycore_interp.h (#131480) 2025-03-19 23:01:32 +01:00
Victor Stinner
b8367e7cf3
gh-130931: Add pycore_typedefs.h internal header (#131396)
Declare _PyInterpreterFrame and _PyRuntimeState types before
declaring their structure members. Break reference cycles between
header files.
2025-03-19 15:23:32 +01:00
Mark Shannon
2bef8ea8ea
GH-127705: Use _PyStackRefs in the default build. (GH-127875) 2025-03-10 14:06:56 +00:00
Mark Shannon
014223649c
GH-130396: Use computed stack limits on linux (GH-130398)
* Implement C recursion protection with limit pointers for Linux, MacOS and Windows

* Remove calls to PyOS_CheckStack

* Add stack protection to parser

* Make tests more robust to low stacks

* Improve error messages for stack overflow
2025-02-25 09:24:48 +00:00
Petr Viktorin
ef29104f7d
GH-91079: Revert "GH-91079: Implement C stack limits using addresses, not counters. (GH-130007)" for now (GH130413)
Revert "GH-91079: Implement C stack limits using addresses, not counters. (GH-130007)" for now

Unfortunatlely, the change broke some buildbots.

This reverts commit 2498c22fa0.
2025-02-24 11:16:08 +01:00
Mark Shannon
2498c22fa0
GH-91079: Implement C stack limits using addresses, not counters. (GH-130007)
* Implement C recursion protection with limit pointers

* Remove calls to PyOS_CheckStack

* Add stack protection to parser

* Make tests more robust to low stacks

* Improve error messages for stack overflow
2025-02-19 11:44:57 +00:00
Mark Shannon
72f56654d0
GH-128682: Account for escapes in DECREF_INPUTS (GH-129953)
* Handle escapes in DECREF_INPUTS

* Mark a few more functions as escaping

* Replace DECREF_INPUTS with PyStackRef_CLOSE where possible
2025-02-12 17:44:59 +00:00
Irit Katriel
c39ae8922b
gh-128799: Add frame of except* to traceback when wrapping a naked exception (#128971) 2025-01-25 13:00:23 +00:00
mpage
2e95c5ba3b
gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP (#123926)
Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.

Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.

Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
2024-11-04 11:13:32 -08:00
Sam Gross
3c4a7fa617
gh-124218: Avoid refcount contention on builtins module (GH-125847)
This replaces `_PyEval_BuiltinsFromGlobals` with
`_PyDict_LoadBuiltinsFromGlobals`, which returns a new reference
instead of a borrowed reference. Internally, the new function uses
per-thread reference counting when possible to avoid contention on the
refcount fields on the builtins module.
2024-10-24 12:44:38 -04:00
Mark Shannon
06ca33020e
GH-125323: Convert DECREF_INPUTS_AND_REUSE_FLOAT into a function that takes PyStackRefs. (GH-125439) 2024-10-14 14:18:57 +01:00
Mark Shannon
da071fa3e8
GH-119866: Spill the stack around escaping calls. (GH-124392)
* Spill the evaluation around escaping calls in the generated interpreter and JIT. 

* The code generator tracks live, cached values so they can be saved to memory when needed.

* Spills the stack pointer around escaping calls, so that the exact stack is visible to the cycle GC.
2024-10-07 14:56:39 +01:00
Savannah Ostrowski
65f1237098
GH-123516: Improve JIT memory consumption by invalidating cold executors (GH-124443)
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
2024-09-27 00:35:42 +00:00
Ken Jin
8810e286fa
gh-121459: Deferred LOAD_GLOBAL (GH-123128)
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Sam Gross <655866+colesbury@users.noreply.github.com>
2024-09-14 00:23:51 +08:00
Mark Shannon
7aca84e557
GH-117224: Move the body of a few large-ish micro-ops into helper functions (GH-122601) 2024-08-02 16:31:17 +01:00
Brandt Bucher
15d4cd0967
GH-116090: Fire RAISE events from _FOR_ITER_TIER_TWO (GH-122413) 2024-07-29 12:17:47 -07:00
Brandt Bucher
7b36b67b1e
GH-118093: Add tier two support to several instructions (GH-121884) 2024-07-18 14:24:58 -07:00
Ken Jin
22b0de2755
gh-117139: Convert the evaluation stack to stack refs (#118450)
This PR sets up tagged pointers for CPython.

The general idea is to create a separate struct _PyStackRef for everything on the evaluation stack to store the bits. This forces the C compiler to warn us if we try to cast things or pull things out of the struct directly.

Only for free threading: We tag the low bit if something is deferred - that means we skip incref and decref operations on it. This behavior may change in the future if Mark's plans to defer all objects in the interpreter loop pans out.

This implies a strict stack reference discipline is required. ALL incref and decref operations on stackrefs must use the stackref variants. It is unsafe to untag something then do normal incref/decref ops on it.

The new incref and decref variants are called dup and close. They mimic a "handle" API operating on these stackrefs.

Please read Include/internal/pycore_stackref.h for more information!

---------

Co-authored-by: Mark Shannon <9448417+markshannon@users.noreply.github.com>
2024-06-27 03:10:43 +08:00
Mark Shannon
9cefcc0ee7
GH-120507: Lower the BEFORE_WITH and BEFORE_ASYNC_WITH instructions. (#120640)
* Remove BEFORE_WITH and BEFORE_ASYNC_WITH instructions.

* Add LOAD_SPECIAL instruction

* Reimplement `with` and `async with` statements using LOAD_SPECIAL
2024-06-18 12:17:46 +01:00
Sam Gross
f3b89a63cb
gh-117657: Fix TSAN reported race in _PyEval_IsGILEnabled. (#119921)
The GIL may be disabled concurrently with this call so we need to use a
relaxed atomic load.
2024-06-02 10:19:02 -04:00
Brett Simmers
be1dfccdf2
gh-118727: Don't drop the GIL in drop_gil() unless the current thread holds it (#118745)
`drop_gil()` assumes that its caller is attached, which means that the current
thread holds the GIL if and only if the GIL is enabled, and the enabled-state
of the GIL won't change. This isn't true, though, because `detach_thread()`
calls `_PyEval_ReleaseLock()` after detaching and
`_PyThreadState_DeleteCurrent()` calls it after removing the current thread
from consideration for stop-the-world requests (effectively detaching it).

Fix this by remembering whether or not a thread acquired the GIL when it last
attached, in `PyThreadState._status.holds_gil`, and check this in `drop_gil()`
instead of `gil->enabled`.

This fixes a crash in `test_multiprocessing_pool_circular_import()`, so I've
reenabled it.
2024-05-23 16:59:35 -04:00
Brett Simmers
853163d3b5
gh-116322: Enable the GIL while loading C extension modules (#118560)
Add the ability to enable/disable the GIL at runtime, and use that in
the C module loading code.

We can't know before running a module init function if it supports
free-threading, so the GIL is temporarily enabled before doing so. If
the module declares support for running without the GIL, the GIL is
later disabled. Otherwise, the GIL is permanently enabled, and will
never be disabled again for the life of the current interpreter.
2024-05-06 23:07:23 -04:00
Pablo Galindo Salgado
1b22d801b8
gh-118518: Allow perf to work without frame pointers (#112254) 2024-05-05 03:07:29 +02:00
Eric Snow
09c2947581
gh-110693: Pending Calls Machinery Cleanups (gh-118296)
This does some cleanup in preparation for later changes.
2024-04-26 01:05:51 +00:00
Mark Shannon
f180b31e76
GH-118095: Handle RETURN_GENERATOR in tier 2 (GH-118180) 2024-04-25 11:32:47 +01:00
Mark Shannon
2cf18a4430
GH-116422: Modify a few uops so that they can be supported by tier 2 with hot/cold splitting (GH-116832) 2024-03-15 10:48:00 +00:00
Brandt Bucher
f0df35eeca
GH-115802: JIT "small" code for Windows (GH-115964) 2024-02-29 08:11:28 -08:00
Brett Simmers
0749244d13
gh-112175: Add eval_breaker to PyThreadState (#115194)
This change adds an `eval_breaker` field to `PyThreadState`. The primary
motivation is for performance in free-threaded builds: with thread-local eval
breakers, we can stop a specific thread (e.g., for an async exception) without
interrupting other threads.

The source of truth for the global instrumentation version is stored in the
`instrumentation_version` field in PyInterpreterState. Threads usually read the
version from their local `eval_breaker`, where it continues to be colocated
with the eval breaker bits.
2024-02-20 09:57:48 -05:00
Sam Gross
a3af3cb4f4
gh-110481: Implement inter-thread queue for biased reference counting (#114824)
Biased reference counting maintains two refcount fields in each object:
`ob_ref_local` and `ob_ref_shared`. The true refcount is the sum of these two
fields. In some cases, when refcounting operations are split across threads,
the ob_ref_shared field can be negative (although the total refcount must be
at least zero). In this case, the thread that decremented the refcount
requests that the owning thread give up ownership and merge the refcount
fields.
2024-02-09 17:08:32 -05:00
Sam Gross
441affc9e7
gh-111964: Implement stop-the-world pauses (gh-112471)
The `--disable-gil` builds occasionally need to pause all but one thread.  Some
examples include:

* Cyclic garbage collection, where this is often called a "stop the world event"
* Before calling `fork()`, to ensure a consistent state for internal data structures
* During interpreter shutdown, to ensure that daemon threads aren't accessing Python objects

This adds the following functions to implement global and per-interpreter pauses:

* `_PyEval_StopTheWorldAll()` and `_PyEval_StartTheWorldAll()` (for the global runtime)
* `_PyEval_StopTheWorld()` and `_PyEval_StartTheWorld()` (per-interpreter)

(The function names may change.)

These functions are no-ops outside of the `--disable-gil` build.
2024-01-23 11:08:23 -07:00
Sam Gross
a3c031884d
gh-112723: Call PyThreadState_Clear() from the correct interpreter (#112776)
The `PyThreadState_Clear()` function must only be called with the GIL
held and must be called from the same interpreter as the passed in
thread state. Otherwise, any Python objects on the thread state may be
destroyed using the wrong interpreter, leading to memory corruption.

This is also important for `Py_GIL_DISABLED` builds because free lists
will be associated with PyThreadStates and cleared in
`PyThreadState_Clear()`.

This fixes two places that called `PyThreadState_Clear()` from the wrong
interpreter and adds an assertion to `PyThreadState_Clear()`.
2023-12-12 17:20:21 -07:00
Sam Gross
cf6110ba13
gh-111924: Use PyMutex for Runtime-global Locks. (gh-112207)
This replaces some usages of PyThread_type_lock with PyMutex, which does not require memory allocation to initialize.

This simplifies some of the runtime initialization and is also one step towards avoiding changing the default raw memory allocator during initialize/finalization, which can be non-thread-safe in some circumstances.
2023-12-07 12:33:40 -07:00
Pablo Galindo Salgado
a73aa48e6b
gh-112367: Only free perf trampoline arenas at shutdown (#112368)
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
2023-12-01 13:20:51 +00:00
Tian Gao
e0afed7e27
gh-103615: Use local events for opcode tracing (GH-109472)
* Use local monitoring for opcode trace

* Remove f_opcode_trace_set

* Add test for setting f_trace_opcodes after settrace
2023-11-03 16:39:50 +00:00