This PR changes the current JIT model from trace projection to trace recording. Benchmarking: better pyperformance (about 1.7% overall) geomean versus current https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251108-3.15.0a1%2B-7e2bc1d-JIT/bm-20251108-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-7e2bc1d-vs-base.svg, 100% faster Richards on the most improved benchmark versus the current JIT. Slowdown of about 10-15% on the worst benchmark versus the current JIT. **Note: the fastest version isn't the one merged, as it relies on fixing bugs in the specializing interpreter, which is left to another PR**. The speedup in the merged version is about 1.1%. https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251112-3.15.0a1%2B-f8a764a-JIT/bm-20251112-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-f8a764a-vs-base.svg
Stats: 50% more uops executed, 30% more traces entered the last time we ran them. It also suggests our trace lengths for a real trace recording JIT are too short, as a lot of trace too long aborts https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251023-3.15.0a1%2B-eb73378-CLANG%2CJIT/bm-20251023-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-eb73378-pystats-vs-base.md .
This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this work gh-140277
The optimizer stack space check is disabled, as it's no longer valid to deal with underflow.
Pros:
* Ignoring the generated tracer code as it's automatically created, this is only additional 1k lines of code. The maintenance burden is handled by the DSL and code generator.
* `optimizer.c` is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace.
* The new JIT frontend is able to handle a lot more control-flow than the old one.
* Tracing is very low overhead. We use the tail calling interpreter/computed goto interpreter to switch between tracing mode and non-tracing mode. I call this mechanism dual dispatch, as we have two dispatch tables dispatching to each other. Specialization is still enabled while tracing.
* Better handling of polymorphism. We leverage the specializing interpreter for this.
Cons:
* (For now) requires tail calling interpreter or computed gotos. This means no Windows JIT for now :(. Not to fret, tail calling is coming soon to Windows though https://github.com/python/cpython/pull/139962
Design:
* After each instruction, the `record_previous_inst` function/label is executed. This does as the name suggests.
* The tracing interpreter lowers bytecode to uops directly so that it can obtain "fresh" values at the point of lowering.
* The tracing version behaves nearly identical to the normal interpreter, in fact it even has specialization! This allows it to run without much of a slowdown when tracing. The actual cost of tracing is only a function call and writes to memory.
* The tracing interpreter uses the specializing interpreter's deopt to naturally form the side exit chains. This allows it to side exit chain effectively, without repeating much code. We force a re-specializing when tracing a deopt.
* The tracing interpreter can even handle goto errors/exceptions, but I chose to disable them for now as it's not tested.
* Because we do not share interpreter dispatch, there is should be no significant slowdown to the original specializing interpreter on tailcall and computed got with JIT disabled. With JIT enabled, there might be a slowdown in the form of the JIT trying to trace.
* Things that could have dynamic instruction pointer effects are guarded on. The guard deopts to a new instruction --- `_DYNAMIC_EXIT`.
Add PyUnstable_ThreadState_SetStackProtection() and
PyUnstable_ThreadState_ResetStackProtection() functions
to set the stack base address and stack size of a Python
thread state.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
This is useful for implementing proper `input()`. It requires the
JavaScript engine to support the wasm JSPI spec which is now stage 4.
It is supported on Chrome since version 137 and on Firefox and node
behind a flag.
We override the `__wasi_fd_read()` syscall with our own variant that
checks for a readAsync operation. If it has it, we use our own async
variant of `fd_read()`, otherwise we use the original `fd_read()`.
We also add a variant of `FS.createDevice()` called
`FS.createAsyncInputDevice()`.
Finally, if JSPI is available, we wrap the `main()` symbol with
`WebAssembly.promising()` so that we can stack switch from `fd_read()`.
If JSPI is not available, attempting to read from an AsyncInputDevice
will raise an `OSError`.
- Use %T format specifier instead of %s and Py_TYPE(x)->tp_name.
- Remove legacy %.200s format specifier for truncating type names.
Co-authored-by: Victor Stinner <vstinner@python.org>
Move PYOS_LOG2_STACK_MARGIN, PYOS_STACK_MARGIN,
PYOS_STACK_MARGIN_BYTES and PYOS_STACK_MARGIN_SHIFT macros to
pycore_pythonrun.h internal header. Add underscore (_) prefix to the
names to make them private. Rename _PYOS to _PyOS.
PEP-734 has been accepted (for 3.14).
(FTR, I'm opposed to putting this under the concurrent package, but
doing so is the SC condition under which the module can land in 3.14.)
It now supports a "full" fallback to _PyFunction_GetXIData() and then `_PyPickle_GetXIData()`. There's also room for other fallback modes if that later makes sense.
This converts functions, code, str, bytes, bytearray, and memoryview objects to PyCodeObject,
and ensure that the object looks like a script. That means no args, no return, and no closure.
_PyCode_GetPureScriptXIData() takes it a step further and ensures there are no globals.
We also add _PyObject_SupportedAsScript() to the internal C-API.
This reverts commit 3c73cf5 (gh-133497), which itself reverted
the original commit d270bb5 (gh-133221).
We reverted the original change due to failing android tests.
The checks in _PyCode_CheckNoInternalState() were too strict,
so we've relaxed them.
"Stateless" code is a function or code object which does not rely on external state or internal state.
It may rely on arguments and builtins, but not globals or a closure. I've left a comment in
pycore_code.h that provides more detail.
We also add _PyFunction_VerifyStateless(). The new functions will be used in several later changes
that facilitate "sharing" functions and code objects between interpreters.
This reverts commit 811edcf (gh-133232), which itself reverted the original commit 811edcf (gh-133128).
We reverted the original change due to failing s390 builds (a big-endian architecture).
It ended up that I had not accommodated op caches.
There's some extra complexity due to making sure we we get things right when handling functions and classes defined in the __main__ module. This is also reflected in the tests, including the addition of extra functions in test.support.import_helper.
This helper is useful in a variety of ways, including in demonstrating how the different counts relate to one another.
It will be used in a later change to help identify if a function is "stateless", meaning it doesn't have any free vars or globals.
Note that a majority of this change is tests.
The function indicates whether or not the function has a return statement.
This is used by a later change related treating some functions like scripts.
The `test_load_global_module()` test consumes a lot of dict key versions.
Skip the test if we have consumed half of the available versions that can be
used for the "load global" cache.
The following are added to the internal C-API:
* _PyErr_FormatV()
* _PyErr_SetModuleNotFoundError()
* _PyXIData_GetNotShareableErrorType()
* _PyXIData_FormatNotShareableError()
We also drop _PyXIData_lookup_context_t and _PyXIData_GetLookupContext().
The `free_work_item()` function in QSBR may call arbitrary code via
Python object destructors, which may reenter the QSBR code. Reorder
the processing of work items to be robust to reentrancy.
Also fix the TODO for the out of memory situation.
* Implement C recursion protection with limit pointers for Linux, MacOS and Windows
* Remove calls to PyOS_CheckStack
* Add stack protection to parser
* Make tests more robust to low stacks
* Improve error messages for stack overflow
Revert "GH-91079: Implement C stack limits using addresses, not counters. (GH-130007)" for now
Unfortunatlely, the change broke some buildbots.
This reverts commit 2498c22fa0.
* Implement C recursion protection with limit pointers
* Remove calls to PyOS_CheckStack
* Add stack protection to parser
* Make tests more robust to low stacks
* Improve error messages for stack overflow
Remove _PyInterpreterState_GetConfigCopy() and
_PyInterpreterState_SetConfig() private functions. PEP 741 "Python
Configuration C API" added a better public C API: PyConfig_Get() and
PyConfig_Set().
* Add `_PyDictKeys_StringLookupSplit` which does locking on dict keys and
use in place of `_PyDictKeys_StringLookup`.
* Change `_PyObject_TryGetInstanceAttribute` to use that function
in the case of split keys.
* Add `unicodekeys_lookup_split` helper which allows code sharing
between `_Py_dict_lookup` and `_PyDictKeys_StringLookupSplit`.
* Fix locking for `STORE_ATTR_INSTANCE_VALUE`. Create
`_GUARD_TYPE_VERSION_AND_LOCK` uop so that object stays locked and
`tp_version_tag` cannot change.
* Pass `tp_version_tag` to `specialize_dict_access()`, ensuring
the version we store on the cache is the correct one (in case of
it changing during the specalize analysis).
* Split `analyze_descriptor` into `analyze_descriptor_load` and
`analyze_descriptor_store` since those don't share much logic.
Add `descriptor_is_class` helper function.
* In `specialize_dict_access`, double check `_PyObject_GetManagedDict()`
in case we race and dict was materialized before the lock.
* Avoid borrowed references in `_Py_Specialize_StoreAttr()`.
* Use `specialize()` and `unspecialize()` helpers.
* Add unit tests to ensure specializing happens as expected in FT builds.
* Add unit tests to attempt to trigger data races (useful for running under TSAN).
* Add `has_split_table` function to `_testinternalcapi`.