Allow the --enable-pystats build option to be used with free-threading. The
stats are now stored on a per-interpreter basis, rather than process global.
For free-threaded builds, the stats structure is allocated per-thread and
then periodically merged into the per-interpreter stats structure (on thread
exit or when the reporting function is called). Most of the pystats related
code has be moved into the file Python/pystats.c.
* FOR_ITER now pushes either the iterator and NULL or leaves the iterable and pushes tagged zero
* NEXT_ITER uses the tagged int as the index into the sequence or, if TOS is NULL, iterates as before.
Mark a few functions used by the interpreter loop as noinline
These are all the slow path and should not be inlined into the interpreter
loop. Unfortunately, they end up being inlined with LTO and the current PGO
task.
Add free-threaded versions of existing specialization for FOR_ITER (list, tuples, fast range iterators and generators), without significantly affecting their thread-safety. (Iterating over shared lists/tuples/ranges should be fine like before. Reusing iterators between threads is not fine, like before. Sharing generators between threads is a recipe for significant crashes, like before.)
Add free-threaded specialization for COMPARE_OP, and tests for COMPARE_OP specialization in general.
Co-authored-by: Donghee Na <donghee.na92@gmail.com>
* Add `_PyDictKeys_StringLookupSplit` which does locking on dict keys and
use in place of `_PyDictKeys_StringLookup`.
* Change `_PyObject_TryGetInstanceAttribute` to use that function
in the case of split keys.
* Add `unicodekeys_lookup_split` helper which allows code sharing
between `_Py_dict_lookup` and `_PyDictKeys_StringLookupSplit`.
* Fix locking for `STORE_ATTR_INSTANCE_VALUE`. Create
`_GUARD_TYPE_VERSION_AND_LOCK` uop so that object stays locked and
`tp_version_tag` cannot change.
* Pass `tp_version_tag` to `specialize_dict_access()`, ensuring
the version we store on the cache is the correct one (in case of
it changing during the specalize analysis).
* Split `analyze_descriptor` into `analyze_descriptor_load` and
`analyze_descriptor_store` since those don't share much logic.
Add `descriptor_is_class` helper function.
* In `specialize_dict_access`, double check `_PyObject_GetManagedDict()`
in case we race and dict was materialized before the lock.
* Avoid borrowed references in `_Py_Specialize_StoreAttr()`.
* Use `specialize()` and `unspecialize()` helpers.
* Add unit tests to ensure specializing happens as expected in FT builds.
* Add unit tests to attempt to trigger data races (useful for running under TSAN).
* Add `has_split_table` function to `_testinternalcapi`.
We use the same approach that was used for specialization of LOAD_GLOBAL in free-threaded builds:
_CHECK_ATTR_MODULE is renamed to _CHECK_ATTR_MODULE_PUSH_KEYS; it pushes the keys object for the following _LOAD_ATTR_MODULE_FROM_KEYS (nee _LOAD_ATTR_MODULE). This arrangement avoids having to recheck the keys version.
_LOAD_ATTR_MODULE is renamed to _LOAD_ATTR_MODULE_FROM_KEYS; it loads the value from the keys object pushed by the preceding _CHECK_ATTR_MODULE_PUSH_KEYS at the cached index.
* Enable specialization of CALL_KW
* Fix bug pushing frame in _PY_FRAME_KW
`_PY_FRAME_KW` pushes a pointer to the new frame onto the stack for
consumption by the next uop. When pushing the frame fails, we do not
want to push the result, `NULL`, to the stack because it is not
a valid stackref. This works in the default build because `PyStackRef_NULL`
and `NULL` are the same value, so the `PyStackRef_XCLOSE()` in the error
handler ignores it. In the free-threaded build the values are not the same;
`PyStackRef_XCLOSE()` will attempt to decref a null pointer.