Add `gi_state`, `cr_state`, and `ag_state` attributes to generators,
coroutines, and async generators respectively. These attributes return the
current state as a string (e.g., `GEN_RUNNING`, `CORO_SUSPENDED`).
The `inspect.getgeneratorstate()`, `inspect.getcoroutinestate()`, and
`inspect.getasyncgenstate()` functions now return these attributes directly.
This is in preparation for making `gi_frame` thread-safe, which may involve
stop-the-world synchronization. The new state attributes avoid potential
performance cliffs in `inspect.getgeneratorstate()` and similar functions by
not requiring frame access.
Also removes unused `FRAME_COMPLETED` state and renumbers the frame state enum
to start at 0 instead of -1.
The positional arguments passed to _PyStack_UnpackDict are already
kept alive by the caller, so we can avoid the extra reference count
operations by using borrowed references instead of creating new ones.
This reduces reference count contention in the free-threaded build
when calling functions with keyword arguments. In particular, this
avoids contention on the type argument to `__new__` when instantiating
namedtuples with keyword arguments.
The pymalloc huge page support had two problems. First, on
architectures where the default huge page size exceeds the arena
size (e.g. 32 MiB on PPC, 512 MiB on ARM64 with 64 KB base
pages), mmap with MAP_HUGETLB silently allocates a full huge page
even when the requested size is smaller. The subsequent munmap
with the original arena size then fails with EINVAL, permanently
leaking the entire huge page. Second, huge pages were always
attempted when compiled in, with no way to disable them at
runtime. On Linux, if the huge page pool is exhausted, page
faults including copy-on-write faults after fork deliver SIGBUS
and kill the process.
The arena allocator now queries the system huge page size from
/proc/meminfo and skips MAP_HUGETLB when the arena size is not a
multiple of it. Huge pages also now require explicit opt-in at
runtime via the PYTHON_PYMALLOC_HUGEPAGES environment variable,
which is read through PyConfig and respects -E and -I flags.
The config field pymalloc_hugepages is propagated to the runtime
allocators struct so the low-level arena allocator can check it
without calling getenv directly.
Add a FRAME_SUSPENDED_YIELD_FROM_LOCKED state that acts as a brief
lock, preventing other threads from transitioning the frame state
while gen_getyieldfrom reads the yield-from object off the stack.
In `_PyDict_GetMethodStackRef`, only use the fast-path unicode lookup
when the dict is owned by the current thread or already marked as shared.
This prevents a race between the lookup and concurrent dict resizes,
which may free the PyDictKeysObject (i.e., it ensures that the resize
uses QSBR).
Address a similar issue in `_Py_dict_lookup_threadsafe_stackref` by
calling `ensure_shared_on_read()`.
This affects string formatting as well as bytes and bytearray formatting.
* For errors in the format string, always include the position of the
start of the format unit.
* For errors related to the formatted arguments, always include the number
or the name of the formatted argument.
* Suggest more probable causes of errors in the format string (stray %,
unsupported format, unexpected character).
* Provide more information when the number of arguments does not match
the number of format units.
* Raise more specific errors when access of arguments by name is mixed with
sequential access and when * is used with a mapping.
* Add tests for some uncovered cases.
Optimize bytes.translate() by deferring change detection
Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.
This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.
It will also be faster for bytes subclasses as those do not need change
detection.
If we are specializing to `LOAD_GLOBAL_MODULE` or `LOAD_ATTR_MODULE`, try
to enable deferred reference counting for the value, if the object is owned by
a different thread. This applies to the free-threaded build only and should
improve scaling of multi-threaded programs.
Speed up conversion from `bytes-like` objects like `bytearray` while
keeping conversion from `bytes` stable.
Co-authored-by: Sergey B Kirpichev <skirpichev@gmail.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
When comparing negative non-integer float and int with the same number
of bits in the integer part, __neg__() in the int subclass returning
not an int caused an assertion error.
Now the integer is no longer negated. Also, reduced the number of
temporary created Python objects.
The PyObject header reference count fields must be initialized using
atomic operations because they may be concurrently read by another
thread (e.g., from `_Py_TryIncref`).
`pycore_optimizer.h` was included redundantly in
Objects/frameobject.c and Python/instrumentation.c.
Both includes are unnecessary and can be safely removed.
No functional change.
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Brandt Bucher <brandt@python.org>
Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
The use of memmove and _Py_memory_repeat were not thread-safe in the
free threading build in some cases. In theory, memmove and
_Py_memory_repeat can copy byte-by-byte instead of pointer-by-pointer,
so concurrent readers could see uninitialized data or tearing.
Additionally, we should be using "release" (or stronger) ordering to be
compliant with the C11 memory model when copying objects within a list.
This makes generator frame state transitions atomic in the free
threading build, which avoids segfaults when trying to execute
a generator from multiple threads concurrently.
There are still a few operations that aren't thread-safe and may crash
if performed concurrently on the same generator/coroutine:
* Accessing gi_yieldfrom/cr_await/ag_await
* Accessing gi_frame/cr_frame/ag_frame
* Async generator operations
Now that we specialize range iteration in the interpreter for the common
case where the iterator has only one reference, there's not a
significant performance cost to making the iteration thread-safe.