Optimize base64 encoding/decoding by eliminating loop-carried dependencies. Key changes:
- Add `base64_encode_trio()` and `base64_decode_quad()` helper functions that process complete groups independently
- Add `base64_encode_fast()` and `base64_decode_fast()` wrappers
- Update `b2a_base64` and `a2b_base64` to use fast path for complete groups
Performance gains (encode/decode speedup vs main, PGO builds):
```
64 bytes 64K 1M
Zen2: 1.2x/1.8x 1.7x/2.8x 1.5x/2.8x
Zen4: 1.2x/1.7x 1.6x/3.0x 1.5x/3.0x [old data, likely faster]
M4: 1.3x/1.9x 2.3x/2.8x 2.4x/2.9x [old data, likely faster]
RPi5-32: 1.2x/1.2x 2.4x/2.4x 2.0x/2.1x
```
Based on my exploratory work done in https://github.com/python/cpython/compare/main...gpshead:cpython:claude/vectorize-base64-c-S7Hku
See PR and issue for further thoughts on sometimes MUCH faster SIMD vectorized versions of this.
mmapmodule: remove unreachable code in Windows error path
Remove an unreachable `return NULL` after `PyErr_SetFromWindowsErr()` in
the Windows mmap resize error path.
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
Make the attributes in _bz2 module thread-safe on the free-threading build.
Attributes (eof, needs_input, unused_data) are now stored atomically or
accessed via mutex-protected getters.
Makes the zlib module thread-safe free-threading build. Even though operations
are protected by locks, attributes exposed via PyMemberDef (eof, needs_input,
unused_data, unconsumed_tail) should still be stored atomically within locked
sections, since they can be read without acquiring the lock.
If there are many untracked tuples, the GC will run too often, resulting
in poor performance. The fix is to include untracked tuples in the
"long lived" object count. The number of frozen objects is also now
included since the free-threaded GC must scan those too.
This PR implements frame caching in the RemoteUnwinder class to significantly reduce memory reads when profiling remote processes with deep call stacks.
When cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.
The implementation adds a last_profiled_frame field to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.
The sampling profiler now enables frame caching by default.