When _ctypes is imported, it may call dlopen on the libpython shared
library, causing the dynamic linker to load a second mapping of the
library into the process address space. The remote debugging code
iterates memory regions from low addresses upward and returns the first
mapping whose filename matches libpython. After _ctypes is imported, it
finds the dlopen'd copy first, but that copy's PyRuntime section was
never initialized, so reading debug offsets from it fails.
Fix this by validating each candidate PyRuntime address before accepting
it. The validation reads the first 8 bytes and checks for the "xdebugpy"
cookie that is only present in an initialized PyRuntime. Uninitialized
duplicate mappings will fail this check and be skipped, allowing the
search to continue to the real, initialized PyRuntime.
When using blocking mode in the remote debugging profiler, ptrace calls
to seize threads can fail with EPERM if the thread has exited between
listing and attaching, is in a special kernel state, or is already being
traced. Previously this raised a RuntimeError that was caught by the
Python sampling loop,and retried indefinitely since EPERM is
a persistent condition that will not resolve on its own.
Treat EPERM the same as ESRCH by returning 1 (skip this thread) instead
of -1 (fatal error). This allows profiling to continue with the threads
that can be traced rather than entering an endless retry loop printing
the same error message repeatedly.
This PR implements frame caching in the RemoteUnwinder class to significantly reduce memory reads when profiling remote processes with deep call stacks.
When cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.
The implementation adds a last_profiled_frame field to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.
The sampling profiler now enables frame caching by default.