gh-115952: Fix a potential virtual memory allocation denial of service in pickle (GH-119204)

Loading a small data which does not even involve arbitrary code execution
could consume arbitrary large amount of memory. There were three issues:

* PUT and LONG_BINPUT with large argument (the C implementation only).
  Since the memo is implemented in C as a continuous dynamic array, a single
  opcode can cause its resizing to arbitrary size. Now the sparsity of
  memo indices is limited.
* BINBYTES, BINBYTES8 and BYTEARRAY8 with large argument.  They allocated
  the bytes or bytearray object of the specified size before reading into
  it.  Now they read very large data by chunks.
* BINSTRING, BINUNICODE, LONG4, BINUNICODE8 and FRAME with large
  argument.  They read the whole data by calling the read() method of
  the underlying file object, which usually allocates the bytes object of
  the specified size before reading into it.  Now they read very large data
  by chunks.

Also add comprehensive benchmark suite to measure performance and memory
impact of chunked reading optimization in PR #119204.

Features:
- Normal mode: benchmarks legitimate pickles (time/memory metrics)
- Antagonistic mode: tests malicious pickles (DoS protection)
- Baseline comparison: side-by-side comparison of two Python builds
- Support for truncated data and sparse memo attack vectors

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
This commit is contained in:
Serhiy Storchaka 2025-12-05 19:17:01 +02:00 committed by GitHub
parent 4085ff7b32
commit 59f247e43b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 1767 additions and 177 deletions

View file

@ -189,6 +189,11 @@ def __init__(self, value):
__all__.extend(x for x in dir() if x.isupper() and not x.startswith('_'))
# Data larger than this will be read in chunks, to prevent extreme
# overallocation.
_MIN_READ_BUF_SIZE = (1 << 20)
class _Framer:
_FRAME_SIZE_MIN = 4
@ -287,7 +292,7 @@ def read(self, n):
"pickle exhausted before end of frame")
return data
else:
return self.file_read(n)
return self._chunked_file_read(n)
def readline(self):
if self.current_frame:
@ -302,11 +307,23 @@ def readline(self):
else:
return self.file_readline()
def _chunked_file_read(self, size):
cursize = min(size, _MIN_READ_BUF_SIZE)
b = self.file_read(cursize)
while cursize < size and len(b) == cursize:
delta = min(cursize, size - cursize)
b += self.file_read(delta)
cursize += delta
return b
def load_frame(self, frame_size):
if self.current_frame and self.current_frame.read() != b'':
raise UnpicklingError(
"beginning of a new frame before end of current frame")
self.current_frame = io.BytesIO(self.file_read(frame_size))
data = self._chunked_file_read(frame_size)
if len(data) < frame_size:
raise EOFError
self.current_frame = io.BytesIO(data)
# Tools used for pickling.
@ -1496,12 +1513,17 @@ def load_binbytes8(self):
dispatch[BINBYTES8[0]] = load_binbytes8
def load_bytearray8(self):
len, = unpack('<Q', self.read(8))
if len > maxsize:
size, = unpack('<Q', self.read(8))
if size > maxsize:
raise UnpicklingError("BYTEARRAY8 exceeds system's maximum size "
"of %d bytes" % maxsize)
b = bytearray(len)
self.readinto(b)
cursize = min(size, _MIN_READ_BUF_SIZE)
b = bytearray(cursize)
if self.readinto(b) == cursize:
while cursize < size and len(b) == cursize:
delta = min(cursize, size - cursize)
b += self.read(delta)
cursize += delta
self.append(b)
dispatch[BYTEARRAY8[0]] = load_bytearray8