gh-115952: Fix a potential virtual memory allocation denial of service in pickle (GH-119204)

Loading a small data which does not even involve arbitrary code execution could consume arbitrary large amount of memory. There were three issues: * PUT and LONG_BINPUT with large argument (the C implementation only). Since the memo is implemented in C as a continuous dynamic array, a single opcode can cause its resizing to arbitrary size. Now the sparsity of memo indices is limited. * BINBYTES, BINBYTES8 and BYTEARRAY8 with large argument. They allocated the bytes or bytearray object of the specified size before reading into it. Now they read very large data by chunks. * BINSTRING, BINUNICODE, LONG4, BINUNICODE8 and FRAME with large argument. They read the whole data by calling the read() method of the underlying file object, which usually allocates the bytes object of the specified size before reading into it. Now they read very large data by chunks. Also add comprehensive benchmark suite to measure performance and memory impact of chunked reading optimization in PR #119204. Features: - Normal mode: benchmarks legitimate pickles (time/memory metrics) - Antagonistic mode: tests malicious pickles (DoS protection) - Baseline comparison: side-by-side comparison of two Python builds - Support for truncated data and sparse memo attack vectors Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Gregory P. Smith <greg@krypto.org>
2025-12-08 06:10:17 +00:00 · 2025-12-05 19:17:01 +02:00 · 2025-12-05 19:17:01 +02:00 · 59f247e43b
commit 59f247e43b
parent 4085ff7b32
7 changed files with 1767 additions and 177 deletions
--- a/Lib/pickle.py
+++ b/Lib/pickle.py
@ -189,6 +189,11 @@ def __init__(self, value):
 __all__.extend(x for x in dir() if x.isupper() and not x.startswith('_'))


+# Data larger than this will be read in chunks, to prevent extreme
+# overallocation.
+_MIN_READ_BUF_SIZE = (1 << 20)
+
+
 class _Framer:

    _FRAME_SIZE_MIN = 4
@ -287,7 +292,7 @@ def read(self, n):
                    "pickle exhausted before end of frame")
            return data
        else:
-            return self.file_read(n)
+            return self._chunked_file_read(n)

    def readline(self):
        if self.current_frame:
@ -302,11 +307,23 @@ def readline(self):
        else:
            return self.file_readline()

+    def _chunked_file_read(self, size):
+        cursize = min(size, _MIN_READ_BUF_SIZE)
+        b = self.file_read(cursize)
+        while cursize < size and len(b) == cursize:
+            delta = min(cursize, size - cursize)
+            b += self.file_read(delta)
+            cursize += delta
+        return b
+
    def load_frame(self, frame_size):
        if self.current_frame and self.current_frame.read() != b'':
            raise UnpicklingError(
                "beginning of a new frame before end of current frame")
-        self.current_frame = io.BytesIO(self.file_read(frame_size))
+        data = self._chunked_file_read(frame_size)
+        if len(data) < frame_size:
+            raise EOFError
+        self.current_frame = io.BytesIO(data)


 # Tools used for pickling.
@ -1496,12 +1513,17 @@ def load_binbytes8(self):
    dispatch[BINBYTES8[0]] = load_binbytes8

    def load_bytearray8(self):
-        len, = unpack('<Q', self.read(8))
-        if len > maxsize:
+        size, = unpack('<Q', self.read(8))
+        if size > maxsize:
            raise UnpicklingError("BYTEARRAY8 exceeds system's maximum size "
                                  "of %d bytes" % maxsize)
-        b = bytearray(len)
-        self.readinto(b)
+        cursize = min(size, _MIN_READ_BUF_SIZE)
+        b = bytearray(cursize)
+        if self.readinto(b) == cursize:
+            while cursize < size and len(b) == cursize:
+                delta = min(cursize, size - cursize)
+                b += self.read(delta)
+                cursize += delta
        self.append(b)
    dispatch[BYTEARRAY8[0]] = load_bytearray8