mirror of
https://github.com/yaml/pyyaml.git
synced 2026-06-18 11:51:41 +00:00
* Make local test targets self-contained Most developer machines do not have libyaml headers installed, so the old default test flow attempted to build the optional extension, failed to find yaml.h, and then only exercised the pure Python fallback. The extension target also required out-of-band system setup before it could pass. Split the Makefile test flow into explicit pure-Python and LibYAML targets, and make the default test target run both. The pure-Python target now forces the extension off during build, while the LibYAML target builds a pinned local yaml/libyaml checkout and wires the include, library, and runtime paths into the extension build. Document the new Makefile workflow in the README, including make test, the individual test targets, the pinned LIBYAML-REF default, and the PYTHON-VERSION override for testing against a specific Python. This makes both test modes pass from a clean checkout without requiring developers to install libyaml development packages by hand. * Fix exponential expansion DoS via duplicate merge key aliases * Fix exponential expansion based DoS in merge key processing Resolves #897 Supercedes #916 ### Summary The merge key (`<<`) constructor implementation in `SafeConstructor.flatten_mapping()` was vulnerable to an exponential time and memory complexity Denial of Service (DoS) vulnerability. When mapping/sequence nodes are merged using anchors/aliases, duplicate references to the same alias point to the same MappingNode instance in Python. During merge key processing, the node values are copied and extended in-place. If the same node appears multiple times at different levels, this causes exponential amplification of the elements list: `2^(n+1) - 1`. A small document under 1 KB can trigger millions of element list extensions, exhausting CPU and memory during safe loading. ### Hardened Fix This commit resolves the vulnerability and hardens it against secondary vectors: 1. Tracks node identity using object ID (`id(node)`) in a single `seen` set scoped to the parent mapping's `flatten_mapping()` execution. 2. Checks and skips duplicate node references inside SequenceNode merge keys (resolving PR-916). 3. Checks and skips duplicate node references across separate, independent MappingNode merge keys in the same mapping (e.g., repeating `<<: *anchor` multiple times). 4. Ensures C-based loaders (e.g., `CSafeLoader`, `CLoader`) are also protected since they inherit constructor logic from `SafeConstructor`. ### Performance Impact - Sequence-nested merge duplicates: Loading a 22-level nested document drops from 3.76s to 0.0028s (O(N) linear complexity). - Mapping-level merge duplicates: Loading a 20-level nested document drops from 0.93s to 0.0026s. ### Tests - Added regression tests to `tests/legacy_tests/data/construct-merge.data` and `tests/legacy_tests/data/construct-merge.code` covering both duplicate sequence merges and duplicate direct merges. * Fix CI toolchain drift failures The PR was failing before it reached the PyYAML regression tests in two places: the cp38 wheel jobs installed latest cibuildwheel, whose 4.x series no longer supports Python 3.8, and the Windows arm64 libyaml build used a newer CMake that rejects libyaml 0.2.5's old minimum policy version. Pin the cp38 wheel jobs to cibuildwheel<4, and quote the matrix-provided pip package spec so bash does not parse the '<4' constraint as input redirection. Pass CMAKE_POLICY_VERSION_MINIMUM=3.5 to the Windows libyaml configure step, and disable fail-fast for the Windows libyaml matrix so one architecture does not hide the others. * Add merge key deduplication regression tests * Add time-bounded DoS prevention tests to verify fix * Strengthen merge key DoS timeout tests * Fix merge key fan-out expansion DoS * Use structural merge DoS regression tests Replace the timing-based merge DoS tests with deterministic checks of the flattened mapping shape. Wall-clock thresholds are fragile on slow or emulated CI runners and do not directly prove that the expansion was prevented. The fan-out regression case uses a small levels=4, width=4 payload from the PR #916 shape. With the old behavior, flattening root expands to 1024 repeated key/value pairs and the test fails immediately because the key list contains 1020 extra entries. With the fix, duplicate merge pairs are collapsed during flattening and the same root mapping contains only k0, k1, k2, and k3. This keeps the test fast even if the old bug returns: it observes the structural amplification directly instead of waiting for a large pathological input to time out. * Hybrid merge keys optimization * A hybrid combination of fixes proposed by frenzymadness and akshat-sj to prevent exponential compute while resolving nested anchor aliases. * Further optimization is possible with broader caching of anchor node graph, but would require careful design of invalidation policy. --------- Co-authored-by: Akshat Singh Jaswal <sja.akshat@gmail.com> Co-authored-by: Aaron Bronow <abronow@gmail.com> Co-authored-by: Matt Davis <nitzmahone@redhat.com> |
||
|---|---|---|
| .. | ||
| legacy_tests | ||
| test_dump_load.py | ||
| test_merge.py | ||