mirror of
https://github.com/python/cpython.git
synced 2025-12-31 12:33:28 +00:00
541 lines
24 KiB
Markdown
541 lines
24 KiB
Markdown
# Profiling Binary Format
|
||
|
||
The profiling module includes a binary file format for storing sampling
|
||
profiler data. This document describes the format's structure and the
|
||
design decisions behind it.
|
||
|
||
The implementation is in
|
||
[`Modules/_remote_debugging/binary_io_writer.c`](../Modules/_remote_debugging/binary_io_writer.c)
|
||
and [`Modules/_remote_debugging/binary_io_reader.c`](../Modules/_remote_debugging/binary_io_reader.c),
|
||
with declarations in
|
||
[`Modules/_remote_debugging/binary_io.h`](../Modules/_remote_debugging/binary_io.h).
|
||
|
||
## Overview
|
||
|
||
The sampling profiler can generate enormous amounts of data. A typical
|
||
profiling session sampling at 1000 Hz for 60 seconds produces 60,000 samples.
|
||
Each sample contains a full call stack, often 20-50 frames deep, and each
|
||
frame includes a filename, function name, and line number. In a text-based
|
||
format like collapsed stacks, this would mean repeating the same long file
|
||
paths and function names thousands of times.
|
||
|
||
The binary format addresses this through two key strategies:
|
||
|
||
1. **Deduplication**: Strings and frames are stored once in lookup tables,
|
||
then referenced by small integer indices. A 100-character file path that
|
||
appears in 50,000 samples is stored once, not 50,000 times.
|
||
|
||
2. **Compact encoding**: Variable-length integers (varints) encode small
|
||
values in fewer bytes. Since most indices are small (under 128), they
|
||
typically need only one byte instead of four.
|
||
|
||
Together with optional zstd compression, these techniques reduce file sizes
|
||
by 10-50x compared to text formats while also enabling faster I/O.
|
||
|
||
## File Layout
|
||
|
||
The file consists of five sections:
|
||
|
||
```
|
||
+------------------+ Offset 0
|
||
| Header | 64 bytes (fixed)
|
||
+------------------+ Offset 64
|
||
| |
|
||
| Sample Data | Variable size (optionally compressed)
|
||
| |
|
||
+------------------+ string_table_offset
|
||
| String Table | Variable size
|
||
+------------------+ frame_table_offset
|
||
| Frame Table | Variable size
|
||
+------------------+ file_size - 32
|
||
| Footer | 32 bytes (fixed)
|
||
+------------------+ file_size
|
||
```
|
||
|
||
The layout is designed for streaming writes during profiling. The profiler
|
||
cannot know in advance how many unique strings or frames will be encountered,
|
||
so these tables must be built incrementally and written at the end.
|
||
|
||
The header comes first so readers can quickly validate the file and locate
|
||
the metadata tables. The sample data follows immediately, allowing the writer
|
||
to stream samples directly to disk (or through a compression stream) without
|
||
buffering the entire dataset in memory.
|
||
|
||
The string and frame tables are placed after sample data because they grow
|
||
as new unique entries are discovered during profiling. By deferring their
|
||
output until finalization, the writer avoids the complexity of reserving
|
||
space or rewriting portions of the file.
|
||
|
||
The footer at the end contains counts needed to allocate arrays before
|
||
parsing the tables. Placing it at a fixed offset from the end (rather than
|
||
at a variable offset recorded in the header) means readers can locate it
|
||
with a single seek to `file_size - 32`, without first reading the header.
|
||
|
||
## Header
|
||
|
||
```
|
||
Offset Size Type Description
|
||
+--------+------+---------+----------------------------------------+
|
||
| 0 | 4 | uint32 | Magic number (0x54414348 = "TACH") |
|
||
| 4 | 4 | uint32 | Format version |
|
||
| 8 | 4 | bytes | Python version (major, minor, micro, |
|
||
| | | | reserved) |
|
||
| 12 | 8 | uint64 | Start timestamp (microseconds) |
|
||
| 20 | 8 | uint64 | Sample interval (microseconds) |
|
||
| 28 | 4 | uint32 | Total sample count |
|
||
| 32 | 4 | uint32 | Thread count |
|
||
| 36 | 8 | uint64 | String table offset |
|
||
| 44 | 8 | uint64 | Frame table offset |
|
||
| 52 | 4 | uint32 | Compression type (0=none, 1=zstd) |
|
||
| 56 | 8 | bytes | Reserved (zero-filled) |
|
||
+--------+------+---------+----------------------------------------+
|
||
```
|
||
|
||
The magic number `0x54414348` ("TACH" for Tachyon) identifies the file format
|
||
and also serves as an **endianness marker**. When read on a system with
|
||
different byte order than the writer, it appears as `0x48434154`. The reader
|
||
uses this to detect cross-endian files and automatically byte-swap all
|
||
multi-byte integer fields.
|
||
|
||
The Python version field records the major, minor, and micro version numbers
|
||
of the Python interpreter that generated the file. This allows analysis tools
|
||
to detect version mismatches when replaying data collected on a different
|
||
Python version, which may have different internal structures or behaviors.
|
||
|
||
The header is written as zeros initially, then overwritten with actual values
|
||
during finalization. This requires the output stream to be seekable, which
|
||
is acceptable since the format targets regular files rather than pipes or
|
||
network streams.
|
||
|
||
## Sample Data
|
||
|
||
Sample data begins at offset 64 and extends to `string_table_offset`. Samples
|
||
use delta compression to minimize redundancy when consecutive samples from the
|
||
same thread have identical or similar call stacks.
|
||
|
||
### Stack Encoding Types
|
||
|
||
Each sample record begins with thread identification, then an encoding byte:
|
||
|
||
| Code | Name | Description |
|
||
|------|------|-------------|
|
||
| 0x00 | REPEAT | RLE: identical stack repeated N times |
|
||
| 0x01 | FULL | Complete stack (first sample or no match) |
|
||
| 0x02 | SUFFIX | Shares N frames from bottom of previous stack |
|
||
| 0x03 | POP_PUSH | Remove M frames from top, add N new frames |
|
||
|
||
### Record Formats
|
||
|
||
**REPEAT (0x00) - Run-Length Encoded Identical Stacks:**
|
||
```
|
||
+-----------------+-----------+----------------------------------------+
|
||
| thread_id | 8 bytes | Thread identifier (uint64, fixed) |
|
||
| interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) |
|
||
| encoding | 1 byte | 0x00 (REPEAT) |
|
||
| count | varint | Number of samples in this RLE group |
|
||
| samples | varies | Interleaved: [delta: varint, status: 1]|
|
||
| | | repeated count times |
|
||
+-----------------+-----------+----------------------------------------+
|
||
```
|
||
The stack is inherited from this thread's previous sample. Each sample in the
|
||
group gets its own timestamp delta and status byte, stored as interleaved pairs
|
||
(delta1, status1, delta2, status2, ...) rather than separate arrays.
|
||
|
||
**FULL (0x01) - Complete Stack:**
|
||
```
|
||
+-----------------+-----------+----------------------------------------+
|
||
| thread_id | 8 bytes | Thread identifier (uint64, fixed) |
|
||
| interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) |
|
||
| encoding | 1 byte | 0x01 (FULL) |
|
||
| timestamp_delta | varint | Microseconds since thread's last sample|
|
||
| status | 1 byte | Thread state flags |
|
||
| stack_depth | varint | Number of frames in call stack |
|
||
| frame_indices | varint[] | Array of frame table indices |
|
||
+-----------------+-----------+----------------------------------------+
|
||
```
|
||
Used for the first sample from a thread, or when delta encoding would not
|
||
provide savings.
|
||
|
||
**SUFFIX (0x02) - Shared Suffix Match:**
|
||
```
|
||
+-----------------+-----------+----------------------------------------+
|
||
| thread_id | 8 bytes | Thread identifier (uint64, fixed) |
|
||
| interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) |
|
||
| encoding | 1 byte | 0x02 (SUFFIX) |
|
||
| timestamp_delta | varint | Microseconds since thread's last sample|
|
||
| status | 1 byte | Thread state flags |
|
||
| shared_count | varint | Frames shared from bottom of prev stack|
|
||
| new_count | varint | New frames at top of stack |
|
||
| new_frames | varint[] | Array of new_count frame indices |
|
||
+-----------------+-----------+----------------------------------------+
|
||
```
|
||
Used when a function call added frames to the top of the stack. The shared
|
||
frames from the previous stack are kept, and new frames are prepended.
|
||
|
||
**POP_PUSH (0x03) - Pop and Push:**
|
||
```
|
||
+-----------------+-----------+----------------------------------------+
|
||
| thread_id | 8 bytes | Thread identifier (uint64, fixed) |
|
||
| interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) |
|
||
| encoding | 1 byte | 0x03 (POP_PUSH) |
|
||
| timestamp_delta | varint | Microseconds since thread's last sample|
|
||
| status | 1 byte | Thread state flags |
|
||
| pop_count | varint | Frames to remove from top of prev stack|
|
||
| push_count | varint | New frames to add at top |
|
||
| new_frames | varint[] | Array of push_count frame indices |
|
||
+-----------------+-----------+----------------------------------------+
|
||
```
|
||
Used when the code path changed: some frames were popped (function returns)
|
||
and new frames were pushed (different function calls).
|
||
|
||
### Thread and Interpreter Identification
|
||
|
||
Thread IDs are 64-bit values that can be large (memory addresses on some
|
||
platforms) and vary unpredictably. Using a fixed 8-byte encoding avoids
|
||
the overhead of varint encoding for large values and simplifies parsing
|
||
since the reader knows exactly where each field begins.
|
||
|
||
The interpreter ID identifies which Python sub-interpreter the thread
|
||
belongs to, allowing analysis tools to separate activity across interpreters
|
||
in processes using multiple sub-interpreters.
|
||
|
||
### Status Byte
|
||
|
||
The status byte is a bitfield encoding thread state at sample time:
|
||
|
||
| Bit | Flag | Meaning |
|
||
|-----|-----------------------|--------------------------------------------|
|
||
| 0 | THREAD_STATUS_HAS_GIL | Thread holds the GIL (Global Interpreter Lock) |
|
||
| 1 | THREAD_STATUS_ON_CPU | Thread is actively running on a CPU core |
|
||
| 2 | THREAD_STATUS_UNKNOWN | Thread state could not be determined |
|
||
| 3 | THREAD_STATUS_GIL_REQUESTED | Thread is waiting to acquire the GIL |
|
||
| 4 | THREAD_STATUS_HAS_EXCEPTION | Thread has a pending exception |
|
||
|
||
Multiple flags can be set simultaneously (e.g., a thread can hold the GIL
|
||
while also running on CPU). Analysis tools use these to filter samples or
|
||
visualize thread states over time.
|
||
|
||
### Timestamp Delta Encoding
|
||
|
||
Timestamps use delta encoding rather than absolute values. Absolute
|
||
timestamps in microseconds require 8 bytes each, but consecutive samples
|
||
from the same thread are typically separated by the sampling interval
|
||
(e.g., 1000 microseconds), so the delta between them is small and fits
|
||
in 1-2 varint bytes. The writer tracks the previous timestamp for each
|
||
thread separately. The first sample from a thread encodes its delta from
|
||
the profiling start time; subsequent samples encode the delta from that
|
||
thread's previous sample. This per-thread tracking is necessary because
|
||
samples are interleaved across threads in arrival order, not grouped by
|
||
thread.
|
||
|
||
For REPEAT (RLE) records, timestamp deltas and status bytes are stored as
|
||
interleaved pairs (delta, status, delta, status, ...) - one pair per
|
||
repeated sample - allowing efficient batching while preserving the exact
|
||
timing and state of each sample.
|
||
|
||
### Frame Indexing
|
||
|
||
Each frame in a call stack is represented by an index into the frame table
|
||
rather than inline data. This provides massive space savings because call
|
||
stacks are highly repetitive: the same function appears in many samples
|
||
(hot functions), call stacks often share common prefixes (main -> app ->
|
||
handler -> ...), and recursive functions create repeated frame sequences.
|
||
A frame index is typically 1-2 varint bytes. Inline frame data would be
|
||
20-200+ bytes (two strings plus a line number). For a profile with 100,000
|
||
samples averaging 30 frames each, this reduces frame data from potentially
|
||
gigabytes to tens of megabytes.
|
||
|
||
Frame indices are written innermost-first (the currently executing frame
|
||
has index 0 in the array). This ordering works well with delta compression:
|
||
function calls typically add frames at the top (index 0), while shared
|
||
frames remain at the bottom.
|
||
|
||
## String Table
|
||
|
||
The string table stores deduplicated UTF-8 strings (filenames and function
|
||
names). It begins at `string_table_offset` and contains entries in order of
|
||
their assignment during writing:
|
||
|
||
```
|
||
+----------------+
|
||
| length: varint |
|
||
| data: bytes |
|
||
+----------------+ (repeated for each string)
|
||
```
|
||
|
||
Strings are stored in the order they were first encountered during writing.
|
||
The first unique filename gets index 0, the second gets index 1, and so on.
|
||
Length-prefixing (rather than null-termination) allows strings containing
|
||
null bytes and enables readers to allocate exact-sized buffers. The varint
|
||
length encoding means short strings (under 128 bytes) need only one length
|
||
byte.
|
||
|
||
## Frame Table
|
||
|
||
The frame table stores deduplicated frame entries with full source position
|
||
information and bytecode opcode:
|
||
|
||
```
|
||
+----------------------------+
|
||
| filename_idx: varint |
|
||
| funcname_idx: varint |
|
||
| lineno: svarint |
|
||
| end_lineno_delta: svarint |
|
||
| column: svarint |
|
||
| end_column_delta: svarint |
|
||
| opcode: u8 |
|
||
+----------------------------+ (repeated for each frame)
|
||
```
|
||
|
||
### Field Definitions
|
||
|
||
| Field | Type | Description |
|
||
|------------------|---------------|----------------------------------------------------------|
|
||
| filename_idx | varint | Index into string table for file name |
|
||
| funcname_idx | varint | Index into string table for function name |
|
||
| lineno | zigzag varint | Start line number (-1 for synthetic frames) |
|
||
| end_lineno_delta | zigzag varint | Delta from lineno (end_lineno = lineno + delta) |
|
||
| column | zigzag varint | Start column offset in UTF-8 bytes (-1 if not available) |
|
||
| end_column_delta | zigzag varint | Delta from column (end_column = column + delta) |
|
||
| opcode | u8 | Python bytecode opcode (0-254) or 255 for None |
|
||
|
||
### Delta Encoding
|
||
|
||
Position end values use delta encoding for efficiency:
|
||
|
||
- `end_lineno = lineno + end_lineno_delta`
|
||
- `end_column = column + end_column_delta`
|
||
|
||
Typical values:
|
||
- `end_lineno_delta`: Usually 0 (single-line expressions) → encodes to 1 byte
|
||
- `end_column_delta`: Usually 5-20 (expression width) → encodes to 1 byte
|
||
|
||
This saves ~1-2 bytes per frame compared to absolute encoding. When the base
|
||
value (lineno or column) is -1 (not available), the delta is stored as 0 and
|
||
the reconstructed value is -1.
|
||
|
||
### Sentinel Values
|
||
|
||
- `opcode = 255`: No opcode captured
|
||
- `lineno = -1`: Synthetic frame (no source location)
|
||
- `column = -1`: Column offset not available
|
||
|
||
### Deduplication
|
||
|
||
Each unique (filename, funcname, lineno, end_lineno, column, end_column,
|
||
opcode) combination gets one entry. This enables instruction-level profiling
|
||
where multiple bytecode instructions on the same line can be distinguished.
|
||
|
||
Strings and frames are deduplicated separately because they have different
|
||
cardinalities and reference patterns. A codebase might have hundreds of
|
||
unique source files but thousands of unique functions. Many functions share
|
||
the same filename, so storing the filename index in each frame entry (rather
|
||
than the full string) provides an additional layer of deduplication. A frame
|
||
entry is typically 7-9 bytes rather than two full strings plus location data.
|
||
|
||
### Size Analysis
|
||
|
||
Typical frame size with delta encoding:
|
||
- file_idx: 1-2 bytes
|
||
- func_idx: 1-2 bytes
|
||
- lineno: 1-2 bytes
|
||
- end_lineno_delta: 1 byte (usually 0)
|
||
- column: 1 byte (usually < 64)
|
||
- end_column_delta: 1 byte (usually < 64)
|
||
- opcode: 1 byte
|
||
|
||
**Total: ~7-9 bytes per frame**
|
||
|
||
Line numbers and columns use signed varint (zigzag encoding) to handle
|
||
sentinel values efficiently. Synthetic frames—generated frames that don't
|
||
correspond directly to Python source code, such as C extension boundaries or
|
||
internal interpreter frames—use -1 to indicate the absence of a source
|
||
location. Zigzag encoding ensures these small negative values encode
|
||
efficiently (−1 becomes 1, which is one byte) rather than requiring the
|
||
maximum varint length.
|
||
|
||
## Footer
|
||
|
||
```
|
||
Offset Size Type Description
|
||
+--------+------+---------+----------------------------------------+
|
||
| 0 | 4 | uint32 | String count |
|
||
| 4 | 4 | uint32 | Frame count |
|
||
| 8 | 8 | uint64 | Total file size |
|
||
| 16 | 16 | bytes | Checksum (reserved, currently zeros) |
|
||
+--------+------+---------+----------------------------------------+
|
||
```
|
||
|
||
The string and frame counts allow readers to pre-allocate arrays of the
|
||
correct size before parsing the tables. Without these counts, readers would
|
||
need to either scan the tables twice (once to count, once to parse) or use
|
||
dynamically-growing arrays.
|
||
|
||
The file size field provides a consistency check: if the actual file size
|
||
does not match, the file may be truncated or corrupted.
|
||
|
||
The checksum field is reserved for future use. A checksum would allow
|
||
detection of corruption but adds complexity and computation cost. The
|
||
current implementation leaves this as zeros.
|
||
|
||
## Variable-Length Integer Encoding
|
||
|
||
The format uses LEB128 (Little Endian Base 128) for unsigned integers and
|
||
zigzag + LEB128 for signed integers. These encodings are widely used
|
||
(Protocol Buffers, DWARF debug info, WebAssembly) and well-understood.
|
||
|
||
### Unsigned Varint (LEB128)
|
||
|
||
Each byte stores 7 bits of data. The high bit indicates whether more bytes
|
||
follow:
|
||
|
||
```
|
||
Value Encoded bytes
|
||
0-127 [0xxxxxxx] (1 byte)
|
||
128-16383 [1xxxxxxx] [0xxxxxxx] (2 bytes)
|
||
16384+ [1xxxxxxx] [1xxxxxxx] ... (3+ bytes)
|
||
```
|
||
|
||
Most indices in profiling data are small. A profile with 1000 unique frames
|
||
needs at most 2 bytes per frame index. The common case (indices under 128)
|
||
needs only 1 byte.
|
||
|
||
### Signed Varint (Zigzag)
|
||
|
||
Standard LEB128 encodes −1 as a very large unsigned value, requiring many
|
||
bytes. Zigzag encoding interleaves positive and negative values:
|
||
|
||
```
|
||
0 -> 0 -1 -> 1 1 -> 2 -2 -> 3 2 -> 4
|
||
```
|
||
|
||
This ensures small-magnitude values (whether positive or negative) encode
|
||
in few bytes.
|
||
|
||
## Compression
|
||
|
||
When compression is enabled, the sample data region contains a zstd stream.
|
||
The string table, frame table, and footer remain uncompressed so readers can
|
||
access metadata without decompressing the entire file. A tool that only needs
|
||
to report "this file contains 50,000 samples of 3 threads" can read the header
|
||
and footer without touching the compressed sample data. This also simplifies
|
||
the format: the header's offset fields point directly to the tables rather
|
||
than to positions within a decompressed stream.
|
||
|
||
Zstd provides an excellent balance of compression ratio and speed. Profiling
|
||
data compresses very well (often 5-10x) due to repetitive patterns: the same
|
||
small set of frame indices appears repeatedly, and delta-encoded timestamps
|
||
cluster around the sampling interval. Zstd's streaming API allows compression
|
||
without buffering the entire dataset. The writer feeds sample data through
|
||
the compressor incrementally, flushing compressed chunks to disk as they
|
||
become available.
|
||
|
||
Level 5 compression is used as a default. Lower levels (1-3) are faster but
|
||
compress less; higher levels (6+) compress more but slow down writing. Level
|
||
5 provides good compression with minimal impact on profiling overhead.
|
||
|
||
## Reading and Writing
|
||
|
||
### Writing
|
||
|
||
1. Open the output file and write 64 zero bytes as a placeholder header
|
||
2. Initialize empty string and frame dictionaries for deduplication
|
||
3. For each sample:
|
||
- Intern any new strings, assigning sequential indices
|
||
- Intern any new frames, assigning sequential indices
|
||
- Encode the sample record and write to the buffer
|
||
- Flush the buffer through compression (if enabled) when full
|
||
4. Flush remaining buffered data and finalize compression
|
||
5. Write the string table (length-prefixed strings in index order)
|
||
6. Write the frame table (varint-encoded entries in index order)
|
||
7. Write the footer with final counts
|
||
8. Seek to offset 0 and write the header with actual values
|
||
|
||
The writer maintains two dictionaries: one mapping strings to indices, one
|
||
mapping (filename_idx, funcname_idx, lineno) tuples to frame indices. These
|
||
enable O(1) lookup during interning.
|
||
|
||
### Reading
|
||
|
||
1. Read the header magic number to detect endianness (set `needs_swap` flag
|
||
if the magic appears byte-swapped)
|
||
2. Validate version and read remaining header fields (byte-swapping if needed)
|
||
3. Seek to end − 32 and read the footer (byte-swapping counts if needed)
|
||
4. Allocate string array of `string_count` elements
|
||
5. Parse the string table, populating the array
|
||
6. Allocate frame array of `frame_count * 3` uint32 elements
|
||
7. Parse the frame table, populating the array
|
||
8. If compressed, decompress the sample data region
|
||
9. Iterate through samples, resolving indices to strings/frames
|
||
(byte-swapping thread_id and interpreter_id if needed)
|
||
|
||
The reader builds lookup arrays rather than dictionaries since it only needs
|
||
index-to-value mapping, not value-to-index.
|
||
|
||
## Platform Considerations
|
||
|
||
### Byte Ordering and Cross-Platform Portability
|
||
|
||
The binary format uses **native byte order** for all multi-byte integer
|
||
fields when writing. However, the reader supports **cross-endian reading**:
|
||
files written on a little-endian system (x86, ARM) can be read on a
|
||
big-endian system (s390x, PowerPC), and vice versa.
|
||
|
||
The magic number doubles as an endianness marker. When read on a system with
|
||
different byte order, it appears byte-swapped (`0x48434154` instead of
|
||
`0x54414348`). The reader detects this and automatically byte-swaps all
|
||
fixed-width integer fields during parsing.
|
||
|
||
Writers must use `memcpy()` from properly-sized integer types when writing
|
||
fixed-width integer fields. When the source variable's type differs from the
|
||
field width (e.g., `size_t` written as 4 bytes), explicit casting to the
|
||
correct type (e.g., `uint32_t`) is required before `memcpy()`. On big-endian
|
||
systems, copying from an oversized type would copy the wrong bytes—high-order
|
||
zeros instead of the actual value.
|
||
|
||
The reader tracks whether byte-swapping is needed via a `needs_swap` flag set
|
||
during header parsing. All fixed-width fields in the header, footer, and
|
||
sample data are conditionally byte-swapped using Python's internal byte-swap
|
||
functions (`_Py_bswap32`, `_Py_bswap64` from `pycore_bitutils.h`).
|
||
|
||
Variable-length integers (varints) are byte-order independent since they
|
||
encode values one byte at a time using the LEB128 scheme, so they require
|
||
no special handling for cross-endian reading.
|
||
|
||
### Memory-Mapped I/O
|
||
|
||
On Unix systems (Linux, macOS), the reader uses `mmap()` to map the file
|
||
into the process address space. The kernel handles paging data in and out
|
||
as needed, no explicit read() calls or buffer management are required,
|
||
multiple readers can share the same physical pages, and sequential access
|
||
patterns benefit from kernel read-ahead.
|
||
|
||
The implementation uses `madvise()` to hint the access pattern to the kernel:
|
||
`MADV_SEQUENTIAL` indicates the file will be read linearly, enabling
|
||
aggressive read-ahead. `MADV_WILLNEED` requests pre-faulting of pages.
|
||
On Linux, `MAP_POPULATE` pre-faults all pages at mmap time rather than on
|
||
first access, moving page fault overhead from the parsing loop to the
|
||
initial mapping for more predictable performance. For large files (over
|
||
32 MB), `MADV_HUGEPAGE` requests transparent huge pages (2 MB instead of
|
||
4 KB) to reduce TLB pressure when accessing large amounts of data.
|
||
|
||
On Windows, the implementation falls back to standard file I/O with full
|
||
file buffering. Profiling data files are typically small enough (tens to
|
||
hundreds of megabytes) that this is acceptable.
|
||
|
||
The writer uses a 512 KB buffer to batch small writes. Each sample record
|
||
is typically tens of bytes; writing these individually would incur excessive
|
||
syscall overhead. The buffer accumulates data until full, then flushes in
|
||
one write() call (or feeds through the compression stream).
|
||
|
||
## Future Considerations
|
||
|
||
The format reserves space for future extensions. The 12 reserved bytes in
|
||
the header could hold additional metadata. The 16-byte checksum field in
|
||
the footer is currently unused. The version field allows incompatible
|
||
changes with graceful rejection. New compression types could be added
|
||
(compression_type > 1).
|
||
|
||
Any changes that alter the meaning of existing fields or the parsing logic
|
||
should increment the version number to prevent older readers from
|
||
misinterpreting new files.
|