diff --git a/InternalDocs/README.md b/InternalDocs/README.md index f6aa3db3b38..8cdd06d189f 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -1,4 +1,3 @@ - # CPython Internals Documentation The documentation in this folder is intended for CPython maintainers. diff --git a/InternalDocs/adaptive.md b/InternalDocs/adaptive.md index 4ae9e85b387..7cfa8e52310 100644 --- a/InternalDocs/adaptive.md +++ b/InternalDocs/adaptive.md @@ -96,6 +96,7 @@ ### Choice of specializations Specialized instructions must be fast. In order to be fast, specialized instructions should be tailored for a particular set of values that allows them to: + 1. Verify that incoming value is part of that set with low overhead. 2. Perform the operation quickly. @@ -107,9 +108,11 @@ ### Choice of specializations dictionaries that have a keys with the expected version. This can be tested quickly: + * `globals->keys->dk_version == expected_version` and the operation can be performed quickly: + * `value = entries[cache->index].me_value;`. Because it is impossible to measure the performance of an instruction without @@ -122,10 +125,11 @@ ### Choice of specializations ### Implementation of specialized instructions In general, specialized instructions should be implemented in two parts: + 1. A sequence of guards, each of the form - `DEOPT_IF(guard-condition-is-false, BASE_NAME)`. + `DEOPT_IF(guard-condition-is-false, BASE_NAME)`. 2. The operation, which should ideally have no branches and - a minimum number of dependent memory accesses. + a minimum number of dependent memory accesses. In practice, the parts may overlap, as data required for guards can be re-used in the operation. diff --git a/InternalDocs/changing_grammar.md b/InternalDocs/changing_grammar.md index 1a5eebdc141..c6b895135a3 100644 --- a/InternalDocs/changing_grammar.md +++ b/InternalDocs/changing_grammar.md @@ -32,7 +32,7 @@ ## Checklist [`Include/internal/pycore_ast.h`](../Include/internal/pycore_ast.h) and [`Python/Python-ast.c`](../Python/Python-ast.c). -* [`Parser/lexer/`](../Parser/lexer/) contains the tokenization code. +* [`Parser/lexer/`](../Parser/lexer) contains the tokenization code. This is where you would add a new type of comment or string literal, for example. * [`Python/ast.c`](../Python/ast.c) will need changes to validate AST objects @@ -60,4 +60,4 @@ ## Checklist to the tokenizer. * Documentation must be written! Specifically, one or more of the pages in - [`Doc/reference/`](../Doc/reference/) will need to be updated. + [`Doc/reference/`](../Doc/reference) will need to be updated. diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md index ed4cfb23ca5..9e99f348acb 100644 --- a/InternalDocs/compiler.md +++ b/InternalDocs/compiler.md @@ -1,4 +1,3 @@ - Compiler design =============== @@ -7,8 +6,8 @@ In CPython, the compilation from source code to bytecode involves several steps: -1. Tokenize the source code [Parser/lexer/](../Parser/lexer/) - and [Parser/tokenizer/](../Parser/tokenizer/). +1. Tokenize the source code [Parser/lexer/](../Parser/lexer) + and [Parser/tokenizer/](../Parser/tokenizer). 2. Parse the stream of tokens into an Abstract Syntax Tree [Parser/parser.c](../Parser/parser.c). 3. Transform AST into an instruction sequence @@ -134,9 +133,8 @@ `FunctionDef()` constructor function sets 'kind' to `FunctionDef_kind` and initializes the *name*, *args*, *body*, and *attributes* fields. -See also -[Green Tree Snakes - The missing Python AST docs](https://greentreesnakes.readthedocs.io/en/latest) - by Thomas Kluyver. +See also [Green Tree Snakes - The missing Python AST docs]( +https://greentreesnakes.readthedocs.io/en/latest) by Thomas Kluyver. Memory management ================= @@ -260,12 +258,12 @@ [Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h). Functions and macros for creating `asdl_xx_seq *` types are as follows: -`_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)` - Allocate memory for an `asdl_generic_seq` of the specified length -`_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)` - Allocate memory for an `asdl_identifier_seq` of the specified length -`_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)` - Allocate memory for an `asdl_int_seq` of the specified length +* `_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`: + Allocate memory for an `asdl_generic_seq` of the specified length +* `_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`: + Allocate memory for an `asdl_identifier_seq` of the specified length +* `_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`: + Allocate memory for an `asdl_int_seq` of the specified length In addition to the three types mentioned above, some ASDL sequence types are automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py) and found in @@ -273,20 +271,20 @@ Macros for using both manually defined and automatically generated ASDL sequence types are as follows: -`asdl_seq_GET(asdl_xx_seq *, int)` - Get item held at a specific position in an `asdl_xx_seq` -`asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)` - Set a specific index in an `asdl_xx_seq` to the specified value +* `asdl_seq_GET(asdl_xx_seq *, int)`: + Get item held at a specific position in an `asdl_xx_seq` +* `asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`: + Set a specific index in an `asdl_xx_seq` to the specified value -Untyped counterparts exist for some of the typed macros. These are useful +Untyped counterparts exist for some of the typed macros. These are useful when a function needs to manipulate a generic ASDL sequence: -`asdl_seq_GET_UNTYPED(asdl_seq *, int)` - Get item held at a specific position in an `asdl_seq` -`asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)` - Set a specific index in an `asdl_seq` to the specified value -`asdl_seq_LEN(asdl_seq *)` - Return the length of an `asdl_seq` or `asdl_xx_seq` +* `asdl_seq_GET_UNTYPED(asdl_seq *, int)`: + Get item held at a specific position in an `asdl_seq` +* `asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`: + Set a specific index in an `asdl_seq` to the specified value +* `asdl_seq_LEN(asdl_seq *)`: + Return the length of an `asdl_seq` or `asdl_xx_seq` Note that typed macros and functions are recommended over their untyped counterparts. Typed macros carry out checks in debug mode and aid @@ -379,33 +377,33 @@ Emission of bytecode is handled by the following macros: -* `ADDOP(struct compiler *, location, int)` - add a specified opcode -* `ADDOP_IN_SCOPE(struct compiler *, location, int)` - like `ADDOP`, but also exits current scope; used for adding return value - opcodes in lambdas and closures -* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)` - add an opcode that takes an integer argument -* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)` - add an opcode with the proper argument based on the position of the - specified PyObject in PyObject sequence object, but with no handling of - mangled names; used for when you - need to do named lookups of objects such as globals, consts, or - parameters where name mangling is not possible and the scope of the - name is known; *TYPE* is the name of PyObject sequence - (`names` or `varnames`) -* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)` - just like `ADDOP_O`, but steals a reference to PyObject -* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)` - just like `ADDOP_O`, but name mangling is also handled; used for - attribute loading or importing based on name -* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)` - add the `LOAD_CONST` opcode with the proper argument based on the - position of the specified PyObject in the consts table. -* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)` - just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject -* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)` - create a jump to a basic block +* `ADDOP(struct compiler *, location, int)`: + add a specified opcode +* `ADDOP_IN_SCOPE(struct compiler *, location, int)`: + like `ADDOP`, but also exits current scope; used for adding return value + opcodes in lambdas and closures +* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`: + add an opcode that takes an integer argument +* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`: + add an opcode with the proper argument based on the position of the + specified PyObject in PyObject sequence object, but with no handling of + mangled names; used for when you + need to do named lookups of objects such as globals, consts, or + parameters where name mangling is not possible and the scope of the + name is known; *TYPE* is the name of PyObject sequence + (`names` or `varnames`) +* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`: + just like `ADDOP_O`, but steals a reference to PyObject +* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`: + just like `ADDOP_O`, but name mangling is also handled; used for + attribute loading or importing based on name +* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`: + add the `LOAD_CONST` opcode with the proper argument based on the + position of the specified PyObject in the consts table. +* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`: + just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject +* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`: + create a jump to a basic block The `location` argument is a struct with the source location to be associated with this instruction. It is typically extracted from an @@ -433,7 +431,7 @@ bytecode. This includes transforming pseudo instructions into actual instructions, converting jump targets from logical labels to relative offsets, and construction of the [exception table](exception_handling.md) and -[locations table](locations.md). +[locations table](code_objects.md#source-code-locations). The bytecode and tables are then wrapped into a `PyCodeObject` along with additional metadata, including the `consts` and `names` arrays, information about function reference to the source code (filename, etc). All of this is implemented by @@ -453,7 +451,7 @@ Important files =============== -* [Parser/](../Parser/) +* [Parser/](../Parser) * [Parser/Python.asdl](../Parser/Python.asdl): ASDL syntax file. @@ -534,7 +532,7 @@ * [Python/instruction_sequence.c](../Python/instruction_sequence.c): A data structure representing a sequence of bytecode-like pseudo-instructions. -* [Include/](../Include/) +* [Include/](../Include) * [Include/cpython/code.h](../Include/cpython/code.h) : Header file for [Objects/codeobject.c](../Objects/codeobject.c); @@ -556,7 +554,7 @@ : Declares `_PyAST_Validate()` external (from [Python/ast.c](../Python/ast.c)). * [Include/internal/pycore_symtable.h](../Include/internal/pycore_symtable.h) - : Header for [Python/symtable.c](../Python/symtable.c). + : Header for [Python/symtable.c](../Python/symtable.c). `struct symtable` and `PySTEntryObject` are defined here. * [Include/internal/pycore_parser.h](../Include/internal/pycore_parser.h) @@ -570,7 +568,7 @@ by [Tools/cases_generator/opcode_id_generator.py](../Tools/cases_generator/opcode_id_generator.py). -* [Objects/](../Objects/) +* [Objects/](../Objects) * [Objects/codeobject.c](../Objects/codeobject.c) : Contains PyCodeObject-related code. @@ -579,7 +577,7 @@ : Contains the `frame_setlineno()` function which should determine whether it is allowed to make a jump between two points in a bytecode. -* [Lib/](../Lib/) +* [Lib/](../Lib) * [Lib/opcode.py](../Lib/opcode.py) : opcode utilities exposed to Python. @@ -591,7 +589,7 @@ Objects ======= -* [Locations](locations.md): Describes the location table +* [Locations](code_objects.md#source-code-locations): Describes the location table * [Frames](frames.md): Describes frames and the frame stack * [Objects/object_layout.md](../Objects/object_layout.md): Describes object layout for 3.11 and later * [Exception Handling](exception_handling.md): Describes the exception table diff --git a/InternalDocs/exception_handling.md b/InternalDocs/exception_handling.md index 14066a5864b..28589787e1f 100644 --- a/InternalDocs/exception_handling.md +++ b/InternalDocs/exception_handling.md @@ -87,10 +87,10 @@ Handling an exception, once an exception table entry is found, consists of the following steps: - 1. pop values from the stack until it matches the stack depth for the handler. - 2. if `lasti` is true, then push the offset that the exception was raised at. - 3. push the exception to the stack. - 4. jump to the target offset and resume execution. +1. pop values from the stack until it matches the stack depth for the handler. +2. if `lasti` is true, then push the offset that the exception was raised at. +3. push the exception to the stack. +4. jump to the target offset and resume execution. Reraising Exceptions and `lasti` @@ -107,13 +107,12 @@ ----------------------------- Conceptually, the exception table consists of a sequence of 5-tuples: -``` - 1. `start-offset` (inclusive) - 2. `end-offset` (exclusive) - 3. `target` - 4. `stack-depth` - 5. `push-lasti` (boolean) -``` + +1. `start-offset` (inclusive) +2. `end-offset` (exclusive) +3. `target` +4. `stack-depth` +5. `push-lasti` (boolean) All offsets and lengths are in code units, not bytes. @@ -123,18 +122,19 @@ Binary search typically assumes fixed size entries, but that is not necessary, as long as we can identify the start of an entry. It is worth noting that the size (end-start) is always smaller than the end, so we encode the entries as: - `start, size, target, depth, push-lasti`. +`start, size, target, depth, push-lasti`. Also, sizes are limited to 2**30 as the code length cannot exceed 2**31 and each code unit takes 2 bytes. It also happens that depth is generally quite small. So, we need to encode: + ``` - `start` (up to 30 bits) - `size` (up to 30 bits) - `target` (up to 30 bits) - `depth` (up to ~8 bits) - `lasti` (1 bit) +start (up to 30 bits) +size (up to 30 bits) +target (up to 30 bits) +depth (up to ~8 bits) +lasti (1 bit) ``` We need a marker for the start of the entry, so the first byte of entry will have the most significant bit set. @@ -145,29 +145,32 @@ In addition, we combine `depth` and `lasti` into a single value, `((depth<<1)+lasti)`, before encoding. For example, the exception entry: + ``` - `start`: 20 - `end`: 28 - `target`: 100 - `depth`: 3 - `lasti`: False +start: 20 +end: 28 +target: 100 +depth: 3 +lasti: False ``` is encoded by first converting to the more compact four value form: + ``` - `start`: 20 - `size`: 8 - `target`: 100 - `depth<<1+lasti`: 6 +start: 20 +size: 8 +target: 100 +depth<<1+lasti: 6 ``` which is then encoded as: + ``` - 148 (MSB + 20 for start) - 8 (size) - 65 (Extend bit + 1) - 36 (Remainder of target, 100 == (1<<6)+36) - 6 +148 (MSB + 20 for start) +8 (size) +65 (Extend bit + 1) +36 (Remainder of target, 100 == (1<<6)+36) +6 ``` for a total of five bytes. diff --git a/InternalDocs/frames.md b/InternalDocs/frames.md index 06dc8f0702c..2598873ca98 100644 --- a/InternalDocs/frames.md +++ b/InternalDocs/frames.md @@ -27,6 +27,7 @@ # Allocation ## Layout Each activation record is laid out as: + * Specials * Locals * Stack diff --git a/InternalDocs/garbage_collector.md b/InternalDocs/garbage_collector.md index 272a0834cbf..9e01a5864e3 100644 --- a/InternalDocs/garbage_collector.md +++ b/InternalDocs/garbage_collector.md @@ -1,4 +1,3 @@ - Garbage collector design ======================== @@ -117,7 +116,7 @@ doubly linked list. Between collections, objects are partitioned into "generations", reflecting how often they've survived collection attempts. During collections, the generation(s) being collected are further partitioned into, for example, sets of reachable and unreachable objects. Doubly linked lists -support moving an object from one partition to another, adding a new object, removing an object +support moving an object from one partition to another, adding a new object, removing an object entirely (objects tracked by GC are most often reclaimed by the refcounting system when GC isn't running at all!), and merging partitions, all with a small constant number of pointer updates. With care, they also support iterating over a partition while objects are being added to - and diff --git a/InternalDocs/generators.md b/InternalDocs/generators.md index d53f0f9bdff..afa8b8f4bb8 100644 --- a/InternalDocs/generators.md +++ b/InternalDocs/generators.md @@ -1,4 +1,3 @@ - Generators ========== diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index 4c10cbbed37..ab149e43471 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -1,4 +1,3 @@ - The bytecode interpreter ======================== diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 348988b7c2f..445b866fc0c 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -1,4 +1,3 @@ - Guide to the parser =================== @@ -444,15 +443,15 @@ Once you have made the changes to the grammar files, to regenerate the `C` parser (the one used by the interpreter) just execute: -``` - make regen-pegen +```shell +$ make regen-pegen ``` using the `Makefile` in the main directory. If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: -``` - ./PCbuild/build.bat --regen +```dos +PCbuild/build.bat --regen ``` The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c). @@ -468,15 +467,15 @@ need to regenerate the meta-parser (the parser that parses the grammar files). To do so just execute: -``` - make regen-pegen-metaparser +```shell +$ make regen-pegen-metaparser ``` If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: -``` - ./PCbuild/build.bat --regen +```dos +PCbuild/build.bat --regen ``` @@ -516,15 +515,15 @@ file. If you change this file to add new tokens, make sure to regenerate the files by executing: -``` - make regen-token +```shell +$ make regen-token ``` If you are on Windows you can use the Visual Studio project files to regenerate the tokens or to execute: -``` - ./PCbuild/build.bat --regen +```dos +PCbuild/build.bat --regen ``` How tokens are generated and the rules governing this are completely up to the tokenizer @@ -546,8 +545,8 @@ name (and type, if present): ``` - rule_name[typr] (memo): - ... +rule_name[typr] (memo): + ... ``` By selectively turning on memoization for a handful of rules, the parser becomes @@ -593,25 +592,25 @@ meaning in context. Trying to use a hard keyword as a variable will always fail: -``` - >>> class = 3 - File "", line 1 - class = 3 - ^ - SyntaxError: invalid syntax - >>> foo(class=3) - File "", line 1 - foo(class=3) - ^^^^^ - SyntaxError: invalid syntax +```pycon +>>> class = 3 +File "", line 1 + class = 3 + ^ +SyntaxError: invalid syntax +>>> foo(class=3) +File "", line 1 + foo(class=3) + ^^^^^ +SyntaxError: invalid syntax ``` While soft keywords don't have this limitation if used in a context other the one where they are defined as keywords: -``` - >>> match = 45 - >>> foo(match="Yeah!") +```pycon +>>> match = 45 +>>> foo(match="Yeah!") ``` The `match` and `case` keywords are soft keywords, so that they are @@ -621,21 +620,21 @@ You can get a list of all keywords defined in the grammar from Python: -``` - >>> import keyword - >>> keyword.kwlist - ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', - 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', - 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', - 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield'] +```pycon +>>> import keyword +>>> keyword.kwlist +['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', +'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', +'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', +'pass', 'raise', 'return', 'try', 'while', 'with', 'yield'] ``` as well as soft keywords: -``` - >>> import keyword - >>> keyword.softkwlist - ['_', 'case', 'match'] +```pycon +>>> import keyword +>>> keyword.softkwlist +['_', 'case', 'match'] ``` > [!CAUTION] @@ -736,7 +735,7 @@ > rule or not. For example: ``` - $ 42 + $ 42 ``` should trigger the syntax error in the `$` character. If your rule is not correctly defined this @@ -744,7 +743,7 @@ `print` statements in order to create a better error message and you define it as: ``` - invalid_print: "print" expression +invalid_print: "print" expression ``` This will **seem** to work because the parser will correctly parse `print(something)` because it is valid @@ -756,7 +755,7 @@ Generating AST objects ---------------------- -The output of the C parser used by CPython, which is generated from the +The output of the C parser used by CPython, which is generated from the [grammar file](../Grammar/python.gram), is a Python AST object (using C structures). This means that the actions in the grammar file generate AST objects when they succeed. Constructing these objects can be quite cumbersome @@ -798,7 +797,7 @@ tests, depending on the nature of the new feature you are adding. Tests for the parser generator itself can be found in the -[test_peg_generator](../Lib/test_peg_generator) directory. +[test_peg_generator](../Lib/test/test_peg_generator) directory. Debugging generated parsers @@ -816,15 +815,15 @@ parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator) directory on the CPython repository and manually call the parser generator by executing: -``` - $ python -m pegen python +```shell +$ python -m pegen python ``` This will generate a file called `parse.py` in the same directory that you can use to parse some input: -``` - $ python parse.py file_with_source_code_to_test.py +```shell +$ python parse.py file_with_source_code_to_test.py ``` As the generated `parse.py` file is just Python code, you can modify it @@ -848,8 +847,8 @@ To activate verbose mode you can add the `-d` flag when executing Python: -``` - $ python -d file_to_test.py +```shell +$ python -d file_to_test.py ``` This will print **a lot** of output to `stderr` so it is probably better to dump @@ -857,7 +856,7 @@ following structure:: ``` - ('>'|'-'|'+'|'!') []: ... + ('>'|'-'|'+'|'!') []: ... ``` Every line is indented by a different amount (``) depending on how diff --git a/InternalDocs/string_interning.md b/InternalDocs/string_interning.md index e0d20632516..26a5197c6e7 100644 --- a/InternalDocs/string_interning.md +++ b/InternalDocs/string_interning.md @@ -2,6 +2,7 @@ # String interning *Interned* strings are conceptually part of an interpreter-global *set* of interned strings, meaning that: + - no two interned strings have the same content (across an interpreter); - two interned strings can be safely compared using pointer equality (Python `is`). @@ -61,6 +62,7 @@ ## Immortality and reference counting The converse is not true: interned strings can be mortal. For mortal interned strings: + - the 2 references from the interned dict (key & value) are excluded from their refcount - the deallocator (`unicode_dealloc`) removes the string from the interned dict @@ -90,6 +92,7 @@ ## Internal API The functions take ownership of (“steal”) the reference to their argument, and update the argument with a *new* reference. This means: + - They're “reference neutral”. - They must not be called with a borrowed reference.