gh-119786: cleanup internal docs and fix internal links (#127485)

This commit is contained in:
Bénédikt Tran 2024-12-01 18:12:22 +01:00 committed by GitHub
parent 1bc4f076d1
commit 04673d2f14
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
11 changed files with 152 additions and 148 deletions

View file

@ -1,4 +1,3 @@
# CPython Internals Documentation
The documentation in this folder is intended for CPython maintainers.

View file

@ -96,6 +96,7 @@ ### Choice of specializations
Specialized instructions must be fast. In order to be fast,
specialized instructions should be tailored for a particular
set of values that allows them to:
1. Verify that incoming value is part of that set with low overhead.
2. Perform the operation quickly.
@ -107,9 +108,11 @@ ### Choice of specializations
dictionaries that have a keys with the expected version.
This can be tested quickly:
* `globals->keys->dk_version == expected_version`
and the operation can be performed quickly:
* `value = entries[cache->index].me_value;`.
Because it is impossible to measure the performance of an instruction without
@ -122,6 +125,7 @@ ### Choice of specializations
### Implementation of specialized instructions
In general, specialized instructions should be implemented in two parts:
1. A sequence of guards, each of the form
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
2. The operation, which should ideally have no branches and

View file

@ -32,7 +32,7 @@ ## Checklist
[`Include/internal/pycore_ast.h`](../Include/internal/pycore_ast.h) and
[`Python/Python-ast.c`](../Python/Python-ast.c).
* [`Parser/lexer/`](../Parser/lexer/) contains the tokenization code.
* [`Parser/lexer/`](../Parser/lexer) contains the tokenization code.
This is where you would add a new type of comment or string literal, for example.
* [`Python/ast.c`](../Python/ast.c) will need changes to validate AST objects
@ -60,4 +60,4 @@ ## Checklist
to the tokenizer.
* Documentation must be written! Specifically, one or more of the pages in
[`Doc/reference/`](../Doc/reference/) will need to be updated.
[`Doc/reference/`](../Doc/reference) will need to be updated.

View file

@ -1,4 +1,3 @@
Compiler design
===============
@ -7,8 +6,8 @@
In CPython, the compilation from source code to bytecode involves several steps:
1. Tokenize the source code [Parser/lexer/](../Parser/lexer/)
and [Parser/tokenizer/](../Parser/tokenizer/).
1. Tokenize the source code [Parser/lexer/](../Parser/lexer)
and [Parser/tokenizer/](../Parser/tokenizer).
2. Parse the stream of tokens into an Abstract Syntax Tree
[Parser/parser.c](../Parser/parser.c).
3. Transform AST into an instruction sequence
@ -134,9 +133,8 @@
`FunctionDef()` constructor function sets 'kind' to `FunctionDef_kind` and
initializes the *name*, *args*, *body*, and *attributes* fields.
See also
[Green Tree Snakes - The missing Python AST docs](https://greentreesnakes.readthedocs.io/en/latest)
by Thomas Kluyver.
See also [Green Tree Snakes - The missing Python AST docs](
https://greentreesnakes.readthedocs.io/en/latest) by Thomas Kluyver.
Memory management
=================
@ -260,11 +258,11 @@
[Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h).
Functions and macros for creating `asdl_xx_seq *` types are as follows:
`_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`
* `_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`:
Allocate memory for an `asdl_generic_seq` of the specified length
`_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`
* `_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`:
Allocate memory for an `asdl_identifier_seq` of the specified length
`_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`
* `_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`:
Allocate memory for an `asdl_int_seq` of the specified length
In addition to the three types mentioned above, some ASDL sequence types are
@ -273,19 +271,19 @@
Macros for using both manually defined and automatically generated ASDL
sequence types are as follows:
`asdl_seq_GET(asdl_xx_seq *, int)`
* `asdl_seq_GET(asdl_xx_seq *, int)`:
Get item held at a specific position in an `asdl_xx_seq`
`asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`
* `asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`:
Set a specific index in an `asdl_xx_seq` to the specified value
Untyped counterparts exist for some of the typed macros. These are useful
when a function needs to manipulate a generic ASDL sequence:
`asdl_seq_GET_UNTYPED(asdl_seq *, int)`
* `asdl_seq_GET_UNTYPED(asdl_seq *, int)`:
Get item held at a specific position in an `asdl_seq`
`asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`
* `asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`:
Set a specific index in an `asdl_seq` to the specified value
`asdl_seq_LEN(asdl_seq *)`
* `asdl_seq_LEN(asdl_seq *)`:
Return the length of an `asdl_seq` or `asdl_xx_seq`
Note that typed macros and functions are recommended over their untyped
@ -379,14 +377,14 @@
Emission of bytecode is handled by the following macros:
* `ADDOP(struct compiler *, location, int)`
* `ADDOP(struct compiler *, location, int)`:
add a specified opcode
* `ADDOP_IN_SCOPE(struct compiler *, location, int)`
* `ADDOP_IN_SCOPE(struct compiler *, location, int)`:
like `ADDOP`, but also exits current scope; used for adding return value
opcodes in lambdas and closures
* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`
* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`:
add an opcode that takes an integer argument
* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`
* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`:
add an opcode with the proper argument based on the position of the
specified PyObject in PyObject sequence object, but with no handling of
mangled names; used for when you
@ -394,17 +392,17 @@
parameters where name mangling is not possible and the scope of the
name is known; *TYPE* is the name of PyObject sequence
(`names` or `varnames`)
* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`
* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`:
just like `ADDOP_O`, but steals a reference to PyObject
* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`
* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`:
just like `ADDOP_O`, but name mangling is also handled; used for
attribute loading or importing based on name
* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`
* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`:
add the `LOAD_CONST` opcode with the proper argument based on the
position of the specified PyObject in the consts table.
* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`
* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`:
just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject
* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`
* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`:
create a jump to a basic block
The `location` argument is a struct with the source location to be
@ -433,7 +431,7 @@
bytecode. This includes transforming pseudo instructions into actual instructions,
converting jump targets from logical labels to relative offsets, and
construction of the [exception table](exception_handling.md) and
[locations table](locations.md).
[locations table](code_objects.md#source-code-locations).
The bytecode and tables are then wrapped into a `PyCodeObject` along with additional
metadata, including the `consts` and `names` arrays, information about function
reference to the source code (filename, etc). All of this is implemented by
@ -453,7 +451,7 @@
Important files
===============
* [Parser/](../Parser/)
* [Parser/](../Parser)
* [Parser/Python.asdl](../Parser/Python.asdl):
ASDL syntax file.
@ -534,7 +532,7 @@
* [Python/instruction_sequence.c](../Python/instruction_sequence.c):
A data structure representing a sequence of bytecode-like pseudo-instructions.
* [Include/](../Include/)
* [Include/](../Include)
* [Include/cpython/code.h](../Include/cpython/code.h)
: Header file for [Objects/codeobject.c](../Objects/codeobject.c);
@ -570,7 +568,7 @@
by
[Tools/cases_generator/opcode_id_generator.py](../Tools/cases_generator/opcode_id_generator.py).
* [Objects/](../Objects/)
* [Objects/](../Objects)
* [Objects/codeobject.c](../Objects/codeobject.c)
: Contains PyCodeObject-related code.
@ -579,7 +577,7 @@
: Contains the `frame_setlineno()` function which should determine whether it is allowed
to make a jump between two points in a bytecode.
* [Lib/](../Lib/)
* [Lib/](../Lib)
* [Lib/opcode.py](../Lib/opcode.py)
: opcode utilities exposed to Python.
@ -591,7 +589,7 @@
Objects
=======
* [Locations](locations.md): Describes the location table
* [Locations](code_objects.md#source-code-locations): Describes the location table
* [Frames](frames.md): Describes frames and the frame stack
* [Objects/object_layout.md](../Objects/object_layout.md): Describes object layout for 3.11 and later
* [Exception Handling](exception_handling.md): Describes the exception table

View file

@ -107,13 +107,12 @@
-----------------------------
Conceptually, the exception table consists of a sequence of 5-tuples:
```
1. `start-offset` (inclusive)
2. `end-offset` (exclusive)
3. `target`
4. `stack-depth`
5. `push-lasti` (boolean)
```
All offsets and lengths are in code units, not bytes.
@ -129,12 +128,13 @@
It also happens that depth is generally quite small.
So, we need to encode:
```
`start` (up to 30 bits)
`size` (up to 30 bits)
`target` (up to 30 bits)
`depth` (up to ~8 bits)
`lasti` (1 bit)
start (up to 30 bits)
size (up to 30 bits)
target (up to 30 bits)
depth (up to ~8 bits)
lasti (1 bit)
```
We need a marker for the start of the entry, so the first byte of entry will have the most significant bit set.
@ -145,23 +145,26 @@
In addition, we combine `depth` and `lasti` into a single value, `((depth<<1)+lasti)`, before encoding.
For example, the exception entry:
```
`start`: 20
`end`: 28
`target`: 100
`depth`: 3
`lasti`: False
start: 20
end: 28
target: 100
depth: 3
lasti: False
```
is encoded by first converting to the more compact four value form:
```
`start`: 20
`size`: 8
`target`: 100
`depth<<1+lasti`: 6
start: 20
size: 8
target: 100
depth<<1+lasti: 6
```
which is then encoded as:
```
148 (MSB + 20 for start)
8 (size)

View file

@ -27,6 +27,7 @@ # Allocation
## Layout
Each activation record is laid out as:
* Specials
* Locals
* Stack

View file

@ -1,4 +1,3 @@
Garbage collector design
========================

View file

@ -1,4 +1,3 @@
Generators
==========

View file

@ -1,4 +1,3 @@
The bytecode interpreter
========================

View file

@ -1,4 +1,3 @@
Guide to the parser
===================
@ -444,15 +443,15 @@
Once you have made the changes to the grammar files, to regenerate the `C`
parser (the one used by the interpreter) just execute:
```
make regen-pegen
```shell
$ make regen-pegen
```
using the `Makefile` in the main directory. If you are on Windows you can
use the Visual Studio project files to regenerate the parser or to execute:
```
./PCbuild/build.bat --regen
```dos
PCbuild/build.bat --regen
```
The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).
@ -468,15 +467,15 @@
need to regenerate the meta-parser (the parser that parses the grammar files).
To do so just execute:
```
make regen-pegen-metaparser
```shell
$ make regen-pegen-metaparser
```
If you are on Windows you can use the Visual Studio project files
to regenerate the parser or to execute:
```
./PCbuild/build.bat --regen
```dos
PCbuild/build.bat --regen
```
@ -516,15 +515,15 @@
file. If you change this file to add new tokens, make sure to regenerate the
files by executing:
```
make regen-token
```shell
$ make regen-token
```
If you are on Windows you can use the Visual Studio project files to regenerate
the tokens or to execute:
```
./PCbuild/build.bat --regen
```dos
PCbuild/build.bat --regen
```
How tokens are generated and the rules governing this are completely up to the tokenizer
@ -593,7 +592,7 @@
meaning in context. Trying to use a hard keyword as a variable will always
fail:
```
```pycon
>>> class = 3
File "<stdin>", line 1
class = 3
@ -609,7 +608,7 @@
While soft keywords don't have this limitation if used in a context other the
one where they are defined as keywords:
```
```pycon
>>> match = 45
>>> foo(match="Yeah!")
```
@ -621,7 +620,7 @@
You can get a list of all keywords defined in the grammar from Python:
```
```pycon
>>> import keyword
>>> keyword.kwlist
['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break',
@ -632,7 +631,7 @@
as well as soft keywords:
```
```pycon
>>> import keyword
>>> keyword.softkwlist
['_', 'case', 'match']
@ -798,7 +797,7 @@
tests, depending on the nature of the new feature you are adding.
Tests for the parser generator itself can be found in the
[test_peg_generator](../Lib/test_peg_generator) directory.
[test_peg_generator](../Lib/test/test_peg_generator) directory.
Debugging generated parsers
@ -816,14 +815,14 @@
parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
directory on the CPython repository and manually call the parser generator by executing:
```
```shell
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
```
This will generate a file called `parse.py` in the same directory that you
can use to parse some input:
```
```shell
$ python parse.py file_with_source_code_to_test.py
```
@ -848,7 +847,7 @@
To activate verbose mode you can add the `-d` flag when executing Python:
```
```shell
$ python -d file_to_test.py
```

View file

@ -2,6 +2,7 @@ # String interning
*Interned* strings are conceptually part of an interpreter-global
*set* of interned strings, meaning that:
- no two interned strings have the same content (across an interpreter);
- two interned strings can be safely compared using pointer equality
(Python `is`).
@ -61,6 +62,7 @@ ## Immortality and reference counting
The converse is not true: interned strings can be mortal.
For mortal interned strings:
- the 2 references from the interned dict (key & value) are excluded from
their refcount
- the deallocator (`unicode_dealloc`) removes the string from the interned dict
@ -90,6 +92,7 @@ ## Internal API
The functions take ownership of (“steal”) the reference to their argument,
and update the argument with a *new* reference.
This means:
- They're “reference neutral”.
- They must not be called with a borrowed reference.