cpython/Python/tier2_engine.md

# The tier 2 execution engine

## General idea

When execution in tier 1 becomes "hot", that is the counter for that point in
the code reaches some threshold, we create an executor and execute that
instead of the tier 1 bytecode.

Since each executor must exit, we also track the "hotness" of those
exits and attach new executors to those exits.

As the program executes, and the hot parts of the program get optimized,
a graph of executors forms.

## Superblocks and Executors

Once a point in the code has become hot enough, we want to optimize it.
Starting from that point we project the likely path of execution,
using information gathered by tier 1 to guide that projection to
form a "superblock", a mostly linear sequence of micro-ops.
Although mostly linear, it may include a single loop.

We then optimize this superblock to form an optimized superblock,
which is equivalent but more efficient.

A superblock is a representation of the code we want to execute,
but it is not in executable form.
The executable form is known as an executor.

Executors are semantically equivalent to the superblock they are
created from, but are in a form that can be efficiently executable.

There are two execution engines for executors, and two types of executors:
* The hardware which runs machine code executors created by the JIT compiler.
* The tier 2 interpreter runs bytecode executors.

It would be very wasteful to support both a tier 2 interpreter and
JIT compiler in the same process.
For now, we will make the choice of engine a configuration option,
but we could make it a command line option in the future if that would prove useful.


### Tier 2 Interpreter

For platforms without a JIT and for testing, we need an interpreter
for executors. It is similar in design to the tier 1 interpreter, but has a
different instruction set, and does not adapt.

### JIT compiler

The JIT compiler converts superblocks into machine code executors.
These have identical behavior to interpreted executors, except that
they consume more memory for the generated machine code and are a lot faster.

## Transfering control

There are three types of control transfer that we need to consider:
* Tier 1 to tier 2
* Tier 2 to tier 1
* One executor to another within tier 2

Since we expect the graph of executors to span most of the hot
part of the program, transfers from one executor to another should
be the most common.
Therefore, we want to make those transfers fast.

### Tier 2 to tier 2

#### Cold exits

All side exits start cold and most stay cold, but a few become
hot. We want to keep the memory consumption small for the many
cold exits, but those that become hot need to be fast.
However we cannot know in advance, which will be which.

So that tier 2 to tier 2 transfers are fast for hot exits,
exits must be implemented as executors. In order to patch
executor exits when they get hot, a pointer to the current
executor must be passed to the exit executor.

#### Handling reference counts

There must be an implicit reference to the currently executing
executor, otherwise it might be freed.
Consequently, we must increment the reference count of an
executor just before executing it, and decrement it just after
executing it.

We want to minimize the amount of data that is passed from
one executor to the next. In the JIT, this reduces the number
of arguments in the tailcall, freeing up registers for other uses.
It is less important in the interpreter, but following the same
design as the JIT simplifies debugging and is good for performance.

Provided that we incref the new executor before executing it, we
can jump directly to the code of the executor, without needing
to pass a reference to that executor object.
However, we do need a reference to the previous executor,
so that it can be decref'd and for handling of cold exits.
To avoid messing up the JIT's register allocation, we pass a
reference to the previous executor in the thread state's
`previous_executor` field.

#### The interpreter

The tier 2 interpreter has a variable `current_executor` which
points to the currently live executor. When transfering from executor
`A` to executor `B` we do the following:
(Initially `current_executor` points to `A`, and the refcount of
`A` is elevated by one)

1. Set the instruction pointer to start at the beginning of `B`
2. Increment the reference count of `B`
3. Start executing `B`

We also make the first instruction in `B` do the following:
1. Set `current_executor` to point to `B`
2. Decrement the reference count of `A` (`A` is referenced by `tstate->previous_executor`)

The net effect of the above is to safely decrement the refcount of `A`,
increment the refcount of `B` and set `current_executor` to point to `B`.

#### In the JIT

Transfering control from one executor to another is done via tailcalls.

The compiled executor should do the same, except that there is no local
variable `current_executor`.

### Tier 1 to tier 2

Since the executor doesn't know if the previous code was tier 1 or tier 2,
we need to make a transfer from tier 1 to tier 2 look like a tier 2 to tier 2
transfer to the executor.

We can then perform a tier 1 to tier 2 transfer by setting `current_executor`
to `None`, and then performing a tier 2 to tier 2 transfer as above.

### Tier 2 to tier 1

Each micro-op that might exit to tier 1 contains a `target` value,
which is the offset of the tier 1 instruction to exit to in the
current code object.

## Counters

TO DO.
The implementation will change soon, so there is no point in
documenting it until then.
GH-112354: Initial implementation of warm up on exits and trace-stitching (GH-114142) 2024-02-20 09:39:55 +00:00			`# The tier 2 execution engine`

			`## General idea`

			`When execution in tier 1 becomes "hot", that is the counter for that point in`
			`the code reaches some threshold, we create an executor and execute that`
			`instead of the tier 1 bytecode.`

			`Since each executor must exit, we also track the "hotness" of those`
			`exits and attach new executors to those exits.`

			`As the program executes, and the hot parts of the program get optimized,`
			`a graph of executors forms.`

			`## Superblocks and Executors`

			`Once a point in the code has become hot enough, we want to optimize it.`
			`Starting from that point we project the likely path of execution,`
			`using information gathered by tier 1 to guide that projection to`
			`form a "superblock", a mostly linear sequence of micro-ops.`
			`Although mostly linear, it may include a single loop.`

			`We then optimize this superblock to form an optimized superblock,`
			`which is equivalent but more efficient.`

			`A superblock is a representation of the code we want to execute,`
			`but it is not in executable form.`
			`The executable form is known as an executor.`

			`Executors are semantically equivalent to the superblock they are`
			`created from, but are in a form that can be efficiently executable.`

			`There are two execution engines for executors, and two types of executors:`
			`* The hardware which runs machine code executors created by the JIT compiler.`
			`* The tier 2 interpreter runs bytecode executors.`

			`It would be very wasteful to support both a tier 2 interpreter and`
			`JIT compiler in the same process.`
			`For now, we will make the choice of engine a configuration option,`
			`but we could make it a command line option in the future if that would prove useful.`


			`### Tier 2 Interpreter`

			`For platforms without a JIT and for testing, we need an interpreter`
			`for executors. It is similar in design to the tier 1 interpreter, but has a`
			`different instruction set, and does not adapt.`

			`### JIT compiler`

			`The JIT compiler converts superblocks into machine code executors.`
			`These have identical behavior to interpreted executors, except that`
			`they consume more memory for the generated machine code and are a lot faster.`

			`## Transfering control`

			`There are three types of control transfer that we need to consider:`
			`* Tier 1 to tier 2`
			`* Tier 2 to tier 1`
			`* One executor to another within tier 2`

			`Since we expect the graph of executors to span most of the hot`
			`part of the program, transfers from one executor to another should`
			`be the most common.`
			`Therefore, we want to make those transfers fast.`

			`### Tier 2 to tier 2`

			`#### Cold exits`

			`All side exits start cold and most stay cold, but a few become`
			`hot. We want to keep the memory consumption small for the many`
			`cold exits, but those that become hot need to be fast.`
			`However we cannot know in advance, which will be which.`

			`So that tier 2 to tier 2 transfers are fast for hot exits,`
			`exits must be implemented as executors. In order to patch`
			`executor exits when they get hot, a pointer to the current`
			`executor must be passed to the exit executor.`

			`#### Handling reference counts`

			`There must be an implicit reference to the currently executing`
			`executor, otherwise it might be freed.`
			`Consequently, we must increment the reference count of an`
			`executor just before executing it, and decrement it just after`
			`executing it.`

			`We want to minimize the amount of data that is passed from`
			`one executor to the next. In the JIT, this reduces the number`
			`of arguments in the tailcall, freeing up registers for other uses.`
			`It is less important in the interpreter, but following the same`
			`design as the JIT simplifies debugging and is good for performance.`

			`Provided that we incref the new executor before executing it, we`
			`can jump directly to the code of the executor, without needing`
			`to pass a reference to that executor object.`
			`However, we do need a reference to the previous executor,`
			`so that it can be decref'd and for handling of cold exits.`
			`To avoid messing up the JIT's register allocation, we pass a`
			`reference to the previous executor in the thread state's`
			`previous_executor` field.

			`#### The interpreter`

			The tier 2 interpreter has a variable `current_executor` which
			`points to the currently live executor. When transfering from executor`
			`A` to executor `B` we do the following:
			(Initially `current_executor` points to `A`, and the refcount of
			`A` is elevated by one)

			1. Set the instruction pointer to start at the beginning of `B`
			2. Increment the reference count of `B`
			3. Start executing `B`

			We also make the first instruction in `B` do the following:
			1. Set `current_executor` to point to `B`
			2. Decrement the reference count of `A` (`A` is referenced by `tstate->previous_executor`)

			The net effect of the above is to safely decrement the refcount of `A`,
			increment the refcount of `B` and set `current_executor` to point to `B`.

			`#### In the JIT`

			`Transfering control from one executor to another is done via tailcalls.`

			`The compiled executor should do the same, except that there is no local`
			variable `current_executor`.

			`### Tier 1 to tier 2`

			`Since the executor doesn't know if the previous code was tier 1 or tier 2,`
			`we need to make a transfer from tier 1 to tier 2 look like a tier 2 to tier 2`
			`transfer to the executor.`

			We can then perform a tier 1 to tier 2 transfer by setting `current_executor`
			to `None`, and then performing a tier 2 to tier 2 transfer as above.

			`### Tier 2 to tier 1`

			Each micro-op that might exit to tier 1 contains a `target` value,
			`which is the offset of the tier 1 instruction to exit to in the`
			`current code object.`

			`## Counters`

			`TO DO.`
			`The implementation will change soon, so there is no point in`
			`documenting it until then.`