mirror of
https://github.com/python/cpython.git
synced 2025-12-31 04:23:37 +00:00
[3.13] gh-113993: Allow interned strings to be mortal, and fix related issues (GH-120520) (GH-120945)
* Add an InternalDocs file describing how interning should work and how to use it.
* Add internal functions to *explicitly* request what kind of interning is done:
- `_PyUnicode_InternMortal`
- `_PyUnicode_InternImmortal`
- `_PyUnicode_InternStatic`
* Switch uses of `PyUnicode_InternInPlace` to those.
* Disallow using `_Py_SetImmortal` on strings directly.
You should use `_PyUnicode_InternImmortal` instead:
- Strings should be interned before immortalization, otherwise you're possibly
interning a immortalizing copy.
- `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
`SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
backports, as they are now part of public API and version-specific ABI.
* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.
* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
- `_Py_ID`
- `_Py_STR` (including the empty string)
- one-character latin-1 singletons
Now, when you intern a singleton, that exact singleton will be interned.
* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).
* Intern `_Py_STR` singletons at startup.
* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.
* Beef up the tests. Cover internal details (marked with `@cpython_only`).
* Add lots of assertions
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
This commit is contained in:
parent
447e07ab3d
commit
9769b7ae06
42 changed files with 2460 additions and 1140 deletions
122
InternalDocs/string_interning.md
Normal file
122
InternalDocs/string_interning.md
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
# String interning
|
||||
|
||||
*Interned* strings are conceptually part of an interpreter-global
|
||||
*set* of interned strings, meaning that:
|
||||
- no two interned strings have the same content (across an interpreter);
|
||||
- two interned strings can be safely compared using pointer equality
|
||||
(Python `is`).
|
||||
|
||||
This is used to optimize dict and attribute lookups, among other things.
|
||||
|
||||
Python uses three different mechanisms to intern strings:
|
||||
|
||||
- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros.
|
||||
These are statically allocated, and collected using `make regen-global-objects`
|
||||
(`Tools/build/generate_global_objects.py`), which generates code
|
||||
for declaration, initialization and finalization.
|
||||
|
||||
The difference between the two kinds is not important. (A `_Py_ID` string is
|
||||
a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain
|
||||
non-identifier characters, so it needs a separate C-compatible name.)
|
||||
|
||||
The empty string is in this category (as `_Py_STR(empty)`).
|
||||
|
||||
These singletons are interned in a runtime-global lookup table,
|
||||
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
|
||||
at runtime initialization.
|
||||
|
||||
- The 256 possible one-character latin-1 strings are singletons,
|
||||
which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global
|
||||
arrays, `_PyRuntime.static_objects.strings.ascii` and
|
||||
`_PyRuntime.static_objects.strings.latin1`.
|
||||
|
||||
These are NOT interned at startup in the normal build.
|
||||
In the free-threaded build, they are; this avoids modifying the
|
||||
global lookup table after threads are started.
|
||||
|
||||
Interning a one-char latin-1 string will always intern the corresponding
|
||||
singleton.
|
||||
|
||||
- All other strings are allocated dynamically, and have their
|
||||
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
|
||||
When interned, such strings are added to an interpreter-wide dict,
|
||||
`PyInterpreterState.cached_objects.interned_strings`.
|
||||
|
||||
The key and value of each entry in this dict reference the same object.
|
||||
|
||||
The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`)
|
||||
are disjoint.
|
||||
If you have such a singleton, it (and no other copy) will be interned.
|
||||
|
||||
|
||||
## Immortality and reference counting
|
||||
|
||||
Invariant: Every immortal string is interned, *except* the one-char latin-1
|
||||
singletons (which might but might not be interned).
|
||||
|
||||
In practice, this means that you must not use `_Py_SetImmortal` on
|
||||
a string. (If you know it's already immortal, don't immortalize it;
|
||||
if you know it's not interned you might be immortalizing a redundant copy;
|
||||
if it's interned and mortal it needs extra processing in
|
||||
`_PyUnicode_InternImmortal`.)
|
||||
|
||||
The converse is not true: interned strings can be mortal.
|
||||
For mortal interned strings:
|
||||
- the 2 references from the interned dict (key & value) are excluded from
|
||||
their refcount
|
||||
- the deallocator (`unicode_dealloc`) removes the string from the interned dict
|
||||
- at shutdown, when the interned dict is cleared, the references are added back
|
||||
|
||||
As with any type, you should only immortalize strings that will live until
|
||||
interpreter shutdown.
|
||||
We currently also immortalize strings contained in code objects and similar,
|
||||
specifically in the compiler and in `marshal`.
|
||||
These are “close enough” to immortal: even in use cases like hot reloading
|
||||
or `eval`-ing user input, the number of distinct identifiers and string
|
||||
constants expected to stay low.
|
||||
|
||||
|
||||
## Internal API
|
||||
|
||||
We have the following *internal* API for interning:
|
||||
|
||||
- `_PyUnicode_InternMortal`: just intern the string
|
||||
- `_PyUnicode_InternImmortal`: intern, and immortalize the result
|
||||
- `_PyUnicode_InternStatic`: intern a static singleton (`_Py_STR`, `_Py_ID`
|
||||
or one-byte). Not for general use.
|
||||
|
||||
All take an interpreter state, and a pointer to a `PyObject*` which they
|
||||
modify in place.
|
||||
|
||||
The functions take ownership of (“steal”) the reference to their argument,
|
||||
and update the argument with a *new* reference.
|
||||
This means:
|
||||
- They're “reference neutral”.
|
||||
- They must not be called with a borrowed reference.
|
||||
|
||||
|
||||
## State
|
||||
|
||||
The intern state (retrieved by `PyUnicode_CHECK_INTERNED(s)`;
|
||||
stored in `_PyUnicode_STATE(s).interned`) can be:
|
||||
|
||||
- `SSTATE_NOT_INTERNED` (defined as 0, which is useful in a boolean context)
|
||||
- `SSTATE_INTERNED_MORTAL` (1)
|
||||
- `SSTATE_INTERNED_IMMORTAL` (2)
|
||||
- `SSTATE_INTERNED_IMMORTAL_STATIC` (3)
|
||||
|
||||
The valid transitions between these states are:
|
||||
|
||||
- For dynamically allocated strings:
|
||||
|
||||
- 0 -> 1 (`_PyUnicode_InternMortal`)
|
||||
- 1 -> 2 or 0 -> 2 (`_PyUnicode_InternImmortal`)
|
||||
|
||||
Using `_PyUnicode_InternStatic` on these is an error; the other cases
|
||||
don't change the state.
|
||||
|
||||
- One-char latin-1 singletons can be interned (0 -> 3) using any interning
|
||||
function; after that the functions don't change the state.
|
||||
|
||||
- Other statically allocated strings are interned (0 -> 3) at runtime init;
|
||||
after that all interning functions don't change the state.
|
||||
Loading…
Add table
Add a link
Reference in a new issue