Remove Bytecode::compile() and the old create() overloads on
ECMAScriptFunctionObject that accepted C++ AST nodes. These
have no remaining callers now that all compilation goes through
the Rust pipeline.
Also remove the if-constexpr Parse Node branch from
async_block_start, since the Statement template instantiation
was already removed.
Fix transitive include dependencies on Generator.h by adding
explicit includes for headers that were previously pulled in
transitively.
When the asmint computes a double result for Add, Sub, Mul,
Math.floor, Math.ceil, or Math.sqrt, try to store it as Int32
if the value is a whole number in [INT32_MIN, INT32_MAX] and
not -0.0. This mirrors the JS::Value(double) constructor and
allows downstream int32 fast paths to fire.
Also add label uniquification to the DSL macro expander so the
same macro can be used multiple times in one handler without
label collisions.
Teach js_to_int32 to leave a clean low 32-bit result on success, then
use box_int32_clean in the ToInt32 fast path and adjacent boolean
coercions. This removes one instruction from the AArch64 fjcvtzs path
and trims the boolean boxing path without changing behavior.
Add the small AsmIntGen float32 load, store, and conversion operations
needed to handle Float32Array directly in the AsmInt typed-array
GetByValue and PutByValue paths.
This covers direct indexed reads plus both int32 and double stores,
and adds regression coverage for Math.fround rounding, negative zero,
and NaN.
Teach the asm typed-array GetByValue and PutByValue paths to handle
Uint8ClampedArray directly. Reads can share the Uint8Array load path,
while int32 stores clamp in asm instead of bailing out to C++.
Add a direct indexed access regression test for clamped int32 stores.
Cache raw data pointers on fixed-length typed array views so asm
GetByValue and PutByValue can use them directly for indexed
element access.
Replace the asm typed-array hot-path
ArrayBuffer/DataBlock/ByteBuffer walk with one cached_data_ptr load.
Remove six unconditional loads, four branches, and the byte_offset
add before the element access, trading them for one
cached_data_ptr null check.
Keep direct C++ typed-array access on IsValidIntegerIndex-based
checks, invalidate cached pointers eagerly when a backing
ArrayBuffer is detached, and add regression coverage for shrink,
regrow, and detach on number and BigInt typed arrays.
Use dedicated Packed branches in GetByValue and PutByValue so
in-bounds indexed accesses can skip hole checks and slot
reloads.
Keep Holey writes on the guarded arm, and keep append writes on
the C++ slow path so PutByValue still respects non-extensible
indexed objects and arrays with a non-writable length.
Add a bytecode regression that exercises both append failure
cases through the real js binary path.
Replace the OwnPtr<IndexedPropertyStorage> indirection with inline
indexed element storage directly on Object. This eliminates virtual
dispatch and reduces indirection for indexed property access.
The new system uses three storage kinds tracked by IndexedStorageKind:
- Packed: Dense array, no holes. Elements stored in a malloced Value*
array with capacity header (same layout as named properties).
- Holey: Dense array with possible holes marked by empty sentinel.
Same physical layout as Packed.
- Dictionary: Sparse storage using GenericIndexedPropertyStorage,
type-punned into the m_indexed_elements pointer.
Transitions: None->Packed->Holey->Dictionary (mostly monotonic).
Dictionary mode triggers on non-default attributes or sparse arrays.
Object keeps the same 48-byte size since m_indexed_elements (8 bytes)
replaces IndexedProperties (8 bytes), and the storage kind + array
size fit in existing padding alongside m_flags.
The asm interpreter benefits from one fewer indirection: it now reads
the element pointer and array size directly from Object fields instead
of chasing through OwnPtr -> IndexedPropertyStorage -> Vector.
Removes: IndexedProperties, SimpleIndexedPropertyStorage,
IndexedPropertyStorage, IndexedPropertyIterator.
Keeps: GenericIndexedPropertyStorage (for Dictionary mode).
Replace the 24-byte Vector<Value> m_storage with an 8-byte raw
Value* m_named_properties pointer, backed by a malloc'd allocation
with an inline capacity header.
Memory layout of the allocation:
[u32 capacity] [u32 padding] [Value 0] [Value 1] ...
m_named_properties points to Value 0.
This shrinks JS::Object from 64 to 48 bytes (on non-Windows
platforms) and removes one level of indirection for property access
in the asm interpreter, since the data pointer is now stored directly
on the object rather than inside a Vector's internal metadata.
Growth policy: max(4, max(needed, old_capacity * 2)).
Add Mov2 and Mov3 bytecode instructions that perform 2 or 3 register
moves in a single dispatch. A peephole optimization pass during
bytecode assembly merges consecutive Mov instructions within each
basic block into these combined instructions.
When merging, identical Movs are deduplicated (e.g. two identical Movs
become a single Mov, not a Mov2). This optimization is implemented in
both the C++ and Rust codegen pipelines.
The goal is to reduce the per-instruction dispatch overhead, which is
significant compared to the actual cost of moving a value.
This isn't fancy or elegant, but provides a real speed-up on many
workloads. As an example, Kraken/imaging-desaturate.js improves by
~1.07x on my laptop.
Replace the 16-byte Variant<Empty, GC::Ref<Script>, GC::Ref<Module>>
with a simple 8-byte GC::Ptr<Cell> that points to either a Script or
Module (or is null for Empty).
A helper function script_or_module_from_cell() converts back to the
full ScriptOrModule variant when needed (e.g. in
VM::get_active_script_or_module).
Remove four fields that are trivially derivable from other fields
already present in the ExecutionContext:
- global_object (from realm)
- global_declarative_environment (from realm)
- identifier_table (from executable)
- property_key_table (from executable)
This shrinks ExecutionContext from 192 to 160 bytes (-17%).
The asmint's GetGlobal/SetGlobal handlers now load through the realm
pointer, taking advantage of the cached declarative environment
pointer added in the previous commit.
These test only the low 32 bits of a register, replacing the previous
pattern of `and reg, 0xFFFFFFFF` followed by `branch_zero` or
`branch_nonzero`.
On aarch64 the old pattern emitted `mov w1, w1; cbnz x1` (2 insns),
now it's just `cbnz w1` (1 insn). Used in JumpIf, JumpTrue, JumpFalse,
and Not for the int32 truthiness fast path.
Teach the DSL and both arch backends to handle memory operands of
the form [pb, pc, field_ref], meaning base + index + field_offset.
On aarch64, since x21 already caches pb + pc (the instruction
pointer), this emits a single `ldr dst, [x21, #offset]` instead of
the previous `mov t0, x21` + `ldr dst, [t0, #offset]` two-instruction
sequence.
On x86_64, this emits `[r14 + r13 + offset]` which is natively
supported by x86 addressing modes.
Convert all `lea t0, [pb, pc]` + `loadNN tX, [t0, field]` pairs in
the DSL to the new single-instruction form, saving one instruction
per IC access and other field loads in GetById, PutById, GetLength,
GetGlobal, SetGlobal, and CallBuiltin handlers.
Instead of storing a u32 index into a cache vector and looking up the
cache at runtime through a chain of dependent loads (load Executable*,
load vector data pointer, multiply index, add), store the actual cache
pointer as a u64 directly in the instruction stream.
A fixup pass (Executable::fixup_cache_pointers()) runs after Executable
construction in both the Rust and C++ pipelines, walking the bytecode
and replacing each index with the corresponding pointer.
The cache pointer type is encoded in Bytecode.def (e.g.
PropertyLookupCache*, GlobalVariableCache*) so the fixup switch is
auto-generated by the Python Op code generator, making it impossible
to forget updating the fixup when adding new cached instructions.
This eliminates 3-4 dependent loads on every inline cache access in
both the C++ interpreter and the assembly interpreter.
Property lookup cache entries previously used GC::Weak<T> for shape,
prototype, and prototype_chain_validity pointers. Each GC::Weak
requires a ref-counted WeakImpl allocation and an extra indirection
on every access.
Replace these with GC::RawPtr<T> and make Executable a WeakContainer
so the GC can clear stale pointers during sweep via remove_dead_cells.
For static PropertyLookupCache instances (used throughout the runtime
for well-known property lookups), introduce StaticPropertyLookupCache
which registers itself in a global list that also gets swept.
Now that inline cache entries use GC::RawPtr instead of GC::Weak,
we can compare shape/prototype pointers directly without going
through the WeakImpl indirection. This removes one dependent load
from each IC check in GetById, PutById, GetLength, GetGlobal, and
SetGlobal handlers.
SimpleIndexedPropertyStorage can only hold default-attributed data
properties. Any attempt to store a property with non-default
attributes (such as accessors) triggers conversion to
GenericIndexedPropertyStorage first. So when we've already verified
is_simple_storage, the accessor check is dead code.
Instead of calling into C++ helpers for global let/const variable
access, inline the binding lookup directly in the asm handlers.
This avoids the overhead of a C++ call for the common case.
Module environments still use the C++ helper since they require
additional lookups that aren't worth inlining.
Convert extract_tag, unbox_int32, unbox_object, box_int32, and
box_int32_clean from DSL macros into codegen instructions, allowing
each backend to emit optimal platform-specific code.
On aarch64, this produces significant improvements:
- extract_tag: single `lsr xD, xS, #48` instead of `mov` + `lsr`
(3-operand shifts are free on ARM). Saves 1 instruction at 57
call sites.
- unbox_object: single `and xD, xS, #0xffffffffffff` instead of
`mov` + `shl` + `shr`. The 48-bit mask is a valid ARM64 logical
immediate. Saves 2 instructions at 6 call sites.
- box_int32: `mov wD, wS` + `movk xD, #tag, lsl #48` instead of
`mov` + `and 0xFFFFFFFF` + `movabs tag` + `or`. The w-register
mov zero-extends, and movk overwrites just the top 16 bits.
Saves 2 instructions and no longer clobbers t0 (rax).
- box_int32_clean: `movk xD, #tag, lsl #48` (1 instruction) instead
of `mov` + `movabs tag` + `or` (saves 2 instructions, no t0
clobber).
On x86_64, the generated code is equivalent to the old macros.
UnsignedRightShift: after shr on a zero-extended value, upper bits are
already clear.
GetByValue typed array path: load32/load8/load16/load8s/load16s all
write to 32-bit destination registers, zeroing the upper 32 bits.
Both can use box_int32_clean to skip the redundant AND 0xFFFFFFFF.
Add a not32 DSL instruction that operates on the 32-bit sub-register,
zeroing the upper 32 bits (x86_64: not r32, aarch64: mvn w_reg).
Use it in BitwiseNot to avoid the sign-extension (unbox_int32), 64-bit
NOT, and explicit AND 0xFFFFFFFF. The 32-bit NOT produces a clean
upper half, so we can use box_int32_clean directly.
Before: movsxd + not r64 + and 0xFFFFFFFF + and 0xFFFFFFFF + or tag
After: mov + not r32 + or tag
In JumpIf, JumpTrue, JumpFalse, and Not, the int32 zero-test path
copied the value to a temporary before masking: mov t3, t1; and t3,
0xFFFFFFFF; branch_zero t3. Since t1 is dead after the test, operate
on it directly: and t1, 0xFFFFFFFF; branch_zero t1. Saves one mov
instruction per handler on the int32 truthiness path.
Add box_int32_clean for sites where the upper 32 bits are already
known to be zero, skipping the redundant zero-extension. On x86_64,
32-bit register writes (add esi, edi; neg esi; etc.) implicitly
clear the upper 32 bits, making the truncation in box_int32
unnecessary.
Use box_int32_clean at 9 call sites after add32_overflow,
sub32_overflow, mul32_overflow, and neg32_overflow, saving one
instruction per site on the hot int32 arithmetic paths.
Replace the check_is_double pattern that loaded the full 64-bit
CANON_NAN_BITS constant (10-byte movabs on x86_64) and masked the
entire value, with a cheaper approach: extract the upper 16-bit tag
and check if (tag & NAN_BASE_TAG) == NAN_BASE_TAG.
This saves instructions at every double-check site. Additionally,
add a check_tag_is_double macro for call sites where the tag has
already been extracted into a register, avoiding redundant
extract_tag operations. This is used in 11 call sites across
coerce_to_doubles, strict_equality_core, numeric_compare, Div,
UnaryPlus, UnaryMinus, and ToInt32.
Replace the pattern of 64-bit arithmetic + sign-extend + compare
with dedicated 32-bit overflow instructions that use the hardware
overflow flag directly.
Before: add t3, t4 / unbox_int32 t5, t3 / branch_ne t3, t5, .overflow
After: add32_overflow t3, t4, .overflow
On x86_64 this compiles to `add r32, r32; jo label` (the 32-bit
register write implicitly zeros the upper 32 bits). On aarch64,
`adds w, w, w; b.vs label` for add/sub, `smull + sxtw + cmp + b.ne`
for multiply, and `negs + b.vs` for negate.
Nine call sites updated: Add, Sub, Mul, Increment, Decrement,
PostfixIncrement, PostfixDecrement, UnaryMinus, and CallBuiltin(abs).
Add a new interpreter that executes bytecode via generated assembly,
written in a custom DSL (asmint.asm) that AsmIntGen compiles to
native x86_64 or aarch64 code.
The interpreter keeps the bytecode program counter and register file
pointer in machine registers for fast access, dispatching opcodes
through a jump table. Hot paths (arithmetic, comparisons, property
access on simple objects) are handled entirely in assembly, with
cold/complex operations calling into C++ helper functions defined
in AsmInterpreter.cpp.
A small build-time tool (gen_asm_offsets) uses offsetof() to emit
struct field offsets as constants consumed by the DSL, ensuring the
assembly stays in sync with C++ struct layouts.
The interpreter is enabled by default on platforms that support it.
The C++ interpreter can be selected via LIBJS_USE_CPP_INTERPRETER=1.
Currently supported platforms:
- Linux/x86_64
- Linux/aarch64
- macOS/x86_64
- macOS/aarch64