The asm interpreter already inlines ECMAScript calls, but builtin calls
still went through the generic C++ Call slow path even when the callee
was a plain native function pointer. That added an avoidable boundary
around hot builtin calls and kept asm from taking full advantage of the
new RawNativeFunction representation.
Teach the asm Call handler to recognize RawNativeFunction, allocate the
callee frame on the interpreter stack, copy the call-site arguments,
and jump straight to the stored C++ entry point.
NativeJavaScriptBackedFunction and other non-raw callees keep falling
through to the existing C++ slow path unchanged.
Place Realm's cached declarative environment next to its global object
so the asm global access fast paths can fetch the two pointers with a
paired load. These handlers never use the intervening GlobalEnvironment
pointer directly.
Mirror Executable's constants size and data pointer in adjacent fields
so the asm Call fast path can pair-load them together. The underlying
Vector layout keeps size and data apart, so a small cached raw span
lets the hot constant-copy loop fetch both pieces of metadata at once.
Load PropertyNameIterator's indexed-property count and next index
together when stepping the fast path. Keeping the paired count live
into the named-property case also avoids reloading it before computing
the flattened index.
Load PropertyNameIterator's cached property cache and shape snapshot
together before validating the receiver shape. The two fields already
sit adjacent in the object layout, so the fast path can fetch both
without any extra reshuffling.
Load EnvironmentCoordinate::hops and ::index together in the asm
environment-walk helper. The pair-load keeps the DSL explicit about
which two fields travel together and removes another scalar metadata
fetch from the fast path.
Load the cached property offset and dictionary generation with paired
loads in the property inline-cache fast paths. AsmIntGen now verifies
these reads against the actual cache layout, so the DSL keeps both
fields named and self-documenting.
Load the cached shape and prototype pointer together in the property
inline-cache fast paths that already read both. This keeps the
cache-entry metadata fetches aligned with the DSL's paired-load model
without changing the surrounding control flow.
Load the inline frame's return pc and destination register at once when
Return or End resumes an asm-managed caller. This keeps the unwind
metadata with the helper that consumes it and removes a separate scalar
load from both handlers.
The asm Call fast path was still reloading the executable pointer while
building the inline callee frame, even though it had already loaded the
same pointer while validating the call target.
Carry that executable pointer through frame setup and reload the passed
argument count from the call bytecode instead of the fresh frame header.
This trims a couple more loads from the hot path.
Pack the asm Call fast path metadata next to the executable pointer
so the interpreter can fetch both values with one paired load. This
removes several dependent shared-data loads from the hot path.
Keep the executable pointer and packed metadata in separate registers
through this binding so the fast path can still use the paired-load
layout after any non-strict this adjustment.
Lower the packed metadata flag checks correctly on x86_64 as well.
Those bits now live above bit 31, so the generator uses bt for single-
bit high masks and covers that path with a unit test.
Add a runtime test that exercises both object and global this binding
through the asm Call fast path.
Executable already caches the combined registers, locals, and constants
count that the asm Call fast path needs for inline frame allocation.
Use that precomputed total instead of rebuilding it from the registers
count and constants vector size in the hot path.
The asm Call fast path already checks SharedFunctionInstanceData's
cached can_inline_call bit before touching the executable pointer.
That cache is only true for ordinary functions with compiled bytecode,
so the extra executable null check is redundant work in the hot path.
The asm Call fast path reads InterpreterStack::m_top and m_limit
back-to-back while checking whether the inline callee frame fits.
Those fields are adjacent, so we can load them together with one
paired load and keep the stack-size check otherwise unchanged.
Teach the asm Call fast path to use paired stores for the fixed
ExecutionContext header writes and for the caller linkage fields.
This also initializes the five reserved Value slots directly instead
of looping over them as part of the general register clear path.
That keeps the hot frame setup work closer to the actual data layout:
reserved registers are seeded with a couple of fixed stores, while the
remaining register and local slots are cleared in wider chunks.
On x86_64, keep the new explicit-offset formatting on store_pair*
and load_pair* without changing ordinary [base, index, scale]
operands into base-plus-index-plus-offset addresses. Add unit
tests covering both the paired zero-offset form and the preserved
scaled-index lowering.
Use the new paired-load DSL operations in the inline Call path for the
adjacent environment, ScriptOrModule, caller metadata, and callee-entry
loads. The flow stays the same, but the hot call setup now needs fewer
scalar memory operations on aarch64.
Handle inline-eligible JS-to-JS Call directly in asmint.asm instead
of routing the whole operation through AsmInterpreter.cpp.
The asm handler now validates the callee, binds `this` for the
non-allocating cases, reserves the callee InterpreterStack frame,
populates the ExecutionContext header and Value tail, and enters the
callee bytecode at pc 0.
Keep the cases that need NewFunctionEnvironment() or sloppy `this`
boxing on a narrow helper that still builds an inline frame. This
preserves the existing inline-call semantics for promise-job ordering,
receiver binding, and sloppy global-this handling while keeping the
common path in assembly.
Add regression coverage for closure-capturing callees, sloppy
primitive receivers, and sloppy undefined receivers.
Emit the ExecutionContext, function-object, executable, and realm
offsets that the asm Call path needs to inspect and initialize
directly when building inline frames.
Handle Return and End entirely in AsmInt when leaving an inline frame.
The handlers now restore the caller, update the interpreter stack
bookkeeping directly, and bump the execution generation without
bouncing through AsmInterpreter.cpp.
Add WeakRef tests that exercise both inline Return and inline End
so this path stays covered.
Store whether a function can participate in JS-to-JS inline calls on
SharedFunctionInstanceData instead of recomputing the function kind,
class-constructor bit, and bytecode availability at each fast-path
call site.
The bytecode interpreter only needed the running execution context,
but still threaded a separate Interpreter object through both the C++
and asm entry points. Move that state and the bytecode execution
helpers onto VM instead, and teach the asm generator and slow paths to
use VM directly.
Specialize only the fixed unary case in the bytecode generator and let
all other argument counts keep using the generic Call instruction. This
keeps the builtin bytecode simple while still covering the common fast
path.
The asm interpreter handles int32 inputs directly, applies the ToUint16
mask in-place, and reuses the VM's cached ASCII single-character
strings when the result is 7-bit representable. Non-ASCII single code
unit results stay on the dedicated builtin path via a small helper, and
the dedicated slow path still handles the generic cases.
Tag String.prototype.charAt as a builtin and emit a dedicated
bytecode instruction for non-computed calls.
The asm interpreter can then stay on the fast path when the
receiver is a primitive string with resident UTF-16 data and the
selected code unit is ASCII. In that case we can return the VM's
cached empty or single-character ASCII string directly.
Teach builtin call specialization to recognize non-computed
member calls to charCodeAt() and emit a dedicated builtin opcode.
Mark String.prototype.charCodeAt with that builtin tag, then add
an asm interpreter fast path for primitive-string receivers whose
UTF-16 data is already resident.
The asm path handles both ASCII-backed and UTF-16-backed resident
strings, returns NaN for out-of-bounds Int32 indices, and falls
back to the generic builtin call path for everything else. This
keeps the optimistic case in asm while preserving the ordinary
method call semantics when charCodeAt has been replaced or when
string resolution would be required.
Replace the generic CallBuiltin instruction with one opcode per
supported builtin call and make those instructions fixed-size by
arity. This removes the builtin dispatch sled in the asm
interpreter, gives each builtin a dedicated slow-path entry point,
and lets bytecode generation encode the callee shape directly.
Keep the existing handwritten asm fast paths for the Math builtins
that already benefit from them, while routing the other builtin
opcodes through their own C++ execute implementations. Build the
new opcode directly in Rust codegen, and keep the generic call
fallback when the original builtin function has been replaced.
Cache the flattened enumerable key snapshot for each `for..in` site and
reuse a `PropertyNameIterator` when the receiver shape, dictionary
generation, indexed storage kind and length, prototype chain
validity, and magical-length state still match.
Handle packed indexed receivers as well as plain named-property
objects. Teach `ObjectPropertyIteratorNext` in `asmint.asm` to return
cached property values directly and to fall back to the slow iterator
logic when any guard fails.
Treat arrays' hidden non-enumerable `length` property as a visited
name for for-in shadowing, and include the receiver's magical-length
state in the cache key so arrays and plain objects do not share
snapshots.
Add `test-js` and `test-js-bytecode` coverage for mixed numeric and
named keys, packed receiver transitions, re-entry, iterator reuse, GC
retention, array length shadowing, and same-site cache reuse.
Teach the asm PutByValue path to materialize in-bounds holey array
elements directly when the receiver is a normal extensible Array with
the default prototype chain and no indexed interference. This avoids
bouncing through generic property setting while preserving the lazy
holey length model.
Keep the fast path narrow so inherited setters, inherited non-writable
properties, and non-extensible arrays still fall back to the generic
semantics. Add regression coverage for those cases alongside the large
holey array stress tests.
Treat setting a large array length as a logical length change instead of
forcing dictionary indexed storage or materializing every hole up front.
This keeps dense fills on Array(length) on the holey indexed path and
only falls back to sparse storage when later writes actually create a
large realized gap.
The asm indexed get/put fast paths assumed holey arrays always had a
materialized backing store. Guard those paths with a capacity check so
lazy holey arrays fall back safely until an index has been realized.
Add regression coverage for very large holey arrays and for densely
filling a large holey array after pre-sizing it with Array(length).
Remove Bytecode::compile() and the old create() overloads on
ECMAScriptFunctionObject that accepted C++ AST nodes. These
have no remaining callers now that all compilation goes through
the Rust pipeline.
Also remove the if-constexpr Parse Node branch from
async_block_start, since the Statement template instantiation
was already removed.
Fix transitive include dependencies on Generator.h by adding
explicit includes for headers that were previously pulled in
transitively.
When the asmint computes a double result for Add, Sub, Mul,
Math.floor, Math.ceil, or Math.sqrt, try to store it as Int32
if the value is a whole number in [INT32_MIN, INT32_MAX] and
not -0.0. This mirrors the JS::Value(double) constructor and
allows downstream int32 fast paths to fire.
Also add label uniquification to the DSL macro expander so the
same macro can be used multiple times in one handler without
label collisions.
Teach js_to_int32 to leave a clean low 32-bit result on success, then
use box_int32_clean in the ToInt32 fast path and adjacent boolean
coercions. This removes one instruction from the AArch64 fjcvtzs path
and trims the boolean boxing path without changing behavior.
Add the small AsmIntGen float32 load, store, and conversion operations
needed to handle Float32Array directly in the AsmInt typed-array
GetByValue and PutByValue paths.
This covers direct indexed reads plus both int32 and double stores,
and adds regression coverage for Math.fround rounding, negative zero,
and NaN.
Teach the asm typed-array GetByValue and PutByValue paths to handle
Uint8ClampedArray directly. Reads can share the Uint8Array load path,
while int32 stores clamp in asm instead of bailing out to C++.
Add a direct indexed access regression test for clamped int32 stores.
Cache raw data pointers on fixed-length typed array views so asm
GetByValue and PutByValue can use them directly for indexed
element access.
Replace the asm typed-array hot-path
ArrayBuffer/DataBlock/ByteBuffer walk with one cached_data_ptr load.
Remove six unconditional loads, four branches, and the byte_offset
add before the element access, trading them for one
cached_data_ptr null check.
Keep direct C++ typed-array access on IsValidIntegerIndex-based
checks, invalidate cached pointers eagerly when a backing
ArrayBuffer is detached, and add regression coverage for shrink,
regrow, and detach on number and BigInt typed arrays.
Use dedicated Packed branches in GetByValue and PutByValue so
in-bounds indexed accesses can skip hole checks and slot
reloads.
Keep Holey writes on the guarded arm, and keep append writes on
the C++ slow path so PutByValue still respects non-extensible
indexed objects and arrays with a non-writable length.
Add a bytecode regression that exercises both append failure
cases through the real js binary path.
Replace the OwnPtr<IndexedPropertyStorage> indirection with inline
indexed element storage directly on Object. This eliminates virtual
dispatch and reduces indirection for indexed property access.
The new system uses three storage kinds tracked by IndexedStorageKind:
- Packed: Dense array, no holes. Elements stored in a malloced Value*
array with capacity header (same layout as named properties).
- Holey: Dense array with possible holes marked by empty sentinel.
Same physical layout as Packed.
- Dictionary: Sparse storage using GenericIndexedPropertyStorage,
type-punned into the m_indexed_elements pointer.
Transitions: None->Packed->Holey->Dictionary (mostly monotonic).
Dictionary mode triggers on non-default attributes or sparse arrays.
Object keeps the same 48-byte size since m_indexed_elements (8 bytes)
replaces IndexedProperties (8 bytes), and the storage kind + array
size fit in existing padding alongside m_flags.
The asm interpreter benefits from one fewer indirection: it now reads
the element pointer and array size directly from Object fields instead
of chasing through OwnPtr -> IndexedPropertyStorage -> Vector.
Removes: IndexedProperties, SimpleIndexedPropertyStorage,
IndexedPropertyStorage, IndexedPropertyIterator.
Keeps: GenericIndexedPropertyStorage (for Dictionary mode).
Replace the 24-byte Vector<Value> m_storage with an 8-byte raw
Value* m_named_properties pointer, backed by a malloc'd allocation
with an inline capacity header.
Memory layout of the allocation:
[u32 capacity] [u32 padding] [Value 0] [Value 1] ...
m_named_properties points to Value 0.
This shrinks JS::Object from 64 to 48 bytes (on non-Windows
platforms) and removes one level of indirection for property access
in the asm interpreter, since the data pointer is now stored directly
on the object rather than inside a Vector's internal metadata.
Growth policy: max(4, max(needed, old_capacity * 2)).
Add Mov2 and Mov3 bytecode instructions that perform 2 or 3 register
moves in a single dispatch. A peephole optimization pass during
bytecode assembly merges consecutive Mov instructions within each
basic block into these combined instructions.
When merging, identical Movs are deduplicated (e.g. two identical Movs
become a single Mov, not a Mov2). This optimization is implemented in
both the C++ and Rust codegen pipelines.
The goal is to reduce the per-instruction dispatch overhead, which is
significant compared to the actual cost of moving a value.
This isn't fancy or elegant, but provides a real speed-up on many
workloads. As an example, Kraken/imaging-desaturate.js improves by
~1.07x on my laptop.
Replace the 16-byte Variant<Empty, GC::Ref<Script>, GC::Ref<Module>>
with a simple 8-byte GC::Ptr<Cell> that points to either a Script or
Module (or is null for Empty).
A helper function script_or_module_from_cell() converts back to the
full ScriptOrModule variant when needed (e.g. in
VM::get_active_script_or_module).
Remove four fields that are trivially derivable from other fields
already present in the ExecutionContext:
- global_object (from realm)
- global_declarative_environment (from realm)
- identifier_table (from executable)
- property_key_table (from executable)
This shrinks ExecutionContext from 192 to 160 bytes (-17%).
The asmint's GetGlobal/SetGlobal handlers now load through the realm
pointer, taking advantage of the cached declarative environment
pointer added in the previous commit.
These test only the low 32 bits of a register, replacing the previous
pattern of `and reg, 0xFFFFFFFF` followed by `branch_zero` or
`branch_nonzero`.
On aarch64 the old pattern emitted `mov w1, w1; cbnz x1` (2 insns),
now it's just `cbnz w1` (1 insn). Used in JumpIf, JumpTrue, JumpFalse,
and Not for the int32 truthiness fast path.
Teach the DSL and both arch backends to handle memory operands of
the form [pb, pc, field_ref], meaning base + index + field_offset.
On aarch64, since x21 already caches pb + pc (the instruction
pointer), this emits a single `ldr dst, [x21, #offset]` instead of
the previous `mov t0, x21` + `ldr dst, [t0, #offset]` two-instruction
sequence.
On x86_64, this emits `[r14 + r13 + offset]` which is natively
supported by x86 addressing modes.
Convert all `lea t0, [pb, pc]` + `loadNN tX, [t0, field]` pairs in
the DSL to the new single-instruction form, saving one instruction
per IC access and other field loads in GetById, PutById, GetLength,
GetGlobal, SetGlobal, and CallBuiltin handlers.
Instead of storing a u32 index into a cache vector and looking up the
cache at runtime through a chain of dependent loads (load Executable*,
load vector data pointer, multiply index, add), store the actual cache
pointer as a u64 directly in the instruction stream.
A fixup pass (Executable::fixup_cache_pointers()) runs after Executable
construction in both the Rust and C++ pipelines, walking the bytecode
and replacing each index with the corresponding pointer.
The cache pointer type is encoded in Bytecode.def (e.g.
PropertyLookupCache*, GlobalVariableCache*) so the fixup switch is
auto-generated by the Python Op code generator, making it impossible
to forget updating the fixup when adding new cached instructions.
This eliminates 3-4 dependent loads on every inline cache access in
both the C++ interpreter and the assembly interpreter.
Property lookup cache entries previously used GC::Weak<T> for shape,
prototype, and prototype_chain_validity pointers. Each GC::Weak
requires a ref-counted WeakImpl allocation and an extra indirection
on every access.
Replace these with GC::RawPtr<T> and make Executable a WeakContainer
so the GC can clear stale pointers during sweep via remove_dead_cells.
For static PropertyLookupCache instances (used throughout the runtime
for well-known property lookups), introduce StaticPropertyLookupCache
which registers itself in a global list that also gets swept.
Now that inline cache entries use GC::RawPtr instead of GC::Weak,
we can compare shape/prototype pointers directly without going
through the WeakImpl indirection. This removes one dependent load
from each IC check in GetById, PutById, GetLength, GetGlobal, and
SetGlobal handlers.
SimpleIndexedPropertyStorage can only hold default-attributed data
properties. Any attempt to store a property with non-default
attributes (such as accessors) triggers conversion to
GenericIndexedPropertyStorage first. So when we've already verified
is_simple_storage, the accessor check is dead code.
Instead of calling into C++ helpers for global let/const variable
access, inline the binding lookup directly in the asm handlers.
This avoids the overhead of a C++ call for the common case.
Module environments still use the C++ helper since they require
additional lookups that aren't worth inlining.
Convert extract_tag, unbox_int32, unbox_object, box_int32, and
box_int32_clean from DSL macros into codegen instructions, allowing
each backend to emit optimal platform-specific code.
On aarch64, this produces significant improvements:
- extract_tag: single `lsr xD, xS, #48` instead of `mov` + `lsr`
(3-operand shifts are free on ARM). Saves 1 instruction at 57
call sites.
- unbox_object: single `and xD, xS, #0xffffffffffff` instead of
`mov` + `shl` + `shr`. The 48-bit mask is a valid ARM64 logical
immediate. Saves 2 instructions at 6 call sites.
- box_int32: `mov wD, wS` + `movk xD, #tag, lsl #48` instead of
`mov` + `and 0xFFFFFFFF` + `movabs tag` + `or`. The w-register
mov zero-extends, and movk overwrites just the top 16 bits.
Saves 2 instructions and no longer clobbers t0 (rax).
- box_int32_clean: `movk xD, #tag, lsl #48` (1 instruction) instead
of `mov` + `movabs tag` + `or` (saves 2 instructions, no t0
clobber).
On x86_64, the generated code is equivalent to the old macros.
UnsignedRightShift: after shr on a zero-extended value, upper bits are
already clear.
GetByValue typed array path: load32/load8/load16/load8s/load16s all
write to 32-bit destination registers, zeroing the upper 32 bits.
Both can use box_int32_clean to skip the redundant AND 0xFFFFFFFF.