This patch adds support for parsing structs in the type section.
It also removes the assumption that all types in the type section are
function types, adding appropriate validation.
Spec tests struct.3 and struct.4 have been disable as this would
require expanding `ValueType` to include more heap-types.
The label stack should be shrunk to the frame's label_index (exclusive),
not label_index + 1. Also add the missing shrink call for
return_call_indirect.
Instead of trying to indirectly load 2x64 bits from *cc, load addresses
directly from their own contiguous allocation.
This allows a future optimisation where we defer loading addresses to
reduce memory port pressure.
As JS::Value is marked "trivial" without actually being trivial, make
the one user that would lead to garbage JS::Value entries provide a
default value instead.
Previously we were reading the arguments in an incorrect order, and
placing the result in the wrong slot.
This also removes the hacky implementation of accumulative relaxed dot,
and just implements it directly as a new operator.
This is supported starting GCC 15.
The warning -Wmaybe-musttail-local-addr complained about &value possibly
escaping (it cannot, but gcc is being pessimistic about
store_to_memory), so a little rearrangement of that function was
necessary.
Instead of doing a naive O(n^2) liveness detection loop, use a bitmap
for values allocated to registers.
This cuts down validating time from 20% to 1.4% of runtime on the same
game as last commit.
This, along with moving the sources and destination out of the config
object, makes it so we don't have to double-deref to get to them on each
instruction, leading to a ~15% perf improvement on dispatch.
Namely, find an upper bound at validation time so we can allocate the
space when entering the frame.
Also drop labels at once instead of popping them off one at a time now
that we're using a Vector.
This still passes the values on the stack, but registers are now allowed
to cross a call boundary.
This is a very significant (>50%) improvement on the small call
microbenchmarks on my machine.
Largely combinations of i32.const and local.get.
This shaves off at most single-digit% number of instructions from
dispatch, which translates to at most ~10% reduced dispatch time.
Across most benchmarks, this gains around ~5% perf increase.
This commit adds a register allocator, with 8 available "register"
slots.
In testing with various random blobs, this moves anywhere from 30% to
74% of value accesses into predefined slots, and is about a ~20% perf
increase end-to-end.
To actually make this usable, a few structural changes were also made:
- we no longer do one instruction per interpret call
- trapping is an (unlikely) exit condition
- the label and frame stacks are replaced with linked lists with a huge
node cache size, as we only need to touch the last element and
push/pop is very frequent.
If the address is already aligned properly, just read a T from it;
otherwise copy it to a local aligned array. This was a bottleneck on
memory-heavy benchmarks.
...instead of specially handling JS::Completion.
This makes it possible for LibWeb/LibJS to have full control over how
these things are made, stored, and visited (whenever).
Fixes an issue where we couldn't roundtrip a JS exception through Wasm.