Namely, find an upper bound at validation time so we can allocate the
space when entering the frame.
Also drop labels at once instead of popping them off one at a time now
that we're using a Vector.
This still passes the values on the stack, but registers are now allowed
to cross a call boundary.
This is a very significant (>50%) improvement on the small call
microbenchmarks on my machine.
Largely combinations of i32.const and local.get.
This shaves off at most single-digit% number of instructions from
dispatch, which translates to at most ~10% reduced dispatch time.
Across most benchmarks, this gains around ~5% perf increase.
This commit adds a register allocator, with 8 available "register"
slots.
In testing with various random blobs, this moves anywhere from 30% to
74% of value accesses into predefined slots, and is about a ~20% perf
increase end-to-end.
To actually make this usable, a few structural changes were also made:
- we no longer do one instruction per interpret call
- trapping is an (unlikely) exit condition
- the label and frame stacks are replaced with linked lists with a huge
node cache size, as we only need to touch the last element and
push/pop is very frequent.
If the address is already aligned properly, just read a T from it;
otherwise copy it to a local aligned array. This was a bottleneck on
memory-heavy benchmarks.
...instead of specially handling JS::Completion.
This makes it possible for LibWeb/LibJS to have full control over how
these things are made, stored, and visited (whenever).
Fixes an issue where we couldn't roundtrip a JS exception through Wasm.