Store source map locations as bytecode offset, line, and column.
Runtime consumers only emit the start line and column, so source end
positions and source text offsets do not need to be carried through
Executable source maps, bytecode cache serialization, or the Rust FFI.
Keep SourceCode's internal position cache able to track source text
offsets so callers can still translate source offsets to line and
column pairs when needed. Hash dump-bytecode IDs from the name, first
source position, and bytecode size instead of source slices that need
end offsets.
Bump the bytecode cache format version for the slimmer serialized
source map entry shape.
The Rust FFI requires UTF-16 source data, so ASCII-stored source code
must be widened to UTF-16. Previously, this conversion was done into a
temporary buffer on every call to compile_function, meaning the entire
source file was converted for each lazily-compiled function. For large
modules with many functions, this caused heavy spinning.
Move the conversion into SourceCode::utf16_data() which lazily converts
and caches the result once per source file. Subsequent compilations of
functions from the same file reuse the cached data.
This ports the lexer to UTF-16 and deals with the immediate fallout up
to the AST. The AST will be dealt with in upcoming commits.
The lexer will still accept UTF-8 strings as input, and will transcode
them to UTF-16 for lexing. This doesn't actually incur a new allocation,
as we were already converting the input StringView to a ByteString for
each lexer.
One immediate logical benefit here is that we do not need to know off-
hand how many UTF-8 bytes some special code points occupy. They all
happen to be a single UTF-16 code unit. So instead of advancing the
lexer by 3 positions in some cases, we can just always advance by 1.
This reverts commit c14173f651. We
should only annotate the minimum number of symbols that external
consumers actually use, so I am starting from scratch to do that