The Rust FFI requires UTF-16 source data, so ASCII-stored source code
must be widened to UTF-16. Previously, this conversion was done into a
temporary buffer on every call to compile_function, meaning the entire
source file was converted for each lazily-compiled function. For large
modules with many functions, this caused heavy spinning.
Move the conversion into SourceCode::utf16_data() which lazily converts
and caches the result once per source file. Subsequent compilations of
functions from the same file reuse the cached data.
This ports the lexer to UTF-16 and deals with the immediate fallout up
to the AST. The AST will be dealt with in upcoming commits.
The lexer will still accept UTF-8 strings as input, and will transcode
them to UTF-16 for lexing. This doesn't actually incur a new allocation,
as we were already converting the input StringView to a ByteString for
each lexer.
One immediate logical benefit here is that we do not need to know off-
hand how many UTF-8 bytes some special code points occupy. They all
happen to be a single UTF-16 code unit. So instead of advancing the
lexer by 3 positions in some cases, we can just always advance by 1.
This reverts commit c14173f651. We
should only annotate the minimum number of symbols that external
consumers actually use, so I am starting from scratch to do that