The RehighlightState designated initializer used `.position = {}`
which invokes TextPosition's default constructor, initializing line
and column to 0xFFFFFFFF (the "invalid" sentinel). This overrode
the struct's default member initializer of { 0, 0 }.
When advance_position() processed the first newline, it incremented
0xFFFFFFFF to 0x100000000, producing line numbers in the billions.
These bogus positions propagated into folding regions, causing an
out-of-bounds crash in Document::set_folding_regions() when viewing
page source on pages with <script> blocks.
Fix by explicitly initializing position to { 0, 0 }.
Fixes#8529.
Delete Lexer.cpp/h and Token.cpp, replacing all tokenization with a
new rust_tokenize() FFI function that calls back for each token.
Rewrite SyntaxHighlighter.cpp and js.cpp REPL to use the Rust
tokenizer. The token type and category enums in Token.h now mirror
the Rust definitions in token.rs.
Move is_syntax_character/is_whitespace/is_line_terminator helpers
into RegExpConstructor.cpp as static functions, since they were only
used there.
This moves the responsibility of setting up a SourceCode object to the
users of JS::Lexer.
This means Lexer and Parser are free to use string views into the
SourceCode internally while working.
It also means Lexer no longer has to think about anything other than
UTF-16 (or ASCII) inputs. So the unit test for parsing various invalid
UTF-8 sequences is deleted here.
This ports the lexer to UTF-16 and deals with the immediate fallout up
to the AST. The AST will be dealt with in upcoming commits.
The lexer will still accept UTF-8 strings as input, and will transcode
them to UTF-16 for lexing. This doesn't actually incur a new allocation,
as we were already converting the input StringView to a ByteString for
each lexer.
One immediate logical benefit here is that we do not need to know off-
hand how many UTF-8 bytes some special code points occupy. They all
happen to be a single UTF-16 code unit. So instead of advancing the
lexer by 3 positions in some cases, we can just always advance by 1.
Trivia is whatever whitespace and comments appear before a token.
Previously this was always given a TokenCategory of Invalid, so it
would be displayed as an error in the view-source page, with red wiggly
underlines. Instead, treat it as what it actually is: whitespace and
comments!