Keep track of all expressions encountered while
generating a rewrite result, and re-use them whenever possible.
Named expressions may still be used for clarity when desired.
Change-Id: I640dca108763eb8baeff8f9a4169300af3445b82
Reviewed-on: https://go-review.googlesource.com/c/go/+/229800
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
This is followup of CL 228860, which rewrite shift rules to use typed
aux. That CL introduced nlz* functions, to refactor left shift rules.
While at it, we realize there's a bug in old rules with both right/left
shift rules, but only fix for left shift rules only.
This CL fixes the bug for right shift rules.
Passes toolstash-check.
Change-Id: Id8f2158b1b66c9e87f3fdeaa7ae3e35dc0666f8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/229137
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
This optimization works on any integer with exactly one bit set.
This is identical to being a power of two, except in the
most negative number. Use oneBit instead.
The rule now triggers in a few more places in std+cmd,
in packages encoding/asn1, crypto/elliptic, and
vendor/golang.org/x/crypto/cryptobyte.
This change obviates the need for CL 222479
by doing this optimization consistently in the compiler.
Change-Id: I983c6235290fdc634fda5e11b10f1f8ce041272f
Reviewed-on: https://go-review.googlesource.com/c/go/+/229124
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Surprisingly many rules needed no modification.
Use wrapper functions for aux like we did for auxint.
Simplifies things a bit.
Change-Id: I2e852e77f1585dcb306a976ab9335f1ac5b4a770
Reviewed-on: https://go-review.googlesource.com/c/go/+/227961
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Munday <mike.munday@ibm.com>
Move a lot of the constant folding rules to use strongly
typed AuxInt fields.
We need more than a cast to convert AuxInt to, e.g., float32.
Make conversion functions for converting back and forth.
Change-Id: Ia3d95ee3583ee2179a10938e20210a7617358c88
Reviewed-on: https://go-review.googlesource.com/c/go/+/227866
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Missed as part of CL 221790. It isn't just * and / that can make NaNs.
Update #36400Fixes#38359
Change-Id: I3fa562f772fe03b510793a6dc0cf6189c0c3e652
Reviewed-on: https://go-review.googlesource.com/c/go/+/227860
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alberto Donizetti <alb.donizetti@gmail.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Will help with strongly typed rewrite rules.
Change-Id: Ifbf316a49f4081322b3b8f13bc962713437d9aba
Reviewed-on: https://go-review.googlesource.com/c/go/+/227785
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Right now the Aux and AuxInt fields of ssa.Values are typed as
interface{} and int64, respectively. Each rule that uses these values
must cast them to the type they actually are (*obj.LSym, or int32, or
ValAndOff, etc.), use them, and then cast them back to interface{} or
int64.
We know for each opcode what the types of the Aux and AuxInt fields
should be. So let's modify the rule generator to declare the types to
be what we know they should be, autoconverting to and from the generic
types for us. That way we can make the rules more type safe.
It's difficult to make a single CL for this, so I've coopted the "=>"
token to indicate a rule that is strongly typed. "->" rules are
processed as before. That will let us migrate a few rules at a time in
separate CLs. Hopefully we can reach a state where all rules are
strongly typed and we can drop the distinction.
This CL changes just a few rules to get a feel for what this
transition would look like.
I've decided not to put explicit types in the rules. I think it
makes the rules somewhat clearer, but definitely more verbose.
In particular, the passthrough rules that don't modify the fields
in question are verbose for no real reason.
Change-Id: I63a1b789ac5702e7caf7934cd49f784235d1d73d
Reviewed-on: https://go-review.googlesource.com/c/go/+/190197
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Extend CL 220417 (which removed the integer Greater and Geq ops) to
floating point comparisons. Greater and Geq can always be
implemented using Less and Leq.
Fixes#37316.
Change-Id: Ieaddb4877dd0ff9037a1dd11d0a9a9e45ced71e7
Reviewed-on: https://go-review.googlesource.com/c/go/+/222397
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Retrying CL 222782, with a fix that will hopefully stop the random crashing.
The issue with the previous CL is that it does pointer arithmetic
in a way that may briefly generate an out-of-bounds pointer. If an
interrupt happens to occur in that state, the referenced object may
be collected incorrectly.
Suppose there was code that did s[x+c]. The previous CL had a rule
to the effect of ptr + (x + c) -> c + (ptr + x). But ptr+x is not
guaranteed to point to the same object as ptr. In contrast,
ptr+(x+c) is guaranteed to point to the same object as ptr, because
we would have already checked that x+c is in bounds.
For example, strconv.trim used to have this code:
MOVZX -0x1(BX)(DX*1), BP
CMPL $0x30, AL
After CL 222782, it had this code:
LEAL 0(BX)(DX*1), BP
CMPB $0x30, -0x1(BP)
An interrupt between those last two instructions could see BP pointing
outside the backing store of the slice involved.
It's really hard to actually demonstrate a bug. First, you need to
have an interrupt occur at exactly the right time. Then, there must
be no other pointers to the object in question. Since the interrupted
frame will be scanned conservatively, there can't even be a dead
pointer in another register or on the stack. (In the example above,
a bug can't happen because BX still holds the original pointer.)
Then, the object in question needs to be collected (or at least
scanned?) before the interrupted code continues.
This CL needs to handle load combining somewhat differently than CL 222782
because of the new restriction on arithmetic. That's the only real
difference (other than removing the bad rules) from that old CL.
This bug is also present in the amd64 rewrite rules, and we haven't
seen any crashing as a result. I will fix up that code similarly to
this one in a separate CL.
Update #37881
Change-Id: I5f0d584d9bef4696bfe89a61ef0a27c8d507329f
Reviewed-on: https://go-review.googlesource.com/c/go/+/225798
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Trying this CL again, with a fixed test that allows platforms
to disagree on the exact behavior of converting NaNs.
We store 32-bit floating point constants in a 64-bit field, by
converting that 32-bit float to 64-bit float to store it, and convert
it back to use it.
That works for *almost* all floating-point constants. The exception is
signaling NaNs. The round trip described above means we can't represent
a 32-bit signaling NaN, because conversions strip the signaling bit.
To fix this issue, just forbid NaNs as floating-point constants in SSA
form. This shouldn't affect any real-world code, as people seldom
constant-propagate NaNs (except in test code).
Additionally, NaNs are somewhat underspecified (which of the many NaNs
do you get when dividing 0/0?), so when cross-compiling there's a
danger of using the compiler machine's NaN regime for some math, and
the target machine's NaN regime for other math. Better to use the
target machine's NaN regime always.
Update #36400
Change-Id: Idf203b688a15abceabbd66ba290d4e9f63619ecb
Reviewed-on: https://go-review.googlesource.com/c/go/+/221790
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
In cases in which we had a named value whose args were all _,
like this rule from ARM.rules:
(MOVBUreg x:(MOVBUload _ _)) -> (MOVWreg x)
We previously inserted
_ = x.Args[1]
even though it is unnecessary.
This change eliminates this pointless bounds check.
And in other cases, we now check bounds just as far as strictly necessary.
No significant movement on any compiler metrics.
Just nicer (and less) code.
Passes toolstash-check -all.
Change-Id: I075dfe9f926cc561cdc705e9ddaab563164bed3a
Reviewed-on: https://go-review.googlesource.com/c/go/+/221781
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This:
* Simplifies and shortens the generated code for rewrite rules.
* Shrinks cmd/compile by 86k (0.4%) and makes it easier to compile.
* Removes the stmt boundary code wrangling from Value.reset,
in favor of doing it in the one place where it actually does some work,
namely the writebarrier pass. (This was ascertained by inspecting the
code for cases in which notStmtBoundary values were generated.)
Passes toolstash-check -all.
Change-Id: I25671d4c4bbd772f235195d11da090878ea2cc07
Reviewed-on: https://go-review.googlesource.com/c/go/+/221421
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
There are often many values to clobber.
Allow passing them all in at once.
The goal is increased rule readability.
As a bonus, it shrinks cmd/compile by ~97k, almost half a percent.
Package SSA requires 1.2% less memory to compile.
The single-line changes were make via regex,
and the remaining multi-line clobbers were manually combined.
Passes toolstash-check -all.
Change-Id: Ib310e9265d3616211f8192c9040b4c8933824d19
Reviewed-on: https://go-review.googlesource.com/c/go/+/220691
Reviewed-by: Michael Munday <mike.munday@ibm.com>
The generic Greater and Geq ops can always be replaced with the Less and
Leq ops. This CL therefore removes them. This simplifies the compiler since
it reduces the number of operations that need handling in both code and in
rewrite rules. This will be especially true when adding control flow
optimizations such as the integer-in-range optimizations in CL 165998.
Change-Id: If0648b2b19998ac1bddccbf251283f3be4ec3040
Reviewed-on: https://go-review.googlesource.com/c/go/+/220417
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This reverts CL 213477.
Reason for revert: tests are failing on linux-mips*-rtrk builders.
Change-Id: I8168f7450890233f1bd7e53930b73693c26d4dc0
Reviewed-on: https://go-review.googlesource.com/c/go/+/220897
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
We store 32-bit floating point constants in a 64-bit field, by
converting that 32-bit float to 64-bit float to store it, and convert
it back to use it.
That works for *almost* all floating-point constants. The exception is
signaling NaNs. The round trip described above means we can't represent
a 32-bit signaling NaN, because conversions strip the signaling bit.
To fix this issue, just forbid NaNs as floating-point constants in SSA
form. This shouldn't affect any real-world code, as people seldom
constant-propagate NaNs (except in test code).
Additionally, NaNs are somewhat underspecified (which of the many NaNs
do you get when dividing 0/0?), so when cross-compiling there's a
danger of using the compiler machine's NaN regime for some math, and
the target machine's NaN regime for other math. Better to use the
target machine's NaN regime always.
This has been a bug since 1.10, and there's an easy workaround
(declare a global varaible containing the signaling NaN pattern, and
use that as the argument to math.Float32frombits) so we'll fix it in
1.15.
Fixes#36400
Update #36399
Change-Id: Icf155e743281560eda2eed953d19a829552ccfda
Reviewed-on: https://go-review.googlesource.com/c/go/+/213477
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
We should panic in this situation. Rewriting to a SSA op just leads
to a compiler panic.
Fixes#36259
Change-Id: I6e0bccbed7dd0fdac7ebae76b98a211947947386
Reviewed-on: https://go-review.googlesource.com/c/go/+/212405
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
CL 213703 converted generated rewrite rules for commutative ops
to use loops instead of duplicated code.
However, it loaded args using expressions like
v.Args[i] and v.Args[i^1], which the compiler could
not eliminate bounds for (including with all outstanding
prove CLs).
Also, given a series of separate rewrite rules for the same op,
we generated bounds checks for every rewrite rule, even though
we were repeatedly loading the same set of args.
This change reduces both sets of bounds checks.
Instead of loading v.Args[i] and v.Args[i^1] for commutative loops,
we now preload v.Args[0] and v.Args[1] into local variables,
and then swap them (as needed) in the commutative loop post statement.
And we now load all top level v.Args into local variables
at the beginning of every rewrite rule function.
The second optimization is the more significant,
but the first helps a little, and they play together
nicely from the perspective of generating the code.
This does increase register pressure, but the reduced bounds
checks more than compensate.
Note that the vast majority of rewrite rules evaluated
are not applied, so the prologue is the most important
part of the rewrite rules.
There is one subtle aspect to the new generated code.
Because the top level v.Args are shared across rewrite rules,
and rule evaluation can swap v_0 and v_1, v_0 and v_1
can end up being swapped from one rule to the next.
That is OK, because any time a rule does not get applied,
they will have been swapped exactly twice.
Passes toolstash-check -all.
name old time/op new time/op delta
Template 213ms ± 2% 211ms ± 2% -0.85% (p=0.000 n=92+96)
Unicode 83.5ms ± 2% 83.2ms ± 2% -0.41% (p=0.004 n=95+90)
GoTypes 737ms ± 2% 733ms ± 2% -0.51% (p=0.000 n=91+94)
Compiler 3.45s ± 2% 3.43s ± 2% -0.44% (p=0.000 n=99+100)
SSA 8.54s ± 1% 8.32s ± 2% -2.56% (p=0.000 n=96+99)
Flate 136ms ± 2% 135ms ± 1% -0.47% (p=0.000 n=96+96)
GoParser 169ms ± 1% 168ms ± 1% -0.33% (p=0.000 n=96+93)
Reflect 456ms ± 3% 455ms ± 3% ~ (p=0.261 n=95+94)
Tar 186ms ± 2% 185ms ± 2% -0.48% (p=0.000 n=94+95)
XML 251ms ± 1% 250ms ± 1% -0.51% (p=0.000 n=91+94)
[Geo mean] 424ms 421ms -0.68%
name old user-time/op new user-time/op delta
Template 275ms ± 1% 274ms ± 2% -0.55% (p=0.000 n=95+98)
Unicode 118ms ± 4% 118ms ± 4% ~ (p=0.642 n=98+90)
GoTypes 983ms ± 1% 980ms ± 1% -0.30% (p=0.000 n=93+93)
Compiler 4.56s ± 6% 4.52s ± 6% -0.72% (p=0.003 n=100+100)
SSA 11.4s ± 1% 11.1s ± 1% -2.50% (p=0.000 n=96+97)
Flate 168ms ± 1% 167ms ± 1% -0.49% (p=0.000 n=92+92)
GoParser 204ms ± 1% 204ms ± 2% -0.27% (p=0.003 n=99+96)
Reflect 599ms ± 2% 598ms ± 2% ~ (p=0.116 n=95+92)
Tar 227ms ± 2% 225ms ± 2% -0.57% (p=0.000 n=95+98)
XML 313ms ± 2% 312ms ± 1% -0.37% (p=0.000 n=89+95)
[Geo mean] 547ms 544ms -0.61%
file before after Δ %
compile 21113112 21109016 -4096 -0.019%
total 131704940 131700844 -4096 -0.003%
Change-Id: Id6c39e0367e597c0c75b8a4b1eb14cc3cbd11956
Reviewed-on: https://go-review.googlesource.com/c/go/+/216218
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Prior to this change, we generated additional rules at rulegen time
for all possible combinations of args to commutative ops.
This is simple and works well, but leads to lots of generated rules.
This in turn has increased the size of the compiler,
made it hard to compile package ssa on small machines,
and provided a disincentive to mark some ops as commutative.
This change reworks how we handle commutative ops.
Instead of generating a rule per argument permutation,
we generate a series of nested loops, one for each commutative op.
Each loop tries both possible argument orderings.
I also considered attempting to canonicalize the inputs to the
rewrite rules. However, because either or both arguments might be
nothing more than an identifier, and because there can be arbitrary
conditions to evaluate during matching, I did not see how to proceed.
The duplicate rule detection now sorts arguments to commutative ops,
so that it can detect commutative-only duplicates.
There may be further optimizations to the new generated code.
In particular, we may not be removing as many bounds checks as before;
I have not investigated deeply. If more work here is needed,
we could do it with more hints or with improvements to the prove pass.
This change has almost no impact on the generated code.
It does not pass toolstash-check, however. In a handful of functions,
for reasons I do not understand, there are minor position changes.
For the entire series ending at this change,
there is negligible compiler performance impact.
The compiler binary shrinks by about 15%,
and package ssa shrinks by about 25%.
Package ssa also compiles ~25% faster with ~25% less memory.
Change-Id: Ia2ee9ceae7be08a17342319d4e31b0bb238a2ee4
Reviewed-on: https://go-review.googlesource.com/c/go/+/213703
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
We absorbed Not into most integer comparisons but not into pointer
and floating point equality checks.
The new cases trigger more than 300 times during make.bash.
Change-Id: I77c6b31fcacde10da5470b73fc001a19521ce78d
Reviewed-on: https://go-review.googlesource.com/c/go/+/200618
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
The rules for extracting the interface data word don't leave
the result typed correctly. If I do i.([1]*int)[0], the result
should have type *int, not [1]*int. Using (IData x) for the result
keeps the typing of the original top-level Value.
I don't think this would ever cause a real codegen bug, bug fixing it
at least makes the typing shown in ssa.html more consistent.
Change-Id: I239d821c394e58347639387981b0510d13b2f7b7
Reviewed-on: https://go-review.googlesource.com/c/go/+/204042
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.
Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.
However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.
Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.
For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f SHRQ $0x3f, DX
48c1e23f SHLQ $0x3f, DX
after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080 MOVQ $0x8000000000000000, AX
4821d0 ANDQ DX, AX
Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.
Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:
var GlobalU uint
func BenchmarkAndHighBits(b *testing.B) {
x := uint(0)
for i := 0; i < b.N; i++ {
x &= 1 << 63
}
GlobalU = x
}
amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name old time/op new time/op delta
AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25):
'go run run.go -all_codegen -v codegen' passes with following adjustments:
ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment
since ORshiftRL generation fusing '>> rc' and '|' interferes
with matching ((x << lc) >> rc) to generate UBFX. Previously
ORshiftLL was created first using the shifts generated for (y & ac).
S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs.
Updates #33826
Updates #32781
Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c
Reviewed-on: https://go-review.googlesource.com/c/go/+/196810
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
First, renove unnecessary "// cond:" lines from the generated files.
This shaves off about ~7k lines.
Second, join "if cond { break }" statements via "||", which allows us to
deduplicate a large number of them. This shaves off another ~25k lines.
This change is not for readability or simplicity; but rather, to avoid
unnecessary verbosity that makes the generated files larger. All in all,
git reports that the generated files overall weigh ~200KiB less, or
about 2.7% less.
While at it, add a -trace flag to rulegen.
Updates #33644.
Change-Id: I3fac0290a6066070cc62400bf970a4ae0929470a
Reviewed-on: https://go-review.googlesource.com/c/go/+/196498
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.
Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.
However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.
Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.
For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f SHRQ $0x3f, DX
48c1e23f SHLQ $0x3f, DX
after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080 MOVQ $0x8000000000000000, AX
4821d0 ANDQ DX, AX
Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.
Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:
var GlobalU uint
func BenchmarkAndHighBits(b *testing.B) {
x := uint(0)
for i := 0; i < b.N; i++ {
x &= 1 << 63
}
GlobalU = x
}
amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name old time/op new time/op delta
AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25):
'go run run.go -all_codegen -v codegen' passes with following adjustments:
ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment
since ORshiftRL generation fusing '>> rc' and '|' interferes
with matching ((x << lc) >> rc) to generate UBFX. Previously
ORshiftLL was created first using the shifts generated for (y & ac).
S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs.
Updates #33826
Updates #32781
Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb
Reviewed-on: https://go-review.googlesource.com/c/go/+/194297
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.
Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.
However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.
Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.
For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f SHRQ $0x3f, DX
48c1e23f SHLQ $0x3f, DX
after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080 MOVQ $0x8000000000000000, AX
4821d0 ANDQ DX, AX
Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.
Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:
var GlobalU uint
func BenchmarkAndHighBits(b *testing.B) {
x := uint(0)
for i := 0; i < b.N; i++ {
x &= 1 << 63
}
GlobalU = x
}
amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name old time/op new time/op delta
AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25):
Updates #33826
Updates #32781
Change-Id: I862d3587446410c447b9a7265196b57f85358633
Reviewed-on: https://go-review.googlesource.com/c/go/+/191780
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
The existing pointer comparison optimizations
don't include pointer arithmetic. Add them.
These rules trigger a few times in std cmd, while compiling:
time.Duration.String
cmd/go/internal/tlog.NodeHash
crypto/tls.ticketKeyFromBytes (3 times)
crypto/elliptic.(*p256Point).p256ScalarMult (15 times!)
crypto/elliptic.initTable
These weird comparisons occur when using the copy builtin,
which does a pointer comparison between src and dst.
This also happens to fix#32454, by optimizing enough
early on that all values can be eliminated.
Fixes#32454
Change-Id: I799d45743350bddd15a295dc1e12f8d03c11d1c6
Reviewed-on: https://go-review.googlesource.com/c/go/+/180940
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
EqPtr and NeqPtr are marked as commutative,
so the transformations for rules are already
generated by the preceding two lines.
Change-Id: Ibecba5c8e54d9df00c84e1dae7e5d8cb53eeff43
Reviewed-on: https://go-review.googlesource.com/c/go/+/180939
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
The assembly output for x & c == c, where c is power of 2:
MOVQ "".set+8(SP), AX
ANDQ $8, AX
CMPQ AX, $8
SETEQ "".~r2+24(SP)
With optimization using bitset:
MOVQ "".set+8(SP), AX
BTL $3, AX
SETCS "".~r2+24(SP)
output less than 1 instruction.
However, there is no speed improvement:
name old time/op new time/op delta
AllBitSet-8 0.35ns ± 0% 0.35ns ± 0% ~ (all equal)
Fixes#31904
Change-Id: I5dca4e410bf45716ed2145e3473979ec997e35d4
Reviewed-on: https://go-review.googlesource.com/c/go/+/175957
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
First, add cpu and memory profiling flags, as these are useful to see
where rulegen is spending its time. It now takes many seconds to run on
a recent laptop, so we have to keep an eye on what it's doing.
Second, stop writing '_ = var' lines to keep imports and variables used
at all times. Now that rulegen removes all such unused names, they're
unnecessary.
To perform the removal, lean on go/types to first detect what names are
unused. We can configure it to give us all the type-checking errors in a
file, so we can collect all "declared but not used" errors in a single
pass.
We then use astutil.Apply to remove the relevant nodes based on the line
information from each unused error. This allows us to apply the changes
without having to do extra parser+printer roundtrips to plaintext, which
are far too expensive.
We need to do multiple such passes, as removing an unused variable
declaration might then make another declaration unused. Two passes are
enough to clean every file at the moment, so add a limit of three passes
for now to avoid eating cpu uncontrollably by accident.
The resulting performance of the changes above is a ~30% loss across the
table, since go/types is fairly expensive. The numbers were obtained
with 'benchcmd Rulegen go run *.go', which involves compiling rulegen
itself, but that seems reflective of how the program is used.
name old time/op new time/op delta
Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4)
name old user-time/op new user-time/op delta
Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4)
name old sys-time/op new sys-time/op delta
Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5)
name old peak-RSS-bytes new peak-RSS-bytes delta
Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5)
We can live with a bit more resource usage, but the time/op getting
close to 10s isn't good. To win that back, introduce concurrency in
main.go. This further increases resource usage a bit, but the real time
on this quad-core laptop is greatly reduced. The final benchstat is as
follows:
name old time/op new time/op delta
Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5)
name old user-time/op new user-time/op delta
Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5)
name old sys-time/op new sys-time/op delta
Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5)
name old peak-RSS-bytes new peak-RSS-bytes delta
Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5)
It might be possible to reduce the cpu or memory usage in the future,
such as configuring go/types to do less work, or taking shortcuts to
avoid having to run it many times. For now, ~2x cpu and ~4x memory usage
seems like a fair trade for a faster and better rulegen.
Finally, we can remove the old code that tried to remove some unused
variables in a hacky and unmaintainable way.
Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6
Reviewed-on: https://go-review.googlesource.com/c/go/+/189798
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
"Division by invariant integers using multiplication" paper
by Granlund and Montgomery contains a method for directly computing
divisibility (x%c == 0 for c constant) by means of the modular inverse.
The method is further elaborated in "Hacker's Delight" by Warren Section 10-17
This general rule can compute divisibilty by one multiplication, and add
and a compare for odd divisors and an additional rotate for even divisors.
To apply the divisibility rule, we must take into account
the rules to rewrite x%c = x-((x/c)*c) and (x/c) for c constant on the first
optimization pass "opt". This complicates the matching as we want to match
only in the cases where the result of (x/c) is not also needed.
So, we must match on the expanded form of (x/c) in the expression x == c*(x/c)
in the "late opt" pass after common subexpresion elimination.
Note, that if there is an intermediate opt pass introduced in the future we
could simplify these rules by delaying the magic division rewrite to "late opt"
and matching directly on (x/c) in the intermediate opt pass.
On amd64, the divisibility check is 30-45% faster.
name old time/op new time/op delta`
DivisiblePow2constI64-4 0.83ns ± 1% 0.82ns ± 0% ~ (p=0.079 n=5+4)
DivisibleconstI64-4 2.68ns ± 1% 1.87ns ± 0% -30.33% (p=0.000 n=5+4)
DivisibleWDivconstI64-4 2.69ns ± 1% 2.71ns ± 3% ~ (p=1.000 n=5+5)
DivisiblePow2constI32-4 1.15ns ± 1% 1.15ns ± 0% ~ (p=0.238 n=5+4)
DivisibleconstI32-4 2.24ns ± 1% 1.20ns ± 0% -46.48% (p=0.016 n=5+4)
DivisibleWDivconstI32-4 2.27ns ± 1% 2.27ns ± 1% ~ (p=0.683 n=5+5)
DivisiblePow2constI16-4 0.81ns ± 1% 0.82ns ± 1% ~ (p=0.135 n=5+5)
DivisibleconstI16-4 2.11ns ± 2% 1.20ns ± 1% -42.99% (p=0.008 n=5+5)
DivisibleWDivconstI16-4 2.23ns ± 0% 2.27ns ± 2% +1.79% (p=0.029 n=4+4)
DivisiblePow2constI8-4 0.81ns ± 1% 0.81ns ± 1% ~ (p=0.286 n=5+5)
DivisibleconstI8-4 2.13ns ± 3% 1.19ns ± 1% -43.84% (p=0.008 n=5+5)
DivisibleWDivconstI8-4 2.23ns ± 1% 2.25ns ± 1% ~ (p=0.183 n=5+5)
Fixes#30282Fixes#15806
Change-Id: Id20d78263a4fdfe0509229ae4dfa2fede83fc1d0
Reviewed-on: https://go-review.googlesource.com/c/go/+/173998
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
"Division by invariant integers using multiplication" paper
by Granlund and Montgomery contains a method for directly computing
divisibility (x%c == 0 for c constant) by means of the modular inverse.
The method is further elaborated in "Hacker's Delight" by Warren Section 10-17
This general rule can compute divisibilty by one multiplication and a compare
for odd divisors and an additional rotate for even divisors.
To apply the divisibility rule, we must take into account
the rules to rewrite x%c = x-((x/c)*c) and (x/c) for c constant on the first
optimization pass "opt". This complicates the matching as we want to match
only in the cases where the result of (x/c) is not also available.
So, we must match on the expanded form of (x/c) in the expression x == c*(x/c)
in the "late opt" pass after common subexpresion elimination.
Note, that if there is an intermediate opt pass introduced in the future we
could simplify these rules by delaying the magic division rewrite to "late opt"
and matching directly on (x/c) in the intermediate opt pass.
Additional rules to lower the generic RotateLeft* ops were also applied.
On amd64, the divisibility check is 25-50% faster.
name old time/op new time/op delta
DivconstI64-4 2.08ns ± 0% 2.08ns ± 1% ~ (p=0.881 n=5+5)
DivisibleconstI64-4 2.67ns ± 0% 2.67ns ± 1% ~ (p=1.000 n=5+5)
DivisibleWDivconstI64-4 2.67ns ± 0% 2.67ns ± 0% ~ (p=0.683 n=5+5)
DivconstU64-4 2.08ns ± 1% 2.08ns ± 1% ~ (p=1.000 n=5+5)
DivisibleconstU64-4 2.77ns ± 1% 1.55ns ± 2% -43.90% (p=0.008 n=5+5)
DivisibleWDivconstU64-4 2.99ns ± 1% 2.99ns ± 1% ~ (p=1.000 n=5+5)
DivconstI32-4 1.53ns ± 2% 1.53ns ± 0% ~ (p=1.000 n=5+5)
DivisibleconstI32-4 2.23ns ± 0% 2.25ns ± 3% ~ (p=0.167 n=5+5)
DivisibleWDivconstI32-4 2.27ns ± 1% 2.27ns ± 1% ~ (p=0.429 n=5+5)
DivconstU32-4 1.78ns ± 0% 1.78ns ± 1% ~ (p=1.000 n=4+5)
DivisibleconstU32-4 2.52ns ± 2% 1.26ns ± 0% -49.96% (p=0.000 n=5+4)
DivisibleWDivconstU32-4 2.63ns ± 0% 2.85ns ±10% +8.29% (p=0.016 n=4+5)
DivconstI16-4 1.54ns ± 0% 1.54ns ± 0% ~ (p=0.333 n=4+5)
DivisibleconstI16-4 2.10ns ± 0% 2.10ns ± 1% ~ (p=0.571 n=4+5)
DivisibleWDivconstI16-4 2.22ns ± 0% 2.23ns ± 1% ~ (p=0.556 n=4+5)
DivconstU16-4 1.09ns ± 0% 1.01ns ± 1% -7.74% (p=0.000 n=4+5)
DivisibleconstU16-4 1.83ns ± 0% 1.26ns ± 0% -31.52% (p=0.008 n=5+5)
DivisibleWDivconstU16-4 1.88ns ± 0% 1.89ns ± 1% ~ (p=0.365 n=5+5)
DivconstI8-4 1.54ns ± 1% 1.54ns ± 1% ~ (p=1.000 n=5+5)
DivisibleconstI8-4 2.10ns ± 0% 2.11ns ± 0% ~ (p=0.238 n=5+4)
DivisibleWDivconstI8-4 2.22ns ± 0% 2.23ns ± 2% ~ (p=0.762 n=5+5)
DivconstU8-4 0.92ns ± 1% 0.94ns ± 1% +2.65% (p=0.008 n=5+5)
DivisibleconstU8-4 1.66ns ± 0% 1.26ns ± 1% -24.28% (p=0.008 n=5+5)
DivisibleWDivconstU8-4 1.79ns ± 0% 1.80ns ± 1% ~ (p=0.079 n=4+5)
A follow-up change will address the signed division case.
Updates #30282
Change-Id: I7e995f167179aa5c76bb10fbcbeb49c520943403
Reviewed-on: https://go-review.googlesource.com/c/go/+/168037
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>