Stowage/go - Remotebranch.eu

Stowage/go

Fork 0

mirror of https://github.com/golang/go.git synced 2025-12-08 06:10:04 +00:00

Commit graph

Author	SHA1	Message	Date
Russ Cox	6e165b4d17	cmd/compile: implement Avg64u, Hmul64, Hmul64u for wasm This lets us remove useAvg and useHmul from the division rules. The compiler is simpler and the generated code is faster. goos: wasip1 goarch: wasm pkg: internal/strconv │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ AppendFloat/Decimal 192.8n ± 1% 194.6n ± 0% +0.91% (p=0.000 n=10) AppendFloat/Float 328.6n ± 0% 279.6n ± 0% -14.93% (p=0.000 n=10) AppendFloat/Exp 335.6n ± 1% 289.2n ± 1% -13.80% (p=0.000 n=10) AppendFloat/NegExp 336.0n ± 0% 289.1n ± 1% -13.97% (p=0.000 n=10) AppendFloat/LongExp 332.4n ± 0% 285.2n ± 1% -14.20% (p=0.000 n=10) AppendFloat/Big 348.2n ± 0% 300.1n ± 0% -13.83% (p=0.000 n=10) AppendFloat/BinaryExp 137.4n ± 0% 138.2n ± 0% +0.55% (p=0.001 n=10) AppendFloat/32Integer 193.3n ± 1% 196.5n ± 0% +1.66% (p=0.000 n=10) AppendFloat/32ExactFraction 283.3n ± 0% 268.9n ± 1% -5.08% (p=0.000 n=10) AppendFloat/32Point 279.9n ± 0% 266.5n ± 0% -4.80% (p=0.000 n=10) AppendFloat/32Exp 300.1n ± 0% 288.3n ± 1% -3.90% (p=0.000 n=10) AppendFloat/32NegExp 288.2n ± 1% 277.9n ± 1% -3.59% (p=0.000 n=10) AppendFloat/32Shortest 261.7n ± 0% 250.2n ± 0% -4.39% (p=0.000 n=10) AppendFloat/32Fixed8Hard 173.3n ± 1% 158.9n ± 1% -8.31% (p=0.000 n=10) AppendFloat/32Fixed9Hard 180.0n ± 0% 167.9n ± 2% -6.70% (p=0.000 n=10) AppendFloat/64Fixed1 167.1n ± 0% 149.6n ± 1% -10.50% (p=0.000 n=10) AppendFloat/64Fixed2 162.4n ± 1% 146.5n ± 0% -9.73% (p=0.000 n=10) AppendFloat/64Fixed2.5 165.5n ± 0% 149.4n ± 1% -9.70% (p=0.000 n=10) AppendFloat/64Fixed3 166.4n ± 1% 150.2n ± 0% -9.74% (p=0.000 n=10) AppendFloat/64Fixed4 163.7n ± 0% 149.6n ± 1% -8.62% (p=0.000 n=10) AppendFloat/64Fixed5Hard 182.8n ± 1% 167.1n ± 1% -8.61% (p=0.000 n=10) AppendFloat/64Fixed12 222.2n ± 0% 208.8n ± 0% -6.05% (p=0.000 n=10) AppendFloat/64Fixed16 197.6n ± 1% 181.7n ± 0% -8.02% (p=0.000 n=10) AppendFloat/64Fixed12Hard 194.5n ± 0% 181.0n ± 0% -6.99% (p=0.000 n=10) AppendFloat/64Fixed17Hard 205.1n ± 1% 191.9n ± 0% -6.44% (p=0.000 n=10) AppendFloat/64Fixed18Hard 6.269µ ± 0% 6.643µ ± 0% +5.97% (p=0.000 n=10) AppendFloat/64FixedF1 211.7n ± 1% 197.0n ± 0% -6.95% (p=0.000 n=10) AppendFloat/64FixedF2 189.4n ± 0% 174.2n ± 0% -8.08% (p=0.000 n=10) AppendFloat/64FixedF3 169.0n ± 0% 154.9n ± 0% -8.32% (p=0.000 n=10) AppendFloat/Slowpath64 321.2n ± 0% 274.2n ± 1% -14.63% (p=0.000 n=10) AppendFloat/SlowpathDenormal64 307.4n ± 1% 261.2n ± 0% -15.03% (p=0.000 n=10) AppendInt 3.367µ ± 1% 3.376µ ± 0% ~ (p=0.517 n=10) AppendUint 675.5n ± 0% 676.9n ± 0% ~ (p=0.196 n=10) AppendIntSmall 28.13n ± 1% 28.17n ± 0% +0.14% (p=0.015 n=10) AppendUintVarlen/digits=1 20.70n ± 0% 20.51n ± 1% -0.89% (p=0.018 n=10) AppendUintVarlen/digits=2 20.43n ± 0% 20.27n ± 0% -0.81% (p=0.001 n=10) AppendUintVarlen/digits=3 38.48n ± 0% 37.93n ± 0% -1.43% (p=0.000 n=10) AppendUintVarlen/digits=4 41.10n ± 0% 38.78n ± 1% -5.62% (p=0.000 n=10) AppendUintVarlen/digits=5 42.25n ± 1% 42.11n ± 0% -0.32% (p=0.041 n=10) AppendUintVarlen/digits=6 45.40n ± 1% 43.14n ± 0% -4.98% (p=0.000 n=10) AppendUintVarlen/digits=7 46.81n ± 1% 46.03n ± 0% -1.66% (p=0.000 n=10) AppendUintVarlen/digits=8 48.88n ± 1% 46.59n ± 1% -4.68% (p=0.000 n=10) AppendUintVarlen/digits=9 49.94n ± 2% 49.41n ± 1% -1.06% (p=0.000 n=10) AppendUintVarlen/digits=10 57.28n ± 1% 56.92n ± 1% -0.62% (p=0.045 n=10) AppendUintVarlen/digits=11 60.09n ± 1% 58.11n ± 2% -3.30% (p=0.000 n=10) AppendUintVarlen/digits=12 62.22n ± 0% 61.85n ± 0% -0.59% (p=0.000 n=10) AppendUintVarlen/digits=13 64.94n ± 0% 62.92n ± 0% -3.10% (p=0.000 n=10) AppendUintVarlen/digits=14 65.42n ± 1% 65.19n ± 1% -0.34% (p=0.005 n=10) AppendUintVarlen/digits=15 68.17n ± 0% 66.13n ± 0% -2.99% (p=0.000 n=10) AppendUintVarlen/digits=16 70.21n ± 1% 70.09n ± 1% ~ (p=0.517 n=10) AppendUintVarlen/digits=17 72.93n ± 0% 70.49n ± 0% -3.34% (p=0.000 n=10) AppendUintVarlen/digits=18 73.01n ± 0% 72.75n ± 0% -0.35% (p=0.000 n=10) AppendUintVarlen/digits=19 79.27n ± 1% 79.49n ± 1% ~ (p=0.671 n=10) AppendUintVarlen/digits=20 82.18n ± 0% 80.43n ± 1% -2.14% (p=0.000 n=10) geomean 143.4n 136.0n -5.20% Change-Id: I8245814a0259ad13cf9225f57db8e9fe3d2e4267 Reviewed-on: https://go-review.googlesource.com/c/go/+/717407 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>	2025-11-04 11:38:18 -08:00
Russ Cox	1e5bb416d8	cmd/compile: implement bits.Mul64 on 32-bit systems This CL implements Mul64uhilo, Hmul64, Hmul64u, and Avg64u on 32-bit systems, with the effect that constant division of both int64s and uint64s can now be emitted directly in all cases, and also that bits.Mul64 can be intrinsified on 32-bit systems. Previously, constant division of uint64s by values 0 ≤ c ≤ 0xFFFF were implemented as uint32 divisions by c and some fixup. After expanding those smaller constant divisions, the code for i/999 required: (386) 7 mul, 10 add, 2 sub, 3 rotate, 3 shift (104 bytes) (arm) 7 mul, 9 add, 3 sub, 2 shift (104 bytes) (mips) 7 mul, 10 add, 5 sub, 6 shift, 3 sgtu (176 bytes) For that much code, we might as well use a full 64x64->128 multiply that can be used for all divisors, not just small ones. Having done that, the same i/999 now generates: (386) 4 mul, 9 add, 2 sub, 2 or, 6 shift (112 bytes) (arm) 4 mul, 8 add, 2 sub, 2 or, 3 shift (92 bytes) (mips) 4 mul, 11 add, 3 sub, 6 shift, 8 sgtu, 4 or (196 bytes) The size increase on 386 is due to a few extra register spills. The size increase on mips is due to add-with-carry being hard. The new approach is more general, letting us delete the old special case and guarantee that all int64 and uint64 divisions by constants are generated directly on 32-bit systems. This especially speeds up code making heavy use of bits.Mul64 with a constant argument, which happens in strconv and various crypto packages. A few examples are benchmarked below. pkg: cmd/compile/internal/test benchmark \ host local linux-amd64 s7 linux-386 s7:GOARCH=386 vs base vs base vs base vs base vs base DivconstI64 ~ ~ ~ -49.66% -21.02% ModconstI64 ~ ~ ~ -13.45% +14.52% DivisiblePow2constI64 ~ ~ ~ +0.97% -1.32% DivisibleconstI64 ~ ~ ~ -20.01% -48.28% DivisibleWDivconstI64 ~ ~ -1.76% -38.59% -42.74% DivconstU64/3 ~ ~ ~ -13.82% -4.09% DivconstU64/5 ~ ~ ~ -14.10% -3.54% DivconstU64/37 -2.07% -4.45% ~ -19.60% -9.55% DivconstU64/1234567 ~ ~ ~ -61.55% -56.93% ModconstU64 ~ ~ ~ -6.25% ~ DivisibleconstU64 ~ ~ ~ -2.78% -7.82% DivisibleWDivconstU64 ~ ~ ~ +4.23% +2.56% pkg: math/bits benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Add ~ ~ ~ ~ Add32 +1.59% ~ ~ ~ Add64 ~ ~ ~ ~ Add64multiple ~ ~ ~ ~ Sub ~ ~ ~ ~ Sub32 ~ ~ ~ ~ Sub64 ~ ~ -9.20% ~ Sub64multiple ~ ~ ~ ~ Mul ~ ~ ~ ~ Mul32 ~ ~ ~ ~ Mul64 ~ ~ -41.58% -53.21% Div ~ ~ ~ ~ Div32 ~ ~ ~ ~ Div64 ~ ~ ~ ~ pkg: strconv benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base ParseInt/Pos/7bit ~ ~ -11.08% -6.75% ParseInt/Pos/26bit ~ ~ -13.65% -11.02% ParseInt/Pos/31bit ~ ~ -14.65% -9.71% ParseInt/Pos/56bit -1.80% ~ -17.97% -10.78% ParseInt/Pos/63bit ~ ~ -13.85% -9.63% ParseInt/Neg/7bit ~ ~ -12.14% -7.26% ParseInt/Neg/26bit ~ ~ -14.18% -9.81% ParseInt/Neg/31bit ~ ~ -14.51% -9.02% ParseInt/Neg/56bit ~ ~ -15.79% -9.79% ParseInt/Neg/63bit ~ ~ -15.68% -11.07% AppendFloat/Decimal ~ ~ -7.25% -12.26% AppendFloat/Float ~ ~ -15.96% -19.45% AppendFloat/Exp ~ ~ -13.96% -17.76% AppendFloat/NegExp ~ ~ -14.89% -20.27% AppendFloat/LongExp ~ ~ -12.68% -17.97% AppendFloat/Big ~ ~ -11.10% -16.64% AppendFloat/BinaryExp ~ ~ ~ ~ AppendFloat/32Integer ~ ~ -10.05% -10.91% AppendFloat/32ExactFraction ~ ~ -8.93% -13.00% AppendFloat/32Point ~ ~ -10.36% -14.89% AppendFloat/32Exp ~ ~ -9.88% -13.54% AppendFloat/32NegExp ~ ~ -10.16% -14.26% AppendFloat/32Shortest ~ ~ -11.39% -14.96% AppendFloat/32Fixed8Hard ~ ~ ~ -2.31% AppendFloat/32Fixed9Hard ~ ~ ~ -7.01% AppendFloat/64Fixed1 ~ ~ -2.83% -8.23% AppendFloat/64Fixed2 ~ ~ ~ -7.94% AppendFloat/64Fixed3 ~ ~ -4.07% -7.22% AppendFloat/64Fixed4 ~ ~ -7.24% -7.62% AppendFloat/64Fixed12 ~ ~ -6.57% -4.82% AppendFloat/64Fixed16 ~ ~ -4.00% -5.81% AppendFloat/64Fixed12Hard -2.22% ~ -4.07% -6.35% AppendFloat/64Fixed17Hard -2.12% ~ ~ -3.79% AppendFloat/64Fixed18Hard -1.89% ~ +2.48% ~ AppendFloat/Slowpath64 -1.85% ~ -14.49% -18.21% AppendFloat/SlowpathDenormal64 ~ ~ -13.08% -19.41% pkg: crypto/internal/fips140/nistec/fiat benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Mul/P224 ~ ~ -29.95% -39.60% Mul/P384 ~ ~ -37.11% -63.33% Mul/P521 ~ ~ -26.62% -12.42% Square/P224 +1.46% ~ -40.62% -49.18% Square/P384 ~ ~ -45.51% -69.68% Square/P521 +90.37% ~ -25.26% -11.23% (The +90% is a separate problem and not real; that much variation can be seen on that system by running the same binary from two different files.) pkg: crypto/internal/fips140/edwards25519 benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base EncodingDecoding ~ ~ -34.67% -35.75% ScalarBaseMult ~ ~ -31.25% -30.29% ScalarMult ~ ~ -33.45% -32.54% VarTimeDoubleScalarBaseMult ~ ~ -33.78% -33.68% Change-Id: Id3c91d42cd01def6731b755e99f8f40c6ad1bb65 Reviewed-on: https://go-review.googlesource.com/c/go/+/716061 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>	2025-10-30 08:04:20 -07:00
Russ Cox	9bbda7c99d	cmd/compile: make prove understand div, mod better This CL introduces new divisible and divmod passes that rewrite divisibility checks and div, mod, and mul. These happen after prove, so that prove can make better sense of the code for deriving bounds, and they must run before decompose, so that 64-bit ops can be lowered to 32-bit ops on 32-bit systems. And then they need another generic pass as well, to optimize the generated code before decomposing. The three opt passes are "opt", "middle opt", and "late opt". (Perhaps instead they should be "generic", "opt", and "late opt"?) The "late opt" pass repeats the "middle opt" work on any new code that has been generated in the interim. There will not be new divs or mods, but there may be new muls. The x%c==0 rewrite rules are much simpler now, since they can match before divs have been rewritten. This has the effect of applying them more consistently and making the rewrite rules independent of the exact div rewrites. Prove is also now charged with marking signed div/mod as unsigned when the arguments call for it, allowing simpler code to be emitted in various cases. For example, t.Seconds()/2 and len(x)/2 are now recognized as unsigned, meaning they compile to a simple shift (unsigned division), avoiding the more complex fixup we need for signed values. https://gist.github.com/rsc/99d9d3bd99cde87b6a1a390e3d85aa32 shows a diff of 'go build -a -gcflags=-d=ssa/prove/debug=1 std' output before and after. "Proved Rsh64x64 shifts to zero" is replaced by the higher-level "Proved Div64 is unsigned" (the shift was in the signed expansion of div by constant), but otherwise prove is only finding more things to prove. One short example, in code that does x[i%len(x)]: < runtime/mfinal.go:131:34: Proved Rsh64x64 shifts to zero --- > runtime/mfinal.go:131:34: Proved Div64 is unsigned > runtime/mfinal.go:131:38: Proved IsInBounds A longer example: < crypto/internal/fips140/sha3/shake.go:28:30: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:38:27: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:53:46: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:55:46: Proved Rsh64x64 shifts to zero --- > crypto/internal/fips140/sha3/shake.go:28:30: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:28:30: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:28:30: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:38:27: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:45:7: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:46:4: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:53:46: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:53:46: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:53:46: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:55:46: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:55:46: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:55:46: Proved IsSliceInBounds These diffs are due to the smaller opt being better and taking work away from prove: < image/jpeg/dct.go:307:5: Proved IsInBounds < image/jpeg/dct.go:308:5: Proved IsInBounds ... < image/jpeg/dct.go:442:5: Proved IsInBounds In the old opt, Mul by 8 was rewritten to Lsh by 3 early. This CL delays that rule to help prove recognize mods, but it also helps opt constant-fold the slice x[8i:8i+8:8*i+8]. Specifically, computing the length, opt can now do: (Sub64 (Add (Mul 8 i) 8) (Add (Mul 8 i) 8)) -> (Add 8 (Sub (Mul 8 i) (Mul 8 i))) -> (Add 8 (Mul 8 (Sub i i))) -> (Add 8 (Mul 8 0)) -> (Add 8 0) -> 8 The key step is (Sub (Mul x y) (Mul x z)) -> (Mul x (Sub y z)), Leaving the multiply as Mul enables using that step; the old rewrite to Lsh blocked it, leaving prove to figure out the length and then remove the bounds checks. But now opt can evaluate the length down to a constant 8 and then constant-fold away the bounds checks 0 < 8, 1 < 8, and so on. After that, the compiler has nothing left to prove. Benchmarks are noisy in general; I checked the assembly for the many large increases below, and the vast majority are unchanged and presumably hitting the caches differently in some way. The divisibility optimizations were not reliably triggering before. This leads to a very large improvement in some cases, like DivisiblePow2constI64, DivisibleconstI64 on 64-bit systems and DivisbleconstU64 on 32-bit systems. Another way the divisibility optimizations were unreliable before was incorrectly triggering for x/3, x%3 even though they are written not to do that. There is a real but small slowdown in the DivisibleWDivconst benchmarks on Mac because in the cases used in the benchmark, it is still faster (on Mac) to do the divisibility check than to remultiply. This may be worth further study. Perhaps when there is no rotate (meaning the divisor is odd), the divisibility optimization should be enabled always. In any event, this CL makes it possible to study that. benchmark \ host s7 linux-amd64 mac linux-arm64 linux-ppc64le linux-386 s7:GOARCH=386 linux-arm vs base vs base vs base vs base vs base vs base vs base vs base LoadAdd ~ ~ ~ ~ ~ -1.59% ~ ~ ExtShift ~ ~ -42.14% +0.10% ~ +1.44% +5.66% +8.50% Modify ~ ~ ~ ~ ~ ~ ~ -1.53% MullImm ~ ~ ~ ~ ~ +37.90% -21.87% +3.05% ConstModify ~ ~ ~ ~ -49.14% ~ ~ ~ BitSet ~ ~ ~ ~ -15.86% -14.57% +6.44% +0.06% BitClear ~ ~ ~ ~ ~ +1.78% +3.50% +0.06% BitToggle ~ ~ ~ ~ ~ -16.09% +2.91% ~ BitSetConst ~ ~ ~ ~ ~ ~ ~ -0.49% BitClearConst ~ ~ ~ ~ -28.29% ~ ~ -0.40% BitToggleConst ~ ~ ~ +8.89% -31.19% ~ ~ -0.77% MulNeg ~ ~ ~ ~ ~ ~ ~ ~ Mul2Neg ~ ~ -4.83% ~ ~ -13.75% -5.92% ~ DivconstI64 ~ ~ ~ ~ ~ -30.12% ~ +0.50% ModconstI64 ~ ~ -9.94% -4.63% ~ +3.15% ~ +5.32% DivisiblePow2constI64 -34.49% -12.58% ~ ~ -12.25% ~ ~ ~ DivisibleconstI64 -24.69% -25.06% -0.40% -2.27% -42.61% -3.31% ~ +1.63% DivisibleWDivconstI64 ~ ~ ~ ~ ~ -17.55% ~ -0.60% DivconstU64/3 ~ ~ ~ ~ ~ +1.51% ~ ~ DivconstU64/5 ~ ~ ~ ~ ~ ~ ~ ~ DivconstU64/37 ~ ~ -0.18% ~ ~ +2.70% ~ ~ DivconstU64/1234567 ~ ~ ~ ~ ~ ~ ~ +0.12% ModconstU64 ~ ~ ~ -0.24% ~ -5.10% -1.07% -1.56% DivisibleconstU64 ~ ~ ~ ~ ~ -29.01% -59.13% -50.72% DivisibleWDivconstU64 ~ ~ -12.18% -18.88% ~ -5.50% -3.91% +5.17% DivconstI32 ~ ~ -0.48% ~ -34.69% +89.01% -6.01% -16.67% ModconstI32 ~ +2.95% -0.33% ~ ~ -2.98% -5.40% -8.30% DivisiblePow2constI32 ~ ~ ~ ~ ~ ~ ~ -16.22% DivisibleconstI32 ~ ~ ~ ~ ~ -37.27% -47.75% -25.03% DivisibleWDivconstI32 -11.59% +5.22% -12.99% -23.83% ~ +45.95% -7.03% -10.01% DivconstU32 ~ ~ ~ ~ ~ +74.71% +4.81% ~ ModconstU32 ~ ~ +0.53% +0.18% ~ +51.16% ~ ~ DivisibleconstU32 ~ ~ ~ -0.62% ~ -4.25% ~ ~ DivisibleWDivconstU32 -2.77% +5.56% +11.12% -5.15% ~ +48.70% +25.11% -4.07% DivconstI16 -6.06% ~ -0.33% +0.22% ~ ~ -9.68% +5.47% ModconstI16 ~ ~ +4.44% +2.82% ~ ~ ~ +5.06% DivisiblePow2constI16 ~ ~ ~ ~ ~ ~ ~ -0.17% DivisibleconstI16 ~ ~ -0.23% ~ ~ ~ +4.60% +6.64% DivisibleWDivconstI16 -1.44% -0.43% +13.48% -5.76% ~ +1.62% -23.15% -9.06% DivconstU16 +1.61% ~ -0.35% -0.47% ~ ~ +15.59% ~ ModconstU16 ~ ~ ~ ~ ~ -0.72% ~ +14.23% DivisibleconstU16 ~ ~ -0.05% +3.00% ~ ~ ~ +5.06% DivisibleWDivconstU16 +52.10% +0.75% +17.28% +4.79% ~ -37.39% +5.28% -9.06% DivconstI8 ~ ~ -0.34% -0.96% ~ ~ -9.20% ~ ModconstI8 +2.29% ~ +4.38% +2.96% ~ ~ ~ ~ DivisiblePow2constI8 ~ ~ ~ ~ ~ ~ ~ ~ DivisibleconstI8 ~ ~ ~ ~ ~ ~ +6.04% ~ DivisibleWDivconstI8 -26.44% +1.69% +17.03% +4.05% ~ +32.48% -24.90% ~ DivconstU8 -4.50% +14.06% -0.28% ~ ~ ~ +4.16% +0.88% ModconstU8 ~ ~ +25.84% -0.64% ~ ~ ~ ~ DivisibleconstU8 ~ ~ -5.70% ~ ~ ~ ~ ~ DivisibleWDivconstU8 +49.55% +9.07% ~ +4.03% +53.87% -40.03% +39.72% -3.01% Mul2 ~ ~ ~ ~ ~ ~ ~ ~ MulNeg2 ~ ~ ~ ~ -11.73% ~ ~ -0.02% EfaceInteger ~ ~ ~ ~ ~ +18.11% ~ +2.53% TypeAssert +33.90% +2.86% ~ ~ ~ -1.07% -5.29% -1.04% Div64UnsignedSmall ~ ~ ~ ~ ~ ~ ~ ~ Div64Small ~ ~ ~ ~ ~ -0.88% ~ +2.39% Div64SmallNegDivisor ~ ~ ~ ~ ~ ~ ~ +0.35% Div64SmallNegDividend ~ ~ ~ ~ ~ -0.84% ~ +3.57% Div64SmallNegBoth ~ ~ ~ ~ ~ -0.86% ~ +3.55% Div64Unsigned ~ ~ ~ ~ ~ ~ ~ -0.11% Div64 ~ ~ ~ ~ ~ ~ ~ +0.11% Div64NegDivisor ~ ~ ~ ~ ~ -1.29% ~ ~ Div64NegDividend ~ ~ ~ ~ ~ -1.44% ~ ~ Div64NegBoth ~ ~ ~ ~ ~ ~ ~ +0.28% Mod64UnsignedSmall ~ ~ ~ ~ ~ +0.48% ~ +0.93% Mod64Small ~ ~ ~ ~ ~ ~ ~ ~ Mod64SmallNegDivisor ~ ~ ~ ~ ~ ~ ~ +1.44% Mod64SmallNegDividend ~ ~ ~ ~ ~ +0.22% ~ +1.37% Mod64SmallNegBoth ~ ~ ~ ~ ~ ~ ~ -2.22% Mod64Unsigned ~ ~ ~ ~ ~ -0.95% ~ +0.11% Mod64 ~ ~ ~ ~ ~ ~ ~ ~ Mod64NegDivisor ~ ~ ~ ~ ~ ~ ~ -0.02% Mod64NegDividend ~ ~ ~ ~ ~ ~ ~ ~ Mod64NegBoth ~ ~ ~ ~ ~ ~ ~ -0.02% MulconstI32/3 ~ ~ ~ -25.00% ~ ~ ~ +47.37% MulconstI32/5 ~ ~ ~ +33.28% ~ ~ ~ +32.21% MulconstI32/12 ~ ~ ~ -2.13% ~ ~ ~ -0.02% MulconstI32/120 ~ ~ ~ +2.93% ~ ~ ~ -0.03% MulconstI32/-120 ~ ~ ~ -2.17% ~ ~ ~ -0.03% MulconstI32/65537 ~ ~ ~ ~ ~ ~ ~ +0.03% MulconstI32/65538 ~ ~ ~ ~ ~ -33.38% ~ +0.04% MulconstI64/3 ~ ~ ~ +33.35% ~ -0.37% ~ -0.13% MulconstI64/5 ~ ~ ~ -25.00% ~ -0.34% ~ ~ MulconstI64/12 ~ ~ ~ +2.13% ~ +11.62% ~ +2.30% MulconstI64/120 ~ ~ ~ -1.98% ~ ~ ~ ~ MulconstI64/-120 ~ ~ ~ +0.75% ~ ~ ~ ~ MulconstI64/65537 ~ ~ ~ ~ ~ +5.61% ~ ~ MulconstI64/65538 ~ ~ ~ ~ ~ +5.25% ~ ~ MulconstU32/3 ~ +0.81% ~ +33.39% ~ +77.92% ~ -32.31% MulconstU32/5 ~ ~ ~ -24.97% ~ +77.92% ~ -24.47% MulconstU32/12 ~ ~ ~ +2.06% ~ ~ ~ +0.03% MulconstU32/120 ~ ~ ~ -2.74% ~ ~ ~ +0.03% MulconstU32/65537 ~ ~ ~ ~ ~ ~ ~ +0.03% MulconstU32/65538 ~ ~ ~ ~ ~ -33.42% ~ -0.03% MulconstU64/3 ~ ~ ~ +33.33% ~ -0.28% ~ +1.22% MulconstU64/5 ~ ~ ~ -25.00% ~ ~ ~ -0.64% MulconstU64/12 ~ ~ ~ +2.30% ~ +11.59% ~ +0.14% MulconstU64/120 ~ ~ ~ -2.82% ~ ~ ~ +0.04% MulconstU64/65537 ~ +0.37% ~ ~ ~ +5.58% ~ ~ MulconstU64/65538 ~ ~ ~ ~ ~ +5.16% ~ ~ ShiftArithmeticRight ~ ~ ~ ~ ~ -10.81% ~ +0.31% Switch8Predictable +14.69% ~ ~ ~ ~ -24.85% ~ ~ Switch8Unpredictable ~ -0.58% -3.80% ~ ~ -11.78% ~ -0.79% Switch32Predictable -10.33% +17.89% ~ ~ ~ +5.76% ~ ~ Switch32Unpredictable -3.15% +1.19% +9.42% ~ ~ -10.30% -5.09% +0.44% SwitchStringPredictable +70.88% +20.48% ~ ~ ~ +2.39% ~ +0.31% SwitchStringUnpredictable ~ +3.91% -5.06% -0.98% ~ +0.61% +2.03% ~ SwitchTypePredictable +146.58% -1.10% ~ -12.45% ~ -0.46% -3.81% ~ SwitchTypeUnpredictable +0.46% -0.83% ~ +4.18% ~ +0.43% ~ +0.62% SwitchInterfaceTypePredictable -13.41% -10.13% +11.03% ~ ~ -4.38% ~ +0.75% SwitchInterfaceTypeUnpredictable -6.37% -2.14% ~ -3.21% ~ -4.20% ~ +1.08% Fixes #63110. Fixes #75954. Change-Id: I55a876f08c6c14f419ce1a8cbba2eaae6c6efbf0 Reviewed-on: https://go-review.googlesource.com/c/go/+/714160 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Russ Cox <rsc@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-10-29 18:49:40 -07:00

Author

SHA1

Message

Date

Russ Cox

6e165b4d17

cmd/compile: implement Avg64u, Hmul64, Hmul64u for wasm

This lets us remove useAvg and useHmul from the division rules.
The compiler is simpler and the generated code is faster.

goos: wasip1
goarch: wasm
pkg: internal/strconv
                               │   old.txt   │               new.txt               │
                               │   sec/op    │   sec/op     vs base                │
AppendFloat/Decimal              192.8n ± 1%   194.6n ± 0%   +0.91% (p=0.000 n=10)
AppendFloat/Float                328.6n ± 0%   279.6n ± 0%  -14.93% (p=0.000 n=10)
AppendFloat/Exp                  335.6n ± 1%   289.2n ± 1%  -13.80% (p=0.000 n=10)
AppendFloat/NegExp               336.0n ± 0%   289.1n ± 1%  -13.97% (p=0.000 n=10)
AppendFloat/LongExp              332.4n ± 0%   285.2n ± 1%  -14.20% (p=0.000 n=10)
AppendFloat/Big                  348.2n ± 0%   300.1n ± 0%  -13.83% (p=0.000 n=10)
AppendFloat/BinaryExp            137.4n ± 0%   138.2n ± 0%   +0.55% (p=0.001 n=10)
AppendFloat/32Integer            193.3n ± 1%   196.5n ± 0%   +1.66% (p=0.000 n=10)
AppendFloat/32ExactFraction      283.3n ± 0%   268.9n ± 1%   -5.08% (p=0.000 n=10)
AppendFloat/32Point              279.9n ± 0%   266.5n ± 0%   -4.80% (p=0.000 n=10)
AppendFloat/32Exp                300.1n ± 0%   288.3n ± 1%   -3.90% (p=0.000 n=10)
AppendFloat/32NegExp             288.2n ± 1%   277.9n ± 1%   -3.59% (p=0.000 n=10)
AppendFloat/32Shortest           261.7n ± 0%   250.2n ± 0%   -4.39% (p=0.000 n=10)
AppendFloat/32Fixed8Hard         173.3n ± 1%   158.9n ± 1%   -8.31% (p=0.000 n=10)
AppendFloat/32Fixed9Hard         180.0n ± 0%   167.9n ± 2%   -6.70% (p=0.000 n=10)
AppendFloat/64Fixed1             167.1n ± 0%   149.6n ± 1%  -10.50% (p=0.000 n=10)
AppendFloat/64Fixed2             162.4n ± 1%   146.5n ± 0%   -9.73% (p=0.000 n=10)
AppendFloat/64Fixed2.5           165.5n ± 0%   149.4n ± 1%   -9.70% (p=0.000 n=10)
AppendFloat/64Fixed3             166.4n ± 1%   150.2n ± 0%   -9.74% (p=0.000 n=10)
AppendFloat/64Fixed4             163.7n ± 0%   149.6n ± 1%   -8.62% (p=0.000 n=10)
AppendFloat/64Fixed5Hard         182.8n ± 1%   167.1n ± 1%   -8.61% (p=0.000 n=10)
AppendFloat/64Fixed12            222.2n ± 0%   208.8n ± 0%   -6.05% (p=0.000 n=10)
AppendFloat/64Fixed16            197.6n ± 1%   181.7n ± 0%   -8.02% (p=0.000 n=10)
AppendFloat/64Fixed12Hard        194.5n ± 0%   181.0n ± 0%   -6.99% (p=0.000 n=10)
AppendFloat/64Fixed17Hard        205.1n ± 1%   191.9n ± 0%   -6.44% (p=0.000 n=10)
AppendFloat/64Fixed18Hard        6.269µ ± 0%   6.643µ ± 0%   +5.97% (p=0.000 n=10)
AppendFloat/64FixedF1            211.7n ± 1%   197.0n ± 0%   -6.95% (p=0.000 n=10)
AppendFloat/64FixedF2            189.4n ± 0%   174.2n ± 0%   -8.08% (p=0.000 n=10)
AppendFloat/64FixedF3            169.0n ± 0%   154.9n ± 0%   -8.32% (p=0.000 n=10)
AppendFloat/Slowpath64           321.2n ± 0%   274.2n ± 1%  -14.63% (p=0.000 n=10)
AppendFloat/SlowpathDenormal64   307.4n ± 1%   261.2n ± 0%  -15.03% (p=0.000 n=10)
AppendInt                        3.367µ ± 1%   3.376µ ± 0%        ~ (p=0.517 n=10)
AppendUint                       675.5n ± 0%   676.9n ± 0%        ~ (p=0.196 n=10)
AppendIntSmall                   28.13n ± 1%   28.17n ± 0%   +0.14% (p=0.015 n=10)
AppendUintVarlen/digits=1        20.70n ± 0%   20.51n ± 1%   -0.89% (p=0.018 n=10)
AppendUintVarlen/digits=2        20.43n ± 0%   20.27n ± 0%   -0.81% (p=0.001 n=10)
AppendUintVarlen/digits=3        38.48n ± 0%   37.93n ± 0%   -1.43% (p=0.000 n=10)
AppendUintVarlen/digits=4        41.10n ± 0%   38.78n ± 1%   -5.62% (p=0.000 n=10)
AppendUintVarlen/digits=5        42.25n ± 1%   42.11n ± 0%   -0.32% (p=0.041 n=10)
AppendUintVarlen/digits=6        45.40n ± 1%   43.14n ± 0%   -4.98% (p=0.000 n=10)
AppendUintVarlen/digits=7        46.81n ± 1%   46.03n ± 0%   -1.66% (p=0.000 n=10)
AppendUintVarlen/digits=8        48.88n ± 1%   46.59n ± 1%   -4.68% (p=0.000 n=10)
AppendUintVarlen/digits=9        49.94n ± 2%   49.41n ± 1%   -1.06% (p=0.000 n=10)
AppendUintVarlen/digits=10       57.28n ± 1%   56.92n ± 1%   -0.62% (p=0.045 n=10)
AppendUintVarlen/digits=11       60.09n ± 1%   58.11n ± 2%   -3.30% (p=0.000 n=10)
AppendUintVarlen/digits=12       62.22n ± 0%   61.85n ± 0%   -0.59% (p=0.000 n=10)
AppendUintVarlen/digits=13       64.94n ± 0%   62.92n ± 0%   -3.10% (p=0.000 n=10)
AppendUintVarlen/digits=14       65.42n ± 1%   65.19n ± 1%   -0.34% (p=0.005 n=10)
AppendUintVarlen/digits=15       68.17n ± 0%   66.13n ± 0%   -2.99% (p=0.000 n=10)
AppendUintVarlen/digits=16       70.21n ± 1%   70.09n ± 1%        ~ (p=0.517 n=10)
AppendUintVarlen/digits=17       72.93n ± 0%   70.49n ± 0%   -3.34% (p=0.000 n=10)
AppendUintVarlen/digits=18       73.01n ± 0%   72.75n ± 0%   -0.35% (p=0.000 n=10)
AppendUintVarlen/digits=19       79.27n ± 1%   79.49n ± 1%        ~ (p=0.671 n=10)
AppendUintVarlen/digits=20       82.18n ± 0%   80.43n ± 1%   -2.14% (p=0.000 n=10)
geomean                          143.4n        136.0n        -5.20%


Change-Id: I8245814a0259ad13cf9225f57db8e9fe3d2e4267
Reviewed-on: https://go-review.googlesource.com/c/go/+/717407
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>

2025-11-04 11:38:18 -08:00

Russ Cox

1e5bb416d8

cmd/compile: implement bits.Mul64 on 32-bit systems

This CL implements Mul64uhilo, Hmul64, Hmul64u, and Avg64u
on 32-bit systems, with the effect that constant division of both
int64s and uint64s can now be emitted directly in all cases,
and also that bits.Mul64 can be intrinsified on 32-bit systems.

Previously, constant division of uint64s by values 0 ≤ c ≤ 0xFFFF were
implemented as uint32 divisions by c and some fixup. After expanding
those smaller constant divisions, the code for i/999 required:

	(386) 7 mul, 10 add, 2 sub, 3 rotate, 3 shift (104 bytes)
	(arm) 7 mul, 9 add, 3 sub, 2 shift (104 bytes)
	(mips) 7 mul, 10 add, 5 sub, 6 shift, 3 sgtu (176 bytes)

For that much code, we might as well use a full 64x64->128 multiply
that can be used for all divisors, not just small ones.
Having done that, the same i/999 now generates:

	(386) 4 mul, 9 add, 2 sub, 2 or, 6 shift (112 bytes)
	(arm) 4 mul, 8 add, 2 sub, 2 or, 3 shift (92 bytes)
	(mips) 4 mul, 11 add, 3 sub, 6 shift, 8 sgtu, 4 or (196 bytes)

The size increase on 386 is due to a few extra register spills.
The size increase on mips is due to add-with-carry being hard.

The new approach is more general, letting us delete the old special case
and guarantee that all int64 and uint64 divisions by constants are
generated directly on 32-bit systems.

This especially speeds up code making heavy use of bits.Mul64 with
a constant argument, which happens in strconv and various crypto
packages. A few examples are benchmarked below.

pkg: cmd/compile/internal/test

benchmark \ host                      local  linux-amd64       s7  linux-386  s7:GOARCH=386
                                    vs base      vs base  vs base    vs base        vs base
DivconstI64                               ~            ~        ~    -49.66%        -21.02%
ModconstI64                               ~            ~        ~    -13.45%        +14.52%
DivisiblePow2constI64                     ~            ~        ~     +0.97%         -1.32%
DivisibleconstI64                         ~            ~        ~    -20.01%        -48.28%
DivisibleWDivconstI64                     ~            ~   -1.76%    -38.59%        -42.74%
DivconstU64/3                             ~            ~        ~    -13.82%         -4.09%
DivconstU64/5                             ~            ~        ~    -14.10%         -3.54%
DivconstU64/37                       -2.07%       -4.45%        ~    -19.60%         -9.55%
DivconstU64/1234567                       ~            ~        ~    -61.55%        -56.93%
ModconstU64                               ~            ~        ~     -6.25%              ~
DivisibleconstU64                         ~            ~        ~     -2.78%         -7.82%
DivisibleWDivconstU64                     ~            ~        ~     +4.23%         +2.56%

pkg: math/bits

benchmark \ host         s7  linux-amd64  linux-386  s7:GOARCH=386
                    vs base      vs base    vs base        vs base
Add                       ~            ~          ~              ~
Add32                +1.59%            ~          ~              ~
Add64                     ~            ~          ~              ~
Add64multiple             ~            ~          ~              ~
Sub                       ~            ~          ~              ~
Sub32                     ~            ~          ~              ~
Sub64                     ~            ~     -9.20%              ~
Sub64multiple             ~            ~          ~              ~
Mul                       ~            ~          ~              ~
Mul32                     ~            ~          ~              ~
Mul64                     ~            ~    -41.58%        -53.21%
Div                       ~            ~          ~              ~
Div32                     ~            ~          ~              ~
Div64                     ~            ~          ~              ~

pkg: strconv

benchmark \ host                       s7  linux-amd64  linux-386  s7:GOARCH=386
                                  vs base      vs base    vs base        vs base
ParseInt/Pos/7bit                       ~            ~    -11.08%         -6.75%
ParseInt/Pos/26bit                      ~            ~    -13.65%        -11.02%
ParseInt/Pos/31bit                      ~            ~    -14.65%         -9.71%
ParseInt/Pos/56bit                 -1.80%            ~    -17.97%        -10.78%
ParseInt/Pos/63bit                      ~            ~    -13.85%         -9.63%
ParseInt/Neg/7bit                       ~            ~    -12.14%         -7.26%
ParseInt/Neg/26bit                      ~            ~    -14.18%         -9.81%
ParseInt/Neg/31bit                      ~            ~    -14.51%         -9.02%
ParseInt/Neg/56bit                      ~            ~    -15.79%         -9.79%
ParseInt/Neg/63bit                      ~            ~    -15.68%        -11.07%
AppendFloat/Decimal                     ~            ~     -7.25%        -12.26%
AppendFloat/Float                       ~            ~    -15.96%        -19.45%
AppendFloat/Exp                         ~            ~    -13.96%        -17.76%
AppendFloat/NegExp                      ~            ~    -14.89%        -20.27%
AppendFloat/LongExp                     ~            ~    -12.68%        -17.97%
AppendFloat/Big                         ~            ~    -11.10%        -16.64%
AppendFloat/BinaryExp                   ~            ~          ~              ~
AppendFloat/32Integer                   ~            ~    -10.05%        -10.91%
AppendFloat/32ExactFraction             ~            ~     -8.93%        -13.00%
AppendFloat/32Point                     ~            ~    -10.36%        -14.89%
AppendFloat/32Exp                       ~            ~     -9.88%        -13.54%
AppendFloat/32NegExp                    ~            ~    -10.16%        -14.26%
AppendFloat/32Shortest                  ~            ~    -11.39%        -14.96%
AppendFloat/32Fixed8Hard                ~            ~          ~         -2.31%
AppendFloat/32Fixed9Hard                ~            ~          ~         -7.01%
AppendFloat/64Fixed1                    ~            ~     -2.83%         -8.23%
AppendFloat/64Fixed2                    ~            ~          ~         -7.94%
AppendFloat/64Fixed3                    ~            ~     -4.07%         -7.22%
AppendFloat/64Fixed4                    ~            ~     -7.24%         -7.62%
AppendFloat/64Fixed12                   ~            ~     -6.57%         -4.82%
AppendFloat/64Fixed16                   ~            ~     -4.00%         -5.81%
AppendFloat/64Fixed12Hard          -2.22%            ~     -4.07%         -6.35%
AppendFloat/64Fixed17Hard          -2.12%            ~          ~         -3.79%
AppendFloat/64Fixed18Hard          -1.89%            ~     +2.48%              ~
AppendFloat/Slowpath64             -1.85%            ~    -14.49%        -18.21%
AppendFloat/SlowpathDenormal64          ~            ~    -13.08%        -19.41%

pkg: crypto/internal/fips140/nistec/fiat

benchmark \ host         s7  linux-amd64  linux-386  s7:GOARCH=386
                    vs base      vs base    vs base        vs base
Mul/P224                  ~            ~    -29.95%        -39.60%
Mul/P384                  ~            ~    -37.11%        -63.33%
Mul/P521                  ~            ~    -26.62%        -12.42%
Square/P224          +1.46%            ~    -40.62%        -49.18%
Square/P384               ~            ~    -45.51%        -69.68%
Square/P521         +90.37%            ~    -25.26%        -11.23%

(The +90% is a separate problem and not real; that much variation
can be seen on that system by running the same binary from two
different files.)

pkg: crypto/internal/fips140/edwards25519

benchmark \ host                    s7  linux-amd64  linux-386  s7:GOARCH=386
                               vs base      vs base    vs base        vs base
EncodingDecoding                     ~            ~    -34.67%        -35.75%
ScalarBaseMult                       ~            ~    -31.25%        -30.29%
ScalarMult                           ~            ~    -33.45%        -32.54%
VarTimeDoubleScalarBaseMult          ~            ~    -33.78%        -33.68%

Change-Id: Id3c91d42cd01def6731b755e99f8f40c6ad1bb65
Reviewed-on: https://go-review.googlesource.com/c/go/+/716061
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>

2025-10-30 08:04:20 -07:00

Russ Cox

9bbda7c99d

cmd/compile: make prove understand div, mod better

This CL introduces new divisible and divmod passes that rewrite
divisibility checks and div, mod, and mul. These happen after
prove, so that prove can make better sense of the code for
deriving bounds, and they must run before decompose, so that
64-bit ops can be lowered to 32-bit ops on 32-bit systems.
And then they need another generic pass as well, to optimize
the generated code before decomposing.

The three opt passes are "opt", "middle opt", and "late opt".
(Perhaps instead they should be "generic", "opt", and "late opt"?)

The "late opt" pass repeats the "middle opt" work on any new code
that has been generated in the interim.
There will not be new divs or mods, but there may be new muls.

The x%c==0 rewrite rules are much simpler now, since they can
match before divs have been rewritten. This has the effect of
applying them more consistently and making the rewrite rules
independent of the exact div rewrites.

Prove is also now charged with marking signed div/mod as
unsigned when the arguments call for it, allowing simpler
code to be emitted in various cases. For example,
t.Seconds()/2 and len(x)/2 are now recognized as unsigned,
meaning they compile to a simple shift (unsigned division),
avoiding the more complex fixup we need for signed values.

https://gist.github.com/rsc/99d9d3bd99cde87b6a1a390e3d85aa32
shows a diff of 'go build -a -gcflags=-d=ssa/prove/debug=1 std'
output before and after. "Proved Rsh64x64 shifts to zero" is replaced
by the higher-level "Proved Div64 is unsigned" (the shift was in the
signed expansion of div by constant), but otherwise prove is only
finding more things to prove.

One short example, in code that does x[i%len(x)]:

< runtime/mfinal.go:131:34: Proved Rsh64x64 shifts to zero
---
> runtime/mfinal.go:131:34: Proved Div64 is unsigned
> runtime/mfinal.go:131:38: Proved IsInBounds

A longer example:

< crypto/internal/fips140/sha3/shake.go:28:30: Proved Rsh64x64 shifts to zero
< crypto/internal/fips140/sha3/shake.go:38:27: Proved Rsh64x64 shifts to zero
< crypto/internal/fips140/sha3/shake.go:53:46: Proved Rsh64x64 shifts to zero
< crypto/internal/fips140/sha3/shake.go:55:46: Proved Rsh64x64 shifts to zero
---
> crypto/internal/fips140/sha3/shake.go:28:30: Proved Div64 is unsigned
> crypto/internal/fips140/sha3/shake.go:28:30: Proved IsInBounds
> crypto/internal/fips140/sha3/shake.go:28:30: Proved IsSliceInBounds
> crypto/internal/fips140/sha3/shake.go:38:27: Proved Div64 is unsigned
> crypto/internal/fips140/sha3/shake.go:45:7: Proved IsSliceInBounds
> crypto/internal/fips140/sha3/shake.go:46:4: Proved IsInBounds
> crypto/internal/fips140/sha3/shake.go:53:46: Proved Div64 is unsigned
> crypto/internal/fips140/sha3/shake.go:53:46: Proved IsInBounds
> crypto/internal/fips140/sha3/shake.go:53:46: Proved IsSliceInBounds
> crypto/internal/fips140/sha3/shake.go:55:46: Proved Div64 is unsigned
> crypto/internal/fips140/sha3/shake.go:55:46: Proved IsInBounds
> crypto/internal/fips140/sha3/shake.go:55:46: Proved IsSliceInBounds

These diffs are due to the smaller opt being better
and taking work away from prove:

< image/jpeg/dct.go:307:5: Proved IsInBounds
< image/jpeg/dct.go:308:5: Proved IsInBounds
...
< image/jpeg/dct.go:442:5: Proved IsInBounds

In the old opt, Mul by 8 was rewritten to Lsh by 3 early.
This CL delays that rule to help prove recognize mods,
but it also helps opt constant-fold the slice x[8*i:8*i+8:8*i+8].
Specifically, computing the length, opt can now do:

	(Sub64 (Add (Mul 8 i) 8) (Add (Mul 8 i) 8)) ->
	(Add 8 (Sub (Mul 8 i) (Mul 8 i))) ->
	(Add 8 (Mul 8 (Sub i i))) ->
	(Add 8 (Mul 8 0)) ->
	(Add 8 0) ->
	8

The key step is (Sub (Mul x y) (Mul x z)) -> (Mul x (Sub y z)),
Leaving the multiply as Mul enables using that step; the old
rewrite to Lsh blocked it, leaving prove to figure out the length
and then remove the bounds checks. But now opt can evaluate
the length down to a constant 8 and then constant-fold away
the bounds checks 0 < 8, 1 < 8, and so on. After that,
the compiler has nothing left to prove.

Benchmarks are noisy in general; I checked the assembly for the many
large increases below, and the vast majority are unchanged and
presumably hitting the caches differently in some way.

The divisibility optimizations were not reliably triggering before.
This leads to a very large improvement in some cases, like
DivisiblePow2constI64, DivisibleconstI64 on 64-bit systems
and DivisbleconstU64 on 32-bit systems.

Another way the divisibility optimizations were unreliable before
was incorrectly triggering for x/3, x%3 even though they are
written not to do that. There is a real but small slowdown
in the DivisibleWDivconst benchmarks on Mac because in the cases
used in the benchmark, it is still faster (on Mac) to do the
divisibility check than to remultiply.
This may be worth further study. Perhaps when there is no rotate
(meaning the divisor is odd), the divisibility optimization
should be enabled always. In any event, this CL makes it possible
to study that.

benchmark \ host                          s7  linux-amd64      mac  linux-arm64  linux-ppc64le  linux-386  s7:GOARCH=386  linux-arm
                                     vs base      vs base  vs base      vs base        vs base    vs base        vs base    vs base
LoadAdd                                    ~            ~        ~            ~              ~     -1.59%              ~          ~
ExtShift                                   ~            ~  -42.14%       +0.10%              ~     +1.44%         +5.66%     +8.50%
Modify                                     ~            ~        ~            ~              ~          ~              ~     -1.53%
MullImm                                    ~            ~        ~            ~              ~    +37.90%        -21.87%     +3.05%
ConstModify                                ~            ~        ~            ~        -49.14%          ~              ~          ~
BitSet                                     ~            ~        ~            ~        -15.86%    -14.57%         +6.44%     +0.06%
BitClear                                   ~            ~        ~            ~              ~     +1.78%         +3.50%     +0.06%
BitToggle                                  ~            ~        ~            ~              ~    -16.09%         +2.91%          ~
BitSetConst                                ~            ~        ~            ~              ~          ~              ~     -0.49%
BitClearConst                              ~            ~        ~            ~        -28.29%          ~              ~     -0.40%
BitToggleConst                             ~            ~        ~       +8.89%        -31.19%          ~              ~     -0.77%
MulNeg                                     ~            ~        ~            ~              ~          ~              ~          ~
Mul2Neg                                    ~            ~   -4.83%            ~              ~    -13.75%         -5.92%          ~
DivconstI64                                ~            ~        ~            ~              ~    -30.12%              ~     +0.50%
ModconstI64                                ~            ~   -9.94%       -4.63%              ~     +3.15%              ~     +5.32%
DivisiblePow2constI64                -34.49%      -12.58%        ~            ~        -12.25%          ~              ~          ~
DivisibleconstI64                    -24.69%      -25.06%   -0.40%       -2.27%        -42.61%     -3.31%              ~     +1.63%
DivisibleWDivconstI64                      ~            ~        ~            ~              ~    -17.55%              ~     -0.60%
DivconstU64/3                              ~            ~        ~            ~              ~     +1.51%              ~          ~
DivconstU64/5                              ~            ~        ~            ~              ~          ~              ~          ~
DivconstU64/37                             ~            ~   -0.18%            ~              ~     +2.70%              ~          ~
DivconstU64/1234567                        ~            ~        ~            ~              ~          ~              ~     +0.12%
ModconstU64                                ~            ~        ~       -0.24%              ~     -5.10%         -1.07%     -1.56%
DivisibleconstU64                          ~            ~        ~            ~              ~    -29.01%        -59.13%    -50.72%
DivisibleWDivconstU64                      ~            ~  -12.18%      -18.88%              ~     -5.50%         -3.91%     +5.17%
DivconstI32                                ~            ~   -0.48%            ~        -34.69%    +89.01%         -6.01%    -16.67%
ModconstI32                                ~       +2.95%   -0.33%            ~              ~     -2.98%         -5.40%     -8.30%
DivisiblePow2constI32                      ~            ~        ~            ~              ~          ~              ~    -16.22%
DivisibleconstI32                          ~            ~        ~            ~              ~    -37.27%        -47.75%    -25.03%
DivisibleWDivconstI32                -11.59%       +5.22%  -12.99%      -23.83%              ~    +45.95%         -7.03%    -10.01%
DivconstU32                                ~            ~        ~            ~              ~    +74.71%         +4.81%          ~
ModconstU32                                ~            ~   +0.53%       +0.18%              ~    +51.16%              ~          ~
DivisibleconstU32                          ~            ~        ~       -0.62%              ~     -4.25%              ~          ~
DivisibleWDivconstU32                 -2.77%       +5.56%  +11.12%       -5.15%              ~    +48.70%        +25.11%     -4.07%
DivconstI16                           -6.06%            ~   -0.33%       +0.22%              ~          ~         -9.68%     +5.47%
ModconstI16                                ~            ~   +4.44%       +2.82%              ~          ~              ~     +5.06%
DivisiblePow2constI16                      ~            ~        ~            ~              ~          ~              ~     -0.17%
DivisibleconstI16                          ~            ~   -0.23%            ~              ~          ~         +4.60%     +6.64%
DivisibleWDivconstI16                 -1.44%       -0.43%  +13.48%       -5.76%              ~     +1.62%        -23.15%     -9.06%
DivconstU16                           +1.61%            ~   -0.35%       -0.47%              ~          ~        +15.59%          ~
ModconstU16                                ~            ~        ~            ~              ~     -0.72%              ~    +14.23%
DivisibleconstU16                          ~            ~   -0.05%       +3.00%              ~          ~              ~     +5.06%
DivisibleWDivconstU16                +52.10%       +0.75%  +17.28%       +4.79%              ~    -37.39%         +5.28%     -9.06%
DivconstI8                                 ~            ~   -0.34%       -0.96%              ~          ~         -9.20%          ~
ModconstI8                            +2.29%            ~   +4.38%       +2.96%              ~          ~              ~          ~
DivisiblePow2constI8                       ~            ~        ~            ~              ~          ~              ~          ~
DivisibleconstI8                           ~            ~        ~            ~              ~          ~         +6.04%          ~
DivisibleWDivconstI8                 -26.44%       +1.69%  +17.03%       +4.05%              ~    +32.48%        -24.90%          ~
DivconstU8                            -4.50%      +14.06%   -0.28%            ~              ~          ~         +4.16%     +0.88%
ModconstU8                                 ~            ~  +25.84%       -0.64%              ~          ~              ~          ~
DivisibleconstU8                           ~            ~   -5.70%            ~              ~          ~              ~          ~
DivisibleWDivconstU8                 +49.55%       +9.07%        ~       +4.03%        +53.87%    -40.03%        +39.72%     -3.01%
Mul2                                       ~            ~        ~            ~              ~          ~              ~          ~
MulNeg2                                    ~            ~        ~            ~        -11.73%          ~              ~     -0.02%
EfaceInteger                               ~            ~        ~            ~              ~    +18.11%              ~     +2.53%
TypeAssert                           +33.90%       +2.86%        ~            ~              ~     -1.07%         -5.29%     -1.04%
Div64UnsignedSmall                         ~            ~        ~            ~              ~          ~              ~          ~
Div64Small                                 ~            ~        ~            ~              ~     -0.88%              ~     +2.39%
Div64SmallNegDivisor                       ~            ~        ~            ~              ~          ~              ~     +0.35%
Div64SmallNegDividend                      ~            ~        ~            ~              ~     -0.84%              ~     +3.57%
Div64SmallNegBoth                          ~            ~        ~            ~              ~     -0.86%              ~     +3.55%
Div64Unsigned                              ~            ~        ~            ~              ~          ~              ~     -0.11%
Div64                                      ~            ~        ~            ~              ~          ~              ~     +0.11%
Div64NegDivisor                            ~            ~        ~            ~              ~     -1.29%              ~          ~
Div64NegDividend                           ~            ~        ~            ~              ~     -1.44%              ~          ~
Div64NegBoth                               ~            ~        ~            ~              ~          ~              ~     +0.28%
Mod64UnsignedSmall                         ~            ~        ~            ~              ~     +0.48%              ~     +0.93%
Mod64Small                                 ~            ~        ~            ~              ~          ~              ~          ~
Mod64SmallNegDivisor                       ~            ~        ~            ~              ~          ~              ~     +1.44%
Mod64SmallNegDividend                      ~            ~        ~            ~              ~     +0.22%              ~     +1.37%
Mod64SmallNegBoth                          ~            ~        ~            ~              ~          ~              ~     -2.22%
Mod64Unsigned                              ~            ~        ~            ~              ~     -0.95%              ~     +0.11%
Mod64                                      ~            ~        ~            ~              ~          ~              ~          ~
Mod64NegDivisor                            ~            ~        ~            ~              ~          ~              ~     -0.02%
Mod64NegDividend                           ~            ~        ~            ~              ~          ~              ~          ~
Mod64NegBoth                               ~            ~        ~            ~              ~          ~              ~     -0.02%
MulconstI32/3                              ~            ~        ~      -25.00%              ~          ~              ~    +47.37%
MulconstI32/5                              ~            ~        ~      +33.28%              ~          ~              ~    +32.21%
MulconstI32/12                             ~            ~        ~       -2.13%              ~          ~              ~     -0.02%
MulconstI32/120                            ~            ~        ~       +2.93%              ~          ~              ~     -0.03%
MulconstI32/-120                           ~            ~        ~       -2.17%              ~          ~              ~     -0.03%
MulconstI32/65537                          ~            ~        ~            ~              ~          ~              ~     +0.03%
MulconstI32/65538                          ~            ~        ~            ~              ~    -33.38%              ~     +0.04%
MulconstI64/3                              ~            ~        ~      +33.35%              ~     -0.37%              ~     -0.13%
MulconstI64/5                              ~            ~        ~      -25.00%              ~     -0.34%              ~          ~
MulconstI64/12                             ~            ~        ~       +2.13%              ~    +11.62%              ~     +2.30%
MulconstI64/120                            ~            ~        ~       -1.98%              ~          ~              ~          ~
MulconstI64/-120                           ~            ~        ~       +0.75%              ~          ~              ~          ~
MulconstI64/65537                          ~            ~        ~            ~              ~     +5.61%              ~          ~
MulconstI64/65538                          ~            ~        ~            ~              ~     +5.25%              ~          ~
MulconstU32/3                              ~       +0.81%        ~      +33.39%              ~    +77.92%              ~    -32.31%
MulconstU32/5                              ~            ~        ~      -24.97%              ~    +77.92%              ~    -24.47%
MulconstU32/12                             ~            ~        ~       +2.06%              ~          ~              ~     +0.03%
MulconstU32/120                            ~            ~        ~       -2.74%              ~          ~              ~     +0.03%
MulconstU32/65537                          ~            ~        ~            ~              ~          ~              ~     +0.03%
MulconstU32/65538                          ~            ~        ~            ~              ~    -33.42%              ~     -0.03%
MulconstU64/3                              ~            ~        ~      +33.33%              ~     -0.28%              ~     +1.22%
MulconstU64/5                              ~            ~        ~      -25.00%              ~          ~              ~     -0.64%
MulconstU64/12                             ~            ~        ~       +2.30%              ~    +11.59%              ~     +0.14%
MulconstU64/120                            ~            ~        ~       -2.82%              ~          ~              ~     +0.04%
MulconstU64/65537                          ~       +0.37%        ~            ~              ~     +5.58%              ~          ~
MulconstU64/65538                          ~            ~        ~            ~              ~     +5.16%              ~          ~
ShiftArithmeticRight                       ~            ~        ~            ~              ~    -10.81%              ~     +0.31%
Switch8Predictable                   +14.69%            ~        ~            ~              ~    -24.85%              ~          ~
Switch8Unpredictable                       ~       -0.58%   -3.80%            ~              ~    -11.78%              ~     -0.79%
Switch32Predictable                  -10.33%      +17.89%        ~            ~              ~     +5.76%              ~          ~
Switch32Unpredictable                 -3.15%       +1.19%   +9.42%            ~              ~    -10.30%         -5.09%     +0.44%
SwitchStringPredictable              +70.88%      +20.48%        ~            ~              ~     +2.39%              ~     +0.31%
SwitchStringUnpredictable                  ~       +3.91%   -5.06%       -0.98%              ~     +0.61%         +2.03%          ~
SwitchTypePredictable               +146.58%       -1.10%        ~      -12.45%              ~     -0.46%         -3.81%          ~
SwitchTypeUnpredictable               +0.46%       -0.83%        ~       +4.18%              ~     +0.43%              ~     +0.62%
SwitchInterfaceTypePredictable       -13.41%      -10.13%  +11.03%            ~              ~     -4.38%              ~     +0.75%
SwitchInterfaceTypeUnpredictable      -6.37%       -2.14%        ~       -3.21%              ~     -4.20%              ~     +1.08%

Fixes #63110.
Fixes #75954.

Change-Id: I55a876f08c6c14f419ce1a8cbba2eaae6c6efbf0
Reviewed-on: https://go-review.googlesource.com/c/go/+/714160
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

2025-10-29 18:49:40 -07:00

3 commits