Stowage/go - Remotebranch.eu

Stowage/go

mirror of https://github.com/golang/go.git synced 2025-12-08 06:10:04 +00:00

Author	SHA1	Message	Date
Meng Zhuo	e7d47ac33d	cmd/compile: simplify negative on multiplication goos: linux goarch: amd64 pkg: cmd/compile/internal/test cpu: AMD EPYC 7532 32-Core Processor │ simplify_base │ simplify_new │ │ sec/op │ sec/op vs base │ SimplifyNegMul 623.0n ± 0% 319.3n ± 1% -48.75% (p=0.000 n=10) goos: linux goarch: riscv64 pkg: cmd/compile/internal/test cpu: Spacemit(R) X60 │ simplify.base │ simplify.new │ │ sec/op │ sec/op vs base │ SimplifyNegMul 10.928µ ± 0% 6.432µ ± 0% -41.14% (p=0.000 n=10) Change-Id: I1d9393cd19a0b948a5d3a512d627cdc0cf0b38be Reviewed-on: https://go-review.googlesource.com/c/go/+/721520 Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mark Freeman <markfreeman@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2025-11-21 12:40:29 -08:00
khr@golang.org	32f5aadd2f	cmd/compile: stack allocate backing stores during append We can already stack allocate the backing store during append if the resulting backing store doesn't escape. See CL 664299. This CL enables us to often stack allocate the backing store during append even if the result escapes. Typically, for code like: func f(n int) []int { var r []int for i := range n { r = append(r, i) } return r } the backing store for r escapes, but only by returning it. Could we operate with r on the stack for most of its lifeime, and only move it to the heap at the return point? The current implementation of append will need to do an allocation each time it calls growslice. This will happen on the 1st, 2nd, 4th, 8th, etc. append calls. The allocations done by all but the last growslice call will then immediately be garbage. We'd like to avoid doing some of those intermediate allocations if possible. We rewrite the above code by introducing a move2heap operation: func f(n int) []int { var r []int for i := range n { r = append(r, i) } r = move2heap(r) return r } Using the move2heap runtime function, which does: move2heap(r): If r is already backed by heap storage, return r. Otherwise, copy r to the heap and return the copy. Now we can treat the backing store of r allocated at the append site as not escaping. Previous stack allocation optimizations now apply, which can use a fixed-size stack-allocated backing store for r when appending. See the description in cmd/compile/internal/slice/slice.go for how we ensure that this optimization is safe. Change-Id: I81f36e58bade2241d07f67967d8d547fff5302b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/707755 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-11-20 09:19:39 -08:00
Keith Randall	ba634ca5c7	cmd/compile: fold boolean NOT into branches Gets rid of an EOR $1 instruction. Change-Id: Ib032b0cee9ac484329c978af9b1305446f8d5dac Reviewed-on: https://go-review.googlesource.com/c/go/+/721501 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Keith Randall <khr@google.com>	2025-11-18 09:31:58 -08:00
Keith Randall	e1a12c781f	cmd/compile: use 32x32->64 multiplies on arm64 Gets rid of some sign extensions. Change-Id: Ie67ef36b4ca1cd1a2cd9fa5d84578db553578a22 Reviewed-on: https://go-review.googlesource.com/c/go/+/721241 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Keith Randall <khr@google.com>	2025-11-17 13:45:54 -08:00
Meng Zhuo	2cdcc4150b	cmd/compile: fold negation into multiplication goos: linux goarch: riscv64 pkg: cmd/compile/internal/test cpu: Spacemit(R) X60 │ /root/mul.base.log │ /root/mul.new.log │ │ sec/op │ sec/op vs base │ MulNeg 6.426µ ± 0% 4.501µ ± 0% -29.96% (p=0.000 n=10) Mul2Neg 9.000µ ± 0% 6.431µ ± 0% -28.54% (p=0.000 n=10) Mul2 1.263µ ± 0% 1.263µ ± 0% ~ (p=1.000 n=10) MulNeg2 1.577µ ± 0% 1.577µ ± 0% ~ (p=0.211 n=10) geomean 3.276µ 2.756µ -15.89% goos: linux goarch: amd64 pkg: cmd/compile/internal/test cpu: AMD EPYC 7532 32-Core Processor │ /root/base │ /root/new │ │ sec/op │ sec/op vs base │ MulNeg 691.9n ± 1% 319.4n ± 0% -53.83% (p=0.000 n=10) Mul2Neg 630.0n ± 0% 629.6n ± 0% -0.07% (p=0.000 n=10) Mul2 438.1n ± 0% 438.1n ± 0% ~ (p=0.728 n=10) MulNeg2 439.3n ± 0% 439.4n ± 0% ~ (p=0.656 n=10) geomean 538.2n 443.6n -17.58% Change-Id: Ice8e6c8d1e8e3009ba8a0b1b689205174e199019 Reviewed-on: https://go-review.googlesource.com/c/go/+/720180 Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@golang.org>	2025-11-14 11:01:22 -08:00
Michael Munday	0a569528ea	cmd/compile: optimize comparisons with single bit difference Optimize comparisons with constants that only differ by 1 bit (i.e. a power of 2). For example: x == 4 \|\| x == 6 -> x\|2 == 6 x != 1 && x != 5 -> x\|4 != 5 Change-Id: Ic61719e5118446d21cf15652d9da22f7d95b2a15 Reviewed-on: https://go-review.googlesource.com/c/go/+/719420 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>	2025-11-14 10:59:56 -08:00
matloob@golang.org	d50a571ddf	test: fix tests to work with sizespecializedmalloc turned off Cq-Include-Trybots: luci.golang.try:gotip-linux-386-nosizespecializedmalloc,gotip-linux-amd64-nosizespecializedmalloc,gotip-linux-arm64-nosizespecializedmalloc Change-Id: I6a6a696465004b939c989afc058c4c3e1fb7134f Reviewed-on: https://go-review.googlesource.com/c/go/+/720401 Auto-Submit: Michael Matloob <matloob@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Matloob <matloob@google.com>	2025-11-13 16:57:31 -08:00
Michael Munday	34aef89366	cmd/compile: use FCLASSD for subnormal checks on riscv64 Only implemented for 64 bit floating point operations for now. goos: linux goarch: riscv64 pkg: math cpu: Spacemit(R) X60 │ sec/op │ sec/op vs base │ Acos 154.1n ± 0% 154.1n ± 0% ~ (p=0.303 n=10) Acosh 215.8n ± 6% 226.7n ± 0% ~ (p=0.439 n=10) Asin 149.2n ± 1% 149.2n ± 0% ~ (p=0.700 n=10) Asinh 262.1n ± 0% 258.5n ± 0% -1.37% (p=0.000 n=10) Atan 99.48n ± 0% 99.49n ± 0% ~ (p=0.836 n=10) Atanh 244.9n ± 0% 243.8n ± 0% -0.43% (p=0.002 n=10) Atan2 158.2n ± 1% 153.3n ± 0% -3.10% (p=0.000 n=10) Cbrt 186.8n ± 0% 181.1n ± 0% -3.03% (p=0.000 n=10) Ceil 36.71n ± 1% 36.71n ± 0% ~ (p=0.434 n=10) Copysign 6.531n ± 1% 6.526n ± 0% ~ (p=0.268 n=10) Cos 98.19n ± 0% 95.40n ± 0% -2.84% (p=0.000 n=10) Cosh 233.1n ± 0% 222.6n ± 0% -4.50% (p=0.000 n=10) Erf 122.5n ± 0% 114.2n ± 0% -6.78% (p=0.000 n=10) Erfc 126.0n ± 1% 116.6n ± 0% -7.46% (p=0.000 n=10) Erfinv 138.8n ± 0% 138.6n ± 0% ~ (p=0.082 n=10) Erfcinv 140.0n ± 0% 139.7n ± 0% ~ (p=0.359 n=10) Exp 193.3n ± 0% 184.2n ± 0% -4.68% (p=0.000 n=10) ExpGo 204.8n ± 0% 194.5n ± 0% -5.03% (p=0.000 n=10) Expm1 152.5n ± 1% 145.0n ± 0% -4.92% (p=0.000 n=10) Exp2 174.5n ± 0% 164.2n ± 0% -5.85% (p=0.000 n=10) Exp2Go 184.4n ± 1% 175.4n ± 0% -4.88% (p=0.000 n=10) Abs 4.912n ± 0% 4.914n ± 0% ~ (p=0.283 n=10) Dim 15.50n ± 1% 15.52n ± 1% ~ (p=0.331 n=10) Floor 36.89n ± 1% 36.76n ± 1% ~ (p=0.325 n=10) Max 31.05n ± 1% 31.17n ± 1% ~ (p=0.628 n=10) Min 31.01n ± 0% 31.06n ± 0% ~ (p=0.767 n=10) Mod 294.1n ± 0% 245.6n ± 0% -16.52% (p=0.000 n=10) Frexp 44.86n ± 1% 35.20n ± 0% -21.53% (p=0.000 n=10) Gamma 195.8n ± 0% 185.4n ± 1% -5.29% (p=0.000 n=10) Hypot 84.91n ± 0% 84.54n ± 1% -0.43% (p=0.006 n=10) HypotGo 96.70n ± 0% 95.42n ± 1% -1.32% (p=0.000 n=10) Ilogb 45.03n ± 0% 35.07n ± 1% -22.10% (p=0.000 n=10) J0 634.5n ± 0% 627.2n ± 0% -1.16% (p=0.000 n=10) J1 644.5n ± 0% 636.9n ± 0% -1.18% (p=0.000 n=10) Jn 1.357µ ± 0% 1.344µ ± 0% -0.92% (p=0.000 n=10) Ldexp 49.89n ± 0% 39.96n ± 0% -19.90% (p=0.000 n=10) Lgamma 186.6n ± 0% 184.3n ± 0% -1.21% (p=0.000 n=10) Log 150.4n ± 0% 141.1n ± 0% -6.15% (p=0.000 n=10) Logb 46.70n ± 0% 35.89n ± 0% -23.15% (p=0.000 n=10) Log1p 164.1n ± 0% 163.9n ± 0% ~ (p=0.122 n=10) Log10 153.1n ± 0% 143.5n ± 0% -6.24% (p=0.000 n=10) Log2 58.83n ± 0% 49.75n ± 0% -15.43% (p=0.000 n=10) Modf 40.82n ± 1% 40.78n ± 0% ~ (p=0.239 n=10) Nextafter32 49.15n ± 0% 48.93n ± 0% -0.44% (p=0.011 n=10) Nextafter64 43.33n ± 0% 43.23n ± 0% ~ (p=0.228 n=10) PowInt 269.4n ± 0% 243.8n ± 0% -9.49% (p=0.000 n=10) PowFrac 618.0n ± 0% 571.7n ± 0% -7.48% (p=0.000 n=10) Pow10Pos 13.09n ± 0% 13.05n ± 0% -0.31% (p=0.003 n=10) Pow10Neg 30.99n ± 1% 30.99n ± 0% ~ (p=0.173 n=10) Round 23.73n ± 0% 23.65n ± 0% -0.36% (p=0.011 n=10) RoundToEven 27.87n ± 0% 27.73n ± 0% -0.48% (p=0.003 n=10) Remainder 282.1n ± 0% 249.6n ± 0% -11.52% (p=0.000 n=10) Signbit 11.46n ± 0% 11.42n ± 0% -0.39% (p=0.003 n=10) Sin 115.2n ± 0% 113.2n ± 0% -1.74% (p=0.000 n=10) Sincos 140.6n ± 0% 138.6n ± 0% -1.39% (p=0.000 n=10) Sinh 252.0n ± 0% 241.4n ± 0% -4.21% (p=0.000 n=10) SqrtIndirect 4.909n ± 0% 4.893n ± 0% -0.34% (p=0.021 n=10) SqrtLatency 19.57n ± 1% 19.57n ± 0% ~ (p=0.087 n=10) SqrtIndirectLatency 19.64n ± 0% 19.57n ± 0% -0.36% (p=0.025 n=10) SqrtGoLatency 198.1n ± 0% 197.4n ± 0% -0.35% (p=0.014 n=10) SqrtPrime 5.733µ ± 0% 5.725µ ± 0% ~ (p=0.116 n=10) Tan 149.1n ± 0% 146.8n ± 0% -1.54% (p=0.000 n=10) Tanh 248.2n ± 1% 238.1n ± 0% -4.05% (p=0.000 n=10) Trunc 36.86n ± 0% 36.70n ± 0% -0.43% (p=0.029 n=10) Y0 638.2n ± 0% 633.6n ± 0% -0.71% (p=0.000 n=10) Y1 641.8n ± 0% 636.1n ± 0% -0.87% (p=0.000 n=10) Yn 1.358µ ± 0% 1.345µ ± 0% -0.92% (p=0.000 n=10) Float64bits 5.721n ± 0% 5.709n ± 0% -0.22% (p=0.044 n=10) Float64frombits 4.905n ± 0% 4.893n ± 0% ~ (p=0.266 n=10) Float32bits 12.27n ± 0% 12.23n ± 0% ~ (p=0.122 n=10) Float32frombits 4.909n ± 0% 4.893n ± 0% -0.32% (p=0.024 n=10) FMA 6.556n ± 0% 6.526n ± 0% ~ (p=0.283 n=10) geomean 86.82n 83.75n -3.54% Change-Id: I522297a79646d76543d516accce291f5a3cea337 Reviewed-on: https://go-review.googlesource.com/c/go/+/717560 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com>	2025-11-12 10:03:41 -08:00
Youlin Feng	c7ccbddf22	cmd/compile/internal/ssa: more aggressive on dead auto elim Propagate "unread" across OpMoves. If the addr of this auto is only used by an OpMove as its source arg, and the OpMove's target arg is the addr of another auto. If the 2nd auto can be eliminated, this one can also be eliminated. This CL eliminates unnecessary memory copies and makes the frame smaller in the following code snippet: func contains(m map[string][16]int, k string) bool { _, ok := m[k] return ok } These are the benchmark results followed by the benchmark code: goos: linux goarch: amd64 cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Map1Access2Ok-8 9.582n ± 2% 9.226n ± 0% -3.72% (p=0.000 n=20) Map2Access2Ok-8 13.79n ± 1% 10.24n ± 1% -25.77% (p=0.000 n=20) Map3Access2Ok-8 68.68n ± 1% 12.65n ± 1% -81.58% (p=0.000 n=20) package main_test import "testing" var ( m1 = map[int]int{} m2 = map[int][16]int{} m3 = map[int][256]int{} ) func init() { for i := range 1000 { m1[i] = i m2[i] = [16]int{15:i} m3[i] = [256]int{255:i} } } func BenchmarkMap1Access2Ok(b testing.B) { for i := range b.N { _, ok := m1[i%1000] if !ok { b.Errorf("%d not found", i) } } } func BenchmarkMap2Access2Ok(b testing.B) { for i := range b.N { _, ok := m2[i%1000] if !ok { b.Errorf("%d not found", i) } } } func BenchmarkMap3Access2Ok(b *testing.B) { for i := range b.N { _, ok := m3[i%1000] if !ok { b.Errorf("%d not found", i) } } } Fixes #75398 Change-Id: If75e9caaa50d460efc31a94565b9ba28c8158771 Reviewed-on: https://go-review.googlesource.com/c/go/+/702875 Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>	2025-11-04 12:46:15 -08:00
Russ Cox	6e165b4d17	cmd/compile: implement Avg64u, Hmul64, Hmul64u for wasm This lets us remove useAvg and useHmul from the division rules. The compiler is simpler and the generated code is faster. goos: wasip1 goarch: wasm pkg: internal/strconv │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ AppendFloat/Decimal 192.8n ± 1% 194.6n ± 0% +0.91% (p=0.000 n=10) AppendFloat/Float 328.6n ± 0% 279.6n ± 0% -14.93% (p=0.000 n=10) AppendFloat/Exp 335.6n ± 1% 289.2n ± 1% -13.80% (p=0.000 n=10) AppendFloat/NegExp 336.0n ± 0% 289.1n ± 1% -13.97% (p=0.000 n=10) AppendFloat/LongExp 332.4n ± 0% 285.2n ± 1% -14.20% (p=0.000 n=10) AppendFloat/Big 348.2n ± 0% 300.1n ± 0% -13.83% (p=0.000 n=10) AppendFloat/BinaryExp 137.4n ± 0% 138.2n ± 0% +0.55% (p=0.001 n=10) AppendFloat/32Integer 193.3n ± 1% 196.5n ± 0% +1.66% (p=0.000 n=10) AppendFloat/32ExactFraction 283.3n ± 0% 268.9n ± 1% -5.08% (p=0.000 n=10) AppendFloat/32Point 279.9n ± 0% 266.5n ± 0% -4.80% (p=0.000 n=10) AppendFloat/32Exp 300.1n ± 0% 288.3n ± 1% -3.90% (p=0.000 n=10) AppendFloat/32NegExp 288.2n ± 1% 277.9n ± 1% -3.59% (p=0.000 n=10) AppendFloat/32Shortest 261.7n ± 0% 250.2n ± 0% -4.39% (p=0.000 n=10) AppendFloat/32Fixed8Hard 173.3n ± 1% 158.9n ± 1% -8.31% (p=0.000 n=10) AppendFloat/32Fixed9Hard 180.0n ± 0% 167.9n ± 2% -6.70% (p=0.000 n=10) AppendFloat/64Fixed1 167.1n ± 0% 149.6n ± 1% -10.50% (p=0.000 n=10) AppendFloat/64Fixed2 162.4n ± 1% 146.5n ± 0% -9.73% (p=0.000 n=10) AppendFloat/64Fixed2.5 165.5n ± 0% 149.4n ± 1% -9.70% (p=0.000 n=10) AppendFloat/64Fixed3 166.4n ± 1% 150.2n ± 0% -9.74% (p=0.000 n=10) AppendFloat/64Fixed4 163.7n ± 0% 149.6n ± 1% -8.62% (p=0.000 n=10) AppendFloat/64Fixed5Hard 182.8n ± 1% 167.1n ± 1% -8.61% (p=0.000 n=10) AppendFloat/64Fixed12 222.2n ± 0% 208.8n ± 0% -6.05% (p=0.000 n=10) AppendFloat/64Fixed16 197.6n ± 1% 181.7n ± 0% -8.02% (p=0.000 n=10) AppendFloat/64Fixed12Hard 194.5n ± 0% 181.0n ± 0% -6.99% (p=0.000 n=10) AppendFloat/64Fixed17Hard 205.1n ± 1% 191.9n ± 0% -6.44% (p=0.000 n=10) AppendFloat/64Fixed18Hard 6.269µ ± 0% 6.643µ ± 0% +5.97% (p=0.000 n=10) AppendFloat/64FixedF1 211.7n ± 1% 197.0n ± 0% -6.95% (p=0.000 n=10) AppendFloat/64FixedF2 189.4n ± 0% 174.2n ± 0% -8.08% (p=0.000 n=10) AppendFloat/64FixedF3 169.0n ± 0% 154.9n ± 0% -8.32% (p=0.000 n=10) AppendFloat/Slowpath64 321.2n ± 0% 274.2n ± 1% -14.63% (p=0.000 n=10) AppendFloat/SlowpathDenormal64 307.4n ± 1% 261.2n ± 0% -15.03% (p=0.000 n=10) AppendInt 3.367µ ± 1% 3.376µ ± 0% ~ (p=0.517 n=10) AppendUint 675.5n ± 0% 676.9n ± 0% ~ (p=0.196 n=10) AppendIntSmall 28.13n ± 1% 28.17n ± 0% +0.14% (p=0.015 n=10) AppendUintVarlen/digits=1 20.70n ± 0% 20.51n ± 1% -0.89% (p=0.018 n=10) AppendUintVarlen/digits=2 20.43n ± 0% 20.27n ± 0% -0.81% (p=0.001 n=10) AppendUintVarlen/digits=3 38.48n ± 0% 37.93n ± 0% -1.43% (p=0.000 n=10) AppendUintVarlen/digits=4 41.10n ± 0% 38.78n ± 1% -5.62% (p=0.000 n=10) AppendUintVarlen/digits=5 42.25n ± 1% 42.11n ± 0% -0.32% (p=0.041 n=10) AppendUintVarlen/digits=6 45.40n ± 1% 43.14n ± 0% -4.98% (p=0.000 n=10) AppendUintVarlen/digits=7 46.81n ± 1% 46.03n ± 0% -1.66% (p=0.000 n=10) AppendUintVarlen/digits=8 48.88n ± 1% 46.59n ± 1% -4.68% (p=0.000 n=10) AppendUintVarlen/digits=9 49.94n ± 2% 49.41n ± 1% -1.06% (p=0.000 n=10) AppendUintVarlen/digits=10 57.28n ± 1% 56.92n ± 1% -0.62% (p=0.045 n=10) AppendUintVarlen/digits=11 60.09n ± 1% 58.11n ± 2% -3.30% (p=0.000 n=10) AppendUintVarlen/digits=12 62.22n ± 0% 61.85n ± 0% -0.59% (p=0.000 n=10) AppendUintVarlen/digits=13 64.94n ± 0% 62.92n ± 0% -3.10% (p=0.000 n=10) AppendUintVarlen/digits=14 65.42n ± 1% 65.19n ± 1% -0.34% (p=0.005 n=10) AppendUintVarlen/digits=15 68.17n ± 0% 66.13n ± 0% -2.99% (p=0.000 n=10) AppendUintVarlen/digits=16 70.21n ± 1% 70.09n ± 1% ~ (p=0.517 n=10) AppendUintVarlen/digits=17 72.93n ± 0% 70.49n ± 0% -3.34% (p=0.000 n=10) AppendUintVarlen/digits=18 73.01n ± 0% 72.75n ± 0% -0.35% (p=0.000 n=10) AppendUintVarlen/digits=19 79.27n ± 1% 79.49n ± 1% ~ (p=0.671 n=10) AppendUintVarlen/digits=20 82.18n ± 0% 80.43n ± 1% -2.14% (p=0.000 n=10) geomean 143.4n 136.0n -5.20% Change-Id: I8245814a0259ad13cf9225f57db8e9fe3d2e4267 Reviewed-on: https://go-review.googlesource.com/c/go/+/717407 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>	2025-11-04 11:38:18 -08:00
Russ Cox	1e5bb416d8	cmd/compile: implement bits.Mul64 on 32-bit systems This CL implements Mul64uhilo, Hmul64, Hmul64u, and Avg64u on 32-bit systems, with the effect that constant division of both int64s and uint64s can now be emitted directly in all cases, and also that bits.Mul64 can be intrinsified on 32-bit systems. Previously, constant division of uint64s by values 0 ≤ c ≤ 0xFFFF were implemented as uint32 divisions by c and some fixup. After expanding those smaller constant divisions, the code for i/999 required: (386) 7 mul, 10 add, 2 sub, 3 rotate, 3 shift (104 bytes) (arm) 7 mul, 9 add, 3 sub, 2 shift (104 bytes) (mips) 7 mul, 10 add, 5 sub, 6 shift, 3 sgtu (176 bytes) For that much code, we might as well use a full 64x64->128 multiply that can be used for all divisors, not just small ones. Having done that, the same i/999 now generates: (386) 4 mul, 9 add, 2 sub, 2 or, 6 shift (112 bytes) (arm) 4 mul, 8 add, 2 sub, 2 or, 3 shift (92 bytes) (mips) 4 mul, 11 add, 3 sub, 6 shift, 8 sgtu, 4 or (196 bytes) The size increase on 386 is due to a few extra register spills. The size increase on mips is due to add-with-carry being hard. The new approach is more general, letting us delete the old special case and guarantee that all int64 and uint64 divisions by constants are generated directly on 32-bit systems. This especially speeds up code making heavy use of bits.Mul64 with a constant argument, which happens in strconv and various crypto packages. A few examples are benchmarked below. pkg: cmd/compile/internal/test benchmark \ host local linux-amd64 s7 linux-386 s7:GOARCH=386 vs base vs base vs base vs base vs base DivconstI64 ~ ~ ~ -49.66% -21.02% ModconstI64 ~ ~ ~ -13.45% +14.52% DivisiblePow2constI64 ~ ~ ~ +0.97% -1.32% DivisibleconstI64 ~ ~ ~ -20.01% -48.28% DivisibleWDivconstI64 ~ ~ -1.76% -38.59% -42.74% DivconstU64/3 ~ ~ ~ -13.82% -4.09% DivconstU64/5 ~ ~ ~ -14.10% -3.54% DivconstU64/37 -2.07% -4.45% ~ -19.60% -9.55% DivconstU64/1234567 ~ ~ ~ -61.55% -56.93% ModconstU64 ~ ~ ~ -6.25% ~ DivisibleconstU64 ~ ~ ~ -2.78% -7.82% DivisibleWDivconstU64 ~ ~ ~ +4.23% +2.56% pkg: math/bits benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Add ~ ~ ~ ~ Add32 +1.59% ~ ~ ~ Add64 ~ ~ ~ ~ Add64multiple ~ ~ ~ ~ Sub ~ ~ ~ ~ Sub32 ~ ~ ~ ~ Sub64 ~ ~ -9.20% ~ Sub64multiple ~ ~ ~ ~ Mul ~ ~ ~ ~ Mul32 ~ ~ ~ ~ Mul64 ~ ~ -41.58% -53.21% Div ~ ~ ~ ~ Div32 ~ ~ ~ ~ Div64 ~ ~ ~ ~ pkg: strconv benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base ParseInt/Pos/7bit ~ ~ -11.08% -6.75% ParseInt/Pos/26bit ~ ~ -13.65% -11.02% ParseInt/Pos/31bit ~ ~ -14.65% -9.71% ParseInt/Pos/56bit -1.80% ~ -17.97% -10.78% ParseInt/Pos/63bit ~ ~ -13.85% -9.63% ParseInt/Neg/7bit ~ ~ -12.14% -7.26% ParseInt/Neg/26bit ~ ~ -14.18% -9.81% ParseInt/Neg/31bit ~ ~ -14.51% -9.02% ParseInt/Neg/56bit ~ ~ -15.79% -9.79% ParseInt/Neg/63bit ~ ~ -15.68% -11.07% AppendFloat/Decimal ~ ~ -7.25% -12.26% AppendFloat/Float ~ ~ -15.96% -19.45% AppendFloat/Exp ~ ~ -13.96% -17.76% AppendFloat/NegExp ~ ~ -14.89% -20.27% AppendFloat/LongExp ~ ~ -12.68% -17.97% AppendFloat/Big ~ ~ -11.10% -16.64% AppendFloat/BinaryExp ~ ~ ~ ~ AppendFloat/32Integer ~ ~ -10.05% -10.91% AppendFloat/32ExactFraction ~ ~ -8.93% -13.00% AppendFloat/32Point ~ ~ -10.36% -14.89% AppendFloat/32Exp ~ ~ -9.88% -13.54% AppendFloat/32NegExp ~ ~ -10.16% -14.26% AppendFloat/32Shortest ~ ~ -11.39% -14.96% AppendFloat/32Fixed8Hard ~ ~ ~ -2.31% AppendFloat/32Fixed9Hard ~ ~ ~ -7.01% AppendFloat/64Fixed1 ~ ~ -2.83% -8.23% AppendFloat/64Fixed2 ~ ~ ~ -7.94% AppendFloat/64Fixed3 ~ ~ -4.07% -7.22% AppendFloat/64Fixed4 ~ ~ -7.24% -7.62% AppendFloat/64Fixed12 ~ ~ -6.57% -4.82% AppendFloat/64Fixed16 ~ ~ -4.00% -5.81% AppendFloat/64Fixed12Hard -2.22% ~ -4.07% -6.35% AppendFloat/64Fixed17Hard -2.12% ~ ~ -3.79% AppendFloat/64Fixed18Hard -1.89% ~ +2.48% ~ AppendFloat/Slowpath64 -1.85% ~ -14.49% -18.21% AppendFloat/SlowpathDenormal64 ~ ~ -13.08% -19.41% pkg: crypto/internal/fips140/nistec/fiat benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Mul/P224 ~ ~ -29.95% -39.60% Mul/P384 ~ ~ -37.11% -63.33% Mul/P521 ~ ~ -26.62% -12.42% Square/P224 +1.46% ~ -40.62% -49.18% Square/P384 ~ ~ -45.51% -69.68% Square/P521 +90.37% ~ -25.26% -11.23% (The +90% is a separate problem and not real; that much variation can be seen on that system by running the same binary from two different files.) pkg: crypto/internal/fips140/edwards25519 benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base EncodingDecoding ~ ~ -34.67% -35.75% ScalarBaseMult ~ ~ -31.25% -30.29% ScalarMult ~ ~ -33.45% -32.54% VarTimeDoubleScalarBaseMult ~ ~ -33.78% -33.68% Change-Id: Id3c91d42cd01def6731b755e99f8f40c6ad1bb65 Reviewed-on: https://go-review.googlesource.com/c/go/+/716061 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>	2025-10-30 08:04:20 -07:00
Russ Cox	9bbda7c99d	cmd/compile: make prove understand div, mod better This CL introduces new divisible and divmod passes that rewrite divisibility checks and div, mod, and mul. These happen after prove, so that prove can make better sense of the code for deriving bounds, and they must run before decompose, so that 64-bit ops can be lowered to 32-bit ops on 32-bit systems. And then they need another generic pass as well, to optimize the generated code before decomposing. The three opt passes are "opt", "middle opt", and "late opt". (Perhaps instead they should be "generic", "opt", and "late opt"?) The "late opt" pass repeats the "middle opt" work on any new code that has been generated in the interim. There will not be new divs or mods, but there may be new muls. The x%c==0 rewrite rules are much simpler now, since they can match before divs have been rewritten. This has the effect of applying them more consistently and making the rewrite rules independent of the exact div rewrites. Prove is also now charged with marking signed div/mod as unsigned when the arguments call for it, allowing simpler code to be emitted in various cases. For example, t.Seconds()/2 and len(x)/2 are now recognized as unsigned, meaning they compile to a simple shift (unsigned division), avoiding the more complex fixup we need for signed values. https://gist.github.com/rsc/99d9d3bd99cde87b6a1a390e3d85aa32 shows a diff of 'go build -a -gcflags=-d=ssa/prove/debug=1 std' output before and after. "Proved Rsh64x64 shifts to zero" is replaced by the higher-level "Proved Div64 is unsigned" (the shift was in the signed expansion of div by constant), but otherwise prove is only finding more things to prove. One short example, in code that does x[i%len(x)]: < runtime/mfinal.go:131:34: Proved Rsh64x64 shifts to zero --- > runtime/mfinal.go:131:34: Proved Div64 is unsigned > runtime/mfinal.go:131:38: Proved IsInBounds A longer example: < crypto/internal/fips140/sha3/shake.go:28:30: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:38:27: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:53:46: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:55:46: Proved Rsh64x64 shifts to zero --- > crypto/internal/fips140/sha3/shake.go:28:30: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:28:30: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:28:30: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:38:27: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:45:7: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:46:4: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:53:46: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:53:46: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:53:46: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:55:46: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:55:46: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:55:46: Proved IsSliceInBounds These diffs are due to the smaller opt being better and taking work away from prove: < image/jpeg/dct.go:307:5: Proved IsInBounds < image/jpeg/dct.go:308:5: Proved IsInBounds ... < image/jpeg/dct.go:442:5: Proved IsInBounds In the old opt, Mul by 8 was rewritten to Lsh by 3 early. This CL delays that rule to help prove recognize mods, but it also helps opt constant-fold the slice x[8i:8i+8:8*i+8]. Specifically, computing the length, opt can now do: (Sub64 (Add (Mul 8 i) 8) (Add (Mul 8 i) 8)) -> (Add 8 (Sub (Mul 8 i) (Mul 8 i))) -> (Add 8 (Mul 8 (Sub i i))) -> (Add 8 (Mul 8 0)) -> (Add 8 0) -> 8 The key step is (Sub (Mul x y) (Mul x z)) -> (Mul x (Sub y z)), Leaving the multiply as Mul enables using that step; the old rewrite to Lsh blocked it, leaving prove to figure out the length and then remove the bounds checks. But now opt can evaluate the length down to a constant 8 and then constant-fold away the bounds checks 0 < 8, 1 < 8, and so on. After that, the compiler has nothing left to prove. Benchmarks are noisy in general; I checked the assembly for the many large increases below, and the vast majority are unchanged and presumably hitting the caches differently in some way. The divisibility optimizations were not reliably triggering before. This leads to a very large improvement in some cases, like DivisiblePow2constI64, DivisibleconstI64 on 64-bit systems and DivisbleconstU64 on 32-bit systems. Another way the divisibility optimizations were unreliable before was incorrectly triggering for x/3, x%3 even though they are written not to do that. There is a real but small slowdown in the DivisibleWDivconst benchmarks on Mac because in the cases used in the benchmark, it is still faster (on Mac) to do the divisibility check than to remultiply. This may be worth further study. Perhaps when there is no rotate (meaning the divisor is odd), the divisibility optimization should be enabled always. In any event, this CL makes it possible to study that. benchmark \ host s7 linux-amd64 mac linux-arm64 linux-ppc64le linux-386 s7:GOARCH=386 linux-arm vs base vs base vs base vs base vs base vs base vs base vs base LoadAdd ~ ~ ~ ~ ~ -1.59% ~ ~ ExtShift ~ ~ -42.14% +0.10% ~ +1.44% +5.66% +8.50% Modify ~ ~ ~ ~ ~ ~ ~ -1.53% MullImm ~ ~ ~ ~ ~ +37.90% -21.87% +3.05% ConstModify ~ ~ ~ ~ -49.14% ~ ~ ~ BitSet ~ ~ ~ ~ -15.86% -14.57% +6.44% +0.06% BitClear ~ ~ ~ ~ ~ +1.78% +3.50% +0.06% BitToggle ~ ~ ~ ~ ~ -16.09% +2.91% ~ BitSetConst ~ ~ ~ ~ ~ ~ ~ -0.49% BitClearConst ~ ~ ~ ~ -28.29% ~ ~ -0.40% BitToggleConst ~ ~ ~ +8.89% -31.19% ~ ~ -0.77% MulNeg ~ ~ ~ ~ ~ ~ ~ ~ Mul2Neg ~ ~ -4.83% ~ ~ -13.75% -5.92% ~ DivconstI64 ~ ~ ~ ~ ~ -30.12% ~ +0.50% ModconstI64 ~ ~ -9.94% -4.63% ~ +3.15% ~ +5.32% DivisiblePow2constI64 -34.49% -12.58% ~ ~ -12.25% ~ ~ ~ DivisibleconstI64 -24.69% -25.06% -0.40% -2.27% -42.61% -3.31% ~ +1.63% DivisibleWDivconstI64 ~ ~ ~ ~ ~ -17.55% ~ -0.60% DivconstU64/3 ~ ~ ~ ~ ~ +1.51% ~ ~ DivconstU64/5 ~ ~ ~ ~ ~ ~ ~ ~ DivconstU64/37 ~ ~ -0.18% ~ ~ +2.70% ~ ~ DivconstU64/1234567 ~ ~ ~ ~ ~ ~ ~ +0.12% ModconstU64 ~ ~ ~ -0.24% ~ -5.10% -1.07% -1.56% DivisibleconstU64 ~ ~ ~ ~ ~ -29.01% -59.13% -50.72% DivisibleWDivconstU64 ~ ~ -12.18% -18.88% ~ -5.50% -3.91% +5.17% DivconstI32 ~ ~ -0.48% ~ -34.69% +89.01% -6.01% -16.67% ModconstI32 ~ +2.95% -0.33% ~ ~ -2.98% -5.40% -8.30% DivisiblePow2constI32 ~ ~ ~ ~ ~ ~ ~ -16.22% DivisibleconstI32 ~ ~ ~ ~ ~ -37.27% -47.75% -25.03% DivisibleWDivconstI32 -11.59% +5.22% -12.99% -23.83% ~ +45.95% -7.03% -10.01% DivconstU32 ~ ~ ~ ~ ~ +74.71% +4.81% ~ ModconstU32 ~ ~ +0.53% +0.18% ~ +51.16% ~ ~ DivisibleconstU32 ~ ~ ~ -0.62% ~ -4.25% ~ ~ DivisibleWDivconstU32 -2.77% +5.56% +11.12% -5.15% ~ +48.70% +25.11% -4.07% DivconstI16 -6.06% ~ -0.33% +0.22% ~ ~ -9.68% +5.47% ModconstI16 ~ ~ +4.44% +2.82% ~ ~ ~ +5.06% DivisiblePow2constI16 ~ ~ ~ ~ ~ ~ ~ -0.17% DivisibleconstI16 ~ ~ -0.23% ~ ~ ~ +4.60% +6.64% DivisibleWDivconstI16 -1.44% -0.43% +13.48% -5.76% ~ +1.62% -23.15% -9.06% DivconstU16 +1.61% ~ -0.35% -0.47% ~ ~ +15.59% ~ ModconstU16 ~ ~ ~ ~ ~ -0.72% ~ +14.23% DivisibleconstU16 ~ ~ -0.05% +3.00% ~ ~ ~ +5.06% DivisibleWDivconstU16 +52.10% +0.75% +17.28% +4.79% ~ -37.39% +5.28% -9.06% DivconstI8 ~ ~ -0.34% -0.96% ~ ~ -9.20% ~ ModconstI8 +2.29% ~ +4.38% +2.96% ~ ~ ~ ~ DivisiblePow2constI8 ~ ~ ~ ~ ~ ~ ~ ~ DivisibleconstI8 ~ ~ ~ ~ ~ ~ +6.04% ~ DivisibleWDivconstI8 -26.44% +1.69% +17.03% +4.05% ~ +32.48% -24.90% ~ DivconstU8 -4.50% +14.06% -0.28% ~ ~ ~ +4.16% +0.88% ModconstU8 ~ ~ +25.84% -0.64% ~ ~ ~ ~ DivisibleconstU8 ~ ~ -5.70% ~ ~ ~ ~ ~ DivisibleWDivconstU8 +49.55% +9.07% ~ +4.03% +53.87% -40.03% +39.72% -3.01% Mul2 ~ ~ ~ ~ ~ ~ ~ ~ MulNeg2 ~ ~ ~ ~ -11.73% ~ ~ -0.02% EfaceInteger ~ ~ ~ ~ ~ +18.11% ~ +2.53% TypeAssert +33.90% +2.86% ~ ~ ~ -1.07% -5.29% -1.04% Div64UnsignedSmall ~ ~ ~ ~ ~ ~ ~ ~ Div64Small ~ ~ ~ ~ ~ -0.88% ~ +2.39% Div64SmallNegDivisor ~ ~ ~ ~ ~ ~ ~ +0.35% Div64SmallNegDividend ~ ~ ~ ~ ~ -0.84% ~ +3.57% Div64SmallNegBoth ~ ~ ~ ~ ~ -0.86% ~ +3.55% Div64Unsigned ~ ~ ~ ~ ~ ~ ~ -0.11% Div64 ~ ~ ~ ~ ~ ~ ~ +0.11% Div64NegDivisor ~ ~ ~ ~ ~ -1.29% ~ ~ Div64NegDividend ~ ~ ~ ~ ~ -1.44% ~ ~ Div64NegBoth ~ ~ ~ ~ ~ ~ ~ +0.28% Mod64UnsignedSmall ~ ~ ~ ~ ~ +0.48% ~ +0.93% Mod64Small ~ ~ ~ ~ ~ ~ ~ ~ Mod64SmallNegDivisor ~ ~ ~ ~ ~ ~ ~ +1.44% Mod64SmallNegDividend ~ ~ ~ ~ ~ +0.22% ~ +1.37% Mod64SmallNegBoth ~ ~ ~ ~ ~ ~ ~ -2.22% Mod64Unsigned ~ ~ ~ ~ ~ -0.95% ~ +0.11% Mod64 ~ ~ ~ ~ ~ ~ ~ ~ Mod64NegDivisor ~ ~ ~ ~ ~ ~ ~ -0.02% Mod64NegDividend ~ ~ ~ ~ ~ ~ ~ ~ Mod64NegBoth ~ ~ ~ ~ ~ ~ ~ -0.02% MulconstI32/3 ~ ~ ~ -25.00% ~ ~ ~ +47.37% MulconstI32/5 ~ ~ ~ +33.28% ~ ~ ~ +32.21% MulconstI32/12 ~ ~ ~ -2.13% ~ ~ ~ -0.02% MulconstI32/120 ~ ~ ~ +2.93% ~ ~ ~ -0.03% MulconstI32/-120 ~ ~ ~ -2.17% ~ ~ ~ -0.03% MulconstI32/65537 ~ ~ ~ ~ ~ ~ ~ +0.03% MulconstI32/65538 ~ ~ ~ ~ ~ -33.38% ~ +0.04% MulconstI64/3 ~ ~ ~ +33.35% ~ -0.37% ~ -0.13% MulconstI64/5 ~ ~ ~ -25.00% ~ -0.34% ~ ~ MulconstI64/12 ~ ~ ~ +2.13% ~ +11.62% ~ +2.30% MulconstI64/120 ~ ~ ~ -1.98% ~ ~ ~ ~ MulconstI64/-120 ~ ~ ~ +0.75% ~ ~ ~ ~ MulconstI64/65537 ~ ~ ~ ~ ~ +5.61% ~ ~ MulconstI64/65538 ~ ~ ~ ~ ~ +5.25% ~ ~ MulconstU32/3 ~ +0.81% ~ +33.39% ~ +77.92% ~ -32.31% MulconstU32/5 ~ ~ ~ -24.97% ~ +77.92% ~ -24.47% MulconstU32/12 ~ ~ ~ +2.06% ~ ~ ~ +0.03% MulconstU32/120 ~ ~ ~ -2.74% ~ ~ ~ +0.03% MulconstU32/65537 ~ ~ ~ ~ ~ ~ ~ +0.03% MulconstU32/65538 ~ ~ ~ ~ ~ -33.42% ~ -0.03% MulconstU64/3 ~ ~ ~ +33.33% ~ -0.28% ~ +1.22% MulconstU64/5 ~ ~ ~ -25.00% ~ ~ ~ -0.64% MulconstU64/12 ~ ~ ~ +2.30% ~ +11.59% ~ +0.14% MulconstU64/120 ~ ~ ~ -2.82% ~ ~ ~ +0.04% MulconstU64/65537 ~ +0.37% ~ ~ ~ +5.58% ~ ~ MulconstU64/65538 ~ ~ ~ ~ ~ +5.16% ~ ~ ShiftArithmeticRight ~ ~ ~ ~ ~ -10.81% ~ +0.31% Switch8Predictable +14.69% ~ ~ ~ ~ -24.85% ~ ~ Switch8Unpredictable ~ -0.58% -3.80% ~ ~ -11.78% ~ -0.79% Switch32Predictable -10.33% +17.89% ~ ~ ~ +5.76% ~ ~ Switch32Unpredictable -3.15% +1.19% +9.42% ~ ~ -10.30% -5.09% +0.44% SwitchStringPredictable +70.88% +20.48% ~ ~ ~ +2.39% ~ +0.31% SwitchStringUnpredictable ~ +3.91% -5.06% -0.98% ~ +0.61% +2.03% ~ SwitchTypePredictable +146.58% -1.10% ~ -12.45% ~ -0.46% -3.81% ~ SwitchTypeUnpredictable +0.46% -0.83% ~ +4.18% ~ +0.43% ~ +0.62% SwitchInterfaceTypePredictable -13.41% -10.13% +11.03% ~ ~ -4.38% ~ +0.75% SwitchInterfaceTypeUnpredictable -6.37% -2.14% ~ -3.21% ~ -4.20% ~ +1.08% Fixes #63110. Fixes #75954. Change-Id: I55a876f08c6c14f419ce1a8cbba2eaae6c6efbf0 Reviewed-on: https://go-review.googlesource.com/c/go/+/714160 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Russ Cox <rsc@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-10-29 18:49:40 -07:00
Russ Cox	915c1839fe	test/codegen: simplify asmcheck pattern matching Separate patterns in asmcheck by spaces instead of commas. Many patterns end in comma (like "MOV [$]123,") so separating patterns by comma is not great; they're already quoted, so spaces are fine. Also replace all tabs in the assembly lines with spaces before matching. Finally, replace \$ or \\$ with [$] as the matching idiom. The effect of all these is to make the patterns look like: // amd64:"BSFQ" "ORQ [$]256" instead of the old: // amd64:"BSFQ","ORQ\t\\$256" Update all tests as well. Change-Id: Ia39febe5d7f67ba115846422789e11b185d5c807 Reviewed-on: https://go-review.googlesource.com/c/go/+/716060 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com> Reviewed-by: Jorropo <jorropo.pgm@gmail.com>	2025-10-29 13:55:00 -07:00
Jorropo	73d7635fae	cmd/compile: add generic rules to remove bool → int → bool roundtrips Change-Id: I8b0a3b64c89fe167d304f901a5d38470f35400ab Reviewed-on: https://go-review.googlesource.com/c/go/+/715200 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@golang.org>	2025-10-27 23:24:54 -07:00
Meng Zhuo	d7a52f9369	cmd/compile: use MOV(D\|F) with const for Const(64\|32)F on riscv64 The original Const64F using: AUIPC + LD + FMVDX to load float64 const, we can use AUIPC + FLD instead, same as Const32F. Change-Id: I8ca0a0e90d820a26e69b74cd25df3cc662132bf7 Reviewed-on: https://go-review.googlesource.com/c/go/+/703215 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>	2025-10-26 18:35:09 -07:00
David Chase	7056c71d32	cmd/compile: disable use of new saturating float-to-int conversions The new conversions can be activated (or bisected) with -gcflags=all=-d=converthash=PATTERN where PATTERN is either a hash string or n, qn, y, qy for no, quietly no, yes, quietly yes. This CL makes the default pattern be "qn" instead of the default-default which is an efficient encoding of "qy". Updates #75834 Change-Id: I88a9fd7880bc999132420c8d0a22a8fdc1e95a2a Reviewed-on: https://go-review.googlesource.com/c/go/+/711845 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Bypass: David Chase <drchase@google.com>	2025-10-14 15:09:35 -07:00
Keith Randall	9b8742f2e7	cmd/compile: don't depend on arch-dependent conversions in the compiler Leave those constant foldings for runtime, similar to how we do it for NaN generation. These are the only instances I could find in cmd/compile/..., using objdump -d ../pkg/tool/darwin_arm64/compile\| egrep "(fcvtz\|>:)" \| grep -B1 fcvt (There are instances in other places, like runtime and reflect, but I don't think those places would affect compiler output.) Change-Id: I4113fe4570115e4765825cf442cb1fde97cf2f27 Reviewed-on: https://go-review.googlesource.com/c/go/+/711281 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@google.com>	2025-10-13 12:19:32 -07:00
Michael Matloob	19a30ea3f2	cmd/compile: call generated size-specialized malloc functions directly This change creates calls to size-specialized malloc functions instead of calls to newObject when we know the size of the allocation at compilation time. Most of it is a matter of calling the newObject function (which will create calls to the size-specialized functions) rather then the newObjectNonSpecialized function (which won't). In the newHeapaddr, small, non-pointer case, we'll create a non specialized newObject and transform that into the appropriate size-specialized function when we produce the mallocgc in flushPendingHeapAllocations. We have to update some of the rewrites in generic.rules to also apply to the size-specialized functions when they apply to newObject. The messiest thing is we have to adjust the offset we use to save the memory profiler stack, because the depth of the call to profilealloc is two frames fewer in the size-specialized malloc functions compared to when newObject calls mallocgc. A bunch of tests have been adjusted to account for that. Change-Id: I6a6a6964c9037fb6719e392c4a498ed700b617d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/707856 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Matloob <matloob@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>	2025-10-09 14:59:40 -07:00
Michael Munday	97fd6bdecc	cmd/compile: fuse NaN checks with other comparisons NaN checks can often be merged into other comparisons by inverting them. For example, `math.IsNaN(x) \|\| x > 0` is equivalent to `!(x <= 0)`. goos: linux goarch: amd64 pkg: math cpu: 12th Gen Intel(R) Core(TM) i7-12700T │ sec/op │ sec/op vs base │ Acos 4.315n ± 0% 4.314n ± 0% ~ (p=0.642 n=10) Acosh 8.398n ± 0% 7.779n ± 0% -7.37% (p=0.000 n=10) Asin 4.203n ± 0% 4.211n ± 0% +0.20% (p=0.001 n=10) Asinh 10.150n ± 0% 9.562n ± 0% -5.79% (p=0.000 n=10) Atan 2.363n ± 0% 2.363n ± 0% ~ (p=0.801 n=10) Atanh 8.192n ± 2% 7.685n ± 0% -6.20% (p=0.000 n=10) Atan2 4.013n ± 0% 4.010n ± 0% ~ (p=0.073 n=10) Cbrt 4.858n ± 0% 4.755n ± 0% -2.12% (p=0.000 n=10) Cos 4.596n ± 0% 4.357n ± 0% -5.20% (p=0.000 n=10) Cosh 5.071n ± 0% 5.071n ± 0% ~ (p=0.585 n=10) Erf 2.802n ± 1% 2.788n ± 0% -0.54% (p=0.002 n=10) Erfc 3.087n ± 1% 3.071n ± 0% ~ (p=0.320 n=10) Erfinv 3.981n ± 0% 3.965n ± 0% -0.41% (p=0.000 n=10) Erfcinv 3.985n ± 0% 3.977n ± 0% -0.20% (p=0.000 n=10) ExpGo 8.721n ± 2% 8.252n ± 0% -5.38% (p=0.000 n=10) Expm1 4.378n ± 0% 4.228n ± 0% -3.43% (p=0.000 n=10) Exp2 8.313n ± 0% 7.855n ± 0% -5.52% (p=0.000 n=10) Exp2Go 8.498n ± 2% 7.921n ± 0% -6.79% (p=0.000 n=10) Mod 15.16n ± 4% 12.20n ± 1% -19.58% (p=0.000 n=10) Frexp 1.780n ± 2% 1.496n ± 0% -15.96% (p=0.000 n=10) Gamma 4.378n ± 1% 4.013n ± 0% -8.35% (p=0.000 n=10) HypotGo 2.655n ± 5% 2.427n ± 1% -8.57% (p=0.000 n=10) Ilogb 1.912n ± 5% 1.749n ± 0% -8.53% (p=0.000 n=10) J0 22.43n ± 9% 20.46n ± 0% -8.76% (p=0.000 n=10) J1 21.03n ± 4% 19.96n ± 0% -5.09% (p=0.000 n=10) Jn 45.40n ± 1% 42.59n ± 0% -6.20% (p=0.000 n=10) Ldexp 2.312n ± 1% 1.944n ± 0% -15.94% (p=0.000 n=10) Lgamma 4.617n ± 1% 4.584n ± 0% -0.73% (p=0.000 n=10) Log 4.226n ± 0% 4.213n ± 0% -0.31% (p=0.001 n=10) Logb 1.771n ± 0% 1.775n ± 0% ~ (p=0.097 n=10) Log1p 5.102n ± 2% 5.001n ± 0% -1.97% (p=0.000 n=10) Log10 4.407n ± 0% 4.408n ± 0% ~ (p=1.000 n=10) Log2 2.416n ± 1% 2.138n ± 0% -11.51% (p=0.000 n=10) Modf 1.669n ± 2% 1.611n ± 0% -3.50% (p=0.000 n=10) Nextafter32 2.186n ± 0% 2.185n ± 0% ~ (p=0.051 n=10) Nextafter64 2.182n ± 0% 2.184n ± 0% +0.09% (p=0.016 n=10) PowInt 11.39n ± 6% 10.68n ± 2% -6.24% (p=0.000 n=10) PowFrac 26.60n ± 2% 26.12n ± 0% -1.80% (p=0.000 n=10) Pow10Pos 0.5067n ± 4% 0.5003n ± 1% -1.27% (p=0.001 n=10) Pow10Neg 0.8552n ± 0% 0.8552n ± 0% ~ (p=0.928 n=10) Round 1.181n ± 0% 1.182n ± 0% +0.08% (p=0.001 n=10) RoundToEven 1.709n ± 0% 1.710n ± 0% ~ (p=0.053 n=10) Remainder 12.54n ± 5% 11.99n ± 2% -4.46% (p=0.000 n=10) Sin 3.933n ± 5% 3.926n ± 0% -0.17% (p=0.000 n=10) Sincos 5.672n ± 0% 5.522n ± 0% -2.65% (p=0.000 n=10) Sinh 5.447n ± 1% 5.444n ± 0% -0.06% (p=0.029 n=10) Tan 4.061n ± 0% 4.058n ± 0% -0.07% (p=0.005 n=10) Tanh 5.599n ± 0% 5.595n ± 0% -0.06% (p=0.042 n=10) Y0 20.75n ± 5% 19.73n ± 1% -4.92% (p=0.000 n=10) Y1 20.87n ± 2% 19.78n ± 1% -5.20% (p=0.000 n=10) Yn 44.50n ± 2% 42.04n ± 2% -5.53% (p=0.000 n=10) geomean 4.989n 4.791n -3.96% goos: linux goarch: riscv64 pkg: math cpu: Spacemit(R) X60 │ sec/op │ sec/op vs base │ Acos 159.9n ± 0% 159.9n ± 0% ~ (p=0.269 n=10) Acosh 244.7n ± 0% 235.0n ± 0% -3.98% (p=0.000 n=10) Asin 159.9n ± 0% 159.9n ± 0% ~ (p=0.154 n=10) Asinh 270.8n ± 0% 261.1n ± 0% -3.60% (p=0.000 n=10) Atan 119.1n ± 0% 119.1n ± 0% ~ (p=0.347 n=10) Atanh 260.2n ± 0% 261.8n ± 4% ~ (p=0.459 n=10) Atan2 186.8n ± 0% 186.8n ± 0% ~ (p=0.487 n=10) Cbrt 203.5n ± 0% 198.2n ± 0% -2.60% (p=0.000 n=10) Ceil 31.82n ± 0% 31.81n ± 0% ~ (p=0.714 n=10) Copysign 4.894n ± 0% 4.893n ± 0% ~ (p=0.161 n=10) Cos 107.6n ± 0% 103.6n ± 0% -3.76% (p=0.000 n=10) Cosh 259.0n ± 0% 252.8n ± 0% -2.39% (p=0.000 n=10) Erf 133.7n ± 0% 133.7n ± 0% ~ (p=0.720 n=10) Erfc 137.9n ± 0% 137.8n ± 0% -0.04% (p=0.033 n=10) Erfinv 173.7n ± 0% 168.8n ± 0% -2.82% (p=0.000 n=10) Erfcinv 173.7n ± 0% 168.8n ± 0% -2.82% (p=0.000 n=10) Exp 215.3n ± 0% 208.1n ± 0% -3.34% (p=0.000 n=10) ExpGo 226.7n ± 0% 220.6n ± 0% -2.69% (p=0.000 n=10) Expm1 164.8n ± 0% 159.0n ± 0% -3.52% (p=0.000 n=10) Exp2 185.0n ± 0% 182.7n ± 0% -1.22% (p=0.000 n=10) Exp2Go 198.9n ± 0% 196.5n ± 0% -1.21% (p=0.000 n=10) Abs 4.894n ± 0% 4.893n ± 0% ~ (p=0.262 n=10) Dim 16.31n ± 0% 16.31n ± 0% ~ (p=1.000 n=10) Floor 31.81n ± 0% 31.81n ± 0% ~ (p=0.067 n=10) Max 26.11n ± 0% 26.10n ± 0% ~ (p=0.080 n=10) Min 26.10n ± 0% 26.10n ± 0% ~ (p=0.095 n=10) Mod 337.7n ± 0% 291.9n ± 0% -13.56% (p=0.000 n=10) Frexp 50.57n ± 0% 42.41n ± 0% -16.13% (p=0.000 n=10) Gamma 206.3n ± 0% 198.1n ± 0% -4.00% (p=0.000 n=10) Hypot 94.62n ± 0% 94.61n ± 0% ~ (p=0.437 n=10) HypotGo 109.3n ± 0% 109.3n ± 0% ~ (p=1.000 n=10) Ilogb 44.05n ± 0% 44.04n ± 0% -0.02% (p=0.025 n=10) J0 663.1n ± 0% 663.9n ± 0% +0.13% (p=0.002 n=10) J1 663.9n ± 0% 666.4n ± 0% +0.38% (p=0.000 n=10) Jn 1.404µ ± 0% 1.407µ ± 0% +0.21% (p=0.000 n=10) Ldexp 57.10n ± 0% 48.93n ± 0% -14.30% (p=0.000 n=10) Lgamma 185.1n ± 0% 187.6n ± 0% +1.32% (p=0.000 n=10) Log 182.7n ± 0% 170.1n ± 0% -6.87% (p=0.000 n=10) Logb 46.49n ± 0% 46.49n ± 0% ~ (p=0.675 n=10) Log1p 184.3n ± 0% 179.4n ± 0% -2.63% (p=0.000 n=10) Log10 184.3n ± 0% 171.2n ± 0% -7.08% (p=0.000 n=10) Log2 66.05n ± 0% 57.90n ± 0% -12.34% (p=0.000 n=10) Modf 34.25n ± 0% 34.24n ± 0% ~ (p=0.163 n=10) Nextafter32 49.33n ± 1% 48.93n ± 0% -0.81% (p=0.002 n=10) Nextafter64 43.64n ± 0% 43.23n ± 0% -0.93% (p=0.000 n=10) PowInt 267.6n ± 0% 251.2n ± 0% -6.11% (p=0.000 n=10) PowFrac 672.9n ± 0% 637.9n ± 0% -5.19% (p=0.000 n=10) Pow10Pos 13.87n ± 0% 13.87n ± 0% ~ (p=1.000 n=10) Pow10Neg 19.58n ± 62% 19.59n ± 62% ~ (p=0.355 n=10) Round 23.65n ± 0% 23.65n ± 0% ~ (p=1.000 n=10) RoundToEven 27.73n ± 0% 27.73n ± 0% ~ (p=0.635 n=10) Remainder 309.9n ± 0% 280.5n ± 0% -9.49% (p=0.000 n=10) Signbit 13.05n ± 0% 13.05n ± 0% ~ (p=1.000 n=10) ¹ Sin 120.7n ± 0% 120.7n ± 0% ~ (p=1.000 n=10) ¹ Sincos 148.4n ± 0% 143.5n ± 0% -3.30% (p=0.000 n=10) Sinh 275.6n ± 0% 267.5n ± 0% -2.94% (p=0.000 n=10) SqrtIndirect 3.262n ± 0% 3.262n ± 0% ~ (p=0.263 n=10) SqrtLatency 19.57n ± 0% 19.57n ± 0% ~ (p=0.582 n=10) SqrtIndirectLatency 19.57n ± 0% 19.57n ± 0% ~ (p=1.000 n=10) SqrtGoLatency 203.2n ± 0% 197.6n ± 0% -2.78% (p=0.000 n=10) SqrtPrime 4.952µ ± 0% 4.952µ ± 0% -0.01% (p=0.025 n=10) Tan 153.3n ± 0% 153.3n ± 0% ~ (p=1.000 n=10) Tanh 280.5n ± 0% 272.4n ± 0% -2.91% (p=0.000 n=10) Trunc 31.81n ± 0% 31.81n ± 0% ~ (p=1.000 n=10) Y0 680.1n ± 0% 664.8n ± 0% -2.25% (p=0.000 n=10) Y1 684.2n ± 0% 669.6n ± 0% -2.14% (p=0.000 n=10) Yn 1.444µ ± 0% 1.410µ ± 0% -2.35% (p=0.000 n=10) Float64bits 5.709n ± 0% 5.708n ± 0% ~ (p=0.573 n=10) Float64frombits 4.893n ± 0% 4.893n ± 0% ~ (p=0.734 n=10) Float32bits 12.23n ± 0% 12.23n ± 0% ~ (p=0.628 n=10) Float32frombits 4.893n ± 0% 4.893n ± 0% ~ (p=0.971 n=10) FMA 4.893n ± 0% 4.893n ± 0% ~ (p=0.736 n=10) geomean 88.96n 87.05n -2.15% ¹ all samples are equal Change-Id: I8db8ac7b7b3430b946b89e88dd6c1546804125c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/697360 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Munday <mikemndy@gmail.com>	2025-10-08 08:11:17 -07:00
Cherry Mui	1d62e92567	test/codegen: make sure assignment results are used. Some tests make assignments to an argument without reading it. With CL 708865, they are treated as dead stores and are removed. Make sure the results are used. Fixes #75745. Fixes #75746. Change-Id: I05580beb1006505ec1550e5fa245b54dcefd10b9 Reviewed-on: https://go-review.googlesource.com/c/go/+/708916 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org>	2025-10-06 14:51:23 -07:00
Cherry Mui	38b26f29f1	cmd/compile: remove stores to unread parameters Currently, we remove stores to local variables that are not read. We don't do that for arguments. But arguments and locals are essentially the same. Arguments are passed by value, and are not expected to be read in the caller's frame. So we can remove the writes to them as well. One exception is the cgo_unsafe_arg directive, which makes all the arguments effectively address-taken. cgo_unsafe_arg implies ABI0, so we just skip ABI0 functions' arguments. Cherry-picked from the dev.simd branch. This CL is not necessarily SIMD specific. Apply early to reduce risk. Change-Id: I8999fc50da6a87f22c1ec23e9a0c15483b6f7df8 Reviewed-on: https://go-review.googlesource.com/c/go/+/705815 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-on: https://go-review.googlesource.com/c/go/+/708865	2025-10-03 12:31:20 -07:00
Joel Sing	4ff8a457db	test/codegen: codify handling of floating point constants on arm64 While here, reorder Float32ConstantStore/Float64ConstantStore for consistency. Change-Id: Ic1b3e9f9474965d15bc94518d78d1a4a7bda93f3 Reviewed-on: https://go-review.googlesource.com/c/go/+/703756 Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Joel Sing <joel@sing.id.au> Reviewed-by: Keith Randall <khr@google.com>	2025-09-30 14:49:25 -07:00
Jake Bailey	97da068774	cmd/compile: eliminate nil checks on .dict arg The first arg of a generic function is the dictionary. This dictionary is never nil, but it gets a nil check becuase the dict arg is treated as a slice during construction. cmp.Compare[go.shape.int] was: 00006 (+41) TESTB AX, (AX) 00007 (+52) CMPQ CX, BX 00008 (52) JGT 14 00009 (+55) JGE 12 00010 (+56) MOVL $1, AX 00011 (56) RET 00012 (+58) XORL AX, AX 00013 (58) RET 00014 (+53) MOVQ $-1, AX 00015 (53) RET Note how the function begins with a TESTB that loads the dict to perform the nil check. This CL eliminates that nil check. For most generic functions, this doesn't matter too much, but not infrequently are generic functions written which never actually use the dictionary (like cmp.Compare), so I suspect this might help in hot code to avoid repeatedly touching the dictionary in memory, and in cases where the generic function is not inlined (and thus the dict dropped). compilecmp shows these changes (deduped): cmp.Compare[go.shape.float64] 73 -> 72 (-1.37%) cmp.Compare[go.shape.int] 26 -> 24 (-7.69%) cmp.Compare[go.shape.int32] 25 -> 23 (-8.00%) cmp.Compare[go.shape.int64] 26 -> 24 (-7.69%) cmp.Compare[go.shape.string] 142 -> 141 (-0.70%) cmp.Compare[go.shape.uint16] 26 -> 24 (-7.69%) cmp.Compare[go.shape.uint] 26 -> 24 (-7.69%) cmp.Compare[go.shape.uint32] 25 -> 23 (-8.00%) cmp.Compare[go.shape.uint64] 26 -> 24 (-7.69%) cmp.Compare[go.shape.uint8] 25 -> 23 (-8.00%) cmp.Compare[go.shape.uintptr] 26 -> 24 (-7.69%) cmp.Less[go.shape.float64] 35 -> 34 (-2.86%) cmp.Less[go.shape.int32] 8 -> 6 (-25.00%) cmp.Less[go.shape.int64] 9 -> 7 (-22.22%) cmp.Less[go.shape.int] 9 -> 7 (-22.22%) cmp.Less[go.shape.string] 112 -> 110 (-1.79%) cmp.Less[go.shape.uint16] 9 -> 7 (-22.22%) cmp.Less[go.shape.uint32] 8 -> 6 (-25.00%) cmp.Less[go.shape.uint64] 9 -> 7 (-22.22%) internal/synctest.Associate[go.shape.struct 114 -> 113 (-0.88%) internal/trace.(dataTable[go.shape.uint64,go.shape.string]).insert 805 -> 791 (-1.74%) internal/trace.(dataTable[go.shape.uint64,go.shape.struct 858 -> 852 (-0.70%) main.(gState[go.shape.int64]).stop 2111 -> 2085 (-1.23%) main.(gState[go.shape.int64]).unblock 941 -> 923 (-1.91%) runtime.fmax[go.shape.float32] 85 -> 83 (-2.35%) runtime.fmax[go.shape.float64] 89 -> 87 (-2.25%) runtime.fmin[go.shape.float32] 85 -> 83 (-2.35%) runtime.fmin[go.shape.float64] 89 -> 87 (-2.25%) slices.BinarySearch[go.shape.[]string,go.shape.string] 346 -> 337 (-2.60%) slices.Concat[go.shape.[]uint8,go.shape.uint8] 462 -> 453 (-1.95%) slices.ContainsFunc[go.shape.[]cmd/vendor/github.com/google/pprof/profile.Sample,go.shape.uint8] 170 -> 169 (-0.59%) slices.ContainsFunc[go.shape.[]debug/dwarf.StructField,go.shape.uint8] 170 -> 169 (-0.59%) slices.ContainsFunc[go.shape.[]go/ast.Field,go.shape.uint8] 170 -> 169 (-0.59%) slices.ContainsFunc[go.shape.[]string,go.shape.string] 186 -> 181 (-2.69%) slices.Contains[go.shape.[]cmd/compile/internal/syntax.BranchStmt,go.shape.cmd/compile/internal/syntax.BranchStmt] 44 -> 42 (-4.55%) slices.Contains[go.shape.[]cmd/compile/internal/syntax.Type,go.shape.interface 223 -> 219 (-1.79%) slices.Contains[go.shape.[]crypto/tls.CurveID,go.shape.uint16] 44 -> 42 (-4.55%) slices.Contains[go.shape.[]crypto/tls.SignatureScheme,go.shape.uint16] 44 -> 42 (-4.55%) slices.Contains[go.shape.[]go/ast.BranchStmt,go.shape.go/ast.BranchStmt] 44 -> 42 (-4.55%) slices.Contains[go.shape.[]go/types.Type,go.shape.interface 223 -> 219 (-1.79%) slices.Contains[go.shape.[]int,go.shape.int] 44 -> 42 (-4.55%) slices.Contains[go.shape.[]string,go.shape.string] 223 -> 219 (-1.79%) slices.Contains[go.shape.[]uint16,go.shape.uint16] 44 -> 42 (-4.55%) slices.Contains[go.shape.[]uint8,go.shape.uint8] 44 -> 42 (-4.55%) slices.Insert[go.shape.[]string,go.shape.string] 1189 -> 1170 (-1.60%) slices.medianCmpFunc[go.shape.struct 1118 -> 1113 (-0.45%) slices.medianCmpFunc[go.shape.struct 1214 -> 1209 (-0.41%) slices.medianCmpFunc[go.shape.struct 889 -> 887 (-0.22%) slices.medianCmpFunc[go.shape.struct 901 -> 874 (-3.00%) slices.order2Ordered[go.shape.float64] 89 -> 87 (-2.25%) slices.order2Ordered[go.shape.uint16] 75 -> 70 (-6.67%) slices.partialInsertionSortOrdered[go.shape.string] 1115 -> 1110 (-0.45%) slices.partialInsertionSortOrdered[go.shape.uint16] 358 -> 352 (-1.68%) slices.partitionEqualOrdered[go.shape.int] 208 -> 203 (-2.40%) slices.partitionEqualOrdered[go.shape.int32] 208 -> 198 (-4.81%) slices.partitionEqualOrdered[go.shape.int64] 208 -> 203 (-2.40%) slices.partitionEqualOrdered[go.shape.uint32] 208 -> 198 (-4.81%) slices.partitionEqualOrdered[go.shape.uint64] 208 -> 203 (-2.40%) slices.partitionOrdered[go.shape.float64] 538 -> 533 (-0.93%) slices.partitionOrdered[go.shape.int] 437 -> 427 (-2.29%) slices.partitionOrdered[go.shape.int64] 437 -> 427 (-2.29%) slices.partitionOrdered[go.shape.uint16] 447 -> 442 (-1.12%) slices.partitionOrdered[go.shape.uint64] 437 -> 427 (-2.29%) slices.rotateCmpFunc[go.shape.struct 1045 -> 1029 (-1.53%) slices.rotateCmpFunc[go.shape.struct 1205 -> 1163 (-3.49%) slices.rotateCmpFunc[go.shape.struct 1226 -> 1176 (-4.08%) slices.rotateCmpFunc[go.shape.struct 1322 -> 1272 (-3.78%) slices.rotateCmpFunc[go.shape.struct 1419 -> 1400 (-1.34%) slices.rotateCmpFunc[go.shape.uint8] 549 -> 538 (-2.00%) slices.rotateLeft[go.shape.string] 603 -> 588 (-2.49%) slices.rotateLeft[go.shape.uint8] 255 -> 250 (-1.96%) slices.siftDownOrdered[go.shape.int] 181 -> 171 (-5.52%) slices.siftDownOrdered[go.shape.int32] 181 -> 171 (-5.52%) slices.siftDownOrdered[go.shape.int64] 181 -> 171 (-5.52%) slices.siftDownOrdered[go.shape.string] 614 -> 592 (-3.58%) slices.siftDownOrdered[go.shape.uint32] 181 -> 171 (-5.52%) slices.siftDownOrdered[go.shape.uint64] 181 -> 171 (-5.52%) time.parseRFC3339[go.shape.string] 1774 -> 1758 (-0.90%) unique.(canonMap[go.shape.struct 280 -> 276 (-1.43%) unique.clone[go.shape.struct 311 -> 293 (-5.79%) weak.Make[go.shape.6880e4598856efac32416085c0172278cf0fb9e5050ce6518bd9b7f7d1662440] 136 -> 134 (-1.47%) weak.Make[go.shape.struct 136 -> 134 (-1.47%) weak.Make[go.shape.uint8] 136 -> 134 (-1.47%) Change-Id: I43dcea5f2aa37372f773e5edc6a2ef1dee0a8db7 Reviewed-on: https://go-review.googlesource.com/c/go/+/706655 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org>	2025-09-30 11:22:35 -07:00
limeidan	af6999e60d	cmd/compile: implement jump table on loong64 Following CL 357330, use jump tables on Loong64. goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A6000-HV @ 2500.00MHz │ old │ new │ │ sec/op │ sec/op vs base │ Switch8Predictable 2.352n ± 0% 2.101n ± 0% -10.65% (p=0.000 n=10) Switch8Unpredictable 11.99n ± 0% 10.25n ± 0% -14.51% (p=0.000 n=10) Switch32Predictable 3.153n ± 0% 1.887n ± 1% -40.14% (p=0.000 n=10) Switch32Unpredictable 12.47n ± 0% 10.22n ± 0% -18.00% (p=0.000 n=10) SwitchStringPredictable 3.162n ± 0% 3.352n ± 0% +6.01% (p=0.000 n=10) SwitchStringUnpredictable 14.70n ± 0% 13.31n ± 0% -9.46% (p=0.000 n=10) SwitchTypePredictable 3.702n ± 0% 2.201n ± 0% -40.55% (p=0.000 n=10) SwitchTypeUnpredictable 16.18n ± 0% 14.48n ± 0% -10.51% (p=0.000 n=10) SwitchInterfaceTypePredictable 7.654n ± 0% 9.680n ± 0% +26.47% (p=0.000 n=10) SwitchInterfaceTypeUnpredictable 22.04n ± 0% 22.44n ± 0% +1.81% (p=0.000 n=10) geomean 7.441n 6.469n -13.07% Change-Id: Id6f30fa73349c60fac17670084daee56973a955f Reviewed-on: https://go-review.googlesource.com/c/go/+/705396 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn>	2025-09-27 05:02:58 -07:00
Xiaolin Zhao	78ef487a6f	cmd/compile: fix the issue of shift amount exceeding the valid range Fixes #75479 Change-Id: I362d3e49090e94f91a840dd5a475978b59222a00 Reviewed-on: https://go-review.googlesource.com/c/go/+/704135 Reviewed-by: Mark Freeman <markfreeman@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn>	2025-09-17 18:05:31 -07:00
Meng Zhuo	2469e92d8c	cmd/compile: combine doubling with shift on riscv64 Change-Id: I4bee2770fedf97e35b5a5b9187a8ba3c41f9ec2e Reviewed-on: https://go-review.googlesource.com/c/go/+/702697 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@google.com>	2025-09-15 17:31:56 -07:00
Meng Zhuo	e5ee1f2600	test/codegen: check zerobase for newobject on 0-sized types This CL also adds riscv64 checks Change-Id: I693e4e606f470615f6b49085592d6d5ca61473d3 Reviewed-on: https://go-review.googlesource.com/c/go/+/703716 Reviewed-by: Pengcheng Wang <wangpengcheng.pp@bytedance.com> Auto-Submit: Keith Randall <khr@google.com> Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>	2025-09-15 07:47:55 -07:00
Jake Bailey	dc960d0bfe	cmd/compile, reflect: further allow inlining of TypeFor Previous CLs optimized direct use of abi.Type, but reflect.Type is indirected, so was not benefiting. For TypeFor, we can use toRType directly without a nil check because the types are statically known. Normally, I'd think SSA would remove the nil check, but due to some oddity (specifically, late fuse being required to remove the nil check, but opt doesn't run that late) means that the nil check persists and gets in the way. Manually writing the code in this instance seems to fix the problem. It also exposed another problem; depending on the ordering, writeType could get to a type symbol before SSA, thereby preventing Extra from being created on the symbol for later lookups that don't go through TypeLinksym directly. In writeType, for non-shape types, call TypeLinksym to ensure that the type is set up for later callers. That change itself passed toolstash -cmp. All up, this stack put through compilecmp shows a lot of improvement in various reflect-using packages, and reflect itself. It is too big to fit in the commit message but here's some info: compilecmp master -> HEAD master (`d767064170`): cmd/compile: mark abi.PtrType.Elem sym as used HEAD (846a94c568): cmd/compile, reflect: further allow inlining of TypeFor file before after Δ % addr2line 3735911 3735391 -520 -0.014% asm 6382235 6382091 -144 -0.002% buildid 3608568 3608360 -208 -0.006% cgo 5951816 5951480 -336 -0.006% compile 28362080 28339772 -22308 -0.079% cover 6668686 6661414 -7272 -0.109% dist 4311961 4311425 -536 -0.012% fix 3771706 3771474 -232 -0.006% link 8686073 8684993 -1080 -0.012% nm 3715923 3715459 -464 -0.012% objdump 6074366 6073774 -592 -0.010% pack 3025653 3025277 -376 -0.012% pprof 18269485 18261653 -7832 -0.043% test2json 3442726 3438390 -4336 -0.126% trace 16984831 16981767 -3064 -0.018% vet 10701931 10696355 -5576 -0.052% total 133693951 133639075 -54876 -0.041% runtime runtime.stkobjinit 240 -> 165 (-31.25%) runtime [cmd/compile] runtime.stkobjinit 240 -> 165 (-31.25%) reflect reflect.Value.Seq2.func3 309 -> 245 (-20.71%) reflect.Value.Seq2.func1.1 281 -> 198 (-29.54%) reflect.Value.Seq.func1.1 242 -> 165 (-31.82%) reflect.Value.Seq2.func2 360 -> 285 (-20.83%) reflect.Value.Seq.func4 281 -> 239 (-14.95%) reflect.Value.Seq2.func4 399 -> 284 (-28.82%) reflect.Value.Seq.func2 271 -> 230 (-15.13%) reflect.TypeFor[go.shape.uint64] 33 -> 18 (-45.45%) reflect.Value.Seq.func3 219 -> 178 (-18.72%) reflect [cmd/compile] reflect.Value.Seq2.func2 360 -> 285 (-20.83%) reflect.Value.Seq.func4 281 -> 239 (-14.95%) reflect.Value.Seq.func2 271 -> 230 (-15.13%) reflect.Value.Seq.func1.1 242 -> 165 (-31.82%) reflect.Value.Seq2.func1.1 281 -> 198 (-29.54%) reflect.Value.Seq2.func3 309 -> 245 (-20.71%) reflect.Value.Seq.func3 219 -> 178 (-18.72%) reflect.TypeFor[go.shape.uint64] 33 -> 18 (-45.45%) reflect.Value.Seq2.func4 399 -> 284 (-28.82%) fmt fmt.(*pp).fmtBytes 1723 -> 1691 (-1.86%) database/sql/driver reflect.TypeFor[go.shape.interface 33 -> 18 (-45.45%) database/sql/driver.init 72 -> 57 (-20.83%) Change-Id: I9eb750cf0b7ebf532589f939431feb0a899e42ff Reviewed-on: https://go-review.googlesource.com/c/go/+/701301 Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-09-12 09:34:43 -07:00
Keith Randall	80a2aae922	Revert "cmd/compile: improve stp merging for non-sequent cases" This reverts commit `4c63d798cb`. Reason for revert: Causes miscompilations. See issue 75365. Change-Id: Icd1fcfeb23d2ec524b16eb556030f43875e1c90d Reviewed-on: https://go-review.googlesource.com/c/go/+/702455 Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Mark Freeman <markfreeman@google.com>	2025-09-10 11:11:11 -07:00
Youlin Feng	a5fa5ea51c	cmd/compile/internal/ssa: expand runtime.memequal for length {3,5,6,7} This CL slightly speeds up strings.HasPrefix when testing constant prefixes of length {3,5,6,7}. goos: linux goarch: amd64 cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz │ old │ new │ │ sec/op │ sec/op vs base │ StringPrefix3-8 11.125n ± 2% 8.539n ± 1% -23.25% (p=0.000 n=20) StringPrefix5-8 11.170n ± 2% 8.700n ± 1% -22.11% (p=0.000 n=20) StringPrefix6-8 11.190n ± 2% 8.655n ± 1% -22.65% (p=0.000 n=20) StringPrefix7-8 11.095n ± 1% 8.878n ± 1% -19.98% (p=0.000 n=20) Change-Id: I510a80d59cf78680b57d68780d35d212d24030e2 Reviewed-on: https://go-review.googlesource.com/c/go/+/700816 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mark Freeman <markfreeman@google.com> Auto-Submit: Keith Randall <khr@golang.org>	2025-09-09 12:10:07 -07:00
Melnikov Denis	4c63d798cb	cmd/compile: improve stp merging for non-sequent cases Original algorithm merges stores with the first mergeable store in the chain, but it misses some cases. Additional reordering stores in increasing order of memory access in the chain allows merging in these cases. Fixes #71987 There are the results of sweet benchmarks and the difference between sizes of sections .text │ old.results │ new.results │ │ sec/op │ sec/op vs base │ BleveIndexBatch100-4 7.614 ± 2% 7.548 ± 1% ~ (p=0.190 n=10) ESBuildThreeJS-4 821.3m ± 0% 819.0m ± 1% ~ (p=0.165 n=10) ESBuildRomeTS-4 206.2m ± 1% 204.4m ± 1% -0.90% (p=0.023 n=10) EtcdPut-4 64.89m ± 1% 64.94m ± 2% ~ (p=0.684 n=10) EtcdSTM-4 318.4m ± 0% 319.2m ± 1% ~ (p=0.631 n=10) GoBuildKubelet-4 157.4 ± 0% 157.6 ± 0% ~ (p=0.105 n=10) GoBuildKubeletLink-4 12.42 ± 2% 12.41 ± 1% ~ (p=0.529 n=10) GoBuildIstioctl-4 124.4 ± 0% 124.4 ± 0% ~ (p=0.579 n=10) GoBuildIstioctlLink-4 8.700 ± 1% 8.693 ± 1% ~ (p=0.912 n=10) GoBuildFrontend-4 46.52 ± 0% 46.50 ± 0% ~ (p=0.971 n=10) GoBuildFrontendLink-4 2.282 ± 1% 2.272 ± 1% ~ (p=0.529 n=10) GoBuildTsgo-4 75.02 ± 1% 75.31 ± 1% ~ (p=0.436 n=10) GoBuildTsgoLink-4 1.229 ± 1% 1.219 ± 1% -0.82% (p=0.035 n=10) GopherLuaKNucleotide-4 34.77 ± 5% 34.31 ± 1% -1.33% (p=0.015 n=10) MarkdownRenderXHTML-4 286.6m ± 0% 285.7m ± 1% ~ (p=0.315 n=10) Tile38QueryLoad-4 657.2µ ± 1% 660.3µ ± 0% ~ (p=0.436 n=10) geomean 2.570 2.563 -0.24% Executable Old .text New .text Change ------------------------------------------------------- benchmark 6504820 6504020 -0.01% bleve-index-bench 3903860 3903636 -0.01% esbuild 4801012 4801172 +0.00% esbuild-bench 1256404 1256340 -0.01% etcd 9188148 9187076 -0.01% etcd-bench 6462228 6461524 -0.01% go 5924468 5923892 -0.01% go-build-bench 1282004 1281940 -0.00% gopher-lua-bench 1639540 1639348 -0.01% markdown-bench 1478452 1478356 -0.01% tile38-bench 2753524 2753300 -0.01% tile38-server 10241380 10240068 -0.01% Change-Id: Ieb4fdfd656aca458f65fc45938de70550632bd13 Reviewed-on: https://go-review.googlesource.com/c/go/+/698097 Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: Keith Randall <khr@google.com>	2025-09-09 12:10:01 -07:00
Xiaolin Zhao	f5b20689e9	cmd/compile: optimize loads from readonly globals into constants on loong64 Ref: CL 141118 Update #26498 Change-Id: I9c4ad2bedc4d50bd273bbe9119a898d4fca95e45 Reviewed-on: https://go-review.googlesource.com/c/go/+/700875 Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-09-05 08:42:28 -07:00
Xiaolin Zhao	3492e4262b	cmd/compile: simplify specific addition operations using the ADDV16 instruction On loong64, the addi.d instruction can only directly handle 12-bit immediate numbers. If a larger immediate number needs to be processed, it must first be placed in a register, and then the add.d instruction is used to complete the processing of the larger immediate number. If a larger immediate number c satisfies is32Bit(c) && c&0xffff == 0, then the ADDV16 instruction can be used to complete the addition operation. Removes 164 instructions from the go binary on loong64. Change-Id: I404de93cc4eaaa12fe424f5a0d61b03231215d1a Reviewed-on: https://go-review.googlesource.com/c/go/+/700536 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>	2025-09-05 08:18:04 -07:00
Youlin Feng	df29038486	cmd/compile/internal/ssa: load constant values from abi.PtrType.Elem This CL makes the generated code for reflect.TypeFor as simple as an intrinsic function. Fixes #75203 Change-Id: I7bb48787101f07e77ab5c583292e834c28a028d6 Reviewed-on: https://go-review.googlesource.com/c/go/+/700336 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Keith Randall <khr@golang.org>	2025-09-04 07:25:26 -07:00
limeidan	bd71b94659	cmd/compile/internal: optimizing add+sll rule using ALSLV instruction on loong64 Reduce the number of go toolchain instructions on loong64 as follows: file before after Δ % go 1573148 1571708 -1,440 -0.0915% gofmt 320578 320090 -488 -0.1522% asm 555066 554406 -660 -0.1189% cgo 481566 480926 -640 -0.1329% compile 2475962 2473880 -2,082 -0.0841% cover 516536 515920 -616 -0.1193% link 702172 701404 -768 -0.1094% preprofile 238626 238274 -352 -0.1475% vet 792928 792100 -828 -0.1044% Change-Id: I61e462726835959c60e1b4e5256d4020202418ab Reviewed-on: https://go-review.googlesource.com/c/go/+/693877 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn>	2025-08-25 12:30:16 -07:00
Xiaolin Zhao	44c5956bf7	test/codegen: add Mul2 and DivPow2 test for loong64 Change-Id: I29ccd105c5418955146a3f4873162963da489a70 Reviewed-on: https://go-review.googlesource.com/c/go/+/697935 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Carlos Amedee <carlos@golang.org>	2025-08-24 18:14:28 -07:00
Xiaolin Zhao	0aa8019e94	test/codegen: add Mul* test for loong64 Change-Id: Ica285212e4884a96fe9738b53cdc789b223bf2e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/697895 Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn>	2025-08-24 18:14:22 -07:00
Xiaolin Zhao	83420974b7	test/codegen: add sqrt* abs and copysign test for loong64 Change-Id: I645396fc4b00242f36a06f01550906805c0c1f73 Reviewed-on: https://go-review.googlesource.com/c/go/+/697955 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Carlos Amedee <carlos@golang.org>	2025-08-24 18:14:13 -07:00
limeidan	1843f1e9c0	cmd/compile: use zero register instead of specialized *zero instructions on loong64 Refer to CL 633075, loong64 has a zero(R0) register that can be used to do this. Change-Id: I846c6bdfcfd6dbfa18338afc13e34e350580ead4 Reviewed-on: https://go-review.googlesource.com/c/go/+/693876 Reviewed-by: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org>	2025-08-21 11:23:05 -07:00
Xiaolin Zhao	9632ba8160	cmd/compile: optimize some patterns into revb2h/revb4h instruction on loong64 Pattern1: (the type of c is uint16) c>>8 \| c<<8 To: revb2h c Pattern2: (the type of c is uint32) (c & 0xff00ff00)>>8 \| (c & 0x00ff00ff)<<8 To: revb2h c Pattern3: (the type of c is uint64) (c & 0xff00ff00ff00ff00)>>8 \| (c & 0x00ff00ff00ff00ff)<<8 To: revb4h c Change-Id: Ic6231a3f476cbacbea4bd00e31193d107cb86cda Reviewed-on: https://go-review.googlesource.com/c/go/+/696335 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-08-21 11:19:34 -07:00
Xiaolin Zhao	fa706ea50f	cmd/compile: optimize rule (x + x) << c to x << c+1 on loong64 Change-Id: I782f93510bba92ba60b298c1c1cde456c8bcec38 Reviewed-on: https://go-review.googlesource.com/c/go/+/697956 Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>	2025-08-21 11:16:49 -07:00
Michael Munday	320df537cc	cmd/compile: emit classify instructions for infinity tests on riscv64 The 'classify' instruction on RISC-V sets a bit in a mask to indicate the class a floating point value belongs to (e.g. whether the value is an infinity, a normal number, a subnormal number and so on). There are other places this instruction is useful but for now I've just used it for infinity tests. The gains are relatively small (~1-2 instructions per IsInf call) but using FCLASSD does potentially unlock further optimizations. It also reduces the number of loads from memory and the number of moves between general purpose and floating point register files. goos: linux goarch: riscv64 pkg: math cpu: Spacemit(R) X60 │ sec/op │ sec/op vs base │ Acos 159.9n ± 0% 173.7n ± 0% +8.66% (p=0.000 n=10) Acosh 249.8n ± 0% 254.4n ± 0% +1.86% (p=0.000 n=10) Asin 159.9n ± 0% 173.7n ± 0% +8.66% (p=0.000 n=10) Asinh 292.2n ± 0% 283.0n ± 0% -3.15% (p=0.000 n=10) Atan 119.1n ± 0% 119.0n ± 0% -0.08% (p=0.036 n=10) Atanh 265.1n ± 0% 271.6n ± 0% +2.43% (p=0.000 n=10) Atan2 194.9n ± 0% 186.7n ± 0% -4.23% (p=0.000 n=10) Cbrt 216.3n ± 0% 203.1n ± 0% -6.10% (p=0.000 n=10) Ceil 31.82n ± 0% 31.81n ± 0% ~ (p=0.063 n=10) Copysign 4.897n ± 0% 4.893n ± 3% -0.08% (p=0.038 n=10) Cos 123.9n ± 0% 107.7n ± 1% -13.03% (p=0.000 n=10) Cosh 293.0n ± 0% 264.6n ± 0% -9.68% (p=0.000 n=10) Erf 150.0n ± 0% 133.8n ± 0% -10.80% (p=0.000 n=10) Erfc 151.8n ± 0% 137.9n ± 0% -9.16% (p=0.000 n=10) Erfinv 173.8n ± 0% 173.8n ± 0% ~ (p=0.820 n=10) Erfcinv 173.8n ± 0% 173.8n ± 0% ~ (p=1.000 n=10) Exp 247.7n ± 0% 220.4n ± 0% -11.04% (p=0.000 n=10) ExpGo 261.4n ± 0% 232.5n ± 0% -11.04% (p=0.000 n=10) Expm1 176.2n ± 0% 164.9n ± 0% -6.41% (p=0.000 n=10) Exp2 220.4n ± 0% 190.2n ± 0% -13.70% (p=0.000 n=10) Exp2Go 232.5n ± 0% 204.0n ± 0% -12.22% (p=0.000 n=10) Abs 4.897n ± 0% 4.897n ± 0% ~ (p=0.726 n=10) Dim 16.32n ± 0% 16.31n ± 0% ~ (p=0.770 n=10) Floor 31.84n ± 0% 31.83n ± 0% ~ (p=0.677 n=10) Max 26.11n ± 0% 26.13n ± 0% ~ (p=0.290 n=10) Min 26.10n ± 0% 26.11n ± 0% ~ (p=0.424 n=10) Mod 416.2n ± 0% 337.8n ± 0% -18.83% (p=0.000 n=10) Frexp 63.65n ± 0% 50.60n ± 0% -20.50% (p=0.000 n=10) Gamma 218.8n ± 0% 206.4n ± 0% -5.62% (p=0.000 n=10) Hypot 92.20n ± 0% 94.69n ± 0% +2.70% (p=0.000 n=10) HypotGo 107.7n ± 0% 109.3n ± 0% +1.49% (p=0.000 n=10) Ilogb 59.54n ± 0% 44.04n ± 0% -26.04% (p=0.000 n=10) J0 708.9n ± 0% 674.5n ± 0% -4.86% (p=0.000 n=10) J1 707.6n ± 0% 676.1n ± 0% -4.44% (p=0.000 n=10) Jn 1.513µ ± 0% 1.427µ ± 0% -5.68% (p=0.000 n=10) Ldexp 70.20n ± 0% 57.09n ± 0% -18.68% (p=0.000 n=10) Lgamma 201.5n ± 0% 185.3n ± 1% -8.01% (p=0.000 n=10) Log 201.5n ± 0% 182.7n ± 0% -9.35% (p=0.000 n=10) Logb 59.54n ± 0% 46.53n ± 0% -21.86% (p=0.000 n=10) Log1p 178.8n ± 0% 173.9n ± 6% -2.74% (p=0.021 n=10) Log10 201.4n ± 0% 184.3n ± 0% -8.49% (p=0.000 n=10) Log2 79.17n ± 0% 66.07n ± 0% -16.54% (p=0.000 n=10) Modf 34.27n ± 0% 34.25n ± 0% ~ (p=0.559 n=10) Nextafter32 49.34n ± 0% 49.37n ± 0% +0.05% (p=0.040 n=10) Nextafter64 43.66n ± 0% 43.66n ± 0% ~ (p=0.869 n=10) PowInt 309.1n ± 0% 267.4n ± 0% -13.49% (p=0.000 n=10) PowFrac 769.6n ± 0% 677.3n ± 0% -11.98% (p=0.000 n=10) Pow10Pos 13.88n ± 0% 13.88n ± 0% ~ (p=0.811 n=10) Pow10Neg 19.58n ± 0% 19.57n ± 0% ~ (p=0.993 n=10) Round 23.65n ± 0% 23.66n ± 0% ~ (p=0.354 n=10) RoundToEven 27.75n ± 0% 27.75n ± 0% ~ (p=0.971 n=10) Remainder 380.0n ± 0% 309.9n ± 0% -18.45% (p=0.000 n=10) Signbit 13.06n ± 0% 13.06n ± 0% ~ (p=1.000 n=10) Sin 133.8n ± 0% 120.8n ± 0% -9.75% (p=0.000 n=10) Sincos 160.7n ± 0% 147.7n ± 0% -8.12% (p=0.000 n=10) Sinh 305.9n ± 0% 277.9n ± 0% -9.17% (p=0.000 n=10) SqrtIndirect 3.265n ± 0% 3.264n ± 0% ~ (p=0.546 n=10) SqrtLatency 19.58n ± 0% 19.58n ± 0% ~ (p=0.973 n=10) SqrtIndirectLatency 19.59n ± 0% 19.58n ± 0% ~ (p=0.370 n=10) SqrtGoLatency 205.7n ± 0% 202.7n ± 0% -1.46% (p=0.000 n=10) SqrtPrime 4.953µ ± 0% 4.954µ ± 0% ~ (p=0.477 n=10) Tan 163.2n ± 0% 150.2n ± 0% -7.99% (p=0.000 n=10) Tanh 312.4n ± 0% 284.2n ± 0% -9.01% (p=0.000 n=10) Trunc 31.83n ± 0% 31.83n ± 0% ~ (p=0.663 n=10) Y0 701.0n ± 0% 669.2n ± 0% -4.54% (p=0.000 n=10) Y1 704.5n ± 0% 672.4n ± 0% -4.55% (p=0.000 n=10) Yn 1.490µ ± 0% 1.422µ ± 0% -4.60% (p=0.000 n=10) Float64bits 5.713n ± 0% 5.710n ± 0% ~ (p=0.926 n=10) Float64frombits 4.896n ± 0% 4.896n ± 0% ~ (p=0.663 n=10) Float32bits 12.25n ± 0% 12.25n ± 0% ~ (p=0.571 n=10) Float32frombits 4.898n ± 0% 4.896n ± 0% ~ (p=0.754 n=10) FMA 4.895n ± 0% 4.895n ± 0% ~ (p=0.745 n=10) geomean 94.40n 89.43n -5.27% Change-Id: I4fe0f2e9f609e38d79463f9ba2519a3f9427432e Reviewed-on: https://go-review.googlesource.com/c/go/+/348389 Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@google.com>	2025-08-13 20:33:56 -07:00
limeidan	90b7d7aaa2	cmd/compile/internal: optimize multiplication use new operation 'ADDshiftLLV' on loong64 goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A6000-HV @ 2500.00MHz │ old │ new │ │ sec/op │ sec/op vs base │ MulconstI32/3 0.8004n ± 0% 0.4247n ± 2% -46.94% (p=0.000 n=10) MulconstI32/5 0.8005n ± 0% 0.4256n ± 1% -46.83% (p=0.000 n=10) MulconstI32/12 1.2010n ± 0% 0.8005n ± 0% -33.35% (p=0.000 n=10) MulconstI32/120 0.8090n ± 0% 0.8067n ± 0% -0.28% (p=0.007 n=10) MulconstI32/-120 0.8109n ± 0% 0.8072n ± 0% -0.47% (p=0.000 n=10) MulconstI32/65537 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) MulconstI32/65538 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.265 n=10) MulconstI64/3 0.8005n ± 0% 0.4241n ± 1% -47.02% (p=0.000 n=10) MulconstI64/5 0.8004n ± 0% 0.4249n ± 1% -46.91% (p=0.000 n=10) MulconstI64/12 1.2010n ± 0% 0.8004n ± 0% -33.36% (p=0.000 n=10) MulconstI64/120 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.635 n=10) MulconstI64/-120 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.837 n=10) MulconstI64/65537 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.837 n=10) MulconstI64/65538 0.8096n ± 0% 0.8004n ± 0% -1.14% (p=0.000 n=10) MulconstU32/3 0.8004n ± 0% 0.4263n ± 1% -46.75% (p=0.000 n=10) MulconstU32/5 0.8005n ± 0% 0.4262n ± 1% -46.76% (p=0.000 n=10) MulconstU32/12 1.2010n ± 0% 0.8005n ± 0% -33.35% (p=0.000 n=10) MulconstU32/120 0.8105n ± 0% 0.8096n ± 0% ~ (p=0.183 n=10) MulconstU32/65537 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) MulconstU32/65538 0.8005n ± 0% 0.8005n ± 0% ~ (p=1.000 n=10) MulconstU64/3 0.8004n ± 0% 0.4265n ± 4% -46.71% (p=0.000 n=10) MulconstU64/5 0.8004n ± 0% 0.4256n ± 0% -46.82% (p=0.000 n=10) MulconstU64/12 1.2010n ± 0% 0.8004n ± 0% -33.36% (p=0.000 n=10) MulconstU64/120 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.387 n=10) MulconstU64/65537 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.265 n=10) MulconstU64/65538 0.8080n ± 0% 0.8004n ± 0% -0.93% (p=0.000 n=10) geomean 0.8539n 0.6597n -22.74% Change-Id: Ie33e88985d7639f481bbba540bc917b9f185c357 Reviewed-on: https://go-review.googlesource.com/c/go/+/693855 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-08-12 23:01:49 -07:00
Keith Randall	f04421ea9a	cmd/compile: soften test for 74788 We now (as of CL 678620) use float registers other than X0 for copying. Change-Id: Ifdecd5df7519663742eed0f292c98453754d4b25 Reviewed-on: https://go-review.googlesource.com/c/go/+/695275 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Jorropo <jorropo.pgm@gmail.com>	2025-08-12 10:05:55 -07:00
Michael Munday	084c0f8494	cmd/compile: allow InlMark operations to be speculatively executed Although InlMark takes a memory argument it ultimately becomes a NOP and therefore is safe to speculatively execute. Fixes #74915 Change-Id: I64317dd433e300ac28de2bcf201845083ec2ac82 Reviewed-on: https://go-review.googlesource.com/c/go/+/693795 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>	2025-08-11 00:52:23 -07:00
Xiaolin Zhao	a552737418	cmd/compile: fold negation into multiplication on loong64 This change also add corresponding benchmark tests and codegen tests. The performance improvement on CPU Loongson-3A6000-HV is as follows: goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A6000-HV @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| MulNeg 828.4n ± 0% 655.9n ± 0% -20.82% (p=0.000 n=10) Mul2Neg 1062.0n ± 0% 826.8n ± 0% -22.15% (p=0.000 n=10) geomean 938.0n 736.4n -21.49% Change-Id: Ia999732880ec65be0c66cddc757a4868847e5b15 Reviewed-on: https://go-review.googlesource.com/c/go/+/682535 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mark Freeman <markfreeman@google.com>	2025-08-05 18:02:06 -07:00
Michael Munday	fcc036f03b	cmd/compile: optimise float <-> int register moves on riscv64 Use the FMV* instructions to move values between the floating point and integer register files. Note: I'm unsure why there is a slowdown in the Float32bits benchmark, I've checked and an FMVXS instruction is being used as expected. There are multiple loads and other instructions in the main loop. goos: linux goarch: riscv64 pkg: math cpu: Spacemit(R) X60 │ fmv-before.txt │ fmv-after.txt │ │ sec/op │ sec/op vs base │ Acos 122.7n ± 0% 122.7n ± 0% ~ (p=1.000 n=10) Acosh 197.2n ± 0% 191.5n ± 0% -2.89% (p=0.000 n=10) Asin 122.7n ± 0% 122.7n ± 0% ~ (p=0.474 n=10) Asinh 231.0n ± 0% 224.1n ± 0% -2.99% (p=0.000 n=10) Atan 91.39n ± 0% 91.41n ± 0% ~ (p=0.465 n=10) Atanh 210.3n ± 0% 203.4n ± 0% -3.26% (p=0.000 n=10) Atan2 149.6n ± 0% 149.6n ± 0% ~ (p=0.721 n=10) Cbrt 176.5n ± 0% 165.9n ± 0% -6.01% (p=0.000 n=10) Ceil 25.67n ± 0% 24.42n ± 0% -4.87% (p=0.000 n=10) Copysign 3.756n ± 0% 3.756n ± 0% ~ (p=0.149 n=10) Cos 95.15n ± 0% 95.15n ± 0% ~ (p=0.374 n=10) Cosh 228.6n ± 0% 224.7n ± 0% -1.71% (p=0.000 n=10) Erf 115.2n ± 0% 115.2n ± 0% ~ (p=0.474 n=10) Erfc 116.4n ± 0% 116.4n ± 0% ~ (p=0.628 n=10) Erfinv 133.3n ± 0% 133.3n ± 0% ~ (p=1.000 n=10) Erfcinv 133.3n ± 0% 133.3n ± 0% ~ (p=1.000 n=10) Exp 194.1n ± 0% 190.3n ± 0% -1.93% (p=0.000 n=10) ExpGo 204.7n ± 0% 200.3n ± 0% -2.15% (p=0.000 n=10) Expm1 137.7n ± 0% 135.2n ± 0% -1.82% (p=0.000 n=10) Exp2 173.4n ± 0% 169.0n ± 0% -2.54% (p=0.000 n=10) Exp2Go 182.8n ± 0% 178.4n ± 0% -2.41% (p=0.000 n=10) Abs 3.756n ± 0% 3.756n ± 0% ~ (p=0.157 n=10) Dim 12.52n ± 0% 12.52n ± 0% ~ (p=0.737 n=10) Floor 25.67n ± 0% 24.42n ± 0% -4.87% (p=0.000 n=10) Max 21.29n ± 0% 20.03n ± 0% -5.92% (p=0.000 n=10) Min 21.28n ± 0% 20.04n ± 0% -5.85% (p=0.000 n=10) Mod 344.9n ± 0% 319.2n ± 0% -7.45% (p=0.000 n=10) Frexp 55.71n ± 0% 48.85n ± 0% -12.30% (p=0.000 n=10) Gamma 165.9n ± 0% 167.8n ± 0% +1.15% (p=0.000 n=10) Hypot 73.24n ± 0% 70.74n ± 0% -3.41% (p=0.000 n=10) HypotGo 84.50n ± 0% 82.63n ± 0% -2.21% (p=0.000 n=10) Ilogb 49.45n ± 0% 45.70n ± 0% -7.59% (p=0.000 n=10) J0 556.5n ± 0% 544.0n ± 0% -2.25% (p=0.000 n=10) J1 555.3n ± 0% 542.8n ± 0% -2.24% (p=0.000 n=10) Jn 1.181µ ± 0% 1.156µ ± 0% -2.12% (p=0.000 n=10) Ldexp 59.47n ± 0% 53.84n ± 0% -9.47% (p=0.000 n=10) Lgamma 167.2n ± 0% 154.6n ± 0% -7.51% (p=0.000 n=10) Log 160.9n ± 0% 154.6n ± 0% -3.92% (p=0.000 n=10) Logb 49.45n ± 0% 45.70n ± 0% -7.58% (p=0.000 n=10) Log1p 147.1n ± 0% 137.1n ± 0% -6.80% (p=0.000 n=10) Log10 162.1n ± 1% 154.6n ± 0% -4.63% (p=0.000 n=10) Log2 66.99n ± 0% 60.72n ± 0% -9.36% (p=0.000 n=10) Modf 29.42n ± 0% 26.29n ± 0% -10.64% (p=0.000 n=10) Nextafter32 41.95n ± 0% 37.88n ± 0% -9.70% (p=0.000 n=10) Nextafter64 38.82n ± 0% 33.49n ± 0% -13.73% (p=0.000 n=10) PowInt 252.3n ± 0% 237.3n ± 0% -5.95% (p=0.000 n=10) PowFrac 615.5n ± 0% 589.7n ± 0% -4.19% (p=0.000 n=10) Pow10Pos 10.64n ± 0% 10.64n ± 0% ~ (p=1.000 n=10) Pow10Neg 24.42n ± 0% 15.02n ± 0% -38.49% (p=0.000 n=10) Round 21.91n ± 0% 18.16n ± 0% -17.12% (p=0.000 n=10) RoundToEven 24.42n ± 0% 21.29n ± 0% -12.84% (p=0.000 n=10) Remainder 308.0n ± 0% 291.2n ± 0% -5.44% (p=0.000 n=10) Signbit 10.02n ± 0% 10.02n ± 0% ~ (p=1.000 n=10) Sin 102.7n ± 0% 102.7n ± 0% ~ (p=0.211 n=10) Sincos 124.0n ± 1% 123.3n ± 0% -0.56% (p=0.002 n=10) Sinh 239.1n ± 0% 234.7n ± 0% -1.84% (p=0.000 n=10) SqrtIndirect 2.504n ± 0% 2.504n ± 0% ~ (p=0.303 n=10) SqrtLatency 15.03n ± 0% 15.02n ± 0% ~ (p=0.598 n=10) SqrtIndirectLatency 15.02n ± 0% 15.02n ± 0% ~ (p=0.907 n=10) SqrtGoLatency 165.3n ± 0% 157.2n ± 0% -4.90% (p=0.000 n=10) SqrtPrime 3.801µ ± 0% 3.802µ ± 0% ~ (p=1.000 n=10) Tan 125.2n ± 0% 125.2n ± 0% ~ (p=0.458 n=10) Tanh 244.2n ± 0% 239.9n ± 0% -1.76% (p=0.000 n=10) Trunc 25.67n ± 0% 24.42n ± 0% -4.87% (p=0.000 n=10) Y0 550.2n ± 0% 538.1n ± 0% -2.21% (p=0.000 n=10) Y1 552.8n ± 0% 540.6n ± 0% -2.21% (p=0.000 n=10) Yn 1.168µ ± 0% 1.143µ ± 0% -2.14% (p=0.000 n=10) Float64bits 8.139n ± 0% 4.385n ± 0% -46.13% (p=0.000 n=10) Float64frombits 7.512n ± 0% 3.759n ± 0% -49.96% (p=0.000 n=10) Float32bits 8.138n ± 0% 9.393n ± 0% +15.42% (p=0.000 n=10) Float32frombits 7.513n ± 0% 3.757n ± 0% -49.98% (p=0.000 n=10) FMA 3.756n ± 0% 3.756n ± 0% ~ (p=0.246 n=10) geomean 77.43n 72.42n -6.47% Change-Id: I8dac69b1d17cb3d2af78d1c844d2b5d80000d667 Reviewed-on: https://go-review.googlesource.com/c/go/+/599235 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Munday <mikemndy@gmail.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>	2025-08-05 08:27:15 -07:00
Xiaolin Zhao	e071617222	cmd/compile: optimize multiplication rules on loong64 Improve multiplication strength reduction, refer to CL 626998, add additional 3 linear combination instructions for loong64. goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A6000-HV @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| MulconstI32/3 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstI32/5 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstI32/12 1.601n ± 0% 1.201n ± 0% -24.98% (p=0.000 n=10) MulconstI32/120 1.6010n ± 0% 0.8130n ± 0% -49.22% (p=0.000 n=10) MulconstI32/-120 1.6010n ± 0% 0.8109n ± 0% -49.35% (p=0.000 n=10) MulconstI32/65537 1.6275n ± 0% 0.8005n ± 0% -50.81% (p=0.000 n=10) MulconstI32/65538 1.6290n ± 0% 0.8004n ± 0% -50.87% (p=0.000 n=10) MulconstI64/3 1.6010n ± 0% 0.8004n ± 0% -50.01% (p=0.000 n=10) MulconstI64/5 1.6010n ± 0% 0.8004n ± 0% -50.01% (p=0.000 n=10) MulconstI64/12 1.601n ± 0% 1.201n ± 0% -24.98% (p=0.000 n=10) MulconstI64/120 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstI64/-120 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstI64/65537 1.6270n ± 0% 0.8005n ± 0% -50.80% (p=0.000 n=10) MulconstI64/65538 1.6290n ± 0% 0.8071n ± 1% -50.45% (p=0.000 n=10) MulconstU32/3 1.6010n ± 0% 0.8004n ± 0% -50.01% (p=0.000 n=10) MulconstU32/5 1.6010n ± 0% 0.8004n ± 0% -50.01% (p=0.000 n=10) MulconstU32/12 1.601n ± 0% 1.201n ± 0% -24.98% (p=0.000 n=10) MulconstU32/120 1.6010n ± 0% 0.8066n ± 0% -49.62% (p=0.000 n=10) MulconstU32/65537 1.6290n ± 0% 0.8005n ± 0% -50.86% (p=0.000 n=10) MulconstU32/65538 1.6280n ± 0% 0.8005n ± 0% -50.83% (p=0.000 n=10) MulconstU64/3 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstU64/5 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstU64/12 1.601n ± 0% 1.201n ± 0% -24.98% (p=0.000 n=10) MulconstU64/120 1.6010n ± 0% 0.8005n ± 0% -50.00% (p=0.000 n=10) MulconstU64/65537 1.6290n ± 0% 0.8005n ± 0% -50.86% (p=0.000 n=10) MulconstU64/65538 1.6300n ± 0% 0.8067n ± 0% -50.51% (p=0.000 n=10) geomean 1.609n 0.8537n -46.95% goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A5000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| MulconstI32/3 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstI32/5 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstI32/12 1.601n ± 0% 1.202n ± 0% -24.92% (p=0.000 n=10) MulconstI32/120 1.6020n ± 0% 0.8012n ± 0% -49.99% (p=0.000 n=10) MulconstI32/-120 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstI32/65537 1.6020n ± 0% 0.8007n ± 0% -50.02% (p=0.000 n=10) MulconstI32/65538 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstI64/3 1.6015n ± 0% 0.8007n ± 0% -50.00% (p=0.000 n=10) MulconstI64/5 1.6020n ± 0% 0.8007n ± 0% -50.02% (p=0.000 n=10) MulconstI64/12 1.602n ± 0% 1.202n ± 0% -25.00% (p=0.000 n=10) MulconstI64/120 1.6030n ± 0% 0.8011n ± 0% -50.02% (p=0.000 n=10) MulconstI64/-120 1.6020n ± 0% 0.8007n ± 0% -50.02% (p=0.000 n=10) MulconstI64/65537 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstI64/65538 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstU32/3 1.6010n ± 0% 0.8006n ± 0% -49.99% (p=0.000 n=10) MulconstU32/5 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstU32/12 1.601n ± 0% 1.202n ± 0% -24.92% (p=0.000 n=10) MulconstU32/120 1.6010n ± 0% 0.8006n ± 0% -49.99% (p=0.000 n=10) MulconstU32/65537 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstU32/65538 1.6020n ± 0% 0.8009n ± 0% -50.01% (p=0.000 n=10) MulconstU64/3 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstU64/5 1.6010n ± 0% 0.8007n ± 0% -49.98% (p=0.000 n=10) MulconstU64/12 1.601n ± 0% 1.201n ± 0% -24.98% (p=0.000 n=10) MulconstU64/120 1.6020n ± 0% 0.8007n ± 0% -50.02% (p=0.000 n=10) MulconstU64/65537 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) MulconstU64/65538 1.6010n ± 0% 0.8007n ± 0% -49.99% (p=0.000 n=10) geomean 1.601n 0.8523n -46.77% Change-Id: I9fb0e47ca57875da171a347bf4828adfab41b875 Reviewed-on: https://go-review.googlesource.com/c/go/+/675455 Reviewed-by: Mark Freeman <mark@golang.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@golang.org>	2025-08-01 08:42:40 -07:00
Keith Randall	eb7f515c4d	cmd/compile: use generated loops instead of DUFFZERO on amd64 goarch: amd64 cpu: 12th Gen Intel(R) Core(TM) i7-12700 │ base │ exp │ │ sec/op │ sec/op vs base │ MemclrKnownSize112-20 1.270n ± 14% 1.006n ± 0% -20.72% (p=0.000 n=10) MemclrKnownSize128-20 1.266n ± 0% 1.005n ± 0% -20.58% (p=0.000 n=10) MemclrKnownSize192-20 1.771n ± 0% 1.579n ± 1% -10.84% (p=0.000 n=10) MemclrKnownSize248-20 4.034n ± 0% 3.520n ± 0% -12.75% (p=0.000 n=10) MemclrKnownSize256-20 2.269n ± 0% 2.014n ± 0% -11.26% (p=0.000 n=10) MemclrKnownSize512-20 4.280n ± 0% 4.030n ± 0% -5.84% (p=0.000 n=10) MemclrKnownSize1024-20 8.309n ± 1% 8.057n ± 0% -3.03% (p=0.000 n=10) Change-Id: I8f1627e2a1e981ff351dc7178932b32a2627f765 Reviewed-on: https://go-review.googlesource.com/c/go/+/678937 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2025-07-31 17:12:39 -07:00
Michael Munday	cedf63616a	cmd/compile: add floating point min/max intrinsics on s390x Add the VECTOR FP (MINIMUM\|MAXIMUM) instructions to the assembler and use them in the compiler to implement min and max. Note: I've allowed floating point registers to be used with the single element instructions (those with the W instead of V prefix) to allow easier integration into the compiler. Change-Id: I5f80a510bd248cf483cce95f1979bf63fbae7de6 Reviewed-on: https://go-review.googlesource.com/c/go/+/684715 Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mark Freeman <mark@golang.org> Reviewed-by: Keith Randall <khr@google.com>	2025-07-30 12:29:15 -07:00

1 2 3 4 5 ...

620 commits