mirror of
https://github.com/golang/go.git
synced 2025-12-08 06:10:04 +00:00
5447 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
509ddf3868 |
cmd/compile: ensure bloop only kept alive addressable nodes
Fixes #76636 Change-Id: I881f88dbf62a901452c1d77e6ffca651451c7790 Reviewed-on: https://go-review.googlesource.com/c/go/+/725420 Auto-Submit: Cuong Manh Le <cuong.manhle.vn@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
3f94f3d4b2 |
test/codegen: fix shift tests on riscv64
These were broken by CL 721206, which changes Rsh to RshU for positive inputs. Change-Id: I9e38c3c428fb8aeb70cf51e7e76f4711c864f027 Reviewed-on: https://go-review.googlesource.com/c/go/+/723340 Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com> Reviewed-by: Mark Ryan <markdryan@rivosinc.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Jorropo <jorropo.pgm@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
2ac1f9cbc3 |
cmd/compile: avoid unnecessary interface conversion in bloop
Fixes #76482 Change-Id: I076568d8ae92ad6c9e0a5797cfe5bbfb615f63d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/725180 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Cuong Manh Le <cuong.manhle.vn@gmail.com> |
||
|
|
4879151d1d |
cmd/compile: introduce alias analysis and automatically free non-aliased memory after growslice
This CL is part of a set of CLs that attempt to reduce how much work the GC must do. See the design in https://go.dev/design/74299-runtime-freegc This CL updates the compiler to examine append calls to prove whether or not the slice is aliased. If proven unaliased, the compiler automatically inserts a call to a new runtime function introduced with this CL, runtime.growsliceNoAlias, which frees the old backing memory immediately after slice growth is complete and the old storage is logically dead. Two append benchmarks below show promising results, executing up to ~2x faster and up to factor of ~3 memory reduction with this CL. The approach works with multiple append calls for the same slice, including inside loops, and the final slice memory can be escaping, such as in a classic pattern of returning a slice from a function after the slice is built. (The final slice memory is never freed with this CL, though we have other work that tackles that.) An example target for this CL is we automatically free the intermediate memory for the appends in the loop in this function: func f1(input []int) []int { var s []int for _, x := range input { s = append(s, g(x)) // s cannot be aliased here if h(x) { s = append(s, x) // s cannot be aliased here } } return s // slice escapes at end } In this case, the compiler and the runtime collaborate so that the heap allocated backing memory for s is automatically freed after a successful grow. (For the first grow, there is nothing to free, but for the second and subsequent growths, the old heap memory is freed automatically.) The new runtime.growsliceNoAlias is primarily implemented by calling runtime.freegc, which we introduced in CL 673695. The high-level approach here is we step through the IR starting from a slice declaration and look for any operations that either alias the slice or might do so, and treat any IR construct we don't specifically handle as a potential alias (and therefore conservatively fall back to treating the slice as aliased when encountering something not understood). For loops, some additional care is required. We arrange the analysis so that an alias in the body of a loop causes all the appends in that same loop body to be marked aliased, even if the aliasing occurs after the append in the IR: func f2() { var s []int for i := range 10 { s = append(s, i) // aliased due to next line alias = s } } For nested loops, we analyse the nesting appropriately so that for example this append is still proven as non-aliased in the inner loop even though it aliased for the outer loop: func f3() { for range 10 { var s []int for i := range 10 { s = append(s, i) // append using non-aliased slice } alias = s } } A good starting point is the beginning of the test/escape_alias.go file, which starts with ~10 introductory examples with brief comments that attempt to illustrate the high-level approach. For more details, see the new .../internal/escape/alias.go file, especially the (*aliasAnalysis).analyze method. In the first benchmark, an append in a loop builds up a slice from nothing, where the slice elements are each 64 bytes. In the table below, 'count' is the number of appends. With 1 append, there is no opportunity for this CL to free memory. Once there are 2 appends, the growth from 1 element to 2 elements means the compiler-inserted growsliceNoAlias frees the 1-element array, and we see a ~33% reduction in memory use and a small reported speed improvement. As the number of appends increases for example to 5, we are at a ~20% speed improvement and ~45% memory reduction, and so on until we reach ~40% faster and ~50% less memory allocated at the end of the table. There can be variation in the reported numbers based on -randlayout, so this table is for 30 different values of -randlayout with a total n=150. (Even so, there is still some variation, so we probably should not read too much into small changes.) This is with GOAMD64=v3 on a VM that gcc reports is cascadelake. goos: linux goarch: amd64 pkg: runtime cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ old-1bb1f2bf0c │ freegc-8ba7421-ps16 │ │ sec/op │ sec/op vs base │ Append64Bytes/count=1-4 31.09n ± 2% 31.69n ± 1% +1.95% (n=150) Append64Bytes/count=2-4 73.31n ± 1% 70.27n ± 0% -4.15% (n=150) Append64Bytes/count=3-4 142.7n ± 1% 124.6n ± 1% -12.68% (n=150) Append64Bytes/count=4-4 149.6n ± 1% 127.7n ± 0% -14.64% (n=150) Append64Bytes/count=5-4 277.1n ± 1% 213.6n ± 0% -22.90% (n=150) Append64Bytes/count=6-4 280.7n ± 1% 216.5n ± 1% -22.87% (n=150) Append64Bytes/count=10-4 544.3n ± 1% 386.6n ± 0% -28.97% (n=150) Append64Bytes/count=20-4 1058.5n ± 1% 715.6n ± 1% -32.39% (n=150) Append64Bytes/count=50-4 2.121µ ± 1% 1.404µ ± 1% -33.83% (n=150) Append64Bytes/count=100-4 4.152µ ± 1% 2.736µ ± 1% -34.11% (n=150) Append64Bytes/count=200-4 7.753µ ± 1% 4.882µ ± 1% -37.03% (n=150) Append64Bytes/count=400-4 15.163µ ± 2% 9.273µ ± 1% -38.84% (n=150) geomean 601.8n 455.0n -24.39% │ old-1bb1f2bf0c │ freegc-8ba7421-ps16 │ │ B/op │ B/op vs base │ Append64Bytes/count=1-4 64.00 ± 0% 64.00 ± 0% ~ (n=150) Append64Bytes/count=2-4 192.0 ± 0% 128.0 ± 0% -33.33% (n=150) Append64Bytes/count=3-4 448.0 ± 0% 256.0 ± 0% -42.86% (n=150) Append64Bytes/count=4-4 448.0 ± 0% 256.0 ± 0% -42.86% (n=150) Append64Bytes/count=5-4 960.0 ± 0% 512.0 ± 0% -46.67% (n=150) Append64Bytes/count=6-4 960.0 ± 0% 512.0 ± 0% -46.67% (n=150) Append64Bytes/count=10-4 1.938Ki ± 0% 1.000Ki ± 0% -48.39% (n=150) Append64Bytes/count=20-4 3.938Ki ± 0% 2.001Ki ± 0% -49.18% (n=150) Append64Bytes/count=50-4 7.938Ki ± 0% 4.005Ki ± 0% -49.54% (n=150) Append64Bytes/count=100-4 15.938Ki ± 0% 8.021Ki ± 0% -49.67% (n=150) Append64Bytes/count=200-4 31.94Ki ± 0% 16.08Ki ± 0% -49.64% (n=150) Append64Bytes/count=400-4 63.94Ki ± 0% 32.33Ki ± 0% -49.44% (n=150) geomean 1.991Ki 1.124Ki -43.54% │ old-1bb1f2bf0c │ freegc-8ba7421-ps16 │ │ allocs/op │ allocs/op vs base │ Append64Bytes/count=1-4 1.000 ± 0% 1.000 ± 0% ~ (n=150) Append64Bytes/count=2-4 2.000 ± 0% 1.000 ± 0% -50.00% (n=150) Append64Bytes/count=3-4 3.000 ± 0% 1.000 ± 0% -66.67% (n=150) Append64Bytes/count=4-4 3.000 ± 0% 1.000 ± 0% -66.67% (n=150) Append64Bytes/count=5-4 4.000 ± 0% 1.000 ± 0% -75.00% (n=150) Append64Bytes/count=6-4 4.000 ± 0% 1.000 ± 0% -75.00% (n=150) Append64Bytes/count=10-4 5.000 ± 0% 1.000 ± 0% -80.00% (n=150) Append64Bytes/count=20-4 6.000 ± 0% 1.000 ± 0% -83.33% (n=150) Append64Bytes/count=50-4 7.000 ± 0% 1.000 ± 0% -85.71% (n=150) Append64Bytes/count=100-4 8.000 ± 0% 1.000 ± 0% -87.50% (n=150) Append64Bytes/count=200-4 9.000 ± 0% 1.000 ± 0% -88.89% (n=150) Append64Bytes/count=400-4 10.000 ± 0% 1.000 ± 0% -90.00% (n=150) geomean 4.331 1.000 -76.91% The second benchmark is similar, but instead uses an 8-byte integer for the slice element. The first 4 appends in the loop never call into the runtime thanks to the excellent CL 664299 introduced by Keith in Go 1.25 that allows some <= 32 byte dynamically-sized slices to be on the stack, so this CL is neutral for <= 32 bytes. Once the 5th append occurs at count=5, a grow happens via the runtime and heap allocates as normal, but freegc does not yet have anything to free, so we see a small ~1.4ns penalty reported there. But once the second growth happens, the older heap memory is now automatically freed by freegc, so we start to see some benefit in memory reductions and speed improvements, starting at a tiny speed improvement (close to a wash, or maybe noise) by the second growth before count=10, and building up to ~2x faster with ~68% fewer allocated bytes reported. goos: linux goarch: amd64 pkg: runtime cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ old-1bb1f2bf0c │ freegc-8ba7421-ps16 │ │ sec/op │ sec/op vs base │ AppendInt/count=1-4 2.978n ± 0% 2.969n ± 0% -0.30% (p=0.000 n=150) AppendInt/count=4-4 4.292n ± 3% 4.163n ± 3% ~ (p=0.528 n=150) AppendInt/count=5-4 33.50n ± 0% 34.93n ± 0% +4.25% (p=0.000 n=150) AppendInt/count=10-4 76.21n ± 1% 75.67n ± 0% -0.72% (p=0.000 n=150) AppendInt/count=20-4 150.6n ± 1% 133.0n ± 0% -11.65% (n=150) AppendInt/count=50-4 284.1n ± 1% 225.6n ± 0% -20.59% (n=150) AppendInt/count=100-4 544.2n ± 1% 392.4n ± 1% -27.89% (n=150) AppendInt/count=200-4 1051.5n ± 1% 702.3n ± 0% -33.21% (n=150) AppendInt/count=400-4 2.041µ ± 1% 1.312µ ± 1% -35.70% (n=150) AppendInt/count=1000-4 5.224µ ± 2% 2.851µ ± 1% -45.43% (n=150) AppendInt/count=2000-4 11.770µ ± 1% 6.010µ ± 1% -48.94% (n=150) AppendInt/count=3000-4 17.747µ ± 2% 8.264µ ± 1% -53.44% (n=150) geomean 331.8n 246.4n -25.72% │ old-1bb1f2bf0c │ freegc-8ba7421-ps16 │ │ B/op │ B/op vs base │ AppendInt/count=1-4 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=150) AppendInt/count=4-4 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=150) AppendInt/count=5-4 64.00 ± 0% 64.00 ± 0% ~ (p=1.000 n=150) AppendInt/count=10-4 192.0 ± 0% 128.0 ± 0% -33.33% (n=150) AppendInt/count=20-4 448.0 ± 0% 256.0 ± 0% -42.86% (n=150) AppendInt/count=50-4 960.0 ± 0% 512.0 ± 0% -46.67% (n=150) AppendInt/count=100-4 1.938Ki ± 0% 1.000Ki ± 0% -48.39% (n=150) AppendInt/count=200-4 3.938Ki ± 0% 2.001Ki ± 0% -49.18% (n=150) AppendInt/count=400-4 7.938Ki ± 0% 4.005Ki ± 0% -49.54% (n=150) AppendInt/count=1000-4 24.56Ki ± 0% 10.05Ki ± 0% -59.07% (n=150) AppendInt/count=2000-4 58.56Ki ± 0% 20.31Ki ± 0% -65.32% (n=150) AppendInt/count=3000-4 85.19Ki ± 0% 27.30Ki ± 0% -67.95% (n=150) geomean ² -42.81% │ old-1bb1f2bf0c │ freegc-8ba7421-ps16 │ │ allocs/op │ allocs/op vs base │ AppendInt/count=1-4 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=150) AppendInt/count=4-4 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=150) AppendInt/count=5-4 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=150) AppendInt/count=10-4 2.000 ± 0% 1.000 ± 0% -50.00% (n=150) AppendInt/count=20-4 3.000 ± 0% 1.000 ± 0% -66.67% (n=150) AppendInt/count=50-4 4.000 ± 0% 1.000 ± 0% -75.00% (n=150) AppendInt/count=100-4 5.000 ± 0% 1.000 ± 0% -80.00% (n=150) AppendInt/count=200-4 6.000 ± 0% 1.000 ± 0% -83.33% (n=150) AppendInt/count=400-4 7.000 ± 0% 1.000 ± 0% -85.71% (n=150) AppendInt/count=1000-4 9.000 ± 0% 1.000 ± 0% -88.89% (n=150) AppendInt/count=2000-4 11.000 ± 0% 1.000 ± 0% -90.91% (n=150) AppendInt/count=3000-4 12.000 ± 0% 1.000 ± 0% -91.67% (n=150) geomean ² -72.76% ² Of course, these are just microbenchmarks, but likely indicate there are some opportunities here. The immediately following CL 712422 tackles inlining and is able to get runtime.freegc working automatically with iterators such as used by slices.Collect, which becomes able to automatically free the intermediate memory from its repeated appends (which earlier in this work required a temporary hand edit to the slices package). For now, we only use the NoAlias version for element types without pointers while waiting on additional runtime support in CL 698515. Updates #74299 Change-Id: I1b9d286aa97c170dcc2e203ec0f8ca72d84e8221 Reviewed-on: https://go-review.googlesource.com/c/go/+/710015 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> |
||
|
|
3531ac23d4 |
go/types, types2: replace setDefType with pending type check
Given a type definition of the form:
type T RHS
The setDefType function would set T.fromRHS as soon as we knew its
top-level type. For instance, in:
type S struct { ... }
S.fromRHS is set to a struct type before type-checking anything inside
the struct.
This permit access to the (incomplete) RHS type in a cyclic type
declaration. Accessing this information is fraught (as it's incomplete),
but was used for reporting certain types of cycles.
This CL replaces setDefType with a check that ensures no value of type
T is used before its RHS is set up.
This CL is strictly more complete than what setDefType achieved. For
instance, it enables correct reporting for the below cycles:
type A [unsafe.Sizeof(A{})]int
var v any = 42
type B [v.(B)]int
func f() C {
return C{}
}
type C [unsafe.Sizeof(f())]int
Fixes #76383
Fixes #76384
Change-Id: I9dfab5b708013b418fa66e43362bb4d8483fedec
Reviewed-on: https://go-review.googlesource.com/c/go/+/724140
Auto-Submit: Mark Freeman <markfreeman@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
||
|
|
3fd9cb1895 |
cmd/compile: fix bloop get name logic
This CL change getNameFrom impl to pattern match addressible patterns. Change-Id: If1faa22a3a012d501e911d8468a5702b348abf16 Reviewed-on: https://go-review.googlesource.com/c/go/+/724180 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> |
||
|
|
3c6bf6fbf3 |
cmd/compile: handle loops better during stack allocation of slices
Don't use the move2heap optimization if the move2heap is inside a loop deeper than the declaration of the slice. We really only want to do the move2heap operation once. Change-Id: I4a68d01609c2c9d4e0abe4580839e70059393a81 Reviewed-on: https://go-review.googlesource.com/c/go/+/722440 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
dda7c8253d |
cmd/compile,internal/bytealg: add MemEq intrinsic for runtime.memequal
Introduce a new MemEq SSA operation for runtime.memequal. The operation is initially implemented for arm64. The change adds opt rules (following existing rules for call to runtime.memequal), working with MemEq, and a later op version LoweredMemEq which may be lowered differently for more constant size cases in future (for other targets as well as for arm64). The new MemEq SSA operation does not have memory result, allowing cse of loads operations around it. Code size difference (for arm64 linux): Executable Old .text New .text Change ------------------------------------------------------- asm 1970420 1969668 -0.04% cgo 1741220 1740212 -0.06% compile 8956756 8959428 +0.03% cover 1879332 1878772 -0.03% link 2574116 2572660 -0.06% preprofile 867124 866820 -0.04% vet 2890404 2888596 -0.06% Change-Id: I6ab507929b861884d17d5818cfbd152cf7879751 Reviewed-on: https://go-review.googlesource.com/c/go/+/686655 Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@golang.org> |
||
|
|
62cd044a79 |
cmd/compile: add cases for StringLen to prove
Tricky index-offset logic had been added for slices, but not for strings. This fixes that, and also adds tests for same behavior in string/slice cases, and adds a new test for code in prove that had been added but not explicitly tested. Fixes #76270. Change-Id: Ibd92b89e944d86b7f30b4486a9008e6f1ac6af7d Reviewed-on: https://go-review.googlesource.com/c/go/+/723980 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> |
||
|
|
220d73cc44 |
[dev.simd] all: merge master (8dd5b13) into dev.simd
Merge List: + 2025-11-24 |
||
|
|
feae743bdb |
cmd/compile: use 32x32->64 multiplies on loong64
Gets rid of some sign extensions, like arm64. Change-Id: I9fc37e15a82718bfcf53db8cab0c4e7baaa0a747 Reviewed-on: https://go-review.googlesource.com/c/go/+/721522 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
22f24f90b5 |
cmd/compile: change testing.B.Loop keep alive semantic
This CL implements this initial design of testing.B.Loop's keep variable alive semantic: https://github.com/golang/go/issues/61515#issuecomment-2407963248. Fixes #73137. Change-Id: I8060470dbcb0dda0819334f3615cc391ff0f6501 Reviewed-on: https://go-review.googlesource.com/c/go/+/716660 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> |
||
|
|
e7d47ac33d |
cmd/compile: simplify negative on multiplication
goos: linux
goarch: amd64
pkg: cmd/compile/internal/test
cpu: AMD EPYC 7532 32-Core Processor
│ simplify_base │ simplify_new │
│ sec/op │ sec/op vs base │
SimplifyNegMul 623.0n ± 0% 319.3n ± 1% -48.75% (p=0.000 n=10)
goos: linux
goarch: riscv64
pkg: cmd/compile/internal/test
cpu: Spacemit(R) X60
│ simplify.base │ simplify.new │
│ sec/op │ sec/op vs base │
SimplifyNegMul 10.928µ ± 0% 6.432µ ± 0% -41.14% (p=0.000 n=10)
Change-Id: I1d9393cd19a0b948a5d3a512d627cdc0cf0b38be
Reviewed-on: https://go-review.googlesource.com/c/go/+/721520
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Freeman <markfreeman@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
||
|
|
ec92bc6d63 |
cmd/compile: rewrite Rsh to RshU if arguments are proved positive
Fixes #76332 Change-Id: I9044025d5dc599531c7f88ed2870bcf3d8b0acbd Reviewed-on: https://go-review.googlesource.com/c/go/+/721206 Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> |
||
|
|
3820f94c1d |
cmd/compile: propagate unsigned relations for Rsh if arguments are positive
Updates #76332 Change-Id: Ifaa4d12897138d88d56b9d4e530c53dcee70bd58 Reviewed-on: https://go-review.googlesource.com/c/go/+/721205 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
f87aaec53d |
cmd/compile: fix integer overflow in prove pass
The detectSliceLenRelation function incorrectly deduced lower bounds for "len(s) - i" without checking if the subtraction could overflow (e.g. when i is negative). This led to incorrect elimination of bounds checks. Fixes: #76355 Change-Id: I30ada0e5f1425929ddd8ae1b66e55096ec209b5b Reviewed-on: https://go-review.googlesource.com/c/go/+/721920 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
e3d4645693 |
[dev.simd] all: merge master (ca37d24) into dev.simd
Conflicts: - src/cmd/compile/internal/typecheck/builtin.go Merge List: + 2025-11-20 |
||
|
|
32f5aadd2f |
cmd/compile: stack allocate backing stores during append
We can already stack allocate the backing store during append if the
resulting backing store doesn't escape. See CL 664299.
This CL enables us to often stack allocate the backing store during
append *even if* the result escapes. Typically, for code like:
func f(n int) []int {
var r []int
for i := range n {
r = append(r, i)
}
return r
}
the backing store for r escapes, but only by returning it.
Could we operate with r on the stack for most of its lifeime,
and only move it to the heap at the return point?
The current implementation of append will need to do an allocation
each time it calls growslice. This will happen on the 1st, 2nd, 4th,
8th, etc. append calls. The allocations done by all but the
last growslice call will then immediately be garbage.
We'd like to avoid doing some of those intermediate allocations
if possible. We rewrite the above code by introducing a move2heap
operation:
func f(n int) []int {
var r []int
for i := range n {
r = append(r, i)
}
r = move2heap(r)
return r
}
Using the move2heap runtime function, which does:
move2heap(r):
If r is already backed by heap storage, return r.
Otherwise, copy r to the heap and return the copy.
Now we can treat the backing store of r allocated at the
append site as not escaping. Previous stack allocation
optimizations now apply, which can use a fixed-size
stack-allocated backing store for r when appending.
See the description in cmd/compile/internal/slice/slice.go
for how we ensure that this optimization is safe.
Change-Id: I81f36e58bade2241d07f67967d8d547fff5302b8
Reviewed-on: https://go-review.googlesource.com/c/go/+/707755
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
||
|
|
2239520d1c |
test: go fmt prove.go tests
Change-Id: Ia4c2ceffcf2bfde862e9dba02a4b38245f868692 Reviewed-on: https://go-review.googlesource.com/c/go/+/721202 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Sean Liao <sean@liao.dev> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Mark Freeman <markfreeman@google.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> |
||
|
|
ba634ca5c7 |
cmd/compile: fold boolean NOT into branches
Gets rid of an EOR $1 instruction. Change-Id: Ib032b0cee9ac484329c978af9b1305446f8d5dac Reviewed-on: https://go-review.googlesource.com/c/go/+/721501 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
e1a12c781f |
cmd/compile: use 32x32->64 multiplies on arm64
Gets rid of some sign extensions. Change-Id: Ie67ef36b4ca1cd1a2cd9fa5d84578db553578a22 Reviewed-on: https://go-review.googlesource.com/c/go/+/721241 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
934dbcea1a |
[dev.simd] simd: update CPU feature APIs
This CL also updates the internal uses of these APIs. This CL also fixed a instable output issue left by previous CLs. Change-Id: Ibc38361d35e2af0c4943a48578f3c610b74ed14d Reviewed-on: https://go-review.googlesource.com/c/go/+/720020 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
c12c337099 |
cmd/compile: teach prove about subtract idioms
For v = x-y:
if y >= 0 then v <= x
if y <= x then v >= 0
(With appropriate guards against overflow/underflow.)
Fixes #76304
Change-Id: I8f8f1254156c347fa97802bd057a8379676720ae
Reviewed-on: https://go-review.googlesource.com/c/go/+/720740
Reviewed-by: Mark Freeman <markfreeman@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
|
||
|
|
03ed43988f |
cmd/compile: allow multi-field structs to be stored directly in interfaces
If the struct is a bunch of 0-sized fields and one pointer field. Merged revert-of-revert for 4 CLs. original revert 681937 695016 693415 694996 693615 695015 694195 694995 Fixes #74092 Update #74888 Update #74908 Update #74935 (updated issues are bugs in the last attempt at this) Change-Id: I32246d49b8bac3bb080972dc06ab432a5480d560 Reviewed-on: https://go-review.googlesource.com/c/go/+/714421 Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
2cdcc4150b |
cmd/compile: fold negation into multiplication
goos: linux
goarch: riscv64
pkg: cmd/compile/internal/test
cpu: Spacemit(R) X60
│ /root/mul.base.log │ /root/mul.new.log │
│ sec/op │ sec/op vs base │
MulNeg 6.426µ ± 0% 4.501µ ± 0% -29.96% (p=0.000 n=10)
Mul2Neg 9.000µ ± 0% 6.431µ ± 0% -28.54% (p=0.000 n=10)
Mul2 1.263µ ± 0% 1.263µ ± 0% ~ (p=1.000 n=10)
MulNeg2 1.577µ ± 0% 1.577µ ± 0% ~ (p=0.211 n=10)
geomean 3.276µ 2.756µ -15.89%
goos: linux
goarch: amd64
pkg: cmd/compile/internal/test
cpu: AMD EPYC 7532 32-Core Processor
│ /root/base │ /root/new │
│ sec/op │ sec/op vs base │
MulNeg 691.9n ± 1% 319.4n ± 0% -53.83% (p=0.000 n=10)
Mul2Neg 630.0n ± 0% 629.6n ± 0% -0.07% (p=0.000 n=10)
Mul2 438.1n ± 0% 438.1n ± 0% ~ (p=0.728 n=10)
MulNeg2 439.3n ± 0% 439.4n ± 0% ~ (p=0.656 n=10)
geomean 538.2n 443.6n -17.58%
Change-Id: Ice8e6c8d1e8e3009ba8a0b1b689205174e199019
Reviewed-on: https://go-review.googlesource.com/c/go/+/720180
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
|
||
|
|
0a569528ea |
cmd/compile: optimize comparisons with single bit difference
Optimize comparisons with constants that only differ by 1 bit (i.e.
a power of 2). For example:
x == 4 || x == 6 -> x|2 == 6
x != 1 && x != 5 -> x|4 != 5
Change-Id: Ic61719e5118446d21cf15652d9da22f7d95b2a15
Reviewed-on: https://go-review.googlesource.com/c/go/+/719420
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
|
||
|
|
d50a571ddf |
test: fix tests to work with sizespecializedmalloc turned off
Cq-Include-Trybots: luci.golang.try:gotip-linux-386-nosizespecializedmalloc,gotip-linux-amd64-nosizespecializedmalloc,gotip-linux-arm64-nosizespecializedmalloc Change-Id: I6a6a696465004b939c989afc058c4c3e1fb7134f Reviewed-on: https://go-review.googlesource.com/c/go/+/720401 Auto-Submit: Michael Matloob <matloob@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Matloob <matloob@google.com> |
||
|
|
d7a0c45642 |
[dev.simd] all: merge master (57362e9) into dev.simd
Conflicts: - src/cmd/compile/internal/ir/symtab.go - src/cmd/compile/internal/ssa/prove.go - src/cmd/compile/internal/ssa/rewriteAMD64.go - src/cmd/compile/internal/ssagen/intrinsics.go - src/cmd/compile/internal/typecheck/builtin.go - src/internal/buildcfg/exp.go - src/internal/strconv/ftoa.go - test/codegen/stack.go Manually resolved some conflicts: - Use internal/strconv for simd.String, remove internal/ftoa - prove.go is just copied from the one on the main branch. We have cherry-picked the changes to prove.go to main branch, so our copy is identical to an old version of the one on the main branch. There are CLs landed after our cherry-picks. Just copy it over to adopt the new code. Merge List: + 2025-11-13 |
||
|
|
34aef89366 |
cmd/compile: use FCLASSD for subnormal checks on riscv64
Only implemented for 64 bit floating point operations for now.
goos: linux
goarch: riscv64
pkg: math
cpu: Spacemit(R) X60
│ sec/op │ sec/op vs base │
Acos 154.1n ± 0% 154.1n ± 0% ~ (p=0.303 n=10)
Acosh 215.8n ± 6% 226.7n ± 0% ~ (p=0.439 n=10)
Asin 149.2n ± 1% 149.2n ± 0% ~ (p=0.700 n=10)
Asinh 262.1n ± 0% 258.5n ± 0% -1.37% (p=0.000 n=10)
Atan 99.48n ± 0% 99.49n ± 0% ~ (p=0.836 n=10)
Atanh 244.9n ± 0% 243.8n ± 0% -0.43% (p=0.002 n=10)
Atan2 158.2n ± 1% 153.3n ± 0% -3.10% (p=0.000 n=10)
Cbrt 186.8n ± 0% 181.1n ± 0% -3.03% (p=0.000 n=10)
Ceil 36.71n ± 1% 36.71n ± 0% ~ (p=0.434 n=10)
Copysign 6.531n ± 1% 6.526n ± 0% ~ (p=0.268 n=10)
Cos 98.19n ± 0% 95.40n ± 0% -2.84% (p=0.000 n=10)
Cosh 233.1n ± 0% 222.6n ± 0% -4.50% (p=0.000 n=10)
Erf 122.5n ± 0% 114.2n ± 0% -6.78% (p=0.000 n=10)
Erfc 126.0n ± 1% 116.6n ± 0% -7.46% (p=0.000 n=10)
Erfinv 138.8n ± 0% 138.6n ± 0% ~ (p=0.082 n=10)
Erfcinv 140.0n ± 0% 139.7n ± 0% ~ (p=0.359 n=10)
Exp 193.3n ± 0% 184.2n ± 0% -4.68% (p=0.000 n=10)
ExpGo 204.8n ± 0% 194.5n ± 0% -5.03% (p=0.000 n=10)
Expm1 152.5n ± 1% 145.0n ± 0% -4.92% (p=0.000 n=10)
Exp2 174.5n ± 0% 164.2n ± 0% -5.85% (p=0.000 n=10)
Exp2Go 184.4n ± 1% 175.4n ± 0% -4.88% (p=0.000 n=10)
Abs 4.912n ± 0% 4.914n ± 0% ~ (p=0.283 n=10)
Dim 15.50n ± 1% 15.52n ± 1% ~ (p=0.331 n=10)
Floor 36.89n ± 1% 36.76n ± 1% ~ (p=0.325 n=10)
Max 31.05n ± 1% 31.17n ± 1% ~ (p=0.628 n=10)
Min 31.01n ± 0% 31.06n ± 0% ~ (p=0.767 n=10)
Mod 294.1n ± 0% 245.6n ± 0% -16.52% (p=0.000 n=10)
Frexp 44.86n ± 1% 35.20n ± 0% -21.53% (p=0.000 n=10)
Gamma 195.8n ± 0% 185.4n ± 1% -5.29% (p=0.000 n=10)
Hypot 84.91n ± 0% 84.54n ± 1% -0.43% (p=0.006 n=10)
HypotGo 96.70n ± 0% 95.42n ± 1% -1.32% (p=0.000 n=10)
Ilogb 45.03n ± 0% 35.07n ± 1% -22.10% (p=0.000 n=10)
J0 634.5n ± 0% 627.2n ± 0% -1.16% (p=0.000 n=10)
J1 644.5n ± 0% 636.9n ± 0% -1.18% (p=0.000 n=10)
Jn 1.357µ ± 0% 1.344µ ± 0% -0.92% (p=0.000 n=10)
Ldexp 49.89n ± 0% 39.96n ± 0% -19.90% (p=0.000 n=10)
Lgamma 186.6n ± 0% 184.3n ± 0% -1.21% (p=0.000 n=10)
Log 150.4n ± 0% 141.1n ± 0% -6.15% (p=0.000 n=10)
Logb 46.70n ± 0% 35.89n ± 0% -23.15% (p=0.000 n=10)
Log1p 164.1n ± 0% 163.9n ± 0% ~ (p=0.122 n=10)
Log10 153.1n ± 0% 143.5n ± 0% -6.24% (p=0.000 n=10)
Log2 58.83n ± 0% 49.75n ± 0% -15.43% (p=0.000 n=10)
Modf 40.82n ± 1% 40.78n ± 0% ~ (p=0.239 n=10)
Nextafter32 49.15n ± 0% 48.93n ± 0% -0.44% (p=0.011 n=10)
Nextafter64 43.33n ± 0% 43.23n ± 0% ~ (p=0.228 n=10)
PowInt 269.4n ± 0% 243.8n ± 0% -9.49% (p=0.000 n=10)
PowFrac 618.0n ± 0% 571.7n ± 0% -7.48% (p=0.000 n=10)
Pow10Pos 13.09n ± 0% 13.05n ± 0% -0.31% (p=0.003 n=10)
Pow10Neg 30.99n ± 1% 30.99n ± 0% ~ (p=0.173 n=10)
Round 23.73n ± 0% 23.65n ± 0% -0.36% (p=0.011 n=10)
RoundToEven 27.87n ± 0% 27.73n ± 0% -0.48% (p=0.003 n=10)
Remainder 282.1n ± 0% 249.6n ± 0% -11.52% (p=0.000 n=10)
Signbit 11.46n ± 0% 11.42n ± 0% -0.39% (p=0.003 n=10)
Sin 115.2n ± 0% 113.2n ± 0% -1.74% (p=0.000 n=10)
Sincos 140.6n ± 0% 138.6n ± 0% -1.39% (p=0.000 n=10)
Sinh 252.0n ± 0% 241.4n ± 0% -4.21% (p=0.000 n=10)
SqrtIndirect 4.909n ± 0% 4.893n ± 0% -0.34% (p=0.021 n=10)
SqrtLatency 19.57n ± 1% 19.57n ± 0% ~ (p=0.087 n=10)
SqrtIndirectLatency 19.64n ± 0% 19.57n ± 0% -0.36% (p=0.025 n=10)
SqrtGoLatency 198.1n ± 0% 197.4n ± 0% -0.35% (p=0.014 n=10)
SqrtPrime 5.733µ ± 0% 5.725µ ± 0% ~ (p=0.116 n=10)
Tan 149.1n ± 0% 146.8n ± 0% -1.54% (p=0.000 n=10)
Tanh 248.2n ± 1% 238.1n ± 0% -4.05% (p=0.000 n=10)
Trunc 36.86n ± 0% 36.70n ± 0% -0.43% (p=0.029 n=10)
Y0 638.2n ± 0% 633.6n ± 0% -0.71% (p=0.000 n=10)
Y1 641.8n ± 0% 636.1n ± 0% -0.87% (p=0.000 n=10)
Yn 1.358µ ± 0% 1.345µ ± 0% -0.92% (p=0.000 n=10)
Float64bits 5.721n ± 0% 5.709n ± 0% -0.22% (p=0.044 n=10)
Float64frombits 4.905n ± 0% 4.893n ± 0% ~ (p=0.266 n=10)
Float32bits 12.27n ± 0% 12.23n ± 0% ~ (p=0.122 n=10)
Float32frombits 4.909n ± 0% 4.893n ± 0% -0.32% (p=0.024 n=10)
FMA 6.556n ± 0% 6.526n ± 0% ~ (p=0.283 n=10)
geomean 86.82n 83.75n -3.54%
Change-Id: I522297a79646d76543d516accce291f5a3cea337
Reviewed-on: https://go-review.googlesource.com/c/go/+/717560
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
|
||
|
|
86b4fe31d9 |
[dev.simd] cmd/compile: add masked merging ops and optimizations
This CL generates optimizations for masked variant of AVX512 instructions for patterns: x.Op(y).Merge(z, mask) => OpMasked(z, x, y mask), where OpMasked is resultInArg0. Change-Id: Ife7ccc9ddbf76ae921a085bd6a42b965da9bc179 Reviewed-on: https://go-review.googlesource.com/c/go/+/718160 Reviewed-by: David Chase <drchase@google.com> TryBot-Bypass: Junyang Shao <shaojunyang@google.com> |
||
|
|
771a1dc216 |
[dev.simd] cmd/compile: add peepholes for all masked ops and bug fixes
For 512-bits they are unchanged. This CL adds the optimization rules for 128/256-bits under feature check. This CL also fixed a bug for masked load variant of instructions and make them zeroing by default as well. Change-Id: I6fe395541c0cd509984a81841420e71c3af732f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/717822 Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
c7ccbddf22 |
cmd/compile/internal/ssa: more aggressive on dead auto elim
Propagate "unread" across OpMoves. If the addr of this auto is only used
by an OpMove as its source arg, and the OpMove's target arg is the addr
of another auto. If the 2nd auto can be eliminated, this one can also be
eliminated.
This CL eliminates unnecessary memory copies and makes the frame smaller
in the following code snippet:
func contains(m map[string][16]int, k string) bool {
_, ok := m[k]
return ok
}
These are the benchmark results followed by the benchmark code:
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Map1Access2Ok-8 9.582n ± 2% 9.226n ± 0% -3.72% (p=0.000 n=20)
Map2Access2Ok-8 13.79n ± 1% 10.24n ± 1% -25.77% (p=0.000 n=20)
Map3Access2Ok-8 68.68n ± 1% 12.65n ± 1% -81.58% (p=0.000 n=20)
package main_test
import "testing"
var (
m1 = map[int]int{}
m2 = map[int][16]int{}
m3 = map[int][256]int{}
)
func init() {
for i := range 1000 {
m1[i] = i
m2[i] = [16]int{15:i}
m3[i] = [256]int{255:i}
}
}
func BenchmarkMap1Access2Ok(b *testing.B) {
for i := range b.N {
_, ok := m1[i%1000]
if !ok {
b.Errorf("%d not found", i)
}
}
}
func BenchmarkMap2Access2Ok(b *testing.B) {
for i := range b.N {
_, ok := m2[i%1000]
if !ok {
b.Errorf("%d not found", i)
}
}
}
func BenchmarkMap3Access2Ok(b *testing.B) {
for i := range b.N {
_, ok := m3[i%1000]
if !ok {
b.Errorf("%d not found", i)
}
}
}
Fixes #75398
Change-Id: If75e9caaa50d460efc31a94565b9ba28c8158771
Reviewed-on: https://go-review.googlesource.com/c/go/+/702875
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
||
|
|
6e165b4d17 |
cmd/compile: implement Avg64u, Hmul64, Hmul64u for wasm
This lets us remove useAvg and useHmul from the division rules.
The compiler is simpler and the generated code is faster.
goos: wasip1
goarch: wasm
pkg: internal/strconv
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
AppendFloat/Decimal 192.8n ± 1% 194.6n ± 0% +0.91% (p=0.000 n=10)
AppendFloat/Float 328.6n ± 0% 279.6n ± 0% -14.93% (p=0.000 n=10)
AppendFloat/Exp 335.6n ± 1% 289.2n ± 1% -13.80% (p=0.000 n=10)
AppendFloat/NegExp 336.0n ± 0% 289.1n ± 1% -13.97% (p=0.000 n=10)
AppendFloat/LongExp 332.4n ± 0% 285.2n ± 1% -14.20% (p=0.000 n=10)
AppendFloat/Big 348.2n ± 0% 300.1n ± 0% -13.83% (p=0.000 n=10)
AppendFloat/BinaryExp 137.4n ± 0% 138.2n ± 0% +0.55% (p=0.001 n=10)
AppendFloat/32Integer 193.3n ± 1% 196.5n ± 0% +1.66% (p=0.000 n=10)
AppendFloat/32ExactFraction 283.3n ± 0% 268.9n ± 1% -5.08% (p=0.000 n=10)
AppendFloat/32Point 279.9n ± 0% 266.5n ± 0% -4.80% (p=0.000 n=10)
AppendFloat/32Exp 300.1n ± 0% 288.3n ± 1% -3.90% (p=0.000 n=10)
AppendFloat/32NegExp 288.2n ± 1% 277.9n ± 1% -3.59% (p=0.000 n=10)
AppendFloat/32Shortest 261.7n ± 0% 250.2n ± 0% -4.39% (p=0.000 n=10)
AppendFloat/32Fixed8Hard 173.3n ± 1% 158.9n ± 1% -8.31% (p=0.000 n=10)
AppendFloat/32Fixed9Hard 180.0n ± 0% 167.9n ± 2% -6.70% (p=0.000 n=10)
AppendFloat/64Fixed1 167.1n ± 0% 149.6n ± 1% -10.50% (p=0.000 n=10)
AppendFloat/64Fixed2 162.4n ± 1% 146.5n ± 0% -9.73% (p=0.000 n=10)
AppendFloat/64Fixed2.5 165.5n ± 0% 149.4n ± 1% -9.70% (p=0.000 n=10)
AppendFloat/64Fixed3 166.4n ± 1% 150.2n ± 0% -9.74% (p=0.000 n=10)
AppendFloat/64Fixed4 163.7n ± 0% 149.6n ± 1% -8.62% (p=0.000 n=10)
AppendFloat/64Fixed5Hard 182.8n ± 1% 167.1n ± 1% -8.61% (p=0.000 n=10)
AppendFloat/64Fixed12 222.2n ± 0% 208.8n ± 0% -6.05% (p=0.000 n=10)
AppendFloat/64Fixed16 197.6n ± 1% 181.7n ± 0% -8.02% (p=0.000 n=10)
AppendFloat/64Fixed12Hard 194.5n ± 0% 181.0n ± 0% -6.99% (p=0.000 n=10)
AppendFloat/64Fixed17Hard 205.1n ± 1% 191.9n ± 0% -6.44% (p=0.000 n=10)
AppendFloat/64Fixed18Hard 6.269µ ± 0% 6.643µ ± 0% +5.97% (p=0.000 n=10)
AppendFloat/64FixedF1 211.7n ± 1% 197.0n ± 0% -6.95% (p=0.000 n=10)
AppendFloat/64FixedF2 189.4n ± 0% 174.2n ± 0% -8.08% (p=0.000 n=10)
AppendFloat/64FixedF3 169.0n ± 0% 154.9n ± 0% -8.32% (p=0.000 n=10)
AppendFloat/Slowpath64 321.2n ± 0% 274.2n ± 1% -14.63% (p=0.000 n=10)
AppendFloat/SlowpathDenormal64 307.4n ± 1% 261.2n ± 0% -15.03% (p=0.000 n=10)
AppendInt 3.367µ ± 1% 3.376µ ± 0% ~ (p=0.517 n=10)
AppendUint 675.5n ± 0% 676.9n ± 0% ~ (p=0.196 n=10)
AppendIntSmall 28.13n ± 1% 28.17n ± 0% +0.14% (p=0.015 n=10)
AppendUintVarlen/digits=1 20.70n ± 0% 20.51n ± 1% -0.89% (p=0.018 n=10)
AppendUintVarlen/digits=2 20.43n ± 0% 20.27n ± 0% -0.81% (p=0.001 n=10)
AppendUintVarlen/digits=3 38.48n ± 0% 37.93n ± 0% -1.43% (p=0.000 n=10)
AppendUintVarlen/digits=4 41.10n ± 0% 38.78n ± 1% -5.62% (p=0.000 n=10)
AppendUintVarlen/digits=5 42.25n ± 1% 42.11n ± 0% -0.32% (p=0.041 n=10)
AppendUintVarlen/digits=6 45.40n ± 1% 43.14n ± 0% -4.98% (p=0.000 n=10)
AppendUintVarlen/digits=7 46.81n ± 1% 46.03n ± 0% -1.66% (p=0.000 n=10)
AppendUintVarlen/digits=8 48.88n ± 1% 46.59n ± 1% -4.68% (p=0.000 n=10)
AppendUintVarlen/digits=9 49.94n ± 2% 49.41n ± 1% -1.06% (p=0.000 n=10)
AppendUintVarlen/digits=10 57.28n ± 1% 56.92n ± 1% -0.62% (p=0.045 n=10)
AppendUintVarlen/digits=11 60.09n ± 1% 58.11n ± 2% -3.30% (p=0.000 n=10)
AppendUintVarlen/digits=12 62.22n ± 0% 61.85n ± 0% -0.59% (p=0.000 n=10)
AppendUintVarlen/digits=13 64.94n ± 0% 62.92n ± 0% -3.10% (p=0.000 n=10)
AppendUintVarlen/digits=14 65.42n ± 1% 65.19n ± 1% -0.34% (p=0.005 n=10)
AppendUintVarlen/digits=15 68.17n ± 0% 66.13n ± 0% -2.99% (p=0.000 n=10)
AppendUintVarlen/digits=16 70.21n ± 1% 70.09n ± 1% ~ (p=0.517 n=10)
AppendUintVarlen/digits=17 72.93n ± 0% 70.49n ± 0% -3.34% (p=0.000 n=10)
AppendUintVarlen/digits=18 73.01n ± 0% 72.75n ± 0% -0.35% (p=0.000 n=10)
AppendUintVarlen/digits=19 79.27n ± 1% 79.49n ± 1% ~ (p=0.671 n=10)
AppendUintVarlen/digits=20 82.18n ± 0% 80.43n ± 1% -2.14% (p=0.000 n=10)
geomean 143.4n 136.0n -5.20%
Change-Id: I8245814a0259ad13cf9225f57db8e9fe3d2e4267
Reviewed-on: https://go-review.googlesource.com/c/go/+/717407
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
||
|
|
235b4e729d |
cmd/compile/internal/ssa: model right shift more precisely
Prove currently checks for 0 sign bit extraction (x>>63) at the end of the pass, but it is more general and more useful (and not really more work) to model right shift during value range tracking. This handles sign bit extraction (both 0 and -1) but also makes the value ranges available for proving bounds checks. 'go build -a -gcflags=-d=ssa/prove/debug=1 std' finds 105 new things to prove. https://gist.github.com/rsc/8ac41176e53ed9c2f1a664fc668e8336 For example, the compiler now recognizes that this code in strconv does not need to check the second shift for being ≥ 64. msb := xHi >> 63 retMantissa := xHi >> (msb + 38) nor does this code in regexp: return b < utf8.RuneSelf && specialBytes[b%16]&(1<<(b/16)) != 0 This code in math no longer has a bounds check on the first index: if 0 <= n && n <= 308 { return pow10postab32[uint(n)/32] * pow10tab[uint(n)%32] } The diff shows one "lost" proof in ycbcr.go but it's not really lost: the expression was folded to a constant instead, and that only shows up with debug=2. A diff of that output is at https://gist.github.com/rsc/9139ed46c6019ae007f5a1ba4bb3250f Change-Id: I84087311e0a303f00e2820d957a6f8b29ee22519 Reviewed-on: https://go-review.googlesource.com/c/go/+/716140 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: David Chase <drchase@google.com> |
||
|
|
1e5bb416d8 |
cmd/compile: implement bits.Mul64 on 32-bit systems
This CL implements Mul64uhilo, Hmul64, Hmul64u, and Avg64u
on 32-bit systems, with the effect that constant division of both
int64s and uint64s can now be emitted directly in all cases,
and also that bits.Mul64 can be intrinsified on 32-bit systems.
Previously, constant division of uint64s by values 0 ≤ c ≤ 0xFFFF were
implemented as uint32 divisions by c and some fixup. After expanding
those smaller constant divisions, the code for i/999 required:
(386) 7 mul, 10 add, 2 sub, 3 rotate, 3 shift (104 bytes)
(arm) 7 mul, 9 add, 3 sub, 2 shift (104 bytes)
(mips) 7 mul, 10 add, 5 sub, 6 shift, 3 sgtu (176 bytes)
For that much code, we might as well use a full 64x64->128 multiply
that can be used for all divisors, not just small ones.
Having done that, the same i/999 now generates:
(386) 4 mul, 9 add, 2 sub, 2 or, 6 shift (112 bytes)
(arm) 4 mul, 8 add, 2 sub, 2 or, 3 shift (92 bytes)
(mips) 4 mul, 11 add, 3 sub, 6 shift, 8 sgtu, 4 or (196 bytes)
The size increase on 386 is due to a few extra register spills.
The size increase on mips is due to add-with-carry being hard.
The new approach is more general, letting us delete the old special case
and guarantee that all int64 and uint64 divisions by constants are
generated directly on 32-bit systems.
This especially speeds up code making heavy use of bits.Mul64 with
a constant argument, which happens in strconv and various crypto
packages. A few examples are benchmarked below.
pkg: cmd/compile/internal/test
benchmark \ host local linux-amd64 s7 linux-386 s7:GOARCH=386
vs base vs base vs base vs base vs base
DivconstI64 ~ ~ ~ -49.66% -21.02%
ModconstI64 ~ ~ ~ -13.45% +14.52%
DivisiblePow2constI64 ~ ~ ~ +0.97% -1.32%
DivisibleconstI64 ~ ~ ~ -20.01% -48.28%
DivisibleWDivconstI64 ~ ~ -1.76% -38.59% -42.74%
DivconstU64/3 ~ ~ ~ -13.82% -4.09%
DivconstU64/5 ~ ~ ~ -14.10% -3.54%
DivconstU64/37 -2.07% -4.45% ~ -19.60% -9.55%
DivconstU64/1234567 ~ ~ ~ -61.55% -56.93%
ModconstU64 ~ ~ ~ -6.25% ~
DivisibleconstU64 ~ ~ ~ -2.78% -7.82%
DivisibleWDivconstU64 ~ ~ ~ +4.23% +2.56%
pkg: math/bits
benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386
vs base vs base vs base vs base
Add ~ ~ ~ ~
Add32 +1.59% ~ ~ ~
Add64 ~ ~ ~ ~
Add64multiple ~ ~ ~ ~
Sub ~ ~ ~ ~
Sub32 ~ ~ ~ ~
Sub64 ~ ~ -9.20% ~
Sub64multiple ~ ~ ~ ~
Mul ~ ~ ~ ~
Mul32 ~ ~ ~ ~
Mul64 ~ ~ -41.58% -53.21%
Div ~ ~ ~ ~
Div32 ~ ~ ~ ~
Div64 ~ ~ ~ ~
pkg: strconv
benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386
vs base vs base vs base vs base
ParseInt/Pos/7bit ~ ~ -11.08% -6.75%
ParseInt/Pos/26bit ~ ~ -13.65% -11.02%
ParseInt/Pos/31bit ~ ~ -14.65% -9.71%
ParseInt/Pos/56bit -1.80% ~ -17.97% -10.78%
ParseInt/Pos/63bit ~ ~ -13.85% -9.63%
ParseInt/Neg/7bit ~ ~ -12.14% -7.26%
ParseInt/Neg/26bit ~ ~ -14.18% -9.81%
ParseInt/Neg/31bit ~ ~ -14.51% -9.02%
ParseInt/Neg/56bit ~ ~ -15.79% -9.79%
ParseInt/Neg/63bit ~ ~ -15.68% -11.07%
AppendFloat/Decimal ~ ~ -7.25% -12.26%
AppendFloat/Float ~ ~ -15.96% -19.45%
AppendFloat/Exp ~ ~ -13.96% -17.76%
AppendFloat/NegExp ~ ~ -14.89% -20.27%
AppendFloat/LongExp ~ ~ -12.68% -17.97%
AppendFloat/Big ~ ~ -11.10% -16.64%
AppendFloat/BinaryExp ~ ~ ~ ~
AppendFloat/32Integer ~ ~ -10.05% -10.91%
AppendFloat/32ExactFraction ~ ~ -8.93% -13.00%
AppendFloat/32Point ~ ~ -10.36% -14.89%
AppendFloat/32Exp ~ ~ -9.88% -13.54%
AppendFloat/32NegExp ~ ~ -10.16% -14.26%
AppendFloat/32Shortest ~ ~ -11.39% -14.96%
AppendFloat/32Fixed8Hard ~ ~ ~ -2.31%
AppendFloat/32Fixed9Hard ~ ~ ~ -7.01%
AppendFloat/64Fixed1 ~ ~ -2.83% -8.23%
AppendFloat/64Fixed2 ~ ~ ~ -7.94%
AppendFloat/64Fixed3 ~ ~ -4.07% -7.22%
AppendFloat/64Fixed4 ~ ~ -7.24% -7.62%
AppendFloat/64Fixed12 ~ ~ -6.57% -4.82%
AppendFloat/64Fixed16 ~ ~ -4.00% -5.81%
AppendFloat/64Fixed12Hard -2.22% ~ -4.07% -6.35%
AppendFloat/64Fixed17Hard -2.12% ~ ~ -3.79%
AppendFloat/64Fixed18Hard -1.89% ~ +2.48% ~
AppendFloat/Slowpath64 -1.85% ~ -14.49% -18.21%
AppendFloat/SlowpathDenormal64 ~ ~ -13.08% -19.41%
pkg: crypto/internal/fips140/nistec/fiat
benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386
vs base vs base vs base vs base
Mul/P224 ~ ~ -29.95% -39.60%
Mul/P384 ~ ~ -37.11% -63.33%
Mul/P521 ~ ~ -26.62% -12.42%
Square/P224 +1.46% ~ -40.62% -49.18%
Square/P384 ~ ~ -45.51% -69.68%
Square/P521 +90.37% ~ -25.26% -11.23%
(The +90% is a separate problem and not real; that much variation
can be seen on that system by running the same binary from two
different files.)
pkg: crypto/internal/fips140/edwards25519
benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386
vs base vs base vs base vs base
EncodingDecoding ~ ~ -34.67% -35.75%
ScalarBaseMult ~ ~ -31.25% -30.29%
ScalarMult ~ ~ -33.45% -32.54%
VarTimeDoubleScalarBaseMult ~ ~ -33.78% -33.68%
Change-Id: Id3c91d42cd01def6731b755e99f8f40c6ad1bb65
Reviewed-on: https://go-review.googlesource.com/c/go/+/716061
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
|
||
|
|
9bbda7c99d |
cmd/compile: make prove understand div, mod better
This CL introduces new divisible and divmod passes that rewrite divisibility checks and div, mod, and mul. These happen after prove, so that prove can make better sense of the code for deriving bounds, and they must run before decompose, so that 64-bit ops can be lowered to 32-bit ops on 32-bit systems. And then they need another generic pass as well, to optimize the generated code before decomposing. The three opt passes are "opt", "middle opt", and "late opt". (Perhaps instead they should be "generic", "opt", and "late opt"?) The "late opt" pass repeats the "middle opt" work on any new code that has been generated in the interim. There will not be new divs or mods, but there may be new muls. The x%c==0 rewrite rules are much simpler now, since they can match before divs have been rewritten. This has the effect of applying them more consistently and making the rewrite rules independent of the exact div rewrites. Prove is also now charged with marking signed div/mod as unsigned when the arguments call for it, allowing simpler code to be emitted in various cases. For example, t.Seconds()/2 and len(x)/2 are now recognized as unsigned, meaning they compile to a simple shift (unsigned division), avoiding the more complex fixup we need for signed values. https://gist.github.com/rsc/99d9d3bd99cde87b6a1a390e3d85aa32 shows a diff of 'go build -a -gcflags=-d=ssa/prove/debug=1 std' output before and after. "Proved Rsh64x64 shifts to zero" is replaced by the higher-level "Proved Div64 is unsigned" (the shift was in the signed expansion of div by constant), but otherwise prove is only finding more things to prove. One short example, in code that does x[i%len(x)]: < runtime/mfinal.go:131:34: Proved Rsh64x64 shifts to zero --- > runtime/mfinal.go:131:34: Proved Div64 is unsigned > runtime/mfinal.go:131:38: Proved IsInBounds A longer example: < crypto/internal/fips140/sha3/shake.go:28:30: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:38:27: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:53:46: Proved Rsh64x64 shifts to zero < crypto/internal/fips140/sha3/shake.go:55:46: Proved Rsh64x64 shifts to zero --- > crypto/internal/fips140/sha3/shake.go:28:30: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:28:30: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:28:30: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:38:27: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:45:7: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:46:4: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:53:46: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:53:46: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:53:46: Proved IsSliceInBounds > crypto/internal/fips140/sha3/shake.go:55:46: Proved Div64 is unsigned > crypto/internal/fips140/sha3/shake.go:55:46: Proved IsInBounds > crypto/internal/fips140/sha3/shake.go:55:46: Proved IsSliceInBounds These diffs are due to the smaller opt being better and taking work away from prove: < image/jpeg/dct.go:307:5: Proved IsInBounds < image/jpeg/dct.go:308:5: Proved IsInBounds ... < image/jpeg/dct.go:442:5: Proved IsInBounds In the old opt, Mul by 8 was rewritten to Lsh by 3 early. This CL delays that rule to help prove recognize mods, but it also helps opt constant-fold the slice x[8*i:8*i+8:8*i+8]. Specifically, computing the length, opt can now do: (Sub64 (Add (Mul 8 i) 8) (Add (Mul 8 i) 8)) -> (Add 8 (Sub (Mul 8 i) (Mul 8 i))) -> (Add 8 (Mul 8 (Sub i i))) -> (Add 8 (Mul 8 0)) -> (Add 8 0) -> 8 The key step is (Sub (Mul x y) (Mul x z)) -> (Mul x (Sub y z)), Leaving the multiply as Mul enables using that step; the old rewrite to Lsh blocked it, leaving prove to figure out the length and then remove the bounds checks. But now opt can evaluate the length down to a constant 8 and then constant-fold away the bounds checks 0 < 8, 1 < 8, and so on. After that, the compiler has nothing left to prove. Benchmarks are noisy in general; I checked the assembly for the many large increases below, and the vast majority are unchanged and presumably hitting the caches differently in some way. The divisibility optimizations were not reliably triggering before. This leads to a very large improvement in some cases, like DivisiblePow2constI64, DivisibleconstI64 on 64-bit systems and DivisbleconstU64 on 32-bit systems. Another way the divisibility optimizations were unreliable before was incorrectly triggering for x/3, x%3 even though they are written not to do that. There is a real but small slowdown in the DivisibleWDivconst benchmarks on Mac because in the cases used in the benchmark, it is still faster (on Mac) to do the divisibility check than to remultiply. This may be worth further study. Perhaps when there is no rotate (meaning the divisor is odd), the divisibility optimization should be enabled always. In any event, this CL makes it possible to study that. benchmark \ host s7 linux-amd64 mac linux-arm64 linux-ppc64le linux-386 s7:GOARCH=386 linux-arm vs base vs base vs base vs base vs base vs base vs base vs base LoadAdd ~ ~ ~ ~ ~ -1.59% ~ ~ ExtShift ~ ~ -42.14% +0.10% ~ +1.44% +5.66% +8.50% Modify ~ ~ ~ ~ ~ ~ ~ -1.53% MullImm ~ ~ ~ ~ ~ +37.90% -21.87% +3.05% ConstModify ~ ~ ~ ~ -49.14% ~ ~ ~ BitSet ~ ~ ~ ~ -15.86% -14.57% +6.44% +0.06% BitClear ~ ~ ~ ~ ~ +1.78% +3.50% +0.06% BitToggle ~ ~ ~ ~ ~ -16.09% +2.91% ~ BitSetConst ~ ~ ~ ~ ~ ~ ~ -0.49% BitClearConst ~ ~ ~ ~ -28.29% ~ ~ -0.40% BitToggleConst ~ ~ ~ +8.89% -31.19% ~ ~ -0.77% MulNeg ~ ~ ~ ~ ~ ~ ~ ~ Mul2Neg ~ ~ -4.83% ~ ~ -13.75% -5.92% ~ DivconstI64 ~ ~ ~ ~ ~ -30.12% ~ +0.50% ModconstI64 ~ ~ -9.94% -4.63% ~ +3.15% ~ +5.32% DivisiblePow2constI64 -34.49% -12.58% ~ ~ -12.25% ~ ~ ~ DivisibleconstI64 -24.69% -25.06% -0.40% -2.27% -42.61% -3.31% ~ +1.63% DivisibleWDivconstI64 ~ ~ ~ ~ ~ -17.55% ~ -0.60% DivconstU64/3 ~ ~ ~ ~ ~ +1.51% ~ ~ DivconstU64/5 ~ ~ ~ ~ ~ ~ ~ ~ DivconstU64/37 ~ ~ -0.18% ~ ~ +2.70% ~ ~ DivconstU64/1234567 ~ ~ ~ ~ ~ ~ ~ +0.12% ModconstU64 ~ ~ ~ -0.24% ~ -5.10% -1.07% -1.56% DivisibleconstU64 ~ ~ ~ ~ ~ -29.01% -59.13% -50.72% DivisibleWDivconstU64 ~ ~ -12.18% -18.88% ~ -5.50% -3.91% +5.17% DivconstI32 ~ ~ -0.48% ~ -34.69% +89.01% -6.01% -16.67% ModconstI32 ~ +2.95% -0.33% ~ ~ -2.98% -5.40% -8.30% DivisiblePow2constI32 ~ ~ ~ ~ ~ ~ ~ -16.22% DivisibleconstI32 ~ ~ ~ ~ ~ -37.27% -47.75% -25.03% DivisibleWDivconstI32 -11.59% +5.22% -12.99% -23.83% ~ +45.95% -7.03% -10.01% DivconstU32 ~ ~ ~ ~ ~ +74.71% +4.81% ~ ModconstU32 ~ ~ +0.53% +0.18% ~ +51.16% ~ ~ DivisibleconstU32 ~ ~ ~ -0.62% ~ -4.25% ~ ~ DivisibleWDivconstU32 -2.77% +5.56% +11.12% -5.15% ~ +48.70% +25.11% -4.07% DivconstI16 -6.06% ~ -0.33% +0.22% ~ ~ -9.68% +5.47% ModconstI16 ~ ~ +4.44% +2.82% ~ ~ ~ +5.06% DivisiblePow2constI16 ~ ~ ~ ~ ~ ~ ~ -0.17% DivisibleconstI16 ~ ~ -0.23% ~ ~ ~ +4.60% +6.64% DivisibleWDivconstI16 -1.44% -0.43% +13.48% -5.76% ~ +1.62% -23.15% -9.06% DivconstU16 +1.61% ~ -0.35% -0.47% ~ ~ +15.59% ~ ModconstU16 ~ ~ ~ ~ ~ -0.72% ~ +14.23% DivisibleconstU16 ~ ~ -0.05% +3.00% ~ ~ ~ +5.06% DivisibleWDivconstU16 +52.10% +0.75% +17.28% +4.79% ~ -37.39% +5.28% -9.06% DivconstI8 ~ ~ -0.34% -0.96% ~ ~ -9.20% ~ ModconstI8 +2.29% ~ +4.38% +2.96% ~ ~ ~ ~ DivisiblePow2constI8 ~ ~ ~ ~ ~ ~ ~ ~ DivisibleconstI8 ~ ~ ~ ~ ~ ~ +6.04% ~ DivisibleWDivconstI8 -26.44% +1.69% +17.03% +4.05% ~ +32.48% -24.90% ~ DivconstU8 -4.50% +14.06% -0.28% ~ ~ ~ +4.16% +0.88% ModconstU8 ~ ~ +25.84% -0.64% ~ ~ ~ ~ DivisibleconstU8 ~ ~ -5.70% ~ ~ ~ ~ ~ DivisibleWDivconstU8 +49.55% +9.07% ~ +4.03% +53.87% -40.03% +39.72% -3.01% Mul2 ~ ~ ~ ~ ~ ~ ~ ~ MulNeg2 ~ ~ ~ ~ -11.73% ~ ~ -0.02% EfaceInteger ~ ~ ~ ~ ~ +18.11% ~ +2.53% TypeAssert +33.90% +2.86% ~ ~ ~ -1.07% -5.29% -1.04% Div64UnsignedSmall ~ ~ ~ ~ ~ ~ ~ ~ Div64Small ~ ~ ~ ~ ~ -0.88% ~ +2.39% Div64SmallNegDivisor ~ ~ ~ ~ ~ ~ ~ +0.35% Div64SmallNegDividend ~ ~ ~ ~ ~ -0.84% ~ +3.57% Div64SmallNegBoth ~ ~ ~ ~ ~ -0.86% ~ +3.55% Div64Unsigned ~ ~ ~ ~ ~ ~ ~ -0.11% Div64 ~ ~ ~ ~ ~ ~ ~ +0.11% Div64NegDivisor ~ ~ ~ ~ ~ -1.29% ~ ~ Div64NegDividend ~ ~ ~ ~ ~ -1.44% ~ ~ Div64NegBoth ~ ~ ~ ~ ~ ~ ~ +0.28% Mod64UnsignedSmall ~ ~ ~ ~ ~ +0.48% ~ +0.93% Mod64Small ~ ~ ~ ~ ~ ~ ~ ~ Mod64SmallNegDivisor ~ ~ ~ ~ ~ ~ ~ +1.44% Mod64SmallNegDividend ~ ~ ~ ~ ~ +0.22% ~ +1.37% Mod64SmallNegBoth ~ ~ ~ ~ ~ ~ ~ -2.22% Mod64Unsigned ~ ~ ~ ~ ~ -0.95% ~ +0.11% Mod64 ~ ~ ~ ~ ~ ~ ~ ~ Mod64NegDivisor ~ ~ ~ ~ ~ ~ ~ -0.02% Mod64NegDividend ~ ~ ~ ~ ~ ~ ~ ~ Mod64NegBoth ~ ~ ~ ~ ~ ~ ~ -0.02% MulconstI32/3 ~ ~ ~ -25.00% ~ ~ ~ +47.37% MulconstI32/5 ~ ~ ~ +33.28% ~ ~ ~ +32.21% MulconstI32/12 ~ ~ ~ -2.13% ~ ~ ~ -0.02% MulconstI32/120 ~ ~ ~ +2.93% ~ ~ ~ -0.03% MulconstI32/-120 ~ ~ ~ -2.17% ~ ~ ~ -0.03% MulconstI32/65537 ~ ~ ~ ~ ~ ~ ~ +0.03% MulconstI32/65538 ~ ~ ~ ~ ~ -33.38% ~ +0.04% MulconstI64/3 ~ ~ ~ +33.35% ~ -0.37% ~ -0.13% MulconstI64/5 ~ ~ ~ -25.00% ~ -0.34% ~ ~ MulconstI64/12 ~ ~ ~ +2.13% ~ +11.62% ~ +2.30% MulconstI64/120 ~ ~ ~ -1.98% ~ ~ ~ ~ MulconstI64/-120 ~ ~ ~ +0.75% ~ ~ ~ ~ MulconstI64/65537 ~ ~ ~ ~ ~ +5.61% ~ ~ MulconstI64/65538 ~ ~ ~ ~ ~ +5.25% ~ ~ MulconstU32/3 ~ +0.81% ~ +33.39% ~ +77.92% ~ -32.31% MulconstU32/5 ~ ~ ~ -24.97% ~ +77.92% ~ -24.47% MulconstU32/12 ~ ~ ~ +2.06% ~ ~ ~ +0.03% MulconstU32/120 ~ ~ ~ -2.74% ~ ~ ~ +0.03% MulconstU32/65537 ~ ~ ~ ~ ~ ~ ~ +0.03% MulconstU32/65538 ~ ~ ~ ~ ~ -33.42% ~ -0.03% MulconstU64/3 ~ ~ ~ +33.33% ~ -0.28% ~ +1.22% MulconstU64/5 ~ ~ ~ -25.00% ~ ~ ~ -0.64% MulconstU64/12 ~ ~ ~ +2.30% ~ +11.59% ~ +0.14% MulconstU64/120 ~ ~ ~ -2.82% ~ ~ ~ +0.04% MulconstU64/65537 ~ +0.37% ~ ~ ~ +5.58% ~ ~ MulconstU64/65538 ~ ~ ~ ~ ~ +5.16% ~ ~ ShiftArithmeticRight ~ ~ ~ ~ ~ -10.81% ~ +0.31% Switch8Predictable +14.69% ~ ~ ~ ~ -24.85% ~ ~ Switch8Unpredictable ~ -0.58% -3.80% ~ ~ -11.78% ~ -0.79% Switch32Predictable -10.33% +17.89% ~ ~ ~ +5.76% ~ ~ Switch32Unpredictable -3.15% +1.19% +9.42% ~ ~ -10.30% -5.09% +0.44% SwitchStringPredictable +70.88% +20.48% ~ ~ ~ +2.39% ~ +0.31% SwitchStringUnpredictable ~ +3.91% -5.06% -0.98% ~ +0.61% +2.03% ~ SwitchTypePredictable +146.58% -1.10% ~ -12.45% ~ -0.46% -3.81% ~ SwitchTypeUnpredictable +0.46% -0.83% ~ +4.18% ~ +0.43% ~ +0.62% SwitchInterfaceTypePredictable -13.41% -10.13% +11.03% ~ ~ -4.38% ~ +0.75% SwitchInterfaceTypeUnpredictable -6.37% -2.14% ~ -3.21% ~ -4.20% ~ +1.08% Fixes #63110. Fixes #75954. Change-Id: I55a876f08c6c14f419ce1a8cbba2eaae6c6efbf0 Reviewed-on: https://go-review.googlesource.com/c/go/+/714160 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Russ Cox <rsc@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
915c1839fe |
test/codegen: simplify asmcheck pattern matching
Separate patterns in asmcheck by spaces instead of commas. Many patterns end in comma (like "MOV [$]123,") so separating patterns by comma is not great; they're already quoted, so spaces are fine. Also replace all tabs in the assembly lines with spaces before matching. Finally, replace \$ or \\$ with [$] as the matching idiom. The effect of all these is to make the patterns look like: // amd64:"BSFQ" "ORQ [$]256" instead of the old: // amd64:"BSFQ","ORQ\t\\$256" Update all tests as well. Change-Id: Ia39febe5d7f67ba115846422789e11b185d5c807 Reviewed-on: https://go-review.googlesource.com/c/go/+/716060 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com> Reviewed-by: Jorropo <jorropo.pgm@gmail.com> |
||
|
|
e452f4ac7d |
[dev.simd] cmd/compile: enhance inlining for closure-of-SIMD
We noticed some hand-translated code that used nested functions as the translation of asm macros, and they were too big to inline, and the resulting performance was underwhelming. Any such closures really need to be inlined. Because Gerrit removed votes from a previous patch set, and because in offline discussion we realized that this was actually a hard-to-abuse inlining hack, I decided to turn it up some more, and also add a "this one goes to 11" joke. The number is utterly unprincipled, only "simd is supposed to go fast, and this is a natural use of closures, and we don't want there to be issues where it doesn't go fast." The test verifies that the inlining occurs for a function that exceeds the current inlining threshold. Inspection of the generated code shows that it has the desired effect. Change-Id: I7a8b57c07d6482e6d98cedaf9622c960f956834d Reviewed-on: https://go-review.googlesource.com/c/go/+/715740 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> |
||
|
|
9035f7aea5 |
runtime: use internal/strconv
Runtime doing its own number formatting dates back to when runtime was the bottom-most Go package. Those days are long gone. Use internal/strconv to avoid duplicating code and also to get better floating-point formatting: % go1.24.6 run x.go +1.234568e+004 % go run x.go 12345.678 % With accurate floating point it becomes necessary to introduce separate printers for float32 vs float64 and for complex64 vs complex128. Otherwise float32(93.7) prints as 93.69999694824219. Change-Id: I25ae3f09519342dc3d1dcabf4711651423e00128 Reviewed-on: https://go-review.googlesource.com/c/go/+/716002 Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
bd4dc413cd |
cmd/compile: don't optimize away a panicing interface comparison
We can't do direct pointer comparisons if the type is not a comparable type. Fixes #76008 Change-Id: I1687acff21832d2c2e8f3b875e7b5ec125702ef3 Reviewed-on: https://go-review.googlesource.com/c/go/+/713840 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com> Reviewed-by: Keith Randall <khr@google.com> |
||
|
|
24af441437 |
cmd/compile: rewrite proved multiplies by 0 or 1 into CondSelect
Updates #76056 Change-Id: I64fe631ab381c74f902f877392530d7cc91860ab Reviewed-on: https://go-review.googlesource.com/c/go/+/715044 Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
73d7635fae |
cmd/compile: add generic rules to remove bool → int → bool roundtrips
Change-Id: I8b0a3b64c89fe167d304f901a5d38470f35400ab Reviewed-on: https://go-review.googlesource.com/c/go/+/715200 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Jorropo <jorropo.pgm@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@golang.org> |
||
|
|
77dc138030 |
cmd/compile: teach prove about unsigned rounding-up divide
Change-Id: Ia7b5242c723f83ba85d12e4ca64a19fbbd126016 Reviewed-on: https://go-review.googlesource.com/c/go/+/714622 Auto-Submit: Jorropo <jorropo.pgm@gmail.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> |
||
|
|
ca1264ac50 |
[dev.simd] test: add some trickier cases to ternary-boolean simd test
These new tests check a hypothesis about interactions between CPU features and common subexpressions. Happily, they hypothesis was not (yet) correct and the test did not fail. These are probably good to have in the corpus in case we decide to tinker with the rewrite in the future, or if someone wants to write a fuzzer and needs a little inspiration. Change-Id: I8ea6e1655a293c22e39bf53e4d2c5afd3dcb2510 Reviewed-on: https://go-review.googlesource.com/c/go/+/714803 Reviewed-by: Junyang Shao <shaojunyang@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> |
||
|
|
d7a52f9369 |
cmd/compile: use MOV(D|F) with const for Const(64|32)F on riscv64
The original Const64F using: AUIPC + LD + FMVDX to load float64 const, we can use AUIPC + FLD instead, same as Const32F. Change-Id: I8ca0a0e90d820a26e69b74cd25df3cc662132bf7 Reviewed-on: https://go-review.googlesource.com/c/go/+/703215 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> |
||
|
|
f6b4711095 |
[dev.simd] cmd/compile, simd: add rewrite to convert logical expression trees into TERNLOG instructions
includes tests of both rewrite application and rewrite correctness Change-Id: I7983ccf87a8408af95bb6c447cb22f01beda9f61 Reviewed-on: https://go-review.googlesource.com/c/go/+/710697 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> |
||
|
|
519ae514ab |
cmd/compile: eliminate bound check for slices of the same length
If two slices start out with the same length and decrease in length by
the same amount on each round of the loop (or in the if block), then
we think their length are always equal.
For example:
if len(a) != len(b) {
return
}
for len(a) >= 4 {
a = a[4:]
b = b[4:] // proved here, omit boundary check
}
if len(a) == len(b) { // proved here
//...
}
Or, change 'for' to 'if':
if len(a) != len(b) {
return
}
if len(a) >= 4 {
a = a[4:]
b = b[4:]
}
if len(a) == len(b) { // proved here
//...
}
Fixes #75144
Change-Id: I4e5902a02b5cf8fdc122715a7dbd2fb5e9a8f5dc
Reviewed-on: https://go-review.googlesource.com/c/go/+/699155
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
|
||
|
|
7056c71d32 |
cmd/compile: disable use of new saturating float-to-int conversions
The new conversions can be activated (or bisected) with -gcflags=all=-d=converthash=PATTERN where PATTERN is either a hash string or n, qn, y, qy for no, quietly no, yes, quietly yes. This CL makes the default pattern be "qn" instead of the default-default which is an efficient encoding of "qy". Updates #75834 Change-Id: I88a9fd7880bc999132420c8d0a22a8fdc1e95a2a Reviewed-on: https://go-review.googlesource.com/c/go/+/711845 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Bypass: David Chase <drchase@google.com> |
||
|
|
6d5b13793f |
Revert "cmd/compile: make 386 float-to-int conversions match amd64"
This reverts commit
|
||
|
|
bb2a14252b |
Revert "runtime: adjust softfloat corner cases to match amd64/arm64"
This reverts commit
|