Commit graph

592 commits

Author SHA1 Message Date
Keith Randall
80a2aae922 Revert "cmd/compile: improve stp merging for non-sequent cases"
This reverts commit 4c63d798cb.

Reason for revert: Causes miscompilations. See issue 75365.

Change-Id: Icd1fcfeb23d2ec524b16eb556030f43875e1c90d
Reviewed-on: https://go-review.googlesource.com/c/go/+/702455
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Mark Freeman <markfreeman@google.com>
2025-09-10 11:11:11 -07:00
Youlin Feng
a5fa5ea51c cmd/compile/internal/ssa: expand runtime.memequal for length {3,5,6,7}
This CL slightly speeds up strings.HasPrefix when testing constant
prefixes of length {3,5,6,7}.

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
                │      old     │                 new                 │
                │    sec/op    │   sec/op     vs base                │
StringPrefix3-8   11.125n ± 2%   8.539n ± 1%  -23.25% (p=0.000 n=20)
StringPrefix5-8   11.170n ± 2%   8.700n ± 1%  -22.11% (p=0.000 n=20)
StringPrefix6-8   11.190n ± 2%   8.655n ± 1%  -22.65% (p=0.000 n=20)
StringPrefix7-8   11.095n ± 1%   8.878n ± 1%  -19.98% (p=0.000 n=20)

Change-Id: I510a80d59cf78680b57d68780d35d212d24030e2
Reviewed-on: https://go-review.googlesource.com/c/go/+/700816
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Freeman <markfreeman@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
2025-09-09 12:10:07 -07:00
Melnikov Denis
4c63d798cb cmd/compile: improve stp merging for non-sequent cases
Original algorithm merges stores with the first
mergeable store in the chain, but it misses some
cases. Additional reordering stores in increasing order
of memory access in the chain allows merging in these cases.

Fixes #71987

There are the results of sweet benchmarks and
the difference between sizes of sections .text

                        │ old.results │            new.results             │
                        │   sec/op    │   sec/op     vs base               │
BleveIndexBatch100-4      7.614 ± 2%    7.548 ± 1%       ~ (p=0.190 n=10)
ESBuildThreeJS-4         821.3m ± 0%   819.0m ± 1%       ~ (p=0.165 n=10)
ESBuildRomeTS-4          206.2m ± 1%   204.4m ± 1%  -0.90% (p=0.023 n=10)
EtcdPut-4                64.89m ± 1%   64.94m ± 2%       ~ (p=0.684 n=10)
EtcdSTM-4                318.4m ± 0%   319.2m ± 1%       ~ (p=0.631 n=10)
GoBuildKubelet-4          157.4 ± 0%    157.6 ± 0%       ~ (p=0.105 n=10)
GoBuildKubeletLink-4      12.42 ± 2%    12.41 ± 1%       ~ (p=0.529 n=10)
GoBuildIstioctl-4         124.4 ± 0%    124.4 ± 0%       ~ (p=0.579 n=10)
GoBuildIstioctlLink-4     8.700 ± 1%    8.693 ± 1%       ~ (p=0.912 n=10)
GoBuildFrontend-4         46.52 ± 0%    46.50 ± 0%       ~ (p=0.971 n=10)
GoBuildFrontendLink-4     2.282 ± 1%    2.272 ± 1%       ~ (p=0.529 n=10)
GoBuildTsgo-4             75.02 ± 1%    75.31 ± 1%       ~ (p=0.436 n=10)
GoBuildTsgoLink-4         1.229 ± 1%    1.219 ± 1%  -0.82% (p=0.035 n=10)
GopherLuaKNucleotide-4    34.77 ± 5%    34.31 ± 1%  -1.33% (p=0.015 n=10)
MarkdownRenderXHTML-4    286.6m ± 0%   285.7m ± 1%       ~ (p=0.315 n=10)
Tile38QueryLoad-4        657.2µ ± 1%   660.3µ ± 0%       ~ (p=0.436 n=10)
geomean                   2.570         2.563       -0.24%

Executable            Old .text  New .text     Change
-------------------------------------------------------
benchmark               6504820    6504020     -0.01%
bleve-index-bench       3903860    3903636     -0.01%
esbuild                 4801012    4801172     +0.00%
esbuild-bench           1256404    1256340     -0.01%
etcd                    9188148    9187076     -0.01%
etcd-bench              6462228    6461524     -0.01%
go                      5924468    5923892     -0.01%
go-build-bench          1282004    1281940     -0.00%
gopher-lua-bench        1639540    1639348     -0.01%
markdown-bench          1478452    1478356     -0.01%
tile38-bench            2753524    2753300     -0.01%
tile38-server          10241380   10240068     -0.01%

Change-Id: Ieb4fdfd656aca458f65fc45938de70550632bd13
Reviewed-on: https://go-review.googlesource.com/c/go/+/698097
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Mark Freeman <markfreeman@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2025-09-09 12:10:01 -07:00
Xiaolin Zhao
f5b20689e9 cmd/compile: optimize loads from readonly globals into constants on loong64
Ref: CL 141118
Update #26498

Change-Id: I9c4ad2bedc4d50bd273bbe9119a898d4fca95e45
Reviewed-on: https://go-review.googlesource.com/c/go/+/700875
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-09-05 08:42:28 -07:00
Xiaolin Zhao
3492e4262b cmd/compile: simplify specific addition operations using the ADDV16 instruction
On loong64, the addi.d instruction can only directly handle 12-bit
immediate numbers. If a larger immediate number needs to be processed,
it must first be placed in a register, and then the add.d instruction
is used to complete the processing of the larger immediate number.
If a larger immediate number c satisfies is32Bit(c) && c&0xffff == 0,
then the ADDV16 instruction can be used to complete the addition operation.

Removes 164 instructions from the go binary on loong64.

Change-Id: I404de93cc4eaaa12fe424f5a0d61b03231215d1a
Reviewed-on: https://go-review.googlesource.com/c/go/+/700536
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2025-09-05 08:18:04 -07:00
Youlin Feng
df29038486 cmd/compile/internal/ssa: load constant values from abi.PtrType.Elem
This CL makes the generated code for reflect.TypeFor as simple as an
intrinsic function.

Fixes #75203

Change-Id: I7bb48787101f07e77ab5c583292e834c28a028d6
Reviewed-on: https://go-review.googlesource.com/c/go/+/700336
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
2025-09-04 07:25:26 -07:00
limeidan
bd71b94659 cmd/compile/internal: optimizing add+sll rule using ALSLV instruction on loong64
Reduce the number of go toolchain instructions on loong64 as follows:

	file	    before	after	    Δ 	     %
	go	    1573148	1571708	   -1,440  -0.0915%
	gofmt	    320578	320090	   -488    -0.1522%
	asm	    555066	554406	   -660    -0.1189%
	cgo	    481566	480926	   -640    -0.1329%
	compile	    2475962	2473880	   -2,082  -0.0841%
	cover	    516536	515920	   -616    -0.1193%
	link	    702172	701404	   -768    -0.1094%
	preprofile  238626	238274	   -352    -0.1475%
	vet	    792928	792100	   -828    -0.1044%

Change-Id: I61e462726835959c60e1b4e5256d4020202418ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/693877
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-08-25 12:30:16 -07:00
Xiaolin Zhao
44c5956bf7 test/codegen: add Mul2 and DivPow2 test for loong64
Change-Id: I29ccd105c5418955146a3f4873162963da489a70
Reviewed-on: https://go-review.googlesource.com/c/go/+/697935
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2025-08-24 18:14:28 -07:00
Xiaolin Zhao
0aa8019e94 test/codegen: add Mul* test for loong64
Change-Id: Ica285212e4884a96fe9738b53cdc789b223bf2e3
Reviewed-on: https://go-review.googlesource.com/c/go/+/697895
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-08-24 18:14:22 -07:00
Xiaolin Zhao
83420974b7 test/codegen: add sqrt* abs and copysign test for loong64
Change-Id: I645396fc4b00242f36a06f01550906805c0c1f73
Reviewed-on: https://go-review.googlesource.com/c/go/+/697955
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2025-08-24 18:14:13 -07:00
limeidan
1843f1e9c0 cmd/compile: use zero register instead of specialized *zero instructions on loong64
Refer to CL 633075, loong64 has a zero(R0) register that can be used to do this.

Change-Id: I846c6bdfcfd6dbfa18338afc13e34e350580ead4
Reviewed-on: https://go-review.googlesource.com/c/go/+/693876
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
2025-08-21 11:23:05 -07:00
Xiaolin Zhao
9632ba8160 cmd/compile: optimize some patterns into revb2h/revb4h instruction on loong64
Pattern1: (the type of c is uint16)
    c>>8 | c<<8
To:
    revb2h c

Pattern2: (the type of c is uint32)
    (c & 0xff00ff00)>>8 | (c & 0x00ff00ff)<<8
To:
    revb2h c

Pattern3: (the type of c is uint64)
    (c & 0xff00ff00ff00ff00)>>8 | (c & 0x00ff00ff00ff00ff)<<8
To:
    revb4h c

Change-Id: Ic6231a3f476cbacbea4bd00e31193d107cb86cda
Reviewed-on: https://go-review.googlesource.com/c/go/+/696335
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-21 11:19:34 -07:00
Xiaolin Zhao
fa706ea50f cmd/compile: optimize rule (x + x) << c to x << c+1 on loong64
Change-Id: I782f93510bba92ba60b298c1c1cde456c8bcec38
Reviewed-on: https://go-review.googlesource.com/c/go/+/697956
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2025-08-21 11:16:49 -07:00
Michael Munday
320df537cc cmd/compile: emit classify instructions for infinity tests on riscv64
The 'classify' instruction on RISC-V sets a bit in a mask to indicate
the class a floating point value belongs to (e.g. whether the value is
an infinity, a normal number, a subnormal number and so on). There are
other places this instruction is useful but for now I've just used it
for infinity tests.

The gains are relatively small (~1-2 instructions per IsInf call) but
using FCLASSD does potentially unlock further optimizations. It also
reduces the number of loads from memory and the number of moves
between general purpose and floating point register files.

goos: linux
goarch: riscv64
pkg: math
cpu: Spacemit(R) X60
                    │        sec/op        │   sec/op     vs base                │
Acos                           159.9n ± 0%   173.7n ± 0%   +8.66% (p=0.000 n=10)
Acosh                          249.8n ± 0%   254.4n ± 0%   +1.86% (p=0.000 n=10)
Asin                           159.9n ± 0%   173.7n ± 0%   +8.66% (p=0.000 n=10)
Asinh                          292.2n ± 0%   283.0n ± 0%   -3.15% (p=0.000 n=10)
Atan                           119.1n ± 0%   119.0n ± 0%   -0.08% (p=0.036 n=10)
Atanh                          265.1n ± 0%   271.6n ± 0%   +2.43% (p=0.000 n=10)
Atan2                          194.9n ± 0%   186.7n ± 0%   -4.23% (p=0.000 n=10)
Cbrt                           216.3n ± 0%   203.1n ± 0%   -6.10% (p=0.000 n=10)
Ceil                           31.82n ± 0%   31.81n ± 0%        ~ (p=0.063 n=10)
Copysign                       4.897n ± 0%   4.893n ± 3%   -0.08% (p=0.038 n=10)
Cos                            123.9n ± 0%   107.7n ± 1%  -13.03% (p=0.000 n=10)
Cosh                           293.0n ± 0%   264.6n ± 0%   -9.68% (p=0.000 n=10)
Erf                            150.0n ± 0%   133.8n ± 0%  -10.80% (p=0.000 n=10)
Erfc                           151.8n ± 0%   137.9n ± 0%   -9.16% (p=0.000 n=10)
Erfinv                         173.8n ± 0%   173.8n ± 0%        ~ (p=0.820 n=10)
Erfcinv                        173.8n ± 0%   173.8n ± 0%        ~ (p=1.000 n=10)
Exp                            247.7n ± 0%   220.4n ± 0%  -11.04% (p=0.000 n=10)
ExpGo                          261.4n ± 0%   232.5n ± 0%  -11.04% (p=0.000 n=10)
Expm1                          176.2n ± 0%   164.9n ± 0%   -6.41% (p=0.000 n=10)
Exp2                           220.4n ± 0%   190.2n ± 0%  -13.70% (p=0.000 n=10)
Exp2Go                         232.5n ± 0%   204.0n ± 0%  -12.22% (p=0.000 n=10)
Abs                            4.897n ± 0%   4.897n ± 0%        ~ (p=0.726 n=10)
Dim                            16.32n ± 0%   16.31n ± 0%        ~ (p=0.770 n=10)
Floor                          31.84n ± 0%   31.83n ± 0%        ~ (p=0.677 n=10)
Max                            26.11n ± 0%   26.13n ± 0%        ~ (p=0.290 n=10)
Min                            26.10n ± 0%   26.11n ± 0%        ~ (p=0.424 n=10)
Mod                            416.2n ± 0%   337.8n ± 0%  -18.83% (p=0.000 n=10)
Frexp                          63.65n ± 0%   50.60n ± 0%  -20.50% (p=0.000 n=10)
Gamma                          218.8n ± 0%   206.4n ± 0%   -5.62% (p=0.000 n=10)
Hypot                          92.20n ± 0%   94.69n ± 0%   +2.70% (p=0.000 n=10)
HypotGo                        107.7n ± 0%   109.3n ± 0%   +1.49% (p=0.000 n=10)
Ilogb                          59.54n ± 0%   44.04n ± 0%  -26.04% (p=0.000 n=10)
J0                             708.9n ± 0%   674.5n ± 0%   -4.86% (p=0.000 n=10)
J1                             707.6n ± 0%   676.1n ± 0%   -4.44% (p=0.000 n=10)
Jn                             1.513µ ± 0%   1.427µ ± 0%   -5.68% (p=0.000 n=10)
Ldexp                          70.20n ± 0%   57.09n ± 0%  -18.68% (p=0.000 n=10)
Lgamma                         201.5n ± 0%   185.3n ± 1%   -8.01% (p=0.000 n=10)
Log                            201.5n ± 0%   182.7n ± 0%   -9.35% (p=0.000 n=10)
Logb                           59.54n ± 0%   46.53n ± 0%  -21.86% (p=0.000 n=10)
Log1p                          178.8n ± 0%   173.9n ± 6%   -2.74% (p=0.021 n=10)
Log10                          201.4n ± 0%   184.3n ± 0%   -8.49% (p=0.000 n=10)
Log2                           79.17n ± 0%   66.07n ± 0%  -16.54% (p=0.000 n=10)
Modf                           34.27n ± 0%   34.25n ± 0%        ~ (p=0.559 n=10)
Nextafter32                    49.34n ± 0%   49.37n ± 0%   +0.05% (p=0.040 n=10)
Nextafter64                    43.66n ± 0%   43.66n ± 0%        ~ (p=0.869 n=10)
PowInt                         309.1n ± 0%   267.4n ± 0%  -13.49% (p=0.000 n=10)
PowFrac                        769.6n ± 0%   677.3n ± 0%  -11.98% (p=0.000 n=10)
Pow10Pos                       13.88n ± 0%   13.88n ± 0%        ~ (p=0.811 n=10)
Pow10Neg                       19.58n ± 0%   19.57n ± 0%        ~ (p=0.993 n=10)
Round                          23.65n ± 0%   23.66n ± 0%        ~ (p=0.354 n=10)
RoundToEven                    27.75n ± 0%   27.75n ± 0%        ~ (p=0.971 n=10)
Remainder                      380.0n ± 0%   309.9n ± 0%  -18.45% (p=0.000 n=10)
Signbit                        13.06n ± 0%   13.06n ± 0%        ~ (p=1.000 n=10)
Sin                            133.8n ± 0%   120.8n ± 0%   -9.75% (p=0.000 n=10)
Sincos                         160.7n ± 0%   147.7n ± 0%   -8.12% (p=0.000 n=10)
Sinh                           305.9n ± 0%   277.9n ± 0%   -9.17% (p=0.000 n=10)
SqrtIndirect                   3.265n ± 0%   3.264n ± 0%        ~ (p=0.546 n=10)
SqrtLatency                    19.58n ± 0%   19.58n ± 0%        ~ (p=0.973 n=10)
SqrtIndirectLatency            19.59n ± 0%   19.58n ± 0%        ~ (p=0.370 n=10)
SqrtGoLatency                  205.7n ± 0%   202.7n ± 0%   -1.46% (p=0.000 n=10)
SqrtPrime                      4.953µ ± 0%   4.954µ ± 0%        ~ (p=0.477 n=10)
Tan                            163.2n ± 0%   150.2n ± 0%   -7.99% (p=0.000 n=10)
Tanh                           312.4n ± 0%   284.2n ± 0%   -9.01% (p=0.000 n=10)
Trunc                          31.83n ± 0%   31.83n ± 0%        ~ (p=0.663 n=10)
Y0                             701.0n ± 0%   669.2n ± 0%   -4.54% (p=0.000 n=10)
Y1                             704.5n ± 0%   672.4n ± 0%   -4.55% (p=0.000 n=10)
Yn                             1.490µ ± 0%   1.422µ ± 0%   -4.60% (p=0.000 n=10)
Float64bits                    5.713n ± 0%   5.710n ± 0%        ~ (p=0.926 n=10)
Float64frombits                4.896n ± 0%   4.896n ± 0%        ~ (p=0.663 n=10)
Float32bits                    12.25n ± 0%   12.25n ± 0%        ~ (p=0.571 n=10)
Float32frombits                4.898n ± 0%   4.896n ± 0%        ~ (p=0.754 n=10)
FMA                            4.895n ± 0%   4.895n ± 0%        ~ (p=0.745 n=10)
geomean                        94.40n        89.43n        -5.27%

Change-Id: I4fe0f2e9f609e38d79463f9ba2519a3f9427432e
Reviewed-on: https://go-review.googlesource.com/c/go/+/348389
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2025-08-13 20:33:56 -07:00
limeidan
90b7d7aaa2 cmd/compile/internal: optimize multiplication use new operation 'ADDshiftLLV' on loong64
goos: linux
goarch: loong64
pkg: cmd/compile/internal/test
cpu: Loongson-3A6000-HV @ 2500.00MHz
                  │     old      │                 new                  │
                  │    sec/op    │    sec/op     vs base                │
MulconstI32/3       0.8004n ± 0%   0.4247n ± 2%  -46.94% (p=0.000 n=10)
MulconstI32/5       0.8005n ± 0%   0.4256n ± 1%  -46.83% (p=0.000 n=10)
MulconstI32/12      1.2010n ± 0%   0.8005n ± 0%  -33.35% (p=0.000 n=10)
MulconstI32/120     0.8090n ± 0%   0.8067n ± 0%   -0.28% (p=0.007 n=10)
MulconstI32/-120    0.8109n ± 0%   0.8072n ± 0%   -0.47% (p=0.000 n=10)
MulconstI32/65537   0.8004n ± 0%   0.8004n ± 0%        ~ (p=1.000 n=10)
MulconstI32/65538   0.8005n ± 0%   0.8005n ± 0%        ~ (p=0.265 n=10)
MulconstI64/3       0.8005n ± 0%   0.4241n ± 1%  -47.02% (p=0.000 n=10)
MulconstI64/5       0.8004n ± 0%   0.4249n ± 1%  -46.91% (p=0.000 n=10)
MulconstI64/12      1.2010n ± 0%   0.8004n ± 0%  -33.36% (p=0.000 n=10)
MulconstI64/120     0.8005n ± 0%   0.8005n ± 0%        ~ (p=0.635 n=10)
MulconstI64/-120    0.8005n ± 0%   0.8005n ± 0%        ~ (p=0.837 n=10)
MulconstI64/65537   0.8005n ± 0%   0.8005n ± 0%        ~ (p=0.837 n=10)
MulconstI64/65538   0.8096n ± 0%   0.8004n ± 0%   -1.14% (p=0.000 n=10)
MulconstU32/3       0.8004n ± 0%   0.4263n ± 1%  -46.75% (p=0.000 n=10)
MulconstU32/5       0.8005n ± 0%   0.4262n ± 1%  -46.76% (p=0.000 n=10)
MulconstU32/12      1.2010n ± 0%   0.8005n ± 0%  -33.35% (p=0.000 n=10)
MulconstU32/120     0.8105n ± 0%   0.8096n ± 0%        ~ (p=0.183 n=10)
MulconstU32/65537   0.8004n ± 0%   0.8004n ± 0%        ~ (p=1.000 n=10)
MulconstU32/65538   0.8005n ± 0%   0.8005n ± 0%        ~ (p=1.000 n=10)
MulconstU64/3       0.8004n ± 0%   0.4265n ± 4%  -46.71% (p=0.000 n=10)
MulconstU64/5       0.8004n ± 0%   0.4256n ± 0%  -46.82% (p=0.000 n=10)
MulconstU64/12      1.2010n ± 0%   0.8004n ± 0%  -33.36% (p=0.000 n=10)
MulconstU64/120     0.8005n ± 0%   0.8005n ± 0%        ~ (p=0.387 n=10)
MulconstU64/65537   0.8005n ± 0%   0.8005n ± 0%        ~ (p=0.265 n=10)
MulconstU64/65538   0.8080n ± 0%   0.8004n ± 0%   -0.93% (p=0.000 n=10)
geomean             0.8539n        0.6597n       -22.74%

Change-Id: Ie33e88985d7639f481bbba540bc917b9f185c357
Reviewed-on: https://go-review.googlesource.com/c/go/+/693855
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-12 23:01:49 -07:00
Keith Randall
f04421ea9a cmd/compile: soften test for 74788
We now (as of CL 678620) use float registers other than X0 for copying.

Change-Id: Ifdecd5df7519663742eed0f292c98453754d4b25
Reviewed-on: https://go-review.googlesource.com/c/go/+/695275
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
2025-08-12 10:05:55 -07:00
Michael Munday
084c0f8494 cmd/compile: allow InlMark operations to be speculatively executed
Although InlMark takes a memory argument it ultimately becomes a
NOP and therefore is safe to speculatively execute.

Fixes #74915

Change-Id: I64317dd433e300ac28de2bcf201845083ec2ac82
Reviewed-on: https://go-review.googlesource.com/c/go/+/693795
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-08-11 00:52:23 -07:00
Xiaolin Zhao
a552737418 cmd/compile: fold negation into multiplication on loong64
This change also add corresponding benchmark tests and codegen tests.
The performance improvement on CPU Loongson-3A6000-HV is as follows:

goos: linux
goarch: loong64
pkg: cmd/compile/internal/test
cpu: Loongson-3A6000-HV @ 2500.00MHz
        |  bench.old   |              bench.new              |
        |    sec/op    |   sec/op     vs base                |
MulNeg     828.4n ± 0%   655.9n ± 0%  -20.82% (p=0.000 n=10)
Mul2Neg   1062.0n ± 0%   826.8n ± 0%  -22.15% (p=0.000 n=10)
geomean    938.0n        736.4n       -21.49%

Change-Id: Ia999732880ec65be0c66cddc757a4868847e5b15
Reviewed-on: https://go-review.googlesource.com/c/go/+/682535
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Freeman <markfreeman@google.com>
2025-08-05 18:02:06 -07:00
Michael Munday
fcc036f03b cmd/compile: optimise float <-> int register moves on riscv64
Use the FMV* instructions to move values between the floating point and
integer register files.

Note: I'm unsure why there is a slowdown in the Float32bits benchmark,
I've checked and an FMVXS instruction is being used as expected. There
are multiple loads and other instructions in the main loop.

goos: linux
goarch: riscv64
pkg: math
cpu: Spacemit(R) X60
                    │ fmv-before.txt │            fmv-after.txt            │
                    │     sec/op     │   sec/op     vs base                │
Acos                     122.7n ± 0%   122.7n ± 0%        ~ (p=1.000 n=10)
Acosh                    197.2n ± 0%   191.5n ± 0%   -2.89% (p=0.000 n=10)
Asin                     122.7n ± 0%   122.7n ± 0%        ~ (p=0.474 n=10)
Asinh                    231.0n ± 0%   224.1n ± 0%   -2.99% (p=0.000 n=10)
Atan                     91.39n ± 0%   91.41n ± 0%        ~ (p=0.465 n=10)
Atanh                    210.3n ± 0%   203.4n ± 0%   -3.26% (p=0.000 n=10)
Atan2                    149.6n ± 0%   149.6n ± 0%        ~ (p=0.721 n=10)
Cbrt                     176.5n ± 0%   165.9n ± 0%   -6.01% (p=0.000 n=10)
Ceil                     25.67n ± 0%   24.42n ± 0%   -4.87% (p=0.000 n=10)
Copysign                 3.756n ± 0%   3.756n ± 0%        ~ (p=0.149 n=10)
Cos                      95.15n ± 0%   95.15n ± 0%        ~ (p=0.374 n=10)
Cosh                     228.6n ± 0%   224.7n ± 0%   -1.71% (p=0.000 n=10)
Erf                      115.2n ± 0%   115.2n ± 0%        ~ (p=0.474 n=10)
Erfc                     116.4n ± 0%   116.4n ± 0%        ~ (p=0.628 n=10)
Erfinv                   133.3n ± 0%   133.3n ± 0%        ~ (p=1.000 n=10)
Erfcinv                  133.3n ± 0%   133.3n ± 0%        ~ (p=1.000 n=10)
Exp                      194.1n ± 0%   190.3n ± 0%   -1.93% (p=0.000 n=10)
ExpGo                    204.7n ± 0%   200.3n ± 0%   -2.15% (p=0.000 n=10)
Expm1                    137.7n ± 0%   135.2n ± 0%   -1.82% (p=0.000 n=10)
Exp2                     173.4n ± 0%   169.0n ± 0%   -2.54% (p=0.000 n=10)
Exp2Go                   182.8n ± 0%   178.4n ± 0%   -2.41% (p=0.000 n=10)
Abs                      3.756n ± 0%   3.756n ± 0%        ~ (p=0.157 n=10)
Dim                      12.52n ± 0%   12.52n ± 0%        ~ (p=0.737 n=10)
Floor                    25.67n ± 0%   24.42n ± 0%   -4.87% (p=0.000 n=10)
Max                      21.29n ± 0%   20.03n ± 0%   -5.92% (p=0.000 n=10)
Min                      21.28n ± 0%   20.04n ± 0%   -5.85% (p=0.000 n=10)
Mod                      344.9n ± 0%   319.2n ± 0%   -7.45% (p=0.000 n=10)
Frexp                    55.71n ± 0%   48.85n ± 0%  -12.30% (p=0.000 n=10)
Gamma                    165.9n ± 0%   167.8n ± 0%   +1.15% (p=0.000 n=10)
Hypot                    73.24n ± 0%   70.74n ± 0%   -3.41% (p=0.000 n=10)
HypotGo                  84.50n ± 0%   82.63n ± 0%   -2.21% (p=0.000 n=10)
Ilogb                    49.45n ± 0%   45.70n ± 0%   -7.59% (p=0.000 n=10)
J0                       556.5n ± 0%   544.0n ± 0%   -2.25% (p=0.000 n=10)
J1                       555.3n ± 0%   542.8n ± 0%   -2.24% (p=0.000 n=10)
Jn                       1.181µ ± 0%   1.156µ ± 0%   -2.12% (p=0.000 n=10)
Ldexp                    59.47n ± 0%   53.84n ± 0%   -9.47% (p=0.000 n=10)
Lgamma                   167.2n ± 0%   154.6n ± 0%   -7.51% (p=0.000 n=10)
Log                      160.9n ± 0%   154.6n ± 0%   -3.92% (p=0.000 n=10)
Logb                     49.45n ± 0%   45.70n ± 0%   -7.58% (p=0.000 n=10)
Log1p                    147.1n ± 0%   137.1n ± 0%   -6.80% (p=0.000 n=10)
Log10                    162.1n ± 1%   154.6n ± 0%   -4.63% (p=0.000 n=10)
Log2                     66.99n ± 0%   60.72n ± 0%   -9.36% (p=0.000 n=10)
Modf                     29.42n ± 0%   26.29n ± 0%  -10.64% (p=0.000 n=10)
Nextafter32              41.95n ± 0%   37.88n ± 0%   -9.70% (p=0.000 n=10)
Nextafter64              38.82n ± 0%   33.49n ± 0%  -13.73% (p=0.000 n=10)
PowInt                   252.3n ± 0%   237.3n ± 0%   -5.95% (p=0.000 n=10)
PowFrac                  615.5n ± 0%   589.7n ± 0%   -4.19% (p=0.000 n=10)
Pow10Pos                 10.64n ± 0%   10.64n ± 0%        ~ (p=1.000 n=10)
Pow10Neg                 24.42n ± 0%   15.02n ± 0%  -38.49% (p=0.000 n=10)
Round                    21.91n ± 0%   18.16n ± 0%  -17.12% (p=0.000 n=10)
RoundToEven              24.42n ± 0%   21.29n ± 0%  -12.84% (p=0.000 n=10)
Remainder                308.0n ± 0%   291.2n ± 0%   -5.44% (p=0.000 n=10)
Signbit                  10.02n ± 0%   10.02n ± 0%        ~ (p=1.000 n=10)
Sin                      102.7n ± 0%   102.7n ± 0%        ~ (p=0.211 n=10)
Sincos                   124.0n ± 1%   123.3n ± 0%   -0.56% (p=0.002 n=10)
Sinh                     239.1n ± 0%   234.7n ± 0%   -1.84% (p=0.000 n=10)
SqrtIndirect             2.504n ± 0%   2.504n ± 0%        ~ (p=0.303 n=10)
SqrtLatency              15.03n ± 0%   15.02n ± 0%        ~ (p=0.598 n=10)
SqrtIndirectLatency      15.02n ± 0%   15.02n ± 0%        ~ (p=0.907 n=10)
SqrtGoLatency            165.3n ± 0%   157.2n ± 0%   -4.90% (p=0.000 n=10)
SqrtPrime                3.801µ ± 0%   3.802µ ± 0%        ~ (p=1.000 n=10)
Tan                      125.2n ± 0%   125.2n ± 0%        ~ (p=0.458 n=10)
Tanh                     244.2n ± 0%   239.9n ± 0%   -1.76% (p=0.000 n=10)
Trunc                    25.67n ± 0%   24.42n ± 0%   -4.87% (p=0.000 n=10)
Y0                       550.2n ± 0%   538.1n ± 0%   -2.21% (p=0.000 n=10)
Y1                       552.8n ± 0%   540.6n ± 0%   -2.21% (p=0.000 n=10)
Yn                       1.168µ ± 0%   1.143µ ± 0%   -2.14% (p=0.000 n=10)
Float64bits              8.139n ± 0%   4.385n ± 0%  -46.13% (p=0.000 n=10)
Float64frombits          7.512n ± 0%   3.759n ± 0%  -49.96% (p=0.000 n=10)
Float32bits              8.138n ± 0%   9.393n ± 0%  +15.42% (p=0.000 n=10)
Float32frombits          7.513n ± 0%   3.757n ± 0%  -49.98% (p=0.000 n=10)
FMA                      3.756n ± 0%   3.756n ± 0%        ~ (p=0.246 n=10)
geomean                  77.43n        72.42n        -6.47%

Change-Id: I8dac69b1d17cb3d2af78d1c844d2b5d80000d667
Reviewed-on: https://go-review.googlesource.com/c/go/+/599235
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Munday <mikemndy@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
2025-08-05 08:27:15 -07:00
Xiaolin Zhao
e071617222 cmd/compile: optimize multiplication rules on loong64
Improve multiplication strength reduction, refer to CL 626998,
add additional 3 linear combination instructions for loong64.

goos: linux
goarch: loong64
pkg: cmd/compile/internal/test
cpu: Loongson-3A6000-HV @ 2500.00MHz
                  |  bench.old   |              bench.new               |
                  |    sec/op    |    sec/op     vs base                |
MulconstI32/3       1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstI32/5       1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstI32/12       1.601n ± 0%    1.201n ± 0%  -24.98% (p=0.000 n=10)
MulconstI32/120     1.6010n ± 0%   0.8130n ± 0%  -49.22% (p=0.000 n=10)
MulconstI32/-120    1.6010n ± 0%   0.8109n ± 0%  -49.35% (p=0.000 n=10)
MulconstI32/65537   1.6275n ± 0%   0.8005n ± 0%  -50.81% (p=0.000 n=10)
MulconstI32/65538   1.6290n ± 0%   0.8004n ± 0%  -50.87% (p=0.000 n=10)
MulconstI64/3       1.6010n ± 0%   0.8004n ± 0%  -50.01% (p=0.000 n=10)
MulconstI64/5       1.6010n ± 0%   0.8004n ± 0%  -50.01% (p=0.000 n=10)
MulconstI64/12       1.601n ± 0%    1.201n ± 0%  -24.98% (p=0.000 n=10)
MulconstI64/120     1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstI64/-120    1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstI64/65537   1.6270n ± 0%   0.8005n ± 0%  -50.80% (p=0.000 n=10)
MulconstI64/65538   1.6290n ± 0%   0.8071n ± 1%  -50.45% (p=0.000 n=10)
MulconstU32/3       1.6010n ± 0%   0.8004n ± 0%  -50.01% (p=0.000 n=10)
MulconstU32/5       1.6010n ± 0%   0.8004n ± 0%  -50.01% (p=0.000 n=10)
MulconstU32/12       1.601n ± 0%    1.201n ± 0%  -24.98% (p=0.000 n=10)
MulconstU32/120     1.6010n ± 0%   0.8066n ± 0%  -49.62% (p=0.000 n=10)
MulconstU32/65537   1.6290n ± 0%   0.8005n ± 0%  -50.86% (p=0.000 n=10)
MulconstU32/65538   1.6280n ± 0%   0.8005n ± 0%  -50.83% (p=0.000 n=10)
MulconstU64/3       1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstU64/5       1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstU64/12       1.601n ± 0%    1.201n ± 0%  -24.98% (p=0.000 n=10)
MulconstU64/120     1.6010n ± 0%   0.8005n ± 0%  -50.00% (p=0.000 n=10)
MulconstU64/65537   1.6290n ± 0%   0.8005n ± 0%  -50.86% (p=0.000 n=10)
MulconstU64/65538   1.6300n ± 0%   0.8067n ± 0%  -50.51% (p=0.000 n=10)
geomean              1.609n        0.8537n       -46.95%

goos: linux
goarch: loong64
pkg: cmd/compile/internal/test
cpu: Loongson-3A5000 @ 2500.00MHz
                  |  bench.old   |              bench.new               |
                  |    sec/op    |    sec/op     vs base                |
MulconstI32/3       1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstI32/5       1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstI32/12       1.601n ± 0%    1.202n ± 0%  -24.92% (p=0.000 n=10)
MulconstI32/120     1.6020n ± 0%   0.8012n ± 0%  -49.99% (p=0.000 n=10)
MulconstI32/-120    1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstI32/65537   1.6020n ± 0%   0.8007n ± 0%  -50.02% (p=0.000 n=10)
MulconstI32/65538   1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstI64/3       1.6015n ± 0%   0.8007n ± 0%  -50.00% (p=0.000 n=10)
MulconstI64/5       1.6020n ± 0%   0.8007n ± 0%  -50.02% (p=0.000 n=10)
MulconstI64/12       1.602n ± 0%    1.202n ± 0%  -25.00% (p=0.000 n=10)
MulconstI64/120     1.6030n ± 0%   0.8011n ± 0%  -50.02% (p=0.000 n=10)
MulconstI64/-120    1.6020n ± 0%   0.8007n ± 0%  -50.02% (p=0.000 n=10)
MulconstI64/65537   1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstI64/65538   1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstU32/3       1.6010n ± 0%   0.8006n ± 0%  -49.99% (p=0.000 n=10)
MulconstU32/5       1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstU32/12       1.601n ± 0%    1.202n ± 0%  -24.92% (p=0.000 n=10)
MulconstU32/120     1.6010n ± 0%   0.8006n ± 0%  -49.99% (p=0.000 n=10)
MulconstU32/65537   1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstU32/65538   1.6020n ± 0%   0.8009n ± 0%  -50.01% (p=0.000 n=10)
MulconstU64/3       1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstU64/5       1.6010n ± 0%   0.8007n ± 0%  -49.98% (p=0.000 n=10)
MulconstU64/12       1.601n ± 0%    1.201n ± 0%  -24.98% (p=0.000 n=10)
MulconstU64/120     1.6020n ± 0%   0.8007n ± 0%  -50.02% (p=0.000 n=10)
MulconstU64/65537   1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
MulconstU64/65538   1.6010n ± 0%   0.8007n ± 0%  -49.99% (p=0.000 n=10)
geomean              1.601n        0.8523n       -46.77%

Change-Id: I9fb0e47ca57875da171a347bf4828adfab41b875
Reviewed-on: https://go-review.googlesource.com/c/go/+/675455
Reviewed-by: Mark Freeman <mark@golang.org>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
2025-08-01 08:42:40 -07:00
Keith Randall
eb7f515c4d cmd/compile: use generated loops instead of DUFFZERO on amd64
goarch: amd64
cpu: 12th Gen Intel(R) Core(TM) i7-12700
                        │     base      │                 exp                 │
                        │    sec/op     │   sec/op     vs base                │
MemclrKnownSize112-20      1.270n ± 14%   1.006n ± 0%  -20.72% (p=0.000 n=10)
MemclrKnownSize128-20      1.266n ±  0%   1.005n ± 0%  -20.58% (p=0.000 n=10)
MemclrKnownSize192-20      1.771n ±  0%   1.579n ± 1%  -10.84% (p=0.000 n=10)
MemclrKnownSize248-20      4.034n ±  0%   3.520n ± 0%  -12.75% (p=0.000 n=10)
MemclrKnownSize256-20      2.269n ±  0%   2.014n ± 0%  -11.26% (p=0.000 n=10)
MemclrKnownSize512-20      4.280n ±  0%   4.030n ± 0%   -5.84% (p=0.000 n=10)
MemclrKnownSize1024-20     8.309n ±  1%   8.057n ± 0%   -3.03% (p=0.000 n=10)

Change-Id: I8f1627e2a1e981ff351dc7178932b32a2627f765
Reviewed-on: https://go-review.googlesource.com/c/go/+/678937
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-07-31 17:12:39 -07:00
Michael Munday
cedf63616a cmd/compile: add floating point min/max intrinsics on s390x
Add the VECTOR FP (MINIMUM|MAXIMUM) instructions to the assembler and
use them in the compiler to implement min and max.

Note: I've allowed floating point registers to be used with the single
element instructions (those with the W instead of V prefix) to allow
easier integration into the compiler.

Change-Id: I5f80a510bd248cf483cce95f1979bf63fbae7de6
Reviewed-on: https://go-review.googlesource.com/c/go/+/684715
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Freeman <mark@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2025-07-30 12:29:15 -07:00
Youlin Feng
cc571dab91 cmd/compile: deduplicate instructions when rewrite func results
After CL 628075, do not rely on the memory arg of an OpLocalAddr.

Fixes #74788

Change-Id: I4e893241e3949bb8f2d93c8b88cc102e155b725d
Reviewed-on: https://go-review.googlesource.com/c/go/+/691275
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
Reviewed-by: Mark Freeman <mark@golang.org>
2025-07-30 09:38:10 -07:00
Cuong Manh Le
bd94ae8903 cmd/compile: use unsigned power-of-two detector for unsigned mod
Same as CL 689815, but for modulus instead of division.

Updates #74485

Change-Id: I73000231c886a987a1093669ff207fd9117a8160
Reviewed-on: https://go-review.googlesource.com/c/go/+/689895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Auto-Submit: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2025-07-29 16:22:40 -07:00
Cuong Manh Le
f3582fc80e cmd/compile: add unsigned power-of-two detector
Fixes #74485

Change-Id: Ia22a58ac43bdc36c8414d555672a3a3eafc749ca
Reviewed-on: https://go-review.googlesource.com/c/go/+/689815
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Cuong Manh Le <cuong.manhle.vn@gmail.com>
2025-07-29 16:22:37 -07:00
Michael Munday
46b5839231 test/codegen: fix failing condmove wasm tests
These recently added tests failed when using the -all_codgen flag.

Fixes #74770

Change-Id: Idea1ea02af2bd9f45c7d0a28d633c7442328e6df
Reviewed-on: https://go-review.googlesource.com/c/go/+/690715
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
Run-TryBot: Michael Munday <mikemndy@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Mark Freeman <mark@golang.org>
Auto-Submit: Jorropo <jorropo.pgm@gmail.com>
TryBot-Bypass: Michael Knyszek <mknyszek@google.com>
2025-07-28 11:01:53 -07:00
Jorropo
ce05ad448f cmd/compile: rewrite condselects into doublings and halvings
For performance see CL 685676.

This allows something like:
  if y { x *= 2 }

To be compiled to:
  SHLXQ BX, AX, AX

Instead of:
  MOVQ    AX, CX
  SHLQ    $1, CX
  MOVBLZX BL, DX
  TESTQ   DX, DX
  CMOVQNE CX, AX

While ./make.bash uniqued per LOC, there is 2 doublings and 4 halvings.

Change-Id: Ic0727cbf429528a2dbf17cbfc3b0121db8387444
Reviewed-on: https://go-review.googlesource.com/c/go/+/685695
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2025-07-24 14:42:15 -07:00
Jorropo
fcd28070fe cmd/compile: add opt branchelim to rewrite some CondSelect into math
This allows something like:
  if y { x++ }

To be compiled to:
  MOVBLZX BX, CX
  ADDQ CX, AX

Instead of:
  LEAQ    1(AX), CX
  MOVBLZX BL, DX
  TESTQ   DX, DX
  CMOVQNE CX, AX

While ./make.bash uniqued per LOC, there is 100 additions and 75 substractions.

See benchmark here: https://go.dev/play/p/DJf5COjwhd_s

Either it's a performance no-op or it is faster:

  goos: linux
  goarch: amd64
  cpu: AMD Ryzen 5 3600 6-Core Processor
                                          │ /tmp/old.logs │            /tmp/new.logs             │
                                          │    sec/op     │    sec/op     vs base                │
  CmovInlineConditionAddLatency-12           0.5443n ± 5%   0.5339n ± 3%   -1.90% (p=0.004 n=10)
  CmovInlineConditionAddThroughputBy6-12      1.492n ± 1%    1.494n ± 1%        ~ (p=0.955 n=10)
  CmovInlineConditionSubLatency-12           0.5419n ± 3%   0.5282n ± 3%   -2.52% (p=0.019 n=10)
  CmovInlineConditionSubThroughputBy6-12      1.587n ± 1%    1.584n ± 2%        ~ (p=0.492 n=10)
  CmovOutlineConditionAddLatency-12          0.5223n ± 1%   0.2639n ± 4%  -49.47% (p=0.000 n=10)
  CmovOutlineConditionAddThroughputBy6-12     1.159n ± 1%    1.097n ± 2%   -5.35% (p=0.000 n=10)
  CmovOutlineConditionSubLatency-12          0.5271n ± 3%   0.2654n ± 2%  -49.66% (p=0.000 n=10)
  CmovOutlineConditionSubThroughputBy6-12     1.053n ± 1%    1.050n ± 1%        ~ (p=1.000 n=10)
  geomean

There are other benefits not tested by this benchmark:
- the math form is usually a couple bytes shorter (ICACHE)
- the math form is usually 0~2 uops shorter (UCACHE)
- the math form has usually less register pressure*
- the math form can sometimes be optimized further

*regalloc rarely find how it can use less registers

As far as pass ordering goes there are many possible options,
I've decided to reorder branchelim before late opt since:
- unlike running exclusively the CondSelect rules after branchelim,
  some extra optimizations might trigger on the adds or subs.
- I don't want to maintain a second generic.rules file of only the stuff,
  that can trigger after branchelim.
- rerunning all of opt a third time increase compilation time for little gains.

By elimination moving branchelim seems fine.

Change-Id: I869adf57e4d109948ee157cfc47144445146bafd
Reviewed-on: https://go-review.googlesource.com/c/go/+/685676
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-07-24 14:42:10 -07:00
Alexander Musman
bd80f74bc1 cmd/compile: fold shift through AND for slice operations
Fold a shift through AND when the AND gets a zero-or-one operand (e.g.
from arithmetic shift by 63 of a 64-bit value) for a common case with
slice operations:

    ASR     $63, R2, R2
    AND     R3<<3, R2, R2
    ADD     R2, R0, R2

As the operands are 64-bit, we can transform it to:

    AND     R2->63, R3, R2
    ADD     R2<<3, R0, R2

Code size improvement:
compile: .text:     9088004 ->  9086292 (-0.02%)
etcd:    .text:    10500276 -> 10498964 (-0.01%)

Change-Id: Ibcd5e67173da39b77ceff77ca67812fb8be5a7b5
Reviewed-on: https://go-review.googlesource.com/c/go/+/679895
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Mark Freeman <mark@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-07-24 13:47:20 -07:00
Alexander Musman
dcb479c2f9 cmd/compile: optimize slice bounds checking with SUB/SUBconst comparisons
Optimize ARM64 code generation for slice bounds checking by recognizing
patterns where comparisons to zero involve SUB or SUBconst operations.
This change adds SSA opt rules to simplify:
 (CMPconst [0] (SUB x y)) => (CMP x y)

The optimizations apply to EQ, NE, ULE, and UGT comparisons, enabling
more efficient bounds checking for slice operations.

Code size improvement:
compile: .text:    9088004  ->  9065988 (-0.24%)
etcd:    .text:    10500276 -> 10497092 (-0.03%)
Change-Id: I467cb27674351652bcacc52b87e1f19677bd46a8
Reviewed-on: https://go-review.googlesource.com/c/go/+/679915
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
2025-07-24 12:39:53 -07:00
Paul Murphy
ee7bfbdbcc cmd/compile/internal/ssa: fix PPC64 merging of (AND (S[RL]Dconst ...)
CL 622236 forgot to check the mask was also a 32 bit rotate mask. Add
a modified version of isPPC64WordRotateMask which valids the mask is
contiguous and fits inside a uint32.

I don't this is possible when merging SRDconst, the first check should
always reject such combines. But, be extra careful and do it there
too.

Fixes #73153

Change-Id: Ie95f74ec5e7d89dc761511126db814f886a7a435
Reviewed-on: https://go-review.googlesource.com/c/go/+/679775
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Jayanth Krishnamurthy <jayanth.krishnamurthy@ibm.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
2025-06-09 20:33:27 -07:00
Jake Bailey
27ff0f249c cmd/compile/internal/ssa: eliminate string copies for calls to unique.Make
unique.Make always copies strings passed into it, so it's safe to not
copy byte slices converted to strings either. Handle this just like map
accesses with string(b) as keys.

This CL only handles unique.Make(string(b)), not nested cases like
unique.Make([2]string{string(b1), string(b2)}); this could be done in a
followup CL but the map lookup code in walk is sufficiently different
than the call handling code that I didn't attempt it. (SSA is much
easier).

Fixes #71926

Change-Id: Ic2f82f2f91963d563b4ddb1282bd49fc40da8b85
Reviewed-on: https://go-review.googlesource.com/c/go/+/672135
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-21 20:20:31 -07:00
thepudds
f4de2ecffb cmd/compile/internal/walk: convert composite literals to interfaces without allocating
Today, this interface conversion causes the struct literal
to be heap allocated:

    var sink any

    func example1() {
        sink = S{1, 1}
    }

For basic literals like integers that are directly used in
an interface conversion that would otherwise allocate, the compiler
is able to use read-only global storage (see #18704).

This CL extends that to struct and array literals as well by creating
read-only global storage that is able to represent for example S{1, 1},
and then using a pointer to that storage in the interface
when the interface conversion happens.

A more challenging example is:

    func example2() {
        v := S{1, 1}
        sink = v
    }

In this case, the struct literal is not directly part of the
interface conversion, but is instead assigned to a local variable.

To still avoid heap allocation in cases like this, in walk we
construct a cache that maps from expressions used in interface
conversions to earlier expressions that can be used to represent the
same value (via ir.ReassignOracle.StaticValue). This is somewhat
analogous to how we avoided heap allocation for basic literals in
CL 649077 earlier in our stack, though here we also need to do a
little more work to create the read-only global.

CL 649076 (also earlier in our stack) added most of the tests
along with debug diagnostics in convert.go to make it easier
to test this change.

See the writeup in #71359 for details.

Fixes #71359
Fixes #71323
Updates #62653
Updates #53465
Updates #8618

Change-Id: I8924f0c69ff738ea33439bd6af7b4066af493b90
Reviewed-on: https://go-review.googlesource.com/c/go/+/649555
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2025-05-21 12:23:26 -07:00
Junyang Shao
d6c29c7156 cmd/compile: fix offset calculation error in memcombine
Fixes #73812

Change-Id: If7a6e103ae9e1442a2cf4a3c6b1270b6a1887196
Reviewed-on: https://go-review.googlesource.com/c/go/+/675175
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Junyang Shao <shaojunyang@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-21 12:17:08 -07:00
Xiaolin Zhao
4ce1c8e9e1 cmd/compile: add rules about ORN and ANDN
Reduce the number of go toolchain instructions on loong64 as follows.

    file      before    after     Δ       %
    addr2line 279880    279776  -104   -0.0372%
    asm       556638    556410  -228   -0.0410%
    buildid   272272    272072  -200   -0.0735%
    cgo       481522    481318  -204   -0.0424%
    compile   2457788   2457580 -208   -0.0085%
    covdata   323384    323280  -104   -0.0322%
    cover     518450    518234  -216   -0.0417%
    dist      340790    340686  -104   -0.0305%
    distpack  282456    282252  -204   -0.0722%
    doc       789932    789688  -244   -0.0309%
    fix       324332    324228  -104   -0.0321%
    link      704622    704390  -232   -0.0329%
    nm        277132    277028  -104   -0.0375%
    objdump   507862    507758  -104   -0.0205%
    pack      221774    221674  -100   -0.0451%
    pprof     1469816   1469552 -264   -0.0180%
    test2json 254836    254732  -104   -0.0408%
    trace     1100002   1099738 -264   -0.0240%
    vet       781078    780874  -204   -0.0261%
    go        1529116   1528848 -268   -0.0175%
    gofmt     318556    318448  -108   -0.0339%
    total     13792238 13788566 -3672  -0.0266%

Change-Id: I23fb3ebd41309252c7075e57ea7094e79f8c4fef
Reviewed-on: https://go-review.googlesource.com/c/go/+/674335
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
2025-05-21 08:28:37 -07:00
Xiaolin Zhao
d37a1bdd48 cmd/compile: fix the implementation of NORconst on loong64
In the loong64 instruction set, there is no NORI instruction,
so the immediate value in NORconst need to be stored in register
and then use the three-register NOR instruction.

Change-Id: I5ef697450619317218cb3ef47fc07e238bdc2139
Reviewed-on: https://go-review.googlesource.com/c/go/+/673836
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-20 20:24:09 -07:00
Junyang Shao
113b25774e cmd/compile: memcombine different size stores
This CL implements the TODO in combineStores to allow combining
stores of different sizes, as long as the total size aligns to
2, 4, 8.

Fixes #72832.

Change-Id: I6d1d471335da90d851ad8f3b5a0cf10bdcfa17c4
Reviewed-on: https://go-review.googlesource.com/c/go/+/661855
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Junyang Shao <shaojunyang@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-20 13:00:16 -07:00
Julian Zhu
dfebef1c04 cmd/compile: fold negation into addition/subtraction on arm64
Fold negation into addition/subtraction and avoid double negation.

platform: linux/arm64

file      before    after     Δ       %
addr2line 3628108   3628116   +8      +0.000%
asm       6208353   6207857   -496    -0.008%
buildid   3460682   3460418   -264    -0.008%
cgo       5572988   5572492   -496    -0.009%
compile   26042159  26041039  -1120   -0.004%
cover     6304328   6303472   -856    -0.014%
dist      4139330   4139098   -232    -0.006%
doc       9429305   9428065   -1240   -0.013%
fix       3997189   3996733   -456    -0.011%
link      8212128   8210280   -1848   -0.023%
nm        3620056   3619696   -360    -0.010%
objdump   5920289   5919233   -1056   -0.018%
pack      2892250   2891778   -472    -0.016%
pprof     17094569  17092745  -1824   -0.011%
test2json 3335825   3335529   -296    -0.009%
trace     15842080  15841456  -624    -0.004%
vet       9472194   9471106   -1088   -0.011%
go        19081541  19081509  -32     -0.000%
total     154253374 154240622 -12752  -0.008%

platform: darwin/arm64

file    before    after     Δ       %
compile 27152002  27135490  -16512  -0.061%
link    8372914   8356402   -16512  -0.197%
go      19154802  19154778  -24     -0.000%
total   157734180 157701132 -33048  -0.021%

Change-Id: I15a349bfbaf7333ec3e4a62ae4d06f3f371dfb1d
Reviewed-on: https://go-review.googlesource.com/c/go/+/673715
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-20 11:08:28 -07:00
Keith Randall
3baf53aec6 cmd/compile: derive bounds on signed %N for N a power of 2
-N+1 <= x % N <= N-1

This is useful for cases like:

func setBit(b []byte, i int) {
    b[i/8] |= 1<<(i%8)
}

The shift does not need protection against larger-than-7 cases.
(It does still need protection against <0 cases.)

Change-Id: Idf83101386af538548bfeb6e2928cea855610ce2
Reviewed-on: https://go-review.googlesource.com/c/go/+/672995
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-05-19 15:21:54 -07:00
Julian Zhu
d52679006c cmd/compile: fold negation into addition/subtraction on mipsx
Fold negation into addition/subtraction and avoid double negation.

file      before    after     Δ       %
addr2line 3742022   3741986   -36     -0.001%
asm       6668616   6668628   +12     +0.000%
buildid   3583786   3583630   -156    -0.004%
cgo       6020370   6019634   -736    -0.012%
compile   29416016  29417336  +1320   +0.004%
cover     6801903   6801675   -228    -0.003%
dist      4485916   4485816   -100    -0.002%
doc       10652787  10652251  -536    -0.005%
fix       4115988   4115560   -428    -0.010%
link      9002328   9001616   -712    -0.008%
nm        3733148   3732780   -368    -0.010%
objdump   6163292   6163068   -224    -0.004%
pack      2944768   2944604   -164    -0.006%
pprof     18909973  18908773  -1200   -0.006%
test2json 3394662   3394778   +116    +0.003%
trace     17350911  17349751  -1160   -0.007%
vet       10077727  10077527  -200    -0.002%
go        19118769  19118609  -160    -0.001%
total     166182982 166178022 -4960   -0.003%

Change-Id: Id55698800fd70f3cb2ff48393584456b87208921
Reviewed-on: https://go-review.googlesource.com/c/go/+/673556
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-05-19 11:27:35 -07:00
Julian Zhu
8097cf14d2 cmd/compile: fold negation into addition/subtraction on mips64x
Fold negation into addition/subtraction and avoid double negation.

file      before    after     Δ       %
addr2line 4007310   4007470   +160    +0.004%
asm       7007636   7007436   -200    -0.003%
buildid   3839268   3838972   -296    -0.008%
cgo       6353466   6352738   -728    -0.011%
compile   30426920  30426896  -24     -0.000%
cover     7005408   7004744   -664    -0.009%
dist      4651192   4650872   -320    -0.007%
doc       10606050  10606034  -16     -0.000%
fix       4446414   4446390   -24     -0.001%
link      9237736   9237024   -712    -0.008%
nm        3999107   3999323   +216    +0.005%
objdump   6762424   6762144   -280    -0.004%
pack      3270757   3270493   -264    -0.008%
pprof     19428299  19361939  -66360  -0.342%
test2json 3717345   3717217   -128    -0.003%
trace     17382273  17381657  -616    -0.004%
vet       10689481  10688985  -496    -0.005%
go        19118769  19118609  -160    -0.001%
total     171949855 171878943 -70912  -0.041%

Change-Id: I35c1f264d216c214ea3f56252a9ddab8ea850fa6
Reviewed-on: https://go-review.googlesource.com/c/go/+/673555
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-05-16 11:06:06 -07:00
Keith Randall
d681270714 cmd/compile: allow load-op merging in additional situations
x += *p

We want to do this with a single load+add operation on amd64.
The tricky part is that we don't want to combine if there are
other uses of x after this instruction.

Implement a simple detector that seems to capture a common situation -
x += *p is in a loop, and the other use of x is after loop exit.
In that case, it does not hurt to do the load+add combo.

Change-Id: I466174cce212e78bde83f908cc1f2752b560c49c
Reviewed-on: https://go-review.googlesource.com/c/go/+/672957
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-15 15:21:36 -07:00
Keith Randall
19f05770b0 cmd/compile: schedule induction variable increments late
for ..; ..; i++ {
 ...
}

We want to schedule the i++ late in the block, so that all other
uses of i in the block are scheduled first. That way, i++ can
happen in place in a register instead of requiring a temporary register.

Change-Id: Id777407c7e67a5ddbd8e58251099b0488138c0df
Reviewed-on: https://go-review.googlesource.com/c/go/+/672998
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2025-05-15 14:06:41 -07:00
Xiaolin Zhao
c31a5c571f cmd/compile: fold negation into addition/subtraction on loong64
This change also avoid double negation, and add loong64 codegen for arithmetic tests.
Reduce the number of go toolchain instructions on loong64 as follows.

    file      before    after     Δ       %
    addr2line 279972    279896  -76    -0.0271%
    asm       556390    556310  -80    -0.0144%
    buildid   272376    272300  -76    -0.0279%
    cgo       481534    481550  +16    +0.0033%
    compile   2457992   2457396 -596   -0.0242%
    covdata   323488    323404  -84    -0.0260%
    cover     518630    518490  -140   -0.0270%
    dist      340894    340814  -80    -0.0235%
    distpack  282568    282484  -84    -0.0297%
    doc       790224    789984  -240   -0.0304%
    fix       324408    324348  -60    -0.0185%
    link      704910    704666  -244   -0.0346%
    nm        277220    277144  -76    -0.0274%
    objdump   508026    507878  -148   -0.0291%
    pack      221810    221786  -24    -0.0108%
    pprof     1470284   1469880 -404   -0.0275%
    test2json 254896    254852  -44    -0.0173%
    trace     1100390   1100074 -316   -0.0287%
    vet       781398    781142  -256   -0.0328%
    go        1529668   1529128 -540   -0.0353%
    gofmt     318668    318568  -100   -0.0314%
    total     13795746 13792094 -3652  -0.0265%

Change-Id: I88d1f12cfc4be0e92687c48e06a57213aa484aca
Reviewed-on: https://go-review.googlesource.com/c/go/+/672555
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-05-14 17:46:58 -07:00
Jakub Ciolek
c9d0fad5cb cmd/compile: add 2 phiopt cases
Add 2 more cases:

if a { x = value } else { x = a } => x = a && value
if a { x = a } else { x = value } => x = a || value

AND case goes from:

00006 (8)	TESTB	AX, AX
00007 (8)	JNE	9
00008 (13)	MOVL	AX, BX
00009 (13)	MOVL	BX, AX
00010 (13)	RET

to:

00006 (13)	ANDL	BX, AX
00007 (13)	RET

OR goes from:

00006 (19)	TESTB	AX, AX
00007 (19)	JNE	9
00008 (24)	MOVL	BX, AX
00009 (24)	RET

to:

00006 (24)	ORL	BX, AX
00007 (24)	RET

compilecmp linux/amd64:

runtime
runtime.lock2 847 -> 869  (+2.60%)
runtime.addspecial 542 -> 517  (-4.61%)
runtime.tracebackPCs changed
runtime.scanstack changed
runtime.mallocinit changed
runtime.traceback2 2238 -> 2206  (-1.43%)

runtime [cmd/compile]
runtime.lock2 860 -> 882  (+2.56%)
runtime.scanstack changed
runtime.addspecial 542 -> 517  (-4.61%)
runtime.traceback2 2238 -> 2206  (-1.43%)
runtime.lockWithRank 870 -> 890  (+2.30%)
runtime.tracebackPCs changed
runtime.mallocinit changed

strconv
strconv.ryuFtoaFixed32 changed
strconv.ryuFtoaFixed64 639 -> 638  (-0.16%)
strconv.readFloat changed
strconv.ryuFtoaShortest changed

strings
strings.(*Replacer).build changed

strconv [cmd/compile]
strconv.readFloat changed
strconv.ryuFtoaFixed64 639 -> 638  (-0.16%)
strconv.ryuFtoaFixed32 changed
strconv.ryuFtoaShortest changed

strings [cmd/compile]
strings.(*Replacer).build changed

regexp
regexp.makeOnePass.func1 changed

regexp [cmd/compile]
regexp.makeOnePass.func1 changed

encoding/json
encoding/json.indirect changed

database/sql
database/sql.driverArgsConnLocked changed

vendor/golang.org/x/text/unicode/norm
vendor/golang.org/x/text/unicode/norm.Form.transform changed

go/doc/comment
go/doc/comment.parseSpans changed

internal/diff
internal/diff.tgs changed

log/slog
log/slog.(*handleState).appendNonBuiltIns 1898 -> 1877  (-1.11%)

testing/fstest
testing/fstest.(*fsTester).checkGlob changed

runtime/pprof
runtime/pprof.(*profileBuilder).build changed

cmd/internal/dwarf
cmd/internal/dwarf.isEmptyInlinedCall 254 -> 244  (-3.94%)

go/printer
go/printer.keepTypeColumn 302 -> 270  (-10.60%)
go/printer.(*printer).binaryExpr changed

cmd/compile/internal/syntax
cmd/compile/internal/syntax.(*scanner).rune changed
cmd/compile/internal/syntax.(*scanner).number 2137 -> 2153  (+0.75%)

Change-Id: I7f95f54b03a35d0b616c40f38b415a7feb71be73
Reviewed-on: https://go-review.googlesource.com/c/go/+/666835
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Jakub Ciolek <jakub@ciolek.dev>
TryBot-Bypass: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-08 10:18:37 -07:00
Keith Randall
12110c3f7e cmd/compile: improve multiplication strength reduction
Use an automatic algorithm to generate strength reduction code.
You give it all the linear combination (a*x+b*y) instructions in your
architecture, it figures out the rest.

Just amd64 and arm64 for now.

Fixes #67575

Change-Id: I35c69382bebb1d2abf4bb4e7c43fd8548c6c59a1
Reviewed-on: https://go-review.googlesource.com/c/go/+/626998
Reviewed-by: Jakub Ciolek <jakub@ciolek.dev>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-01 09:33:31 -07:00
Joel Sing
4d10d4ad84 cmd/compile,internal/cpu,runtime: intrinsify math/bits.OnesCount on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.OnesCount
using the CPOP/CPOPW machine instructions. Since the native Go
implementation of OnesCount is relatively expensive, it is also
worth emitting a check for Zbb support when compiled for rva20u64.

On a Banana Pi F3, with GORISCV64=rva22u64:

              │     oc.1     │                oc.2                 │
              │    sec/op    │   sec/op     vs base                │
OnesCount-8     16.930n ± 0%   4.389n ± 0%  -74.08% (p=0.000 n=10)
OnesCount8-8     5.642n ± 0%   5.016n ± 0%  -11.10% (p=0.000 n=10)
OnesCount16-8    9.404n ± 0%   5.015n ± 0%  -46.67% (p=0.000 n=10)
OnesCount32-8   13.165n ± 0%   4.388n ± 0%  -66.67% (p=0.000 n=10)
OnesCount64-8   16.300n ± 0%   4.388n ± 0%  -73.08% (p=0.000 n=10)
geomean          11.40n        4.629n       -59.40%

On a Banana Pi F3, compiled with GORISCV64=rva20u64 and with Zbb
detection enabled:

              │     oc.3     │                oc.4                 │
              │    sec/op    │   sec/op     vs base                │
OnesCount-8     16.930n ± 0%   5.643n ± 0%  -66.67% (p=0.000 n=10)
OnesCount8-8     5.642n ± 0%   5.642n ± 0%        ~ (p=0.447 n=10)
OnesCount16-8   10.030n ± 0%   6.896n ± 0%  -31.25% (p=0.000 n=10)
OnesCount32-8   13.170n ± 0%   5.642n ± 0%  -57.16% (p=0.000 n=10)
OnesCount64-8   16.300n ± 0%   5.642n ± 0%  -65.39% (p=0.000 n=10)
geomean          11.55n        5.873n       -49.16%

On a Banana Pi F3, compiled with GORISCV64=rva20u64 but with Zbb
detection disabled:

              │    oc.3     │                oc.5                 │
              │   sec/op    │   sec/op     vs base                │
OnesCount-8     16.93n ± 0%   29.47n ± 0%  +74.07% (p=0.000 n=10)
OnesCount8-8    5.642n ± 0%   5.643n ± 0%        ~ (p=0.191 n=10)
OnesCount16-8   10.03n ± 0%   15.05n ± 0%  +50.05% (p=0.000 n=10)
OnesCount32-8   13.17n ± 0%   18.18n ± 0%  +38.04% (p=0.000 n=10)
OnesCount64-8   16.30n ± 0%   21.94n ± 0%  +34.60% (p=0.000 n=10)
geomean         11.55n        15.84n       +37.16%

For hardware without Zbb, this adds ~5ns overhead, while for hardware
with Zbb we achieve a performance gain up of up to 11ns. It is worth
noting that OnesCount8 is cheap enough that it is preferable to stick
with the generic version in this case.

Change-Id: Id657e40e0dd1b1ab8cc0fe0f8a68df4c9f2d7da5
Reviewed-on: https://go-review.googlesource.com/c/go/+/660856
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-01 05:57:41 -07:00
Joel Sing
90e8b8cdae cmd/compile: intrinsify math/bits.Bswap on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.Bswap
using the REV8 machine instruction.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                 │     rb.1     │                rb.2                 │
                 │    sec/op    │   sec/op     vs base                │
ReverseBytes-4     18.790n ± 0%   4.026n ± 0%  -78.57% (p=0.000 n=10)
ReverseBytes16-4    6.710n ± 0%   5.368n ± 0%  -20.00% (p=0.000 n=10)
ReverseBytes32-4   13.420n ± 0%   5.368n ± 0%  -60.00% (p=0.000 n=10)
ReverseBytes64-4   17.450n ± 0%   4.026n ± 0%  -76.93% (p=0.000 n=10)
geomean             13.11n        4.649n       -64.54%

Change-Id: I26eee34270b1721f7304bb1cddb0fda129b20ece
Reviewed-on: https://go-review.googlesource.com/c/go/+/660855
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
2025-05-01 05:57:13 -07:00
Keith Randall
7d0cb2a2ad cmd/compile: constant fold 128-bit multiplies
The full 64x64->128 multiply comes up when using bits.Mul64.
The 64x64->64+overflow multiply comes up in unsafe.Slice when using
a constant length.

Change-Id: I298515162ca07d804b2d699d03bc957ca30a4ebc
Reviewed-on: https://go-review.googlesource.com/c/go/+/667175
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-22 10:24:18 -07:00
Keith Randall
8af32240c6 cmd/compile: don't evaluate side effects of range over array
If the thing we're ranging over is an array or ptr to array, and
it doesn't have a function call or channel receive in it, then we
shouldn't evaluate it.

Typecheck the ranged-over value as a constant in that case.
That makes the unified exporter replace the range expression
with a constant int.

Change-Id: I0d4ea081de70d20cf6d1fa8d25ef6cb021975554
Reviewed-on: https://go-review.googlesource.com/c/go/+/659317
Reviewed-by: Junyang Shao <shaojunyang@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Griesemer <gri@google.com>
2025-04-21 15:50:43 -07:00