Commit graph

37 commits

Author SHA1 Message Date
David Chase
00a8dacbe4 [dev.simd] cmd/compile: remove unused simd intrinsics "helpers"
turns out they weren't helpful enough.

Change-Id: I4fa99dc0e7513f25acaddd7fb06451b0134172b9
Reviewed-on: https://go-review.googlesource.com/c/go/+/681498
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
2025-06-13 14:25:44 -07:00
David Chase
04b1030ae4 [dev.simd] cmd/compile: adapters for simd
This combines several CLs into a single patch of "glue"
for the generated SIMD extensions.

This glue includes GOEXPERIMENT checks that disable
the creation of user-visible "simd" types and
that disable the registration of "simd" intrinsics.

The simd type checks were changed to work for either
package "simd" or "internal/simd" so that moving that
package won't be quite so fragile.

cmd/compile, internal/simd: glue for adding SIMD extensions to Go
cmd/compile: theft of Cherry's sample SIMD compilation

Change-Id: Id44e2f4bafe74032c26de576a8691b6f7d977e01
Reviewed-on: https://go-review.googlesource.com/c/go/+/675598
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
2025-05-28 08:46:30 -07:00
Julian Zhu
3dbc775d60 cmd/compile/internal: intrinsify publicationBarrier on mipsx
This enables publicationBarrier to be used as an intrinsic on mipsx.

Change-Id: Ic199f34b84b3058bcfab79aac8f2399ff21a97ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/674856
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
2025-05-21 12:08:18 -07:00
Julian Zhu
94e3caeec1 cmd/compile/internal: intrinsify publicationBarrier on mips64x
This enables publicationBarrier to be used as an intrinsic on mips64x.

Change-Id: I4030ea65086c37ee1dcc1675d0d5d40ef8683851
Reviewed-on: https://go-review.googlesource.com/c/go/+/674855
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2025-05-21 12:06:44 -07:00
Guoqi Chen
da9c5b142c cmd/compile: add prefetch intrinsic support on loong64
This CL enables intrinsic support to emit the following prefetch
instructions for loong64 platform:
  1.Prefetch - prefetches data from memory address to cache;
  2.PrefetchStreamed - prefetches data from memory address, with a
    hint that this data is being streamed.

Benchmarks picked from go/test/bench/garbage
Parameters tested with:
GOMAXPROCS=8
tree2 -heapsize=1000000000 -cpus=8
tree -n=18
parser
peano

Benchmarks Loongson-3A6000-HV @ 2500.00MHz:
         |   bench.old   |              bench.new               |
         |    sec/op     |    sec/op      vs base               |
Tree2-8    1238.2µ ± 24%   999.9µ ± 453%       ~ (p=0.089 n=10)
Tree-8      277.4m ±  1%   275.5m ±   1%       ~ (p=0.063 n=10)
Parser-8     3.564 ±  0%    3.509 ±   1%  -1.56% (p=0.000 n=10)
Peano-8     39.12m ±  2%   38.85m ±   2%       ~ (p=0.353 n=10)
geomean     83.19m         78.28m         -5.90%

Change-Id: I59e9aa4f609a106d4f70706e6d6d1fe6738ab72a
Reviewed-on: https://go-review.googlesource.com/c/go/+/671876
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-05-19 00:27:10 -07:00
Joel Sing
4d10d4ad84 cmd/compile,internal/cpu,runtime: intrinsify math/bits.OnesCount on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.OnesCount
using the CPOP/CPOPW machine instructions. Since the native Go
implementation of OnesCount is relatively expensive, it is also
worth emitting a check for Zbb support when compiled for rva20u64.

On a Banana Pi F3, with GORISCV64=rva22u64:

              │     oc.1     │                oc.2                 │
              │    sec/op    │   sec/op     vs base                │
OnesCount-8     16.930n ± 0%   4.389n ± 0%  -74.08% (p=0.000 n=10)
OnesCount8-8     5.642n ± 0%   5.016n ± 0%  -11.10% (p=0.000 n=10)
OnesCount16-8    9.404n ± 0%   5.015n ± 0%  -46.67% (p=0.000 n=10)
OnesCount32-8   13.165n ± 0%   4.388n ± 0%  -66.67% (p=0.000 n=10)
OnesCount64-8   16.300n ± 0%   4.388n ± 0%  -73.08% (p=0.000 n=10)
geomean          11.40n        4.629n       -59.40%

On a Banana Pi F3, compiled with GORISCV64=rva20u64 and with Zbb
detection enabled:

              │     oc.3     │                oc.4                 │
              │    sec/op    │   sec/op     vs base                │
OnesCount-8     16.930n ± 0%   5.643n ± 0%  -66.67% (p=0.000 n=10)
OnesCount8-8     5.642n ± 0%   5.642n ± 0%        ~ (p=0.447 n=10)
OnesCount16-8   10.030n ± 0%   6.896n ± 0%  -31.25% (p=0.000 n=10)
OnesCount32-8   13.170n ± 0%   5.642n ± 0%  -57.16% (p=0.000 n=10)
OnesCount64-8   16.300n ± 0%   5.642n ± 0%  -65.39% (p=0.000 n=10)
geomean          11.55n        5.873n       -49.16%

On a Banana Pi F3, compiled with GORISCV64=rva20u64 but with Zbb
detection disabled:

              │    oc.3     │                oc.5                 │
              │   sec/op    │   sec/op     vs base                │
OnesCount-8     16.93n ± 0%   29.47n ± 0%  +74.07% (p=0.000 n=10)
OnesCount8-8    5.642n ± 0%   5.643n ± 0%        ~ (p=0.191 n=10)
OnesCount16-8   10.03n ± 0%   15.05n ± 0%  +50.05% (p=0.000 n=10)
OnesCount32-8   13.17n ± 0%   18.18n ± 0%  +38.04% (p=0.000 n=10)
OnesCount64-8   16.30n ± 0%   21.94n ± 0%  +34.60% (p=0.000 n=10)
geomean         11.55n        15.84n       +37.16%

For hardware without Zbb, this adds ~5ns overhead, while for hardware
with Zbb we achieve a performance gain up of up to 11ns. It is worth
noting that OnesCount8 is cheap enough that it is preferable to stick
with the generic version in this case.

Change-Id: Id657e40e0dd1b1ab8cc0fe0f8a68df4c9f2d7da5
Reviewed-on: https://go-review.googlesource.com/c/go/+/660856
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-01 05:57:41 -07:00
Joel Sing
90e8b8cdae cmd/compile: intrinsify math/bits.Bswap on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.Bswap
using the REV8 machine instruction.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                 │     rb.1     │                rb.2                 │
                 │    sec/op    │   sec/op     vs base                │
ReverseBytes-4     18.790n ± 0%   4.026n ± 0%  -78.57% (p=0.000 n=10)
ReverseBytes16-4    6.710n ± 0%   5.368n ± 0%  -20.00% (p=0.000 n=10)
ReverseBytes32-4   13.420n ± 0%   5.368n ± 0%  -60.00% (p=0.000 n=10)
ReverseBytes64-4   17.450n ± 0%   4.026n ± 0%  -76.93% (p=0.000 n=10)
geomean             13.11n        4.649n       -64.54%

Change-Id: I26eee34270b1721f7304bb1cddb0fda129b20ece
Reviewed-on: https://go-review.googlesource.com/c/go/+/660855
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
2025-05-01 05:57:13 -07:00
Joel Sing
b70244ff7a cmd/compile: intrinsify math/bits.Len on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.Len using the
CLZ/CLZW machine instructions.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                 │   clz.b.1   │               clz.b.2               │
                 │   sec/op    │   sec/op     vs base                │
LeadingZeros-4     28.89n ± 0%   12.08n ± 0%  -58.19% (p=0.000 n=10)
LeadingZeros8-4    18.79n ± 0%   14.76n ± 0%  -21.45% (p=0.000 n=10)
LeadingZeros16-4   25.27n ± 0%   14.76n ± 0%  -41.59% (p=0.000 n=10)
LeadingZeros32-4   25.12n ± 0%   12.08n ± 0%  -51.92% (p=0.000 n=10)
LeadingZeros64-4   25.89n ± 0%   12.08n ± 0%  -53.35% (p=0.000 n=10)
geomean            24.55n        13.09n       -46.70%

Change-Id: I0dda684713dbdf5336af393f5ccbdae861c4f694
Reviewed-on: https://go-review.googlesource.com/c/go/+/652321
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-03-21 18:21:44 -07:00
Joel Sing
03cb8d408e cmd/compile/internal/ssagen: use an alias for math/bits.OnesCount
Currently, only amd64 has an intrinsic for math/bits.OnesCount, which
generates the same code as math/bits.OnesCount64. Replace this with
an alias that maps math/bits.OnesCount to math/bits.OnesCount64 on
64 bit platforms.

Change-Id: Ifa12a2173a201aacd52c3c22b9a948be6e314405
Reviewed-on: https://go-review.googlesource.com/c/go/+/659215
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-20 08:15:01 -07:00
Joel Sing
6fb7bdc96d cmd/compile: intrinsify math/bits.TrailingZeros on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.TrailingZeros
using the CTZ/CTZW machine instructions.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                  │   ctz.b.1    │               ctz.b.2               │
                  │    sec/op    │   sec/op     vs base                │
TrailingZeros-4     25.500n ± 0%   8.052n ± 0%  -68.42% (p=0.000 n=10)
TrailingZeros8-4     14.76n ± 0%   10.74n ± 0%  -27.24% (p=0.000 n=10)
TrailingZeros16-4    26.84n ± 0%   10.74n ± 0%  -59.99% (p=0.000 n=10)
TrailingZeros32-4   25.500n ± 0%   8.052n ± 0%  -68.42% (p=0.000 n=10)
TrailingZeros64-4   25.500n ± 0%   8.052n ± 0%  -68.42% (p=0.000 n=10)
geomean              23.09n        9.035n       -60.88%

Change-Id: I71edf2b988acb7a68e797afda4ee66d7a57d587e
Reviewed-on: https://go-review.googlesource.com/c/go/+/652320
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
2025-03-15 19:07:53 -07:00
Joel Sing
7e8ceadf85 cmd/compile/internal/ssagen: use an alias for math/bits.Len
Rather than using a specific intrinsic for math/bits.Len, use a pair of
aliases instead. This requires less code and automatically adapts when
platforms have a math/bits.Len32 or math/bits.Len64 intrinsic.

Change-Id: I28b300172daaee26ef82a7530d9e96123663f541
Reviewed-on: https://go-review.googlesource.com/c/go/+/656995
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Auto-Submit: Jorropo <jorropo.pgm@gmail.com>
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
2025-03-12 09:04:33 -07:00
Russ Cox
c9b07e8871 cmd/compile: use FMA on plan9, and drop UseFMA
Every OS uses FMA now.

Change-Id: Ia7ffa77c52c45aefca611ddc54e9dfffb27a48da
Reviewed-on: https://go-review.googlesource.com/c/go/+/655877
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 05:40:34 -07:00
Joel Sing
927fdb7843 cmd/compile: simplify intrinsification of TrailingZeros16 and TrailingZeros8
Decompose Ctz16 and Ctz8 within the SSA rules for LOONG64, MIPS, PPC64
and S390X, rather than having a custom intrinsic. Note that for PPC64 this
actually allows the existing Ctz16 and Ctz8 rules to be used.

Change-Id: I27a5e978f852b9d75396d2a80f5d7dfcb5ef7dd4
Reviewed-on: https://go-review.googlesource.com/c/go/+/651816
Reviewed-by: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2025-02-27 03:45:44 -08:00
Joel Sing
1b1c6b838e cmd/compile: simplify intrinsification of BitLen16 and BitLen8
Decompose BitLen16 and BitLen8 within the SSA rules for architectures that
support BitLen32 or BitLen64, rather than having a custom intrinsic.

Change-Id: Ie4188ce69d1021e63cec27a8e7418efb0714812b
Reviewed-on: https://go-review.googlesource.com/c/go/+/651817
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-02-26 02:02:07 -08:00
Michael Pratt
dce30a1920 cmd/compile: intrinsify swissmap match calls with SIMD on amd64
Use similar SIMD operations to the ones used in Abseil. We still
using 8-slot groups (even though the XMM registers could handle 16-slot
groups) to keep the implementation simpler (no changes to the memory
layout of maps).

Still, the implementations of matchH2 and matchEmpty are shorter than
the portable version using standard arithmetic operations. They also
return a packed bitset, which avoids the need to shift in bitset.first.

That said, the packed bitset is a downside in cognitive complexity, as
we have to think about two different possible representations. This
doesn't leak out of the API, but we do need to intrinsify bitset to
switch to a compatible implementation.

The compiler's intrinsics don't support intrinsifying methods, so the
implementations move to free functions.

This makes operations between 0-3% faster on my machine. e.g.,

MapGetHit/impl=runtimeMap/t=Int64/len=6-12                      12.34n ±  1%   11.42n ± 1%   -7.46% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=12-12                     15.14n ±  2%   14.88n ± 1%   -1.72% (p=0.009 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=18-12                     15.04n ±  6%   14.66n ± 2%   -2.53% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=24-12                     15.80n ±  1%   15.48n ± 3%        ~ (p=0.444 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=30-12                     15.55n ±  4%   14.77n ± 3%   -5.02% (p=0.004 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=64-12                     15.26n ±  1%   15.05n ± 1%        ~ (p=0.055 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=128-12                    15.34n ±  1%   15.02n ± 2%   -2.09% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=256-12                    15.42n ±  1%   15.15n ± 1%   -1.75% (p=0.001 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=512-12                    15.48n ±  1%   15.18n ± 1%   -1.94% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=1024-12                   17.38n ±  1%   17.05n ± 1%   -1.90% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=2048-12                   17.96n ±  0%   17.59n ± 1%   -2.06% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=4096-12                   18.36n ±  1%   18.18n ± 1%   -0.98% (p=0.013 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=8192-12                   18.75n ±  0%   18.31n ± 1%   -2.35% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=65536-12                  26.25n ±  0%   25.95n ± 1%   -1.14% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=262144-12                 44.24n ±  1%   44.06n ± 1%        ~ (p=0.181 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=1048576-12                85.02n ±  0%   85.35n ± 0%   +0.39% (p=0.032 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=4194304-12                98.87n ±  1%   98.85n ± 1%        ~ (p=0.799 n=25)

For #54766.

Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10,gotip-linux-amd64-goamd64v3
Change-Id: Ic1b852f02744404122cb3672900fd95f4625905e
Reviewed-on: https://go-review.googlesource.com/c/go/+/626277
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2024-11-20 22:41:14 +00:00
Guoqi Chen
fe7d97d032 cmd/compile, internal/runtime/atomic: add Xchg8 for loong64
In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic
instruction AMSWAP[DB]{B,H} [1] is supported. Therefore, the implementation of
the atomic operation exchange can be selected according to the CPUCFG flag LAM_BH:
AMSWAPDBB(full barrier) instruction is used on new microstructures, and traditional
LL-SC is used on LA464 (Loongson-3A5000) and older microstructures. This can
significantly improve the performance of Go programs on new microstructures.

Because Xchg8 implemented using traditional LL-SC uses too many temporary
registers, it is not suitable for intrinsics.

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
BenchmarkXchg8             	100000000	        10.41 ns/op
BenchmarkXchg8-2           	100000000	        10.41 ns/op
BenchmarkXchg8-4           	100000000	        10.41 ns/op
BenchmarkXchg8Parallel     	96647592	        12.41 ns/op
BenchmarkXchg8Parallel-2   	58376136	        20.60 ns/op
BenchmarkXchg8Parallel-4   	78458899	        17.97 ns/op

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000-HV @ 2500.00MHz
BenchmarkXchg8             	38323825	        31.23 ns/op
BenchmarkXchg8-2           	38368219	        31.23 ns/op
BenchmarkXchg8-4           	37154156	        31.26 ns/op
BenchmarkXchg8Parallel     	37908301	        31.63 ns/op
BenchmarkXchg8Parallel-2   	30413440	        39.42 ns/op
BenchmarkXchg8Parallel-4   	30737626	        39.03 ns/op

For #69735

[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html

Change-Id: I02ba68f66a2210b6902344fdc9975eb62de728ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/623058
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2024-11-20 02:58:50 +00:00
Guoqi Chen
5432cd96fd cmd/compiler,internal/runtime/atomic: optimize Cas{64,32} on loong64
In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic
compare-and-exchange instruction AMCAS[DB]{B,W,H,V} [1] is supported. Therefore,
the implementation of the atomic operation compare-and-swap can be selected according
to the CPUCFG flag LAMCAS: AMCASDB(full barrier) instruction is used on new
microstructures, and traditional LL-SC is used on LA464 (Loongson-3A5000) and older
microstructures. This can significantly improve the performance of Go programs on
new microstructures.

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
         |  bench.old   |  bench.new                           |
         |   sec/op     |   sec/op       vs base               |
Cas        46.84n ±  0%   22.82n ±  0%  -51.28% (p=0.000 n=20)
Cas-2      47.58n ±  0%   29.57n ±  0%  -37.85% (p=0.000 n=20)
Cas-4      43.27n ± 20%   25.31n ± 13%  -41.50% (p=0.000 n=20)
Cas64      46.85n ±  0%   22.82n ±  0%  -51.29% (p=0.000 n=20)
Cas64-2    47.43n ±  0%   29.53n ±  0%  -37.74% (p=0.002 n=20)
Cas64-4    43.18n ±  0%   25.28n ±  2%  -41.46% (p=0.000 n=20)
geomean    45.82n         25.74n        -43.82%

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
         |  bench.old  |  bench.new                         |
         |   sec/op    |   sec/op      vs base              |
Cas        50.05n ± 0%   51.26n ± 0%  +2.42% (p=0.000 n=20)
Cas-2      52.80n ± 0%   53.11n ± 0%  +0.59% (p=0.000 n=20)
Cas-4      55.97n ± 0%   57.31n ± 0%  +2.39% (p=0.000 n=20)
Cas64      50.05n ± 0%   51.26n ± 0%  +2.42% (p=0.000 n=20)
Cas64-2    52.68n ± 0%   53.11n ± 0%  +0.82% (p=0.000 n=20)
Cas64-4    55.96n ± 0%   57.26n ± 0%  +2.33% (p=0.000 n=20)
geomean    52.86n        53.83n       +1.82%

[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html

Change-Id: I9b777c63c124fb492f61c903f77061fa2b4e5322
Reviewed-on: https://go-review.googlesource.com/c/go/+/613396
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-19 01:15:07 +00:00
Xiaolin Zhao
ab55465098 cmd/compile: wire up math/bits.TrailingZeros intrinsics for loong64
Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
                |  bench.old   |              bench.new               |
                |    sec/op    |    sec/op     vs base                |
TrailingZeros     1.7240n ± 0%   0.8120n ± 0%  -52.90% (p=0.000 n=20)
TrailingZeros8    1.0530n ± 0%   0.8015n ± 0%  -23.88% (p=0.000 n=20)
TrailingZeros16    2.072n ± 0%    1.015n ± 0%  -51.01% (p=0.000 n=20)
TrailingZeros32   1.7160n ± 0%   0.8122n ± 0%  -52.67% (p=0.000 n=20)
TrailingZeros64   2.0060n ± 0%   0.8125n ± 0%  -59.50% (p=0.000 n=20)
geomean            1.669n        0.8470n       -49.25%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
                |  bench.old   |              bench.new               |
                |    sec/op    |    sec/op     vs base                |
TrailingZeros     2.6275n ± 0%   0.9120n ± 0%  -65.29% (p=0.000 n=20)
TrailingZeros8     1.451n ± 0%    1.163n ± 0%  -19.85% (p=0.000 n=20)
TrailingZeros16    3.069n ± 0%    1.201n ± 0%  -60.87% (p=0.000 n=20)
TrailingZeros32   2.9060n ± 0%   0.9115n ± 0%  -68.63% (p=0.000 n=20)
TrailingZeros64   2.6305n ± 0%   0.9115n ± 0%  -65.35% (p=0.000 n=20)
geomean            2.456n         1.011n       -58.83%

This patch is a copy of CL 479498.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: I1a5b2114a844dc0d02c8e68f41ce2443ac3b5fda
Reviewed-on: https://go-review.googlesource.com/c/go/+/624356
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
2024-11-13 00:57:25 +00:00
Guoqi Chen
fb9b946adc cmd/compile: optimize math/bits.OnesCount{16,32,64} implementation on loong64
Use Loong64's LSX instruction VPCNT to implement math/bits.OnesCount{16,32,64}
and make it intrinsic.

Benchmark results on loongson 3A5000 and 3A6000 machines:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000-HV @ 2500.00MHz
            |   bench.old   |   bench.new                          |
            |    sec/op     |    sec/op       vs base               |
OnesCount      4.413n ± 0%     1.401n ± 0%   -68.25% (p=0.000 n=10)
OnesCount8     1.364n ± 0%     1.363n ± 0%         ~ (p=0.130 n=10)
OnesCount16    2.112n ± 0%     1.534n ± 0%   -27.37% (p=0.000 n=10)
OnesCount32    4.533n ± 0%     1.529n ± 0%   -66.27% (p=0.000 n=10)
OnesCount64    4.565n ± 0%     1.531n ± 1%   -66.46% (p=0.000 n=10)
geomean        3.048n          1.470n        -51.78%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
            |   bench.old   |   bench.new                          |
            |    sec/op     |    sec/op       vs base              |
OnesCount       3.553n ± 0%     1.201n ± 0%  -66.20% (p=0.000 n=10)
OnesCount8     0.8021n ± 0%    0.8004n ± 0%   -0.21% (p=0.000 n=10)
OnesCount16     1.216n ± 0%     1.000n ± 0%  -17.76% (p=0.000 n=10)
OnesCount32     3.006n ± 0%     1.035n ± 0%  -65.57% (p=0.000 n=10)
OnesCount64     3.503n ± 0%     1.035n ± 0%  -70.45% (p=0.000 n=10)
geomean         2.053n          1.006n       -51.01%

Change-Id: I07a5b8da2bb48711b896387ec7625145804affc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/620978
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-12 00:48:04 +00:00
Xiaolin Zhao
583d750fa1 cmd/compile: wire up bits.Reverse intrinsics for loong64
Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
          |  CL 624576   |               this CL                |
          |    sec/op    |    sec/op     vs base                |
Reverse     2.8130n ± 0%   0.8008n ± 0%  -71.53% (p=0.000 n=20)
Reverse8    0.7014n ± 0%   0.4040n ± 0%  -42.40% (p=0.000 n=20)
Reverse16   1.2975n ± 0%   0.6632n ± 1%  -48.89% (p=0.000 n=20)
Reverse32   2.7520n ± 0%   0.4042n ± 0%  -85.31% (p=0.000 n=20)
Reverse64   2.8970n ± 0%   0.4041n ± 0%  -86.05% (p=0.000 n=20)
geomean      1.828n        0.5116n       -72.01%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
          |  CL 624576   |               this CL                |
          |    sec/op    |    sec/op     vs base                |
Reverse     4.0050n ± 0%   0.8011n ± 0%  -80.00% (p=0.000 n=20)
Reverse8    0.8010n ± 0%   0.5210n ± 1%  -34.96% (p=0.000 n=20)
Reverse16   1.6160n ± 0%   0.6008n ± 0%  -62.82% (p=0.000 n=20)
Reverse32   3.8550n ± 0%   0.5179n ± 0%  -86.57% (p=0.000 n=20)
Reverse64   3.8050n ± 0%   0.5177n ± 0%  -86.40% (p=0.000 n=20)
geomean      2.378n        0.5828n       -75.49%

Updates #59120

This patch is a copy of CL 483656.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: I98681091763279279c8404bd0295785f13ea1c8e
Reviewed-on: https://go-review.googlesource.com/c/go/+/624276
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
2024-11-11 00:08:45 +00:00
Guoqi Chen
4b0da6b13f cmd/compiler,internal/runtime/atomic: optimize And{64,32,8} and Or{64,32,8} on loong64
Use loong64's atomic operation instruction AMANDDB{V,W,W} (full barrier) to implement
And{64,32,8}, AMORDB{V,W,W} (full barrier) to implement Or{64,32,8}.

Intrinsify And{64,32,8} and Or{64,32,8}, And this CL alias all of the And/Or operations
into sync/atomic package.

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000-HV @ 2500.00MHz
                |   bench.old    |   bench.new                           |
                |   sec/op       |   sec/op        vs base               |
And32              27.73n ± 0%      10.81n ± 0%   -61.02% (p=0.000 n=20)
And32Parallel      28.96n ± 0%      12.41n ± 0%   -57.15% (p=0.000 n=20)
And64              27.73n ± 0%      10.81n ± 0%   -61.02% (p=0.000 n=20)
And64Parallel      28.96n ± 0%      12.41n ± 0%   -57.15% (p=0.000 n=20)
Or32               27.62n ± 0%      10.81n ± 0%   -60.86% (p=0.000 n=20)
Or32Parallel       28.96n ± 0%      12.41n ± 0%   -57.15% (p=0.000 n=20)
Or64               27.62n ± 0%      10.81n ± 0%   -60.86% (p=0.000 n=20)
Or64Parallel       28.97n ± 0%      12.41n ± 0%   -57.16% (p=0.000 n=20)
And8               29.15n ± 0%      13.21n ± 0%   -54.68% (p=0.000 n=20)
And                27.71n ± 0%      12.82n ± 0%   -53.74% (p=0.000 n=20)
And8Parallel       28.99n ± 0%      14.46n ± 0%   -50.12% (p=0.000 n=20)
AndParallel        29.12n ± 0%      14.42n ± 0%   -50.48% (p=0.000 n=20)
Or8                28.31n ± 0%      12.81n ± 0%   -54.75% (p=0.000 n=20)
Or                 27.72n ± 0%      12.81n ± 0%   -53.79% (p=0.000 n=20)
Or8Parallel        29.03n ± 0%      14.62n ± 0%   -49.64% (p=0.000 n=20)
OrParallel         29.12n ± 0%      14.42n ± 0%   -50.49% (p=0.000 n=20)
geomean            28.47n           12.58n        -55.80%

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
                |   bench.old    |   bench.new                          |
                |   sec/op       |   sec/op        vs base              |
And32              30.02n ± 0%      14.81n ± 0%   -50.67% (p=0.000 n=20)
And32Parallel      30.83n ± 0%      15.61n ± 0%   -49.37% (p=0.000 n=20)
And64              30.02n ± 0%      14.81n ± 0%   -50.67% (p=0.000 n=20)
And64Parallel      30.83n ± 0%      15.61n ± 0%   -49.37% (p=0.000 n=20)
And8               30.42n ± 0%      14.41n ± 0%   -52.63% (p=0.000 n=20)
And                30.02n ± 0%      13.61n ± 0%   -54.66% (p=0.000 n=20)
And8Parallel       31.23n ± 0%      15.21n ± 0%   -51.30% (p=0.000 n=20)
AndParallel        30.83n ± 0%      14.41n ± 0%   -53.26% (p=0.000 n=20)
Or32               30.02n ± 0%      14.81n ± 0%   -50.67% (p=0.000 n=20)
Or32Parallel       30.83n ± 0%      15.61n ± 0%   -49.37% (p=0.000 n=20)
Or64               30.02n ± 0%      14.82n ± 0%   -50.63% (p=0.000 n=20)
Or64Parallel       30.83n ± 0%      15.61n ± 0%   -49.37% (p=0.000 n=20)
Or8                30.02n ± 0%      14.01n ± 0%   -53.33% (p=0.000 n=20)
Or                 30.02n ± 0%      13.61n ± 0%   -54.66% (p=0.000 n=20)
Or8Parallel        30.83n ± 0%      14.81n ± 0%   -51.96% (p=0.000 n=20)
OrParallel         30.83n ± 0%      14.41n ± 0%   -53.26% (p=0.000 n=20)
geomean            30.47n           14.75n        -51.61%

Change-Id: If008ff6a08b51905076f8ddb6e92f8e214d3f7b3
Reviewed-on: https://go-review.googlesource.com/c/go/+/482756
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-11 00:08:08 +00:00
Xiaolin Zhao
e6cc9d228a cmd/compile: implement FMA codegen for loong64
Benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A6000 @ 2500.00MHz
    |  bench.old   |              bench.new              |
    |    sec/op    |   sec/op     vs base                |
FMA   25.930n ± 0%   2.002n ± 0%  -92.28% (p=0.000 n=10)

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A5000 @ 2500.00MHz
    |  bench.old   |              bench.new              |
    |    sec/op    |   sec/op     vs base                |
FMA   32.840n ± 0%   2.002n ± 0%  -93.90% (p=0.000 n=10)

Updates #59120

This patch is a copy of CL 483355.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: I88b89d23f00864f9173a182a47ee135afec7ed6e
Reviewed-on: https://go-review.googlesource.com/c/go/+/625335
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-11-08 01:05:48 +00:00
Guoqi Chen
e534989d18 cmd/compile/internal: intrinsify publicationBarrier on loong64
The publication barrier is a StoreStore barrier, which is implemented
by "DBAR 0x1A" [1] on loong64.

goos: linux
goarch: loong64
pkg: runtime
cpu: Loongson-3A6000 @ 2500.00MHz
                     |   bench.old   |  bench.new                            |
                     |    sec/op     |   sec/op        vs base               |
Malloc8                 31.76n ± 0%     22.79n ± 0%   -28.24% (p=0.000 n=20)
Malloc8-2               25.46n ± 0%     18.33n ± 0%   -28.00% (p=0.000 n=20)
Malloc8-4               25.75n ± 0%     18.43n ± 0%   -28.41% (p=0.000 n=20)
Malloc16                62.97n ± 0%     42.41n ± 0%   -32.65% (p=0.000 n=20)
Malloc16-2              49.11n ± 0%     31.68n ± 0%   -35.50% (p=0.000 n=20)
Malloc16-4              49.64n ± 1%     31.95n ± 0%   -35.62% (p=0.000 n=20)
MallocTypeInfo8         58.57n ± 0%     46.51n ± 0%   -20.61% (p=0.000 n=20)
MallocTypeInfo8-2       51.43n ± 0%     38.01n ± 0%   -26.09% (p=0.000 n=20)
MallocTypeInfo8-4       51.65n ± 0%     38.15n ± 0%   -26.13% (p=0.000 n=20)
MallocTypeInfo16        68.07n ± 0%     51.62n ± 0%   -24.17% (p=0.000 n=20)
MallocTypeInfo16-2      54.73n ± 0%     41.13n ± 0%   -24.85% (p=0.000 n=20)
MallocTypeInfo16-4      55.05n ± 0%     41.28n ± 0%   -25.02% (p=0.000 n=20)
MallocLargeStruct       491.5n ± 0%     454.8n ± 0%    -7.47% (p=0.000 n=20)
MallocLargeStruct-2     351.8n ± 1%     323.8n ± 0%    -7.94% (p=0.000 n=20)
MallocLargeStruct-4     333.6n ± 0%     316.7n ± 0%    -5.10% (p=0.000 n=20)
geomean                 71.01n          53.78n        -24.26%

[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html

Change-Id: Ica0c89db6f2bebd55d9b3207a1c462a9454e9268
Reviewed-on: https://go-review.googlesource.com/c/go/+/577515
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-11-08 01:04:43 +00:00
Guoqi Chen
ac345fb7e7 cmd/compiler,internal/runtime/atomic: optimize Store{64,32,8} on loong64
On Loong64, AMSWAPDB{W,V} instructions are supported by default, and AMSWAPDB{B,H} [1]
is a new instruction added by LA664(Loongson 3A6000) and later microarchitectures.
Therefore, AMSWAPDB{W,V} (full barrier) is used to implement AtomicStore{32,64}, and
the traditional MOVB or the new AMSWAPDBB is used to implement AtomicStore8 according
to the CPU feature.

The StoreRelease barrier on Loong64 is "dbar 0x12", but it is still necessary to
ensure consistency in the order of Store/Load [2].

LoweredAtomicStorezero{32,64} was removed because on loong64 the constant "0" uses
the R0 register, and there is no performance difference between the implementations
of LoweredAtomicStorezero{32,64} and LoweredAtomicStore{32,64}.

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000-HV @ 2500.00MHz
                |  bench.old  |              bench.new              |
                |   sec/op    |   sec/op     vs base                |
AtomicStore64     19.61n ± 0%   13.61n ± 0%  -30.60% (p=0.000 n=20)
AtomicStore64-2   19.61n ± 0%   13.61n ± 0%  -30.57% (p=0.000 n=20)
AtomicStore64-4   19.62n ± 0%   13.61n ± 0%  -30.63% (p=0.000 n=20)
AtomicStore       19.61n ± 0%   13.61n ± 0%  -30.60% (p=0.000 n=20)
AtomicStore-2     19.62n ± 0%   13.61n ± 0%  -30.63% (p=0.000 n=20)
AtomicStore-4     19.62n ± 0%   13.62n ± 0%  -30.58% (p=0.000 n=20)
AtomicStore8      19.61n ± 0%   20.01n ± 0%   +2.04% (p=0.000 n=20)
AtomicStore8-2    19.62n ± 0%   20.02n ± 0%   +2.01% (p=0.000 n=20)
AtomicStore8-4    19.61n ± 0%   20.02n ± 0%   +2.09% (p=0.000 n=20)
geomean           19.61n        15.48n       -21.08%

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
                |  bench.old  |              bench.new              |
                |   sec/op    |   sec/op     vs base                |
AtomicStore64     18.03n ± 0%   12.81n ± 0%  -28.93% (p=0.000 n=20)
AtomicStore64-2   18.02n ± 0%   12.81n ± 0%  -28.91% (p=0.000 n=20)
AtomicStore64-4   18.01n ± 0%   12.81n ± 0%  -28.87% (p=0.000 n=20)
AtomicStore       18.02n ± 0%   12.81n ± 0%  -28.91% (p=0.000 n=20)
AtomicStore-2     18.01n ± 0%   12.81n ± 0%  -28.87% (p=0.000 n=20)
AtomicStore-4     18.01n ± 0%   12.81n ± 0%  -28.87% (p=0.000 n=20)
AtomicStore8      18.01n ± 0%   12.81n ± 0%  -28.87% (p=0.000 n=20)
AtomicStore8-2    18.01n ± 0%   12.81n ± 0%  -28.87% (p=0.000 n=20)
AtomicStore8-4    18.01n ± 0%   12.81n ± 0%  -28.87% (p=0.000 n=20)
geomean           18.01n        12.81n       -28.89%

[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html
[2]: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/loongarch/sync.md

Change-Id: I4ae5e8dd0e6f026129b6e503990a763ed40c6097
Reviewed-on: https://go-review.googlesource.com/c/go/+/581356
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
2024-11-07 02:19:55 +00:00
Xiaolin Zhao
d6fb0ab2c7 cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64
Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
               |  bench.old   |              bench.new               |
               |    sec/op    |    sec/op     vs base                |
ReverseBytes     2.0020n ± 0%   0.4040n ± 0%  -79.82% (p=0.000 n=20)
ReverseBytes16   0.8866n ± 1%   0.8007n ± 0%   -9.69% (p=0.000 n=20)
ReverseBytes32   1.2195n ± 0%   0.8007n ± 0%  -34.34% (p=0.000 n=20)
ReverseBytes64   2.0705n ± 0%   0.8008n ± 0%  -61.32% (p=0.000 n=20)
geomean           1.455n        0.6749n       -53.62%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
               |  bench.old   |              bench.new               |
               |    sec/op    |    sec/op     vs base                |
ReverseBytes     2.8040n ± 0%   0.5205n ± 0%  -81.44% (p=0.000 n=20)
ReverseBytes16   0.7066n ± 0%   0.8011n ± 0%  +13.37% (p=0.000 n=20)
ReverseBytes32   1.5500n ± 0%   0.8010n ± 0%  -48.32% (p=0.000 n=20)
ReverseBytes64   2.7665n ± 0%   0.8010n ± 0%  -71.05% (p=0.000 n=20)
geomean           1.707n        0.7192n       -57.87%

Updates #59120

This patch is a copy of CL 483357.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: If355354cd031533df91991fcc3392e5a6c314295
Reviewed-on: https://go-review.googlesource.com/c/go/+/624576
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-11-06 03:12:50 +00:00
Xiaolin Zhao
d98c51809d cmd/compile: wire up math/bits.Len intrinsics for loong64
For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
               |  bench.old  |              bench.new              |
               |   sec/op    |   sec/op     vs base                |
LeadingZeros     3.660n ± 0%   1.348n ± 0%  -63.17% (p=0.000 n=20)
LeadingZeros8    1.777n ± 0%   1.767n ± 0%   -0.56% (p=0.000 n=20)
LeadingZeros16   2.816n ± 0%   1.770n ± 0%  -37.14% (p=0.000 n=20)
LeadingZeros32   5.293n ± 1%   1.683n ± 0%  -68.21% (p=0.000 n=20)
LeadingZeros64   3.622n ± 0%   1.349n ± 0%  -62.76% (p=0.000 n=20)
geomean          3.229n        1.571n       -51.35%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
               |  bench.old   |              bench.new               |
               |    sec/op    |    sec/op     vs base                |
LeadingZeros      2.410n ± 0%    1.103n ± 1%  -54.23% (p=0.000 n=20)
LeadingZeros8     1.236n ± 0%    1.501n ± 0%  +21.44% (p=0.000 n=20)
LeadingZeros16    2.106n ± 0%    1.501n ± 0%  -28.73% (p=0.000 n=20)
LeadingZeros32    2.860n ± 0%    1.324n ± 0%  -53.72% (p=0.000 n=20)
LeadingZeros64   2.6135n ± 0%   0.9509n ± 0%  -63.62% (p=0.000 n=20)
geomean           2.159n         1.256n       -41.81%

Updates #59120

This patch is a copy of CL 483356.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: Iee81a17f7da06d77a427e73dfcc016f2b15ae556
Reviewed-on: https://go-review.googlesource.com/c/go/+/624575
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2024-11-06 00:40:40 +00:00
Youlin Feng
0140aae6d0 cmd/compile: optimize Ctz64 on 386
Compared with the version generated by dec64.rules based on Ctz32,
the number of assembly instructions is reduced by half.

SwissMap uses TrailingZeros64 to find the first match in its control
group and may benefit from this CL on 386 architectures.

goos: linux
goarch: 386
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                   │   old.txt    │               new.txt                │
                   │    sec/op    │    sec/op     vs base                │
TrailingZeros64-20   0.8828n ± 1%   0.6299n ± 1%  -28.65% (p=0.000 n=20)

Change-Id: Iba08a3f4e13efd3349715dfb7fcd5fd470286cd3
Reviewed-on: https://go-review.googlesource.com/c/go/+/624376
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
2024-11-05 15:30:57 +00:00
Mauri de Souza Meneguzzo
3aa71c12ea cmd/compile, internal/runtime/atomic: add Xchg8 for arm64
For #69735

Change-Id: I61a2e561684c538eea705e60c8ebda6be3ef31a7
GitHub-Last-Rev: 3c7f4ec845
GitHub-Pull-Request: golang/go#69751
Reviewed-on: https://go-review.googlesource.com/c/go/+/617595
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2024-10-08 22:06:50 +00:00
Paul E. Murphy
15618840f6 cmd/compile: add internal/runtime/atomic.Xchg8 intrinsic for PPC64
This is minor extension of the existing support for 32 and
64 bit types.

For #69735

Change-Id: I6828ec223951d2b692e077dc507b000ac23c32a1
Reviewed-on: https://go-review.googlesource.com/c/go/+/617496
Reviewed-by: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
2024-10-07 19:20:23 +00:00
Rhys Hiltner
dc8902f4eb cmd/compile/internal/ssa: intrinsify atomic.Xchg8 on amd64
For #68578

Change-Id: Ia9580579bfc4709945bfcf6ec3803d5d11812187
Reviewed-on: https://go-review.googlesource.com/c/go/+/606901
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-10-02 16:58:01 +00:00
Michael Pratt
d1c3f255fd runtime: move getclosureptr to internal/runtime/sys
Moving these intrinsics to a base package enables other internal/runtime
packages to use them.

There is no immediate need for getclosureptr outside of runtime, but it
is moved for consistency with the other intrinsics.

For #54766.

Change-Id: Ia68b16a938c8cb84cb222469db28e3a83861be5d
Reviewed-on: https://go-review.googlesource.com/c/go/+/613262
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-09-17 17:01:26 +00:00
Michael Pratt
4f881115d4 runtime: move getcallersp to internal/runtime/sys
Moving these intrinsics to a base package enables other internal/runtime
packages to use them.

For #54766.

Change-Id: I45a530422207dd94b5ad4eee51216c9410a84040
Reviewed-on: https://go-review.googlesource.com/c/go/+/613261
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-09-17 17:01:20 +00:00
Michael Pratt
81c92352a7 runtime: move getcallerpc to internal/runtime/sys
Moving these intrinsics to a base package enables other internal/runtime
packages to use them.

For #54766.

Change-Id: I0b3eded3bb45af53e3eb5bab93e3792e6a8beb46
Reviewed-on: https://go-review.googlesource.com/c/go/+/613260
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-09-17 15:14:14 +00:00
Joel Sing
7c54e024e8 cmd/compile/internal/ssagen: add check for duplicate intrinsics
Add a check to ensure that intrinsics are not being overwritten.
Remove two S390X intrinsics that are being replaced by aliases and
are therefore ineffective.

Change-Id: I4187a169c14ca75c45a67f41a1d626d76b82bb72
Reviewed-on: https://go-review.googlesource.com/c/go/+/605479
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2024-08-27 14:35:25 +00:00
Joel Sing
dff15aa610 cmd/compile/internal/ssagen: provide intrinsicBuilders
Create an intrinsicBuilders type that has functions for adding and
looking up intrinsics. This makes the implementation more self contained,
readable and testable. Additionally, pass an *intrinsicBuildConfig to
initIntrinsics to improve testability without needing to modify package
level variables.

Change-Id: I0ee0a19c192dd6da9f1c5f1c29b98a3ad8161fe2
Reviewed-on: https://go-review.googlesource.com/c/go/+/605478
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Joel Sing <joel@sing.id.au>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
2024-08-27 14:27:25 +00:00
Paul E. Murphy
2b0a157d68 cmd/compile: intrinsify math.MulUintptr on PPC64
This can be done efficiently with few instructions.

This also adds MULHDUCC for further codegen improvement.

Change-Id: I06320ba4383a679341b911a237a360ef07b19168
Reviewed-on: https://go-review.googlesource.com/c/go/+/605975
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Archana Ravindar <aravinda@redhat.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-08-26 17:02:43 +00:00
Joel Sing
c6c9634515 cmd/compile/internal/ssagen: factor out intrinsics code
The intrinsic handling code is a good thousand lines in the fairly
large ssa.go file. This code is already reasonably self-contained - factor
it out into a separate file so that future changes are easier to manage
(and it becomes easier to add/change intrinsics for an architecture).

Change-Id: I3c18d3d1bb6332f1817d902150e736373bf1ac81
Reviewed-on: https://go-review.googlesource.com/c/go/+/605477
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-08-20 14:20:34 +00:00