Commit graph

29 commits

Author SHA1 Message Date
Meng Zhuo
d7a52f9369 cmd/compile: use MOV(D|F) with const for Const(64|32)F on riscv64
The original Const64F using: AUIPC + LD + FMVDX to load
float64 const, we can use AUIPC + FLD instead, same as Const32F.

Change-Id: I8ca0a0e90d820a26e69b74cd25df3cc662132bf7
Reviewed-on: https://go-review.googlesource.com/c/go/+/703215
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-10-26 18:35:09 -07:00
Jes Cok
46cc532900 cmd/compile/internal/ssa: fix typo in comment
Change-Id: Ic48a756b71a62be1c6c4cfe781c02b89010e2338
GitHub-Last-Rev: 8c0d89b475
GitHub-Pull-Request: golang/go#75985
Reviewed-on: https://go-review.googlesource.com/c/go/+/713041
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Auto-Submit: Jorropo <jorropo.pgm@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-10-21 08:43:34 -07:00
Meng Zhuo
b9a4a09b0f runtime: remove duff support for riscv64
Change-Id: I987d9f49fbd2650eef4224f72271bf752c54d39c
Reviewed-on: https://go-review.googlesource.com/c/go/+/700538
Reviewed-by: Mark Freeman <markfreeman@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
2025-09-09 19:42:25 -07:00
Meng Zhuo
4dac9e093f cmd/compile: use generated loops instead of DUFFCOPY on riscv64
MemmoveKnownSize112-4            632.1Mi ± 1%   1288.5Mi ± 0%  +103.85% (p=0.000 n=10)
MemmoveKnownSize128-4            636.1Mi ± 0%   1280.9Mi ± 1%  +101.36% (p=0.000 n=10)
MemmoveKnownSize192-4            645.3Mi ± 0%   1306.9Mi ± 1%  +102.53% (p=0.000 n=10)
MemmoveKnownSize248-4            650.2Mi ± 2%   1312.5Mi ± 1%  +101.87% (p=0.000 n=10)
MemmoveKnownSize256-4            650.7Mi ± 0%   1303.6Mi ± 1%  +100.33% (p=0.000 n=10)
MemmoveKnownSize512-4            658.2Mi ± 1%   1293.9Mi ± 0%   +96.60% (p=0.000 n=10)
MemmoveKnownSize1024-4           662.1Mi ± 0%   1312.6Mi ± 0%   +98.26% (p=0.000 n=10)

Change-Id: I43681ca029880025558b33ddc4295da3947c9b28
Reviewed-on: https://go-review.googlesource.com/c/go/+/700537
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Mark Freeman <markfreeman@google.com>
2025-09-09 19:42:09 -07:00
Meng Zhuo
879ff736d3 cmd/compile: use generated loops instead of DUFFZERO on riscv64
MemclrKnownSize112-4          5.602Gi ± 0%    5.601Gi ± 0%         ~ (p=0.363 n=10)
MemclrKnownSize128-4          6.933Gi ± 1%    6.545Gi ± 1%    -5.59% (p=0.000 n=10)
MemclrKnownSize192-4          8.055Gi ± 1%    7.804Gi ± 0%    -3.12% (p=0.000 n=10)
MemclrKnownSize248-4          8.489Gi ± 0%    8.718Gi ± 0%    +2.69% (p=0.000 n=10)
MemclrKnownSize256-4          8.762Gi ± 0%    8.763Gi ± 0%         ~ (p=0.494 n=10)
MemclrKnownSize512-4          9.514Gi ± 1%    9.514Gi ± 0%         ~ (p=0.529 n=10)
MemclrKnownSize1024-4         9.940Gi ± 0%    9.939Gi ± 1%         ~ (p=0.989 n=10)
ClearFat3-4                   1.300Gi ± 0%    1.301Gi ±  0%         ~ (p=0.447 n=10)
ClearFat4-4                   3.902Gi ± 0%    3.902Gi ±  0%         ~ (p=0.971 n=10)
ClearFat5-4                   665.8Mi ± 0%   1331.5Mi ±  0%  +100.01% (p=0.000 n=10)
ClearFat6-4                   665.8Mi ± 0%   1330.5Mi ±  0%   +99.82% (p=0.000 n=10)
ClearFat7-4                   490.7Mi ± 0%   1331.9Mi ±  0%  +171.45% (p=0.000 n=10)
ClearFat8-4                   5.201Gi ± 0%    5.202Gi ±  0%         ~ (p=0.123 n=10)
ClearFat9-4                   856.1Mi ± 0%   1331.6Mi ±  0%   +55.54% (p=0.000 n=10)
ClearFat10-4                  887.8Mi ± 0%   1331.9Mi ±  0%   +50.03% (p=0.000 n=10)
ClearFat11-4                  915.3Mi ± 0%   1331.1Mi ±  0%   +45.42% (p=0.000 n=10)
ClearFat12-4                  5.202Gi ± 0%    5.202Gi ±  0%         ~ (p=0.481 n=10)
ClearFat13-4                  961.5Mi ± 0%   1331.8Mi ±  0%   +38.50% (p=0.000 n=10)
ClearFat14-4                  981.0Mi ± 0%   1331.8Mi ±  0%   +35.76% (p=0.000 n=10)
ClearFat15-4                  951.3Mi ± 0%   1331.4Mi ±  0%   +39.96% (p=0.000 n=10)
ClearFat16-4                  1.600Gi ± 0%    5.202Gi ±  0%  +225.10% (p=0.000 n=10)
ClearFat18-4                  1.018Gi ± 0%    1.300Gi ±  0%   +27.77% (p=0.000 n=10)
ClearFat20-4                  2.601Gi ± 0%    4.938Gi ± 12%   +89.87% (p=0.000 n=10)
ClearFat24-4                  2.601Gi ± 0%    5.201Gi ±  0%   +99.96% (p=0.000 n=10)
ClearFat32-4                  1.982Gi ± 0%    5.203Gi ±  0%  +162.55% (p=0.000 n=10)
ClearFat40-4                  3.467Gi ± 0%    4.338Gi ±  0%   +25.11% (p=0.000 n=10)
ClearFat48-4                  3.671Gi ± 0%    5.201Gi ±  0%   +41.69% (p=0.000 n=10)
ClearFat56-4                  3.640Gi ± 0%    5.201Gi ±  0%   +42.88% (p=0.000 n=10)
ClearFat64-4                  2.250Gi ± 0%    5.202Gi ±  0%  +131.25% (p=0.000 n=10)
ClearFat72-4                  4.064Gi ± 0%    5.201Gi ±  0%   +27.97% (p=0.000 n=10)
ClearFat128-4                 4.496Gi ± 0%    5.203Gi ±  0%   +15.71% (p=0.000 n=10)
ClearFat256-4                 4.756Gi ± 0%    5.201Gi ±  0%    +9.36% (p=0.000 n=10)
ClearFat512-4                 2.512Gi ± 0%    5.201Gi ±  0%  +107.03% (p=0.000 n=10)
ClearFat1024-4                4.255Gi ± 0%    5.202Gi ±  0%   +22.26% (p=0.000 n=10)
ClearFat1032-4                4.260Gi ± 0%    5.201Gi ±  0%   +22.09% (p=0.000 n=10)
ClearFat1040-4                4.285Gi ± 1%    5.203Gi ±  0%   +21.41% (p=0.000 n=10)
geomean                       2.005Gi         3.020Gi         +50.58%

Change-Id: Iea1da734ff8eaf1b5a2822ae2bdb7f4fd9b65651
Reviewed-on: https://go-review.googlesource.com/c/go/+/699635
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Freeman <markfreeman@google.com>
2025-09-09 19:41:39 -07:00
Michael Munday
320df537cc cmd/compile: emit classify instructions for infinity tests on riscv64
The 'classify' instruction on RISC-V sets a bit in a mask to indicate
the class a floating point value belongs to (e.g. whether the value is
an infinity, a normal number, a subnormal number and so on). There are
other places this instruction is useful but for now I've just used it
for infinity tests.

The gains are relatively small (~1-2 instructions per IsInf call) but
using FCLASSD does potentially unlock further optimizations. It also
reduces the number of loads from memory and the number of moves
between general purpose and floating point register files.

goos: linux
goarch: riscv64
pkg: math
cpu: Spacemit(R) X60
                    │        sec/op        │   sec/op     vs base                │
Acos                           159.9n ± 0%   173.7n ± 0%   +8.66% (p=0.000 n=10)
Acosh                          249.8n ± 0%   254.4n ± 0%   +1.86% (p=0.000 n=10)
Asin                           159.9n ± 0%   173.7n ± 0%   +8.66% (p=0.000 n=10)
Asinh                          292.2n ± 0%   283.0n ± 0%   -3.15% (p=0.000 n=10)
Atan                           119.1n ± 0%   119.0n ± 0%   -0.08% (p=0.036 n=10)
Atanh                          265.1n ± 0%   271.6n ± 0%   +2.43% (p=0.000 n=10)
Atan2                          194.9n ± 0%   186.7n ± 0%   -4.23% (p=0.000 n=10)
Cbrt                           216.3n ± 0%   203.1n ± 0%   -6.10% (p=0.000 n=10)
Ceil                           31.82n ± 0%   31.81n ± 0%        ~ (p=0.063 n=10)
Copysign                       4.897n ± 0%   4.893n ± 3%   -0.08% (p=0.038 n=10)
Cos                            123.9n ± 0%   107.7n ± 1%  -13.03% (p=0.000 n=10)
Cosh                           293.0n ± 0%   264.6n ± 0%   -9.68% (p=0.000 n=10)
Erf                            150.0n ± 0%   133.8n ± 0%  -10.80% (p=0.000 n=10)
Erfc                           151.8n ± 0%   137.9n ± 0%   -9.16% (p=0.000 n=10)
Erfinv                         173.8n ± 0%   173.8n ± 0%        ~ (p=0.820 n=10)
Erfcinv                        173.8n ± 0%   173.8n ± 0%        ~ (p=1.000 n=10)
Exp                            247.7n ± 0%   220.4n ± 0%  -11.04% (p=0.000 n=10)
ExpGo                          261.4n ± 0%   232.5n ± 0%  -11.04% (p=0.000 n=10)
Expm1                          176.2n ± 0%   164.9n ± 0%   -6.41% (p=0.000 n=10)
Exp2                           220.4n ± 0%   190.2n ± 0%  -13.70% (p=0.000 n=10)
Exp2Go                         232.5n ± 0%   204.0n ± 0%  -12.22% (p=0.000 n=10)
Abs                            4.897n ± 0%   4.897n ± 0%        ~ (p=0.726 n=10)
Dim                            16.32n ± 0%   16.31n ± 0%        ~ (p=0.770 n=10)
Floor                          31.84n ± 0%   31.83n ± 0%        ~ (p=0.677 n=10)
Max                            26.11n ± 0%   26.13n ± 0%        ~ (p=0.290 n=10)
Min                            26.10n ± 0%   26.11n ± 0%        ~ (p=0.424 n=10)
Mod                            416.2n ± 0%   337.8n ± 0%  -18.83% (p=0.000 n=10)
Frexp                          63.65n ± 0%   50.60n ± 0%  -20.50% (p=0.000 n=10)
Gamma                          218.8n ± 0%   206.4n ± 0%   -5.62% (p=0.000 n=10)
Hypot                          92.20n ± 0%   94.69n ± 0%   +2.70% (p=0.000 n=10)
HypotGo                        107.7n ± 0%   109.3n ± 0%   +1.49% (p=0.000 n=10)
Ilogb                          59.54n ± 0%   44.04n ± 0%  -26.04% (p=0.000 n=10)
J0                             708.9n ± 0%   674.5n ± 0%   -4.86% (p=0.000 n=10)
J1                             707.6n ± 0%   676.1n ± 0%   -4.44% (p=0.000 n=10)
Jn                             1.513µ ± 0%   1.427µ ± 0%   -5.68% (p=0.000 n=10)
Ldexp                          70.20n ± 0%   57.09n ± 0%  -18.68% (p=0.000 n=10)
Lgamma                         201.5n ± 0%   185.3n ± 1%   -8.01% (p=0.000 n=10)
Log                            201.5n ± 0%   182.7n ± 0%   -9.35% (p=0.000 n=10)
Logb                           59.54n ± 0%   46.53n ± 0%  -21.86% (p=0.000 n=10)
Log1p                          178.8n ± 0%   173.9n ± 6%   -2.74% (p=0.021 n=10)
Log10                          201.4n ± 0%   184.3n ± 0%   -8.49% (p=0.000 n=10)
Log2                           79.17n ± 0%   66.07n ± 0%  -16.54% (p=0.000 n=10)
Modf                           34.27n ± 0%   34.25n ± 0%        ~ (p=0.559 n=10)
Nextafter32                    49.34n ± 0%   49.37n ± 0%   +0.05% (p=0.040 n=10)
Nextafter64                    43.66n ± 0%   43.66n ± 0%        ~ (p=0.869 n=10)
PowInt                         309.1n ± 0%   267.4n ± 0%  -13.49% (p=0.000 n=10)
PowFrac                        769.6n ± 0%   677.3n ± 0%  -11.98% (p=0.000 n=10)
Pow10Pos                       13.88n ± 0%   13.88n ± 0%        ~ (p=0.811 n=10)
Pow10Neg                       19.58n ± 0%   19.57n ± 0%        ~ (p=0.993 n=10)
Round                          23.65n ± 0%   23.66n ± 0%        ~ (p=0.354 n=10)
RoundToEven                    27.75n ± 0%   27.75n ± 0%        ~ (p=0.971 n=10)
Remainder                      380.0n ± 0%   309.9n ± 0%  -18.45% (p=0.000 n=10)
Signbit                        13.06n ± 0%   13.06n ± 0%        ~ (p=1.000 n=10)
Sin                            133.8n ± 0%   120.8n ± 0%   -9.75% (p=0.000 n=10)
Sincos                         160.7n ± 0%   147.7n ± 0%   -8.12% (p=0.000 n=10)
Sinh                           305.9n ± 0%   277.9n ± 0%   -9.17% (p=0.000 n=10)
SqrtIndirect                   3.265n ± 0%   3.264n ± 0%        ~ (p=0.546 n=10)
SqrtLatency                    19.58n ± 0%   19.58n ± 0%        ~ (p=0.973 n=10)
SqrtIndirectLatency            19.59n ± 0%   19.58n ± 0%        ~ (p=0.370 n=10)
SqrtGoLatency                  205.7n ± 0%   202.7n ± 0%   -1.46% (p=0.000 n=10)
SqrtPrime                      4.953µ ± 0%   4.954µ ± 0%        ~ (p=0.477 n=10)
Tan                            163.2n ± 0%   150.2n ± 0%   -7.99% (p=0.000 n=10)
Tanh                           312.4n ± 0%   284.2n ± 0%   -9.01% (p=0.000 n=10)
Trunc                          31.83n ± 0%   31.83n ± 0%        ~ (p=0.663 n=10)
Y0                             701.0n ± 0%   669.2n ± 0%   -4.54% (p=0.000 n=10)
Y1                             704.5n ± 0%   672.4n ± 0%   -4.55% (p=0.000 n=10)
Yn                             1.490µ ± 0%   1.422µ ± 0%   -4.60% (p=0.000 n=10)
Float64bits                    5.713n ± 0%   5.710n ± 0%        ~ (p=0.926 n=10)
Float64frombits                4.896n ± 0%   4.896n ± 0%        ~ (p=0.663 n=10)
Float32bits                    12.25n ± 0%   12.25n ± 0%        ~ (p=0.571 n=10)
Float32frombits                4.898n ± 0%   4.896n ± 0%        ~ (p=0.754 n=10)
FMA                            4.895n ± 0%   4.895n ± 0%        ~ (p=0.745 n=10)
geomean                        94.40n        89.43n        -5.27%

Change-Id: I4fe0f2e9f609e38d79463f9ba2519a3f9427432e
Reviewed-on: https://go-review.googlesource.com/c/go/+/348389
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2025-08-13 20:33:56 -07:00
Michael Munday
fcc036f03b cmd/compile: optimise float <-> int register moves on riscv64
Use the FMV* instructions to move values between the floating point and
integer register files.

Note: I'm unsure why there is a slowdown in the Float32bits benchmark,
I've checked and an FMVXS instruction is being used as expected. There
are multiple loads and other instructions in the main loop.

goos: linux
goarch: riscv64
pkg: math
cpu: Spacemit(R) X60
                    │ fmv-before.txt │            fmv-after.txt            │
                    │     sec/op     │   sec/op     vs base                │
Acos                     122.7n ± 0%   122.7n ± 0%        ~ (p=1.000 n=10)
Acosh                    197.2n ± 0%   191.5n ± 0%   -2.89% (p=0.000 n=10)
Asin                     122.7n ± 0%   122.7n ± 0%        ~ (p=0.474 n=10)
Asinh                    231.0n ± 0%   224.1n ± 0%   -2.99% (p=0.000 n=10)
Atan                     91.39n ± 0%   91.41n ± 0%        ~ (p=0.465 n=10)
Atanh                    210.3n ± 0%   203.4n ± 0%   -3.26% (p=0.000 n=10)
Atan2                    149.6n ± 0%   149.6n ± 0%        ~ (p=0.721 n=10)
Cbrt                     176.5n ± 0%   165.9n ± 0%   -6.01% (p=0.000 n=10)
Ceil                     25.67n ± 0%   24.42n ± 0%   -4.87% (p=0.000 n=10)
Copysign                 3.756n ± 0%   3.756n ± 0%        ~ (p=0.149 n=10)
Cos                      95.15n ± 0%   95.15n ± 0%        ~ (p=0.374 n=10)
Cosh                     228.6n ± 0%   224.7n ± 0%   -1.71% (p=0.000 n=10)
Erf                      115.2n ± 0%   115.2n ± 0%        ~ (p=0.474 n=10)
Erfc                     116.4n ± 0%   116.4n ± 0%        ~ (p=0.628 n=10)
Erfinv                   133.3n ± 0%   133.3n ± 0%        ~ (p=1.000 n=10)
Erfcinv                  133.3n ± 0%   133.3n ± 0%        ~ (p=1.000 n=10)
Exp                      194.1n ± 0%   190.3n ± 0%   -1.93% (p=0.000 n=10)
ExpGo                    204.7n ± 0%   200.3n ± 0%   -2.15% (p=0.000 n=10)
Expm1                    137.7n ± 0%   135.2n ± 0%   -1.82% (p=0.000 n=10)
Exp2                     173.4n ± 0%   169.0n ± 0%   -2.54% (p=0.000 n=10)
Exp2Go                   182.8n ± 0%   178.4n ± 0%   -2.41% (p=0.000 n=10)
Abs                      3.756n ± 0%   3.756n ± 0%        ~ (p=0.157 n=10)
Dim                      12.52n ± 0%   12.52n ± 0%        ~ (p=0.737 n=10)
Floor                    25.67n ± 0%   24.42n ± 0%   -4.87% (p=0.000 n=10)
Max                      21.29n ± 0%   20.03n ± 0%   -5.92% (p=0.000 n=10)
Min                      21.28n ± 0%   20.04n ± 0%   -5.85% (p=0.000 n=10)
Mod                      344.9n ± 0%   319.2n ± 0%   -7.45% (p=0.000 n=10)
Frexp                    55.71n ± 0%   48.85n ± 0%  -12.30% (p=0.000 n=10)
Gamma                    165.9n ± 0%   167.8n ± 0%   +1.15% (p=0.000 n=10)
Hypot                    73.24n ± 0%   70.74n ± 0%   -3.41% (p=0.000 n=10)
HypotGo                  84.50n ± 0%   82.63n ± 0%   -2.21% (p=0.000 n=10)
Ilogb                    49.45n ± 0%   45.70n ± 0%   -7.59% (p=0.000 n=10)
J0                       556.5n ± 0%   544.0n ± 0%   -2.25% (p=0.000 n=10)
J1                       555.3n ± 0%   542.8n ± 0%   -2.24% (p=0.000 n=10)
Jn                       1.181µ ± 0%   1.156µ ± 0%   -2.12% (p=0.000 n=10)
Ldexp                    59.47n ± 0%   53.84n ± 0%   -9.47% (p=0.000 n=10)
Lgamma                   167.2n ± 0%   154.6n ± 0%   -7.51% (p=0.000 n=10)
Log                      160.9n ± 0%   154.6n ± 0%   -3.92% (p=0.000 n=10)
Logb                     49.45n ± 0%   45.70n ± 0%   -7.58% (p=0.000 n=10)
Log1p                    147.1n ± 0%   137.1n ± 0%   -6.80% (p=0.000 n=10)
Log10                    162.1n ± 1%   154.6n ± 0%   -4.63% (p=0.000 n=10)
Log2                     66.99n ± 0%   60.72n ± 0%   -9.36% (p=0.000 n=10)
Modf                     29.42n ± 0%   26.29n ± 0%  -10.64% (p=0.000 n=10)
Nextafter32              41.95n ± 0%   37.88n ± 0%   -9.70% (p=0.000 n=10)
Nextafter64              38.82n ± 0%   33.49n ± 0%  -13.73% (p=0.000 n=10)
PowInt                   252.3n ± 0%   237.3n ± 0%   -5.95% (p=0.000 n=10)
PowFrac                  615.5n ± 0%   589.7n ± 0%   -4.19% (p=0.000 n=10)
Pow10Pos                 10.64n ± 0%   10.64n ± 0%        ~ (p=1.000 n=10)
Pow10Neg                 24.42n ± 0%   15.02n ± 0%  -38.49% (p=0.000 n=10)
Round                    21.91n ± 0%   18.16n ± 0%  -17.12% (p=0.000 n=10)
RoundToEven              24.42n ± 0%   21.29n ± 0%  -12.84% (p=0.000 n=10)
Remainder                308.0n ± 0%   291.2n ± 0%   -5.44% (p=0.000 n=10)
Signbit                  10.02n ± 0%   10.02n ± 0%        ~ (p=1.000 n=10)
Sin                      102.7n ± 0%   102.7n ± 0%        ~ (p=0.211 n=10)
Sincos                   124.0n ± 1%   123.3n ± 0%   -0.56% (p=0.002 n=10)
Sinh                     239.1n ± 0%   234.7n ± 0%   -1.84% (p=0.000 n=10)
SqrtIndirect             2.504n ± 0%   2.504n ± 0%        ~ (p=0.303 n=10)
SqrtLatency              15.03n ± 0%   15.02n ± 0%        ~ (p=0.598 n=10)
SqrtIndirectLatency      15.02n ± 0%   15.02n ± 0%        ~ (p=0.907 n=10)
SqrtGoLatency            165.3n ± 0%   157.2n ± 0%   -4.90% (p=0.000 n=10)
SqrtPrime                3.801µ ± 0%   3.802µ ± 0%        ~ (p=1.000 n=10)
Tan                      125.2n ± 0%   125.2n ± 0%        ~ (p=0.458 n=10)
Tanh                     244.2n ± 0%   239.9n ± 0%   -1.76% (p=0.000 n=10)
Trunc                    25.67n ± 0%   24.42n ± 0%   -4.87% (p=0.000 n=10)
Y0                       550.2n ± 0%   538.1n ± 0%   -2.21% (p=0.000 n=10)
Y1                       552.8n ± 0%   540.6n ± 0%   -2.21% (p=0.000 n=10)
Yn                       1.168µ ± 0%   1.143µ ± 0%   -2.14% (p=0.000 n=10)
Float64bits              8.139n ± 0%   4.385n ± 0%  -46.13% (p=0.000 n=10)
Float64frombits          7.512n ± 0%   3.759n ± 0%  -49.96% (p=0.000 n=10)
Float32bits              8.138n ± 0%   9.393n ± 0%  +15.42% (p=0.000 n=10)
Float32frombits          7.513n ± 0%   3.757n ± 0%  -49.98% (p=0.000 n=10)
FMA                      3.756n ± 0%   3.756n ± 0%        ~ (p=0.246 n=10)
geomean                  77.43n        72.42n        -6.47%

Change-Id: I8dac69b1d17cb3d2af78d1c844d2b5d80000d667
Reviewed-on: https://go-review.googlesource.com/c/go/+/599235
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Munday <mikemndy@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
2025-08-05 08:27:15 -07:00
Keith Randall
95693816a5 cmd/compile: move riscv64 over to new bounds check strategy
Change-Id: Idd9eaf051aa57f7fef7049c12085926030c35d70
Reviewed-on: https://go-review.googlesource.com/c/go/+/682401
Reviewed-by: Mark Freeman <mark@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-04 10:08:13 -07:00
Joel Sing
4d10d4ad84 cmd/compile,internal/cpu,runtime: intrinsify math/bits.OnesCount on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.OnesCount
using the CPOP/CPOPW machine instructions. Since the native Go
implementation of OnesCount is relatively expensive, it is also
worth emitting a check for Zbb support when compiled for rva20u64.

On a Banana Pi F3, with GORISCV64=rva22u64:

              │     oc.1     │                oc.2                 │
              │    sec/op    │   sec/op     vs base                │
OnesCount-8     16.930n ± 0%   4.389n ± 0%  -74.08% (p=0.000 n=10)
OnesCount8-8     5.642n ± 0%   5.016n ± 0%  -11.10% (p=0.000 n=10)
OnesCount16-8    9.404n ± 0%   5.015n ± 0%  -46.67% (p=0.000 n=10)
OnesCount32-8   13.165n ± 0%   4.388n ± 0%  -66.67% (p=0.000 n=10)
OnesCount64-8   16.300n ± 0%   4.388n ± 0%  -73.08% (p=0.000 n=10)
geomean          11.40n        4.629n       -59.40%

On a Banana Pi F3, compiled with GORISCV64=rva20u64 and with Zbb
detection enabled:

              │     oc.3     │                oc.4                 │
              │    sec/op    │   sec/op     vs base                │
OnesCount-8     16.930n ± 0%   5.643n ± 0%  -66.67% (p=0.000 n=10)
OnesCount8-8     5.642n ± 0%   5.642n ± 0%        ~ (p=0.447 n=10)
OnesCount16-8   10.030n ± 0%   6.896n ± 0%  -31.25% (p=0.000 n=10)
OnesCount32-8   13.170n ± 0%   5.642n ± 0%  -57.16% (p=0.000 n=10)
OnesCount64-8   16.300n ± 0%   5.642n ± 0%  -65.39% (p=0.000 n=10)
geomean          11.55n        5.873n       -49.16%

On a Banana Pi F3, compiled with GORISCV64=rva20u64 but with Zbb
detection disabled:

              │    oc.3     │                oc.5                 │
              │   sec/op    │   sec/op     vs base                │
OnesCount-8     16.93n ± 0%   29.47n ± 0%  +74.07% (p=0.000 n=10)
OnesCount8-8    5.642n ± 0%   5.643n ± 0%        ~ (p=0.191 n=10)
OnesCount16-8   10.03n ± 0%   15.05n ± 0%  +50.05% (p=0.000 n=10)
OnesCount32-8   13.17n ± 0%   18.18n ± 0%  +38.04% (p=0.000 n=10)
OnesCount64-8   16.30n ± 0%   21.94n ± 0%  +34.60% (p=0.000 n=10)
geomean         11.55n        15.84n       +37.16%

For hardware without Zbb, this adds ~5ns overhead, while for hardware
with Zbb we achieve a performance gain up of up to 11ns. It is worth
noting that OnesCount8 is cheap enough that it is preferable to stick
with the generic version in this case.

Change-Id: Id657e40e0dd1b1ab8cc0fe0f8a68df4c9f2d7da5
Reviewed-on: https://go-review.googlesource.com/c/go/+/660856
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-01 05:57:41 -07:00
Joel Sing
90e8b8cdae cmd/compile: intrinsify math/bits.Bswap on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.Bswap
using the REV8 machine instruction.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                 │     rb.1     │                rb.2                 │
                 │    sec/op    │   sec/op     vs base                │
ReverseBytes-4     18.790n ± 0%   4.026n ± 0%  -78.57% (p=0.000 n=10)
ReverseBytes16-4    6.710n ± 0%   5.368n ± 0%  -20.00% (p=0.000 n=10)
ReverseBytes32-4   13.420n ± 0%   5.368n ± 0%  -60.00% (p=0.000 n=10)
ReverseBytes64-4   17.450n ± 0%   4.026n ± 0%  -76.93% (p=0.000 n=10)
geomean             13.11n        4.649n       -64.54%

Change-Id: I26eee34270b1721f7304bb1cddb0fda129b20ece
Reviewed-on: https://go-review.googlesource.com/c/go/+/660855
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
2025-05-01 05:57:13 -07:00
Joel Sing
b70244ff7a cmd/compile: intrinsify math/bits.Len on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.Len using the
CLZ/CLZW machine instructions.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                 │   clz.b.1   │               clz.b.2               │
                 │   sec/op    │   sec/op     vs base                │
LeadingZeros-4     28.89n ± 0%   12.08n ± 0%  -58.19% (p=0.000 n=10)
LeadingZeros8-4    18.79n ± 0%   14.76n ± 0%  -21.45% (p=0.000 n=10)
LeadingZeros16-4   25.27n ± 0%   14.76n ± 0%  -41.59% (p=0.000 n=10)
LeadingZeros32-4   25.12n ± 0%   12.08n ± 0%  -51.92% (p=0.000 n=10)
LeadingZeros64-4   25.89n ± 0%   12.08n ± 0%  -53.35% (p=0.000 n=10)
geomean            24.55n        13.09n       -46.70%

Change-Id: I0dda684713dbdf5336af393f5ccbdae861c4f694
Reviewed-on: https://go-review.googlesource.com/c/go/+/652321
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-03-21 18:21:44 -07:00
Joel Sing
6fb7bdc96d cmd/compile: intrinsify math/bits.TrailingZeros on riscv64
For riscv64/rva22u64 and above, we can intrinsify math/bits.TrailingZeros
using the CTZ/CTZW machine instructions.

On a StarFive VisionFive 2 with GORISCV64=rva22u64:

                  │   ctz.b.1    │               ctz.b.2               │
                  │    sec/op    │   sec/op     vs base                │
TrailingZeros-4     25.500n ± 0%   8.052n ± 0%  -68.42% (p=0.000 n=10)
TrailingZeros8-4     14.76n ± 0%   10.74n ± 0%  -27.24% (p=0.000 n=10)
TrailingZeros16-4    26.84n ± 0%   10.74n ± 0%  -59.99% (p=0.000 n=10)
TrailingZeros32-4   25.500n ± 0%   8.052n ± 0%  -68.42% (p=0.000 n=10)
TrailingZeros64-4   25.500n ± 0%   8.052n ± 0%  -68.42% (p=0.000 n=10)
geomean              23.09n        9.035n       -60.88%

Change-Id: I71edf2b988acb7a68e797afda4ee66d7a57d587e
Reviewed-on: https://go-review.googlesource.com/c/go/+/652320
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
2025-03-15 19:07:53 -07:00
Michael Pratt
81c92352a7 runtime: move getcallerpc to internal/runtime/sys
Moving these intrinsics to a base package enables other internal/runtime
packages to use them.

For #54766.

Change-Id: I0b3eded3bb45af53e3eb5bab93e3792e6a8beb46
Reviewed-on: https://go-review.googlesource.com/c/go/+/613260
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-09-17 15:14:14 +00:00
Joel Sing
0ee5d20b1f cmd/compile,cmd/internal/obj/riscv: always provide ANDN, ORN and XNOR for riscv64
The ANDN, ORN and XNOR RISC-V Zbb extension instructions are easily
synthesised. Make them always available by adding support to the
riscv64 assembler so that we either emit two instruction sequences,
or a single instruction, when permitted by the GORISCV64 profile.
This means that these instructions can be used unconditionally,
simplifying compiler rewrite rules, codegen tests and manually
written assembly.

Around 180 instructions are removed from the Go binary on riscv64
when built with rva22u64.

Change-Id: Ib2d90f2593a306530dc0ed08a981acde4d01be20
Reviewed-on: https://go-review.googlesource.com/c/go/+/611895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Tim King <taking@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2024-09-12 15:03:44 +00:00
Joel Sing
e126129d76 cmd/compile/internal/ssa: combine shift and addition for riscv64 rva22u64
When GORISCV64 enables rva22u64, combined shift and addition using the
SH1ADD, SH2ADD and SH3ADD instructions that are available via the Zba
extension. This results in more than 2000 instructions being removed
from the Go binary on riscv64.

Change-Id: Ia62ae7dda3d8083cff315113421bee73f518eea8
Reviewed-on: https://go-review.googlesource.com/c/go/+/606636
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
2024-08-28 13:46:24 +00:00
Joel Sing
3d9a89b057 cmd/compile: use integer min/max instructions on riscv64
When GORISCV64 enables rva22u64, make use of integer MIN/MINU/MAX/MAXU
instructions in compiler rewrite rules.

Change-Id: I4e7c514516acad03f2869d4c8936f06582cf7ea9
Reviewed-on: https://go-review.googlesource.com/c/go/+/559660
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
2024-08-19 13:46:02 +00:00
Joel Sing
997636760e cmd/compile,cmd/internal/obj: provide rotation pseudo-instructions for riscv64
Provide and use rotation pseudo-instructions for riscv64. The RISC-V bitmanip
extension adds support for hardware rotation instructions in the form of ROL,
ROLW, ROR, RORI, RORIW and RORW. These are easily implemented in the assembler
as pseudo-instructions for CPUs that do not support the bitmanip extension.

This approach provides a number of advantages, including reducing the rewrite
rules needed in the compiler, simplifying codegen tests and most importantly,
allowing these instructions to be used in assembly (for example, riscv64
optimised versions of SHA-256 and SHA-512). When bitmanip support is added,
these instruction sequences can simply be replaced with a single instruction
if permitted by the GORISCV64 profile.

Change-Id: Ia23402e1a82f211ac760690deb063386056ae1fa
Reviewed-on: https://go-review.googlesource.com/c/go/+/565015
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: M Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
2024-03-07 14:57:07 +00:00
Joel Sing
daa58db486 cmd/compile: improve rotations for riscv64
Enable canRotate for riscv64, enable rotation intrinsics and provide
better rewrite implementations for rotations. By avoiding Lsh*x64
and Rsh*Ux64 we can produce better code, especially for 32 and 64
bit rotations. By enabling canRotate we also benefit from the generic
rotation rewrite rules.

Benchmark on a StarFive VisionFive 2:

               │   rotate.1   │              rotate.2               │
               │    sec/op    │   sec/op     vs base                │
RotateLeft-4     14.700n ± 0%   8.016n ± 0%  -45.47% (p=0.000 n=10)
RotateLeft8-4     14.70n ± 0%   10.69n ± 0%  -27.28% (p=0.000 n=10)
RotateLeft16-4    14.70n ± 0%   12.02n ± 0%  -18.23% (p=0.000 n=10)
RotateLeft32-4   13.360n ± 0%   8.016n ± 0%  -40.00% (p=0.000 n=10)
RotateLeft64-4   13.360n ± 0%   8.016n ± 0%  -40.00% (p=0.000 n=10)
geomean           14.15n        9.208n       -34.92%

Change-Id: I1a2036fdc57cf88ebb6617eb8d92e1d187e183b2
Reviewed-on: https://go-review.googlesource.com/c/go/+/560315
Reviewed-by: M Zhuo <mengzhuo1203@gmail.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
2024-02-16 11:59:07 +00:00
Meng Zhuo
09ed9a6585 cmd/compile: implement float min/max in hardware for riscv64
CL 514596 adds float min/max for amd64, this CL adds it for riscv64.

The behavior of the RISC-V FMIN/FMAX instructions almost match Go's
requirements.

However according to RISCV spec 8.3 "NaN Generation and Propagation"
>> if at least one input is a signaling NaN, or if both inputs are quiet
>> NaNs, the result is the canonical NaN. If one operand is a quiet NaN
>> and the other is not a NaN, the result is the non-NaN operand.

Go using quiet NaN as NaN and according to Go spec
>> if any argument is a NaN, the result is a NaN

This requires the float min/max implementation to check whether one
of operand is qNaN before float mix/max actually execute.

This CL also fix a typo in minmax test.

Benchmark on Visionfive2
goos: linux
goarch: riscv64
pkg: runtime
         │ float_minmax.old.bench │       float_minmax.new.bench        │
         │         sec/op         │   sec/op     vs base                │
MinFloat             158.20n ± 0%   28.13n ± 0%  -82.22% (p=0.000 n=10)
MaxFloat             158.10n ± 0%   28.12n ± 0%  -82.21% (p=0.000 n=10)
geomean               158.1n        28.12n       -82.22%

Update #59488

Change-Id: Iab48be6d32b8882044fb8c821438ca8840e5493d
Reviewed-on: https://go-review.googlesource.com/c/go/+/514775
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Run-TryBot: M Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2024-01-26 01:41:50 +00:00
Ubuntu
8fc043ccfa cmd/compile: optimize right shifts of int32 on riscv64
The compiler is currently sign extending 32 bit signed integers to
64 bits before right shifting them using a 64 bit shift instruction.
There's no need to do this as RISC-V has instructions for right
shifting 32 bit signed values (sraw and sraiw) which sign extend
the result of the shift to 64 bits.  Change the compiler so that
it uses sraw and sraiw for shifts of signed 32 bit integers reducing
in most cases the number of instructions needed to perform the shift.

Here are some examples of code sequences that are changed by this
patch:

int32(a) >> 2

  before:

    sll     x5,x10,0x20
    sra     x10,x5,0x22

  after:

    sraw    x10,x10,0x2

int32(v) >> int(s)

  before:

    sext.w  x5,x10
    sltiu   x6,x11,64
    add     x6,x6,-1
    or      x6,x11,x6
    sra     x10,x5,x6

  after:

    sltiu   x5,x11,32
    add     x5,x5,-1
    or      x5,x11,x5
    sraw    x10,x10,x5

int32(v) >> (int(s) & 31)

  before:

    sext.w  x5,x10
    and     x6,x11,63
    sra     x10,x5,x6

after:

    and     x5,x11,31
    sraw    x10,x10,x5

int32(100) >> int(a)

  before:

    bltz    x10,<target address calls runtime.panicshift>
    sltiu   x5,x10,64
    add     x5,x5,-1
    or      x5,x10,x5
    li      x6,100
    sra     x10,x6,x5

  after:

    bltz    x10,<target address calls runtime.panicshift>
    sltiu   x5,x10,32
    add     x5,x5,-1
    or      x5,x10,x5
    li      x6,100
    sraw    x10,x6,x5

int32(v) >> (int(s) & 63)

  before:

    sext.w  x5,x10
    and     x6,x11,63
    sra     x10,x5,x6

  after:

    and     x5,x11,63
    sltiu   x6,x5,32
    add     x6,x6,-1
    or      x5,x5,x6
    sraw    x10,x10,x5

In most cases we eliminate one instruction.  In the case where
we shift a int32 constant by a variable the number of instructions
generated is identical.  A sra is simply replaced by a sraw.  In the
unusual case where we shift right by a variable anded with a constant
> 31 but < 64, we generate two additional instructions.  As this is
an unusual case we do not try to optimize for it.

Some improvements can be seen in some of the existing benchmarks,
notably in the utf8 package which performs right shifts of runes
which are signed 32 bit integers.

                      |  utf8-old   |              utf8-new            |
                      |   sec/op    |   sec/op     vs base             |
EncodeASCIIRune-4       17.68n ± 0%   17.67n ± 0%       ~ (p=0.312 n=10)
EncodeJapaneseRune-4    35.34n ± 0%   34.53n ± 1%  -2.31% (p=0.000 n=10)
AppendASCIIRune-4       3.213n ± 0%   3.213n ± 0%       ~ (p=0.318 n=10)
AppendJapaneseRune-4    36.14n ± 0%   35.35n ± 0%  -2.19% (p=0.000 n=10)
DecodeASCIIRune-4       28.11n ± 0%   27.36n ± 0%  -2.69% (p=0.000 n=10)
DecodeJapaneseRune-4    38.55n ± 0%   38.58n ± 0%       ~ (p=0.612 n=10)

Change-Id: I60a91cbede9ce65597571c7b7dd9943eeb8d3cc2
Reviewed-on: https://go-review.googlesource.com/c/go/+/535115
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: M Zhuo <mzh@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
2023-10-30 14:47:06 +00:00
Joel Sing
f711892a8a cmd/compile/internal: stop lowering OpConvert on riscv64
Lowering for OpConvert was removed for all architectures in CL#108496,
prior to the riscv64 port being upstreamed. Remove lowering of OpConvert
on riscv64, which brings it inline with all other architectures. This
results in 1,600+ instructions being removed from the riscv64 go binary.

Change-Id: Iaaf1f8b397875926604048b66ad8ac91a98c871e
Reviewed-on: https://go-review.googlesource.com/c/go/+/533335
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
2023-10-07 12:31:59 +00:00
Mark Ryan
561bf0457f cmd/compile: optimize right shifts of uint32 on riscv
The compiler is currently zero extending 32 bit unsigned integers to
64 bits before right shifting them using a 64 bit shift instruction.
There's no need to do this as RISC-V has instructions for right
shifting 32 bit unsigned values (srlw and srliw) which zero extend
the result of the shift to 64 bits.  Change the compiler so that
it uses srlw and srliw for 32 bit unsigned shifts reducing in most
cases the number of instructions needed to perform the shift.

Here are some examples of code sequences that are changed by this
patch:

uint32(a) >> 2

  before:

    sll     x5,x10,0x20
    srl     x10,x5,0x22

  after:

    srlw    x10,x10,0x2

uint32(a) >> int(b)

  before:

    sll     x5,x10,0x20
    srl     x5,x5,0x20
    srl     x5,x5,x11
    sltiu   x6,x11,64
    neg     x6,x6
    and     x10,x5,x6

  after:

    srlw    x5,x10,x11
    sltiu   x6,x11,32
    neg     x6,x6
    and     x10,x5,x6

bits.RotateLeft32(uint32(a), 1)

  before:

    sll     x5,x10,0x1
    sll     x6,x10,0x20
    srl     x7,x6,0x3f
    or      x5,x5,x7

  after:

   sll     x5,x10,0x1
   srlw    x6,x10,0x1f
   or      x10,x5,x6

bits.RotateLeft32(uint32(a), int(b))

  before:
    and     x6,x11,31
    sll     x7,x10,x6
    sll     x8,x10,0x20
    srl     x8,x8,0x20
    add     x6,x6,-32
    neg     x6,x6
    srl     x9,x8,x6
    sltiu   x6,x6,64
    neg     x6,x6
    and     x6,x9,x6
    or      x6,x6,x7

  after:

    and     x5,x11,31
    sll     x6,x10,x5
    add     x5,x5,-32
    neg     x5,x5
    srlw    x7,x10,x5
    sltiu   x5,x5,32
    neg     x5,x5
    and     x5,x7,x5
    or      x10,x6,x5

The one regression observed is the following case, an unbounded right
shift of a uint32 where the value we're shifting by is known to be
< 64 but > 31.  As this is an unusual case this commit does not
optimize for it, although the existing code does.

uint32(a) >> (b & 63)

  before:

    sll     x5,x10,0x20
    srl     x5,x5,0x20
    and     x6,x11,63
    srl     x10,x5,x6

  after

    and     x5,x11,63
    srlw    x6,x10,x5
    sltiu   x5,x5,32
    neg     x5,x5
    and     x10,x6,x5

Here we have one extra instruction.

Some benchmark highlights, generated on a VisionFive2 8GB running
Ubuntu 23.04.

pkg: math/bits
LeadingZeros32-4    18.64n ± 0%     17.32n ± 0%   -7.11% (p=0.000 n=10)
LeadingZeros64-4    15.47n ± 0%     15.51n ± 0%   +0.26% (p=0.027 n=10)
TrailingZeros16-4   18.48n ± 0%     17.68n ± 0%   -4.33% (p=0.000 n=10)
TrailingZeros32-4   16.87n ± 0%     16.07n ± 0%   -4.74% (p=0.000 n=10)
TrailingZeros64-4   15.26n ± 0%     15.27n ± 0%   +0.07% (p=0.043 n=10)
OnesCount32-4       20.08n ± 0%     19.29n ± 0%   -3.96% (p=0.000 n=10)
RotateLeft-4        8.864n ± 0%     8.838n ± 0%   -0.30% (p=0.006 n=10)
RotateLeft32-4      8.837n ± 0%     8.032n ± 0%   -9.11% (p=0.000 n=10)
Reverse32-4         29.77n ± 0%     26.52n ± 0%  -10.93% (p=0.000 n=10)
ReverseBytes32-4    9.640n ± 0%     8.838n ± 0%   -8.32% (p=0.000 n=10)
Sub32-4             8.835n ± 0%     8.035n ± 0%   -9.06% (p=0.000 n=10)
geomean             11.50n          11.33n        -1.45%

pkg: crypto/md5
Hash8Bytes-4             1.486µ ± 0%   1.426µ ± 0%  -4.04% (p=0.000 n=10)
Hash64-4                 2.079µ ± 0%   1.968µ ± 0%  -5.36% (p=0.000 n=10)
Hash128-4                2.720µ ± 0%   2.557µ ± 0%  -5.99% (p=0.000 n=10)
Hash256-4                3.996µ ± 0%   3.733µ ± 0%  -6.58% (p=0.000 n=10)
Hash512-4                6.541µ ± 0%   6.072µ ± 0%  -7.18% (p=0.000 n=10)
Hash1K-4                 11.64µ ± 0%   10.75µ ± 0%  -7.58% (p=0.000 n=10)
Hash8K-4                 82.95µ ± 0%   76.32µ ± 0%  -7.99% (p=0.000 n=10)
Hash1M-4                10.436m ± 0%   9.591m ± 0%  -8.10% (p=0.000 n=10)
Hash8M-4                 83.50m ± 0%   76.73m ± 0%  -8.10% (p=0.000 n=10)
Hash8BytesUnaligned-4    1.494µ ± 0%   1.434µ ± 0%  -4.02% (p=0.000 n=10)
Hash1KUnaligned-4        11.64µ ± 0%   10.76µ ± 0%  -7.52% (p=0.000 n=10)
Hash8KUnaligned-4        83.01µ ± 0%   76.32µ ± 0%  -8.07% (p=0.000 n=10)
geomean                  28.32µ        26.42µ       -6.72%

Change-Id: I20483a6668cca1b53fe83944bee3706aadcf8693
Reviewed-on: https://go-review.googlesource.com/c/go/+/528975
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-10-07 12:31:38 +00:00
Xianmiao Qu
d98f74b31e cmd/compile/internal: intrinsify publicationBarrier on riscv64
This enables publicationBarrier to be used as an intrinsic
on riscv64, optimizing the required function call and return
instructions for invoking the "runtime.publicationBarrier"
function.

This function is called by mallocgc. The benchmark results for malloc tested on Lichee-Pi-4A(TH1520, RISC-V 2.0G C910 x4) are as follows.

goos: linux
goarch: riscv64
pkg: runtime
                    │   old.txt   │              new.txt               │
                    │   sec/op    │   sec/op     vs base               │
Malloc8-4             92.78n ± 1%   90.77n ± 1%  -2.17% (p=0.001 n=10)
Malloc16-4            156.5n ± 1%   151.7n ± 2%  -3.10% (p=0.000 n=10)
MallocTypeInfo8-4     131.7n ± 1%   130.6n ± 2%       ~ (p=0.165 n=10)
MallocTypeInfo16-4    186.5n ± 2%   186.2n ± 1%       ~ (p=0.956 n=10)
MallocLargeStruct-4   1.345µ ± 1%   1.355µ ± 1%       ~ (p=0.093 n=10)
geomean               216.9n        214.5n       -1.10%


Change-Id: Ieab6c02309614bac5c1b12b5ee3311f988ff644d
Reviewed-on: https://go-review.googlesource.com/c/go/+/531719
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: M Zhuo <mzh@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
2023-10-03 19:29:38 +00:00
Meng Zhuo
63ab68ddc5 cmd/compile: add single-precision FMA code generation for riscv64
This CL adds FMADDS,FMSUBS,FNMADDS,FNMSUBS SSA support for riscv

Change-Id: I1e7dd322b46b9e0f4923dbba256303d69ed12066
Reviewed-on: https://go-review.googlesource.com/c/go/+/506616
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: M Zhuo <mzh@golangcn.org>
2023-08-22 12:05:36 +00:00
Meng Zhuo
05f9511582 cmd/compile: improve FP FMA performance on riscv64
FMADD/FMSUB/FNSUB are an efficient FP FMA instructions, which can
be used by the compiler to improve FP performance.

Erf               188.0n ± 2%   139.5n ± 2%  -25.82% (p=0.000 n=10)
Erfc              193.6n ± 1%   143.2n ± 1%  -26.01% (p=0.000 n=10)
Erfinv            244.4n ± 2%   172.6n ± 0%  -29.40% (p=0.000 n=10)
Erfcinv           244.7n ± 2%   173.0n ± 1%  -29.31% (p=0.000 n=10)
geomean           216.0n        156.3n       -27.65%

Ref: The RISC-V Instruction Set Manual Volume I: Unprivileged ISA
11.6 Single-Precision Floating-Point Computational Instructions

Change-Id: I89aa3a4df7576fdd47f4a6ee608ac16feafd093c
Reviewed-on: https://go-review.googlesource.com/c/go/+/506036
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: M Zhuo <mzh@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-08-22 08:38:08 +00:00
Keith Randall
21d82e6ac8 cmd/compile: batch write barrier calls
Have the write barrier call return a pointer to a buffer into which
the generated code records pointers that need write barrier treatment.

Change-Id: I7871764298e0aa1513de417010c8d46b296b199e
Reviewed-on: https://go-review.googlesource.com/c/go/+/447781
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Bypass: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2023-02-24 00:21:13 +00:00
Keith Randall
12befc3ce3 cmd/compile: improve scheduling pass
Convert the scheduling pass from scheduling backwards to scheduling forwards.

Forward scheduling makes it easier to prioritize scheduling values as
soon as they are ready, which is important for things like nil checks,
select ops, etc.

Forward scheduling is also quite a bit clearer. It was originally
backwards because computing uses is tricky, but I found a way to do it
simply and with n lg n complexity. The new scheme also makes it easy
to add new scheduling edges if needed.

Fixes #42673
Update #56568

Change-Id: Ibbb38c52d191f50ce7a94f8c1cbd3cd9b614ea8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/270940
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2023-01-20 04:54:01 +00:00
Keith Randall
45dc81d856 cmd/compile: add memory argument to GetCallerSP
We need to make sure that when we get the stack pointer, we get it
at the right time.

V = GetCallerSP
Call()
W = GetCallerSP

If Call causes a stack growth, then we will be in a situation
where V != W. So it matters when GetCallerSP operations get scheduled.
Add a memory argument to GetCallerSP so it can't be reordered with
things like calls.

Change-Id: I6cc801134c38e358c5a1ec0c09d38379a16a4184
Reviewed-on: https://go-review.googlesource.com/c/go/+/453515
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Martin Möhrmann <martin@golang.org>
Reviewed-by: Robert Griesemer <gri@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-01-19 22:43:22 +00:00
Russ Cox
164406ad93 cmd/compile: rename gen and builtin to _gen and _builtin
These two directories are full of //go:build ignore files.
We can ignore them more easily by putting an underscore
at the start of the name. That also works around a bug
in Go 1.17 that was not fixed until Go 1.17.3.

Change-Id: Ia5389b65c79b1e6d08e4fef374d335d776d44ead
Reviewed-on: https://go-review.googlesource.com/c/go/+/435472
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-10-04 19:35:46 +00:00
Renamed from src/cmd/compile/internal/ssa/gen/RISCV64Ops.go (Browse further)