Also rewrite subtraction of zero to NEG/NEGW.
Change-Id: I216e286d1860055f2a07fe2f772cd50f366ea097
Reviewed-on: https://go-review.googlesource.com/c/go/+/221691
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This commit adds a new cmd/compile flag -spectre,
which accepts a comma-separated list of possible
Spectre mitigations to apply, or the empty string (none),
or "all". The only known mitigation right now is "index",
which uses conditional moves to ensure that x86-64 CPUs
do not speculate past index bounds checks.
Speculating past index bounds checks may be problematic
on systems running privileged servers that accept requests
from untrusted users who can execute their own programs
on the same machine. (And some more constraints that
make it even more unlikely in practice.)
The cases this protects against are analogous to the ones
Microsoft explains in the "Array out of bounds load/store feeding ..."
sections here:
https://docs.microsoft.com/en-us/cpp/security/developer-guidance-speculative-execution?view=vs-2019#array-out-of-bounds-load-feeding-an-indirect-branch
Change-Id: Ib7532d7e12466b17e04c4e2075c2a456dc98f610
Reviewed-on: https://go-review.googlesource.com/c/go/+/222660
Reviewed-by: Keith Randall <khr@golang.org>
This allows for zero stores to be performed using the zero register, rather
than loading a separate register with zero.
Change-Id: Ic81d8dbcdacbb2ca2c3f77682ff5ad7cdc33d18d
Reviewed-on: https://go-review.googlesource.com/c/go/+/221684
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Const64 gets lowered to MOVDconst.
Change rules using interior Const64 to use MOVDconst instead,
to be less dependent on rule application order.
As a result of doing this, some of the rules end up being
exact duplicates; remove those.
We had those exact duplicates because of the order dependency;
ppc64 had no way to optimize away shifts by a constant
if the initial lowering didn't catch it.
Add those optimizations as well.
The outcome is the same, but this makes the overall rules more robust.
Change-Id: Iadd97a9fe73d52358d571d022ace145e506d160b
Reviewed-on: https://go-review.googlesource.com/c/go/+/220877
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
The goal here is improved AuxInt printing in ssa.html.
Instead of displaying an inscrutable encoded integer,
it displays something like
v25 (28) = UBFX <int> [lsb=4,width=8] v52
which is much nicer for debugging.
Change-Id: I40713ff7f4a857c4557486cdf73c2dff137511ca
Reviewed-on: https://go-review.googlesource.com/c/go/+/221420
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
SignExt32to64 can be implemented with a single ADDIW instruction, rather than
the two shifts that are in use currently.
Change-Id: Ie1bbaef4018f1ba5162773fc64fa5a887457cfc9
Reviewed-on: https://go-review.googlesource.com/c/go/+/220922
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Use SUBW to perform a 32-bit subtraction, rather than zero extending from
32 to 64 bits. This reduces Eq32 and Neq32 to two instructions, rather than
the four instructions required previously.
Change-Id: Ib2798324881e9db842c864e91a0c1b1e48c4b67b
Reviewed-on: https://go-review.googlesource.com/c/go/+/220921
Reviewed-by: Cherry Zhang <cherryyz@google.com>
The generic Greater and Geq ops can always be replaced with the Less and
Leq ops. This CL therefore removes them. This simplifies the compiler since
it reduces the number of operations that need handling in both code and in
rewrite rules. This will be especially true when adding control flow
optimizations such as the integer-in-range optimizations in CL 165998.
Change-Id: If0648b2b19998ac1bddccbf251283f3be4ec3040
Reviewed-on: https://go-review.googlesource.com/c/go/+/220417
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
And use this newfound power to more precisely describe some PPC64 ops.
Change-Id: Idb2b669d74fbab5f3508edf19f7e3347306b0daf
Reviewed-on: https://go-review.googlesource.com/c/go/+/217002
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
HMUL is commutative. However, it has asymmetric register requirements.
There are existing rewrite rules to place arguments in preferable slots.
Due to a bug, the existing rulegen commutativity engine doesn't generate
the commuted form of the HMUL rules.
The commuted form of those rewrite rules cause infinite loops.
In order to fix the rulegen commutativity bug,
we need to choose between eliminating
those rewrite rules and marking HMUL ops as not commutative.
This change chooses the latter, since doing so yields better
optimization results on std+cmd.
Removing the rewrite rules yields only text size regressions:
file before after Δ %
runtime.s 477257 477269 +12 +0.003%
time.s 83552 83612 +60 +0.072%
encoding/asn1.s 57378 57382 +4 +0.007%
cmd/go/internal/modfetch/codehost.s 89822 89829 +7 +0.008%
cmd/internal/test2json.s 9459 9466 +7 +0.074%
cmd/go/internal/test.s 57665 57678 +13 +0.023%
Marking HMUL as not commutative actually yields (mostly) improvements:
file before after Δ %
runtime.s 477257 477247 -10 -0.002%
math.s 35985 35992 +7 +0.019%
strconv.s 53486 53462 -24 -0.045%
syscall.s 82483 82446 -37 -0.045%
time.s 83552 83561 +9 +0.011%
os.s 52691 52684 -7 -0.013%
archive/zip.s 42285 42272 -13 -0.031%
encoding/asn1.s 57378 57329 -49 -0.085%
encoding/base64.s 12156 12094 -62 -0.510%
net.s 296286 296276 -10 -0.003%
encoding/base32.s 9720 9658 -62 -0.638%
net/http.s 560931 560907 -24 -0.004%
net/smtp.s 14421 14411 -10 -0.069%
cmd/vendor/golang.org/x/sys/unix.s 74307 74266 -41 -0.055%
The regressions are minor, and are in functions math.cbrt,
time.Time.String, and time.Date.
Change-Id: I9f6d9ee71654e5b70381cac77b0ac26011f4ea12
Reviewed-on: https://go-review.googlesource.com/c/go/+/213701
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
Based on riscv-go port.
Updates #27532
Change-Id: Ia329daa243db63ff334053b8807ea96b97ce3acf
Reviewed-on: https://go-review.googlesource.com/c/go/+/204631
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Count Values with side effects but no use as live, and don't fuse
branches that contain such Values. (This can happen e.g. when it
is followed by an infinite loop.) Otherwise this may lead to
miscompilation (side effect fired at wrong condition) or ICE (two
stores live simultaneously).
Fixes#36005.
Change-Id: If202eae4b37cb7f0311d6ca120ffa46609925157
Reviewed-on: https://go-review.googlesource.com/c/go/+/210179
Reviewed-by: Keith Randall <khr@golang.org>
Intrinsify these functions to match other platforms. Update the
sequence of instructions used in the assembly implementations to
match the intrinsics.
Also, add a micro benchmark so we can more easily measure the
performance of these two functions:
name old time/op new time/op delta
And8-8 5.33ns ± 7% 2.55ns ± 8% -52.12% (p=0.000 n=20+20)
And8Parallel-8 7.39ns ± 5% 3.74ns ± 4% -49.34% (p=0.000 n=20+20)
Or8-8 4.84ns ±15% 2.64ns ±11% -45.50% (p=0.000 n=20+20)
Or8Parallel-8 7.27ns ± 3% 3.84ns ± 4% -47.10% (p=0.000 n=19+20)
By using a 'rotate then xor selected bits' instruction combined with
either a 'load and and' or a 'load and or' instruction we can
implement And8 and Or8 with far fewer instructions. Replacing
'compare and swap' with atomic instructions may also improve
performance when there is contention.
Change-Id: I28bb8032052b73ae8ccdf6e4c612d2877085fa01
Reviewed-on: https://go-review.googlesource.com/c/go/+/204277
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
We'll use CTR as a scratch register for call injection. Mark code
sequences that use CTR as unsafe for async preemption. Currently
it is only used in LoweredZero and LoweredMove. It is unfortunate
that they are nonpreemptible. But I think it is still better than
using LR for call injection and marking all leaf functions
nonpreemptible.
Also mark the prologue of large frame functions nonpreemptible,
as we write below SP.
Change-Id: I05a75431499f3f4b2f23651a7b17f7fcf2afbe06
Reviewed-on: https://go-review.googlesource.com/c/go/+/203823
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Mark atomic LL/SC loops as unsafe for async preemption, as they
use REGTMP.
Change-Id: I5be7f93ad3ee337049ec7c3efd6fdc30eef87d97
Reviewed-on: https://go-review.googlesource.com/c/go/+/203719
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
This API was added for #25819, where it was discussed as math.FMA.
The commit adding it used math.Fma, presumably for consistency
with the rest of the unusual names in package math
(Sincos, Acosh, Erfcinv, Float32bits, etc).
I believe that using an idiomatic Go name is more important here
than consistency with these other names, most of which are historical
baggage from C's standard library.
Early additions like Float32frombits happened before "uppercase for export"
(so they were originally like "float32frombits") and they were not properly
reconsidered when we uppercased the symbols to export them.
That's a mistake we live with.
The names of functions we have added since then, and even a few
that were legacy, are more properly Go-cased, such as IsNaN, IsInf,
and RoundToEven, rather than Isnan, Isinf, and Roundtoeven.
And also constants like MaxFloat32.
For new API, we should keep using proper Go-cased symbols
instead of minimally-upper-cased-C symbols.
So math.FMA, not math.Fma.
This API has not yet been released, so this change does not break
the compatibility promise.
This CL also modifies cmd/compile, since the compiler knows
the name of the function. I could have stopped at changing the
string constants, but it seemed to make more sense to use a
consistent casing everywhere.
Change-Id: I0f6f3407f41e99bfa8239467345c33945088896e
Reviewed-on: https://go-review.googlesource.com/c/go/+/205317
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Introduce a mechanism for marking architecture-specific Ops
unsafe. And mark ones that use REGTMP on ARM64, as for async
preemption we will be using REGTMP as a temporary register in the
injected call.
Change-Id: I8ff22e87d8f9cb10d02a2f0af7c12ad6d7d58f54
Reviewed-on: https://go-review.googlesource.com/c/go/+/203459
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Austin Clements <austin@google.com>
CL 203284 added a compiler intrinsics from atomic Load8 and Store8 on
several architectures, but missed the lowering on MIPS. This CL fixes
that.
Updates #10958, #24543.
Change-Id: I82e88971554fe8c33ad2bf195a633c44b9ac4cf7
Reviewed-on: https://go-review.googlesource.com/c/go/+/203977
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
For #10958, #24543, but makes sense on its own.
Change-Id: I2a87dab66b82a1863e4b6512b1f8def51463ce2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/203284
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This change introduces an arm intrinsic that generates the FMULAD
instruction for the fused-multiply-add operation on systems that
support it. System support is detected via cpu.ARM.HasVFPv4. A rewrite
rule translates the generic intrinsic to FMULAD.
Updates #25819.
Change-Id: I8459e5dd1cdbdca35f88a78dbeb7d387f1e20efa
Reviewed-on: https://go-review.googlesource.com/c/go/+/142117
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
To permit ssa-level optimization, this change introduces an amd64 intrinsic
that generates the VFMADD231SD instruction for the fused-multiply-add
operation on systems that support it. System support is detected via
cpu.X86.HasFMA. A rewrite rule can then translate the generic ssa intrinsic
("Fma") to VFMADD231SD.
The benchmark compares the software implementation (old) with the intrinsic
(new).
name old time/op new time/op delta
Fma-4 27.2ns ± 1% 1.0ns ± 9% -96.48% (p=0.008 n=5+5)
Updates #25819.
Change-Id: I966655e5f96817a5d06dff5942418a3915b09584
Reviewed-on: https://go-review.googlesource.com/c/go/+/137156
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
In order to make math.FMA a compiler intrinsic for ISAs like ARM64,
PPC64[le], and S390X, a generic 3-argument opcode "Fma" is provided and
rewritten as
ARM64: (Fma x y z) -> (FMADDD z x y)
PPC64: (Fma x y z) -> (FMADD x y z)
S390X: (Fma x y z) -> (FMADD z x y)
Updates #25819.
Change-Id: Ie5bc628311e6feeb28ddf9adaa6e702c8c291efa
Reviewed-on: https://go-review.googlesource.com/c/go/+/131959
Run-TryBot: Akhil Indurti <aindurti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Avoid an out-of-range error when calling LongString on a generic
block.
Change-Id: I33ca88940d899bc71e3155bc63d2aa925cf83230
Reviewed-on: https://go-review.googlesource.com/c/go/+/200737
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Prior to this CL conditional branches on s390x always used an
extended mnemonic such as BNE, BLT and so on to represent branch
instructions with different condition code masks. This CL adds
support for numeric condition code masks to the s390x SSA backend
so that we can encode the condition under which a Block's
successor is chosen as a field in that Block rather than in its
type.
This change will be useful as we come to add support for combined
compare-and-branch instructions. Rather than trying to add extended
mnemonics for every possible combination of mask and compare-and-
branch instruction we can instead use a single mnemonic for each
instruction.
Change-Id: Idb7458f187b50906877d683695c291dff5279553
Reviewed-on: https://go-review.googlesource.com/c/go/+/197178
Reviewed-by: Keith Randall <khr@golang.org>
PPC64's ANDCC, ORCC, XORCC SSA ops produce a flags value, which
should not have register mask of an integer register.
Fixes#34468.
Change-Id: Ic762e423b20275fd9f8118dae7951c258d59738c
Reviewed-on: https://go-review.googlesource.com/c/go/+/196960
Reviewed-by: Keith Randall <khr@golang.org>
Before this change, wasm only used float variables with a size of 64 bit
and applied rounding to 32 bit precision where necessary. This change
adds proper 32 bit float variables.
Reduces the size of pkg/js_wasm by 254 bytes.
Change-Id: Ieabe846a8cb283d66def3cdf11e2523b3b31f345
Reviewed-on: https://go-review.googlesource.com/c/go/+/195117
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This cleans up the isel code generation in ssa for ppc64x.
Current there is no isel op and the isel code is only
generated from pseudo ops in ppc64/ssa.go, and only using
operands with values 0 or 1. When the isel is generated,
there is always a load of 1 into the temp register before it.
This change implements the isel op so it can be used in PPC64.rules,
and can recognize operand values other than 0 or 1. This also
eliminates the forced load of 1, so it will be loaded only if
needed.
This will make the isel code generation consistent with other ops,
and allow future rule changes that can take advantage of having
a more general purpose isel rule.
Change-Id: I363e1dbd3f7f5dfecb53187ad51cce409a8d1f8d
Reviewed-on: https://go-review.googlesource.com/c/go/+/195057
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
This change adds an intrinsic for Mul64 on s390x. To achieve that,
a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly
instruction directly uses an existing instruction on Z and supports multiplication
of two 64 bit unsigned integer and stores the result in two separate registers.
In this case, we require the multiplcand to be stored in register R3 and
the output result (the high and low 64 bit of the product) to be stored in
R2 and R3 respectively.
A test case is also added.
Benchmark:
name old time/op new time/op delta
Mul-18 11.1ns ± 0% 1.4ns ± 0% -87.39% (p=0.002 n=8+10)
Mul32-18 2.07ns ± 0% 2.07ns ± 0% ~ (all equal)
Mul64-18 11.1ns ± 1% 1.4ns ± 0% -87.42% (p=0.000 n=10+10)
Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/194572
Reviewed-by: Michael Munday <mike.munday@ibm.com>
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This CL gets rid of the MOVDreg and MOVDnop SSA operations on
s390x. They were originally inserted to help avoid situations
where a sign/zero extension was elided but a spill invalidated
the optimization. It's not really clear we need to do this though
(amd64 doesn't have these ops for example) so long as we are
careful when removing sign/zero extensions. Also, the MOVDreg
technique doesn't work if the register is spilled before the
MOVDreg op (I haven't seen that in practice).
Removing these ops reduces the complexity of the rules and also
allows us to unblock optimizations. For example, the compiler can
now merge the loads in binary.{Big,Little}Endian.PutUint16 which
it wasn't able to do before. This CL reduces the size of the .text
section in the go tool by about 4.7KB (0.09%).
Change-Id: Icaddae7f2e4f9b2debb6fabae845adb3f73b41db
Reviewed-on: https://go-review.googlesource.com/c/go/+/173897
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Currently we use R16 and R17 for ARM64's Duff's devices.
According to ARM64 ABI, R16 and R17 can be used by the (external)
linker as scratch registers in trampolines. So don't use these
registers to pass information across functions.
It seems unlikely that calling Duff's devices would need a
trampoline in normal cases. But it could happen if the call
target is out of the 128 MB direct jump limit.
The choice of R20 and R21 is kind of arbitrary. The register
allocator allocates from low-numbered registers. High numbered
registers are chosen so it is unlikely to hold a live value and
forces a spill.
Fixes#32773.
Change-Id: Id22d555b5afeadd4efcf62797d1580d641c39218
Reviewed-on: https://go-review.googlesource.com/c/go/+/183842
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
The z/Architecture does not guarantee that a load following a store
will not be reordered with that store, unless they access the same
address. Therefore if we want to ensure the sequential consistency
of atomic loads and stores we need to perform serialization
operations after atomic stores.
We do not need to serialize in the runtime when using StoreRel[ease]
and LoadAcq[uire]. The z/Architecture already provides sufficient
ordering guarantees for these operations.
name old time/op new time/op delta
AtomicLoad64-16 0.51ns ± 0% 0.51ns ± 0% ~ (all equal)
AtomicStore64-16 0.51ns ± 0% 0.60ns ± 9% +16.47% (p=0.000 n=17+20)
AtomicLoad-16 0.51ns ± 0% 0.51ns ± 0% ~ (all equal)
AtomicStore-16 0.51ns ± 0% 0.60ns ± 9% +16.50% (p=0.000 n=18+20)
Fixes#32428.
Change-Id: I88d19a4010c46070e4fff4b41587efe4c628d4d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/180439
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
This CL adds intrinsics for the 64-bit addition and subtraction
functions in math/bits. These intrinsics use the condition code
to propagate the carry or borrow bit.
To make the carry chains more efficient I've removed the
'clobberFlags' property from most of the load and store
operations. Originally these ops did clobber flags when using
offsets that didn't fit in a signed 20-bit integer, however
that is no longer true.
As with other platforms the intrinsics are faster when executed
in a chain rather than a loop because currently we need to spill
and restore the carry bit between each loop iteration. We may
be able to reduce the need to do this on s390x (e.g. by using
compare-and-branch instructions that do not clobber flags) in the
future.
name old time/op new time/op delta
Add64 1.21ns ± 2% 2.03ns ± 2% +67.18% (p=0.000 n=7+10)
Add64multiple 2.98ns ± 3% 1.03ns ± 0% -65.39% (p=0.000 n=10+9)
Sub64 1.23ns ± 4% 2.03ns ± 1% +64.85% (p=0.000 n=10+10)
Sub64multiple 3.73ns ± 4% 1.04ns ± 1% -72.28% (p=0.000 n=10+8)
Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083
Reviewed-on: https://go-review.googlesource.com/c/go/+/174303
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This change creates an intrinsic for Add64 for ppc64x and adds a
testcase for it.
name old time/op new time/op delta
Add64-160 1.90ns ±40% 2.29ns ± 0% ~ (p=0.119 n=5+5)
Add64multiple-160 6.69ns ± 2% 2.45ns ± 4% -63.47% (p=0.016 n=4+5)
Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97
Reviewed-on: https://go-review.googlesource.com/c/go/+/173758
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>