The new RoundToEven function can be implemented as a single FIDBR
instruction on s390x.
name old time/op new time/op delta
RoundToEven 5.32ns ± 1% 0.86ns ± 1% -83.86% (p=0.000 n=10+10)
Change-Id: Iaf597e57a0d1085961701e3c75ff4f6f6dcebb5f
Reviewed-on: https://go-review.googlesource.com/74350
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This change adds three new instructions:
- LPDFR: load positive (math.Abs(x))
- LNDFR: load negative (-math.Abs(x))
- CPSDR: copy sign (math.Copysign(x, y))
By making use of GPR <-> FPR moves we can now compile math.Abs and
math.Copysign to these instructions using SSA rules.
This CL also adds new rules to merge address generation into combined
load operations. This makes GPR <-> FPR move matching more reliable.
name old time/op new time/op delta
Copysign 1.85ns ± 0% 1.40ns ± 1% -24.65% (p=0.000 n=8+10)
Abs 1.58ns ± 1% 0.73ns ± 1% -53.64% (p=0.000 n=10+10)
The geo mean improvement for all math package benchmarks was 4.6%.
Change-Id: I0cec35c5c1b3fb45243bf666b56b57faca981bc9
Reviewed-on: https://go-review.googlesource.com/73950
Run-TryBot: Michael Munday <mike.munday@ibm.com>
Reviewed-by: Keith Randall <khr@golang.org>
This adds support for math Abs, Copysign to be instrinsics on ppc64x.
New instruction FCPSGN is added to generate fcpsgn. Some new
rules are added to improve the int<->float conversions that are
generated mainly due to the Float64bits and Float64frombits in
the math package. PPC64.rules is also modified as suggested
in the review for CL 63290.
Improvements:
benchmark old ns/op new ns/op delta
BenchmarkAbs-16 1.12 0.69 -38.39%
BenchmarkCopysign-16 1.30 0.93 -28.46%
BenchmarkNextafter32-16 9.34 8.05 -13.81%
BenchmarkFrexp-16 8.81 7.60 -13.73%
Others that used Copysign also saw smaller improvements.
I attempted to make this work using rules since that
seems to be preferred, but due to the use of Float64bits and
Float64frombits in these functions, several rules had to be added and
even then not all cases were matched. Using rules became too
complicated and seemed too fragile for these.
Updates #21390
Change-Id: Ia265da9a18355e08000818a4fba1a40e9e031995
Reviewed-on: https://go-review.googlesource.com/67130
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Keith Randall <khr@golang.org>
This pragma is not actually honored by the compiler.
The tests implicitly relied on the inliner being unable
to inline closures with captured variables, which will
soon change.
Fixes#22208
Change-Id: I13abc9c930b9156d43ec216f8efb768952a29439
Reviewed-on: https://go-review.googlesource.com/73211
Reviewed-by: Michael Munday <mike.munday@ibm.com>
This CL optimizes assembly for len() or cap() division
by a power of 2 constants:
func lenDiv(s []int) int {
return len(s) / 16
}
amd64 assembly before the CL:
MOVQ "".s+16(SP), AX
MOVQ AX, CX
SARQ $63, AX
SHRQ $60, AX
ADDQ CX, AX
SARQ $4, AX
MOVQ AX, "".~r1+32(SP)
RET
amd64 assembly after the CL:
MOVQ "".s+16(SP), AX
SHRQ $4, AX
MOVQ AX, "".~r1+32(SP)
RET
The CL relies on the fact that len() and cap() result cannot
be negative.
Trigger stats for the added SSA rules on linux/amd64 when running
make.bash:
46 Div64
12 Mod64
The added SSA rules may trigger on more cases in the future
when SSA values will be populated with the info on their
lower bounds.
For instance:
func f(i int16) int16 {
if i < 3 {
return -1
}
// Lower bound of i is 3 here -> i is non-negative,
// so unsigned arithmetics may be used here.
return i % 16
}
Change-Id: I8bc6be5a03e71157ced533c01416451ff6f1a7f0
Reviewed-on: https://go-review.googlesource.com/65530
Reviewed-by: Keith Randall <khr@golang.org>
amd64 and 386 have rules to reduce multiplication by a positive power
of two, but a more general reduction (both for positive and negative
powers of two) is already performed by generic rules that were added
in CL 36323 to replace walkmul (see lines 166:173 in generic.rules).
The x86 and amd64 rules are never triggered during all.bash and can be
removed, reducing rules duplication.
The change also adds a few code generation tests for amd64 and 386.
Change-Id: I566d48186643bd722a4c0137fe94e513b8b20e36
Reviewed-on: https://go-review.googlesource.com/68450
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Combine setcc and store of result into setcc that writes directly to memory.
Triggers 200+ times in go tool.
Fixes#21630
Change-Id: Iafa22607426f4120140c88fae4b9aecb46e0bba8
Reviewed-on: https://go-review.googlesource.com/67950
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Ceil, Floor and Trunc are pre-existing intrinsics. Round is a new
function and has been added as an intrinsic in this CL. All of the
functions can be implemented as a single 'LOAD FP INTEGER'
instruction, FIDBR, on s390x.
name old time/op new time/op delta
Ceil 2.34ns ± 0% 0.85ns ± 0% -63.74% (p=0.000 n=5+4)
Floor 2.33ns ± 0% 0.85ns ± 1% -63.35% (p=0.008 n=5+5)
Round 4.23ns ± 0% 0.85ns ± 0% -79.89% (p=0.000 n=5+4)
Trunc 2.35ns ± 0% 0.85ns ± 0% -63.83% (p=0.029 n=4+4)
Change-Id: Idee7ba24a2899d12bf9afee4eedd6b4aaad3c510
Reviewed-on: https://go-review.googlesource.com/63890
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Add generic rules to propagate floating point constants through
comparisons and integer conversions. These new rules seldom trigger
in the standard library so there is no performance change, however
I think it is worth adding them anyway for completeness.
Change-Id: I9db5222746508a2996f1cafb72f4e0cf2541de07
Reviewed-on: https://go-review.googlesource.com/63795
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This adds rules to match the code in math/bits RotateLeft,
RotateLeft32, and RotateLef64 to allow them to be inlined.
The rules are complicated because the code in these function
use different types, and the non-const version of these
shifts generate Mask and Carry instructions that become
subexpressions during the match process.
Also adds a testcase to asm_test.go.
Improvement in math/bits:
BenchmarkRotateLeft-16 1.57 1.32 -15.92%
BenchmarkRotateLeft32-16 1.60 1.37 -14.37%
BenchmarkRotateLeft64-16 1.57 1.32 -15.92%
Updates #21390
Change-Id: Ib6f17669ecc9cab54f18d690be27e2225ca654a4
Reviewed-on: https://go-review.googlesource.com/59932
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
This CL adds generic SSA rules to propagate constants through raw bits
conversions between floats and integers. This allows constants to
propagate through some math functions. For example, math.Copysign(0, -1)
is now constant folded to a load of -0.0.
Requires a fix to the ARM assembler which loaded -0.0 as +0.0.
Change-Id: I52649a4691077c7414f19d17bb599a6743c23ac2
Reviewed-on: https://go-review.googlesource.com/62250
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Do the similar thing to CL 55143 to reduce IMUL.
Change-Id: I1bd38f618058e3cd74fac181f003610ea13f2294
Reviewed-on: https://go-review.googlesource.com/56252
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Normal shift rules plus constant folding are enough to generate
efficient shift-by-constant instructions.
Add test to make sure we don't generate comparisons for constant
shifts.
TODO: there are still constant shift rules on PPC64. If they
are removed, the constant folding rules are not enough to remove
all the test and mask stuff for constant shifts. Leave them in
for now.
Fixes#20663.
Change-Id: I724cc324aa8607762d0c8aacf9bfa641bda5c2a1
Reviewed-on: https://go-review.googlesource.com/60330
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Just to make it clearer which regexps are positive and which
regexps are negative.
Change-Id: Ia190e89be28048fcae2491506f552afad90a5f85
Reviewed-on: https://go-review.googlesource.com/59490
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
The SSA compiler currently generates MOVOstore instructions
to optimize 16 bytes moves on AMD64 architecture.
However, we can't use the MOVOstore instruction on Plan 9,
because floating point operations are not allowed in the
note handler.
We rely on the useSSE flag to disable the use of the
MOVOstore instruction on Plan 9 and replace it by two
MOVQstore instructions.
Fixes#21625
Change-Id: Idfefcceadccafe1752b059b5fe113ce566c0e71c
Reviewed-on: https://go-review.googlesource.com/59171
Run-TryBot: David du Colombier <0intro@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
Sometimes (often for calls) we generate code like this:
MOVQ (addr),AX
MOVQ 8(addr),BX
MOVQ AX,(otheraddr)
MOVQ BX,8(otheraddr)
Replace it with
MOVUPS (addr),X0
MOVUPS X0,(otheraddr)
For completeness do the same for 8,16,32-bit loads/stores too.
Shaves 1% from code sections of go tool.
/localdisk/itocar/golang/bin/go 10293917
go_old 10334877 [40960 bytes]
read-only data = 682 bytes (0.040769%)
global text (code) = 38961 bytes (1.036503%)
Total difference 39643 bytes (0.674628%)
Updates #6853
Change-Id: I1f0d2f60273a63a079b58927cd1c4e3429d2e7ae
Reviewed-on: https://go-review.googlesource.com/57130
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Implement int reg <-> fp reg moves on amd64.
If we see a load to int reg followed by an int->fp move, then we can just
load to the fp reg instead. Same for stores.
math.Abs is now:
MOVQ "".x+8(SP), AX
SHLQ $1, AX
SHRQ $1, AX
MOVQ AX, "".~r1+16(SP)
math.Copysign is now:
MOVQ "".x+8(SP), AX
SHLQ $1, AX
SHRQ $1, AX
MOVQ "".y+16(SP), CX
SHRQ $63, CX
SHLQ $63, CX
ORQ CX, AX
MOVQ AX, "".~r2+24(SP)
math.Float64bits is now:
MOVSD "".x+8(SP), X0
MOVSD X0, "".~r1+16(SP)
(it would be nicer to use a non-SSE reg for this, nothing is perfect)
And due to the fix for #21440, the inlined version of these improve as well.
name old time/op new time/op delta
Abs 1.38ns ± 5% 0.89ns ±10% -35.54% (p=0.000 n=10+10)
Copysign 1.56ns ± 7% 1.35ns ± 6% -13.77% (p=0.000 n=9+10)
Fixes#13095
Change-Id: Ibd7f2792412a6668608780b0688a77062e1f1499
Reviewed-on: https://go-review.googlesource.com/58732
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
This is a crude compiler pass to eliminate stores to auto variables
that are only ever written to.
Eliminates an unnecessary store to x from the following code:
func f() int {
var x := 1
return *(&x)
}
Fixes#19765.
Change-Id: If2c63a8ae67b8c590b6e0cc98a9610939a3eeffa
Reviewed-on: https://go-review.googlesource.com/38746
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This change adds to the code-generation harness in asm_test.go support
for the use of a '$' placeholder name for test functions.
A few of uninformative function names are also changed to use the
placeholder, to confirm that the change works as expected.
Fixes#21500
Change-Id: Iba168bd85efc9822253305d003b06682cf8a6c5c
Reviewed-on: https://go-review.googlesource.com/57292
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
We already combine const stores up-to MOVQstoreconst.
Combine 2 64-bit stores of const zero into 1 sse store of 128-bit zero.
Shaves significant (>1%) amount of code from go tool:
/localdisk/itocar/golang/bin/go 10334877
go_old 10388125 [53248 bytes]
global text (code) = 51041 bytes (1.343944%)
read-only data = 663 bytes (0.039617%)
Total difference 51704 bytes (0.873981%)
Change-Id: I7bc40968023c3a69f379b10fbb433cdb11364f1b
Reviewed-on: https://go-review.googlesource.com/56250
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
Reviewed-by: Keith Randall <khr@golang.org>
There are a few cases where this can be useful. Apart from the obvious
(and silly)
100*n + 200*n
where we generate one IMUL instead of two, consider:
15*n + 31*n
Currently, the compiler strength-reduces both imuls, generating:
0x0000 00000 MOVQ "".n+8(SP), AX
0x0005 00005 MOVQ AX, CX
0x0008 00008 SHLQ $4, AX
0x000c 00012 SUBQ CX, AX
0x000f 00015 MOVQ CX, DX
0x0012 00018 SHLQ $5, CX
0x0016 00022 SUBQ DX, CX
0x0019 00025 ADDQ CX, AX
0x001c 00028 MOVQ AX, "".~r1+16(SP)
0x0021 00033 RET
But combining the imuls is both faster and shorter:
0x0000 00000 MOVQ "".n+8(SP), AX
0x0005 00005 IMULQ $46, AX
0x0009 00009 MOVQ AX, "".~r1+16(SP)
0x000e 00014 RET
even without strength-reduction.
Moreover, consider:
5*n + 7*(n+1) + 11*(n+2)
We already have a rule that rewrites 7(n+1) into 7n+7, so the
generated code (without imuls merging) looks like this:
0x0000 00000 MOVQ "".n+8(SP), AX
0x0005 00005 LEAQ (AX)(AX*4), CX
0x0009 00009 MOVQ AX, DX
0x000c 00012 NEGQ AX
0x000f 00015 LEAQ (AX)(DX*8), AX
0x0013 00019 ADDQ CX, AX
0x0016 00022 LEAQ (DX)(CX*2), CX
0x001a 00026 LEAQ 29(AX)(CX*1), AX
0x001f 00031 MOVQ AX, "".~r1+16(SP)
But with imuls merging, the 5n, 7n and 11n factors get merged, and the
generated code looks like this:
0x0000 00000 MOVQ "".n+8(SP), AX
0x0005 00005 IMULQ $23, AX
0x0009 00009 ADDQ $29, AX
0x000d 00013 MOVQ AX, "".~r1+16(SP)
0x0012 00018 RET
Which is both faster and shorter; that's also the exact same code that
clang and the intel c compiler generate for the above expression.
Change-Id: Ib4d5503f05d2f2efe31a1be14e2fe6cac33730a9
Reviewed-on: https://go-review.googlesource.com/55143
Reviewed-by: Keith Randall <khr@golang.org>
For address of an auto or arg, on all non-x86 architectures
the assembler backend encodes the actual SP offset in the
instruction but leaves the offset in Prog unchanged. When the
assembly is printed in compile -S, it shows an offset
relative to pseudo FP/SP with an actual hardware SP base
register (e.g. R13 on ARM). This is confusing. Unset the
base register if it is indeed SP, so the assembly output is
consistent. If the base register isn't SP, it should be an
error and the error output contains the actual base register.
For address loading instructions, the base register isn't set
in the compiler on non-x86 architectures. Set it. Normally it
is SP and will be unset in the change mentioned above for
printing. If it is not, it will be an error and the error
output contains the actual base register.
No change in generated binary, only printed assembly. Passes
"go build -a -toolexec 'toolstash -cmp' std cmd" on all
architectures.
Fixes#21064.
Change-Id: Ifafe8d5f9b437efbe824b63b3cbc2f5f6cdc1fd5
Reviewed-on: https://go-review.googlesource.com/49432
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
The writebarrier test has to change.
Now that T23 composite literals are passed to the backend,
they get SSA'd, so writes to their fields are treated separately,
so the relevant part of the first write to t23 is now a dead store.
Preserve the intent of the test by splitting it up into two functions.
Reduces code size a bit:
name old object-bytes new object-bytes delta
Template 386k ± 0% 386k ± 0% ~ (all equal)
Unicode 202k ± 0% 202k ± 0% ~ (all equal)
GoTypes 1.16M ± 0% 1.16M ± 0% ~ (all equal)
Compiler 3.92M ± 0% 3.91M ± 0% -0.19% (p=0.008 n=5+5)
SSA 7.91M ± 0% 7.91M ± 0% ~ (all equal)
Flate 228k ± 0% 228k ± 0% -0.05% (p=0.008 n=5+5)
GoParser 283k ± 0% 283k ± 0% ~ (all equal)
Reflect 952k ± 0% 952k ± 0% -0.06% (p=0.008 n=5+5)
Tar 188k ± 0% 188k ± 0% -0.09% (p=0.008 n=5+5)
XML 406k ± 0% 406k ± 0% -0.02% (p=0.008 n=5+5)
[Geo mean] 649k 648k -0.04%
Fixes#18872
Change-Id: Ifeed0f71f13849732999aa731cc2bf40c0f0e32a
Reviewed-on: https://go-review.googlesource.com/43154
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
ARM64 assembler backend only accepts loads and stores with small
or aligned offset. The compiler therefore can only fold small or
aligned offsets into loads and stores. For locals and args, their
offsets to SP are not known until very late, and the compiler
makes conservative decision not folding some of them. However,
in most cases, the offset is indeed small or aligned, and can
be folded into load and store (but actually not).
This CL adds support of loads and stores with large and unaligned
offsets. When the offset doesn't fit into the instruction, it
uses two instructions and (for very large offset) the constant
pool. This way, the compiler doesn't need to be conservative,
and can simply fold the offset.
To make it work, the assembler's optab matching rules need to be
changed. Before, MOVD accepts C_UAUTO32K which matches multiple
of 8 between 0 and 32K, and also C_UAUTO16K, which may not be
multiple of 8 and does not fit into MOVD instruction. The
assembler errors in the latter case. This change makes it only
matches multiple of 8 (or offsets within ±256, which also fits
in instruction), and uses the large-or-unaligned-offset rule
for things doesn't fit (without error). Other sized move rules
are changed similarly.
Class C_UAUTO64K and C_UOREG64K are removed, as they are never
used.
In shared library, load/store of global is rewritten to using
GOT and temp register, which conflicts with the use of temp
register for assembling large offset. So the folding is disabled
for globals in shared library mode.
Reduce cmd/go binary size by 2%.
name old time/op new time/op delta
BinaryTree17-8 8.67s ± 0% 8.61s ± 0% -0.60% (p=0.000 n=9+10)
Fannkuch11-8 6.24s ± 0% 6.19s ± 0% -0.83% (p=0.000 n=10+9)
FmtFprintfEmpty-8 116ns ± 0% 116ns ± 0% ~ (all equal)
FmtFprintfString-8 196ns ± 0% 192ns ± 0% -1.89% (p=0.000 n=10+10)
FmtFprintfInt-8 199ns ± 0% 198ns ± 0% -0.35% (p=0.001 n=9+10)
FmtFprintfIntInt-8 294ns ± 0% 293ns ± 0% -0.34% (p=0.000 n=8+8)
FmtFprintfPrefixedInt-8 318ns ± 1% 318ns ± 1% ~ (p=1.000 n=10+10)
FmtFprintfFloat-8 537ns ± 0% 531ns ± 0% -1.17% (p=0.000 n=9+10)
FmtManyArgs-8 1.19µs ± 1% 1.18µs ± 1% -1.41% (p=0.001 n=10+10)
GobDecode-8 17.2ms ± 1% 17.3ms ± 2% ~ (p=0.165 n=10+10)
GobEncode-8 14.7ms ± 1% 14.7ms ± 2% ~ (p=0.631 n=10+10)
Gzip-8 837ms ± 0% 836ms ± 0% -0.14% (p=0.006 n=9+10)
Gunzip-8 141ms ± 0% 139ms ± 0% -1.24% (p=0.000 n=9+10)
HTTPClientServer-8 256µs ± 1% 253µs ± 1% -1.35% (p=0.000 n=10+10)
JSONEncode-8 40.1ms ± 1% 41.3ms ± 1% +3.06% (p=0.000 n=10+9)
JSONDecode-8 157ms ± 1% 156ms ± 1% -0.83% (p=0.001 n=9+8)
Mandelbrot200-8 8.94ms ± 0% 8.94ms ± 0% +0.02% (p=0.000 n=9+9)
GoParse-8 8.69ms ± 0% 8.54ms ± 1% -1.69% (p=0.000 n=8+10)
RegexpMatchEasy0_32-8 227ns ± 1% 228ns ± 1% +0.48% (p=0.016 n=10+9)
RegexpMatchEasy0_1K-8 1.92µs ± 0% 1.63µs ± 0% -15.08% (p=0.000 n=10+9)
RegexpMatchEasy1_32-8 256ns ± 0% 251ns ± 0% -2.19% (p=0.000 n=10+9)
RegexpMatchEasy1_1K-8 2.38µs ± 0% 2.09µs ± 0% -12.49% (p=0.000 n=10+9)
RegexpMatchMedium_32-8 352ns ± 0% 354ns ± 0% +0.39% (p=0.002 n=10+9)
RegexpMatchMedium_1K-8 106µs ± 0% 106µs ± 0% -0.05% (p=0.005 n=10+9)
RegexpMatchHard_32-8 5.92µs ± 0% 5.89µs ± 0% -0.40% (p=0.000 n=9+8)
RegexpMatchHard_1K-8 180µs ± 0% 179µs ± 0% -0.14% (p=0.000 n=10+9)
Revcomp-8 1.20s ± 0% 1.13s ± 0% -6.29% (p=0.000 n=9+8)
Template-8 159ms ± 1% 154ms ± 1% -3.14% (p=0.000 n=9+10)
TimeParse-8 800ns ± 3% 769ns ± 1% -3.91% (p=0.000 n=10+10)
TimeFormat-8 826ns ± 2% 817ns ± 2% -1.04% (p=0.050 n=10+10)
[Geo mean] 145µs 143µs -1.79%
Change-Id: I5fc42087cee9b54ea414f8ef6d6d020b80eb5985
Reviewed-on: https://go-review.googlesource.com/42172
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
This updates PPC64.rules to include rules to generate rotates
for ADD, OR, XOR operators that combine two opposite shifts
that sum to 32 or 64.
To support this change opcodes for ROTL and ROTLW were added to
be used like the rotldi and rotlwi extended mnemonics.
This provides the following improvement in sha3:
BenchmarkPermutationFunction-8 302.83 376.40 1.24x
BenchmarkSha3_512_MTU-8 98.64 121.92 1.24x
BenchmarkSha3_384_MTU-8 136.80 168.30 1.23x
BenchmarkSha3_256_MTU-8 169.21 211.29 1.25x
BenchmarkSha3_224_MTU-8 179.76 221.19 1.23x
BenchmarkShake128_MTU-8 212.87 263.23 1.24x
BenchmarkShake256_MTU-8 196.62 245.60 1.25x
BenchmarkShake256_16x-8 163.57 194.37 1.19x
BenchmarkShake256_1MiB-8 199.02 248.74 1.25x
BenchmarkSha3_512_1MiB-8 106.55 133.13 1.25x
Fixes#20030
Change-Id: I484c56f48395d32f53ff3ecb3ac6cb8191cfee44
Reviewed-on: https://go-review.googlesource.com/40992
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
To preserve reproducible builds, the text entries
during compilation will be sorted before being printed.
TestAssembly currently assumes that function init
comes after all user-defined functions.
Remove that assumption.
Instead of looking for "TEXT" to tell you where
a function ends--which may now yield lots of
non-function-code junk--look for a line beginning
with non-whitespace.
Updates #15756
Change-Id: Ibc82dba6143d769ef4c391afc360e523b1a51348
Reviewed-on: https://go-review.googlesource.com/39853
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Currently we expand comparison with small constant strings into len check
and a sequence of byte comparisons. Generate 16/32/64-bit comparisons,
instead of bytewise on 386 and amd64. Also increase limits on what is
considered small constant string.
Shaves ~30kb (0.5%) from go executable.
This also updates test/prove.go to keep test case valid.
Change-Id: I99ae8871a1d00c96363c6d03d0b890782fa7e1d9
Reviewed-on: https://go-review.googlesource.com/38776
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
For an AND that masks out leading or trailing bits, generic rules
rewrite it to a pair of shifts. On ARM64, the mask actually can
fit into an AND instruction. So we rewrite it back to AND.
Fixes#19857.
Change-Id: I479d7320ae4f29bb3f0056d5979bde4478063a8f
Reviewed-on: https://go-review.googlesource.com/39651
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Popcount instructions on amd64 are not guaranteed to be
present, so we must guard their call. Rewrite rules can't
generate control flow at the moment, so the intrinsifier
needs to generate that code.
name old time/op new time/op delta
OnesCount-8 2.47ns ± 5% 1.04ns ± 2% -57.70% (p=0.000 n=10+10)
OnesCount16-8 1.05ns ± 1% 0.78ns ± 0% -25.56% (p=0.000 n=9+8)
OnesCount32-8 1.63ns ± 5% 1.04ns ± 2% -35.96% (p=0.000 n=10+10)
OnesCount64-8 2.45ns ± 0% 1.04ns ± 1% -57.55% (p=0.000 n=6+10)
Update #18616
Change-Id: I4aff2cc9aa93787898d7b22055fe272a7cf95673
Reviewed-on: https://go-review.googlesource.com/38320
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
We want to merge a load and op into a single instruction
l = LOAD ptr mem
y = OP x l
into
y = OPload x ptr mem
However, all of our OPload instructions require that y uses
the same register as x. If x is needed past this instruction, then
we must copy x somewhere else, losing the whole benefit of merging
the instructions in the first place.
Disable this optimization if x is live past the OP.
Also disable this optimization if the OP is in a deeper loop than the load.
Update #19595
Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a
Reviewed-on: https://go-review.googlesource.com/38337
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
name old time/op new time/op delta
LeadingZeros-4 2.00ns ± 0% 1.34ns ± 1% -33.02% (p=0.000 n=8+10)
LeadingZeros16-4 1.62ns ± 0% 1.57ns ± 0% -3.09% (p=0.001 n=8+9)
LeadingZeros32-4 2.14ns ± 0% 1.48ns ± 0% -30.84% (p=0.002 n=8+10)
LeadingZeros64-4 2.06ns ± 1% 1.33ns ± 0% -35.08% (p=0.000 n=8+8)
8-bit args is a special case - the Go code is really fast because
it is just a single table lookup. So I've disabled that for now.
Intrinsics were actually slower:
LeadingZeros8-4 1.22ns ± 3% 1.58ns ± 1% +29.56% (p=0.000 n=10+10)
Update #18616
Change-Id: Ia9c289b9ba59c583ea64060470315fd637e814cf
Reviewed-on: https://go-review.googlesource.com/38311
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
Implement math/bits.TrailingZerosX using intrinsics.
Generally reorganize the intrinsic spec a bit.
The instrinsics data structure is now built at init time.
This will make doing the other functions in math/bits easier.
Update sys.CtzX to return int instead of uint{64,32} so it
matches math/bits.TrailingZerosX.
Improve the intrinsics a bit for amd64. We don't need the CMOV
for <64 bit versions.
Update #18616
Change-Id: Ic1c5339c943f961d830ae56f12674d7b29d4ff39
Reviewed-on: https://go-review.googlesource.com/38155
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
This is now handled by os/exec.
Updates #12868
Change-Id: Ic21a6ff76a9b9517437ff1acf3a9195f9604bb45
Reviewed-on: https://go-review.googlesource.com/37698
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Explcitly block fused multiply-add pattern matching when a cast is used
after the multiplication, for example:
- (a * b) + c // can emit fused multiply-add
- float64(a * b) + c // cannot emit fused multiply-add
float{32,64} and complex{64,128} casts of matching types are now kept
as OCONV operations rather than being replaced with OCONVNOP operations
because they now imply a rounding operation (and therefore aren't a
no-op anymore).
Operations (for example, multiplication) on complex types may utilize
fused multiply-add and -subtract instructions internally. There is no
way to disable this behavior at the moment.
Improves the performance of the floating point implementation of
poly1305:
name old speed new speed delta
64 246MB/s ± 0% 275MB/s ± 0% +11.48% (p=0.000 n=10+8)
1K 312MB/s ± 0% 357MB/s ± 0% +14.41% (p=0.000 n=10+10)
64Unaligned 246MB/s ± 0% 274MB/s ± 0% +11.43% (p=0.000 n=10+10)
1KUnaligned 312MB/s ± 0% 357MB/s ± 0% +14.39% (p=0.000 n=10+8)
Updates #17895.
Change-Id: Ia771d275bb9150d1a598f8cc773444663de5ce16
Reviewed-on: https://go-review.googlesource.com/36963
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Follow-up to CL 37270.
This considerably reduces the time to run the test.
Before:
real 0m7.638s
user 0m14.341s
sys 0m2.244s
After:
real 0m4.867s
user 0m7.107s
sys 0m1.842s
Change-Id: I8837a5da0979a1c365e1ce5874d81708249a4129
Reviewed-on: https://go-review.googlesource.com/37461
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Munday <munday@ca.ibm.com>
TestAssembly was very slow, leading to it being skipped by default.
This is not surprising, it separately invoked the compiler and
parsed the result many times.
Now the test assembles one source file for arch/os combination,
containing the relevant functions.
Tests for each arch/os run in parallel.
Now the test runs approximately 10x faster on my Intel(R) Core(TM)
i5-6600 CPU @ 3.30GHz.
Fixes#18966
Change-Id: I45ab97630b627a32e17900c109f790eb4c0e90d9
Reviewed-on: https://go-review.googlesource.com/37270
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
... and same for stores. This does for binary.BigEndian.Uint16() what
was already done for Uint32 and Uint64 with BSWAP in 10f75748 (CL 32222).
Here is how generated code changes e.g. for the following function
(omitting saying the same prologue/epilogue):
func get16(b [2]byte) uint16 {
return binary.BigEndian.Uint16(b[:])
}
"".get16 t=1 size=21 args=0x10 locals=0x0
// before
0x0000 00000 (x.go:15) MOVBLZX "".b+9(FP), AX
0x0005 00005 (x.go:15) MOVBLZX "".b+8(FP), CX
0x000a 00010 (x.go:15) SHLL $8, CX
0x000d 00013 (x.go:15) ORL CX, AX
// after
0x0000 00000 (x.go:15) MOVWLZX "".b+8(FP), AX
0x0005 00005 (x.go:15) ROLW $8, AX
encoding/binary is speedup overall a bit:
name old time/op new time/op delta
ReadSlice1000Int32s-4 4.83µs ± 0% 4.83µs ± 0% ~ (p=0.206 n=4+5)
ReadStruct-4 1.29µs ± 2% 1.28µs ± 1% -1.27% (p=0.032 n=4+5)
ReadInts-4 384ns ± 1% 385ns ± 1% ~ (p=0.968 n=4+5)
WriteInts-4 534ns ± 3% 526ns ± 0% -1.54% (p=0.048 n=4+5)
WriteSlice1000Int32s-4 5.02µs ± 0% 5.11µs ± 3% ~ (p=0.175 n=4+5)
PutUint16-4 0.59ns ± 0% 0.49ns ± 2% -16.95% (p=0.016 n=4+5)
PutUint32-4 0.52ns ± 0% 0.52ns ± 0% ~ (all equal)
PutUint64-4 0.53ns ± 0% 0.53ns ± 0% ~ (all equal)
PutUvarint32-4 19.9ns ± 0% 19.9ns ± 1% ~ (p=0.556 n=4+5)
PutUvarint64-4 54.5ns ± 1% 54.2ns ± 0% ~ (p=0.333 n=4+5)
name old speed new speed delta
ReadSlice1000Int32s-4 829MB/s ± 0% 828MB/s ± 0% ~ (p=0.190 n=4+5)
ReadStruct-4 58.0MB/s ± 2% 58.7MB/s ± 1% +1.30% (p=0.032 n=4+5)
ReadInts-4 78.0MB/s ± 1% 77.8MB/s ± 1% ~ (p=0.968 n=4+5)
WriteInts-4 56.1MB/s ± 3% 57.0MB/s ± 0% ~ (p=0.063 n=4+5)
WriteSlice1000Int32s-4 797MB/s ± 0% 783MB/s ± 3% ~ (p=0.190 n=4+5)
PutUint16-4 3.37GB/s ± 0% 4.07GB/s ± 2% +20.83% (p=0.016 n=4+5)
PutUint32-4 7.73GB/s ± 0% 7.72GB/s ± 0% ~ (p=0.556 n=4+5)
PutUint64-4 15.1GB/s ± 0% 15.1GB/s ± 0% ~ (p=0.905 n=4+5)
PutUvarint32-4 201MB/s ± 0% 201MB/s ± 0% ~ (p=0.905 n=4+5)
PutUvarint64-4 147MB/s ± 1% 147MB/s ± 0% ~ (p=0.286 n=4+5)
( "a bit" only because most of the time is spent in reflection-like things
there, not actual bytes decoding. Even for direct PutUint16 benchmark the
looping adds overhead and lowers visible benefit. For code-generated encoders /
decoders actual effect is more than 20% )
Adding Uint32 and Uint64 raw benchmarks too for completeness.
NOTE I had to adjust load-combining rule for bswap case to match first 2 bytes
loads as result of "2-bytes load+shift" -> "loadw + rorw 8" rewrite. Reason is:
for loads+shift, even e.g. into uint16 var
var b []byte
var v uin16
v = uint16(b[1]) | uint16(b[0])<<8
the compiler eventually generates L(ong) shift - SHLLconst [8], probably
because it is more straightforward / other reasons to work on the whole
register. This way 2 bytes rewriting rule is using SHLLconst (not SHLWconst) in
its pattern, and then it always gets matched first, even if 2-byte rule comes
syntactically after 4-byte rule in AMD64.rules because 4-bytes rule seemingly
needs more applyRewrite() cycles to trigger. If 2-bytes rule gets matched for
inner half of
var b []byte
var v uin32
v = uint32(b[3]) | uint32(b[2])<<8 | uint32(b[1])<<16 | uint32(b[0])<<24
and we keep 4-byte load rule unchanged, the result will be MOVW + RORW $8 and
then series of byte loads and shifts - not one MOVL + BSWAPL.
There is no such problem for stores: there compiler, since it probably knows
store destination is 2 bytes wide, uses SHRWconst 8 (not SHRLconst 8) and thus
2-byte store rule is not a subset of rule for 4-byte stores.
Fixes#17151 (int16 was last missing piece there)
Change-Id: Idc03ba965bfce2b94fef456b02ff6742194748f6
Reviewed-on: https://go-review.googlesource.com/34636
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
CL 35261 introduces special handling of zero-valued STRUCTLIT for
efficient struct zeroing. But it didn't cover all use cases, for
example, CONVNOP STRUCTLIT is not handled.
On the other hand, CL 34566 handles zeroing earlier, so we don't
need the change in CL 35261 for efficient zeroing. Other uses of
zero-valued struct literals are very rare. So undo the change in
walk.go in CL 35261.
Add a test for efficient zeroing.
Fixes#19084.
Change-Id: I0807f7423fb44d47bf325b3c1ce9611a14953853
Reviewed-on: https://go-review.googlesource.com/36955
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@golang.org>