Commit graph

68 commits

Author SHA1 Message Date
Alberto Donizetti
b3c0fe1d14 cmd/compile: use typed aux in arm64 MOVstore rules
Introduces a few casts, mostly to fix rules that mix int64 and int32
off1 and off2.

Passes

  GOARCH=arm64 gotip build -toolexec 'toolstash -cmp' -a std

Change-Id: I1ec75211f3bb8e521dcc5217cf29ab0655a84d79
Reviewed-on: https://go-review.googlesource.com/c/go/+/230840
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-05-04 16:05:00 +00:00
Alberto Donizetti
666c9aedd4 cmd/compile: switch to typed auxint for arm64 TBZ/TBNZ block
This CL changes the arm64 TBZ/TBNZ block from using Aux to using
a (typed) AuxInt. The corresponding rules have also been changed
to be typed.

Passes

  GOARCH=arm64 gotip build -toolexec 'toolstash -cmp' -a std

Change-Id: I98d0cd2a791948f1db13259c17fb1b9b2807a043
Reviewed-on: https://go-review.googlesource.com/c/go/+/230839
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2020-04-30 17:30:54 +00:00
Austin Clements
2bad2f7eba cmd/compile: mark PanicBounds/Extend as calls
PanicBounds and PanicExtend are lowered to runtime calls (with a
non-Go ABI), but are not currently marked as calls. Since liveness
analysis only emits stack maps at calls in the runtime, this means
these panic call sites in the runtime won't get a stack map. These
almost immediately turn into throws in the runtime, but there's still
a chance they'll try to grow the stack first, which would lead to a
different panic.

To fix this, mark these operations as calls.

Outside the runtime, we currently emit stack maps for everything that
isn't an unsafe-point, so these panic calls get stack maps by default.
However, we're about to move to emitting stack maps only at call
sites, at which point this will start to matter outside the runtime as
well.

I confirmed that this has no effect on anything but PCDATA/FUNCDATA in
runtime and net/http.

For #36365.

Change-Id: Ic5bb463fd152cc320c815dc04cf62005261ae169
Reviewed-on: https://go-review.googlesource.com/c/go/+/230539
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-04-29 21:29:14 +00:00
Josh Bleecher Snyder
2cf3ebaf3d cmd/compile: add dedicated ARM64BitField aux type
The goal here is improved AuxInt printing in ssa.html.
Instead of displaying an inscrutable encoded integer,
it displays something like

v25 (28) = UBFX <int> [lsb=4,width=8] v52

which is much nicer for debugging.

Change-Id: I40713ff7f4a857c4557486cdf73c2dff137511ca
Reviewed-on: https://go-review.googlesource.com/c/go/+/221420
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-02-28 14:52:13 +00:00
Cherry Zhang
1b0b980904 runtime: add async preemption support on ARM64
This CL adds support of call injection and async preemption on
ARM64.

There seems no way to return from the injected call without
clobbering *any* register. So we have to clobber one, which is
chosen to be REGTMP. Previous CLs have marked code sequences
that use REGTMP async-nonpreemtible.

Change-Id: Ieca4e3ba5557adf3d0f5d923bce5f1769b58e30b
Reviewed-on: https://go-review.googlesource.com/c/go/+/203461
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2019-11-07 19:18:12 +00:00
Cherry Zhang
4a7ed1fab7 cmd/compile: mark architecture-specific unsafe points
Introduce a mechanism for marking architecture-specific Ops
unsafe. And mark ones that use REGTMP on ARM64, as for async
preemption we will be using REGTMP as a temporary register in the
injected call.

Change-Id: I8ff22e87d8f9cb10d02a2f0af7c12ad6d7d58f54
Reviewed-on: https://go-review.googlesource.com/c/go/+/203459
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Austin Clements <austin@google.com>
2019-11-05 02:55:11 +00:00
Austin Clements
97592b3c14 cmd/compile: intrinsics for runtime/internal/atomic.Store8
For #10958, #24543, but makes sense on its own.

Change-Id: I2a87dab66b82a1863e4b6512b1f8def51463ce2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/203284
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-10-29 03:18:55 +00:00
Michael Munday
9c2e7e8bed cmd/compile: allow multiple SSA block control values
Control values are used to choose which successor of a block is
jumped to. Typically a control value takes the form of a 'flags'
value that represents the result of a comparison. Some
architectures however use a variable in a register as a control
value.

Up until now we have managed with a single control value per block.
However some architectures (e.g. s390x and riscv64) have combined
compare-and-branch instructions that take two variables in registers
as parameters. To generate these instructions we need to support 2
control values per block.

This CL allows up to 2 control values to be used in a block in
order to support the addition of compare-and-branch instructions.
I have implemented s390x compare-and-branch instructions in a
different CL.

Passes toolstash-check -all.

Results of compilebench:

name                      old time/op       new time/op       delta
Template                        208ms ± 1%        209ms ± 1%    ~     (p=0.289 n=20+20)
Unicode                        83.7ms ± 1%       83.3ms ± 3%  -0.49%  (p=0.017 n=18+18)
GoTypes                         748ms ± 1%        748ms ± 0%    ~     (p=0.460 n=20+18)
Compiler                        3.47s ± 1%        3.48s ± 1%    ~     (p=0.070 n=19+18)
SSA                             11.5s ± 1%        11.7s ± 1%  +1.64%  (p=0.000 n=19+18)
Flate                           130ms ± 1%        130ms ± 1%    ~     (p=0.588 n=19+20)
GoParser                        160ms ± 1%        161ms ± 1%    ~     (p=0.211 n=20+20)
Reflect                         465ms ± 1%        467ms ± 1%  +0.42%  (p=0.007 n=20+20)
Tar                             184ms ± 1%        185ms ± 2%    ~     (p=0.087 n=18+20)
XML                             253ms ± 1%        253ms ± 1%    ~     (p=0.377 n=20+18)
LinkCompiler                    769ms ± 2%        774ms ± 2%    ~     (p=0.070 n=19+19)
ExternalLinkCompiler            3.59s ±11%        3.68s ± 6%    ~     (p=0.072 n=20+20)
LinkWithoutDebugCompiler        446ms ± 5%        454ms ± 3%  +1.79%  (p=0.002 n=19+20)
StdCmd                          26.0s ± 2%        26.0s ± 2%    ~     (p=0.799 n=20+20)

name                      old user-time/op  new user-time/op  delta
Template                        238ms ± 5%        240ms ± 5%    ~     (p=0.142 n=20+20)
Unicode                         105ms ±11%        106ms ±10%    ~     (p=0.512 n=20+20)
GoTypes                         876ms ± 2%        873ms ± 4%    ~     (p=0.647 n=20+19)
Compiler                        4.17s ± 2%        4.19s ± 1%    ~     (p=0.093 n=20+18)
SSA                             13.9s ± 1%        14.1s ± 1%  +1.45%  (p=0.000 n=18+18)
Flate                           145ms ±13%        146ms ± 5%    ~     (p=0.851 n=20+18)
GoParser                        185ms ± 5%        188ms ± 7%    ~     (p=0.174 n=20+20)
Reflect                         534ms ± 3%        538ms ± 2%    ~     (p=0.105 n=20+18)
Tar                             215ms ± 4%        211ms ± 9%    ~     (p=0.079 n=19+20)
XML                             295ms ± 6%        295ms ± 5%    ~     (p=0.968 n=20+20)
LinkCompiler                    832ms ± 4%        837ms ± 7%    ~     (p=0.707 n=17+20)
ExternalLinkCompiler            1.58s ± 8%        1.60s ± 4%    ~     (p=0.296 n=20+19)
LinkWithoutDebugCompiler        478ms ±12%        489ms ±10%    ~     (p=0.429 n=20+20)

name                      old object-bytes  new object-bytes  delta
Template                        559kB ± 0%        559kB ± 0%    ~     (all equal)
Unicode                         216kB ± 0%        216kB ± 0%    ~     (all equal)
GoTypes                        2.03MB ± 0%       2.03MB ± 0%    ~     (all equal)
Compiler                       8.07MB ± 0%       8.07MB ± 0%  -0.06%  (p=0.000 n=20+20)
SSA                            27.1MB ± 0%       27.3MB ± 0%  +0.89%  (p=0.000 n=20+20)
Flate                           343kB ± 0%        343kB ± 0%    ~     (all equal)
GoParser                        441kB ± 0%        441kB ± 0%    ~     (all equal)
Reflect                        1.36MB ± 0%       1.36MB ± 0%    ~     (all equal)
Tar                             487kB ± 0%        487kB ± 0%    ~     (all equal)
XML                             632kB ± 0%        632kB ± 0%    ~     (all equal)

name                      old export-bytes  new export-bytes  delta
Template                       18.5kB ± 0%       18.5kB ± 0%    ~     (all equal)
Unicode                        7.92kB ± 0%       7.92kB ± 0%    ~     (all equal)
GoTypes                        35.0kB ± 0%       35.0kB ± 0%    ~     (all equal)
Compiler                        109kB ± 0%        110kB ± 0%  +0.72%  (p=0.000 n=20+20)
SSA                             137kB ± 0%        138kB ± 0%  +0.58%  (p=0.000 n=20+20)
Flate                          4.89kB ± 0%       4.89kB ± 0%    ~     (all equal)
GoParser                       8.49kB ± 0%       8.49kB ± 0%    ~     (all equal)
Reflect                        11.4kB ± 0%       11.4kB ± 0%    ~     (all equal)
Tar                            10.5kB ± 0%       10.5kB ± 0%    ~     (all equal)
XML                            16.7kB ± 0%       16.7kB ± 0%    ~     (all equal)

name                      old text-bytes    new text-bytes    delta
HelloSize                       761kB ± 0%        761kB ± 0%    ~     (all equal)
CmdGoSize                      10.8MB ± 0%       10.8MB ± 0%    ~     (all equal)

name                      old data-bytes    new data-bytes    delta
HelloSize                      10.7kB ± 0%       10.7kB ± 0%    ~     (all equal)
CmdGoSize                       312kB ± 0%        312kB ± 0%    ~     (all equal)

name                      old bss-bytes     new bss-bytes     delta
HelloSize                       122kB ± 0%        122kB ± 0%    ~     (all equal)
CmdGoSize                       146kB ± 0%        146kB ± 0%    ~     (all equal)

name                      old exe-bytes     new exe-bytes     delta
HelloSize                      1.13MB ± 0%       1.13MB ± 0%    ~     (all equal)
CmdGoSize                      15.1MB ± 0%       15.1MB ± 0%    ~     (all equal)

Change-Id: I3cc2f9829a109543d9a68be4a21775d2d3e9801f
Reviewed-on: https://go-review.googlesource.com/c/go/+/196557
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Keith Randall <khr@golang.org>
2019-10-02 09:56:36 +00:00
Cherry Zhang
4ea7aa7cf3 cmd/compile, runtime: use R20, R21 in ARM64's Duff's devices
Currently we use R16 and R17 for ARM64's Duff's devices.
According to ARM64 ABI, R16 and R17 can be used by the (external)
linker as scratch registers in trampolines. So don't use these
registers to pass information across functions.

It seems unlikely that calling Duff's devices would need a
trampoline in normal cases. But it could happen if the call
target is out of the 128 MB direct jump limit.

The choice of R20 and R21 is kind of arbitrary. The register
allocator allocates from low-numbered registers. High numbered
registers are chosen so it is unlikely to hold a live value and
forces a spill.

Fixes #32773.

Change-Id: Id22d555b5afeadd4efcf62797d1580d641c39218
Reviewed-on: https://go-review.googlesource.com/c/go/+/183842
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2019-06-26 16:01:47 +00:00
Austin Clements
4a4e05b0b1 cmd/compile,runtime/internal/atomic: add Load8
Change-Id: Id52a5730cf9207ee7ccebac4ef12791dc5720e7c
Reviewed-on: https://go-review.googlesource.com/c/go/+/172283
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2019-05-03 19:25:37 +00:00
erifan01
f8f265b9cf cmd/compile: intrinsify math/bits.Sub64 for arm64
This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS,
NGC and NEG, and optimzes the case of borrowing chains.

Benchmarks:
name              old time/op       new time/op       delta
Sub-64            2.500000ns +- 0%  2.048000ns +- 1%  -18.08%  (p=0.000 n=10+10)
Sub32-64          2.500000ns +- 0%  2.500000ns +- 0%     ~     (all equal)
Sub64-64          2.500000ns +- 0%  2.080000ns +- 0%  -16.80%  (p=0.000 n=10+7)
Sub64multiple-64  7.090000ns +- 0%  2.090000ns +- 0%  -70.52%  (p=0.000 n=10+10)

Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655
Reviewed-on: https://go-review.googlesource.com/c/go/+/159018
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-04-22 14:40:20 +00:00
erifan01
d0cbf9bf53 cmd/compile: follow up intrinsifying math/bits.Add64 for arm64
This CL deals with the additional comments of CL 159017.

Change-Id: I4ad3c60c834646d58dc0c544c741b92bfe83fb8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/168857
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-22 15:09:47 +00:00
erifan01
5714c91b53 cmd/compile: intrinsify math/bits.Add64 for arm64
This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS
and ADC, and optimzes the case of carry chains.The CL also changes the
test code so that the intrinsic implementation can be tested.

Benchmarks:
name               old time/op       new time/op       delta
Add-224            2.500000ns +- 0%  2.090000ns +- 4%  -16.40%  (p=0.000 n=9+10)
Add32-224          2.500000ns +- 0%  2.500000ns +- 0%     ~     (all equal)
Add64-224          2.500000ns +- 0%  1.577778ns +- 2%  -36.89%  (p=0.000 n=10+9)
Add64multiple-224  6.000000ns +- 0%  2.000000ns +- 0%  -66.67%  (p=0.000 n=10+10)

Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615
Reviewed-on: https://go-review.googlesource.com/c/go/+/159017
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-20 05:39:49 +00:00
Keith Randall
2c423f063b cmd/compile,runtime: provide index information on bounds check failure
A few examples (for accessing a slice of length 3):

   s[-1]    runtime error: index out of range [-1]
   s[3]     runtime error: index out of range [3] with length 3
   s[-1:0]  runtime error: slice bounds out of range [-1:]
   s[3:0]   runtime error: slice bounds out of range [3:0]
   s[3:-1]  runtime error: slice bounds out of range [:-1]
   s[3:4]   runtime error: slice bounds out of range [:4] with capacity 3
   s[0:3:4] runtime error: slice bounds out of range [::4] with capacity 3

Note that in cases where there are multiple things wrong with the
indexes (e.g. s[3:-1]), we report one of those errors kind of
arbitrarily, currently the rightmost one.

An exhaustive set of examples is in issue30116[u].out in the CL.

The message text has the same prefix as the old message text. That
leads to slightly awkward phrasing but hopefully minimizes the chance
that code depending on the error text will break.

Increases the size of the go binary by 0.5% (amd64). The panic functions
take arguments in registers in order to keep the size of the compiled code
as small as possible.

Fixes #30116

Change-Id: Idb99a827b7888822ca34c240eca87b7e44a04fdd
Reviewed-on: https://go-review.googlesource.com/c/go/+/161477
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2019-03-18 17:33:38 +00:00
fanzha02
27cce773d3 cmd/compile: optimize arm64 comparison of x and 0.0 with "FCMP $(0.0), Fn"
Code:
func comp(x float64) bool {return x < 0}

Previous version:
  FMOVD	"".x(FP), F0
  FMOVD	ZR, F1
  FCMPD	F1, F0
  CSET	MI, R0
  MOVB	R0, "".~r1+8(FP)
  RET	(R30)

Optimized version:
  FMOVD	"".x(FP), F0
  FCMPD	$(0.0), F0
  CSET	MI, R0
  MOVB	R0, "".~r1+8(FP)
  RET	(R30)

Math package benchmark results:
name                   old time/op          new time/op          delta
Acos-8                   77.500000ns +- 0%    77.400000ns +- 0%   -0.13%  (p=0.000 n=9+10)
Acosh-8                  98.600000ns +- 0%    98.100000ns +- 0%   -0.51%  (p=0.000 n=10+9)
Asin-8                   67.600000ns +- 0%    66.600000ns +- 0%   -1.48%  (p=0.000 n=9+10)
Asinh-8                 108.000000ns +- 0%   109.000000ns +- 0%   +0.93%  (p=0.000 n=10+10)
Atan-8                   36.788889ns +- 0%    36.000000ns +- 0%   -2.14%  (p=0.000 n=9+10)
Atanh-8                 104.000000ns +- 0%   105.000000ns +- 0%   +0.96%  (p=0.000 n=10+10)
Atan2-8                  67.100000ns +- 0%    66.600000ns +- 0%   -0.75%  (p=0.000 n=10+10)
Cbrt-8                   89.100000ns +- 0%    82.000000ns +- 0%   -7.97%  (p=0.000 n=10+10)
Erf-8                    43.500000ns +- 0%    43.000000ns +- 0%   -1.15%  (p=0.000 n=10+10)
Erfc-8                   49.000000ns +- 0%    48.220000ns +- 0%   -1.59%  (p=0.000 n=9+10)
Erfinv-8                 59.100000ns +- 0%    58.600000ns +- 0%   -0.85%  (p=0.000 n=10+10)
Erfcinv-8                59.100000ns +- 0%    58.600000ns +- 0%   -0.85%  (p=0.000 n=10+10)
Expm1-8                  56.600000ns +- 0%    56.040000ns +- 0%   -0.99%  (p=0.000 n=8+10)
Exp2Go-8                 97.600000ns +- 0%    99.400000ns +- 0%   +1.84%  (p=0.000 n=10+10)
Dim-8                     2.500000ns +- 0%     2.250000ns +- 0%  -10.00%  (p=0.000 n=10+10)
Mod-8                   108.000000ns +- 0%   106.000000ns +- 0%   -1.85%  (p=0.000 n=8+8)
Frexp-8                  12.000000ns +- 0%    12.500000ns +- 0%   +4.17%  (p=0.000 n=10+10)
Gamma-8                  67.100000ns +- 0%    67.600000ns +- 0%   +0.75%  (p=0.000 n=10+10)
Hypot-8                  17.100000ns +- 0%    17.000000ns +- 0%   -0.58%  (p=0.002 n=8+10)
Ilogb-8                   9.010000ns +- 0%     8.510000ns +- 0%   -5.55%  (p=0.000 n=10+9)
J1-8                    288.000000ns +- 0%   287.000000ns +- 0%   -0.35%  (p=0.000 n=10+10)
Jn-8                    605.000000ns +- 0%   604.000000ns +- 0%   -0.17%  (p=0.001 n=8+9)
Logb-8                   10.600000ns +- 0%    10.500000ns +- 0%   -0.94%  (p=0.000 n=9+10)
Log2-8                   16.500000ns +- 0%    17.000000ns +- 0%   +3.03%  (p=0.000 n=10+10)
PowFrac-8               232.000000ns +- 0%   233.000000ns +- 0%   +0.43%  (p=0.000 n=10+10)
Remainder-8              70.600000ns +- 0%    69.600000ns +- 0%   -1.42%  (p=0.000 n=10+10)
SqrtGoLatency-8          77.600000ns +- 0%    76.600000ns +- 0%   -1.29%  (p=0.000 n=10+10)
Tanh-8                   97.600000ns +- 0%    94.100000ns +- 0%   -3.59%  (p=0.000 n=10+10)
Y1-8                    289.000000ns +- 0%   288.000000ns +- 0%   -0.35%  (p=0.000 n=10+10)
Yn-8                    603.000000ns +- 0%   589.000000ns +- 0%   -2.32%  (p=0.000 n=10+10)

Change-Id: I6920734f8662b329aa58f5b8e4eeae73b409984d
Reviewed-on: https://go-review.googlesource.com/c/go/+/164719
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07 22:04:09 +00:00
fanzha02
6efd51c6b7 cmd/compile: change the condition flags of floating-point comparisons in arm64 backend
Current compiler reverses operands to work around NaN in
"less than" and "less equal than" comparisons. But if we
want to use "FCMPD/FCMPS $(0.0), Fn" to do some optimization,
the workaround way does not work. Because assembler does
not support instruction "FCMPD/FCMPS Fn, $(0.0)".

This CL sets condition flags for floating-point comparisons
to resolve this problem.

Change-Id: Ia48076a1da95da64596d6e68304018cb301ebe33
Reviewed-on: https://go-review.googlesource.com/c/go/+/164718
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-07 21:23:52 +00:00
Ben Shi
5aeecc4530 cmd/compile: optimize arm64's code with more shifted operations
This CL optimizes arm64's NEG/MVN/TST/CMN with a shifted operand.

1. The total size of pkg/android_arm64 decreases about 0.2KB, excluding
cmd/compile/ .

2. The go1 benchmark shows no regression, excluding noise.
name                     old time/op    new time/op    delta
BinaryTree17-4              16.4s ± 1%     16.4s ± 1%    ~     (p=0.914 n=29+29)
Fannkuch11-4                8.72s ± 0%     8.72s ± 0%    ~     (p=0.274 n=30+29)
FmtFprintfEmpty-4           174ns ± 0%     174ns ± 0%    ~     (all equal)
FmtFprintfString-4          370ns ± 0%     370ns ± 0%    ~     (all equal)
FmtFprintfInt-4             419ns ± 0%     419ns ± 0%    ~     (all equal)
FmtFprintfIntInt-4          672ns ± 1%     675ns ± 2%    ~     (p=0.217 n=28+30)
FmtFprintfPrefixedInt-4     806ns ± 0%     806ns ± 0%    ~     (p=0.402 n=30+28)
FmtFprintfFloat-4          1.09µs ± 0%    1.09µs ± 0%  +0.02%  (p=0.011 n=22+27)
FmtManyArgs-4              2.67µs ± 0%    2.68µs ± 0%    ~     (p=0.279 n=29+30)
GobDecode-4                33.1ms ± 1%    33.1ms ± 0%    ~     (p=0.052 n=28+29)
GobEncode-4                29.6ms ± 0%    29.6ms ± 0%  +0.08%  (p=0.013 n=28+29)
Gzip-4                      1.38s ± 2%     1.39s ± 2%    ~     (p=0.071 n=29+29)
Gunzip-4                    139ms ± 0%     139ms ± 0%    ~     (p=0.265 n=29+29)
HTTPClientServer-4          789µs ± 4%     785µs ± 4%    ~     (p=0.206 n=29+28)
JSONEncode-4               49.7ms ± 0%    49.6ms ± 0%  -0.24%  (p=0.000 n=30+30)
JSONDecode-4                266ms ± 1%     267ms ± 1%  +0.34%  (p=0.000 n=30+30)
Mandelbrot200-4            16.6ms ± 0%    16.6ms ± 0%    ~     (p=0.835 n=28+30)
GoParse-4                  15.9ms ± 0%    15.8ms ± 0%  -0.29%  (p=0.000 n=27+30)
RegexpMatchEasy0_32-4       380ns ± 0%     381ns ± 0%  +0.18%  (p=0.000 n=30+30)
RegexpMatchEasy0_1K-4      1.18µs ± 0%    1.19µs ± 0%  +0.23%  (p=0.000 n=30+30)
RegexpMatchEasy1_32-4       357ns ± 0%     358ns ± 0%  +0.28%  (p=0.000 n=29+29)
RegexpMatchEasy1_1K-4      2.04µs ± 0%    2.04µs ± 0%  +0.06%  (p=0.006 n=30+30)
RegexpMatchMedium_32-4      589ns ± 0%     590ns ± 0%  +0.24%  (p=0.000 n=28+30)
RegexpMatchMedium_1K-4      162µs ± 0%     162µs ± 0%  -0.01%  (p=0.027 n=26+29)
RegexpMatchHard_32-4       9.58µs ± 0%    9.58µs ± 0%    ~     (p=0.935 n=30+30)
RegexpMatchHard_1K-4        287µs ± 0%     287µs ± 0%    ~     (p=0.387 n=29+30)
Revcomp-4                   2.50s ± 0%     2.50s ± 0%  -0.10%  (p=0.020 n=28+28)
Template-4                  310ms ± 0%     310ms ± 1%    ~     (p=0.406 n=30+30)
TimeParse-4                1.68µs ± 0%    1.68µs ± 0%  +0.03%  (p=0.014 n=30+17)
TimeFormat-4               1.65µs ± 0%    1.66µs ± 0%  +0.32%  (p=0.000 n=27+29)
[Geo mean]                  247µs          247µs       +0.05%

name                     old speed      new speed      delta
GobDecode-4              23.2MB/s ± 0%  23.2MB/s ± 0%  -0.08%  (p=0.032 n=27+29)
GobEncode-4              26.0MB/s ± 0%  25.9MB/s ± 0%  -0.10%  (p=0.011 n=29+29)
Gzip-4                   14.1MB/s ± 2%  14.0MB/s ± 2%    ~     (p=0.081 n=29+29)
Gunzip-4                  139MB/s ± 0%   139MB/s ± 0%    ~     (p=0.290 n=29+29)
JSONEncode-4             39.0MB/s ± 0%  39.1MB/s ± 0%  +0.25%  (p=0.000 n=29+30)
JSONDecode-4             7.30MB/s ± 1%  7.28MB/s ± 1%  -0.33%  (p=0.000 n=30+30)
GoParse-4                3.65MB/s ± 0%  3.66MB/s ± 0%  +0.29%  (p=0.000 n=27+30)
RegexpMatchEasy0_32-4    84.1MB/s ± 0%  84.0MB/s ± 0%  -0.17%  (p=0.000 n=30+28)
RegexpMatchEasy0_1K-4     864MB/s ± 0%   862MB/s ± 0%  -0.24%  (p=0.000 n=30+30)
RegexpMatchEasy1_32-4    89.5MB/s ± 0%  89.3MB/s ± 0%  -0.18%  (p=0.000 n=28+24)
RegexpMatchEasy1_1K-4     502MB/s ± 0%   502MB/s ± 0%  -0.05%  (p=0.008 n=30+29)
RegexpMatchMedium_32-4   1.70MB/s ± 0%  1.69MB/s ± 0%  -0.59%  (p=0.000 n=29+30)
RegexpMatchMedium_1K-4   6.31MB/s ± 0%  6.31MB/s ± 0%  +0.05%  (p=0.005 n=30+26)
RegexpMatchHard_32-4     3.34MB/s ± 0%  3.34MB/s ± 0%    ~     (all equal)
RegexpMatchHard_1K-4     3.57MB/s ± 0%  3.57MB/s ± 0%    ~     (all equal)
Revcomp-4                 102MB/s ± 0%   102MB/s ± 0%  +0.10%  (p=0.022 n=28+28)
Template-4               6.26MB/s ± 0%  6.26MB/s ± 1%    ~     (p=0.768 n=30+30)
[Geo mean]               24.2MB/s       24.1MB/s       -0.08%

Change-Id: I494f9db7f8a568a00e9c74ae25086a58b2221683
Reviewed-on: https://go-review.googlesource.com/137976
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-28 15:05:17 +00:00
fanzha02
a19a83c8ef cmd/compile: optimize math.Float64(32)bits and math.Float64(32)frombits on arm64
Use float <-> int register moves without conversion instead of stores
and loads to move float <-> int values.

Math package benchmark results.
name                 old time/op  new time/op  delta
Acosh                 153ns ± 0%   147ns ± 0%   -3.92%  (p=0.000 n=10+10)
Asinh                 183ns ± 0%   177ns ± 0%   -3.28%  (p=0.000 n=10+10)
Atanh                 157ns ± 0%   155ns ± 0%   -1.27%  (p=0.000 n=10+10)
Atan2                 118ns ± 0%   117ns ± 1%   -0.59%  (p=0.003 n=10+10)
Cbrt                  119ns ± 0%   114ns ± 0%   -4.20%  (p=0.000 n=10+10)
Copysign             7.51ns ± 0%  6.51ns ± 0%  -13.32%  (p=0.000 n=9+10)
Cos                  73.1ns ± 0%  70.6ns ± 0%   -3.42%  (p=0.000 n=10+10)
Cosh                  119ns ± 0%   121ns ± 0%   +1.68%  (p=0.000 n=10+9)
ExpGo                 154ns ± 0%   149ns ± 0%   -3.05%  (p=0.000 n=9+10)
Expm1                 101ns ± 0%    99ns ± 0%   -1.88%  (p=0.000 n=10+10)
Exp2Go                150ns ± 0%   146ns ± 0%   -2.67%  (p=0.000 n=10+10)
Abs                  7.01ns ± 0%  6.01ns ± 0%  -14.27%  (p=0.000 n=10+9)
Mod                   234ns ± 0%   212ns ± 0%   -9.40%  (p=0.000 n=9+10)
Frexp                34.5ns ± 0%  30.0ns ± 0%  -13.04%  (p=0.000 n=10+10)
Gamma                 112ns ± 0%   111ns ± 0%   -0.89%  (p=0.000 n=10+10)
Hypot                73.6ns ± 0%  68.6ns ± 0%   -6.79%  (p=0.000 n=10+10)
HypotGo              77.1ns ± 0%  72.1ns ± 0%   -6.49%  (p=0.000 n=10+10)
Ilogb                31.0ns ± 0%  28.0ns ± 0%   -9.68%  (p=0.000 n=10+10)
J0                    437ns ± 0%   434ns ± 0%   -0.62%  (p=0.000 n=10+10)
J1                    433ns ± 0%   431ns ± 0%   -0.46%  (p=0.000 n=10+10)
Jn                    927ns ± 0%   922ns ± 0%   -0.54%  (p=0.000 n=10+10)
Ldexp                41.5ns ± 0%  37.0ns ± 0%  -10.84%  (p=0.000 n=9+10)
Log                   124ns ± 0%   118ns ± 0%   -4.84%  (p=0.000 n=10+9)
Logb                 34.0ns ± 0%  32.0ns ± 0%   -5.88%  (p=0.000 n=10+10)
Log1p                 110ns ± 0%   108ns ± 0%   -1.82%  (p=0.000 n=10+10)
Log10                 136ns ± 0%   132ns ± 0%   -2.94%  (p=0.000 n=10+10)
Log2                 51.6ns ± 0%  47.1ns ± 0%   -8.72%  (p=0.000 n=10+10)
Nextafter32          33.0ns ± 0%  30.5ns ± 0%   -7.58%  (p=0.000 n=10+10)
Nextafter64          29.0ns ± 0%  26.5ns ± 0%   -8.62%  (p=0.000 n=10+10)
PowInt                169ns ± 0%   160ns ± 0%   -5.33%  (p=0.000 n=10+10)
PowFrac               375ns ± 0%   361ns ± 0%   -3.73%  (p=0.000 n=10+10)
RoundToEven          14.0ns ± 0%  12.5ns ± 0%  -10.71%  (p=0.000 n=10+10)
Remainder             206ns ± 0%   192ns ± 0%   -6.80%  (p=0.000 n=10+9)
Signbit              6.01ns ± 0%  5.51ns ± 0%   -8.32%  (p=0.000 n=10+9)
Sin                  70.1ns ± 0%  69.6ns ± 0%   -0.71%  (p=0.000 n=10+10)
Sincos               99.1ns ± 0%  99.6ns ± 0%   +0.50%  (p=0.000 n=9+10)
SqrtGoLatency         178ns ± 0%   146ns ± 0%  -17.70%  (p=0.000 n=8+10)
SqrtPrime            9.19µs ± 0%  9.20µs ± 0%   +0.01%  (p=0.000 n=9+9)
Tanh                  125ns ± 1%   127ns ± 0%   +1.36%  (p=0.000 n=10+10)
Y0                    428ns ± 0%   426ns ± 0%   -0.47%  (p=0.000 n=10+10)
Y1                    431ns ± 0%   429ns ± 0%   -0.46%  (p=0.000 n=10+9)
Yn                    906ns ± 0%   901ns ± 0%   -0.55%  (p=0.000 n=10+10)
Float64bits          4.50ns ± 0%  3.50ns ± 0%  -22.22%  (p=0.000 n=10+10)
Float64frombits      4.00ns ± 0%  3.50ns ± 0%  -12.50%  (p=0.000 n=10+9)
Float32bits          4.50ns ± 0%  3.50ns ± 0%  -22.22%  (p=0.002 n=8+10)
Float32frombits      4.00ns ± 0%  3.50ns ± 0%  -12.50%  (p=0.000 n=10+10)

Change-Id: Iba829e15d5624962fe0c699139ea783efeefabc2
Reviewed-on: https://go-review.googlesource.com/129715
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-09-17 20:49:04 +00:00
erifan01
8149db4f64 cmd/compile: intrinsify math.RoundToEven and math.Abs on arm64
math.RoundToEven can be done by one arm64 instruction FRINTND, intrinsify it to improve performance.
The current pure Go implementation of the function Abs is translated into five instructions on arm64:
str, ldr, and, str, ldr. The intrinsic implementation requires only one instruction, so in terms of
performance, intrinsify it is worthwhile.

Benchmarks:
name           old time/op  new time/op  delta
Abs-8          3.50ns ± 0%  1.50ns ± 0%  -57.14%  (p=0.000 n=10+10)
RoundToEven-8  9.26ns ± 0%  1.50ns ± 0%  -83.80%  (p=0.000 n=10+10)

Change-Id: I9456b26ab282b544dfac0154fc86f17aed96ac3d
Reviewed-on: https://go-review.googlesource.com/116535
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-09-13 14:52:51 +00:00
erifan01
204cc14bdd cmd/compile: implement non-constant rotates using ROR on arm64
Add some rules to match the Go code like:
	y &= 63
	x << y | x >> (64-y)
or
	y &= 63
	x >> y | x << (64-y)
as a ROR instruction. Make math/bits.RotateLeft faster on arm64.

Extends CL 132435 to arm64.

Benchmarks of math/bits.RotateLeftxxN:
name            old time/op       new time/op       delta
RotateLeft-8    3.548750ns +- 1%  2.003750ns +- 0%  -43.54%  (p=0.000 n=8+8)
RotateLeft8-8   3.925000ns +- 0%  3.925000ns +- 0%     ~     (p=1.000 n=8+8)
RotateLeft16-8  3.925000ns +- 0%  3.927500ns +- 0%     ~     (p=0.608 n=8+8)
RotateLeft32-8  3.925000ns +- 0%  2.002500ns +- 0%  -48.98%  (p=0.000 n=8+8)
RotateLeft64-8  3.536250ns +- 0%  2.003750ns +- 0%  -43.34%  (p=0.000 n=8+8)

Change-Id: I77622cd7f39b917427e060647321f5513973232c
Reviewed-on: https://go-review.googlesource.com/122542
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-07 14:52:02 +00:00
Ben Shi
0e9f1de0b7 cmd/compile: optimize arm64's comparison
Add more optimization with TST/CMN.

1. A tiny benchmark shows more than 12% improvement.
TSTCMN-4                    378µs ± 0%     332µs ± 0%  -12.15%  (p=0.000 n=30+27)
(https://github.com/benshi001/ugo1/blob/master/tstcmn_test.go)

2. There is little regression in the go1 benchmark, excluding noise.

name                     old time/op    new time/op    delta
BinaryTree17-4              19.1s ± 0%     19.1s ± 0%    ~     (p=0.994 n=28+29)
Fannkuch11-4                10.0s ± 0%     10.0s ± 0%    ~     (p=0.198 n=30+25)
FmtFprintfEmpty-4           233ns ± 0%     233ns ± 0%  +0.14%  (p=0.002 n=24+30)
FmtFprintfString-4          428ns ± 0%     428ns ± 0%    ~     (all equal)
FmtFprintfInt-4             472ns ± 0%     472ns ± 0%    ~     (all equal)
FmtFprintfIntInt-4          725ns ± 0%     725ns ± 0%    ~     (all equal)
FmtFprintfPrefixedInt-4     889ns ± 0%     888ns ± 0%    ~     (p=0.632 n=28+30)
FmtFprintfFloat-4          1.20µs ± 0%    1.20µs ± 0%  +0.05%  (p=0.001 n=18+30)
FmtManyArgs-4              3.00µs ± 0%    2.99µs ± 0%  -0.07%  (p=0.001 n=27+30)
GobDecode-4                42.1ms ± 0%    42.2ms ± 0%  +0.29%  (p=0.000 n=28+28)
GobEncode-4                38.6ms ± 9%    38.8ms ± 9%    ~     (p=0.912 n=30+30)
Gzip-4                      2.07s ± 1%     2.05s ± 1%  -0.64%  (p=0.000 n=29+30)
Gunzip-4                    175ms ± 0%     175ms ± 0%  -0.15%  (p=0.001 n=30+30)
HTTPClientServer-4          872µs ± 5%     880µs ± 6%    ~     (p=0.196 n=30+29)
JSONEncode-4               88.5ms ± 1%    89.8ms ± 1%  +1.49%  (p=0.000 n=23+24)
JSONDecode-4                393ms ± 1%     390ms ± 1%  -0.89%  (p=0.000 n=28+30)
Mandelbrot200-4            19.5ms ± 0%    19.5ms ± 0%    ~     (p=0.405 n=29+28)
GoParse-4                  19.9ms ± 0%    20.0ms ± 0%  +0.27%  (p=0.000 n=30+30)
RegexpMatchEasy0_32-4       431ns ± 0%     431ns ± 0%    ~     (p=1.000 n=30+30)
RegexpMatchEasy0_1K-4      1.61µs ± 0%    1.61µs ± 0%    ~     (p=0.527 n=26+26)
RegexpMatchEasy1_32-4       443ns ± 0%     443ns ± 0%    ~     (all equal)
RegexpMatchEasy1_1K-4      2.58µs ± 1%    2.58µs ± 1%    ~     (p=0.578 n=27+25)
RegexpMatchMedium_32-4      740ns ± 0%     740ns ± 0%    ~     (p=0.357 n=30+30)
RegexpMatchMedium_1K-4      223µs ± 0%     223µs ± 0%  +0.16%  (p=0.000 n=30+29)
RegexpMatchHard_32-4       12.3µs ± 0%    12.3µs ± 0%    ~     (p=0.236 n=27+27)
RegexpMatchHard_1K-4        371µs ± 0%     371µs ± 0%  +0.09%  (p=0.000 n=30+27)
Revcomp-4                   2.85s ± 0%     2.85s ± 0%    ~     (p=0.057 n=28+25)
Template-4                  408ms ± 1%     409ms ± 1%    ~     (p=0.117 n=29+29)
TimeParse-4                1.93µs ± 0%    1.93µs ± 0%    ~     (p=0.535 n=29+28)
TimeFormat-4               1.99µs ± 0%    1.99µs ± 0%    ~     (p=0.168 n=29+28)
[Geo mean]                  306µs          307µs       +0.07%

name                     old speed      new speed      delta
GobDecode-4              18.3MB/s ± 0%  18.2MB/s ± 0%  -0.31%  (p=0.000 n=28+29)
GobEncode-4              19.9MB/s ± 8%  19.8MB/s ± 9%    ~     (p=0.923 n=30+30)
Gzip-4                   9.39MB/s ± 1%  9.45MB/s ± 1%  +0.65%  (p=0.000 n=29+30)
Gunzip-4                  111MB/s ± 0%   111MB/s ± 0%  +0.15%  (p=0.001 n=30+30)
JSONEncode-4             21.9MB/s ± 1%  21.6MB/s ± 1%  -1.45%  (p=0.000 n=23+23)
JSONDecode-4             4.94MB/s ± 1%  4.98MB/s ± 1%  +0.84%  (p=0.000 n=27+30)
GoParse-4                2.91MB/s ± 0%  2.90MB/s ± 0%  -0.34%  (p=0.000 n=21+22)
RegexpMatchEasy0_32-4    74.1MB/s ± 0%  74.1MB/s ± 0%    ~     (p=0.469 n=29+28)
RegexpMatchEasy0_1K-4     634MB/s ± 0%   634MB/s ± 0%    ~     (p=0.978 n=24+28)
RegexpMatchEasy1_32-4    72.2MB/s ± 0%  72.2MB/s ± 0%    ~     (p=0.064 n=27+29)
RegexpMatchEasy1_1K-4     396MB/s ± 1%   396MB/s ± 1%    ~     (p=0.583 n=27+25)
RegexpMatchMedium_32-4   1.35MB/s ± 0%  1.35MB/s ± 0%    ~     (all equal)
RegexpMatchMedium_1K-4   4.60MB/s ± 0%  4.59MB/s ± 0%  -0.14%  (p=0.000 n=30+26)
RegexpMatchHard_32-4     2.61MB/s ± 0%  2.61MB/s ± 0%    ~     (all equal)
RegexpMatchHard_1K-4     2.76MB/s ± 0%  2.76MB/s ± 0%    ~     (all equal)
Revcomp-4                89.1MB/s ± 0%  89.1MB/s ± 0%    ~     (p=0.059 n=28+25)
Template-4               4.75MB/s ± 1%  4.75MB/s ± 1%    ~     (p=0.106 n=29+29)
[Geo mean]               18.3MB/s       18.3MB/s       -0.07%

Change-Id: I3cd76ce63e84b0c3cebabf9fa3573b76a7343899
Reviewed-on: https://go-review.googlesource.com/124935
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-05 02:51:28 +00:00
Ben Shi
b444215116 cmd/compile: optimize ARM64's code with MADD/MSUB
MADD does MUL-ADD in a single instruction, and MSUB does the
similiar simplification for MUL-SUB.

The CL implements the optimization with MADD/MSUB.

1. The total size of pkg/android_arm64/ decreases about 20KB,
excluding cmd/compile/.

2. The go1 benchmark shows a little improvement for RegexpMatchHard_32-4
and Template-4, excluding noise.

name                     old time/op    new time/op    delta
BinaryTree17-4              16.3s ± 1%     16.5s ± 1%  +1.41%  (p=0.000 n=26+28)
Fannkuch11-4                8.79s ± 1%     8.76s ± 0%  -0.36%  (p=0.000 n=26+28)
FmtFprintfEmpty-4           172ns ± 0%     172ns ± 0%    ~     (all equal)
FmtFprintfString-4          362ns ± 1%     364ns ± 0%  +0.55%  (p=0.000 n=30+30)
FmtFprintfInt-4             416ns ± 0%     416ns ± 0%    ~     (p=0.099 n=22+30)
FmtFprintfIntInt-4          655ns ± 1%     660ns ± 1%  +0.76%  (p=0.000 n=30+30)
FmtFprintfPrefixedInt-4     810ns ± 0%     809ns ± 0%  -0.08%  (p=0.009 n=29+29)
FmtFprintfFloat-4          1.08µs ± 0%    1.09µs ± 0%  +0.61%  (p=0.000 n=30+29)
FmtManyArgs-4              2.70µs ± 0%    2.69µs ± 0%  -0.23%  (p=0.000 n=29+28)
GobDecode-4                32.2ms ± 1%    32.1ms ± 1%  -0.39%  (p=0.000 n=27+26)
GobEncode-4                27.4ms ± 2%    27.4ms ± 1%    ~     (p=0.864 n=28+28)
Gzip-4                      1.53s ± 1%     1.52s ± 1%  -0.30%  (p=0.031 n=29+29)
Gunzip-4                    146ms ± 0%     146ms ± 0%  -0.14%  (p=0.001 n=25+30)
HTTPClientServer-4         1.00ms ± 4%    0.98ms ± 6%  -1.65%  (p=0.001 n=29+30)
JSONEncode-4               67.3ms ± 1%    67.2ms ± 1%    ~     (p=0.520 n=28+28)
JSONDecode-4                329ms ± 5%     330ms ± 4%    ~     (p=0.142 n=30+30)
Mandelbrot200-4            17.3ms ± 0%    17.3ms ± 0%    ~     (p=0.055 n=26+29)
GoParse-4                  16.9ms ± 1%    17.0ms ± 1%  +0.82%  (p=0.000 n=30+30)
RegexpMatchEasy0_32-4       382ns ± 0%     382ns ± 0%    ~     (all equal)
RegexpMatchEasy0_1K-4      1.33µs ± 0%    1.33µs ± 0%  -0.25%  (p=0.000 n=30+27)
RegexpMatchEasy1_32-4       361ns ± 0%     361ns ± 0%  -0.08%  (p=0.002 n=30+28)
RegexpMatchEasy1_1K-4      2.11µs ± 0%    2.09µs ± 0%  -0.54%  (p=0.000 n=30+29)
RegexpMatchMedium_32-4      594ns ± 0%     592ns ± 0%  -0.32%  (p=0.000 n=30+30)
RegexpMatchMedium_1K-4      173µs ± 0%     172µs ± 0%  -0.77%  (p=0.000 n=29+27)
RegexpMatchHard_32-4       10.4µs ± 0%    10.1µs ± 0%  -3.63%  (p=0.000 n=28+27)
RegexpMatchHard_1K-4        306µs ± 0%     301µs ± 0%  -1.64%  (p=0.000 n=29+30)
Revcomp-4                   2.51s ± 1%     2.52s ± 0%  +0.18%  (p=0.017 n=26+27)
Template-4                  394ms ± 3%     382ms ± 3%  -3.22%  (p=0.000 n=28+28)
TimeParse-4                1.67µs ± 0%    1.67µs ± 0%  +0.05%  (p=0.030 n=27+30)
TimeFormat-4               1.72µs ± 0%    1.70µs ± 0%  -0.79%  (p=0.000 n=28+26)
[Geo mean]                  259µs          259µs       -0.33%

name                     old speed      new speed      delta
GobDecode-4              23.8MB/s ± 1%  23.9MB/s ± 1%  +0.40%  (p=0.001 n=27+26)
GobEncode-4              28.0MB/s ± 2%  28.0MB/s ± 1%    ~     (p=0.863 n=28+28)
Gzip-4                   12.7MB/s ± 1%  12.7MB/s ± 1%  +0.32%  (p=0.026 n=29+29)
Gunzip-4                  133MB/s ± 0%   133MB/s ± 0%  +0.15%  (p=0.001 n=24+30)
JSONEncode-4             28.8MB/s ± 1%  28.9MB/s ± 1%    ~     (p=0.475 n=28+28)
JSONDecode-4             5.89MB/s ± 4%  5.87MB/s ± 5%    ~     (p=0.174 n=29+30)
GoParse-4                3.43MB/s ± 0%  3.40MB/s ± 1%  -0.83%  (p=0.000 n=28+30)
RegexpMatchEasy0_32-4    83.6MB/s ± 0%  83.6MB/s ± 0%    ~     (p=0.848 n=28+29)
RegexpMatchEasy0_1K-4     768MB/s ± 0%   770MB/s ± 0%  +0.25%  (p=0.000 n=30+27)
RegexpMatchEasy1_32-4    88.5MB/s ± 0%  88.5MB/s ± 0%    ~     (p=0.086 n=29+29)
RegexpMatchEasy1_1K-4     486MB/s ± 0%   489MB/s ± 0%  +0.54%  (p=0.000 n=30+29)
RegexpMatchMedium_32-4   1.68MB/s ± 0%  1.69MB/s ± 0%  +0.60%  (p=0.000 n=30+23)
RegexpMatchMedium_1K-4   5.90MB/s ± 0%  5.95MB/s ± 0%  +0.85%  (p=0.000 n=18+20)
RegexpMatchHard_32-4     3.07MB/s ± 0%  3.18MB/s ± 0%  +3.72%  (p=0.000 n=29+26)
RegexpMatchHard_1K-4     3.35MB/s ± 0%  3.40MB/s ± 0%  +1.69%  (p=0.000 n=30+30)
Revcomp-4                 101MB/s ± 0%   101MB/s ± 0%  -0.18%  (p=0.018 n=26+27)
Template-4               4.92MB/s ± 4%  5.09MB/s ± 3%  +3.31%  (p=0.000 n=28+28)
[Geo mean]               22.4MB/s       22.6MB/s       +0.62%

Change-Id: I8f304b272785739f57b3c8f736316f658f8c1b2a
Reviewed-on: https://go-review.googlesource.com/129119
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-04 20:41:58 +00:00
Ben Shi
3ca3e89bb6 cmd/compile: optimize arm64 with indexed FP load/store
The FP load/store on arm64 have register indexed forms. And this
CL implements this optimization.

1. The total size of pkg/android_arm64 (excluding cmd/compile)
decreases about 400 bytes.

2. There is no regression in the go1 benchmark, the test case
GobEncode even gets slight improvement, excluding noise.

name                     old time/op    new time/op    delta
BinaryTree17-4              19.0s ± 0%     19.0s ± 1%    ~     (p=0.817 n=29+29)
Fannkuch11-4                9.94s ± 0%     9.95s ± 0%  +0.03%  (p=0.010 n=24+30)
FmtFprintfEmpty-4           233ns ± 0%     233ns ± 0%    ~     (all equal)
FmtFprintfString-4          427ns ± 0%     427ns ± 0%    ~     (p=0.649 n=30+30)
FmtFprintfInt-4             471ns ± 0%     471ns ± 0%    ~     (all equal)
FmtFprintfIntInt-4          730ns ± 0%     730ns ± 0%    ~     (all equal)
FmtFprintfPrefixedInt-4     889ns ± 0%     889ns ± 0%    ~     (all equal)
FmtFprintfFloat-4          1.21µs ± 0%    1.21µs ± 0%  +0.04%  (p=0.012 n=20+30)
FmtManyArgs-4              2.99µs ± 0%    2.99µs ± 0%    ~     (p=0.651 n=29+29)
GobDecode-4                42.4ms ± 1%    42.3ms ± 1%  -0.27%  (p=0.001 n=29+28)
GobEncode-4                37.8ms ±11%    36.0ms ± 0%  -4.67%  (p=0.000 n=30+26)
Gzip-4                      1.98s ± 1%     1.96s ± 1%  -1.26%  (p=0.000 n=30+30)
Gunzip-4                    175ms ± 0%     175ms ± 0%    ~     (p=0.988 n=29+29)
HTTPClientServer-4          854µs ± 5%     860µs ± 5%    ~     (p=0.236 n=28+29)
JSONEncode-4               88.8ms ± 0%    87.9ms ± 0%  -1.00%  (p=0.000 n=24+26)
JSONDecode-4                390ms ± 1%     392ms ± 2%  +0.48%  (p=0.025 n=30+30)
Mandelbrot200-4            19.5ms ± 0%    19.5ms ± 0%    ~     (p=0.894 n=24+29)
GoParse-4                  20.3ms ± 0%    20.1ms ± 1%  -0.94%  (p=0.000 n=27+26)
RegexpMatchEasy0_32-4       451ns ± 0%     451ns ± 0%    ~     (p=0.578 n=30+30)
RegexpMatchEasy0_1K-4      1.63µs ± 0%    1.63µs ± 0%    ~     (p=0.298 n=30+28)
RegexpMatchEasy1_32-4       431ns ± 0%     434ns ± 0%  +0.67%  (p=0.000 n=30+29)
RegexpMatchEasy1_1K-4      2.60µs ± 0%    2.64µs ± 0%  +1.36%  (p=0.000 n=28+26)
RegexpMatchMedium_32-4      744ns ± 0%     744ns ± 0%    ~     (p=0.474 n=29+29)
RegexpMatchMedium_1K-4      223µs ± 0%     223µs ± 0%  -0.08%  (p=0.038 n=26+30)
RegexpMatchHard_32-4       12.2µs ± 0%    12.3µs ± 0%  +0.27%  (p=0.000 n=29+30)
RegexpMatchHard_1K-4        373µs ± 0%     373µs ± 0%    ~     (p=0.219 n=29+28)
Revcomp-4                   2.84s ± 0%     2.84s ± 0%    ~     (p=0.130 n=28+28)
Template-4                  394ms ± 1%     392ms ± 1%  -0.52%  (p=0.001 n=30+30)
TimeParse-4                1.93µs ± 0%    1.93µs ± 0%    ~     (p=0.587 n=29+30)
TimeFormat-4               2.00µs ± 0%    2.00µs ± 0%  +0.07%  (p=0.001 n=28+27)
[Geo mean]                  306µs          305µs       -0.17%

name                     old speed      new speed      delta
GobDecode-4              18.1MB/s ± 1%  18.2MB/s ± 1%  +0.27%  (p=0.001 n=29+28)
GobEncode-4              20.3MB/s ±10%  21.3MB/s ± 0%  +4.64%  (p=0.000 n=30+26)
Gzip-4                   9.79MB/s ± 1%  9.91MB/s ± 1%  +1.28%  (p=0.000 n=30+30)
Gunzip-4                  111MB/s ± 0%   111MB/s ± 0%    ~     (p=0.988 n=29+29)
JSONEncode-4             21.8MB/s ± 0%  22.1MB/s ± 0%  +1.02%  (p=0.000 n=24+26)
JSONDecode-4             4.97MB/s ± 1%  4.95MB/s ± 2%  -0.45%  (p=0.031 n=30+30)
GoParse-4                2.85MB/s ± 1%  2.88MB/s ± 1%  +1.03%  (p=0.000 n=30+26)
RegexpMatchEasy0_32-4    70.9MB/s ± 0%  70.9MB/s ± 0%    ~     (p=0.904 n=29+28)
RegexpMatchEasy0_1K-4     627MB/s ± 0%   627MB/s ± 0%    ~     (p=0.156 n=30+30)
RegexpMatchEasy1_32-4    74.2MB/s ± 0%  73.7MB/s ± 0%  -0.67%  (p=0.000 n=30+29)
RegexpMatchEasy1_1K-4     393MB/s ± 0%   388MB/s ± 0%  -1.34%  (p=0.000 n=28+26)
RegexpMatchMedium_32-4   1.34MB/s ± 0%  1.34MB/s ± 0%    ~     (all equal)
RegexpMatchMedium_1K-4   4.59MB/s ± 0%  4.59MB/s ± 0%  +0.07%  (p=0.035 n=25+30)
RegexpMatchHard_32-4     2.61MB/s ± 0%  2.61MB/s ± 0%  -0.11%  (p=0.002 n=28+30)
RegexpMatchHard_1K-4     2.75MB/s ± 0%  2.75MB/s ± 0%  +0.15%  (p=0.001 n=30+24)
Revcomp-4                89.4MB/s ± 0%  89.4MB/s ± 0%    ~     (p=0.140 n=28+28)
Template-4               4.93MB/s ± 1%  4.95MB/s ± 1%  +0.51%  (p=0.001 n=30+30)
[Geo mean]               18.4MB/s       18.4MB/s       +0.37%

Change-Id: I9a6b521a971b21cfb51064e8e9b853cef8a1d071
Reviewed-on: https://go-review.googlesource.com/124636
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-08-28 02:37:18 +00:00
Ben Shi
096229b2ec cmd/compile: add missing type information for some arm/arm64 rules
Some indexed load/store rules lack of type information, and this
CL adds that for them.

Change-Id: Icac315ccb83a2f5bf30b056d4667d5b59eb4e5e2
Reviewed-on: https://go-review.googlesource.com/128455
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-08-27 15:22:45 +00:00
Wei Xiao
0a7ac93c27 cmd/compile: improve atomic add intrinsics with ARMv8.1 new instruction
ARMv8.1 has added new instruction (LDADDAL) for atomic memory operations. This
CL improves existing atomic add intrinsics with the new instruction. Since the
new instruction is only guaranteed to be present after ARMv8.1, we guard its
usage with a conditional on CPU feature.

Performance result on ARMv8.1 machine:
name        old time/op  new time/op  delta
Xadd-224    1.05µs ± 6%  0.02µs ± 4%  -98.06%  (p=0.000 n=10+8)
Xadd64-224  1.05µs ± 3%  0.02µs ±13%  -98.10%  (p=0.000 n=9+10)
[Geo mean]  1.05µs       0.02µs       -98.08%

Performance result on ARMv8.0 machine:
name        old time/op  new time/op  delta
Xadd-46      538ns ± 1%   541ns ± 1%  +0.62%  (p=0.000 n=9+9)
Xadd64-46    505ns ± 1%   508ns ± 0%  +0.48%  (p=0.003 n=9+8)
[Geo mean]   521ns        524ns       +0.55%

Change-Id: If4b5d8d0e2d6f84fe1492a4f5de0789910ad0ee9
Reviewed-on: https://go-review.googlesource.com/81877
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-06-21 14:52:43 +00:00
Cherry Zhang
44b826bb28 cmd/compile: use a different register for updated value in AtomicAnd8/Or8 on ARM64
ARM64 manual says it is "constrained unpredictable" if the src
and dst registers of STLXRB are same, although it doesn't seem
to cause any problem on real hardwares so far. Fix by allocating
a different register to hold the updated value for
AtomicAnd8/Or8. We do this by making the ops returns <val,mem>
like AtomicAdd, although val will not be used elsewhere.

Fixes #25823.

Change-Id: I735b9822f99877b3c7aee67a65e62b7278dc40df
Reviewed-on: https://go-review.googlesource.com/117976
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Wei Xiao <Wei.Xiao@arm.com>
2018-06-12 20:22:50 +00:00
Wei Xiao
bd8a88729c cmd/compile: intrinsify runtime.getcallerpc on arm64
Add a compiler intrinsic for getcallerpc on arm64 for better code generation.

Change-Id: I897e670a2b8ffa1a8c2fdc638f5b2c44bda26318
Reviewed-on: https://go-review.googlesource.com/109276
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-04-30 13:29:14 +00:00
Ben Shi
aaf73c6d1e cmd/compile: optimize ARM64 with shifted register indexed load/store
ARM64 supports efficient instructions which combine shift, addition, load/store
together. Such as "MOVD (R0)(R1<<3), R2" and "MOVWU R6, (R4)(R1<<2)".

This CL optimizes the compiler to emit such efficient instuctions. And below
is some test data.

1. binary size before/after
binary                 size change
pkg/linux_arm64        +80.1KB
pkg/tool/linux_arm64   +121.9KB
go                     -4.3KB
gofmt                  -64KB

2. go1 benchmark
There is big improvement for the test case Fannkuch11, and slight
improvement for sme others, excluding noise.

name                     old time/op    new time/op    delta
BinaryTree17-4              43.9s ± 2%     44.0s ± 2%     ~     (p=0.820 n=30+30)
Fannkuch11-4                30.6s ± 2%     24.5s ± 3%  -19.93%  (p=0.000 n=25+30)
FmtFprintfEmpty-4           500ns ± 0%     499ns ± 0%   -0.11%  (p=0.000 n=23+25)
FmtFprintfString-4         1.03µs ± 0%    1.04µs ± 3%     ~     (p=0.065 n=29+30)
FmtFprintfInt-4            1.15µs ± 3%    1.15µs ± 4%   -0.56%  (p=0.000 n=30+30)
FmtFprintfIntInt-4         1.80µs ± 5%    1.82µs ± 0%     ~     (p=0.094 n=30+24)
FmtFprintfPrefixedInt-4    2.17µs ± 5%    2.20µs ± 0%     ~     (p=0.100 n=30+23)
FmtFprintfFloat-4          3.08µs ± 3%    3.09µs ± 4%     ~     (p=0.123 n=30+30)
FmtManyArgs-4              7.41µs ± 4%    7.17µs ± 1%   -3.26%  (p=0.000 n=30+23)
GobDecode-4                93.7ms ± 0%    94.7ms ± 4%     ~     (p=0.685 n=24+30)
GobEncode-4                78.7ms ± 7%    77.1ms ± 0%     ~     (p=0.729 n=30+23)
Gzip-4                      4.01s ± 0%     3.97s ± 5%   -1.11%  (p=0.037 n=24+30)
Gunzip-4                    389ms ± 4%     384ms ± 0%     ~     (p=0.155 n=30+23)
HTTPClientServer-4          536µs ± 1%     537µs ± 1%     ~     (p=0.236 n=30+30)
JSONEncode-4                179ms ± 1%     182ms ± 6%     ~     (p=0.763 n=24+30)
JSONDecode-4                843ms ± 0%     839ms ± 6%   -0.42%  (p=0.003 n=25+30)
Mandelbrot200-4            46.5ms ± 0%    46.5ms ± 0%   +0.02%  (p=0.000 n=26+26)
GoParse-4                  44.3ms ± 6%    43.3ms ± 0%     ~     (p=0.067 n=30+27)
RegexpMatchEasy0_32-4      1.07µs ± 7%    1.07µs ± 4%     ~     (p=0.835 n=30+30)
RegexpMatchEasy0_1K-4      5.51µs ± 0%    5.49µs ± 0%   -0.35%  (p=0.000 n=23+26)
RegexpMatchEasy1_32-4      1.01µs ± 0%    1.02µs ± 4%   +0.96%  (p=0.014 n=24+30)
RegexpMatchEasy1_1K-4      7.43µs ± 0%    7.18µs ± 0%   -3.41%  (p=0.000 n=23+24)
RegexpMatchMedium_32-4     1.78µs ± 0%    1.81µs ± 4%   +1.47%  (p=0.012 n=23+30)
RegexpMatchMedium_1K-4      547µs ± 1%     542µs ± 3%   -0.90%  (p=0.003 n=24+30)
RegexpMatchHard_32-4       30.4µs ± 0%    29.7µs ± 0%   -2.15%  (p=0.000 n=19+23)
RegexpMatchHard_1K-4        913µs ± 0%     915µs ± 6%   +0.25%  (p=0.012 n=24+30)
Revcomp-4                   6.32s ± 1%     6.42s ± 4%     ~     (p=0.342 n=25+30)
Template-4                  868ms ± 6%     878ms ± 6%   +1.15%  (p=0.000 n=30+30)
TimeParse-4                4.57µs ± 4%    4.59µs ± 3%   +0.65%  (p=0.010 n=29+30)
TimeFormat-4               4.51µs ± 0%    4.50µs ± 0%   -0.27%  (p=0.000 n=27+24)
[Geo mean]                  695µs          689µs        -0.92%

name                     old speed      new speed      delta
GobDecode-4              8.19MB/s ± 0%  8.12MB/s ± 4%     ~     (p=0.680 n=24+30)
GobEncode-4              9.76MB/s ± 7%  9.96MB/s ± 0%     ~     (p=0.616 n=30+23)
Gzip-4                   4.84MB/s ± 0%  4.89MB/s ± 4%   +1.16%  (p=0.030 n=24+30)
Gunzip-4                 49.9MB/s ± 4%  50.6MB/s ± 0%     ~     (p=0.162 n=30+23)
JSONEncode-4             10.9MB/s ± 1%  10.7MB/s ± 6%     ~     (p=0.575 n=24+30)
JSONDecode-4             2.30MB/s ± 0%  2.32MB/s ± 5%   +0.72%  (p=0.003 n=22+30)
GoParse-4                1.31MB/s ± 6%  1.34MB/s ± 0%   +2.26%  (p=0.002 n=30+27)
RegexpMatchEasy0_32-4    30.0MB/s ± 6%  30.0MB/s ± 4%     ~     (p=1.000 n=30+30)
RegexpMatchEasy0_1K-4     186MB/s ± 0%   187MB/s ± 0%   +0.35%  (p=0.000 n=23+26)
RegexpMatchEasy1_32-4    31.8MB/s ± 0%  31.5MB/s ± 4%   -0.92%  (p=0.012 n=25+30)
RegexpMatchEasy1_1K-4     138MB/s ± 0%   143MB/s ± 0%   +3.53%  (p=0.000 n=23+24)
RegexpMatchMedium_32-4    560kB/s ± 0%   553kB/s ± 4%   -1.19%  (p=0.005 n=23+30)
RegexpMatchMedium_1K-4   1.87MB/s ± 0%  1.89MB/s ± 3%   +1.04%  (p=0.002 n=24+30)
RegexpMatchHard_32-4     1.05MB/s ± 0%  1.08MB/s ± 0%   +2.40%  (p=0.000 n=19+23)
RegexpMatchHard_1K-4     1.12MB/s ± 0%  1.12MB/s ± 5%   +0.12%  (p=0.006 n=25+30)
Revcomp-4                40.2MB/s ± 1%  39.6MB/s ± 4%     ~     (p=0.242 n=25+30)
Template-4               2.24MB/s ± 6%  2.21MB/s ± 6%   -1.15%  (p=0.000 n=30+30)
[Geo mean]               7.87MB/s       7.91MB/s        +0.44%

Change-Id: If374cb7abf83537aa0a176f73c0f736f7800db03
Reviewed-on: https://go-review.googlesource.com/108735
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-04-27 20:02:05 +00:00
Balaram Makam
f524268c40 cmd/compile: optimize ARM64 code with CMN/TST
Use CMN/TST to simplify comparisons. This can reduce the
register pressure by removing single def/use registers for example:
ADDW R0, R1, R8 -> CMNW R1, R0 ; CMN is an alias of ADDS.
CBZW R8, label  -> BEQ  label  ; single def/use of R8 removed.

Little change in performance of go1 benchmark on Amberwing:
name                   old time/op    new time/op    delta
RegexpMatchEasy0_32       247ns ± 0%     246ns ± 0%  -0.40%  (p=0.008 n=5+5)
RegexpMatchEasy0_1K       581ns ± 0%     580ns ± 0%    ~     (p=0.079 n=4+5)
RegexpMatchEasy1_32       244ns ± 0%     243ns ± 0%  -0.41%  (p=0.008 n=5+5)
RegexpMatchEasy1_1K       804ns ± 0%     806ns ± 0%  +0.25%  (p=0.016 n=5+4)
RegexpMatchMedium_32      313ns ± 0%     311ns ± 0%  -0.64%  (p=0.008 n=5+5)
RegexpMatchMedium_1K     52.2µs ± 0%    51.9µs ± 0%  -0.51%  (p=0.008 n=5+5)
RegexpMatchHard_32       2.76µs ± 3%    2.74µs ± 0%    ~     (p=0.683 n=5+5)
RegexpMatchHard_1K       78.8µs ± 0%    78.9µs ± 0%  +0.04%  (p=0.008 n=5+5)
FmtFprintfEmpty          58.6ns ± 0%    57.7ns ± 0%  -1.54%  (p=0.008 n=5+5)
FmtFprintfString          118ns ± 0%     115ns ± 0%  -2.54%  (p=0.008 n=5+5)
FmtFprintfInt             119ns ± 0%     119ns ± 0%    ~     (all equal)
FmtFprintfIntInt          192ns ± 0%     192ns ± 0%    ~     (all equal)
FmtFprintfPrefixedInt     224ns ± 0%     205ns ± 0%  -8.48%  (p=0.008 n=5+5)
FmtFprintfFloat           336ns ± 0%     333ns ± 1%    ~     (p=0.683 n=5+5)
FmtManyArgs               779ns ± 1%     760ns ± 1%  -2.41%  (p=0.008 n=5+5)
Gzip                      437ms ± 0%     436ms ± 0%  -0.27%  (p=0.008 n=5+5)
HTTPClientServer         90.1µs ± 1%    91.1µs ± 0%  +1.19%  (p=0.008 n=5+5)
JSONEncode               20.1ms ± 0%    20.2ms ± 1%    ~     (p=0.690 n=5+5)
JSONDecode               94.5ms ± 1%    94.1ms ± 1%    ~     (p=0.095 n=5+5)
Mandelbrot200            5.37ms ± 0%    5.37ms ± 0%    ~     (p=0.421 n=5+5)
TimeParse                 450ns ± 0%     446ns ± 0%  -0.89%  (p=0.000 n=5+4)
TimeFormat                483ns ± 1%     473ns ± 0%  -2.19%  (p=0.008 n=5+5)
Template                 90.6ms ± 0%    89.7ms ± 0%  -0.93%  (p=0.008 n=5+5)
GoParse                  5.97ms ± 0%    6.01ms ± 0%  +0.65%  (p=0.008 n=5+5)
BinaryTree17              11.8s ± 0%     11.7s ± 0%  -0.28%  (p=0.016 n=5+5)
Revcomp                   669ms ± 0%     669ms ± 0%    ~     (p=0.222 n=5+5)
Fannkuch11                3.28s ± 0%     3.34s ± 0%  +1.72%  (p=0.016 n=4+5)
[Geo mean]               46.6µs         46.3µs       -0.74%

name                   old speed      new speed      delta
RegexpMatchEasy0_32     129MB/s ± 0%   130MB/s ± 0%  +0.32%  (p=0.016 n=5+4)
RegexpMatchEasy0_1K    1.76GB/s ± 0%  1.76GB/s ± 0%  +0.13%  (p=0.016 n=4+5)
RegexpMatchEasy1_32     131MB/s ± 0%   132MB/s ± 0%  +0.32%  (p=0.008 n=5+5)
RegexpMatchEasy1_1K    1.27GB/s ± 0%  1.27GB/s ± 0%  -0.24%  (p=0.016 n=5+4)
RegexpMatchMedium_32   3.19MB/s ± 0%  3.21MB/s ± 0%  +0.63%  (p=0.008 n=5+5)
RegexpMatchMedium_1K   19.6MB/s ± 0%  19.7MB/s ± 0%  +0.51%  (p=0.029 n=4+4)
RegexpMatchHard_32     11.6MB/s ± 2%  11.7MB/s ± 0%    ~     (p=1.000 n=5+5)
RegexpMatchHard_1K     13.0MB/s ± 0%  13.0MB/s ± 0%    ~     (p=0.079 n=4+5)
Gzip                   44.4MB/s ± 0%  44.5MB/s ± 0%  +0.27%  (p=0.008 n=5+5)
JSONEncode             96.4MB/s ± 0%  96.2MB/s ± 1%    ~     (p=0.579 n=5+5)
JSONDecode             20.5MB/s ± 1%  20.6MB/s ± 1%    ~     (p=0.111 n=5+5)
Template               21.4MB/s ± 0%  21.6MB/s ± 0%  +0.94%  (p=0.008 n=5+5)
GoParse                9.70MB/s ± 0%  9.63MB/s ± 0%  -0.68%  (p=0.016 n=4+5)
Revcomp                 380MB/s ± 0%   380MB/s ± 0%    ~     (p=0.222 n=5+5)
[Geo mean]             55.3MB/s       55.4MB/s       +0.23%

Change-Id: I2e5338138991d9bc984e67b51212aa5d1b0f2a6b
Reviewed-on: https://go-review.googlesource.com/97335
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
2018-04-26 14:13:12 +00:00
Austin Clements
8871c930be cmd/compile: don't lower OpConvert
Currently, each architecture lowers OpConvert to an arch-specific
OpXXXconvert. This is silly because OpConvert means the same thing on
all architectures and is logically a no-op that exists only to keep
track of conversions to and from unsafe.Pointer. Furthermore, lowering
it makes it harder to recognize in other analyses, particularly
liveness analysis.

This CL eliminates the lowering of OpConvert, leaving it as the
generic op until code generation time.

The main complexity here is that we still need to register-allocate
OpConvert operations. Currently, each arch's lowered OpConvert
specifies all GP registers in its register mask. Ideally, OpConvert
wouldn't affect value homing at all, and we could just copy the home
of OpConvert's source, but this can potentially home an OpConvert in a
LocalSlot, which neither regalloc nor stackalloc expect. Rather than
try to disentangle this assumption from regalloc and stackalloc, we
continue to register-allocate OpConvert, but teach regalloc that
OpConvert can be allocated to any allocatable GP register.

For #24543.

Change-Id: I795a6aee5fd94d4444a7bafac3838a400c9f7bb6
Reviewed-on: https://go-review.googlesource.com/108496
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2018-04-20 18:46:39 +00:00
Ben Shi
34f5f8a580 cmd/compile: optimize ARM64 with register indexed load/store
ARM64 supports load/store instructions with a memory operand that
the address is calculated by base register + index register.

In this CL,
1. Some rules are added to the compile's ARM64 backend to emit
such efficient instructions.
2. A wrong rule of load combination is fixed.

The go1 benchmark does show improvement.

name                     old time/op    new time/op    delta
BinaryTree17-4              44.5s ± 2%     44.1s ± 1%   -0.81%  (p=0.000 n=28+29)
Fannkuch11-4                32.7s ± 3%     30.5s ± 0%   -6.79%  (p=0.000 n=30+26)
FmtFprintfEmpty-4           499ns ± 0%     506ns ± 5%   +1.39%  (p=0.003 n=25+30)
FmtFprintfString-4         1.07µs ± 0%    1.04µs ± 4%   -3.17%  (p=0.000 n=23+30)
FmtFprintfInt-4            1.15µs ± 4%    1.13µs ± 0%   -1.55%  (p=0.000 n=30+23)
FmtFprintfIntInt-4         1.77µs ± 4%    1.74µs ± 0%   -1.71%  (p=0.000 n=30+24)
FmtFprintfPrefixedInt-4    2.37µs ± 5%    2.12µs ± 0%  -10.56%  (p=0.000 n=30+23)
FmtFprintfFloat-4          3.03µs ± 1%    3.03µs ± 4%   -0.13%  (p=0.003 n=25+30)
FmtManyArgs-4              7.38µs ± 1%    7.43µs ± 4%   +0.59%  (p=0.003 n=25+30)
GobDecode-4                 101ms ± 6%      95ms ± 5%   -5.55%  (p=0.000 n=30+30)
GobEncode-4                78.0ms ± 4%    78.8ms ± 6%   +1.05%  (p=0.000 n=30+30)
Gzip-4                      4.25s ± 0%     4.27s ± 4%   +0.45%  (p=0.003 n=24+30)
Gunzip-4                    428ms ± 1%     420ms ± 0%   -1.88%  (p=0.000 n=23+23)
HTTPClientServer-4          549µs ± 1%     541µs ± 1%   -1.56%  (p=0.000 n=29+29)
JSONEncode-4                194ms ± 0%     188ms ± 4%     ~     (p=0.417 n=23+30)
JSONDecode-4                890ms ± 5%     831ms ± 0%   -6.55%  (p=0.000 n=30+23)
Mandelbrot200-4            47.3ms ± 2%    46.5ms ± 0%     ~     (p=0.980 n=30+26)
GoParse-4                  43.1ms ± 6%    43.8ms ± 6%   +1.65%  (p=0.000 n=30+30)
RegexpMatchEasy0_32-4      1.06µs ± 0%    1.07µs ± 3%     ~     (p=0.092 n=23+30)
RegexpMatchEasy0_1K-4      5.53µs ± 0%    5.51µs ± 0%   -0.24%  (p=0.000 n=25+25)
RegexpMatchEasy1_32-4      1.02µs ± 3%    1.01µs ± 0%   -1.27%  (p=0.000 n=30+24)
RegexpMatchEasy1_1K-4      7.26µs ± 0%    7.33µs ± 0%   +0.95%  (p=0.000 n=23+26)
RegexpMatchMedium_32-4     1.84µs ± 7%    1.79µs ± 1%     ~     (p=0.333 n=30+23)
RegexpMatchMedium_1K-4      553µs ± 0%     547µs ± 0%   -1.14%  (p=0.000 n=24+22)
RegexpMatchHard_32-4       30.8µs ± 1%    30.3µs ± 0%   -1.40%  (p=0.000 n=24+24)
RegexpMatchHard_1K-4        928µs ± 0%     929µs ± 5%   +0.12%  (p=0.013 n=23+30)
Revcomp-4                   8.13s ± 4%     6.32s ± 1%  -22.23%  (p=0.000 n=30+23)
Template-4                  899ms ± 6%     854ms ± 1%   -5.01%  (p=0.000 n=30+24)
TimeParse-4                4.66µs ± 4%    4.59µs ± 1%   -1.57%  (p=0.000 n=30+23)
TimeFormat-4               4.58µs ± 0%    4.61µs ± 0%   +0.57%  (p=0.000 n=26+24)
[Geo mean]                  717µs          698µs        -2.55%

name                     old speed      new speed      delta
GobDecode-4              7.63MB/s ± 6%  8.08MB/s ± 5%   +5.88%  (p=0.000 n=30+30)
GobEncode-4              9.85MB/s ± 4%  9.75MB/s ± 6%   -1.04%  (p=0.000 n=30+30)
Gzip-4                   4.56MB/s ± 0%  4.55MB/s ± 4%   -0.36%  (p=0.003 n=24+30)
Gunzip-4                 45.3MB/s ± 1%  46.2MB/s ± 0%   +1.92%  (p=0.000 n=23+23)
JSONEncode-4             10.0MB/s ± 0%  10.4MB/s ± 4%     ~     (p=0.403 n=23+30)
JSONDecode-4             2.18MB/s ± 5%  2.33MB/s ± 0%   +6.91%  (p=0.000 n=30+23)
GoParse-4                1.34MB/s ± 5%  1.32MB/s ± 5%   -1.66%  (p=0.000 n=30+30)
RegexpMatchEasy0_32-4    30.2MB/s ± 0%  29.8MB/s ± 3%     ~     (p=0.099 n=23+30)
RegexpMatchEasy0_1K-4     185MB/s ± 0%   186MB/s ± 0%   +0.24%  (p=0.000 n=25+25)
RegexpMatchEasy1_32-4    31.4MB/s ± 3%  31.8MB/s ± 0%   +1.24%  (p=0.000 n=30+24)
RegexpMatchEasy1_1K-4     141MB/s ± 0%   140MB/s ± 0%   -0.94%  (p=0.000 n=23+26)
RegexpMatchMedium_32-4    541kB/s ± 6%   560kB/s ± 0%   +3.45%  (p=0.000 n=30+23)
RegexpMatchMedium_1K-4   1.85MB/s ± 0%  1.87MB/s ± 0%   +1.08%  (p=0.000 n=24+23)
RegexpMatchHard_32-4     1.04MB/s ± 1%  1.06MB/s ± 1%   +1.48%  (p=0.000 n=24+24)
RegexpMatchHard_1K-4     1.10MB/s ± 0%  1.10MB/s ± 5%   +0.15%  (p=0.004 n=23+30)
Revcomp-4                31.3MB/s ± 4%  40.2MB/s ± 1%  +28.52%  (p=0.000 n=30+23)
Template-4               2.16MB/s ± 6%  2.27MB/s ± 1%   +5.18%  (p=0.000 n=30+24)
[Geo mean]               7.57MB/s       7.79MB/s        +2.98%

fixes #24907

Change-Id: I94afd0e3f53d62a1cf5e452f3dd6daf61be21785
Reviewed-on: https://go-review.googlesource.com/107376
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-04-19 15:08:10 +00:00
Balaram Makam
d7c7d88b2c cmd/compile: intrinsify math/big.mulWW on ARM64
Performance numbers on amberwing:

pkg: math/big
name                            old time/op    new time/op    delta
QuoRem                            3.08µs ± 0%    2.93µs ± 1%   -4.89%  (p=0.008 n=5+5)
ModSqrt225_Tonelli                 721µs ± 0%     718µs ± 0%   -0.46%  (p=0.008 n=5+5)
ModSqrt224_3Mod4                   218µs ± 0%     217µs ± 0%   -0.27%  (p=0.008 n=5+5)
ModSqrt5430_Tonelli                2.91s ± 0%     2.91s ± 0%     ~     (p=0.222 n=5+5)
ModSqrt5430_3Mod4                  970ms ± 0%     970ms ± 0%     ~     (p=0.151 n=5+5)
Sqrt                              45.9µs ± 0%    43.8µs ± 0%   -4.63%  (p=0.008 n=5+5)
IntSqr/1                          19.9ns ± 0%    17.3ns ± 0%  -13.07%  (p=0.008 n=5+5)
IntSqr/2                          52.6ns ± 0%    50.8ns ± 0%   -3.35%  (p=0.008 n=5+5)
IntSqr/3                          70.4ns ± 0%    69.4ns ± 0%     ~     (p=0.079 n=4+5)
IntSqr/5                           103ns ± 0%      99ns ± 0%   -3.98%  (p=0.008 n=5+5)
IntSqr/8                           179ns ± 0%     178ns ± 0%   -0.56%  (p=0.008 n=5+5)
IntSqr/10                          272ns ± 0%     272ns ± 0%     ~     (all equal)
IntSqr/20                          763ns ± 0%     787ns ± 0%   +3.15%  (p=0.016 n=5+4)
IntSqr/30                         1.25µs ± 1%    1.29µs ± 1%   +3.27%  (p=0.008 n=5+5)
IntSqr/50                         2.64µs ± 0%    2.71µs ± 0%   +2.61%  (p=0.008 n=5+5)
IntSqr/80                         5.67µs ± 0%    5.72µs ± 0%   +0.88%  (p=0.008 n=5+5)
IntSqr/100                        8.05µs ± 0%    8.09µs ± 0%   +0.45%  (p=0.008 n=5+5)
IntSqr/200                        28.0µs ± 0%    28.1µs ± 0%     ~     (p=0.151 n=5+5)
IntSqr/300                        59.4µs ± 0%    59.6µs ± 0%   +0.36%  (p=0.008 n=5+5)
IntSqr/500                         141µs ± 0%     141µs ± 0%   +0.08%  (p=0.008 n=5+5)
IntSqr/800                         280µs ± 0%     280µs ± 0%   -0.12%  (p=0.008 n=5+5)
IntSqr/1000                        429µs ± 0%     428µs ± 0%   -0.27%  (p=0.008 n=5+5)

pkg: crypto-ecdsa
name      old time/op    new time/op    delta
SignP384    7.85ms ± 1%    7.61ms ± 1%  -3.12%  (p=0.008 n=5+5)

Change-Id: I1ab30856cc0e570f6312f0bd8914779b55adbc16
Reviewed-on: https://go-review.googlesource.com/104135
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-04-04 18:37:24 +00:00
Geoff Berry
e244a7a7d3 cmd/compile/internal/ssa: add patterns for arm64 bitfield opcodes
Add patterns to match common idioms for EXTR, BFI, BFXIL, SBFIZ, SBFX,
UBFIZ and UBFX opcodes.

go1 benchmarks results on Amberwing:
name                   old time/op    new time/op    delta
FmtManyArgs               786ns ± 2%     714ns ± 1%  -9.20%  (p=0.000 n=10+10)
Gzip                      437ms ± 0%     402ms ± 0%  -7.99%  (p=0.000 n=10+10)
FmtFprintfIntInt          196ns ± 0%     182ns ± 0%  -7.28%  (p=0.000 n=10+9)
FmtFprintfPrefixedInt     207ns ± 0%     199ns ± 0%  -3.86%  (p=0.000 n=10+10)
FmtFprintfFloat           324ns ± 0%     316ns ± 0%  -2.47%  (p=0.000 n=10+8)
FmtFprintfInt             119ns ± 0%     117ns ± 0%  -1.68%  (p=0.000 n=10+9)
GobDecode                12.8ms ± 2%    12.6ms ± 1%  -1.62%  (p=0.002 n=10+10)
JSONDecode               94.4ms ± 1%    93.4ms ± 0%  -1.10%  (p=0.000 n=10+10)
RegexpMatchEasy0_32       247ns ± 0%     245ns ± 0%  -0.65%  (p=0.000 n=10+10)
RegexpMatchMedium_32      314ns ± 0%     312ns ± 0%  -0.64%  (p=0.000 n=10+10)
RegexpMatchEasy0_1K       541ns ± 0%     538ns ± 0%  -0.55%  (p=0.000 n=10+9)
TimeParse                 450ns ± 1%     448ns ± 1%  -0.42%  (p=0.035 n=9+9)
RegexpMatchEasy1_32       244ns ± 0%     243ns ± 0%  -0.41%  (p=0.000 n=10+10)
GoParse                  6.03ms ± 0%    6.00ms ± 0%  -0.40%  (p=0.002 n=10+10)
RegexpMatchEasy1_1K       779ns ± 0%     777ns ± 0%  -0.26%  (p=0.000 n=10+10)
RegexpMatchHard_32       2.75µs ± 0%    2.74µs ± 1%  -0.06%  (p=0.026 n=9+9)
BinaryTree17              11.7s ± 0%     11.6s ± 0%    ~     (p=0.089 n=10+10)
HTTPClientServer         89.1µs ± 1%    89.5µs ± 2%    ~     (p=0.436 n=10+10)
RegexpMatchHard_1K       78.9µs ± 0%    79.5µs ± 2%    ~     (p=0.469 n=10+10)
FmtFprintfEmpty          58.5ns ± 0%    58.5ns ± 0%    ~     (all equal)
GobEncode                12.0ms ± 1%    12.1ms ± 0%    ~     (p=0.075 n=10+10)
Revcomp                   669ms ± 0%     668ms ± 0%    ~     (p=0.091 n=7+9)
Mandelbrot200            5.35ms ± 0%    5.36ms ± 0%  +0.07%  (p=0.000 n=9+9)
RegexpMatchMedium_1K     52.1µs ± 0%    52.1µs ± 0%  +0.10%  (p=0.000 n=9+9)
Fannkuch11                3.25s ± 0%     3.26s ± 0%  +0.36%  (p=0.000 n=9+10)
FmtFprintfString          114ns ± 1%     115ns ± 0%  +0.52%  (p=0.011 n=10+10)
JSONEncode               20.2ms ± 0%    20.3ms ± 0%  +0.65%  (p=0.000 n=10+10)
Template                 91.3ms ± 0%    92.3ms ± 0%  +1.08%  (p=0.000 n=10+10)
TimeFormat                484ns ± 0%     495ns ± 1%  +2.30%  (p=0.000 n=9+10)

There are some opportunities to improve this change further by adding
patterns to match the "extended register" versions of ADD/SUB/CMP, but I
think that should be evaluated on its own.  The regressions in Template
and TimeFormat would likely be recovered by this, as they seem to be due
to generating:

    ubfiz x0, x0, #3, #8
    add x1, x2, x0

instead of

    add x1, x2, x0, lsl #3

Change-Id: I5644a8d70ac7a98e784a377a2b76ab47a3415a4b
Reviewed-on: https://go-review.googlesource.com/88355
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-15 14:10:41 +00:00
Meng Zhuo
8916773a3d runtime, cmd/compile: use ldp for DUFFCOPY on ARM64
name         old time/op  new time/op  delta
CopyFat8     2.15ns ± 1%  2.19ns ± 6%     ~     (p=0.171 n=8+9)
CopyFat12    2.15ns ± 0%  2.17ns ± 2%     ~     (p=0.137 n=8+10)
CopyFat16    2.17ns ± 3%  2.15ns ± 0%     ~     (p=0.211 n=10+10)
CopyFat24    2.16ns ± 1%  2.15ns ± 0%     ~     (p=0.087 n=10+10)
CopyFat32    11.5ns ± 0%  12.8ns ± 2%  +10.87%  (p=0.000 n=8+10)
CopyFat64    20.2ns ± 2%  12.9ns ± 0%  -36.11%  (p=0.000 n=10+10)
CopyFat128   37.2ns ± 0%  21.5ns ± 0%  -42.20%  (p=0.000 n=10+10)
CopyFat256   71.6ns ± 0%  38.7ns ± 0%  -45.95%  (p=0.000 n=10+10)
CopyFat512    140ns ± 0%    73ns ± 0%  -47.86%  (p=0.000 n=10+9)
CopyFat520    142ns ± 0%    74ns ± 0%  -47.54%  (p=0.000 n=10+10)
CopyFat1024   277ns ± 0%   141ns ± 0%  -49.10%  (p=0.000 n=10+10)

Change-Id: If54bc571add5db674d5e081579c87e80153d0a5a
Reviewed-on: https://go-review.googlesource.com/97395
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-03-06 04:14:59 +00:00
Heschi Kreinick
caa1b4afbd cmd/compile/internal/ssa: note zero-width Ops
Add a bool to opInfo to indicate if an Op never results in any
instructions. This is a conservative approximation: some operations,
like Copy, may or may not generate code depending on their arguments.

I built the list by reading each arch's ssaGenValue function. Hopefully
I got them all.

Change-Id: I130b251b65f18208294e129bb7ddc3f91d57d31d
Reviewed-on: https://go-review.googlesource.com/97957
Reviewed-by: Keith Randall <khr@golang.org>
2018-03-02 18:55:45 +00:00
Ben Shi
1057624985 cmd/compile: optimize ARM64 code with EON/ORN
EON and ORN are efficient ARM64 instructions. EON combines (x ^ ^y)
into a single operation, and so ORN does for (x | ^y).

This CL implements that optimization. And here are benchmark results
with RaspberryPi3/ArchLinux.

1. A specific test gets about 13% improvement.
EONORN                      181µs ± 0%     157µs ± 0%  -13.26%  (p=0.000 n=26+23)
(https://github.com/benshi001/ugo1/blob/master/eonorn_test.go)

2. There is little change in the go1 benchmark, excluding noise.
name                     old time/op    new time/op    delta
BinaryTree17-4              44.1s ± 2%     44.0s ± 2%    ~     (p=0.513 n=30+30)
Fannkuch11-4                32.9s ± 3%     32.8s ± 3%  -0.12%  (p=0.024 n=30+30)
FmtFprintfEmpty-4           561ns ± 9%     558ns ± 9%    ~     (p=0.654 n=30+30)
FmtFprintfString-4         1.09µs ± 4%    1.09µs ± 3%    ~     (p=0.158 n=30+30)
FmtFprintfInt-4            1.12µs ± 0%    1.12µs ± 0%    ~     (p=0.917 n=23+28)
FmtFprintfIntInt-4         1.73µs ± 0%    1.76µs ± 4%    ~     (p=0.665 n=23+30)
FmtFprintfPrefixedInt-4    2.15µs ± 1%    2.15µs ± 0%    ~     (p=0.389 n=27+26)
FmtFprintfFloat-4          3.18µs ± 4%    3.13µs ± 0%  -1.50%  (p=0.003 n=30+23)
FmtManyArgs-4              7.32µs ± 4%    7.21µs ± 0%    ~     (p=0.220 n=30+25)
GobDecode-4                99.1ms ± 9%    97.0ms ± 0%  -2.07%  (p=0.000 n=30+23)
GobEncode-4                83.3ms ± 3%    82.4ms ± 4%    ~     (p=0.321 n=30+30)
Gzip-4                      4.39s ± 4%     4.32s ± 2%  -1.42%  (p=0.017 n=30+23)
Gunzip-4                    440ms ± 0%     447ms ± 4%  +1.54%  (p=0.006 n=24+30)
HTTPClientServer-4          547µs ± 1%     537µs ± 1%  -1.91%  (p=0.000 n=30+30)
JSONEncode-4                211ms ± 0%     211ms ± 0%  +0.04%  (p=0.000 n=23+24)
JSONDecode-4                847ms ± 0%     847ms ± 0%    ~     (p=0.158 n=25+25)
Mandelbrot200-4            46.5ms ± 0%    46.5ms ± 0%  -0.04%  (p=0.000 n=25+24)
GoParse-4                  43.4ms ± 0%    43.4ms ± 0%    ~     (p=0.494 n=24+25)
RegexpMatchEasy0_32-4      1.03µs ± 0%    1.03µs ± 0%    ~     (all equal)
RegexpMatchEasy0_1K-4      4.02µs ± 3%    3.98µs ± 0%  -0.95%  (p=0.003 n=30+24)
RegexpMatchEasy1_32-4      1.01µs ± 3%    1.01µs ± 2%    ~     (p=0.629 n=30+30)
RegexpMatchEasy1_1K-4      6.39µs ± 0%    6.39µs ± 0%    ~     (p=0.564 n=24+23)
RegexpMatchMedium_32-4     1.80µs ± 3%    1.78µs ± 0%    ~     (p=0.155 n=30+24)
RegexpMatchMedium_1K-4      555µs ± 0%     563µs ± 3%  +1.55%  (p=0.004 n=27+30)
RegexpMatchHard_32-4       31.0µs ± 4%    30.5µs ± 1%  -1.58%  (p=0.000 n=30+23)
RegexpMatchHard_1K-4        947µs ± 4%     931µs ± 0%  -1.66%  (p=0.009 n=30+24)
Revcomp-4                   7.71s ± 4%     7.71s ± 4%    ~     (p=0.196 n=29+30)
Template-4                  877ms ± 0%     878ms ± 0%  +0.16%  (p=0.018 n=23+27)
TimeParse-4                4.75µs ± 1%    4.74µs ± 0%    ~     (p=0.895 n=24+23)
TimeFormat-4               4.83µs ± 4%    4.83µs ± 4%    ~     (p=0.767 n=30+30)
[Geo mean]                  709µs          707µs       -0.35%

name                     old speed      new speed      delta
GobDecode-4              7.75MB/s ± 8%  7.91MB/s ± 0%  +2.03%  (p=0.001 n=30+23)
GobEncode-4              9.22MB/s ± 3%  9.32MB/s ± 4%    ~     (p=0.389 n=30+30)
Gzip-4                   4.43MB/s ± 4%  4.43MB/s ± 4%    ~     (p=0.888 n=30+30)
Gunzip-4                 44.1MB/s ± 0%  43.4MB/s ± 4%  -1.46%  (p=0.009 n=24+30)
JSONEncode-4             9.18MB/s ± 0%  9.18MB/s ± 0%    ~     (p=0.308 n=16+24)
JSONDecode-4             2.29MB/s ± 0%  2.29MB/s ± 0%    ~     (all equal)
GoParse-4                1.33MB/s ± 0%  1.33MB/s ± 0%    ~     (all equal)
RegexpMatchEasy0_32-4    30.9MB/s ± 0%  30.9MB/s ± 0%    ~     (p=1.000 n=23+24)
RegexpMatchEasy0_1K-4     255MB/s ± 3%   257MB/s ± 0%  +0.92%  (p=0.004 n=30+24)
RegexpMatchEasy1_32-4    31.7MB/s ± 3%  31.6MB/s ± 2%    ~     (p=0.603 n=30+30)
RegexpMatchEasy1_1K-4     160MB/s ± 0%   160MB/s ± 0%    ~     (p=0.435 n=24+23)
RegexpMatchMedium_32-4    554kB/s ± 3%   560kB/s ± 0%  +1.08%  (p=0.004 n=30+24)
RegexpMatchMedium_1K-4   1.85MB/s ± 0%  1.82MB/s ± 3%  -1.48%  (p=0.001 n=27+30)
RegexpMatchHard_32-4     1.03MB/s ± 4%  1.05MB/s ± 1%  +1.51%  (p=0.027 n=30+23)
RegexpMatchHard_1K-4     1.08MB/s ± 4%  1.10MB/s ± 0%  +1.69%  (p=0.002 n=30+25)
Revcomp-4                33.0MB/s ± 4%  33.0MB/s ± 4%    ~     (p=0.272 n=29+30)
Template-4               2.21MB/s ± 0%  2.21MB/s ± 0%    ~     (all equal)
[Geo mean]               7.75MB/s       7.77MB/s       +0.29%

3. There is little regression in the compilecmp benchmark.
name        old time/op       new time/op       delta
Template          2.28s ± 3%        2.28s ± 4%    ~     (p=0.739 n=10+10)
Unicode           1.34s ± 4%        1.32s ± 3%    ~     (p=0.113 n=10+9)
GoTypes           8.10s ± 3%        8.18s ± 3%    ~     (p=0.393 n=10+10)
Compiler          39.0s ± 3%        39.2s ± 3%    ~     (p=0.393 n=10+10)
SSA                114s ± 3%         115s ± 2%    ~     (p=0.631 n=10+10)
Flate             1.41s ± 2%        1.42s ± 3%    ~     (p=0.353 n=10+10)
GoParser          1.81s ± 1%        1.83s ± 2%    ~     (p=0.211 n=10+9)
Reflect           5.06s ± 2%        5.06s ± 2%    ~     (p=0.912 n=10+10)
Tar               2.19s ± 3%        2.20s ± 3%    ~     (p=0.247 n=10+10)
XML               2.65s ± 2%        2.67s ± 5%    ~     (p=0.796 n=10+10)
[Geo mean]        4.92s             4.93s       +0.27%

name        old user-time/op  new user-time/op  delta
Template          2.81s ± 2%        2.81s ± 3%    ~     (p=0.971 n=10+10)
Unicode           1.70s ± 3%        1.67s ± 5%    ~     (p=0.315 n=10+10)
GoTypes           9.71s ± 1%        9.78s ± 1%  +0.71%  (p=0.023 n=10+10)
Compiler          47.3s ± 1%        47.1s ± 3%    ~     (p=0.579 n=10+10)
SSA                143s ± 2%         143s ± 2%    ~     (p=0.280 n=10+10)
Flate             1.70s ± 3%        1.71s ± 3%    ~     (p=0.481 n=10+10)
GoParser          2.21s ± 3%        2.21s ± 1%    ~     (p=0.549 n=10+9)
Reflect           5.89s ± 1%        5.87s ± 2%    ~     (p=0.739 n=10+10)
Tar               2.66s ± 2%        2.63s ± 2%    ~     (p=0.105 n=10+10)
XML               3.16s ± 3%        3.18s ± 2%    ~     (p=0.143 n=10+10)
[Geo mean]        5.97s             5.97s       -0.06%

name        old text-bytes    new text-bytes    delta
HelloSize         637kB ± 0%        637kB ± 0%    ~     (all equal)

name        old data-bytes    new data-bytes    delta
HelloSize        9.46kB ± 0%       9.46kB ± 0%    ~     (all equal)

name        old bss-bytes     new bss-bytes     delta
HelloSize         125kB ± 0%        125kB ± 0%    ~     (all equal)

name        old exe-bytes     new exe-bytes     delta
HelloSize        1.24MB ± 0%       1.24MB ± 0%    ~     (all equal)

Change-Id: Ie27357d65c5ce9d07afdffebe1e2daadcaa3369f
Reviewed-on: https://go-review.googlesource.com/97036
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-28 23:42:40 +00:00
Ben Shi
7113d3a512 cmd/compile: fix FP accuracy issue introduced by FMA optimization on ARM64
Two ARM64 rules are added to avoid FP accuracy issue, which causes
build failure.
https://build.golang.org/log/1360f5c9ef3f37968216350283c1013e9681725d

fixes #24033

Change-Id: I9b74b584ab5cc53fa49476de275dc549adf97610
Reviewed-on: https://go-review.googlesource.com/96355
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-22 15:28:08 +00:00
Ben Shi
f4c3072cf5 cmd/compile: improve FP performance on ARM64
FMADD/FMSUB/FNMADD/FNMSUB are efficient FP instructions, which can
be used by the comiler to improve FP performance. This CL implements
this optimization.

1. The compilecmp benchmark shows little change.
name        old time/op       new time/op       delta
Template          2.35s ± 4%        2.38s ± 4%    ~     (p=0.161 n=15+15)
Unicode           1.36s ± 5%        1.36s ± 4%    ~     (p=0.685 n=14+13)
GoTypes           8.11s ± 3%        8.13s ± 2%    ~     (p=0.624 n=15+15)
Compiler          40.5s ± 2%        40.7s ± 2%    ~     (p=0.137 n=15+15)
SSA                115s ± 3%         116s ± 1%    ~     (p=0.270 n=15+14)
Flate             1.46s ± 4%        1.45s ± 5%    ~     (p=0.870 n=15+15)
GoParser          1.85s ± 2%        1.87s ± 3%    ~     (p=0.477 n=14+15)
Reflect           5.11s ± 4%        5.10s ± 2%    ~     (p=0.624 n=15+15)
Tar               2.23s ± 3%        2.23s ± 5%    ~     (p=0.624 n=15+15)
XML               2.72s ± 5%        2.74s ± 3%    ~     (p=0.290 n=15+14)
[Geo mean]        5.02s             5.03s       +0.29%

name        old user-time/op  new user-time/op  delta
Template          2.90s ± 2%        2.90s ± 3%    ~     (p=0.780 n=14+15)
Unicode           1.71s ± 5%        1.70s ± 3%    ~     (p=0.458 n=14+13)
GoTypes           9.77s ± 2%        9.76s ± 2%    ~     (p=0.838 n=15+15)
Compiler          49.1s ± 2%        49.1s ± 2%    ~     (p=0.902 n=15+15)
SSA                144s ± 1%         144s ± 2%    ~     (p=0.567 n=15+15)
Flate             1.75s ± 5%        1.74s ± 3%    ~     (p=0.461 n=15+15)
GoParser          2.22s ± 2%        2.21s ± 3%    ~     (p=0.233 n=15+15)
Reflect           5.99s ± 2%        5.95s ± 1%    ~     (p=0.093 n=14+15)
Tar               2.68s ± 2%        2.67s ± 3%    ~     (p=0.310 n=14+15)
XML               3.22s ± 2%        3.24s ± 3%    ~     (p=0.512 n=15+15)
[Geo mean]        6.08s             6.07s       -0.19%

name        old text-bytes    new text-bytes    delta
HelloSize         641kB ± 0%        641kB ± 0%    ~     (all equal)

name        old data-bytes    new data-bytes    delta
HelloSize        9.46kB ± 0%       9.46kB ± 0%    ~     (all equal)

name        old bss-bytes     new bss-bytes     delta
HelloSize         125kB ± 0%        125kB ± 0%    ~     (all equal)

name        old exe-bytes     new exe-bytes     delta
HelloSize        1.24MB ± 0%       1.24MB ± 0%    ~     (all equal)

2. The go1 benchmark shows little improvement in total (excluding noise),
but some improvement in test case Mandelbrot200 and FmtFprintfFloat.
name                     old time/op    new time/op    delta
BinaryTree17-4              42.1s ± 2%     42.0s ± 2%    ~     (p=0.453 n=30+28)
Fannkuch11-4                33.5s ± 3%     33.3s ± 3%  -0.38%  (p=0.045 n=30+30)
FmtFprintfEmpty-4           534ns ± 0%     534ns ± 0%    ~     (all equal)
FmtFprintfString-4         1.09µs ± 0%    1.09µs ± 0%  -0.27%  (p=0.000 n=23+17)
FmtFprintfInt-4            1.16µs ± 3%    1.16µs ± 3%    ~     (p=0.714 n=30+30)
FmtFprintfIntInt-4         1.76µs ± 1%    1.77µs ± 0%  +0.15%  (p=0.002 n=23+23)
FmtFprintfPrefixedInt-4    2.21µs ± 3%    2.20µs ± 3%    ~     (p=0.390 n=30+30)
FmtFprintfFloat-4          3.28µs ± 0%    3.11µs ± 0%  -5.01%  (p=0.000 n=25+26)
FmtManyArgs-4              7.18µs ± 0%    7.19µs ± 0%  +0.13%  (p=0.000 n=24+25)
GobDecode-4                94.9ms ± 0%    95.6ms ± 5%  +0.83%  (p=0.002 n=23+29)
GobEncode-4                80.7ms ± 4%    79.8ms ± 0%  -1.11%  (p=0.003 n=30+24)
Gzip-4                      4.58s ± 4%     4.59s ± 3%  +0.26%  (p=0.002 n=30+26)
Gunzip-4                    449ms ± 4%     443ms ± 0%    ~     (p=0.096 n=30+26)
HTTPClientServer-4          553µs ± 1%     548µs ± 1%  -0.96%  (p=0.000 n=30+30)
JSONEncode-4                215ms ± 4%     214ms ± 4%  -0.29%  (p=0.000 n=30+30)
JSONDecode-4                868ms ± 4%     875ms ± 5%  +0.79%  (p=0.008 n=30+30)
Mandelbrot200-4            51.4ms ± 0%    46.7ms ± 3%  -9.09%  (p=0.000 n=25+26)
GoParse-4                  42.1ms ± 0%    41.8ms ± 0%  -0.61%  (p=0.000 n=25+24)
RegexpMatchEasy0_32-4      1.02µs ± 4%    1.02µs ± 4%  -0.17%  (p=0.000 n=30+30)
RegexpMatchEasy0_1K-4      3.90µs ± 0%    3.95µs ± 4%    ~     (p=0.516 n=23+30)
RegexpMatchEasy1_32-4       970ns ± 3%     973ns ± 3%    ~     (p=0.951 n=30+30)
RegexpMatchEasy1_1K-4      6.43µs ± 3%    6.33µs ± 0%  -1.62%  (p=0.000 n=30+25)
RegexpMatchMedium_32-4     1.75µs ± 0%    1.75µs ± 0%    ~     (p=0.422 n=25+24)
RegexpMatchMedium_1K-4      568µs ± 3%     562µs ± 0%    ~     (p=0.079 n=30+24)
RegexpMatchHard_32-4       30.8µs ± 0%    31.2µs ± 4%  +1.46%  (p=0.018 n=23+30)
RegexpMatchHard_1K-4        932µs ± 0%     946µs ± 3%  +1.49%  (p=0.000 n=24+30)
Revcomp-4                   7.69s ± 3%     7.69s ± 2%  +0.04%  (p=0.032 n=24+25)
Template-4                  893ms ± 5%     880ms ± 6%  -1.53%  (p=0.000 n=30+30)
TimeParse-4                4.90µs ± 3%    4.84µs ± 0%    ~     (p=0.080 n=30+25)
TimeFormat-4               4.70µs ± 1%    4.76µs ± 0%  +1.21%  (p=0.000 n=23+26)
[Geo mean]                  710µs          706µs       -0.63%

name                     old speed      new speed      delta
GobDecode-4              8.09MB/s ± 0%  8.03MB/s ± 5%  -0.77%  (p=0.002 n=23+29)
GobEncode-4              9.52MB/s ± 4%  9.62MB/s ± 0%  +1.07%  (p=0.003 n=30+24)
Gzip-4                   4.24MB/s ± 4%  4.23MB/s ± 3%  -0.35%  (p=0.002 n=30+26)
Gunzip-4                 43.2MB/s ± 4%  43.8MB/s ± 0%    ~     (p=0.123 n=30+26)
JSONEncode-4             9.03MB/s ± 4%  9.06MB/s ± 4%  +0.28%  (p=0.000 n=30+30)
JSONDecode-4             2.24MB/s ± 4%  2.22MB/s ± 5%  -0.79%  (p=0.008 n=30+30)
GoParse-4                1.38MB/s ± 1%  1.38MB/s ± 0%    ~     (p=0.401 n=25+17)
RegexpMatchEasy0_32-4    31.4MB/s ± 4%  31.5MB/s ± 3%  +0.16%  (p=0.000 n=30+30)
RegexpMatchEasy0_1K-4     262MB/s ± 0%   259MB/s ± 4%    ~     (p=0.693 n=23+30)
RegexpMatchEasy1_32-4    33.0MB/s ± 3%  32.9MB/s ± 3%    ~     (p=0.139 n=30+30)
RegexpMatchEasy1_1K-4     159MB/s ± 3%   162MB/s ± 0%  +1.60%  (p=0.000 n=30+25)
RegexpMatchMedium_32-4    570kB/s ± 0%   570kB/s ± 0%    ~     (all equal)
RegexpMatchMedium_1K-4   1.80MB/s ± 3%  1.82MB/s ± 0%  +1.09%  (p=0.007 n=30+24)
RegexpMatchHard_32-4     1.04MB/s ± 0%  1.03MB/s ± 3%  -1.38%  (p=0.003 n=23+30)
RegexpMatchHard_1K-4     1.10MB/s ± 0%  1.08MB/s ± 3%  -1.52%  (p=0.000 n=24+30)
Revcomp-4                33.0MB/s ± 3%  33.0MB/s ± 2%    ~     (p=0.128 n=24+25)
Template-4               2.17MB/s ± 5%  2.21MB/s ± 6%  +1.61%  (p=0.000 n=30+30)
[Geo mean]               7.79MB/s       7.79MB/s       +0.05%

Change-Id: Ied3dbdb5ba8e386168629cba06fcd4263bbb83e1
Reviewed-on: https://go-review.googlesource.com/94901
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-02-22 04:10:07 +00:00
Ben Shi
3c8b824453 cmd/compile: optimize ARM64 code with MNEG
A pair of MUL/NEG instructions can be combined to a single MNEG on ARM64.
This CL implements this optimization.

1. A special test case gets big improvement.
(https://github.com/benshi001/ugo1/blob/master/mneg_test.go)
name                     old time/op    new time/op    delta
MNEG-4                      315µs ± 0%     260µs ± 0%  -17.39%  (p=0.000 n=24+25)

2. There is little change in the go1 benchmark, excluding noise.
name                     old time/op    new time/op    delta
BinaryTree17-4              42.2s ± 2%     41.9s ± 2%  -0.82%  (p=0.001 n=30+26)
Fannkuch11-4                32.9s ± 0%     32.9s ± 0%  -0.01%  (p=0.006 n=20+26)
FmtFprintfEmpty-4           541ns ± 3%     534ns ± 0%  -1.24%  (p=0.003 n=30+26)
FmtFprintfString-4         1.09µs ± 0%    1.10µs ± 3%    ~     (p=0.142 n=23+30)
FmtFprintfInt-4            1.14µs ± 0%    1.14µs ± 0%    ~     (p=0.435 n=24+24)
FmtFprintfIntInt-4         1.76µs ± 0%    1.76µs ± 0%    ~     (p=0.508 n=24+26)
FmtFprintfPrefixedInt-4    2.20µs ± 3%    2.17µs ± 0%  -1.10%  (p=0.017 n=30+24)
FmtFprintfFloat-4          3.28µs ± 0%    3.28µs ± 0%    ~     (p=0.579 n=24+24)
FmtManyArgs-4              7.30µs ± 0%    7.30µs ± 0%    ~     (p=0.662 n=26+27)
GobDecode-4                94.8ms ± 0%    94.8ms ± 0%  +0.07%  (p=0.010 n=25+23)
GobEncode-4                80.9ms ± 4%    80.6ms ± 4%    ~     (p=0.901 n=30+30)
Gzip-4                      4.45s ± 0%     4.49s ± 0%  +0.98%  (p=0.000 n=25+24)
Gunzip-4                    450ms ± 3%     443ms ± 0%    ~     (p=0.942 n=30+26)
HTTPClientServer-4          548µs ± 1%     551µs ± 1%  +0.60%  (p=0.000 n=29+30)
JSONEncode-4                210ms ± 0%     211ms ± 0%  +0.03%  (p=0.000 n=23+25)
JSONDecode-4                866ms ± 5%     877ms ± 5%    ~     (p=0.187 n=30+30)
Mandelbrot200-4            51.4ms ± 0%    52.0ms ± 3%  +1.15%  (p=0.001 n=24+30)
GoParse-4                  42.9ms ± 5%    41.9ms ± 0%  -2.24%  (p=0.000 n=30+26)
RegexpMatchEasy0_32-4      1.02µs ± 3%    1.01µs ± 0%    ~     (p=0.247 n=30+26)
RegexpMatchEasy0_1K-4      3.90µs ± 0%    3.90µs ± 0%    ~     (p=0.062 n=24+24)
RegexpMatchEasy1_32-4       955ns ± 0%     956ns ± 0%  +0.16%  (p=0.000 n=25+23)
RegexpMatchEasy1_1K-4      6.42µs ± 3%    6.37µs ± 0%  -0.81%  (p=0.012 n=30+24)
RegexpMatchMedium_32-4     1.77µs ± 3%    1.79µs ± 0%  +1.28%  (p=0.003 n=30+24)
RegexpMatchMedium_1K-4      561µs ± 0%     569µs ± 3%  +1.50%  (p=0.000 n=25+30)
RegexpMatchHard_32-4       31.0µs ± 4%    30.8µs ± 0%    ~     (p=1.000 n=26+26)
RegexpMatchHard_1K-4        945µs ± 3%     945µs ± 3%    ~     (p=0.513 n=30+30)
Revcomp-4                   7.76s ± 4%     7.68s ± 0%    ~     (p=0.464 n=29+23)
Template-4                  903ms ± 5%     904ms ± 5%    ~     (p=0.248 n=30+30)
TimeParse-4                4.80µs ± 0%    4.80µs ± 0%    ~     (p=0.081 n=25+26)
TimeFormat-4               4.70µs ± 1%    4.70µs ± 1%    ~     (p=0.763 n=24+26)
[Geo mean]                  709µs          708µs       -0.09%

name                     old speed      new speed      delta
GobDecode-4              8.10MB/s ± 0%  8.09MB/s ± 0%    ~     (p=0.160 n=25+23)
GobEncode-4              9.49MB/s ± 4%  9.53MB/s ± 4%    ~     (p=0.360 n=30+30)
Gzip-4                   4.36MB/s ± 0%  4.32MB/s ± 0%  -0.92%  (p=0.000 n=25+24)
Gunzip-4                 43.2MB/s ± 3%  43.8MB/s ± 0%    ~     (p=0.980 n=30+26)
JSONEncode-4             9.22MB/s ± 0%  9.22MB/s ± 0%  -0.04%  (p=0.005 n=23+25)
JSONDecode-4             2.24MB/s ± 5%  2.21MB/s ± 4%    ~     (p=0.252 n=30+30)
GoParse-4                1.35MB/s ± 5%  1.38MB/s ± 0%  +2.00%  (p=0.003 n=30+26)
RegexpMatchEasy0_32-4    31.5MB/s ± 3%  31.8MB/s ± 0%    ~     (p=0.110 n=30+26)
RegexpMatchEasy0_1K-4     263MB/s ± 0%   263MB/s ± 0%    ~     (p=0.111 n=24+24)
RegexpMatchEasy1_32-4    33.5MB/s ± 0%  33.4MB/s ± 0%  -0.16%  (p=0.003 n=25+23)
RegexpMatchEasy1_1K-4     160MB/s ± 3%   161MB/s ± 0%  +0.78%  (p=0.012 n=30+24)
RegexpMatchMedium_32-4    565kB/s ± 3%   560kB/s ± 0%  -0.83%  (p=0.001 n=30+24)
RegexpMatchMedium_1K-4   1.83MB/s ± 0%  1.80MB/s ± 3%  -1.56%  (p=0.000 n=25+30)
RegexpMatchHard_32-4     1.03MB/s ± 3%  1.04MB/s ± 0%  +1.46%  (p=0.000 n=30+26)
RegexpMatchHard_1K-4     1.08MB/s ± 3%  1.09MB/s ± 3%    ~     (p=0.444 n=30+30)
Revcomp-4                32.8MB/s ± 4%  33.1MB/s ± 0%    ~     (p=0.858 n=29+23)
Template-4               2.15MB/s ± 5%  2.15MB/s ± 5%    ~     (p=0.646 n=30+30)
[Geo mean]               7.79MB/s       7.81MB/s       +0.21%

3. There is no regression in the compilecmp benchmark.
name        old time/op       new time/op       delta
Template          2.35s ± 4%        2.33s ± 3%    ~     (p=0.796 n=10+10)
Unicode           1.35s ± 6%        1.35s ± 5%    ~     (p=1.000 n=9+10)
GoTypes           8.10s ± 3%        8.14s ± 3%    ~     (p=0.604 n=9+10)
Compiler          40.5s ± 2%        40.2s ± 2%    ~     (p=0.065 n=10+9)
SSA                115s ± 2%         115s ± 2%    ~     (p=0.447 n=9+10)
Flate             1.45s ± 3%        1.45s ± 4%    ~     (p=0.739 n=10+10)
GoParser          1.85s ± 3%        1.86s ± 2%    ~     (p=0.853 n=10+10)
Reflect           5.11s ± 2%        5.10s ± 2%    ~     (p=0.971 n=10+10)
Tar               2.23s ± 5%        2.23s ± 3%    ~     (p=0.796 n=10+10)
XML               2.67s ± 2%        2.69s ± 2%    ~     (p=0.549 n=9+10)
[Geo mean]        5.00s             5.00s       +0.02%

name        old user-time/op  new user-time/op  delta
Template          2.88s ± 2%        2.86s ± 2%    ~     (p=0.529 n=10+10)
Unicode           1.70s ± 7%        1.69s ± 5%    ~     (p=0.853 n=10+10)
GoTypes           9.72s ± 1%        9.73s ± 1%    ~     (p=0.684 n=10+10)
Compiler          49.0s ± 1%        48.9s ± 1%    ~     (p=0.631 n=10+10)
SSA                144s ± 1%         144s ± 2%    ~     (p=0.684 n=10+10)
Flate             1.71s ± 4%        1.72s ± 4%    ~     (p=0.853 n=10+10)
GoParser          2.23s ± 2%        2.23s ± 2%    ~     (p=0.971 n=10+10)
Reflect           5.98s ± 2%        5.96s ± 2%    ~     (p=0.481 n=10+10)
Tar               2.68s ± 3%        2.67s ± 2%    ~     (p=0.393 n=10+10)
XML               3.21s ± 3%        3.22s ± 1%    ~     (p=0.604 n=10+9)
[Geo mean]        6.05s             6.05s       -0.04%

name        old text-bytes    new text-bytes    delta
HelloSize         641kB ± 0%        641kB ± 0%    ~     (all equal)

name        old data-bytes    new data-bytes    delta
HelloSize        9.46kB ± 0%       9.46kB ± 0%    ~     (all equal)

name        old bss-bytes     new bss-bytes     delta
HelloSize         125kB ± 0%        125kB ± 0%    ~     (all equal)

name        old exe-bytes     new exe-bytes     delta
HelloSize        1.24MB ± 0%       1.24MB ± 0%    ~     (all equal)

Change-Id: I9ed9128f0114e0f1ebb08ca2d042c90fcb2b1dcd
Reviewed-on: https://go-review.googlesource.com/95075
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-20 15:23:23 +00:00
philhofer
2d0172c3a7 cmd/compile/internal/ssa: emit csel on arm64
Introduce a new SSA pass to generate CondSelect intstrutions,
and add CondSelect lowering rules for arm64.

In order to make the CSEL instruction easier to optimize,
and to simplify the introduction of CSNEG, CSINC, and CSINV
in the future, modify the CSEL instruction to accept a condition
code in the aux field.

Notably, this change makes the go1 Gzip benchmark
more than 10% faster.

Benchmarks on a Cavium ThunderX:

name                      old time/op    new time/op    delta
BinaryTree17-96              15.9s ± 6%     16.0s ± 4%     ~     (p=0.968 n=10+9)
Fannkuch11-96                7.17s ± 0%     7.00s ± 0%   -2.43%  (p=0.000 n=8+9)
FmtFprintfEmpty-96           208ns ± 1%     207ns ± 0%     ~     (p=0.152 n=10+8)
FmtFprintfString-96          379ns ± 0%     375ns ± 0%   -0.95%  (p=0.000 n=10+9)
FmtFprintfInt-96             385ns ± 0%     383ns ± 0%   -0.52%  (p=0.000 n=9+10)
FmtFprintfIntInt-96          591ns ± 0%     586ns ± 0%   -0.85%  (p=0.006 n=7+9)
FmtFprintfPrefixedInt-96     656ns ± 0%     667ns ± 0%   +1.71%  (p=0.000 n=10+10)
FmtFprintfFloat-96           967ns ± 0%     984ns ± 0%   +1.78%  (p=0.000 n=10+10)
FmtManyArgs-96              2.35µs ± 0%    2.25µs ± 0%   -4.63%  (p=0.000 n=9+8)
GobDecode-96                31.0ms ± 0%    30.8ms ± 0%   -0.36%  (p=0.006 n=9+9)
GobEncode-96                24.4ms ± 0%    24.5ms ± 0%   +0.30%  (p=0.000 n=9+9)
Gzip-96                      1.60s ± 0%     1.43s ± 0%  -10.58%  (p=0.000 n=9+10)
Gunzip-96                    167ms ± 0%     169ms ± 0%   +0.83%  (p=0.000 n=8+9)
HTTPClientServer-96          311µs ± 1%     308µs ± 0%   -0.75%  (p=0.000 n=10+10)
JSONEncode-96               65.0ms ± 0%    64.8ms ± 0%   -0.25%  (p=0.000 n=9+8)
JSONDecode-96                262ms ± 1%     261ms ± 1%     ~     (p=0.579 n=10+10)
Mandelbrot200-96            18.0ms ± 0%    18.1ms ± 0%   +0.17%  (p=0.000 n=8+10)
GoParse-96                  14.0ms ± 0%    14.1ms ± 1%   +0.42%  (p=0.003 n=9+10)
RegexpMatchEasy0_32-96       644ns ± 2%     645ns ± 2%     ~     (p=0.836 n=10+10)
RegexpMatchEasy0_1K-96      3.70µs ± 0%    3.49µs ± 0%   -5.58%  (p=0.000 n=10+10)
RegexpMatchEasy1_32-96       662ns ± 2%     657ns ± 2%     ~     (p=0.137 n=10+10)
RegexpMatchEasy1_1K-96      4.47µs ± 0%    4.31µs ± 0%   -3.48%  (p=0.000 n=10+10)
RegexpMatchMedium_32-96      844ns ± 2%     849ns ± 1%     ~     (p=0.208 n=10+10)
RegexpMatchMedium_1K-96      179µs ± 0%     182µs ± 0%   +1.20%  (p=0.000 n=10+10)
RegexpMatchHard_32-96       10.0µs ± 0%    10.1µs ± 0%   +0.48%  (p=0.000 n=10+9)
RegexpMatchHard_1K-96        297µs ± 0%     297µs ± 0%   -0.14%  (p=0.000 n=10+10)
Revcomp-96                   3.08s ± 0%     3.13s ± 0%   +1.56%  (p=0.000 n=9+9)
Template-96                  276ms ± 2%     275ms ± 1%     ~     (p=0.393 n=10+10)
TimeParse-96                1.37µs ± 0%    1.36µs ± 0%   -0.53%  (p=0.000 n=10+7)
TimeFormat-96               1.40µs ± 0%    1.42µs ± 0%   +0.97%  (p=0.000 n=10+10)
[Geo mean]                   264µs          262µs        -0.77%

Change-Id: Ie54eee4b3092af53e6da3baa6d1755098f57f3a2
Reviewed-on: https://go-review.googlesource.com/55670
Run-TryBot: Philip Hofer <phofer@umich.edu>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2018-02-20 06:00:54 +00:00
Chad Rosier
07f0f09563 cmd/compile: make math.Ceil/Floor/Round/Trunc intrinsics on arm64
name       old time/op  new time/op  delta
Ceil        550ns ± 0%   486ns ± 7%  -11.64%  (p=0.000 n=13+18)
Floor       495ns ±19%   512ns ±12%     ~     (p=0.164 n=20+20)
Round       550ns ± 0%   487ns ± 8%  -11.49%  (p=0.000 n=12+19)
Trunc       563ns ± 7%   488ns ±13%  -13.44%  (p=0.000 n=15+2)

Change-Id: I53f234b160b3c026a277506e2cf977d150379464
Reviewed-on: https://go-review.googlesource.com/88295
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-16 15:37:57 +00:00
Balaram Makam
fcba05148f cmd/compile: arm64 intrinsics for math/bits.OnesCount
This adds math/bits intrinsics for OnesCount on arm64.

name         old time/op  new time/op  delta
OnesCount    3.81ns ± 0%  1.60ns ± 0%  -57.96%  (p=0.000 n=7+8)
OnesCount8   1.60ns ± 0%  1.60ns ± 0%     ~     (all equal)
OnesCount16  2.41ns ± 0%  1.60ns ± 0%  -33.61%  (p=0.000 n=8+8)
OnesCount32  4.17ns ± 0%  1.60ns ± 0%  -61.58%  (p=0.000 n=8+8)
OnesCount64  3.80ns ± 0%  1.60ns ± 0%  -57.84%  (p=0.000 n=8+8)

Update #18616

Conflicts:
	src/cmd/compile/internal/gc/asm_test.go

Change-Id: I63ac2f63acafdb1f60656ab8a56be0b326eec5cb
Reviewed-on: https://go-review.googlesource.com/90835
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-15 23:00:20 +00:00
Ben Shi
ebb77aa867 cmd/compile/internal/ssa: optimize arm64 with FNMULS/FNMULD
FNMULS&FNMULD are efficient arm64 instructions, which can be used
to improve FP performance. This CL use them to optimize pairs of neg-mul
operations.

Here are benchmark test results on Raspberry Pi 3 with ArchLinux.

1. A special test case gets about 15% improvement.
(https://github.com/benshi001/ugo1/blob/master/fpmul_test.go)
FPMul-4                     485µs ± 0%     410µs ± 0%  -15.49%  (p=0.000 n=26+23)

2. There is little regression in the go1 benchmark (excluding noise).
name                     old time/op    new time/op    delta
BinaryTree17-4              42.0s ± 3%     42.1s ± 2%    ~     (p=0.542 n=39+40)
Fannkuch11-4                33.3s ± 3%     32.9s ± 1%    ~     (p=0.200 n=40+32)
FmtFprintfEmpty-4           534ns ± 0%     534ns ± 0%    ~     (all equal)
FmtFprintfString-4         1.09µs ± 1%    1.09µs ± 0%    ~     (p=0.950 n=32+32)
FmtFprintfInt-4            1.14µs ± 0%    1.14µs ± 1%    ~     (p=0.571 n=32+31)
FmtFprintfIntInt-4         1.79µs ± 3%    1.76µs ± 0%  -1.42%  (p=0.004 n=40+34)
FmtFprintfPrefixedInt-4    2.17µs ± 0%    2.17µs ± 0%    ~     (p=0.073 n=31+34)
FmtFprintfFloat-4          3.33µs ± 3%    3.28µs ± 0%  -1.46%  (p=0.001 n=40+34)
FmtManyArgs-4              7.28µs ± 6%    7.19µs ± 0%    ~     (p=0.641 n=40+33)
GobDecode-4                96.5ms ± 4%    96.5ms ± 9%    ~     (p=0.214 n=40+40)
GobEncode-4                79.5ms ± 0%    80.7ms ± 4%  +1.51%  (p=0.000 n=34+40)
Gzip-4                      4.53s ± 4%     4.56s ± 4%  +0.60%  (p=0.000 n=40+40)
Gunzip-4                    451ms ± 3%     442ms ± 0%  -1.93%  (p=0.000 n=40+32)
HTTPClientServer-4          530µs ± 1%     535µs ± 1%  +0.88%  (p=0.000 n=39+39)
JSONEncode-4                214ms ± 4%     211ms ± 0%    ~     (p=0.059 n=40+31)
JSONDecode-4                865ms ± 5%     864ms ± 4%  -0.06%  (p=0.003 n=40+40)
Mandelbrot200-4            52.0ms ± 3%    52.1ms ± 3%    ~     (p=0.556 n=40+40)
GoParse-4                  43.1ms ± 8%    42.1ms ± 0%    ~     (p=0.083 n=40+33)
RegexpMatchEasy0_32-4      1.02µs ± 3%    1.02µs ± 4%  +0.06%  (p=0.020 n=40+40)
RegexpMatchEasy0_1K-4      3.90µs ± 0%    3.96µs ± 3%  +1.58%  (p=0.000 n=31+40)
RegexpMatchEasy1_32-4       967ns ± 4%     981ns ± 3%  +1.40%  (p=0.000 n=40+40)
RegexpMatchEasy1_1K-4      6.41µs ± 4%    6.43µs ± 3%    ~     (p=0.386 n=40+40)
RegexpMatchMedium_32-4     1.76µs ± 3%    1.78µs ± 3%  +1.08%  (p=0.000 n=40+40)
RegexpMatchMedium_1K-4      561µs ± 0%     562µs ± 0%  +0.09%  (p=0.003 n=34+31)
RegexpMatchHard_32-4       31.5µs ± 2%    31.1µs ± 4%  -1.17%  (p=0.000 n=30+40)
RegexpMatchHard_1K-4        960µs ± 3%     950µs ± 4%  -1.02%  (p=0.016 n=40+40)
Revcomp-4                   7.79s ± 7%     7.79s ± 4%    ~     (p=0.859 n=40+40)
Template-4                  889ms ± 6%     872ms ± 3%  -1.86%  (p=0.025 n=40+31)
TimeParse-4                4.80µs ± 0%    4.89µs ± 3%  +1.71%  (p=0.001 n=31+40)
TimeFormat-4               4.70µs ± 1%    4.78µs ± 3%  +1.57%  (p=0.000 n=33+40)
[Geo mean]                  710µs          709µs       -0.13%

name                     old speed      new speed      delta
GobDecode-4              7.96MB/s ± 4%  7.96MB/s ± 9%    ~     (p=0.174 n=40+40)
GobEncode-4              9.65MB/s ± 0%  9.51MB/s ± 4%  -1.45%  (p=0.000 n=34+40)
Gzip-4                   4.29MB/s ± 4%  4.26MB/s ± 4%  -0.59%  (p=0.000 n=40+40)
Gunzip-4                 43.0MB/s ± 3%  43.9MB/s ± 0%  +1.90%  (p=0.000 n=40+32)
JSONEncode-4             9.09MB/s ± 4%  9.22MB/s ± 0%    ~     (p=0.429 n=40+31)
JSONDecode-4             2.25MB/s ± 5%  2.25MB/s ± 4%    ~     (p=0.278 n=40+40)
GoParse-4                1.35MB/s ± 7%  1.37MB/s ± 0%    ~     (p=0.071 n=40+25)
RegexpMatchEasy0_32-4    31.5MB/s ± 3%  31.5MB/s ± 4%  -0.08%  (p=0.018 n=40+40)
RegexpMatchEasy0_1K-4     263MB/s ± 0%   259MB/s ± 3%  -1.51%  (p=0.000 n=31+40)
RegexpMatchEasy1_32-4    33.1MB/s ± 4%  32.6MB/s ± 3%  -1.38%  (p=0.000 n=40+40)
RegexpMatchEasy1_1K-4     160MB/s ± 4%   159MB/s ± 3%    ~     (p=0.364 n=40+40)
RegexpMatchMedium_32-4    565kB/s ± 3%   562kB/s ± 2%    ~     (p=0.208 n=40+40)
RegexpMatchMedium_1K-4   1.82MB/s ± 0%  1.82MB/s ± 0%  -0.27%  (p=0.000 n=34+31)
RegexpMatchHard_32-4     1.02MB/s ± 3%  1.03MB/s ± 4%  +1.04%  (p=0.000 n=32+40)
RegexpMatchHard_1K-4     1.07MB/s ± 4%  1.08MB/s ± 4%  +0.94%  (p=0.003 n=40+40)
Revcomp-4                32.6MB/s ± 7%  32.6MB/s ± 4%    ~     (p=0.965 n=40+40)
Template-4               2.18MB/s ± 6%  2.22MB/s ± 3%  +1.83%  (p=0.020 n=40+31)
[Geo mean]               7.77MB/s       7.78MB/s       +0.16%

3. There is little change in the compilecmp benchmark (excluding noise).
name        old time/op       new time/op       delta
Template          2.37s ± 3%        2.35s ± 4%    ~     (p=0.529 n=10+10)
Unicode           1.38s ± 8%        1.36s ± 5%    ~     (p=0.247 n=10+10)
GoTypes           8.10s ± 2%        8.10s ± 2%    ~     (p=0.971 n=10+10)
Compiler          40.5s ± 4%        40.8s ± 1%    ~     (p=0.529 n=10+10)
SSA                115s ± 2%         115s ± 3%    ~     (p=0.684 n=10+10)
Flate             1.45s ± 5%        1.46s ± 3%    ~     (p=0.796 n=10+10)
GoParser          1.86s ± 4%        1.84s ± 2%    ~     (p=0.095 n=9+10)
Reflect           5.11s ± 2%        5.13s ± 2%    ~     (p=0.315 n=10+10)
Tar               2.22s ± 3%        2.23s ± 1%    ~     (p=0.299 n=9+7)
XML               2.72s ± 3%        2.72s ± 3%    ~     (p=0.912 n=10+10)
[Geo mean]        5.03s             5.02s       -0.21%

name        old user-time/op  new user-time/op  delta
Template          2.92s ± 2%        2.89s ± 1%    ~     (p=0.247 n=10+10)
Unicode           1.71s ± 5%        1.69s ± 4%    ~     (p=0.393 n=10+10)
GoTypes           9.78s ± 2%        9.76s ± 2%    ~     (p=0.631 n=10+10)
Compiler          49.1s ± 2%        49.1s ± 1%    ~     (p=0.796 n=10+10)
SSA                144s ± 1%         144s ± 2%    ~     (p=0.796 n=10+10)
Flate             1.74s ± 2%        1.73s ± 3%    ~     (p=0.842 n=10+9)
GoParser          2.23s ± 3%        2.25s ± 2%    ~     (p=0.143 n=10+10)
Reflect           5.93s ± 3%        5.98s ± 2%    ~     (p=0.211 n=10+9)
Tar               2.65s ± 2%        2.69s ± 3%  +1.51%  (p=0.010 n=9+10)
XML               3.25s ± 2%        3.21s ± 1%  -1.24%  (p=0.035 n=10+9)
[Geo mean]        6.07s             6.07s       -0.08%

name        old text-bytes    new text-bytes    delta
HelloSize         641kB ± 0%        641kB ± 0%    ~     (all equal)

name        old data-bytes    new data-bytes    delta
HelloSize        9.46kB ± 0%       9.46kB ± 0%    ~     (all equal)

name        old bss-bytes     new bss-bytes     delta
HelloSize         125kB ± 0%        125kB ± 0%    ~     (all equal)

name        old exe-bytes     new exe-bytes     delta
HelloSize        1.24MB ± 0%       1.24MB ± 0%    ~     (all equal)

Change-Id: Id095d998c380eef929755124084df02446a6b7c1
Reviewed-on: https://go-review.googlesource.com/92555
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-14 15:22:05 +00:00
Austin Clements
79594ee95a runtime: buffered write barrier for arm64
Updates #22460.

Change-Id: I5f8fbece9545840f5fc4c9834e2050b0920776f0
Reviewed-on: https://go-review.googlesource.com/92699
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-13 16:34:19 +00:00
Cherry Zhang
6f3e5e637c cmd/compile: intrinsify runtime.getcallersp
Add a compiler intrinsic for getcallersp. So we are able to get
rid of the argument (not done in this CL).

Change-Id: Ic38fda1c694f918328659ab44654198fb116668d
Reviewed-on: https://go-review.googlesource.com/69350
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>
2017-10-10 15:15:21 +00:00
Wei Xiao
c02fc1605a cmd/compile: memory clearing optimization for arm64
Use "STP (ZR, ZR), O(R)" instead of "MOVD ZR, O(R)" to implement memory clearing.
Also improve assembler supports to STP/LDP.
Results (A57@2GHzx8):

benchmark                   old ns/op     new ns/op     delta
BenchmarkClearFat8-8        1.00          1.00          +0.00%
BenchmarkClearFat12-8       1.01          1.01          +0.00%
BenchmarkClearFat16-8       1.01          1.01          +0.00%
BenchmarkClearFat24-8       1.52          1.52          +0.00%
BenchmarkClearFat32-8       3.00          2.02          -32.67%
BenchmarkClearFat40-8       3.50          2.52          -28.00%
BenchmarkClearFat48-8       3.50          3.03          -13.43%
BenchmarkClearFat56-8       4.00          3.50          -12.50%
BenchmarkClearFat64-8       4.25          4.00          -5.88%
BenchmarkClearFat128-8      8.01          8.01          +0.00%
BenchmarkClearFat256-8      16.1          16.0          -0.62%
BenchmarkClearFat512-8      32.1          32.0          -0.31%
BenchmarkClearFat1024-8     64.1          64.1          +0.00%

Change-Id: Ie5f5eac271ff685884775005825f206167a5c146
Reviewed-on: https://go-review.googlesource.com/55610
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-08-25 20:09:06 +00:00
philhofer
c59b495963 cmd/compile: add support for arm64 bit-test instructions
Add support for generating TBZ/TBNZ instructions.

The bit-test-and-branch pattern shows up in a number of
important places, including the runtime (gc bitmaps).

Before this change, there were 3 TB[N]?Z instructions in the Go tool,
all of which were in hand-written assembly. After this change, there
are 285. Also, the go1 benchmark binary gets about 4.5kB smaller.

Fixes #21361

Change-Id: I170c138b852754b9b8df149966ca5e62e6dfa771
Reviewed-on: https://go-review.googlesource.com/54470
Run-TryBot: Philip Hofer <phofer@umich.edu>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-08-15 13:39:11 +00:00
Keith Randall
1e72bf6218 cmd/compile: experiment which clobbers all dead pointer fields
The experiment "clobberdead" clobbers all pointer fields that the
compiler thinks are dead, just before and after every safepoint.
Useful for debugging the generation of live pointer bitmaps.

Helped find the following issues:
Update #15936
Update #16026
Update #16095
Update #18860

Change-Id: Id1d12f86845e3d93bae903d968b1eac61fc461f9
Reviewed-on: https://go-review.googlesource.com/23924
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-04-21 20:19:50 +00:00
Matthew Dempsky
691755304c cmd/compile/internal/ssa: populate SymEffects for SSA Ops
Changes to ${GOARCH}Ops.go files were mechanically produced using
github.com/mdempsky/ssa-symops, a one-off tool that inserts
"SymEffect: X" elements by pattern matching against the Op names.

Change-Id: Ibf3e481ffd588647f2a31662d72114b740ccbfcf
Reviewed-on: https://go-review.googlesource.com/38084
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-14 18:34:45 +00:00
Matthew Dempsky
08d8d5c986 cmd/compile/internal/ssa: replace {Defer,Go}Call with StaticCall
Passes toolstash-check -all.

Change-Id: Icf8b75364e4761a5e56567f503b2c1cb17382ed2
Reviewed-on: https://go-review.googlesource.com/38080
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-13 19:44:36 +00:00