Commit graph

225 commits

Author SHA1 Message Date
Lynn Boger
a8b2e4a630 cmd/compile: improve LoweredMove performance on ppc64x
This change improves the performance for LoweredMove on ppc64le
and ppc64.

benchmark                   old ns/op     new ns/op     delta
BenchmarkCopyFat8-16        0.93          0.69          -25.81%
BenchmarkCopyFat12-16       2.61          1.85          -29.12%
BenchmarkCopyFat16-16       9.68          1.89          -80.48%
BenchmarkCopyFat24-16       4.48          1.85          -58.71%
BenchmarkCopyFat32-16       6.12          1.82          -70.26%
BenchmarkCopyFat64-16       21.2          2.70          -87.26%
BenchmarkCopyFat128-16      29.6          3.97          -86.59%
BenchmarkCopyFat256-16      52.6          13.4          -74.52%
BenchmarkCopyFat512-16      97.1          18.7          -80.74%
BenchmarkCopyFat1024-16     186           35.3          -81.02%

BenchmarkAssertE2TLarge-16      14.2          5.06          -64.37%

Fixes #19785

Change-Id: I7d5e0052712b75811c02c7d86c5112e5649ad782
Reviewed-on: https://go-review.googlesource.com/38950
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-31 21:24:09 +00:00
Ben Shi
8577f81a10 cmd/compile/internal: Optimization with RBIT and REV
By checking GOARM in ssa/gen/ARM.rules, each intermediate operator
can be implemented via different instruction serials.

It is up to the user to choose between compitability and efficiency.

The Bswap32(x) is optimized to REV(x) when GOARM >= 6.
The CTZ(x) is optimized to CLZ(RBIT x) when GOARM == 7.

Change-Id: Ie9ee645fa39333fa79ad84ed4d1cefac30422814
Reviewed-on: https://go-review.googlesource.com/35610
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-03-31 15:10:24 +00:00
Keith Randall
68da265c8e Revert "cmd/compile: automatically handle commuting ops in rewrite rules"
This reverts commit 041ecb697f.

Reason for revert: Not working on S390x and some 386 archs.
I have a guess why the S390x is failing.  No clue on the 386 yet.
Revert until I can figure it out.

Change-Id: I64f1ce78fa6d1037ebe7ee2a8a8107cb4c1db70c
Reviewed-on: https://go-review.googlesource.com/38790
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-29 18:06:44 +00:00
Keith Randall
041ecb697f cmd/compile: automatically handle commuting ops in rewrite rules
We have lots of rewrite rules that vary only in the fact that
we have 2 versions for the 2 different orderings of various
commuting ops. For example:

(ADDL x (MOVLconst [c])) -> (ADDLconst [c] x)
(ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x)

It can get unwieldly quickly, especially when there is more than
one commuting op in a rule.

Our existing "fix" for this problem is to have rules that
canonicalize the operations first. For example:

(Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x)

Subsequent rules can then assume if there is a constant arg to Eq64,
it will be the first one. This fix kinda works, but it is fragile and
only works when we remember to include the required extra rules.

The fundamental problem is that the rule matcher doesn't
know anything about commuting ops. This CL fixes that fact.

We already have information about which ops commute. (The register
allocator takes advantage of commutivity.)  The rule generator now
automatically generates multiple rules for a single source rule when
there are commutative ops in the rule. We can now drop all of our
almost-duplicate source-level rules and the canonicalization rules.

I have some CLs in progress that will be a lot less verbose when
the rule generator handles commutivity for me.

I had to reorganize the load-combining rules a bit. The 8-way OR rules
generated 128 different reorderings, which was causing the generator
to put too much code in the rewrite*.go files (the big ones were going
from 25K lines to 132K lines). Instead I reorganized the rules to
combine pairs of loads at a time. The generated rule files are now
actually a bit (5%) smaller.
[Note to reviewers: check these carefully. Most of the other rule
changes are trivial.]

Make.bash times are ~unchanged.

Compiler benchmarks are not observably different. Probably because
we don't spend much compiler time in rule matching anyway.

I've also done a pass over all of our ops adding commutative markings
for ops which hadn't had them previously.

Fixes #18292

Change-Id: I999b1307272e91965b66754576019dedcbe7527a
Reviewed-on: https://go-review.googlesource.com/38666
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2017-03-29 16:22:09 +00:00
Lynn Boger
23bd919136 cmd/compile: improve LoweredZero performance for ppc64x
This change improves the performance of the LoweredZero rule
on ppc64x.

The improvement can be seen in the runtime ClearFat
benchmarks:

BenchmarkClearFat12-16       2.40          0.69          -71.25%
BenchmarkClearFat16-16       9.98          0.93          -90.68%
BenchmarkClearFat24-16       4.75          0.93          -80.42%
BenchmarkClearFat32-16       6.02          0.93          -84.55%
BenchmarkClearFat40-16       7.19          1.16          -83.87%
BenchmarkClearFat48-16       15.0          1.39          -90.73%
BenchmarkClearFat56-16       9.95          1.62          -83.72%
BenchmarkClearFat64-16       18.0          1.86          -89.67%
BenchmarkClearFat128-16      30.0          8.08          -73.07%
BenchmarkClearFat256-16      52.5          11.3          -78.48%
BenchmarkClearFat512-16      97.0          19.0          -80.41%
BenchmarkClearFat1024-16     244           34.2          -85.98%

Fixes: #19532

Change-Id: If493e28bc1d8e61bc79978498be9f5336a36cd3f
Reviewed-on: https://go-review.googlesource.com/38096
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Munday <munday@ca.ibm.com>
2017-03-21 15:08:02 +00:00
Michael Munday
17570a9afb cmd/compile: emit fused multiply-{add,subtract} on ppc64x
A follow on to CL 36963 adding support for ppc64x.

Performance changes (as posted on the issue):

poly1305:
benchmark               old ns/op new ns/op delta
Benchmark64-16          172       151       -12.21%
Benchmark1K-16          1828      1523      -16.68%
Benchmark64Unaligned-16 172       151       -12.21%
Benchmark1KUnaligned-16 1827      1523      -16.64%

math:
BenchmarkAcos-16        43.9      39.9      -9.11%
BenchmarkAcosh-16       57.0      45.8      -19.65%
BenchmarkAsin-16        35.8      33.0      -7.82%
BenchmarkAsinh-16       68.6      60.8      -11.37%
BenchmarkAtan-16        19.8      16.2      -18.18%
BenchmarkAtanh-16       65.5      57.5      -12.21%
BenchmarkAtan2-16       45.4      34.2      -24.67%
BenchmarkGamma-16       37.6      26.0      -30.85%
BenchmarkLgamma-16      40.0      28.2      -29.50%
BenchmarkLog1p-16       35.1      29.1      -17.09%
BenchmarkSin-16         22.7      18.4      -18.94%
BenchmarkSincos-16      31.7      23.7      -25.24%
BenchmarkSinh-16        146       131       -10.27%
BenchmarkY0-16          130       107       -17.69%
BenchmarkY1-16          127       107       -15.75%
BenchmarkYn-16          278       235       -15.47%

Updates #17895.

Change-Id: I1c16199715d20c9c4bd97c4a950bcfa69eb688c1
Reviewed-on: https://go-review.googlesource.com/38095
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2017-03-20 20:01:29 +00:00
Keith Randall
42e97468a1 cmd/compile: intrinsic for math/bits.Reverse on ARM64
I don't know that it exists for any other architectures.

Update #18616

Change-Id: Idfe5dee251764d32787915889ec0be4bebc5be24
Reviewed-on: https://go-review.googlesource.com/38323
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-03-17 18:07:18 +00:00
Keith Randall
495b167919 cmd/compile: intrinsics for math/bits.{Len,LeadingZeros}
name              old time/op  new time/op  delta
LeadingZeros-4    2.00ns ± 0%  1.34ns ± 1%  -33.02%  (p=0.000 n=8+10)
LeadingZeros16-4  1.62ns ± 0%  1.57ns ± 0%   -3.09%  (p=0.001 n=8+9)
LeadingZeros32-4  2.14ns ± 0%  1.48ns ± 0%  -30.84%  (p=0.002 n=8+10)
LeadingZeros64-4  2.06ns ± 1%  1.33ns ± 0%  -35.08%  (p=0.000 n=8+8)

8-bit args is a special case - the Go code is really fast because
it is just a single table lookup.  So I've disabled that for now.
Intrinsics were actually slower:
LeadingZeros8-4   1.22ns ± 3%  1.58ns ± 1%  +29.56%  (p=0.000 n=10+10)

Update #18616

Change-Id: Ia9c289b9ba59c583ea64060470315fd637e814cf
Reviewed-on: https://go-review.googlesource.com/38311
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-03-16 22:53:49 +00:00
Cherry Zhang
c8f38b3398 cmd/compile: use type information in Aux for Store size
Remove size AuxInt in Store, and alignment in Move/Zero. We still
pass size AuxInt to Move/Zero, as it is used for partial Move/Zero
lowering (e.g. cmd/compile/internal/ssa/gen/386.rules:288).
SizeAndAlign is gone.

Passes "toolstash -cmp" on std.

Change-Id: I1ca34652b65dd30de886940e789fcf41d521475d
Reviewed-on: https://go-review.googlesource.com/38150
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-16 14:25:04 +00:00
Cherry Zhang
211c8c9f1a cmd/compile: pass types on SSA Store/Move/Zero ops
For SSA Store/Move/Zero ops, attach the type of the value being
stored to the op as the Aux field. This type will be used for
write barrier insertion (in a followup CL). Since SSA passes
do not accurately propagate types of values (because of type
casting), we can't simply use type of the store's arguments
for write barrier insertion.

Passes "toolstash -cmp" on std.

Updates #17583.

Change-Id: I051d5e5c482931640d1d7d879b2a6bb91f2e0056
Reviewed-on: https://go-review.googlesource.com/36838
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-16 14:22:53 +00:00
Matthew Dempsky
91d08e3bca cmd/compile/internal/ssa: remove unused OpFunc
Change-Id: I0f7eec2e0c15a355422d5ae7289508a5bd33b971
Reviewed-on: https://go-review.googlesource.com/38171
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-03-14 19:28:25 +00:00
Matthew Dempsky
691755304c cmd/compile/internal/ssa: populate SymEffects for SSA Ops
Changes to ${GOARCH}Ops.go files were mechanically produced using
github.com/mdempsky/ssa-symops, a one-off tool that inserts
"SymEffect: X" elements by pattern matching against the Op names.

Change-Id: Ibf3e481ffd588647f2a31662d72114b740ccbfcf
Reviewed-on: https://go-review.googlesource.com/38084
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-14 18:34:45 +00:00
Matthew Dempsky
1cdf4bf33f cmd/compile/internal/ssa: add SymEffect attribute to SSA Ops
To replace the progeffects tables for liveness analysis.

Change-Id: Idc4b990665cb0a9aa300d62cdf8ad12e51c5b991
Reviewed-on: https://go-review.googlesource.com/38083
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-14 18:34:38 +00:00
Matthew Dempsky
cc71aa9ac4 cmd/compile/internal/ssa: make ARM's udiv like other calls
Passes toolstash-check -all.

Change-Id: Id389f8158cf33a3c0fcef373615b5351e7c74b5b
Reviewed-on: https://go-review.googlesource.com/38082
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-13 21:29:02 +00:00
Matthew Dempsky
08d8d5c986 cmd/compile/internal/ssa: replace {Defer,Go}Call with StaticCall
Passes toolstash-check -all.

Change-Id: Icf8b75364e4761a5e56567f503b2c1cb17382ed2
Reviewed-on: https://go-review.googlesource.com/38080
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-13 19:44:36 +00:00
Cherry Zhang
6fd5e2549a cmd/compile: mark MOVWF/MOVFW clobbering F15 on ARM
The assembler back end uses F15 as a temporary register in these
instructions.

Checked the assembler back end and made sure that this is the
only case clobbering F15.

Fixes #19403.

Change-Id: I02b9e00fdd9229db899f501c8e9b306e02912d83
Reviewed-on: https://go-review.googlesource.com/37792
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-05 18:31:27 +00:00
Matthew Dempsky
02e36f8c87 cmd/compile/internal/ssa: remove Hmul{8,16}{,u} ops
Change-Id: I90865921584ae4bdfb6c220d439b14593d72b6f9
Reviewed-on: https://go-review.googlesource.com/37752
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-03-03 20:47:36 +00:00
Cherry Zhang
5bfd1ef036 cmd/compile: get rid of "volatile" in SSA
A value is "volatile" if it is a pointer to the argument region
on stack which will be clobbered by function call. This is used
to make sure the value is safe when inserting write barrier calls.
The writebarrier pass can tell whether a value is such a pointer.
Therefore no need to mark it when building SSA and thread this
information through.

Passes "toolstash -cmp" on std.

Updates #17583.

Change-Id: Idc5fc0d710152b94b3c504ce8db55ea9ff5b5195
Reviewed-on: https://go-review.googlesource.com/36835
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-03 13:26:15 +00:00
Keith Randall
13c35a1b20 cmd/compile: ppc64x no longer needs a scratch stack location
After https://go-review.googlesource.com/c/36725/, ppc64x no longer
needs a temp stack location for int reg <-> fp reg moves.

Update #18922

Change-Id: Ib4319784f7a855f593dfa5231604ca2c24e4c882
Reviewed-on: https://go-review.googlesource.com/37651
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2017-03-01 22:14:21 +00:00
Lynn Boger
95c9583a18 cmd/compile: intrinsify atomics on ppc64x
This adds the necessary changes so that atomics are treated as
intrinsics on ppc64x.

The implementations of And8 and Or8 require power8 for
both ppc64 and ppc64le.  This is a new requirement
for ppc64.

Fixes #8739

Change-Id: Icb85e2755a49166ee3652668279f6ed5ebbca901
Reviewed-on: https://go-review.googlesource.com/36832
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-01 19:56:01 +00:00
Josh Bleecher Snyder
2183135554 cmd/compile: recognize bit test patterns on amd64
Updates #18943

Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08
Reviewed-on: https://go-review.googlesource.com/36329
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-01 00:36:04 +00:00
Michael Munday
bd8a39b67a cmd/compile: emit fused multiply-{add,subtract} instructions on s390x
Explcitly block fused multiply-add pattern matching when a cast is used
after the multiplication, for example:

    - (a * b) + c        // can emit fused multiply-add
    - float64(a * b) + c // cannot emit fused multiply-add

float{32,64} and complex{64,128} casts of matching types are now kept
as OCONV operations rather than being replaced with OCONVNOP operations
because they now imply a rounding operation (and therefore aren't a
no-op anymore).

Operations (for example, multiplication) on complex types may utilize
fused multiply-add and -subtract instructions internally. There is no
way to disable this behavior at the moment.

Improves the performance of the floating point implementation of
poly1305:

name         old speed     new speed     delta
64           246MB/s ± 0%  275MB/s ± 0%  +11.48%   (p=0.000 n=10+8)
1K           312MB/s ± 0%  357MB/s ± 0%  +14.41%  (p=0.000 n=10+10)
64Unaligned  246MB/s ± 0%  274MB/s ± 0%  +11.43%  (p=0.000 n=10+10)
1KUnaligned  312MB/s ± 0%  357MB/s ± 0%  +14.39%   (p=0.000 n=10+8)

Updates #17895.

Change-Id: Ia771d275bb9150d1a598f8cc773444663de5ce16
Reviewed-on: https://go-review.googlesource.com/36963
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-02-28 15:34:20 +00:00
David Chase
11b283092a cmd/compile: add opcode flag hasSideEffects for do-not-remove
Added a flag to generic and various architectures' atomic
operations that are judged to have observable side effects
and thus cannot be dead-code-eliminated.

Test requires GOMAXPROCS > 1 without preemption in loop.

Fixes #19182.

Change-Id: Id2230031abd2cca0bbb32fd68fc8a58fb912070f
Reviewed-on: https://go-review.googlesource.com/37333
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-02-22 15:15:47 +00:00
Keith Randall
cfb0d34992 cmd/compile: amd64, allow XCHG on stack pointers
XCHG needs to allow the stack pointer as an argument because we have a
rewrite that incorporates the address of a local variable into the
instruction.

Fixes #19184

Change-Id: Ic438e6e1946332cdce3864d15abecd41b911b2a9
Reviewed-on: https://go-review.googlesource.com/37253
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-02-19 17:16:01 +00:00
Ilya Tocar
21c71d7788 cmd/compile/internal/ssa: combine load + op on AMD64
On AMD64 Most operation can have one operand in memory.
Combine load and dependand operation into one new operation,
where possible. I've seen no significant performance changes on go1,
but this allows to remove ~1.8kb code from go tool. And in math package
I see e. g.:

Remainder-6            70.0ns ± 0%   64.6ns ± 0%   -7.76%  (p=0.000 n=9+1
Change-Id: I88b8602b1d55da8ba548a34eb7da4b25d59a297e
Reviewed-on: https://go-review.googlesource.com/36793
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-02-17 22:21:49 +00:00
Keith Randall
708ba22a0c cmd/compile: move constant divide strength reduction to SSA rules
Currently the conversion from constant divides to multiplies is mostly
done during the walk pass.  This is suboptimal because SSA can
determine that the value being divided by is constant more often
(e.g. after inlining).

Change-Id: If1a9b993edd71be37396b9167f77da271966f85f
Reviewed-on: https://go-review.googlesource.com/37015
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2017-02-17 06:16:44 +00:00
Lynn Boger
695f12c21a cmd/compile: rules change to use ANDN more effectively on ppc64x
Currently there are cases where an XOR with -1 followed by an AND
is generanted when it could be done with just an ANDN instruction.

Changes to PPC64.rules and required files allows this change
in generated code.  Examples of this occur in sha3 among others.

Fixes: #18918

Change-Id: I647cb9b4a4aaeebb27db85f8bf75487d78f720c9
Reviewed-on: https://go-review.googlesource.com/36218
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2017-02-09 18:57:19 +00:00
Michael Munday
ddf807fce8 cmd/compile: fix type propagation through s390x SSA rules
This CL fixes two issues:

1. Load ops were initially always lowered to unsigned loads, even
   for signed types. This was fine by itself however LoadReg ops
   (used to re-load spilled values) were lowered to signed loads
   for signed types. This meant that spills could invalidate
   optimizations that assumed the original unsigned load.

2. Types were not always being maintained correctly through rules
   designed to eliminate unnecessary zero and sign extensions.

Fixes #18906.

Change-Id: I95785dcadba03f7e3e94524677e7d8d3d3b9b737
Reviewed-on: https://go-review.googlesource.com/36256
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-02-03 21:27:21 +00:00
Cherry Zhang
fddc004537 cmd/compile: remove nil check for Zero/Move on 386, AMD64, S390X
Fixes #18003.

Change-Id: Iadcc5c424c64badecfb5fdbd4dbd9197df56182c
Reviewed-on: https://go-review.googlesource.com/33421
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-02-02 21:28:38 +00:00
Keith Randall
01c8719f8b cmd/compile: move rotate instruction generation to SSA
Remove rotate generation from walk.  Remove OLROT and ssa.Lrot* opcodes.
Generate rotates during SSA lowering for architectures that have them.

This CL will allow rotates to be generated in more situations,
like when the shift values are determined to be constant
only after some analysis.

Fixes #18254

Change-Id: I8d6d684ff5ce2511aceaddfda98b908007851079
Reviewed-on: https://go-review.googlesource.com/34232
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-02-02 17:57:15 +00:00
Vladimir Stefanovic
247fc4a98e cmd/compile/internal/ssa: add support for GOARCH=mips{,le}
Change-Id: I632d4aef7295778ba5018d98bcb06a68bcf07ce1
Reviewed-on: https://go-review.googlesource.com/31478
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2016-11-08 19:40:43 +00:00
Keith Randall
741445068f cmd/compile: make [0]T and [1]T SSAable types
We used to have to keep on-stack copies of these types.
Now they can be registerized.

[0]T is kind of trivial but might as well handle it.

This change enables another change I'm working on to improve how x.(T)
expressions are handled (#17405).  This CL helps because now all
types that are direct interface types are registerizeable (e.g. [1]*byte).

No higher-degree arrays for now because non-constant indexes are hard.

Update #17405

Change-Id: I2399940965d17b3969ae66f6fe447a8cefdd6edd
Reviewed-on: https://go-review.googlesource.com/32416
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2016-10-31 19:44:19 +00:00
Austin Clements
8a7f0ad0b5 cmd/compile: use typedmemclr for zeroing if there are pointers
Currently, zeroing generates an ssa.OpZero, which never has write
barriers, even if the assignment is an OASWB. The hybrid barrier
requires write barriers on zeroing, so change OASWB to generate an
ssa.OpZeroWB when assigning the zero value, which turns into a
typedmemclr.

Updates #17503.

Change-Id: Ib37ac5e39f578447dbd6b36a6a54117d5624784d
Reviewed-on: https://go-review.googlesource.com/31451
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2016-10-28 19:13:23 +00:00
Cherry Zhang
f9238a76ff cmd/compile: make LR allocatable in non-leaf functions on ARM
The mechanism is initially introduced (and reviewed) in CL 30597
on S390X.

Reduce number of "spilled value remains" by 0.4% in cmd/go.

Disabled on ARMv5 because LR is clobbered almost everywhere with
inserted softfloat calls.

Change-Id: I2934737ce2455909647ed2118fe2bd6f0aa5ac52
Reviewed-on: https://go-review.googlesource.com/32178
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2016-10-28 14:25:33 +00:00
Keith Randall
deb4177cf0 cmd/compile: use masks instead of branches for slicing
When we do

  var x []byte = ...
  y := x[i:]

We can't just use y.ptr = x.ptr + i, as the new pointer may point to the
next object in memory after the backing array.
We used to fix this by doing:

  y.cap = x.cap - i
  delta := i
  if y.cap == 0 {
    delta = 0
  }
  y.ptr = x.ptr + delta

That generates a branch in what is otherwise straight-line code.

Better to do:

  y.cap = x.cap - i
  mask := (y.cap - 1) >> 63 // -1 if y.cap==0, 0 otherwise
  y.ptr = x.ptr + i &^ mask

It's about the same number of instructions (~4, depending on what
parts are constant, and the target architecture), but it is all
inline. It plays nicely with CSE, and the mask can be computed
in parallel with the index (in cases where a multiply is required).

It is a minor win in both speed and space.

Change-Id: Ied60465a0b8abb683c02208402e5bb7ac0e8370f
Reviewed-on: https://go-review.googlesource.com/32022
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2016-10-27 20:22:49 +00:00
Cherry Zhang
4f6d479186 cmd/compile: make LR allocatable in non-leaf functions on MIPS64
The mechanism is initially introduced (and reviewed) in CL 30597
on S390X.

Change-Id: I83024d2fc84c8efc23fbda52b3ad83073f42cb93
Reviewed-on: https://go-review.googlesource.com/32179
Reviewed-by: David Chase <drchase@google.com>
2016-10-27 15:35:20 +00:00
Cherry Zhang
5c59cb4aa3 cmd/compile: make LR allocatable in non-leaf functions on ARM64
The mechanism is initially introduced (and reviewed) in CL 30597
on S390X.

Change-Id: I12fbe6e9269b2936690e0ec896cb6b5aa40ad7da
Reviewed-on: https://go-review.googlesource.com/32180
Reviewed-by: David Chase <drchase@google.com>
2016-10-27 15:35:06 +00:00
Cherry Zhang
f6aec889e1 cmd/compile: add a writebarrier phase in SSA
When the compiler insert write barriers, the frontend makes
conservative decisions at an early stage. This may have false
positives which result in write barriers for stack writes.

A new phase, writebarrier, is added to the SSA backend, to delay
the decision and eliminate false positives. The frontend still
makes conservative decisions. When building SSA, instead of
emitting runtime calls directly, it emits WB ops (StoreWB,
MoveWB, etc.), which will be expanded to branches and runtime
calls in writebarrier phase. Writes to static locations on stack
are detected and write barriers are removed.

All write barriers of stack writes found by the script from
issue #17330 are eliminated (except two false positives).

Fixes #17330.

Change-Id: I9bd66333da9d0ceb64dcaa3c6f33502798d1a0f8
Reviewed-on: https://go-review.googlesource.com/31131
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2016-10-25 21:53:40 +00:00
Michael Munday
517a44d57e cmd/compile: intrinsify atomic operations on s390x
Implements the following intrinsics on s390x:
 - AtomicAdd{32,64}
 - AtomicCompareAndSwap{32,64}
 - AtomicExchange{32,64}
 - AtomicLoad{32,64,Ptr}
 - AtomicStore{32,64,PtrNoWB}

I haven't added rules for And8 or Or8 yet.

Change-Id: I647af023a8e513718e90e98a60191e7af6167314
Reviewed-on: https://go-review.googlesource.com/31614
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-10-25 12:23:49 +00:00
Michael Munday
1cfb5c3fd5 cmd/compile: merge loads into operations on s390x
Adds the new canMergeLoad function which can be used by rules to
decide whether a load can be merged into an operation. The function
ensures that the merge will not reorder the load relative to memory
operations (for example, stores) in such a way that the block can no
longer be scheduled.

This new function enables transformations such as:

MOVD 0(R1), R2
ADD  R2, R3

to:

ADD  0(R1), R3

The two-operand form of the following instructions can now read a
single memory operand:

 - ADD
 - ADDC
 - ADDW
 - MULLD
 - MULLW
 - SUB
 - SUBC
 - SUBE
 - SUBW
 - AND
 - ANDW
 - OR
 - ORW
 - XOR
 - XORW

Improves SHA3 performance by 6-8%.

Updates #15054.

Change-Id: Ibcb9122126cd1a26f2c01c0dfdbb42fe5e7b5b94
Reviewed-on: https://go-review.googlesource.com/29272
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-10-17 19:45:20 +00:00
Michael Munday
15817e409b cmd/compile: make link register allocatable in non-leaf functions
We save and restore the link register in non-leaf functions because
it is clobbered by CALLs. It is therefore available for general
purpose use.

Only enabled on s390x currently. The RC4 benchmarks in particular
benefit from the extra register:

name     old speed     new speed     delta
RC4_128  243MB/s ± 2%  341MB/s ± 2%  +40.46%  (p=0.008 n=5+5)
RC4_1K   267MB/s ± 0%  359MB/s ± 1%  +34.32%  (p=0.008 n=5+5)
RC4_8K   271MB/s ± 0%  362MB/s ± 0%  +33.61%  (p=0.008 n=5+5)

Change-Id: Id23bff95e771da9425353da2f32668b8e34ba09f
Reviewed-on: https://go-review.googlesource.com/30597
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-10-11 18:52:35 +00:00
Cherry Zhang
2756d56c89 cmd/compile: intrinsify math/big.mulWW, divWW on AMD64
Change-Id: I59f7afa7a5803d19f8b21fe70fc85ef997bb3a85
Reviewed-on: https://go-review.googlesource.com/30542
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2016-10-11 16:07:46 +00:00
David Chase
2f0b8f88df cmd/compile: PPC64, elide unnecessary sign extension
Inputs to store[BHW] and cmpW(U) need not be correct
in more bits than are used by the instruction.

Added a pattern tailored to what appears to be cgo boilerplate.
Added a pattern (also seen in cgo boilerplate and hashing)
to replace {EQ,NE}-CMP-ANDconst with {EQ-NE}-ANDCCconst.
Added a pattern to clean up ANDconst shift distance inputs
(this was seen in hashing).

Simplify repeated and,or,xor.

Fixes #17109.

Change-Id: I68eac83e3e614d69ffe473a08953048c8b066d88
Reviewed-on: https://go-review.googlesource.com/30455
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-10-10 12:22:40 +00:00
Michael Munday
45b26a93f3 cmd/{asm,compile}: replace TESTB op with CMPWconst on s390x
TESTB was implemented as AND $0xff, Rx, REGTMP. Unfortunately there
is no 3-operand AND-with-immediate instruction and so it was emulated
by the assembler using two instructions.

This CL uses CMPW instead of AND and also optimizes CMPW to use
the chi instruction where possible.

Overall this CL reduces the size of the .text section of the
bin/go binary by ~2%.

Change-Id: Ic335c29fc1129378fcbb1265bfb10f5b744a0f3f
Reviewed-on: https://go-review.googlesource.com/30690
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-10-07 20:02:59 +00:00
Michael Munday
dd1dcf9496 cmd/{asm,compile}: add ANDW, ORW and XORW instructions to s390x
Adds the following instructions and uses them in the SSA backend:

 - ANDW
 - ORW
 - XORW

The instruction encodings for 32-bit operations are typically shorter,
particularly when an immediate is used. For example, XORW $-1, R1
only requires one instruction, whereas XOR requires two.

Also removes some unused instructions (that were emulated):

 - ANDN
 - NAND
 - ORN
 - NOR

Change-Id: Iff2a16f52004ba498720034e354be9771b10cac4
Reviewed-on: https://go-review.googlesource.com/30291
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2016-10-06 02:59:04 +00:00
Cherry Zhang
b662e524e4 cmd/compile: use CBZ/CBNZ instrunctions on ARM64
These are conditional branches that takes a register instead of
flags as control value.

Reduce binary size by 0.7%, text size by 2.4% (cmd/go as an
exmaple).

Change-Id: I0020cfde745f9eab680b8b949ad28c87fe183afd
Reviewed-on: https://go-review.googlesource.com/30030
Reviewed-by: David Chase <drchase@google.com>
2016-10-05 18:22:56 +00:00
Matthew Dempsky
c28f55c502 cmd/compile/internal/ssa: add Op.UsesScratch method
Passes toolstash/buildall.

Change-Id: I928a2ef39fb10091957f35bb3f1564498f6f1b83
Reviewed-on: https://go-review.googlesource.com/30312
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-10-04 20:56:56 +00:00
Michael Munday
962dc4b44d cmd/compile: improve load/store merging on s390x
This commit makes the process of load/store merging more incremental
for both big and little endian operations. It also adds support for
32-bit shifts (needed to merge 16- and 32-bit loads/stores).

In addition, the merging of little endian stores is now supported.
Little endian stores are now up to 30 times faster.

Change-Id: Iefdd81eda4a65b335f23c3ff222146540083ad9c
Reviewed-on: https://go-review.googlesource.com/29956
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-09-30 14:41:43 +00:00
Keith Randall
98938189a1 cmd/compile: remove duplicate nilchecks
Mark nil check operations as faulting if their arg is zero.
This lets the late nilcheck pass remove duplicates.

Fixes #17242.

Change-Id: I4c9938d8a5a1e43edd85b4a66f0b34004860bcd9
Reviewed-on: https://go-review.googlesource.com/29952
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2016-09-27 23:54:01 +00:00
Cherry Zhang
9d4b40f55d runtime, cmd/compile: implement and use DUFFCOPY on ARM64
Change-Id: I8984eac30e5df78d4b94f19412135d3cc36969f8
Reviewed-on: https://go-review.googlesource.com/29910
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2016-09-27 15:07:31 +00:00