Commit graph

38 commits

Author SHA1 Message Date
Lynn Boger
9248ff46a8 cmd/compile: add rotates to PPC64.rules
This updates PPC64.rules to include rules to generate rotates
for ADD, OR, XOR operators that combine two opposite shifts
that sum to 32 or 64.

To support this change opcodes for ROTL and ROTLW were added to
be used like the rotldi and rotlwi extended mnemonics.

This provides the following improvement in sha3:

BenchmarkPermutationFunction-8     302.83       376.40       1.24x
BenchmarkSha3_512_MTU-8            98.64        121.92       1.24x
BenchmarkSha3_384_MTU-8            136.80       168.30       1.23x
BenchmarkSha3_256_MTU-8            169.21       211.29       1.25x
BenchmarkSha3_224_MTU-8            179.76       221.19       1.23x
BenchmarkShake128_MTU-8            212.87       263.23       1.24x
BenchmarkShake256_MTU-8            196.62       245.60       1.25x
BenchmarkShake256_16x-8            163.57       194.37       1.19x
BenchmarkShake256_1MiB-8           199.02       248.74       1.25x
BenchmarkSha3_512_1MiB-8           106.55       133.13       1.25x

Fixes #20030

Change-Id: I484c56f48395d32f53ff3ecb3ac6cb8191cfee44
Reviewed-on: https://go-review.googlesource.com/40992
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-04-20 18:05:22 +00:00
Keith Randall
7e07e635f3 cmd/compile: implement non-constant rotates
Makes math/bits.Rotate{Left,Right} fast on amd64.

name              old time/op  new time/op  delta
RotateLeft-12     7.42ns ± 6%  5.45ns ± 6%  -26.54%   (p=0.000 n=9+10)
RotateLeft8-12    4.77ns ± 5%  3.42ns ± 7%  -28.25%   (p=0.000 n=8+10)
RotateLeft16-12   4.82ns ± 8%  3.40ns ± 7%  -29.36%  (p=0.000 n=10+10)
RotateLeft32-12   4.87ns ± 7%  3.48ns ± 7%  -28.51%    (p=0.000 n=8+9)
RotateLeft64-12   5.23ns ±10%  3.35ns ± 6%  -35.97%   (p=0.000 n=9+10)
RotateRight-12    7.59ns ± 8%  5.71ns ± 1%  -24.72%   (p=0.000 n=10+8)
RotateRight8-12   4.98ns ± 7%  3.36ns ± 9%  -32.55%  (p=0.000 n=10+10)
RotateRight16-12  5.12ns ± 2%  3.45ns ± 5%  -32.62%  (p=0.000 n=10+10)
RotateRight32-12  4.80ns ± 6%  3.42ns ±16%  -28.68%  (p=0.000 n=10+10)
RotateRight64-12  4.78ns ± 6%  3.42ns ± 6%  -28.50%  (p=0.000 n=10+10)

Update #18940

Change-Id: Ie79fb5581c489ed4d3b859314c5e669a134c119b
Reviewed-on: https://go-review.googlesource.com/39711
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2017-04-17 23:19:45 +00:00
Josh Bleecher Snyder
3d0a898385 cmd/compile: improve output when TestAssembly build fails
Change-Id: Ibee84399d81463d3e7d5319626bb0d6b60b86bd9
Reviewed-on: https://go-review.googlesource.com/40861
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-04-17 03:12:34 +00:00
Josh Bleecher Snyder
0d36999a0f cmd/compile: make TestAssembly resilient to output ordering
To preserve reproducible builds, the text entries
during compilation will be sorted before being printed.
TestAssembly currently assumes that function init
comes after all user-defined functions.
Remove that assumption.
Instead of looking for "TEXT" to tell you where
a function ends--which may now yield lots of
non-function-code junk--look for a line beginning
with non-whitespace.

Updates #15756

Change-Id: Ibc82dba6143d769ef4c391afc360e523b1a51348
Reviewed-on: https://go-review.googlesource.com/39853
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2017-04-13 02:30:29 +00:00
Ilya Tocar
e4a500ce14 cmd/compile/internal/gc: improve comparison with constant strings
Currently we expand comparison with small constant strings into len check
and a sequence of byte comparisons. Generate 16/32/64-bit comparisons,
instead of bytewise on 386 and amd64. Also increase limits on what is
considered small constant string.
Shaves ~30kb (0.5%) from go executable.

This also updates test/prove.go to keep test case valid.

Change-Id: I99ae8871a1d00c96363c6d03d0b890782fa7e1d9
Reviewed-on: https://go-review.googlesource.com/38776
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2017-04-07 15:40:25 +00:00
Cherry Zhang
257b01f8f4 cmd/compile: use ANDconst to mask out leading/trailing bits on ARM64
For an AND that masks out leading or trailing bits, generic rules
rewrite it to a pair of shifts. On ARM64, the mask actually can
fit into an AND instruction. So we rewrite it back to AND.

Fixes #19857.

Change-Id: I479d7320ae4f29bb3f0056d5979bde4478063a8f
Reviewed-on: https://go-review.googlesource.com/39651
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2017-04-06 17:59:32 +00:00
Keith Randall
5cadc91b3c cmd/compile: intrinsics for math/bits.OnesCount
Popcount instructions on amd64 are not guaranteed to be
present, so we must guard their call.  Rewrite rules can't
generate control flow at the moment, so the intrinsifier
needs to generate that code.

name           old time/op  new time/op  delta
OnesCount-8    2.47ns ± 5%  1.04ns ± 2%  -57.70%  (p=0.000 n=10+10)
OnesCount16-8  1.05ns ± 1%  0.78ns ± 0%  -25.56%    (p=0.000 n=9+8)
OnesCount32-8  1.63ns ± 5%  1.04ns ± 2%  -35.96%  (p=0.000 n=10+10)
OnesCount64-8  2.45ns ± 0%  1.04ns ± 1%  -57.55%   (p=0.000 n=6+10)

Update #18616

Change-Id: I4aff2cc9aa93787898d7b22055fe272a7cf95673
Reviewed-on: https://go-review.googlesource.com/38320
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-04-04 02:40:11 +00:00
Keith Randall
63a72fd447 cmd/compile: strength-reduce floating point
x*2 -> x+x
x/c, c power of 2 -> x*(1/c)

Fixes #19827

Change-Id: I74c9f0b5b49b2ed26c0990314c7d1d5f9631b6f1
Reviewed-on: https://go-review.googlesource.com/39295
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2017-04-03 21:27:03 +00:00
Keith Randall
86dc86b4f9 cmd/compile: don't merge load+op if other op arg is still live
We want to merge a load and op into a single instruction

    l = LOAD ptr mem
    y = OP x l

into

    y = OPload x ptr mem

However, all of our OPload instructions require that y uses
the same register as x. If x is needed past this instruction, then
we must copy x somewhere else, losing the whole benefit of merging
the instructions in the first place.

Disable this optimization if x is live past the OP.

Also disable this optimization if the OP is in a deeper loop than the load.

Update #19595

Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a
Reviewed-on: https://go-review.googlesource.com/38337
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
2017-03-23 15:53:04 +00:00
Michael Munday
17570a9afb cmd/compile: emit fused multiply-{add,subtract} on ppc64x
A follow on to CL 36963 adding support for ppc64x.

Performance changes (as posted on the issue):

poly1305:
benchmark               old ns/op new ns/op delta
Benchmark64-16          172       151       -12.21%
Benchmark1K-16          1828      1523      -16.68%
Benchmark64Unaligned-16 172       151       -12.21%
Benchmark1KUnaligned-16 1827      1523      -16.64%

math:
BenchmarkAcos-16        43.9      39.9      -9.11%
BenchmarkAcosh-16       57.0      45.8      -19.65%
BenchmarkAsin-16        35.8      33.0      -7.82%
BenchmarkAsinh-16       68.6      60.8      -11.37%
BenchmarkAtan-16        19.8      16.2      -18.18%
BenchmarkAtanh-16       65.5      57.5      -12.21%
BenchmarkAtan2-16       45.4      34.2      -24.67%
BenchmarkGamma-16       37.6      26.0      -30.85%
BenchmarkLgamma-16      40.0      28.2      -29.50%
BenchmarkLog1p-16       35.1      29.1      -17.09%
BenchmarkSin-16         22.7      18.4      -18.94%
BenchmarkSincos-16      31.7      23.7      -25.24%
BenchmarkSinh-16        146       131       -10.27%
BenchmarkY0-16          130       107       -17.69%
BenchmarkY1-16          127       107       -15.75%
BenchmarkYn-16          278       235       -15.47%

Updates #17895.

Change-Id: I1c16199715d20c9c4bd97c4a950bcfa69eb688c1
Reviewed-on: https://go-review.googlesource.com/38095
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2017-03-20 20:01:29 +00:00
Keith Randall
495b167919 cmd/compile: intrinsics for math/bits.{Len,LeadingZeros}
name              old time/op  new time/op  delta
LeadingZeros-4    2.00ns ± 0%  1.34ns ± 1%  -33.02%  (p=0.000 n=8+10)
LeadingZeros16-4  1.62ns ± 0%  1.57ns ± 0%   -3.09%  (p=0.001 n=8+9)
LeadingZeros32-4  2.14ns ± 0%  1.48ns ± 0%  -30.84%  (p=0.002 n=8+10)
LeadingZeros64-4  2.06ns ± 1%  1.33ns ± 0%  -35.08%  (p=0.000 n=8+8)

8-bit args is a special case - the Go code is really fast because
it is just a single table lookup.  So I've disabled that for now.
Intrinsics were actually slower:
LeadingZeros8-4   1.22ns ± 3%  1.58ns ± 1%  +29.56%  (p=0.000 n=10+10)

Update #18616

Change-Id: Ia9c289b9ba59c583ea64060470315fd637e814cf
Reviewed-on: https://go-review.googlesource.com/38311
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-03-16 22:53:49 +00:00
Keith Randall
dd9892e31b cmd/compile: intrinsify math/bits.ReverseBytes
Update #18616

Change-Id: I0c2d643cbbeb131b4c9b12194697afa4af48e1d2
Reviewed-on: https://go-review.googlesource.com/38166
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-03-16 19:41:56 +00:00
Keith Randall
d5dc490519 cmd/compile: intrinsics for math/bits.TrailingZerosX
Implement math/bits.TrailingZerosX using intrinsics.

Generally reorganize the intrinsic spec a bit.
The instrinsics data structure is now built at init time.
This will make doing the other functions in math/bits easier.

Update sys.CtzX to return int instead of uint{64,32} so it
matches math/bits.TrailingZerosX.

Improve the intrinsics a bit for amd64.  We don't need the CMOV
for <64 bit versions.

Update #18616

Change-Id: Ic1c5339c943f961d830ae56f12674d7b29d4ff39
Reviewed-on: https://go-review.googlesource.com/38155
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-03-16 02:44:16 +00:00
Josh Bleecher Snyder
3a90bfb253 cmd/dist, cmd/compile: eliminate mergeEnvLists copies
This is now handled by os/exec.

Updates #12868

Change-Id: Ic21a6ff76a9b9517437ff1acf3a9195f9604bb45
Reviewed-on: https://go-review.googlesource.com/37698
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-03-02 22:26:23 +00:00
Josh Bleecher Snyder
2183135554 cmd/compile: recognize bit test patterns on amd64
Updates #18943

Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08
Reviewed-on: https://go-review.googlesource.com/36329
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-01 00:36:04 +00:00
Michael Munday
bd8a39b67a cmd/compile: emit fused multiply-{add,subtract} instructions on s390x
Explcitly block fused multiply-add pattern matching when a cast is used
after the multiplication, for example:

    - (a * b) + c        // can emit fused multiply-add
    - float64(a * b) + c // cannot emit fused multiply-add

float{32,64} and complex{64,128} casts of matching types are now kept
as OCONV operations rather than being replaced with OCONVNOP operations
because they now imply a rounding operation (and therefore aren't a
no-op anymore).

Operations (for example, multiplication) on complex types may utilize
fused multiply-add and -subtract instructions internally. There is no
way to disable this behavior at the moment.

Improves the performance of the floating point implementation of
poly1305:

name         old speed     new speed     delta
64           246MB/s ± 0%  275MB/s ± 0%  +11.48%   (p=0.000 n=10+8)
1K           312MB/s ± 0%  357MB/s ± 0%  +14.41%  (p=0.000 n=10+10)
64Unaligned  246MB/s ± 0%  274MB/s ± 0%  +11.43%  (p=0.000 n=10+10)
1KUnaligned  312MB/s ± 0%  357MB/s ± 0%  +14.39%   (p=0.000 n=10+8)

Updates #17895.

Change-Id: Ia771d275bb9150d1a598f8cc773444663de5ce16
Reviewed-on: https://go-review.googlesource.com/36963
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-02-28 15:34:20 +00:00
Josh Bleecher Snyder
e458264aca cmd/compile: fix dolinkobj flag in TestAssembly
Follow-up to CL 37270.

This considerably reduces the time to run the test.

Before:

real	0m7.638s
user	0m14.341s
sys	0m2.244s

After:

real	0m4.867s
user	0m7.107s
sys	0m1.842s

Change-Id: I8837a5da0979a1c365e1ce5874d81708249a4129
Reviewed-on: https://go-review.googlesource.com/37461
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Munday <munday@ca.ibm.com>
2017-02-25 14:39:29 +00:00
Lorenzo Masini
fb1f47a77c cmd/compile: speed up TestAssembly
TestAssembly was very slow, leading to it being skipped by default.
This is not surprising, it separately invoked the compiler and
parsed the result many times.

Now the test assembles one source file for arch/os combination,
containing the relevant functions.

Tests for each arch/os run in parallel.

Now the test runs approximately 10x faster on my Intel(R) Core(TM)
i5-6600 CPU @ 3.30GHz.

Fixes #18966

Change-Id: I45ab97630b627a32e17900c109f790eb4c0e90d9
Reviewed-on: https://go-review.googlesource.com/37270
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-02-24 21:23:43 +00:00
Kirill Smelkov
4477fd097f cmd/compile/internal/ssa: combine 2 byte loads + shifts into word load + rolw 8 on AMD64
... and same for stores. This does for binary.BigEndian.Uint16() what
was already done for Uint32 and Uint64 with BSWAP in 10f75748 (CL 32222).

Here is how generated code changes e.g. for the following function
(omitting saying the same prologue/epilogue):

	func get16(b [2]byte) uint16 {
		return binary.BigEndian.Uint16(b[:])
	}

"".get16 t=1 size=21 args=0x10 locals=0x0

	// before
        0x0000 00000 (x.go:15)  MOVBLZX "".b+9(FP), AX
        0x0005 00005 (x.go:15)  MOVBLZX "".b+8(FP), CX
        0x000a 00010 (x.go:15)  SHLL    $8, CX
        0x000d 00013 (x.go:15)  ORL     CX, AX

	// after
	0x0000 00000 (x.go:15)	MOVWLZX	"".b+8(FP), AX
	0x0005 00005 (x.go:15)	ROLW	$8, AX

encoding/binary is speedup overall a bit:

name                    old time/op    new time/op    delta
ReadSlice1000Int32s-4     4.83µs ± 0%    4.83µs ± 0%     ~     (p=0.206 n=4+5)
ReadStruct-4              1.29µs ± 2%    1.28µs ± 1%   -1.27%  (p=0.032 n=4+5)
ReadInts-4                 384ns ± 1%     385ns ± 1%     ~     (p=0.968 n=4+5)
WriteInts-4                534ns ± 3%     526ns ± 0%   -1.54%  (p=0.048 n=4+5)
WriteSlice1000Int32s-4    5.02µs ± 0%    5.11µs ± 3%     ~     (p=0.175 n=4+5)
PutUint16-4               0.59ns ± 0%    0.49ns ± 2%  -16.95%  (p=0.016 n=4+5)
PutUint32-4               0.52ns ± 0%    0.52ns ± 0%     ~     (all equal)
PutUint64-4               0.53ns ± 0%    0.53ns ± 0%     ~     (all equal)
PutUvarint32-4            19.9ns ± 0%    19.9ns ± 1%     ~     (p=0.556 n=4+5)
PutUvarint64-4            54.5ns ± 1%    54.2ns ± 0%     ~     (p=0.333 n=4+5)

name                    old speed      new speed      delta
ReadSlice1000Int32s-4    829MB/s ± 0%   828MB/s ± 0%     ~     (p=0.190 n=4+5)
ReadStruct-4            58.0MB/s ± 2%  58.7MB/s ± 1%   +1.30%  (p=0.032 n=4+5)
ReadInts-4              78.0MB/s ± 1%  77.8MB/s ± 1%     ~     (p=0.968 n=4+5)
WriteInts-4             56.1MB/s ± 3%  57.0MB/s ± 0%     ~     (p=0.063 n=4+5)
WriteSlice1000Int32s-4   797MB/s ± 0%   783MB/s ± 3%     ~     (p=0.190 n=4+5)
PutUint16-4             3.37GB/s ± 0%  4.07GB/s ± 2%  +20.83%  (p=0.016 n=4+5)
PutUint32-4             7.73GB/s ± 0%  7.72GB/s ± 0%     ~     (p=0.556 n=4+5)
PutUint64-4             15.1GB/s ± 0%  15.1GB/s ± 0%     ~     (p=0.905 n=4+5)
PutUvarint32-4           201MB/s ± 0%   201MB/s ± 0%     ~     (p=0.905 n=4+5)
PutUvarint64-4           147MB/s ± 1%   147MB/s ± 0%     ~     (p=0.286 n=4+5)

( "a bit" only because most of the time is spent in reflection-like things
  there, not actual bytes decoding. Even for direct PutUint16 benchmark the
  looping adds overhead and lowers visible benefit. For code-generated encoders /
  decoders actual effect is more than 20% )

Adding Uint32 and Uint64 raw benchmarks too for completeness.

NOTE I had to adjust load-combining rule for bswap case to match first 2 bytes
loads as result of "2-bytes load+shift" -> "loadw + rorw 8" rewrite. Reason is:
for loads+shift, even e.g. into uint16 var

	var b []byte
	var v uin16
	v = uint16(b[1]) | uint16(b[0])<<8

the compiler eventually generates L(ong) shift - SHLLconst [8], probably
because it is more straightforward / other reasons to work on the whole
register. This way 2 bytes rewriting rule is using SHLLconst (not SHLWconst) in
its pattern, and then it always gets matched first, even if 2-byte rule comes
syntactically after 4-byte rule in AMD64.rules because 4-bytes rule seemingly
needs more applyRewrite() cycles to trigger. If 2-bytes rule gets matched for
inner half of

	var b []byte
	var v uin32
	v = uint32(b[3]) | uint32(b[2])<<8 | uint32(b[1])<<16 | uint32(b[0])<<24

and we keep 4-byte load rule unchanged, the result will be MOVW + RORW $8 and
then series of byte loads and shifts - not one MOVL + BSWAPL.

There is no such problem for stores: there compiler, since it probably knows
store destination is 2 bytes wide, uses SHRWconst 8 (not SHRLconst 8) and thus
2-byte store rule is not a subset of rule for 4-byte stores.

Fixes #17151  (int16 was last missing piece there)

Change-Id: Idc03ba965bfce2b94fef456b02ff6742194748f6
Reviewed-on: https://go-review.googlesource.com/34636
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-02-14 22:17:08 +00:00
Cherry Zhang
78200799a2 cmd/compile: undo special handling of zero-valued STRUCTLIT
CL 35261 introduces special handling of zero-valued STRUCTLIT for
efficient struct zeroing. But it didn't cover all use cases, for
example, CONVNOP STRUCTLIT is not handled.

On the other hand, CL 34566 handles zeroing earlier, so we don't
need the change in CL 35261 for efficient zeroing. Other uses of
zero-valued struct literals are very rare. So undo the change in
walk.go in CL 35261.

Add a test for efficient zeroing.

Fixes #19084.

Change-Id: I0807f7423fb44d47bf325b3c1ce9611a14953853
Reviewed-on: https://go-review.googlesource.com/36955
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2017-02-14 18:57:56 +00:00
Kirill Smelkov
bd91e3569a cmd/compile/internal/ssa: generate bswap/store for indexed bigendian byte stores too on AMD64
Commit 10f75748 (CL 32222) added rewrite rules to combine byte loads/stores +
shifts into larger loads/stores + bswap. For loads both MOVBload and
MOVBloadidx1 were handled but for store only MOVBstore was there without
MOVBstoreidx added to rewrite pattern. Fix it.

Here is how generated code changes for the following 2 functions
(ommitting staying the same prologue/epilogue):

    func put32(b []byte, i int, v uint32) {
            binary.BigEndian.PutUint32(b[i:], v)
    }

    func put64(b []byte, i int, v uint64) {
            binary.BigEndian.PutUint64(b[i:], v)
    }

"".put32 t=1 size=100 args=0x28 locals=0x0

	// before
	0x0032 00050 (x.go:5)	MOVL	CX, DX
	0x0034 00052 (x.go:5)	SHRL	$24, CX
	0x0037 00055 (x.go:5)	MOVQ	"".b+8(FP), BX
	0x003c 00060 (x.go:5)	MOVB	CL, (BX)(AX*1)
	0x003f 00063 (x.go:5)	MOVL	DX, CX
	0x0041 00065 (x.go:5)	SHRL	$16, DX
	0x0044 00068 (x.go:5)	MOVB	DL, 1(BX)(AX*1)
	0x0048 00072 (x.go:5)	MOVL	CX, DX
	0x004a 00074 (x.go:5)	SHRL	$8, CX
	0x004d 00077 (x.go:5)	MOVB	CL, 2(BX)(AX*1)
	0x0051 00081 (x.go:5)	MOVB	DL, 3(BX)(AX*1)

	// after
	0x0032 00050 (x.go:5)	BSWAPL	CX
	0x0034 00052 (x.go:5)	MOVQ	"".b+8(FP), DX
	0x0039 00057 (x.go:5)	MOVL	CX, (DX)(AX*1)

"".put64 t=1 size=155 args=0x28 locals=0x0

	// before
	0x0037 00055 (x.go:9)	MOVQ	CX, DX
	0x003a 00058 (x.go:9)	SHRQ	$56, CX
	0x003e 00062 (x.go:9)	MOVQ	"".b+8(FP), BX
	0x0043 00067 (x.go:9)	MOVB	CL, (BX)(AX*1)
	0x0046 00070 (x.go:9)	MOVQ	DX, CX
	0x0049 00073 (x.go:9)	SHRQ	$48, DX
	0x004d 00077 (x.go:9)	MOVB	DL, 1(BX)(AX*1)
	0x0051 00081 (x.go:9)	MOVQ	CX, DX
	0x0054 00084 (x.go:9)	SHRQ	$40, CX
	0x0058 00088 (x.go:9)	MOVB	CL, 2(BX)(AX*1)
	0x005c 00092 (x.go:9)	MOVQ	DX, CX
	0x005f 00095 (x.go:9)	SHRQ	$32, DX
	0x0063 00099 (x.go:9)	MOVB	DL, 3(BX)(AX*1)
	0x0067 00103 (x.go:9)	MOVQ	CX, DX
	0x006a 00106 (x.go:9)	SHRQ	$24, CX
	0x006e 00110 (x.go:9)	MOVB	CL, 4(BX)(AX*1)
	0x0072 00114 (x.go:9)	MOVQ	DX, CX
	0x0075 00117 (x.go:9)	SHRQ	$16, DX
	0x0079 00121 (x.go:9)	MOVB	DL, 5(BX)(AX*1)
	0x007d 00125 (x.go:9)	MOVQ	CX, DX
	0x0080 00128 (x.go:9)	SHRQ	$8, CX
	0x0084 00132 (x.go:9)	MOVB	CL, 6(BX)(AX*1)
	0x0088 00136 (x.go:9)	MOVB	DL, 7(BX)(AX*1)

	// after
	0x0033 00051 (x.go:9)	BSWAPQ	CX
	0x0036 00054 (x.go:9)	MOVQ	"".b+8(FP), DX
	0x003b 00059 (x.go:9)	MOVQ	CX, (DX)(AX*1)

Updates #17151

Change-Id: I3f4a7f28f210e62e153e60da5abd1d39508cc6c4
Reviewed-on: https://go-review.googlesource.com/34635
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
2017-02-14 18:35:43 +00:00
Kirill Smelkov
e2948f7efe cmd/compile: Show arch/os when something in TestAssembly fails
It is not always obvious from the first glance when looking at
TestAssembly failure in which context the code was generated. For
example x86 and x86-64 are similar, and those of us who do not work with
assembly every day can even take s390x version as something similar to x86.

So when something fails lets print the whole test context - this
includes os and arch which were previously missing. An example failure:

before:

--- FAIL: TestAssembly (40.48s)
        asm_test.go:46: expected:       MOVWZ   \(.*\),
                go:
                import "encoding/binary"
                func f(b []byte) uint32 {
                        return binary.LittleEndian.Uint32(b)
                }

                asm:"".f t=1 size=160 args=0x20 locals=0x0
		...

after:

--- FAIL: TestAssembly (40.43s)
        asm_test.go:46: linux/s390x: expected:  MOVWZ   \(.*\),
                go:
                import "encoding/binary"
                func f(b []byte) uint32 {
                        return binary.LittleEndian.Uint32(b)
                }

                asm:"".f t=1 size=160 args=0x20 locals=0x0

Motivated-by: #18946#issuecomment-279491071

Change-Id: I61089ceec05da7a165718a7d69dec4227dd0e993
Reviewed-on: https://go-review.googlesource.com/36881
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-02-13 20:30:31 +00:00
Michael Munday
074b73b1b2 cmd/compile: fix s390x load-combining rules
MOVD{reg,nop} operations (added in CL 36256) inserted to preserve
type information were blocking the load-combining rules. Fix this
by merging type changes into loads wherever possible.

Fixes #19059.

Change-Id: I8a1df06eb0f231b40ae43107d4a3bd0b9c441b59
Reviewed-on: https://go-review.googlesource.com/36843
Run-TryBot: Michael Munday <munday@ca.ibm.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-02-13 20:04:14 +00:00
Keith Randall
b548eee3d9 cmd/compile: fix load-combining rules
CL 33632 reorders args of commutative ops in order to make
CSE for commutative ops more robust.  Unfortunately, that
broke the load-combining rules which depend on a certain ordering
of OR ops' arguments.

Introduce some additional rules that order OR ops' arguments
consistently so that the load-combining rules fire.

Note: there's also something else wrong with the s390x rules.
I've filed #19059 for that.

Fixes #18946

Change-Id: I0a5447196bd88a55ccee683c69a57b943a9972e1
Reviewed-on: https://go-review.googlesource.com/36911
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2017-02-13 18:29:51 +00:00
Josh Bleecher Snyder
5faba3057d cmd/compile: use constants directly for fast map access calls
CL 35554 taught order.go to use static variables
for constants that needed to be addressable for runtime routines.
However, there is one class of runtime routines that
do not actually need an addressable value: fast map access routines.
This CL teaches order.go to avoid using static variables
for addressability in those cases.
Instead, it avoids introducing a temp at all,
which the backend would just have to optimize away.

Fixes #19015.

Change-Id: I5ef780c604fac3fb48dabb23a344435e283cb832
Reviewed-on: https://go-review.googlesource.com/36693
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-02-10 04:57:20 +00:00
Keith Randall
01c8719f8b cmd/compile: move rotate instruction generation to SSA
Remove rotate generation from walk.  Remove OLROT and ssa.Lrot* opcodes.
Generate rotates during SSA lowering for architectures that have them.

This CL will allow rotates to be generated in more situations,
like when the shift values are determined to be constant
only after some analysis.

Fixes #18254

Change-Id: I8d6d684ff5ce2511aceaddfda98b908007851079
Reviewed-on: https://go-review.googlesource.com/34232
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-02-02 17:57:15 +00:00
Russ Cox
47ce87877b all: merge dev.inline into master
Change-Id: I7715581a04e513dcda9918e853fa6b1ddc703770
2017-02-01 09:47:23 -05:00
Kirill Smelkov
c44da14440 cmd/compile/internal/ssa: add tests for BSWAP on stores on AMD64
Commit 10f75748 (CL 32222) taught AMD64 backend to rewrite series of
byte loads or stores with corresponding shifts into a single long or
quad load or store + appropriate BSWAP. However it did not added test
for stores - only loads were tested.

Fix it.

NOTE Tests for indexed stores are not added because 10f75748 did not add
support for indexed stores - only indexed loads were handled then.

Change-Id: I48c867ebe7622ac8e691d43741feed1d40cca0d7
Reviewed-on: https://go-review.googlesource.com/34634
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-12-21 16:36:45 +00:00
Keith Randall
8d21691044 cmd/compile: test for correct zeroing
Make sure we generate the right code for zeroing a structure.

Check in after Matthew's CL (34564).

Update #18370

Change-Id: I987087f979d99227a880b34c44d9d4de6c25ba0c
Reviewed-on: https://go-review.googlesource.com/34565
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
2016-12-19 17:36:35 +00:00
Robert Griesemer
eab3707d6d [dev.inline] cmd/compile: rename various fields from Lineno to Pos
Various minor adjustments.

Change-Id: Iedfb97989f7bedaa3e9e8993b167e05f162434a7
Reviewed-on: https://go-review.googlesource.com/34136
Reviewed-by: David Lazar <lazard@golang.org>
2016-12-08 21:35:18 +00:00
Ilya Tocar
10f757486e cmd/compile/internal/ssa: generate bswap on AMD64
Generate bswap+load/store for reading/writing big endian data.
Helps encoding/binary.

name                    old time/op    new time/op    delta
ReadSlice1000Int32s-8     5.06µs ± 8%    4.58µs ± 8%   -9.50%        (p=0.000 n=10+10)
ReadStruct-8              1.07µs ± 0%    1.05µs ± 0%   -1.51%         (p=0.000 n=9+10)
ReadInts-8                 367ns ± 0%     363ns ± 0%   -1.15%          (p=0.000 n=8+9)
WriteInts-8                475ns ± 1%     469ns ± 0%   -1.45%        (p=0.000 n=10+10)
WriteSlice1000Int32s-8    5.03µs ± 3%    4.50µs ± 3%  -10.45%          (p=0.000 n=9+9)
PutUvarint32-8            17.2ns ± 0%    17.2ns ± 0%     ~     (all samples are equal)
PutUvarint64-8            46.7ns ± 0%    46.7ns ± 0%     ~           (p=0.509 n=10+10)

name                    old speed      new speed      delta
ReadSlice1000Int32s-8    791MB/s ± 8%   875MB/s ± 8%  +10.53%        (p=0.000 n=10+10)
ReadStruct-8            70.0MB/s ± 0%  71.1MB/s ± 0%   +1.54%         (p=0.000 n=9+10)
ReadInts-8              81.6MB/s ± 0%  82.6MB/s ± 0%   +1.21%          (p=0.000 n=9+9)
WriteInts-8             63.0MB/s ± 1%  63.9MB/s ± 0%   +1.45%        (p=0.000 n=10+10)
WriteSlice1000Int32s-8   796MB/s ± 4%   888MB/s ± 3%  +11.65%          (p=0.000 n=9+9)
PutUvarint32-8           233MB/s ± 0%   233MB/s ± 0%     ~           (p=0.089 n=10+10)
PutUvarint64-8           171MB/s ± 0%   171MB/s ± 0%     ~            (p=0.137 n=10+9)

Change-Id: Ia2dbdef92198eaa7e2af5443a8ed586d4b401ffb
Reviewed-on: https://go-review.googlesource.com/32222
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-11-03 12:34:12 +00:00
Josh Bleecher Snyder
66504485eb cmd/compile/internal/gc: make tests run faster
TestAssembly takes 20s on my machine,
which is too slow for normal operation.
Marking as -short has its dangers (#17472),
but hopefully we'll soon have a builder for that.

All the SSA tests are hermetic and not time sensitive
and can thus be run in parallel.
Reduces the cmd/compile/internal/gc test time during
all.bash on my laptop from 42s to 7s.

Updates #17751

Change-Id: Idd876421db23b9fa3475e8a9b3355a5dc92a5a29
Reviewed-on: https://go-review.googlesource.com/32585
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-11-03 01:07:08 +00:00
Keith Randall
26a6131bac cmd/compile: fix 4-byte unaligned load rules
The 2-byte rule was firing before the 4-byte rule, preventing
the 4-byte rule from firing.  Update the 4-byte rule to use
the results of the 2-byte rule instead.

Add some tests to make sure we don't regress again.

Fixes #17147

Change-Id: Icfeccd9f2b96450981086a52edd76afb3191410a
Reviewed-on: https://go-review.googlesource.com/29382
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2016-09-23 19:32:37 +00:00
Keith Randall
842b05832f all: use testing.GoToolPath instead of "go"
This change makes sure that tests are run with the correct
version of the go tool.  The correct version is the one that
we invoked with "go test", not the one that is first in our path.

Fixes #16577

Change-Id: If22c8f8c3ec9e7c35d094362873819f2fbb8559b
Reviewed-on: https://go-review.googlesource.com/28089
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-08-30 22:49:11 +00:00
Cherry Zhang
29f0984a35 cmd/compile: don't set line number to 0 when building SSA
The frontend may emit node with line number missing. In this case,
use the parent line number. Instead of changing every call site of
pushLine, do it in pushLine itself.

Fixes #16214.

Change-Id: I80390550b56e4d690fc770b01ff725b892ffd6dc
Reviewed-on: https://go-review.googlesource.com/24641
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-07-01 01:12:24 +00:00
David du Colombier
e29e0ba19a cmd/compile: fix TestAssembly on Plan 9
Since CL 23620, TestAssembly is failing on Plan 9.

In CL 23620, the process environment is passed to 'go tool compile'
after setting GOARCH. On Plan 9, if GOARCH is already set in the
process environment, it would take precedence. On Unix, it works
as expected because the first GOARCH found takes precedence.

This change uses the mergeEnvLists function from cmd/go/main.go
to merge the two environment lists such that variables with the
same name in "in" replace those in "out".

Change-Id: Idee22058343932ee18666dda331c562c89c33507
Reviewed-on: https://go-review.googlesource.com/23593
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: David du Colombier <0intro@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-06-01 13:33:43 +00:00
Michael Hudson-Doyle
2885e07c25 cmd/compile: pass process env to 'go tool compile' in compileToAsm
In particular, this stops the test failing when GOROOT and GOROOT_FINAL are
different.

Change-Id: Ibf6cc0a173f1d965ee8aa31eee2698b223f1ceec
Reviewed-on: https://go-review.googlesource.com/23620
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-06-01 03:55:09 +00:00
Keith Randall
9369f22b84 cmd/compile: testing harness for checking generated assembly
Add a test which compiles a function and checks the
generated assembly to make sure certain patterns are present.
This test allows us to do white box tests of the compiler
to make sure optimizations don't regress.

Added a few simple tests for now.  More to come.

Change-Id: I4ab5ce5d95b9e04e7d0d9328ffae47b8d1f95e74
Reviewed-on: https://go-review.googlesource.com/23403
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-05-26 23:07:01 +00:00