Stowage/go - Remotebranch.eu

Stowage/go

mirror of https://github.com/golang/go.git synced 2025-12-08 06:10:04 +00:00

Author	SHA1	Message	Date
Ilya Tocar	f3884680fc	cmd/compile/internal/ssa: inline memmove with known size Replace calls to memmove with known (constant) size, with OpMove. Do it only if it is safe from aliasing point of view. Helps with code like this: append(buf,"const str"...) In strconv this provides nice benefit: Quote-6 731ns ± 2% 647ns ± 3% -11.41% (p=0.000 n=10+10) QuoteRune-6 117ns ± 5% 111ns ± 1% -4.54% (p=0.000 n=10+10) AppendQuote-6 475ns ± 0% 396ns ± 0% -16.59% (p=0.000 n=9+10) AppendQuoteRune-6 32.0ns ± 0% 27.4ns ± 0% -14.41% (p=0.000 n=8+9) Change-Id: I7704f5c51b46aed2d8f033de74c75140fc35036c Reviewed-on: https://go-review.googlesource.com/54394 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-11-02 20:30:25 +00:00
Michael Munday	4745604bcb	cmd/compile: intrinsify math.RoundToEven on s390x The new RoundToEven function can be implemented as a single FIDBR instruction on s390x. name old time/op new time/op delta RoundToEven 5.32ns ± 1% 0.86ns ± 1% -83.86% (p=0.000 n=10+10) Change-Id: Iaf597e57a0d1085961701e3c75ff4f6f6dcebb5f Reviewed-on: https://go-review.googlesource.com/74350 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-10-31 18:04:27 +00:00
Michael Munday	96cdacb971	cmd/asm, cmd/compile: optimize math.Abs and math.Copysign on s390x This change adds three new instructions: - LPDFR: load positive (math.Abs(x)) - LNDFR: load negative (-math.Abs(x)) - CPSDR: copy sign (math.Copysign(x, y)) By making use of GPR <-> FPR moves we can now compile math.Abs and math.Copysign to these instructions using SSA rules. This CL also adds new rules to merge address generation into combined load operations. This makes GPR <-> FPR move matching more reliable. name old time/op new time/op delta Copysign 1.85ns ± 0% 1.40ns ± 1% -24.65% (p=0.000 n=8+10) Abs 1.58ns ± 1% 0.73ns ± 1% -53.64% (p=0.000 n=10+10) The geo mean improvement for all math package benchmarks was 4.6%. Change-Id: I0cec35c5c1b3fb45243bf666b56b57faca981bc9 Reviewed-on: https://go-review.googlesource.com/73950 Run-TryBot: Michael Munday <mike.munday@ibm.com> Reviewed-by: Keith Randall <khr@golang.org>	2017-10-30 23:42:51 +00:00
Austin Clements	7e343134d3	cmd/compile: compiler support for buffered write barrier This CL implements the compiler support for calling the buffered write barrier added by the previous CL. Since the buffered write barrier is only implemented on amd64 right now, this still supports the old, eager write barrier as well. There's little overhead to supporting both and this way a few tests in test/fixedbugs that expect to have liveness maps at write barrier calls can easily opt-in to the old, eager barrier. This significantly improves the performance of the write barrier: name old time/op new time/op delta WriteBarrier-12 73.5ns ±20% 19.2ns ±27% -73.90% (p=0.000 n=19+18) It also reduces the size of binaries because the write barrier call is more compact: name old object-bytes new object-bytes delta Template 398k ± 0% 393k ± 0% -1.14% (p=0.008 n=5+5) Unicode 208k ± 0% 206k ± 0% -1.00% (p=0.008 n=5+5) GoTypes 1.18M ± 0% 1.15M ± 0% -2.00% (p=0.008 n=5+5) Compiler 4.05M ± 0% 3.88M ± 0% -4.26% (p=0.008 n=5+5) SSA 8.25M ± 0% 8.11M ± 0% -1.59% (p=0.008 n=5+5) Flate 228k ± 0% 224k ± 0% -1.83% (p=0.008 n=5+5) GoParser 295k ± 0% 284k ± 0% -3.62% (p=0.008 n=5+5) Reflect 1.00M ± 0% 0.99M ± 0% -0.70% (p=0.008 n=5+5) Tar 339k ± 0% 333k ± 0% -1.67% (p=0.008 n=5+5) XML 404k ± 0% 395k ± 0% -2.10% (p=0.008 n=5+5) [Geo mean] 704k 690k -2.00% name old exe-bytes new exe-bytes delta HelloSize 1.05M ± 0% 1.04M ± 0% -1.55% (p=0.008 n=5+5) https://perf.golang.org/search?q=upload:20171027.1 (Amusingly, this also reduces compiler allocations by 0.75%, which, combined with the better write barrier, speeds up the compiler overall by 2.10%. See the perf link.) It slightly improves the performance of most of the go1 benchmarks and improves the performance of the x/benchmarks: name old time/op new time/op delta BinaryTree17-12 2.40s ± 1% 2.47s ± 1% +2.69% (p=0.000 n=19+19) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.21% (p=0.000 n=20+19) FmtFprintfEmpty-12 41.8ns ± 4% 41.4ns ± 2% -1.03% (p=0.014 n=20+20) FmtFprintfString-12 68.7ns ± 2% 67.5ns ± 1% -1.75% (p=0.000 n=20+17) FmtFprintfInt-12 79.0ns ± 3% 77.1ns ± 1% -2.40% (p=0.000 n=19+17) FmtFprintfIntInt-12 127ns ± 1% 123ns ± 3% -3.42% (p=0.000 n=20+20) FmtFprintfPrefixedInt-12 152ns ± 1% 150ns ± 1% -1.02% (p=0.000 n=18+17) FmtFprintfFloat-12 211ns ± 1% 209ns ± 0% -0.99% (p=0.000 n=20+16) FmtManyArgs-12 500ns ± 0% 496ns ± 0% -0.73% (p=0.000 n=17+20) GobDecode-12 6.44ms ± 1% 6.53ms ± 0% +1.28% (p=0.000 n=20+19) GobEncode-12 5.46ms ± 0% 5.46ms ± 1% ~ (p=0.550 n=19+20) Gzip-12 220ms ± 1% 216ms ± 0% -1.75% (p=0.000 n=19+19) Gunzip-12 38.8ms ± 0% 38.6ms ± 0% -0.30% (p=0.000 n=18+19) HTTPClientServer-12 79.0µs ± 1% 78.2µs ± 1% -1.01% (p=0.000 n=20+20) JSONEncode-12 11.9ms ± 0% 11.9ms ± 0% -0.29% (p=0.000 n=20+19) JSONDecode-12 52.6ms ± 0% 52.2ms ± 0% -0.68% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.36% (p=0.000 n=20+20) GoParse-12 3.13ms ± 1% 3.18ms ± 1% +1.67% (p=0.000 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 72.3ns ± 1% -1.19% (p=0.000 n=19+18) RegexpMatchEasy0_1K-12 241ns ± 0% 239ns ± 0% -0.83% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 68.6ns ± 1% 69.0ns ± 1% +0.47% (p=0.015 n=18+16) RegexpMatchEasy1_1K-12 364ns ± 0% 361ns ± 0% -0.67% (p=0.000 n=16+17) RegexpMatchMedium_32-12 104ns ± 1% 103ns ± 1% -0.79% (p=0.001 n=20+15) RegexpMatchMedium_1K-12 33.8µs ± 3% 34.0µs ± 2% ~ (p=0.267 n=20+19) RegexpMatchHard_32-12 1.64µs ± 1% 1.62µs ± 2% -1.25% (p=0.000 n=19+18) RegexpMatchHard_1K-12 49.2µs ± 0% 48.7µs ± 1% -0.93% (p=0.000 n=19+18) Revcomp-12 391ms ± 5% 396ms ± 7% ~ (p=0.154 n=19+19) Template-12 63.1ms ± 0% 59.5ms ± 0% -5.76% (p=0.000 n=18+19) TimeParse-12 307ns ± 0% 306ns ± 0% -0.39% (p=0.000 n=19+17) TimeFormat-12 325ns ± 0% 323ns ± 0% -0.50% (p=0.000 n=19+19) [Geo mean] 47.3µs 46.9µs -0.67% https://perf.golang.org/search?q=upload:20171026.1 name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.25ms ± 1% 2.20ms ± 1% -2.31% (p=0.000 n=18+18) HTTP-12 12.6µs ± 0% 12.6µs ± 0% -0.72% (p=0.000 n=18+17) JSON-12 11.0ms ± 0% 11.0ms ± 1% -0.68% (p=0.000 n=17+19) https://perf.golang.org/search?q=upload:20171026.2 Updates #14951. Updates #22460. Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7 Reviewed-on: https://go-review.googlesource.com/73712 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2017-10-30 18:12:46 +00:00
Lynn Boger	4d0151ede5	cmd/compile,cmd/internal/obj/ppc64: make math.Abs,math.Copysign instrinsics on ppc64x This adds support for math Abs, Copysign to be instrinsics on ppc64x. New instruction FCPSGN is added to generate fcpsgn. Some new rules are added to improve the int<->float conversions that are generated mainly due to the Float64bits and Float64frombits in the math package. PPC64.rules is also modified as suggested in the review for CL 63290. Improvements: benchmark old ns/op new ns/op delta BenchmarkAbs-16 1.12 0.69 -38.39% BenchmarkCopysign-16 1.30 0.93 -28.46% BenchmarkNextafter32-16 9.34 8.05 -13.81% BenchmarkFrexp-16 8.81 7.60 -13.73% Others that used Copysign also saw smaller improvements. I attempted to make this work using rules since that seems to be preferred, but due to the use of Float64bits and Float64frombits in these functions, several rules had to be added and even then not all cases were matched. Using rules became too complicated and seemed too fragile for these. Updates #21390 Change-Id: Ia265da9a18355e08000818a4fba1a40e9e031995 Reviewed-on: https://go-review.googlesource.com/67130 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Keith Randall <khr@golang.org>	2017-10-30 13:56:39 +00:00
Hugues Bruant	3c46f49f94	cmd/compile: fix incorrect go:noinline usage This pragma is not actually honored by the compiler. The tests implicitly relied on the inliner being unable to inline closures with captured variables, which will soon change. Fixes #22208 Change-Id: I13abc9c930b9156d43ec216f8efb768952a29439 Reviewed-on: https://go-review.googlesource.com/73211 Reviewed-by: Michael Munday <mike.munday@ibm.com>	2017-10-30 07:48:21 +00:00
Aliaksandr Valialkin	0011cfbe2b	cmd/compile: optimize signed non-negative div/mod by a power of 2 This CL optimizes assembly for len() or cap() division by a power of 2 constants: func lenDiv(s []int) int { return len(s) / 16 } amd64 assembly before the CL: MOVQ "".s+16(SP), AX MOVQ AX, CX SARQ $63, AX SHRQ $60, AX ADDQ CX, AX SARQ $4, AX MOVQ AX, "".~r1+32(SP) RET amd64 assembly after the CL: MOVQ "".s+16(SP), AX SHRQ $4, AX MOVQ AX, "".~r1+32(SP) RET The CL relies on the fact that len() and cap() result cannot be negative. Trigger stats for the added SSA rules on linux/amd64 when running make.bash: 46 Div64 12 Mod64 The added SSA rules may trigger on more cases in the future when SSA values will be populated with the info on their lower bounds. For instance: func f(i int16) int16 { if i < 3 { return -1 } // Lower bound of i is 3 here -> i is non-negative, // so unsigned arithmetics may be used here. return i % 16 } Change-Id: I8bc6be5a03e71157ced533c01416451ff6f1a7f0 Reviewed-on: https://go-review.googlesource.com/65530 Reviewed-by: Keith Randall <khr@golang.org>	2017-10-06 15:15:39 +00:00
Alberto Donizetti	03614562ca	cmd/compile: remove x86 arch-specific rules for +2ⁿ multiplication amd64 and 386 have rules to reduce multiplication by a positive power of two, but a more general reduction (both for positive and negative powers of two) is already performed by generic rules that were added in CL 36323 to replace walkmul (see lines 166:173 in generic.rules). The x86 and amd64 rules are never triggered during all.bash and can be removed, reducing rules duplication. The change also adds a few code generation tests for amd64 and 386. Change-Id: I566d48186643bd722a4c0137fe94e513b8b20e36 Reviewed-on: https://go-review.googlesource.com/68450 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-10-06 09:30:57 +00:00
Ilya Tocar	6b8a3c8889	cmd/compile/internal/amd64: add SETccmem Combine setcc and store of result into setcc that writes directly to memory. Triggers 200+ times in go tool. Fixes #21630 Change-Id: Iafa22607426f4120140c88fae4b9aecb46e0bba8 Reviewed-on: https://go-review.googlesource.com/67950 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-10-05 20:53:28 +00:00
Michael Munday	7582494e06	cmd/compile: add s390x intrinsics for Ceil, Floor, Round and Trunc Ceil, Floor and Trunc are pre-existing intrinsics. Round is a new function and has been added as an intrinsic in this CL. All of the functions can be implemented as a single 'LOAD FP INTEGER' instruction, FIDBR, on s390x. name old time/op new time/op delta Ceil 2.34ns ± 0% 0.85ns ± 0% -63.74% (p=0.000 n=5+4) Floor 2.33ns ± 0% 0.85ns ± 1% -63.35% (p=0.008 n=5+5) Round 4.23ns ± 0% 0.85ns ± 0% -79.89% (p=0.000 n=5+4) Trunc 2.35ns ± 0% 0.85ns ± 0% -63.83% (p=0.029 n=4+4) Change-Id: Idee7ba24a2899d12bf9afee4eedd6b4aaad3c510 Reviewed-on: https://go-review.googlesource.com/63890 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-09-20 10:01:35 +00:00
Michael Munday	95b146e8eb	cmd/compile: improve floating point constant propagation Add generic rules to propagate floating point constants through comparisons and integer conversions. These new rules seldom trigger in the standard library so there is no performance change, however I think it is worth adding them anyway for completeness. Change-Id: I9db5222746508a2996f1cafb72f4e0cf2541de07 Reviewed-on: https://go-review.googlesource.com/63795 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-09-14 23:08:33 +00:00
Lynn Boger	fa3fe2e3c6	cmd/compile, math/bits: add rotate rules to PPC64.rules This adds rules to match the code in math/bits RotateLeft, RotateLeft32, and RotateLef64 to allow them to be inlined. The rules are complicated because the code in these function use different types, and the non-const version of these shifts generate Mask and Carry instructions that become subexpressions during the match process. Also adds a testcase to asm_test.go. Improvement in math/bits: BenchmarkRotateLeft-16 1.57 1.32 -15.92% BenchmarkRotateLeft32-16 1.60 1.37 -14.37% BenchmarkRotateLeft64-16 1.57 1.32 -15.92% Updates #21390 Change-Id: Ib6f17669ecc9cab54f18d690be27e2225ca654a4 Reviewed-on: https://go-review.googlesource.com/59932 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>	2017-09-11 20:44:22 +00:00
Michael Munday	9da29b687f	cmd/compile: propagate constants through math.Float{32,64}{,from}bits This CL adds generic SSA rules to propagate constants through raw bits conversions between floats and integers. This allows constants to propagate through some math functions. For example, math.Copysign(0, -1) is now constant folded to a load of -0.0. Requires a fix to the ARM assembler which loaded -0.0 as +0.0. Change-Id: I52649a4691077c7414f19d17bb599a6743c23ac2 Reviewed-on: https://go-review.googlesource.com/62250 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2017-09-08 17:24:03 +00:00
Keith Randall	aed1c119fd	cmd/compile: fix assembly test Bad merge, missed changing to keyed literal structs. Bug introduced in CL 56252 Change-Id: I55cccff4990bd25e6387f6c90919ee5866900d7f Reviewed-on: https://go-review.googlesource.com/61290 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-09-03 16:24:24 +00:00
Cholerae Hu	fb165eaffd	cmd/compile: combine xn - yn into (x-y)*n Do the similar thing to CL 55143 to reduce IMUL. Change-Id: I1bd38f618058e3cd74fac181f003610ea13f2294 Reviewed-on: https://go-review.googlesource.com/56252 Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-09-03 14:29:38 +00:00
Cherry Zhang	7846500a5a	cmd/compile: remove redundant constant shift rules Normal shift rules plus constant folding are enough to generate efficient shift-by-constant instructions. Add test to make sure we don't generate comparisons for constant shifts. TODO: there are still constant shift rules on PPC64. If they are removed, the constant folding rules are not enough to remove all the test and mask stuff for constant shifts. Leave them in for now. Fixes #20663. Change-Id: I724cc324aa8607762d0c8aacf9bfa641bda5c2a1 Reviewed-on: https://go-review.googlesource.com/60330 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-08-31 02:08:48 +00:00
Keith Randall	2b079c3c04	cmd/compile: use keyed struct for asm tests Just to make it clearer which regexps are positive and which regexps are negative. Change-Id: Ia190e89be28048fcae2491506f552afad90a5f85 Reviewed-on: https://go-review.googlesource.com/59490 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-08-28 17:34:25 +00:00
David du Colombier	adbfdfe377	cmd/compile: don't use MOVOstore for move on plan9/amd64 The SSA compiler currently generates MOVOstore instructions to optimize 16 bytes moves on AMD64 architecture. However, we can't use the MOVOstore instruction on Plan 9, because floating point operations are not allowed in the note handler. We rely on the useSSE flag to disable the use of the MOVOstore instruction on Plan 9 and replace it by two MOVQstore instructions. Fixes #21625 Change-Id: Idfefcceadccafe1752b059b5fe113ce566c0e71c Reviewed-on: https://go-review.googlesource.com/59171 Run-TryBot: David du Colombier <0intro@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>	2017-08-28 16:21:28 +00:00
Ilya Tocar	9c99512d18	cmd/compile/internal/ssa: combine consecutive loads and stores on amd64 Sometimes (often for calls) we generate code like this: MOVQ (addr),AX MOVQ 8(addr),BX MOVQ AX,(otheraddr) MOVQ BX,8(otheraddr) Replace it with MOVUPS (addr),X0 MOVUPS X0,(otheraddr) For completeness do the same for 8,16,32-bit loads/stores too. Shaves 1% from code sections of go tool. /localdisk/itocar/golang/bin/go 10293917 go_old 10334877 [40960 bytes] read-only data = 682 bytes (0.040769%) global text (code) = 38961 bytes (1.036503%) Total difference 39643 bytes (0.674628%) Updates #6853 Change-Id: I1f0d2f60273a63a079b58927cd1c4e3429d2e7ae Reviewed-on: https://go-review.googlesource.com/57130 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-08-25 20:05:17 +00:00
Keith Randall	fb05948d9e	cmd/compile,math: improve code generation for math.Abs Implement int reg <-> fp reg moves on amd64. If we see a load to int reg followed by an int->fp move, then we can just load to the fp reg instead. Same for stores. math.Abs is now: MOVQ "".x+8(SP), AX SHLQ $1, AX SHRQ $1, AX MOVQ AX, "".~r1+16(SP) math.Copysign is now: MOVQ "".x+8(SP), AX SHLQ $1, AX SHRQ $1, AX MOVQ "".y+16(SP), CX SHRQ $63, CX SHLQ $63, CX ORQ CX, AX MOVQ AX, "".~r2+24(SP) math.Float64bits is now: MOVSD "".x+8(SP), X0 MOVSD X0, "".~r1+16(SP) (it would be nicer to use a non-SSE reg for this, nothing is perfect) And due to the fix for #21440, the inlined version of these improve as well. name old time/op new time/op delta Abs 1.38ns ± 5% 0.89ns ±10% -35.54% (p=0.000 n=10+10) Copysign 1.56ns ± 7% 1.35ns ± 6% -13.77% (p=0.000 n=9+10) Fixes #13095 Change-Id: Ibd7f2792412a6668608780b0688a77062e1f1499 Reviewed-on: https://go-review.googlesource.com/58732 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>	2017-08-25 19:15:01 +00:00
Michael Munday	744ebfde04	cmd/compile: eliminate stores to unread auto variables This is a crude compiler pass to eliminate stores to auto variables that are only ever written to. Eliminates an unnecessary store to x from the following code: func f() int { var x := 1 return *(&x) } Fixes #19765. Change-Id: If2c63a8ae67b8c590b6e0cc98a9610939a3eeffa Reviewed-on: https://go-review.googlesource.com/38746 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-08-24 16:53:56 +00:00
Alberto Donizetti	8bca7ef607	cmd/compile: support placeholder name '$' in code generation tests This change adds to the code-generation harness in asm_test.go support for the use of a '$' placeholder name for test functions. A few of uninformative function names are also changed to use the placeholder, to confirm that the change works as expected. Fixes #21500 Change-Id: Iba168bd85efc9822253305d003b06682cf8a6c5c Reviewed-on: https://go-review.googlesource.com/57292 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-08-22 19:42:32 +00:00
Ilya Tocar	da34ddf24b	cmd/compile/internal/ssa: combine more const stores We already combine const stores up-to MOVQstoreconst. Combine 2 64-bit stores of const zero into 1 sse store of 128-bit zero. Shaves significant (>1%) amount of code from go tool: /localdisk/itocar/golang/bin/go 10334877 go_old 10388125 [53248 bytes] global text (code) = 51041 bytes (1.343944%) read-only data = 663 bytes (0.039617%) Total difference 51704 bytes (0.873981%) Change-Id: I7bc40968023c3a69f379b10fbb433cdb11364f1b Reviewed-on: https://go-review.googlesource.com/56250 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Giovanni Bajo <rasky@develer.com> Reviewed-by: Keith Randall <khr@golang.org>	2017-08-17 17:40:40 +00:00
Alberto Donizetti	a0453a180f	cmd/compile: combine xn + yn into (x+y)n There are a few cases where this can be useful. Apart from the obvious (and silly) 100n + 200n where we generate one IMUL instead of two, consider: 15n + 31n Currently, the compiler strength-reduces both imuls, generating: 0x0000 00000 MOVQ "".n+8(SP), AX 0x0005 00005 MOVQ AX, CX 0x0008 00008 SHLQ $4, AX 0x000c 00012 SUBQ CX, AX 0x000f 00015 MOVQ CX, DX 0x0012 00018 SHLQ $5, CX 0x0016 00022 SUBQ DX, CX 0x0019 00025 ADDQ CX, AX 0x001c 00028 MOVQ AX, "".~r1+16(SP) 0x0021 00033 RET But combining the imuls is both faster and shorter: 0x0000 00000 MOVQ "".n+8(SP), AX 0x0005 00005 IMULQ $46, AX 0x0009 00009 MOVQ AX, "".~r1+16(SP) 0x000e 00014 RET even without strength-reduction. Moreover, consider: 5n + 7(n+1) + 11(n+2) We already have a rule that rewrites 7(n+1) into 7n+7, so the generated code (without imuls merging) looks like this: 0x0000 00000 MOVQ "".n+8(SP), AX 0x0005 00005 LEAQ (AX)(AX4), CX 0x0009 00009 MOVQ AX, DX 0x000c 00012 NEGQ AX 0x000f 00015 LEAQ (AX)(DX8), AX 0x0013 00019 ADDQ CX, AX 0x0016 00022 LEAQ (DX)(CX2), CX 0x001a 00026 LEAQ 29(AX)(CX1), AX 0x001f 00031 MOVQ AX, "".~r1+16(SP) But with imuls merging, the 5n, 7n and 11n factors get merged, and the generated code looks like this: 0x0000 00000 MOVQ "".n+8(SP), AX 0x0005 00005 IMULQ $23, AX 0x0009 00009 ADDQ $29, AX 0x000d 00013 MOVQ AX, "".~r1+16(SP) 0x0012 00018 RET Which is both faster and shorter; that's also the exact same code that clang and the intel c compiler generate for the above expression. Change-Id: Ib4d5503f05d2f2efe31a1be14e2fe6cac33730a9 Reviewed-on: https://go-review.googlesource.com/55143 Reviewed-by: Keith Randall <khr@golang.org>	2017-08-16 16:51:59 +00:00
Cherry Zhang	f20944de78	cmd/compile: set/unset base register for better assembly print For address of an auto or arg, on all non-x86 architectures the assembler backend encodes the actual SP offset in the instruction but leaves the offset in Prog unchanged. When the assembly is printed in compile -S, it shows an offset relative to pseudo FP/SP with an actual hardware SP base register (e.g. R13 on ARM). This is confusing. Unset the base register if it is indeed SP, so the assembly output is consistent. If the base register isn't SP, it should be an error and the error output contains the actual base register. For address loading instructions, the base register isn't set in the compiler on non-x86 architectures. Set it. Normally it is SP and will be unset in the change mentioned above for printing. If it is not, it will be an error and the error output contains the actual base register. No change in generated binary, only printed assembly. Passes "go build -a -toolexec 'toolstash -cmp' std cmd" on all architectures. Fixes #21064. Change-Id: Ifafe8d5f9b437efbe824b63b3cbc2f5f6cdc1fd5 Reviewed-on: https://go-review.googlesource.com/49432 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>	2017-08-02 12:24:02 +00:00
Ilya Tocar	3bdc2f3abf	cmd/compile/internal/gc: speed-up small array comparison Currently we inline array comparisons for arrays with at most 4 elements. Compare arrays with small size, but more than 4 elements (e. g. [16]byte) with larger compares. This provides very slightly smaller binaries, and results in faster code. ArrayEqual-6 7.41ns ± 0% 3.17ns ± 0% -57.15% (p=0.000 n=10+10) For go tool: global text (code) = -559 bytes (-0.014566%) This also helps mapaccess1_faststr, and maps in general: MapDelete/Str/1-6 195ns ± 1% 186ns ± 2% -4.47% (p=0.000 n=10+10) MapDelete/Str/2-6 211ns ± 1% 177ns ± 1% -16.01% (p=0.000 n=10+10) MapDelete/Str/4-6 225ns ± 1% 183ns ± 1% -18.49% (p=0.000 n=8+10) MapStringKeysEight_16-6 31.3ns ± 0% 28.6ns ± 0% -8.63% (p=0.000 n=6+9) MapStringKeysEight_32-6 29.2ns ± 0% 27.6ns ± 0% -5.45% (p=0.000 n=10+10) MapStringKeysEight_64-6 29.1ns ± 1% 27.5ns ± 0% -5.46% (p=0.000 n=10+10) MapStringKeysEight_1M-6 29.1ns ± 1% 27.6ns ± 0% -5.49% (p=0.000 n=10+10) Change-Id: I9ec98e41b233031e0e96c4e13d86a324f628ed4a Reviewed-on: https://go-review.googlesource.com/40771 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-06-01 15:46:16 +00:00
Josh Bleecher Snyder	ee69c21747	cmd/compile: don't use statictmps for SSA-able composite literals The writebarrier test has to change. Now that T23 composite literals are passed to the backend, they get SSA'd, so writes to their fields are treated separately, so the relevant part of the first write to t23 is now a dead store. Preserve the intent of the test by splitting it up into two functions. Reduces code size a bit: name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% ~ (all equal) Unicode 202k ± 0% 202k ± 0% ~ (all equal) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (all equal) Compiler 3.92M ± 0% 3.91M ± 0% -0.19% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.91M ± 0% ~ (all equal) Flate 228k ± 0% 228k ± 0% -0.05% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% ~ (all equal) Reflect 952k ± 0% 952k ± 0% -0.06% (p=0.008 n=5+5) Tar 188k ± 0% 188k ± 0% -0.09% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.02% (p=0.008 n=5+5) [Geo mean] 649k 648k -0.04% Fixes #18872 Change-Id: Ifeed0f71f13849732999aa731cc2bf40c0f0e32a Reviewed-on: https://go-review.googlesource.com/43154 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2017-05-11 18:28:40 +00:00
Cherry Zhang	fb0ccc5d0a	cmd/internal/obj/arm64, cmd/compile: improve offset folding on ARM64 ARM64 assembler backend only accepts loads and stores with small or aligned offset. The compiler therefore can only fold small or aligned offsets into loads and stores. For locals and args, their offsets to SP are not known until very late, and the compiler makes conservative decision not folding some of them. However, in most cases, the offset is indeed small or aligned, and can be folded into load and store (but actually not). This CL adds support of loads and stores with large and unaligned offsets. When the offset doesn't fit into the instruction, it uses two instructions and (for very large offset) the constant pool. This way, the compiler doesn't need to be conservative, and can simply fold the offset. To make it work, the assembler's optab matching rules need to be changed. Before, MOVD accepts C_UAUTO32K which matches multiple of 8 between 0 and 32K, and also C_UAUTO16K, which may not be multiple of 8 and does not fit into MOVD instruction. The assembler errors in the latter case. This change makes it only matches multiple of 8 (or offsets within ±256, which also fits in instruction), and uses the large-or-unaligned-offset rule for things doesn't fit (without error). Other sized move rules are changed similarly. Class C_UAUTO64K and C_UOREG64K are removed, as they are never used. In shared library, load/store of global is rewritten to using GOT and temp register, which conflicts with the use of temp register for assembling large offset. So the folding is disabled for globals in shared library mode. Reduce cmd/go binary size by 2%. name old time/op new time/op delta BinaryTree17-8 8.67s ± 0% 8.61s ± 0% -0.60% (p=0.000 n=9+10) Fannkuch11-8 6.24s ± 0% 6.19s ± 0% -0.83% (p=0.000 n=10+9) FmtFprintfEmpty-8 116ns ± 0% 116ns ± 0% ~ (all equal) FmtFprintfString-8 196ns ± 0% 192ns ± 0% -1.89% (p=0.000 n=10+10) FmtFprintfInt-8 199ns ± 0% 198ns ± 0% -0.35% (p=0.001 n=9+10) FmtFprintfIntInt-8 294ns ± 0% 293ns ± 0% -0.34% (p=0.000 n=8+8) FmtFprintfPrefixedInt-8 318ns ± 1% 318ns ± 1% ~ (p=1.000 n=10+10) FmtFprintfFloat-8 537ns ± 0% 531ns ± 0% -1.17% (p=0.000 n=9+10) FmtManyArgs-8 1.19µs ± 1% 1.18µs ± 1% -1.41% (p=0.001 n=10+10) GobDecode-8 17.2ms ± 1% 17.3ms ± 2% ~ (p=0.165 n=10+10) GobEncode-8 14.7ms ± 1% 14.7ms ± 2% ~ (p=0.631 n=10+10) Gzip-8 837ms ± 0% 836ms ± 0% -0.14% (p=0.006 n=9+10) Gunzip-8 141ms ± 0% 139ms ± 0% -1.24% (p=0.000 n=9+10) HTTPClientServer-8 256µs ± 1% 253µs ± 1% -1.35% (p=0.000 n=10+10) JSONEncode-8 40.1ms ± 1% 41.3ms ± 1% +3.06% (p=0.000 n=10+9) JSONDecode-8 157ms ± 1% 156ms ± 1% -0.83% (p=0.001 n=9+8) Mandelbrot200-8 8.94ms ± 0% 8.94ms ± 0% +0.02% (p=0.000 n=9+9) GoParse-8 8.69ms ± 0% 8.54ms ± 1% -1.69% (p=0.000 n=8+10) RegexpMatchEasy0_32-8 227ns ± 1% 228ns ± 1% +0.48% (p=0.016 n=10+9) RegexpMatchEasy0_1K-8 1.92µs ± 0% 1.63µs ± 0% -15.08% (p=0.000 n=10+9) RegexpMatchEasy1_32-8 256ns ± 0% 251ns ± 0% -2.19% (p=0.000 n=10+9) RegexpMatchEasy1_1K-8 2.38µs ± 0% 2.09µs ± 0% -12.49% (p=0.000 n=10+9) RegexpMatchMedium_32-8 352ns ± 0% 354ns ± 0% +0.39% (p=0.002 n=10+9) RegexpMatchMedium_1K-8 106µs ± 0% 106µs ± 0% -0.05% (p=0.005 n=10+9) RegexpMatchHard_32-8 5.92µs ± 0% 5.89µs ± 0% -0.40% (p=0.000 n=9+8) RegexpMatchHard_1K-8 180µs ± 0% 179µs ± 0% -0.14% (p=0.000 n=10+9) Revcomp-8 1.20s ± 0% 1.13s ± 0% -6.29% (p=0.000 n=9+8) Template-8 159ms ± 1% 154ms ± 1% -3.14% (p=0.000 n=9+10) TimeParse-8 800ns ± 3% 769ns ± 1% -3.91% (p=0.000 n=10+10) TimeFormat-8 826ns ± 2% 817ns ± 2% -1.04% (p=0.050 n=10+10) [Geo mean] 145µs 143µs -1.79% Change-Id: I5fc42087cee9b54ea414f8ef6d6d020b80eb5985 Reviewed-on: https://go-review.googlesource.com/42172 Run-TryBot: Cherry Zhang <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com>	2017-05-09 19:41:00 +00:00
Martin Möhrmann	f9bec9eb42	cmd/compile: use MOVL instead of MOVQ for small constants on amd64 The encoding of MOVL to a register is 2 bytes shorter than for MOVQ. The upper 32bit are automatically zeroed when MOVL to a register is used. Replaces 1657 MOVQ by MOVL in the go binary. Reduces go binary size by 4 kilobyte. name old time/op new time/op delta BinaryTree17 1.93s ± 0% 1.93s ± 0% -0.32% (p=0.000 n=9+9) Fannkuch11 2.66s ± 0% 2.48s ± 0% -6.60% (p=0.000 n=9+9) FmtFprintfEmpty 31.8ns ± 0% 31.6ns ± 0% -0.63% (p=0.000 n=10+10) FmtFprintfString 52.0ns ± 0% 51.9ns ± 0% -0.19% (p=0.000 n=10+10) FmtFprintfInt 55.6ns ± 0% 54.6ns ± 0% -1.80% (p=0.002 n=8+10) FmtFprintfIntInt 87.7ns ± 0% 84.8ns ± 0% -3.31% (p=0.000 n=9+9) FmtFprintfPrefixedInt 98.9ns ± 0% 102.0ns ± 0% +3.10% (p=0.000 n=10+10) FmtFprintfFloat 165ns ± 0% 164ns ± 0% -0.61% (p=0.000 n=10+10) FmtManyArgs 368ns ± 0% 361ns ± 0% -1.98% (p=0.000 n=8+10) GobDecode 4.53ms ± 0% 4.58ms ± 0% +1.08% (p=0.000 n=9+10) GobEncode 3.74ms ± 0% 3.73ms ± 0% -0.27% (p=0.000 n=10+10) Gzip 164ms ± 0% 163ms ± 0% -0.48% (p=0.000 n=10+10) Gunzip 26.7ms ± 0% 26.6ms ± 0% -0.13% (p=0.000 n=9+10) HTTPClientServer 30.4µs ± 1% 30.3µs ± 1% -0.41% (p=0.016 n=10+10) JSONEncode 10.9ms ± 0% 11.0ms ± 0% +0.70% (p=0.000 n=10+10) JSONDecode 36.8ms ± 0% 37.0ms ± 0% +0.59% (p=0.000 n=9+10) Mandelbrot200 3.20ms ± 0% 3.21ms ± 0% +0.44% (p=0.000 n=9+10) GoParse 2.35ms ± 0% 2.35ms ± 0% +0.26% (p=0.000 n=10+9) RegexpMatchEasy0_32 58.3ns ± 0% 58.4ns ± 0% +0.17% (p=0.000 n=10+10) RegexpMatchEasy0_1K 138ns ± 0% 142ns ± 0% +2.68% (p=0.000 n=10+10) RegexpMatchEasy1_32 55.1ns ± 0% 55.6ns ± 1% ~ (p=0.104 n=10+10) RegexpMatchEasy1_1K 242ns ± 0% 243ns ± 0% +0.41% (p=0.000 n=10+10) RegexpMatchMedium_32 87.4ns ± 0% 89.9ns ± 0% +2.86% (p=0.000 n=10+10) RegexpMatchMedium_1K 27.4µs ± 0% 27.4µs ± 0% +0.15% (p=0.000 n=10+10) RegexpMatchHard_32 1.30µs ± 0% 1.32µs ± 1% +1.91% (p=0.000 n=10+10) RegexpMatchHard_1K 39.0µs ± 0% 39.5µs ± 0% +1.38% (p=0.000 n=10+10) Revcomp 316ms ± 0% 319ms ± 0% +1.13% (p=0.000 n=9+8) Template 40.6ms ± 0% 40.6ms ± 0% ~ (p=0.123 n=10+10) TimeParse 224ns ± 0% 224ns ± 0% ~ (all equal) TimeFormat 230ns ± 0% 225ns ± 0% -2.17% (p=0.000 n=10+10) Change-Id: I32a099b65f9e6d4ad7288ed48546655c534757d8 Reviewed-on: https://go-review.googlesource.com/38630 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-05-01 20:59:58 +00:00
Lynn Boger	9248ff46a8	cmd/compile: add rotates to PPC64.rules This updates PPC64.rules to include rules to generate rotates for ADD, OR, XOR operators that combine two opposite shifts that sum to 32 or 64. To support this change opcodes for ROTL and ROTLW were added to be used like the rotldi and rotlwi extended mnemonics. This provides the following improvement in sha3: BenchmarkPermutationFunction-8 302.83 376.40 1.24x BenchmarkSha3_512_MTU-8 98.64 121.92 1.24x BenchmarkSha3_384_MTU-8 136.80 168.30 1.23x BenchmarkSha3_256_MTU-8 169.21 211.29 1.25x BenchmarkSha3_224_MTU-8 179.76 221.19 1.23x BenchmarkShake128_MTU-8 212.87 263.23 1.24x BenchmarkShake256_MTU-8 196.62 245.60 1.25x BenchmarkShake256_16x-8 163.57 194.37 1.19x BenchmarkShake256_1MiB-8 199.02 248.74 1.25x BenchmarkSha3_512_1MiB-8 106.55 133.13 1.25x Fixes #20030 Change-Id: I484c56f48395d32f53ff3ecb3ac6cb8191cfee44 Reviewed-on: https://go-review.googlesource.com/40992 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Michael Munday <munday@ca.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-04-20 18:05:22 +00:00
Keith Randall	7e07e635f3	cmd/compile: implement non-constant rotates Makes math/bits.Rotate{Left,Right} fast on amd64. name old time/op new time/op delta RotateLeft-12 7.42ns ± 6% 5.45ns ± 6% -26.54% (p=0.000 n=9+10) RotateLeft8-12 4.77ns ± 5% 3.42ns ± 7% -28.25% (p=0.000 n=8+10) RotateLeft16-12 4.82ns ± 8% 3.40ns ± 7% -29.36% (p=0.000 n=10+10) RotateLeft32-12 4.87ns ± 7% 3.48ns ± 7% -28.51% (p=0.000 n=8+9) RotateLeft64-12 5.23ns ±10% 3.35ns ± 6% -35.97% (p=0.000 n=9+10) RotateRight-12 7.59ns ± 8% 5.71ns ± 1% -24.72% (p=0.000 n=10+8) RotateRight8-12 4.98ns ± 7% 3.36ns ± 9% -32.55% (p=0.000 n=10+10) RotateRight16-12 5.12ns ± 2% 3.45ns ± 5% -32.62% (p=0.000 n=10+10) RotateRight32-12 4.80ns ± 6% 3.42ns ±16% -28.68% (p=0.000 n=10+10) RotateRight64-12 4.78ns ± 6% 3.42ns ± 6% -28.50% (p=0.000 n=10+10) Update #18940 Change-Id: Ie79fb5581c489ed4d3b859314c5e669a134c119b Reviewed-on: https://go-review.googlesource.com/39711 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>	2017-04-17 23:19:45 +00:00
Josh Bleecher Snyder	3d0a898385	cmd/compile: improve output when TestAssembly build fails Change-Id: Ibee84399d81463d3e7d5319626bb0d6b60b86bd9 Reviewed-on: https://go-review.googlesource.com/40861 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2017-04-17 03:12:34 +00:00
Josh Bleecher Snyder	0d36999a0f	cmd/compile: make TestAssembly resilient to output ordering To preserve reproducible builds, the text entries during compilation will be sorted before being printed. TestAssembly currently assumes that function init comes after all user-defined functions. Remove that assumption. Instead of looking for "TEXT" to tell you where a function ends--which may now yield lots of non-function-code junk--look for a line beginning with non-whitespace. Updates #15756 Change-Id: Ibc82dba6143d769ef4c391afc360e523b1a51348 Reviewed-on: https://go-review.googlesource.com/39853 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Matthew Dempsky <mdempsky@google.com>	2017-04-13 02:30:29 +00:00
Ilya Tocar	e4a500ce14	cmd/compile/internal/gc: improve comparison with constant strings Currently we expand comparison with small constant strings into len check and a sequence of byte comparisons. Generate 16/32/64-bit comparisons, instead of bytewise on 386 and amd64. Also increase limits on what is considered small constant string. Shaves ~30kb (0.5%) from go executable. This also updates test/prove.go to keep test case valid. Change-Id: I99ae8871a1d00c96363c6d03d0b890782fa7e1d9 Reviewed-on: https://go-review.googlesource.com/38776 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>	2017-04-07 15:40:25 +00:00
Cherry Zhang	257b01f8f4	cmd/compile: use ANDconst to mask out leading/trailing bits on ARM64 For an AND that masks out leading or trailing bits, generic rules rewrite it to a pair of shifts. On ARM64, the mask actually can fit into an AND instruction. So we rewrite it back to AND. Fixes #19857. Change-Id: I479d7320ae4f29bb3f0056d5979bde4478063a8f Reviewed-on: https://go-review.googlesource.com/39651 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>	2017-04-06 17:59:32 +00:00
Keith Randall	5cadc91b3c	cmd/compile: intrinsics for math/bits.OnesCount Popcount instructions on amd64 are not guaranteed to be present, so we must guard their call. Rewrite rules can't generate control flow at the moment, so the intrinsifier needs to generate that code. name old time/op new time/op delta OnesCount-8 2.47ns ± 5% 1.04ns ± 2% -57.70% (p=0.000 n=10+10) OnesCount16-8 1.05ns ± 1% 0.78ns ± 0% -25.56% (p=0.000 n=9+8) OnesCount32-8 1.63ns ± 5% 1.04ns ± 2% -35.96% (p=0.000 n=10+10) OnesCount64-8 2.45ns ± 0% 1.04ns ± 1% -57.55% (p=0.000 n=6+10) Update #18616 Change-Id: I4aff2cc9aa93787898d7b22055fe272a7cf95673 Reviewed-on: https://go-review.googlesource.com/38320 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-04-04 02:40:11 +00:00
Keith Randall	63a72fd447	cmd/compile: strength-reduce floating point x2 -> x+x x/c, c power of 2 -> x(1/c) Fixes #19827 Change-Id: I74c9f0b5b49b2ed26c0990314c7d1d5f9631b6f1 Reviewed-on: https://go-review.googlesource.com/39295 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>	2017-04-03 21:27:03 +00:00
Keith Randall	86dc86b4f9	cmd/compile: don't merge load+op if other op arg is still live We want to merge a load and op into a single instruction l = LOAD ptr mem y = OP x l into y = OPload x ptr mem However, all of our OPload instructions require that y uses the same register as x. If x is needed past this instruction, then we must copy x somewhere else, losing the whole benefit of merging the instructions in the first place. Disable this optimization if x is live past the OP. Also disable this optimization if the OP is in a deeper loop than the load. Update #19595 Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a Reviewed-on: https://go-review.googlesource.com/38337 Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>	2017-03-23 15:53:04 +00:00
Michael Munday	17570a9afb	cmd/compile: emit fused multiply-{add,subtract} on ppc64x A follow on to CL 36963 adding support for ppc64x. Performance changes (as posted on the issue): poly1305: benchmark old ns/op new ns/op delta Benchmark64-16 172 151 -12.21% Benchmark1K-16 1828 1523 -16.68% Benchmark64Unaligned-16 172 151 -12.21% Benchmark1KUnaligned-16 1827 1523 -16.64% math: BenchmarkAcos-16 43.9 39.9 -9.11% BenchmarkAcosh-16 57.0 45.8 -19.65% BenchmarkAsin-16 35.8 33.0 -7.82% BenchmarkAsinh-16 68.6 60.8 -11.37% BenchmarkAtan-16 19.8 16.2 -18.18% BenchmarkAtanh-16 65.5 57.5 -12.21% BenchmarkAtan2-16 45.4 34.2 -24.67% BenchmarkGamma-16 37.6 26.0 -30.85% BenchmarkLgamma-16 40.0 28.2 -29.50% BenchmarkLog1p-16 35.1 29.1 -17.09% BenchmarkSin-16 22.7 18.4 -18.94% BenchmarkSincos-16 31.7 23.7 -25.24% BenchmarkSinh-16 146 131 -10.27% BenchmarkY0-16 130 107 -17.69% BenchmarkY1-16 127 107 -15.75% BenchmarkYn-16 278 235 -15.47% Updates #17895. Change-Id: I1c16199715d20c9c4bd97c4a950bcfa69eb688c1 Reviewed-on: https://go-review.googlesource.com/38095 Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>	2017-03-20 20:01:29 +00:00
Keith Randall	495b167919	cmd/compile: intrinsics for math/bits.{Len,LeadingZeros} name old time/op new time/op delta LeadingZeros-4 2.00ns ± 0% 1.34ns ± 1% -33.02% (p=0.000 n=8+10) LeadingZeros16-4 1.62ns ± 0% 1.57ns ± 0% -3.09% (p=0.001 n=8+9) LeadingZeros32-4 2.14ns ± 0% 1.48ns ± 0% -30.84% (p=0.002 n=8+10) LeadingZeros64-4 2.06ns ± 1% 1.33ns ± 0% -35.08% (p=0.000 n=8+8) 8-bit args is a special case - the Go code is really fast because it is just a single table lookup. So I've disabled that for now. Intrinsics were actually slower: LeadingZeros8-4 1.22ns ± 3% 1.58ns ± 1% +29.56% (p=0.000 n=10+10) Update #18616 Change-Id: Ia9c289b9ba59c583ea64060470315fd637e814cf Reviewed-on: https://go-review.googlesource.com/38311 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-03-16 22:53:49 +00:00
Keith Randall	dd9892e31b	cmd/compile: intrinsify math/bits.ReverseBytes Update #18616 Change-Id: I0c2d643cbbeb131b4c9b12194697afa4af48e1d2 Reviewed-on: https://go-review.googlesource.com/38166 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-03-16 19:41:56 +00:00
Keith Randall	d5dc490519	cmd/compile: intrinsics for math/bits.TrailingZerosX Implement math/bits.TrailingZerosX using intrinsics. Generally reorganize the intrinsic spec a bit. The instrinsics data structure is now built at init time. This will make doing the other functions in math/bits easier. Update sys.CtzX to return int instead of uint{64,32} so it matches math/bits.TrailingZerosX. Improve the intrinsics a bit for amd64. We don't need the CMOV for <64 bit versions. Update #18616 Change-Id: Ic1c5339c943f961d830ae56f12674d7b29d4ff39 Reviewed-on: https://go-review.googlesource.com/38155 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-03-16 02:44:16 +00:00
Josh Bleecher Snyder	3a90bfb253	cmd/dist, cmd/compile: eliminate mergeEnvLists copies This is now handled by os/exec. Updates #12868 Change-Id: Ic21a6ff76a9b9517437ff1acf3a9195f9604bb45 Reviewed-on: https://go-review.googlesource.com/37698 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2017-03-02 22:26:23 +00:00
Josh Bleecher Snyder	2183135554	cmd/compile: recognize bit test patterns on amd64 Updates #18943 Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08 Reviewed-on: https://go-review.googlesource.com/36329 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-03-01 00:36:04 +00:00
Michael Munday	bd8a39b67a	cmd/compile: emit fused multiply-{add,subtract} instructions on s390x Explcitly block fused multiply-add pattern matching when a cast is used after the multiplication, for example: - (a * b) + c // can emit fused multiply-add - float64(a * b) + c // cannot emit fused multiply-add float{32,64} and complex{64,128} casts of matching types are now kept as OCONV operations rather than being replaced with OCONVNOP operations because they now imply a rounding operation (and therefore aren't a no-op anymore). Operations (for example, multiplication) on complex types may utilize fused multiply-add and -subtract instructions internally. There is no way to disable this behavior at the moment. Improves the performance of the floating point implementation of poly1305: name old speed new speed delta 64 246MB/s ± 0% 275MB/s ± 0% +11.48% (p=0.000 n=10+8) 1K 312MB/s ± 0% 357MB/s ± 0% +14.41% (p=0.000 n=10+10) 64Unaligned 246MB/s ± 0% 274MB/s ± 0% +11.43% (p=0.000 n=10+10) 1KUnaligned 312MB/s ± 0% 357MB/s ± 0% +14.39% (p=0.000 n=10+8) Updates #17895. Change-Id: Ia771d275bb9150d1a598f8cc773444663de5ce16 Reviewed-on: https://go-review.googlesource.com/36963 Run-TryBot: Michael Munday <munday@ca.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-02-28 15:34:20 +00:00
Josh Bleecher Snyder	e458264aca	cmd/compile: fix dolinkobj flag in TestAssembly Follow-up to CL 37270. This considerably reduces the time to run the test. Before: real 0m7.638s user 0m14.341s sys 0m2.244s After: real 0m4.867s user 0m7.107s sys 0m1.842s Change-Id: I8837a5da0979a1c365e1ce5874d81708249a4129 Reviewed-on: https://go-review.googlesource.com/37461 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Munday <munday@ca.ibm.com>	2017-02-25 14:39:29 +00:00
Lorenzo Masini	fb1f47a77c	cmd/compile: speed up TestAssembly TestAssembly was very slow, leading to it being skipped by default. This is not surprising, it separately invoked the compiler and parsed the result many times. Now the test assembles one source file for arch/os combination, containing the relevant functions. Tests for each arch/os run in parallel. Now the test runs approximately 10x faster on my Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz. Fixes #18966 Change-Id: I45ab97630b627a32e17900c109f790eb4c0e90d9 Reviewed-on: https://go-review.googlesource.com/37270 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2017-02-24 21:23:43 +00:00
Kirill Smelkov	4477fd097f	cmd/compile/internal/ssa: combine 2 byte loads + shifts into word load + rolw 8 on AMD64 ... and same for stores. This does for binary.BigEndian.Uint16() what was already done for Uint32 and Uint64 with BSWAP in `10f75748` (CL 32222). Here is how generated code changes e.g. for the following function (omitting saying the same prologue/epilogue): func get16(b [2]byte) uint16 { return binary.BigEndian.Uint16(b[:]) } "".get16 t=1 size=21 args=0x10 locals=0x0 // before 0x0000 00000 (x.go:15) MOVBLZX "".b+9(FP), AX 0x0005 00005 (x.go:15) MOVBLZX "".b+8(FP), CX 0x000a 00010 (x.go:15) SHLL $8, CX 0x000d 00013 (x.go:15) ORL CX, AX // after 0x0000 00000 (x.go:15) MOVWLZX "".b+8(FP), AX 0x0005 00005 (x.go:15) ROLW $8, AX encoding/binary is speedup overall a bit: name old time/op new time/op delta ReadSlice1000Int32s-4 4.83µs ± 0% 4.83µs ± 0% ~ (p=0.206 n=4+5) ReadStruct-4 1.29µs ± 2% 1.28µs ± 1% -1.27% (p=0.032 n=4+5) ReadInts-4 384ns ± 1% 385ns ± 1% ~ (p=0.968 n=4+5) WriteInts-4 534ns ± 3% 526ns ± 0% -1.54% (p=0.048 n=4+5) WriteSlice1000Int32s-4 5.02µs ± 0% 5.11µs ± 3% ~ (p=0.175 n=4+5) PutUint16-4 0.59ns ± 0% 0.49ns ± 2% -16.95% (p=0.016 n=4+5) PutUint32-4 0.52ns ± 0% 0.52ns ± 0% ~ (all equal) PutUint64-4 0.53ns ± 0% 0.53ns ± 0% ~ (all equal) PutUvarint32-4 19.9ns ± 0% 19.9ns ± 1% ~ (p=0.556 n=4+5) PutUvarint64-4 54.5ns ± 1% 54.2ns ± 0% ~ (p=0.333 n=4+5) name old speed new speed delta ReadSlice1000Int32s-4 829MB/s ± 0% 828MB/s ± 0% ~ (p=0.190 n=4+5) ReadStruct-4 58.0MB/s ± 2% 58.7MB/s ± 1% +1.30% (p=0.032 n=4+5) ReadInts-4 78.0MB/s ± 1% 77.8MB/s ± 1% ~ (p=0.968 n=4+5) WriteInts-4 56.1MB/s ± 3% 57.0MB/s ± 0% ~ (p=0.063 n=4+5) WriteSlice1000Int32s-4 797MB/s ± 0% 783MB/s ± 3% ~ (p=0.190 n=4+5) PutUint16-4 3.37GB/s ± 0% 4.07GB/s ± 2% +20.83% (p=0.016 n=4+5) PutUint32-4 7.73GB/s ± 0% 7.72GB/s ± 0% ~ (p=0.556 n=4+5) PutUint64-4 15.1GB/s ± 0% 15.1GB/s ± 0% ~ (p=0.905 n=4+5) PutUvarint32-4 201MB/s ± 0% 201MB/s ± 0% ~ (p=0.905 n=4+5) PutUvarint64-4 147MB/s ± 1% 147MB/s ± 0% ~ (p=0.286 n=4+5) ( "a bit" only because most of the time is spent in reflection-like things there, not actual bytes decoding. Even for direct PutUint16 benchmark the looping adds overhead and lowers visible benefit. For code-generated encoders / decoders actual effect is more than 20% ) Adding Uint32 and Uint64 raw benchmarks too for completeness. NOTE I had to adjust load-combining rule for bswap case to match first 2 bytes loads as result of "2-bytes load+shift" -> "loadw + rorw 8" rewrite. Reason is: for loads+shift, even e.g. into uint16 var var b []byte var v uin16 v = uint16(b[1]) \| uint16(b[0])<<8 the compiler eventually generates L(ong) shift - SHLLconst [8], probably because it is more straightforward / other reasons to work on the whole register. This way 2 bytes rewriting rule is using SHLLconst (not SHLWconst) in its pattern, and then it always gets matched first, even if 2-byte rule comes syntactically after 4-byte rule in AMD64.rules because 4-bytes rule seemingly needs more applyRewrite() cycles to trigger. If 2-bytes rule gets matched for inner half of var b []byte var v uin32 v = uint32(b[3]) \| uint32(b[2])<<8 \| uint32(b[1])<<16 \| uint32(b[0])<<24 and we keep 4-byte load rule unchanged, the result will be MOVW + RORW $8 and then series of byte loads and shifts - not one MOVL + BSWAPL. There is no such problem for stores: there compiler, since it probably knows store destination is 2 bytes wide, uses SHRWconst 8 (not SHRLconst 8) and thus 2-byte store rule is not a subset of rule for 4-byte stores. Fixes #17151 (int16 was last missing piece there) Change-Id: Idc03ba965bfce2b94fef456b02ff6742194748f6 Reviewed-on: https://go-review.googlesource.com/34636 Reviewed-by: Ilya Tocar <ilya.tocar@intel.com> Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-02-14 22:17:08 +00:00
Cherry Zhang	78200799a2	cmd/compile: undo special handling of zero-valued STRUCTLIT CL 35261 introduces special handling of zero-valued STRUCTLIT for efficient struct zeroing. But it didn't cover all use cases, for example, CONVNOP STRUCTLIT is not handled. On the other hand, CL 34566 handles zeroing earlier, so we don't need the change in CL 35261 for efficient zeroing. Other uses of zero-valued struct literals are very rare. So undo the change in walk.go in CL 35261. Add a test for efficient zeroing. Fixes #19084. Change-Id: I0807f7423fb44d47bf325b3c1ce9611a14953853 Reviewed-on: https://go-review.googlesource.com/36955 Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: Keith Randall <khr@golang.org>	2017-02-14 18:57:56 +00:00
Kirill Smelkov	bd91e3569a	cmd/compile/internal/ssa: generate bswap/store for indexed bigendian byte stores too on AMD64 Commit `10f75748` (CL 32222) added rewrite rules to combine byte loads/stores + shifts into larger loads/stores + bswap. For loads both MOVBload and MOVBloadidx1 were handled but for store only MOVBstore was there without MOVBstoreidx added to rewrite pattern. Fix it. Here is how generated code changes for the following 2 functions (ommitting staying the same prologue/epilogue): func put32(b []byte, i int, v uint32) { binary.BigEndian.PutUint32(b[i:], v) } func put64(b []byte, i int, v uint64) { binary.BigEndian.PutUint64(b[i:], v) } "".put32 t=1 size=100 args=0x28 locals=0x0 // before 0x0032 00050 (x.go:5) MOVL CX, DX 0x0034 00052 (x.go:5) SHRL $24, CX 0x0037 00055 (x.go:5) MOVQ "".b+8(FP), BX 0x003c 00060 (x.go:5) MOVB CL, (BX)(AX1) 0x003f 00063 (x.go:5) MOVL DX, CX 0x0041 00065 (x.go:5) SHRL $16, DX 0x0044 00068 (x.go:5) MOVB DL, 1(BX)(AX1) 0x0048 00072 (x.go:5) MOVL CX, DX 0x004a 00074 (x.go:5) SHRL $8, CX 0x004d 00077 (x.go:5) MOVB CL, 2(BX)(AX1) 0x0051 00081 (x.go:5) MOVB DL, 3(BX)(AX1) // after 0x0032 00050 (x.go:5) BSWAPL CX 0x0034 00052 (x.go:5) MOVQ "".b+8(FP), DX 0x0039 00057 (x.go:5) MOVL CX, (DX)(AX1) "".put64 t=1 size=155 args=0x28 locals=0x0 // before 0x0037 00055 (x.go:9) MOVQ CX, DX 0x003a 00058 (x.go:9) SHRQ $56, CX 0x003e 00062 (x.go:9) MOVQ "".b+8(FP), BX 0x0043 00067 (x.go:9) MOVB CL, (BX)(AX1) 0x0046 00070 (x.go:9) MOVQ DX, CX 0x0049 00073 (x.go:9) SHRQ $48, DX 0x004d 00077 (x.go:9) MOVB DL, 1(BX)(AX1) 0x0051 00081 (x.go:9) MOVQ CX, DX 0x0054 00084 (x.go:9) SHRQ $40, CX 0x0058 00088 (x.go:9) MOVB CL, 2(BX)(AX1) 0x005c 00092 (x.go:9) MOVQ DX, CX 0x005f 00095 (x.go:9) SHRQ $32, DX 0x0063 00099 (x.go:9) MOVB DL, 3(BX)(AX1) 0x0067 00103 (x.go:9) MOVQ CX, DX 0x006a 00106 (x.go:9) SHRQ $24, CX 0x006e 00110 (x.go:9) MOVB CL, 4(BX)(AX1) 0x0072 00114 (x.go:9) MOVQ DX, CX 0x0075 00117 (x.go:9) SHRQ $16, DX 0x0079 00121 (x.go:9) MOVB DL, 5(BX)(AX1) 0x007d 00125 (x.go:9) MOVQ CX, DX 0x0080 00128 (x.go:9) SHRQ $8, CX 0x0084 00132 (x.go:9) MOVB CL, 6(BX)(AX1) 0x0088 00136 (x.go:9) MOVB DL, 7(BX)(AX1) // after 0x0033 00051 (x.go:9) BSWAPQ CX 0x0036 00054 (x.go:9) MOVQ "".b+8(FP), DX 0x003b 00059 (x.go:9) MOVQ CX, (DX)(AX1) Updates #17151 Change-Id: I3f4a7f28f210e62e153e60da5abd1d39508cc6c4 Reviewed-on: https://go-review.googlesource.com/34635 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>	2017-02-14 18:35:43 +00:00

1 2 3

117 commits