go/src/cmd/compile/internal/ssa/looprotate.go

104 lines
2.2 KiB
Go
Raw Normal View History

cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
// Copyright 2017 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package ssa
// loopRotate converts loops with a check-loop-condition-at-beginning
// to loops with a check-loop-condition-at-end.
// This helps loops avoid extra unnecessary jumps.
//
// loop:
// CMPQ ...
// JGE exit
// ...
// JMP loop
// exit:
//
// JMP entry
// loop:
// ...
// entry:
// CMPQ ...
// JLT loop
func loopRotate(f *Func) {
loopnest := f.loopnest()
if len(loopnest.loops) == 0 {
return
}
cmd/compile: don't break up contiguous blocks in looprotate looprotate finds loop headers and arranges for them to be placed after the body of the loop. This eliminates a jump from the body. However, if the loop header is a series of contiguously laid out blocks, the rotation introduces a new jump in that series. This CL expands the "loop header" to move to be the entire run of contiguously laid out blocks in the same loop. This shrinks object files a little, and actually speeds up the compiler noticeably. Numbers below. Fannkuch performance seems to vary a lot by machine. On my laptop: name old time/op new time/op delta Fannkuch11-8 2.89s ± 2% 2.85s ± 3% -1.22% (p=0.000 n=50+50) This has a significant affect on the append benchmarks in #14758: name old time/op new time/op delta Foo-8 312ns ± 3% 276ns ± 2% -11.37% (p=0.000 n=30+29) Bar-8 565ns ± 2% 456ns ± 2% -19.27% (p=0.000 n=27+28) Updates #18977 Fixes #20355 name old time/op new time/op delta Template 205ms ± 5% 204ms ± 8% ~ (p=0.903 n=92+99) Unicode 85.3ms ± 4% 85.1ms ± 3% ~ (p=0.191 n=92+94) GoTypes 512ms ± 4% 507ms ± 4% -0.93% (p=0.000 n=95+97) Compiler 2.38s ± 3% 2.35s ± 3% -1.27% (p=0.000 n=98+95) SSA 4.67s ± 3% 4.64s ± 3% -0.62% (p=0.000 n=95+96) Flate 117ms ± 3% 117ms ± 3% ~ (p=0.099 n=84+86) GoParser 139ms ± 4% 137ms ± 4% -0.90% (p=0.000 n=97+98) Reflect 329ms ± 5% 326ms ± 6% -0.97% (p=0.002 n=99+98) Tar 102ms ± 6% 101ms ± 5% -0.97% (p=0.006 n=97+97) XML 198ms ±10% 196ms ±13% ~ (p=0.087 n=100+100) [Geo mean] 318ms 316ms -0.72% name old user-time/op new user-time/op delta Template 250ms ± 7% 250ms ± 7% ~ (p=0.850 n=94+92) Unicode 107ms ± 8% 106ms ± 5% -0.76% (p=0.005 n=98+91) GoTypes 665ms ± 5% 659ms ± 5% -0.85% (p=0.003 n=93+98) Compiler 3.15s ± 3% 3.10s ± 3% -1.60% (p=0.000 n=99+98) SSA 6.82s ± 3% 6.72s ± 4% -1.55% (p=0.000 n=94+98) Flate 138ms ± 8% 138ms ± 6% ~ (p=0.369 n=94+92) GoParser 170ms ± 5% 168ms ± 6% -1.13% (p=0.002 n=96+98) Reflect 412ms ± 8% 416ms ± 8% ~ (p=0.169 n=100+100) Tar 123ms ±18% 123ms ±14% ~ (p=0.896 n=100+100) XML 236ms ± 9% 234ms ±11% ~ (p=0.124 n=100+100) [Geo mean] 401ms 398ms -0.63% name old alloc/op new alloc/op delta Template 38.8MB ± 0% 38.8MB ± 0% ~ (p=0.222 n=5+5) Unicode 28.7MB ± 0% 28.7MB ± 0% ~ (p=0.421 n=5+5) GoTypes 109MB ± 0% 109MB ± 0% ~ (p=0.056 n=5+5) Compiler 457MB ± 0% 457MB ± 0% +0.07% (p=0.008 n=5+5) SSA 1.10GB ± 0% 1.10GB ± 0% +0.05% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.5MB ± 0% ~ (p=0.222 n=5+5) GoParser 30.9MB ± 0% 31.0MB ± 0% +0.21% (p=0.016 n=5+5) Reflect 73.4MB ± 0% 73.4MB ± 0% ~ (p=0.421 n=5+5) Tar 25.5MB ± 0% 25.5MB ± 0% ~ (p=0.548 n=5+5) XML 40.9MB ± 0% 40.9MB ± 0% ~ (p=0.151 n=5+5) [Geo mean] 71.6MB 71.6MB +0.07% name old allocs/op new allocs/op delta Template 394k ± 0% 394k ± 0% ~ (p=1.000 n=5+5) Unicode 344k ± 0% 343k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (p=1.000 n=5+5) Compiler 4.42M ± 0% 4.42M ± 0% ~ (p=1.000 n=5+5) SSA 9.80M ± 0% 9.80M ± 0% ~ (p=0.095 n=5+5) Flate 237k ± 1% 238k ± 1% ~ (p=0.310 n=5+5) GoParser 320k ± 0% 322k ± 1% +0.50% (p=0.032 n=5+5) Reflect 958k ± 0% 957k ± 0% ~ (p=0.548 n=5+5) Tar 252k ± 1% 252k ± 0% ~ (p=1.000 n=5+5) XML 400k ± 0% 400k ± 0% ~ (p=0.841 n=5+5) [Geo mean] 741k 742k +0.06% name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% -0.05% (p=0.008 n=5+5) Unicode 202k ± 0% 202k ± 0% -0.01% (p=0.008 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% -0.06% (p=0.008 n=5+5) Compiler 3.91M ± 0% 3.91M ± 0% -0.06% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.92M ± 0% +0.01% (p=0.008 n=5+5) Flate 228k ± 0% 227k ± 0% -0.04% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% -0.06% (p=0.008 n=5+5) Reflect 952k ± 0% 951k ± 0% -0.02% (p=0.008 n=5+5) Tar 187k ± 0% 187k ± 0% -0.04% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.05% (p=0.008 n=5+5) [Geo mean] 648k 648k -0.04% Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094 Reviewed-on: https://go-review.googlesource.com/43491 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-14 14:52:09 -07:00
idToIdx := make([]int, f.NumBlocks())
for i, b := range f.Blocks {
idToIdx[b.ID] = i
}
cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
// Set of blocks we're moving, by ID.
move := map[ID]struct{}{}
cmd/compile: don't break up contiguous blocks in looprotate looprotate finds loop headers and arranges for them to be placed after the body of the loop. This eliminates a jump from the body. However, if the loop header is a series of contiguously laid out blocks, the rotation introduces a new jump in that series. This CL expands the "loop header" to move to be the entire run of contiguously laid out blocks in the same loop. This shrinks object files a little, and actually speeds up the compiler noticeably. Numbers below. Fannkuch performance seems to vary a lot by machine. On my laptop: name old time/op new time/op delta Fannkuch11-8 2.89s ± 2% 2.85s ± 3% -1.22% (p=0.000 n=50+50) This has a significant affect on the append benchmarks in #14758: name old time/op new time/op delta Foo-8 312ns ± 3% 276ns ± 2% -11.37% (p=0.000 n=30+29) Bar-8 565ns ± 2% 456ns ± 2% -19.27% (p=0.000 n=27+28) Updates #18977 Fixes #20355 name old time/op new time/op delta Template 205ms ± 5% 204ms ± 8% ~ (p=0.903 n=92+99) Unicode 85.3ms ± 4% 85.1ms ± 3% ~ (p=0.191 n=92+94) GoTypes 512ms ± 4% 507ms ± 4% -0.93% (p=0.000 n=95+97) Compiler 2.38s ± 3% 2.35s ± 3% -1.27% (p=0.000 n=98+95) SSA 4.67s ± 3% 4.64s ± 3% -0.62% (p=0.000 n=95+96) Flate 117ms ± 3% 117ms ± 3% ~ (p=0.099 n=84+86) GoParser 139ms ± 4% 137ms ± 4% -0.90% (p=0.000 n=97+98) Reflect 329ms ± 5% 326ms ± 6% -0.97% (p=0.002 n=99+98) Tar 102ms ± 6% 101ms ± 5% -0.97% (p=0.006 n=97+97) XML 198ms ±10% 196ms ±13% ~ (p=0.087 n=100+100) [Geo mean] 318ms 316ms -0.72% name old user-time/op new user-time/op delta Template 250ms ± 7% 250ms ± 7% ~ (p=0.850 n=94+92) Unicode 107ms ± 8% 106ms ± 5% -0.76% (p=0.005 n=98+91) GoTypes 665ms ± 5% 659ms ± 5% -0.85% (p=0.003 n=93+98) Compiler 3.15s ± 3% 3.10s ± 3% -1.60% (p=0.000 n=99+98) SSA 6.82s ± 3% 6.72s ± 4% -1.55% (p=0.000 n=94+98) Flate 138ms ± 8% 138ms ± 6% ~ (p=0.369 n=94+92) GoParser 170ms ± 5% 168ms ± 6% -1.13% (p=0.002 n=96+98) Reflect 412ms ± 8% 416ms ± 8% ~ (p=0.169 n=100+100) Tar 123ms ±18% 123ms ±14% ~ (p=0.896 n=100+100) XML 236ms ± 9% 234ms ±11% ~ (p=0.124 n=100+100) [Geo mean] 401ms 398ms -0.63% name old alloc/op new alloc/op delta Template 38.8MB ± 0% 38.8MB ± 0% ~ (p=0.222 n=5+5) Unicode 28.7MB ± 0% 28.7MB ± 0% ~ (p=0.421 n=5+5) GoTypes 109MB ± 0% 109MB ± 0% ~ (p=0.056 n=5+5) Compiler 457MB ± 0% 457MB ± 0% +0.07% (p=0.008 n=5+5) SSA 1.10GB ± 0% 1.10GB ± 0% +0.05% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.5MB ± 0% ~ (p=0.222 n=5+5) GoParser 30.9MB ± 0% 31.0MB ± 0% +0.21% (p=0.016 n=5+5) Reflect 73.4MB ± 0% 73.4MB ± 0% ~ (p=0.421 n=5+5) Tar 25.5MB ± 0% 25.5MB ± 0% ~ (p=0.548 n=5+5) XML 40.9MB ± 0% 40.9MB ± 0% ~ (p=0.151 n=5+5) [Geo mean] 71.6MB 71.6MB +0.07% name old allocs/op new allocs/op delta Template 394k ± 0% 394k ± 0% ~ (p=1.000 n=5+5) Unicode 344k ± 0% 343k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (p=1.000 n=5+5) Compiler 4.42M ± 0% 4.42M ± 0% ~ (p=1.000 n=5+5) SSA 9.80M ± 0% 9.80M ± 0% ~ (p=0.095 n=5+5) Flate 237k ± 1% 238k ± 1% ~ (p=0.310 n=5+5) GoParser 320k ± 0% 322k ± 1% +0.50% (p=0.032 n=5+5) Reflect 958k ± 0% 957k ± 0% ~ (p=0.548 n=5+5) Tar 252k ± 1% 252k ± 0% ~ (p=1.000 n=5+5) XML 400k ± 0% 400k ± 0% ~ (p=0.841 n=5+5) [Geo mean] 741k 742k +0.06% name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% -0.05% (p=0.008 n=5+5) Unicode 202k ± 0% 202k ± 0% -0.01% (p=0.008 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% -0.06% (p=0.008 n=5+5) Compiler 3.91M ± 0% 3.91M ± 0% -0.06% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.92M ± 0% +0.01% (p=0.008 n=5+5) Flate 228k ± 0% 227k ± 0% -0.04% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% -0.06% (p=0.008 n=5+5) Reflect 952k ± 0% 951k ± 0% -0.02% (p=0.008 n=5+5) Tar 187k ± 0% 187k ± 0% -0.04% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.05% (p=0.008 n=5+5) [Geo mean] 648k 648k -0.04% Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094 Reviewed-on: https://go-review.googlesource.com/43491 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-14 14:52:09 -07:00
// Map from block ID to the moving blocks that should
cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
// come right after it.
cmd/compile: don't break up contiguous blocks in looprotate looprotate finds loop headers and arranges for them to be placed after the body of the loop. This eliminates a jump from the body. However, if the loop header is a series of contiguously laid out blocks, the rotation introduces a new jump in that series. This CL expands the "loop header" to move to be the entire run of contiguously laid out blocks in the same loop. This shrinks object files a little, and actually speeds up the compiler noticeably. Numbers below. Fannkuch performance seems to vary a lot by machine. On my laptop: name old time/op new time/op delta Fannkuch11-8 2.89s ± 2% 2.85s ± 3% -1.22% (p=0.000 n=50+50) This has a significant affect on the append benchmarks in #14758: name old time/op new time/op delta Foo-8 312ns ± 3% 276ns ± 2% -11.37% (p=0.000 n=30+29) Bar-8 565ns ± 2% 456ns ± 2% -19.27% (p=0.000 n=27+28) Updates #18977 Fixes #20355 name old time/op new time/op delta Template 205ms ± 5% 204ms ± 8% ~ (p=0.903 n=92+99) Unicode 85.3ms ± 4% 85.1ms ± 3% ~ (p=0.191 n=92+94) GoTypes 512ms ± 4% 507ms ± 4% -0.93% (p=0.000 n=95+97) Compiler 2.38s ± 3% 2.35s ± 3% -1.27% (p=0.000 n=98+95) SSA 4.67s ± 3% 4.64s ± 3% -0.62% (p=0.000 n=95+96) Flate 117ms ± 3% 117ms ± 3% ~ (p=0.099 n=84+86) GoParser 139ms ± 4% 137ms ± 4% -0.90% (p=0.000 n=97+98) Reflect 329ms ± 5% 326ms ± 6% -0.97% (p=0.002 n=99+98) Tar 102ms ± 6% 101ms ± 5% -0.97% (p=0.006 n=97+97) XML 198ms ±10% 196ms ±13% ~ (p=0.087 n=100+100) [Geo mean] 318ms 316ms -0.72% name old user-time/op new user-time/op delta Template 250ms ± 7% 250ms ± 7% ~ (p=0.850 n=94+92) Unicode 107ms ± 8% 106ms ± 5% -0.76% (p=0.005 n=98+91) GoTypes 665ms ± 5% 659ms ± 5% -0.85% (p=0.003 n=93+98) Compiler 3.15s ± 3% 3.10s ± 3% -1.60% (p=0.000 n=99+98) SSA 6.82s ± 3% 6.72s ± 4% -1.55% (p=0.000 n=94+98) Flate 138ms ± 8% 138ms ± 6% ~ (p=0.369 n=94+92) GoParser 170ms ± 5% 168ms ± 6% -1.13% (p=0.002 n=96+98) Reflect 412ms ± 8% 416ms ± 8% ~ (p=0.169 n=100+100) Tar 123ms ±18% 123ms ±14% ~ (p=0.896 n=100+100) XML 236ms ± 9% 234ms ±11% ~ (p=0.124 n=100+100) [Geo mean] 401ms 398ms -0.63% name old alloc/op new alloc/op delta Template 38.8MB ± 0% 38.8MB ± 0% ~ (p=0.222 n=5+5) Unicode 28.7MB ± 0% 28.7MB ± 0% ~ (p=0.421 n=5+5) GoTypes 109MB ± 0% 109MB ± 0% ~ (p=0.056 n=5+5) Compiler 457MB ± 0% 457MB ± 0% +0.07% (p=0.008 n=5+5) SSA 1.10GB ± 0% 1.10GB ± 0% +0.05% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.5MB ± 0% ~ (p=0.222 n=5+5) GoParser 30.9MB ± 0% 31.0MB ± 0% +0.21% (p=0.016 n=5+5) Reflect 73.4MB ± 0% 73.4MB ± 0% ~ (p=0.421 n=5+5) Tar 25.5MB ± 0% 25.5MB ± 0% ~ (p=0.548 n=5+5) XML 40.9MB ± 0% 40.9MB ± 0% ~ (p=0.151 n=5+5) [Geo mean] 71.6MB 71.6MB +0.07% name old allocs/op new allocs/op delta Template 394k ± 0% 394k ± 0% ~ (p=1.000 n=5+5) Unicode 344k ± 0% 343k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (p=1.000 n=5+5) Compiler 4.42M ± 0% 4.42M ± 0% ~ (p=1.000 n=5+5) SSA 9.80M ± 0% 9.80M ± 0% ~ (p=0.095 n=5+5) Flate 237k ± 1% 238k ± 1% ~ (p=0.310 n=5+5) GoParser 320k ± 0% 322k ± 1% +0.50% (p=0.032 n=5+5) Reflect 958k ± 0% 957k ± 0% ~ (p=0.548 n=5+5) Tar 252k ± 1% 252k ± 0% ~ (p=1.000 n=5+5) XML 400k ± 0% 400k ± 0% ~ (p=0.841 n=5+5) [Geo mean] 741k 742k +0.06% name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% -0.05% (p=0.008 n=5+5) Unicode 202k ± 0% 202k ± 0% -0.01% (p=0.008 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% -0.06% (p=0.008 n=5+5) Compiler 3.91M ± 0% 3.91M ± 0% -0.06% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.92M ± 0% +0.01% (p=0.008 n=5+5) Flate 228k ± 0% 227k ± 0% -0.04% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% -0.06% (p=0.008 n=5+5) Reflect 952k ± 0% 951k ± 0% -0.02% (p=0.008 n=5+5) Tar 187k ± 0% 187k ± 0% -0.04% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.05% (p=0.008 n=5+5) [Geo mean] 648k 648k -0.04% Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094 Reviewed-on: https://go-review.googlesource.com/43491 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-14 14:52:09 -07:00
after := map[ID][]*Block{}
cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
// Check each loop header and decide if we want to move it.
for _, loop := range loopnest.loops {
b := loop.header
var p *Block // b's in-loop predecessor
for _, e := range b.Preds {
if e.b.Kind != BlockPlain {
continue
}
if loopnest.b2l[e.b.ID] != loop {
continue
}
p = e.b
}
if p == nil || p == b {
continue
}
cmd/compile: don't break up contiguous blocks in looprotate looprotate finds loop headers and arranges for them to be placed after the body of the loop. This eliminates a jump from the body. However, if the loop header is a series of contiguously laid out blocks, the rotation introduces a new jump in that series. This CL expands the "loop header" to move to be the entire run of contiguously laid out blocks in the same loop. This shrinks object files a little, and actually speeds up the compiler noticeably. Numbers below. Fannkuch performance seems to vary a lot by machine. On my laptop: name old time/op new time/op delta Fannkuch11-8 2.89s ± 2% 2.85s ± 3% -1.22% (p=0.000 n=50+50) This has a significant affect on the append benchmarks in #14758: name old time/op new time/op delta Foo-8 312ns ± 3% 276ns ± 2% -11.37% (p=0.000 n=30+29) Bar-8 565ns ± 2% 456ns ± 2% -19.27% (p=0.000 n=27+28) Updates #18977 Fixes #20355 name old time/op new time/op delta Template 205ms ± 5% 204ms ± 8% ~ (p=0.903 n=92+99) Unicode 85.3ms ± 4% 85.1ms ± 3% ~ (p=0.191 n=92+94) GoTypes 512ms ± 4% 507ms ± 4% -0.93% (p=0.000 n=95+97) Compiler 2.38s ± 3% 2.35s ± 3% -1.27% (p=0.000 n=98+95) SSA 4.67s ± 3% 4.64s ± 3% -0.62% (p=0.000 n=95+96) Flate 117ms ± 3% 117ms ± 3% ~ (p=0.099 n=84+86) GoParser 139ms ± 4% 137ms ± 4% -0.90% (p=0.000 n=97+98) Reflect 329ms ± 5% 326ms ± 6% -0.97% (p=0.002 n=99+98) Tar 102ms ± 6% 101ms ± 5% -0.97% (p=0.006 n=97+97) XML 198ms ±10% 196ms ±13% ~ (p=0.087 n=100+100) [Geo mean] 318ms 316ms -0.72% name old user-time/op new user-time/op delta Template 250ms ± 7% 250ms ± 7% ~ (p=0.850 n=94+92) Unicode 107ms ± 8% 106ms ± 5% -0.76% (p=0.005 n=98+91) GoTypes 665ms ± 5% 659ms ± 5% -0.85% (p=0.003 n=93+98) Compiler 3.15s ± 3% 3.10s ± 3% -1.60% (p=0.000 n=99+98) SSA 6.82s ± 3% 6.72s ± 4% -1.55% (p=0.000 n=94+98) Flate 138ms ± 8% 138ms ± 6% ~ (p=0.369 n=94+92) GoParser 170ms ± 5% 168ms ± 6% -1.13% (p=0.002 n=96+98) Reflect 412ms ± 8% 416ms ± 8% ~ (p=0.169 n=100+100) Tar 123ms ±18% 123ms ±14% ~ (p=0.896 n=100+100) XML 236ms ± 9% 234ms ±11% ~ (p=0.124 n=100+100) [Geo mean] 401ms 398ms -0.63% name old alloc/op new alloc/op delta Template 38.8MB ± 0% 38.8MB ± 0% ~ (p=0.222 n=5+5) Unicode 28.7MB ± 0% 28.7MB ± 0% ~ (p=0.421 n=5+5) GoTypes 109MB ± 0% 109MB ± 0% ~ (p=0.056 n=5+5) Compiler 457MB ± 0% 457MB ± 0% +0.07% (p=0.008 n=5+5) SSA 1.10GB ± 0% 1.10GB ± 0% +0.05% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.5MB ± 0% ~ (p=0.222 n=5+5) GoParser 30.9MB ± 0% 31.0MB ± 0% +0.21% (p=0.016 n=5+5) Reflect 73.4MB ± 0% 73.4MB ± 0% ~ (p=0.421 n=5+5) Tar 25.5MB ± 0% 25.5MB ± 0% ~ (p=0.548 n=5+5) XML 40.9MB ± 0% 40.9MB ± 0% ~ (p=0.151 n=5+5) [Geo mean] 71.6MB 71.6MB +0.07% name old allocs/op new allocs/op delta Template 394k ± 0% 394k ± 0% ~ (p=1.000 n=5+5) Unicode 344k ± 0% 343k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (p=1.000 n=5+5) Compiler 4.42M ± 0% 4.42M ± 0% ~ (p=1.000 n=5+5) SSA 9.80M ± 0% 9.80M ± 0% ~ (p=0.095 n=5+5) Flate 237k ± 1% 238k ± 1% ~ (p=0.310 n=5+5) GoParser 320k ± 0% 322k ± 1% +0.50% (p=0.032 n=5+5) Reflect 958k ± 0% 957k ± 0% ~ (p=0.548 n=5+5) Tar 252k ± 1% 252k ± 0% ~ (p=1.000 n=5+5) XML 400k ± 0% 400k ± 0% ~ (p=0.841 n=5+5) [Geo mean] 741k 742k +0.06% name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% -0.05% (p=0.008 n=5+5) Unicode 202k ± 0% 202k ± 0% -0.01% (p=0.008 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% -0.06% (p=0.008 n=5+5) Compiler 3.91M ± 0% 3.91M ± 0% -0.06% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.92M ± 0% +0.01% (p=0.008 n=5+5) Flate 228k ± 0% 227k ± 0% -0.04% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% -0.06% (p=0.008 n=5+5) Reflect 952k ± 0% 951k ± 0% -0.02% (p=0.008 n=5+5) Tar 187k ± 0% 187k ± 0% -0.04% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.05% (p=0.008 n=5+5) [Geo mean] 648k 648k -0.04% Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094 Reviewed-on: https://go-review.googlesource.com/43491 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-14 14:52:09 -07:00
after[p.ID] = []*Block{b}
for {
nextIdx := idToIdx[b.ID] + 1
if nextIdx >= len(f.Blocks) { // reached end of function (maybe impossible?)
break
}
nextb := f.Blocks[nextIdx]
if nextb == p { // original loop precedessor is next
break
}
if loopnest.b2l[nextb.ID] != loop { // about to leave loop
break
}
after[p.ID] = append(after[p.ID], nextb)
b = nextb
}
cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
// Place b after p.
cmd/compile: don't break up contiguous blocks in looprotate looprotate finds loop headers and arranges for them to be placed after the body of the loop. This eliminates a jump from the body. However, if the loop header is a series of contiguously laid out blocks, the rotation introduces a new jump in that series. This CL expands the "loop header" to move to be the entire run of contiguously laid out blocks in the same loop. This shrinks object files a little, and actually speeds up the compiler noticeably. Numbers below. Fannkuch performance seems to vary a lot by machine. On my laptop: name old time/op new time/op delta Fannkuch11-8 2.89s ± 2% 2.85s ± 3% -1.22% (p=0.000 n=50+50) This has a significant affect on the append benchmarks in #14758: name old time/op new time/op delta Foo-8 312ns ± 3% 276ns ± 2% -11.37% (p=0.000 n=30+29) Bar-8 565ns ± 2% 456ns ± 2% -19.27% (p=0.000 n=27+28) Updates #18977 Fixes #20355 name old time/op new time/op delta Template 205ms ± 5% 204ms ± 8% ~ (p=0.903 n=92+99) Unicode 85.3ms ± 4% 85.1ms ± 3% ~ (p=0.191 n=92+94) GoTypes 512ms ± 4% 507ms ± 4% -0.93% (p=0.000 n=95+97) Compiler 2.38s ± 3% 2.35s ± 3% -1.27% (p=0.000 n=98+95) SSA 4.67s ± 3% 4.64s ± 3% -0.62% (p=0.000 n=95+96) Flate 117ms ± 3% 117ms ± 3% ~ (p=0.099 n=84+86) GoParser 139ms ± 4% 137ms ± 4% -0.90% (p=0.000 n=97+98) Reflect 329ms ± 5% 326ms ± 6% -0.97% (p=0.002 n=99+98) Tar 102ms ± 6% 101ms ± 5% -0.97% (p=0.006 n=97+97) XML 198ms ±10% 196ms ±13% ~ (p=0.087 n=100+100) [Geo mean] 318ms 316ms -0.72% name old user-time/op new user-time/op delta Template 250ms ± 7% 250ms ± 7% ~ (p=0.850 n=94+92) Unicode 107ms ± 8% 106ms ± 5% -0.76% (p=0.005 n=98+91) GoTypes 665ms ± 5% 659ms ± 5% -0.85% (p=0.003 n=93+98) Compiler 3.15s ± 3% 3.10s ± 3% -1.60% (p=0.000 n=99+98) SSA 6.82s ± 3% 6.72s ± 4% -1.55% (p=0.000 n=94+98) Flate 138ms ± 8% 138ms ± 6% ~ (p=0.369 n=94+92) GoParser 170ms ± 5% 168ms ± 6% -1.13% (p=0.002 n=96+98) Reflect 412ms ± 8% 416ms ± 8% ~ (p=0.169 n=100+100) Tar 123ms ±18% 123ms ±14% ~ (p=0.896 n=100+100) XML 236ms ± 9% 234ms ±11% ~ (p=0.124 n=100+100) [Geo mean] 401ms 398ms -0.63% name old alloc/op new alloc/op delta Template 38.8MB ± 0% 38.8MB ± 0% ~ (p=0.222 n=5+5) Unicode 28.7MB ± 0% 28.7MB ± 0% ~ (p=0.421 n=5+5) GoTypes 109MB ± 0% 109MB ± 0% ~ (p=0.056 n=5+5) Compiler 457MB ± 0% 457MB ± 0% +0.07% (p=0.008 n=5+5) SSA 1.10GB ± 0% 1.10GB ± 0% +0.05% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.5MB ± 0% ~ (p=0.222 n=5+5) GoParser 30.9MB ± 0% 31.0MB ± 0% +0.21% (p=0.016 n=5+5) Reflect 73.4MB ± 0% 73.4MB ± 0% ~ (p=0.421 n=5+5) Tar 25.5MB ± 0% 25.5MB ± 0% ~ (p=0.548 n=5+5) XML 40.9MB ± 0% 40.9MB ± 0% ~ (p=0.151 n=5+5) [Geo mean] 71.6MB 71.6MB +0.07% name old allocs/op new allocs/op delta Template 394k ± 0% 394k ± 0% ~ (p=1.000 n=5+5) Unicode 344k ± 0% 343k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (p=1.000 n=5+5) Compiler 4.42M ± 0% 4.42M ± 0% ~ (p=1.000 n=5+5) SSA 9.80M ± 0% 9.80M ± 0% ~ (p=0.095 n=5+5) Flate 237k ± 1% 238k ± 1% ~ (p=0.310 n=5+5) GoParser 320k ± 0% 322k ± 1% +0.50% (p=0.032 n=5+5) Reflect 958k ± 0% 957k ± 0% ~ (p=0.548 n=5+5) Tar 252k ± 1% 252k ± 0% ~ (p=1.000 n=5+5) XML 400k ± 0% 400k ± 0% ~ (p=0.841 n=5+5) [Geo mean] 741k 742k +0.06% name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% -0.05% (p=0.008 n=5+5) Unicode 202k ± 0% 202k ± 0% -0.01% (p=0.008 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% -0.06% (p=0.008 n=5+5) Compiler 3.91M ± 0% 3.91M ± 0% -0.06% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.92M ± 0% +0.01% (p=0.008 n=5+5) Flate 228k ± 0% 227k ± 0% -0.04% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% -0.06% (p=0.008 n=5+5) Reflect 952k ± 0% 951k ± 0% -0.02% (p=0.008 n=5+5) Tar 187k ± 0% 187k ± 0% -0.04% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.05% (p=0.008 n=5+5) [Geo mean] 648k 648k -0.04% Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094 Reviewed-on: https://go-review.googlesource.com/43491 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-14 14:52:09 -07:00
for _, b := range after[p.ID] {
move[b.ID] = struct{}{}
}
cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
}
// Move blocks to their destinations in a single pass.
// We rely here on the fact that loop headers must come
// before the rest of the loop. And that relies on the
// fact that we only identify reducible loops.
j := 0
for i, b := range f.Blocks {
if _, ok := move[b.ID]; ok {
continue
}
f.Blocks[j] = b
j++
cmd/compile: don't break up contiguous blocks in looprotate looprotate finds loop headers and arranges for them to be placed after the body of the loop. This eliminates a jump from the body. However, if the loop header is a series of contiguously laid out blocks, the rotation introduces a new jump in that series. This CL expands the "loop header" to move to be the entire run of contiguously laid out blocks in the same loop. This shrinks object files a little, and actually speeds up the compiler noticeably. Numbers below. Fannkuch performance seems to vary a lot by machine. On my laptop: name old time/op new time/op delta Fannkuch11-8 2.89s ± 2% 2.85s ± 3% -1.22% (p=0.000 n=50+50) This has a significant affect on the append benchmarks in #14758: name old time/op new time/op delta Foo-8 312ns ± 3% 276ns ± 2% -11.37% (p=0.000 n=30+29) Bar-8 565ns ± 2% 456ns ± 2% -19.27% (p=0.000 n=27+28) Updates #18977 Fixes #20355 name old time/op new time/op delta Template 205ms ± 5% 204ms ± 8% ~ (p=0.903 n=92+99) Unicode 85.3ms ± 4% 85.1ms ± 3% ~ (p=0.191 n=92+94) GoTypes 512ms ± 4% 507ms ± 4% -0.93% (p=0.000 n=95+97) Compiler 2.38s ± 3% 2.35s ± 3% -1.27% (p=0.000 n=98+95) SSA 4.67s ± 3% 4.64s ± 3% -0.62% (p=0.000 n=95+96) Flate 117ms ± 3% 117ms ± 3% ~ (p=0.099 n=84+86) GoParser 139ms ± 4% 137ms ± 4% -0.90% (p=0.000 n=97+98) Reflect 329ms ± 5% 326ms ± 6% -0.97% (p=0.002 n=99+98) Tar 102ms ± 6% 101ms ± 5% -0.97% (p=0.006 n=97+97) XML 198ms ±10% 196ms ±13% ~ (p=0.087 n=100+100) [Geo mean] 318ms 316ms -0.72% name old user-time/op new user-time/op delta Template 250ms ± 7% 250ms ± 7% ~ (p=0.850 n=94+92) Unicode 107ms ± 8% 106ms ± 5% -0.76% (p=0.005 n=98+91) GoTypes 665ms ± 5% 659ms ± 5% -0.85% (p=0.003 n=93+98) Compiler 3.15s ± 3% 3.10s ± 3% -1.60% (p=0.000 n=99+98) SSA 6.82s ± 3% 6.72s ± 4% -1.55% (p=0.000 n=94+98) Flate 138ms ± 8% 138ms ± 6% ~ (p=0.369 n=94+92) GoParser 170ms ± 5% 168ms ± 6% -1.13% (p=0.002 n=96+98) Reflect 412ms ± 8% 416ms ± 8% ~ (p=0.169 n=100+100) Tar 123ms ±18% 123ms ±14% ~ (p=0.896 n=100+100) XML 236ms ± 9% 234ms ±11% ~ (p=0.124 n=100+100) [Geo mean] 401ms 398ms -0.63% name old alloc/op new alloc/op delta Template 38.8MB ± 0% 38.8MB ± 0% ~ (p=0.222 n=5+5) Unicode 28.7MB ± 0% 28.7MB ± 0% ~ (p=0.421 n=5+5) GoTypes 109MB ± 0% 109MB ± 0% ~ (p=0.056 n=5+5) Compiler 457MB ± 0% 457MB ± 0% +0.07% (p=0.008 n=5+5) SSA 1.10GB ± 0% 1.10GB ± 0% +0.05% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.5MB ± 0% ~ (p=0.222 n=5+5) GoParser 30.9MB ± 0% 31.0MB ± 0% +0.21% (p=0.016 n=5+5) Reflect 73.4MB ± 0% 73.4MB ± 0% ~ (p=0.421 n=5+5) Tar 25.5MB ± 0% 25.5MB ± 0% ~ (p=0.548 n=5+5) XML 40.9MB ± 0% 40.9MB ± 0% ~ (p=0.151 n=5+5) [Geo mean] 71.6MB 71.6MB +0.07% name old allocs/op new allocs/op delta Template 394k ± 0% 394k ± 0% ~ (p=1.000 n=5+5) Unicode 344k ± 0% 343k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% ~ (p=1.000 n=5+5) Compiler 4.42M ± 0% 4.42M ± 0% ~ (p=1.000 n=5+5) SSA 9.80M ± 0% 9.80M ± 0% ~ (p=0.095 n=5+5) Flate 237k ± 1% 238k ± 1% ~ (p=0.310 n=5+5) GoParser 320k ± 0% 322k ± 1% +0.50% (p=0.032 n=5+5) Reflect 958k ± 0% 957k ± 0% ~ (p=0.548 n=5+5) Tar 252k ± 1% 252k ± 0% ~ (p=1.000 n=5+5) XML 400k ± 0% 400k ± 0% ~ (p=0.841 n=5+5) [Geo mean] 741k 742k +0.06% name old object-bytes new object-bytes delta Template 386k ± 0% 386k ± 0% -0.05% (p=0.008 n=5+5) Unicode 202k ± 0% 202k ± 0% -0.01% (p=0.008 n=5+5) GoTypes 1.16M ± 0% 1.16M ± 0% -0.06% (p=0.008 n=5+5) Compiler 3.91M ± 0% 3.91M ± 0% -0.06% (p=0.008 n=5+5) SSA 7.91M ± 0% 7.92M ± 0% +0.01% (p=0.008 n=5+5) Flate 228k ± 0% 227k ± 0% -0.04% (p=0.008 n=5+5) GoParser 283k ± 0% 283k ± 0% -0.06% (p=0.008 n=5+5) Reflect 952k ± 0% 951k ± 0% -0.02% (p=0.008 n=5+5) Tar 187k ± 0% 187k ± 0% -0.04% (p=0.008 n=5+5) XML 406k ± 0% 406k ± 0% -0.05% (p=0.008 n=5+5) [Geo mean] 648k 648k -0.04% Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094 Reviewed-on: https://go-review.googlesource.com/43491 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-14 14:52:09 -07:00
for _, a := range after[b.ID] {
cmd/compile: rotate loops so conditional branch is at the end Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-21 14:51:38 -07:00
if j > i {
f.Fatalf("head before tail in loop %s", b)
}
f.Blocks[j] = a
j++
}
}
if j != len(f.Blocks) {
f.Fatalf("bad reordering in looprotate")
}
}