go/src/cmd/compile/internal/ssa/dom.go

303 lines
8 KiB
Go
Raw Normal View History

// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package ssa
// This file contains code to compute the dominator tree
// of a control-flow graph.
// postorder computes a postorder traversal ordering for the
// basic blocks in f. Unreachable blocks will not appear.
func postorder(f *Func) []*Block {
2019-05-16 15:39:34 -07:00
return postorderWithNumbering(f, nil)
}
type blockAndIndex struct {
b *Block
index int // index is the number of successor edges of b that have already been explored.
}
// postorderWithNumbering provides a DFS postordering.
// This seems to make loop-finding more robust.
cmd/compile: use sparse algorithm for phis in large program This adds a sparse method for locating nearest ancestors in a dominator tree, and checks blocks with more than one predecessor for differences and inserts phi functions where there are. Uses reversed post order to cut number of passes, running it from first def to last use ("last use" for paramout and mem is end-of-program; last use for a phi input from a backedge is the source of the back edge) Includes a cutover from old algorithm to new to avoid paying large constant factor for small programs. This keeps normal builds running at about the same time, while not running over-long on large machine-generated inputs. Add "phase" flags for ssa/build -- ssa/build/stats prints number of blocks, values (before and after linking references and inserting phis, so expansion can be measured), and their product; the product governs the cutover, where a good value seems to be somewhere between 1 and 5 million. Among the files compiled by make.bash, this is the shape of the tail of the distribution for #blocks, #vars, and their product: #blocks #vars product max 6171 28180 173,898,780 99.9% 1641 6548 10,401,878 99% 463 1909 873,721 95% 152 639 95,235 90% 84 359 30,021 The old algorithm is indeed usually fastest, for 99%ile values of usually. The fix to LookupVarOutgoing ( https://go-review.googlesource.com/#/c/22790/ ) deals with some of the same problems addressed by this CL, but on at least one bug ( #15537 ) this change is still a significant help. With this CL: /tmp/gopath$ rm -rf pkg bin /tmp/gopath$ time go get -v -gcflags -memprofile=y.mprof \ github.com/gogo/protobuf/test/theproto3/combos/... ... real 4m35.200s user 13m16.644s sys 0m36.712s and pprof reports 3.4GB allocated in one of the larger profiles With tip: /tmp/gopath$ rm -rf pkg bin /tmp/gopath$ time go get -v -gcflags -memprofile=y.mprof \ github.com/gogo/protobuf/test/theproto3/combos/... ... real 10m36.569s user 25m52.286s sys 4m3.696s and pprof reports 8.3GB allocated in the same larger profile With this CL, most of the compilation time on the benchmarked input is spent in register/stack allocation (cumulative 53%) and in the sparse lookup algorithm itself (cumulative 20%). Fixes #15537. Change-Id: Ia0299dda6a291534d8b08e5f9883216ded677a00 Reviewed-on: https://go-review.googlesource.com/22342 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-04-21 13:24:58 -04:00
func postorderWithNumbering(f *Func, ponums []int32) []*Block {
seen := make([]bool, f.NumBlocks())
// result ordering
2019-05-16 15:39:34 -07:00
order := make([]*Block, 0, len(f.Blocks))
// stack of blocks and next child to visit
cmd/compile: stack-allocate 2 worklists in order, dom passes Allocate two more ssa local worklists on the stack. The initial sizes are chosen to cover >99% of the calls. name old time/op new time/op delta Template 281ms ± 2% 283ms ± 5% ~ (p=0.443 n=18+19) Unicode 136ms ± 4% 135ms ± 7% ~ (p=0.277 n=20+20) GoTypes 886ms ± 2% 885ms ± 2% ~ (p=0.862 n=20+20) Compiler 4.03s ± 2% 4.02s ± 1% ~ (p=0.270 n=19+20) SSA 9.66s ± 1% 9.64s ± 2% ~ (p=0.253 n=20+20) Flate 186ms ± 5% 183ms ± 6% ~ (p=0.174 n=20+20) GoParser 222ms ± 4% 219ms ± 4% ~ (p=0.081 n=20+20) Reflect 569ms ± 2% 568ms ± 2% ~ (p=0.686 n=19+19) Tar 258ms ± 4% 256ms ± 3% ~ (p=0.211 n=20+20) XML 319ms ± 2% 317ms ± 3% ~ (p=0.158 n=18+20) name old user-time/op new user-time/op delta Template 396ms ± 6% 392ms ± 6% ~ (p=0.211 n=20+20) Unicode 212ms ±10% 211ms ± 9% ~ (p=0.904 n=20+20) GoTypes 1.21s ± 3% 1.21s ± 2% ~ (p=0.183 n=20+20) Compiler 5.60s ± 2% 5.62s ± 2% ~ (p=0.355 n=18+18) SSA 14.0s ± 6% 13.9s ± 5% ~ (p=0.678 n=20+20) Flate 250ms ± 8% 245ms ± 6% ~ (p=0.166 n=19+20) GoParser 305ms ± 6% 304ms ± 5% ~ (p=0.659 n=20+20) Reflect 760ms ± 3% 758ms ± 4% ~ (p=0.758 n=20+20) Tar 362ms ± 6% 357ms ± 5% ~ (p=0.108 n=20+20) XML 429ms ± 4% 429ms ± 4% ~ (p=0.799 n=20+20) name old alloc/op new alloc/op delta Template 39.0MB ± 0% 38.8MB ± 0% -0.55% (p=0.000 n=20+20) Unicode 29.1MB ± 0% 29.1MB ± 0% -0.06% (p=0.000 n=20+20) GoTypes 116MB ± 0% 115MB ± 0% -0.50% (p=0.000 n=20+20) Compiler 493MB ± 0% 491MB ± 0% -0.46% (p=0.000 n=19+20) SSA 1.40GB ± 0% 1.40GB ± 0% -0.31% (p=0.000 n=19+20) Flate 25.0MB ± 0% 24.9MB ± 0% -0.60% (p=0.000 n=19+19) GoParser 30.9MB ± 0% 30.7MB ± 0% -0.66% (p=0.000 n=20+20) Reflect 77.5MB ± 0% 77.1MB ± 0% -0.52% (p=0.000 n=20+20) Tar 39.2MB ± 0% 39.0MB ± 0% -0.47% (p=0.000 n=20+20) XML 44.8MB ± 0% 44.6MB ± 0% -0.45% (p=0.000 n=20+19) name old allocs/op new allocs/op delta Template 382k ± 0% 379k ± 0% -0.69% (p=0.000 n=20+19) Unicode 337k ± 0% 336k ± 0% -0.09% (p=0.000 n=20+20) GoTypes 1.19M ± 0% 1.18M ± 0% -0.64% (p=0.000 n=20+20) Compiler 4.60M ± 0% 4.58M ± 0% -0.57% (p=0.000 n=20+20) SSA 11.5M ± 0% 11.4M ± 0% -0.42% (p=0.000 n=19+20) Flate 235k ± 0% 233k ± 0% -0.74% (p=0.000 n=20+19) GoParser 316k ± 0% 313k ± 0% -0.69% (p=0.000 n=20+20) Reflect 953k ± 0% 946k ± 0% -0.81% (p=0.000 n=20+20) Tar 391k ± 0% 388k ± 0% -0.61% (p=0.000 n=20+19) XML 413k ± 0% 411k ± 0% -0.56% (p=0.000 n=20+20) Change-Id: I7378174e3550b47df4368b24cf24c8ce1b85c906 Reviewed-on: https://go-review.googlesource.com/104656 Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2018-04-04 11:57:03 +02:00
// A constant bound allows this to be stack-allocated. 32 is
// enough to cover almost every postorderWithNumbering call.
s := make([]blockAndIndex, 0, 32)
s = append(s, blockAndIndex{b: f.Entry})
seen[f.Entry.ID] = true
for len(s) > 0 {
tos := len(s) - 1
x := s[tos]
b := x.b
if i := x.index; i < len(b.Succs) {
s[tos].index++
bb := b.Succs[i].Block()
if !seen[bb.ID] {
seen[bb.ID] = true
s = append(s, blockAndIndex{b: bb})
}
continue
}
s = s[:tos]
if ponums != nil {
ponums[b.ID] = int32(len(order))
}
order = append(order, b)
}
return order
}
type linkedBlocks func(*Block) []Edge
const nscratchslices = 7
// experimentally, functions with 512 or fewer blocks account
// for 75% of memory (size) allocation for dominator computation
// in make.bash.
const minscratchblocks = 512
func (cache *Cache) scratchBlocksForDom(maxBlockID int) (a, b, c, d, e, f, g []ID) {
tot := maxBlockID * nscratchslices
scratch := cache.domblockstore
if len(scratch) < tot {
// req = min(1.5*tot, nscratchslices*minscratchblocks)
// 50% padding allows for graph growth in later phases.
req := (tot * 3) >> 1
if req < nscratchslices*minscratchblocks {
req = nscratchslices * minscratchblocks
}
scratch = make([]ID, req)
cache.domblockstore = scratch
} else {
// Clear as much of scratch as we will (re)use
scratch = scratch[0:tot]
for i := range scratch {
scratch[i] = 0
}
}
a = scratch[0*maxBlockID : 1*maxBlockID]
b = scratch[1*maxBlockID : 2*maxBlockID]
c = scratch[2*maxBlockID : 3*maxBlockID]
d = scratch[3*maxBlockID : 4*maxBlockID]
e = scratch[4*maxBlockID : 5*maxBlockID]
f = scratch[5*maxBlockID : 6*maxBlockID]
g = scratch[6*maxBlockID : 7*maxBlockID]
return
}
func dominators(f *Func) []*Block {
preds := func(b *Block) []Edge { return b.Preds }
succs := func(b *Block) []Edge { return b.Succs }
//TODO: benchmark and try to find criteria for swapping between
// dominatorsSimple and dominatorsLT
return f.dominatorsLTOrig(f.Entry, preds, succs)
}
// dominatorsLTOrig runs Lengauer-Tarjan to compute a dominator tree starting at
// entry and using predFn/succFn to find predecessors/successors to allow
// computing both dominator and post-dominator trees.
func (f *Func) dominatorsLTOrig(entry *Block, predFn linkedBlocks, succFn linkedBlocks) []*Block {
// Adapted directly from the original TOPLAS article's "simple" algorithm
maxBlockID := entry.Func.NumBlocks()
semi, vertex, label, parent, ancestor, bucketHead, bucketLink := f.Cache.scratchBlocksForDom(maxBlockID)
// This version uses integers for most of the computation,
// to make the work arrays smaller and pointer-free.
// fromID translates from ID to *Block where that is needed.
fromID := make([]*Block, maxBlockID)
for _, v := range f.Blocks {
fromID[v.ID] = v
}
idom := make([]*Block, maxBlockID)
// Step 1. Carry out a depth first search of the problem graph. Number
// the vertices from 1 to n as they are reached during the search.
n := f.dfsOrig(entry, succFn, semi, vertex, label, parent)
for i := n; i >= 2; i-- {
w := vertex[i]
// step2 in TOPLAS paper
for _, e := range predFn(fromID[w]) {
v := e.b
if semi[v.ID] == 0 {
// skip unreachable predecessor
// not in original, but we're using existing pred instead of building one.
continue
}
u := evalOrig(v.ID, ancestor, semi, label)
if semi[u] < semi[w] {
semi[w] = semi[u]
}
}
// add w to bucket[vertex[semi[w]]]
// implement bucket as a linked list implemented
// in a pair of arrays.
vsw := vertex[semi[w]]
bucketLink[w] = bucketHead[vsw]
bucketHead[vsw] = w
linkOrig(parent[w], w, ancestor)
// step3 in TOPLAS paper
for v := bucketHead[parent[w]]; v != 0; v = bucketLink[v] {
u := evalOrig(v, ancestor, semi, label)
if semi[u] < semi[v] {
idom[v] = fromID[u]
} else {
idom[v] = fromID[parent[w]]
}
}
}
// step 4 in toplas paper
for i := ID(2); i <= n; i++ {
w := vertex[i]
if idom[w].ID != vertex[semi[w]] {
idom[w] = idom[idom[w].ID]
}
}
return idom
}
// dfs performs a depth first search over the blocks starting at entry block
// (in arbitrary order). This is a de-recursed version of dfs from the
// original Tarjan-Lengauer TOPLAS article. It's important to return the
// same values for parent as the original algorithm.
func (f *Func) dfsOrig(entry *Block, succFn linkedBlocks, semi, vertex, label, parent []ID) ID {
n := ID(0)
s := make([]*Block, 0, 256)
s = append(s, entry)
for len(s) > 0 {
v := s[len(s)-1]
s = s[:len(s)-1]
// recursing on v
if semi[v.ID] != 0 {
continue // already visited
}
n++
semi[v.ID] = n
vertex[n] = v.ID
label[v.ID] = v.ID
// ancestor[v] already zero
for _, e := range succFn(v) {
w := e.b
// if it has a dfnum, we've already visited it
if semi[w.ID] == 0 {
// yes, w can be pushed multiple times.
s = append(s, w)
parent[w.ID] = v.ID // keep overwriting this till it is visited.
}
}
}
return n
}
// compressOrig is the "simple" compress function from LT paper
func compressOrig(v ID, ancestor, semi, label []ID) {
if ancestor[ancestor[v]] != 0 {
compressOrig(ancestor[v], ancestor, semi, label)
if semi[label[ancestor[v]]] < semi[label[v]] {
label[v] = label[ancestor[v]]
}
ancestor[v] = ancestor[ancestor[v]]
}
}
// evalOrig is the "simple" eval function from LT paper
func evalOrig(v ID, ancestor, semi, label []ID) ID {
if ancestor[v] == 0 {
return v
}
compressOrig(v, ancestor, semi, label)
return label[v]
}
func linkOrig(v, w ID, ancestor []ID) {
ancestor[w] = v
}
// dominators computes the dominator tree for f. It returns a slice
// which maps block ID to the immediate dominator of that block.
// Unreachable blocks map to nil. The entry block maps to nil.
func dominatorsSimple(f *Func) []*Block {
// A simple algorithm for now
// Cooper, Harvey, Kennedy
idom := make([]*Block, f.NumBlocks())
// Compute postorder walk
post := f.postorder()
// Make map from block id to order index (for intersect call)
postnum := make([]int, f.NumBlocks())
for i, b := range post {
postnum[b.ID] = i
}
// Make the entry block a self-loop
idom[f.Entry.ID] = f.Entry
if postnum[f.Entry.ID] != len(post)-1 {
f.Fatalf("entry block %v not last in postorder", f.Entry)
}
// Compute relaxation of idom entries
for {
changed := false
for i := len(post) - 2; i >= 0; i-- {
b := post[i]
var d *Block
for _, e := range b.Preds {
p := e.b
if idom[p.ID] == nil {
continue
}
if d == nil {
d = p
continue
}
d = intersect(d, p, postnum, idom)
}
if d != idom[b.ID] {
idom[b.ID] = d
changed = true
}
}
if !changed {
break
}
}
// Set idom of entry block to nil instead of itself.
idom[f.Entry.ID] = nil
return idom
}
// intersect finds the closest dominator of both b and c.
// It requires a postorder numbering of all the blocks.
func intersect(b, c *Block, postnum []int, idom []*Block) *Block {
cmd/compile: put spills in better places Previously we always issued a spill right after the op that was being spilled. This CL pushes spills father away from the generator, hopefully pushing them into unlikely branches. For example: x = ... if unlikely { call ... } ... use x ... Used to compile to x = ... spill x if unlikely { call ... restore x } It now compiles to x = ... if unlikely { spill x call ... restore x } This is particularly useful for code which appends, as the only call is an unlikely call to growslice. It also helps for the spills needed around write barrier calls. The basic algorithm is walk down the dominator tree following a path where the block still dominates all of the restores. We're looking for a block that: 1) dominates all restores 2) has the value being spilled in a register 3) has a loop depth no deeper than the value being spilled The walking-down code is iterative. I was forced to limit it to searching 100 blocks so it doesn't become O(n^2). Maybe one day we'll find a better way. I had to delete most of David's code which pushed spills out of loops. I suspect this CL subsumes most of the cases that his code handled. Generally positive performance improvements, but hard to tell for sure with all the noise. (compilebench times are unchanged.) name old time/op new time/op delta BinaryTree17-12 2.91s ±15% 2.80s ±12% ~ (p=0.063 n=10+10) Fannkuch11-12 3.47s ± 0% 3.30s ± 4% -4.91% (p=0.000 n=9+10) FmtFprintfEmpty-12 48.0ns ± 1% 47.4ns ± 1% -1.32% (p=0.002 n=9+9) FmtFprintfString-12 85.6ns ±11% 79.4ns ± 3% -7.27% (p=0.005 n=10+10) FmtFprintfInt-12 91.8ns ±10% 85.9ns ± 4% ~ (p=0.203 n=10+9) FmtFprintfIntInt-12 135ns ±13% 127ns ± 1% -5.72% (p=0.025 n=10+9) FmtFprintfPrefixedInt-12 167ns ± 1% 168ns ± 2% ~ (p=0.580 n=9+10) FmtFprintfFloat-12 249ns ±11% 230ns ± 1% -7.32% (p=0.000 n=10+10) FmtManyArgs-12 504ns ± 7% 506ns ± 1% ~ (p=0.198 n=9+9) GobDecode-12 6.95ms ± 1% 7.04ms ± 1% +1.37% (p=0.001 n=10+10) GobEncode-12 6.32ms ±13% 6.04ms ± 1% ~ (p=0.063 n=10+10) Gzip-12 233ms ± 1% 235ms ± 0% +1.01% (p=0.000 n=10+9) Gunzip-12 40.1ms ± 1% 39.6ms ± 0% -1.12% (p=0.000 n=10+8) HTTPClientServer-12 227µs ± 9% 221µs ± 5% ~ (p=0.114 n=9+8) JSONEncode-12 16.1ms ± 2% 15.8ms ± 1% -2.09% (p=0.002 n=9+8) JSONDecode-12 61.8ms ±11% 57.9ms ± 1% -6.30% (p=0.000 n=10+9) Mandelbrot200-12 4.30ms ± 3% 4.28ms ± 1% ~ (p=0.203 n=10+8) GoParse-12 3.18ms ± 2% 3.18ms ± 2% ~ (p=0.579 n=10+10) RegexpMatchEasy0_32-12 76.7ns ± 1% 77.5ns ± 1% +0.92% (p=0.002 n=9+8) RegexpMatchEasy0_1K-12 239ns ± 3% 239ns ± 1% ~ (p=0.204 n=10+10) RegexpMatchEasy1_32-12 71.4ns ± 1% 70.6ns ± 0% -1.15% (p=0.000 n=10+9) RegexpMatchEasy1_1K-12 383ns ± 2% 390ns ±10% ~ (p=0.181 n=8+9) RegexpMatchMedium_32-12 114ns ± 0% 113ns ± 1% -0.88% (p=0.000 n=9+8) RegexpMatchMedium_1K-12 36.3µs ± 1% 36.8µs ± 1% +1.59% (p=0.000 n=10+8) RegexpMatchHard_32-12 1.90µs ± 1% 1.90µs ± 1% ~ (p=0.341 n=10+10) RegexpMatchHard_1K-12 59.4µs ±11% 57.8µs ± 1% ~ (p=0.968 n=10+9) Revcomp-12 461ms ± 1% 462ms ± 1% ~ (p=1.000 n=9+9) Template-12 67.5ms ± 1% 66.3ms ± 1% -1.77% (p=0.000 n=10+8) TimeParse-12 314ns ± 3% 309ns ± 0% -1.56% (p=0.000 n=9+8) TimeFormat-12 340ns ± 2% 331ns ± 1% -2.79% (p=0.000 n=10+10) The go binary is 0.2% larger. Not really sure why the size would change. Change-Id: Ia5116e53a3aeb025ef350ffc51c14ae5cc17871c Reviewed-on: https://go-review.googlesource.com/34822 Reviewed-by: David Chase <drchase@google.com>
2017-03-07 14:45:46 -05:00
// TODO: This loop is O(n^2). It used to be used in nilcheck,
// see BenchmarkNilCheckDeep*.
for b != c {
if postnum[b.ID] < postnum[c.ID] {
b = idom[b.ID]
} else {
c = idom[c.ID]
}
}
return b
}