mirror of
https://github.com/golang/go.git
synced 2025-12-08 06:10:04 +00:00
When using a concurrent backend,
the overall compilation time is bounded
in part by the slowest function to compile.
The number of top-level statements in a function
is an easily calculated and fairly reliable
proxy for compilation time.
Here's a standard compilecmp output for -c=8 with this CL:
name old time/op new time/op delta
Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50)
Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49)
GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50)
Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47)
SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49)
Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49)
GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49)
Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49)
Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49)
XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47)
name old user-time/op new user-time/op delta
Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50)
Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48)
GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48)
Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50)
SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49)
Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50)
GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48)
Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49)
Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50)
XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40)
A more interesting way of viewing the data is by
looking at the ratio of the time taken to compile
the slowest-to-compile function to the overall
time spent compiling functions.
If this ratio is small (near 0), then increased concurrency might help.
If this ratio is big (near 1), then we're bounded by that single function.
I instrumented the compiler to emit this ratio per-package,
ran 'go build -a -gcflags=-c=C -p=P std cmd' three times,
for varying values of C and P,
and collected the ratios encountered into an ASCII histogram.
Here's c=1 p=1, which is a non-concurrent backend, single process at a time:
90%|
80%|
70%|
60%|
50%|
40%|
30%|
20%|**
10%|***
0%|*********
----+----------
|0123456789
The x-axis is floor(10*ratio), so the first column indicates the percent of
ratios that fell in the 0% to 9.9999% range.
We can see in this histogram that more concurrency will help;
in most cases, the ratio is small.
Here's c=8 p=1, before this CL:
90%|
80%|
70%|
60%|
50%|
40%|
30%| *
20%| *
10%|* * *
0%|**********
----+----------
|0123456789
In 30-40% of cases, we're mostly bound by the compilation time
of a single function.
Here's c=8 p=1, after this CL:
90%|
80%|
70%|
60%|
50%| *
40%| *
30%| *
20%| *
10%| *
0%|**********
----+----------
|0123456789
The sorting pays off; we are bound by the
compilation time of a single function in over half of packages.
The single * in the histogram indicates 0-10%.
The actual values for this chart are:
0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55%
This indicates that efforts to increase or enable more concurrency,
e.g. by optimizing mutexes or increasing the value of c,
will probably not yield fruit.
That matches what compilecmp tells us.
Further optimization efforts should thus focus instead on one of:
(1) making more functions compile concurrently
(2) improving the compilation time of the slowest functions
(3) speeding up the remaining serial parts of the compiler
(4) automatically splitting up some large autogenerated functions
into small ones, as discussed in #19751
I hope to spend more time on (1) before the freeze.
Adding process parallelism doesn't change the story much.
For example, here's c=8 p=8, after this CL:
90%|
80%|
70%|
60%|
50%|
40%| *
30%| *
20%| *
10%| ***
0%|**********
----+----------
|0123456789
Since we don't need to worry much about p,
these histograms can help us select a good
general value of c to use as a default,
assuming we're not bounded by GOMAXPROCS.
Here are some charts after this CL, for c from 1 to 8:
c=1 p=1
90%|
80%|
70%|
60%|
50%|
40%|
30%|
20%|**
10%|***
0%|*********
----+----------
|0123456789
c=2 p=1
90%|
80%|
70%|
60%|
50%|
40%|
30%|
20%|
10%| **** *
0%|**********
----+----------
|0123456789
c=3 p=1
90%|
80%|
70%|
60%|
50%|
40%|
30%|
20%| *
10%| ** * *
0%|**********
----+----------
|0123456789
c=4 p=1
90%|
80%|
70%|
60%|
50%|
40%|
30%| *
20%| *
10%| * *
0%|**********
----+----------
|0123456789
c=5 p=1
90%|
80%|
70%|
60%|
50%|
40%|
30%| *
20%| *
10%| * *
0%|**********
----+----------
|0123456789
c=6 p=1
90%|
80%|
70%|
60%|
50%|
40%| *
30%| *
20%| *
10%| *
0%|**********
----+----------
|0123456789
c=7 p=1
90%|
80%|
70%|
60%|
50%| *
40%| *
30%| *
20%| *
10%| **
0%|**********
----+----------
|0123456789
c=8 p=1
90%|
80%|
70%|
60%|
50%| *
40%| *
30%| *
20%| *
10%| *
0%|**********
----+----------
|0123456789
Given the increased user-CPU costs as
c increases, it looks like c=4 is probably
the sweet spot, at least for now.
Pleasingly, this matches (and explains)
the results of the standard benchmarking
that I have done.
Updates #15756
Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
Reviewed-on: https://go-review.googlesource.com/41192
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
376 lines
8.5 KiB
Go
376 lines
8.5 KiB
Go
// Copyright 2011 The Go Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style
|
|
// license that can be found in the LICENSE file.
|
|
|
|
package gc
|
|
|
|
import (
|
|
"cmd/compile/internal/ssa"
|
|
"cmd/compile/internal/types"
|
|
"cmd/internal/dwarf"
|
|
"cmd/internal/obj"
|
|
"cmd/internal/objabi"
|
|
"cmd/internal/src"
|
|
"cmd/internal/sys"
|
|
"fmt"
|
|
"sort"
|
|
"sync"
|
|
)
|
|
|
|
// "Portable" code generation.
|
|
|
|
var (
|
|
nBackendWorkers int // number of concurrent backend workers, set by a compiler flag
|
|
compilequeue []*Node // functions waiting to be compiled
|
|
)
|
|
|
|
func emitptrargsmap() {
|
|
if Curfn.funcname() == "_" {
|
|
return
|
|
}
|
|
sym := lookup(fmt.Sprintf("%s.args_stackmap", Curfn.funcname()))
|
|
lsym := sym.Linksym()
|
|
|
|
nptr := int(Curfn.Type.ArgWidth() / int64(Widthptr))
|
|
bv := bvalloc(int32(nptr) * 2)
|
|
nbitmap := 1
|
|
if Curfn.Type.Results().NumFields() > 0 {
|
|
nbitmap = 2
|
|
}
|
|
off := duint32(lsym, 0, uint32(nbitmap))
|
|
off = duint32(lsym, off, uint32(bv.n))
|
|
var xoffset int64
|
|
if Curfn.IsMethod() {
|
|
xoffset = 0
|
|
onebitwalktype1(Curfn.Type.Recvs(), &xoffset, bv)
|
|
}
|
|
|
|
if Curfn.Type.Params().NumFields() > 0 {
|
|
xoffset = 0
|
|
onebitwalktype1(Curfn.Type.Params(), &xoffset, bv)
|
|
}
|
|
|
|
off = dbvec(lsym, off, bv)
|
|
if Curfn.Type.Results().NumFields() > 0 {
|
|
xoffset = 0
|
|
onebitwalktype1(Curfn.Type.Results(), &xoffset, bv)
|
|
off = dbvec(lsym, off, bv)
|
|
}
|
|
|
|
ggloblsym(lsym, int32(off), obj.RODATA|obj.LOCAL)
|
|
}
|
|
|
|
// cmpstackvarlt reports whether the stack variable a sorts before b.
|
|
//
|
|
// Sort the list of stack variables. Autos after anything else,
|
|
// within autos, unused after used, within used, things with
|
|
// pointers first, zeroed things first, and then decreasing size.
|
|
// Because autos are laid out in decreasing addresses
|
|
// on the stack, pointers first, zeroed things first and decreasing size
|
|
// really means, in memory, things with pointers needing zeroing at
|
|
// the top of the stack and increasing in size.
|
|
// Non-autos sort on offset.
|
|
func cmpstackvarlt(a, b *Node) bool {
|
|
if (a.Class() == PAUTO) != (b.Class() == PAUTO) {
|
|
return b.Class() == PAUTO
|
|
}
|
|
|
|
if a.Class() != PAUTO {
|
|
return a.Xoffset < b.Xoffset
|
|
}
|
|
|
|
if a.Used() != b.Used() {
|
|
return a.Used()
|
|
}
|
|
|
|
ap := types.Haspointers(a.Type)
|
|
bp := types.Haspointers(b.Type)
|
|
if ap != bp {
|
|
return ap
|
|
}
|
|
|
|
ap = a.Name.Needzero()
|
|
bp = b.Name.Needzero()
|
|
if ap != bp {
|
|
return ap
|
|
}
|
|
|
|
if a.Type.Width != b.Type.Width {
|
|
return a.Type.Width > b.Type.Width
|
|
}
|
|
|
|
return a.Sym.Name < b.Sym.Name
|
|
}
|
|
|
|
// byStackvar implements sort.Interface for []*Node using cmpstackvarlt.
|
|
type byStackVar []*Node
|
|
|
|
func (s byStackVar) Len() int { return len(s) }
|
|
func (s byStackVar) Less(i, j int) bool { return cmpstackvarlt(s[i], s[j]) }
|
|
func (s byStackVar) Swap(i, j int) { s[i], s[j] = s[j], s[i] }
|
|
|
|
func (s *ssafn) AllocFrame(f *ssa.Func) {
|
|
s.stksize = 0
|
|
s.stkptrsize = 0
|
|
fn := s.curfn.Func
|
|
|
|
// Mark the PAUTO's unused.
|
|
for _, ln := range fn.Dcl {
|
|
if ln.Class() == PAUTO {
|
|
ln.SetUsed(false)
|
|
}
|
|
}
|
|
|
|
for _, l := range f.RegAlloc {
|
|
if ls, ok := l.(ssa.LocalSlot); ok {
|
|
ls.N.(*Node).SetUsed(true)
|
|
}
|
|
}
|
|
|
|
scratchUsed := false
|
|
for _, b := range f.Blocks {
|
|
for _, v := range b.Values {
|
|
switch a := v.Aux.(type) {
|
|
case *ssa.ArgSymbol:
|
|
n := a.Node.(*Node)
|
|
// Don't modify nodfp; it is a global.
|
|
if n != nodfp {
|
|
n.SetUsed(true)
|
|
}
|
|
case *ssa.AutoSymbol:
|
|
a.Node.(*Node).SetUsed(true)
|
|
}
|
|
|
|
if !scratchUsed {
|
|
scratchUsed = v.Op.UsesScratch()
|
|
}
|
|
}
|
|
}
|
|
|
|
if f.Config.NeedsFpScratch && scratchUsed {
|
|
s.scratchFpMem = tempAt(src.NoXPos, s.curfn, types.Types[TUINT64])
|
|
}
|
|
|
|
sort.Sort(byStackVar(fn.Dcl))
|
|
|
|
// Reassign stack offsets of the locals that are used.
|
|
for i, n := range fn.Dcl {
|
|
if n.Op != ONAME || n.Class() != PAUTO {
|
|
continue
|
|
}
|
|
if !n.Used() {
|
|
fn.Dcl = fn.Dcl[:i]
|
|
break
|
|
}
|
|
|
|
dowidth(n.Type)
|
|
w := n.Type.Width
|
|
if w >= thearch.MAXWIDTH || w < 0 {
|
|
Fatalf("bad width")
|
|
}
|
|
s.stksize += w
|
|
s.stksize = Rnd(s.stksize, int64(n.Type.Align))
|
|
if types.Haspointers(n.Type) {
|
|
s.stkptrsize = s.stksize
|
|
}
|
|
if thearch.LinkArch.InFamily(sys.MIPS, sys.MIPS64, sys.ARM, sys.ARM64, sys.PPC64, sys.S390X) {
|
|
s.stksize = Rnd(s.stksize, int64(Widthptr))
|
|
}
|
|
n.Xoffset = -s.stksize
|
|
}
|
|
|
|
s.stksize = Rnd(s.stksize, int64(Widthreg))
|
|
s.stkptrsize = Rnd(s.stkptrsize, int64(Widthreg))
|
|
}
|
|
|
|
func compile(fn *Node) {
|
|
Curfn = fn
|
|
dowidth(fn.Type)
|
|
|
|
if fn.Nbody.Len() == 0 {
|
|
emitptrargsmap()
|
|
return
|
|
}
|
|
|
|
saveerrors()
|
|
|
|
order(fn)
|
|
if nerrors != 0 {
|
|
return
|
|
}
|
|
|
|
walk(fn)
|
|
if nerrors != 0 {
|
|
return
|
|
}
|
|
if instrumenting {
|
|
instrument(fn)
|
|
}
|
|
|
|
// From this point, there should be no uses of Curfn. Enforce that.
|
|
Curfn = nil
|
|
|
|
// Set up the function's LSym early to avoid data races with the assemblers.
|
|
fn.Func.initLSym()
|
|
|
|
if compilenow() {
|
|
compileSSA(fn, 0)
|
|
} else {
|
|
compilequeue = append(compilequeue, fn)
|
|
}
|
|
}
|
|
|
|
// compilenow reports whether to compile immediately.
|
|
// If functions are not compiled immediately,
|
|
// they are enqueued in compilequeue,
|
|
// which is drained by compileFunctions.
|
|
func compilenow() bool {
|
|
return nBackendWorkers == 1
|
|
}
|
|
|
|
// compileSSA builds an SSA backend function,
|
|
// uses it to generate a plist,
|
|
// and flushes that plist to machine code.
|
|
// worker indicates which of the backend workers is doing the processing.
|
|
func compileSSA(fn *Node, worker int) {
|
|
ssafn := buildssa(fn, worker)
|
|
pp := newProgs(fn, worker)
|
|
genssa(ssafn, pp)
|
|
if pp.Text.To.Offset < 1<<31 {
|
|
pp.Flush()
|
|
} else {
|
|
largeStackFramesMu.Lock()
|
|
largeStackFrames = append(largeStackFrames, fn.Pos)
|
|
largeStackFramesMu.Unlock()
|
|
}
|
|
// fieldtrack must be called after pp.Flush. See issue 20014.
|
|
fieldtrack(pp.Text.From.Sym, fn.Func.FieldTrack)
|
|
pp.Free()
|
|
}
|
|
|
|
// compileFunctions compiles all functions in compilequeue.
|
|
// It fans out nBackendWorkers to do the work
|
|
// and waits for them to complete.
|
|
func compileFunctions() {
|
|
if len(compilequeue) != 0 {
|
|
// Compile the longest functions first,
|
|
// since they're most likely to be the slowest.
|
|
// This helps avoid stragglers.
|
|
obj.SortSlice(compilequeue, func(i, j int) bool {
|
|
return compilequeue[i].Nbody.Len() > compilequeue[j].Nbody.Len()
|
|
})
|
|
var wg sync.WaitGroup
|
|
c := make(chan *Node)
|
|
for i := 0; i < nBackendWorkers; i++ {
|
|
wg.Add(1)
|
|
go func(worker int) {
|
|
for fn := range c {
|
|
compileSSA(fn, worker)
|
|
}
|
|
wg.Done()
|
|
}(i)
|
|
}
|
|
for _, fn := range compilequeue {
|
|
c <- fn
|
|
}
|
|
close(c)
|
|
compilequeue = nil
|
|
wg.Wait()
|
|
}
|
|
}
|
|
|
|
func debuginfo(fnsym *obj.LSym, curfn interface{}) []*dwarf.Var {
|
|
fn := curfn.(*Node)
|
|
if expect := fn.Func.Nname.Sym.Linksym(); fnsym != expect {
|
|
Fatalf("unexpected fnsym: %v != %v", fnsym, expect)
|
|
}
|
|
|
|
var vars []*dwarf.Var
|
|
for _, n := range fn.Func.Dcl {
|
|
if n.Op != ONAME { // might be OTYPE or OLITERAL
|
|
continue
|
|
}
|
|
|
|
var name obj.AddrName
|
|
var abbrev int
|
|
offs := n.Xoffset
|
|
|
|
switch n.Class() {
|
|
case PAUTO:
|
|
if !n.Used() {
|
|
Fatalf("debuginfo unused node (AllocFrame should truncate fn.Func.Dcl)")
|
|
}
|
|
name = obj.NAME_AUTO
|
|
|
|
abbrev = dwarf.DW_ABRV_AUTO
|
|
if Ctxt.FixedFrameSize() == 0 {
|
|
offs -= int64(Widthptr)
|
|
}
|
|
if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
|
|
offs -= int64(Widthptr)
|
|
}
|
|
|
|
case PPARAM, PPARAMOUT:
|
|
name = obj.NAME_PARAM
|
|
|
|
abbrev = dwarf.DW_ABRV_PARAM
|
|
offs += Ctxt.FixedFrameSize()
|
|
|
|
default:
|
|
continue
|
|
}
|
|
|
|
gotype := ngotype(n).Linksym()
|
|
fnsym.Func.Autom = append(fnsym.Func.Autom, &obj.Auto{
|
|
Asym: Ctxt.Lookup(n.Sym.Name),
|
|
Aoffset: int32(n.Xoffset),
|
|
Name: name,
|
|
Gotype: gotype,
|
|
})
|
|
|
|
if n.IsAutoTmp() {
|
|
continue
|
|
}
|
|
|
|
typename := dwarf.InfoPrefix + gotype.Name[len("type."):]
|
|
vars = append(vars, &dwarf.Var{
|
|
Name: n.Sym.Name,
|
|
Abbrev: abbrev,
|
|
Offset: int32(offs),
|
|
Type: Ctxt.Lookup(typename),
|
|
})
|
|
}
|
|
|
|
// Stable sort so that ties are broken with declaration order.
|
|
sort.Stable(dwarf.VarsByOffset(vars))
|
|
|
|
return vars
|
|
}
|
|
|
|
// fieldtrack adds R_USEFIELD relocations to fnsym to record any
|
|
// struct fields that it used.
|
|
func fieldtrack(fnsym *obj.LSym, tracked map[*types.Sym]struct{}) {
|
|
if fnsym == nil {
|
|
return
|
|
}
|
|
if objabi.Fieldtrack_enabled == 0 || len(tracked) == 0 {
|
|
return
|
|
}
|
|
|
|
trackSyms := make([]*types.Sym, 0, len(tracked))
|
|
for sym := range tracked {
|
|
trackSyms = append(trackSyms, sym)
|
|
}
|
|
sort.Sort(symByName(trackSyms))
|
|
for _, sym := range trackSyms {
|
|
r := obj.Addrel(fnsym)
|
|
r.Sym = sym.Linksym()
|
|
r.Type = objabi.R_USEFIELD
|
|
}
|
|
}
|
|
|
|
type symByName []*types.Sym
|
|
|
|
func (a symByName) Len() int { return len(a) }
|
|
func (a symByName) Less(i, j int) bool { return a[i].Name < a[j].Name }
|
|
func (a symByName) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
|