go/src/cmd/compile/internal/ssa/_gen/main.go

569 lines
17 KiB
Go
Raw Normal View History

// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// The gen command generates Go code (in the parent directory) for all
// the architecture-specific opcodes, blocks, and rewrites.
package main
import (
"bytes"
"flag"
"fmt"
"go/format"
"log"
"math/bits"
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
"os"
"path"
"regexp"
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
"runtime"
"runtime/pprof"
"runtime/trace"
"slices"
"sort"
"strings"
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
"sync"
)
// TODO: capitalize these types, so that we can more easily tell variable names
// apart from type names, and avoid awkward func parameters like "arch arch".
type arch struct {
name string
pkg string // obj package to import for this arch.
genfile string // source file containing opcode code generation.
ops []opData
blocks []blockData
regnames []string
ParamIntRegNames string
ParamFloatRegNames string
gpregmask regMask
fpregmask regMask
fp32regmask regMask
fp64regmask regMask
specialregmask regMask
framepointerreg int8
linkreg int8
generic bool
imports []string
}
type opData struct {
name string
reg regInfo
asm string
typ string // default result type
aux string
rematerializeable bool
argLength int32 // number of arguments, if -1, then this operation has a variable number of arguments
commutative bool // this operation is commutative on its first 2 arguments (e.g. addition)
resultInArg0 bool // (first, if a tuple) output of v and v.Args[0] must be allocated to the same register
resultNotInArgs bool // outputs must not be allocated to the same registers as inputs
clobberFlags bool // this op clobbers flags register
needIntTemp bool // need a temporary free integer register
call bool // is a function call
tailCall bool // is a tail call
nilCheck bool // this op is a nil check on arg0
faultOnNilArg0 bool // this op will fault if arg0 is nil (and aux encodes a small offset)
faultOnNilArg1 bool // this op will fault if arg1 is nil (and aux encodes a small offset)
hasSideEffects bool // for "reasons", not to be eliminated. E.g., atomic store, #19182.
zeroWidth bool // op never translates into any machine code. example: copy, which may sometimes translate to machine code, is not zero-width.
unsafePoint bool // this op is an unsafe point, i.e. not safe for async preemption
symEffect string // effect this op has on symbol in aux
scale uint8 // amd64/386 indexed load scale
}
type blockData struct {
cmd/compile: allow multiple SSA block control values Control values are used to choose which successor of a block is jumped to. Typically a control value takes the form of a 'flags' value that represents the result of a comparison. Some architectures however use a variable in a register as a control value. Up until now we have managed with a single control value per block. However some architectures (e.g. s390x and riscv64) have combined compare-and-branch instructions that take two variables in registers as parameters. To generate these instructions we need to support 2 control values per block. This CL allows up to 2 control values to be used in a block in order to support the addition of compare-and-branch instructions. I have implemented s390x compare-and-branch instructions in a different CL. Passes toolstash-check -all. Results of compilebench: name old time/op new time/op delta Template 208ms ± 1% 209ms ± 1% ~ (p=0.289 n=20+20) Unicode 83.7ms ± 1% 83.3ms ± 3% -0.49% (p=0.017 n=18+18) GoTypes 748ms ± 1% 748ms ± 0% ~ (p=0.460 n=20+18) Compiler 3.47s ± 1% 3.48s ± 1% ~ (p=0.070 n=19+18) SSA 11.5s ± 1% 11.7s ± 1% +1.64% (p=0.000 n=19+18) Flate 130ms ± 1% 130ms ± 1% ~ (p=0.588 n=19+20) GoParser 160ms ± 1% 161ms ± 1% ~ (p=0.211 n=20+20) Reflect 465ms ± 1% 467ms ± 1% +0.42% (p=0.007 n=20+20) Tar 184ms ± 1% 185ms ± 2% ~ (p=0.087 n=18+20) XML 253ms ± 1% 253ms ± 1% ~ (p=0.377 n=20+18) LinkCompiler 769ms ± 2% 774ms ± 2% ~ (p=0.070 n=19+19) ExternalLinkCompiler 3.59s ±11% 3.68s ± 6% ~ (p=0.072 n=20+20) LinkWithoutDebugCompiler 446ms ± 5% 454ms ± 3% +1.79% (p=0.002 n=19+20) StdCmd 26.0s ± 2% 26.0s ± 2% ~ (p=0.799 n=20+20) name old user-time/op new user-time/op delta Template 238ms ± 5% 240ms ± 5% ~ (p=0.142 n=20+20) Unicode 105ms ±11% 106ms ±10% ~ (p=0.512 n=20+20) GoTypes 876ms ± 2% 873ms ± 4% ~ (p=0.647 n=20+19) Compiler 4.17s ± 2% 4.19s ± 1% ~ (p=0.093 n=20+18) SSA 13.9s ± 1% 14.1s ± 1% +1.45% (p=0.000 n=18+18) Flate 145ms ±13% 146ms ± 5% ~ (p=0.851 n=20+18) GoParser 185ms ± 5% 188ms ± 7% ~ (p=0.174 n=20+20) Reflect 534ms ± 3% 538ms ± 2% ~ (p=0.105 n=20+18) Tar 215ms ± 4% 211ms ± 9% ~ (p=0.079 n=19+20) XML 295ms ± 6% 295ms ± 5% ~ (p=0.968 n=20+20) LinkCompiler 832ms ± 4% 837ms ± 7% ~ (p=0.707 n=17+20) ExternalLinkCompiler 1.58s ± 8% 1.60s ± 4% ~ (p=0.296 n=20+19) LinkWithoutDebugCompiler 478ms ±12% 489ms ±10% ~ (p=0.429 n=20+20) name old object-bytes new object-bytes delta Template 559kB ± 0% 559kB ± 0% ~ (all equal) Unicode 216kB ± 0% 216kB ± 0% ~ (all equal) GoTypes 2.03MB ± 0% 2.03MB ± 0% ~ (all equal) Compiler 8.07MB ± 0% 8.07MB ± 0% -0.06% (p=0.000 n=20+20) SSA 27.1MB ± 0% 27.3MB ± 0% +0.89% (p=0.000 n=20+20) Flate 343kB ± 0% 343kB ± 0% ~ (all equal) GoParser 441kB ± 0% 441kB ± 0% ~ (all equal) Reflect 1.36MB ± 0% 1.36MB ± 0% ~ (all equal) Tar 487kB ± 0% 487kB ± 0% ~ (all equal) XML 632kB ± 0% 632kB ± 0% ~ (all equal) name old export-bytes new export-bytes delta Template 18.5kB ± 0% 18.5kB ± 0% ~ (all equal) Unicode 7.92kB ± 0% 7.92kB ± 0% ~ (all equal) GoTypes 35.0kB ± 0% 35.0kB ± 0% ~ (all equal) Compiler 109kB ± 0% 110kB ± 0% +0.72% (p=0.000 n=20+20) SSA 137kB ± 0% 138kB ± 0% +0.58% (p=0.000 n=20+20) Flate 4.89kB ± 0% 4.89kB ± 0% ~ (all equal) GoParser 8.49kB ± 0% 8.49kB ± 0% ~ (all equal) Reflect 11.4kB ± 0% 11.4kB ± 0% ~ (all equal) Tar 10.5kB ± 0% 10.5kB ± 0% ~ (all equal) XML 16.7kB ± 0% 16.7kB ± 0% ~ (all equal) name old text-bytes new text-bytes delta HelloSize 761kB ± 0% 761kB ± 0% ~ (all equal) CmdGoSize 10.8MB ± 0% 10.8MB ± 0% ~ (all equal) name old data-bytes new data-bytes delta HelloSize 10.7kB ± 0% 10.7kB ± 0% ~ (all equal) CmdGoSize 312kB ± 0% 312kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 122kB ± 0% 122kB ± 0% ~ (all equal) CmdGoSize 146kB ± 0% 146kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.13MB ± 0% 1.13MB ± 0% ~ (all equal) CmdGoSize 15.1MB ± 0% 15.1MB ± 0% ~ (all equal) Change-Id: I3cc2f9829a109543d9a68be4a21775d2d3e9801f Reviewed-on: https://go-review.googlesource.com/c/go/+/196557 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-12 20:19:58 +01:00
name string // the suffix for this block ("EQ", "LT", etc.)
controls int // the number of control values this type of block requires
aux string // the type of the Aux/AuxInt value, if any
}
type regInfo struct {
// inputs[i] encodes the set of registers allowed for the i'th input.
// Inputs that don't use registers (flags, memory, etc.) should be 0.
inputs []regMask
// clobbers encodes the set of registers that are overwritten by
// the instruction (other than the output registers).
clobbers regMask
// outputs[i] encodes the set of registers allowed for the i'th output.
outputs []regMask
}
type regMask uint64
func (a arch) regMaskComment(r regMask) string {
var buf strings.Builder
for i := uint64(0); r != 0; i++ {
if r&1 != 0 {
if buf.Len() == 0 {
buf.WriteString(" //")
}
buf.WriteString(" ")
buf.WriteString(a.regnames[i])
}
r >>= 1
}
return buf.String()
}
var archs []arch
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")
var memprofile = flag.String("memprofile", "", "write memory profile to `file`")
var tracefile = flag.String("trace", "", "write trace to `file`")
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
func main() {
flag.Parse()
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal("could not create CPU profile: ", err)
}
defer f.Close()
if err := pprof.StartCPUProfile(f); err != nil {
log.Fatal("could not start CPU profile: ", err)
}
defer pprof.StopCPUProfile()
}
if *tracefile != "" {
f, err := os.Create(*tracefile)
if err != nil {
log.Fatalf("failed to create trace output file: %v", err)
}
defer func() {
if err := f.Close(); err != nil {
log.Fatalf("failed to close trace file: %v", err)
}
}()
if err := trace.Start(f); err != nil {
log.Fatalf("failed to start trace: %v", err)
}
defer trace.Stop()
}
slices.SortFunc(archs, func(a, b arch) int {
return strings.Compare(a.name, b.name)
})
// The generate tasks are run concurrently, since they are CPU-intensive
// that can easily make use of many cores on a machine.
//
// Note that there is no limit on the concurrency at the moment. On a
// four-core laptop at the time of writing, peak RSS usually reaches
// ~200MiB, which seems doable by practically any machine nowadays. If
// that stops being the case, we can cap this func to a fixed number of
// architectures being generated at once.
tasks := []func(){
genOp,
genAllocators,
}
for _, a := range archs {
a := a // the funcs are ran concurrently at a later time
tasks = append(tasks, func() {
genRules(a)
genSplitLoadRules(a)
cmd/compile: add late lower pass for last rules to run Usually optimization rules have corresponding priorities, some need to be run first, some run next, and some run last, which produces the best code. But currently our optimization rules have no priority, this CL adds a late lower pass that runs those rules that need to be run at last, such as split unreasonable constant folding. This pass can be seen as the second round of the lower pass. For example: func foo(a, b uint64) uint64 { d := a+0x1234568 d1 := b+0x1234568 return d&d1 } The code generated by the master branch: 0x0004 00004 ADD $19088744, R0, R2 // movz+movk+add 0x0010 00016 ADD $19088744, R1, R1 // movz+movk+add 0x001c 00028 AND R1, R2, R0 This is because the current constant folding optimization rules do not take into account the range of constants, causing the constant to be loaded repeatedly. This CL splits these unreasonable constants folding in the late lower pass. With this CL the generated code: 0x0004 00004 MOVD $19088744, R2 // movz+movk 0x000c 00012 ADD R0, R2, R3 0x0010 00016 ADD R1, R2, R1 0x0014 00020 AND R1, R3, R0 This CL also adds constant folding optimization for ADDS instruction. In addition, in order not to introduce the codegen regression, an optimization rule is added to change the addition of a negative number into a subtraction of a positive number. go1 benchmarks: name old time/op new time/op delta BinaryTree17-8 1.22s ± 1% 1.24s ± 0% +1.56% (p=0.008 n=5+5) Fannkuch11-8 1.54s ± 0% 1.53s ± 0% -0.69% (p=0.016 n=4+5) FmtFprintfEmpty-8 14.1ns ± 0% 14.1ns ± 0% ~ (p=0.079 n=4+5) FmtFprintfString-8 26.0ns ± 0% 26.1ns ± 0% +0.23% (p=0.008 n=5+5) FmtFprintfInt-8 32.3ns ± 0% 32.9ns ± 1% +1.72% (p=0.008 n=5+5) FmtFprintfIntInt-8 54.5ns ± 0% 55.5ns ± 0% +1.83% (p=0.008 n=5+5) FmtFprintfPrefixedInt-8 61.5ns ± 0% 62.0ns ± 0% +0.93% (p=0.008 n=5+5) FmtFprintfFloat-8 72.0ns ± 0% 73.6ns ± 0% +2.24% (p=0.008 n=5+5) FmtManyArgs-8 221ns ± 0% 224ns ± 0% +1.22% (p=0.008 n=5+5) GobDecode-8 1.91ms ± 0% 1.93ms ± 0% +0.98% (p=0.008 n=5+5) GobEncode-8 1.40ms ± 1% 1.39ms ± 0% -0.79% (p=0.032 n=5+5) Gzip-8 115ms ± 0% 117ms ± 1% +1.17% (p=0.008 n=5+5) Gunzip-8 19.4ms ± 1% 19.3ms ± 0% -0.71% (p=0.016 n=5+4) HTTPClientServer-8 27.0µs ± 0% 27.3µs ± 0% +0.80% (p=0.008 n=5+5) JSONEncode-8 3.36ms ± 1% 3.33ms ± 0% ~ (p=0.056 n=5+5) JSONDecode-8 17.5ms ± 2% 17.8ms ± 0% +1.71% (p=0.016 n=5+4) Mandelbrot200-8 2.29ms ± 0% 2.29ms ± 0% ~ (p=0.151 n=5+5) GoParse-8 1.35ms ± 1% 1.36ms ± 1% ~ (p=0.056 n=5+5) RegexpMatchEasy0_32-8 24.5ns ± 0% 24.5ns ± 0% ~ (p=0.444 n=4+5) RegexpMatchEasy0_1K-8 131ns ±11% 118ns ± 6% ~ (p=0.056 n=5+5) RegexpMatchEasy1_32-8 22.9ns ± 0% 22.9ns ± 0% ~ (p=0.905 n=4+5) RegexpMatchEasy1_1K-8 126ns ± 0% 127ns ± 0% ~ (p=0.063 n=4+5) RegexpMatchMedium_32-8 486ns ± 5% 483ns ± 0% ~ (p=0.381 n=5+4) RegexpMatchMedium_1K-8 15.4µs ± 1% 15.5µs ± 0% ~ (p=0.151 n=5+5) RegexpMatchHard_32-8 687ns ± 0% 686ns ± 0% ~ (p=0.103 n=5+5) RegexpMatchHard_1K-8 20.7µs ± 0% 20.7µs ± 1% ~ (p=0.151 n=5+5) Revcomp-8 175ms ± 2% 176ms ± 3% ~ (p=1.000 n=5+5) Template-8 20.4ms ± 6% 20.1ms ± 2% ~ (p=0.151 n=5+5) TimeParse-8 112ns ± 0% 113ns ± 0% +0.97% (p=0.016 n=5+4) TimeFormat-8 156ns ± 0% 145ns ± 0% -7.14% (p=0.029 n=4+4) Change-Id: I3ced26e89041f873ac989586514ccc5ee09f13da Reviewed-on: https://go-review.googlesource.com/c/go/+/425134 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-08-17 10:01:17 +00:00
genLateLowerRules(a)
})
}
var wg sync.WaitGroup
for _, task := range tasks {
task := task
wg.Add(1)
go func() {
task()
wg.Done()
}()
}
wg.Wait()
cmd/compile: teach rulegen to remove unused decls First, add cpu and memory profiling flags, as these are useful to see where rulegen is spending its time. It now takes many seconds to run on a recent laptop, so we have to keep an eye on what it's doing. Second, stop writing '_ = var' lines to keep imports and variables used at all times. Now that rulegen removes all such unused names, they're unnecessary. To perform the removal, lean on go/types to first detect what names are unused. We can configure it to give us all the type-checking errors in a file, so we can collect all "declared but not used" errors in a single pass. We then use astutil.Apply to remove the relevant nodes based on the line information from each unused error. This allows us to apply the changes without having to do extra parser+printer roundtrips to plaintext, which are far too expensive. We need to do multiple such passes, as removing an unused variable declaration might then make another declaration unused. Two passes are enough to clean every file at the moment, so add a limit of three passes for now to avoid eating cpu uncontrollably by accident. The resulting performance of the changes above is a ~30% loss across the table, since go/types is fairly expensive. The numbers were obtained with 'benchcmd Rulegen go run *.go', which involves compiling rulegen itself, but that seems reflective of how the program is used. name old time/op new time/op delta Rulegen 5.61s ± 0% 7.36s ± 0% +31.17% (p=0.016 n=5+4) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 9.92s ± 1% +37.76% (p=0.016 n=5+4) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 169ms ±17% +25.66% (p=0.032 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 85.6MB ± 2% +20.56% (p=0.008 n=5+5) We can live with a bit more resource usage, but the time/op getting close to 10s isn't good. To win that back, introduce concurrency in main.go. This further increases resource usage a bit, but the real time on this quad-core laptop is greatly reduced. The final benchstat is as follows: name old time/op new time/op delta Rulegen 5.61s ± 0% 3.97s ± 1% -29.26% (p=0.008 n=5+5) name old user-time/op new user-time/op delta Rulegen 7.20s ± 1% 13.91s ± 1% +93.09% (p=0.008 n=5+5) name old sys-time/op new sys-time/op delta Rulegen 135ms ±19% 269ms ± 9% +99.17% (p=0.008 n=5+5) name old peak-RSS-bytes new peak-RSS-bytes delta Rulegen 71.0MB ± 2% 226.3MB ± 1% +218.72% (p=0.008 n=5+5) It might be possible to reduce the cpu or memory usage in the future, such as configuring go/types to do less work, or taking shortcuts to avoid having to run it many times. For now, ~2x cpu and ~4x memory usage seems like a fair trade for a faster and better rulegen. Finally, we can remove the old code that tried to remove some unused variables in a hacky and unmaintainable way. Change-Id: Iff9e83e3f253babf5a1bd48cc993033b8550cee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/189798 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-10 19:27:45 +02:00
if *memprofile != "" {
f, err := os.Create(*memprofile)
if err != nil {
log.Fatal("could not create memory profile: ", err)
}
defer f.Close()
runtime.GC() // get up-to-date statistics
if err := pprof.WriteHeapProfile(f); err != nil {
log.Fatal("could not write memory profile: ", err)
}
}
}
func genOp() {
w := new(bytes.Buffer)
fmt.Fprintf(w, "// Code generated from _gen/*Ops.go using 'go generate'; DO NOT EDIT.\n")
fmt.Fprintln(w)
fmt.Fprintln(w, "package ssa")
fmt.Fprintln(w, "import (")
fmt.Fprintln(w, "\"cmd/internal/obj\"")
for _, a := range archs {
if a.pkg != "" {
fmt.Fprintf(w, "%q\n", a.pkg)
}
}
fmt.Fprintln(w, ")")
// generate Block* declarations
fmt.Fprintln(w, "const (")
fmt.Fprintln(w, "BlockInvalid BlockKind = iota")
for _, a := range archs {
fmt.Fprintln(w)
for _, d := range a.blocks {
fmt.Fprintf(w, "Block%s%s\n", a.Name(), d.name)
}
}
fmt.Fprintln(w, ")")
// generate block kind string method
fmt.Fprintln(w, "var blockString = [...]string{")
fmt.Fprintln(w, "BlockInvalid:\"BlockInvalid\",")
for _, a := range archs {
fmt.Fprintln(w)
for _, b := range a.blocks {
fmt.Fprintf(w, "Block%s%s:\"%s\",\n", a.Name(), b.name, b.name)
}
}
fmt.Fprintln(w, "}")
fmt.Fprintln(w, "func (k BlockKind) String() string {return blockString[k]}")
cmd/compile: add SSA rules for s390x compare-and-branch instructions This commit adds SSA rules for the s390x combined compare-and-branch instructions. These have a shorter encoding than separate compare and branch instructions and they also don't clobber the condition code (a.k.a. flag register) reducing pressure on the flag allocator. I have deleted the 'loop_test.go' file and replaced it with a new codegen test which performs a wider range of checks. Object sizes from compilebench: name old object-bytes new object-bytes delta Template 562kB ± 0% 561kB ± 0% -0.28% (p=0.000 n=10+10) Unicode 217kB ± 0% 217kB ± 0% -0.17% (p=0.000 n=10+10) GoTypes 2.03MB ± 0% 2.02MB ± 0% -0.59% (p=0.000 n=10+10) Compiler 8.16MB ± 0% 8.11MB ± 0% -0.62% (p=0.000 n=10+10) SSA 27.4MB ± 0% 27.0MB ± 0% -1.45% (p=0.000 n=10+10) Flate 356kB ± 0% 356kB ± 0% -0.12% (p=0.000 n=10+10) GoParser 438kB ± 0% 436kB ± 0% -0.51% (p=0.000 n=10+10) Reflect 1.37MB ± 0% 1.37MB ± 0% -0.42% (p=0.000 n=10+10) Tar 485kB ± 0% 483kB ± 0% -0.39% (p=0.000 n=10+10) XML 630kB ± 0% 621kB ± 0% -1.45% (p=0.000 n=10+10) [Geo mean] 1.14MB 1.13MB -0.60% name old text-bytes new text-bytes delta HelloSize 763kB ± 0% 754kB ± 0% -1.30% (p=0.000 n=10+10) CmdGoSize 10.7MB ± 0% 10.6MB ± 0% -0.91% (p=0.000 n=10+10) [Geo mean] 2.86MB 2.82MB -1.10% Change-Id: Ibca55d9c0aa1254aee69433731ab5d26a43a7c18 Reviewed-on: https://go-review.googlesource.com/c/go/+/198037 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-09-17 07:29:31 -07:00
// generate block kind auxint method
fmt.Fprintln(w, "func (k BlockKind) AuxIntType() string {")
fmt.Fprintln(w, "switch k {")
cmd/compile: add SSA rules for s390x compare-and-branch instructions This commit adds SSA rules for the s390x combined compare-and-branch instructions. These have a shorter encoding than separate compare and branch instructions and they also don't clobber the condition code (a.k.a. flag register) reducing pressure on the flag allocator. I have deleted the 'loop_test.go' file and replaced it with a new codegen test which performs a wider range of checks. Object sizes from compilebench: name old object-bytes new object-bytes delta Template 562kB ± 0% 561kB ± 0% -0.28% (p=0.000 n=10+10) Unicode 217kB ± 0% 217kB ± 0% -0.17% (p=0.000 n=10+10) GoTypes 2.03MB ± 0% 2.02MB ± 0% -0.59% (p=0.000 n=10+10) Compiler 8.16MB ± 0% 8.11MB ± 0% -0.62% (p=0.000 n=10+10) SSA 27.4MB ± 0% 27.0MB ± 0% -1.45% (p=0.000 n=10+10) Flate 356kB ± 0% 356kB ± 0% -0.12% (p=0.000 n=10+10) GoParser 438kB ± 0% 436kB ± 0% -0.51% (p=0.000 n=10+10) Reflect 1.37MB ± 0% 1.37MB ± 0% -0.42% (p=0.000 n=10+10) Tar 485kB ± 0% 483kB ± 0% -0.39% (p=0.000 n=10+10) XML 630kB ± 0% 621kB ± 0% -1.45% (p=0.000 n=10+10) [Geo mean] 1.14MB 1.13MB -0.60% name old text-bytes new text-bytes delta HelloSize 763kB ± 0% 754kB ± 0% -1.30% (p=0.000 n=10+10) CmdGoSize 10.7MB ± 0% 10.6MB ± 0% -0.91% (p=0.000 n=10+10) [Geo mean] 2.86MB 2.82MB -1.10% Change-Id: Ibca55d9c0aa1254aee69433731ab5d26a43a7c18 Reviewed-on: https://go-review.googlesource.com/c/go/+/198037 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-09-17 07:29:31 -07:00
for _, a := range archs {
for _, b := range a.blocks {
if b.auxIntType() == "invalid" {
cmd/compile: add SSA rules for s390x compare-and-branch instructions This commit adds SSA rules for the s390x combined compare-and-branch instructions. These have a shorter encoding than separate compare and branch instructions and they also don't clobber the condition code (a.k.a. flag register) reducing pressure on the flag allocator. I have deleted the 'loop_test.go' file and replaced it with a new codegen test which performs a wider range of checks. Object sizes from compilebench: name old object-bytes new object-bytes delta Template 562kB ± 0% 561kB ± 0% -0.28% (p=0.000 n=10+10) Unicode 217kB ± 0% 217kB ± 0% -0.17% (p=0.000 n=10+10) GoTypes 2.03MB ± 0% 2.02MB ± 0% -0.59% (p=0.000 n=10+10) Compiler 8.16MB ± 0% 8.11MB ± 0% -0.62% (p=0.000 n=10+10) SSA 27.4MB ± 0% 27.0MB ± 0% -1.45% (p=0.000 n=10+10) Flate 356kB ± 0% 356kB ± 0% -0.12% (p=0.000 n=10+10) GoParser 438kB ± 0% 436kB ± 0% -0.51% (p=0.000 n=10+10) Reflect 1.37MB ± 0% 1.37MB ± 0% -0.42% (p=0.000 n=10+10) Tar 485kB ± 0% 483kB ± 0% -0.39% (p=0.000 n=10+10) XML 630kB ± 0% 621kB ± 0% -1.45% (p=0.000 n=10+10) [Geo mean] 1.14MB 1.13MB -0.60% name old text-bytes new text-bytes delta HelloSize 763kB ± 0% 754kB ± 0% -1.30% (p=0.000 n=10+10) CmdGoSize 10.7MB ± 0% 10.6MB ± 0% -0.91% (p=0.000 n=10+10) [Geo mean] 2.86MB 2.82MB -1.10% Change-Id: Ibca55d9c0aa1254aee69433731ab5d26a43a7c18 Reviewed-on: https://go-review.googlesource.com/c/go/+/198037 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-09-17 07:29:31 -07:00
continue
}
fmt.Fprintf(w, "case Block%s%s: return \"%s\"\n", a.Name(), b.name, b.auxIntType())
cmd/compile: add SSA rules for s390x compare-and-branch instructions This commit adds SSA rules for the s390x combined compare-and-branch instructions. These have a shorter encoding than separate compare and branch instructions and they also don't clobber the condition code (a.k.a. flag register) reducing pressure on the flag allocator. I have deleted the 'loop_test.go' file and replaced it with a new codegen test which performs a wider range of checks. Object sizes from compilebench: name old object-bytes new object-bytes delta Template 562kB ± 0% 561kB ± 0% -0.28% (p=0.000 n=10+10) Unicode 217kB ± 0% 217kB ± 0% -0.17% (p=0.000 n=10+10) GoTypes 2.03MB ± 0% 2.02MB ± 0% -0.59% (p=0.000 n=10+10) Compiler 8.16MB ± 0% 8.11MB ± 0% -0.62% (p=0.000 n=10+10) SSA 27.4MB ± 0% 27.0MB ± 0% -1.45% (p=0.000 n=10+10) Flate 356kB ± 0% 356kB ± 0% -0.12% (p=0.000 n=10+10) GoParser 438kB ± 0% 436kB ± 0% -0.51% (p=0.000 n=10+10) Reflect 1.37MB ± 0% 1.37MB ± 0% -0.42% (p=0.000 n=10+10) Tar 485kB ± 0% 483kB ± 0% -0.39% (p=0.000 n=10+10) XML 630kB ± 0% 621kB ± 0% -1.45% (p=0.000 n=10+10) [Geo mean] 1.14MB 1.13MB -0.60% name old text-bytes new text-bytes delta HelloSize 763kB ± 0% 754kB ± 0% -1.30% (p=0.000 n=10+10) CmdGoSize 10.7MB ± 0% 10.6MB ± 0% -0.91% (p=0.000 n=10+10) [Geo mean] 2.86MB 2.82MB -1.10% Change-Id: Ibca55d9c0aa1254aee69433731ab5d26a43a7c18 Reviewed-on: https://go-review.googlesource.com/c/go/+/198037 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-09-17 07:29:31 -07:00
}
}
fmt.Fprintln(w, "}")
fmt.Fprintln(w, "return \"\"")
fmt.Fprintln(w, "}")
cmd/compile: add SSA rules for s390x compare-and-branch instructions This commit adds SSA rules for the s390x combined compare-and-branch instructions. These have a shorter encoding than separate compare and branch instructions and they also don't clobber the condition code (a.k.a. flag register) reducing pressure on the flag allocator. I have deleted the 'loop_test.go' file and replaced it with a new codegen test which performs a wider range of checks. Object sizes from compilebench: name old object-bytes new object-bytes delta Template 562kB ± 0% 561kB ± 0% -0.28% (p=0.000 n=10+10) Unicode 217kB ± 0% 217kB ± 0% -0.17% (p=0.000 n=10+10) GoTypes 2.03MB ± 0% 2.02MB ± 0% -0.59% (p=0.000 n=10+10) Compiler 8.16MB ± 0% 8.11MB ± 0% -0.62% (p=0.000 n=10+10) SSA 27.4MB ± 0% 27.0MB ± 0% -1.45% (p=0.000 n=10+10) Flate 356kB ± 0% 356kB ± 0% -0.12% (p=0.000 n=10+10) GoParser 438kB ± 0% 436kB ± 0% -0.51% (p=0.000 n=10+10) Reflect 1.37MB ± 0% 1.37MB ± 0% -0.42% (p=0.000 n=10+10) Tar 485kB ± 0% 483kB ± 0% -0.39% (p=0.000 n=10+10) XML 630kB ± 0% 621kB ± 0% -1.45% (p=0.000 n=10+10) [Geo mean] 1.14MB 1.13MB -0.60% name old text-bytes new text-bytes delta HelloSize 763kB ± 0% 754kB ± 0% -1.30% (p=0.000 n=10+10) CmdGoSize 10.7MB ± 0% 10.6MB ± 0% -0.91% (p=0.000 n=10+10) [Geo mean] 2.86MB 2.82MB -1.10% Change-Id: Ibca55d9c0aa1254aee69433731ab5d26a43a7c18 Reviewed-on: https://go-review.googlesource.com/c/go/+/198037 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-09-17 07:29:31 -07:00
// generate Op* declarations
fmt.Fprintln(w, "const (")
fmt.Fprintln(w, "OpInvalid Op = iota") // make sure OpInvalid is 0.
for _, a := range archs {
fmt.Fprintln(w)
for _, v := range a.ops {
if v.name == "Invalid" {
continue
}
fmt.Fprintf(w, "Op%s%s\n", a.Name(), v.name)
}
}
fmt.Fprintln(w, ")")
// generate OpInfo table
fmt.Fprintln(w, "var opcodeTable = [...]opInfo{")
fmt.Fprintln(w, " { name: \"OpInvalid\" },")
for _, a := range archs {
fmt.Fprintln(w)
pkg := path.Base(a.pkg)
for _, v := range a.ops {
if v.name == "Invalid" {
continue
}
fmt.Fprintln(w, "{")
fmt.Fprintf(w, "name:\"%s\",\n", v.name)
// flags
if v.aux != "" {
fmt.Fprintf(w, "auxType: aux%s,\n", v.aux)
}
fmt.Fprintf(w, "argLen: %d,\n", v.argLength)
if v.rematerializeable {
if v.reg.clobbers != 0 {
log.Fatalf("%s is rematerializeable and clobbers registers", v.name)
}
if v.clobberFlags {
log.Fatalf("%s is rematerializeable and clobbers flags", v.name)
}
fmt.Fprintln(w, "rematerializeable: true,")
}
if v.commutative {
fmt.Fprintln(w, "commutative: true,")
}
if v.resultInArg0 {
fmt.Fprintln(w, "resultInArg0: true,")
// OpConvert's register mask is selected dynamically,
// so don't try to check it in the static table.
if v.name != "Convert" && v.reg.inputs[0] != v.reg.outputs[0] {
log.Fatalf("%s: input[0] and output[0] must use the same registers for %s", a.name, v.name)
}
if v.name != "Convert" && v.commutative && v.reg.inputs[1] != v.reg.outputs[0] {
log.Fatalf("%s: input[1] and output[0] must use the same registers for %s", a.name, v.name)
}
}
if v.resultNotInArgs {
fmt.Fprintln(w, "resultNotInArgs: true,")
}
if v.clobberFlags {
fmt.Fprintln(w, "clobberFlags: true,")
}
if v.needIntTemp {
fmt.Fprintln(w, "needIntTemp: true,")
}
if v.call {
fmt.Fprintln(w, "call: true,")
}
if v.tailCall {
fmt.Fprintln(w, "tailCall: true,")
}
if v.nilCheck {
fmt.Fprintln(w, "nilCheck: true,")
}
if v.faultOnNilArg0 {
fmt.Fprintln(w, "faultOnNilArg0: true,")
if v.aux != "Sym" && v.aux != "SymOff" && v.aux != "SymValAndOff" && v.aux != "Int64" && v.aux != "Int32" && v.aux != "" {
log.Fatalf("faultOnNilArg0 with aux %s not allowed", v.aux)
}
}
if v.faultOnNilArg1 {
fmt.Fprintln(w, "faultOnNilArg1: true,")
if v.aux != "Sym" && v.aux != "SymOff" && v.aux != "SymValAndOff" && v.aux != "Int64" && v.aux != "Int32" && v.aux != "" {
log.Fatalf("faultOnNilArg1 with aux %s not allowed", v.aux)
}
}
if v.hasSideEffects {
fmt.Fprintln(w, "hasSideEffects: true,")
}
if v.zeroWidth {
fmt.Fprintln(w, "zeroWidth: true,")
}
if v.unsafePoint {
fmt.Fprintln(w, "unsafePoint: true,")
}
needEffect := strings.HasPrefix(v.aux, "Sym")
if v.symEffect != "" {
if !needEffect {
log.Fatalf("symEffect with aux %s not allowed", v.aux)
}
fmt.Fprintf(w, "symEffect: Sym%s,\n", strings.Replace(v.symEffect, ",", "|Sym", -1))
} else if needEffect {
log.Fatalf("symEffect needed for aux %s", v.aux)
}
if a.name == "generic" {
fmt.Fprintln(w, "generic:true,")
fmt.Fprintln(w, "},") // close op
// generic ops have no reg info or asm
continue
}
if v.asm != "" {
fmt.Fprintf(w, "asm: %s.A%s,\n", pkg, v.asm)
}
if v.scale != 0 {
fmt.Fprintf(w, "scale: %d,\n", v.scale)
}
fmt.Fprintln(w, "reg:regInfo{")
// Compute input allocation order. We allocate from the
// most to the least constrained input. This order guarantees
// that we will always be able to find a register.
var s []intPair
for i, r := range v.reg.inputs {
if r != 0 {
s = append(s, intPair{countRegs(r), i})
}
}
if len(s) > 0 {
sort.Sort(byKey(s))
fmt.Fprintln(w, "inputs: []inputInfo{")
for _, p := range s {
r := v.reg.inputs[p.val]
fmt.Fprintf(w, "{%d,%d},%s\n", p.val, r, a.regMaskComment(r))
}
fmt.Fprintln(w, "},")
}
if v.reg.clobbers > 0 {
fmt.Fprintf(w, "clobbers: %d,%s\n", v.reg.clobbers, a.regMaskComment(v.reg.clobbers))
}
// reg outputs
s = s[:0]
for i, r := range v.reg.outputs {
s = append(s, intPair{countRegs(r), i})
}
if len(s) > 0 {
sort.Sort(byKey(s))
fmt.Fprintln(w, "outputs: []outputInfo{")
for _, p := range s {
r := v.reg.outputs[p.val]
fmt.Fprintf(w, "{%d,%d},%s\n", p.val, r, a.regMaskComment(r))
}
fmt.Fprintln(w, "},")
}
fmt.Fprintln(w, "},") // close reg info
fmt.Fprintln(w, "},") // close op
}
}
fmt.Fprintln(w, "}")
fmt.Fprintln(w, "func (o Op) Asm() obj.As {return opcodeTable[o].asm}")
fmt.Fprintln(w, "func (o Op) Scale() int16 {return int16(opcodeTable[o].scale)}")
// generate op string method
fmt.Fprintln(w, "func (o Op) String() string {return opcodeTable[o].name }")
fmt.Fprintln(w, "func (o Op) SymEffect() SymEffect { return opcodeTable[o].symEffect }")
fmt.Fprintln(w, "func (o Op) IsCall() bool { return opcodeTable[o].call }")
fmt.Fprintln(w, "func (o Op) IsTailCall() bool { return opcodeTable[o].tailCall }")
fmt.Fprintln(w, "func (o Op) HasSideEffects() bool { return opcodeTable[o].hasSideEffects }")
fmt.Fprintln(w, "func (o Op) UnsafePoint() bool { return opcodeTable[o].unsafePoint }")
fmt.Fprintln(w, "func (o Op) ResultInArg0() bool { return opcodeTable[o].resultInArg0 }")
// generate registers
for _, a := range archs {
if a.generic {
continue
}
fmt.Fprintf(w, "var registers%s = [...]Register {\n", a.name)
var gcRegN int
num := map[string]int8{}
for i, r := range a.regnames {
num[r] = int8(i)
pkg := a.pkg[len("cmd/internal/obj/"):]
var objname string // name in cmd/internal/obj/$ARCH
switch r {
case "SB":
// SB isn't a real register. cmd/internal/obj expects 0 in this case.
objname = "0"
case "SP":
objname = pkg + ".REGSP"
case "g":
objname = pkg + ".REGG"
default:
objname = pkg + ".REG_" + r
}
// Assign a GC register map index to registers
// that may contain pointers.
gcRegIdx := -1
if a.gpregmask&(1<<uint(i)) != 0 {
gcRegIdx = gcRegN
gcRegN++
}
fmt.Fprintf(w, " {%d, %s, %d, \"%s\"},\n", i, objname, gcRegIdx, r)
}
parameterRegisterList := func(paramNamesString string) []int8 {
paramNamesString = strings.TrimSpace(paramNamesString)
if paramNamesString == "" {
return nil
}
paramNames := strings.Split(paramNamesString, " ")
var paramRegs []int8
for _, regName := range paramNames {
if regName == "" {
// forgive extra spaces
continue
}
if regNum, ok := num[regName]; ok {
paramRegs = append(paramRegs, regNum)
delete(num, regName)
} else {
log.Fatalf("parameter register %s for architecture %s not a register name (or repeated in parameter list)", regName, a.name)
}
}
return paramRegs
}
paramIntRegs := parameterRegisterList(a.ParamIntRegNames)
paramFloatRegs := parameterRegisterList(a.ParamFloatRegNames)
if gcRegN > 32 {
// Won't fit in a uint32 mask.
log.Fatalf("too many GC registers (%d > 32) on %s", gcRegN, a.name)
}
fmt.Fprintln(w, "}")
fmt.Fprintf(w, "var paramIntReg%s = %#v\n", a.name, paramIntRegs)
fmt.Fprintf(w, "var paramFloatReg%s = %#v\n", a.name, paramFloatRegs)
fmt.Fprintf(w, "var gpRegMask%s = regMask(%d)\n", a.name, a.gpregmask)
fmt.Fprintf(w, "var fpRegMask%s = regMask(%d)\n", a.name, a.fpregmask)
if a.fp32regmask != 0 {
fmt.Fprintf(w, "var fp32RegMask%s = regMask(%d)\n", a.name, a.fp32regmask)
}
if a.fp64regmask != 0 {
fmt.Fprintf(w, "var fp64RegMask%s = regMask(%d)\n", a.name, a.fp64regmask)
}
fmt.Fprintf(w, "var specialRegMask%s = regMask(%d)\n", a.name, a.specialregmask)
fmt.Fprintf(w, "var framepointerReg%s = int8(%d)\n", a.name, a.framepointerreg)
fmt.Fprintf(w, "var linkReg%s = int8(%d)\n", a.name, a.linkreg)
}
// gofmt result
b := w.Bytes()
var err error
b, err = format.Source(b)
if err != nil {
fmt.Printf("%s\n", w.Bytes())
panic(err)
}
if err := os.WriteFile("../opGen.go", b, 0666); err != nil {
log.Fatalf("can't write output: %v\n", err)
}
// Check that the arch genfile handles all the arch-specific opcodes.
// This is very much a hack, but it is better than nothing.
//
// Do a single regexp pass to record all ops being handled in a map, and
// then compare that with the ops list. This is much faster than one
// regexp pass per opcode.
for _, a := range archs {
if a.genfile == "" {
continue
}
pattern := fmt.Sprintf(`\Wssa\.Op%s([a-zA-Z0-9_]+)\W`, a.name)
rxOp, err := regexp.Compile(pattern)
if err != nil {
log.Fatalf("bad opcode regexp %s: %v", pattern, err)
}
src, err := os.ReadFile(a.genfile)
if err != nil {
log.Fatalf("can't read %s: %v", a.genfile, err)
}
seen := make(map[string]bool, len(a.ops))
for _, m := range rxOp.FindAllSubmatch(src, -1) {
seen[string(m[1])] = true
}
for _, op := range a.ops {
if !seen[op.name] {
log.Fatalf("Op%s%s has no code generation in %s", a.name, op.name, a.genfile)
}
}
}
}
// Name returns the name of the architecture for use in Op* and Block* enumerations.
func (a arch) Name() string {
s := a.name
if s == "generic" {
s = ""
}
return s
}
// countRegs returns the number of set bits in the register mask.
func countRegs(r regMask) int {
return bits.OnesCount64(uint64(r))
}
// for sorting a pair of integers by key
type intPair struct {
key, val int
}
type byKey []intPair
func (a byKey) Len() int { return len(a) }
func (a byKey) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a byKey) Less(i, j int) bool { return a[i].key < a[j].key }