go/src/cmd/compile/internal/ssa/_gen/genericOps.go

753 lines
47 KiB
Go
Raw Normal View History

// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package main
// Generic opcodes typically specify a width. The inputs and outputs
// of that op are the given number of bits wide. There is no notion of
// "sign", so Add32 can be used both for signed and unsigned 32-bit
// addition.
// Signed/unsigned is explicit with the extension ops
// (SignExt*/ZeroExt*) and implicit as the arg to some opcodes
// (e.g. the second argument to shifts is unsigned). If not mentioned,
// all args take signed inputs, or don't care whether their inputs
// are signed or unsigned.
var genericOps = []opData{
cmd/compile: implement bits.Mul64 on 32-bit systems This CL implements Mul64uhilo, Hmul64, Hmul64u, and Avg64u on 32-bit systems, with the effect that constant division of both int64s and uint64s can now be emitted directly in all cases, and also that bits.Mul64 can be intrinsified on 32-bit systems. Previously, constant division of uint64s by values 0 ≤ c ≤ 0xFFFF were implemented as uint32 divisions by c and some fixup. After expanding those smaller constant divisions, the code for i/999 required: (386) 7 mul, 10 add, 2 sub, 3 rotate, 3 shift (104 bytes) (arm) 7 mul, 9 add, 3 sub, 2 shift (104 bytes) (mips) 7 mul, 10 add, 5 sub, 6 shift, 3 sgtu (176 bytes) For that much code, we might as well use a full 64x64->128 multiply that can be used for all divisors, not just small ones. Having done that, the same i/999 now generates: (386) 4 mul, 9 add, 2 sub, 2 or, 6 shift (112 bytes) (arm) 4 mul, 8 add, 2 sub, 2 or, 3 shift (92 bytes) (mips) 4 mul, 11 add, 3 sub, 6 shift, 8 sgtu, 4 or (196 bytes) The size increase on 386 is due to a few extra register spills. The size increase on mips is due to add-with-carry being hard. The new approach is more general, letting us delete the old special case and guarantee that all int64 and uint64 divisions by constants are generated directly on 32-bit systems. This especially speeds up code making heavy use of bits.Mul64 with a constant argument, which happens in strconv and various crypto packages. A few examples are benchmarked below. pkg: cmd/compile/internal/test benchmark \ host local linux-amd64 s7 linux-386 s7:GOARCH=386 vs base vs base vs base vs base vs base DivconstI64 ~ ~ ~ -49.66% -21.02% ModconstI64 ~ ~ ~ -13.45% +14.52% DivisiblePow2constI64 ~ ~ ~ +0.97% -1.32% DivisibleconstI64 ~ ~ ~ -20.01% -48.28% DivisibleWDivconstI64 ~ ~ -1.76% -38.59% -42.74% DivconstU64/3 ~ ~ ~ -13.82% -4.09% DivconstU64/5 ~ ~ ~ -14.10% -3.54% DivconstU64/37 -2.07% -4.45% ~ -19.60% -9.55% DivconstU64/1234567 ~ ~ ~ -61.55% -56.93% ModconstU64 ~ ~ ~ -6.25% ~ DivisibleconstU64 ~ ~ ~ -2.78% -7.82% DivisibleWDivconstU64 ~ ~ ~ +4.23% +2.56% pkg: math/bits benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Add ~ ~ ~ ~ Add32 +1.59% ~ ~ ~ Add64 ~ ~ ~ ~ Add64multiple ~ ~ ~ ~ Sub ~ ~ ~ ~ Sub32 ~ ~ ~ ~ Sub64 ~ ~ -9.20% ~ Sub64multiple ~ ~ ~ ~ Mul ~ ~ ~ ~ Mul32 ~ ~ ~ ~ Mul64 ~ ~ -41.58% -53.21% Div ~ ~ ~ ~ Div32 ~ ~ ~ ~ Div64 ~ ~ ~ ~ pkg: strconv benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base ParseInt/Pos/7bit ~ ~ -11.08% -6.75% ParseInt/Pos/26bit ~ ~ -13.65% -11.02% ParseInt/Pos/31bit ~ ~ -14.65% -9.71% ParseInt/Pos/56bit -1.80% ~ -17.97% -10.78% ParseInt/Pos/63bit ~ ~ -13.85% -9.63% ParseInt/Neg/7bit ~ ~ -12.14% -7.26% ParseInt/Neg/26bit ~ ~ -14.18% -9.81% ParseInt/Neg/31bit ~ ~ -14.51% -9.02% ParseInt/Neg/56bit ~ ~ -15.79% -9.79% ParseInt/Neg/63bit ~ ~ -15.68% -11.07% AppendFloat/Decimal ~ ~ -7.25% -12.26% AppendFloat/Float ~ ~ -15.96% -19.45% AppendFloat/Exp ~ ~ -13.96% -17.76% AppendFloat/NegExp ~ ~ -14.89% -20.27% AppendFloat/LongExp ~ ~ -12.68% -17.97% AppendFloat/Big ~ ~ -11.10% -16.64% AppendFloat/BinaryExp ~ ~ ~ ~ AppendFloat/32Integer ~ ~ -10.05% -10.91% AppendFloat/32ExactFraction ~ ~ -8.93% -13.00% AppendFloat/32Point ~ ~ -10.36% -14.89% AppendFloat/32Exp ~ ~ -9.88% -13.54% AppendFloat/32NegExp ~ ~ -10.16% -14.26% AppendFloat/32Shortest ~ ~ -11.39% -14.96% AppendFloat/32Fixed8Hard ~ ~ ~ -2.31% AppendFloat/32Fixed9Hard ~ ~ ~ -7.01% AppendFloat/64Fixed1 ~ ~ -2.83% -8.23% AppendFloat/64Fixed2 ~ ~ ~ -7.94% AppendFloat/64Fixed3 ~ ~ -4.07% -7.22% AppendFloat/64Fixed4 ~ ~ -7.24% -7.62% AppendFloat/64Fixed12 ~ ~ -6.57% -4.82% AppendFloat/64Fixed16 ~ ~ -4.00% -5.81% AppendFloat/64Fixed12Hard -2.22% ~ -4.07% -6.35% AppendFloat/64Fixed17Hard -2.12% ~ ~ -3.79% AppendFloat/64Fixed18Hard -1.89% ~ +2.48% ~ AppendFloat/Slowpath64 -1.85% ~ -14.49% -18.21% AppendFloat/SlowpathDenormal64 ~ ~ -13.08% -19.41% pkg: crypto/internal/fips140/nistec/fiat benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Mul/P224 ~ ~ -29.95% -39.60% Mul/P384 ~ ~ -37.11% -63.33% Mul/P521 ~ ~ -26.62% -12.42% Square/P224 +1.46% ~ -40.62% -49.18% Square/P384 ~ ~ -45.51% -69.68% Square/P521 +90.37% ~ -25.26% -11.23% (The +90% is a separate problem and not real; that much variation can be seen on that system by running the same binary from two different files.) pkg: crypto/internal/fips140/edwards25519 benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base EncodingDecoding ~ ~ -34.67% -35.75% ScalarBaseMult ~ ~ -31.25% -30.29% ScalarMult ~ ~ -33.45% -32.54% VarTimeDoubleScalarBaseMult ~ ~ -33.78% -33.68% Change-Id: Id3c91d42cd01def6731b755e99f8f40c6ad1bb65 Reviewed-on: https://go-review.googlesource.com/c/go/+/716061 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
2025-10-27 19:41:39 -04:00
// Pseudo-op.
{name: "Last", argLength: -1}, // return last element of tuple; for "let" bindings
// 2-input arithmetic
// Types must be consistent with Go typing. Add, for example, must take two values
// of the same type and produces that same type.
{name: "Add8", argLength: 2, commutative: true}, // arg0 + arg1
{name: "Add16", argLength: 2, commutative: true},
{name: "Add32", argLength: 2, commutative: true},
{name: "Add64", argLength: 2, commutative: true},
{name: "AddPtr", argLength: 2}, // For address calculations. arg0 is a pointer and arg1 is an int.
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
{name: "Add32F", argLength: 2, commutative: true},
{name: "Add64F", argLength: 2, commutative: true},
{name: "Sub8", argLength: 2}, // arg0 - arg1
{name: "Sub16", argLength: 2},
{name: "Sub32", argLength: 2},
{name: "Sub64", argLength: 2},
{name: "SubPtr", argLength: 2},
{name: "Sub32F", argLength: 2},
{name: "Sub64F", argLength: 2},
{name: "Mul8", argLength: 2, commutative: true}, // arg0 * arg1
{name: "Mul16", argLength: 2, commutative: true},
{name: "Mul32", argLength: 2, commutative: true},
{name: "Mul64", argLength: 2, commutative: true},
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
{name: "Mul32F", argLength: 2, commutative: true},
{name: "Mul64F", argLength: 2, commutative: true},
{name: "Div32F", argLength: 2}, // arg0 / arg1
{name: "Div64F", argLength: 2},
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
{name: "Hmul32", argLength: 2, commutative: true},
{name: "Hmul32u", argLength: 2, commutative: true},
{name: "Hmul64", argLength: 2, commutative: true},
{name: "Hmul64u", argLength: 2, commutative: true},
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
{name: "Mul32uhilo", argLength: 2, typ: "(UInt32,UInt32)", commutative: true}, // arg0 * arg1, returns (hi, lo)
{name: "Mul64uhilo", argLength: 2, typ: "(UInt64,UInt64)", commutative: true}, // arg0 * arg1, returns (hi, lo)
{name: "Mul32uover", argLength: 2, typ: "(UInt32,Bool)", commutative: true}, // Let x = arg0*arg1 (full 32x32-> 64 unsigned multiply), returns (uint32(x), (uint32(x) != x))
{name: "Mul64uover", argLength: 2, typ: "(UInt64,Bool)", commutative: true}, // Let x = arg0*arg1 (full 64x64->128 unsigned multiply), returns (uint64(x), (uint64(x) != x))
// Weird special instructions for use in the strength reduction of divides.
// These ops compute unsigned (arg0 + arg1) / 2, correct to all
// 32/64 bits, even when the intermediate result of the add has 33/65 bits.
// These ops can assume arg0 >= arg1.
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
// Note: these ops aren't commutative!
{name: "Avg32u", argLength: 2, typ: "UInt32"}, // 32-bit platforms only
{name: "Avg64u", argLength: 2, typ: "UInt64"}, // 64-bit platforms only
// For Div16, Div32 and Div64, AuxInt non-zero means that the divisor has been proved to be not -1
// or that the dividend is not the most negative value.
{name: "Div8", argLength: 2}, // arg0 / arg1, signed
{name: "Div8u", argLength: 2}, // arg0 / arg1, unsigned
{name: "Div16", argLength: 2, aux: "Bool"},
{name: "Div16u", argLength: 2},
{name: "Div32", argLength: 2, aux: "Bool"},
{name: "Div32u", argLength: 2},
{name: "Div64", argLength: 2, aux: "Bool"},
{name: "Div64u", argLength: 2},
{name: "Div128u", argLength: 3}, // arg0:arg1 / arg2 (128-bit divided by 64-bit), returns (q, r)
// For Mod16, Mod32 and Mod64, AuxInt non-zero means that the divisor has been proved to be not -1.
{name: "Mod8", argLength: 2}, // arg0 % arg1, signed
{name: "Mod8u", argLength: 2}, // arg0 % arg1, unsigned
{name: "Mod16", argLength: 2, aux: "Bool"},
{name: "Mod16u", argLength: 2},
{name: "Mod32", argLength: 2, aux: "Bool"},
{name: "Mod32u", argLength: 2},
{name: "Mod64", argLength: 2, aux: "Bool"},
{name: "Mod64u", argLength: 2},
{name: "And8", argLength: 2, commutative: true}, // arg0 & arg1
{name: "And16", argLength: 2, commutative: true},
{name: "And32", argLength: 2, commutative: true},
{name: "And64", argLength: 2, commutative: true},
{name: "Or8", argLength: 2, commutative: true}, // arg0 | arg1
{name: "Or16", argLength: 2, commutative: true},
{name: "Or32", argLength: 2, commutative: true},
{name: "Or64", argLength: 2, commutative: true},
{name: "Xor8", argLength: 2, commutative: true}, // arg0 ^ arg1
{name: "Xor16", argLength: 2, commutative: true},
{name: "Xor32", argLength: 2, commutative: true},
{name: "Xor64", argLength: 2, commutative: true},
// For shifts, AxB means the shifted value has A bits and the shift amount has B bits.
// Shift amounts are considered unsigned.
// If arg1 is known to be nonnegative and less than the number of bits in arg0,
// then auxInt may be set to 1.
cmd/compile: simplify shifts using bounds from prove pass The prove pass sometimes has bounds information that later rewrite passes do not. Use this information to mark shifts as bounded, and then use that information to generate better code on amd64. It may prove to be helpful on other architectures, too. While here, coalesce the existing shift lowering rules. This triggers 35 times building std+cmd. The full list is below. Here's an example from runtime.heapBitsSetType: if nb < 8 { b |= uintptr(*p) << nb p = add1(p) } else { nb -= 8 } We now generate better code on amd64 for that left shift. Updates #25087 vendor/golang_org/x/crypto/curve25519/mont25519_amd64.go:48:20: Proved Rsh8Ux64 bounded runtime/mbitmap.go:1252:22: Proved Lsh64x64 bounded runtime/mbitmap.go:1265:16: Proved Lsh64x64 bounded runtime/mbitmap.go:1275:28: Proved Lsh64x64 bounded runtime/mbitmap.go:1645:25: Proved Lsh64x64 bounded runtime/mbitmap.go:1663:25: Proved Lsh64x64 bounded runtime/mbitmap.go:1808:41: Proved Lsh64x64 bounded runtime/mbitmap.go:1831:49: Proved Lsh64x64 bounded syscall/route_bsd.go:227:23: Proved Lsh32x64 bounded syscall/route_bsd.go:295:23: Proved Lsh32x64 bounded syscall/route_darwin.go:40:23: Proved Lsh32x64 bounded compress/bzip2/bzip2.go:384:26: Proved Lsh64x16 bounded vendor/golang_org/x/net/route/address.go:370:14: Proved Lsh64x64 bounded compress/flate/inflate.go:201:54: Proved Lsh64x64 bounded math/big/prime.go:50:25: Proved Lsh64x64 bounded vendor/golang_org/x/crypto/cryptobyte/asn1.go:464:43: Proved Lsh8x8 bounded net/ip.go:87:21: Proved Rsh8Ux64 bounded cmd/internal/goobj/read.go:267:23: Proved Lsh64x64 bounded cmd/vendor/golang.org/x/arch/arm64/arm64asm/decode.go:534:27: Proved Lsh32x32 bounded cmd/vendor/golang.org/x/arch/arm64/arm64asm/decode.go:544:27: Proved Lsh32x32 bounded cmd/internal/obj/arm/asm5.go:1044:16: Proved Lsh32x64 bounded cmd/internal/obj/arm/asm5.go:1065:10: Proved Lsh32x32 bounded cmd/internal/obj/mips/obj0.go:1311:21: Proved Lsh32x64 bounded cmd/compile/internal/syntax/scanner.go:352:23: Proved Lsh64x64 bounded go/types/expr.go:222:36: Proved Lsh64x64 bounded crypto/x509/x509.go:1626:9: Proved Rsh8Ux64 bounded cmd/link/internal/loadelf/ldelf.go:823:22: Proved Lsh8x64 bounded net/http/h2_bundle.go:1470:17: Proved Lsh8x8 bounded net/http/h2_bundle.go:1477:46: Proved Lsh8x8 bounded net/http/h2_bundle.go:1481:31: Proved Lsh64x8 bounded cmd/compile/internal/ssa/rewriteARM64.go:18759:17: Proved Lsh64x64 bounded cmd/compile/internal/ssa/sparsemap.go:70:23: Proved Lsh32x64 bounded cmd/compile/internal/ssa/sparsemap.go:73:45: Proved Lsh32x64 bounded Change-Id: I58bb72f3e6f12f6ac69be633ea7222c245438142 Reviewed-on: https://go-review.googlesource.com/109776 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Giovanni Bajo <rasky@develer.com>
2018-04-26 20:56:03 -07:00
// This enables better code generation on some platforms.
{name: "Lsh8x8", argLength: 2, aux: "Bool"}, // arg0 << arg1
{name: "Lsh8x16", argLength: 2, aux: "Bool"},
{name: "Lsh8x32", argLength: 2, aux: "Bool"},
{name: "Lsh8x64", argLength: 2, aux: "Bool"},
{name: "Lsh16x8", argLength: 2, aux: "Bool"},
{name: "Lsh16x16", argLength: 2, aux: "Bool"},
{name: "Lsh16x32", argLength: 2, aux: "Bool"},
{name: "Lsh16x64", argLength: 2, aux: "Bool"},
{name: "Lsh32x8", argLength: 2, aux: "Bool"},
{name: "Lsh32x16", argLength: 2, aux: "Bool"},
{name: "Lsh32x32", argLength: 2, aux: "Bool"},
{name: "Lsh32x64", argLength: 2, aux: "Bool"},
{name: "Lsh64x8", argLength: 2, aux: "Bool"},
{name: "Lsh64x16", argLength: 2, aux: "Bool"},
{name: "Lsh64x32", argLength: 2, aux: "Bool"},
{name: "Lsh64x64", argLength: 2, aux: "Bool"},
{name: "Rsh8x8", argLength: 2, aux: "Bool"}, // arg0 >> arg1, signed
{name: "Rsh8x16", argLength: 2, aux: "Bool"},
{name: "Rsh8x32", argLength: 2, aux: "Bool"},
{name: "Rsh8x64", argLength: 2, aux: "Bool"},
{name: "Rsh16x8", argLength: 2, aux: "Bool"},
{name: "Rsh16x16", argLength: 2, aux: "Bool"},
{name: "Rsh16x32", argLength: 2, aux: "Bool"},
{name: "Rsh16x64", argLength: 2, aux: "Bool"},
{name: "Rsh32x8", argLength: 2, aux: "Bool"},
{name: "Rsh32x16", argLength: 2, aux: "Bool"},
{name: "Rsh32x32", argLength: 2, aux: "Bool"},
{name: "Rsh32x64", argLength: 2, aux: "Bool"},
{name: "Rsh64x8", argLength: 2, aux: "Bool"},
{name: "Rsh64x16", argLength: 2, aux: "Bool"},
{name: "Rsh64x32", argLength: 2, aux: "Bool"},
{name: "Rsh64x64", argLength: 2, aux: "Bool"},
{name: "Rsh8Ux8", argLength: 2, aux: "Bool"}, // arg0 >> arg1, unsigned
{name: "Rsh8Ux16", argLength: 2, aux: "Bool"},
{name: "Rsh8Ux32", argLength: 2, aux: "Bool"},
{name: "Rsh8Ux64", argLength: 2, aux: "Bool"},
{name: "Rsh16Ux8", argLength: 2, aux: "Bool"},
{name: "Rsh16Ux16", argLength: 2, aux: "Bool"},
{name: "Rsh16Ux32", argLength: 2, aux: "Bool"},
{name: "Rsh16Ux64", argLength: 2, aux: "Bool"},
{name: "Rsh32Ux8", argLength: 2, aux: "Bool"},
{name: "Rsh32Ux16", argLength: 2, aux: "Bool"},
{name: "Rsh32Ux32", argLength: 2, aux: "Bool"},
{name: "Rsh32Ux64", argLength: 2, aux: "Bool"},
{name: "Rsh64Ux8", argLength: 2, aux: "Bool"},
{name: "Rsh64Ux16", argLength: 2, aux: "Bool"},
{name: "Rsh64Ux32", argLength: 2, aux: "Bool"},
{name: "Rsh64Ux64", argLength: 2, aux: "Bool"},
// 2-input comparisons
{name: "Eq8", argLength: 2, commutative: true, typ: "Bool"}, // arg0 == arg1
{name: "Eq16", argLength: 2, commutative: true, typ: "Bool"},
{name: "Eq32", argLength: 2, commutative: true, typ: "Bool"},
{name: "Eq64", argLength: 2, commutative: true, typ: "Bool"},
{name: "EqPtr", argLength: 2, commutative: true, typ: "Bool"},
{name: "EqInter", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
{name: "EqSlice", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
{name: "Eq32F", argLength: 2, commutative: true, typ: "Bool"},
{name: "Eq64F", argLength: 2, commutative: true, typ: "Bool"},
{name: "Neq8", argLength: 2, commutative: true, typ: "Bool"}, // arg0 != arg1
{name: "Neq16", argLength: 2, commutative: true, typ: "Bool"},
{name: "Neq32", argLength: 2, commutative: true, typ: "Bool"},
{name: "Neq64", argLength: 2, commutative: true, typ: "Bool"},
{name: "NeqPtr", argLength: 2, commutative: true, typ: "Bool"},
{name: "NeqInter", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
{name: "NeqSlice", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
cmd/compile: automatically handle commuting ops in rewrite rules Note that this is a redo of an undo of the original buggy CL 38666. We have lots of rewrite rules that vary only in the fact that we have 2 versions for the 2 different orderings of various commuting ops. For example: (ADDL x (MOVLconst [c])) -> (ADDLconst [c] x) (ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x) It can get unwieldly quickly, especially when there is more than one commuting op in a rule. Our existing "fix" for this problem is to have rules that canonicalize the operations first. For example: (Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x) Subsequent rules can then assume if there is a constant arg to Eq64, it will be the first one. This fix kinda works, but it is fragile and only works when we remember to include the required extra rules. The fundamental problem is that the rule matcher doesn't know anything about commuting ops. This CL fixes that fact. We already have information about which ops commute. (The register allocator takes advantage of commutivity.) The rule generator now automatically generates multiple rules for a single source rule when there are commutative ops in the rule. We can now drop all of our almost-duplicate source-level rules and the canonicalization rules. I have some CLs in progress that will be a lot less verbose when the rule generator handles commutivity for me. I had to reorganize the load-combining rules a bit. The 8-way OR rules generated 128 different reorderings, which was causing the generator to put too much code in the rewrite*.go files (the big ones were going from 25K lines to 132K lines). Instead I reorganized the rules to combine pairs of loads at a time. The generated rule files are now actually a bit (5%) smaller. Make.bash times are ~unchanged. Compiler benchmarks are not observably different. Probably because we don't spend much compiler time in rule matching anyway. I've also done a pass over all of our ops adding commutative markings for ops which hadn't had them previously. Fixes #18292 Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805 Reviewed-on: https://go-review.googlesource.com/38801 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2017-03-30 03:30:22 +00:00
{name: "Neq32F", argLength: 2, commutative: true, typ: "Bool"},
{name: "Neq64F", argLength: 2, commutative: true, typ: "Bool"},
{name: "Less8", argLength: 2, typ: "Bool"}, // arg0 < arg1, signed
{name: "Less8U", argLength: 2, typ: "Bool"}, // arg0 < arg1, unsigned
{name: "Less16", argLength: 2, typ: "Bool"},
{name: "Less16U", argLength: 2, typ: "Bool"},
{name: "Less32", argLength: 2, typ: "Bool"},
{name: "Less32U", argLength: 2, typ: "Bool"},
{name: "Less64", argLength: 2, typ: "Bool"},
{name: "Less64U", argLength: 2, typ: "Bool"},
{name: "Less32F", argLength: 2, typ: "Bool"},
{name: "Less64F", argLength: 2, typ: "Bool"},
{name: "Leq8", argLength: 2, typ: "Bool"}, // arg0 <= arg1, signed
{name: "Leq8U", argLength: 2, typ: "Bool"}, // arg0 <= arg1, unsigned
{name: "Leq16", argLength: 2, typ: "Bool"},
{name: "Leq16U", argLength: 2, typ: "Bool"},
{name: "Leq32", argLength: 2, typ: "Bool"},
{name: "Leq32U", argLength: 2, typ: "Bool"},
{name: "Leq64", argLength: 2, typ: "Bool"},
{name: "Leq64U", argLength: 2, typ: "Bool"},
{name: "Leq32F", argLength: 2, typ: "Bool"},
{name: "Leq64F", argLength: 2, typ: "Bool"},
cmd/compile/internal/ssa: emit csel on arm64 Introduce a new SSA pass to generate CondSelect intstrutions, and add CondSelect lowering rules for arm64. In order to make the CSEL instruction easier to optimize, and to simplify the introduction of CSNEG, CSINC, and CSINV in the future, modify the CSEL instruction to accept a condition code in the aux field. Notably, this change makes the go1 Gzip benchmark more than 10% faster. Benchmarks on a Cavium ThunderX: name old time/op new time/op delta BinaryTree17-96 15.9s ± 6% 16.0s ± 4% ~ (p=0.968 n=10+9) Fannkuch11-96 7.17s ± 0% 7.00s ± 0% -2.43% (p=0.000 n=8+9) FmtFprintfEmpty-96 208ns ± 1% 207ns ± 0% ~ (p=0.152 n=10+8) FmtFprintfString-96 379ns ± 0% 375ns ± 0% -0.95% (p=0.000 n=10+9) FmtFprintfInt-96 385ns ± 0% 383ns ± 0% -0.52% (p=0.000 n=9+10) FmtFprintfIntInt-96 591ns ± 0% 586ns ± 0% -0.85% (p=0.006 n=7+9) FmtFprintfPrefixedInt-96 656ns ± 0% 667ns ± 0% +1.71% (p=0.000 n=10+10) FmtFprintfFloat-96 967ns ± 0% 984ns ± 0% +1.78% (p=0.000 n=10+10) FmtManyArgs-96 2.35µs ± 0% 2.25µs ± 0% -4.63% (p=0.000 n=9+8) GobDecode-96 31.0ms ± 0% 30.8ms ± 0% -0.36% (p=0.006 n=9+9) GobEncode-96 24.4ms ± 0% 24.5ms ± 0% +0.30% (p=0.000 n=9+9) Gzip-96 1.60s ± 0% 1.43s ± 0% -10.58% (p=0.000 n=9+10) Gunzip-96 167ms ± 0% 169ms ± 0% +0.83% (p=0.000 n=8+9) HTTPClientServer-96 311µs ± 1% 308µs ± 0% -0.75% (p=0.000 n=10+10) JSONEncode-96 65.0ms ± 0% 64.8ms ± 0% -0.25% (p=0.000 n=9+8) JSONDecode-96 262ms ± 1% 261ms ± 1% ~ (p=0.579 n=10+10) Mandelbrot200-96 18.0ms ± 0% 18.1ms ± 0% +0.17% (p=0.000 n=8+10) GoParse-96 14.0ms ± 0% 14.1ms ± 1% +0.42% (p=0.003 n=9+10) RegexpMatchEasy0_32-96 644ns ± 2% 645ns ± 2% ~ (p=0.836 n=10+10) RegexpMatchEasy0_1K-96 3.70µs ± 0% 3.49µs ± 0% -5.58% (p=0.000 n=10+10) RegexpMatchEasy1_32-96 662ns ± 2% 657ns ± 2% ~ (p=0.137 n=10+10) RegexpMatchEasy1_1K-96 4.47µs ± 0% 4.31µs ± 0% -3.48% (p=0.000 n=10+10) RegexpMatchMedium_32-96 844ns ± 2% 849ns ± 1% ~ (p=0.208 n=10+10) RegexpMatchMedium_1K-96 179µs ± 0% 182µs ± 0% +1.20% (p=0.000 n=10+10) RegexpMatchHard_32-96 10.0µs ± 0% 10.1µs ± 0% +0.48% (p=0.000 n=10+9) RegexpMatchHard_1K-96 297µs ± 0% 297µs ± 0% -0.14% (p=0.000 n=10+10) Revcomp-96 3.08s ± 0% 3.13s ± 0% +1.56% (p=0.000 n=9+9) Template-96 276ms ± 2% 275ms ± 1% ~ (p=0.393 n=10+10) TimeParse-96 1.37µs ± 0% 1.36µs ± 0% -0.53% (p=0.000 n=10+7) TimeFormat-96 1.40µs ± 0% 1.42µs ± 0% +0.97% (p=0.000 n=10+10) [Geo mean] 264µs 262µs -0.77% Change-Id: Ie54eee4b3092af53e6da3baa6d1755098f57f3a2 Reviewed-on: https://go-review.googlesource.com/55670 Run-TryBot: Philip Hofer <phofer@umich.edu> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org>
2017-08-13 22:36:47 +00:00
// the type of a CondSelect is the same as the type of its first
// two arguments, which should be register-width scalars; the third
// argument should be a boolean
{name: "CondSelect", argLength: 3}, // arg2 ? arg0 : arg1
// boolean ops
{name: "AndB", argLength: 2, commutative: true, typ: "Bool"}, // arg0 && arg1 (not shortcircuited)
{name: "OrB", argLength: 2, commutative: true, typ: "Bool"}, // arg0 || arg1 (not shortcircuited)
{name: "EqB", argLength: 2, commutative: true, typ: "Bool"}, // arg0 == arg1
{name: "NeqB", argLength: 2, commutative: true, typ: "Bool"}, // arg0 != arg1
{name: "Not", argLength: 1, typ: "Bool"}, // !arg0, boolean
// 1-input ops
{name: "Neg8", argLength: 1}, // -arg0
{name: "Neg16", argLength: 1},
{name: "Neg32", argLength: 1},
{name: "Neg64", argLength: 1},
{name: "Neg32F", argLength: 1},
{name: "Neg64F", argLength: 1},
{name: "Com8", argLength: 1}, // ^arg0
{name: "Com16", argLength: 1},
{name: "Com32", argLength: 1},
{name: "Com64", argLength: 1},
{name: "Ctz8", argLength: 1}, // Count trailing (low order) zeroes (returns 0-8)
{name: "Ctz16", argLength: 1}, // Count trailing (low order) zeroes (returns 0-16)
{name: "Ctz32", argLength: 1}, // Count trailing (low order) zeroes (returns 0-32)
{name: "Ctz64", argLength: 1}, // Count trailing (low order) zeroes (returns 0-64)
{name: "Ctz64On32", argLength: 2}, // Count trailing (low order) zeroes (returns 0-64) in arg[1]<<32+arg[0]
{name: "Ctz8NonZero", argLength: 1}, // same as above, but arg[0] known to be non-zero, returns 0-7
{name: "Ctz16NonZero", argLength: 1}, // same as above, but arg[0] known to be non-zero, returns 0-15
{name: "Ctz32NonZero", argLength: 1}, // same as above, but arg[0] known to be non-zero, returns 0-31
{name: "Ctz64NonZero", argLength: 1}, // same as above, but arg[0] known to be non-zero, returns 0-63
{name: "BitLen8", argLength: 1}, // Number of bits in arg[0] (returns 0-8)
{name: "BitLen16", argLength: 1}, // Number of bits in arg[0] (returns 0-16)
{name: "BitLen32", argLength: 1}, // Number of bits in arg[0] (returns 0-32)
{name: "BitLen64", argLength: 1}, // Number of bits in arg[0] (returns 0-64)
{name: "Bswap16", argLength: 1}, // Swap bytes
{name: "Bswap32", argLength: 1}, // Swap bytes
{name: "Bswap64", argLength: 1}, // Swap bytes
{name: "BitRev8", argLength: 1}, // Reverse the bits in arg[0]
{name: "BitRev16", argLength: 1}, // Reverse the bits in arg[0]
{name: "BitRev32", argLength: 1}, // Reverse the bits in arg[0]
{name: "BitRev64", argLength: 1}, // Reverse the bits in arg[0]
{name: "PopCount8", argLength: 1}, // Count bits in arg[0]
{name: "PopCount16", argLength: 1}, // Count bits in arg[0]
{name: "PopCount32", argLength: 1}, // Count bits in arg[0]
{name: "PopCount64", argLength: 1}, // Count bits in arg[0]
// RotateLeftX instructions rotate the X bits of arg[0] to the left
// by the low lg_2(X) bits of arg[1], interpreted as an unsigned value.
// Note that this works out regardless of the bit width or signedness of
// arg[1]. In particular, RotateLeft by x is the same as RotateRight by -x.
{name: "RotateLeft64", argLength: 2},
{name: "RotateLeft32", argLength: 2},
{name: "RotateLeft16", argLength: 2},
{name: "RotateLeft8", argLength: 2},
// Square root.
// Special cases:
// +∞ → +∞
// ±0 → ±0 (sign preserved)
// x<0 → NaN
// NaN → NaN
{name: "Sqrt", argLength: 1}, // √arg0 (floating point, double precision)
{name: "Sqrt32", argLength: 1}, // √arg0 (floating point, single precision)
// Round to integer, float64 only.
// Special cases:
// ±∞ → ±∞ (sign preserved)
// ±0 → ±0 (sign preserved)
// NaN → NaN
{name: "Floor", argLength: 1}, // round arg0 toward -∞
{name: "Ceil", argLength: 1}, // round arg0 toward +∞
{name: "Trunc", argLength: 1}, // round arg0 toward 0
{name: "Round", argLength: 1}, // round arg0 to nearest, ties away from 0
{name: "RoundToEven", argLength: 1}, // round arg0 to nearest, ties to even
// Modify the sign bit
{name: "Abs", argLength: 1}, // absolute value arg0
{name: "Copysign", argLength: 2}, // copy sign from arg0 to arg1
// Integer min/max implementation, if hardware is available.
{name: "Min64", argLength: 2}, // min(arg0,arg1), signed
{name: "Max64", argLength: 2}, // max(arg0,arg1), signed
{name: "Min64u", argLength: 2}, // min(arg0,arg1), unsigned
{name: "Max64u", argLength: 2}, // max(arg0,arg1), unsigned
// Float min/max implementation, if hardware is available.
{name: "Min64F", argLength: 2}, // min(arg0,arg1)
{name: "Min32F", argLength: 2}, // min(arg0,arg1)
{name: "Max64F", argLength: 2}, // max(arg0,arg1)
{name: "Max32F", argLength: 2}, // max(arg0,arg1)
// 3-input opcode.
// Fused-multiply-add, float64 only.
// When a*b+c is exactly zero (before rounding), then the result is +0 or -0.
// The 0's sign is determined according to the standard rules for the
// addition (-0 if both a*b and c are -0, +0 otherwise).
//
// Otherwise, when a*b+c rounds to zero, then the resulting 0's sign is
// determined by the sign of the exact result a*b+c.
// See section 6.3 in ieee754.
//
// When the multiply is an infinity times a zero, the result is NaN.
// See section 7.2 in ieee754.
math, cmd/compile: rename Fma to FMA This API was added for #25819, where it was discussed as math.FMA. The commit adding it used math.Fma, presumably for consistency with the rest of the unusual names in package math (Sincos, Acosh, Erfcinv, Float32bits, etc). I believe that using an idiomatic Go name is more important here than consistency with these other names, most of which are historical baggage from C's standard library. Early additions like Float32frombits happened before "uppercase for export" (so they were originally like "float32frombits") and they were not properly reconsidered when we uppercased the symbols to export them. That's a mistake we live with. The names of functions we have added since then, and even a few that were legacy, are more properly Go-cased, such as IsNaN, IsInf, and RoundToEven, rather than Isnan, Isinf, and Roundtoeven. And also constants like MaxFloat32. For new API, we should keep using proper Go-cased symbols instead of minimally-upper-cased-C symbols. So math.FMA, not math.Fma. This API has not yet been released, so this change does not break the compatibility promise. This CL also modifies cmd/compile, since the compiler knows the name of the function. I could have stopped at changing the string constants, but it seemed to make more sense to use a consistent casing everywhere. Change-Id: I0f6f3407f41e99bfa8239467345c33945088896e Reviewed-on: https://go-review.googlesource.com/c/go/+/205317 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-11-04 19:43:45 -05:00
{name: "FMA", argLength: 3}, // compute (a*b)+c without intermediate rounding
// Data movement. Max argument length for Phi is indefinite.
{name: "Phi", argLength: -1, zeroWidth: true}, // select an argument based on which predecessor block we came from
{name: "Copy", argLength: 1}, // output = arg0
// Convert converts between pointers and integers.
// We have a special op for this so as to not confuse GC
// (particularly stack maps). It takes a memory arg so it
// gets correctly ordered with respect to GC safepoints.
// It gets compiled to nothing, so its result must in the same
// register as its argument. regalloc knows it can use any
// allocatable integer register for OpConvert.
// arg0=ptr/int arg1=mem, output=int/ptr
{name: "Convert", argLength: 2, zeroWidth: true, resultInArg0: true},
// constants. Constant values are stored in the aux or
// auxint fields.
{name: "ConstBool", aux: "Bool"}, // auxint is 0 for false and 1 for true
{name: "ConstString", aux: "String"}, // value is aux.(string)
{name: "ConstNil", typ: "BytePtr"}, // nil pointer
{name: "Const8", aux: "Int8"}, // auxint is sign-extended 8 bits
{name: "Const16", aux: "Int16"}, // auxint is sign-extended 16 bits
{name: "Const32", aux: "Int32"}, // auxint is sign-extended 32 bits
// Note: ConstX are sign-extended even when the type of the value is unsigned.
// For instance, uint8(0xaa) is stored as auxint=0xffffffffffffffaa.
{name: "Const64", aux: "Int64"}, // value is auxint
// Note: for both Const32F and Const64F, we disallow encoding NaNs.
// Signaling NaNs are tricky because if you do anything with them, they become quiet.
// Particularly, converting a 32 bit sNaN to 64 bit and back converts it to a qNaN.
// See issue 36399 and 36400.
// Encodings of +inf, -inf, and -0 are fine.
{name: "Const32F", aux: "Float32"}, // value is math.Float64frombits(uint64(auxint)) and is exactly representable as float 32
{name: "Const64F", aux: "Float64"}, // value is math.Float64frombits(uint64(auxint))
{name: "ConstInterface"}, // nil interface
{name: "ConstSlice"}, // nil slice
// Constant-like things
{name: "InitMem", zeroWidth: true}, // memory input to the function.
{name: "Arg", aux: "SymOff", symEffect: "Read", zeroWidth: true}, // argument to the function. aux=GCNode of arg, off = offset in that arg.
// Like Arg, these are generic ops that survive lowering. AuxInt is a register index, and the actual output register for each index is defined by the architecture.
// AuxInt = integer argument index (not a register number). ABI-specified spill loc obtained from function
{name: "ArgIntReg", aux: "NameOffsetInt8", zeroWidth: true}, // argument to the function in an int reg.
{name: "ArgFloatReg", aux: "NameOffsetInt8", zeroWidth: true}, // argument to the function in a float reg.
// The address of a variable. arg0 is the base pointer.
// If the variable is a global, the base pointer will be SB and
// the Aux field will be a *obj.LSym.
// If the variable is a local, the base pointer will be SP and
// the Aux field will be a *ir.Name
{name: "Addr", argLength: 1, aux: "Sym", symEffect: "Addr"}, // Address of a variable. Arg0=SB. Aux identifies the variable.
{name: "LocalAddr", argLength: 2, aux: "Sym", symEffect: "Addr"}, // Address of a variable. Arg0=SP. Arg1=mem. Aux identifies the variable.
{name: "SP", zeroWidth: true, fixedReg: true}, // stack pointer
{name: "SB", typ: "Uintptr", zeroWidth: true, fixedReg: true}, // static base pointer (a.k.a. globals pointer)
{name: "Invalid"}, // unused value
{name: "SPanchored", typ: "Uintptr", argLength: 2, zeroWidth: true}, // arg0 = SP, arg1 = mem. Result is identical to arg0, but cannot be scheduled before memory state arg1.
// Memory operations
{name: "Load", argLength: 2}, // Load from arg0. arg1=memory
{name: "Dereference", argLength: 2}, // Load from arg0. arg1=memory. Helper op for arg/result passing, result is an otherwise not-SSA-able "value".
{name: "Store", argLength: 3, typ: "Mem", aux: "Typ"}, // Store arg1 to arg0. arg2=memory, aux=type. Returns memory.
// masked memory operations.
// TODO add 16 and 8
{name: "LoadMasked8", argLength: 3}, // Load from arg0, arg1 = mask of 8-bits, arg2 = memory
{name: "LoadMasked16", argLength: 3}, // Load from arg0, arg1 = mask of 16-bits, arg2 = memory
{name: "LoadMasked32", argLength: 3}, // Load from arg0, arg1 = mask of 32-bits, arg2 = memory
{name: "LoadMasked64", argLength: 3}, // Load from arg0, arg1 = mask of 64-bits, arg2 = memory
{name: "StoreMasked8", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 8-bits, arg3 = memory
{name: "StoreMasked16", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 16-bits, arg3 = memory
{name: "StoreMasked32", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 32-bits, arg3 = memory
{name: "StoreMasked64", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 64-bits, arg3 = memory
// Normally we require that the source and destination of Move do not overlap.
// There is an exception when we know all the loads will happen before all
// the stores. In that case, overlap is ok. See
// memmove inlining in generic.rules. When inlineablememmovesize (in ../rewrite.go)
// returns true, we must do all loads before all stores, when lowering Move.
// The type of Move is used for the write barrier pass to insert write barriers
// and for alignment on some architectures.
// For pointerless types, it is possible for the type to be inaccurate.
// For type alignment and pointer information, use the type in Aux;
// for type size, use the size in AuxInt.
// The "inline runtime.memmove" rewrite rule generates Moves with inaccurate types,
// such as type byte instead of the more accurate type [8]byte.
{name: "Move", argLength: 3, typ: "Mem", aux: "TypSize"}, // arg0=destptr, arg1=srcptr, arg2=mem, auxint=size, aux=type. Returns memory.
{name: "Zero", argLength: 2, typ: "Mem", aux: "TypSize"}, // arg0=destptr, arg1=mem, auxint=size, aux=type. Returns memory.
// Memory operations with write barriers.
// Expand to runtime calls. Write barrier will be removed if write on stack.
{name: "StoreWB", argLength: 3, typ: "Mem", aux: "Typ"}, // Store arg1 to arg0. arg2=memory, aux=type. Returns memory.
{name: "MoveWB", argLength: 3, typ: "Mem", aux: "TypSize"}, // arg0=destptr, arg1=srcptr, arg2=mem, auxint=size, aux=type. Returns memory.
{name: "ZeroWB", argLength: 2, typ: "Mem", aux: "TypSize"}, // arg0=destptr, arg1=mem, auxint=size, aux=type. Returns memory.
{name: "WBend", argLength: 1, typ: "Mem"}, // Write barrier code is done, interrupting is now allowed.
// WB invokes runtime.gcWriteBarrier. This is not a normal
cmd/compile: compiler support for buffered write barrier This CL implements the compiler support for calling the buffered write barrier added by the previous CL. Since the buffered write barrier is only implemented on amd64 right now, this still supports the old, eager write barrier as well. There's little overhead to supporting both and this way a few tests in test/fixedbugs that expect to have liveness maps at write barrier calls can easily opt-in to the old, eager barrier. This significantly improves the performance of the write barrier: name old time/op new time/op delta WriteBarrier-12 73.5ns ±20% 19.2ns ±27% -73.90% (p=0.000 n=19+18) It also reduces the size of binaries because the write barrier call is more compact: name old object-bytes new object-bytes delta Template 398k ± 0% 393k ± 0% -1.14% (p=0.008 n=5+5) Unicode 208k ± 0% 206k ± 0% -1.00% (p=0.008 n=5+5) GoTypes 1.18M ± 0% 1.15M ± 0% -2.00% (p=0.008 n=5+5) Compiler 4.05M ± 0% 3.88M ± 0% -4.26% (p=0.008 n=5+5) SSA 8.25M ± 0% 8.11M ± 0% -1.59% (p=0.008 n=5+5) Flate 228k ± 0% 224k ± 0% -1.83% (p=0.008 n=5+5) GoParser 295k ± 0% 284k ± 0% -3.62% (p=0.008 n=5+5) Reflect 1.00M ± 0% 0.99M ± 0% -0.70% (p=0.008 n=5+5) Tar 339k ± 0% 333k ± 0% -1.67% (p=0.008 n=5+5) XML 404k ± 0% 395k ± 0% -2.10% (p=0.008 n=5+5) [Geo mean] 704k 690k -2.00% name old exe-bytes new exe-bytes delta HelloSize 1.05M ± 0% 1.04M ± 0% -1.55% (p=0.008 n=5+5) https://perf.golang.org/search?q=upload:20171027.1 (Amusingly, this also reduces compiler allocations by 0.75%, which, combined with the better write barrier, speeds up the compiler overall by 2.10%. See the perf link.) It slightly improves the performance of most of the go1 benchmarks and improves the performance of the x/benchmarks: name old time/op new time/op delta BinaryTree17-12 2.40s ± 1% 2.47s ± 1% +2.69% (p=0.000 n=19+19) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.21% (p=0.000 n=20+19) FmtFprintfEmpty-12 41.8ns ± 4% 41.4ns ± 2% -1.03% (p=0.014 n=20+20) FmtFprintfString-12 68.7ns ± 2% 67.5ns ± 1% -1.75% (p=0.000 n=20+17) FmtFprintfInt-12 79.0ns ± 3% 77.1ns ± 1% -2.40% (p=0.000 n=19+17) FmtFprintfIntInt-12 127ns ± 1% 123ns ± 3% -3.42% (p=0.000 n=20+20) FmtFprintfPrefixedInt-12 152ns ± 1% 150ns ± 1% -1.02% (p=0.000 n=18+17) FmtFprintfFloat-12 211ns ± 1% 209ns ± 0% -0.99% (p=0.000 n=20+16) FmtManyArgs-12 500ns ± 0% 496ns ± 0% -0.73% (p=0.000 n=17+20) GobDecode-12 6.44ms ± 1% 6.53ms ± 0% +1.28% (p=0.000 n=20+19) GobEncode-12 5.46ms ± 0% 5.46ms ± 1% ~ (p=0.550 n=19+20) Gzip-12 220ms ± 1% 216ms ± 0% -1.75% (p=0.000 n=19+19) Gunzip-12 38.8ms ± 0% 38.6ms ± 0% -0.30% (p=0.000 n=18+19) HTTPClientServer-12 79.0µs ± 1% 78.2µs ± 1% -1.01% (p=0.000 n=20+20) JSONEncode-12 11.9ms ± 0% 11.9ms ± 0% -0.29% (p=0.000 n=20+19) JSONDecode-12 52.6ms ± 0% 52.2ms ± 0% -0.68% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.36% (p=0.000 n=20+20) GoParse-12 3.13ms ± 1% 3.18ms ± 1% +1.67% (p=0.000 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 72.3ns ± 1% -1.19% (p=0.000 n=19+18) RegexpMatchEasy0_1K-12 241ns ± 0% 239ns ± 0% -0.83% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 68.6ns ± 1% 69.0ns ± 1% +0.47% (p=0.015 n=18+16) RegexpMatchEasy1_1K-12 364ns ± 0% 361ns ± 0% -0.67% (p=0.000 n=16+17) RegexpMatchMedium_32-12 104ns ± 1% 103ns ± 1% -0.79% (p=0.001 n=20+15) RegexpMatchMedium_1K-12 33.8µs ± 3% 34.0µs ± 2% ~ (p=0.267 n=20+19) RegexpMatchHard_32-12 1.64µs ± 1% 1.62µs ± 2% -1.25% (p=0.000 n=19+18) RegexpMatchHard_1K-12 49.2µs ± 0% 48.7µs ± 1% -0.93% (p=0.000 n=19+18) Revcomp-12 391ms ± 5% 396ms ± 7% ~ (p=0.154 n=19+19) Template-12 63.1ms ± 0% 59.5ms ± 0% -5.76% (p=0.000 n=18+19) TimeParse-12 307ns ± 0% 306ns ± 0% -0.39% (p=0.000 n=19+17) TimeFormat-12 325ns ± 0% 323ns ± 0% -0.50% (p=0.000 n=19+19) [Geo mean] 47.3µs 46.9µs -0.67% https://perf.golang.org/search?q=upload:20171026.1 name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.25ms ± 1% 2.20ms ± 1% -2.31% (p=0.000 n=18+18) HTTP-12 12.6µs ± 0% 12.6µs ± 0% -0.72% (p=0.000 n=18+17) JSON-12 11.0ms ± 0% 11.0ms ± 1% -0.68% (p=0.000 n=17+19) https://perf.golang.org/search?q=upload:20171026.2 Updates #14951. Updates #22460. Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7 Reviewed-on: https://go-review.googlesource.com/73712 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-10-26 12:33:04 -04:00
// call: it takes arguments in registers, doesn't clobber
// general-purpose registers (the exact clobber set is
// arch-dependent), and is not a safe-point.
{name: "WB", argLength: 1, typ: "(BytePtr,Mem)", aux: "Int64"}, // arg0=mem, auxint=# of buffer entries needed. Returns buffer pointer and memory.
cmd/compile: compiler support for buffered write barrier This CL implements the compiler support for calling the buffered write barrier added by the previous CL. Since the buffered write barrier is only implemented on amd64 right now, this still supports the old, eager write barrier as well. There's little overhead to supporting both and this way a few tests in test/fixedbugs that expect to have liveness maps at write barrier calls can easily opt-in to the old, eager barrier. This significantly improves the performance of the write barrier: name old time/op new time/op delta WriteBarrier-12 73.5ns ±20% 19.2ns ±27% -73.90% (p=0.000 n=19+18) It also reduces the size of binaries because the write barrier call is more compact: name old object-bytes new object-bytes delta Template 398k ± 0% 393k ± 0% -1.14% (p=0.008 n=5+5) Unicode 208k ± 0% 206k ± 0% -1.00% (p=0.008 n=5+5) GoTypes 1.18M ± 0% 1.15M ± 0% -2.00% (p=0.008 n=5+5) Compiler 4.05M ± 0% 3.88M ± 0% -4.26% (p=0.008 n=5+5) SSA 8.25M ± 0% 8.11M ± 0% -1.59% (p=0.008 n=5+5) Flate 228k ± 0% 224k ± 0% -1.83% (p=0.008 n=5+5) GoParser 295k ± 0% 284k ± 0% -3.62% (p=0.008 n=5+5) Reflect 1.00M ± 0% 0.99M ± 0% -0.70% (p=0.008 n=5+5) Tar 339k ± 0% 333k ± 0% -1.67% (p=0.008 n=5+5) XML 404k ± 0% 395k ± 0% -2.10% (p=0.008 n=5+5) [Geo mean] 704k 690k -2.00% name old exe-bytes new exe-bytes delta HelloSize 1.05M ± 0% 1.04M ± 0% -1.55% (p=0.008 n=5+5) https://perf.golang.org/search?q=upload:20171027.1 (Amusingly, this also reduces compiler allocations by 0.75%, which, combined with the better write barrier, speeds up the compiler overall by 2.10%. See the perf link.) It slightly improves the performance of most of the go1 benchmarks and improves the performance of the x/benchmarks: name old time/op new time/op delta BinaryTree17-12 2.40s ± 1% 2.47s ± 1% +2.69% (p=0.000 n=19+19) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.21% (p=0.000 n=20+19) FmtFprintfEmpty-12 41.8ns ± 4% 41.4ns ± 2% -1.03% (p=0.014 n=20+20) FmtFprintfString-12 68.7ns ± 2% 67.5ns ± 1% -1.75% (p=0.000 n=20+17) FmtFprintfInt-12 79.0ns ± 3% 77.1ns ± 1% -2.40% (p=0.000 n=19+17) FmtFprintfIntInt-12 127ns ± 1% 123ns ± 3% -3.42% (p=0.000 n=20+20) FmtFprintfPrefixedInt-12 152ns ± 1% 150ns ± 1% -1.02% (p=0.000 n=18+17) FmtFprintfFloat-12 211ns ± 1% 209ns ± 0% -0.99% (p=0.000 n=20+16) FmtManyArgs-12 500ns ± 0% 496ns ± 0% -0.73% (p=0.000 n=17+20) GobDecode-12 6.44ms ± 1% 6.53ms ± 0% +1.28% (p=0.000 n=20+19) GobEncode-12 5.46ms ± 0% 5.46ms ± 1% ~ (p=0.550 n=19+20) Gzip-12 220ms ± 1% 216ms ± 0% -1.75% (p=0.000 n=19+19) Gunzip-12 38.8ms ± 0% 38.6ms ± 0% -0.30% (p=0.000 n=18+19) HTTPClientServer-12 79.0µs ± 1% 78.2µs ± 1% -1.01% (p=0.000 n=20+20) JSONEncode-12 11.9ms ± 0% 11.9ms ± 0% -0.29% (p=0.000 n=20+19) JSONDecode-12 52.6ms ± 0% 52.2ms ± 0% -0.68% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.36% (p=0.000 n=20+20) GoParse-12 3.13ms ± 1% 3.18ms ± 1% +1.67% (p=0.000 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 72.3ns ± 1% -1.19% (p=0.000 n=19+18) RegexpMatchEasy0_1K-12 241ns ± 0% 239ns ± 0% -0.83% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 68.6ns ± 1% 69.0ns ± 1% +0.47% (p=0.015 n=18+16) RegexpMatchEasy1_1K-12 364ns ± 0% 361ns ± 0% -0.67% (p=0.000 n=16+17) RegexpMatchMedium_32-12 104ns ± 1% 103ns ± 1% -0.79% (p=0.001 n=20+15) RegexpMatchMedium_1K-12 33.8µs ± 3% 34.0µs ± 2% ~ (p=0.267 n=20+19) RegexpMatchHard_32-12 1.64µs ± 1% 1.62µs ± 2% -1.25% (p=0.000 n=19+18) RegexpMatchHard_1K-12 49.2µs ± 0% 48.7µs ± 1% -0.93% (p=0.000 n=19+18) Revcomp-12 391ms ± 5% 396ms ± 7% ~ (p=0.154 n=19+19) Template-12 63.1ms ± 0% 59.5ms ± 0% -5.76% (p=0.000 n=18+19) TimeParse-12 307ns ± 0% 306ns ± 0% -0.39% (p=0.000 n=19+17) TimeFormat-12 325ns ± 0% 323ns ± 0% -0.50% (p=0.000 n=19+19) [Geo mean] 47.3µs 46.9µs -0.67% https://perf.golang.org/search?q=upload:20171026.1 name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.25ms ± 1% 2.20ms ± 1% -2.31% (p=0.000 n=18+18) HTTP-12 12.6µs ± 0% 12.6µs ± 0% -0.72% (p=0.000 n=18+17) JSON-12 11.0ms ± 0% 11.0ms ± 1% -0.68% (p=0.000 n=17+19) https://perf.golang.org/search?q=upload:20171026.2 Updates #14951. Updates #22460. Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7 Reviewed-on: https://go-review.googlesource.com/73712 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-10-26 12:33:04 -04:00
{name: "HasCPUFeature", argLength: 0, typ: "bool", aux: "Sym", symEffect: "None"}, // aux=place that this feature flag can be loaded from
// PanicBounds and PanicExtend generate a runtime panic.
// Their arguments provide index values to use in panic messages.
// Both PanicBounds and PanicExtend have an AuxInt value from the BoundsKind type (in ../op.go).
// PanicBounds' index is int sized.
// PanicExtend's index is int64 sized. (PanicExtend is only used on 32-bit archs.)
{name: "PanicBounds", argLength: 3, aux: "Int64", typ: "Mem", call: true}, // arg0=idx, arg1=len, arg2=mem, returns memory.
{name: "PanicExtend", argLength: 4, aux: "Int64", typ: "Mem", call: true}, // arg0=idxHi, arg1=idxLo, arg2=len, arg3=mem, returns memory.
// Function calls. Arguments to the call have already been written to the stack.
// Return values appear on the stack. The method receiver, if any, is treated
// as a phantom first argument.
// TODO(josharian): ClosureCall and InterCall should have Int32 aux
// to match StaticCall's 32 bit arg size limit.
// TODO(drchase,josharian): could the arg size limit be bundled into the rules for CallOff?
// Before lowering, LECalls receive their fixed inputs (first), memory (last),
// and a variable number of input values in the middle.
// They produce a variable number of result values.
// These values are not necessarily "SSA-able"; they can be too large,
// but in that case inputs are loaded immediately before with OpDereference,
// and outputs are stored immediately with OpStore.
//
// After call expansion, Calls have the same fixed-middle-memory arrangement of inputs,
// with the difference that the "middle" is only the register-resident inputs,
// and the non-register inputs are instead stored at ABI-defined offsets from SP
// (and the stores thread through the memory that is ultimately an input to the call).
// Outputs follow a similar pattern; register-resident outputs are the leading elements
// of a Result-typed output, with memory last, and any memory-resident outputs have been
// stored to ABI-defined locations. Each non-memory input or output fits in a register.
//
// Subsequent architecture-specific lowering only changes the opcode.
{name: "ClosureCall", argLength: -1, aux: "CallOff", call: true}, // arg0=code pointer, arg1=context ptr, arg2..argN-1 are register inputs, argN=memory. auxint=arg size. Returns Result of register results, plus memory.
{name: "StaticCall", argLength: -1, aux: "CallOff", call: true}, // call function aux.(*obj.LSym), arg0..argN-1 are register inputs, argN=memory. auxint=arg size. Returns Result of register results, plus memory.
{name: "InterCall", argLength: -1, aux: "CallOff", call: true}, // interface call. arg0=code pointer, arg1..argN-1 are register inputs, argN=memory, auxint=arg size. Returns Result of register results, plus memory.
cmd/compile: restore tail call for method wrappers For certain type of method wrappers we used to generate a tail call. That was disabled in CL 307234 when register ABI is used, because with the current IR it was difficult to generate a tail call with the arguments in the right places. The problem was that the IR does not contain a CALL-like node with arguments; instead, it contains an OAS node that adjusts the receiver, than an OTAILCALL node that just contains the target, but no argument (with the assumption that the OAS node will put the adjusted receiver in the right place). With register ABI, putting arguments in registers are done in SSA. The assignment (OAS) doesn't put the receiver in register. This CL changes the IR of a tail call to take an actual OCALL node. Specifically, a tail call is represented as OTAILCALL (OCALL target args...) This way, the call target and args are connected through the OCALL node. So the call can be analyzed in SSA and the args can be passed in the right places. (Alternatively, we could have OTAILCALL node directly take the target and the args, without the OCALL node. Using an OCALL node is convenient as there are existing code that processes OCALL nodes which do not need to be changed. Also, a tail call is similar to ORETURN (OCALL target args...), except it doesn't preserve the frame. I did the former but I'm open to change.) The SSA representation is similar. Previously, the IR lowers to a Store the receiver then a BlockRetJmp which jumps to the target (without putting the arg in register). Now we use a TailCall op, which takes the target and the args. The call expansion pass and the register allocator handles TailCall pretty much like a StaticCall, and it will do the right ABI analysis and put the args in the right places. (Args other than the receiver are already in the right places. For register args it generates no code for them. For stack args currently it generates a self copy. I'll work on optimize that out.) BlockRetJmp is still used, signaling it is a tail call. The actual call is made in the TailCall op so BlockRetJmp generates no code (we could use BlockExit if we like). This slightly reduces binary size: old new cmd/go 14003088 13953936 cmd/link 6275552 6271456 Change-Id: I2d16d8d419fe1f17554916d317427383e17e27f0 Reviewed-on: https://go-review.googlesource.com/c/go/+/350145 Trust: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: David Chase <drchase@google.com>
2021-09-10 22:05:55 -04:00
{name: "TailCall", argLength: -1, aux: "CallOff", call: true}, // tail call function aux.(*obj.LSym), arg0..argN-1 are register inputs, argN=memory. auxint=arg size. Returns Result of register results, plus memory.
{name: "ClosureLECall", argLength: -1, aux: "CallOff", call: true}, // late-expanded closure call. arg0=code pointer, arg1=context ptr, arg2..argN-1 are inputs, argN is mem. auxint = arg size. Result is tuple of result(s), plus mem.
{name: "StaticLECall", argLength: -1, aux: "CallOff", call: true}, // late-expanded static call function aux.(*ssa.AuxCall.Fn). arg0..argN-1 are inputs, argN is mem. auxint = arg size. Result is tuple of result(s), plus mem.
{name: "InterLECall", argLength: -1, aux: "CallOff", call: true}, // late-expanded interface call. arg0=code pointer, arg1..argN-1 are inputs, argN is mem. auxint = arg size. Result is tuple of result(s), plus mem.
cmd/compile: restore tail call for method wrappers For certain type of method wrappers we used to generate a tail call. That was disabled in CL 307234 when register ABI is used, because with the current IR it was difficult to generate a tail call with the arguments in the right places. The problem was that the IR does not contain a CALL-like node with arguments; instead, it contains an OAS node that adjusts the receiver, than an OTAILCALL node that just contains the target, but no argument (with the assumption that the OAS node will put the adjusted receiver in the right place). With register ABI, putting arguments in registers are done in SSA. The assignment (OAS) doesn't put the receiver in register. This CL changes the IR of a tail call to take an actual OCALL node. Specifically, a tail call is represented as OTAILCALL (OCALL target args...) This way, the call target and args are connected through the OCALL node. So the call can be analyzed in SSA and the args can be passed in the right places. (Alternatively, we could have OTAILCALL node directly take the target and the args, without the OCALL node. Using an OCALL node is convenient as there are existing code that processes OCALL nodes which do not need to be changed. Also, a tail call is similar to ORETURN (OCALL target args...), except it doesn't preserve the frame. I did the former but I'm open to change.) The SSA representation is similar. Previously, the IR lowers to a Store the receiver then a BlockRetJmp which jumps to the target (without putting the arg in register). Now we use a TailCall op, which takes the target and the args. The call expansion pass and the register allocator handles TailCall pretty much like a StaticCall, and it will do the right ABI analysis and put the args in the right places. (Args other than the receiver are already in the right places. For register args it generates no code for them. For stack args currently it generates a self copy. I'll work on optimize that out.) BlockRetJmp is still used, signaling it is a tail call. The actual call is made in the TailCall op so BlockRetJmp generates no code (we could use BlockExit if we like). This slightly reduces binary size: old new cmd/go 14003088 13953936 cmd/link 6275552 6271456 Change-Id: I2d16d8d419fe1f17554916d317427383e17e27f0 Reviewed-on: https://go-review.googlesource.com/c/go/+/350145 Trust: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: David Chase <drchase@google.com>
2021-09-10 22:05:55 -04:00
{name: "TailLECall", argLength: -1, aux: "CallOff", call: true}, // late-expanded static tail call function aux.(*ssa.AuxCall.Fn). arg0..argN-1 are inputs, argN is mem. auxint = arg size. Result is tuple of result(s), plus mem.
// Conversions: signed extensions, zero (unsigned) extensions, truncations
{name: "SignExt8to16", argLength: 1, typ: "Int16"},
{name: "SignExt8to32", argLength: 1, typ: "Int32"},
{name: "SignExt8to64", argLength: 1, typ: "Int64"},
{name: "SignExt16to32", argLength: 1, typ: "Int32"},
{name: "SignExt16to64", argLength: 1, typ: "Int64"},
{name: "SignExt32to64", argLength: 1, typ: "Int64"},
{name: "ZeroExt8to16", argLength: 1, typ: "UInt16"},
{name: "ZeroExt8to32", argLength: 1, typ: "UInt32"},
{name: "ZeroExt8to64", argLength: 1, typ: "UInt64"},
{name: "ZeroExt16to32", argLength: 1, typ: "UInt32"},
{name: "ZeroExt16to64", argLength: 1, typ: "UInt64"},
{name: "ZeroExt32to64", argLength: 1, typ: "UInt64"},
{name: "Trunc16to8", argLength: 1},
{name: "Trunc32to8", argLength: 1},
{name: "Trunc32to16", argLength: 1},
{name: "Trunc64to8", argLength: 1},
{name: "Trunc64to16", argLength: 1},
{name: "Trunc64to32", argLength: 1},
{name: "Cvt32to32F", argLength: 1},
{name: "Cvt32to64F", argLength: 1},
{name: "Cvt64to32F", argLength: 1},
{name: "Cvt64to64F", argLength: 1},
{name: "Cvt32Fto32", argLength: 1},
{name: "Cvt32Fto64", argLength: 1},
{name: "Cvt64Fto32", argLength: 1},
{name: "Cvt64Fto64", argLength: 1},
{name: "Cvt32Fto64F", argLength: 1},
{name: "Cvt64Fto32F", argLength: 1},
{name: "CvtBoolToUint8", argLength: 1},
// Force rounding to precision of type.
{name: "Round32F", argLength: 1},
{name: "Round64F", argLength: 1},
// Automatically inserted safety checks
{name: "IsNonNil", argLength: 1, typ: "Bool"}, // arg0 != nil
{name: "IsInBounds", argLength: 2, typ: "Bool"}, // 0 <= arg0 < arg1. arg1 is guaranteed >= 0.
{name: "IsSliceInBounds", argLength: 2, typ: "Bool"}, // 0 <= arg0 <= arg1. arg1 is guaranteed >= 0.
{name: "NilCheck", argLength: 2, nilCheck: true}, // arg0=ptr, arg1=mem. Panics if arg0 is nil. Returns the ptr unmodified.
// Pseudo-ops
{name: "GetG", argLength: 1, zeroWidth: true}, // runtime.getg() (read g pointer). arg0=mem
{name: "GetClosurePtr"}, // get closure pointer from dedicated register
{name: "GetCallerPC"}, // for GetCallerPC intrinsic
{name: "GetCallerSP", argLength: 1}, // for GetCallerSP intrinsic. arg0=mem.
// Indexing operations
{name: "PtrIndex", argLength: 2}, // arg0=ptr, arg1=index. Computes ptr+sizeof(*v.type)*index, where index is extended to ptrwidth type
{name: "OffPtr", argLength: 1, aux: "Int64"}, // arg0 + auxint (arg0 and result are pointers)
// Slices
{name: "SliceMake", argLength: 3}, // arg0=ptr, arg1=len, arg2=cap
{name: "SlicePtr", argLength: 1, typ: "BytePtr"}, // ptr(arg0)
{name: "SliceLen", argLength: 1}, // len(arg0)
{name: "SliceCap", argLength: 1}, // cap(arg0)
// SlicePtrUnchecked, like SlicePtr, extracts the pointer from a slice.
// SlicePtr values are assumed non-nil, because they are guarded by bounds checks.
// SlicePtrUnchecked values can be nil.
{name: "SlicePtrUnchecked", argLength: 1},
// Complex (part/whole)
{name: "ComplexMake", argLength: 2}, // arg0=real, arg1=imag
{name: "ComplexReal", argLength: 1}, // real(arg0)
{name: "ComplexImag", argLength: 1}, // imag(arg0)
// Strings
{name: "StringMake", argLength: 2}, // arg0=ptr, arg1=len
{name: "StringPtr", argLength: 1, typ: "BytePtr"}, // ptr(arg0)
{name: "StringLen", argLength: 1, typ: "Int"}, // len(arg0)
// Interfaces
{name: "IMake", argLength: 2}, // arg0=itab, arg1=data
{name: "ITab", argLength: 1, typ: "Uintptr"}, // arg0=interface, returns itable field
{name: "IData", argLength: 1}, // arg0=interface, returns data field
// Structs
{name: "StructMake", argLength: -1}, // args...=field0..n-1. Returns struct with n fields.
{name: "StructSelect", argLength: 1, aux: "Int64"}, // arg0=struct, auxint=field index. Returns the auxint'th field.
// Arrays
{name: "ArrayMake0"}, // Returns array with 0 elements
{name: "ArrayMake1", argLength: 1}, // Returns array with 1 element
{name: "ArraySelect", argLength: 1, aux: "Int64"}, // arg0=array, auxint=index. Returns a[i].
// Spill&restore ops for the register allocator. These are
// semantically identical to OpCopy; they do not take/return
// stores like regular memory ops do. We can get away without memory
// args because we know there is no aliasing of spill slots on the stack.
{name: "StoreReg", argLength: 1},
{name: "LoadReg", argLength: 1},
// Used during ssa construction. Like Copy, but the arg has not been specified yet.
{name: "FwdRef", aux: "Sym", symEffect: "None"},
// Unknown value. Used for Values whose values don't matter because they are dead code.
{name: "Unknown"},
{name: "VarDef", argLength: 1, aux: "Sym", typ: "Mem", symEffect: "None", zeroWidth: true}, // aux is a *ir.Name of a variable that is about to be initialized. arg0=mem, returns mem
// TODO: what's the difference between VarLive and KeepAlive?
{name: "VarLive", argLength: 1, aux: "Sym", symEffect: "Read", zeroWidth: true}, // aux is a *ir.Name of a variable that must be kept live. arg0=mem, returns mem
{name: "KeepAlive", argLength: 2, typ: "Mem", zeroWidth: true}, // arg[0] is a value that must be kept alive until this mark. arg[1]=mem, returns mem
// InlMark marks the start of an inlined function body. Its AuxInt field
// distinguishes which entry in the local inline tree it is marking.
{name: "InlMark", argLength: 1, aux: "Int32", typ: "Void"}, // arg[0]=mem, returns void.
// Ops for breaking 64-bit operations on 32-bit architectures
{name: "Int64Make", argLength: 2, typ: "UInt64"}, // arg0=hi, arg1=lo
{name: "Int64Hi", argLength: 1, typ: "UInt32"}, // high 32-bit of arg0
{name: "Int64Lo", argLength: 1, typ: "UInt32"}, // low 32-bit of arg0
cmd/compile: implement bits.Mul64 on 32-bit systems This CL implements Mul64uhilo, Hmul64, Hmul64u, and Avg64u on 32-bit systems, with the effect that constant division of both int64s and uint64s can now be emitted directly in all cases, and also that bits.Mul64 can be intrinsified on 32-bit systems. Previously, constant division of uint64s by values 0 ≤ c ≤ 0xFFFF were implemented as uint32 divisions by c and some fixup. After expanding those smaller constant divisions, the code for i/999 required: (386) 7 mul, 10 add, 2 sub, 3 rotate, 3 shift (104 bytes) (arm) 7 mul, 9 add, 3 sub, 2 shift (104 bytes) (mips) 7 mul, 10 add, 5 sub, 6 shift, 3 sgtu (176 bytes) For that much code, we might as well use a full 64x64->128 multiply that can be used for all divisors, not just small ones. Having done that, the same i/999 now generates: (386) 4 mul, 9 add, 2 sub, 2 or, 6 shift (112 bytes) (arm) 4 mul, 8 add, 2 sub, 2 or, 3 shift (92 bytes) (mips) 4 mul, 11 add, 3 sub, 6 shift, 8 sgtu, 4 or (196 bytes) The size increase on 386 is due to a few extra register spills. The size increase on mips is due to add-with-carry being hard. The new approach is more general, letting us delete the old special case and guarantee that all int64 and uint64 divisions by constants are generated directly on 32-bit systems. This especially speeds up code making heavy use of bits.Mul64 with a constant argument, which happens in strconv and various crypto packages. A few examples are benchmarked below. pkg: cmd/compile/internal/test benchmark \ host local linux-amd64 s7 linux-386 s7:GOARCH=386 vs base vs base vs base vs base vs base DivconstI64 ~ ~ ~ -49.66% -21.02% ModconstI64 ~ ~ ~ -13.45% +14.52% DivisiblePow2constI64 ~ ~ ~ +0.97% -1.32% DivisibleconstI64 ~ ~ ~ -20.01% -48.28% DivisibleWDivconstI64 ~ ~ -1.76% -38.59% -42.74% DivconstU64/3 ~ ~ ~ -13.82% -4.09% DivconstU64/5 ~ ~ ~ -14.10% -3.54% DivconstU64/37 -2.07% -4.45% ~ -19.60% -9.55% DivconstU64/1234567 ~ ~ ~ -61.55% -56.93% ModconstU64 ~ ~ ~ -6.25% ~ DivisibleconstU64 ~ ~ ~ -2.78% -7.82% DivisibleWDivconstU64 ~ ~ ~ +4.23% +2.56% pkg: math/bits benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Add ~ ~ ~ ~ Add32 +1.59% ~ ~ ~ Add64 ~ ~ ~ ~ Add64multiple ~ ~ ~ ~ Sub ~ ~ ~ ~ Sub32 ~ ~ ~ ~ Sub64 ~ ~ -9.20% ~ Sub64multiple ~ ~ ~ ~ Mul ~ ~ ~ ~ Mul32 ~ ~ ~ ~ Mul64 ~ ~ -41.58% -53.21% Div ~ ~ ~ ~ Div32 ~ ~ ~ ~ Div64 ~ ~ ~ ~ pkg: strconv benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base ParseInt/Pos/7bit ~ ~ -11.08% -6.75% ParseInt/Pos/26bit ~ ~ -13.65% -11.02% ParseInt/Pos/31bit ~ ~ -14.65% -9.71% ParseInt/Pos/56bit -1.80% ~ -17.97% -10.78% ParseInt/Pos/63bit ~ ~ -13.85% -9.63% ParseInt/Neg/7bit ~ ~ -12.14% -7.26% ParseInt/Neg/26bit ~ ~ -14.18% -9.81% ParseInt/Neg/31bit ~ ~ -14.51% -9.02% ParseInt/Neg/56bit ~ ~ -15.79% -9.79% ParseInt/Neg/63bit ~ ~ -15.68% -11.07% AppendFloat/Decimal ~ ~ -7.25% -12.26% AppendFloat/Float ~ ~ -15.96% -19.45% AppendFloat/Exp ~ ~ -13.96% -17.76% AppendFloat/NegExp ~ ~ -14.89% -20.27% AppendFloat/LongExp ~ ~ -12.68% -17.97% AppendFloat/Big ~ ~ -11.10% -16.64% AppendFloat/BinaryExp ~ ~ ~ ~ AppendFloat/32Integer ~ ~ -10.05% -10.91% AppendFloat/32ExactFraction ~ ~ -8.93% -13.00% AppendFloat/32Point ~ ~ -10.36% -14.89% AppendFloat/32Exp ~ ~ -9.88% -13.54% AppendFloat/32NegExp ~ ~ -10.16% -14.26% AppendFloat/32Shortest ~ ~ -11.39% -14.96% AppendFloat/32Fixed8Hard ~ ~ ~ -2.31% AppendFloat/32Fixed9Hard ~ ~ ~ -7.01% AppendFloat/64Fixed1 ~ ~ -2.83% -8.23% AppendFloat/64Fixed2 ~ ~ ~ -7.94% AppendFloat/64Fixed3 ~ ~ -4.07% -7.22% AppendFloat/64Fixed4 ~ ~ -7.24% -7.62% AppendFloat/64Fixed12 ~ ~ -6.57% -4.82% AppendFloat/64Fixed16 ~ ~ -4.00% -5.81% AppendFloat/64Fixed12Hard -2.22% ~ -4.07% -6.35% AppendFloat/64Fixed17Hard -2.12% ~ ~ -3.79% AppendFloat/64Fixed18Hard -1.89% ~ +2.48% ~ AppendFloat/Slowpath64 -1.85% ~ -14.49% -18.21% AppendFloat/SlowpathDenormal64 ~ ~ -13.08% -19.41% pkg: crypto/internal/fips140/nistec/fiat benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base Mul/P224 ~ ~ -29.95% -39.60% Mul/P384 ~ ~ -37.11% -63.33% Mul/P521 ~ ~ -26.62% -12.42% Square/P224 +1.46% ~ -40.62% -49.18% Square/P384 ~ ~ -45.51% -69.68% Square/P521 +90.37% ~ -25.26% -11.23% (The +90% is a separate problem and not real; that much variation can be seen on that system by running the same binary from two different files.) pkg: crypto/internal/fips140/edwards25519 benchmark \ host s7 linux-amd64 linux-386 s7:GOARCH=386 vs base vs base vs base vs base EncodingDecoding ~ ~ -34.67% -35.75% ScalarBaseMult ~ ~ -31.25% -30.29% ScalarMult ~ ~ -33.45% -32.54% VarTimeDoubleScalarBaseMult ~ ~ -33.78% -33.68% Change-Id: Id3c91d42cd01def6731b755e99f8f40c6ad1bb65 Reviewed-on: https://go-review.googlesource.com/c/go/+/716061 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
2025-10-27 19:41:39 -04:00
{name: "Add32carry", argLength: 2, commutative: true, typ: "(UInt32,Flags)"}, // arg0 + arg1, returns (value, carry)
{name: "Add32withcarry", argLength: 3, commutative: true}, // arg0 + arg1 + arg2, arg2=carry (0 or 1)
{name: "Add32carrywithcarry", argLength: 3, commutative: true, typ: "(UInt32,Flags)"}, // arg0 + arg1 + arg2, arg2=carry, returns (value, carry)
{name: "Sub32carry", argLength: 2, typ: "(UInt32,Flags)"}, // arg0 - arg1, returns (value, carry)
{name: "Sub32withcarry", argLength: 3}, // arg0 - arg1 - arg2, arg2=carry (0 or 1)
{name: "Add64carry", argLength: 3, commutative: true, typ: "(UInt64,UInt64)"}, // arg0 + arg1 + arg2, arg2 must be 0 or 1. returns (value, value>>64)
{name: "Sub64borrow", argLength: 3, typ: "(UInt64,UInt64)"}, // arg0 - (arg1 + arg2), arg2 must be 0 or 1. returns (value, value>>64&1)
{name: "Signmask", argLength: 1, typ: "Int32"}, // 0 if arg0 >= 0, -1 if arg0 < 0
{name: "Zeromask", argLength: 1, typ: "UInt32"}, // 0 if arg0 == 0, 0xffffffff if arg0 != 0
{name: "Slicemask", argLength: 1}, // 0 if arg0 == 0, -1 if arg0 > 0, undef if arg0<0. Type is native int size.
{name: "SpectreIndex", argLength: 2}, // arg0 if 0 <= arg0 < arg1, 0 otherwise. Type is native int size.
{name: "SpectreSliceIndex", argLength: 2}, // arg0 if 0 <= arg0 <= arg1, 0 otherwise. Type is native int size.
{name: "Cvt32Uto32F", argLength: 1}, // uint32 -> float32, only used on 32-bit arch
{name: "Cvt32Uto64F", argLength: 1}, // uint32 -> float64, only used on 32-bit arch
{name: "Cvt32Fto32U", argLength: 1}, // float32 -> uint32, only used on 32-bit arch
{name: "Cvt64Fto32U", argLength: 1}, // float64 -> uint32, only used on 32-bit arch
{name: "Cvt64Uto32F", argLength: 1}, // uint64 -> float32, only used on archs that has the instruction
{name: "Cvt64Uto64F", argLength: 1}, // uint64 -> float64, only used on archs that has the instruction
{name: "Cvt32Fto64U", argLength: 1}, // float32 -> uint64, only used on archs that has the instruction
{name: "Cvt64Fto64U", argLength: 1}, // float64 -> uint64, only used on archs that has the instruction
// pseudo-ops for breaking Tuple
{name: "Select0", argLength: 1, zeroWidth: true}, // the first component of a tuple
{name: "Select1", argLength: 1, zeroWidth: true}, // the second component of a tuple
{name: "MakeTuple", argLength: 2}, // arg0 arg1 are components of a "Tuple" (like the result from a 128bits op).
{name: "SelectN", argLength: 1, aux: "Int64"}, // arg0=result, auxint=field index. Returns the auxint'th member.
{name: "SelectNAddr", argLength: 1, aux: "Int64"}, // arg0=result, auxint=field index. Returns the address of auxint'th member. Used for un-SSA-able result types.
{name: "MakeResult", argLength: -1}, // arg0 .. are components of a "Result" (like the result from a Call). The last arg should be memory (like the result from a call).
// Atomic operations used for semantically inlining sync/atomic and
// internal/runtime/atomic. Atomic loads return a new memory so that
// the loads are properly ordered with respect to other loads and
// stores.
{name: "AtomicLoad8", argLength: 2, typ: "(UInt8,Mem)"}, // Load from arg0. arg1=memory. Returns loaded value and new memory.
{name: "AtomicLoad32", argLength: 2, typ: "(UInt32,Mem)"}, // Load from arg0. arg1=memory. Returns loaded value and new memory.
{name: "AtomicLoad64", argLength: 2, typ: "(UInt64,Mem)"}, // Load from arg0. arg1=memory. Returns loaded value and new memory.
{name: "AtomicLoadPtr", argLength: 2, typ: "(BytePtr,Mem)"}, // Load from arg0. arg1=memory. Returns loaded value and new memory.
{name: "AtomicLoadAcq32", argLength: 2, typ: "(UInt32,Mem)"}, // Load from arg0. arg1=memory. Lock acquisition, returns loaded value and new memory.
cmd/compiler,cmd/go,sync: add internal {LoadAcq,StoreRel}64 on ppc64 Add an internal atomic intrinsic for load with acquire semantics (extending LoadAcq to 64b) and add LoadAcquintptr for internal use within the sync package. For other arches, this remaps to the appropriate atomic.Load{,64} intrinsic which should not alter code generation. Similarly, add StoreRel{uintptr,64} for consistency, and inline. Finally, add an exception to allow sync to directly use the runtime/internal/atomic package which avoids more convoluted workarounds (contributed by Lynn Boger). In an extreme example, sync.(*Pool).pin consumes 20% of wall time during fmt tests. This is reduced to 5% on ppc64le/power9. From the fmt benchmarks on ppc64le: name old time/op new time/op delta SprintfPadding 468ns ± 0% 451ns ± 0% -3.63% SprintfEmpty 73.3ns ± 0% 51.9ns ± 0% -29.20% SprintfString 135ns ± 0% 122ns ± 0% -9.63% SprintfTruncateString 232ns ± 0% 214ns ± 0% -7.76% SprintfTruncateBytes 216ns ± 0% 202ns ± 0% -6.48% SprintfSlowParsingPath 162ns ± 0% 142ns ± 0% -12.35% SprintfQuoteString 1.00µs ± 0% 0.99µs ± 0% -1.39% SprintfInt 117ns ± 0% 104ns ± 0% -11.11% SprintfIntInt 190ns ± 0% 175ns ± 0% -7.89% SprintfPrefixedInt 232ns ± 0% 212ns ± 0% -8.62% SprintfFloat 270ns ± 0% 255ns ± 0% -5.56% SprintfComplex 1.01µs ± 0% 0.99µs ± 0% -1.68% SprintfBoolean 127ns ± 0% 111ns ± 0% -12.60% SprintfHexString 220ns ± 0% 198ns ± 0% -10.00% SprintfHexBytes 261ns ± 0% 252ns ± 0% -3.45% SprintfBytes 600ns ± 0% 590ns ± 0% -1.67% SprintfStringer 684ns ± 0% 658ns ± 0% -3.80% SprintfStructure 2.57µs ± 0% 2.57µs ± 0% -0.12% ManyArgs 669ns ± 0% 646ns ± 0% -3.44% FprintInt 140ns ± 0% 136ns ± 0% -2.86% FprintfBytes 184ns ± 0% 181ns ± 0% -1.63% FprintIntNoAlloc 140ns ± 0% 136ns ± 0% -2.86% ScanInts 929µs ± 0% 921µs ± 0% -0.79% ScanRecursiveInt 122ms ± 0% 121ms ± 0% -0.11% ScanRecursiveIntReaderWrapper 122ms ± 0% 122ms ± 0% -0.18% Change-Id: I4d66780261b57b06ef600229e475462e7313f0d6 Reviewed-on: https://go-review.googlesource.com/c/go/+/253748 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Keith Randall <khr@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org>
2020-09-09 17:24:23 -05:00
{name: "AtomicLoadAcq64", argLength: 2, typ: "(UInt64,Mem)"}, // Load from arg0. arg1=memory. Lock acquisition, returns loaded value and new memory.
{name: "AtomicStore8", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
{name: "AtomicStore32", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
{name: "AtomicStore64", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
{name: "AtomicStorePtrNoWB", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
{name: "AtomicStoreRel32", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Lock release, returns memory.
cmd/compiler,cmd/go,sync: add internal {LoadAcq,StoreRel}64 on ppc64 Add an internal atomic intrinsic for load with acquire semantics (extending LoadAcq to 64b) and add LoadAcquintptr for internal use within the sync package. For other arches, this remaps to the appropriate atomic.Load{,64} intrinsic which should not alter code generation. Similarly, add StoreRel{uintptr,64} for consistency, and inline. Finally, add an exception to allow sync to directly use the runtime/internal/atomic package which avoids more convoluted workarounds (contributed by Lynn Boger). In an extreme example, sync.(*Pool).pin consumes 20% of wall time during fmt tests. This is reduced to 5% on ppc64le/power9. From the fmt benchmarks on ppc64le: name old time/op new time/op delta SprintfPadding 468ns ± 0% 451ns ± 0% -3.63% SprintfEmpty 73.3ns ± 0% 51.9ns ± 0% -29.20% SprintfString 135ns ± 0% 122ns ± 0% -9.63% SprintfTruncateString 232ns ± 0% 214ns ± 0% -7.76% SprintfTruncateBytes 216ns ± 0% 202ns ± 0% -6.48% SprintfSlowParsingPath 162ns ± 0% 142ns ± 0% -12.35% SprintfQuoteString 1.00µs ± 0% 0.99µs ± 0% -1.39% SprintfInt 117ns ± 0% 104ns ± 0% -11.11% SprintfIntInt 190ns ± 0% 175ns ± 0% -7.89% SprintfPrefixedInt 232ns ± 0% 212ns ± 0% -8.62% SprintfFloat 270ns ± 0% 255ns ± 0% -5.56% SprintfComplex 1.01µs ± 0% 0.99µs ± 0% -1.68% SprintfBoolean 127ns ± 0% 111ns ± 0% -12.60% SprintfHexString 220ns ± 0% 198ns ± 0% -10.00% SprintfHexBytes 261ns ± 0% 252ns ± 0% -3.45% SprintfBytes 600ns ± 0% 590ns ± 0% -1.67% SprintfStringer 684ns ± 0% 658ns ± 0% -3.80% SprintfStructure 2.57µs ± 0% 2.57µs ± 0% -0.12% ManyArgs 669ns ± 0% 646ns ± 0% -3.44% FprintInt 140ns ± 0% 136ns ± 0% -2.86% FprintfBytes 184ns ± 0% 181ns ± 0% -1.63% FprintIntNoAlloc 140ns ± 0% 136ns ± 0% -2.86% ScanInts 929µs ± 0% 921µs ± 0% -0.79% ScanRecursiveInt 122ms ± 0% 121ms ± 0% -0.11% ScanRecursiveIntReaderWrapper 122ms ± 0% 122ms ± 0% -0.18% Change-Id: I4d66780261b57b06ef600229e475462e7313f0d6 Reviewed-on: https://go-review.googlesource.com/c/go/+/253748 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Keith Randall <khr@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org>
2020-09-09 17:24:23 -05:00
{name: "AtomicStoreRel64", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Lock release, returns memory.
{name: "AtomicExchange8", argLength: 3, typ: "(UInt8,Mem)", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicExchange32", argLength: 3, typ: "(UInt32,Mem)", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicExchange64", argLength: 3, typ: "(UInt64,Mem)", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicAdd32", argLength: 3, typ: "(UInt32,Mem)", hasSideEffects: true}, // Do *arg0 += arg1. arg2=memory. Returns sum and new memory.
{name: "AtomicAdd64", argLength: 3, typ: "(UInt64,Mem)", hasSideEffects: true}, // Do *arg0 += arg1. arg2=memory. Returns sum and new memory.
{name: "AtomicCompareAndSwap32", argLength: 4, typ: "(Bool,Mem)", hasSideEffects: true}, // if *arg0==arg1, then set *arg0=arg2. Returns true if store happens and new memory.
{name: "AtomicCompareAndSwap64", argLength: 4, typ: "(Bool,Mem)", hasSideEffects: true}, // if *arg0==arg1, then set *arg0=arg2. Returns true if store happens and new memory.
{name: "AtomicCompareAndSwapRel32", argLength: 4, typ: "(Bool,Mem)", hasSideEffects: true}, // if *arg0==arg1, then set *arg0=arg2. Lock release, reports whether store happens and new memory.
// Older atomic logical operations which don't return the old value.
{name: "AtomicAnd8", argLength: 3, typ: "Mem", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns memory.
{name: "AtomicOr8", argLength: 3, typ: "Mem", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns memory.
{name: "AtomicAnd32", argLength: 3, typ: "Mem", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns memory.
{name: "AtomicOr32", argLength: 3, typ: "Mem", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns memory.
// Newer atomic logical operations which return the old value.
{name: "AtomicAnd64value", argLength: 3, typ: "(Uint64, Mem)", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicAnd32value", argLength: 3, typ: "(Uint32, Mem)", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicAnd8value", argLength: 3, typ: "(Uint8, Mem)", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicOr64value", argLength: 3, typ: "(Uint64, Mem)", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicOr32value", argLength: 3, typ: "(Uint32, Mem)", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicOr8value", argLength: 3, typ: "(Uint8, Mem)", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
// Atomic operation variants
// These variants have the same semantics as above atomic operations.
// But they are used for generating more efficient code on certain modern machines, with run-time CPU feature detection.
// On ARM64, these are used when the LSE hardware feature is available (either known at compile time or detected at runtime). If LSE is not available,
// then the basic atomic operations are used instead.
cmd/compiler,internal/runtime/atomic: optimize Store{64,32,8} on loong64 On Loong64, AMSWAPDB{W,V} instructions are supported by default, and AMSWAPDB{B,H} [1] is a new instruction added by LA664(Loongson 3A6000) and later microarchitectures. Therefore, AMSWAPDB{W,V} (full barrier) is used to implement AtomicStore{32,64}, and the traditional MOVB or the new AMSWAPDBB is used to implement AtomicStore8 according to the CPU feature. The StoreRelease barrier on Loong64 is "dbar 0x12", but it is still necessary to ensure consistency in the order of Store/Load [2]. LoweredAtomicStorezero{32,64} was removed because on loong64 the constant "0" uses the R0 register, and there is no performance difference between the implementations of LoweredAtomicStorezero{32,64} and LoweredAtomicStore{32,64}. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000-HV @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | AtomicStore64 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20) AtomicStore64-2 19.61n ± 0% 13.61n ± 0% -30.57% (p=0.000 n=20) AtomicStore64-4 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20) AtomicStore 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20) AtomicStore-2 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20) AtomicStore-4 19.62n ± 0% 13.62n ± 0% -30.58% (p=0.000 n=20) AtomicStore8 19.61n ± 0% 20.01n ± 0% +2.04% (p=0.000 n=20) AtomicStore8-2 19.62n ± 0% 20.02n ± 0% +2.01% (p=0.000 n=20) AtomicStore8-4 19.61n ± 0% 20.02n ± 0% +2.09% (p=0.000 n=20) geomean 19.61n 15.48n -21.08% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | AtomicStore64 18.03n ± 0% 12.81n ± 0% -28.93% (p=0.000 n=20) AtomicStore64-2 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20) AtomicStore64-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20) AtomicStore-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) geomean 18.01n 12.81n -28.89% [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html [2]: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/loongarch/sync.md Change-Id: I4ae5e8dd0e6f026129b6e503990a763ed40c6097 Reviewed-on: https://go-review.googlesource.com/c/go/+/581356 Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>
2024-09-13 18:47:56 +08:00
{name: "AtomicStore8Variant", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
{name: "AtomicStore32Variant", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
{name: "AtomicStore64Variant", argLength: 3, typ: "Mem", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns memory.
cmd/compile: improve atomic swap intrinsics on arm64 ARMv8.1 has added new instructions for atomic memory operations. This change builds on the previous change which added support for atomic add, 0a7ac93c27c9ade79fe0f66ae0bb81484c241ae5, to include similar support for atomic-compare-and-swap, atomic-swap, atomic-or, and atomic-and intrinsics. Since the new instructions are not guaranteed to be present, we guard their usages with a branch on a CPU feature. Peformance on an ARMv8.1 machine: name old time/op new time/op delta CompareAndSwap-16 37.9ns ±16% 24.1ns ± 4% -36.44% (p=0.000 n=10+9) CompareAndSwap64-16 38.6ns ±15% 24.1ns ± 3% -37.47% (p=0.000 n=10+10) name old time/op new time/op delta Swap-16 46.9ns ±32% 12.5ns ± 6% -73.40% (p=0.000 n=10+10) Swap64-16 53.4ns ± 1% 12.5ns ± 6% -76.56% (p=0.000 n=10+10) name old time/op new time/op delta Or8-16 8.81ns ± 0% 5.61ns ± 0% -36.32% (p=0.000 n=10+10) Or-16 7.21ns ± 0% 5.61ns ± 0% -22.19% (p=0.000 n=10+10) Or8Parallel-16 59.8ns ± 3% 12.5ns ± 2% -79.10% (p=0.000 n=10+10) OrParallel-16 51.7ns ± 3% 12.5ns ± 2% -75.84% (p=0.000 n=10+10) name old time/op new time/op delta And8-16 8.81ns ± 0% 5.61ns ± 0% -36.32% (p=0.000 n=10+10) And-16 7.21ns ± 0% 5.61ns ± 0% -22.19% (p=0.000 n=10+10) And8Parallel-16 59.1ns ± 6% 12.8ns ± 3% -78.33% (p=0.000 n=10+10) AndParallel-16 51.4ns ± 7% 12.8ns ± 3% -75.03% (p=0.000 n=10+10) Performance on an ARMv8.0 machine (no atomics instructions): name old time/op new time/op delta CompareAndSwap-16 61.3ns ± 0% 62.4ns ± 0% +1.70% (p=0.000 n=8+9) CompareAndSwap64-16 62.0ns ± 3% 61.3ns ± 2% ~ (p=0.093 n=10+10) name old time/op new time/op delta Swap-16 127ns ± 2% 131ns ± 2% +2.91% (p=0.001 n=10+10) Swap64-16 128ns ± 1% 131ns ± 2% +2.43% (p=0.001 n=10+10) name old time/op new time/op delta Or8-16 14.9ns ± 0% 15.3ns ± 0% +2.68% (p=0.000 n=10+10) Or-16 11.8ns ± 0% 12.3ns ± 0% +4.24% (p=0.000 n=10+10) Or8Parallel-16 137ns ± 1% 144ns ± 1% +4.97% (p=0.000 n=10+10) OrParallel-16 128ns ± 1% 136ns ± 1% +6.34% (p=0.000 n=10+10) name old time/op new time/op delta And8-16 14.9ns ± 0% 15.3ns ± 0% +2.68% (p=0.000 n=10+10) And-16 11.8ns ± 0% 12.3ns ± 0% +4.24% (p=0.000 n=10+10) And8Parallel-16 134ns ± 2% 141ns ± 1% +5.29% (p=0.000 n=10+10) AndParallel-16 125ns ± 2% 134ns ± 1% +7.10% (p=0.000 n=10+10) Fixes #39304 Change-Id: Idaca68701d4751650be6b4bedca3d57f51571712 Reviewed-on: https://go-review.googlesource.com/c/go/+/234217 Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Trust: fannie zhang <Fannie.Zhang@arm.com>
2020-11-04 16:18:23 +00:00
{name: "AtomicAdd32Variant", argLength: 3, typ: "(UInt32,Mem)", hasSideEffects: true}, // Do *arg0 += arg1. arg2=memory. Returns sum and new memory.
{name: "AtomicAdd64Variant", argLength: 3, typ: "(UInt64,Mem)", hasSideEffects: true}, // Do *arg0 += arg1. arg2=memory. Returns sum and new memory.
{name: "AtomicExchange8Variant", argLength: 3, typ: "(UInt8,Mem)", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns old contents of *arg0 and new memory.
cmd/compile: improve atomic swap intrinsics on arm64 ARMv8.1 has added new instructions for atomic memory operations. This change builds on the previous change which added support for atomic add, 0a7ac93c27c9ade79fe0f66ae0bb81484c241ae5, to include similar support for atomic-compare-and-swap, atomic-swap, atomic-or, and atomic-and intrinsics. Since the new instructions are not guaranteed to be present, we guard their usages with a branch on a CPU feature. Peformance on an ARMv8.1 machine: name old time/op new time/op delta CompareAndSwap-16 37.9ns ±16% 24.1ns ± 4% -36.44% (p=0.000 n=10+9) CompareAndSwap64-16 38.6ns ±15% 24.1ns ± 3% -37.47% (p=0.000 n=10+10) name old time/op new time/op delta Swap-16 46.9ns ±32% 12.5ns ± 6% -73.40% (p=0.000 n=10+10) Swap64-16 53.4ns ± 1% 12.5ns ± 6% -76.56% (p=0.000 n=10+10) name old time/op new time/op delta Or8-16 8.81ns ± 0% 5.61ns ± 0% -36.32% (p=0.000 n=10+10) Or-16 7.21ns ± 0% 5.61ns ± 0% -22.19% (p=0.000 n=10+10) Or8Parallel-16 59.8ns ± 3% 12.5ns ± 2% -79.10% (p=0.000 n=10+10) OrParallel-16 51.7ns ± 3% 12.5ns ± 2% -75.84% (p=0.000 n=10+10) name old time/op new time/op delta And8-16 8.81ns ± 0% 5.61ns ± 0% -36.32% (p=0.000 n=10+10) And-16 7.21ns ± 0% 5.61ns ± 0% -22.19% (p=0.000 n=10+10) And8Parallel-16 59.1ns ± 6% 12.8ns ± 3% -78.33% (p=0.000 n=10+10) AndParallel-16 51.4ns ± 7% 12.8ns ± 3% -75.03% (p=0.000 n=10+10) Performance on an ARMv8.0 machine (no atomics instructions): name old time/op new time/op delta CompareAndSwap-16 61.3ns ± 0% 62.4ns ± 0% +1.70% (p=0.000 n=8+9) CompareAndSwap64-16 62.0ns ± 3% 61.3ns ± 2% ~ (p=0.093 n=10+10) name old time/op new time/op delta Swap-16 127ns ± 2% 131ns ± 2% +2.91% (p=0.001 n=10+10) Swap64-16 128ns ± 1% 131ns ± 2% +2.43% (p=0.001 n=10+10) name old time/op new time/op delta Or8-16 14.9ns ± 0% 15.3ns ± 0% +2.68% (p=0.000 n=10+10) Or-16 11.8ns ± 0% 12.3ns ± 0% +4.24% (p=0.000 n=10+10) Or8Parallel-16 137ns ± 1% 144ns ± 1% +4.97% (p=0.000 n=10+10) OrParallel-16 128ns ± 1% 136ns ± 1% +6.34% (p=0.000 n=10+10) name old time/op new time/op delta And8-16 14.9ns ± 0% 15.3ns ± 0% +2.68% (p=0.000 n=10+10) And-16 11.8ns ± 0% 12.3ns ± 0% +4.24% (p=0.000 n=10+10) And8Parallel-16 134ns ± 2% 141ns ± 1% +5.29% (p=0.000 n=10+10) AndParallel-16 125ns ± 2% 134ns ± 1% +7.10% (p=0.000 n=10+10) Fixes #39304 Change-Id: Idaca68701d4751650be6b4bedca3d57f51571712 Reviewed-on: https://go-review.googlesource.com/c/go/+/234217 Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Trust: fannie zhang <Fannie.Zhang@arm.com>
2020-11-04 16:18:23 +00:00
{name: "AtomicExchange32Variant", argLength: 3, typ: "(UInt32,Mem)", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicExchange64Variant", argLength: 3, typ: "(UInt64,Mem)", hasSideEffects: true}, // Store arg1 to *arg0. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicCompareAndSwap32Variant", argLength: 4, typ: "(Bool,Mem)", hasSideEffects: true}, // if *arg0==arg1, then set *arg0=arg2. Returns true if store happens and new memory.
{name: "AtomicCompareAndSwap64Variant", argLength: 4, typ: "(Bool,Mem)", hasSideEffects: true}, // if *arg0==arg1, then set *arg0=arg2. Returns true if store happens and new memory.
{name: "AtomicAnd64valueVariant", argLength: 3, typ: "(Uint64, Mem)", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicOr64valueVariant", argLength: 3, typ: "(Uint64, Mem)", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicAnd32valueVariant", argLength: 3, typ: "(Uint32, Mem)", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicOr32valueVariant", argLength: 3, typ: "(Uint32, Mem)", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicAnd8valueVariant", argLength: 3, typ: "(Uint8, Mem)", hasSideEffects: true}, // *arg0 &= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
{name: "AtomicOr8valueVariant", argLength: 3, typ: "(Uint8, Mem)", hasSideEffects: true}, // *arg0 |= arg1. arg2=memory. Returns old contents of *arg0 and new memory.
cmd/compile: intrinsify publicationBarrier This CL intrinsify asm call for publicationBarrier on ARM64. As for x86 we may completly removes any instructions due to strong memory oredering, but decided to leave it as is for compiler barrier. Benchmarks Go1 ARM64: name old time/op new time/op delta BinaryTree17-8 3.38s ± 1% 3.36s ± 1% ~ (p=0.095 n=5+5) Fannkuch11-8 2.93s ± 0% 2.84s ± 0% -3.26% (p=0.008 n=5+5) FmtFprintfEmpty-8 54.2ns ± 1% 54.0ns ± 1% ~ (p=0.690 n=5+5) FmtFprintfString-8 111ns ± 0% 109ns ± 0% -1.48% (p=0.008 n=5+5) FmtFprintfInt-8 140ns ± 0% 138ns ± 0% -1.10% (p=0.000 n=4+5) FmtFprintfIntInt-8 168ns ± 0% 169ns ± 0% +0.66% (p=0.008 n=5+5) FmtFprintfPrefixedInt-8 206ns ± 1% 195ns ± 0% -5.12% (p=0.008 n=5+5) FmtFprintfFloat-8 270ns ± 0% 269ns ± 0% -0.20% (p=0.024 n=5+5) FmtManyArgs-8 721ns ± 0% 733ns ± 0% +1.69% (p=0.008 n=5+5) GobDecode-8 9.75ms ± 1% 9.28ms ± 3% -4.88% (p=0.008 n=5+5) GobEncode-8 6.38ms ± 1% 6.34ms ± 1% ~ (p=0.095 n=5+5) Gzip-8 255ms ± 0% 254ms ± 0% -0.44% (p=0.008 n=5+5) Gunzip-8 41.8ms ± 1% 40.8ms ± 0% -2.40% (p=0.008 n=5+5) HTTPClientServer-8 65.1µs ± 1% 65.1µs ± 1% ~ (p=0.690 n=5+5) JSONEncode-8 11.7ms ± 0% 11.7ms ± 1% ~ (p=0.841 n=5+5) JSONDecode-8 60.2ms ± 1% 60.0ms ± 0% ~ (p=0.841 n=5+5) Mandelbrot200-8 5.85ms ± 0% 5.86ms ± 0% +0.22% (p=0.016 n=4+5) GoParse-8 4.38ms ± 0% 4.35ms ± 0% -0.60% (p=0.008 n=5+5) RegexpMatchEasy0_32-8 87.1ns ± 2% 88.3ns ± 1% ~ (p=0.151 n=5+5) RegexpMatchEasy0_1K-8 306ns ± 0% 306ns ± 1% ~ (p=0.143 n=5+5) RegexpMatchEasy1_32-8 86.3ns ± 0% 84.8ns ± 0% -1.81% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 491ns ± 2% 487ns ± 1% ~ (p=0.548 n=5+5) RegexpMatchMedium_32-8 7.50ns ± 0% 7.49ns ± 1% ~ (p=0.817 n=5+5) RegexpMatchMedium_1K-8 40.3µs ± 1% 39.9µs ± 0% -1.02% (p=0.008 n=5+5) RegexpMatchHard_32-8 2.10µs ± 1% 2.10µs ± 1% ~ (p=0.548 n=5+5) RegexpMatchHard_1K-8 62.4µs ± 1% 62.5µs ± 2% ~ (p=0.690 n=5+5) Revcomp-8 504ms ± 1% 502ms ± 1% ~ (p=0.095 n=5+5) Template-8 86.8ms ± 1% 86.5ms ± 1% ~ (p=0.222 n=5+5) TimeParse-8 330ns ± 0% 327ns ± 0% -0.84% (p=0.008 n=5+5) TimeFormat-8 383ns ± 1% 392ns ± 1% +2.42% (p=0.008 n=5+5) [Geo mean] 54.3µs 53.9µs -0.67% name old speed new speed delta GobDecode-8 78.7MB/s ± 1% 82.8MB/s ± 4% +5.16% (p=0.008 n=5+5) GobEncode-8 120MB/s ± 1% 121MB/s ± 1% ~ (p=0.095 n=5+5) Gzip-8 76.2MB/s ± 0% 76.5MB/s ± 0% +0.44% (p=0.008 n=5+5) Gunzip-8 464MB/s ± 1% 475MB/s ± 0% +2.45% (p=0.008 n=5+5) JSONEncode-8 166MB/s ± 0% 166MB/s ± 1% ~ (p=0.841 n=5+5) JSONDecode-8 32.2MB/s ± 1% 32.3MB/s ± 0% ~ (p=0.714 n=5+5) GoParse-8 13.2MB/s ± 0% 13.3MB/s ± 0% +0.59% (p=0.008 n=5+5) RegexpMatchEasy0_32-8 368MB/s ± 2% 362MB/s ± 1% ~ (p=0.151 n=5+5) RegexpMatchEasy0_1K-8 3.34GB/s ± 0% 3.34GB/s ± 1% ~ (p=0.127 n=5+5) RegexpMatchEasy1_32-8 371MB/s ± 0% 378MB/s ± 0% +1.84% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 2.09GB/s ± 2% 2.10GB/s ± 1% ~ (p=0.548 n=5+5) RegexpMatchMedium_32-8 133MB/s ± 0% 134MB/s ± 1% ~ (p=0.952 n=5+5) RegexpMatchMedium_1K-8 25.4MB/s ± 1% 25.6MB/s ± 0% +1.04% (p=0.008 n=5+5) RegexpMatchHard_32-8 15.3MB/s ± 1% 15.2MB/s ± 1% ~ (p=0.500 n=5+5) RegexpMatchHard_1K-8 16.4MB/s ± 1% 16.4MB/s ± 2% ~ (p=0.595 n=5+5) Revcomp-8 504MB/s ± 1% 506MB/s ± 1% ~ (p=0.095 n=5+5) Template-8 22.4MB/s ± 1% 22.4MB/s ± 1% ~ (p=0.206 n=5+5) [Geo mean] 120MB/s 121MB/s +0.71% Change-Id: I9cc10840b5c0d6bf759150f052c79f4c499e35e1 Reviewed-on: https://go-review.googlesource.com/c/go/+/328290 Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: David Chase <drchase@google.com>
2021-06-11 09:27:09 +00:00
// Publication barrier
{name: "PubBarrier", argLength: 1, hasSideEffects: true}, // Do data barrier. arg0=memory.
// Clobber experiment op
{name: "Clobber", argLength: 0, typ: "Void", aux: "SymOff", symEffect: "None"}, // write an invalid pointer value to the given pointer slot of a stack variable
{name: "ClobberReg", argLength: 0, typ: "Void"}, // clobber a register
cmd/compile: add prefetch intrinsic support This CL provide new intrinsics to emit prefetch instructions for AMD64 and ARM64 platforms: Prefetch - prefetches data from memory address to cache; PrefetchStreamed - prefetches data from memory address, with a hint that this data is being streamed. This patch also provides prefetch calls pointed by RSC inside scanobject and greyobject of GC mark logic. Performance results provided by Michael: https://perf.golang.org/search?q=upload:20210901.9 Benchmark parameters: tree2 -heapsize=1000000000 -cpus=8 tree -n=18 parser peano Benchmarks AMD64 (Xeon - Cascade Lake): name old time/op new time/op delta Tree2-8 36.1ms ± 6% 33.4ms ± 5% -7.65% (p=0.000 n=9+9) Tree-8 326ms ± 1% 324ms ± 1% -0.44% (p=0.006 n=9+10) Parser-8 2.75s ± 1% 2.71s ± 1% -1.47% (p=0.008 n=5+5) Peano-8 63.1ms ± 1% 63.0ms ± 1% ~ (p=0.730 n=9+9) [Geo mean] 213ms 207ms -2.45% Benchmarks ARM64 (Kunpeng 920): name old time/op new time/op delta Tree2-8 50.3ms ± 8% 44.1ms ± 5% -12.24% (p=0.000 n=10+9) Tree-8 494ms ± 1% 493ms ± 1% ~ (p=0.684 n=10+10) Parser-8 3.99s ± 1% 3.93s ± 1% -1.37% (p=0.016 n=5+5) Peano-8 84.4ms ± 0% 84.1ms ± 1% ~ (p=0.068 n=8+10) [Geo mean] 302ms 291ms -3.67% Change-Id: I43e10bc2f9512dc49d7631dd8843a79036fa43d0 Reviewed-on: https://go-review.googlesource.com/c/go/+/328289 Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Go Bot <gobot@golang.org>
2021-06-15 14:04:30 +00:00
// Prefetch instruction
{name: "PrefetchCache", argLength: 2, hasSideEffects: true}, // Do prefetch arg0 to cache. arg0=addr, arg1=memory.
{name: "PrefetchCacheStreamed", argLength: 2, hasSideEffects: true}, // Do non-temporal or streamed prefetch arg0 to cache. arg0=addr, arg1=memory.
// Helper instruction which is semantically equivalent to calling runtime.memequal, but some targets may prefer to custom lower it later, e.g. for specific constant sizes.
{name: "MemEq", argLength: 4, commutative: true, typ: "Bool"}, // arg0=ptr0, arg1=ptr1, arg2=size, arg3=memory.
// SIMD
{name: "ZeroSIMD", argLength: 0}, // zero value of a vector
// Convert integers to masks
{name: "Cvt16toMask8x16", argLength: 1}, // arg0 = integer mask value
{name: "Cvt32toMask8x32", argLength: 1}, // arg0 = integer mask value
{name: "Cvt64toMask8x64", argLength: 1}, // arg0 = integer mask value
{name: "Cvt8toMask16x8", argLength: 1}, // arg0 = integer mask value
{name: "Cvt16toMask16x16", argLength: 1}, // arg0 = integer mask value
{name: "Cvt32toMask16x32", argLength: 1}, // arg0 = integer mask value
{name: "Cvt8toMask32x4", argLength: 1}, // arg0 = integer mask value
{name: "Cvt8toMask32x8", argLength: 1}, // arg0 = integer mask value
{name: "Cvt16toMask32x16", argLength: 1}, // arg0 = integer mask value
{name: "Cvt8toMask64x2", argLength: 1}, // arg0 = integer mask value
{name: "Cvt8toMask64x4", argLength: 1}, // arg0 = integer mask value
{name: "Cvt8toMask64x8", argLength: 1}, // arg0 = integer mask value
// Convert masks to integers
{name: "CvtMask8x16to16", argLength: 1}, // arg0 = mask
{name: "CvtMask8x32to32", argLength: 1}, // arg0 = mask
{name: "CvtMask8x64to64", argLength: 1}, // arg0 = mask
{name: "CvtMask16x8to8", argLength: 1}, // arg0 = mask
{name: "CvtMask16x16to16", argLength: 1}, // arg0 = mask
{name: "CvtMask16x32to32", argLength: 1}, // arg0 = mask
{name: "CvtMask32x4to8", argLength: 1}, // arg0 = mask
{name: "CvtMask32x8to8", argLength: 1}, // arg0 = mask
{name: "CvtMask32x16to16", argLength: 1}, // arg0 = mask
{name: "CvtMask64x2to8", argLength: 1}, // arg0 = mask
{name: "CvtMask64x4to8", argLength: 1}, // arg0 = mask
{name: "CvtMask64x8to8", argLength: 1}, // arg0 = mask
// Returns true if arg0 is all zero.
{name: "IsZeroVec", argLength: 1},
}
cmd/compile, runtime: use PC of deferreturn for panic transfer this removes the old conditional-on-register-value handshake from the deferproc/deferprocstack logic. The "line" for the recovery-exit frame itself (not the defers that it runs) is the closing brace of the function. Reduces code size slightly (e.g. go command is 0.2% smaller) Sample output showing effect of this change, also what sort of code it requires to observe the effect: ``` package main import "os" func main() { g(len(os.Args) - 1) // stack[0] } var gi int var pi *int = &gi //go:noinline func g(i int) { switch i { case 0: defer func() { println("g0", i) q() // stack[2] if i == 0 }() for j := *pi; j < 1; j++ { defer func() { println("recover0", recover().(string)) }() } default: for j := *pi; j < 1; j++ { defer func() { println("g1", i) q() // stack[2] if i == 1 }() } defer func() { println("recover1", recover().(string)) }() } p() } // stack[1] (deferreturn) //go:noinline func p() { panic("p()") } //go:noinline func q() { panic("q()") // stack[3] } /* Sample output for "./foo foo": recover1 p() g1 1 panic: q() goroutine 1 [running]: main.q() .../main.go:46 +0x2c main.g.func3() .../main.go:29 +0x48 main.g(0x1?) .../main.go:37 +0x68 main.main() .../main.go:6 +0x28 */ ``` Change-Id: Ie39ea62ecc244213500380ea06d44024cadc2317 Reviewed-on: https://go-review.googlesource.com/c/go/+/650795 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-02-19 16:47:31 -05:00
// kind controls successors implicit exit
// ------------------------------------------------------------
// Exit [return mem] [] yes
// Ret [return mem] [] yes
// RetJmp [return mem] [] yes
// Plain [] [next]
// If [boolean Value] [then, else]
// First [] [always, never]
// Defer [mem] [nopanic, recovery] (control opcode should be OpStaticCall to runtime.defer*)
// JumpTable [integer Value] [succ1,succ2,..]
var genericBlocks = []blockData{
cmd/compile: implement jump tables Performance is kind of hard to exactly quantify. One big difference between jump tables and the old binary search scheme is that there's only 1 branch statement instead of O(n) of them. That can be both a blessing and a curse, and can make evaluating jump tables very hard to do. The single branch can become a choke point for the hardware branch predictor. A branch table jump must fit all of its state in a single branch predictor entry (technically, a branch target predictor entry). With binary search that predictor state can be spread among lots of entries. In cases where the case selection is repetitive and thus predictable, binary search can perform better. The big win for a jump table is that it doesn't consume so much of the branch predictor's resources. But that benefit is essentially never observed in microbenchmarks, because the branch predictor can easily keep state for all the binary search branches in a microbenchmark. So that benefit is really hard to measure. So predictable switch microbenchmarks are ~useless - they will almost always favor the binary search scheme. Fully unpredictable switch microbenchmarks are better, as they aren't lying to us quite so much. In a perfectly unpredictable situation, a jump table will expect to incur 1-1/N branch mispredicts, where a binary search would incur lg(N)/2 of them. That makes the crossover point at about N=4. But of course switches in real programs are seldom fully unpredictable, so we'll use a higher crossover point. Beyond the branch predictor, jump tables tend to execute more instructions per switch but have no additional instructions per case, which also argues for a larger crossover. As far as code size goes, with this CL cmd/go has a slightly smaller code segment and a slightly larger overall size (from the jump tables themselves which live in the data segment). This is a case where some FDO (feedback-directed optimization) would be really nice to have. #28262 Some large-program benchmarks might help make the case for this CL. Especially if we can turn on branch mispredict counters so we can see how much using jump tables can free up branch prediction resources that can be gainfully used elsewhere in the program. name old time/op new time/op delta Switch8Predictable 1.89ns ± 2% 1.27ns ± 3% -32.58% (p=0.000 n=9+10) Switch8Unpredictable 9.33ns ± 1% 7.50ns ± 1% -19.60% (p=0.000 n=10+9) Switch32Predictable 2.20ns ± 2% 1.64ns ± 1% -25.39% (p=0.000 n=10+9) Switch32Unpredictable 10.0ns ± 2% 7.6ns ± 2% -24.04% (p=0.000 n=10+10) Fixes #5496 Update #34381 Change-Id: I3ff56011d02be53f605ca5fd3fb96b905517c34f Reviewed-on: https://go-review.googlesource.com/c/go/+/357330 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com>
2021-10-04 12:17:46 -07:00
{name: "Plain"}, // a single successor
{name: "If", controls: 1}, // if Controls[0] goto Succs[0] else goto Succs[1]
cmd/compile, runtime: use PC of deferreturn for panic transfer this removes the old conditional-on-register-value handshake from the deferproc/deferprocstack logic. The "line" for the recovery-exit frame itself (not the defers that it runs) is the closing brace of the function. Reduces code size slightly (e.g. go command is 0.2% smaller) Sample output showing effect of this change, also what sort of code it requires to observe the effect: ``` package main import "os" func main() { g(len(os.Args) - 1) // stack[0] } var gi int var pi *int = &gi //go:noinline func g(i int) { switch i { case 0: defer func() { println("g0", i) q() // stack[2] if i == 0 }() for j := *pi; j < 1; j++ { defer func() { println("recover0", recover().(string)) }() } default: for j := *pi; j < 1; j++ { defer func() { println("g1", i) q() // stack[2] if i == 1 }() } defer func() { println("recover1", recover().(string)) }() } p() } // stack[1] (deferreturn) //go:noinline func p() { panic("p()") } //go:noinline func q() { panic("q()") // stack[3] } /* Sample output for "./foo foo": recover1 p() g1 1 panic: q() goroutine 1 [running]: main.q() .../main.go:46 +0x2c main.g.func3() .../main.go:29 +0x48 main.g(0x1?) .../main.go:37 +0x68 main.main() .../main.go:6 +0x28 */ ``` Change-Id: Ie39ea62ecc244213500380ea06d44024cadc2317 Reviewed-on: https://go-review.googlesource.com/c/go/+/650795 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-02-19 16:47:31 -05:00
{name: "Defer", controls: 1}, // Succs[0]=defer queued, Succs[1]=defer recovery branch (jmp performed by runtime). Controls[0] is call op (of memory type).
cmd/compile: implement jump tables Performance is kind of hard to exactly quantify. One big difference between jump tables and the old binary search scheme is that there's only 1 branch statement instead of O(n) of them. That can be both a blessing and a curse, and can make evaluating jump tables very hard to do. The single branch can become a choke point for the hardware branch predictor. A branch table jump must fit all of its state in a single branch predictor entry (technically, a branch target predictor entry). With binary search that predictor state can be spread among lots of entries. In cases where the case selection is repetitive and thus predictable, binary search can perform better. The big win for a jump table is that it doesn't consume so much of the branch predictor's resources. But that benefit is essentially never observed in microbenchmarks, because the branch predictor can easily keep state for all the binary search branches in a microbenchmark. So that benefit is really hard to measure. So predictable switch microbenchmarks are ~useless - they will almost always favor the binary search scheme. Fully unpredictable switch microbenchmarks are better, as they aren't lying to us quite so much. In a perfectly unpredictable situation, a jump table will expect to incur 1-1/N branch mispredicts, where a binary search would incur lg(N)/2 of them. That makes the crossover point at about N=4. But of course switches in real programs are seldom fully unpredictable, so we'll use a higher crossover point. Beyond the branch predictor, jump tables tend to execute more instructions per switch but have no additional instructions per case, which also argues for a larger crossover. As far as code size goes, with this CL cmd/go has a slightly smaller code segment and a slightly larger overall size (from the jump tables themselves which live in the data segment). This is a case where some FDO (feedback-directed optimization) would be really nice to have. #28262 Some large-program benchmarks might help make the case for this CL. Especially if we can turn on branch mispredict counters so we can see how much using jump tables can free up branch prediction resources that can be gainfully used elsewhere in the program. name old time/op new time/op delta Switch8Predictable 1.89ns ± 2% 1.27ns ± 3% -32.58% (p=0.000 n=9+10) Switch8Unpredictable 9.33ns ± 1% 7.50ns ± 1% -19.60% (p=0.000 n=10+9) Switch32Predictable 2.20ns ± 2% 1.64ns ± 1% -25.39% (p=0.000 n=10+9) Switch32Unpredictable 10.0ns ± 2% 7.6ns ± 2% -24.04% (p=0.000 n=10+10) Fixes #5496 Update #34381 Change-Id: I3ff56011d02be53f605ca5fd3fb96b905517c34f Reviewed-on: https://go-review.googlesource.com/c/go/+/357330 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com>
2021-10-04 12:17:46 -07:00
{name: "Ret", controls: 1}, // no successors, Controls[0] value is memory result
{name: "RetJmp", controls: 1}, // no successors, Controls[0] value is a tail call
{name: "Exit", controls: 1}, // no successors, Controls[0] value generates a panic
{name: "JumpTable", controls: 1}, // multiple successors, the integer Controls[0] selects which one
// transient block state used for dead code removal
{name: "First"}, // 2 successors, always takes the first one (second is dead)
}
func init() {
genericOps = append(genericOps, simdGenericOps()...)
archs = append(archs, arch{
name: "generic",
ops: genericOps,
blocks: genericBlocks,
generic: true,
})
}