2015-02-27 22:57:28 -05:00
// Derived from Inferno utils/6l/obj.c and utils/6l/span.c
2016-08-28 17:04:46 -07:00
// https://bitbucket.org/inferno-os/inferno-os/src/default/utils/6l/obj.c
// https://bitbucket.org/inferno-os/inferno-os/src/default/utils/6l/span.c
2015-02-27 22:57:28 -05:00
//
// Copyright © 1994-1999 Lucent Technologies Inc. All rights reserved.
// Portions Copyright © 1995-1997 C H Forsyth (forsyth@terzarima.net)
// Portions Copyright © 1997-1999 Vita Nuova Limited
// Portions Copyright © 2000-2007 Vita Nuova Holdings Limited (www.vitanuova.com)
// Portions Copyright © 2004,2006 Bruce Ellis
// Portions Copyright © 2005-2007 C H Forsyth (forsyth@terzarima.net)
// Revisions Copyright © 2000-2007 Lucent Technologies Inc. and others
2016-04-10 14:32:26 -07:00
// Portions Copyright © 2009 The Go Authors. All rights reserved.
2015-02-27 22:57:28 -05:00
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
package ld
import (
2019-03-17 15:51:30 +01:00
"bufio"
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
"bytes"
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
"cmd/internal/gcprog"
2017-04-18 12:53:25 -07:00
"cmd/internal/objabi"
2016-04-06 12:01:40 -07:00
"cmd/internal/sys"
2020-02-12 17:20:00 -05:00
"cmd/link/internal/loader"
2017-10-04 17:54:04 -04:00
"cmd/link/internal/sym"
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
"compress/zlib"
"encoding/binary"
2015-02-27 22:57:28 -05:00
"fmt"
"log"
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
"os"
2016-03-09 16:23:25 +02:00
"sort"
2015-06-04 15:15:48 -04:00
"strconv"
2015-02-27 22:57:28 -05:00
"strings"
2016-04-18 14:50:14 -04:00
"sync"
2015-02-27 22:57:28 -05:00
)
2018-11-22 11:46:44 +01:00
// isRuntimeDepPkg reports whether pkg is the runtime package or its dependency
2016-09-14 14:47:12 -04:00
func isRuntimeDepPkg ( pkg string ) bool {
switch pkg {
case "runtime" ,
2018-03-01 16:38:41 -08:00
"sync/atomic" , // runtime may call to sync/atomic, due to go:linkname
"internal/bytealg" , // for IndexByte
"internal/cpu" : // for cpu features
2016-09-14 14:47:12 -04:00
return true
}
return strings . HasPrefix ( pkg , "runtime/internal/" ) && ! strings . HasSuffix ( pkg , "_test" )
}
2017-06-08 08:26:19 -04:00
// Estimate the max size needed to hold any new trampolines created for this function. This
// is used to determine when the section can be split if it becomes too large, to ensure that
// the trampolines are in the same section as the function that uses them.
2017-10-04 17:54:04 -04:00
func maxSizeTrampolinesPPC64 ( s * sym . Symbol , isTramp bool ) uint64 {
2018-04-01 00:58:48 +03:00
// If thearch.Trampoline is nil, then trampoline support is not available on this arch.
2017-06-08 08:26:19 -04:00
// A trampoline does not need any dependent trampolines.
2018-04-01 00:58:48 +03:00
if thearch . Trampoline == nil || isTramp {
2017-06-08 08:26:19 -04:00
return 0
}
n := uint64 ( 0 )
for ri := range s . R {
r := & s . R [ ri ]
2019-09-20 03:25:16 +10:00
if r . Type . IsDirectCallOrJump ( ) {
2017-06-08 08:26:19 -04:00
n ++
}
}
// Trampolines in ppc64 are 4 instructions.
return n * 16
}
2016-09-14 14:47:12 -04:00
// detect too-far jumps in function s, and add trampolines if necessary
2017-06-08 08:26:19 -04:00
// ARM, PPC64 & PPC64LE support trampoline insertion for internal and external linking
// On PPC64 & PPC64LE the text sections might be split but will still insert trampolines
// where necessary.
2017-10-04 17:54:04 -04:00
func trampoline ( ctxt * Link , s * sym . Symbol ) {
2018-04-01 00:58:48 +03:00
if thearch . Trampoline == nil {
2016-09-14 14:47:12 -04:00
return // no need or no support of trampolines on this arch
}
for ri := range s . R {
r := & s . R [ ri ]
2019-09-20 03:25:16 +10:00
if ! r . Type . IsDirectCallOrJump ( ) {
2016-09-14 14:47:12 -04:00
continue
}
2019-09-29 16:59:56 -07:00
if Symaddr ( r . Sym ) == 0 && ( r . Sym . Type != sym . SDYNIMPORT && r . Sym . Type != sym . SUNDEFEXT ) {
2016-09-14 14:47:12 -04:00
if r . Sym . File != s . File {
if ! isRuntimeDepPkg ( s . File ) || ! isRuntimeDepPkg ( r . Sym . File ) {
2020-02-24 21:07:26 -05:00
ctxt . errorUnresolved ( s , r )
2016-09-14 14:47:12 -04:00
}
// runtime and its dependent packages may call to each other.
// they are fine, as they will be laid down together.
}
continue
}
2018-04-01 00:58:48 +03:00
thearch . Trampoline ( ctxt , r , s )
2016-09-14 14:47:12 -04:00
}
}
2018-07-06 12:45:03 -04:00
// relocsym resolve relocations in "s". The main loop walks through
// the list of relocations attached to "s" and resolves them where
// applicable. Relocations are often architecture-specific, requiring
// calls into the 'archreloc' and/or 'archrelocvariant' functions for
// the architecture. When external linking is in effect, it may not be
// possible to completely resolve the address/offset for a symbol, in
// which case the goal is to lay the groundwork for turning a given
// relocation into an external reloc (to be applied by the external
// linker). For more on how relocations work in general, see
//
// "Linkers and Loaders", by John R. Levine (Morgan Kaufmann, 1999), ch. 7
//
// This is a performance-critical function for the linker; be careful
// to avoid introducing unnecessary allocations in the main loop.
2020-02-24 21:12:49 -05:00
// TODO: This function is called in parallel. When the Loader wavefront
// reaches here, calls into the loader need to be parallel as well.
2020-03-24 14:35:45 -04:00
func relocsym ( target * Target , ldr * loader . Loader , err * ErrorReporter , syms * ArchSyms , s * sym . Symbol ) {
2019-04-04 00:29:16 -04:00
if len ( s . R ) == 0 {
return
}
if s . Attr . ReadOnly ( ) {
// The symbol's content is backed by read-only memory.
// Copy it to writable memory to apply relocations.
s . P = append ( [ ] byte ( nil ) , s . P ... )
s . Attr . Set ( sym . AttrReadOnly , false )
}
2015-03-02 12:35:15 -05:00
for ri := int32 ( 0 ) ; ri < int32 ( len ( s . R ) ) ; ri ++ {
2017-10-01 14:39:04 +11:00
r := & s . R [ ri ]
2017-10-24 16:08:46 -04:00
if r . Done {
// Relocation already processed by an earlier phase.
continue
}
2017-08-27 22:00:00 +09:00
r . Done = true
2017-10-01 14:39:04 +11:00
off := r . Off
siz := int32 ( r . Siz )
2015-02-27 22:57:28 -05:00
if off < 0 || off + siz > int32 ( len ( s . P ) ) {
2016-09-17 09:39:33 -04:00
rname := ""
if r . Sym != nil {
rname = r . Sym . Name
}
Errorf ( s , "invalid relocation %s: %d+%d not in [%d,%d)" , rname , off , siz , 0 , len ( s . P ) )
2015-02-27 22:57:28 -05:00
continue
}
2018-05-21 21:00:01 +03:00
if r . Sym != nil && ( ( r . Sym . Type == sym . Sxxx && ! r . Sym . Attr . VisibilityHidden ( ) ) || r . Sym . Type == sym . SXREF ) {
2015-03-30 02:59:10 +00:00
// When putting the runtime but not main into a shared library
// these symbols are undefined and that's OK.
2020-02-24 20:59:02 -05:00
if target . IsShared ( ) || target . IsPlugin ( ) {
if r . Sym . Name == "main.main" || ( ! target . IsPlugin ( ) && r . Sym . Name == "main..inittask" ) {
2017-10-04 17:54:04 -04:00
r . Sym . Type = sym . SDYNIMPORT
2016-07-28 13:04:41 -04:00
} else if strings . HasPrefix ( r . Sym . Name , "go.info." ) {
// Skip go.info symbols. They are only needed to communicate
// DWARF info between the compiler and linker.
continue
}
2015-03-30 02:59:10 +00:00
} else {
2020-02-24 21:07:26 -05:00
err . errorUnresolved ( s , r )
2015-03-30 02:59:10 +00:00
continue
}
2015-02-27 22:57:28 -05:00
}
2019-04-16 02:49:09 +00:00
if r . Type >= objabi . ElfRelocOffset {
2015-02-27 22:57:28 -05:00
continue
}
if r . Siz == 0 { // informational relocation - no work to do
continue
}
2015-03-30 02:59:10 +00:00
// We need to be able to reference dynimport symbols when linking against
2018-11-23 13:29:59 +01:00
// shared libraries, and Solaris, Darwin and AIX need it always
2020-02-24 20:59:02 -05:00
if ! target . IsSolaris ( ) && ! target . IsDarwin ( ) && ! target . IsAIX ( ) && r . Sym != nil && r . Sym . Type == sym . SDYNIMPORT && ! target . IsDynlinkingGo ( ) && ! r . Sym . Attr . SubSymbol ( ) {
if ! ( target . IsPPC64 ( ) && target . IsExternal ( ) && r . Sym . Name == ".TOC." ) {
Errorf ( s , "unhandled relocation for %s (type %d (%s) rtype %d (%s))" , r . Sym . Name , r . Sym . Type , r . Sym . Type , r . Type , sym . RelocName ( target . Arch , r . Type ) )
cmd/compile, cmd/link, runtime: on ppc64x, maintain the TOC pointer in R2 when compiling PIC
The PowerPC ISA does not have a PC-relative load instruction, which poses
obvious challenges when generating position-independent code. The way the ELFv2
ABI addresses this is to specify that r2 points to a per "module" (shared
library or executable) TOC pointer. Maintaining this pointer requires
cooperation between codegen and the system linker:
* Non-leaf functions leave space on the stack at r1+24 to save the TOC pointer.
* A call to a function that *might* have to go via a PLT stub must be followed
by a nop instruction that the system linker can replace with "ld r1, 24(r1)"
to restore the TOC pointer (only when dynamically linking Go code).
* When calling a function via a function pointer, the address of the function
must be in r12, and the first couple of instructions (the "global entry
point") of the called function use this to derive the address of the TOC
for the module it is in.
* When calling a function that is implemented in the same module, the system
linker adjusts the call to skip over the instructions mentioned above (the
"local entry point"), assuming that r2 is already correctly set.
So this changeset adds the global entry point instructions, sets the metadata so
the system linker knows where the local entry point is, inserts code to save the
TOC pointer at 24(r1), adds a nop after any call not known to be local and copes
with the odd non-local code transfer in the runtime (e.g. the stuff around
jmpdefer). It does not actually compile PIC yet.
Change-Id: I7522e22bdfd2f891745a900c60254fe9e372c854
Reviewed-on: https://go-review.googlesource.com/15967
Reviewed-by: Russ Cox <rsc@golang.org>
2015-10-16 15:42:09 +13:00
}
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
if r . Sym != nil && r . Sym . Type != sym . STLSBSS && r . Type != objabi . R_WEAKADDROFF && ! r . Sym . Attr . Reachable ( ) {
2016-09-17 09:39:33 -04:00
Errorf ( s , "unreachable sym in relocation: %s" , r . Sym . Name )
2015-02-27 22:57:28 -05:00
}
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) {
2019-04-23 11:34:58 -04:00
r . InitExt ( )
}
2016-03-18 16:57:54 -04:00
// TODO(mundaym): remove this special case - see issue 14218.
2020-02-24 20:59:02 -05:00
if target . IsS390X ( ) {
2016-03-18 16:57:54 -04:00
switch r . Type {
2017-04-18 12:53:25 -07:00
case objabi . R_PCRELDBL :
2019-04-23 11:34:58 -04:00
r . InitExt ( )
2017-04-18 12:53:25 -07:00
r . Type = objabi . R_PCREL
2017-10-04 17:54:04 -04:00
r . Variant = sym . RV_390_DBL
2017-04-18 12:53:25 -07:00
case objabi . R_CALL :
2019-04-23 11:34:58 -04:00
r . InitExt ( )
2017-10-04 17:54:04 -04:00
r . Variant = sym . RV_390_DBL
2016-03-18 16:57:54 -04:00
}
}
2017-10-01 14:39:04 +11:00
var o int64
2015-02-27 22:57:28 -05:00
switch r . Type {
default :
2015-08-03 14:08:17 +12:00
switch siz {
default :
2016-09-17 09:39:33 -04:00
Errorf ( s , "bad reloc size %#x for %s" , uint32 ( siz ) , r . Sym . Name )
2015-08-03 14:08:17 +12:00
case 1 :
o = int64 ( s . P [ off ] )
case 2 :
2020-02-24 20:59:02 -05:00
o = int64 ( target . Arch . ByteOrder . Uint16 ( s . P [ off : ] ) )
2015-08-03 14:08:17 +12:00
case 4 :
2020-02-24 20:59:02 -05:00
o = int64 ( target . Arch . ByteOrder . Uint32 ( s . P [ off : ] ) )
2015-08-03 14:08:17 +12:00
case 8 :
2020-02-24 20:59:02 -05:00
o = int64 ( target . Arch . ByteOrder . Uint64 ( s . P [ off : ] ) )
2015-08-03 14:08:17 +12:00
}
2020-02-24 21:07:26 -05:00
if offset , ok := thearch . Archreloc ( target , syms , r , s , o ) ; ok {
cmd/link: fewer allocs in ld.Arch.Archreloc
Archreloc had this signature:
func(*Link, *sym.Reloc, *sym.Symbol, *int64) bool
The last *int64 argument is used as out parameter.
Passed valus could be allocated on stack, but escape analysis
fails here, leading to high number of unwanted allocs.
If instead 4th arg is passed by value, and modified values is returned,
no problems with allocations arise:
func(*Link, *sym.Reloc, *sym.Symbol, int64) (int64, bool)
There are 2 benefits:
1. code becomes more readable.
2. less allocations.
For linking "hello world" example from net/http:
name old time/op new time/op delta
Linker-4 530ms ± 2% 520ms ± 2% -1.83% (p=0.001 n=17+16)
It's top 1 in alloc_objects from memprofile:
flat flat% sum% cum cum%
229379 33.05% 33.05% 229379 33.05% cmd/link/internal/ld.relocsym
...
list relocsym:
229379 229379 (flat, cum) 33.05% of Total
229379 229379 183: var o int64
After the patch, ~230k of int64 allocs (~ 1.75mb) removed.
Passes toolshash-check (toolstash cmp).
Change-Id: I25504fe27967bcff70c4b7338790f3921d15473d
Reviewed-on: https://go-review.googlesource.com/113637
Run-TryBot: Iskander Sharipov <iskander.sharipov@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-05-17 19:50:29 +03:00
o = offset
} else {
2020-02-24 20:59:02 -05:00
Errorf ( s , "unknown reloc to %v: %d (%s)" , r . Sym . Name , r . Type , sym . RelocName ( target . Arch , r . Type ) )
2015-02-27 22:57:28 -05:00
}
2017-04-18 12:53:25 -07:00
case objabi . R_TLS_LE :
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) && target . IsElf ( ) {
2017-08-27 22:00:00 +09:00
r . Done = false
2015-09-02 10:35:54 +12:00
if r . Sym == nil {
2020-02-24 21:07:26 -05:00
r . Sym = syms . Tlsg
2015-09-02 10:35:54 +12:00
}
r . Xsym = r . Sym
2015-02-27 22:57:28 -05:00
r . Xadd = r . Add
o = 0
2020-02-24 20:59:02 -05:00
if ! target . IsAMD64 ( ) {
2015-02-27 22:57:28 -05:00
o = r . Add
}
break
}
2020-02-24 20:59:02 -05:00
if target . IsElf ( ) && target . IsARM ( ) {
2015-09-02 10:35:54 +12:00
// On ELF ARM, the thread pointer is 8 bytes before
// the start of the thread-local data block, so add 8
// to the actual TLS offset (r->sym->value).
// This 8 seems to be a fundamental constant of
// ELF on ARM (or maybe Glibc on ARM); it is not
// related to the fact that our own TLS storage happens
// to take up 8 bytes.
o = 8 + r . Sym . Value
2020-02-24 20:59:02 -05:00
} else if target . IsElf ( ) || target . IsPlan9 ( ) || target . IsDarwin ( ) {
2020-02-24 21:07:26 -05:00
o = int64 ( syms . Tlsoffset ) + r . Add
2020-02-24 20:59:02 -05:00
} else if target . IsWindows ( ) {
2015-04-23 21:53:48 +12:00
o = r . Add
} else {
2020-02-24 20:59:02 -05:00
log . Fatalf ( "unexpected R_TLS_LE relocation for %v" , target . HeadType )
2015-04-23 21:53:48 +12:00
}
2017-04-18 12:53:25 -07:00
case objabi . R_TLS_IE :
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) && target . IsElf ( ) {
2017-08-27 22:00:00 +09:00
r . Done = false
2015-09-02 10:35:54 +12:00
if r . Sym == nil {
2020-02-24 21:07:26 -05:00
r . Sym = syms . Tlsg
2015-09-02 10:35:54 +12:00
}
r . Xsym = r . Sym
2015-02-27 22:57:28 -05:00
r . Xadd = r . Add
o = 0
2020-02-24 20:59:02 -05:00
if ! target . IsAMD64 ( ) {
2015-02-27 22:57:28 -05:00
o = r . Add
}
break
}
2020-02-24 20:59:02 -05:00
if target . IsPIE ( ) && target . IsElf ( ) {
2016-09-06 07:46:59 -04:00
// We are linking the final executable, so we
// can optimize any TLS IE relocation to LE.
2018-04-01 00:58:48 +03:00
if thearch . TLSIEtoLE == nil {
2020-02-24 20:59:02 -05:00
log . Fatalf ( "internal linking of TLS IE not supported on %v" , target . Arch . Family )
2016-09-06 07:46:59 -04:00
}
2018-04-01 00:58:48 +03:00
thearch . TLSIEtoLE ( s , int ( off ) , int ( r . Siz ) )
2020-02-24 21:07:26 -05:00
o = int64 ( syms . Tlsoffset )
// TODO: o += r.Add when !target.IsAmd64()?
2016-09-06 07:46:59 -04:00
// Why do we treat r.Add differently on AMD64?
// Is the external linker using Xadd at all?
2016-09-06 07:46:59 -04:00
} else {
log . Fatalf ( "cannot handle R_TLS_IE (sym %s) when linking internally" , s . Name )
}
2017-04-18 12:53:25 -07:00
case objabi . R_ADDR :
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) && r . Sym . Type != sym . SCONST {
2017-08-27 22:00:00 +09:00
r . Done = false
2015-02-27 22:57:28 -05:00
// set up addend for eventual relocation via outer symbol.
2017-10-01 14:39:04 +11:00
rs := r . Sym
2015-02-27 22:57:28 -05:00
r . Xadd = r . Add
for rs . Outer != nil {
2016-09-17 09:39:33 -04:00
r . Xadd += Symaddr ( rs ) - Symaddr ( rs . Outer )
2015-02-27 22:57:28 -05:00
rs = rs . Outer
}
2019-09-29 16:59:56 -07:00
if rs . Type != sym . SHOSTOBJ && rs . Type != sym . SDYNIMPORT && rs . Type != sym . SUNDEFEXT && rs . Sect == nil {
2016-09-17 09:39:33 -04:00
Errorf ( s , "missing section for relocation target %s" , rs . Name )
2015-02-27 22:57:28 -05:00
}
r . Xsym = rs
o = r . Xadd
2020-02-24 20:59:02 -05:00
if target . IsElf ( ) {
if target . IsAMD64 ( ) {
2015-02-27 22:57:28 -05:00
o = 0
}
2020-02-24 20:59:02 -05:00
} else if target . IsDarwin ( ) {
2017-10-04 17:54:04 -04:00
if rs . Type != sym . SHOSTOBJ {
2018-04-21 13:27:13 +02:00
o += Symaddr ( rs )
2015-02-27 22:57:28 -05:00
}
2020-02-24 20:59:02 -05:00
} else if target . IsWindows ( ) {
2015-03-09 03:05:40 -04:00
// nothing to do
2020-02-24 20:59:02 -05:00
} else if target . IsAIX ( ) {
2019-02-20 16:16:38 +01:00
o = Symaddr ( r . Sym ) + r . Add
2015-02-27 22:57:28 -05:00
} else {
2020-02-24 20:59:02 -05:00
Errorf ( s , "unhandled pcrel relocation to %s on %v" , rs . Name , target . HeadType )
2015-02-27 22:57:28 -05:00
}
break
}
2018-11-23 14:20:19 +01:00
// On AIX, a second relocation must be done by the loader,
// as section addresses can change once loaded.
2018-09-28 17:02:16 +02:00
// The "default" symbol address is still needed by the loader so
// the current relocation can't be skipped.
2020-02-24 20:59:02 -05:00
if target . IsAIX ( ) && r . Sym . Type != sym . SDYNIMPORT {
2018-11-23 14:20:19 +01:00
// It's not possible to make a loader relocation in a
// symbol which is not inside .data section.
// FIXME: It should be forbidden to have R_ADDR from a
// symbol which isn't in .data. However, as .text has the
// same address once loaded, this is possible.
if s . Sect . Seg == & Segdata {
2020-03-24 12:55:14 -04:00
Xcoffadddynrel ( target , ldr , s , r )
2018-09-28 17:02:16 +02:00
}
}
2015-02-27 22:57:28 -05:00
2016-09-17 09:39:33 -04:00
o = Symaddr ( r . Sym ) + r . Add
2015-02-27 22:57:28 -05:00
// On amd64, 4-byte offsets will be sign-extended, so it is impossible to
// access more than 2GB of static data; fail at link time is better than
2015-07-10 17:17:11 -06:00
// fail at runtime. See https://golang.org/issue/7980.
2015-02-27 22:57:28 -05:00
// Instead of special casing only amd64, we treat this as an error on all
// 64-bit architectures so as to be future-proof.
2020-02-24 20:59:02 -05:00
if int32 ( o ) < 0 && target . Arch . PtrSize > 4 && siz == 4 {
2016-09-17 09:39:33 -04:00
Errorf ( s , "non-pc-relative relocation address for %s is too big: %#x (%#x + %#x)" , r . Sym . Name , uint64 ( o ) , Symaddr ( r . Sym ) , r . Add )
2015-04-09 07:37:17 -04:00
errorexit ( )
2015-02-27 22:57:28 -05:00
}
2017-10-24 16:08:46 -04:00
case objabi . R_DWARFSECREF :
2017-05-22 20:17:31 -04:00
if r . Sym . Sect == nil {
2016-09-17 09:39:33 -04:00
Errorf ( s , "missing DWARF section for relocation target %s" , r . Sym . Name )
2016-03-14 09:23:04 -07:00
}
2017-05-02 16:46:01 +02:00
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) {
2017-08-27 22:00:00 +09:00
r . Done = false
cmd/link: suppress unnecessary DWARF relocs that confuse dsymutil
During Mach-O linking, dsymutil takes the DWARF from individual object
files and combines it into a debug archive. Because it's content-aware,
it doesn't need our help to do its job. Nonetheless, it does try to
honor relocations that are present in its input.
When dsymutil encounters a relocation, it uses the value of that
relocation as an index into the debug map to find its final location.
When it does that, it's assuming that the value is an address in the
object file. But DWARF references are section-relative. So when it
processes a relocation for a DWARF reference, it gets confused,
and if the value happens to match the address of a function or
data symbol, it will rewrite it incorrectly.
Since the relocations don't help, and can hurt, drop them when
externally linking a Mach-O binary.
Fixes #22068
Change-Id: I8ec36da626575d9f6c8d0e7a0b76eab8ba22d62c
Reviewed-on: https://go-review.googlesource.com/68330
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Crawshaw <crawshaw@golang.org>
2017-10-04 18:30:38 -04:00
// On most platforms, the external linker needs to adjust DWARF references
// as it combines DWARF sections. However, on Darwin, dsymutil does the
// DWARF linking, and it understands how to follow section offsets.
// Leaving in the relocation records confuses it (see
// https://golang.org/issue/22068) so drop them for Darwin.
2020-02-24 20:59:02 -05:00
if target . IsDarwin ( ) {
cmd/link: suppress unnecessary DWARF relocs that confuse dsymutil
During Mach-O linking, dsymutil takes the DWARF from individual object
files and combines it into a debug archive. Because it's content-aware,
it doesn't need our help to do its job. Nonetheless, it does try to
honor relocations that are present in its input.
When dsymutil encounters a relocation, it uses the value of that
relocation as an index into the debug map to find its final location.
When it does that, it's assuming that the value is an address in the
object file. But DWARF references are section-relative. So when it
processes a relocation for a DWARF reference, it gets confused,
and if the value happens to match the address of a function or
data symbol, it will rewrite it incorrectly.
Since the relocations don't help, and can hurt, drop them when
externally linking a Mach-O binary.
Fixes #22068
Change-Id: I8ec36da626575d9f6c8d0e7a0b76eab8ba22d62c
Reviewed-on: https://go-review.googlesource.com/68330
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Crawshaw <crawshaw@golang.org>
2017-10-04 18:30:38 -04:00
r . Done = true
}
2017-02-08 12:30:30 +11:00
// PE code emits IMAGE_REL_I386_SECREL and IMAGE_REL_AMD64_SECREL
2017-10-24 16:08:46 -04:00
// for R_DWARFSECREF relocations, while R_ADDR is replaced with
2017-02-08 12:30:30 +11:00
// IMAGE_REL_I386_DIR32, IMAGE_REL_AMD64_ADDR64 and IMAGE_REL_AMD64_ADDR32.
2017-10-24 16:08:46 -04:00
// Do not replace R_DWARFSECREF with R_ADDR for windows -
2017-02-08 12:30:30 +11:00
// let PE code emit correct relocations.
2020-02-24 20:59:02 -05:00
if ! target . IsWindows ( ) {
2017-04-18 12:53:25 -07:00
r . Type = objabi . R_ADDR
2017-02-08 12:30:30 +11:00
}
2016-03-14 09:23:04 -07:00
2020-03-24 14:35:45 -04:00
r . Xsym = r . Sym . Sect . Sym
2017-05-22 20:17:31 -04:00
r . Xadd = r . Add + Symaddr ( r . Sym ) - int64 ( r . Sym . Sect . Vaddr )
2017-05-02 16:46:01 +02:00
2016-03-14 09:23:04 -07:00
o = r . Xadd
2020-02-24 20:59:02 -05:00
if target . IsElf ( ) && target . IsAMD64 ( ) {
2016-03-14 09:23:04 -07:00
o = 0
}
break
}
2017-05-22 20:17:31 -04:00
o = Symaddr ( r . Sym ) + r . Add - int64 ( r . Sym . Sect . Vaddr )
2017-04-18 12:53:25 -07:00
case objabi . R_WEAKADDROFF :
2016-11-21 16:58:55 -05:00
if ! r . Sym . Attr . Reachable ( ) {
continue
}
fallthrough
2017-04-18 12:53:25 -07:00
case objabi . R_ADDROFF :
2016-08-25 11:07:33 -05:00
// The method offset tables using this relocation expect the offset to be relative
// to the start of the first text section, even if there are multiple.
if r . Sym . Sect . Name == ".text" {
2017-04-18 21:52:06 +12:00
o = Symaddr ( r . Sym ) - int64 ( Segtext . Sections [ 0 ] . Vaddr ) + r . Add
2016-08-25 11:07:33 -05:00
} else {
o = Symaddr ( r . Sym ) - int64 ( r . Sym . Sect . Vaddr ) + r . Add
}
2016-03-27 10:21:48 -04:00
2017-10-21 12:45:23 +02:00
case objabi . R_ADDRCUOFF :
// debug_range and debug_loc elements use this relocation type to get an
// offset from the start of the compile unit.
2019-11-02 17:25:39 -04:00
o = Symaddr ( r . Sym ) + r . Add - Symaddr ( r . Sym . Unit . Textp [ 0 ] )
2017-10-21 12:45:23 +02:00
2015-02-27 22:57:28 -05:00
// r->sym can be null when CALL $(constant) is transformed from absolute PC to relative PC call.
2017-04-18 12:53:25 -07:00
case objabi . R_GOTPCREL :
2020-02-24 20:59:02 -05:00
if target . IsDynlinkingGo ( ) && target . IsDarwin ( ) && r . Sym != nil && r . Sym . Type != sym . SCONST {
2017-08-27 22:00:00 +09:00
r . Done = false
2016-11-13 21:20:58 -05:00
r . Xadd = r . Add
r . Xadd -= int64 ( r . Siz ) // relative to address after the relocated chunk
r . Xsym = r . Sym
o = r . Xadd
o += int64 ( r . Siz )
break
}
fallthrough
2017-04-18 12:53:25 -07:00
case objabi . R_CALL , objabi . R_PCREL :
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) && r . Sym != nil && r . Sym . Type == sym . SUNDEFEXT {
2019-09-29 16:59:56 -07:00
// pass through to the external linker.
r . Done = false
r . Xadd = 0
2020-02-24 20:59:02 -05:00
if target . IsElf ( ) {
2019-09-29 16:59:56 -07:00
r . Xadd -= int64 ( r . Siz )
}
r . Xsym = r . Sym
o = 0
break
}
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) && r . Sym != nil && r . Sym . Type != sym . SCONST && ( r . Sym . Sect != s . Sect || r . Type == objabi . R_GOTPCREL ) {
2017-08-27 22:00:00 +09:00
r . Done = false
2015-02-27 22:57:28 -05:00
// set up addend for eventual relocation via outer symbol.
2017-10-01 14:39:04 +11:00
rs := r . Sym
2015-02-27 22:57:28 -05:00
r . Xadd = r . Add
for rs . Outer != nil {
2016-09-17 09:39:33 -04:00
r . Xadd += Symaddr ( rs ) - Symaddr ( rs . Outer )
2015-02-27 22:57:28 -05:00
rs = rs . Outer
}
r . Xadd -= int64 ( r . Siz ) // relative to address after the relocated chunk
2017-10-04 17:54:04 -04:00
if rs . Type != sym . SHOSTOBJ && rs . Type != sym . SDYNIMPORT && rs . Sect == nil {
2016-09-17 09:39:33 -04:00
Errorf ( s , "missing section for relocation target %s" , rs . Name )
2015-02-27 22:57:28 -05:00
}
r . Xsym = rs
o = r . Xadd
2020-02-24 20:59:02 -05:00
if target . IsElf ( ) {
if target . IsAMD64 ( ) {
2015-02-27 22:57:28 -05:00
o = 0
}
2020-02-24 20:59:02 -05:00
} else if target . IsDarwin ( ) {
2017-04-18 12:53:25 -07:00
if r . Type == objabi . R_CALL {
2020-02-24 20:59:02 -05:00
if target . IsExternal ( ) && rs . Type == sym . SDYNIMPORT {
switch target . Arch . Family {
2018-04-23 07:30:32 -07:00
case sys . AMD64 :
// AMD64 dynamic relocations are relative to the end of the relocation.
o += int64 ( r . Siz )
case sys . I386 :
// I386 dynamic relocations are relative to the start of the section.
o -= int64 ( r . Off ) // offset in symbol
o -= int64 ( s . Value - int64 ( s . Sect . Vaddr ) ) // offset of symbol in section
}
} else {
if rs . Type != sym . SHOSTOBJ {
o += int64 ( uint64 ( Symaddr ( rs ) ) - rs . Sect . Vaddr )
}
o -= int64 ( r . Off ) // relative to section offset, not symbol
2015-02-27 22:57:28 -05:00
}
2020-02-24 20:59:02 -05:00
} else if target . IsARM ( ) {
2016-04-26 15:17:56 -04:00
// see ../arm/asm.go:/machoreloc1
2018-02-25 20:08:22 +09:00
o += Symaddr ( rs ) - s . Value - int64 ( r . Off )
2015-02-27 22:57:28 -05:00
} else {
o += int64 ( r . Siz )
}
2020-02-24 20:59:02 -05:00
} else if target . IsWindows ( ) && target . IsAMD64 ( ) { // only amd64 needs PCREL
2015-03-13 22:10:48 -04:00
// PE/COFF's PC32 relocation uses the address after the relocated
// bytes as the base. Compensate by skewing the addend.
o += int64 ( r . Siz )
2015-02-27 22:57:28 -05:00
} else {
2020-02-24 20:59:02 -05:00
Errorf ( s , "unhandled pcrel relocation to %s on %v" , rs . Name , target . HeadType )
2015-02-27 22:57:28 -05:00
}
break
}
o = 0
if r . Sym != nil {
2016-09-17 09:39:33 -04:00
o += Symaddr ( r . Sym )
2015-02-27 22:57:28 -05:00
}
2016-09-06 12:33:36 -04:00
o += r . Add - ( s . Value + int64 ( r . Off ) + int64 ( r . Siz ) )
2017-04-18 12:53:25 -07:00
case objabi . R_SIZE :
2015-02-27 22:57:28 -05:00
o = r . Sym . Size + r . Add
2019-02-20 16:20:56 +01:00
case objabi . R_XCOFFREF :
2020-02-24 20:59:02 -05:00
if ! target . IsAIX ( ) {
2019-02-20 16:20:56 +01:00
Errorf ( s , "find XCOFF R_REF on non-XCOFF files" )
}
2020-02-24 20:59:02 -05:00
if ! target . IsExternal ( ) {
2019-02-20 16:20:56 +01:00
Errorf ( s , "find XCOFF R_REF with internal linking" )
}
r . Xsym = r . Sym
r . Xadd = r . Add
r . Done = false
// This isn't a real relocation so it must not update
// its offset value.
continue
2019-04-04 00:30:25 -04:00
case objabi . R_DWARFFILEREF :
// The final file index is saved in r.Add in dwarf.go:writelines.
o = r . Add
2015-02-27 22:57:28 -05:00
}
2020-02-24 20:59:02 -05:00
if target . IsPPC64 ( ) || target . IsS390X ( ) {
2019-04-23 11:34:58 -04:00
r . InitExt ( )
if r . Variant != sym . RV_NONE {
2020-02-24 21:07:26 -05:00
o = thearch . Archrelocvariant ( target , syms , r , s , o )
2019-04-23 11:34:58 -04:00
}
2015-02-27 22:57:28 -05:00
}
2015-03-01 13:32:49 -05:00
if false {
nam := "<nil>"
2018-04-12 17:07:14 -04:00
var addr int64
2015-03-01 13:32:49 -05:00
if r . Sym != nil {
nam = r . Sym . Name
2018-04-12 17:07:14 -04:00
addr = Symaddr ( r . Sym )
}
xnam := "<nil>"
if r . Xsym != nil {
xnam = r . Xsym . Name
2015-03-01 13:32:49 -05:00
}
2020-02-24 20:59:02 -05:00
fmt . Printf ( "relocate %s %#x (%#x+%#x, size %d) => %s %#x +%#x (xsym: %s +%#x) [type %d (%s)/%d, %x]\n" , s . Name , s . Value + int64 ( off ) , s . Value , r . Off , r . Siz , nam , addr , r . Add , xnam , r . Xadd , r . Type , sym . RelocName ( target . Arch , r . Type ) , r . Variant , o )
2015-03-01 13:32:49 -05:00
}
2015-02-27 22:57:28 -05:00
switch siz {
default :
2016-09-17 09:39:33 -04:00
Errorf ( s , "bad reloc size %#x for %s" , uint32 ( siz ) , r . Sym . Name )
2015-02-27 22:57:28 -05:00
fallthrough
// TODO(rsc): Remove.
case 1 :
s . P [ off ] = byte ( int8 ( o ) )
case 2 :
if o != int64 ( int16 ( o ) ) {
2016-09-17 09:39:33 -04:00
Errorf ( s , "relocation address for %s is too big: %#x" , r . Sym . Name , o )
2015-02-27 22:57:28 -05:00
}
2017-10-01 14:39:04 +11:00
i16 := int16 ( o )
2020-02-24 20:59:02 -05:00
target . Arch . ByteOrder . PutUint16 ( s . P [ off : ] , uint16 ( i16 ) )
2015-02-27 22:57:28 -05:00
case 4 :
2017-04-18 12:53:25 -07:00
if r . Type == objabi . R_PCREL || r . Type == objabi . R_CALL {
2015-02-27 22:57:28 -05:00
if o != int64 ( int32 ( o ) ) {
2016-09-17 09:39:33 -04:00
Errorf ( s , "pc-relative relocation address for %s is too big: %#x" , r . Sym . Name , o )
2015-02-27 22:57:28 -05:00
}
} else {
if o != int64 ( int32 ( o ) ) && o != int64 ( uint32 ( o ) ) {
2016-09-17 09:39:33 -04:00
Errorf ( s , "non-pc-relative relocation address for %s is too big: %#x" , r . Sym . Name , uint64 ( o ) )
2015-02-27 22:57:28 -05:00
}
}
2017-10-01 14:39:04 +11:00
fl := int32 ( o )
2020-02-24 20:59:02 -05:00
target . Arch . ByteOrder . PutUint32 ( s . P [ off : ] , uint32 ( fl ) )
2015-02-27 22:57:28 -05:00
case 8 :
2020-02-24 20:59:02 -05:00
target . Arch . ByteOrder . PutUint64 ( s . P [ off : ] , uint64 ( o ) )
2015-02-27 22:57:28 -05:00
}
}
}
2016-08-19 22:40:38 -04:00
func ( ctxt * Link ) reloc ( ) {
2020-02-24 21:08:12 -05:00
var wg sync . WaitGroup
2020-02-24 21:07:26 -05:00
target := & ctxt . Target
2020-03-24 12:55:14 -04:00
ldr := ctxt . loader
2020-02-24 21:07:26 -05:00
reporter := & ctxt . ErrorReporter
syms := & ctxt . ArchSyms
2020-02-24 21:08:12 -05:00
wg . Add ( 3 )
go func ( ) {
for _ , s := range ctxt . Textp {
2020-03-24 14:35:45 -04:00
relocsym ( target , ldr , reporter , syms , s )
2020-02-24 21:08:12 -05:00
}
wg . Done ( )
} ( )
go func ( ) {
2020-03-09 11:08:04 -04:00
for _ , s := range ctxt . datap {
2020-03-24 14:35:45 -04:00
relocsym ( target , ldr , reporter , syms , s )
2020-02-24 21:08:12 -05:00
}
wg . Done ( )
} ( )
go func ( ) {
for _ , s := range dwarfp {
2020-03-24 14:35:45 -04:00
relocsym ( target , ldr , reporter , syms , s )
2020-02-24 21:08:12 -05:00
}
wg . Done ( )
} ( )
wg . Wait ( )
2015-02-27 22:57:28 -05:00
}
2020-03-24 09:20:01 -04:00
func windynrelocsym ( ctxt * Link , rel * loader . SymbolBuilder , s loader . Sym ) {
var su * loader . SymbolBuilder
var rslice [ ] loader . Reloc
relocs := ctxt . loader . Relocs ( s )
for ri := 0 ; ri < relocs . Count ; ri ++ {
r := relocs . At2 ( ri )
targ := r . Sym ( )
if targ == 0 {
2017-09-09 11:23:29 +09:00
continue
2015-02-27 22:57:28 -05:00
}
2020-03-24 09:20:01 -04:00
rt := r . Type ( )
if ! ctxt . loader . AttrReachable ( targ ) {
if rt == objabi . R_WEAKADDROFF {
2015-02-27 22:57:28 -05:00
continue
}
2020-03-24 09:20:01 -04:00
ctxt . Errorf ( s , "dynamic relocation to unreachable symbol %s" ,
ctxt . loader . SymName ( targ ) )
2017-09-09 11:23:29 +09:00
}
2020-03-24 09:20:01 -04:00
tplt := ctxt . loader . SymPlt ( targ )
tgot := ctxt . loader . SymGot ( targ )
if tplt == - 2 && tgot != - 2 { // make dynimport JMP table for PE object files.
tplt := int32 ( rel . Size ( ) )
ctxt . loader . SetPlt ( targ , tplt )
if su == nil {
su = ctxt . loader . MakeSymbolUpdater ( s )
rslice = su . Relocs ( )
}
r := & rslice [ ri ]
r . Sym = rel . Sym ( )
r . Add = int64 ( tplt )
2017-09-09 11:23:29 +09:00
// jmp *addr
2018-07-24 15:13:41 -07:00
switch ctxt . Arch . Family {
default :
2020-03-24 09:20:01 -04:00
ctxt . Errorf ( s , "unsupported arch %v" , ctxt . Arch . Family )
2018-07-24 15:13:41 -07:00
return
case sys . I386 :
2017-09-30 15:06:44 +00:00
rel . AddUint8 ( 0xff )
rel . AddUint8 ( 0x25 )
2020-03-24 09:20:01 -04:00
rel . AddAddrPlus ( ctxt . Arch , targ , 0 )
2017-09-30 15:06:44 +00:00
rel . AddUint8 ( 0x90 )
rel . AddUint8 ( 0x90 )
2018-07-24 15:13:41 -07:00
case sys . AMD64 :
2017-09-30 15:06:44 +00:00
rel . AddUint8 ( 0xff )
rel . AddUint8 ( 0x24 )
rel . AddUint8 ( 0x25 )
2020-03-24 09:20:01 -04:00
rel . AddAddrPlus4 ( ctxt . Arch , targ , 0 )
2017-09-30 15:06:44 +00:00
rel . AddUint8 ( 0x90 )
2015-02-27 22:57:28 -05:00
}
2020-03-24 09:20:01 -04:00
} else if tplt >= 0 {
if su == nil {
su = ctxt . loader . MakeSymbolUpdater ( s )
rslice = su . Relocs ( )
}
r := & rslice [ ri ]
r . Sym = rel . Sym ( )
r . Add = int64 ( tplt )
2015-02-27 22:57:28 -05:00
}
2017-09-09 11:23:29 +09:00
}
}
2015-02-27 22:57:28 -05:00
2018-06-16 16:23:52 +10:00
// windynrelocsyms generates jump table to C library functions that will be
// added later. windynrelocsyms writes the table into .rel symbol.
func ( ctxt * Link ) windynrelocsyms ( ) {
2020-03-24 09:20:01 -04:00
if ! ( ctxt . IsWindows ( ) && iscgo && ctxt . IsInternal ( ) ) {
2015-02-27 22:57:28 -05:00
return
}
2018-06-16 16:23:52 +10:00
2020-03-24 09:20:01 -04:00
rel := ctxt . loader . LookupOrCreateSym ( ".rel" , 0 )
relu := ctxt . loader . MakeSymbolUpdater ( rel )
relu . SetType ( sym . STEXT )
2018-06-16 16:23:52 +10:00
2020-03-24 09:20:01 -04:00
for _ , s := range ctxt . Textp2 {
windynrelocsym ( ctxt , relu , s )
2018-06-16 16:23:52 +10:00
}
2020-03-24 09:20:01 -04:00
ctxt . Textp2 = append ( ctxt . Textp2 , rel )
2018-06-16 16:23:52 +10:00
}
2015-02-27 22:57:28 -05:00
2018-06-16 16:23:52 +10:00
func dynrelocsym ( ctxt * Link , s * sym . Symbol ) {
2020-03-04 15:27:47 -05:00
target := & ctxt . Target
2020-03-24 12:55:14 -04:00
ldr := ctxt . loader
2020-03-04 15:27:47 -05:00
syms := & ctxt . ArchSyms
2018-04-06 21:41:06 +01:00
for ri := range s . R {
2016-04-20 10:36:49 -04:00
r := & s . R [ ri ]
2017-10-05 10:20:17 -04:00
if ctxt . BuildMode == BuildModePIE && ctxt . LinkMode == LinkInternal {
2016-09-05 23:49:53 -04:00
// It's expected that some relocations will be done
// later by relocsym (R_TLS_LE, R_ADDROFF), so
// don't worry if Adddynrel returns false.
2020-03-24 12:55:14 -04:00
thearch . Adddynrel ( target , ldr , syms , s , r )
2016-09-05 23:49:53 -04:00
continue
}
2018-09-28 17:02:16 +02:00
2019-04-16 02:49:09 +00:00
if r . Sym != nil && r . Sym . Type == sym . SDYNIMPORT || r . Type >= objabi . ElfRelocOffset {
2016-03-02 07:59:49 -05:00
if r . Sym != nil && ! r . Sym . Attr . Reachable ( ) {
2016-09-17 09:39:33 -04:00
Errorf ( s , "dynamic relocation to unreachable symbol %s" , r . Sym . Name )
2015-02-27 22:57:28 -05:00
}
2020-03-24 12:55:14 -04:00
if ! thearch . Adddynrel ( target , ldr , syms , s , r ) {
2017-10-04 17:54:04 -04:00
Errorf ( s , "unsupported dynamic relocation for symbol %s (type=%d (%s) stype=%d (%s))" , r . Sym . Name , r . Type , sym . RelocName ( ctxt . Arch , r . Type ) , r . Sym . Type , r . Sym . Type )
2016-09-05 23:49:53 -04:00
}
2015-02-27 22:57:28 -05:00
}
}
}
2017-10-04 17:54:04 -04:00
func dynreloc ( ctxt * Link , data * [ sym . SXREF ] [ ] * sym . Symbol ) {
2018-06-16 16:23:52 +10:00
if ctxt . HeadType == objabi . Hwindows {
return
}
2015-02-27 22:57:28 -05:00
// -d suppresses dynamic loader format, so we may as well not
// compute these sections or mark their symbols as reachable.
2018-06-16 16:23:52 +10:00
if * FlagD {
2015-02-27 22:57:28 -05:00
return
}
2016-08-19 22:40:38 -04:00
for _ , s := range ctxt . Textp {
dynrelocsym ( ctxt , s )
2015-02-27 22:57:28 -05:00
}
2016-04-18 14:50:14 -04:00
for _ , syms := range data {
2017-10-04 17:54:04 -04:00
for _ , s := range syms {
dynrelocsym ( ctxt , s )
2016-04-18 14:50:14 -04:00
}
2015-02-27 22:57:28 -05:00
}
2017-10-07 13:43:38 -04:00
if ctxt . IsELF {
2016-08-19 22:40:38 -04:00
elfdynhash ( ctxt )
2015-02-27 22:57:28 -05:00
}
}
2020-03-20 12:24:35 -04:00
func Codeblk ( ctxt * Link , out * OutBuf , addr int64 , size int64 ) {
CodeblkPad ( ctxt , out , addr , size , zeros [ : ] )
2016-06-11 17:12:28 -07:00
}
2020-03-17 10:24:40 -04:00
func CodeblkPad ( ctxt * Link , out * OutBuf , addr int64 , size int64 , pad [ ] byte ) {
2016-08-21 18:34:24 -04:00
if * flagA {
2017-10-01 02:37:20 +00:00
ctxt . Logf ( "codeblk [%#x,%#x) at offset %#x\n" , addr , addr + size , ctxt . Out . Offset ( ) )
2015-02-27 22:57:28 -05:00
}
2020-03-17 10:24:40 -04:00
writeBlocks ( out , ctxt . outSem , ctxt . Textp , addr , size , pad )
2015-02-27 22:57:28 -05:00
/* again for printing */
2016-08-21 18:34:24 -04:00
if ! * flagA {
2015-02-27 22:57:28 -05:00
return
}
2016-08-19 22:40:38 -04:00
syms := ctxt . Textp
2017-10-04 17:54:04 -04:00
for i , s := range syms {
if ! s . Attr . Reachable ( ) {
2015-02-27 22:57:28 -05:00
continue
}
2017-10-04 17:54:04 -04:00
if s . Value >= addr {
2016-04-22 09:38:41 +12:00
syms = syms [ i : ]
2015-02-27 22:57:28 -05:00
break
}
}
2015-03-02 12:35:15 -05:00
eaddr := addr + size
2017-10-04 17:54:04 -04:00
for _ , s := range syms {
if ! s . Attr . Reachable ( ) {
2015-02-27 22:57:28 -05:00
continue
}
2017-10-04 17:54:04 -04:00
if s . Value >= eaddr {
2015-02-27 22:57:28 -05:00
break
}
2017-10-04 17:54:04 -04:00
if addr < s . Value {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "%-20s %.8x|" , "_" , uint64 ( addr ) )
2017-10-04 17:54:04 -04:00
for ; addr < s . Value ; addr ++ {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( " %.2x" , 0 )
2015-02-27 22:57:28 -05:00
}
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "\n" )
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
ctxt . Logf ( "%.6x\t%-20s\n" , uint64 ( addr ) , s . Name )
2018-04-06 21:41:06 +01:00
q := s . P
2015-02-27 22:57:28 -05:00
2015-08-11 10:59:24 -04:00
for len ( q ) >= 16 {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "%.6x\t% x\n" , uint64 ( addr ) , q [ : 16 ] )
2015-02-27 22:57:28 -05:00
addr += 16
q = q [ 16 : ]
}
2015-08-11 10:59:24 -04:00
if len ( q ) > 0 {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "%.6x\t% x\n" , uint64 ( addr ) , q )
2015-08-11 10:59:24 -04:00
addr += int64 ( len ( q ) )
2015-02-27 22:57:28 -05:00
}
}
if addr < eaddr {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "%-20s %.8x|" , "_" , uint64 ( addr ) )
2015-02-27 22:57:28 -05:00
for ; addr < eaddr ; addr ++ {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( " %.2x" , 0 )
2015-02-27 22:57:28 -05:00
}
}
}
2020-03-17 10:24:40 -04:00
const blockSize = 1 << 20 // 1MB chunks written at a time.
// writeBlocks writes a specified chunk of symbols to the output buffer. It
// breaks the write up into ≥blockSize chunks to write them out, and schedules
// as many goroutines as necessary to accomplish this task. This call then
// blocks, waiting on the writes to complete. Note that we use the sem parameter
// to limit the number of concurrent writes taking place.
func writeBlocks ( out * OutBuf , sem chan int , syms [ ] * sym . Symbol , addr , size int64 , pad [ ] byte ) {
for i , s := range syms {
if s . Value >= addr && ! s . Attr . SubSymbol ( ) {
syms = syms [ i : ]
break
}
}
var wg sync . WaitGroup
max , lastAddr , written := int64 ( blockSize ) , addr + size , int64 ( 0 )
for addr < lastAddr {
// Find the last symbol we'd write.
idx := - 1
length := int64 ( 0 )
for i , s := range syms {
// If the next symbol's size would put us out of bounds on the total length,
// stop looking.
if s . Value + s . Size > lastAddr {
break
}
// We're gonna write this symbol.
idx = i
// If we cross over the max size, we've got enough symbols.
if s . Value + s . Size > addr + max {
break
}
}
// If we didn't find any symbols to write, we're done here.
if idx < 0 {
break
}
2020-03-31 14:36:19 -04:00
// Compute the length to write, including padding.
if idx + 1 < len ( syms ) {
length = syms [ idx + 1 ] . Value - addr
} else {
length = lastAddr - addr
}
2020-03-17 10:24:40 -04:00
// Start the block output operator.
if o , err := out . View ( uint64 ( out . Offset ( ) + written ) ) ; err == nil {
sem <- 1
wg . Add ( 1 )
go func ( o * OutBuf , syms [ ] * sym . Symbol , addr , size int64 , pad [ ] byte ) {
writeBlock ( o , syms , addr , size , pad )
wg . Done ( )
<- sem
} ( o , syms , addr , length , pad )
} else { // output not mmaped, don't parallelize.
writeBlock ( out , syms , addr , length , pad )
}
// Prepare for the next loop.
if idx != - 1 {
syms = syms [ idx + 1 : ]
}
written += length
addr += length
}
wg . Wait ( )
}
func writeBlock ( out * OutBuf , syms [ ] * sym . Symbol , addr , size int64 , pad [ ] byte ) {
2016-04-18 14:50:14 -04:00
for i , s := range syms {
2020-03-17 10:24:40 -04:00
if s . Value >= addr && ! s . Attr . SubSymbol ( ) {
2016-04-18 14:50:14 -04:00
syms = syms [ i : ]
break
}
}
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
// This doesn't distinguish the memory size from the file
// size, and it lays out the file based on Symbol.Value, which
// is the virtual address. DWARF compression changes file sizes,
// so dwarfcompress will fix this up later if necessary.
2016-04-18 14:50:14 -04:00
eaddr := addr + size
for _ , s := range syms {
2017-04-28 13:01:03 +12:00
if s . Attr . SubSymbol ( ) {
2016-04-18 14:50:14 -04:00
continue
}
if s . Value >= eaddr {
break
}
if s . Value < addr {
2016-09-17 09:39:33 -04:00
Errorf ( s , "phase error: addr=%#x but sym=%#x type=%d" , addr , s . Value , s . Type )
2016-04-18 14:50:14 -04:00
errorexit ( )
}
if addr < s . Value {
2019-03-17 15:51:30 +01:00
out . WriteStringPad ( "" , int ( s . Value - addr ) , pad )
2016-04-18 14:50:14 -04:00
addr = s . Value
}
2019-04-03 23:35:44 -04:00
out . WriteSym ( s )
2016-04-18 14:50:14 -04:00
addr += int64 ( len ( s . P ) )
if addr < s . Value + s . Size {
2019-03-17 15:51:30 +01:00
out . WriteStringPad ( "" , int ( s . Value + s . Size - addr ) , pad )
2016-04-18 14:50:14 -04:00
addr = s . Value + s . Size
}
if addr != s . Value + s . Size {
2016-09-17 09:39:33 -04:00
Errorf ( s , "phase error: addr=%#x value+size=%#x" , addr , s . Value + s . Size )
2016-04-18 14:50:14 -04:00
errorexit ( )
}
if s . Value + s . Size >= eaddr {
break
}
}
if addr < eaddr {
2019-03-17 15:51:30 +01:00
out . WriteStringPad ( "" , int ( eaddr - addr ) , pad )
2016-04-18 14:50:14 -04:00
}
2020-03-17 10:24:40 -04:00
}
type writeFn func ( * Link , * OutBuf , int64 , int64 )
// WriteParallel handles scheduling parallel execution of data write functions.
func WriteParallel ( wg * sync . WaitGroup , fn writeFn , ctxt * Link , seek , vaddr , length uint64 ) {
if out , err := ctxt . Out . View ( seek ) ; err != nil {
ctxt . Out . SeekSet ( int64 ( seek ) )
fn ( ctxt , ctxt . Out , int64 ( vaddr ) , int64 ( length ) )
} else {
wg . Add ( 1 )
go func ( ) {
defer wg . Done ( )
fn ( ctxt , out , int64 ( vaddr ) , int64 ( length ) )
} ( )
}
2016-04-18 14:50:14 -04:00
}
2020-03-20 12:24:35 -04:00
func Datblk ( ctxt * Link , out * OutBuf , addr , size int64 ) {
2020-03-17 10:24:40 -04:00
writeDatblkToOutBuf ( ctxt , out , addr , size )
}
2019-04-03 22:41:48 -04:00
// Used only on Wasm for now.
2019-03-17 15:51:30 +01:00
func DatblkBytes ( ctxt * Link , addr int64 , size int64 ) [ ] byte {
buf := bytes . NewBuffer ( make ( [ ] byte , 0 , size ) )
out := & OutBuf { w : bufio . NewWriter ( buf ) }
writeDatblkToOutBuf ( ctxt , out , addr , size )
out . Flush ( )
return buf . Bytes ( )
}
func writeDatblkToOutBuf ( ctxt * Link , out * OutBuf , addr int64 , size int64 ) {
2016-08-21 18:34:24 -04:00
if * flagA {
2017-10-01 02:37:20 +00:00
ctxt . Logf ( "datblk [%#x,%#x) at offset %#x\n" , addr , addr + size , ctxt . Out . Offset ( ) )
2015-02-27 22:57:28 -05:00
}
2020-03-17 10:24:40 -04:00
writeBlocks ( out , ctxt . outSem , ctxt . datap , addr , size , zeros [ : ] )
2015-02-27 22:57:28 -05:00
/* again for printing */
2016-08-21 18:34:24 -04:00
if ! * flagA {
2015-02-27 22:57:28 -05:00
return
}
2020-03-09 11:08:04 -04:00
syms := ctxt . datap
2016-04-18 14:50:14 -04:00
for i , sym := range syms {
2015-02-27 22:57:28 -05:00
if sym . Value >= addr {
2016-04-18 14:50:14 -04:00
syms = syms [ i : ]
2015-02-27 22:57:28 -05:00
break
}
}
2015-03-02 12:35:15 -05:00
eaddr := addr + size
2016-04-18 14:50:14 -04:00
for _ , sym := range syms {
2015-02-27 22:57:28 -05:00
if sym . Value >= eaddr {
break
}
if addr < sym . Value {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "\t%.8x| 00 ...\n" , uint64 ( addr ) )
2015-02-27 22:57:28 -05:00
addr = sym . Value
}
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "%s\n\t%.8x|" , sym . Name , uint64 ( addr ) )
2016-04-20 10:36:49 -04:00
for i , b := range sym . P {
if i > 0 && i % 16 == 0 {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "\n\t%.8x|" , uint64 ( addr ) + uint64 ( i ) )
2015-02-27 22:57:28 -05:00
}
2016-08-25 12:32:42 +10:00
ctxt . Logf ( " %.2x" , b )
2015-02-27 22:57:28 -05:00
}
addr += int64 ( len ( sym . P ) )
for ; addr < sym . Value + sym . Size ; addr ++ {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( " %.2x" , 0 )
2015-02-27 22:57:28 -05:00
}
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "\n" )
2015-02-27 22:57:28 -05:00
2017-10-05 10:20:17 -04:00
if ctxt . LinkMode != LinkExternal {
2016-04-20 10:36:49 -04:00
continue
}
2018-05-17 19:47:52 +03:00
for i := range sym . R {
2018-08-27 11:49:38 -04:00
r := & sym . R [ i ] // Copying sym.Reloc has measurable impact on performance
2016-04-20 10:36:49 -04:00
rsname := ""
2019-01-23 09:01:37 +01:00
rsval := int64 ( 0 )
2016-04-20 10:36:49 -04:00
if r . Sym != nil {
rsname = r . Sym . Name
2019-01-23 09:01:37 +01:00
rsval = r . Sym . Value
2016-04-20 10:36:49 -04:00
}
typ := "?"
switch r . Type {
2017-04-18 12:53:25 -07:00
case objabi . R_ADDR :
2016-04-20 10:36:49 -04:00
typ = "addr"
2017-04-18 12:53:25 -07:00
case objabi . R_PCREL :
2016-04-20 10:36:49 -04:00
typ = "pcrel"
2017-04-18 12:53:25 -07:00
case objabi . R_CALL :
2016-04-20 10:36:49 -04:00
typ = "call"
2015-02-27 22:57:28 -05:00
}
2019-01-23 09:01:37 +01:00
ctxt . Logf ( "\treloc %.8x/%d %s %s+%#x [%#x]\n" , uint ( sym . Value + int64 ( r . Off ) ) , r . Siz , typ , rsname , r . Add , rsval + r . Add )
2015-02-27 22:57:28 -05:00
}
}
if addr < eaddr {
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "\t%.8x| 00 ...\n" , uint ( addr ) )
2015-02-27 22:57:28 -05:00
}
2016-08-25 12:32:42 +10:00
ctxt . Logf ( "\t%.8x|\n" , uint ( eaddr ) )
2015-02-27 22:57:28 -05:00
}
2020-03-20 12:24:35 -04:00
func Dwarfblk ( ctxt * Link , out * OutBuf , addr int64 , size int64 ) {
2020-03-17 10:24:40 -04:00
if * flagA {
ctxt . Logf ( "dwarfblk [%#x,%#x) at offset %#x\n" , addr , addr + size , ctxt . Out . Offset ( ) )
}
writeBlocks ( out , ctxt . outSem , dwarfp , addr , size , zeros [ : ] )
}
2016-02-29 11:49:49 +02:00
var zeros [ 512 ] byte
2015-02-27 22:57:28 -05:00
2018-01-09 10:10:46 -05:00
var (
strdata = make ( map [ string ] string )
strnames [ ] string
)
2015-06-29 13:03:11 -04:00
2016-08-19 22:40:38 -04:00
func addstrdata1 ( ctxt * Link , arg string ) {
2017-10-05 15:49:32 +02:00
eq := strings . Index ( arg , "=" )
2016-10-25 08:59:35 -04:00
dot := strings . LastIndex ( arg [ : eq + 1 ] , "." )
if eq < 0 || dot < 0 {
2015-05-21 14:35:02 -04:00
Exitf ( "-X flag requires argument of the form importpath.name=value" )
2015-02-27 22:57:28 -05:00
}
2017-10-21 07:29:46 -04:00
pkg := arg [ : dot ]
2017-10-05 10:20:17 -04:00
if ctxt . BuildMode == BuildModePlugin && pkg == "main" {
2017-10-01 20:28:53 -04:00
pkg = * flagPluginPath
}
2017-10-21 07:29:46 -04:00
pkg = objabi . PathToPrefix ( pkg )
2018-01-09 10:10:46 -05:00
name := pkg + arg [ dot : eq ]
value := arg [ eq + 1 : ]
if _ , ok := strdata [ name ] ; ! ok {
strnames = append ( strnames , name )
}
strdata [ name ] = value
2015-02-27 22:57:28 -05:00
}
2018-06-26 13:48:05 +07:00
// addstrdata sets the initial value of the string variable name to value.
2020-02-12 17:20:00 -05:00
func addstrdata ( arch * sys . Arch , l * loader . Loader , name , value string ) {
s := l . Lookup ( name , 0 )
if s == 0 {
2018-01-09 10:10:46 -05:00
return
}
2020-02-12 17:20:00 -05:00
if goType := l . SymGoType ( s ) ; goType == 0 {
return
} else if typeName := l . SymName ( goType ) ; typeName != "type.string" {
Errorf ( nil , "%s: cannot set with -X: not a var of type string (%s)" , name , typeName )
2018-01-09 10:10:46 -05:00
return
}
2020-03-13 16:56:00 -04:00
if ! l . AttrReachable ( s ) {
return // don't bother setting unreachable variable
}
2020-02-18 10:54:26 -05:00
bld := l . MakeSymbolUpdater ( s )
2020-02-12 17:20:00 -05:00
if bld . Type ( ) == sym . SBSS {
bld . SetType ( sym . SDATA )
2018-01-09 10:10:46 -05:00
}
2020-02-12 17:20:00 -05:00
p := fmt . Sprintf ( "%s.str" , name )
2020-02-18 10:54:26 -05:00
sp := l . LookupOrCreateSym ( p , 0 )
sbld := l . MakeSymbolUpdater ( sp )
2015-02-27 22:57:28 -05:00
2020-02-12 17:20:00 -05:00
sbld . Addstring ( value )
sbld . SetType ( sym . SRODATA )
2015-02-27 22:57:28 -05:00
2020-02-12 17:20:00 -05:00
bld . SetSize ( 0 )
bld . SetData ( make ( [ ] byte , 0 , arch . PtrSize * 2 ) )
bld . SetReadOnly ( false )
bld . SetRelocs ( nil )
bld . AddAddrPlus ( arch , sp , 0 )
bld . AddUint ( arch , uint64 ( len ( value ) ) )
2015-02-27 22:57:28 -05:00
}
2018-01-09 10:10:46 -05:00
func ( ctxt * Link ) dostrdata ( ) {
for _ , name := range strnames {
2020-02-12 17:20:00 -05:00
addstrdata ( ctxt . Arch , ctxt . loader , name , strdata [ name ] )
2015-06-29 13:03:11 -04:00
}
}
2017-10-04 17:54:04 -04:00
func Addstring ( s * sym . Symbol , str string ) int64 {
2015-02-27 22:57:28 -05:00
if s . Type == 0 {
2017-10-04 17:54:04 -04:00
s . Type = sym . SNOPTRDATA
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
s . Attr |= sym . AttrReachable
2016-04-07 18:00:57 +03:00
r := s . Size
2015-02-27 22:57:28 -05:00
if s . Name == ".shstrtab" {
2016-09-17 09:39:33 -04:00
elfsetstring ( s , str , int ( r ) )
2015-02-27 22:57:28 -05:00
}
2016-04-07 18:00:57 +03:00
s . P = append ( s . P , str ... )
s . P = append ( s . P , 0 )
s . Size = int64 ( len ( s . P ) )
return r
2015-02-27 22:57:28 -05:00
}
2015-04-11 12:05:21 +08:00
// addgostring adds str, as a Go string value, to s. symname is the name of the
// symbol used to define the string data and must be unique per linked object.
2017-10-04 17:54:04 -04:00
func addgostring ( ctxt * Link , s * sym . Symbol , symname , str string ) {
sdata := ctxt . Syms . Lookup ( symname , 0 )
if sdata . Type != sym . Sxxx {
2016-09-17 09:39:33 -04:00
Errorf ( s , "duplicate symname in addgostring: %s" , symname )
2015-04-11 12:05:21 +08:00
}
2017-10-04 17:54:04 -04:00
sdata . Attr |= sym . AttrReachable
sdata . Attr |= sym . AttrLocal
sdata . Type = sym . SRODATA
sdata . Size = int64 ( len ( str ) )
sdata . P = [ ] byte ( str )
s . AddAddr ( ctxt . Arch , sdata )
2017-09-30 15:06:44 +00:00
s . AddUint ( ctxt . Arch , uint64 ( len ( str ) ) )
2015-04-11 12:05:21 +08:00
}
2017-10-04 17:54:04 -04:00
func addinitarrdata ( ctxt * Link , s * sym . Symbol ) {
2015-03-17 09:47:01 -07:00
p := s . Name + ".ptr"
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
sp := ctxt . Syms . Lookup ( p , 0 )
2017-10-04 17:54:04 -04:00
sp . Type = sym . SINITARR
2015-03-17 09:47:01 -07:00
sp . Size = 0
2017-10-04 17:54:04 -04:00
sp . Attr |= sym . AttrDuplicateOK
2017-09-30 15:06:44 +00:00
sp . AddAddr ( ctxt . Arch , s )
2015-03-17 09:47:01 -07:00
}
2016-03-02 21:24:04 -05:00
// symalign returns the required alignment for the given symbol s.
2017-10-04 17:54:04 -04:00
func symalign ( s * sym . Symbol ) int32 {
2018-04-01 00:58:48 +03:00
min := int32 ( thearch . Minalign )
2016-03-02 21:24:04 -05:00
if s . Align >= min {
2015-02-27 22:57:28 -05:00
return s . Align
2016-03-02 21:24:04 -05:00
} else if s . Align != 0 {
return min
2015-02-27 22:57:28 -05:00
}
2016-10-13 22:31:46 +03:00
if strings . HasPrefix ( s . Name , "go.string." ) || strings . HasPrefix ( s . Name , "type..namedata." ) {
2016-03-07 22:48:07 -05:00
// String data is just bytes.
// If we align it, we waste a lot of space to padding.
2016-03-18 16:57:54 -04:00
return min
2016-03-07 22:48:07 -05:00
}
2018-04-01 00:58:48 +03:00
align := int32 ( thearch . Maxalign )
2016-03-02 21:24:04 -05:00
for int64 ( align ) > s . Size && align > min {
2015-02-27 22:57:28 -05:00
align >>= 1
}
2019-02-20 15:48:22 +01:00
s . Align = align
2015-02-27 22:57:28 -05:00
return align
}
2017-10-04 17:54:04 -04:00
func aligndatsize ( datsize int64 , s * sym . Symbol ) int64 {
2015-02-27 22:57:28 -05:00
return Rnd ( datsize , int64 ( symalign ( s ) ) )
}
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
const debugGCProg = false
2015-02-27 22:57:28 -05:00
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
type GCProg struct {
2016-08-19 22:40:38 -04:00
ctxt * Link
2017-10-04 17:54:04 -04:00
sym * sym . Symbol
2016-08-19 22:40:38 -04:00
w gcprog . Writer
2015-02-27 22:57:28 -05:00
}
2016-08-19 22:40:38 -04:00
func ( p * GCProg ) Init ( ctxt * Link , name string ) {
p . ctxt = ctxt
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
p . sym = ctxt . Syms . Lookup ( name , 0 )
2016-08-19 22:40:38 -04:00
p . w . Init ( p . writeByte ( ctxt ) )
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
if debugGCProg {
fmt . Fprintf ( os . Stderr , "ld: start GCProg %s\n" , name )
p . w . Debug ( os . Stderr )
2015-02-27 22:57:28 -05:00
}
}
2016-08-19 22:40:38 -04:00
func ( p * GCProg ) writeByte ( ctxt * Link ) func ( x byte ) {
return func ( x byte ) {
2017-09-30 15:06:44 +00:00
p . sym . AddUint8 ( x )
2016-08-19 22:40:38 -04:00
}
2015-02-27 22:57:28 -05:00
}
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
func ( p * GCProg ) End ( size int64 ) {
2017-09-30 21:10:49 +00:00
p . w . ZeroUntil ( size / int64 ( p . ctxt . Arch . PtrSize ) )
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
p . w . End ( )
if debugGCProg {
fmt . Fprintf ( os . Stderr , "ld: end GCProg\n" )
2015-02-27 22:57:28 -05:00
}
}
2017-10-04 17:54:04 -04:00
func ( p * GCProg ) AddSym ( s * sym . Symbol ) {
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
typ := s . Gotype
2017-10-04 17:54:04 -04:00
// Things without pointers should be in sym.SNOPTRDATA or sym.SNOPTRBSS;
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
// everything we see should have pointers and should therefore have a type.
if typ == nil {
2016-09-19 14:13:07 -04:00
switch s . Name {
case "runtime.data" , "runtime.edata" , "runtime.bss" , "runtime.ebss" :
// Ignore special symbols that are sometimes laid out
// as real symbols. See comment about dyld on darwin in
// the address function.
return
}
2016-09-17 09:39:33 -04:00
Errorf ( s , "missing Go type information for global symbol: size %d" , s . Size )
2015-02-27 22:57:28 -05:00
return
}
2017-09-30 21:10:49 +00:00
ptrsize := int64 ( p . ctxt . Arch . PtrSize )
2019-10-03 18:25:21 -04:00
nptr := decodetypePtrdata ( p . ctxt . Arch , typ . P ) / ptrsize
2015-02-27 22:57:28 -05:00
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
if debugGCProg {
fmt . Fprintf ( os . Stderr , "gcprog sym: %s at %d (ptr=%d+%d)\n" , s . Name , s . Value , s . Value / ptrsize , nptr )
2015-04-28 00:28:47 -04:00
}
2015-02-27 22:57:28 -05:00
2019-10-03 18:25:21 -04:00
if decodetypeUsegcprog ( p . ctxt . Arch , typ . P ) == 0 {
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
// Copy pointers from mask into program.
2016-08-22 10:33:13 -04:00
mask := decodetypeGcmask ( p . ctxt , typ )
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
for i := int64 ( 0 ) ; i < nptr ; i ++ {
if ( mask [ i / 8 ] >> uint ( i % 8 ) ) & 1 != 0 {
p . w . Ptr ( s . Value / ptrsize + i )
2015-02-27 22:57:28 -05:00
}
}
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
return
2015-02-27 22:57:28 -05:00
}
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
// Copy program.
2016-08-22 10:33:13 -04:00
prog := decodetypeGcprog ( p . ctxt , typ )
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
p . w . ZeroUntil ( s . Value / ptrsize )
2015-05-25 16:13:50 +12:00
p . w . Append ( prog [ 4 : ] , nptr )
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
// dataSortKey is used to sort a slice of data symbol *sym.Symbol pointers.
2016-12-08 22:15:40 +00:00
// The sort keys are kept inline to improve cache behavior while sorting.
2016-03-09 16:23:25 +02:00
type dataSortKey struct {
2016-04-18 14:50:14 -04:00
size int64
name string
2017-10-04 17:54:04 -04:00
sym * sym . Symbol
2016-03-09 16:23:25 +02:00
}
2016-04-18 14:50:14 -04:00
type bySizeAndName [ ] dataSortKey
2016-03-09 16:23:25 +02:00
2016-04-18 14:50:14 -04:00
func ( d bySizeAndName ) Len ( ) int { return len ( d ) }
func ( d bySizeAndName ) Swap ( i , j int ) { d [ i ] , d [ j ] = d [ j ] , d [ i ] }
func ( d bySizeAndName ) Less ( i , j int ) bool {
s1 , s2 := d [ i ] , d [ j ]
if s1 . size != s2 . size {
return s1 . size < s2 . size
2016-04-04 13:07:24 -04:00
}
2016-04-18 14:50:14 -04:00
return s1 . name < s2 . name
2016-03-09 16:23:25 +02:00
}
2017-11-08 09:43:56 +01:00
// cutoff is the maximum data section size permitted by the linker
// (see issue #9862).
const cutoff = 2e9 // 2 GB (or so; looks better in errors than 2^31)
2016-04-19 08:59:56 -04:00
2017-10-04 17:54:04 -04:00
func checkdatsize ( ctxt * Link , datsize int64 , symn sym . SymKind ) {
2016-04-19 08:59:56 -04:00
if datsize > cutoff {
2017-11-08 09:43:56 +01:00
Errorf ( nil , "too much data in section %v (over %v bytes)" , symn , cutoff )
2016-04-19 08:59:56 -04:00
}
2015-02-27 22:57:28 -05:00
}
2020-03-09 10:35:31 -04:00
// fixZeroSizedSymbols gives a few special symbols with zero size some space.
func fixZeroSizedSymbols ( ctxt * Link ) {
// The values in moduledata are filled out by relocations
// pointing to the addresses of these special symbols.
// Typically these symbols have no size and are not laid
// out with their matching section.
//
// However on darwin, dyld will find the special symbol
// in the first loaded module, even though it is local.
//
// (An hypothesis, formed without looking in the dyld sources:
// these special symbols have no size, so their address
// matches a real symbol. The dynamic linker assumes we
// want the normal symbol with the same address and finds
// it in the other module.)
//
// To work around this we lay out the symbls whose
// addresses are vital for multi-module programs to work
// as normal symbols, and give them a little size.
//
// On AIX, as all DATA sections are merged together, ld might not put
// these symbols at the beginning of their respective section if there
// aren't real symbols, their alignment might not match the
// first symbol alignment. Therefore, there are explicitly put at the
// beginning of their section with the same alignment.
if ! ( ctxt . DynlinkingGo ( ) && ctxt . HeadType == objabi . Hdarwin ) && ! ( ctxt . HeadType == objabi . Haix && ctxt . LinkMode == LinkExternal ) {
return
}
2016-03-09 16:23:25 +02:00
2020-03-09 10:35:31 -04:00
bss := ctxt . Syms . Lookup ( "runtime.bss" , 0 )
bss . Size = 8
bss . Attr . Set ( sym . AttrSpecial , false )
ctxt . Syms . Lookup ( "runtime.ebss" , 0 ) . Attr . Set ( sym . AttrSpecial , false )
2016-09-19 14:13:07 -04:00
2020-03-09 10:35:31 -04:00
data := ctxt . Syms . Lookup ( "runtime.data" , 0 )
data . Size = 8
data . Attr . Set ( sym . AttrSpecial , false )
2016-09-19 14:13:07 -04:00
2020-03-09 10:35:31 -04:00
edata := ctxt . Syms . Lookup ( "runtime.edata" , 0 )
edata . Attr . Set ( sym . AttrSpecial , false )
if ctxt . HeadType == objabi . Haix {
// XCOFFTOC symbols are part of .data section.
edata . Type = sym . SXCOFFTOC
}
2019-02-20 16:20:56 +01:00
2020-03-09 10:35:31 -04:00
types := ctxt . Syms . Lookup ( "runtime.types" , 0 )
types . Type = sym . STYPE
types . Size = 8
types . Attr . Set ( sym . AttrSpecial , false )
2019-02-20 16:20:56 +01:00
2020-03-09 10:35:31 -04:00
etypes := ctxt . Syms . Lookup ( "runtime.etypes" , 0 )
etypes . Type = sym . SFUNCTAB
etypes . Attr . Set ( sym . AttrSpecial , false )
if ctxt . HeadType == objabi . Haix {
rodata := ctxt . Syms . Lookup ( "runtime.rodata" , 0 )
rodata . Type = sym . SSTRING
rodata . Size = 8
rodata . Attr . Set ( sym . AttrSpecial , false )
ctxt . Syms . Lookup ( "runtime.erodata" , 0 ) . Attr . Set ( sym . AttrSpecial , false )
}
}
// makeRelroForSharedLib creates a section of readonly data if necessary.
func makeRelroForSharedLib ( target * Link , data * [ sym . SXREF ] [ ] * sym . Symbol ) {
if ! target . UseRelro ( ) {
return
}
// "read only" data with relocations needs to go in its own section
// when building a shared library. We do this by boosting objects of
// type SXXX with relocations to type SXXXRELRO.
for _ , symnro := range sym . ReadOnly {
symnrelro := sym . RelROMap [ symnro ]
ro := [ ] * sym . Symbol { }
relro := data [ symnrelro ]
for _ , s := range data [ symnro ] {
isRelro := len ( s . R ) > 0
switch s . Type {
case sym . STYPE , sym . STYPERELRO , sym . SGOFUNCRELRO :
// Symbols are not sorted yet, so it is possible
// that an Outer symbol has been changed to a
// relro Type before it reaches here.
isRelro = true
case sym . SFUNCTAB :
if target . IsAIX ( ) && s . Name == "runtime.etypes" {
// runtime.etypes must be at the end of
// the relro datas.
isRelro = true
}
}
if isRelro {
s . Type = symnrelro
if s . Outer != nil {
s . Outer . Type = s . Type
}
relro = append ( relro , s )
} else {
ro = append ( ro , s )
}
}
2019-02-20 16:20:56 +01:00
2020-03-09 10:35:31 -04:00
// Check that we haven't made two symbols with the same .Outer into
// different types (because references two symbols with non-nil Outer
// become references to the outer symbol + offset it's vital that the
// symbol and the outer end up in the same section).
for _ , s := range relro {
if s . Outer != nil && s . Outer . Type != s . Type {
Errorf ( s , "inconsistent types for symbol and its Outer %s (%v != %v)" ,
s . Outer . Name , s . Type , s . Outer . Type )
}
2019-02-20 16:20:56 +01:00
}
2020-03-09 10:35:31 -04:00
data [ symnro ] = ro
data [ symnrelro ] = relro
2016-09-19 14:13:07 -04:00
}
2020-03-09 10:35:31 -04:00
}
func ( ctxt * Link ) dodata ( ) {
// Give zeros sized symbols space if necessary.
fixZeroSizedSymbols ( ctxt )
2016-09-19 14:13:07 -04:00
2016-04-18 14:50:14 -04:00
// Collect data symbols by type into data.
2017-10-04 17:54:04 -04:00
var data [ sym . SXREF ] [ ] * sym . Symbol
2016-09-20 14:59:39 +12:00
for _ , s := range ctxt . Syms . Allsym {
2017-04-28 13:01:03 +12:00
if ! s . Attr . Reachable ( ) || s . Attr . Special ( ) || s . Attr . SubSymbol ( ) {
2015-02-27 22:57:28 -05:00
continue
}
2017-10-04 17:54:04 -04:00
if s . Type <= sym . STEXT || s . Type >= sym . SXREF {
2016-04-18 14:50:14 -04:00
continue
2015-02-27 22:57:28 -05:00
}
2016-04-18 14:50:14 -04:00
data [ s . Type ] = append ( data [ s . Type ] , s )
2015-02-27 22:57:28 -05:00
}
2016-04-18 14:50:14 -04:00
// Now that we have the data symbols, but before we start
// to assign addresses, record all the necessary
// dynamic relocations. These will grow the relocation
// symbol, which is itself data.
//
// On darwin, we need the symbol table numbers for dynreloc.
2017-10-07 13:49:44 -04:00
if ctxt . HeadType == objabi . Hdarwin {
2016-08-19 22:40:38 -04:00
machosymorder ( ctxt )
2015-02-27 22:57:28 -05:00
}
2016-08-19 22:40:38 -04:00
dynreloc ( ctxt , & data )
2015-02-27 22:57:28 -05:00
2020-03-09 10:35:31 -04:00
// Move any RO data with relocations to a separate section.
makeRelroForSharedLib ( ctxt , & data )
2015-03-30 15:45:33 +02:00
2016-04-18 14:50:14 -04:00
// Sort symbols.
2017-10-04 17:54:04 -04:00
var dataMaxAlign [ sym . SXREF ] int32
2016-04-18 14:50:14 -04:00
var wg sync . WaitGroup
for symn := range data {
2017-10-04 17:54:04 -04:00
symn := sym . SymKind ( symn )
2016-04-18 14:50:14 -04:00
wg . Add ( 1 )
go func ( ) {
2016-08-19 22:40:38 -04:00
data [ symn ] , dataMaxAlign [ symn ] = dodataSect ( ctxt , symn , data [ symn ] )
2016-04-18 14:50:14 -04:00
wg . Done ( )
} ( )
2015-02-27 22:57:28 -05:00
}
2016-04-18 14:50:14 -04:00
wg . Wait ( )
2015-02-27 22:57:28 -05:00
2019-02-20 16:20:56 +01:00
if ctxt . HeadType == objabi . Haix && ctxt . LinkMode == LinkExternal {
// These symbols must have the same alignment as their section.
// Otherwize, ld might change the layout of Go sections.
ctxt . Syms . ROLookup ( "runtime.data" , 0 ) . Align = dataMaxAlign [ sym . SDATA ]
ctxt . Syms . ROLookup ( "runtime.bss" , 0 ) . Align = dataMaxAlign [ sym . SBSS ]
}
2016-04-18 14:50:14 -04:00
// Allocate sections.
// Data is processed before segtext, because we need
// to see all symbols in the .data and .bss sections in order
// to generate garbage collection information.
2015-03-02 12:35:15 -05:00
datsize := int64 ( 0 )
2015-02-27 22:57:28 -05:00
2016-09-07 14:45:27 -04:00
// Writable data sections that do not need any specialized handling.
2017-10-04 17:54:04 -04:00
writable := [ ] sym . SymKind {
2019-04-22 23:02:37 -04:00
sym . SBUILDINFO ,
2017-10-04 17:54:04 -04:00
sym . SELFSECT ,
sym . SMACHO ,
sym . SMACHOGOT ,
sym . SWINDOWS ,
2016-04-18 14:50:14 -04:00
}
2016-09-07 14:45:27 -04:00
for _ , symn := range writable {
2016-04-18 14:50:14 -04:00
for _ , s := range data [ symn ] {
2017-09-30 21:10:49 +00:00
sect := addsection ( ctxt . Arch , & Segdata , s . Name , 06 )
2016-04-18 14:50:14 -04:00
sect . Align = symalign ( s )
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SDATA
2016-04-18 14:50:14 -04:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2016-04-18 14:50:14 -04:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
}
2016-08-19 22:40:38 -04:00
checkdatsize ( ctxt , datsize , symn )
2015-02-27 22:57:28 -05:00
}
2016-04-18 14:50:14 -04:00
// .got (and .toc on ppc64)
2017-10-04 17:54:04 -04:00
if len ( data [ sym . SELFGOT ] ) > 0 {
2017-09-30 21:10:49 +00:00
sect := addsection ( ctxt . Arch , & Segdata , ".got" , 06 )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SELFGOT ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SELFGOT ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SDATA
2015-02-27 22:57:28 -05:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
// Resolve .TOC. symbol for this object file (ppc64)
2018-04-06 21:41:06 +01:00
toc := ctxt . Syms . ROLookup ( ".TOC." , int ( s . Version ) )
2015-02-27 22:57:28 -05:00
if toc != nil {
toc . Sect = sect
toc . Outer = s
toc . Sub = s . Sub
s . Sub = toc
toc . Value = 0x8000
}
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SELFGOT )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
}
/* pointer-free data */
2017-09-30 21:10:49 +00:00
sect := addsection ( ctxt . Arch , & Segdata , ".noptrdata" , 06 )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SNOPTRDATA ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.noptrdata" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.enoptrdata" , 0 ) . Sect = sect
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SNOPTRDATA ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SDATA
2015-02-27 22:57:28 -05:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SNOPTRDATA )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
2017-10-07 13:28:51 -04:00
hasinitarr := ctxt . linkShared
2015-04-01 14:17:43 +13:00
2015-02-27 22:57:28 -05:00
/* shared library initializer */
2017-10-05 10:20:17 -04:00
switch ctxt . BuildMode {
case BuildModeCArchive , BuildModeCShared , BuildModeShared , BuildModePlugin :
2015-04-01 14:17:43 +13:00
hasinitarr = true
}
2019-03-25 10:33:49 +01:00
if ctxt . HeadType == objabi . Haix {
if len ( data [ sym . SINITARR ] ) > 0 {
Errorf ( nil , "XCOFF format doesn't allow .init_array section" )
}
}
2018-12-07 14:19:01 -08:00
if hasinitarr && len ( data [ sym . SINITARR ] ) > 0 {
2017-09-30 21:10:49 +00:00
sect := addsection ( ctxt . Arch , & Segdata , ".init_array" , 06 )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SINITARR ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SINITARR ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
sect . Length = uint64 ( datsize ) - sect . Vaddr
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SINITARR )
2015-02-27 22:57:28 -05:00
}
/* data */
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , & Segdata , ".data" , 06 )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SDATA ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.data" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.edata" , 0 ) . Sect = sect
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
var gc GCProg
2016-08-19 22:40:38 -04:00
gc . Init ( ctxt , "runtime.gcdata" )
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SDATA ] {
2015-02-27 22:57:28 -05:00
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SDATA
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
gc . AddSym ( s )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2019-02-20 16:20:56 +01:00
gc . End ( datsize - int64 ( sect . Vaddr ) )
2018-09-28 17:02:16 +02:00
// On AIX, TOC entries must be the last of .data
2019-02-20 16:20:56 +01:00
// These aren't part of gc as they won't change during the runtime.
2018-09-28 17:02:16 +02:00
for _ , s := range data [ sym . SXCOFFTOC ] {
s . Sect = sect
s . Type = sym . SDATA
datsize = aligndatsize ( datsize , s )
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
datsize += s . Size
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SDATA )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
/* bss */
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , & Segdata , ".bss" , 06 )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SBSS ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.bss" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.ebss" , 0 ) . Sect = sect
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
gc = GCProg { }
2016-08-19 22:40:38 -04:00
gc . Init ( ctxt , "runtime.gcbss" )
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SBSS ] {
2015-02-27 22:57:28 -05:00
s . Sect = sect
datsize = aligndatsize ( datsize , s )
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
gc . AddSym ( s )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SBSS )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes #9625.
Fixes #10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
2015-05-08 01:43:18 -04:00
gc . End ( int64 ( sect . Length ) )
2015-02-27 22:57:28 -05:00
/* pointer-free bss */
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , & Segdata , ".noptrbss" , 06 )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SNOPTRBSS ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.noptrbss" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.enoptrbss" , 0 ) . Sect = sect
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SNOPTRBSS ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
sect . Length = uint64 ( datsize ) - sect . Vaddr
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.end" , 0 ) . Sect = sect
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SNOPTRBSS )
2015-02-27 22:57:28 -05:00
2019-10-18 17:39:39 -07:00
// Coverage instrumentation counters for libfuzzer.
if len ( data [ sym . SLIBFUZZER_EXTRA_COUNTER ] ) > 0 {
sect := addsection ( ctxt . Arch , & Segdata , "__libfuzzer_extra_counters" , 06 )
sect . Align = dataMaxAlign [ sym . SLIBFUZZER_EXTRA_COUNTER ]
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
for _ , s := range data [ sym . SLIBFUZZER_EXTRA_COUNTER ] {
datsize = aligndatsize ( datsize , s )
s . Sect = sect
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
datsize += s . Size
}
sect . Length = uint64 ( datsize ) - sect . Vaddr
checkdatsize ( ctxt , datsize , sym . SLIBFUZZER_EXTRA_COUNTER )
}
2017-10-04 17:54:04 -04:00
if len ( data [ sym . STLSBSS ] ) > 0 {
var sect * sym . Section
2019-02-20 16:16:38 +01:00
if ( ctxt . IsELF || ctxt . HeadType == objabi . Haix ) && ( ctxt . LinkMode == LinkExternal || ! * FlagD ) {
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , & Segdata , ".tbss" , 06 )
sect . Align = int32 ( ctxt . Arch . PtrSize )
2015-08-11 12:29:00 +12:00
sect . Vaddr = 0
}
2015-02-27 22:57:28 -05:00
datsize = 0
2015-08-11 12:29:00 +12:00
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . STLSBSS ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2015-08-11 12:29:00 +12:00
s . Value = datsize
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . STLSBSS )
2015-02-27 22:57:28 -05:00
2015-08-11 12:29:00 +12:00
if sect != nil {
sect . Length = uint64 ( datsize )
2015-02-27 22:57:28 -05:00
}
}
/ *
* We finished data , begin read - only data .
* Not all systems support a separate read - only non - executable data section .
2018-06-02 15:35:25 +10:00
* ELF and Windows PE systems do .
2015-02-27 22:57:28 -05:00
* OS X and Plan 9 do not .
* And if we ' re using external linking mode , the point is moot ,
* since it ' s not our decision ; that code expects the sections in
* segtext .
* /
2017-10-04 17:54:04 -04:00
var segro * sym . Segment
2017-10-07 13:43:38 -04:00
if ctxt . IsELF && ctxt . LinkMode == LinkInternal {
2015-02-27 22:57:28 -05:00
segro = & Segrodata
2018-06-02 15:35:25 +10:00
} else if ctxt . HeadType == objabi . Hwindows {
segro = & Segrodata
2015-02-27 22:57:28 -05:00
} else {
segro = & Segtext
}
datsize = 0
/* read-only executable ELF, Mach-O sections */
2017-10-04 17:54:04 -04:00
if len ( data [ sym . STEXT ] ) != 0 {
Errorf ( nil , "dodata found an sym.STEXT symbol: %s" , data [ sym . STEXT ] [ 0 ] . Name )
2016-04-18 14:50:14 -04:00
}
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SELFRXSECT ] {
2017-09-30 21:10:49 +00:00
sect := addsection ( ctxt . Arch , & Segtext , s . Name , 04 )
2015-02-27 22:57:28 -05:00
sect . Align = symalign ( s )
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2015-02-27 22:57:28 -05:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SELFRXSECT )
2015-02-27 22:57:28 -05:00
}
/* read-only data */
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , segro , ".rodata" , 04 )
2015-02-27 22:57:28 -05:00
sect . Vaddr = 0
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.rodata" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.erodata" , 0 ) . Sect = sect
2017-10-05 10:20:17 -04:00
if ! ctxt . UseRelro ( ) {
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.types" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.etypes" , 0 ) . Sect = sect
2016-03-27 10:21:48 -04:00
}
2017-10-04 17:54:04 -04:00
for _ , symn := range sym . ReadOnly {
2016-04-19 08:59:56 -04:00
align := dataMaxAlign [ symn ]
2016-04-18 14:50:14 -04:00
if sect . Align < align {
sect . Align = align
}
}
datsize = Rnd ( datsize , int64 ( sect . Align ) )
2017-10-04 17:54:04 -04:00
for _ , symn := range sym . ReadOnly {
2019-02-20 15:54:11 +01:00
symnStartValue := datsize
2016-04-18 14:50:14 -04:00
for _ , s := range data [ symn ] {
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-04-18 14:50:14 -04:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2016-04-18 14:50:14 -04:00
}
2016-08-19 22:40:38 -04:00
checkdatsize ( ctxt , datsize , symn )
2019-02-20 15:54:11 +01:00
if ctxt . HeadType == objabi . Haix {
// Read-only symbols might be wrapped inside their outer
// symbol.
// XCOFF symbol table needs to know the size of
// these outer symbols.
xcoffUpdateOuterSize ( ctxt , datsize - symnStartValue , symn )
}
2015-02-27 22:57:28 -05:00
}
sect . Length = uint64 ( datsize ) - sect . Vaddr
2016-09-05 23:29:16 -04:00
/* read-only ELF, Mach-O sections */
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SELFROSECT ] {
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , segro , s . Name , 04 )
2016-09-05 23:29:16 -04:00
sect . Align = symalign ( s )
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-09-05 23:29:16 -04:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
datsize += s . Size
sect . Length = uint64 ( datsize ) - sect . Vaddr
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SELFROSECT )
2016-09-05 23:29:16 -04:00
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SMACHOPLT ] {
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , segro , s . Name , 04 )
2016-09-05 23:29:16 -04:00
sect . Align = symalign ( s )
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-09-05 23:29:16 -04:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
datsize += s . Size
sect . Length = uint64 ( datsize ) - sect . Vaddr
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SMACHOPLT )
2016-09-05 23:29:16 -04:00
2015-05-21 13:07:19 +12:00
// There is some data that are conceptually read-only but are written to by
// relocations. On GNU systems, we can arrange for the dynamic linker to
// mprotect sections after relocations are applied by giving them write
// permissions in the object file and calling them ".data.rel.ro.FOO". We
// divide the .rodata section between actual .rodata and .data.rel.ro.rodata,
// but for the other sections that this applies to, we just write a read-only
// .FOO section or a read-write .data.rel.ro.FOO section depending on the
// situation.
// TODO(mwhudson): It would make sense to do this more widely, but it makes
// the system linker segfault on darwin.
2017-10-04 17:54:04 -04:00
addrelrosection := func ( suffix string ) * sym . Section {
2017-09-30 21:10:49 +00:00
return addsection ( ctxt . Arch , segro , suffix , 04 )
2016-09-05 23:29:16 -04:00
}
2015-05-21 13:07:19 +12:00
2017-10-05 10:20:17 -04:00
if ctxt . UseRelro ( ) {
2019-11-25 22:17:29 -08:00
segrelro := & Segrelrodata
if ctxt . LinkMode == LinkExternal && ctxt . HeadType != objabi . Haix {
// Using a separate segment with an external
// linker results in some programs moving
// their data sections unexpectedly, which
// corrupts the moduledata. So we use the
// rodata segment and let the external linker
// sort out a rel.ro segment.
segrelro = segro
} else {
// Reset datsize for new segment.
datsize = 0
}
2017-10-04 17:54:04 -04:00
addrelrosection = func ( suffix string ) * sym . Section {
2019-11-25 22:17:29 -08:00
return addsection ( ctxt . Arch , segrelro , ".data.rel.ro" + suffix , 06 )
2016-09-05 23:29:16 -04:00
}
2019-11-25 22:17:29 -08:00
2015-05-21 13:07:19 +12:00
/* data only written by relocations */
2016-09-05 23:29:16 -04:00
sect = addrelrosection ( "" )
2015-05-21 13:07:19 +12:00
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.types" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.etypes" , 0 ) . Sect = sect
2019-02-20 16:20:56 +01:00
2017-10-04 17:54:04 -04:00
for _ , symnro := range sym . ReadOnly {
symn := sym . RelROMap [ symnro ]
2016-04-19 08:59:56 -04:00
align := dataMaxAlign [ symn ]
2016-04-18 14:50:14 -04:00
if sect . Align < align {
sect . Align = align
}
}
datsize = Rnd ( datsize , int64 ( sect . Align ) )
2019-11-25 22:17:29 -08:00
sect . Vaddr = uint64 ( datsize )
for i , symnro := range sym . ReadOnly {
if i == 0 && symnro == sym . STYPE && ctxt . HeadType != objabi . Haix {
// Skip forward so that no type
// reference uses a zero offset.
// This is unlikely but possible in small
// programs with no other read-only data.
datsize ++
}
2017-10-04 17:54:04 -04:00
symn := sym . RelROMap [ symnro ]
2019-02-20 15:54:11 +01:00
symnStartValue := datsize
2016-04-18 14:50:14 -04:00
for _ , s := range data [ symn ] {
datsize = aligndatsize ( datsize , s )
if s . Outer != nil && s . Outer . Sect != nil && s . Outer . Sect != sect {
2016-09-17 09:39:33 -04:00
Errorf ( s , "s.Outer (%s) in different section from s, %s != %s" , s . Outer . Name , s . Outer . Sect . Name , sect . Name )
2016-04-18 14:50:14 -04:00
}
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-04-18 14:50:14 -04:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-05-21 13:07:19 +12:00
}
2016-08-19 22:40:38 -04:00
checkdatsize ( ctxt , datsize , symn )
2019-02-20 15:54:11 +01:00
if ctxt . HeadType == objabi . Haix {
// Read-only symbols might be wrapped inside their outer
// symbol.
// XCOFF symbol table needs to know the size of
// these outer symbols.
xcoffUpdateOuterSize ( ctxt , datsize - symnStartValue , symn )
}
2015-05-21 13:07:19 +12:00
}
sect . Length = uint64 ( datsize ) - sect . Vaddr
}
2015-02-27 22:57:28 -05:00
/* typelink */
2016-09-05 23:29:16 -04:00
sect = addrelrosection ( ".typelink" )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . STYPELINK ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
2016-10-19 07:33:16 +03:00
typelink := ctxt . Syms . Lookup ( "runtime.typelink" , 0 )
typelink . Sect = sect
2017-10-04 17:54:04 -04:00
typelink . Type = sym . SRODATA
2016-10-19 07:33:16 +03:00
datsize += typelink . Size
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . STYPELINK )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
2016-03-17 07:00:33 -07:00
/* itablink */
2016-09-05 23:29:16 -04:00
sect = addrelrosection ( ".itablink" )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SITABLINK ]
2016-03-17 07:00:33 -07:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.itablink" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.eitablink" , 0 ) . Sect = sect
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SITABLINK ] {
2016-03-17 07:00:33 -07:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-03-17 07:00:33 -07:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2016-03-17 07:00:33 -07:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SITABLINK )
2016-03-17 07:00:33 -07:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
2019-02-20 15:54:11 +01:00
if ctxt . HeadType == objabi . Haix {
// Store .itablink size because its symbols are wrapped
// under an outer symbol: runtime.itablink.
xcoffUpdateOuterSize ( ctxt , int64 ( sect . Length ) , sym . SITABLINK )
}
2016-03-17 07:00:33 -07:00
2015-02-27 22:57:28 -05:00
/* gosymtab */
2016-09-05 23:29:16 -04:00
sect = addrelrosection ( ".gosymtab" )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SSYMTAB ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.symtab" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.esymtab" , 0 ) . Sect = sect
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SSYMTAB ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2015-02-27 22:57:28 -05:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SSYMTAB )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
/* gopclntab */
2016-09-05 23:29:16 -04:00
sect = addrelrosection ( ".gopclntab" )
2017-10-04 17:54:04 -04:00
sect . Align = dataMaxAlign [ sym . SPCLNTAB ]
2015-02-27 22:57:28 -05:00
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
ctxt . Syms . Lookup ( "runtime.pclntab" , 0 ) . Sect = sect
ctxt . Syms . Lookup ( "runtime.epclntab" , 0 ) . Sect = sect
2017-10-04 17:54:04 -04:00
for _ , s := range data [ sym . SPCLNTAB ] {
2015-02-27 22:57:28 -05:00
datsize = aligndatsize ( datsize , s )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2015-02-27 22:57:28 -05:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SRODATA )
2015-02-27 22:57:28 -05:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
// 6g uses 4-byte relocation offsets, so the entire segment must fit in 32 bits.
if datsize != int64 ( uint32 ( datsize ) ) {
2016-09-17 09:39:33 -04:00
Errorf ( nil , "read-only data segment too large: %d" , datsize )
2015-02-27 22:57:28 -05:00
}
2017-10-04 17:54:04 -04:00
for symn := sym . SELFRXSECT ; symn < sym . SXREF ; symn ++ {
2020-03-09 11:08:04 -04:00
ctxt . datap = append ( ctxt . datap , data [ symn ] ... )
2016-04-18 14:50:14 -04:00
}
2018-04-24 13:05:10 +02:00
dwarfGenerateDebugSyms ( ctxt )
2016-03-14 09:23:04 -07:00
2016-04-22 10:31:14 +12:00
var i int
2017-05-22 20:17:31 -04:00
for ; i < len ( dwarfp ) ; i ++ {
s := dwarfp [ i ]
2017-10-04 17:54:04 -04:00
if s . Type != sym . SDWARFSECT {
2016-04-22 10:31:14 +12:00
break
}
2017-05-02 16:46:01 +02:00
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , & Segdwarf , s . Name , 04 )
2020-03-24 14:35:45 -04:00
sect . Sym = s
2016-03-14 09:23:04 -07:00
sect . Align = 1
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-03-14 09:23:04 -07:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2016-04-19 08:59:56 -04:00
datsize += s . Size
2016-03-14 09:23:04 -07:00
sect . Length = uint64 ( datsize ) - sect . Vaddr
}
2017-10-04 17:54:04 -04:00
checkdatsize ( ctxt , datsize , sym . SDWARFSECT )
2016-03-14 09:23:04 -07:00
2017-05-22 20:17:31 -04:00
for i < len ( dwarfp ) {
curType := dwarfp [ i ] . Type
2017-10-04 17:54:04 -04:00
var sect * sym . Section
2020-03-24 14:35:45 -04:00
var sectname string
2017-05-22 20:17:31 -04:00
switch curType {
2017-10-04 17:54:04 -04:00
case sym . SDWARFINFO :
2020-03-24 14:35:45 -04:00
sectname = ".debug_info"
2017-10-04 17:54:04 -04:00
case sym . SDWARFRANGE :
2020-03-24 14:35:45 -04:00
sectname = ".debug_ranges"
2017-10-04 17:54:04 -04:00
case sym . SDWARFLOC :
2020-03-24 14:35:45 -04:00
sectname = ".debug_loc"
2017-05-22 20:17:31 -04:00
default :
2019-08-09 11:36:03 -04:00
// Error is unrecoverable, so panic.
panic ( fmt . Sprintf ( "unknown DWARF section %v" , curType ) )
2017-05-22 20:17:31 -04:00
}
2020-03-24 14:35:45 -04:00
sect = addsection ( ctxt . Arch , & Segdwarf , sectname , 04 )
sect . Sym = ctxt . Syms . ROLookup ( sectname , 0 )
2016-03-14 09:23:04 -07:00
sect . Align = 1
datsize = Rnd ( datsize , int64 ( sect . Align ) )
sect . Vaddr = uint64 ( datsize )
2017-05-22 20:17:31 -04:00
for ; i < len ( dwarfp ) ; i ++ {
s := dwarfp [ i ]
if s . Type != curType {
2016-04-22 10:31:14 +12:00
break
}
2016-03-14 09:23:04 -07:00
s . Sect = sect
2017-10-04 17:54:04 -04:00
s . Type = sym . SRODATA
2016-03-14 09:23:04 -07:00
s . Value = int64 ( uint64 ( datsize ) - sect . Vaddr )
2017-10-04 17:54:04 -04:00
s . Attr |= sym . AttrLocal
2016-04-19 08:59:56 -04:00
datsize += s . Size
2019-02-20 16:29:00 +01:00
if ctxt . HeadType == objabi . Haix && curType == sym . SDWARFLOC {
// Update the size of .debug_loc for this symbol's
// package.
addDwsectCUSize ( ".debug_loc" , s . File , uint64 ( s . Size ) )
}
2016-03-14 09:23:04 -07:00
}
sect . Length = uint64 ( datsize ) - sect . Vaddr
2017-05-22 20:17:31 -04:00
checkdatsize ( ctxt , datsize , curType )
2016-03-14 09:23:04 -07:00
}
2015-02-27 22:57:28 -05:00
/* number the sections */
2015-03-02 12:35:15 -05:00
n := int32 ( 1 )
2015-02-27 22:57:28 -05:00
2017-04-18 21:52:06 +12:00
for _ , sect := range Segtext . Sections {
2015-02-27 22:57:28 -05:00
sect . Extnum = int16 ( n )
n ++
}
2017-04-18 21:52:06 +12:00
for _ , sect := range Segrodata . Sections {
2015-02-27 22:57:28 -05:00
sect . Extnum = int16 ( n )
n ++
}
2017-04-18 21:52:06 +12:00
for _ , sect := range Segrelrodata . Sections {
2016-09-05 23:29:16 -04:00
sect . Extnum = int16 ( n )
n ++
}
2017-04-18 21:52:06 +12:00
for _ , sect := range Segdata . Sections {
2015-02-27 22:57:28 -05:00
sect . Extnum = int16 ( n )
n ++
}
2017-04-18 21:52:06 +12:00
for _ , sect := range Segdwarf . Sections {
2016-03-14 09:23:04 -07:00
sect . Extnum = int16 ( n )
n ++
}
2015-02-27 22:57:28 -05:00
}
2015-06-04 15:15:48 -04:00
2017-10-04 17:54:04 -04:00
func dodataSect ( ctxt * Link , symn sym . SymKind , syms [ ] * sym . Symbol ) ( result [ ] * sym . Symbol , maxAlign int32 ) {
2017-10-07 13:49:44 -04:00
if ctxt . HeadType == objabi . Hdarwin {
2016-04-18 14:50:14 -04:00
// Some symbols may no longer belong in syms
// due to movement in machosymorder.
2017-10-04 17:54:04 -04:00
newSyms := make ( [ ] * sym . Symbol , 0 , len ( syms ) )
2016-04-18 14:50:14 -04:00
for _ , s := range syms {
2016-09-07 14:45:27 -04:00
if s . Type == symn {
2016-04-18 14:50:14 -04:00
newSyms = append ( newSyms , s )
}
}
syms = newSyms
}
2017-10-04 17:54:04 -04:00
var head , tail * sym . Symbol
2016-09-19 14:13:07 -04:00
symsSort := make ( [ ] dataSortKey , 0 , len ( syms ) )
for _ , s := range syms {
2016-04-18 14:50:14 -04:00
if s . Attr . OnList ( ) {
log . Fatalf ( "symbol %s listed multiple times" , s . Name )
}
2017-10-04 17:54:04 -04:00
s . Attr |= sym . AttrOnList
2016-04-19 08:59:56 -04:00
switch {
case s . Size < int64 ( len ( s . P ) ) :
2016-09-17 09:39:33 -04:00
Errorf ( s , "initialize bounds (%d < %d)" , s . Size , len ( s . P ) )
2016-04-19 08:59:56 -04:00
case s . Size < 0 :
2016-09-17 09:39:33 -04:00
Errorf ( s , "negative size (%d bytes)" , s . Size )
2016-04-19 08:59:56 -04:00
case s . Size > cutoff :
2016-09-17 09:39:33 -04:00
Errorf ( s , "symbol too large (%d bytes)" , s . Size )
2016-04-18 14:50:14 -04:00
}
2016-09-19 14:13:07 -04:00
// If the usually-special section-marker symbols are being laid
// out as regular symbols, put them either at the beginning or
// end of their section.
2019-02-20 16:20:56 +01:00
if ( ctxt . DynlinkingGo ( ) && ctxt . HeadType == objabi . Hdarwin ) || ( ctxt . HeadType == objabi . Haix && ctxt . LinkMode == LinkExternal ) {
2016-09-19 14:13:07 -04:00
switch s . Name {
2019-02-20 16:20:56 +01:00
case "runtime.text" , "runtime.bss" , "runtime.data" , "runtime.types" , "runtime.rodata" :
2016-09-19 14:13:07 -04:00
head = s
continue
2019-02-20 16:20:56 +01:00
case "runtime.etext" , "runtime.ebss" , "runtime.edata" , "runtime.etypes" , "runtime.erodata" :
2016-09-19 14:13:07 -04:00
tail = s
continue
}
}
key := dataSortKey {
2016-04-18 14:50:14 -04:00
size : s . Size ,
name : s . Name ,
2016-08-22 10:27:20 +12:00
sym : s ,
2016-04-18 14:50:14 -04:00
}
switch s . Type {
2017-10-04 17:54:04 -04:00
case sym . SELFGOT :
2016-04-18 14:50:14 -04:00
// For ppc64, we want to interleave the .got and .toc sections
2017-10-04 17:54:04 -04:00
// from input files. Both are type sym.SELFGOT, so in that case
2016-04-18 14:50:14 -04:00
// we skip size comparison and fall through to the name
// comparison (conveniently, .got sorts before .toc).
2016-09-19 14:13:07 -04:00
key . size = 0
2016-04-18 14:50:14 -04:00
}
2016-09-19 14:13:07 -04:00
symsSort = append ( symsSort , key )
2016-04-18 14:50:14 -04:00
}
sort . Sort ( bySizeAndName ( symsSort ) )
2016-09-19 14:13:07 -04:00
off := 0
if head != nil {
syms [ 0 ] = head
off ++
}
2016-04-18 14:50:14 -04:00
for i , symSort := range symsSort {
2016-09-19 14:13:07 -04:00
syms [ i + off ] = symSort . sym
2016-08-22 10:27:20 +12:00
align := symalign ( symSort . sym )
2016-04-19 08:59:56 -04:00
if maxAlign < align {
maxAlign = align
}
2016-04-18 14:50:14 -04:00
}
2016-09-19 14:13:07 -04:00
if tail != nil {
syms [ len ( syms ) - 1 ] = tail
}
2016-04-18 14:50:14 -04:00
2017-10-07 13:43:38 -04:00
if ctxt . IsELF && symn == sym . SELFROSECT {
2016-04-18 14:50:14 -04:00
// Make .rela and .rela.plt contiguous, the ELF ABI requires this
// and Solaris actually cares.
reli , plti := - 1 , - 1
for i , s := range syms {
switch s . Name {
case ".rel.plt" , ".rela.plt" :
plti = i
case ".rel" , ".rela" :
reli = i
}
}
if reli >= 0 && plti >= 0 && plti != reli + 1 {
2016-04-20 19:10:20 -04:00
var first , second int
if plti > reli {
first , second = reli , plti
} else {
first , second = plti , reli
2016-04-18 14:50:14 -04:00
}
2016-04-20 19:10:20 -04:00
rel , plt := syms [ reli ] , syms [ plti ]
copy ( syms [ first + 2 : ] , syms [ first + 1 : second ] )
syms [ first + 0 ] = rel
syms [ first + 1 ] = plt
2016-12-01 22:48:52 -08:00
// Make sure alignment doesn't introduce a gap.
// Setting the alignment explicitly prevents
// symalign from basing it on the size and
// getting it wrong.
2017-09-30 21:10:49 +00:00
rel . Align = int32 ( ctxt . Arch . RegSize )
plt . Align = int32 ( ctxt . Arch . RegSize )
2016-04-18 14:50:14 -04:00
}
}
2016-04-19 08:59:56 -04:00
return syms , maxAlign
2016-04-18 14:50:14 -04:00
}
2015-06-04 15:15:48 -04:00
// Add buildid to beginning of text segment, on non-ELF systems.
// Non-ELF binary formats are not always flexible enough to
// give us a place to put the Go build ID. On those systems, we put it
// at the very beginning of the text segment.
// This ``header'' is read by cmd/go.
2016-08-19 22:40:38 -04:00
func ( ctxt * Link ) textbuildid ( ) {
2017-10-07 13:43:38 -04:00
if ctxt . IsELF || ctxt . BuildMode == BuildModePlugin || * flagBuildid == "" {
2015-06-04 15:15:48 -04:00
return
}
2020-03-26 12:54:21 -04:00
ldr := ctxt . loader
s := ldr . CreateSymForUpdate ( "go.buildid" , 0 )
s . SetReachable ( true )
2015-06-04 15:15:48 -04:00
// The \xff is invalid UTF-8, meant to make it less likely
// to find one of these accidentally.
2016-08-21 18:34:24 -04:00
data := "\xff Go build ID: " + strconv . Quote ( * flagBuildid ) + "\n \xff"
2020-03-26 12:54:21 -04:00
s . SetType ( sym . STEXT )
s . SetData ( [ ] byte ( data ) )
s . SetSize ( int64 ( len ( data ) ) )
2015-06-04 15:15:48 -04:00
2020-03-26 12:54:21 -04:00
ctxt . Textp2 = append ( ctxt . Textp2 , 0 )
copy ( ctxt . Textp2 [ 1 : ] , ctxt . Textp2 )
ctxt . Textp2 [ 0 ] = s . Sym ( )
2015-06-04 15:15:48 -04:00
}
2015-02-27 22:57:28 -05:00
2019-04-22 23:02:37 -04:00
func ( ctxt * Link ) buildinfo ( ) {
if ctxt . linkShared || ctxt . BuildMode == BuildModePlugin {
// -linkshared and -buildmode=plugin get confused
// about the relocations in go.buildinfo
// pointing at the other data sections.
// The version information is only available in executables.
return
}
s := ctxt . Syms . Lookup ( ".go.buildinfo" , 0 )
s . Attr |= sym . AttrReachable
s . Type = sym . SBUILDINFO
s . Align = 16
// The \xff is invalid UTF-8, meant to make it less likely
// to find one of these accidentally.
const prefix = "\xff Go buildinf:" // 14 bytes, plus 2 data bytes filled in below
data := make ( [ ] byte , 32 )
copy ( data , prefix )
data [ len ( prefix ) ] = byte ( ctxt . Arch . PtrSize )
data [ len ( prefix ) + 1 ] = 0
if ctxt . Arch . ByteOrder == binary . BigEndian {
data [ len ( prefix ) + 1 ] = 1
}
s . P = data
s . Size = int64 ( len ( s . P ) )
s1 := ctxt . Syms . Lookup ( "runtime.buildVersion" , 0 )
s2 := ctxt . Syms . Lookup ( "runtime.modinfo" , 0 )
s . R = [ ] sym . Reloc {
{ Off : 16 , Siz : uint8 ( ctxt . Arch . PtrSize ) , Type : objabi . R_ADDR , Sym : s1 } ,
{ Off : 16 + int32 ( ctxt . Arch . PtrSize ) , Siz : uint8 ( ctxt . Arch . PtrSize ) , Type : objabi . R_ADDR , Sym : s2 } ,
}
}
2015-02-27 22:57:28 -05:00
// assign addresses to text
2016-08-19 22:40:38 -04:00
func ( ctxt * Link ) textaddress ( ) {
2017-09-30 21:10:49 +00:00
addsection ( ctxt . Arch , & Segtext , ".text" , 05 )
2015-02-27 22:57:28 -05:00
// Assign PCs in text segment.
// Could parallelize, by assigning to text
// and then letting threads copy down, but probably not worth it.
2017-04-18 21:52:06 +12:00
sect := Segtext . Sections [ 0 ]
2015-02-27 22:57:28 -05:00
sect . Align = int32 ( Funcalign )
2016-09-19 14:13:07 -04:00
text := ctxt . Syms . Lookup ( "runtime.text" , 0 )
text . Sect = sect
2019-02-20 16:20:56 +01:00
if ctxt . HeadType == objabi . Haix && ctxt . LinkMode == LinkExternal {
// Setting runtime.text has a real symbol prevents ld to
// change its base address resulting in wrong offsets for
// reflect methods.
text . Align = sect . Align
text . Size = 0x8
}
2016-09-19 14:13:07 -04:00
2019-02-20 16:20:56 +01:00
if ( ctxt . DynlinkingGo ( ) && ctxt . HeadType == objabi . Hdarwin ) || ( ctxt . HeadType == objabi . Haix && ctxt . LinkMode == LinkExternal ) {
2016-09-19 14:13:07 -04:00
etext := ctxt . Syms . Lookup ( "runtime.etext" , 0 )
etext . Sect = sect
ctxt . Textp = append ( ctxt . Textp , etext , nil )
copy ( ctxt . Textp [ 1 : ] , ctxt . Textp )
ctxt . Textp [ 0 ] = text
}
2016-08-21 18:34:24 -04:00
va := uint64 ( * FlagTextAddr )
2016-08-25 11:07:33 -05:00
n := 1
2015-02-27 22:57:28 -05:00
sect . Vaddr = va
2016-09-14 14:47:12 -04:00
ntramps := 0
2017-10-04 17:54:04 -04:00
for _ , s := range ctxt . Textp {
sect , n , va = assignAddress ( ctxt , sect , n , s , va , false )
2016-09-14 14:47:12 -04:00
2017-10-04 17:54:04 -04:00
trampoline ( ctxt , s ) // resolve jumps, may add trampolines if jump too far
2016-09-14 14:47:12 -04:00
// lay down trampolines after each function
for ; ntramps < len ( ctxt . tramps ) ; ntramps ++ {
tramp := ctxt . tramps [ ntramps ]
2019-02-20 16:48:43 +01:00
if ctxt . HeadType == objabi . Haix && strings . HasPrefix ( tramp . Name , "runtime.text." ) {
// Already set in assignAddress
continue
}
2017-06-08 08:26:19 -04:00
sect , n , va = assignAddress ( ctxt , sect , n , tramp , va , true )
2015-02-27 22:57:28 -05:00
}
2016-09-14 14:47:12 -04:00
}
sect . Length = va - sect . Vaddr
ctxt . Syms . Lookup ( "runtime.etext" , 0 ) . Sect = sect
// merge tramps into Textp, keeping Textp in address order
if ntramps != 0 {
2017-10-04 17:54:04 -04:00
newtextp := make ( [ ] * sym . Symbol , 0 , len ( ctxt . Textp ) + ntramps )
2016-09-14 14:47:12 -04:00
i := 0
2017-10-04 17:54:04 -04:00
for _ , s := range ctxt . Textp {
for ; i < ntramps && ctxt . tramps [ i ] . Value < s . Value ; i ++ {
2016-09-14 14:47:12 -04:00
newtextp = append ( newtextp , ctxt . tramps [ i ] )
}
2017-10-04 17:54:04 -04:00
newtextp = append ( newtextp , s )
2015-02-27 22:57:28 -05:00
}
2016-09-14 14:47:12 -04:00
newtextp = append ( newtextp , ctxt . tramps [ i : ntramps ] ... )
ctxt . Textp = newtextp
}
}
// assigns address for a text symbol, returns (possibly new) section, its number, and the address
// Note: once we have trampoline insertion support for external linking, this function
// will not need to create new text sections, and so no need to return sect and n.
2017-10-04 17:54:04 -04:00
func assignAddress ( ctxt * Link , sect * sym . Section , n int , s * sym . Symbol , va uint64 , isTramp bool ) ( * sym . Section , int , uint64 ) {
2018-03-04 12:59:15 +01:00
if thearch . AssignAddress != nil {
return thearch . AssignAddress ( ctxt , sect , n , s , va , isTramp )
}
2017-10-04 17:54:04 -04:00
s . Sect = sect
2017-04-28 13:01:03 +12:00
if s . Attr . SubSymbol ( ) {
2016-09-14 14:47:12 -04:00
return sect , n , va
}
2017-10-04 17:54:04 -04:00
if s . Align != 0 {
va = uint64 ( Rnd ( int64 ( va ) , int64 ( s . Align ) ) )
2016-09-14 14:47:12 -04:00
} else {
va = uint64 ( Rnd ( int64 ( va ) , int64 ( Funcalign ) ) )
}
funcsize := uint64 ( MINFUNC ) // spacing required for findfunctab
2017-10-04 17:54:04 -04:00
if s . Size > MINFUNC {
funcsize = uint64 ( s . Size )
2016-09-14 14:47:12 -04:00
}
2016-08-25 11:07:33 -05:00
2016-09-14 14:47:12 -04:00
// On ppc64x a text section should not be larger than 2^26 bytes due to the size of
// call target offset field in the bl instruction. Splitting into smaller text
// sections smaller than this limit allows the GNU linker to modify the long calls
// appropriately. The limit allows for the space needed for tables inserted by the linker.
2016-08-25 11:07:33 -05:00
2016-09-14 14:47:12 -04:00
// If this function doesn't fit in the current text section, then create a new one.
2016-08-25 11:07:33 -05:00
2016-09-14 14:47:12 -04:00
// Only break at outermost syms.
2016-08-25 11:07:33 -05:00
2019-02-20 16:48:43 +01:00
if ctxt . Arch . InFamily ( sys . PPC64 ) && s . Outer == nil && ctxt . LinkMode == LinkExternal && va - sect . Vaddr + funcsize + maxSizeTrampolinesPPC64 ( s , isTramp ) > 0x1c00000 {
2016-09-14 14:47:12 -04:00
// Set the length for the previous text section
sect . Length = va - sect . Vaddr
2016-08-25 11:07:33 -05:00
2016-09-14 14:47:12 -04:00
// Create new section, set the starting Vaddr
2017-09-30 21:10:49 +00:00
sect = addsection ( ctxt . Arch , & Segtext , ".text" , 05 )
2016-09-14 14:47:12 -04:00
sect . Vaddr = va
2017-10-04 17:54:04 -04:00
s . Sect = sect
2016-08-25 11:07:33 -05:00
2016-09-14 14:47:12 -04:00
// Create a symbol for the start of the secondary text sections
2019-02-20 16:48:43 +01:00
ntext := ctxt . Syms . Lookup ( fmt . Sprintf ( "runtime.text.%d" , n ) , 0 )
ntext . Sect = sect
if ctxt . HeadType == objabi . Haix {
// runtime.text.X must be a real symbol on AIX.
// Assign its address directly in order to be the
// first symbol of this new section.
ntext . Type = sym . STEXT
ntext . Size = int64 ( MINFUNC )
ntext . Attr |= sym . AttrReachable
ntext . Attr |= sym . AttrOnList
ctxt . tramps = append ( ctxt . tramps , ntext )
ntext . Value = int64 ( va )
va += uint64 ( ntext . Size )
if s . Align != 0 {
va = uint64 ( Rnd ( int64 ( va ) , int64 ( s . Align ) ) )
} else {
va = uint64 ( Rnd ( int64 ( va ) , int64 ( Funcalign ) ) )
}
}
2016-09-14 14:47:12 -04:00
n ++
2015-02-27 22:57:28 -05:00
}
2019-02-20 16:48:43 +01:00
s . Value = 0
for sub := s ; sub != nil ; sub = sub . Sub {
sub . Value += int64 ( va )
}
2016-09-14 14:47:12 -04:00
va += funcsize
2015-02-27 22:57:28 -05:00
2016-09-14 14:47:12 -04:00
return sect , n , va
2015-02-27 22:57:28 -05:00
}
2018-05-04 14:55:31 -04:00
// address assigns virtual addresses to all segments and sections and
// returns all segments in file order.
func ( ctxt * Link ) address ( ) [ ] * sym . Segment {
var order [ ] * sym . Segment // Layout order
2016-08-21 18:34:24 -04:00
va := uint64 ( * FlagTextAddr )
2018-05-04 14:55:31 -04:00
order = append ( order , & Segtext )
2015-02-27 22:57:28 -05:00
Segtext . Rwx = 05
Segtext . Vaddr = va
2017-04-18 21:52:06 +12:00
for _ , s := range Segtext . Sections {
2015-02-27 22:57:28 -05:00
va = uint64 ( Rnd ( int64 ( va ) , int64 ( s . Align ) ) )
s . Vaddr = va
va += s . Length
}
2016-08-21 18:34:24 -04:00
Segtext . Length = va - uint64 ( * FlagTextAddr )
2015-02-27 22:57:28 -05:00
2017-04-18 21:52:06 +12:00
if len ( Segrodata . Sections ) > 0 {
2015-02-27 22:57:28 -05:00
// align to page boundary so as not to mix
// rodata and executable text.
2016-09-05 23:29:16 -04:00
//
// Note: gold or GNU ld will reduce the size of the executable
// file by arranging for the relro segment to end at a page
// boundary, and overlap the end of the text segment with the
// start of the relro segment in the file. The PT_LOAD segments
// will be such that the last page of the text segment will be
// mapped twice, once r-x and once starting out rw- and, after
// relocation processing, changed to r--.
//
// Ideally the last page of the text segment would not be
// writable even for this short period.
2016-08-21 18:34:24 -04:00
va = uint64 ( Rnd ( int64 ( va ) , int64 ( * FlagRound ) ) )
2015-02-27 22:57:28 -05:00
2018-05-04 14:55:31 -04:00
order = append ( order , & Segrodata )
2015-02-27 22:57:28 -05:00
Segrodata . Rwx = 04
Segrodata . Vaddr = va
2017-04-18 21:52:06 +12:00
for _ , s := range Segrodata . Sections {
2015-02-27 22:57:28 -05:00
va = uint64 ( Rnd ( int64 ( va ) , int64 ( s . Align ) ) )
s . Vaddr = va
va += s . Length
}
Segrodata . Length = va - Segrodata . Vaddr
}
2017-04-18 21:52:06 +12:00
if len ( Segrelrodata . Sections ) > 0 {
2016-09-05 23:29:16 -04:00
// align to page boundary so as not to mix
// rodata, rel-ro data, and executable text.
va = uint64 ( Rnd ( int64 ( va ) , int64 ( * FlagRound ) ) )
2019-02-20 16:16:38 +01:00
if ctxt . HeadType == objabi . Haix {
// Relro data are inside data segment on AIX.
va += uint64 ( XCOFFDATABASE ) - uint64 ( XCOFFTEXTBASE )
}
2016-09-05 23:29:16 -04:00
2018-05-04 14:55:31 -04:00
order = append ( order , & Segrelrodata )
2016-09-05 23:29:16 -04:00
Segrelrodata . Rwx = 06
Segrelrodata . Vaddr = va
2017-04-18 21:52:06 +12:00
for _ , s := range Segrelrodata . Sections {
2016-09-05 23:29:16 -04:00
va = uint64 ( Rnd ( int64 ( va ) , int64 ( s . Align ) ) )
s . Vaddr = va
va += s . Length
}
Segrelrodata . Length = va - Segrelrodata . Vaddr
}
2015-02-27 22:57:28 -05:00
2016-08-21 18:34:24 -04:00
va = uint64 ( Rnd ( int64 ( va ) , int64 ( * FlagRound ) ) )
2019-02-20 16:16:38 +01:00
if ctxt . HeadType == objabi . Haix && len ( Segrelrodata . Sections ) == 0 {
2018-11-23 15:12:04 +01:00
// Data sections are moved to an unreachable segment
// to ensure that they are position-independent.
2019-02-20 16:16:38 +01:00
// Already done if relro sections exist.
2018-11-23 15:12:04 +01:00
va += uint64 ( XCOFFDATABASE ) - uint64 ( XCOFFTEXTBASE )
}
2018-05-04 14:55:31 -04:00
order = append ( order , & Segdata )
2015-02-27 22:57:28 -05:00
Segdata . Rwx = 06
Segdata . Vaddr = va
2017-10-04 17:54:04 -04:00
var data * sym . Section
var noptr * sym . Section
var bss * sym . Section
var noptrbss * sym . Section
2017-04-18 21:52:06 +12:00
for i , s := range Segdata . Sections {
2019-02-20 16:16:38 +01:00
if ( ctxt . IsELF || ctxt . HeadType == objabi . Haix ) && s . Name == ".tbss" {
2015-08-11 12:29:00 +12:00
continue
}
2017-10-01 14:39:04 +11:00
vlen := int64 ( s . Length )
2019-02-20 16:16:38 +01:00
if i + 1 < len ( Segdata . Sections ) && ! ( ( ctxt . IsELF || ctxt . HeadType == objabi . Haix ) && Segdata . Sections [ i + 1 ] . Name == ".tbss" ) {
2017-04-18 21:52:06 +12:00
vlen = int64 ( Segdata . Sections [ i + 1 ] . Vaddr - s . Vaddr )
2015-02-27 22:57:28 -05:00
}
s . Vaddr = va
va += uint64 ( vlen )
Segdata . Length = va - Segdata . Vaddr
if s . Name == ".data" {
data = s
}
if s . Name == ".noptrdata" {
noptr = s
}
if s . Name == ".bss" {
bss = s
}
if s . Name == ".noptrbss" {
noptrbss = s
}
}
2018-05-04 14:55:31 -04:00
// Assign Segdata's Filelen omitting the BSS. We do this here
// simply because right now we know where the BSS starts.
2015-02-27 22:57:28 -05:00
Segdata . Filelen = bss . Vaddr - Segdata . Vaddr
2016-08-21 18:34:24 -04:00
va = uint64 ( Rnd ( int64 ( va ) , int64 ( * FlagRound ) ) )
2018-05-04 14:55:31 -04:00
order = append ( order , & Segdwarf )
2016-03-14 09:23:04 -07:00
Segdwarf . Rwx = 06
Segdwarf . Vaddr = va
2017-04-18 21:52:06 +12:00
for i , s := range Segdwarf . Sections {
2017-10-01 14:39:04 +11:00
vlen := int64 ( s . Length )
2017-04-18 21:52:06 +12:00
if i + 1 < len ( Segdwarf . Sections ) {
vlen = int64 ( Segdwarf . Sections [ i + 1 ] . Vaddr - s . Vaddr )
2016-03-14 09:23:04 -07:00
}
s . Vaddr = va
va += uint64 ( vlen )
2017-10-07 13:49:44 -04:00
if ctxt . HeadType == objabi . Hwindows {
2016-03-14 09:23:04 -07:00
va = uint64 ( Rnd ( int64 ( va ) , PEFILEALIGN ) )
}
Segdwarf . Length = va - Segdwarf . Vaddr
}
2016-09-05 23:29:16 -04:00
var (
2017-04-18 21:52:06 +12:00
text = Segtext . Sections [ 0 ]
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
rodata = ctxt . Syms . Lookup ( "runtime.rodata" , 0 ) . Sect
itablink = ctxt . Syms . Lookup ( "runtime.itablink" , 0 ) . Sect
symtab = ctxt . Syms . Lookup ( "runtime.symtab" , 0 ) . Sect
pclntab = ctxt . Syms . Lookup ( "runtime.pclntab" , 0 ) . Sect
types = ctxt . Syms . Lookup ( "runtime.types" , 0 ) . Sect
2016-09-05 23:29:16 -04:00
)
2016-08-25 11:07:33 -05:00
lasttext := text
// Could be multiple .text sections
2017-04-18 21:52:06 +12:00
for _ , sect := range Segtext . Sections {
if sect . Name == ".text" {
lasttext = sect
}
2016-08-25 11:07:33 -05:00
}
2015-02-27 22:57:28 -05:00
2020-03-09 11:08:04 -04:00
for _ , s := range ctxt . datap {
2016-04-18 14:50:14 -04:00
if s . Sect != nil {
s . Value += int64 ( s . Sect . Vaddr )
2015-02-27 22:57:28 -05:00
}
2016-04-18 14:50:14 -04:00
for sub := s . Sub ; sub != nil ; sub = sub . Sub {
sub . Value += s . Value
2015-02-27 22:57:28 -05:00
}
}
2016-04-18 14:50:14 -04:00
2017-10-04 17:54:04 -04:00
for _ , s := range dwarfp {
if s . Sect != nil {
s . Value += int64 ( s . Sect . Vaddr )
2016-03-14 09:23:04 -07:00
}
2017-10-04 17:54:04 -04:00
for sub := s . Sub ; sub != nil ; sub = sub . Sub {
sub . Value += s . Value
2016-03-14 09:23:04 -07:00
}
}
2015-02-27 22:57:28 -05:00
2017-10-05 10:20:17 -04:00
if ctxt . BuildMode == BuildModeShared {
cmd/link: use ctxt.{Lookup,ROLookup} in favour of function versions of same
Done with two eg templates:
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linklookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.Lookup(name, v)
}
package p
import (
"cmd/link/internal/ld"
)
func before(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ld.Linkrlookup(ctxt, name, v)
}
func after(ctxt *ld.Link, name string, v int) *ld.Symbol {
return ctxt.Syms.ROLookup(name, v)
}
Change-Id: I00647dbf62294557bd24c29ad1f108fc786335f1
Reviewed-on: https://go-review.googlesource.com/29343
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-09-20 15:06:08 +12:00
s := ctxt . Syms . Lookup ( "go.link.abihashbytes" , 0 )
sectSym := ctxt . Syms . Lookup ( ".note.go.abihash" , 0 )
2015-05-25 13:59:08 +12:00
s . Sect = sectSym . Sect
s . Value = int64 ( sectSym . Sect . Vaddr + 16 )
}
2017-10-04 17:54:04 -04:00
ctxt . xdefine ( "runtime.text" , sym . STEXT , int64 ( text . Vaddr ) )
ctxt . xdefine ( "runtime.etext" , sym . STEXT , int64 ( lasttext . Vaddr + lasttext . Length ) )
2016-08-25 11:07:33 -05:00
// If there are multiple text sections, create runtime.text.n for
// their section Vaddr, using n for index
n := 1
2017-04-18 21:52:06 +12:00
for _ , sect := range Segtext . Sections [ 1 : ] {
2017-10-28 17:35:27 +01:00
if sect . Name != ".text" {
2017-04-18 21:52:06 +12:00
break
}
2017-10-28 17:35:27 +01:00
symname := fmt . Sprintf ( "runtime.text.%d" , n )
2019-02-20 16:48:43 +01:00
if ctxt . HeadType != objabi . Haix || ctxt . LinkMode != LinkExternal {
// Addresses are already set on AIX with external linker
// because these symbols are part of their sections.
ctxt . xdefine ( symname , sym . STEXT , int64 ( sect . Vaddr ) )
}
2017-10-28 17:35:27 +01:00
n ++
2016-08-25 11:07:33 -05:00
}
2017-10-04 17:54:04 -04:00
ctxt . xdefine ( "runtime.rodata" , sym . SRODATA , int64 ( rodata . Vaddr ) )
ctxt . xdefine ( "runtime.erodata" , sym . SRODATA , int64 ( rodata . Vaddr + rodata . Length ) )
ctxt . xdefine ( "runtime.types" , sym . SRODATA , int64 ( types . Vaddr ) )
ctxt . xdefine ( "runtime.etypes" , sym . SRODATA , int64 ( types . Vaddr + types . Length ) )
ctxt . xdefine ( "runtime.itablink" , sym . SRODATA , int64 ( itablink . Vaddr ) )
ctxt . xdefine ( "runtime.eitablink" , sym . SRODATA , int64 ( itablink . Vaddr + itablink . Length ) )
s := ctxt . Syms . Lookup ( "runtime.gcdata" , 0 )
s . Attr |= sym . AttrLocal
ctxt . xdefine ( "runtime.egcdata" , sym . SRODATA , Symaddr ( s ) + s . Size )
ctxt . Syms . Lookup ( "runtime.egcdata" , 0 ) . Sect = s . Sect
s = ctxt . Syms . Lookup ( "runtime.gcbss" , 0 )
s . Attr |= sym . AttrLocal
ctxt . xdefine ( "runtime.egcbss" , sym . SRODATA , Symaddr ( s ) + s . Size )
ctxt . Syms . Lookup ( "runtime.egcbss" , 0 ) . Sect = s . Sect
ctxt . xdefine ( "runtime.symtab" , sym . SRODATA , int64 ( symtab . Vaddr ) )
ctxt . xdefine ( "runtime.esymtab" , sym . SRODATA , int64 ( symtab . Vaddr + symtab . Length ) )
ctxt . xdefine ( "runtime.pclntab" , sym . SRODATA , int64 ( pclntab . Vaddr ) )
ctxt . xdefine ( "runtime.epclntab" , sym . SRODATA , int64 ( pclntab . Vaddr + pclntab . Length ) )
ctxt . xdefine ( "runtime.noptrdata" , sym . SNOPTRDATA , int64 ( noptr . Vaddr ) )
ctxt . xdefine ( "runtime.enoptrdata" , sym . SNOPTRDATA , int64 ( noptr . Vaddr + noptr . Length ) )
ctxt . xdefine ( "runtime.bss" , sym . SBSS , int64 ( bss . Vaddr ) )
ctxt . xdefine ( "runtime.ebss" , sym . SBSS , int64 ( bss . Vaddr + bss . Length ) )
ctxt . xdefine ( "runtime.data" , sym . SDATA , int64 ( data . Vaddr ) )
ctxt . xdefine ( "runtime.edata" , sym . SDATA , int64 ( data . Vaddr + data . Length ) )
ctxt . xdefine ( "runtime.noptrbss" , sym . SNOPTRBSS , int64 ( noptrbss . Vaddr ) )
ctxt . xdefine ( "runtime.enoptrbss" , sym . SNOPTRBSS , int64 ( noptrbss . Vaddr + noptrbss . Length ) )
ctxt . xdefine ( "runtime.end" , sym . SBSS , int64 ( Segdata . Vaddr + Segdata . Length ) )
2018-05-04 14:55:31 -04:00
2020-03-25 17:10:16 -04:00
if ctxt . IsSolaris ( ) {
// On Solaris, in the runtime it sets the external names of the
// end symbols. Unset them and define separate symbols, so we
// keep both.
etext := ctxt . Syms . ROLookup ( "runtime.etext" , 0 )
edata := ctxt . Syms . ROLookup ( "runtime.edata" , 0 )
end := ctxt . Syms . ROLookup ( "runtime.end" , 0 )
etext . SetExtname ( "runtime.etext" )
edata . SetExtname ( "runtime.edata" )
end . SetExtname ( "runtime.end" )
ctxt . xdefine ( "_etext" , etext . Type , etext . Value )
ctxt . xdefine ( "_edata" , edata . Type , edata . Value )
ctxt . xdefine ( "_end" , end . Type , end . Value )
ctxt . Syms . ROLookup ( "_etext" , 0 ) . Sect = etext . Sect
ctxt . Syms . ROLookup ( "_edata" , 0 ) . Sect = edata . Sect
ctxt . Syms . ROLookup ( "_end" , 0 ) . Sect = end . Sect
}
2018-05-04 14:55:31 -04:00
return order
}
// layout assigns file offsets and lengths to the segments in order.
2019-04-03 22:41:48 -04:00
// Returns the file size containing all the segments.
func ( ctxt * Link ) layout ( order [ ] * sym . Segment ) uint64 {
2018-05-04 14:55:31 -04:00
var prev * sym . Segment
for _ , seg := range order {
if prev == nil {
seg . Fileoff = uint64 ( HEADR )
} else {
switch ctxt . HeadType {
default :
// Assuming the previous segment was
// aligned, the following rounding
// should ensure that this segment's
// VA ≡ Fileoff mod FlagRound.
seg . Fileoff = uint64 ( Rnd ( int64 ( prev . Fileoff + prev . Filelen ) , int64 ( * FlagRound ) ) )
if seg . Vaddr % uint64 ( * FlagRound ) != seg . Fileoff % uint64 ( * FlagRound ) {
Exitf ( "bad segment rounding (Vaddr=%#x Fileoff=%#x FlagRound=%#x)" , seg . Vaddr , seg . Fileoff , * FlagRound )
}
case objabi . Hwindows :
seg . Fileoff = prev . Fileoff + uint64 ( Rnd ( int64 ( prev . Filelen ) , PEFILEALIGN ) )
case objabi . Hplan9 :
seg . Fileoff = prev . Fileoff + prev . Filelen
}
}
if seg != & Segdata {
// Link.address already set Segdata.Filelen to
// account for BSS.
seg . Filelen = seg . Length
}
prev = seg
}
2019-04-03 22:41:48 -04:00
return prev . Fileoff + prev . Filelen
2015-02-27 22:57:28 -05:00
}
2016-09-14 14:47:12 -04:00
// add a trampoline with symbol s (to be laid down after the current function)
2017-10-04 17:54:04 -04:00
func ( ctxt * Link ) AddTramp ( s * sym . Symbol ) {
s . Type = sym . STEXT
s . Attr |= sym . AttrReachable
s . Attr |= sym . AttrOnList
2016-09-14 14:47:12 -04:00
ctxt . tramps = append ( ctxt . tramps , s )
if * FlagDebugTramp > 0 && ctxt . Debugvlog > 0 {
ctxt . Logf ( "trampoline %s inserted\n" , s )
}
}
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
// compressSyms compresses syms and returns the contents of the
// compressed section. If the section would get larger, it returns nil.
func compressSyms ( ctxt * Link , syms [ ] * sym . Symbol ) [ ] byte {
var total int64
for _ , sym := range syms {
total += sym . Size
}
var buf bytes . Buffer
buf . Write ( [ ] byte ( "ZLIB" ) )
var sizeBytes [ 8 ] byte
binary . BigEndian . PutUint64 ( sizeBytes [ : ] , uint64 ( total ) )
buf . Write ( sizeBytes [ : ] )
2020-02-24 20:59:02 -05:00
var relocbuf [ ] byte // temporary buffer for applying relocations
2018-07-11 16:13:04 -04:00
// Using zlib.BestSpeed achieves very nearly the same
// compression levels of zlib.DefaultCompression, but takes
// substantially less time. This is important because DWARF
// compression can be a significant fraction of link time.
z , err := zlib . NewWriterLevel ( & buf , zlib . BestSpeed )
if err != nil {
log . Fatalf ( "NewWriterLevel failed: %s" , err )
}
2020-02-24 21:07:26 -05:00
target := & ctxt . Target
2020-03-24 12:55:14 -04:00
ldr := ctxt . loader
2020-02-24 21:07:26 -05:00
reporter := & ctxt . ErrorReporter
archSyms := & ctxt . ArchSyms
2019-04-04 00:29:16 -04:00
for _ , s := range syms {
// s.P may be read-only. Apply relocations in a
2019-04-10 10:18:52 -04:00
// temporary buffer, and immediately write it out.
2019-04-04 00:29:16 -04:00
oldP := s . P
wasReadOnly := s . Attr . ReadOnly ( )
if len ( s . R ) != 0 && wasReadOnly {
2020-02-24 20:59:02 -05:00
relocbuf = append ( relocbuf [ : 0 ] , s . P ... )
s . P = relocbuf
2020-02-24 21:12:49 -05:00
// TODO: This function call needs to be parallelized when the loader wavefront gets here.
2019-04-04 00:29:16 -04:00
s . Attr . Set ( sym . AttrReadOnly , false )
}
2020-03-24 14:35:45 -04:00
relocsym ( target , ldr , reporter , archSyms , s )
2019-04-04 00:29:16 -04:00
if _ , err := z . Write ( s . P ) ; err != nil {
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
log . Fatalf ( "compression failed: %s" , err )
}
2019-04-04 00:29:16 -04:00
for i := s . Size - int64 ( len ( s . P ) ) ; i > 0 ; {
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
b := zeros [ : ]
if i < int64 ( len ( b ) ) {
b = b [ : i ]
}
n , err := z . Write ( b )
if err != nil {
log . Fatalf ( "compression failed: %s" , err )
}
i -= int64 ( n )
}
2019-04-04 00:29:16 -04:00
// Restore s.P if a temporary buffer was used. If compression
// is not beneficial, we'll go back to use the uncompressed
// contents, in which case we still need s.P.
if len ( s . R ) != 0 && wasReadOnly {
s . P = oldP
s . Attr . Set ( sym . AttrReadOnly , wasReadOnly )
for i := range s . R {
s . R [ i ] . Done = false
}
2019-04-10 10:18:52 -04:00
}
cmd/link: compress DWARF sections in ELF binaries
Forked from CL 111895.
The trickiest part of this is that the binary layout code (blk,
elfshbits, and various other things) assumes a constant offset between
symbols' and sections' file locations and their virtual addresses.
Compression, of course, breaks this constant offset. But we need to
assign virtual addresses to everything before compression in order to
resolve relocations before compression. As a result, compression needs
to re-compute the "address" of the DWARF sections and symbols based on
their compressed size. Luckily, these are at the end of the file, so
this doesn't perturb any other sections or symbols. (And there is, of
course, a surprising amount of code that assumes the DWARF segment
comes last, so what's one more place?)
Relevant benchmarks:
name old time/op new time/op delta
StdCmd 10.3s ± 2% 10.8s ± 1% +5.43% (p=0.000 n=30+30)
name old text-bytes new text-bytes delta
HelloSize 746kB ± 0% 746kB ± 0% ~ (all equal)
CmdGoSize 8.41MB ± 0% 8.41MB ± 0% ~ (all equal)
[Geo mean] 2.50MB 2.50MB +0.00%
name old data-bytes new data-bytes delta
HelloSize 10.6kB ± 0% 10.6kB ± 0% ~ (all equal)
CmdGoSize 252kB ± 0% 252kB ± 0% ~ (all equal)
[Geo mean] 51.5kB 51.5kB +0.00%
name old bss-bytes new bss-bytes delta
HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal)
CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal)
[Geo mean] 135kB 135kB +0.00%
name old exe-bytes new exe-bytes delta
HelloSize 1.60MB ± 0% 1.05MB ± 0% -34.39% (p=0.000 n=30+30)
CmdGoSize 16.5MB ± 0% 11.3MB ± 0% -31.76% (p=0.000 n=30+30)
[Geo mean] 5.14MB 3.44MB -33.08%
Fixes #11799.
Updates #6853.
Change-Id: I64197afe4c01a237523a943088051ee056331c6f
Reviewed-on: https://go-review.googlesource.com/118276
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-05-05 21:49:40 -04:00
}
if err := z . Close ( ) ; err != nil {
log . Fatalf ( "compression failed: %s" , err )
}
if int64 ( buf . Len ( ) ) >= total {
// Compression didn't save any space.
return nil
}
return buf . Bytes ( )
}