2014-11-11 17:05:02 -05:00
|
|
|
// Copyright 2009 The Go Authors. All rights reserved.
|
|
|
|
|
// Use of this source code is governed by a BSD-style
|
|
|
|
|
// license that can be found in the LICENSE file.
|
|
|
|
|
|
|
|
|
|
// Garbage collector (GC).
|
|
|
|
|
//
|
2014-12-09 13:25:45 -05:00
|
|
|
// The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple
|
|
|
|
|
// GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is
|
2014-11-15 08:00:38 -05:00
|
|
|
// non-generational and non-compacting. Allocation is done using size segregated per P allocation
|
|
|
|
|
// areas to minimize fragmentation while eliminating locks in the common case.
|
2014-11-11 17:05:02 -05:00
|
|
|
//
|
2014-11-15 08:00:38 -05:00
|
|
|
// The algorithm decomposes into several steps.
|
|
|
|
|
// This is a high level description of the algorithm being used. For an overview of GC a good
|
|
|
|
|
// place to start is Richard Jones' gchandbook.org.
|
|
|
|
|
//
|
|
|
|
|
// The algorithm's intellectual heritage includes Dijkstra's on-the-fly algorithm, see
|
|
|
|
|
// Edsger W. Dijkstra, Leslie Lamport, A. J. Martin, C. S. Scholten, and E. F. M. Steffens. 1978.
|
2014-12-09 10:15:18 -05:00
|
|
|
// On-the-fly garbage collection: an exercise in cooperation. Commun. ACM 21, 11 (November 1978),
|
2014-12-09 13:25:45 -05:00
|
|
|
// 966-975.
|
2014-11-15 08:00:38 -05:00
|
|
|
// For journal quality proofs that these steps are complete, correct, and terminate see
|
|
|
|
|
// Hudson, R., and Moss, J.E.B. Copying Garbage Collection without stopping the world.
|
|
|
|
|
// Concurrency and Computation: Practice and Experience 15(3-5), 2003.
|
2014-11-11 17:05:02 -05:00
|
|
|
//
|
2016-12-22 17:30:23 -07:00
|
|
|
// 1. GC performs sweep termination.
|
|
|
|
|
//
|
|
|
|
|
// a. Stop the world. This causes all Ps to reach a GC safe-point.
|
|
|
|
|
//
|
|
|
|
|
// b. Sweep any unswept spans. There will only be unswept spans if
|
|
|
|
|
// this GC cycle was forced before the expected time.
|
|
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// 2. GC performs the mark phase.
|
2016-12-22 17:30:23 -07:00
|
|
|
//
|
|
|
|
|
// a. Prepare for the mark phase by setting gcphase to _GCmark
|
|
|
|
|
// (from _GCoff), enabling the write barrier, enabling mutator
|
|
|
|
|
// assists, and enqueueing root mark jobs. No objects may be
|
|
|
|
|
// scanned until all Ps have enabled the write barrier, which is
|
|
|
|
|
// accomplished using STW.
|
|
|
|
|
//
|
|
|
|
|
// b. Start the world. From this point, GC work is done by mark
|
|
|
|
|
// workers started by the scheduler and by assists performed as
|
|
|
|
|
// part of allocation. The write barrier shades both the
|
|
|
|
|
// overwritten pointer and the new pointer value for any pointer
|
|
|
|
|
// writes (see mbarrier.go for details). Newly allocated objects
|
|
|
|
|
// are immediately marked black.
|
|
|
|
|
//
|
|
|
|
|
// c. GC performs root marking jobs. This includes scanning all
|
|
|
|
|
// stacks, shading all globals, and shading any heap pointers in
|
|
|
|
|
// off-heap runtime data structures. Scanning a stack stops a
|
|
|
|
|
// goroutine, shades any pointers found on its stack, and then
|
|
|
|
|
// resumes the goroutine.
|
|
|
|
|
//
|
|
|
|
|
// d. GC drains the work queue of grey objects, scanning each grey
|
|
|
|
|
// object to black and shading all pointers found in the object
|
|
|
|
|
// (which in turn may add those pointers to the work queue).
|
|
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// e. Because GC work is spread across local caches, GC uses a
|
|
|
|
|
// distributed termination algorithm to detect when there are no
|
|
|
|
|
// more root marking jobs or grey objects (see gcMarkDone). At this
|
|
|
|
|
// point, GC transitions to mark termination.
|
2016-12-22 17:30:23 -07:00
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// 3. GC performs mark termination.
|
2016-12-22 17:30:23 -07:00
|
|
|
//
|
|
|
|
|
// a. Stop the world.
|
|
|
|
|
//
|
|
|
|
|
// b. Set gcphase to _GCmarktermination, and disable workers and
|
|
|
|
|
// assists.
|
|
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// c. Perform housekeeping like flushing mcaches.
|
2016-12-22 17:30:23 -07:00
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// 4. GC performs the sweep phase.
|
2016-12-22 17:30:23 -07:00
|
|
|
//
|
|
|
|
|
// a. Prepare for the sweep phase by setting gcphase to _GCoff,
|
|
|
|
|
// setting up sweep state and disabling the write barrier.
|
|
|
|
|
//
|
|
|
|
|
// b. Start the world. From this point on, newly allocated objects
|
|
|
|
|
// are white, and allocating sweeps spans before use if necessary.
|
|
|
|
|
//
|
|
|
|
|
// c. GC does concurrent sweeping in the background and in response
|
|
|
|
|
// to allocation. See description below.
|
|
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// 5. When sufficient allocation has taken place, replay the sequence
|
2016-12-22 17:30:23 -07:00
|
|
|
// starting with 1 above. See discussion of GC rate below.
|
2014-11-15 08:00:38 -05:00
|
|
|
|
2014-11-11 17:05:02 -05:00
|
|
|
// Concurrent sweep.
|
runtime: introduce heap_live; replace use of heap_alloc in GC
Currently there are two main consumers of memstats.heap_alloc:
updatememstats (aka ReadMemStats) and shouldtriggergc.
updatememstats recomputes heap_alloc from the ground up, so we don't
need to keep heap_alloc up to date for it. shouldtriggergc wants to
know how many bytes were marked by the previous GC plus how many bytes
have been allocated since then, but this *isn't* what heap_alloc
tracks. heap_alloc also includes objects that are not marked and
haven't yet been swept.
Introduce a new memstat called heap_live that actually tracks what
shouldtriggergc wants to know and stop keeping heap_alloc up to date.
Unlike heap_alloc, heap_live follows a simple sawtooth that drops
during each mark termination and increases monotonically between GCs.
heap_alloc, on the other hand, has much more complicated behavior: it
may drop during sweep termination, slowly decreases from background
sweeping between GCs, is roughly unaffected by allocation as long as
there are unswept spans (because we sweep and allocate at the same
rate), and may go up after background sweeping is done depending on
the GC trigger.
heap_live simplifies computing next_gc and using it to figure out when
to trigger garbage collection. Currently, we guess next_gc at the end
of a cycle and update it as we sweep and get a better idea of how much
heap was marked. Now, since we're directly tracking how much heap is
marked, we can directly compute next_gc.
This also corrects bugs that could cause us to trigger GC early.
Currently, in any case where sweep termination actually finds spans to
sweep, heap_alloc is an overestimation of live heap, so we'll trigger
GC too early. heap_live, on the other hand, is unaffected by sweeping.
Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388
Reviewed-on: https://go-review.googlesource.com/8389
Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
|
|
|
//
|
2014-11-11 17:05:02 -05:00
|
|
|
// The sweep phase proceeds concurrently with normal program execution.
|
|
|
|
|
// The heap is swept span-by-span both lazily (when a goroutine needs another span)
|
|
|
|
|
// and concurrently in a background goroutine (this helps programs that are not CPU bound).
|
runtime: introduce heap_live; replace use of heap_alloc in GC
Currently there are two main consumers of memstats.heap_alloc:
updatememstats (aka ReadMemStats) and shouldtriggergc.
updatememstats recomputes heap_alloc from the ground up, so we don't
need to keep heap_alloc up to date for it. shouldtriggergc wants to
know how many bytes were marked by the previous GC plus how many bytes
have been allocated since then, but this *isn't* what heap_alloc
tracks. heap_alloc also includes objects that are not marked and
haven't yet been swept.
Introduce a new memstat called heap_live that actually tracks what
shouldtriggergc wants to know and stop keeping heap_alloc up to date.
Unlike heap_alloc, heap_live follows a simple sawtooth that drops
during each mark termination and increases monotonically between GCs.
heap_alloc, on the other hand, has much more complicated behavior: it
may drop during sweep termination, slowly decreases from background
sweeping between GCs, is roughly unaffected by allocation as long as
there are unswept spans (because we sweep and allocate at the same
rate), and may go up after background sweeping is done depending on
the GC trigger.
heap_live simplifies computing next_gc and using it to figure out when
to trigger garbage collection. Currently, we guess next_gc at the end
of a cycle and update it as we sweep and get a better idea of how much
heap was marked. Now, since we're directly tracking how much heap is
marked, we can directly compute next_gc.
This also corrects bugs that could cause us to trigger GC early.
Currently, in any case where sweep termination actually finds spans to
sweep, heap_alloc is an overestimation of live heap, so we'll trigger
GC too early. heap_live, on the other hand, is unaffected by sweeping.
Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388
Reviewed-on: https://go-review.googlesource.com/8389
Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
|
|
|
// At the end of STW mark termination all spans are marked as "needs sweeping".
|
|
|
|
|
//
|
|
|
|
|
// The background sweeper goroutine simply sweeps spans one-by-one.
|
|
|
|
|
//
|
|
|
|
|
// To avoid requesting more OS memory while there are unswept spans, when a
|
|
|
|
|
// goroutine needs another span, it first attempts to reclaim that much memory
|
|
|
|
|
// by sweeping. When a goroutine needs to allocate a new small-object span, it
|
|
|
|
|
// sweeps small-object spans for the same object size until it frees at least
|
|
|
|
|
// one object. When a goroutine needs to allocate large-object span from heap,
|
|
|
|
|
// it sweeps spans until it frees at least that many pages into heap. There is
|
|
|
|
|
// one case where this may not suffice: if a goroutine sweeps and frees two
|
|
|
|
|
// nonadjacent one-page spans to the heap, it will allocate a new two-page
|
|
|
|
|
// span, but there can still be other one-page unswept spans which could be
|
|
|
|
|
// combined into a two-page span.
|
|
|
|
|
//
|
2014-11-11 17:05:02 -05:00
|
|
|
// It's critical to ensure that no operations proceed on unswept spans (that would corrupt
|
|
|
|
|
// mark bits in GC bitmap). During GC all mcaches are flushed into the central cache,
|
|
|
|
|
// so they are empty. When a goroutine grabs a new span into mcache, it sweeps it.
|
|
|
|
|
// When a goroutine explicitly frees an object or sets a finalizer, it ensures that
|
|
|
|
|
// the span is swept (either by sweeping it, or by waiting for the concurrent sweep to finish).
|
|
|
|
|
// The finalizer goroutine is kicked off only when all spans are swept.
|
|
|
|
|
// When the next GC starts, it sweeps all not-yet-swept spans (if any).
|
|
|
|
|
|
2014-11-15 08:00:38 -05:00
|
|
|
// GC rate.
|
|
|
|
|
// Next GC is after we've allocated an extra amount of memory proportional to
|
|
|
|
|
// the amount already in use. The proportion is controlled by GOGC environment variable
|
|
|
|
|
// (100 by default). If GOGC=100 and we're using 4M, we'll GC again when we get to 8M
|
2021-04-01 18:38:14 +00:00
|
|
|
// (this mark is tracked in gcController.heapGoal variable). This keeps the GC cost in
|
|
|
|
|
// linear proportion to the allocation cost. Adjusting GOGC just changes the linear constant
|
2014-11-15 08:00:38 -05:00
|
|
|
// (and also the amount of extra memory used).
|
|
|
|
|
|
runtime: bound scanobject to ~100 µs
Currently the time spent in scanobject is proportional to the size of
the object being scanned. Since scanobject is non-preemptible, large
objects can cause significant goroutine (and even whole application)
delays through several means:
1. If a GC assist picks up a large object, the allocating goroutine is
blocked for the whole scan, even if that scan well exceeds that
goroutine's debt.
2. Since the scheduler does not run on the P performing a large object
scan, goroutines in that P's run queue do not run unless they are
stolen by another P (which can take some time). If there are a few
large objects, all of the Ps may get tied up so the scheduler
doesn't run anywhere.
3. Even if a large object is scanned by a background worker and other
Ps are still running the scheduler, the large object scan doesn't
flush background credit until the whole scan is done. This can
easily cause all allocations to block in assists, waiting for
credit, causing an effective STW.
Fix this by splitting large objects into 128 KB "oblets" and scanning
at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates
to bounding scanobject at roughly 100 µs. This improves assist
behavior both because assists can no longer get "unlucky" and be stuck
scanning a large object, and because it causes the background worker
to flush credit and unblock assists more frequently when scanning
large objects. This also improves GC parallelism if the heap consists
primarily of a small number of very large objects by letting multiple
workers scan a large objects in parallel.
Fixes #10345. Fixes #16293.
This substantially improves goroutine latency in the benchmark from
issue #16293, which exercises several forms of very large objects:
name old max-latency new max-latency delta
SliceNoPointer-12 154µs ± 1% 155µs ± 2% ~ (p=0.087 n=13+12)
SlicePointer-12 314ms ± 1% 5.94ms ±138% -98.11% (p=0.000 n=19+20)
SliceLivePointer-12 1148ms ± 0% 4.72ms ±167% -99.59% (p=0.000 n=19+20)
MapNoPointer-12 72509µs ± 1% 408µs ±325% -99.44% (p=0.000 n=19+18)
ChanPointer-12 313ms ± 0% 4.74ms ±140% -98.49% (p=0.000 n=18+20)
ChanLivePointer-12 1147ms ± 0% 3.30ms ±149% -99.71% (p=0.000 n=19+20)
name old P99.9-latency new P99.9-latency delta
SliceNoPointer-12 113µs ±25% 107µs ±12% ~ (p=0.153 n=20+18)
SlicePointer-12 309450µs ± 0% 133µs ±23% -99.96% (p=0.000 n=20+20)
SliceLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20)
MapNoPointer-12 448µs ±288% 119µs ±18% -73.34% (p=0.000 n=18+20)
ChanPointer-12 309450µs ± 0% 134µs ±23% -99.96% (p=0.000 n=20+19)
ChanLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20)
This has negligible effect on all metrics from the garbage, JSON, and
HTTP x/benchmarks.
It shows slight improvement on some of the go1 benchmarks,
particularly Revcomp, which uses some multi-megabyte buffers:
name old time/op new time/op delta
BinaryTree17-12 2.46s ± 1% 2.47s ± 1% +0.32% (p=0.012 n=20+20)
Fannkuch11-12 2.82s ± 0% 2.81s ± 0% -0.61% (p=0.000 n=17+20)
FmtFprintfEmpty-12 50.8ns ± 5% 50.5ns ± 2% ~ (p=0.197 n=17+19)
FmtFprintfString-12 131ns ± 1% 132ns ± 0% +0.57% (p=0.000 n=20+16)
FmtFprintfInt-12 117ns ± 0% 116ns ± 0% -0.47% (p=0.000 n=15+20)
FmtFprintfIntInt-12 180ns ± 0% 179ns ± 1% -0.78% (p=0.000 n=16+20)
FmtFprintfPrefixedInt-12 186ns ± 1% 185ns ± 1% -0.55% (p=0.000 n=19+20)
FmtFprintfFloat-12 263ns ± 1% 271ns ± 0% +2.84% (p=0.000 n=18+20)
FmtManyArgs-12 741ns ± 1% 742ns ± 1% ~ (p=0.190 n=19+19)
GobDecode-12 7.44ms ± 0% 7.35ms ± 1% -1.21% (p=0.000 n=20+20)
GobEncode-12 6.22ms ± 1% 6.21ms ± 1% ~ (p=0.336 n=20+19)
Gzip-12 220ms ± 1% 219ms ± 1% ~ (p=0.130 n=19+19)
Gunzip-12 37.9ms ± 0% 37.9ms ± 1% ~ (p=1.000 n=20+19)
HTTPClientServer-12 82.5µs ± 3% 82.6µs ± 3% ~ (p=0.776 n=20+19)
JSONEncode-12 16.4ms ± 1% 16.5ms ± 2% +0.49% (p=0.003 n=18+19)
JSONDecode-12 53.7ms ± 1% 54.1ms ± 1% +0.71% (p=0.000 n=19+18)
Mandelbrot200-12 4.19ms ± 1% 4.20ms ± 1% ~ (p=0.452 n=19+19)
GoParse-12 3.38ms ± 1% 3.37ms ± 1% ~ (p=0.123 n=19+19)
RegexpMatchEasy0_32-12 72.1ns ± 1% 71.8ns ± 1% ~ (p=0.397 n=19+17)
RegexpMatchEasy0_1K-12 242ns ± 0% 242ns ± 0% ~ (p=0.168 n=17+20)
RegexpMatchEasy1_32-12 72.1ns ± 1% 72.1ns ± 1% ~ (p=0.538 n=18+19)
RegexpMatchEasy1_1K-12 385ns ± 1% 384ns ± 1% ~ (p=0.388 n=20+20)
RegexpMatchMedium_32-12 112ns ± 1% 112ns ± 3% ~ (p=0.539 n=20+20)
RegexpMatchMedium_1K-12 34.4µs ± 2% 34.4µs ± 2% ~ (p=0.628 n=18+18)
RegexpMatchHard_32-12 1.80µs ± 1% 1.80µs ± 1% ~ (p=0.522 n=18+19)
RegexpMatchHard_1K-12 54.0µs ± 1% 54.1µs ± 1% ~ (p=0.647 n=20+19)
Revcomp-12 387ms ± 1% 369ms ± 5% -4.89% (p=0.000 n=17+19)
Template-12 62.3ms ± 1% 62.0ms ± 0% -0.48% (p=0.002 n=20+17)
TimeParse-12 314ns ± 1% 314ns ± 0% ~ (p=1.011 n=20+13)
TimeFormat-12 358ns ± 0% 354ns ± 0% -1.12% (p=0.000 n=17+20)
[Geo mean] 53.5µs 53.3µs -0.23%
Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132
Reviewed-on: https://go-review.googlesource.com/23540
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-05-27 21:04:40 -04:00
|
|
|
// Oblets
|
|
|
|
|
//
|
|
|
|
|
// In order to prevent long pauses while scanning large objects and to
|
|
|
|
|
// improve parallelism, the garbage collector breaks up scan jobs for
|
|
|
|
|
// objects larger than maxObletBytes into "oblets" of at most
|
|
|
|
|
// maxObletBytes. When scanning encounters the beginning of a large
|
|
|
|
|
// object, it scans only the first oblet and enqueues the remaining
|
|
|
|
|
// oblets as new scan jobs.
|
|
|
|
|
|
2014-11-11 17:05:02 -05:00
|
|
|
package runtime
|
|
|
|
|
|
2015-11-02 14:09:24 -05:00
|
|
|
import (
|
2018-06-05 08:14:57 +02:00
|
|
|
"internal/cpu"
|
2015-11-02 14:09:24 -05:00
|
|
|
"runtime/internal/atomic"
|
|
|
|
|
"unsafe"
|
|
|
|
|
)
|
2014-11-11 17:05:02 -05:00
|
|
|
|
|
|
|
|
const (
|
|
|
|
|
_DebugGC = 0
|
|
|
|
|
_ConcurrentSweep = true
|
|
|
|
|
_FinBlockSize = 4 * 1024
|
2015-09-14 14:28:09 -04:00
|
|
|
|
2019-10-10 14:38:15 -04:00
|
|
|
// debugScanConservative enables debug logging for stack
|
|
|
|
|
// frames that are scanned conservatively.
|
|
|
|
|
debugScanConservative = false
|
|
|
|
|
|
2015-08-03 09:25:23 -04:00
|
|
|
// sweepMinHeapDistance is a lower bound on the heap distance
|
|
|
|
|
// (in bytes) reserved for concurrent sweeping between GC
|
2017-09-25 15:17:28 -04:00
|
|
|
// cycles.
|
2015-08-03 09:25:23 -04:00
|
|
|
sweepMinHeapDistance = 1024 * 1024
|
2014-11-11 17:05:02 -05:00
|
|
|
)
|
|
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
func gcinit() {
|
|
|
|
|
if unsafe.Sizeof(workbuf{}) != _WorkbufSize {
|
|
|
|
|
throw("size of Workbuf is suboptimal")
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
2017-04-04 13:26:28 -04:00
|
|
|
// No sweep on the first cycle.
|
2021-04-06 19:25:28 -04:00
|
|
|
mheap_.sweepDrained = 1
|
2017-03-31 17:09:41 -04:00
|
|
|
|
2021-04-01 18:01:46 +00:00
|
|
|
// Initialize GC pacer state.
|
|
|
|
|
// Use the environment variable GOGC for the initial gcPercent value.
|
|
|
|
|
gcController.init(readGOGC())
|
2017-04-04 13:26:28 -04:00
|
|
|
|
2015-10-23 14:15:18 -04:00
|
|
|
work.startSema = 1
|
2015-10-26 11:27:37 -04:00
|
|
|
work.markDoneSema = 1
|
runtime: static lock ranking for the runtime (enabled by GOEXPERIMENT)
I took some of the infrastructure from Austin's lock logging CR
https://go-review.googlesource.com/c/go/+/192704 (with deadlock
detection from the logs), and developed a setup to give static lock
ranking for runtime locks.
Static lock ranking establishes a documented total ordering among locks,
and then reports an error if the total order is violated. This can
happen if a deadlock happens (by acquiring a sequence of locks in
different orders), or if just one side of a possible deadlock happens.
Lock ordering deadlocks cannot happen as long as the lock ordering is
followed.
Along the way, I found a deadlock involving the new timer code, which Ian fixed
via https://go-review.googlesource.com/c/go/+/207348, as well as two other
potential deadlocks.
See the constants at the top of runtime/lockrank.go to show the static
lock ranking that I ended up with, along with some comments. This is
great documentation of the current intended lock ordering when acquiring
multiple locks in the runtime.
I also added an array lockPartialOrder[] which shows and enforces the
current partial ordering among locks (which is embedded within the total
ordering). This is more specific about the dependencies among locks.
I don't try to check the ranking within a lock class with multiple locks
that can be acquired at the same time (i.e. check the ranking when
multiple hchan locks are acquired).
Currently, I am doing a lockInit() call to set the lock rank of most
locks. Any lock that is not otherwise initialized is assumed to be a
leaf lock (a very high rank lock), so that eliminates the need to do
anything for a bunch of locks (including all architecture-dependent
locks). For two locks, root.lock and notifyList.lock (only in the
runtime/sema.go file), it is not as easy to do lock initialization, so
instead, I am passing the lock rank with the lock calls.
For Windows compilation, I needed to increase the StackGuard size from
896 to 928 because of the new lock-rank checking functions.
Checking of the static lock ranking is enabled by setting
GOEXPERIMENT=staticlockranking before doing a run.
To make sure that the static lock ranking code has no overhead in memory
or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so
that it defines a build tag (with the same name) whenever any experiment
has been baked into the toolchain (by checking Expstring()). This allows
me to avoid increasing the size of the 'mutex' type when static lock
ranking is not enabled.
Fixes #38029
Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a
Reviewed-on: https://go-review.googlesource.com/c/go/+/207619
Reviewed-by: Dan Scales <danscales@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Dan Scales <danscales@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-11-13 17:34:47 -08:00
|
|
|
lockInit(&work.sweepWaiters.lock, lockRankSweepWaiters)
|
|
|
|
|
lockInit(&work.assistQueue.lock, lockRankAssistQueue)
|
|
|
|
|
lockInit(&work.wbufSpans.lock, lockRankWbufSpans)
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
|
|
|
|
|
2015-03-05 16:04:17 -05:00
|
|
|
// gcenable is called after the bulk of the runtime initialization,
|
|
|
|
|
// just before we're about to start letting user code run.
|
2018-10-17 23:29:42 +00:00
|
|
|
// It kicks off the background sweeper goroutine, the background
|
|
|
|
|
// scavenger goroutine, and enables GC.
|
2015-03-05 16:04:17 -05:00
|
|
|
func gcenable() {
|
2018-10-17 23:29:42 +00:00
|
|
|
// Kick off sweeping and scavenging.
|
2021-06-03 18:29:05 -04:00
|
|
|
c := make(chan int, 2)
|
|
|
|
|
go bgsweep(c)
|
|
|
|
|
go bgscavenge(c)
|
|
|
|
|
<-c
|
|
|
|
|
<-c
|
2015-03-05 16:04:17 -05:00
|
|
|
memstats.enablegc = true // now that runtime is initialized, GC is okay
|
|
|
|
|
}
|
|
|
|
|
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
// Garbage collector phase.
|
2016-07-25 15:53:15 +03:00
|
|
|
// Indicates to write barrier and synchronization task to perform.
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
var gcphase uint32
|
2015-11-13 17:45:22 -08:00
|
|
|
|
|
|
|
|
// The compiler knows about this variable.
|
2017-04-07 18:06:12 -04:00
|
|
|
// If you change it, you must change builtin/runtime.go, too.
|
|
|
|
|
// If you change the first four bytes, you must also change the write
|
|
|
|
|
// barrier insertion code.
|
2015-11-13 17:45:22 -08:00
|
|
|
var writeBarrier struct {
|
2016-05-06 10:12:57 -07:00
|
|
|
enabled bool // compiler emits a check of this before calling write barrier
|
|
|
|
|
pad [3]byte // compiler uses 32-bit load for "enabled" field
|
|
|
|
|
needed bool // whether we need a write barrier for current GC phase
|
|
|
|
|
cgo bool // whether we need a write barrier for a cgo check
|
|
|
|
|
alignme uint64 // guarantee alignment so that compiler can use a 32 or 64-bit load
|
2015-11-13 17:45:22 -08:00
|
|
|
}
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
|
|
|
|
|
// gcBlackenEnabled is 1 if mutator assists and background mark
|
|
|
|
|
// workers are allowed to blacken objects. This must only be set when
|
|
|
|
|
// gcphase == _GCmark.
|
|
|
|
|
var gcBlackenEnabled uint32
|
|
|
|
|
|
|
|
|
|
const (
|
2015-06-25 12:24:44 -04:00
|
|
|
_GCoff = iota // GC not running; sweeping in background, write barrier disabled
|
2016-03-30 17:02:23 -04:00
|
|
|
_GCmark // GC marking roots and workbufs: allocate black, write barrier ENABLED
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
_GCmarktermination // GC mark termination: allocate black, P's help GC, write barrier ENABLED
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
//go:nosplit
|
|
|
|
|
func setGCPhase(x uint32) {
|
2015-11-02 14:09:24 -05:00
|
|
|
atomic.Store(&gcphase, x)
|
2015-11-13 17:45:22 -08:00
|
|
|
writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
|
|
|
|
|
writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
}
|
|
|
|
|
|
2015-04-15 17:01:30 -04:00
|
|
|
// gcMarkWorkerMode represents the mode that a concurrent mark worker
|
|
|
|
|
// should operate in.
|
|
|
|
|
//
|
|
|
|
|
// Concurrent marking happens through four different mechanisms. One
|
|
|
|
|
// is mutator assists, which happen in response to allocations and are
|
|
|
|
|
// not scheduled. The other three are variations in the per-P mark
|
|
|
|
|
// workers and are distinguished by gcMarkWorkerMode.
|
|
|
|
|
type gcMarkWorkerMode int
|
|
|
|
|
|
|
|
|
|
const (
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// gcMarkWorkerNotWorker indicates that the next scheduled G is not
|
|
|
|
|
// starting work and the mode should be ignored.
|
|
|
|
|
gcMarkWorkerNotWorker gcMarkWorkerMode = iota
|
|
|
|
|
|
2015-04-15 17:01:30 -04:00
|
|
|
// gcMarkWorkerDedicatedMode indicates that the P of a mark
|
|
|
|
|
// worker is dedicated to running that mark worker. The mark
|
runtime: eliminate getfull barrier from concurrent mark
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.
In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).
Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.
This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).
The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.
Updates #12041 and #8687.
No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.
name old time/op new time/op delta
XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20)
Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
|
|
|
// worker should run without preemption.
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gcMarkWorkerDedicatedMode
|
2015-04-15 17:01:30 -04:00
|
|
|
|
|
|
|
|
// gcMarkWorkerFractionalMode indicates that a P is currently
|
|
|
|
|
// running the "fractional" mark worker. The fractional worker
|
2017-10-04 17:12:28 -04:00
|
|
|
// is necessary when GOMAXPROCS*gcBackgroundUtilization is not
|
2021-02-23 03:12:56 +00:00
|
|
|
// an integer and using only dedicated workers would result in
|
|
|
|
|
// utilization too far from the target of gcBackgroundUtilization.
|
|
|
|
|
// The fractional worker should run until it is preempted and
|
|
|
|
|
// will be scheduled to pick up the fractional part of
|
|
|
|
|
// GOMAXPROCS*gcBackgroundUtilization.
|
2015-04-15 17:01:30 -04:00
|
|
|
gcMarkWorkerFractionalMode
|
|
|
|
|
|
|
|
|
|
// gcMarkWorkerIdleMode indicates that a P is running the mark
|
|
|
|
|
// worker because it has nothing else to do. The idle worker
|
|
|
|
|
// should run until it is preempted and account its time
|
|
|
|
|
// against gcController.idleMarkTime.
|
|
|
|
|
gcMarkWorkerIdleMode
|
|
|
|
|
)
|
|
|
|
|
|
2016-10-07 17:25:26 -04:00
|
|
|
// gcMarkWorkerModeStrings are the strings labels of gcMarkWorkerModes
|
|
|
|
|
// to use in execution traces.
|
|
|
|
|
var gcMarkWorkerModeStrings = [...]string{
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
"Not worker",
|
2016-10-07 17:25:26 -04:00
|
|
|
"GC (dedicated)",
|
|
|
|
|
"GC (fractional)",
|
|
|
|
|
"GC (idle)",
|
|
|
|
|
}
|
|
|
|
|
|
2018-11-02 15:18:43 +00:00
|
|
|
// pollFractionalWorkerExit reports whether a fractional mark worker
|
2017-10-04 16:15:35 -04:00
|
|
|
// should self-preempt. It assumes it is called from the fractional
|
|
|
|
|
// worker.
|
|
|
|
|
func pollFractionalWorkerExit() bool {
|
|
|
|
|
// This should be kept in sync with the fractional worker
|
|
|
|
|
// scheduler logic in findRunnableGCWorker.
|
|
|
|
|
now := nanotime()
|
|
|
|
|
delta := now - gcController.markStartTime
|
|
|
|
|
if delta <= 0 {
|
|
|
|
|
return true
|
|
|
|
|
}
|
|
|
|
|
p := getg().m.p.ptr()
|
2017-10-05 12:16:45 -04:00
|
|
|
selfTime := p.gcFractionalMarkTime + (now - p.gcMarkWorkerStartTime)
|
2017-10-04 16:15:35 -04:00
|
|
|
// Add some slack to the utilization goal so that the
|
|
|
|
|
// fractional worker isn't behind again the instant it exits.
|
|
|
|
|
return float64(selfTime)/float64(delta) > 1.2*gcController.fractionalUtilizationGoal
|
|
|
|
|
}
|
|
|
|
|
|
2015-02-19 15:48:40 -05:00
|
|
|
var work struct {
|
2018-06-05 08:14:57 +02:00
|
|
|
full lfstack // lock-free list of full blocks workbuf
|
|
|
|
|
empty lfstack // lock-free list of empty blocks workbuf
|
|
|
|
|
pad0 cpu.CacheLinePad // prevents false-sharing between full/empty and nproc/nwait
|
runtime: perform concurrent scan in GC workers
Currently the concurrent root scan is performed in its entirety by the
GC coordinator before entering concurrent mark (which enables GC
workers). This scan is done sequentially, which can prolong the scan
phase, delay the mark phase, and means that the scan phase does not
obey the 25% CPU goal. Furthermore, there's no need to complete the
root scan before starting marking (in fact, we already allow GC
assists to happen during the scan phase), so this acts as an
unnecessary barrier between root scanning and marking.
This change shifts the root scan work out of the GC coordinator and in
to the GC workers. The coordinator simply sets up the scan state and
enqueues the right number of root scan jobs. The GC workers then drain
the root scan jobs prior to draining heap scan jobs.
This parallelizes the root scan process, makes it obey the 25% CPU
goal, and effectively eliminates root scanning as an isolated phase,
allowing the system to smoothly transition from root scanning to heap
marking. This also eliminates a major non-STW responsibility of the GC
coordinator, which will make it easier to switch to a decentralized
state machine. Finally, it puts us in a good position to perform root
scanning in assists as well, which will help satisfy assists at the
beginning of the GC cycle.
This is mostly straightforward. One tricky aspect is that we have to
deal with preemption deadlock: where two non-preemptible gorountines
are trying to preempt each other to perform a stack scan. Given the
context where this happens, the only instance of this is two
background workers trying to scan each other. We avoid this by simply
not scanning the stacks of background workers during the concurrent
phase; this is safe because we'll scan them during mark termination
(and their stacks are *very* small and should not contain any new
pointers).
This change also switches the root marking during mark termination to
use the same gcDrain-based code path as concurrent mark. This
shouldn't affect performance because STW root marking was already
parallel and tasks switched to heap marking immediately when no more
root marking tasks were available. However, it simplifies the code and
unifies these code paths.
This has negligible effect on the go1 benchmarks. It slightly slows
down the garbage benchmark, possibly by making GC run slightly more
frequently.
name old time/op new time/op delta
XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18)
name old time/op new time/op delta
BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20)
Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18)
FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20)
FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19)
FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19)
FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19)
FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19)
FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20)
GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20)
Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17)
Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20)
HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20)
JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18)
JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20)
Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19)
GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20)
RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19)
RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19)
RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18)
RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20)
RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18)
RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19)
RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18)
Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17)
Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20)
TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20)
TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19)
[Geo mean] 62.1µs 62.3µs +0.31%
name old speed new speed delta
GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20)
Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17)
Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20)
JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18)
JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20)
GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20)
RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20)
RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19)
RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19)
RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18)
RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20)
RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18)
RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19)
RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19)
Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17)
Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20)
[Geo mean] 100MB/s 99.0MB/s -0.82%
Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0
Reviewed-on: https://go-review.googlesource.com/16059
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
|
|
|
|
2017-03-20 14:05:48 -04:00
|
|
|
wbufSpans struct {
|
|
|
|
|
lock mutex
|
2017-03-20 17:25:59 -04:00
|
|
|
// free is a list of spans dedicated to workbufs, but
|
|
|
|
|
// that don't currently contain any workbufs.
|
|
|
|
|
free mSpanList
|
2017-03-20 14:05:48 -04:00
|
|
|
// busy is a list of all spans containing workbufs on
|
|
|
|
|
// one of the workbuf lists.
|
|
|
|
|
busy mSpanList
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Restore 64-bit alignment on 32-bit.
|
|
|
|
|
_ uint32
|
|
|
|
|
|
2016-11-14 18:24:37 -05:00
|
|
|
// bytesMarked is the number of bytes marked this cycle. This
|
|
|
|
|
// includes bytes blackened in scanned objects, noscan objects
|
|
|
|
|
// that go straight to black, and permagrey objects scanned by
|
|
|
|
|
// markroot during the concurrent scan phase. This is updated
|
|
|
|
|
// atomically during the cycle. Updates may be batched
|
|
|
|
|
// arbitrarily, since the value is only read at the end of the
|
|
|
|
|
// cycle.
|
|
|
|
|
//
|
|
|
|
|
// Because of benign races during marking, this number may not
|
|
|
|
|
// be the exact number of marked bytes, but it should be very
|
|
|
|
|
// close.
|
|
|
|
|
//
|
|
|
|
|
// Put this field here because it needs 64-bit atomic access
|
|
|
|
|
// (and thus 8-byte alignment even on 32-bit architectures).
|
|
|
|
|
bytesMarked uint64
|
|
|
|
|
|
runtime: perform concurrent scan in GC workers
Currently the concurrent root scan is performed in its entirety by the
GC coordinator before entering concurrent mark (which enables GC
workers). This scan is done sequentially, which can prolong the scan
phase, delay the mark phase, and means that the scan phase does not
obey the 25% CPU goal. Furthermore, there's no need to complete the
root scan before starting marking (in fact, we already allow GC
assists to happen during the scan phase), so this acts as an
unnecessary barrier between root scanning and marking.
This change shifts the root scan work out of the GC coordinator and in
to the GC workers. The coordinator simply sets up the scan state and
enqueues the right number of root scan jobs. The GC workers then drain
the root scan jobs prior to draining heap scan jobs.
This parallelizes the root scan process, makes it obey the 25% CPU
goal, and effectively eliminates root scanning as an isolated phase,
allowing the system to smoothly transition from root scanning to heap
marking. This also eliminates a major non-STW responsibility of the GC
coordinator, which will make it easier to switch to a decentralized
state machine. Finally, it puts us in a good position to perform root
scanning in assists as well, which will help satisfy assists at the
beginning of the GC cycle.
This is mostly straightforward. One tricky aspect is that we have to
deal with preemption deadlock: where two non-preemptible gorountines
are trying to preempt each other to perform a stack scan. Given the
context where this happens, the only instance of this is two
background workers trying to scan each other. We avoid this by simply
not scanning the stacks of background workers during the concurrent
phase; this is safe because we'll scan them during mark termination
(and their stacks are *very* small and should not contain any new
pointers).
This change also switches the root marking during mark termination to
use the same gcDrain-based code path as concurrent mark. This
shouldn't affect performance because STW root marking was already
parallel and tasks switched to heap marking immediately when no more
root marking tasks were available. However, it simplifies the code and
unifies these code paths.
This has negligible effect on the go1 benchmarks. It slightly slows
down the garbage benchmark, possibly by making GC run slightly more
frequently.
name old time/op new time/op delta
XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18)
name old time/op new time/op delta
BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20)
Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18)
FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20)
FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19)
FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19)
FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19)
FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19)
FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20)
GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20)
Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17)
Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20)
HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20)
JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18)
JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20)
Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19)
GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20)
RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19)
RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19)
RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18)
RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20)
RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18)
RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19)
RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18)
Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17)
Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20)
TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20)
TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19)
[Geo mean] 62.1µs 62.3µs +0.31%
name old speed new speed delta
GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20)
Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17)
Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20)
JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18)
JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20)
GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20)
RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20)
RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19)
RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19)
RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18)
RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20)
RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18)
RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19)
RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19)
Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17)
Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20)
[Geo mean] 100MB/s 99.0MB/s -0.82%
Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0
Reviewed-on: https://go-review.googlesource.com/16059
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
|
|
|
markrootNext uint32 // next markroot job
|
|
|
|
|
markrootJobs uint32 // number of markroot jobs
|
|
|
|
|
|
2018-08-26 21:33:26 -04:00
|
|
|
nproc uint32
|
|
|
|
|
tstart int64
|
|
|
|
|
nwait uint32
|
2014-11-15 08:00:38 -05:00
|
|
|
|
2015-10-16 16:52:26 -04:00
|
|
|
// Number of roots of various root types. Set by gcMarkRootPrepare.
|
2017-02-09 11:50:26 -05:00
|
|
|
nDataRoots, nBSSRoots, nSpanRoots, nStackRoots int
|
2015-10-16 16:52:26 -04:00
|
|
|
|
2020-12-22 19:22:14 +08:00
|
|
|
// Base indexes of each root type. Set by gcMarkRootPrepare.
|
|
|
|
|
baseData, baseBSS, baseSpans, baseStacks, baseEnd uint32
|
|
|
|
|
|
2015-10-23 14:15:18 -04:00
|
|
|
// Each type of GC state transition is protected by a lock.
|
|
|
|
|
// Since multiple threads can simultaneously detect the state
|
|
|
|
|
// transition condition, any thread that detects a transition
|
|
|
|
|
// condition must acquire the appropriate transition lock,
|
|
|
|
|
// re-check the transition condition and return if it no
|
|
|
|
|
// longer holds or perform the transition if it does.
|
|
|
|
|
// Likewise, any transition must invalidate the transition
|
|
|
|
|
// condition before releasing the lock. This ensures that each
|
|
|
|
|
// transition is performed by exactly one thread and threads
|
|
|
|
|
// that need the transition to happen block until it has
|
|
|
|
|
// happened.
|
|
|
|
|
//
|
|
|
|
|
// startSema protects the transition from "off" to mark or
|
|
|
|
|
// mark termination.
|
|
|
|
|
startSema uint32
|
2018-08-03 17:13:09 -04:00
|
|
|
// markDoneSema protects transitions from mark to mark termination.
|
2015-10-26 11:27:37 -04:00
|
|
|
markDoneSema uint32
|
2015-10-23 14:15:18 -04:00
|
|
|
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
bgMarkReady note // signal background mark worker has started
|
|
|
|
|
bgMarkDone uint32 // cas to 1 when at a background mark completion point
|
2015-04-22 17:44:36 -04:00
|
|
|
// Background mark completion signaling
|
2015-06-01 18:16:03 -04:00
|
|
|
|
2015-10-23 15:17:04 -04:00
|
|
|
// mode is the concurrency mode of the current GC cycle.
|
|
|
|
|
mode gcMode
|
|
|
|
|
|
2017-02-27 10:46:12 -05:00
|
|
|
// userForced indicates the current GC cycle was forced by an
|
|
|
|
|
// explicit user call.
|
|
|
|
|
userForced bool
|
|
|
|
|
|
2015-04-01 13:47:35 -04:00
|
|
|
// totaltime is the CPU nanoseconds spent in GC since the
|
|
|
|
|
// program started if debug.gctrace > 0.
|
|
|
|
|
totaltime int64
|
2015-03-12 16:53:57 -04:00
|
|
|
|
2021-03-31 22:55:06 +00:00
|
|
|
// initialHeapLive is the value of gcController.heapLive at the
|
runtime: fix underflow in next_gc calculation
Currently, it's possible for the next_gc calculation to underflow.
Since next_gc is unsigned, this wraps around and effectively disables
GC for the rest of the program's execution. Besides being obviously
wrong, this is causing test failures on 32-bit because some tests are
running out of heap.
This underflow happens for two reasons, both having to do with how we
estimate the reachable heap size at the end of the GC cycle.
One reason is that this calculation depends on the value of heap_live
at the beginning of the GC cycle, but we currently only record that
value during a concurrent GC and not during a forced STW GC. Fix this
by moving the recorded value from gcController to work and recording
it on a common code path.
The other reason is that we use the amount of allocation during the GC
cycle as an approximation of the amount of floating garbage and
subtract it from the marked heap to estimate the reachable heap.
However, since this is only an approximation, it's possible for the
amount of allocation during the cycle to be *larger* than the marked
heap size (since the runtime allocates white and it's possible for
these allocations to never be made reachable from the heap). Currently
this causes wrap-around in our estimate of the reachable heap size,
which in turn causes wrap-around in next_gc. Fix this by bottoming out
the reachable heap estimate at 0, in which case we just fall back to
triggering GC at heapminimum (which is okay since this only happens on
small heaps).
Fixes #10555, fixes #10556, and fixes #10559.
Change-Id: Iad07b529c03772356fede2ae557732f13ebfdb63
Reviewed-on: https://go-review.googlesource.com/9286
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-23 13:02:31 -04:00
|
|
|
// beginning of this GC cycle.
|
|
|
|
|
initialHeapLive uint64
|
2015-10-14 21:31:33 -04:00
|
|
|
|
|
|
|
|
// assistQueue is a queue of assists that are blocked because
|
|
|
|
|
// there was neither enough credit to steal or enough work to
|
|
|
|
|
// do.
|
|
|
|
|
assistQueue struct {
|
2018-08-09 23:47:37 -04:00
|
|
|
lock mutex
|
|
|
|
|
q gQueue
|
2015-10-14 21:31:33 -04:00
|
|
|
}
|
2015-10-23 15:17:04 -04:00
|
|
|
|
2017-02-23 21:50:19 -05:00
|
|
|
// sweepWaiters is a list of blocked goroutines to wake when
|
|
|
|
|
// we transition from mark termination to sweep.
|
|
|
|
|
sweepWaiters struct {
|
|
|
|
|
lock mutex
|
2018-08-10 10:28:44 -04:00
|
|
|
list gList
|
2017-02-23 21:50:19 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// cycles is the number of completed GC cycles, where a GC
|
|
|
|
|
// cycle is sweep termination, mark, mark termination, and
|
|
|
|
|
// sweep. This differs from memstats.numgc, which is
|
|
|
|
|
// incremented at mark termination.
|
|
|
|
|
cycles uint32
|
|
|
|
|
|
2015-10-23 15:17:04 -04:00
|
|
|
// Timing/utilization stats for this cycle.
|
|
|
|
|
stwprocs, maxprocs int32
|
|
|
|
|
tSweepTerm, tMark, tMarkTerm, tEnd int64 // nanotime() of phase start
|
|
|
|
|
|
|
|
|
|
pauseNS int64 // total STW time this cycle
|
|
|
|
|
pauseStart int64 // nanotime() of last STW
|
|
|
|
|
|
|
|
|
|
// debug.gctrace heap sizes for this cycle.
|
|
|
|
|
heap0, heap1, heap2, heapGoal uint64
|
2015-01-06 14:58:49 -05:00
|
|
|
}
|
|
|
|
|
|
2015-07-18 23:22:18 -07:00
|
|
|
// GC runs a garbage collection and blocks the caller until the
|
|
|
|
|
// garbage collection is complete. It may also block the entire
|
|
|
|
|
// program.
|
2015-02-19 13:38:46 -05:00
|
|
|
func GC() {
|
2017-02-23 21:50:19 -05:00
|
|
|
// We consider a cycle to be: sweep termination, mark, mark
|
|
|
|
|
// termination, and sweep. This function shouldn't return
|
|
|
|
|
// until a full cycle has been completed, from beginning to
|
|
|
|
|
// end. Hence, we always want to finish up the current cycle
|
|
|
|
|
// and start a new one. That means:
|
|
|
|
|
//
|
|
|
|
|
// 1. In sweep termination, mark, or mark termination of cycle
|
|
|
|
|
// N, wait until mark termination N completes and transitions
|
|
|
|
|
// to sweep N.
|
|
|
|
|
//
|
|
|
|
|
// 2. In sweep N, help with sweep N.
|
|
|
|
|
//
|
|
|
|
|
// At this point we can begin a full cycle N+1.
|
|
|
|
|
//
|
|
|
|
|
// 3. Trigger cycle N+1 by starting sweep termination N+1.
|
|
|
|
|
//
|
|
|
|
|
// 4. Wait for mark termination N+1 to complete.
|
|
|
|
|
//
|
|
|
|
|
// 5. Help with sweep N+1 until it's done.
|
|
|
|
|
//
|
|
|
|
|
// This all has to be written to deal with the fact that the
|
|
|
|
|
// GC may move ahead on its own. For example, when we block
|
|
|
|
|
// until mark termination N, we may wake up in cycle N+2.
|
|
|
|
|
|
2018-03-26 17:44:05 -04:00
|
|
|
// Wait until the current sweep termination, mark, and mark
|
|
|
|
|
// termination complete.
|
2017-02-23 21:50:19 -05:00
|
|
|
n := atomic.Load(&work.cycles)
|
2018-03-26 17:44:05 -04:00
|
|
|
gcWaitOnMark(n)
|
2017-02-23 21:50:19 -05:00
|
|
|
|
|
|
|
|
// We're now in sweep N or later. Trigger GC cycle N+1, which
|
|
|
|
|
// will first finish sweep N if necessary and then enter sweep
|
|
|
|
|
// termination N+1.
|
2018-08-13 16:14:19 -04:00
|
|
|
gcStart(gcTrigger{kind: gcTriggerCycle, n: n + 1})
|
2017-02-23 21:50:19 -05:00
|
|
|
|
|
|
|
|
// Wait for mark termination N+1 to complete.
|
2018-03-26 17:44:05 -04:00
|
|
|
gcWaitOnMark(n + 1)
|
2017-02-23 21:50:19 -05:00
|
|
|
|
|
|
|
|
// Finish sweep N+1 before returning. We do this both to
|
|
|
|
|
// complete the cycle and because runtime.GC() is often used
|
|
|
|
|
// as part of tests and benchmarks to get the system into a
|
|
|
|
|
// relatively stable and isolated state.
|
2018-09-25 17:32:03 -04:00
|
|
|
for atomic.Load(&work.cycles) == n+1 && sweepone() != ^uintptr(0) {
|
2017-02-23 21:50:19 -05:00
|
|
|
sweep.nbgsweep++
|
|
|
|
|
Gosched()
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Callers may assume that the heap profile reflects the
|
|
|
|
|
// just-completed cycle when this returns (historically this
|
|
|
|
|
// happened because this was a STW GC), but right now the
|
|
|
|
|
// profile still reflects mark termination N, not N+1.
|
|
|
|
|
//
|
|
|
|
|
// As soon as all of the sweep frees from cycle N+1 are done,
|
|
|
|
|
// we can go ahead and publish the heap profile.
|
|
|
|
|
//
|
|
|
|
|
// First, wait for sweeping to finish. (We know there are no
|
|
|
|
|
// more spans on the sweep queue, but we may be concurrently
|
|
|
|
|
// sweeping spans, so we have to wait.)
|
2021-04-06 19:25:28 -04:00
|
|
|
for atomic.Load(&work.cycles) == n+1 && !isSweepDone() {
|
2017-02-23 21:50:19 -05:00
|
|
|
Gosched()
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Now we're really done with sweeping, so we can publish the
|
|
|
|
|
// stable heap profile. Only do this if we haven't already hit
|
|
|
|
|
// another mark termination.
|
|
|
|
|
mp := acquirem()
|
|
|
|
|
cycle := atomic.Load(&work.cycles)
|
|
|
|
|
if cycle == n+1 || (gcphase == _GCmark && cycle == n+2) {
|
|
|
|
|
mProf_PostSweep()
|
|
|
|
|
}
|
|
|
|
|
releasem(mp)
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
|
|
|
|
|
2018-03-26 17:44:05 -04:00
|
|
|
// gcWaitOnMark blocks until GC finishes the Nth mark phase. If GC has
|
|
|
|
|
// already completed this mark phase, it returns immediately.
|
|
|
|
|
func gcWaitOnMark(n uint32) {
|
|
|
|
|
for {
|
|
|
|
|
// Disable phase transitions.
|
|
|
|
|
lock(&work.sweepWaiters.lock)
|
|
|
|
|
nMarks := atomic.Load(&work.cycles)
|
|
|
|
|
if gcphase != _GCmark {
|
|
|
|
|
// We've already completed this cycle's mark.
|
|
|
|
|
nMarks++
|
|
|
|
|
}
|
|
|
|
|
if nMarks > n {
|
|
|
|
|
// We're done.
|
|
|
|
|
unlock(&work.sweepWaiters.lock)
|
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Wait until sweep termination, mark, and mark
|
|
|
|
|
// termination of cycle N complete.
|
2018-08-10 10:28:44 -04:00
|
|
|
work.sweepWaiters.list.push(getg())
|
2018-03-06 21:28:24 -08:00
|
|
|
goparkunlock(&work.sweepWaiters.lock, waitReasonWaitForGCCycle, traceEvGoBlock, 1)
|
2018-03-26 17:44:05 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2015-09-24 14:30:09 -04:00
|
|
|
// gcMode indicates how concurrent a GC cycle should be.
|
|
|
|
|
type gcMode int
|
|
|
|
|
|
2015-02-19 15:48:40 -05:00
|
|
|
const (
|
2015-09-24 14:30:09 -04:00
|
|
|
gcBackgroundMode gcMode = iota // concurrent GC and sweep
|
|
|
|
|
gcForceMode // stop-the-world GC now, concurrent sweep
|
2016-12-06 17:42:42 -05:00
|
|
|
gcForceBlockMode // stop-the-world GC now and STW sweep (forced by user)
|
2015-02-19 15:48:40 -05:00
|
|
|
)
|
|
|
|
|
|
2017-01-09 11:35:42 -05:00
|
|
|
// A gcTrigger is a predicate for starting a GC cycle. Specifically,
|
|
|
|
|
// it is an exit condition for the _GCoff phase.
|
|
|
|
|
type gcTrigger struct {
|
|
|
|
|
kind gcTriggerKind
|
2017-02-23 21:50:19 -05:00
|
|
|
now int64 // gcTriggerTime: current time
|
|
|
|
|
n uint32 // gcTriggerCycle: cycle number to start
|
2017-01-09 11:35:42 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
type gcTriggerKind int
|
|
|
|
|
|
|
|
|
|
const (
|
|
|
|
|
// gcTriggerHeap indicates that a cycle should be started when
|
|
|
|
|
// the heap size reaches the trigger heap size computed by the
|
|
|
|
|
// controller.
|
2017-09-25 15:01:29 -04:00
|
|
|
gcTriggerHeap gcTriggerKind = iota
|
2017-01-09 11:35:42 -05:00
|
|
|
|
|
|
|
|
// gcTriggerTime indicates that a cycle should be started when
|
|
|
|
|
// it's been more than forcegcperiod nanoseconds since the
|
|
|
|
|
// previous GC cycle.
|
|
|
|
|
gcTriggerTime
|
2017-02-23 21:50:19 -05:00
|
|
|
|
|
|
|
|
// gcTriggerCycle indicates that a cycle should be started if
|
|
|
|
|
// we have not yet started cycle number gcTrigger.n (relative
|
|
|
|
|
// to work.cycles).
|
|
|
|
|
gcTriggerCycle
|
2017-01-09 11:35:42 -05:00
|
|
|
)
|
|
|
|
|
|
2018-11-02 15:18:43 +00:00
|
|
|
// test reports whether the trigger condition is satisfied, meaning
|
2017-01-09 11:35:42 -05:00
|
|
|
// that the exit condition for the _GCoff phase has been met. The exit
|
|
|
|
|
// condition should be tested when allocating.
|
|
|
|
|
func (t gcTrigger) test() bool {
|
2017-09-25 15:01:29 -04:00
|
|
|
if !memstats.enablegc || panicking != 0 || gcphase != _GCoff {
|
2017-01-09 11:35:42 -05:00
|
|
|
return false
|
|
|
|
|
}
|
|
|
|
|
switch t.kind {
|
|
|
|
|
case gcTriggerHeap:
|
2021-03-31 22:55:06 +00:00
|
|
|
// Non-atomic access to gcController.heapLive for performance. If
|
2017-04-21 11:45:44 -04:00
|
|
|
// we are going to trigger on this, this thread just
|
2021-03-31 22:55:06 +00:00
|
|
|
// atomically wrote gcController.heapLive anyway and we'll see our
|
2017-04-21 11:45:44 -04:00
|
|
|
// own write.
|
2021-03-31 22:55:06 +00:00
|
|
|
return gcController.heapLive >= gcController.trigger
|
2017-01-09 11:35:42 -05:00
|
|
|
case gcTriggerTime:
|
2021-04-01 17:56:32 +00:00
|
|
|
if gcController.gcPercent < 0 {
|
2017-09-25 14:58:13 -04:00
|
|
|
return false
|
|
|
|
|
}
|
2017-01-09 11:35:42 -05:00
|
|
|
lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
|
|
|
|
|
return lastgc != 0 && t.now-lastgc > forcegcperiod
|
2017-02-23 21:50:19 -05:00
|
|
|
case gcTriggerCycle:
|
|
|
|
|
// t.n > work.cycles, but accounting for wraparound.
|
|
|
|
|
return int32(t.n-work.cycles) > 0
|
2017-01-09 11:35:42 -05:00
|
|
|
}
|
|
|
|
|
return true
|
2015-10-23 14:15:18 -04:00
|
|
|
}
|
|
|
|
|
|
2018-08-13 16:14:19 -04:00
|
|
|
// gcStart starts the GC. It transitions from _GCoff to _GCmark (if
|
|
|
|
|
// debug.gcstoptheworld == 0) or performs all of GC (if
|
|
|
|
|
// debug.gcstoptheworld != 0).
|
2015-10-23 14:15:18 -04:00
|
|
|
//
|
|
|
|
|
// This may return without performing this transition in some cases,
|
|
|
|
|
// such as when called on a system stack or with locks held.
|
2018-08-13 16:14:19 -04:00
|
|
|
func gcStart(trigger gcTrigger) {
|
2015-10-23 14:15:18 -04:00
|
|
|
// Since this is called from malloc and malloc is called in
|
|
|
|
|
// the guts of a number of libraries that might be holding
|
|
|
|
|
// locks, don't attempt to start GC in non-preemptible or
|
|
|
|
|
// potentially unstable situations.
|
|
|
|
|
mp := acquirem()
|
|
|
|
|
if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
|
|
|
|
|
releasem(mp)
|
|
|
|
|
return
|
|
|
|
|
}
|
|
|
|
|
releasem(mp)
|
|
|
|
|
mp = nil
|
|
|
|
|
|
2015-10-23 15:04:37 -04:00
|
|
|
// Pick up the remaining unswept/not being swept spans concurrently
|
|
|
|
|
//
|
|
|
|
|
// This shouldn't happen if we're being invoked in background
|
|
|
|
|
// mode since proportional sweep should have just finished
|
|
|
|
|
// sweeping everything, but rounding errors, etc, may leave a
|
|
|
|
|
// few spans unswept. In forced mode, this is necessary since
|
|
|
|
|
// GC can be forced at any point in the sweeping cycle.
|
|
|
|
|
//
|
|
|
|
|
// We check the transition condition continuously here in case
|
|
|
|
|
// this G gets delayed in to the next GC cycle.
|
2018-09-25 17:32:03 -04:00
|
|
|
for trigger.test() && sweepone() != ^uintptr(0) {
|
2015-10-23 15:04:37 -04:00
|
|
|
sweep.nbgsweep++
|
|
|
|
|
}
|
|
|
|
|
|
2015-10-23 14:15:18 -04:00
|
|
|
// Perform GC initialization and the sweep termination
|
|
|
|
|
// transition.
|
2017-02-23 11:54:43 -05:00
|
|
|
semacquire(&work.startSema)
|
|
|
|
|
// Re-check transition condition under transition lock.
|
|
|
|
|
if !trigger.test() {
|
|
|
|
|
semrelease(&work.startSema)
|
|
|
|
|
return
|
2015-10-23 14:15:18 -04:00
|
|
|
}
|
|
|
|
|
|
2016-12-06 17:42:42 -05:00
|
|
|
// For stats, check if this GC was forced by the user.
|
2017-09-25 15:01:29 -04:00
|
|
|
work.userForced = trigger.kind == gcTriggerCycle
|
2016-12-06 17:42:42 -05:00
|
|
|
|
2015-10-23 14:15:18 -04:00
|
|
|
// In gcstoptheworld debug mode, upgrade the mode accordingly.
|
|
|
|
|
// We do this after re-checking the transition condition so
|
|
|
|
|
// that multiple goroutines that detect the heap trigger don't
|
|
|
|
|
// start multiple STW GCs.
|
2018-08-13 16:14:19 -04:00
|
|
|
mode := gcBackgroundMode
|
|
|
|
|
if debug.gcstoptheworld == 1 {
|
|
|
|
|
mode = gcForceMode
|
|
|
|
|
} else if debug.gcstoptheworld == 2 {
|
|
|
|
|
mode = gcForceBlockMode
|
2015-10-23 14:15:18 -04:00
|
|
|
}
|
|
|
|
|
|
2017-08-19 22:33:51 +02:00
|
|
|
// Ok, we're doing it! Stop everybody else
|
runtime: don't hold worldsema across mark phase
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.
Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.
For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.
For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.
To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.
This change is being landed once more after being reverted in the Go
1.14 release cycle, since CL 215157 allows it to have a positive
effect on system performance.
For the benchmark BenchmarkReadMemStatsLatency in the runtime, which
measures ReadMemStats latencies while the GC is exercised, the tail of
these latencies reduced dramatically on an 8-core machine:
name old 50%tile-ns new 50%tile-ns delta
ReadMemStatsLatency-8 4.40M ±74% 0.12M ± 2% -97.35% (p=0.008 n=5+5)
name old 90%tile-ns new 90%tile-ns delta
ReadMemStatsLatency-8 102M ± 6% 0M ±14% -99.79% (p=0.008 n=5+5)
name old 99%tile-ns new 99%tile-ns delta
ReadMemStatsLatency-8 147M ±18% 4M ±57% -97.43% (p=0.008 n=5+5)
Fixes #19812.
Change-Id: If66c3c97d171524ae29f0e7af4bd33509d9fd0bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/216557
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-06-17 19:03:09 +00:00
|
|
|
semacquire(&gcsema)
|
2016-12-13 16:45:55 +01:00
|
|
|
semacquire(&worldsema)
|
2014-12-12 18:41:57 +01:00
|
|
|
|
2015-07-01 11:04:19 -04:00
|
|
|
if trace.enabled {
|
|
|
|
|
traceGCStart()
|
|
|
|
|
}
|
|
|
|
|
|
2018-08-23 13:14:19 -04:00
|
|
|
// Check that all Ps have finished deferred mcache flushes.
|
|
|
|
|
for _, p := range allp {
|
|
|
|
|
if fg := atomic.Load(&p.mcache.flushGen); fg != mheap_.sweepgen {
|
|
|
|
|
println("runtime: p", p.id, "flushGen", fg, "!= sweepgen", mheap_.sweepgen)
|
|
|
|
|
throw("p mcache not flushed")
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
gcBgMarkStartWorkers()
|
2016-03-01 15:09:24 -05:00
|
|
|
|
2019-05-17 14:48:04 +00:00
|
|
|
systemstack(gcResetMarkState)
|
2016-03-01 15:09:24 -05:00
|
|
|
|
runtime: fix gctrace STW CPU time and CPU fraction
The CPU time reported in the gctrace for STW phases is simply
work.stwprocs times the wall-clock duration of these phases. However,
work.stwprocs is set to gcprocs(), which is wrong for multiple
reasons:
1. gcprocs is intended to limit the number of Ms used for mark
termination based on how well the garbage collector actually
scales, but the gctrace wants to report how much CPU time is being
stolen from the application. During STW, that's *all* of the CPU,
regardless of how many the garbage collector can actually use.
2. gcprocs assumes it's being called during STW, so it limits its
result to sched.nmidle+1. However, we're not calling it during STW,
so sched.nmidle is typically quite small, even if GOMAXPROCS is
quite large.
Fix this by setting work.stwprocs to min(ncpu, GOMAXPROCS). This also
fixes the overall GC CPU fraction, which is based on the computed CPU
times.
Fixes #22725.
Change-Id: I64b5ce87e28dbec6870aa068ce7aecdd28c058d1
Reviewed-on: https://go-review.googlesource.com/77710
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-11-14 15:08:32 -08:00
|
|
|
work.stwprocs, work.maxprocs = gomaxprocs, gomaxprocs
|
|
|
|
|
if work.stwprocs > ncpu {
|
|
|
|
|
// This is used to compute CPU time of the STW phases,
|
|
|
|
|
// so it can't be more than ncpu, even if GOMAXPROCS is.
|
|
|
|
|
work.stwprocs = ncpu
|
|
|
|
|
}
|
2021-03-31 22:55:06 +00:00
|
|
|
work.heap0 = atomic.Load64(&gcController.heapLive)
|
2015-10-23 15:17:04 -04:00
|
|
|
work.pauseNS = 0
|
|
|
|
|
work.mode = mode
|
|
|
|
|
|
2017-07-18 11:21:15 -04:00
|
|
|
now := nanotime()
|
|
|
|
|
work.tSweepTerm = now
|
2015-10-23 15:17:04 -04:00
|
|
|
work.pauseStart = now
|
2017-07-21 14:25:28 -04:00
|
|
|
if trace.enabled {
|
|
|
|
|
traceGCSTWStart(1)
|
|
|
|
|
}
|
2015-05-15 16:00:50 -04:00
|
|
|
systemstack(stopTheWorldWithSema)
|
runtime: remove sweep wait loop in finishsweep_m
In general, finishsweep_m must block until any spans that are
concurrently being swept have been swept. It accomplishes this by
looping over all spans, which, as in the previous commit, takes
~1ms/heap GB. Unfortunately, we do this during the STW sweep
termination phase, so multi-gigabyte heaps can push our STW time past
10ms.
However, there's no need to do this wait if the world is stopped
because, in effect, stopping the world already had to wait for
anything that was sweeping (and if it didn't, the wait in
finishsweep_m would deadlock). Hence, we can simply skip this loop if
the world is stopped, such as during sweep termination. In fact,
currently all calls to finishsweep_m are STW, but this hasn't always
been the case and may not be the case in the future, so we keep the
logic around.
For 24GB heaps, this reduces max pause time by 75% relative to tip and
by 90% relative to Go 1.5. Notably, all pauses are now well under
10ms. Here are the results for the garbage benchmark:
------------- max pause ------------
Heap Procs after change before change 1.5.1
24GB 12 3.8ms 16ms 37ms
24GB 4 3.7ms 16ms 37ms
4GB 4 3.7ms 3ms 6.9ms
In the 4GB/4P case, it seems the "before change" run got lucky: the
max went up, but the 99%ile pause time went down from 3ms to 2.04ms.
Change-Id: Ica22189559f231d408ef2815019c9dbb5f38bf31
Reviewed-on: https://go-review.googlesource.com/15071
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-26 14:00:57 -04:00
|
|
|
// Finish sweep before we start concurrent scan.
|
|
|
|
|
systemstack(func() {
|
2016-10-05 21:22:33 -04:00
|
|
|
finishsweep_m()
|
runtime: remove sweep wait loop in finishsweep_m
In general, finishsweep_m must block until any spans that are
concurrently being swept have been swept. It accomplishes this by
looping over all spans, which, as in the previous commit, takes
~1ms/heap GB. Unfortunately, we do this during the STW sweep
termination phase, so multi-gigabyte heaps can push our STW time past
10ms.
However, there's no need to do this wait if the world is stopped
because, in effect, stopping the world already had to wait for
anything that was sweeping (and if it didn't, the wait in
finishsweep_m would deadlock). Hence, we can simply skip this loop if
the world is stopped, such as during sweep termination. In fact,
currently all calls to finishsweep_m are STW, but this hasn't always
been the case and may not be the case in the future, so we keep the
logic around.
For 24GB heaps, this reduces max pause time by 75% relative to tip and
by 90% relative to Go 1.5. Notably, all pauses are now well under
10ms. Here are the results for the garbage benchmark:
------------- max pause ------------
Heap Procs after change before change 1.5.1
24GB 12 3.8ms 16ms 37ms
24GB 4 3.7ms 16ms 37ms
4GB 4 3.7ms 3ms 6.9ms
In the 4GB/4P case, it seems the "before change" run got lucky: the
max went up, but the 99%ile pause time went down from 3ms to 2.04ms.
Change-Id: Ica22189559f231d408ef2815019c9dbb5f38bf31
Reviewed-on: https://go-review.googlesource.com/15071
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-26 14:00:57 -04:00
|
|
|
})
|
runtime: add new mcentral implementation
Currently mcentral is implemented as a couple of linked lists of spans
protected by a lock. Unfortunately this design leads to significant lock
contention.
The span ownership model is also confusing and complicated. In-use spans
jump between being owned by multiple sources, generally some combination
of a gcSweepBuf, a concurrent sweeper, an mcentral or an mcache.
So first to address contention, this change replaces those linked lists
with gcSweepBufs which have an atomic fast path. Then, we change up the
ownership model: a span may be simultaneously owned only by an mcentral
and the page reclaimer. Otherwise, an mcentral (which now consists of
sweep bufs), a sweeper, or an mcache are the sole owners of a span at
any given time. This dramatically simplifies reasoning about span
ownership in the runtime.
As a result of this new ownership model, sweeping is now driven by
walking over the mcentrals rather than having its own global list of
spans. Because we no longer have a global list and we traditionally
haven't used the mcentrals for large object spans, we no longer have
anywhere to put large objects. So, this change also makes it so that we
keep large object spans in the appropriate mcentral lists.
In terms of the static lock ranking, we add the spanSet spine locks in
pretty much the same place as the mcentral locks, since they have the
potential to be manipulated both on the allocation and sweep paths, like
the mcentral locks.
This new implementation is turned on by default via a feature flag
called go115NewMCentralImpl.
Benchmark results for 1 KiB allocation throughput (5 runs each):
name \ MiB/s go113 go114 gotip gotip+this-patch
AllocKiB-1 1.71k ± 1% 1.68k ± 1% 1.59k ± 2% 1.71k ± 1%
AllocKiB-2 2.46k ± 1% 2.51k ± 1% 2.54k ± 1% 2.93k ± 1%
AllocKiB-4 4.27k ± 1% 4.41k ± 2% 4.33k ± 1% 5.01k ± 2%
AllocKiB-8 4.38k ± 3% 5.24k ± 1% 5.46k ± 1% 8.23k ± 1%
AllocKiB-12 4.38k ± 3% 4.49k ± 1% 5.10k ± 1% 10.04k ± 0%
AllocKiB-16 4.31k ± 1% 4.14k ± 3% 4.22k ± 0% 10.42k ± 0%
AllocKiB-20 4.26k ± 1% 3.98k ± 1% 4.09k ± 1% 10.46k ± 3%
AllocKiB-24 4.20k ± 1% 3.97k ± 1% 4.06k ± 1% 10.74k ± 1%
AllocKiB-28 4.15k ± 0% 4.00k ± 0% 4.20k ± 0% 10.76k ± 1%
Fixes #37487.
Change-Id: I92d47355acacf9af2c41bf080c08a8c1638ba210
Reviewed-on: https://go-review.googlesource.com/c/go/+/221182
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:58:45 +00:00
|
|
|
|
2015-03-05 17:33:08 -05:00
|
|
|
// clearpools before we start the GC. If we wait they memory will not be
|
|
|
|
|
// reclaimed until the next GC cycle.
|
|
|
|
|
clearpools()
|
2015-02-19 15:48:40 -05:00
|
|
|
|
2017-02-23 21:50:19 -05:00
|
|
|
work.cycles++
|
2015-03-12 12:08:47 -04:00
|
|
|
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
gcController.startCycle()
|
2021-04-01 18:38:14 +00:00
|
|
|
work.heapGoal = gcController.heapGoal
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
|
|
|
|
|
// In STW mode, disable scheduling of user Gs. This may also
|
|
|
|
|
// disable scheduling of this goroutine, so it may block as
|
|
|
|
|
// soon as we start the world again.
|
|
|
|
|
if mode != gcBackgroundMode {
|
|
|
|
|
schedEnableUser(false)
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Enter concurrent mark phase and enable
|
|
|
|
|
// write barriers.
|
|
|
|
|
//
|
|
|
|
|
// Because the world is stopped, all Ps will
|
|
|
|
|
// observe that write barriers are enabled by
|
|
|
|
|
// the time we start the world and begin
|
|
|
|
|
// scanning.
|
|
|
|
|
//
|
|
|
|
|
// Write barriers must be enabled before assists are
|
|
|
|
|
// enabled because they must be enabled before
|
|
|
|
|
// any non-leaf heap objects are marked. Since
|
|
|
|
|
// allocations are blocked until assists can
|
|
|
|
|
// happen, we want enable assists as early as
|
|
|
|
|
// possible.
|
|
|
|
|
setGCPhase(_GCmark)
|
|
|
|
|
|
|
|
|
|
gcBgMarkPrepare() // Must happen before assist enable.
|
|
|
|
|
gcMarkRootPrepare()
|
|
|
|
|
|
|
|
|
|
// Mark all active tinyalloc blocks. Since we're
|
|
|
|
|
// allocating from these, they need to be black like
|
|
|
|
|
// other allocations. The alternative is to blacken
|
|
|
|
|
// the tiny block on every allocation from it, which
|
|
|
|
|
// would slow down the tiny allocator.
|
|
|
|
|
gcMarkTinyAllocs()
|
|
|
|
|
|
|
|
|
|
// At this point all Ps have enabled the write
|
|
|
|
|
// barrier, thus maintaining the no white to
|
|
|
|
|
// black invariant. Enable mutator assists to
|
|
|
|
|
// put back-pressure on fast allocating
|
|
|
|
|
// mutators.
|
|
|
|
|
atomic.Store(&gcBlackenEnabled, 1)
|
|
|
|
|
|
|
|
|
|
// Assists and workers can start the moment we start
|
|
|
|
|
// the world.
|
|
|
|
|
gcController.markStartTime = now
|
|
|
|
|
|
2020-04-15 18:01:00 +00:00
|
|
|
// In STW mode, we could block the instant systemstack
|
|
|
|
|
// returns, so make sure we're not preemptible.
|
|
|
|
|
mp = acquirem()
|
|
|
|
|
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
// Concurrent mark.
|
|
|
|
|
systemstack(func() {
|
|
|
|
|
now = startTheWorldWithSema(trace.enabled)
|
2015-10-24 21:19:52 -04:00
|
|
|
work.pauseNS += now - work.pauseStart
|
2015-10-23 15:17:04 -04:00
|
|
|
work.tMark = now
|
2020-08-06 21:59:13 +00:00
|
|
|
memstats.gcPauseDist.record(now - work.pauseStart)
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
})
|
runtime: don't hold worldsema across mark phase
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.
Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.
For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.
For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.
To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.
This change is being landed once more after being reverted in the Go
1.14 release cycle, since CL 215157 allows it to have a positive
effect on system performance.
For the benchmark BenchmarkReadMemStatsLatency in the runtime, which
measures ReadMemStats latencies while the GC is exercised, the tail of
these latencies reduced dramatically on an 8-core machine:
name old 50%tile-ns new 50%tile-ns delta
ReadMemStatsLatency-8 4.40M ±74% 0.12M ± 2% -97.35% (p=0.008 n=5+5)
name old 90%tile-ns new 90%tile-ns delta
ReadMemStatsLatency-8 102M ± 6% 0M ±14% -99.79% (p=0.008 n=5+5)
name old 99%tile-ns new 99%tile-ns delta
ReadMemStatsLatency-8 147M ±18% 4M ±57% -97.43% (p=0.008 n=5+5)
Fixes #19812.
Change-Id: If66c3c97d171524ae29f0e7af4bd33509d9fd0bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/216557
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-06-17 19:03:09 +00:00
|
|
|
|
|
|
|
|
// Release the world sema before Gosched() in STW mode
|
|
|
|
|
// because we will need to reacquire it later but before
|
|
|
|
|
// this goroutine becomes runnable again, and we could
|
|
|
|
|
// self-deadlock otherwise.
|
|
|
|
|
semrelease(&worldsema)
|
2020-04-15 18:01:00 +00:00
|
|
|
releasem(mp)
|
runtime: don't hold worldsema across mark phase
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.
Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.
For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.
For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.
To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.
This change is being landed once more after being reverted in the Go
1.14 release cycle, since CL 215157 allows it to have a positive
effect on system performance.
For the benchmark BenchmarkReadMemStatsLatency in the runtime, which
measures ReadMemStats latencies while the GC is exercised, the tail of
these latencies reduced dramatically on an 8-core machine:
name old 50%tile-ns new 50%tile-ns delta
ReadMemStatsLatency-8 4.40M ±74% 0.12M ± 2% -97.35% (p=0.008 n=5+5)
name old 90%tile-ns new 90%tile-ns delta
ReadMemStatsLatency-8 102M ± 6% 0M ±14% -99.79% (p=0.008 n=5+5)
name old 99%tile-ns new 99%tile-ns delta
ReadMemStatsLatency-8 147M ±18% 4M ±57% -97.43% (p=0.008 n=5+5)
Fixes #19812.
Change-Id: If66c3c97d171524ae29f0e7af4bd33509d9fd0bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/216557
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-06-17 19:03:09 +00:00
|
|
|
|
2020-04-15 18:01:00 +00:00
|
|
|
// Make sure we block instead of returning to user code
|
|
|
|
|
// in STW mode.
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
if mode != gcBackgroundMode {
|
|
|
|
|
Gosched()
|
2015-10-23 15:55:03 -04:00
|
|
|
}
|
|
|
|
|
|
2017-02-23 11:54:43 -05:00
|
|
|
semrelease(&work.startSema)
|
2015-10-23 15:55:03 -04:00
|
|
|
}
|
|
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// gcMarkDoneFlushed counts the number of P's with flushed work.
|
|
|
|
|
//
|
|
|
|
|
// Ideally this would be a captured local in gcMarkDone, but forEachP
|
|
|
|
|
// escapes its callback closure, so it can't capture anything.
|
2015-10-24 21:30:59 -04:00
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// This is protected by markDoneSema.
|
|
|
|
|
var gcMarkDoneFlushed uint32
|
|
|
|
|
|
|
|
|
|
// gcMarkDone transitions the GC from mark to mark termination if all
|
|
|
|
|
// reachable objects have been marked (that is, there are no grey
|
|
|
|
|
// objects and can be no more in the future). Otherwise, it flushes
|
|
|
|
|
// all local work to the global queues where it can be discovered by
|
|
|
|
|
// other workers.
|
|
|
|
|
//
|
|
|
|
|
// This should be called when all local mark work has been drained and
|
|
|
|
|
// there are no remaining workers. Specifically, when
|
|
|
|
|
//
|
|
|
|
|
// work.nwait == work.nproc && !gcMarkWorkAvailable(p)
|
2015-11-23 11:37:12 -05:00
|
|
|
//
|
|
|
|
|
// The calling context must be preemptible.
|
|
|
|
|
//
|
2018-08-03 17:13:09 -04:00
|
|
|
// Flushing local work is important because idle Ps may have local
|
|
|
|
|
// work queued. This is the only way to make that work visible and
|
|
|
|
|
// drive GC to completion.
|
|
|
|
|
//
|
|
|
|
|
// It is explicitly okay to have write barriers in this function. If
|
|
|
|
|
// it does transition to mark termination, then all reachable objects
|
|
|
|
|
// have been marked, so the write barrier cannot shade any more
|
|
|
|
|
// objects.
|
2015-10-24 21:30:59 -04:00
|
|
|
func gcMarkDone() {
|
2018-08-03 17:13:09 -04:00
|
|
|
// Ensure only one thread is running the ragged barrier at a
|
|
|
|
|
// time.
|
2016-12-13 16:45:55 +01:00
|
|
|
semacquire(&work.markDoneSema)
|
2015-10-24 21:30:59 -04:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
top:
|
2015-10-26 11:27:37 -04:00
|
|
|
// Re-check transition condition under transition lock.
|
2018-08-03 17:13:09 -04:00
|
|
|
//
|
|
|
|
|
// It's critical that this checks the global work queues are
|
|
|
|
|
// empty before performing the ragged barrier. Otherwise,
|
|
|
|
|
// there could be global work that a P could take after the P
|
|
|
|
|
// has passed the ragged barrier.
|
2015-10-26 11:27:37 -04:00
|
|
|
if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
|
|
|
|
|
semrelease(&work.markDoneSema)
|
|
|
|
|
return
|
|
|
|
|
}
|
2015-06-01 18:16:03 -04:00
|
|
|
|
runtime: don't hold worldsema across mark phase
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.
Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.
For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.
For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.
To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.
This change is being landed once more after being reverted in the Go
1.14 release cycle, since CL 215157 allows it to have a positive
effect on system performance.
For the benchmark BenchmarkReadMemStatsLatency in the runtime, which
measures ReadMemStats latencies while the GC is exercised, the tail of
these latencies reduced dramatically on an 8-core machine:
name old 50%tile-ns new 50%tile-ns delta
ReadMemStatsLatency-8 4.40M ±74% 0.12M ± 2% -97.35% (p=0.008 n=5+5)
name old 90%tile-ns new 90%tile-ns delta
ReadMemStatsLatency-8 102M ± 6% 0M ±14% -99.79% (p=0.008 n=5+5)
name old 99%tile-ns new 99%tile-ns delta
ReadMemStatsLatency-8 147M ±18% 4M ±57% -97.43% (p=0.008 n=5+5)
Fixes #19812.
Change-Id: If66c3c97d171524ae29f0e7af4bd33509d9fd0bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/216557
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-06-17 19:03:09 +00:00
|
|
|
// forEachP needs worldsema to execute, and we'll need it to
|
|
|
|
|
// stop the world later, so acquire worldsema now.
|
|
|
|
|
semacquire(&worldsema)
|
|
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Flush all local buffers and collect flushedWork flags.
|
|
|
|
|
gcMarkDoneFlushed = 0
|
|
|
|
|
systemstack(func() {
|
2018-12-06 21:51:51 +00:00
|
|
|
gp := getg().m.curg
|
|
|
|
|
// Mark the user stack as preemptible so that it may be scanned.
|
|
|
|
|
// Otherwise, our attempt to force all P's to a safepoint could
|
|
|
|
|
// result in a deadlock as we attempt to preempt a worker that's
|
|
|
|
|
// trying to preempt us (e.g. for a stack scan).
|
|
|
|
|
casgstatus(gp, _Grunning, _Gwaiting)
|
2018-08-03 17:13:09 -04:00
|
|
|
forEachP(func(_p_ *p) {
|
|
|
|
|
// Flush the write barrier buffer, since this may add
|
|
|
|
|
// work to the gcWork.
|
|
|
|
|
wbBufFlush1(_p_)
|
2020-10-14 17:18:27 -04:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Flush the gcWork, since this may create global work
|
|
|
|
|
// and set the flushedWork flag.
|
|
|
|
|
//
|
|
|
|
|
// TODO(austin): Break up these workbufs to
|
|
|
|
|
// better distribute work.
|
|
|
|
|
_p_.gcw.dispose()
|
|
|
|
|
// Collect the flushedWork flag.
|
|
|
|
|
if _p_.gcw.flushedWork {
|
|
|
|
|
atomic.Xadd(&gcMarkDoneFlushed, 1)
|
|
|
|
|
_p_.gcw.flushedWork = false
|
|
|
|
|
}
|
2015-06-01 18:16:03 -04:00
|
|
|
})
|
2018-12-06 21:51:51 +00:00
|
|
|
casgstatus(gp, _Gwaiting, _Grunning)
|
2018-08-03 17:13:09 -04:00
|
|
|
})
|
2015-06-01 18:16:03 -04:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
if gcMarkDoneFlushed != 0 {
|
|
|
|
|
// More grey objects were discovered since the
|
|
|
|
|
// previous termination check, so there may be more
|
|
|
|
|
// work to do. Keep going. It's possible the
|
|
|
|
|
// transition condition became true again during the
|
|
|
|
|
// ragged barrier, so re-check it.
|
runtime: don't hold worldsema across mark phase
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.
Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.
For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.
For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.
To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.
This change is being landed once more after being reverted in the Go
1.14 release cycle, since CL 215157 allows it to have a positive
effect on system performance.
For the benchmark BenchmarkReadMemStatsLatency in the runtime, which
measures ReadMemStats latencies while the GC is exercised, the tail of
these latencies reduced dramatically on an 8-core machine:
name old 50%tile-ns new 50%tile-ns delta
ReadMemStatsLatency-8 4.40M ±74% 0.12M ± 2% -97.35% (p=0.008 n=5+5)
name old 90%tile-ns new 90%tile-ns delta
ReadMemStatsLatency-8 102M ± 6% 0M ±14% -99.79% (p=0.008 n=5+5)
name old 99%tile-ns new 99%tile-ns delta
ReadMemStatsLatency-8 147M ±18% 4M ±57% -97.43% (p=0.008 n=5+5)
Fixes #19812.
Change-Id: If66c3c97d171524ae29f0e7af4bd33509d9fd0bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/216557
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-06-17 19:03:09 +00:00
|
|
|
semrelease(&worldsema)
|
2018-08-03 17:13:09 -04:00
|
|
|
goto top
|
|
|
|
|
}
|
2015-03-18 11:22:12 -04:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// There was no global work, no local work, and no Ps
|
|
|
|
|
// communicated work since we took markDoneSema. Therefore
|
|
|
|
|
// there are no grey objects and no more objects can be
|
|
|
|
|
// shaded. Transition to mark termination.
|
|
|
|
|
now := nanotime()
|
|
|
|
|
work.tMarkTerm = now
|
|
|
|
|
work.pauseStart = now
|
|
|
|
|
getg().m.preemptoff = "gcing"
|
|
|
|
|
if trace.enabled {
|
|
|
|
|
traceGCSTWStart(0)
|
|
|
|
|
}
|
|
|
|
|
systemstack(stopTheWorldWithSema)
|
|
|
|
|
// The gcphase is _GCmark, it will transition to _GCmarktermination
|
|
|
|
|
// below. The important thing is that the wb remains active until
|
|
|
|
|
// all marking is complete. This includes writes made by the GC.
|
2015-03-12 17:56:14 -04:00
|
|
|
|
2020-10-14 17:18:27 -04:00
|
|
|
// There is sometimes work left over when we enter mark termination due
|
|
|
|
|
// to write barriers performed after the completion barrier above.
|
|
|
|
|
// Detect this and resume concurrent mark. This is obviously
|
|
|
|
|
// unfortunate.
|
|
|
|
|
//
|
|
|
|
|
// See issue #27993 for details.
|
|
|
|
|
//
|
|
|
|
|
// Switch to the system stack to call wbBufFlush1, though in this case
|
|
|
|
|
// it doesn't matter because we're non-preemptible anyway.
|
|
|
|
|
restart := false
|
|
|
|
|
systemstack(func() {
|
2018-11-26 14:41:23 -05:00
|
|
|
for _, p := range allp {
|
2020-10-14 17:18:27 -04:00
|
|
|
wbBufFlush1(p)
|
|
|
|
|
if !p.gcw.empty() {
|
|
|
|
|
restart = true
|
|
|
|
|
break
|
2018-11-26 14:41:23 -05:00
|
|
|
}
|
|
|
|
|
}
|
2020-10-14 17:18:27 -04:00
|
|
|
})
|
|
|
|
|
if restart {
|
|
|
|
|
getg().m.preemptoff = ""
|
2019-01-03 14:48:30 -05:00
|
|
|
systemstack(func() {
|
2020-10-14 17:18:27 -04:00
|
|
|
now := startTheWorldWithSema(true)
|
|
|
|
|
work.pauseNS += now - work.pauseStart
|
2020-08-06 21:59:13 +00:00
|
|
|
memstats.gcPauseDist.record(now - work.pauseStart)
|
2019-01-03 14:48:30 -05:00
|
|
|
})
|
2020-10-14 17:18:27 -04:00
|
|
|
semrelease(&worldsema)
|
|
|
|
|
goto top
|
2018-11-26 14:41:23 -05:00
|
|
|
}
|
|
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Disable assists and background workers. We must do
|
|
|
|
|
// this before waking blocked assists.
|
|
|
|
|
atomic.Store(&gcBlackenEnabled, 0)
|
2016-01-15 13:28:41 -05:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Wake all blocked assists. These will run when we
|
|
|
|
|
// start the world again.
|
|
|
|
|
gcWakeAllAssists()
|
2015-10-14 21:31:33 -04:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Likewise, release the transition lock. Blocked
|
|
|
|
|
// workers and assists will run when we start the
|
|
|
|
|
// world again.
|
|
|
|
|
semrelease(&work.markDoneSema)
|
2015-10-26 11:27:37 -04:00
|
|
|
|
runtime: implement STW GC in terms of concurrent GC
Currently, STW GC works very differently from concurrent GC. The
largest differences in that in concurrent GC, all marking work is done
by background mark workers during the mark phase, while in STW GC, all
marking work is done by gchelper during the mark termination phase.
This is a consequence of the evolution of Go's GC from a STW GC by
incrementally moving work from STW mark termination into concurrent
mark. However, at this point, the STW code paths exist only as a
debugging mode. Having separate code paths for this increases the
maintenance burden and complexity of the garbage collector. At the
same time, these code paths aren't tested nearly as well, making it
far more likely that they will bit-rot.
This CL reverses the relationship between STW GC, by re-implementing
STW GC in terms of concurrent GC.
This builds on the new scheduled support for disabling user goroutine
scheduling. During sweep termination, it disables user scheduling, so
when the GC starts the world again for concurrent mark, it's really
only "concurrent" with itself.
There are several code paths that were specific to STW GC that are now
vestigial. We'll remove these in the follow-up CLs.
Updates #26903.
Change-Id: Ia3883d2fcf7ab1d89bdc9c8ee54bf9bffb32c096
Reviewed-on: https://go-review.googlesource.com/c/134780
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-13 16:30:54 -04:00
|
|
|
// In STW mode, re-enable user goroutines. These will be
|
|
|
|
|
// queued to run after we start the world.
|
|
|
|
|
schedEnableUser(true)
|
|
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// endCycle depends on all gcWork cache stats being flushed.
|
|
|
|
|
// The termination algorithm above ensured that up to
|
|
|
|
|
// allocations since the ragged barrier.
|
2021-04-01 19:09:40 +00:00
|
|
|
nextTriggerRatio := gcController.endCycle(work.userForced)
|
2015-10-26 11:27:37 -04:00
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Perform mark termination. This will restart the world.
|
|
|
|
|
gcMarkTermination(nextTriggerRatio)
|
2015-10-26 11:27:37 -04:00
|
|
|
}
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2020-07-07 17:55:40 -04:00
|
|
|
// World must be stopped and mark assists and background workers must be
|
|
|
|
|
// disabled.
|
2017-03-31 17:09:41 -04:00
|
|
|
func gcMarkTermination(nextTriggerRatio float64) {
|
2020-07-07 17:55:40 -04:00
|
|
|
// Start marktermination (write barrier remains enabled for now).
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
setGCPhase(_GCmarktermination)
|
2015-03-05 17:33:08 -05:00
|
|
|
|
2021-03-31 22:55:06 +00:00
|
|
|
work.heap1 = gcController.heapLive
|
2015-02-19 13:38:46 -05:00
|
|
|
startTime := nanotime()
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2015-03-23 18:15:14 -04:00
|
|
|
mp := acquirem()
|
|
|
|
|
mp.preemptoff = "gcing"
|
2015-02-19 16:43:27 -05:00
|
|
|
_g_ := getg()
|
|
|
|
|
_g_.m.traceback = 2
|
|
|
|
|
gp := _g_.m.curg
|
|
|
|
|
casgstatus(gp, _Grunning, _Gwaiting)
|
2018-03-06 21:28:24 -08:00
|
|
|
gp.waitreason = waitReasonGarbageCollection
|
2015-02-19 16:43:27 -05:00
|
|
|
|
2016-03-01 23:21:55 +00:00
|
|
|
// Run gc on the g0 stack. We do this so that the g stack
|
|
|
|
|
// we're currently running on will no longer change. Cuts
|
2015-02-19 13:38:46 -05:00
|
|
|
// the root set down a bit (g0 stacks are not scanned, and
|
|
|
|
|
// we don't need to scan gc's internal state). We also
|
|
|
|
|
// need to switch to g0 so we can shrink the stack.
|
2015-02-19 16:21:00 -05:00
|
|
|
systemstack(func() {
|
2017-04-03 12:10:56 -04:00
|
|
|
gcMark(startTime)
|
2015-07-30 19:39:16 -04:00
|
|
|
// Must return immediately.
|
|
|
|
|
// The outer function's stack may have moved
|
|
|
|
|
// during gcMark (it shrinks stacks, including the
|
|
|
|
|
// outer function's stack), so we must not refer
|
|
|
|
|
// to any of its variables. Return back to the
|
|
|
|
|
// non-system stack to pick up the new addresses
|
|
|
|
|
// before continuing.
|
|
|
|
|
})
|
|
|
|
|
|
|
|
|
|
systemstack(func() {
|
2015-10-23 15:17:04 -04:00
|
|
|
work.heap2 = work.bytesMarked
|
2015-02-19 16:43:27 -05:00
|
|
|
if debug.gccheckmark > 0 {
|
2018-08-14 17:04:04 -04:00
|
|
|
// Run a full non-parallel, stop-the-world
|
|
|
|
|
// mark using checkmark bits, to check that we
|
|
|
|
|
// didn't forget to mark anything during the
|
|
|
|
|
// concurrent mark process.
|
2020-06-05 16:48:03 -04:00
|
|
|
startCheckmarks()
|
2015-06-26 13:56:58 -04:00
|
|
|
gcResetMarkState()
|
2018-08-14 17:04:04 -04:00
|
|
|
gcw := &getg().m.p.ptr().gcw
|
2018-08-16 12:25:38 -04:00
|
|
|
gcDrain(gcw, 0)
|
2018-08-14 17:04:04 -04:00
|
|
|
wbBufFlush1(getg().m.p.ptr())
|
|
|
|
|
gcw.dispose()
|
2020-06-05 16:48:03 -04:00
|
|
|
endCheckmarks()
|
2015-02-19 15:48:40 -05:00
|
|
|
}
|
2015-03-05 17:33:08 -05:00
|
|
|
|
|
|
|
|
// marking is complete so we can turn the write barrier off
|
runtime: replace needwb() with writeBarrierEnabled
Reduce the write barrier check to a single load and compare
so that it can be inlined into write barrier use sites.
Makes the standard write barrier a little faster too.
name old new delta
BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~
BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81%
BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92%
BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67%
BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~
BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~
BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~
BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16%
BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22%
BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~
BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33%
BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~
BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~
BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~
BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96%
BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~
BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~
BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~
BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~
BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~
BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33%
BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12%
BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92%
BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~
BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86%
BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~
BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07%
BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~
BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67%
Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7
Reviewed-on: https://go-review.googlesource.com/9321
Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
|
|
|
setGCPhase(_GCoff)
|
2015-10-23 15:17:04 -04:00
|
|
|
gcSweep(work.mode)
|
2015-02-19 13:38:46 -05:00
|
|
|
})
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2015-02-19 16:43:27 -05:00
|
|
|
_g_.m.traceback = 0
|
|
|
|
|
casgstatus(gp, _Gwaiting, _Grunning)
|
2015-02-19 16:21:00 -05:00
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
if trace.enabled {
|
|
|
|
|
traceGCDone()
|
|
|
|
|
}
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
// all done
|
|
|
|
|
mp.preemptoff = ""
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2015-03-05 17:33:08 -05:00
|
|
|
if gcphase != _GCoff {
|
|
|
|
|
throw("gc done but gcphase != _GCoff")
|
|
|
|
|
}
|
|
|
|
|
|
2021-04-01 18:38:14 +00:00
|
|
|
// Record heapGoal and heap_inuse for scavenger.
|
|
|
|
|
gcController.lastHeapGoal = gcController.heapGoal
|
2019-09-03 19:54:32 +00:00
|
|
|
memstats.last_heap_inuse = memstats.heap_inuse
|
|
|
|
|
|
2017-04-03 12:10:56 -04:00
|
|
|
// Update GC trigger and pacing for the next cycle.
|
2021-04-01 16:31:29 +00:00
|
|
|
gcController.commit(nextTriggerRatio)
|
2017-04-03 12:10:56 -04:00
|
|
|
|
2015-07-01 11:04:19 -04:00
|
|
|
// Update timing memstats
|
2017-02-03 19:26:13 -05:00
|
|
|
now := nanotime()
|
|
|
|
|
sec, nsec, _ := time_now()
|
|
|
|
|
unixNow := sec*1e9 + int64(nsec)
|
2015-10-23 15:17:04 -04:00
|
|
|
work.pauseNS += now - work.pauseStart
|
|
|
|
|
work.tEnd = now
|
2020-08-06 21:59:13 +00:00
|
|
|
memstats.gcPauseDist.record(now - work.pauseStart)
|
2017-02-03 19:26:13 -05:00
|
|
|
atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
|
|
|
|
|
atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
|
2015-10-23 15:17:04 -04:00
|
|
|
memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
|
2015-07-01 11:04:19 -04:00
|
|
|
memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
|
2015-10-23 15:17:04 -04:00
|
|
|
memstats.pause_total_ns += uint64(work.pauseNS)
|
2015-07-01 11:04:19 -04:00
|
|
|
|
2015-07-29 14:02:34 -04:00
|
|
|
// Update work.totaltime.
|
2015-10-23 15:17:04 -04:00
|
|
|
sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
|
2015-07-29 14:02:34 -04:00
|
|
|
// We report idle marking time below, but omit it from the
|
|
|
|
|
// overall utilization here since it's "free".
|
|
|
|
|
markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
|
2015-10-23 15:17:04 -04:00
|
|
|
markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
|
runtime: perform concurrent scan in GC workers
Currently the concurrent root scan is performed in its entirety by the
GC coordinator before entering concurrent mark (which enables GC
workers). This scan is done sequentially, which can prolong the scan
phase, delay the mark phase, and means that the scan phase does not
obey the 25% CPU goal. Furthermore, there's no need to complete the
root scan before starting marking (in fact, we already allow GC
assists to happen during the scan phase), so this acts as an
unnecessary barrier between root scanning and marking.
This change shifts the root scan work out of the GC coordinator and in
to the GC workers. The coordinator simply sets up the scan state and
enqueues the right number of root scan jobs. The GC workers then drain
the root scan jobs prior to draining heap scan jobs.
This parallelizes the root scan process, makes it obey the 25% CPU
goal, and effectively eliminates root scanning as an isolated phase,
allowing the system to smoothly transition from root scanning to heap
marking. This also eliminates a major non-STW responsibility of the GC
coordinator, which will make it easier to switch to a decentralized
state machine. Finally, it puts us in a good position to perform root
scanning in assists as well, which will help satisfy assists at the
beginning of the GC cycle.
This is mostly straightforward. One tricky aspect is that we have to
deal with preemption deadlock: where two non-preemptible gorountines
are trying to preempt each other to perform a stack scan. Given the
context where this happens, the only instance of this is two
background workers trying to scan each other. We avoid this by simply
not scanning the stacks of background workers during the concurrent
phase; this is safe because we'll scan them during mark termination
(and their stacks are *very* small and should not contain any new
pointers).
This change also switches the root marking during mark termination to
use the same gcDrain-based code path as concurrent mark. This
shouldn't affect performance because STW root marking was already
parallel and tasks switched to heap marking immediately when no more
root marking tasks were available. However, it simplifies the code and
unifies these code paths.
This has negligible effect on the go1 benchmarks. It slightly slows
down the garbage benchmark, possibly by making GC run slightly more
frequently.
name old time/op new time/op delta
XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18)
name old time/op new time/op delta
BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20)
Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18)
FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20)
FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19)
FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19)
FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19)
FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19)
FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20)
GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20)
Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17)
Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20)
HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20)
JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18)
JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20)
Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19)
GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20)
RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19)
RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19)
RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18)
RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20)
RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18)
RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19)
RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18)
Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17)
Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20)
TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20)
TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19)
[Geo mean] 62.1µs 62.3µs +0.31%
name old speed new speed delta
GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20)
Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17)
Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20)
JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18)
JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20)
GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20)
RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20)
RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19)
RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19)
RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18)
RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20)
RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18)
RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19)
RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19)
Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17)
Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20)
[Geo mean] 100MB/s 99.0MB/s -0.82%
Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0
Reviewed-on: https://go-review.googlesource.com/16059
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
|
|
|
cycleCpu := sweepTermCpu + markCpu + markTermCpu
|
2015-07-29 14:02:34 -04:00
|
|
|
work.totaltime += cycleCpu
|
|
|
|
|
|
|
|
|
|
// Compute overall GC CPU utilization.
|
|
|
|
|
totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
|
|
|
|
|
memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)
|
|
|
|
|
|
2015-12-14 15:07:40 -05:00
|
|
|
// Reset sweep state.
|
|
|
|
|
sweep.nbgsweep = 0
|
|
|
|
|
sweep.npausesweep = 0
|
|
|
|
|
|
2017-02-27 10:46:12 -05:00
|
|
|
if work.userForced {
|
|
|
|
|
memstats.numforcedgc++
|
|
|
|
|
}
|
|
|
|
|
|
2017-02-23 21:50:19 -05:00
|
|
|
// Bump GC cycle count and wake goroutines waiting on sweep.
|
|
|
|
|
lock(&work.sweepWaiters.lock)
|
|
|
|
|
memstats.numgc++
|
2018-08-10 10:33:05 -04:00
|
|
|
injectglist(&work.sweepWaiters.list)
|
2017-02-23 21:50:19 -05:00
|
|
|
unlock(&work.sweepWaiters.lock)
|
|
|
|
|
|
2017-03-01 21:03:20 -05:00
|
|
|
// Finish the current heap profiling cycle and start a new
|
|
|
|
|
// heap profiling cycle. We do this before starting the world
|
|
|
|
|
// so events don't leak into the wrong cycle.
|
|
|
|
|
mProf_NextCycle()
|
2017-03-01 13:58:22 -05:00
|
|
|
|
runtime: block sweep completion on all sweep paths
The runtime currently has two different notions of sweep completion:
1. All spans are either swept or have begun sweeping.
2. The sweeper has *finished* sweeping all spans.
Most things depend on condition 1. Notably, GC correctness depends on
condition 1, but since all sweep operations a non-preemptible, the STW
at the beginning of GC forces condition 1 to become condition 2.
runtime.GC(), however, depends on condition 2, since the intent is to
complete a complete GC cycle, and also update the heap profile (which
can only be done after sweeping is complete).
However, the way we compute condition 2 is racy right now and may in
fact only indicate condition 1. Specifically, sweepone blocks
condition 2 until all sweepone calls are done, but there are many
other ways to enter the sweeper that don't block this. Hence, sweepone
may see that there are no more spans in the sweep list and see that
it's the last sweepone and declare sweeping done, while there's some
other sweeper still working on a span.
Fix this by making sure every entry to the sweeper participates in the
protocol that blocks condition 2. To make sure we get this right, this
CL introduces a type to track sweep blocking and (lightly) enforces
span sweep ownership via the type system. This has the nice
side-effect of abstracting the pattern of acquiring sweep ownership
that's currently repeated in many different places.
Fixes #45315.
Change-Id: I7fab30170c5ae14c8b2f10998628735b8be6d901
Reviewed-on: https://go-review.googlesource.com/c/go/+/307915
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2021-04-02 15:54:24 -04:00
|
|
|
// There may be stale spans in mcaches that need to be swept.
|
|
|
|
|
// Those aren't tracked in any sweep lists, so we need to
|
|
|
|
|
// count them against sweep completion until we ensure all
|
|
|
|
|
// those spans have been forced out.
|
|
|
|
|
sl := newSweepLocker()
|
|
|
|
|
sl.blockCompletion()
|
|
|
|
|
|
2017-07-21 14:25:28 -04:00
|
|
|
systemstack(func() { startTheWorldWithSema(true) })
|
2015-10-27 17:48:18 -04:00
|
|
|
|
2017-03-01 13:58:22 -05:00
|
|
|
// Flush the heap profile so we can start a new cycle next GC.
|
|
|
|
|
// This is relatively expensive, so we don't do it with the
|
|
|
|
|
// world stopped.
|
2017-03-01 21:03:20 -05:00
|
|
|
mProf_Flush()
|
2016-09-11 20:03:14 -04:00
|
|
|
|
2017-03-20 17:25:59 -04:00
|
|
|
// Prepare workbufs for freeing by the sweeper. We do this
|
|
|
|
|
// asynchronously because it can take non-trivial time.
|
|
|
|
|
prepareFreeWorkbufs()
|
|
|
|
|
|
2015-10-27 17:48:18 -04:00
|
|
|
// Free stack spans. This must be done between GC cycles.
|
|
|
|
|
systemstack(freeStackSpans)
|
|
|
|
|
|
2018-08-23 13:14:19 -04:00
|
|
|
// Ensure all mcaches are flushed. Each P will flush its own
|
|
|
|
|
// mcache before allocating, but idle Ps may not. Since this
|
|
|
|
|
// is necessary to sweep all spans, we need to ensure all
|
|
|
|
|
// mcaches are flushed before we start the next GC cycle.
|
|
|
|
|
systemstack(func() {
|
|
|
|
|
forEachP(func(_p_ *p) {
|
|
|
|
|
_p_.mcache.prepareForSweep()
|
|
|
|
|
})
|
|
|
|
|
})
|
runtime: block sweep completion on all sweep paths
The runtime currently has two different notions of sweep completion:
1. All spans are either swept or have begun sweeping.
2. The sweeper has *finished* sweeping all spans.
Most things depend on condition 1. Notably, GC correctness depends on
condition 1, but since all sweep operations a non-preemptible, the STW
at the beginning of GC forces condition 1 to become condition 2.
runtime.GC(), however, depends on condition 2, since the intent is to
complete a complete GC cycle, and also update the heap profile (which
can only be done after sweeping is complete).
However, the way we compute condition 2 is racy right now and may in
fact only indicate condition 1. Specifically, sweepone blocks
condition 2 until all sweepone calls are done, but there are many
other ways to enter the sweeper that don't block this. Hence, sweepone
may see that there are no more spans in the sweep list and see that
it's the last sweepone and declare sweeping done, while there's some
other sweeper still working on a span.
Fix this by making sure every entry to the sweeper participates in the
protocol that blocks condition 2. To make sure we get this right, this
CL introduces a type to track sweep blocking and (lightly) enforces
span sweep ownership via the type system. This has the nice
side-effect of abstracting the pattern of acquiring sweep ownership
that's currently repeated in many different places.
Fixes #45315.
Change-Id: I7fab30170c5ae14c8b2f10998628735b8be6d901
Reviewed-on: https://go-review.googlesource.com/c/go/+/307915
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2021-04-02 15:54:24 -04:00
|
|
|
// Now that we've swept stale spans in mcaches, they don't
|
|
|
|
|
// count against unswept spans.
|
|
|
|
|
sl.dispose()
|
2018-08-23 13:14:19 -04:00
|
|
|
|
2015-12-14 15:19:07 -05:00
|
|
|
// Print gctrace before dropping worldsema. As soon as we drop
|
|
|
|
|
// worldsema another cycle could start and smash the stats
|
|
|
|
|
// we're trying to print.
|
2015-03-26 18:48:42 -04:00
|
|
|
if debug.gctrace > 0 {
|
2015-07-29 14:02:34 -04:00
|
|
|
util := int(memstats.gc_cpu_fraction * 100)
|
2015-04-01 13:47:35 -04:00
|
|
|
|
2015-03-26 18:48:42 -04:00
|
|
|
var sbuf [24]byte
|
|
|
|
|
printlock()
|
2015-07-20 15:48:53 -04:00
|
|
|
print("gc ", memstats.numgc,
|
2015-10-23 15:17:04 -04:00
|
|
|
" @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
|
runtime: increase precision of gctrace times
Currently we truncate gctrace clock and CPU times to millisecond
precision. As a result, many phases are typically printed as 0, which
is fine for user consumption, but makes gathering statistics and
reports over GC traces difficult.
In 1.4, the gctrace line printed times in microseconds. This was
better for statistics, but not as easy for users to read or interpret,
and it generally made the trace lines longer.
This change strikes a balance between these extremes by printing
milliseconds, but including the decimal part to two significant
figures down to microsecond precision. This remains easy to read and
interpret, but includes more precision when it's useful.
For example, where the code currently prints,
gc #29 @1.629s 0%: 0+2+0+12+0 ms clock, 0+2+0+0/12/0+0 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
this prints,
gc #29 @1.629s 0%: 0.005+2.1+0+12+0.29 ms clock, 0.005+2.1+0+0/12/0+0.29 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
Fixes #10970.
Change-Id: I249624779433927cd8b0947b986df9060c289075
Reviewed-on: https://go-review.googlesource.com/10554
Reviewed-by: Russ Cox <rsc@golang.org>
2015-05-30 21:47:00 -04:00
|
|
|
util, "%: ")
|
2015-10-23 15:17:04 -04:00
|
|
|
prev := work.tSweepTerm
|
2016-01-08 14:57:26 -05:00
|
|
|
for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
|
runtime: increase precision of gctrace times
Currently we truncate gctrace clock and CPU times to millisecond
precision. As a result, many phases are typically printed as 0, which
is fine for user consumption, but makes gathering statistics and
reports over GC traces difficult.
In 1.4, the gctrace line printed times in microseconds. This was
better for statistics, but not as easy for users to read or interpret,
and it generally made the trace lines longer.
This change strikes a balance between these extremes by printing
milliseconds, but including the decimal part to two significant
figures down to microsecond precision. This remains easy to read and
interpret, but includes more precision when it's useful.
For example, where the code currently prints,
gc #29 @1.629s 0%: 0+2+0+12+0 ms clock, 0+2+0+0/12/0+0 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
this prints,
gc #29 @1.629s 0%: 0.005+2.1+0+12+0.29 ms clock, 0.005+2.1+0+0/12/0+0.29 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
Fixes #10970.
Change-Id: I249624779433927cd8b0947b986df9060c289075
Reviewed-on: https://go-review.googlesource.com/10554
Reviewed-by: Russ Cox <rsc@golang.org>
2015-05-30 21:47:00 -04:00
|
|
|
if i != 0 {
|
|
|
|
|
print("+")
|
|
|
|
|
}
|
|
|
|
|
print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
|
|
|
|
|
prev = ns
|
|
|
|
|
}
|
|
|
|
|
print(" ms clock, ")
|
2016-01-08 14:57:26 -05:00
|
|
|
for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
|
|
|
|
|
if i == 2 || i == 3 {
|
runtime: increase precision of gctrace times
Currently we truncate gctrace clock and CPU times to millisecond
precision. As a result, many phases are typically printed as 0, which
is fine for user consumption, but makes gathering statistics and
reports over GC traces difficult.
In 1.4, the gctrace line printed times in microseconds. This was
better for statistics, but not as easy for users to read or interpret,
and it generally made the trace lines longer.
This change strikes a balance between these extremes by printing
milliseconds, but including the decimal part to two significant
figures down to microsecond precision. This remains easy to read and
interpret, but includes more precision when it's useful.
For example, where the code currently prints,
gc #29 @1.629s 0%: 0+2+0+12+0 ms clock, 0+2+0+0/12/0+0 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
this prints,
gc #29 @1.629s 0%: 0.005+2.1+0+12+0.29 ms clock, 0.005+2.1+0+0/12/0+0.29 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
Fixes #10970.
Change-Id: I249624779433927cd8b0947b986df9060c289075
Reviewed-on: https://go-review.googlesource.com/10554
Reviewed-by: Russ Cox <rsc@golang.org>
2015-05-30 21:47:00 -04:00
|
|
|
// Separate mark time components with /.
|
|
|
|
|
print("/")
|
|
|
|
|
} else if i != 0 {
|
|
|
|
|
print("+")
|
|
|
|
|
}
|
|
|
|
|
print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
|
|
|
|
|
}
|
|
|
|
|
print(" ms cpu, ",
|
2015-10-23 15:17:04 -04:00
|
|
|
work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
|
|
|
|
|
work.heapGoal>>20, " MB goal, ",
|
|
|
|
|
work.maxprocs, " P")
|
2017-02-27 10:46:12 -05:00
|
|
|
if work.userForced {
|
2015-03-26 18:48:42 -04:00
|
|
|
print(" (forced)")
|
|
|
|
|
}
|
|
|
|
|
print("\n")
|
|
|
|
|
printunlock()
|
|
|
|
|
}
|
|
|
|
|
|
2015-12-14 15:19:07 -05:00
|
|
|
semrelease(&worldsema)
|
runtime: don't hold worldsema across mark phase
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.
Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.
For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.
For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.
To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.
This change is being landed once more after being reverted in the Go
1.14 release cycle, since CL 215157 allows it to have a positive
effect on system performance.
For the benchmark BenchmarkReadMemStatsLatency in the runtime, which
measures ReadMemStats latencies while the GC is exercised, the tail of
these latencies reduced dramatically on an 8-core machine:
name old 50%tile-ns new 50%tile-ns delta
ReadMemStatsLatency-8 4.40M ±74% 0.12M ± 2% -97.35% (p=0.008 n=5+5)
name old 90%tile-ns new 90%tile-ns delta
ReadMemStatsLatency-8 102M ± 6% 0M ±14% -99.79% (p=0.008 n=5+5)
name old 99%tile-ns new 99%tile-ns delta
ReadMemStatsLatency-8 147M ±18% 4M ±57% -97.43% (p=0.008 n=5+5)
Fixes #19812.
Change-Id: If66c3c97d171524ae29f0e7af4bd33509d9fd0bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/216557
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-06-17 19:03:09 +00:00
|
|
|
semrelease(&gcsema)
|
2015-12-14 15:19:07 -05:00
|
|
|
// Careful: another GC cycle may start now.
|
|
|
|
|
|
|
|
|
|
releasem(mp)
|
|
|
|
|
mp = nil
|
|
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
// now that gc is done, kick off finalizer thread if needed
|
|
|
|
|
if !concurrentSweep {
|
|
|
|
|
// give the queued finalizers, if any, a chance to run
|
|
|
|
|
Gosched()
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// gcBgMarkStartWorkers prepares background mark worker goroutines. These
|
|
|
|
|
// goroutines will not run until the mark phase, but they must be started while
|
|
|
|
|
// the work is not stopped and from a regular G stack. The caller must hold
|
|
|
|
|
// worldsema.
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
func gcBgMarkStartWorkers() {
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// Background marking is performed by per-P G's. Ensure that each P has
|
|
|
|
|
// a background GC G.
|
|
|
|
|
//
|
|
|
|
|
// Worker Gs don't exit if gomaxprocs is reduced. If it is raised
|
|
|
|
|
// again, we can reuse the old workers; no need to create new workers.
|
|
|
|
|
for gcBgMarkWorkerCount < gomaxprocs {
|
|
|
|
|
go gcBgMarkWorker()
|
2020-10-13 18:11:26 -04:00
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
notetsleepg(&work.bgMarkReady, -1)
|
|
|
|
|
noteclear(&work.bgMarkReady)
|
2020-10-13 18:11:26 -04:00
|
|
|
// The worker is now guaranteed to be added to the pool before
|
|
|
|
|
// its P's next findRunnableGCWorker.
|
|
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gcBgMarkWorkerCount++
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// gcBgMarkPrepare sets up state for background marking.
|
|
|
|
|
// Mutator assists must not yet be enabled.
|
|
|
|
|
func gcBgMarkPrepare() {
|
|
|
|
|
// Background marking will stop when the work queues are empty
|
|
|
|
|
// and there are no more workers (note that, since this is
|
|
|
|
|
// concurrent, this may be a transient state, but mark
|
|
|
|
|
// termination will clean it up). Between background workers
|
|
|
|
|
// and assists, we don't really know how many workers there
|
|
|
|
|
// will be, so we pretend to have an arbitrarily large number
|
|
|
|
|
// of workers, almost all of which are "waiting". While a
|
|
|
|
|
// worker is working it decrements nwait. If nproc == nwait,
|
|
|
|
|
// there are no workers.
|
|
|
|
|
work.nproc = ^uint32(0)
|
|
|
|
|
work.nwait = ^uint32(0)
|
|
|
|
|
}
|
|
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// gcBgMarkWorker is an entry in the gcBgMarkWorkerPool. It points to a single
|
|
|
|
|
// gcBgMarkWorker goroutine.
|
|
|
|
|
type gcBgMarkWorkerNode struct {
|
|
|
|
|
// Unused workers are managed in a lock-free stack. This field must be first.
|
|
|
|
|
node lfnode
|
|
|
|
|
|
|
|
|
|
// The g of this worker.
|
|
|
|
|
gp guintptr
|
|
|
|
|
|
|
|
|
|
// Release this m on park. This is used to communicate with the unlock
|
|
|
|
|
// function, which cannot access the G's stack. It is unused outside of
|
|
|
|
|
// gcBgMarkWorker().
|
|
|
|
|
m muintptr
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
func gcBgMarkWorker() {
|
runtime: never pass stack pointers to gopark
gopark calls the unlock function after setting the G to _Gwaiting.
This means it's generally unsafe to access the G's stack from the
unlock function because the G may start running on another P. Once we
start shrinking stacks concurrently, a stack shrink could also move
the stack the moment after it enters _Gwaiting and before the unlock
function is called.
Document this restriction and fix the two places where we currently
violate it.
This is unlikely to be a problem in practice for these two places
right now, but they're already skating on thin ice. For example, the
following sequence could in principle cause corruption, deadlock, or a
panic in the select code:
On M1/P1:
1. G1 selects on channels A and B.
2. selectgoImpl calls gopark.
3. gopark puts G1 in _Gwaiting.
4. gopark calls selparkcommit.
5. selparkcommit releases the lock on channel A.
On M2/P2:
6. G2 sends to channel A.
7. The send puts G1 in _Grunnable and puts it on P2's run queue.
8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1.
9. On G1, the sellock immediately following the gopark gets called.
10. sellock grows and moves the stack.
On M1/P1:
11. selparkcommit continues to scan the lock order for the next
channel to unlock, but it's now reading from a freed (and possibly
reused) stack.
This shouldn't happen in practice because step 10 isn't the first call
to sellock, so the stack should already be big enough. However, once
we start shrinking stacks concurrently, this reasoning won't work any
more.
For #12967.
Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c
Reviewed-on: https://go-review.googlesource.com/20038
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
|
|
|
gp := getg()
|
|
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// We pass node to a gopark unlock function, so it can't be on
|
runtime: never pass stack pointers to gopark
gopark calls the unlock function after setting the G to _Gwaiting.
This means it's generally unsafe to access the G's stack from the
unlock function because the G may start running on another P. Once we
start shrinking stacks concurrently, a stack shrink could also move
the stack the moment after it enters _Gwaiting and before the unlock
function is called.
Document this restriction and fix the two places where we currently
violate it.
This is unlikely to be a problem in practice for these two places
right now, but they're already skating on thin ice. For example, the
following sequence could in principle cause corruption, deadlock, or a
panic in the select code:
On M1/P1:
1. G1 selects on channels A and B.
2. selectgoImpl calls gopark.
3. gopark puts G1 in _Gwaiting.
4. gopark calls selparkcommit.
5. selparkcommit releases the lock on channel A.
On M2/P2:
6. G2 sends to channel A.
7. The send puts G1 in _Grunnable and puts it on P2's run queue.
8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1.
9. On G1, the sellock immediately following the gopark gets called.
10. sellock grows and moves the stack.
On M1/P1:
11. selparkcommit continues to scan the lock order for the next
channel to unlock, but it's now reading from a freed (and possibly
reused) stack.
This shouldn't happen in practice because step 10 isn't the first call
to sellock, so the stack should already be big enough. However, once
we start shrinking stacks concurrently, this reasoning won't work any
more.
For #12967.
Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c
Reviewed-on: https://go-review.googlesource.com/20038
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
|
|
|
// the stack (see gopark). Prevent deadlock from recursively
|
|
|
|
|
// starting GC by disabling preemption.
|
|
|
|
|
gp.m.preemptoff = "GC worker init"
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
node := new(gcBgMarkWorkerNode)
|
runtime: never pass stack pointers to gopark
gopark calls the unlock function after setting the G to _Gwaiting.
This means it's generally unsafe to access the G's stack from the
unlock function because the G may start running on another P. Once we
start shrinking stacks concurrently, a stack shrink could also move
the stack the moment after it enters _Gwaiting and before the unlock
function is called.
Document this restriction and fix the two places where we currently
violate it.
This is unlikely to be a problem in practice for these two places
right now, but they're already skating on thin ice. For example, the
following sequence could in principle cause corruption, deadlock, or a
panic in the select code:
On M1/P1:
1. G1 selects on channels A and B.
2. selectgoImpl calls gopark.
3. gopark puts G1 in _Gwaiting.
4. gopark calls selparkcommit.
5. selparkcommit releases the lock on channel A.
On M2/P2:
6. G2 sends to channel A.
7. The send puts G1 in _Grunnable and puts it on P2's run queue.
8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1.
9. On G1, the sellock immediately following the gopark gets called.
10. sellock grows and moves the stack.
On M1/P1:
11. selparkcommit continues to scan the lock order for the next
channel to unlock, but it's now reading from a freed (and possibly
reused) stack.
This shouldn't happen in practice because step 10 isn't the first call
to sellock, so the stack should already be big enough. However, once
we start shrinking stacks concurrently, this reasoning won't work any
more.
For #12967.
Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c
Reviewed-on: https://go-review.googlesource.com/20038
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
|
|
|
gp.m.preemptoff = ""
|
|
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
node.gp.set(gp)
|
2020-10-13 18:11:26 -04:00
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
node.m.set(acquirem())
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
notewakeup(&work.bgMarkReady)
|
2020-10-13 18:11:26 -04:00
|
|
|
// After this point, the background mark worker is generally scheduled
|
|
|
|
|
// cooperatively by gcController.findRunnableGCWorker. While performing
|
|
|
|
|
// work on the P, preemption is disabled because we are working on
|
|
|
|
|
// P-local work buffers. When the preempt flag is set, this puts itself
|
|
|
|
|
// into _Gwaiting to be woken up by gcController.findRunnableGCWorker
|
|
|
|
|
// at the appropriate time.
|
|
|
|
|
//
|
|
|
|
|
// When preemption is enabled (e.g., while in gcMarkDone), this worker
|
|
|
|
|
// may be preempted and schedule as a _Grunnable G from a runq. That is
|
|
|
|
|
// fine; it will eventually gopark again for further scheduling via
|
|
|
|
|
// findRunnableGCWorker.
|
|
|
|
|
//
|
|
|
|
|
// Since we disable preemption before notifying bgMarkReady, we
|
|
|
|
|
// guarantee that this G will be in the worker pool for the next
|
|
|
|
|
// findRunnableGCWorker. This isn't strictly necessary, but it reduces
|
|
|
|
|
// latency between _GCmark starting and the workers starting.
|
2015-10-26 11:27:37 -04:00
|
|
|
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
for {
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// Go to sleep until woken by
|
2020-10-13 18:11:26 -04:00
|
|
|
// gcController.findRunnableGCWorker.
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gopark(func(g *g, nodep unsafe.Pointer) bool {
|
|
|
|
|
node := (*gcBgMarkWorkerNode)(nodep)
|
2016-01-19 22:45:37 -05:00
|
|
|
|
2020-10-13 18:11:26 -04:00
|
|
|
if mp := node.m.ptr(); mp != nil {
|
|
|
|
|
// The worker G is no longer running; release
|
|
|
|
|
// the M.
|
|
|
|
|
//
|
|
|
|
|
// N.B. it is _safe_ to release the M as soon
|
|
|
|
|
// as we are no longer performing P-local mark
|
|
|
|
|
// work.
|
|
|
|
|
//
|
|
|
|
|
// However, since we cooperatively stop work
|
|
|
|
|
// when gp.preempt is set, if we releasem in
|
|
|
|
|
// the loop then the following call to gopark
|
|
|
|
|
// would immediately preempt the G. This is
|
|
|
|
|
// also safe, but inefficient: the G must
|
|
|
|
|
// schedule again only to enter gopark and park
|
|
|
|
|
// again. Thus, we defer the release until
|
|
|
|
|
// after parking the G.
|
|
|
|
|
releasem(mp)
|
|
|
|
|
}
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
|
|
|
|
|
// Release this G to the pool.
|
|
|
|
|
gcBgMarkWorkerPool.push(&node.node)
|
|
|
|
|
// Note that at this point, the G may immediately be
|
|
|
|
|
// rescheduled and may be running.
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
return true
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
}, unsafe.Pointer(node), waitReasonGCWorkerIdle, traceEvGoBlock, 0)
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// Preemption must not occur here, or another G might see
|
|
|
|
|
// p.gcMarkWorkerMode.
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
|
|
|
|
|
// Disable preemption so we can use the gcw. If the
|
|
|
|
|
// scheduler wants to preempt us, we'll stop draining,
|
|
|
|
|
// dispose the gcw, and then preempt.
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
node.m.set(acquirem())
|
2020-10-13 18:11:26 -04:00
|
|
|
pp := gp.m.p.ptr() // P can't change with preemption disabled.
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
|
2015-03-27 17:01:53 -04:00
|
|
|
if gcBlackenEnabled == 0 {
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
println("worker mode", pp.gcMarkWorkerMode)
|
2015-03-27 17:01:53 -04:00
|
|
|
throw("gcBgMarkWorker: blackening not enabled")
|
|
|
|
|
}
|
|
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
if pp.gcMarkWorkerMode == gcMarkWorkerNotWorker {
|
|
|
|
|
throw("gcBgMarkWorker: mode not set")
|
|
|
|
|
}
|
|
|
|
|
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
startTime := nanotime()
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
pp.gcMarkWorkerStartTime = startTime
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
|
2015-11-02 14:09:24 -05:00
|
|
|
decnwait := atomic.Xadd(&work.nwait, -1)
|
2015-06-01 18:16:03 -04:00
|
|
|
if decnwait == work.nproc {
|
|
|
|
|
println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
|
|
|
|
|
throw("work.nwait was > work.nproc")
|
|
|
|
|
}
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
|
runtime: scan mark worker stacks like normal
Currently, markroot delays scanning mark worker stacks until mark
termination by putting the mark worker G directly on the rescan list
when it encounters one during the mark phase. Without this, since mark
workers are non-preemptible, two mark workers that attempt to scan
each other's stacks can deadlock.
However, this is annoyingly asymmetric and causes some real problems.
First, markroot does not own the G at that point, so it's not
technically safe to add it to the rescan list. I haven't been able to
find a specific problem this could cause, but I suspect it's the root
cause of issue #17099. Second, this will interfere with the hybrid
barrier, since there is no stack rescanning during mark termination
with the hybrid barrier.
This commit switches to a different approach. We move the mark
worker's call to gcDrain to the system stack and set the mark worker's
status to _Gwaiting for the duration of the drain to indicate that
it's preemptible. This lets another mark worker scan its G stack while
the drain is running on the system stack. We don't return to the G
stack until we can switch back to _Grunning, which ensures we don't
race with a stack scan. This lets us eliminate the special case for
mark worker stack scans and scan them just like any other goroutine.
The only subtlety to this approach is that we have to disable stack
shrinking for mark workers; they could be referring to captured
variables from the G stack, so it's not safe to move their stacks.
Updates #17099 and #17503.
Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea
Reviewed-on: https://go-review.googlesource.com/31820
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-24 14:20:07 -04:00
|
|
|
systemstack(func() {
|
|
|
|
|
// Mark our goroutine preemptible so its stack
|
|
|
|
|
// can be scanned. This lets two mark workers
|
|
|
|
|
// scan each other (otherwise, they would
|
|
|
|
|
// deadlock). We must not modify anything on
|
|
|
|
|
// the G stack. However, stack shrinking is
|
|
|
|
|
// disabled for mark workers, so it is safe to
|
|
|
|
|
// read from the G stack.
|
|
|
|
|
casgstatus(gp, _Grunning, _Gwaiting)
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
switch pp.gcMarkWorkerMode {
|
runtime: scan mark worker stacks like normal
Currently, markroot delays scanning mark worker stacks until mark
termination by putting the mark worker G directly on the rescan list
when it encounters one during the mark phase. Without this, since mark
workers are non-preemptible, two mark workers that attempt to scan
each other's stacks can deadlock.
However, this is annoyingly asymmetric and causes some real problems.
First, markroot does not own the G at that point, so it's not
technically safe to add it to the rescan list. I haven't been able to
find a specific problem this could cause, but I suspect it's the root
cause of issue #17099. Second, this will interfere with the hybrid
barrier, since there is no stack rescanning during mark termination
with the hybrid barrier.
This commit switches to a different approach. We move the mark
worker's call to gcDrain to the system stack and set the mark worker's
status to _Gwaiting for the duration of the drain to indicate that
it's preemptible. This lets another mark worker scan its G stack while
the drain is running on the system stack. We don't return to the G
stack until we can switch back to _Grunning, which ensures we don't
race with a stack scan. This lets us eliminate the special case for
mark worker stack scans and scan them just like any other goroutine.
The only subtlety to this approach is that we have to disable stack
shrinking for mark workers; they could be referring to captured
variables from the G stack, so it's not safe to move their stacks.
Updates #17099 and #17503.
Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea
Reviewed-on: https://go-review.googlesource.com/31820
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-24 14:20:07 -04:00
|
|
|
default:
|
|
|
|
|
throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
|
|
|
|
|
case gcMarkWorkerDedicatedMode:
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gcDrain(&pp.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
|
2017-06-23 17:54:39 -04:00
|
|
|
if gp.preempt {
|
|
|
|
|
// We were preempted. This is
|
|
|
|
|
// a useful signal to kick
|
|
|
|
|
// everything out of the run
|
|
|
|
|
// queue so it can run
|
|
|
|
|
// somewhere else.
|
2021-04-23 21:25:06 +08:00
|
|
|
if drainQ, n := runqdrain(pp); n > 0 {
|
|
|
|
|
lock(&sched.lock)
|
|
|
|
|
globrunqputbatch(&drainQ, int32(n))
|
|
|
|
|
unlock(&sched.lock)
|
2017-06-23 17:54:39 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
// Go back to draining, this time
|
|
|
|
|
// without preemption.
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gcDrain(&pp.gcw, gcDrainFlushBgCredit)
|
2016-10-30 20:20:17 -04:00
|
|
|
case gcMarkWorkerFractionalMode:
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gcDrain(&pp.gcw, gcDrainFractional|gcDrainUntilPreempt|gcDrainFlushBgCredit)
|
2016-10-30 20:20:17 -04:00
|
|
|
case gcMarkWorkerIdleMode:
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
gcDrain(&pp.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
|
runtime: scan mark worker stacks like normal
Currently, markroot delays scanning mark worker stacks until mark
termination by putting the mark worker G directly on the rescan list
when it encounters one during the mark phase. Without this, since mark
workers are non-preemptible, two mark workers that attempt to scan
each other's stacks can deadlock.
However, this is annoyingly asymmetric and causes some real problems.
First, markroot does not own the G at that point, so it's not
technically safe to add it to the rescan list. I haven't been able to
find a specific problem this could cause, but I suspect it's the root
cause of issue #17099. Second, this will interfere with the hybrid
barrier, since there is no stack rescanning during mark termination
with the hybrid barrier.
This commit switches to a different approach. We move the mark
worker's call to gcDrain to the system stack and set the mark worker's
status to _Gwaiting for the duration of the drain to indicate that
it's preemptible. This lets another mark worker scan its G stack while
the drain is running on the system stack. We don't return to the G
stack until we can switch back to _Grunning, which ensures we don't
race with a stack scan. This lets us eliminate the special case for
mark worker stack scans and scan them just like any other goroutine.
The only subtlety to this approach is that we have to disable stack
shrinking for mark workers; they could be referring to captured
variables from the G stack, so it's not safe to move their stacks.
Updates #17099 and #17503.
Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea
Reviewed-on: https://go-review.googlesource.com/31820
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-24 14:20:07 -04:00
|
|
|
}
|
|
|
|
|
casgstatus(gp, _Gwaiting, _Grunning)
|
|
|
|
|
})
|
2015-07-24 16:38:19 -04:00
|
|
|
|
2015-10-26 16:48:36 -04:00
|
|
|
// Account for time.
|
|
|
|
|
duration := nanotime() - startTime
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
switch pp.gcMarkWorkerMode {
|
2015-10-26 16:48:36 -04:00
|
|
|
case gcMarkWorkerDedicatedMode:
|
2015-11-02 14:09:24 -05:00
|
|
|
atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
|
|
|
|
|
atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
|
2015-10-26 16:48:36 -04:00
|
|
|
case gcMarkWorkerFractionalMode:
|
2015-11-02 14:09:24 -05:00
|
|
|
atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
atomic.Xaddint64(&pp.gcFractionalMarkTime, duration)
|
2015-10-26 16:48:36 -04:00
|
|
|
case gcMarkWorkerIdleMode:
|
2015-11-02 14:09:24 -05:00
|
|
|
atomic.Xaddint64(&gcController.idleMarkTime, duration)
|
2015-10-26 16:48:36 -04:00
|
|
|
}
|
|
|
|
|
|
runtime: eliminate getfull barrier from concurrent mark
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.
In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).
Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.
This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).
The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.
Updates #12041 and #8687.
No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.
name old time/op new time/op delta
XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20)
Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
|
|
|
// Was this the last worker and did we run out
|
|
|
|
|
// of work?
|
2015-11-02 14:09:24 -05:00
|
|
|
incnwait := atomic.Xadd(&work.nwait, +1)
|
runtime: eliminate getfull barrier from concurrent mark
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.
In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).
Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.
This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).
The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.
Updates #12041 and #8687.
No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.
name old time/op new time/op delta
XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20)
Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
|
|
|
if incnwait > work.nproc {
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
println("runtime: p.gcMarkWorkerMode=", pp.gcMarkWorkerMode,
|
runtime: eliminate getfull barrier from concurrent mark
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.
In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).
Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.
This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).
The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.
Updates #12041 and #8687.
No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.
name old time/op new time/op delta
XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20)
Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
|
|
|
"work.nwait=", incnwait, "work.nproc=", work.nproc)
|
|
|
|
|
throw("work.nwait > work.nproc")
|
2015-06-01 18:16:03 -04:00
|
|
|
}
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
// We'll releasem after this point and thus this P may run
|
|
|
|
|
// something else. We must clear the worker mode to avoid
|
|
|
|
|
// attributing the mode to a different (non-worker) G in
|
|
|
|
|
// traceGoStart.
|
|
|
|
|
pp.gcMarkWorkerMode = gcMarkWorkerNotWorker
|
|
|
|
|
|
2015-04-22 17:44:36 -04:00
|
|
|
// If this worker reached a background mark completion
|
|
|
|
|
// point, signal the main GC goroutine.
|
runtime: eliminate getfull barrier from concurrent mark
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.
In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).
Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.
This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).
The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.
Updates #12041 and #8687.
No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.
name old time/op new time/op delta
XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20)
Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
|
|
|
if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
|
2020-10-13 18:11:26 -04:00
|
|
|
// We don't need the P-local buffers here, allow
|
|
|
|
|
// preemption becuse we may schedule like a regular
|
|
|
|
|
// goroutine in gcMarkDone (block on locks, etc).
|
runtime: manage gcBgMarkWorkers with a global pool
Background mark workers perform per-P marking work. Currently each
worker is assigned a P at creation time. The worker "attaches" to the P
via p.gcBgMarkWorker, making itself (usually) available to
findRunnableGCWorker for scheduling GC work.
While running gcMarkDone, the worker "detaches" from the P (by clearing
p.gcBgMarkWorker), since it may park for other reasons and should not be
scheduled by findRunnableGCWorker.
Unfortunately, this design is complex and difficult to reason about. We
simplify things by changing the design to eliminate the hard P
attachment. Rather than workers always performing work from the same P,
workers perform work for whichever P they find themselves on. On park,
the workers are placed in a pool of free workers, which each P's
findRunnableGCWorker can use to run a worker for its P.
Now if a worker parks in gcMarkDone, a P may simply use another worker
from the pool to complete its own work.
The P's GC worker mode is used to communicate the mode to run to the
selected worker. It is also used to emit the appropriate worker
EvGoStart tracepoint. This is a slight change, as this G may be
preempted (e.g., in gcMarkDone). When it is rescheduled, the trace
viewer will show it as a normal goroutine again. It is currently a bit
difficult to connect to the original worker tracepoint, as the viewer
does not display the goid for the original worker (though the data is in
the trace file).
Change-Id: Id7bd3a364dc18a4d2b1c99c4dc4810fae1293c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/262348
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
2020-10-13 12:39:13 -04:00
|
|
|
releasem(node.m.ptr())
|
2020-10-13 18:11:26 -04:00
|
|
|
node.m.set(nil)
|
2015-10-26 11:27:37 -04:00
|
|
|
|
2015-10-24 21:30:59 -04:00
|
|
|
gcMarkDone()
|
runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2018-11-02 15:18:43 +00:00
|
|
|
// gcMarkWorkAvailable reports whether executing a mark worker
|
2015-10-19 13:35:25 -04:00
|
|
|
// on p is potentially useful. p may be nil, in which case it only
|
|
|
|
|
// checks the global sources of work.
|
2015-05-18 16:02:37 -04:00
|
|
|
func gcMarkWorkAvailable(p *p) bool {
|
2015-10-19 13:35:25 -04:00
|
|
|
if p != nil && !p.gcw.empty() {
|
2015-05-18 16:02:37 -04:00
|
|
|
return true
|
|
|
|
|
}
|
2017-03-07 16:38:29 -05:00
|
|
|
if !work.full.empty() {
|
2015-05-18 16:02:37 -04:00
|
|
|
return true // global work available
|
|
|
|
|
}
|
runtime: perform concurrent scan in GC workers
Currently the concurrent root scan is performed in its entirety by the
GC coordinator before entering concurrent mark (which enables GC
workers). This scan is done sequentially, which can prolong the scan
phase, delay the mark phase, and means that the scan phase does not
obey the 25% CPU goal. Furthermore, there's no need to complete the
root scan before starting marking (in fact, we already allow GC
assists to happen during the scan phase), so this acts as an
unnecessary barrier between root scanning and marking.
This change shifts the root scan work out of the GC coordinator and in
to the GC workers. The coordinator simply sets up the scan state and
enqueues the right number of root scan jobs. The GC workers then drain
the root scan jobs prior to draining heap scan jobs.
This parallelizes the root scan process, makes it obey the 25% CPU
goal, and effectively eliminates root scanning as an isolated phase,
allowing the system to smoothly transition from root scanning to heap
marking. This also eliminates a major non-STW responsibility of the GC
coordinator, which will make it easier to switch to a decentralized
state machine. Finally, it puts us in a good position to perform root
scanning in assists as well, which will help satisfy assists at the
beginning of the GC cycle.
This is mostly straightforward. One tricky aspect is that we have to
deal with preemption deadlock: where two non-preemptible gorountines
are trying to preempt each other to perform a stack scan. Given the
context where this happens, the only instance of this is two
background workers trying to scan each other. We avoid this by simply
not scanning the stacks of background workers during the concurrent
phase; this is safe because we'll scan them during mark termination
(and their stacks are *very* small and should not contain any new
pointers).
This change also switches the root marking during mark termination to
use the same gcDrain-based code path as concurrent mark. This
shouldn't affect performance because STW root marking was already
parallel and tasks switched to heap marking immediately when no more
root marking tasks were available. However, it simplifies the code and
unifies these code paths.
This has negligible effect on the go1 benchmarks. It slightly slows
down the garbage benchmark, possibly by making GC run slightly more
frequently.
name old time/op new time/op delta
XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18)
name old time/op new time/op delta
BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20)
Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18)
FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20)
FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19)
FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19)
FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19)
FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19)
FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20)
GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20)
Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17)
Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20)
HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20)
JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18)
JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20)
Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19)
GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20)
RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19)
RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19)
RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18)
RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20)
RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18)
RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19)
RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18)
Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17)
Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20)
TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20)
TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19)
[Geo mean] 62.1µs 62.3µs +0.31%
name old speed new speed delta
GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20)
Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17)
Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20)
JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18)
JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20)
GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20)
RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20)
RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19)
RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19)
RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18)
RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20)
RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18)
RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19)
RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19)
Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17)
Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20)
[Geo mean] 100MB/s 99.0MB/s -0.82%
Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0
Reviewed-on: https://go-review.googlesource.com/16059
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
|
|
|
if work.markrootNext < work.markrootJobs {
|
|
|
|
|
return true // root scan work available
|
|
|
|
|
}
|
2015-05-18 16:02:37 -04:00
|
|
|
return false
|
|
|
|
|
}
|
|
|
|
|
|
2015-02-19 16:43:27 -05:00
|
|
|
// gcMark runs the mark (or, for concurrent GC, mark termination)
|
2016-09-11 16:55:34 -04:00
|
|
|
// All gcWork caches must be empty.
|
2015-02-19 15:48:40 -05:00
|
|
|
// STW is in effect at this point.
|
2021-04-09 23:56:44 +08:00
|
|
|
func gcMark(startTime int64) {
|
2014-11-11 17:05:02 -05:00
|
|
|
if debug.allocfreetrace > 0 {
|
|
|
|
|
tracegc()
|
|
|
|
|
}
|
|
|
|
|
|
2015-03-05 17:33:08 -05:00
|
|
|
if gcphase != _GCmarktermination {
|
|
|
|
|
throw("in gcMark expecting to see gcphase as _GCmarktermination")
|
|
|
|
|
}
|
2021-04-09 23:56:44 +08:00
|
|
|
work.tstart = startTime
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2018-08-16 12:17:32 -04:00
|
|
|
// Check that there's no marking work remaining.
|
2018-08-16 12:32:46 -04:00
|
|
|
if work.full != 0 || work.markrootNext < work.markrootJobs {
|
|
|
|
|
print("runtime: full=", hex(work.full), " next=", work.markrootNext, " jobs=", work.markrootJobs, " nDataRoots=", work.nDataRoots, " nBSSRoots=", work.nBSSRoots, " nSpanRoots=", work.nSpanRoots, " nStackRoots=", work.nStackRoots, "\n")
|
2018-08-16 12:17:32 -04:00
|
|
|
panic("non-empty mark queue after concurrent mark")
|
runtime: avoid getfull() barrier most of the time
With the hybrid barrier, unless we're doing a STW GC or hit a very
rare race (~once per all.bash) that can start mark termination before
all of the work is drained, we don't need to drain the work queue at
all. Even draining an empty work queue is rather expensive since we
have to enter the getfull() barrier, so it's worth avoiding this.
Conveniently, it's quite easy to detect whether or not we actually
need the getufull() barrier: since the world is stopped when we enter
mark termination, everything must have flushed its work to the work
queue, so we can just check the queue. If the queue is empty and we
haven't queued up any jobs that may create more work (which should
always be the case with the hybrid barrier), we can simply have all GC
workers perform non-blocking drains.
Also conveniently, this solution is quite safe. If we do somehow screw
something up and there's work on the work queue, some worker will
still process it, it just may not happen in parallel.
This is not the "right" solution, but it's simple, expedient,
low-risk, and maintains compatibility with debug.gcrescanstacks. When
we remove the gcrescanstacks fallback in Go 1.9, we should also fix
the race that starts mark termination early, and then we can eliminate
work draining from mark termination.
Updates #17503.
Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361
Reviewed-on: https://go-review.googlesource.com/32186
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
|
|
|
}
|
|
|
|
|
|
2016-03-14 13:51:23 -04:00
|
|
|
if debug.gccheckmark > 0 {
|
|
|
|
|
// This is expensive when there's a large number of
|
|
|
|
|
// Gs, so only do it if checkmark is also enabled.
|
|
|
|
|
gcMarkRootCheck()
|
|
|
|
|
}
|
2014-11-15 08:00:38 -05:00
|
|
|
if work.full != 0 {
|
2014-12-27 20:58:00 -08:00
|
|
|
throw("work.full != 0")
|
2014-11-15 08:00:38 -05:00
|
|
|
}
|
|
|
|
|
|
2018-08-03 17:13:09 -04:00
|
|
|
// Clear out buffers and double-check that all gcWork caches
|
|
|
|
|
// are empty. This should be ensured by gcMarkDone before we
|
|
|
|
|
// enter mark termination.
|
|
|
|
|
//
|
|
|
|
|
// TODO: We could clear out buffers just before mark if this
|
|
|
|
|
// has a non-negligible impact on STW time.
|
2017-06-13 12:01:56 -04:00
|
|
|
for _, p := range allp {
|
2018-08-03 17:13:09 -04:00
|
|
|
// The write barrier may have buffered pointers since
|
|
|
|
|
// the gcMarkDone barrier. However, since the barrier
|
|
|
|
|
// ensured all reachable objects were marked, all of
|
|
|
|
|
// these must be pointers to black objects. Hence we
|
|
|
|
|
// can just discard the write barrier buffer.
|
2020-10-14 17:18:27 -04:00
|
|
|
if debug.gccheckmark > 0 {
|
2018-08-03 17:13:09 -04:00
|
|
|
// For debugging, flush the buffer and make
|
|
|
|
|
// sure it really was all marked.
|
|
|
|
|
wbBufFlush1(p)
|
|
|
|
|
} else {
|
|
|
|
|
p.wbBuf.reset()
|
|
|
|
|
}
|
|
|
|
|
|
2017-06-13 12:01:56 -04:00
|
|
|
gcw := &p.gcw
|
2016-04-16 18:27:38 -04:00
|
|
|
if !gcw.empty() {
|
2018-11-19 10:36:45 -05:00
|
|
|
printlock()
|
|
|
|
|
print("runtime: P ", p.id, " flushedWork ", gcw.flushedWork)
|
|
|
|
|
if gcw.wbuf1 == nil {
|
|
|
|
|
print(" wbuf1=<nil>")
|
|
|
|
|
} else {
|
|
|
|
|
print(" wbuf1.n=", gcw.wbuf1.nobj)
|
|
|
|
|
}
|
|
|
|
|
if gcw.wbuf2 == nil {
|
|
|
|
|
print(" wbuf2=<nil>")
|
|
|
|
|
} else {
|
|
|
|
|
print(" wbuf2.n=", gcw.wbuf2.nobj)
|
|
|
|
|
}
|
|
|
|
|
print("\n")
|
runtime: replace per-M workbuf cache with per-P gcWork cache
Currently, each M has a cache of the most recently used *workbuf. This
is used primarily by the write barrier so it doesn't have to access
the global workbuf lists on every write barrier. It's also used by
stack scanning because it's convenient.
This cache is important for write barrier performance, but this
particular approach has several downsides. It's faster than no cache,
but far from optimal (as the benchmarks below show). It's complex:
access to the cache is sprinkled through most of the workbuf list
operations and it requires special care to transform into and back out
of the gcWork cache that's actually used for scanning and marking. It
requires atomic exchanges to take ownership of the cached workbuf and
to return it to the M's cache even though it's almost always used by
only the current M. Since it's per-M, flushing these caches is O(# of
Ms), which may be high. And it has some significant subtleties: for
example, in general the cache shouldn't be used after the
harvestwbufs() in mark termination because it could hide work from
mark termination, but stack scanning can happen after this and *will*
use the cache (but it turns out this is okay because it will always be
followed by a getfull(), which drains the cache).
This change replaces this cache with a per-P gcWork object. This
gcWork cache can be used directly by scanning and marking (as long as
preemption is disabled, which is a general requirement of gcWork).
Since it's per-P, it doesn't require synchronization, which simplifies
things and means the only atomic operations in the write barrier are
occasionally fetching new work buffers and setting a mark bit if the
object isn't already marked. This cache can be flushed in O(# of Ps),
which is generally small. It follows a simple flushing rule: the cache
can be used during any phase, but during mark termination it must be
flushed before allowing preemption. This also makes the dispose during
mutator assist no longer necessary, which eliminates the vast majority
of gcWork dispose calls and reduces contention on the global workbuf
lists. And it's a lot faster on some benchmarks:
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 11963668673 11206112763 -6.33%
BenchmarkFannkuch11 2643217136 2649182499 +0.23%
BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28%
BenchmarkFmtFprintfString 364 307 -15.66%
BenchmarkFmtFprintfInt 317 282 -11.04%
BenchmarkFmtFprintfIntInt 512 483 -5.66%
BenchmarkFmtFprintfPrefixedInt 404 380 -5.94%
BenchmarkFmtFprintfFloat 521 479 -8.06%
BenchmarkFmtManyArgs 2164 1894 -12.48%
BenchmarkGobDecode 30366146 22429593 -26.14%
BenchmarkGobEncode 29867472 26663152 -10.73%
BenchmarkGzip 391236616 396779490 +1.42%
BenchmarkGunzip 96639491 96297024 -0.35%
BenchmarkHTTPClientServer 100110 70763 -29.31%
BenchmarkJSONEncode 51866051 52511382 +1.24%
BenchmarkJSONDecode 103813138 86094963 -17.07%
BenchmarkMandelbrot200 4121834 4120886 -0.02%
BenchmarkGoParse 16472789 5879949 -64.31%
BenchmarkRegexpMatchEasy0_32 140 140 +0.00%
BenchmarkRegexpMatchEasy0_1K 394 394 +0.00%
BenchmarkRegexpMatchEasy1_32 120 120 +0.00%
BenchmarkRegexpMatchEasy1_1K 621 614 -1.13%
BenchmarkRegexpMatchMedium_32 209 202 -3.35%
BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52%
BenchmarkRegexpMatchHard_32 2682 2675 -0.26%
BenchmarkRegexpMatchHard_1K 79383 79524 +0.18%
BenchmarkRevcomp 584116718 584595320 +0.08%
BenchmarkTemplate 125400565 109620196 -12.58%
BenchmarkTimeParse 386 387 +0.26%
BenchmarkTimeFormat 580 447 -22.93%
(Best out of 10 runs. The delta of averages is similar.)
This also puts us in a good position to flush these caches when
nearing the end of concurrent marking, which will let us increase the
size of the work buffers while still controlling mark termination
pause time.
Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522
Reviewed-on: https://go-review.googlesource.com/9178
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-19 15:22:20 -04:00
|
|
|
throw("P has cached GC work at end of mark termination")
|
2016-04-16 18:27:38 -04:00
|
|
|
}
|
2018-08-03 17:13:09 -04:00
|
|
|
// There may still be cached empty buffers, which we
|
|
|
|
|
// need to flush since we're going to free them. Also,
|
|
|
|
|
// there may be non-zero stats because we allocated
|
|
|
|
|
// black after the gcMarkDone barrier.
|
|
|
|
|
gcw.dispose()
|
runtime: replace per-M workbuf cache with per-P gcWork cache
Currently, each M has a cache of the most recently used *workbuf. This
is used primarily by the write barrier so it doesn't have to access
the global workbuf lists on every write barrier. It's also used by
stack scanning because it's convenient.
This cache is important for write barrier performance, but this
particular approach has several downsides. It's faster than no cache,
but far from optimal (as the benchmarks below show). It's complex:
access to the cache is sprinkled through most of the workbuf list
operations and it requires special care to transform into and back out
of the gcWork cache that's actually used for scanning and marking. It
requires atomic exchanges to take ownership of the cached workbuf and
to return it to the M's cache even though it's almost always used by
only the current M. Since it's per-M, flushing these caches is O(# of
Ms), which may be high. And it has some significant subtleties: for
example, in general the cache shouldn't be used after the
harvestwbufs() in mark termination because it could hide work from
mark termination, but stack scanning can happen after this and *will*
use the cache (but it turns out this is okay because it will always be
followed by a getfull(), which drains the cache).
This change replaces this cache with a per-P gcWork object. This
gcWork cache can be used directly by scanning and marking (as long as
preemption is disabled, which is a general requirement of gcWork).
Since it's per-P, it doesn't require synchronization, which simplifies
things and means the only atomic operations in the write barrier are
occasionally fetching new work buffers and setting a mark bit if the
object isn't already marked. This cache can be flushed in O(# of Ps),
which is generally small. It follows a simple flushing rule: the cache
can be used during any phase, but during mark termination it must be
flushed before allowing preemption. This also makes the dispose during
mutator assist no longer necessary, which eliminates the vast majority
of gcWork dispose calls and reduces contention on the global workbuf
lists. And it's a lot faster on some benchmarks:
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 11963668673 11206112763 -6.33%
BenchmarkFannkuch11 2643217136 2649182499 +0.23%
BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28%
BenchmarkFmtFprintfString 364 307 -15.66%
BenchmarkFmtFprintfInt 317 282 -11.04%
BenchmarkFmtFprintfIntInt 512 483 -5.66%
BenchmarkFmtFprintfPrefixedInt 404 380 -5.94%
BenchmarkFmtFprintfFloat 521 479 -8.06%
BenchmarkFmtManyArgs 2164 1894 -12.48%
BenchmarkGobDecode 30366146 22429593 -26.14%
BenchmarkGobEncode 29867472 26663152 -10.73%
BenchmarkGzip 391236616 396779490 +1.42%
BenchmarkGunzip 96639491 96297024 -0.35%
BenchmarkHTTPClientServer 100110 70763 -29.31%
BenchmarkJSONEncode 51866051 52511382 +1.24%
BenchmarkJSONDecode 103813138 86094963 -17.07%
BenchmarkMandelbrot200 4121834 4120886 -0.02%
BenchmarkGoParse 16472789 5879949 -64.31%
BenchmarkRegexpMatchEasy0_32 140 140 +0.00%
BenchmarkRegexpMatchEasy0_1K 394 394 +0.00%
BenchmarkRegexpMatchEasy1_32 120 120 +0.00%
BenchmarkRegexpMatchEasy1_1K 621 614 -1.13%
BenchmarkRegexpMatchMedium_32 209 202 -3.35%
BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52%
BenchmarkRegexpMatchHard_32 2682 2675 -0.26%
BenchmarkRegexpMatchHard_1K 79383 79524 +0.18%
BenchmarkRevcomp 584116718 584595320 +0.08%
BenchmarkTemplate 125400565 109620196 -12.58%
BenchmarkTimeParse 386 387 +0.26%
BenchmarkTimeFormat 580 447 -22.93%
(Best out of 10 runs. The delta of averages is similar.)
This also puts us in a good position to flush these caches when
nearing the end of concurrent marking, which will let us increase the
size of the work buffers while still controlling mark termination
pause time.
Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522
Reviewed-on: https://go-review.googlesource.com/9178
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-19 15:22:20 -04:00
|
|
|
}
|
|
|
|
|
|
2016-09-15 14:30:31 -04:00
|
|
|
// Update the marked heap stat.
|
2021-03-31 22:55:06 +00:00
|
|
|
gcController.heapMarked = work.bytesMarked
|
runtime: use reachable heap estimate to set trigger/goal
Currently, we set the heap goal for the next GC cycle using the size
of the marked heap at the end of the current cycle. This can lead to a
bad feedback loop if the mutator is rapidly allocating and releasing
pointers that can significantly bloat heap size.
If the GC were STW, the marked heap size would be exactly the
reachable heap size (call it stwLive). However, in concurrent GC,
marked=stwLive+floatLive, where floatLive is the amount of "floating
garbage": objects that were reachable at some point during the cycle
and were marked, but which are no longer reachable by the end of the
cycle. If the GC cycle is short, then the mutator doesn't have much
time to create floating garbage, so marked≈stwLive. However, if the GC
cycle is long and the mutator is allocating and creating floating
garbage very rapidly, then it's possible that marked≫stwLive. Since
the runtime currently sets the heap goal based on marked, this will
cause it to set a high heap goal. This means that 1) the next GC cycle
will take longer because of the larger heap and 2) the assist ratio
will be low because of the large distance between the trigger and the
goal. The combination of these lets the mutator produce even more
floating garbage in the next cycle, which further exacerbates the
problem.
For example, on the garbage benchmark with GOMAXPROCS=1, this causes
the heap to grow to ~500MB and the garbage collector to retain upwards
of ~300MB of heap, while the true reachable heap size is ~32MB. This,
in turn, causes the GC cycle to take upwards of ~3 seconds.
Fix this bad feedback loop by estimating the true reachable heap size
(stwLive) and using this rather than the marked heap size
(stwLive+floatLive) as the basis for the GC trigger and heap goal.
This breaks the bad feedback loop and causes the mutator to assist
more, which decreases the rate at which it can create floating
garbage. On the same garbage benchmark, this reduces the maximum heap
size to ~73MB, the retained heap to ~40MB, and the duration of the GC
cycle to ~200ms.
Change-Id: I7712244c94240743b266f9eb720c03802799cdd1
Reviewed-on: https://go-review.googlesource.com/9177
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 14:24:25 -04:00
|
|
|
|
2020-07-24 19:58:31 +00:00
|
|
|
// Flush scanAlloc from each mcache since we're about to modify
|
2021-03-31 22:55:06 +00:00
|
|
|
// heapScan directly. If we were to flush this later, then scanAlloc
|
runtime: flush local_scan directly and more often
Now that local_scan is the last mcache-based statistic that is flushed
by purgecachedstats, and heap_scan and gcController.revise may be
interacted with concurrently, we don't need to flush heap_scan at
arbitrary locations where the heap is locked, and we don't need
purgecachedstats and cachestats anymore. Instead, we can flush
local_scan at the same time we update heap_live in refill, so the two
updates may share the same revise call.
Clean up unused functions, remove code that would cause the heap to get
locked in the allocSpan when it didn't need to (other than to flush
local_scan), and flush local_scan explicitly in a few important places.
Notably we need to flush local_scan whenever we flush the other stats,
but it doesn't need to be donated anywhere, so have releaseAll do the
flushing. Also, we need to flush local_scan before we set heap_scan at
the end of a GC, which was previously handled by cachestats. Just do so
explicitly -- it's not much code and it becomes a lot more clear why we
need to do so.
Change-Id: I35ac081784df7744d515479896a41d530653692d
Reviewed-on: https://go-review.googlesource.com/c/go/+/246968
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
2020-07-23 22:36:58 +00:00
|
|
|
// might have incorrect information.
|
|
|
|
|
for _, p := range allp {
|
|
|
|
|
c := p.mcache
|
|
|
|
|
if c == nil {
|
|
|
|
|
continue
|
|
|
|
|
}
|
2021-03-31 22:55:06 +00:00
|
|
|
gcController.heapScan += uint64(c.scanAlloc)
|
2020-07-24 19:58:31 +00:00
|
|
|
c.scanAlloc = 0
|
runtime: flush local_scan directly and more often
Now that local_scan is the last mcache-based statistic that is flushed
by purgecachedstats, and heap_scan and gcController.revise may be
interacted with concurrently, we don't need to flush heap_scan at
arbitrary locations where the heap is locked, and we don't need
purgecachedstats and cachestats anymore. Instead, we can flush
local_scan at the same time we update heap_live in refill, so the two
updates may share the same revise call.
Clean up unused functions, remove code that would cause the heap to get
locked in the allocSpan when it didn't need to (other than to flush
local_scan), and flush local_scan explicitly in a few important places.
Notably we need to flush local_scan whenever we flush the other stats,
but it doesn't need to be donated anywhere, so have releaseAll do the
flushing. Also, we need to flush local_scan before we set heap_scan at
the end of a GC, which was previously handled by cachestats. Just do so
explicitly -- it's not much code and it becomes a lot more clear why we
need to do so.
Change-Id: I35ac081784df7744d515479896a41d530653692d
Reviewed-on: https://go-review.googlesource.com/c/go/+/246968
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
2020-07-23 22:36:58 +00:00
|
|
|
}
|
|
|
|
|
|
runtime: fix (sometimes major) underestimation of heap_live
Currently, we update memstats.heap_live from mcache.local_cachealloc
whenever we lock the heap (e.g., to obtain a fresh span or to release
an unused span). However, under the right circumstances,
local_cachealloc can accumulate allocations up to the size of
the *entire heap* without flushing them to heap_live. Specifically,
since span allocations from an mcentral don't lock the heap, if a
large number of pages are held in an mcentral and the application
continues to use and free objects of that size class (e.g., the
BinaryTree17 benchmark), local_cachealloc won't be flushed until the
mcentral runs out of spans.
This is a problem because, unlike many of the memory statistics that
are purely informative, heap_live is used to determine when the
garbage collector should start and how hard it should work.
This commit eliminates local_cachealloc, instead atomically updating
heap_live directly. To control contention, we do this only when
obtaining a span from an mcentral. Furthermore, we make heap_live
conservative: allocating a span assumes that all free slots in that
span will be used and accounts for these when the span is
allocated, *before* the objects themselves are. This is important
because 1) this triggers the GC earlier than necessary rather than
potentially too late and 2) this leads to a conservative GC rate
rather than a GC rate that is potentially too low.
Alternatively, we could have flushed local_cachealloc when it passed
some threshold, but this would require determining a threshold and
would cause heap_live to underestimate the true value rather than
overestimate.
Fixes #12199.
name old time/op new time/op delta
BinaryTree17-12 2.88s ± 4% 2.88s ± 1% ~ (p=0.470 n=19+19)
Fannkuch11-12 2.48s ± 1% 2.48s ± 1% ~ (p=0.243 n=16+19)
FmtFprintfEmpty-12 50.9ns ± 2% 50.7ns ± 1% ~ (p=0.238 n=15+14)
FmtFprintfString-12 175ns ± 1% 171ns ± 1% -2.48% (p=0.000 n=18+18)
FmtFprintfInt-12 159ns ± 1% 158ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 270ns ± 1% 265ns ± 2% -1.67% (p=0.000 n=18+18)
FmtFprintfPrefixedInt-12 235ns ± 1% 234ns ± 0% ~ (p=0.362 n=18+19)
FmtFprintfFloat-12 309ns ± 1% 308ns ± 1% -0.41% (p=0.001 n=18+19)
FmtManyArgs-12 1.10µs ± 1% 1.08µs ± 0% -1.96% (p=0.000 n=19+18)
GobDecode-12 7.81ms ± 1% 7.80ms ± 1% ~ (p=0.425 n=18+19)
GobEncode-12 6.53ms ± 1% 6.53ms ± 1% ~ (p=0.817 n=19+19)
Gzip-12 312ms ± 1% 312ms ± 2% ~ (p=0.967 n=19+20)
Gunzip-12 42.0ms ± 1% 41.9ms ± 1% ~ (p=0.172 n=19+19)
HTTPClientServer-12 63.7µs ± 1% 63.8µs ± 1% ~ (p=0.639 n=19+19)
JSONEncode-12 16.4ms ± 1% 16.4ms ± 1% ~ (p=0.954 n=19+19)
JSONDecode-12 58.5ms ± 1% 57.8ms ± 1% -1.27% (p=0.000 n=18+19)
Mandelbrot200-12 3.86ms ± 1% 3.88ms ± 0% +0.44% (p=0.000 n=18+18)
GoParse-12 3.67ms ± 2% 3.66ms ± 1% -0.52% (p=0.001 n=18+19)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% ~ (p=0.257 n=19+18)
RegexpMatchEasy0_1K-12 347ns ± 1% 347ns ± 1% ~ (p=0.527 n=18+18)
RegexpMatchEasy1_32-12 83.7ns ± 2% 83.1ns ± 2% ~ (p=0.096 n=18+19)
RegexpMatchEasy1_1K-12 509ns ± 1% 505ns ± 1% -0.75% (p=0.000 n=18+19)
RegexpMatchMedium_32-12 130ns ± 2% 129ns ± 1% ~ (p=0.962 n=20+20)
RegexpMatchMedium_1K-12 39.5µs ± 2% 39.4µs ± 1% ~ (p=0.376 n=20+19)
RegexpMatchHard_32-12 2.04µs ± 0% 2.04µs ± 1% ~ (p=0.195 n=18+17)
RegexpMatchHard_1K-12 61.4µs ± 1% 61.4µs ± 1% ~ (p=0.885 n=19+19)
Revcomp-12 540ms ± 2% 542ms ± 4% ~ (p=0.552 n=19+17)
Template-12 69.6ms ± 1% 71.2ms ± 1% +2.39% (p=0.000 n=20+20)
TimeParse-12 357ns ± 1% 357ns ± 1% ~ (p=0.883 n=18+20)
TimeFormat-12 379ns ± 1% 362ns ± 1% -4.53% (p=0.000 n=18+19)
[Geo mean] 62.0µs 61.8µs -0.44%
name old time/op new time/op delta
XBenchGarbage-12 5.89ms ± 2% 5.81ms ± 2% -1.41% (p=0.000 n=19+18)
Change-Id: I96b31cca6ae77c30693a891cff3fe663fa2447a0
Reviewed-on: https://go-review.googlesource.com/17748
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2015-12-11 17:49:14 -05:00
|
|
|
// Update other GC heap size stats. This must happen after
|
|
|
|
|
// cachestats (which flushes local statistics to these) and
|
2021-03-31 22:55:06 +00:00
|
|
|
// flushallmcaches (which modifies gcController.heapLive).
|
|
|
|
|
gcController.heapLive = work.bytesMarked
|
|
|
|
|
gcController.heapScan = uint64(gcController.scanWork)
|
2015-01-28 15:57:46 -05:00
|
|
|
|
2014-12-12 18:41:57 +01:00
|
|
|
if trace.enabled {
|
runtime: introduce heap_live; replace use of heap_alloc in GC
Currently there are two main consumers of memstats.heap_alloc:
updatememstats (aka ReadMemStats) and shouldtriggergc.
updatememstats recomputes heap_alloc from the ground up, so we don't
need to keep heap_alloc up to date for it. shouldtriggergc wants to
know how many bytes were marked by the previous GC plus how many bytes
have been allocated since then, but this *isn't* what heap_alloc
tracks. heap_alloc also includes objects that are not marked and
haven't yet been swept.
Introduce a new memstat called heap_live that actually tracks what
shouldtriggergc wants to know and stop keeping heap_alloc up to date.
Unlike heap_alloc, heap_live follows a simple sawtooth that drops
during each mark termination and increases monotonically between GCs.
heap_alloc, on the other hand, has much more complicated behavior: it
may drop during sweep termination, slowly decreases from background
sweeping between GCs, is roughly unaffected by allocation as long as
there are unswept spans (because we sweep and allocate at the same
rate), and may go up after background sweeping is done depending on
the GC trigger.
heap_live simplifies computing next_gc and using it to figure out when
to trigger garbage collection. Currently, we guess next_gc at the end
of a cycle and update it as we sweep and get a better idea of how much
heap was marked. Now, since we're directly tracking how much heap is
marked, we can directly compute next_gc.
This also corrects bugs that could cause us to trigger GC early.
Currently, in any case where sweep termination actually finds spans to
sweep, heap_alloc is an overestimation of live heap, so we'll trigger
GC too early. heap_live, on the other hand, is unaffected by sweeping.
Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388
Reviewed-on: https://go-review.googlesource.com/8389
Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
|
|
|
traceHeapAlloc()
|
2014-12-12 18:41:57 +01:00
|
|
|
}
|
2015-02-19 16:43:27 -05:00
|
|
|
}
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2019-05-17 14:48:04 +00:00
|
|
|
// gcSweep must be called on the system stack because it acquires the heap
|
|
|
|
|
// lock. See mheap for details.
|
runtime: add new mcentral implementation
Currently mcentral is implemented as a couple of linked lists of spans
protected by a lock. Unfortunately this design leads to significant lock
contention.
The span ownership model is also confusing and complicated. In-use spans
jump between being owned by multiple sources, generally some combination
of a gcSweepBuf, a concurrent sweeper, an mcentral or an mcache.
So first to address contention, this change replaces those linked lists
with gcSweepBufs which have an atomic fast path. Then, we change up the
ownership model: a span may be simultaneously owned only by an mcentral
and the page reclaimer. Otherwise, an mcentral (which now consists of
sweep bufs), a sweeper, or an mcache are the sole owners of a span at
any given time. This dramatically simplifies reasoning about span
ownership in the runtime.
As a result of this new ownership model, sweeping is now driven by
walking over the mcentrals rather than having its own global list of
spans. Because we no longer have a global list and we traditionally
haven't used the mcentrals for large object spans, we no longer have
anywhere to put large objects. So, this change also makes it so that we
keep large object spans in the appropriate mcentral lists.
In terms of the static lock ranking, we add the spanSet spine locks in
pretty much the same place as the mcentral locks, since they have the
potential to be manipulated both on the allocation and sweep paths, like
the mcentral locks.
This new implementation is turned on by default via a feature flag
called go115NewMCentralImpl.
Benchmark results for 1 KiB allocation throughput (5 runs each):
name \ MiB/s go113 go114 gotip gotip+this-patch
AllocKiB-1 1.71k ± 1% 1.68k ± 1% 1.59k ± 2% 1.71k ± 1%
AllocKiB-2 2.46k ± 1% 2.51k ± 1% 2.54k ± 1% 2.93k ± 1%
AllocKiB-4 4.27k ± 1% 4.41k ± 2% 4.33k ± 1% 5.01k ± 2%
AllocKiB-8 4.38k ± 3% 5.24k ± 1% 5.46k ± 1% 8.23k ± 1%
AllocKiB-12 4.38k ± 3% 4.49k ± 1% 5.10k ± 1% 10.04k ± 0%
AllocKiB-16 4.31k ± 1% 4.14k ± 3% 4.22k ± 0% 10.42k ± 0%
AllocKiB-20 4.26k ± 1% 3.98k ± 1% 4.09k ± 1% 10.46k ± 3%
AllocKiB-24 4.20k ± 1% 3.97k ± 1% 4.06k ± 1% 10.74k ± 1%
AllocKiB-28 4.15k ± 0% 4.00k ± 0% 4.20k ± 0% 10.76k ± 1%
Fixes #37487.
Change-Id: I92d47355acacf9af2c41bf080c08a8c1638ba210
Reviewed-on: https://go-review.googlesource.com/c/go/+/221182
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:58:45 +00:00
|
|
|
//
|
|
|
|
|
// The world must be stopped.
|
|
|
|
|
//
|
2019-05-17 14:48:04 +00:00
|
|
|
//go:systemstack
|
2015-09-24 14:30:09 -04:00
|
|
|
func gcSweep(mode gcMode) {
|
2020-10-28 18:06:05 -04:00
|
|
|
assertWorldStopped()
|
|
|
|
|
|
2015-03-05 17:33:08 -05:00
|
|
|
if gcphase != _GCoff {
|
|
|
|
|
throw("gcSweep being done but phase is not GCoff")
|
|
|
|
|
}
|
2014-11-15 08:00:38 -05:00
|
|
|
|
2015-02-19 16:21:42 -05:00
|
|
|
lock(&mheap_.lock)
|
2014-11-11 17:05:02 -05:00
|
|
|
mheap_.sweepgen += 2
|
2021-04-06 19:25:28 -04:00
|
|
|
mheap_.sweepDrained = 0
|
2017-04-03 15:47:11 -04:00
|
|
|
mheap_.pagesSwept = 0
|
runtime: implement efficient page reclaimer
When we attempt to allocate an N page span (either for a large
allocation or when an mcentral runs dry), we first try to sweep spans
to release N pages. Currently, this can be extremely expensive:
sweeping a span to emptiness is the hardest thing to ask for and the
sweeper generally doesn't know where to even look for potentially
fruitful results. Since this is on the critical path of many
allocations, this is unfortunate.
This CL changes how we reclaim empty spans. Instead of trying lots of
spans and hoping for the best, it uses the newly introduced span marks
to efficiently find empty spans. The span marks (and in-use bits) are
in a dense bitmap, so these spans can be found with an efficient
sequential memory scan. This approach can scan for unmarked spans at
about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We
could probably significantly improve the rate at which is can free
unmarked spans, but that's a separate issue.
Like the current reclaimer, this is still linear in the number of
spans that are swept, but the constant factor is now so vanishingly
small that it doesn't matter.
The benchmark in #18155 demonstrates both significant page reclaiming
delays, and object reclaiming delays. With "-retain-count=20000000
-preallocate=true -loop-count=3", the benchmark demonstrates several
page reclaiming delays on the order of 40ms. After this change, the
page reclaims are insignificant. The longest sweeps are still ~150ms,
but are object reclaiming delays. We'll address those in the next
several CLs.
Updates #18155.
Fixes #21378 by completely replacing the logic that had that bug.
Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064
Reviewed-on: https://go-review.googlesource.com/c/138959
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
|
|
|
mheap_.sweepArenas = mheap_.allArenas
|
|
|
|
|
mheap_.reclaimIndex = 0
|
|
|
|
|
mheap_.reclaimCredit = 0
|
2014-11-11 17:05:02 -05:00
|
|
|
unlock(&mheap_.lock)
|
|
|
|
|
|
2020-02-19 16:37:48 +00:00
|
|
|
sweep.centralIndex.clear()
|
runtime: add new mcentral implementation
Currently mcentral is implemented as a couple of linked lists of spans
protected by a lock. Unfortunately this design leads to significant lock
contention.
The span ownership model is also confusing and complicated. In-use spans
jump between being owned by multiple sources, generally some combination
of a gcSweepBuf, a concurrent sweeper, an mcentral or an mcache.
So first to address contention, this change replaces those linked lists
with gcSweepBufs which have an atomic fast path. Then, we change up the
ownership model: a span may be simultaneously owned only by an mcentral
and the page reclaimer. Otherwise, an mcentral (which now consists of
sweep bufs), a sweeper, or an mcache are the sole owners of a span at
any given time. This dramatically simplifies reasoning about span
ownership in the runtime.
As a result of this new ownership model, sweeping is now driven by
walking over the mcentrals rather than having its own global list of
spans. Because we no longer have a global list and we traditionally
haven't used the mcentrals for large object spans, we no longer have
anywhere to put large objects. So, this change also makes it so that we
keep large object spans in the appropriate mcentral lists.
In terms of the static lock ranking, we add the spanSet spine locks in
pretty much the same place as the mcentral locks, since they have the
potential to be manipulated both on the allocation and sweep paths, like
the mcentral locks.
This new implementation is turned on by default via a feature flag
called go115NewMCentralImpl.
Benchmark results for 1 KiB allocation throughput (5 runs each):
name \ MiB/s go113 go114 gotip gotip+this-patch
AllocKiB-1 1.71k ± 1% 1.68k ± 1% 1.59k ± 2% 1.71k ± 1%
AllocKiB-2 2.46k ± 1% 2.51k ± 1% 2.54k ± 1% 2.93k ± 1%
AllocKiB-4 4.27k ± 1% 4.41k ± 2% 4.33k ± 1% 5.01k ± 2%
AllocKiB-8 4.38k ± 3% 5.24k ± 1% 5.46k ± 1% 8.23k ± 1%
AllocKiB-12 4.38k ± 3% 4.49k ± 1% 5.10k ± 1% 10.04k ± 0%
AllocKiB-16 4.31k ± 1% 4.14k ± 3% 4.22k ± 0% 10.42k ± 0%
AllocKiB-20 4.26k ± 1% 3.98k ± 1% 4.09k ± 1% 10.46k ± 3%
AllocKiB-24 4.20k ± 1% 3.97k ± 1% 4.06k ± 1% 10.74k ± 1%
AllocKiB-28 4.15k ± 0% 4.00k ± 0% 4.20k ± 0% 10.76k ± 1%
Fixes #37487.
Change-Id: I92d47355acacf9af2c41bf080c08a8c1638ba210
Reviewed-on: https://go-review.googlesource.com/c/go/+/221182
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:58:45 +00:00
|
|
|
|
2015-02-19 16:43:27 -05:00
|
|
|
if !_ConcurrentSweep || mode == gcForceBlockMode {
|
|
|
|
|
// Special case synchronous sweep.
|
runtime: finish sweeping before concurrent GC starts
Currently, the concurrent sweep follows a 1:1 rule: when allocation
needs a span, it sweeps a span (likewise, when a large allocation
needs N pages, it sweeps until it frees N pages). This rule worked
well for the STW collector (especially when GOGC==100) because it did
no more sweeping than necessary to keep the heap from growing, would
generally finish sweeping just before GC, and ensured good temporal
locality between sweeping a page and allocating from it.
It doesn't work well with concurrent GC. Since concurrent GC requires
starting GC earlier (sometimes much earlier), the sweep often won't be
done when GC starts. Unfortunately, the first thing GC has to do is
finish the sweep. In the mean time, the mutator can continue
allocating, pushing the heap size even closer to the goal size. This
worked okay with the 7/8ths trigger, but it gets into a vicious cycle
with the GC trigger controller: if the mutator is allocating quickly
and driving the trigger lower, more and more sweep work will be left
to GC; this both causes GC to take longer (allowing the mutator to
allocate more during GC) and delays the start of the concurrent mark
phase, which throws off the GC controller's statistics and generally
causes it to push the trigger even lower.
As an example of a particularly bad case, the garbage benchmark with
GOMAXPROCS=4 and -benchmem 512 (MB) spends the first 0.4-0.8 seconds
of each GC cycle sweeping, during which the heap grows by between
109MB and 252MB.
To fix this, this change replaces the 1:1 sweep rule with a
proportional sweep rule. At the end of GC, GC knows exactly how much
heap allocation will occur before the next concurrent GC as well as
how many span pages must be swept. This change computes this "sweep
ratio" and when the mallocgc asks for a span, the mcentral sweeps
enough spans to bring the swept span count into ratio with the
allocated byte count.
On the benchmark from above, this entirely eliminates sweeping at the
beginning of GC, which reduces the time between startGC readying the
GC goroutine and GC stopping the world for sweep termination to ~100µs
during which the heap grows at most 134KB.
Change-Id: I35422d6bba0c2310d48bb1f8f30a72d29e98c1af
Reviewed-on: https://go-review.googlesource.com/8921
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-13 23:34:57 -04:00
|
|
|
// Record that no proportional sweeping has to happen.
|
|
|
|
|
lock(&mheap_.lock)
|
|
|
|
|
mheap_.sweepPagesPerByte = 0
|
|
|
|
|
unlock(&mheap_.lock)
|
2014-11-11 17:05:02 -05:00
|
|
|
// Sweep all spans eagerly.
|
|
|
|
|
for sweepone() != ^uintptr(0) {
|
|
|
|
|
sweep.npausesweep++
|
|
|
|
|
}
|
2017-03-20 17:25:59 -04:00
|
|
|
// Free workbufs eagerly.
|
|
|
|
|
prepareFreeWorkbufs()
|
|
|
|
|
for freeSomeWbufs(false) {
|
|
|
|
|
}
|
2017-03-01 21:03:20 -05:00
|
|
|
// All "free" events for this mark/sweep cycle have
|
|
|
|
|
// now happened, so we can make this profile cycle
|
|
|
|
|
// available immediately.
|
2017-03-01 13:58:22 -05:00
|
|
|
mProf_NextCycle()
|
|
|
|
|
mProf_Flush()
|
2015-02-19 16:43:27 -05:00
|
|
|
return
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
|
|
|
|
|
2015-02-19 16:43:27 -05:00
|
|
|
// Background sweep.
|
|
|
|
|
lock(&sweep.lock)
|
2015-03-05 16:04:17 -05:00
|
|
|
if sweep.parked {
|
2015-02-19 16:43:27 -05:00
|
|
|
sweep.parked = false
|
2016-05-17 18:21:54 -04:00
|
|
|
ready(sweep.g, 0, true)
|
2015-02-19 16:43:27 -05:00
|
|
|
}
|
|
|
|
|
unlock(&sweep.lock)
|
|
|
|
|
}
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2015-10-17 23:57:53 -04:00
|
|
|
// gcResetMarkState resets global state prior to marking (concurrent
|
2016-03-01 15:09:24 -05:00
|
|
|
// or STW) and resets the stack scan state of all Gs.
|
|
|
|
|
//
|
|
|
|
|
// This is safe to do without the world stopped because any Gs created
|
|
|
|
|
// during or after this will start out in the reset state.
|
2019-05-17 14:48:04 +00:00
|
|
|
//
|
|
|
|
|
// gcResetMarkState must be called on the system stack because it acquires
|
|
|
|
|
// the heap lock. See mheap for details.
|
|
|
|
|
//
|
|
|
|
|
//go:systemstack
|
2015-10-17 23:57:53 -04:00
|
|
|
func gcResetMarkState() {
|
2020-12-23 15:05:37 -05:00
|
|
|
// This may be called during a concurrent phase, so lock to make sure
|
2015-02-24 22:20:38 -05:00
|
|
|
// allgs doesn't change.
|
2020-12-23 15:05:37 -05:00
|
|
|
forEachG(func(gp *g) {
|
2019-09-27 14:13:22 -04:00
|
|
|
gp.gcscandone = false // set to true in gcphasework
|
runtime: directly track GC assist balance
Currently we track the per-G GC assist balance as two monotonically
increasing values: the bytes allocated by the G this cycle (gcalloc)
and the scan work performed by the G this cycle (gcscanwork). The
assist balance is hence assistRatio*gcalloc - gcscanwork.
This works, but has two important downsides:
1) It requires floating-point math to figure out if a G is in debt or
not. This makes it inappropriate to check for assist debt in the
hot path of mallocgc, so we only do this when a G allocates a new
span. As a result, Gs can operate "in the red", leading to
under-assist and extended GC cycle length.
2) Revising the assist ratio during a GC cycle can lead to an "assist
burst". If you think of plotting the scan work performed versus
heaps size, the assist ratio controls the slope of this line.
However, in the current system, the target line always passes
through 0 at the heap size that triggered GC, so if the runtime
increases the assist ratio, there has to be a potentially large
assist to jump from the current amount of scan work up to the new
target scan work for the current heap size.
This commit replaces this approach with directly tracking the GC
assist balance in terms of allocation credit bytes. Allocating N bytes
simply decreases this by N and assisting raises it by the amount of
scan work performed divided by the assist ratio (to get back to
bytes).
This will make it cheap to figure out if a G is in debt, which will
let us efficiently check if an assist is necessary *before* performing
an allocation and hence keep Gs "in the black".
This also fixes assist bursts because the assist ratio is now in terms
of *remaining* work, rather than work from the beginning of the GC
cycle. Hence, the plot of scan work versus heap size becomes
continuous: we can revise the slope, but this slope always starts from
where we are right now, rather than where we were at the beginning of
the cycle.
Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c
Reviewed-on: https://go-review.googlesource.com/15408
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
|
|
|
gp.gcAssistBytes = 0
|
2020-12-23 15:05:37 -05:00
|
|
|
})
|
2015-02-24 22:20:38 -05:00
|
|
|
|
2018-09-26 15:59:21 -04:00
|
|
|
// Clear page marks. This is just 1MB per 64GB of heap, so the
|
|
|
|
|
// time here is pretty trivial.
|
|
|
|
|
lock(&mheap_.lock)
|
|
|
|
|
arenas := mheap_.allArenas
|
|
|
|
|
unlock(&mheap_.lock)
|
|
|
|
|
for _, ai := range arenas {
|
|
|
|
|
ha := mheap_.arenas[ai.l1()][ai.l2()]
|
|
|
|
|
for i := range ha.pageMarks {
|
|
|
|
|
ha.pageMarks[i] = 0
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2015-06-26 13:56:58 -04:00
|
|
|
work.bytesMarked = 0
|
2021-03-31 22:55:06 +00:00
|
|
|
work.initialHeapLive = atomic.Load64(&gcController.heapLive)
|
2015-06-26 13:56:58 -04:00
|
|
|
}
|
|
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
// Hooks for other packages
|
2014-11-11 17:05:02 -05:00
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
var poolcleanup func()
|
|
|
|
|
|
|
|
|
|
//go:linkname sync_runtime_registerPoolCleanup sync.runtime_registerPoolCleanup
|
|
|
|
|
func sync_runtime_registerPoolCleanup(f func()) {
|
|
|
|
|
poolcleanup = f
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
|
|
|
|
|
2015-02-19 13:38:46 -05:00
|
|
|
func clearpools() {
|
|
|
|
|
// clear sync.Pools
|
|
|
|
|
if poolcleanup != nil {
|
|
|
|
|
poolcleanup()
|
2014-11-11 17:05:02 -05:00
|
|
|
}
|
|
|
|
|
|
2015-02-03 00:33:02 +03:00
|
|
|
// Clear central sudog cache.
|
|
|
|
|
// Leave per-P caches alone, they have strictly bounded size.
|
|
|
|
|
// Disconnect cached list before dropping it on the floor,
|
|
|
|
|
// so that a dangling ref to one entry does not pin all of them.
|
|
|
|
|
lock(&sched.sudoglock)
|
|
|
|
|
var sg, sgnext *sudog
|
|
|
|
|
for sg = sched.sudogcache; sg != nil; sg = sgnext {
|
|
|
|
|
sgnext = sg.next
|
|
|
|
|
sg.next = nil
|
|
|
|
|
}
|
|
|
|
|
sched.sudogcache = nil
|
|
|
|
|
unlock(&sched.sudoglock)
|
|
|
|
|
|
2021-06-08 18:45:18 -04:00
|
|
|
// Clear central defer pool.
|
2015-02-05 13:35:41 +00:00
|
|
|
// Leave per-P pools alone, they have strictly bounded size.
|
|
|
|
|
lock(&sched.deferlock)
|
2021-06-08 18:45:18 -04:00
|
|
|
// disconnect cached list before dropping it on the floor,
|
|
|
|
|
// so that a dangling ref to one entry does not pin all of them.
|
|
|
|
|
var d, dlink *_defer
|
|
|
|
|
for d = sched.deferpool; d != nil; d = dlink {
|
|
|
|
|
dlink = d.link
|
|
|
|
|
d.link = nil
|
2015-02-05 13:35:41 +00:00
|
|
|
}
|
2021-06-08 18:45:18 -04:00
|
|
|
sched.deferpool = nil
|
2015-02-05 13:35:41 +00:00
|
|
|
unlock(&sched.deferlock)
|
2015-02-19 13:38:46 -05:00
|
|
|
}
|
|
|
|
|
|
2017-10-22 18:10:08 -04:00
|
|
|
// Timing
|
|
|
|
|
|
2015-03-26 18:48:42 -04:00
|
|
|
// itoaDiv formats val/(10**dec) into buf.
|
|
|
|
|
func itoaDiv(buf []byte, val uint64, dec int) []byte {
|
|
|
|
|
i := len(buf) - 1
|
|
|
|
|
idec := i - dec
|
|
|
|
|
for val >= 10 || i >= idec {
|
|
|
|
|
buf[i] = byte(val%10 + '0')
|
|
|
|
|
i--
|
|
|
|
|
if i == idec {
|
|
|
|
|
buf[i] = '.'
|
|
|
|
|
i--
|
|
|
|
|
}
|
|
|
|
|
val /= 10
|
|
|
|
|
}
|
|
|
|
|
buf[i] = byte(val + '0')
|
|
|
|
|
return buf[i:]
|
|
|
|
|
}
|
runtime: increase precision of gctrace times
Currently we truncate gctrace clock and CPU times to millisecond
precision. As a result, many phases are typically printed as 0, which
is fine for user consumption, but makes gathering statistics and
reports over GC traces difficult.
In 1.4, the gctrace line printed times in microseconds. This was
better for statistics, but not as easy for users to read or interpret,
and it generally made the trace lines longer.
This change strikes a balance between these extremes by printing
milliseconds, but including the decimal part to two significant
figures down to microsecond precision. This remains easy to read and
interpret, but includes more precision when it's useful.
For example, where the code currently prints,
gc #29 @1.629s 0%: 0+2+0+12+0 ms clock, 0+2+0+0/12/0+0 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
this prints,
gc #29 @1.629s 0%: 0.005+2.1+0+12+0.29 ms clock, 0.005+2.1+0+0/12/0+0.29 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
Fixes #10970.
Change-Id: I249624779433927cd8b0947b986df9060c289075
Reviewed-on: https://go-review.googlesource.com/10554
Reviewed-by: Russ Cox <rsc@golang.org>
2015-05-30 21:47:00 -04:00
|
|
|
|
|
|
|
|
// fmtNSAsMS nicely formats ns nanoseconds as milliseconds.
|
|
|
|
|
func fmtNSAsMS(buf []byte, ns uint64) []byte {
|
|
|
|
|
if ns >= 10e6 {
|
|
|
|
|
// Format as whole milliseconds.
|
|
|
|
|
return itoaDiv(buf, ns/1e6, 0)
|
|
|
|
|
}
|
|
|
|
|
// Format two digits of precision, with at most three decimal places.
|
|
|
|
|
x := ns / 1e3
|
|
|
|
|
if x == 0 {
|
|
|
|
|
buf[0] = '0'
|
|
|
|
|
return buf[:1]
|
|
|
|
|
}
|
|
|
|
|
dec := 3
|
|
|
|
|
for x >= 100 {
|
|
|
|
|
x /= 10
|
|
|
|
|
dec--
|
|
|
|
|
}
|
|
|
|
|
return itoaDiv(buf, x, dec)
|
|
|
|
|
}
|
2021-03-24 10:45:20 -04:00
|
|
|
|
|
|
|
|
// Helpers for testing GC.
|
|
|
|
|
|
|
|
|
|
// gcTestMoveStackOnNextCall causes the stack to be moved on a call
|
|
|
|
|
// immediately following the call to this. It may not work correctly
|
|
|
|
|
// if any other work appears after this call (such as returning).
|
|
|
|
|
// Typically the following call should be marked go:noinline so it
|
|
|
|
|
// performs a stack check.
|
2021-03-31 12:13:58 -04:00
|
|
|
//
|
|
|
|
|
// In rare cases this may not cause the stack to move, specifically if
|
|
|
|
|
// there's a preemption between this call and the next.
|
2021-03-24 10:45:20 -04:00
|
|
|
func gcTestMoveStackOnNextCall() {
|
|
|
|
|
gp := getg()
|
2021-04-01 16:50:53 -04:00
|
|
|
gp.stackguard0 = stackForceMove
|
2021-03-24 10:45:20 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// gcTestIsReachable performs a GC and returns a bit set where bit i
|
|
|
|
|
// is set if ptrs[i] is reachable.
|
|
|
|
|
func gcTestIsReachable(ptrs ...unsafe.Pointer) (mask uint64) {
|
|
|
|
|
// This takes the pointers as unsafe.Pointers in order to keep
|
|
|
|
|
// them live long enough for us to attach specials. After
|
|
|
|
|
// that, we drop our references to them.
|
|
|
|
|
|
|
|
|
|
if len(ptrs) > 64 {
|
|
|
|
|
panic("too many pointers for uint64 mask")
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Block GC while we attach specials and drop our references
|
|
|
|
|
// to ptrs. Otherwise, if a GC is in progress, it could mark
|
|
|
|
|
// them reachable via this function before we have a chance to
|
|
|
|
|
// drop them.
|
|
|
|
|
semacquire(&gcsema)
|
|
|
|
|
|
|
|
|
|
// Create reachability specials for ptrs.
|
|
|
|
|
specials := make([]*specialReachable, len(ptrs))
|
|
|
|
|
for i, p := range ptrs {
|
|
|
|
|
lock(&mheap_.speciallock)
|
|
|
|
|
s := (*specialReachable)(mheap_.specialReachableAlloc.alloc())
|
|
|
|
|
unlock(&mheap_.speciallock)
|
|
|
|
|
s.special.kind = _KindSpecialReachable
|
|
|
|
|
if !addspecial(p, &s.special) {
|
|
|
|
|
throw("already have a reachable special (duplicate pointer?)")
|
|
|
|
|
}
|
|
|
|
|
specials[i] = s
|
|
|
|
|
// Make sure we don't retain ptrs.
|
|
|
|
|
ptrs[i] = nil
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
semrelease(&gcsema)
|
|
|
|
|
|
|
|
|
|
// Force a full GC and sweep.
|
|
|
|
|
GC()
|
|
|
|
|
|
|
|
|
|
// Process specials.
|
|
|
|
|
for i, s := range specials {
|
|
|
|
|
if !s.done {
|
|
|
|
|
printlock()
|
|
|
|
|
println("runtime: object", i, "was not swept")
|
|
|
|
|
throw("IsReachable failed")
|
|
|
|
|
}
|
|
|
|
|
if s.reachable {
|
|
|
|
|
mask |= 1 << i
|
|
|
|
|
}
|
|
|
|
|
lock(&mheap_.speciallock)
|
|
|
|
|
mheap_.specialReachableAlloc.free(unsafe.Pointer(s))
|
|
|
|
|
unlock(&mheap_.speciallock)
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return mask
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// gcTestPointerClass returns the category of what p points to, one of:
|
|
|
|
|
// "heap", "stack", "data", "bss", "other". This is useful for checking
|
|
|
|
|
// that a test is doing what it's intended to do.
|
|
|
|
|
//
|
|
|
|
|
// This is nosplit simply to avoid extra pointer shuffling that may
|
|
|
|
|
// complicate a test.
|
|
|
|
|
//
|
|
|
|
|
//go:nosplit
|
|
|
|
|
func gcTestPointerClass(p unsafe.Pointer) string {
|
|
|
|
|
p2 := uintptr(noescape(p))
|
|
|
|
|
gp := getg()
|
|
|
|
|
if gp.stack.lo <= p2 && p2 < gp.stack.hi {
|
|
|
|
|
return "stack"
|
|
|
|
|
}
|
|
|
|
|
if base, _, _ := findObject(p2, 0, 0); base != 0 {
|
|
|
|
|
return "heap"
|
|
|
|
|
}
|
|
|
|
|
for _, datap := range activeModules() {
|
|
|
|
|
if datap.data <= p2 && p2 < datap.edata || datap.noptrdata <= p2 && p2 < datap.enoptrdata {
|
|
|
|
|
return "data"
|
|
|
|
|
}
|
|
|
|
|
if datap.bss <= p2 && p2 < datap.ebss || datap.noptrbss <= p2 && p2 <= datap.enoptrbss {
|
|
|
|
|
return "bss"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
KeepAlive(p)
|
|
|
|
|
return "other"
|
|
|
|
|
}
|