go/src/runtime/mgc.go

2238 lines
73 KiB
Go
Raw Normal View History

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Garbage collector (GC).
//
// The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple
// GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is
// non-generational and non-compacting. Allocation is done using size segregated per P allocation
// areas to minimize fragmentation while eliminating locks in the common case.
//
// The algorithm decomposes into several steps.
// This is a high level description of the algorithm being used. For an overview of GC a good
// place to start is Richard Jones' gchandbook.org.
//
// The algorithm's intellectual heritage includes Dijkstra's on-the-fly algorithm, see
// Edsger W. Dijkstra, Leslie Lamport, A. J. Martin, C. S. Scholten, and E. F. M. Steffens. 1978.
// On-the-fly garbage collection: an exercise in cooperation. Commun. ACM 21, 11 (November 1978),
// 966-975.
// For journal quality proofs that these steps are complete, correct, and terminate see
// Hudson, R., and Moss, J.E.B. Copying Garbage Collection without stopping the world.
// Concurrency and Computation: Practice and Experience 15(3-5), 2003.
//
// 1. GC performs sweep termination.
//
// a. Stop the world. This causes all Ps to reach a GC safe-point.
//
// b. Sweep any unswept spans. There will only be unswept spans if
// this GC cycle was forced before the expected time.
//
// 2. GC performs the "mark 1" sub-phase. In this sub-phase, Ps are
// allowed to locally cache parts of the work queue.
//
// a. Prepare for the mark phase by setting gcphase to _GCmark
// (from _GCoff), enabling the write barrier, enabling mutator
// assists, and enqueueing root mark jobs. No objects may be
// scanned until all Ps have enabled the write barrier, which is
// accomplished using STW.
//
// b. Start the world. From this point, GC work is done by mark
// workers started by the scheduler and by assists performed as
// part of allocation. The write barrier shades both the
// overwritten pointer and the new pointer value for any pointer
// writes (see mbarrier.go for details). Newly allocated objects
// are immediately marked black.
//
// c. GC performs root marking jobs. This includes scanning all
// stacks, shading all globals, and shading any heap pointers in
// off-heap runtime data structures. Scanning a stack stops a
// goroutine, shades any pointers found on its stack, and then
// resumes the goroutine.
//
// d. GC drains the work queue of grey objects, scanning each grey
// object to black and shading all pointers found in the object
// (which in turn may add those pointers to the work queue).
//
// 3. Once the global work queue is empty (but local work queue caches
// may still contain work), GC performs the "mark 2" sub-phase.
//
// a. GC stops all workers, disables local work queue caches,
// flushes each P's local work queue cache to the global work queue
// cache, and reenables workers.
//
// b. GC again drains the work queue, as in 2d above.
//
// 4. Once the work queue is empty, GC performs mark termination.
//
// a. Stop the world.
//
// b. Set gcphase to _GCmarktermination, and disable workers and
// assists.
//
// c. Drain any remaining work from the work queue (typically there
// will be none).
//
// d. Perform other housekeeping like flushing mcaches.
//
// 5. GC performs the sweep phase.
//
// a. Prepare for the sweep phase by setting gcphase to _GCoff,
// setting up sweep state and disabling the write barrier.
//
// b. Start the world. From this point on, newly allocated objects
// are white, and allocating sweeps spans before use if necessary.
//
// c. GC does concurrent sweeping in the background and in response
// to allocation. See description below.
//
// 6. When sufficient allocation has taken place, replay the sequence
// starting with 1 above. See discussion of GC rate below.
// Concurrent sweep.
runtime: introduce heap_live; replace use of heap_alloc in GC Currently there are two main consumers of memstats.heap_alloc: updatememstats (aka ReadMemStats) and shouldtriggergc. updatememstats recomputes heap_alloc from the ground up, so we don't need to keep heap_alloc up to date for it. shouldtriggergc wants to know how many bytes were marked by the previous GC plus how many bytes have been allocated since then, but this *isn't* what heap_alloc tracks. heap_alloc also includes objects that are not marked and haven't yet been swept. Introduce a new memstat called heap_live that actually tracks what shouldtriggergc wants to know and stop keeping heap_alloc up to date. Unlike heap_alloc, heap_live follows a simple sawtooth that drops during each mark termination and increases monotonically between GCs. heap_alloc, on the other hand, has much more complicated behavior: it may drop during sweep termination, slowly decreases from background sweeping between GCs, is roughly unaffected by allocation as long as there are unswept spans (because we sweep and allocate at the same rate), and may go up after background sweeping is done depending on the GC trigger. heap_live simplifies computing next_gc and using it to figure out when to trigger garbage collection. Currently, we guess next_gc at the end of a cycle and update it as we sweep and get a better idea of how much heap was marked. Now, since we're directly tracking how much heap is marked, we can directly compute next_gc. This also corrects bugs that could cause us to trigger GC early. Currently, in any case where sweep termination actually finds spans to sweep, heap_alloc is an overestimation of live heap, so we'll trigger GC too early. heap_live, on the other hand, is unaffected by sweeping. Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388 Reviewed-on: https://go-review.googlesource.com/8389 Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
//
// The sweep phase proceeds concurrently with normal program execution.
// The heap is swept span-by-span both lazily (when a goroutine needs another span)
// and concurrently in a background goroutine (this helps programs that are not CPU bound).
runtime: introduce heap_live; replace use of heap_alloc in GC Currently there are two main consumers of memstats.heap_alloc: updatememstats (aka ReadMemStats) and shouldtriggergc. updatememstats recomputes heap_alloc from the ground up, so we don't need to keep heap_alloc up to date for it. shouldtriggergc wants to know how many bytes were marked by the previous GC plus how many bytes have been allocated since then, but this *isn't* what heap_alloc tracks. heap_alloc also includes objects that are not marked and haven't yet been swept. Introduce a new memstat called heap_live that actually tracks what shouldtriggergc wants to know and stop keeping heap_alloc up to date. Unlike heap_alloc, heap_live follows a simple sawtooth that drops during each mark termination and increases monotonically between GCs. heap_alloc, on the other hand, has much more complicated behavior: it may drop during sweep termination, slowly decreases from background sweeping between GCs, is roughly unaffected by allocation as long as there are unswept spans (because we sweep and allocate at the same rate), and may go up after background sweeping is done depending on the GC trigger. heap_live simplifies computing next_gc and using it to figure out when to trigger garbage collection. Currently, we guess next_gc at the end of a cycle and update it as we sweep and get a better idea of how much heap was marked. Now, since we're directly tracking how much heap is marked, we can directly compute next_gc. This also corrects bugs that could cause us to trigger GC early. Currently, in any case where sweep termination actually finds spans to sweep, heap_alloc is an overestimation of live heap, so we'll trigger GC too early. heap_live, on the other hand, is unaffected by sweeping. Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388 Reviewed-on: https://go-review.googlesource.com/8389 Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
// At the end of STW mark termination all spans are marked as "needs sweeping".
//
// The background sweeper goroutine simply sweeps spans one-by-one.
//
// To avoid requesting more OS memory while there are unswept spans, when a
// goroutine needs another span, it first attempts to reclaim that much memory
// by sweeping. When a goroutine needs to allocate a new small-object span, it
// sweeps small-object spans for the same object size until it frees at least
// one object. When a goroutine needs to allocate large-object span from heap,
// it sweeps spans until it frees at least that many pages into heap. There is
// one case where this may not suffice: if a goroutine sweeps and frees two
// nonadjacent one-page spans to the heap, it will allocate a new two-page
// span, but there can still be other one-page unswept spans which could be
// combined into a two-page span.
//
// It's critical to ensure that no operations proceed on unswept spans (that would corrupt
// mark bits in GC bitmap). During GC all mcaches are flushed into the central cache,
// so they are empty. When a goroutine grabs a new span into mcache, it sweeps it.
// When a goroutine explicitly frees an object or sets a finalizer, it ensures that
// the span is swept (either by sweeping it, or by waiting for the concurrent sweep to finish).
// The finalizer goroutine is kicked off only when all spans are swept.
// When the next GC starts, it sweeps all not-yet-swept spans (if any).
// GC rate.
// Next GC is after we've allocated an extra amount of memory proportional to
// the amount already in use. The proportion is controlled by GOGC environment variable
// (100 by default). If GOGC=100 and we're using 4M, we'll GC again when we get to 8M
// (this mark is tracked in next_gc variable). This keeps the GC cost in linear
// proportion to the allocation cost. Adjusting GOGC just changes the linear constant
// (and also the amount of extra memory used).
runtime: bound scanobject to ~100 µs Currently the time spent in scanobject is proportional to the size of the object being scanned. Since scanobject is non-preemptible, large objects can cause significant goroutine (and even whole application) delays through several means: 1. If a GC assist picks up a large object, the allocating goroutine is blocked for the whole scan, even if that scan well exceeds that goroutine's debt. 2. Since the scheduler does not run on the P performing a large object scan, goroutines in that P's run queue do not run unless they are stolen by another P (which can take some time). If there are a few large objects, all of the Ps may get tied up so the scheduler doesn't run anywhere. 3. Even if a large object is scanned by a background worker and other Ps are still running the scheduler, the large object scan doesn't flush background credit until the whole scan is done. This can easily cause all allocations to block in assists, waiting for credit, causing an effective STW. Fix this by splitting large objects into 128 KB "oblets" and scanning at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates to bounding scanobject at roughly 100 µs. This improves assist behavior both because assists can no longer get "unlucky" and be stuck scanning a large object, and because it causes the background worker to flush credit and unblock assists more frequently when scanning large objects. This also improves GC parallelism if the heap consists primarily of a small number of very large objects by letting multiple workers scan a large objects in parallel. Fixes #10345. Fixes #16293. This substantially improves goroutine latency in the benchmark from issue #16293, which exercises several forms of very large objects: name old max-latency new max-latency delta SliceNoPointer-12 154µs ± 1% 155µs ± 2% ~ (p=0.087 n=13+12) SlicePointer-12 314ms ± 1% 5.94ms ±138% -98.11% (p=0.000 n=19+20) SliceLivePointer-12 1148ms ± 0% 4.72ms ±167% -99.59% (p=0.000 n=19+20) MapNoPointer-12 72509µs ± 1% 408µs ±325% -99.44% (p=0.000 n=19+18) ChanPointer-12 313ms ± 0% 4.74ms ±140% -98.49% (p=0.000 n=18+20) ChanLivePointer-12 1147ms ± 0% 3.30ms ±149% -99.71% (p=0.000 n=19+20) name old P99.9-latency new P99.9-latency delta SliceNoPointer-12 113µs ±25% 107µs ±12% ~ (p=0.153 n=20+18) SlicePointer-12 309450µs ± 0% 133µs ±23% -99.96% (p=0.000 n=20+20) SliceLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) MapNoPointer-12 448µs ±288% 119µs ±18% -73.34% (p=0.000 n=18+20) ChanPointer-12 309450µs ± 0% 134µs ±23% -99.96% (p=0.000 n=20+19) ChanLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) This has negligible effect on all metrics from the garbage, JSON, and HTTP x/benchmarks. It shows slight improvement on some of the go1 benchmarks, particularly Revcomp, which uses some multi-megabyte buffers: name old time/op new time/op delta BinaryTree17-12 2.46s ± 1% 2.47s ± 1% +0.32% (p=0.012 n=20+20) Fannkuch11-12 2.82s ± 0% 2.81s ± 0% -0.61% (p=0.000 n=17+20) FmtFprintfEmpty-12 50.8ns ± 5% 50.5ns ± 2% ~ (p=0.197 n=17+19) FmtFprintfString-12 131ns ± 1% 132ns ± 0% +0.57% (p=0.000 n=20+16) FmtFprintfInt-12 117ns ± 0% 116ns ± 0% -0.47% (p=0.000 n=15+20) FmtFprintfIntInt-12 180ns ± 0% 179ns ± 1% -0.78% (p=0.000 n=16+20) FmtFprintfPrefixedInt-12 186ns ± 1% 185ns ± 1% -0.55% (p=0.000 n=19+20) FmtFprintfFloat-12 263ns ± 1% 271ns ± 0% +2.84% (p=0.000 n=18+20) FmtManyArgs-12 741ns ± 1% 742ns ± 1% ~ (p=0.190 n=19+19) GobDecode-12 7.44ms ± 0% 7.35ms ± 1% -1.21% (p=0.000 n=20+20) GobEncode-12 6.22ms ± 1% 6.21ms ± 1% ~ (p=0.336 n=20+19) Gzip-12 220ms ± 1% 219ms ± 1% ~ (p=0.130 n=19+19) Gunzip-12 37.9ms ± 0% 37.9ms ± 1% ~ (p=1.000 n=20+19) HTTPClientServer-12 82.5µs ± 3% 82.6µs ± 3% ~ (p=0.776 n=20+19) JSONEncode-12 16.4ms ± 1% 16.5ms ± 2% +0.49% (p=0.003 n=18+19) JSONDecode-12 53.7ms ± 1% 54.1ms ± 1% +0.71% (p=0.000 n=19+18) Mandelbrot200-12 4.19ms ± 1% 4.20ms ± 1% ~ (p=0.452 n=19+19) GoParse-12 3.38ms ± 1% 3.37ms ± 1% ~ (p=0.123 n=19+19) RegexpMatchEasy0_32-12 72.1ns ± 1% 71.8ns ± 1% ~ (p=0.397 n=19+17) RegexpMatchEasy0_1K-12 242ns ± 0% 242ns ± 0% ~ (p=0.168 n=17+20) RegexpMatchEasy1_32-12 72.1ns ± 1% 72.1ns ± 1% ~ (p=0.538 n=18+19) RegexpMatchEasy1_1K-12 385ns ± 1% 384ns ± 1% ~ (p=0.388 n=20+20) RegexpMatchMedium_32-12 112ns ± 1% 112ns ± 3% ~ (p=0.539 n=20+20) RegexpMatchMedium_1K-12 34.4µs ± 2% 34.4µs ± 2% ~ (p=0.628 n=18+18) RegexpMatchHard_32-12 1.80µs ± 1% 1.80µs ± 1% ~ (p=0.522 n=18+19) RegexpMatchHard_1K-12 54.0µs ± 1% 54.1µs ± 1% ~ (p=0.647 n=20+19) Revcomp-12 387ms ± 1% 369ms ± 5% -4.89% (p=0.000 n=17+19) Template-12 62.3ms ± 1% 62.0ms ± 0% -0.48% (p=0.002 n=20+17) TimeParse-12 314ns ± 1% 314ns ± 0% ~ (p=1.011 n=20+13) TimeFormat-12 358ns ± 0% 354ns ± 0% -1.12% (p=0.000 n=17+20) [Geo mean] 53.5µs 53.3µs -0.23% Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132 Reviewed-on: https://go-review.googlesource.com/23540 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-05-27 21:04:40 -04:00
// Oblets
//
// In order to prevent long pauses while scanning large objects and to
// improve parallelism, the garbage collector breaks up scan jobs for
// objects larger than maxObletBytes into "oblets" of at most
// maxObletBytes. When scanning encounters the beginning of a large
// object, it scans only the first oblet and enqueues the remaining
// oblets as new scan jobs.
package runtime
import (
"runtime/internal/atomic"
"runtime/internal/sys"
"unsafe"
)
const (
_DebugGC = 0
_ConcurrentSweep = true
_FinBlockSize = 4 * 1024
// sweepMinHeapDistance is a lower bound on the heap distance
// (in bytes) reserved for concurrent sweeping between GC
// cycles. This will be scaled by gcpercent/100.
sweepMinHeapDistance = 1024 * 1024
)
// heapminimum is the minimum heap size at which to trigger GC.
// For small heaps, this overrides the usual GOGC*live set rule.
//
// When there is a very small live set but a lot of allocation, simply
// collecting when the heap reaches GOGC*live results in many GC
// cycles and high total per-GC overhead. This minimum amortizes this
// per-GC overhead while keeping the heap reasonably small.
//
// During initialization this is set to 4MB*GOGC/100. In the case of
// GOGC==0, this will set heapminimum to 0, resulting in constant
// collection even when the heap size is small, which is useful for
// debugging.
var heapminimum uint64 = defaultHeapMinimum
// defaultHeapMinimum is the value of heapminimum for GOGC==100.
const defaultHeapMinimum = 4 << 20
// Initialized from $GOGC. GOGC=off means no GC.
var gcpercent int32
func gcinit() {
if unsafe.Sizeof(workbuf{}) != _WorkbufSize {
throw("size of Workbuf is suboptimal")
}
runtime: switch to gcWork abstraction This converts the garbage collector from directly manipulating work buffers to using the new gcWork abstraction. The previous management of work buffers was rather ad hoc. As a result, switching to the gcWork abstraction changes many details of work buffer management. If greyobject fills a work buffer, it can now pull from work.partial in addition to work.empty. Previously, gcDrain started with a partial or empty work buffer and fetched an empty work buffer if it filled its current buffer (in greyobject). Now, gcDrain starts with a full work buffer and fetches an partial or empty work buffer if it fills its current buffer (in greyobject). The original behavior was bad because gcDrain would immediately drop the empty work buffer returned by greyobject and fetch a full work buffer, which greyobject was likely to immediately overflow, fetching another empty work buffer, etc. The new behavior isn't great at the start because greyobject is likely to immediately overflow the full buffer, but the steady-state behavior should be more stable. Both before and after this change, gcDrain fetches a full work buffer if it drains its current buffer. Basically all of these choices are bad; the right answer is to use a dual work buffer scheme. Previously, shade always fetched a work buffer (though usually from m.currentwbuf), even if the object was already marked. Now it only fetches a work buffer if it actually greys an object. Change-Id: I8b880ed660eb63135236fa5d5678f0c1c041881f Reviewed-on: https://go-review.googlesource.com/5232 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2015-02-17 10:53:31 -05:00
// No sweep on the first cycle.
mheap_.sweepdone = 1
// Set a reasonable initial GC trigger.
memstats.triggerRatio = 7 / 8.0
// Fake a heap_marked value so it looks like a trigger at
// heapminimum is the appropriate growth from heap_marked.
// This will go into computing the initial GC goal.
memstats.heap_marked = uint64(float64(heapminimum) / (1 + memstats.triggerRatio))
// Set gcpercent from the environment. This will also compute
// and set the GC trigger and goal.
_ = setGCPercent(readgogc())
work.startSema = 1
work.markDoneSema = 1
}
func readgogc() int32 {
p := gogetenv("GOGC")
if p == "off" {
return -1
}
if n, ok := atoi32(p); ok {
return n
}
return 100
}
// gcenable is called after the bulk of the runtime initialization,
// just before we're about to start letting user code run.
// It kicks off the background sweeper goroutine and enables GC.
func gcenable() {
c := make(chan int, 1)
go bgsweep(c)
<-c
memstats.enablegc = true // now that runtime is initialized, GC is okay
}
//go:linkname setGCPercent runtime/debug.setGCPercent
func setGCPercent(in int32) (out int32) {
lock(&mheap_.lock)
out = gcpercent
if in < 0 {
in = -1
}
gcpercent = in
heapminimum = defaultHeapMinimum * uint64(gcpercent) / 100
// Update pacing in response to gcpercent change.
gcSetTriggerRatio(memstats.triggerRatio)
unlock(&mheap_.lock)
// If we just disabled GC, wait for any concurrent GC to
// finish so we always return with no GC running.
if in < 0 {
// Disable phase transitions.
lock(&work.sweepWaiters.lock)
if gcphase == _GCmark {
// GC is active. Wait until we reach sweeping.
gp := getg()
gp.schedlink = work.sweepWaiters.head
work.sweepWaiters.head.set(gp)
goparkunlock(&work.sweepWaiters.lock, "wait for GC cycle", traceEvGoBlock, 1)
} else {
// GC isn't active.
unlock(&work.sweepWaiters.lock)
}
}
return out
}
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
// Garbage collector phase.
// Indicates to write barrier and synchronization task to perform.
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
var gcphase uint32
// The compiler knows about this variable.
// If you change it, you must change builtin/runtime.go, too.
// If you change the first four bytes, you must also change the write
// barrier insertion code.
var writeBarrier struct {
enabled bool // compiler emits a check of this before calling write barrier
pad [3]byte // compiler uses 32-bit load for "enabled" field
needed bool // whether we need a write barrier for current GC phase
cgo bool // whether we need a write barrier for a cgo check
alignme uint64 // guarantee alignment so that compiler can use a 32 or 64-bit load
}
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
// gcBlackenEnabled is 1 if mutator assists and background mark
// workers are allowed to blacken objects. This must only be set when
// gcphase == _GCmark.
var gcBlackenEnabled uint32
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
// gcBlackenPromptly indicates that optimizations that may
// hide work from the global work queue should be disabled.
//
// If gcBlackenPromptly is true, per-P gcWork caches should
// be flushed immediately and new objects should be allocated black.
//
// There is a tension between allocating objects white and
// allocating them black. If white and the objects die before being
// marked they can be collected during this GC cycle. On the other
// hand allocating them black will reduce _GCmarktermination latency
// since more work is done in the mark phase. This tension is resolved
// by allocating white until the mark phase is approaching its end and
// then allocating black for the remainder of the mark phase.
var gcBlackenPromptly bool
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
const (
_GCoff = iota // GC not running; sweeping in background, write barrier disabled
runtime: allocate black during GC Currently we allocate white for most of concurrent marking. This is based on the classical argument that it produces less floating garbage, since allocations during GC may not get linked into the heap and allocating white lets us reclaim these. However, it's not clear how often this actually happens, especially since our write barrier shades any pointer as soon as it's installed in the heap regardless of the color of the slot. On the other hand, allocating black has several advantages that seem to significantly outweigh this downside. 1) It naturally bounds the total scan work to the live heap size at the start of a GC cycle. Allocating white does not, and thus depends entirely on assists to prevent the heap from growing faster than it can be scanned. 2) It reduces the total amount of scan work per GC cycle by the size of newly allocated objects that are linked into the heap graph, since objects allocated black never need to be scanned. 3) It reduces total write barrier work since more objects will already be black when they are linked into the heap graph. This gives a slight overall improvement in benchmarks. name old time/op new time/op delta XBenchGarbage-12 2.24ms ± 0% 2.21ms ± 1% -1.32% (p=0.000 n=18+17) name old time/op new time/op delta BinaryTree17-12 2.60s ± 3% 2.53s ± 3% -2.56% (p=0.000 n=20+20) Fannkuch11-12 2.08s ± 1% 2.08s ± 0% ~ (p=0.452 n=19+19) FmtFprintfEmpty-12 45.1ns ± 2% 45.3ns ± 2% ~ (p=0.367 n=19+20) FmtFprintfString-12 131ns ± 3% 129ns ± 0% -1.60% (p=0.000 n=20+16) FmtFprintfInt-12 122ns ± 0% 121ns ± 2% -0.86% (p=0.000 n=16+19) FmtFprintfIntInt-12 187ns ± 1% 186ns ± 1% ~ (p=0.514 n=18+19) FmtFprintfPrefixedInt-12 189ns ± 0% 188ns ± 1% -0.54% (p=0.000 n=16+18) FmtFprintfFloat-12 256ns ± 0% 254ns ± 1% -0.43% (p=0.000 n=17+19) FmtManyArgs-12 769ns ± 0% 763ns ± 0% -0.72% (p=0.000 n=18+18) GobDecode-12 7.08ms ± 2% 7.00ms ± 1% -1.22% (p=0.000 n=20+20) GobEncode-12 5.88ms ± 0% 5.88ms ± 1% ~ (p=0.406 n=18+18) Gzip-12 214ms ± 0% 214ms ± 1% ~ (p=0.103 n=17+18) Gunzip-12 37.6ms ± 0% 37.6ms ± 0% ~ (p=0.563 n=17+17) HTTPClientServer-12 77.2µs ± 3% 76.9µs ± 2% ~ (p=0.606 n=20+20) JSONEncode-12 15.1ms ± 1% 15.2ms ± 2% ~ (p=0.138 n=19+19) JSONDecode-12 53.3ms ± 1% 53.1ms ± 1% -0.33% (p=0.000 n=19+18) Mandelbrot200-12 4.04ms ± 1% 4.04ms ± 1% ~ (p=0.075 n=19+18) GoParse-12 3.30ms ± 1% 3.29ms ± 1% -0.57% (p=0.000 n=18+16) RegexpMatchEasy0_32-12 69.5ns ± 1% 69.9ns ± 3% ~ (p=0.822 n=18+20) RegexpMatchEasy0_1K-12 237ns ± 1% 237ns ± 0% ~ (p=0.398 n=19+18) RegexpMatchEasy1_32-12 69.8ns ± 2% 69.5ns ± 1% ~ (p=0.090 n=20+16) RegexpMatchEasy1_1K-12 371ns ± 1% 372ns ± 1% ~ (p=0.178 n=19+20) RegexpMatchMedium_32-12 108ns ± 2% 108ns ± 3% ~ (p=0.124 n=20+19) RegexpMatchMedium_1K-12 33.9µs ± 2% 34.2µs ± 4% ~ (p=0.309 n=20+19) RegexpMatchHard_32-12 1.75µs ± 2% 1.77µs ± 4% +1.28% (p=0.018 n=19+18) RegexpMatchHard_1K-12 52.7µs ± 1% 53.4µs ± 4% +1.23% (p=0.013 n=15+18) Revcomp-12 354ms ± 1% 359ms ± 4% +1.27% (p=0.043 n=20+20) Template-12 63.6ms ± 2% 63.7ms ± 2% ~ (p=0.654 n=20+18) TimeParse-12 313ns ± 1% 316ns ± 2% +0.80% (p=0.014 n=17+20) TimeFormat-12 332ns ± 0% 329ns ± 0% -0.66% (p=0.000 n=16+16) [Geo mean] 51.7µs 51.6µs -0.09% Change-Id: I2214a6a0e4f544699ea166073249a8efdf080dc0 Reviewed-on: https://go-review.googlesource.com/21323 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-30 17:02:23 -04:00
_GCmark // GC marking roots and workbufs: allocate black, write barrier ENABLED
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
_GCmarktermination // GC mark termination: allocate black, P's help GC, write barrier ENABLED
)
//go:nosplit
func setGCPhase(x uint32) {
atomic.Store(&gcphase, x)
writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
}
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// gcMarkWorkerMode represents the mode that a concurrent mark worker
// should operate in.
//
// Concurrent marking happens through four different mechanisms. One
// is mutator assists, which happen in response to allocations and are
// not scheduled. The other three are variations in the per-P mark
// workers and are distinguished by gcMarkWorkerMode.
type gcMarkWorkerMode int
const (
// gcMarkWorkerDedicatedMode indicates that the P of a mark
// worker is dedicated to running that mark worker. The mark
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
// worker should run without preemption.
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
gcMarkWorkerDedicatedMode gcMarkWorkerMode = iota
// gcMarkWorkerFractionalMode indicates that a P is currently
// running the "fractional" mark worker. The fractional worker
// is necessary when GOMAXPROCS*gcBackgroundUtilization is not
// an integer. The fractional worker should run until it is
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// preempted and will be scheduled to pick up the fractional
// part of GOMAXPROCS*gcBackgroundUtilization.
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
gcMarkWorkerFractionalMode
// gcMarkWorkerIdleMode indicates that a P is running the mark
// worker because it has nothing else to do. The idle worker
// should run until it is preempted and account its time
// against gcController.idleMarkTime.
gcMarkWorkerIdleMode
)
// gcMarkWorkerModeStrings are the strings labels of gcMarkWorkerModes
// to use in execution traces.
var gcMarkWorkerModeStrings = [...]string{
"GC (dedicated)",
"GC (fractional)",
"GC (idle)",
}
// gcController implements the GC pacing controller that determines
// when to trigger concurrent garbage collection and how much marking
// work to do in mutator assists and background marking.
//
// It uses a feedback control algorithm to adjust the memstats.gc_trigger
// trigger based on the heap growth and GC CPU utilization each cycle.
// This algorithm optimizes for heap growth to match GOGC and for CPU
// utilization between assist and background marking to be 25% of
// GOMAXPROCS. The high-level design of this algorithm is documented
// at https://golang.org/s/go15gcpacing.
//
// All fields of gcController are used only during a single mark
// cycle.
var gcController gcControllerState
type gcControllerState struct {
// scanWork is the total scan work performed this cycle. This
// is updated atomically during the cycle. Updates occur in
// bounded batches, since it is both written and read
// throughout the cycle. At the end of the cycle, this is how
// much of the retained heap is scannable.
runtime: use heap scan size as estimate of GC scan work Currently, the GC uses a moving average of recent scan work ratios to estimate the total scan work required by this cycle. This is in turn used to compute how much scan work should be done by mutators when they allocate in order to perform all expected scan work by the time the allocated heap reaches the heap goal. However, our current scan work estimate can be arbitrarily wrong if the heap topography changes significantly from one cycle to the next. For example, in the go1 benchmarks, at the beginning of each benchmark, the heap is dominated by a 256MB no-scan object, so the GC learns that the scan density of the heap is very low. In benchmarks that then rapidly allocate pointer-dense objects, by the time of the next GC cycle, our estimate of the scan work can be too low by a large factor. This in turn lets the mutator allocate faster than the GC can collect, allowing it to get arbitrarily far ahead of the scan work estimate, which leads to very long GC cycles with very little mutator assist that can overshoot the heap goal by large margins. This is particularly easy to demonstrate with BinaryTree17: $ GODEBUG=gctrace=1 ./go1.test -test.bench BinaryTree17 gc #1 @0.017s 2%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 4->262->262 MB, 4 MB goal, 1 P gc #2 @0.026s 3%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 262->262->262 MB, 524 MB goal, 1 P testing: warning: no tests to run PASS BenchmarkBinaryTree17 gc #3 @1.906s 0%: 0+0+0+0+7 ms clock, 0+0+0+0/0/0+7 ms cpu, 325->325->287 MB, 325 MB goal, 1 P (forced) gc #4 @12.203s 20%: 0+0+0+10067+10 ms clock, 0+0+0+0/2523/852+10 ms cpu, 430->2092->1950 MB, 574 MB goal, 1 P 1 9150447353 ns/op Change this estimate to instead use the *current* scannable heap size. This has the advantage of being based solely on the current state of the heap, not on past densities or reachable heap sizes, so it isn't susceptible to falling behind during these sorts of phase changes. This is strictly an over-estimate, but it's better to over-estimate and get more assist than necessary than it is to under-estimate and potentially spiral out of control. Experiments with scaling this estimate back showed no obvious benefit for mutator utilization, heap size, or assist time. This new estimate has little effect for most benchmarks, including most go1 benchmarks, x/benchmarks, and the 6g benchmark. It has a huge effect for benchmarks that triggered the bad pacer behavior: name old mean new mean delta BinaryTree17 10.0s × (1.00,1.00) 3.5s × (0.98,1.01) -64.93% (p=0.000) Fannkuch11 2.74s × (1.00,1.01) 2.65s × (1.00,1.00) -3.52% (p=0.000) FmtFprintfEmpty 56.4ns × (0.99,1.00) 57.8ns × (1.00,1.01) +2.43% (p=0.000) FmtFprintfString 187ns × (0.99,1.00) 185ns × (0.99,1.01) -1.19% (p=0.010) FmtFprintfInt 184ns × (1.00,1.00) 183ns × (1.00,1.00) (no variance) FmtFprintfIntInt 321ns × (1.00,1.00) 315ns × (1.00,1.00) -1.80% (p=0.000) FmtFprintfPrefixedInt 266ns × (1.00,1.00) 263ns × (1.00,1.00) -1.22% (p=0.000) FmtFprintfFloat 353ns × (1.00,1.00) 353ns × (1.00,1.00) -0.13% (p=0.035) FmtManyArgs 1.21µs × (1.00,1.00) 1.19µs × (1.00,1.00) -1.33% (p=0.000) GobDecode 9.69ms × (1.00,1.00) 9.59ms × (1.00,1.00) -1.07% (p=0.000) GobEncode 7.89ms × (0.99,1.01) 7.74ms × (1.00,1.00) -1.92% (p=0.000) Gzip 391ms × (1.00,1.00) 392ms × (1.00,1.00) ~ (p=0.522) Gunzip 97.1ms × (1.00,1.00) 97.0ms × (1.00,1.00) -0.10% (p=0.000) HTTPClientServer 55.7µs × (0.99,1.01) 56.7µs × (0.99,1.01) +1.81% (p=0.001) JSONEncode 19.1ms × (1.00,1.00) 19.0ms × (1.00,1.00) -0.85% (p=0.000) JSONDecode 66.8ms × (1.00,1.00) 66.9ms × (1.00,1.00) ~ (p=0.288) Mandelbrot200 4.13ms × (1.00,1.00) 4.12ms × (1.00,1.00) -0.08% (p=0.000) GoParse 3.97ms × (1.00,1.01) 4.01ms × (1.00,1.00) +0.99% (p=0.000) RegexpMatchEasy0_32 114ns × (1.00,1.00) 115ns × (0.99,1.00) ~ (p=0.070) RegexpMatchEasy0_1K 376ns × (1.00,1.00) 376ns × (1.00,1.00) ~ (p=0.900) RegexpMatchEasy1_32 94.9ns × (1.00,1.00) 96.3ns × (1.00,1.01) +1.53% (p=0.001) RegexpMatchEasy1_1K 568ns × (1.00,1.00) 567ns × (1.00,1.00) -0.22% (p=0.001) RegexpMatchMedium_32 159ns × (1.00,1.00) 159ns × (1.00,1.00) ~ (p=0.178) RegexpMatchMedium_1K 46.4µs × (1.00,1.00) 46.6µs × (1.00,1.00) +0.29% (p=0.000) RegexpMatchHard_32 2.37µs × (1.00,1.00) 2.37µs × (1.00,1.00) ~ (p=0.722) RegexpMatchHard_1K 71.1µs × (1.00,1.00) 71.2µs × (1.00,1.00) ~ (p=0.229) Revcomp 565ms × (1.00,1.00) 562ms × (1.00,1.00) -0.52% (p=0.000) Template 81.0ms × (1.00,1.00) 80.2ms × (1.00,1.00) -0.97% (p=0.000) TimeParse 380ns × (1.00,1.00) 380ns × (1.00,1.00) ~ (p=0.148) TimeFormat 405ns × (0.99,1.00) 385ns × (0.99,1.00) -5.00% (p=0.000) Change-Id: I11274158bf3affaf62662e02de7af12d5fb789e4 Reviewed-on: https://go-review.googlesource.com/9696 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Austin Clements <austin@google.com>
2015-05-04 16:55:31 -04:00
//
// Currently this is the bytes of heap scanned. For most uses,
// this is an opaque unit of work, but for estimation the
// definition is important.
scanWork int64
// bgScanCredit is the scan work credit accumulated by the
// concurrent background scan. This credit is accumulated by
// the background scan and stolen by mutator assists. This is
// updated atomically. Updates occur in bounded batches, since
// it is both written and read throughout the cycle.
bgScanCredit int64
// assistTime is the nanoseconds spent in mutator assists
// during this cycle. This is updated atomically. Updates
// occur in bounded batches, since it is both written and read
// throughout the cycle.
assistTime int64
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// dedicatedMarkTime is the nanoseconds spent in dedicated
// mark workers during this cycle. This is updated atomically
// at the end of the concurrent mark phase.
dedicatedMarkTime int64
// fractionalMarkTime is the nanoseconds spent in the
// fractional mark worker during this cycle. This is updated
// atomically throughout the cycle and will be up-to-date if
// the fractional mark worker is not currently running.
fractionalMarkTime int64
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// idleMarkTime is the nanoseconds spent in idle marking
// during this cycle. This is updated atomically throughout
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// the cycle.
idleMarkTime int64
// markStartTime is the absolute start time in nanoseconds
// that assists and background mark workers started.
markStartTime int64
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// dedicatedMarkWorkersNeeded is the number of dedicated mark
// workers that need to be started. This is computed at the
// beginning of each cycle and decremented atomically as
// dedicated mark workers get started.
dedicatedMarkWorkersNeeded int64
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
// assistWorkPerByte is the ratio of scan work to allocated
// bytes that should be performed by mutator assists. This is
runtime: revise assist ratio aggressively At the start of a GC cycle, the garbage collector computes the assist ratio based on the total scannable heap size. This was intended to be conservative; after all, this assumes the entire heap may be reachable and hence needs to be scanned. But it only assumes that the *current* entire heap may be reachable. It fails to account for heap allocated during the GC cycle. If the trigger ratio is very low (near zero), and most of the heap is reachable when GC starts (which is likely if the trigger ratio is near zero), then it's possible for the mutator to create new, reachable heap fast enough that the assists won't keep up based on the assist ratio computed at the beginning of the cycle. As a result, the heap can grow beyond the heap goal (by hundreds of megs in stress tests like in issue #11911). We already have some vestigial logic for dealing with situations like this; it just doesn't run often enough. Currently, every 10 ms during the GC cycle, the GC revises the assist ratio. This was put in before we switched to a conservative assist ratio (when we really were using estimates of scannable heap), and it turns out to be exactly what we need now. However, every 10 ms is far too infrequent for a rapidly allocating mutator. This commit reuses this logic, but replaces the 10 ms timer with revising the assist ratio every time the heap is locked, which coincides precisely with when the statistics used to compute the assist ratio are updated. Fixes #11911. Change-Id: I377b231ab064946228378fa10422a46d1b50f4c5 Reviewed-on: https://go-review.googlesource.com/13047 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-08-03 17:48:47 -04:00
// computed at the beginning of each cycle and updated every
// time heap_scan is updated.
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
assistWorkPerByte float64
// assistBytesPerWork is 1/assistWorkPerByte.
assistBytesPerWork float64
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// fractionalUtilizationGoal is the fraction of wall clock
// time that should be spent in the fractional mark worker on
// each P that isn't running a dedicated worker.
//
// For example, if the utilization goal is 25% and there are
// no dedicated workers, this will be 0.25. If there goal is
// 25%, there is one dedicated worker, and GOMAXPROCS is 5,
// this will be 0.05 to make up the missing 5%.
//
// If this is zero, no fractional workers are needed.
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
fractionalUtilizationGoal float64
_ [sys.CacheLineSize]byte
}
// startCycle resets the GC controller's state and computes estimates
// for a new GC cycle. The caller must hold worldsema.
func (c *gcControllerState) startCycle() {
c.scanWork = 0
c.bgScanCredit = 0
c.assistTime = 0
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
c.dedicatedMarkTime = 0
c.fractionalMarkTime = 0
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
c.idleMarkTime = 0
// If this is the first GC cycle or we're operating on a very
// small heap, fake heap_marked so it looks like gc_trigger is
// the appropriate growth from heap_marked, even though the
// real heap_marked may not have a meaningful value (on the
// first cycle) or may be much smaller (resulting in a large
// error response).
if memstats.gc_trigger <= heapminimum {
memstats.heap_marked = uint64(float64(memstats.gc_trigger) / (1 + memstats.triggerRatio))
}
// Re-compute the heap goal for this cycle in case something
// changed. This is the same calculation we use elsewhere.
memstats.next_gc = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
if gcpercent < 0 {
memstats.next_gc = ^uint64(0)
}
// Ensure that the heap goal is at least a little larger than
// the current live heap size. This may not be the case if GC
// start is delayed or if the allocation that pushed heap_live
// over gc_trigger is large or if the trigger is really close to
// GOGC. Assist is proportional to this distance, so enforce a
// minimum distance, even if it means going over the GOGC goal
// by a tiny bit.
if memstats.next_gc < memstats.heap_live+1024*1024 {
memstats.next_gc = memstats.heap_live + 1024*1024
}
// Compute the background mark utilization goal. In general,
// this may not come out exactly. We round the number of
// dedicated workers so that the utilization is closest to
// 25%. For small GOMAXPROCS, this would introduce too much
// error, so we add fractional workers in that case.
totalUtilizationGoal := float64(gomaxprocs) * gcBackgroundUtilization
c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal + 0.5)
utilError := float64(c.dedicatedMarkWorkersNeeded)/totalUtilizationGoal - 1
const maxUtilError = 0.3
if utilError < -maxUtilError || utilError > maxUtilError {
// Rounding put us more than 30% off our goal. With
// gcBackgroundUtilization of 25%, this happens for
// GOMAXPROCS<=3 or GOMAXPROCS=6. Enable fractional
// workers to compensate.
if float64(c.dedicatedMarkWorkersNeeded) > totalUtilizationGoal {
// Too many dedicated workers.
c.dedicatedMarkWorkersNeeded--
}
c.fractionalUtilizationGoal = (totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded)) / float64(gomaxprocs)
} else {
c.fractionalUtilizationGoal = 0
}
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// Clear per-P state
for _, p := range allp {
p.gcAssistTime = 0
p.gcFractionalMarkTime = 0
}
// Compute initial values for controls that are updated
// throughout the cycle.
c.revise()
if debug.gcpacertrace > 0 {
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
print("pacer: assist ratio=", c.assistWorkPerByte,
" (scan ", memstats.heap_scan>>20, " MB in ",
work.initialHeapLive>>20, "->",
memstats.next_gc>>20, " MB)",
" workers=", c.dedicatedMarkWorkersNeeded,
"+", c.fractionalUtilizationGoal, "\n")
}
}
// revise updates the assist ratio during the GC cycle to account for
runtime: revise assist ratio aggressively At the start of a GC cycle, the garbage collector computes the assist ratio based on the total scannable heap size. This was intended to be conservative; after all, this assumes the entire heap may be reachable and hence needs to be scanned. But it only assumes that the *current* entire heap may be reachable. It fails to account for heap allocated during the GC cycle. If the trigger ratio is very low (near zero), and most of the heap is reachable when GC starts (which is likely if the trigger ratio is near zero), then it's possible for the mutator to create new, reachable heap fast enough that the assists won't keep up based on the assist ratio computed at the beginning of the cycle. As a result, the heap can grow beyond the heap goal (by hundreds of megs in stress tests like in issue #11911). We already have some vestigial logic for dealing with situations like this; it just doesn't run often enough. Currently, every 10 ms during the GC cycle, the GC revises the assist ratio. This was put in before we switched to a conservative assist ratio (when we really were using estimates of scannable heap), and it turns out to be exactly what we need now. However, every 10 ms is far too infrequent for a rapidly allocating mutator. This commit reuses this logic, but replaces the 10 ms timer with revising the assist ratio every time the heap is locked, which coincides precisely with when the statistics used to compute the assist ratio are updated. Fixes #11911. Change-Id: I377b231ab064946228378fa10422a46d1b50f4c5 Reviewed-on: https://go-review.googlesource.com/13047 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-08-03 17:48:47 -04:00
// improved estimates. This should be called either under STW or
// whenever memstats.heap_scan, memstats.heap_live, or
// memstats.next_gc is updated (with mheap_.lock held).
//
// It should only be called when gcBlackenEnabled != 0 (because this
// is when assists are enabled and the necessary statistics are
// available).
func (c *gcControllerState) revise() {
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-19 17:03:52 -04:00
gcpercent := gcpercent
if gcpercent < 0 {
// If GC is disabled but we're running a forced GC,
// act like GOGC is huge for the below calculations.
gcpercent = 100000
}
live := atomic.Load64(&memstats.heap_live)
var heapGoal, scanWorkExpected int64
if live <= memstats.next_gc {
// We're under the soft goal. Pace GC to complete at
// next_gc assuming the heap is in steady-state.
heapGoal = int64(memstats.next_gc)
// Compute the expected scan work remaining.
//
// This is estimated based on the expected
// steady-state scannable heap. For example, with
// GOGC=100, only half of the scannable heap is
// expected to be live, so that's what we target.
//
// (This is a float calculation to avoid overflowing on
// 100*heap_scan.)
scanWorkExpected = int64(float64(memstats.heap_scan) * 100 / float64(100+gcpercent))
} else {
// We're past the soft goal. Pace GC so that in the
// worst case it will complete by the hard goal.
const maxOvershoot = 1.1
heapGoal = int64(float64(memstats.next_gc) * maxOvershoot)
// Compute the upper bound on the scan work remaining.
scanWorkExpected = int64(memstats.heap_scan)
}
// Compute the remaining scan work estimate.
runtime: use heap scan size as estimate of GC scan work Currently, the GC uses a moving average of recent scan work ratios to estimate the total scan work required by this cycle. This is in turn used to compute how much scan work should be done by mutators when they allocate in order to perform all expected scan work by the time the allocated heap reaches the heap goal. However, our current scan work estimate can be arbitrarily wrong if the heap topography changes significantly from one cycle to the next. For example, in the go1 benchmarks, at the beginning of each benchmark, the heap is dominated by a 256MB no-scan object, so the GC learns that the scan density of the heap is very low. In benchmarks that then rapidly allocate pointer-dense objects, by the time of the next GC cycle, our estimate of the scan work can be too low by a large factor. This in turn lets the mutator allocate faster than the GC can collect, allowing it to get arbitrarily far ahead of the scan work estimate, which leads to very long GC cycles with very little mutator assist that can overshoot the heap goal by large margins. This is particularly easy to demonstrate with BinaryTree17: $ GODEBUG=gctrace=1 ./go1.test -test.bench BinaryTree17 gc #1 @0.017s 2%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 4->262->262 MB, 4 MB goal, 1 P gc #2 @0.026s 3%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 262->262->262 MB, 524 MB goal, 1 P testing: warning: no tests to run PASS BenchmarkBinaryTree17 gc #3 @1.906s 0%: 0+0+0+0+7 ms clock, 0+0+0+0/0/0+7 ms cpu, 325->325->287 MB, 325 MB goal, 1 P (forced) gc #4 @12.203s 20%: 0+0+0+10067+10 ms clock, 0+0+0+0/2523/852+10 ms cpu, 430->2092->1950 MB, 574 MB goal, 1 P 1 9150447353 ns/op Change this estimate to instead use the *current* scannable heap size. This has the advantage of being based solely on the current state of the heap, not on past densities or reachable heap sizes, so it isn't susceptible to falling behind during these sorts of phase changes. This is strictly an over-estimate, but it's better to over-estimate and get more assist than necessary than it is to under-estimate and potentially spiral out of control. Experiments with scaling this estimate back showed no obvious benefit for mutator utilization, heap size, or assist time. This new estimate has little effect for most benchmarks, including most go1 benchmarks, x/benchmarks, and the 6g benchmark. It has a huge effect for benchmarks that triggered the bad pacer behavior: name old mean new mean delta BinaryTree17 10.0s × (1.00,1.00) 3.5s × (0.98,1.01) -64.93% (p=0.000) Fannkuch11 2.74s × (1.00,1.01) 2.65s × (1.00,1.00) -3.52% (p=0.000) FmtFprintfEmpty 56.4ns × (0.99,1.00) 57.8ns × (1.00,1.01) +2.43% (p=0.000) FmtFprintfString 187ns × (0.99,1.00) 185ns × (0.99,1.01) -1.19% (p=0.010) FmtFprintfInt 184ns × (1.00,1.00) 183ns × (1.00,1.00) (no variance) FmtFprintfIntInt 321ns × (1.00,1.00) 315ns × (1.00,1.00) -1.80% (p=0.000) FmtFprintfPrefixedInt 266ns × (1.00,1.00) 263ns × (1.00,1.00) -1.22% (p=0.000) FmtFprintfFloat 353ns × (1.00,1.00) 353ns × (1.00,1.00) -0.13% (p=0.035) FmtManyArgs 1.21µs × (1.00,1.00) 1.19µs × (1.00,1.00) -1.33% (p=0.000) GobDecode 9.69ms × (1.00,1.00) 9.59ms × (1.00,1.00) -1.07% (p=0.000) GobEncode 7.89ms × (0.99,1.01) 7.74ms × (1.00,1.00) -1.92% (p=0.000) Gzip 391ms × (1.00,1.00) 392ms × (1.00,1.00) ~ (p=0.522) Gunzip 97.1ms × (1.00,1.00) 97.0ms × (1.00,1.00) -0.10% (p=0.000) HTTPClientServer 55.7µs × (0.99,1.01) 56.7µs × (0.99,1.01) +1.81% (p=0.001) JSONEncode 19.1ms × (1.00,1.00) 19.0ms × (1.00,1.00) -0.85% (p=0.000) JSONDecode 66.8ms × (1.00,1.00) 66.9ms × (1.00,1.00) ~ (p=0.288) Mandelbrot200 4.13ms × (1.00,1.00) 4.12ms × (1.00,1.00) -0.08% (p=0.000) GoParse 3.97ms × (1.00,1.01) 4.01ms × (1.00,1.00) +0.99% (p=0.000) RegexpMatchEasy0_32 114ns × (1.00,1.00) 115ns × (0.99,1.00) ~ (p=0.070) RegexpMatchEasy0_1K 376ns × (1.00,1.00) 376ns × (1.00,1.00) ~ (p=0.900) RegexpMatchEasy1_32 94.9ns × (1.00,1.00) 96.3ns × (1.00,1.01) +1.53% (p=0.001) RegexpMatchEasy1_1K 568ns × (1.00,1.00) 567ns × (1.00,1.00) -0.22% (p=0.001) RegexpMatchMedium_32 159ns × (1.00,1.00) 159ns × (1.00,1.00) ~ (p=0.178) RegexpMatchMedium_1K 46.4µs × (1.00,1.00) 46.6µs × (1.00,1.00) +0.29% (p=0.000) RegexpMatchHard_32 2.37µs × (1.00,1.00) 2.37µs × (1.00,1.00) ~ (p=0.722) RegexpMatchHard_1K 71.1µs × (1.00,1.00) 71.2µs × (1.00,1.00) ~ (p=0.229) Revcomp 565ms × (1.00,1.00) 562ms × (1.00,1.00) -0.52% (p=0.000) Template 81.0ms × (1.00,1.00) 80.2ms × (1.00,1.00) -0.97% (p=0.000) TimeParse 380ns × (1.00,1.00) 380ns × (1.00,1.00) ~ (p=0.148) TimeFormat 405ns × (0.99,1.00) 385ns × (0.99,1.00) -5.00% (p=0.000) Change-Id: I11274158bf3affaf62662e02de7af12d5fb789e4 Reviewed-on: https://go-review.googlesource.com/9696 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Austin Clements <austin@google.com>
2015-05-04 16:55:31 -04:00
//
runtime: allocate black during GC Currently we allocate white for most of concurrent marking. This is based on the classical argument that it produces less floating garbage, since allocations during GC may not get linked into the heap and allocating white lets us reclaim these. However, it's not clear how often this actually happens, especially since our write barrier shades any pointer as soon as it's installed in the heap regardless of the color of the slot. On the other hand, allocating black has several advantages that seem to significantly outweigh this downside. 1) It naturally bounds the total scan work to the live heap size at the start of a GC cycle. Allocating white does not, and thus depends entirely on assists to prevent the heap from growing faster than it can be scanned. 2) It reduces the total amount of scan work per GC cycle by the size of newly allocated objects that are linked into the heap graph, since objects allocated black never need to be scanned. 3) It reduces total write barrier work since more objects will already be black when they are linked into the heap graph. This gives a slight overall improvement in benchmarks. name old time/op new time/op delta XBenchGarbage-12 2.24ms ± 0% 2.21ms ± 1% -1.32% (p=0.000 n=18+17) name old time/op new time/op delta BinaryTree17-12 2.60s ± 3% 2.53s ± 3% -2.56% (p=0.000 n=20+20) Fannkuch11-12 2.08s ± 1% 2.08s ± 0% ~ (p=0.452 n=19+19) FmtFprintfEmpty-12 45.1ns ± 2% 45.3ns ± 2% ~ (p=0.367 n=19+20) FmtFprintfString-12 131ns ± 3% 129ns ± 0% -1.60% (p=0.000 n=20+16) FmtFprintfInt-12 122ns ± 0% 121ns ± 2% -0.86% (p=0.000 n=16+19) FmtFprintfIntInt-12 187ns ± 1% 186ns ± 1% ~ (p=0.514 n=18+19) FmtFprintfPrefixedInt-12 189ns ± 0% 188ns ± 1% -0.54% (p=0.000 n=16+18) FmtFprintfFloat-12 256ns ± 0% 254ns ± 1% -0.43% (p=0.000 n=17+19) FmtManyArgs-12 769ns ± 0% 763ns ± 0% -0.72% (p=0.000 n=18+18) GobDecode-12 7.08ms ± 2% 7.00ms ± 1% -1.22% (p=0.000 n=20+20) GobEncode-12 5.88ms ± 0% 5.88ms ± 1% ~ (p=0.406 n=18+18) Gzip-12 214ms ± 0% 214ms ± 1% ~ (p=0.103 n=17+18) Gunzip-12 37.6ms ± 0% 37.6ms ± 0% ~ (p=0.563 n=17+17) HTTPClientServer-12 77.2µs ± 3% 76.9µs ± 2% ~ (p=0.606 n=20+20) JSONEncode-12 15.1ms ± 1% 15.2ms ± 2% ~ (p=0.138 n=19+19) JSONDecode-12 53.3ms ± 1% 53.1ms ± 1% -0.33% (p=0.000 n=19+18) Mandelbrot200-12 4.04ms ± 1% 4.04ms ± 1% ~ (p=0.075 n=19+18) GoParse-12 3.30ms ± 1% 3.29ms ± 1% -0.57% (p=0.000 n=18+16) RegexpMatchEasy0_32-12 69.5ns ± 1% 69.9ns ± 3% ~ (p=0.822 n=18+20) RegexpMatchEasy0_1K-12 237ns ± 1% 237ns ± 0% ~ (p=0.398 n=19+18) RegexpMatchEasy1_32-12 69.8ns ± 2% 69.5ns ± 1% ~ (p=0.090 n=20+16) RegexpMatchEasy1_1K-12 371ns ± 1% 372ns ± 1% ~ (p=0.178 n=19+20) RegexpMatchMedium_32-12 108ns ± 2% 108ns ± 3% ~ (p=0.124 n=20+19) RegexpMatchMedium_1K-12 33.9µs ± 2% 34.2µs ± 4% ~ (p=0.309 n=20+19) RegexpMatchHard_32-12 1.75µs ± 2% 1.77µs ± 4% +1.28% (p=0.018 n=19+18) RegexpMatchHard_1K-12 52.7µs ± 1% 53.4µs ± 4% +1.23% (p=0.013 n=15+18) Revcomp-12 354ms ± 1% 359ms ± 4% +1.27% (p=0.043 n=20+20) Template-12 63.6ms ± 2% 63.7ms ± 2% ~ (p=0.654 n=20+18) TimeParse-12 313ns ± 1% 316ns ± 2% +0.80% (p=0.014 n=17+20) TimeFormat-12 332ns ± 0% 329ns ± 0% -0.66% (p=0.000 n=16+16) [Geo mean] 51.7µs 51.6µs -0.09% Change-Id: I2214a6a0e4f544699ea166073249a8efdf080dc0 Reviewed-on: https://go-review.googlesource.com/21323 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-30 17:02:23 -04:00
// Note that we currently count allocations during GC as both
// scannable heap (heap_scan) and scan work completed
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-19 17:03:52 -04:00
// (scanWork), so allocation will change this difference will
// slowly in the soft regime and not at all in the hard
// regime.
scanWorkRemaining := scanWorkExpected - c.scanWork
if scanWorkRemaining < 1000 {
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
// We set a somewhat arbitrary lower bound on
// remaining scan work since if we aim a little high,
// we can miss by a little.
//
// We *do* need to enforce that this is at least 1,
// since marking is racy and double-scanning objects
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-19 17:03:52 -04:00
// may legitimately make the remaining scan work
// negative, even in the hard goal regime.
scanWorkRemaining = 1000
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
}
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
// Compute the heap distance remaining.
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-19 17:03:52 -04:00
heapRemaining := heapGoal - int64(live)
if heapRemaining <= 0 {
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
// This shouldn't happen, but if it does, avoid
// dividing by zero or setting the assist negative.
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-19 17:03:52 -04:00
heapRemaining = 1
}
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
// Compute the mutator assist ratio so by the time the mutator
// allocates the remaining heap bytes up to next_gc, it will
// have done (or stolen) the remaining amount of scan work.
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-19 17:03:52 -04:00
c.assistWorkPerByte = float64(scanWorkRemaining) / float64(heapRemaining)
c.assistBytesPerWork = float64(heapRemaining) / float64(scanWorkRemaining)
}
// endCycle computes the trigger ratio for the next cycle.
func (c *gcControllerState) endCycle() float64 {
if work.userForced {
// Forced GC means this cycle didn't start at the
// trigger, so where it finished isn't good
// information about how to adjust the trigger.
// Just leave it where it is.
return memstats.triggerRatio
}
// Proportional response gain for the trigger controller. Must
// be in [0, 1]. Lower values smooth out transient effects but
// take longer to respond to phase changes. Higher values
// react to phase changes quickly, but are more affected by
// transient changes. Values near 1 may be unstable.
const triggerGain = 0.5
// Compute next cycle trigger ratio. First, this computes the
// "error" for this cycle; that is, how far off the trigger
// was from what it should have been, accounting for both heap
// growth and GC CPU utilization. We compute the actual heap
// growth during this cycle and scale that by how far off from
// the goal CPU utilization we were (to estimate the heap
// growth if we had the desired CPU utilization). The
// difference between this estimate and the GOGC-based goal
// heap growth is the error.
goalGrowthRatio := float64(gcpercent) / 100
actualGrowthRatio := float64(memstats.heap_live)/float64(memstats.heap_marked) - 1
assistDuration := nanotime() - c.markStartTime
runtime: schedule GC work more aggressively Schedule the work as early as possible, while still respecting the utilization percentage on average. The old code tried never to go above the utilization percentage. The new code is willing to go above the utilization percentage by one time slice (but of course after doing that it must wait until the percentage drops back down to the target before it gets another time slice). The effect is that for concurrent GCs that can run in a small number of time slices, the time during which write barriers are enabled is reduced by one mutator + GC time slice round (possibly 30 ms per GC). This only affects the fractional GC processor (the remainder of GOMAXPROCS/4), so it matters most in GOMAXPROCS=1, a bit in GOMAXPROCS=2, and not at all in GOMAXPROCS=4. GOMAXPROCS=1 name old mean new mean delta BenchmarkBinaryTree17 12.4s × (0.98,1.03) 13.5s × (0.97,1.04) +8.84% (p=0.000) BenchmarkFannkuch11 4.38s × (1.00,1.01) 4.38s × (1.00,1.01) ~ (p=0.343) BenchmarkFmtFprintfEmpty 88.9ns × (0.97,1.10) 90.1ns × (0.93,1.14) ~ (p=0.224) BenchmarkFmtFprintfString 356ns × (0.94,1.05) 321ns × (0.94,1.12) -9.77% (p=0.000) BenchmarkFmtFprintfInt 344ns × (0.98,1.03) 325ns × (0.96,1.03) -5.46% (p=0.000) BenchmarkFmtFprintfIntInt 622ns × (0.97,1.03) 571ns × (0.95,1.05) -8.09% (p=0.000) BenchmarkFmtFprintfPrefixedInt 462ns × (0.96,1.04) 431ns × (0.95,1.05) -6.81% (p=0.000) BenchmarkFmtFprintfFloat 653ns × (0.98,1.03) 621ns × (0.99,1.03) -4.90% (p=0.000) BenchmarkFmtManyArgs 2.32µs × (0.97,1.03) 2.19µs × (0.98,1.02) -5.43% (p=0.000) BenchmarkGobDecode 27.0ms × (0.96,1.04) 20.0ms × (0.97,1.04) -26.06% (p=0.000) BenchmarkGobEncode 26.6ms × (0.99,1.01) 17.8ms × (0.95,1.05) -33.19% (p=0.000) BenchmarkGzip 659ms × (0.98,1.03) 650ms × (0.99,1.01) -1.34% (p=0.000) BenchmarkGunzip 145ms × (0.98,1.04) 143ms × (1.00,1.01) -1.47% (p=0.000) BenchmarkHTTPClientServer 111µs × (0.97,1.04) 110µs × (0.96,1.03) -1.30% (p=0.000) BenchmarkJSONEncode 52.0ms × (0.97,1.03) 40.8ms × (0.97,1.03) -21.47% (p=0.000) BenchmarkJSONDecode 127ms × (0.98,1.04) 120ms × (0.98,1.02) -5.55% (p=0.000) BenchmarkMandelbrot200 6.04ms × (0.99,1.04) 6.02ms × (1.00,1.01) ~ (p=0.176) BenchmarkGoParse 8.62ms × (0.96,1.08) 8.55ms × (0.93,1.09) ~ (p=0.302) BenchmarkRegexpMatchEasy0_32 164ns × (0.98,1.05) 165ns × (0.98,1.07) ~ (p=0.293) BenchmarkRegexpMatchEasy0_1K 546ns × (0.98,1.06) 547ns × (0.97,1.07) ~ (p=0.741) BenchmarkRegexpMatchEasy1_32 142ns × (0.97,1.09) 141ns × (0.97,1.05) ~ (p=0.231) BenchmarkRegexpMatchEasy1_1K 904ns × (0.97,1.07) 900ns × (0.98,1.04) ~ (p=0.294) BenchmarkRegexpMatchMedium_32 256ns × (0.98,1.06) 256ns × (0.97,1.04) ~ (p=0.530) BenchmarkRegexpMatchMedium_1K 74.2µs × (0.98,1.05) 73.8µs × (0.98,1.04) ~ (p=0.334) BenchmarkRegexpMatchHard_32 3.94µs × (0.98,1.07) 3.92µs × (0.98,1.05) ~ (p=0.356) BenchmarkRegexpMatchHard_1K 119µs × (0.98,1.07) 119µs × (0.98,1.06) ~ (p=0.467) BenchmarkRevcomp 978ms × (0.96,1.09) 984ms × (0.95,1.07) ~ (p=0.448) BenchmarkTemplate 151ms × (0.96,1.03) 142ms × (0.95,1.04) -5.55% (p=0.000) BenchmarkTimeParse 628ns × (0.99,1.01) 628ns × (0.99,1.01) ~ (p=0.855) BenchmarkTimeFormat 729ns × (0.98,1.06) 734ns × (0.97,1.05) ~ (p=0.149) GOMAXPROCS=2 name old mean new mean delta BenchmarkBinaryTree17-2 9.80s × (0.97,1.03) 9.85s × (0.99,1.02) ~ (p=0.444) BenchmarkFannkuch11-2 4.35s × (0.99,1.01) 4.40s × (0.98,1.05) ~ (p=0.099) BenchmarkFmtFprintfEmpty-2 86.7ns × (0.97,1.05) 85.9ns × (0.98,1.04) ~ (p=0.409) BenchmarkFmtFprintfString-2 297ns × (0.98,1.01) 297ns × (0.99,1.01) ~ (p=0.743) BenchmarkFmtFprintfInt-2 309ns × (0.98,1.02) 310ns × (0.99,1.01) ~ (p=0.464) BenchmarkFmtFprintfIntInt-2 525ns × (0.97,1.05) 518ns × (0.99,1.01) ~ (p=0.151) BenchmarkFmtFprintfPrefixedInt-2 408ns × (0.98,1.02) 408ns × (0.98,1.03) ~ (p=0.797) BenchmarkFmtFprintfFloat-2 603ns × (0.99,1.01) 604ns × (0.98,1.02) ~ (p=0.588) BenchmarkFmtManyArgs-2 2.07µs × (0.98,1.02) 2.05µs × (0.99,1.01) ~ (p=0.091) BenchmarkGobDecode-2 19.1ms × (0.97,1.01) 19.3ms × (0.97,1.04) ~ (p=0.195) BenchmarkGobEncode-2 16.2ms × (0.97,1.03) 16.4ms × (0.99,1.01) ~ (p=0.069) BenchmarkGzip-2 652ms × (0.99,1.01) 651ms × (0.99,1.01) ~ (p=0.705) BenchmarkGunzip-2 143ms × (1.00,1.01) 143ms × (1.00,1.00) ~ (p=0.665) BenchmarkHTTPClientServer-2 149µs × (0.92,1.11) 149µs × (0.91,1.08) ~ (p=0.862) BenchmarkJSONEncode-2 34.6ms × (0.98,1.02) 37.2ms × (0.99,1.01) +7.56% (p=0.000) BenchmarkJSONDecode-2 117ms × (0.99,1.01) 117ms × (0.99,1.01) ~ (p=0.858) BenchmarkMandelbrot200-2 6.10ms × (0.99,1.03) 6.03ms × (1.00,1.00) ~ (p=0.083) BenchmarkGoParse-2 8.25ms × (0.98,1.01) 8.21ms × (0.99,1.02) ~ (p=0.307) BenchmarkRegexpMatchEasy0_32-2 162ns × (0.99,1.02) 162ns × (0.99,1.01) ~ (p=0.857) BenchmarkRegexpMatchEasy0_1K-2 541ns × (0.99,1.01) 540ns × (1.00,1.00) ~ (p=0.530) BenchmarkRegexpMatchEasy1_32-2 138ns × (1.00,1.00) 141ns × (0.98,1.04) +1.88% (p=0.038) BenchmarkRegexpMatchEasy1_1K-2 887ns × (0.99,1.01) 894ns × (0.99,1.01) ~ (p=0.087) BenchmarkRegexpMatchMedium_32-2 252ns × (0.99,1.01) 252ns × (0.99,1.01) ~ (p=0.954) BenchmarkRegexpMatchMedium_1K-2 73.4µs × (0.99,1.02) 72.8µs × (1.00,1.01) -0.87% (p=0.029) BenchmarkRegexpMatchHard_32-2 3.95µs × (0.97,1.05) 3.87µs × (1.00,1.01) -2.11% (p=0.035) BenchmarkRegexpMatchHard_1K-2 117µs × (0.99,1.01) 117µs × (0.99,1.01) ~ (p=0.669) BenchmarkRevcomp-2 980ms × (0.95,1.03) 993ms × (0.94,1.09) ~ (p=0.527) BenchmarkTemplate-2 136ms × (0.98,1.01) 135ms × (0.99,1.01) ~ (p=0.200) BenchmarkTimeParse-2 630ns × (1.00,1.01) 630ns × (1.00,1.00) ~ (p=0.634) BenchmarkTimeFormat-2 705ns × (0.99,1.01) 710ns × (0.98,1.02) ~ (p=0.174) GOMAXPROCS=4 BenchmarkBinaryTree17-4 9.87s × (0.96,1.04) 9.75s × (0.96,1.03) ~ (p=0.178) BenchmarkFannkuch11-4 4.35s × (1.00,1.01) 4.40s × (0.99,1.04) ~ (p=0.071) BenchmarkFmtFprintfEmpty-4 85.8ns × (0.98,1.06) 85.6ns × (0.98,1.04) ~ (p=0.858) BenchmarkFmtFprintfString-4 306ns × (0.99,1.03) 304ns × (0.97,1.02) ~ (p=0.470) BenchmarkFmtFprintfInt-4 317ns × (0.98,1.01) 315ns × (0.98,1.02) -0.92% (p=0.044) BenchmarkFmtFprintfIntInt-4 527ns × (0.99,1.01) 525ns × (0.98,1.01) ~ (p=0.164) BenchmarkFmtFprintfPrefixedInt-4 421ns × (0.98,1.03) 417ns × (0.99,1.02) ~ (p=0.092) BenchmarkFmtFprintfFloat-4 623ns × (0.98,1.02) 618ns × (0.98,1.03) ~ (p=0.172) BenchmarkFmtManyArgs-4 2.09µs × (0.98,1.02) 2.09µs × (0.98,1.02) ~ (p=0.679) BenchmarkGobDecode-4 18.6ms × (0.99,1.01) 18.6ms × (0.98,1.03) ~ (p=0.595) BenchmarkGobEncode-4 15.0ms × (0.98,1.02) 15.1ms × (0.99,1.01) ~ (p=0.301) BenchmarkGzip-4 659ms × (0.98,1.04) 660ms × (0.97,1.02) ~ (p=0.724) BenchmarkGunzip-4 145ms × (0.98,1.04) 144ms × (0.99,1.04) ~ (p=0.671) BenchmarkHTTPClientServer-4 139µs × (0.97,1.02) 138µs × (0.99,1.02) ~ (p=0.392) BenchmarkJSONEncode-4 35.0ms × (0.99,1.02) 35.1ms × (0.98,1.02) ~ (p=0.777) BenchmarkJSONDecode-4 119ms × (0.98,1.01) 118ms × (0.98,1.02) ~ (p=0.710) BenchmarkMandelbrot200-4 6.02ms × (1.00,1.00) 6.02ms × (1.00,1.00) ~ (p=0.289) BenchmarkGoParse-4 7.96ms × (0.99,1.01) 7.96ms × (0.99,1.01) ~ (p=0.884) BenchmarkRegexpMatchEasy0_32-4 164ns × (0.98,1.04) 166ns × (0.97,1.04) ~ (p=0.221) BenchmarkRegexpMatchEasy0_1K-4 540ns × (0.99,1.01) 552ns × (0.97,1.04) +2.10% (p=0.018) BenchmarkRegexpMatchEasy1_32-4 140ns × (0.99,1.04) 142ns × (0.97,1.04) ~ (p=0.226) BenchmarkRegexpMatchEasy1_1K-4 896ns × (0.99,1.03) 907ns × (0.97,1.04) ~ (p=0.155) BenchmarkRegexpMatchMedium_32-4 255ns × (0.99,1.04) 255ns × (0.98,1.04) ~ (p=0.904) BenchmarkRegexpMatchMedium_1K-4 73.4µs × (0.99,1.04) 73.8µs × (0.98,1.04) ~ (p=0.560) BenchmarkRegexpMatchHard_32-4 3.93µs × (0.98,1.04) 3.95µs × (0.98,1.04) ~ (p=0.571) BenchmarkRegexpMatchHard_1K-4 117µs × (1.00,1.01) 119µs × (0.98,1.04) +1.48% (p=0.048) BenchmarkRevcomp-4 990ms × (0.94,1.08) 989ms × (0.94,1.10) ~ (p=0.957) BenchmarkTemplate-4 137ms × (0.98,1.02) 137ms × (0.99,1.01) ~ (p=0.996) BenchmarkTimeParse-4 629ns × (1.00,1.00) 629ns × (0.99,1.01) ~ (p=0.924) BenchmarkTimeFormat-4 710ns × (0.99,1.01) 716ns × (0.98,1.02) +0.84% (p=0.033) Change-Id: I43a04e0f6ad5e3ba9847dddf12e13222561f9cf4 Reviewed-on: https://go-review.googlesource.com/9543 Reviewed-by: Austin Clements <austin@google.com>
2015-04-30 00:17:09 -04:00
// Assume background mark hit its utilization goal.
utilization := gcBackgroundUtilization
runtime: schedule GC work more aggressively Schedule the work as early as possible, while still respecting the utilization percentage on average. The old code tried never to go above the utilization percentage. The new code is willing to go above the utilization percentage by one time slice (but of course after doing that it must wait until the percentage drops back down to the target before it gets another time slice). The effect is that for concurrent GCs that can run in a small number of time slices, the time during which write barriers are enabled is reduced by one mutator + GC time slice round (possibly 30 ms per GC). This only affects the fractional GC processor (the remainder of GOMAXPROCS/4), so it matters most in GOMAXPROCS=1, a bit in GOMAXPROCS=2, and not at all in GOMAXPROCS=4. GOMAXPROCS=1 name old mean new mean delta BenchmarkBinaryTree17 12.4s × (0.98,1.03) 13.5s × (0.97,1.04) +8.84% (p=0.000) BenchmarkFannkuch11 4.38s × (1.00,1.01) 4.38s × (1.00,1.01) ~ (p=0.343) BenchmarkFmtFprintfEmpty 88.9ns × (0.97,1.10) 90.1ns × (0.93,1.14) ~ (p=0.224) BenchmarkFmtFprintfString 356ns × (0.94,1.05) 321ns × (0.94,1.12) -9.77% (p=0.000) BenchmarkFmtFprintfInt 344ns × (0.98,1.03) 325ns × (0.96,1.03) -5.46% (p=0.000) BenchmarkFmtFprintfIntInt 622ns × (0.97,1.03) 571ns × (0.95,1.05) -8.09% (p=0.000) BenchmarkFmtFprintfPrefixedInt 462ns × (0.96,1.04) 431ns × (0.95,1.05) -6.81% (p=0.000) BenchmarkFmtFprintfFloat 653ns × (0.98,1.03) 621ns × (0.99,1.03) -4.90% (p=0.000) BenchmarkFmtManyArgs 2.32µs × (0.97,1.03) 2.19µs × (0.98,1.02) -5.43% (p=0.000) BenchmarkGobDecode 27.0ms × (0.96,1.04) 20.0ms × (0.97,1.04) -26.06% (p=0.000) BenchmarkGobEncode 26.6ms × (0.99,1.01) 17.8ms × (0.95,1.05) -33.19% (p=0.000) BenchmarkGzip 659ms × (0.98,1.03) 650ms × (0.99,1.01) -1.34% (p=0.000) BenchmarkGunzip 145ms × (0.98,1.04) 143ms × (1.00,1.01) -1.47% (p=0.000) BenchmarkHTTPClientServer 111µs × (0.97,1.04) 110µs × (0.96,1.03) -1.30% (p=0.000) BenchmarkJSONEncode 52.0ms × (0.97,1.03) 40.8ms × (0.97,1.03) -21.47% (p=0.000) BenchmarkJSONDecode 127ms × (0.98,1.04) 120ms × (0.98,1.02) -5.55% (p=0.000) BenchmarkMandelbrot200 6.04ms × (0.99,1.04) 6.02ms × (1.00,1.01) ~ (p=0.176) BenchmarkGoParse 8.62ms × (0.96,1.08) 8.55ms × (0.93,1.09) ~ (p=0.302) BenchmarkRegexpMatchEasy0_32 164ns × (0.98,1.05) 165ns × (0.98,1.07) ~ (p=0.293) BenchmarkRegexpMatchEasy0_1K 546ns × (0.98,1.06) 547ns × (0.97,1.07) ~ (p=0.741) BenchmarkRegexpMatchEasy1_32 142ns × (0.97,1.09) 141ns × (0.97,1.05) ~ (p=0.231) BenchmarkRegexpMatchEasy1_1K 904ns × (0.97,1.07) 900ns × (0.98,1.04) ~ (p=0.294) BenchmarkRegexpMatchMedium_32 256ns × (0.98,1.06) 256ns × (0.97,1.04) ~ (p=0.530) BenchmarkRegexpMatchMedium_1K 74.2µs × (0.98,1.05) 73.8µs × (0.98,1.04) ~ (p=0.334) BenchmarkRegexpMatchHard_32 3.94µs × (0.98,1.07) 3.92µs × (0.98,1.05) ~ (p=0.356) BenchmarkRegexpMatchHard_1K 119µs × (0.98,1.07) 119µs × (0.98,1.06) ~ (p=0.467) BenchmarkRevcomp 978ms × (0.96,1.09) 984ms × (0.95,1.07) ~ (p=0.448) BenchmarkTemplate 151ms × (0.96,1.03) 142ms × (0.95,1.04) -5.55% (p=0.000) BenchmarkTimeParse 628ns × (0.99,1.01) 628ns × (0.99,1.01) ~ (p=0.855) BenchmarkTimeFormat 729ns × (0.98,1.06) 734ns × (0.97,1.05) ~ (p=0.149) GOMAXPROCS=2 name old mean new mean delta BenchmarkBinaryTree17-2 9.80s × (0.97,1.03) 9.85s × (0.99,1.02) ~ (p=0.444) BenchmarkFannkuch11-2 4.35s × (0.99,1.01) 4.40s × (0.98,1.05) ~ (p=0.099) BenchmarkFmtFprintfEmpty-2 86.7ns × (0.97,1.05) 85.9ns × (0.98,1.04) ~ (p=0.409) BenchmarkFmtFprintfString-2 297ns × (0.98,1.01) 297ns × (0.99,1.01) ~ (p=0.743) BenchmarkFmtFprintfInt-2 309ns × (0.98,1.02) 310ns × (0.99,1.01) ~ (p=0.464) BenchmarkFmtFprintfIntInt-2 525ns × (0.97,1.05) 518ns × (0.99,1.01) ~ (p=0.151) BenchmarkFmtFprintfPrefixedInt-2 408ns × (0.98,1.02) 408ns × (0.98,1.03) ~ (p=0.797) BenchmarkFmtFprintfFloat-2 603ns × (0.99,1.01) 604ns × (0.98,1.02) ~ (p=0.588) BenchmarkFmtManyArgs-2 2.07µs × (0.98,1.02) 2.05µs × (0.99,1.01) ~ (p=0.091) BenchmarkGobDecode-2 19.1ms × (0.97,1.01) 19.3ms × (0.97,1.04) ~ (p=0.195) BenchmarkGobEncode-2 16.2ms × (0.97,1.03) 16.4ms × (0.99,1.01) ~ (p=0.069) BenchmarkGzip-2 652ms × (0.99,1.01) 651ms × (0.99,1.01) ~ (p=0.705) BenchmarkGunzip-2 143ms × (1.00,1.01) 143ms × (1.00,1.00) ~ (p=0.665) BenchmarkHTTPClientServer-2 149µs × (0.92,1.11) 149µs × (0.91,1.08) ~ (p=0.862) BenchmarkJSONEncode-2 34.6ms × (0.98,1.02) 37.2ms × (0.99,1.01) +7.56% (p=0.000) BenchmarkJSONDecode-2 117ms × (0.99,1.01) 117ms × (0.99,1.01) ~ (p=0.858) BenchmarkMandelbrot200-2 6.10ms × (0.99,1.03) 6.03ms × (1.00,1.00) ~ (p=0.083) BenchmarkGoParse-2 8.25ms × (0.98,1.01) 8.21ms × (0.99,1.02) ~ (p=0.307) BenchmarkRegexpMatchEasy0_32-2 162ns × (0.99,1.02) 162ns × (0.99,1.01) ~ (p=0.857) BenchmarkRegexpMatchEasy0_1K-2 541ns × (0.99,1.01) 540ns × (1.00,1.00) ~ (p=0.530) BenchmarkRegexpMatchEasy1_32-2 138ns × (1.00,1.00) 141ns × (0.98,1.04) +1.88% (p=0.038) BenchmarkRegexpMatchEasy1_1K-2 887ns × (0.99,1.01) 894ns × (0.99,1.01) ~ (p=0.087) BenchmarkRegexpMatchMedium_32-2 252ns × (0.99,1.01) 252ns × (0.99,1.01) ~ (p=0.954) BenchmarkRegexpMatchMedium_1K-2 73.4µs × (0.99,1.02) 72.8µs × (1.00,1.01) -0.87% (p=0.029) BenchmarkRegexpMatchHard_32-2 3.95µs × (0.97,1.05) 3.87µs × (1.00,1.01) -2.11% (p=0.035) BenchmarkRegexpMatchHard_1K-2 117µs × (0.99,1.01) 117µs × (0.99,1.01) ~ (p=0.669) BenchmarkRevcomp-2 980ms × (0.95,1.03) 993ms × (0.94,1.09) ~ (p=0.527) BenchmarkTemplate-2 136ms × (0.98,1.01) 135ms × (0.99,1.01) ~ (p=0.200) BenchmarkTimeParse-2 630ns × (1.00,1.01) 630ns × (1.00,1.00) ~ (p=0.634) BenchmarkTimeFormat-2 705ns × (0.99,1.01) 710ns × (0.98,1.02) ~ (p=0.174) GOMAXPROCS=4 BenchmarkBinaryTree17-4 9.87s × (0.96,1.04) 9.75s × (0.96,1.03) ~ (p=0.178) BenchmarkFannkuch11-4 4.35s × (1.00,1.01) 4.40s × (0.99,1.04) ~ (p=0.071) BenchmarkFmtFprintfEmpty-4 85.8ns × (0.98,1.06) 85.6ns × (0.98,1.04) ~ (p=0.858) BenchmarkFmtFprintfString-4 306ns × (0.99,1.03) 304ns × (0.97,1.02) ~ (p=0.470) BenchmarkFmtFprintfInt-4 317ns × (0.98,1.01) 315ns × (0.98,1.02) -0.92% (p=0.044) BenchmarkFmtFprintfIntInt-4 527ns × (0.99,1.01) 525ns × (0.98,1.01) ~ (p=0.164) BenchmarkFmtFprintfPrefixedInt-4 421ns × (0.98,1.03) 417ns × (0.99,1.02) ~ (p=0.092) BenchmarkFmtFprintfFloat-4 623ns × (0.98,1.02) 618ns × (0.98,1.03) ~ (p=0.172) BenchmarkFmtManyArgs-4 2.09µs × (0.98,1.02) 2.09µs × (0.98,1.02) ~ (p=0.679) BenchmarkGobDecode-4 18.6ms × (0.99,1.01) 18.6ms × (0.98,1.03) ~ (p=0.595) BenchmarkGobEncode-4 15.0ms × (0.98,1.02) 15.1ms × (0.99,1.01) ~ (p=0.301) BenchmarkGzip-4 659ms × (0.98,1.04) 660ms × (0.97,1.02) ~ (p=0.724) BenchmarkGunzip-4 145ms × (0.98,1.04) 144ms × (0.99,1.04) ~ (p=0.671) BenchmarkHTTPClientServer-4 139µs × (0.97,1.02) 138µs × (0.99,1.02) ~ (p=0.392) BenchmarkJSONEncode-4 35.0ms × (0.99,1.02) 35.1ms × (0.98,1.02) ~ (p=0.777) BenchmarkJSONDecode-4 119ms × (0.98,1.01) 118ms × (0.98,1.02) ~ (p=0.710) BenchmarkMandelbrot200-4 6.02ms × (1.00,1.00) 6.02ms × (1.00,1.00) ~ (p=0.289) BenchmarkGoParse-4 7.96ms × (0.99,1.01) 7.96ms × (0.99,1.01) ~ (p=0.884) BenchmarkRegexpMatchEasy0_32-4 164ns × (0.98,1.04) 166ns × (0.97,1.04) ~ (p=0.221) BenchmarkRegexpMatchEasy0_1K-4 540ns × (0.99,1.01) 552ns × (0.97,1.04) +2.10% (p=0.018) BenchmarkRegexpMatchEasy1_32-4 140ns × (0.99,1.04) 142ns × (0.97,1.04) ~ (p=0.226) BenchmarkRegexpMatchEasy1_1K-4 896ns × (0.99,1.03) 907ns × (0.97,1.04) ~ (p=0.155) BenchmarkRegexpMatchMedium_32-4 255ns × (0.99,1.04) 255ns × (0.98,1.04) ~ (p=0.904) BenchmarkRegexpMatchMedium_1K-4 73.4µs × (0.99,1.04) 73.8µs × (0.98,1.04) ~ (p=0.560) BenchmarkRegexpMatchHard_32-4 3.93µs × (0.98,1.04) 3.95µs × (0.98,1.04) ~ (p=0.571) BenchmarkRegexpMatchHard_1K-4 117µs × (1.00,1.01) 119µs × (0.98,1.04) +1.48% (p=0.048) BenchmarkRevcomp-4 990ms × (0.94,1.08) 989ms × (0.94,1.10) ~ (p=0.957) BenchmarkTemplate-4 137ms × (0.98,1.02) 137ms × (0.99,1.01) ~ (p=0.996) BenchmarkTimeParse-4 629ns × (1.00,1.00) 629ns × (0.99,1.01) ~ (p=0.924) BenchmarkTimeFormat-4 710ns × (0.99,1.01) 716ns × (0.98,1.02) +0.84% (p=0.033) Change-Id: I43a04e0f6ad5e3ba9847dddf12e13222561f9cf4 Reviewed-on: https://go-review.googlesource.com/9543 Reviewed-by: Austin Clements <austin@google.com>
2015-04-30 00:17:09 -04:00
// Add assist utilization; avoid divide by zero.
if assistDuration > 0 {
utilization += float64(c.assistTime) / float64(assistDuration*int64(gomaxprocs))
}
runtime: schedule GC work more aggressively Schedule the work as early as possible, while still respecting the utilization percentage on average. The old code tried never to go above the utilization percentage. The new code is willing to go above the utilization percentage by one time slice (but of course after doing that it must wait until the percentage drops back down to the target before it gets another time slice). The effect is that for concurrent GCs that can run in a small number of time slices, the time during which write barriers are enabled is reduced by one mutator + GC time slice round (possibly 30 ms per GC). This only affects the fractional GC processor (the remainder of GOMAXPROCS/4), so it matters most in GOMAXPROCS=1, a bit in GOMAXPROCS=2, and not at all in GOMAXPROCS=4. GOMAXPROCS=1 name old mean new mean delta BenchmarkBinaryTree17 12.4s × (0.98,1.03) 13.5s × (0.97,1.04) +8.84% (p=0.000) BenchmarkFannkuch11 4.38s × (1.00,1.01) 4.38s × (1.00,1.01) ~ (p=0.343) BenchmarkFmtFprintfEmpty 88.9ns × (0.97,1.10) 90.1ns × (0.93,1.14) ~ (p=0.224) BenchmarkFmtFprintfString 356ns × (0.94,1.05) 321ns × (0.94,1.12) -9.77% (p=0.000) BenchmarkFmtFprintfInt 344ns × (0.98,1.03) 325ns × (0.96,1.03) -5.46% (p=0.000) BenchmarkFmtFprintfIntInt 622ns × (0.97,1.03) 571ns × (0.95,1.05) -8.09% (p=0.000) BenchmarkFmtFprintfPrefixedInt 462ns × (0.96,1.04) 431ns × (0.95,1.05) -6.81% (p=0.000) BenchmarkFmtFprintfFloat 653ns × (0.98,1.03) 621ns × (0.99,1.03) -4.90% (p=0.000) BenchmarkFmtManyArgs 2.32µs × (0.97,1.03) 2.19µs × (0.98,1.02) -5.43% (p=0.000) BenchmarkGobDecode 27.0ms × (0.96,1.04) 20.0ms × (0.97,1.04) -26.06% (p=0.000) BenchmarkGobEncode 26.6ms × (0.99,1.01) 17.8ms × (0.95,1.05) -33.19% (p=0.000) BenchmarkGzip 659ms × (0.98,1.03) 650ms × (0.99,1.01) -1.34% (p=0.000) BenchmarkGunzip 145ms × (0.98,1.04) 143ms × (1.00,1.01) -1.47% (p=0.000) BenchmarkHTTPClientServer 111µs × (0.97,1.04) 110µs × (0.96,1.03) -1.30% (p=0.000) BenchmarkJSONEncode 52.0ms × (0.97,1.03) 40.8ms × (0.97,1.03) -21.47% (p=0.000) BenchmarkJSONDecode 127ms × (0.98,1.04) 120ms × (0.98,1.02) -5.55% (p=0.000) BenchmarkMandelbrot200 6.04ms × (0.99,1.04) 6.02ms × (1.00,1.01) ~ (p=0.176) BenchmarkGoParse 8.62ms × (0.96,1.08) 8.55ms × (0.93,1.09) ~ (p=0.302) BenchmarkRegexpMatchEasy0_32 164ns × (0.98,1.05) 165ns × (0.98,1.07) ~ (p=0.293) BenchmarkRegexpMatchEasy0_1K 546ns × (0.98,1.06) 547ns × (0.97,1.07) ~ (p=0.741) BenchmarkRegexpMatchEasy1_32 142ns × (0.97,1.09) 141ns × (0.97,1.05) ~ (p=0.231) BenchmarkRegexpMatchEasy1_1K 904ns × (0.97,1.07) 900ns × (0.98,1.04) ~ (p=0.294) BenchmarkRegexpMatchMedium_32 256ns × (0.98,1.06) 256ns × (0.97,1.04) ~ (p=0.530) BenchmarkRegexpMatchMedium_1K 74.2µs × (0.98,1.05) 73.8µs × (0.98,1.04) ~ (p=0.334) BenchmarkRegexpMatchHard_32 3.94µs × (0.98,1.07) 3.92µs × (0.98,1.05) ~ (p=0.356) BenchmarkRegexpMatchHard_1K 119µs × (0.98,1.07) 119µs × (0.98,1.06) ~ (p=0.467) BenchmarkRevcomp 978ms × (0.96,1.09) 984ms × (0.95,1.07) ~ (p=0.448) BenchmarkTemplate 151ms × (0.96,1.03) 142ms × (0.95,1.04) -5.55% (p=0.000) BenchmarkTimeParse 628ns × (0.99,1.01) 628ns × (0.99,1.01) ~ (p=0.855) BenchmarkTimeFormat 729ns × (0.98,1.06) 734ns × (0.97,1.05) ~ (p=0.149) GOMAXPROCS=2 name old mean new mean delta BenchmarkBinaryTree17-2 9.80s × (0.97,1.03) 9.85s × (0.99,1.02) ~ (p=0.444) BenchmarkFannkuch11-2 4.35s × (0.99,1.01) 4.40s × (0.98,1.05) ~ (p=0.099) BenchmarkFmtFprintfEmpty-2 86.7ns × (0.97,1.05) 85.9ns × (0.98,1.04) ~ (p=0.409) BenchmarkFmtFprintfString-2 297ns × (0.98,1.01) 297ns × (0.99,1.01) ~ (p=0.743) BenchmarkFmtFprintfInt-2 309ns × (0.98,1.02) 310ns × (0.99,1.01) ~ (p=0.464) BenchmarkFmtFprintfIntInt-2 525ns × (0.97,1.05) 518ns × (0.99,1.01) ~ (p=0.151) BenchmarkFmtFprintfPrefixedInt-2 408ns × (0.98,1.02) 408ns × (0.98,1.03) ~ (p=0.797) BenchmarkFmtFprintfFloat-2 603ns × (0.99,1.01) 604ns × (0.98,1.02) ~ (p=0.588) BenchmarkFmtManyArgs-2 2.07µs × (0.98,1.02) 2.05µs × (0.99,1.01) ~ (p=0.091) BenchmarkGobDecode-2 19.1ms × (0.97,1.01) 19.3ms × (0.97,1.04) ~ (p=0.195) BenchmarkGobEncode-2 16.2ms × (0.97,1.03) 16.4ms × (0.99,1.01) ~ (p=0.069) BenchmarkGzip-2 652ms × (0.99,1.01) 651ms × (0.99,1.01) ~ (p=0.705) BenchmarkGunzip-2 143ms × (1.00,1.01) 143ms × (1.00,1.00) ~ (p=0.665) BenchmarkHTTPClientServer-2 149µs × (0.92,1.11) 149µs × (0.91,1.08) ~ (p=0.862) BenchmarkJSONEncode-2 34.6ms × (0.98,1.02) 37.2ms × (0.99,1.01) +7.56% (p=0.000) BenchmarkJSONDecode-2 117ms × (0.99,1.01) 117ms × (0.99,1.01) ~ (p=0.858) BenchmarkMandelbrot200-2 6.10ms × (0.99,1.03) 6.03ms × (1.00,1.00) ~ (p=0.083) BenchmarkGoParse-2 8.25ms × (0.98,1.01) 8.21ms × (0.99,1.02) ~ (p=0.307) BenchmarkRegexpMatchEasy0_32-2 162ns × (0.99,1.02) 162ns × (0.99,1.01) ~ (p=0.857) BenchmarkRegexpMatchEasy0_1K-2 541ns × (0.99,1.01) 540ns × (1.00,1.00) ~ (p=0.530) BenchmarkRegexpMatchEasy1_32-2 138ns × (1.00,1.00) 141ns × (0.98,1.04) +1.88% (p=0.038) BenchmarkRegexpMatchEasy1_1K-2 887ns × (0.99,1.01) 894ns × (0.99,1.01) ~ (p=0.087) BenchmarkRegexpMatchMedium_32-2 252ns × (0.99,1.01) 252ns × (0.99,1.01) ~ (p=0.954) BenchmarkRegexpMatchMedium_1K-2 73.4µs × (0.99,1.02) 72.8µs × (1.00,1.01) -0.87% (p=0.029) BenchmarkRegexpMatchHard_32-2 3.95µs × (0.97,1.05) 3.87µs × (1.00,1.01) -2.11% (p=0.035) BenchmarkRegexpMatchHard_1K-2 117µs × (0.99,1.01) 117µs × (0.99,1.01) ~ (p=0.669) BenchmarkRevcomp-2 980ms × (0.95,1.03) 993ms × (0.94,1.09) ~ (p=0.527) BenchmarkTemplate-2 136ms × (0.98,1.01) 135ms × (0.99,1.01) ~ (p=0.200) BenchmarkTimeParse-2 630ns × (1.00,1.01) 630ns × (1.00,1.00) ~ (p=0.634) BenchmarkTimeFormat-2 705ns × (0.99,1.01) 710ns × (0.98,1.02) ~ (p=0.174) GOMAXPROCS=4 BenchmarkBinaryTree17-4 9.87s × (0.96,1.04) 9.75s × (0.96,1.03) ~ (p=0.178) BenchmarkFannkuch11-4 4.35s × (1.00,1.01) 4.40s × (0.99,1.04) ~ (p=0.071) BenchmarkFmtFprintfEmpty-4 85.8ns × (0.98,1.06) 85.6ns × (0.98,1.04) ~ (p=0.858) BenchmarkFmtFprintfString-4 306ns × (0.99,1.03) 304ns × (0.97,1.02) ~ (p=0.470) BenchmarkFmtFprintfInt-4 317ns × (0.98,1.01) 315ns × (0.98,1.02) -0.92% (p=0.044) BenchmarkFmtFprintfIntInt-4 527ns × (0.99,1.01) 525ns × (0.98,1.01) ~ (p=0.164) BenchmarkFmtFprintfPrefixedInt-4 421ns × (0.98,1.03) 417ns × (0.99,1.02) ~ (p=0.092) BenchmarkFmtFprintfFloat-4 623ns × (0.98,1.02) 618ns × (0.98,1.03) ~ (p=0.172) BenchmarkFmtManyArgs-4 2.09µs × (0.98,1.02) 2.09µs × (0.98,1.02) ~ (p=0.679) BenchmarkGobDecode-4 18.6ms × (0.99,1.01) 18.6ms × (0.98,1.03) ~ (p=0.595) BenchmarkGobEncode-4 15.0ms × (0.98,1.02) 15.1ms × (0.99,1.01) ~ (p=0.301) BenchmarkGzip-4 659ms × (0.98,1.04) 660ms × (0.97,1.02) ~ (p=0.724) BenchmarkGunzip-4 145ms × (0.98,1.04) 144ms × (0.99,1.04) ~ (p=0.671) BenchmarkHTTPClientServer-4 139µs × (0.97,1.02) 138µs × (0.99,1.02) ~ (p=0.392) BenchmarkJSONEncode-4 35.0ms × (0.99,1.02) 35.1ms × (0.98,1.02) ~ (p=0.777) BenchmarkJSONDecode-4 119ms × (0.98,1.01) 118ms × (0.98,1.02) ~ (p=0.710) BenchmarkMandelbrot200-4 6.02ms × (1.00,1.00) 6.02ms × (1.00,1.00) ~ (p=0.289) BenchmarkGoParse-4 7.96ms × (0.99,1.01) 7.96ms × (0.99,1.01) ~ (p=0.884) BenchmarkRegexpMatchEasy0_32-4 164ns × (0.98,1.04) 166ns × (0.97,1.04) ~ (p=0.221) BenchmarkRegexpMatchEasy0_1K-4 540ns × (0.99,1.01) 552ns × (0.97,1.04) +2.10% (p=0.018) BenchmarkRegexpMatchEasy1_32-4 140ns × (0.99,1.04) 142ns × (0.97,1.04) ~ (p=0.226) BenchmarkRegexpMatchEasy1_1K-4 896ns × (0.99,1.03) 907ns × (0.97,1.04) ~ (p=0.155) BenchmarkRegexpMatchMedium_32-4 255ns × (0.99,1.04) 255ns × (0.98,1.04) ~ (p=0.904) BenchmarkRegexpMatchMedium_1K-4 73.4µs × (0.99,1.04) 73.8µs × (0.98,1.04) ~ (p=0.560) BenchmarkRegexpMatchHard_32-4 3.93µs × (0.98,1.04) 3.95µs × (0.98,1.04) ~ (p=0.571) BenchmarkRegexpMatchHard_1K-4 117µs × (1.00,1.01) 119µs × (0.98,1.04) +1.48% (p=0.048) BenchmarkRevcomp-4 990ms × (0.94,1.08) 989ms × (0.94,1.10) ~ (p=0.957) BenchmarkTemplate-4 137ms × (0.98,1.02) 137ms × (0.99,1.01) ~ (p=0.996) BenchmarkTimeParse-4 629ns × (1.00,1.00) 629ns × (0.99,1.01) ~ (p=0.924) BenchmarkTimeFormat-4 710ns × (0.99,1.01) 716ns × (0.98,1.02) +0.84% (p=0.033) Change-Id: I43a04e0f6ad5e3ba9847dddf12e13222561f9cf4 Reviewed-on: https://go-review.googlesource.com/9543 Reviewed-by: Austin Clements <austin@google.com>
2015-04-30 00:17:09 -04:00
triggerError := goalGrowthRatio - memstats.triggerRatio - utilization/gcGoalUtilization*(actualGrowthRatio-memstats.triggerRatio)
// Finally, we adjust the trigger for next time by this error,
// damped by the proportional gain.
triggerRatio := memstats.triggerRatio + triggerGain*triggerError
if debug.gcpacertrace > 0 {
// Print controller state in terms of the design
// document.
H_m_prev := memstats.heap_marked
h_t := memstats.triggerRatio
H_T := memstats.gc_trigger
h_a := actualGrowthRatio
H_a := memstats.heap_live
h_g := goalGrowthRatio
H_g := int64(float64(H_m_prev) * (1 + h_g))
u_a := utilization
u_g := gcGoalUtilization
W_a := c.scanWork
print("pacer: H_m_prev=", H_m_prev,
" h_t=", h_t, " H_T=", H_T,
" h_a=", h_a, " H_a=", H_a,
" h_g=", h_g, " H_g=", H_g,
" u_a=", u_a, " u_g=", u_g,
runtime: use heap scan size as estimate of GC scan work Currently, the GC uses a moving average of recent scan work ratios to estimate the total scan work required by this cycle. This is in turn used to compute how much scan work should be done by mutators when they allocate in order to perform all expected scan work by the time the allocated heap reaches the heap goal. However, our current scan work estimate can be arbitrarily wrong if the heap topography changes significantly from one cycle to the next. For example, in the go1 benchmarks, at the beginning of each benchmark, the heap is dominated by a 256MB no-scan object, so the GC learns that the scan density of the heap is very low. In benchmarks that then rapidly allocate pointer-dense objects, by the time of the next GC cycle, our estimate of the scan work can be too low by a large factor. This in turn lets the mutator allocate faster than the GC can collect, allowing it to get arbitrarily far ahead of the scan work estimate, which leads to very long GC cycles with very little mutator assist that can overshoot the heap goal by large margins. This is particularly easy to demonstrate with BinaryTree17: $ GODEBUG=gctrace=1 ./go1.test -test.bench BinaryTree17 gc #1 @0.017s 2%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 4->262->262 MB, 4 MB goal, 1 P gc #2 @0.026s 3%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 262->262->262 MB, 524 MB goal, 1 P testing: warning: no tests to run PASS BenchmarkBinaryTree17 gc #3 @1.906s 0%: 0+0+0+0+7 ms clock, 0+0+0+0/0/0+7 ms cpu, 325->325->287 MB, 325 MB goal, 1 P (forced) gc #4 @12.203s 20%: 0+0+0+10067+10 ms clock, 0+0+0+0/2523/852+10 ms cpu, 430->2092->1950 MB, 574 MB goal, 1 P 1 9150447353 ns/op Change this estimate to instead use the *current* scannable heap size. This has the advantage of being based solely on the current state of the heap, not on past densities or reachable heap sizes, so it isn't susceptible to falling behind during these sorts of phase changes. This is strictly an over-estimate, but it's better to over-estimate and get more assist than necessary than it is to under-estimate and potentially spiral out of control. Experiments with scaling this estimate back showed no obvious benefit for mutator utilization, heap size, or assist time. This new estimate has little effect for most benchmarks, including most go1 benchmarks, x/benchmarks, and the 6g benchmark. It has a huge effect for benchmarks that triggered the bad pacer behavior: name old mean new mean delta BinaryTree17 10.0s × (1.00,1.00) 3.5s × (0.98,1.01) -64.93% (p=0.000) Fannkuch11 2.74s × (1.00,1.01) 2.65s × (1.00,1.00) -3.52% (p=0.000) FmtFprintfEmpty 56.4ns × (0.99,1.00) 57.8ns × (1.00,1.01) +2.43% (p=0.000) FmtFprintfString 187ns × (0.99,1.00) 185ns × (0.99,1.01) -1.19% (p=0.010) FmtFprintfInt 184ns × (1.00,1.00) 183ns × (1.00,1.00) (no variance) FmtFprintfIntInt 321ns × (1.00,1.00) 315ns × (1.00,1.00) -1.80% (p=0.000) FmtFprintfPrefixedInt 266ns × (1.00,1.00) 263ns × (1.00,1.00) -1.22% (p=0.000) FmtFprintfFloat 353ns × (1.00,1.00) 353ns × (1.00,1.00) -0.13% (p=0.035) FmtManyArgs 1.21µs × (1.00,1.00) 1.19µs × (1.00,1.00) -1.33% (p=0.000) GobDecode 9.69ms × (1.00,1.00) 9.59ms × (1.00,1.00) -1.07% (p=0.000) GobEncode 7.89ms × (0.99,1.01) 7.74ms × (1.00,1.00) -1.92% (p=0.000) Gzip 391ms × (1.00,1.00) 392ms × (1.00,1.00) ~ (p=0.522) Gunzip 97.1ms × (1.00,1.00) 97.0ms × (1.00,1.00) -0.10% (p=0.000) HTTPClientServer 55.7µs × (0.99,1.01) 56.7µs × (0.99,1.01) +1.81% (p=0.001) JSONEncode 19.1ms × (1.00,1.00) 19.0ms × (1.00,1.00) -0.85% (p=0.000) JSONDecode 66.8ms × (1.00,1.00) 66.9ms × (1.00,1.00) ~ (p=0.288) Mandelbrot200 4.13ms × (1.00,1.00) 4.12ms × (1.00,1.00) -0.08% (p=0.000) GoParse 3.97ms × (1.00,1.01) 4.01ms × (1.00,1.00) +0.99% (p=0.000) RegexpMatchEasy0_32 114ns × (1.00,1.00) 115ns × (0.99,1.00) ~ (p=0.070) RegexpMatchEasy0_1K 376ns × (1.00,1.00) 376ns × (1.00,1.00) ~ (p=0.900) RegexpMatchEasy1_32 94.9ns × (1.00,1.00) 96.3ns × (1.00,1.01) +1.53% (p=0.001) RegexpMatchEasy1_1K 568ns × (1.00,1.00) 567ns × (1.00,1.00) -0.22% (p=0.001) RegexpMatchMedium_32 159ns × (1.00,1.00) 159ns × (1.00,1.00) ~ (p=0.178) RegexpMatchMedium_1K 46.4µs × (1.00,1.00) 46.6µs × (1.00,1.00) +0.29% (p=0.000) RegexpMatchHard_32 2.37µs × (1.00,1.00) 2.37µs × (1.00,1.00) ~ (p=0.722) RegexpMatchHard_1K 71.1µs × (1.00,1.00) 71.2µs × (1.00,1.00) ~ (p=0.229) Revcomp 565ms × (1.00,1.00) 562ms × (1.00,1.00) -0.52% (p=0.000) Template 81.0ms × (1.00,1.00) 80.2ms × (1.00,1.00) -0.97% (p=0.000) TimeParse 380ns × (1.00,1.00) 380ns × (1.00,1.00) ~ (p=0.148) TimeFormat 405ns × (0.99,1.00) 385ns × (0.99,1.00) -5.00% (p=0.000) Change-Id: I11274158bf3affaf62662e02de7af12d5fb789e4 Reviewed-on: https://go-review.googlesource.com/9696 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Austin Clements <austin@google.com>
2015-05-04 16:55:31 -04:00
" W_a=", W_a,
" goalΔ=", goalGrowthRatio-h_t,
" actualΔ=", h_a-h_t,
" u_a/u_g=", u_a/u_g,
"\n")
}
return triggerRatio
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
}
runtime: make putfull start mark workers Currently we depend on the good graces and timing of the scheduler to get opportunities to start dedicated mark workers. In the worst case, it may take 10ms to get dedicated mark workers going at the beginning of mark 1 and mark 2 or after the amount of available work has dropped and gone back up. Instead of waiting for the regular preemption logic to get around to us, make putfull enlist a random P if we're not already running enough dedicated workers. This should improve performance stability of the garbage collector and is likely to improve the overall performance somewhat. No overall effect on the go1 benchmarks. It speeds up the garbage benchmark by 12%, which more than counters the performance loss from the previous commit. name old time/op new time/op delta XBenchGarbage-12 6.32ms ± 4% 5.58ms ± 2% -11.68% (p=0.000 n=20+16) name old time/op new time/op delta BinaryTree17-12 3.18s ± 5% 3.12s ± 4% -1.83% (p=0.021 n=20+20) Fannkuch11-12 2.50s ± 2% 2.46s ± 2% -1.57% (p=0.000 n=18+19) FmtFprintfEmpty-12 50.8ns ± 3% 50.4ns ± 3% ~ (p=0.184 n=20+20) FmtFprintfString-12 167ns ± 2% 171ns ± 1% +2.46% (p=0.000 n=20+19) FmtFprintfInt-12 161ns ± 2% 163ns ± 2% +1.81% (p=0.000 n=20+20) FmtFprintfIntInt-12 269ns ± 1% 266ns ± 1% -0.81% (p=0.002 n=19+20) FmtFprintfPrefixedInt-12 237ns ± 2% 231ns ± 2% -2.86% (p=0.000 n=20+20) FmtFprintfFloat-12 313ns ± 2% 313ns ± 1% ~ (p=0.681 n=20+20) FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 1% -2.26% (p=0.000 n=20+20) GobDecode-12 8.66ms ± 1% 8.67ms ± 1% ~ (p=0.380 n=19+20) GobEncode-12 6.56ms ± 1% 6.56ms ± 2% ~ (p=0.607 n=19+20) Gzip-12 317ms ± 1% 314ms ± 2% -1.10% (p=0.000 n=20+19) Gunzip-12 42.1ms ± 1% 42.2ms ± 1% +0.27% (p=0.044 n=20+19) HTTPClientServer-12 62.7µs ± 1% 62.0µs ± 1% -1.04% (p=0.000 n=19+18) JSONEncode-12 16.7ms ± 1% 16.8ms ± 2% +0.59% (p=0.021 n=20+20) JSONDecode-12 58.2ms ± 1% 61.4ms ± 2% +5.43% (p=0.000 n=18+19) Mandelbrot200-12 3.84ms ± 1% 3.87ms ± 2% +0.79% (p=0.008 n=18+20) GoParse-12 3.86ms ± 2% 3.76ms ± 2% -2.60% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 100ns ± 2% 100ns ± 1% -0.68% (p=0.005 n=18+15) RegexpMatchEasy0_1K-12 332ns ± 1% 342ns ± 1% +3.16% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 82.9ns ± 3% 83.0ns ± 2% ~ (p=0.906 n=19+20) RegexpMatchEasy1_1K-12 487ns ± 1% 494ns ± 1% +1.50% (p=0.000 n=17+20) RegexpMatchMedium_32-12 131ns ± 2% 130ns ± 1% ~ (p=0.686 n=19+20) RegexpMatchMedium_1K-12 39.6µs ± 1% 39.2µs ± 1% -1.09% (p=0.000 n=18+19) RegexpMatchHard_32-12 2.04µs ± 1% 2.04µs ± 2% ~ (p=0.804 n=20+20) RegexpMatchHard_1K-12 61.7µs ± 2% 61.3µs ± 2% ~ (p=0.052 n=18+20) Revcomp-12 529ms ± 2% 533ms ± 1% +0.83% (p=0.003 n=20+19) Template-12 70.7ms ± 2% 71.0ms ± 2% ~ (p=0.065 n=20+19) TimeParse-12 351ns ± 2% 355ns ± 1% +1.25% (p=0.000 n=19+20) TimeFormat-12 362ns ± 2% 373ns ± 1% +2.83% (p=0.000 n=18+20) [Geo mean] 62.2µs 62.3µs +0.13% name old speed new speed delta GobDecode-12 88.6MB/s ± 1% 88.5MB/s ± 1% ~ (p=0.392 n=19+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.622 n=19+20) Gzip-12 61.1MB/s ± 1% 61.8MB/s ± 2% +1.11% (p=0.000 n=20+19) Gunzip-12 461MB/s ± 1% 460MB/s ± 1% -0.27% (p=0.044 n=20+19) JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% -0.58% (p=0.022 n=20+20) JSONDecode-12 33.3MB/s ± 1% 31.6MB/s ± 2% -5.15% (p=0.000 n=18+19) GoParse-12 15.0MB/s ± 2% 15.4MB/s ± 2% +2.66% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 317MB/s ± 2% 319MB/s ± 2% ~ (p=0.052 n=20+20) RegexpMatchEasy0_1K-12 3.08GB/s ± 1% 2.99GB/s ± 1% -3.07% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 386MB/s ± 3% 386MB/s ± 2% ~ (p=0.939 n=19+20) RegexpMatchEasy1_1K-12 2.10GB/s ± 1% 2.07GB/s ± 1% -1.46% (p=0.000 n=17+20) RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.64MB/s ± 1% ~ (p=0.702 n=19+20) RegexpMatchMedium_1K-12 25.9MB/s ± 1% 26.1MB/s ± 2% +0.99% (p=0.000 n=18+20) RegexpMatchHard_32-12 15.7MB/s ± 1% 15.7MB/s ± 2% ~ (p=0.723 n=20+20) RegexpMatchHard_1K-12 16.6MB/s ± 2% 16.7MB/s ± 2% ~ (p=0.052 n=18+20) Revcomp-12 481MB/s ± 2% 477MB/s ± 1% -0.83% (p=0.003 n=20+19) Template-12 27.5MB/s ± 2% 27.3MB/s ± 2% ~ (p=0.062 n=20+19) [Geo mean] 99.4MB/s 99.1MB/s -0.35% Change-Id: I914d8cadded5a230509d118164a4c201601afc06 Reviewed-on: https://go-review.googlesource.com/16298 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 17:07:02 -04:00
// enlistWorker encourages another dedicated mark worker to start on
// another P if there are spare worker slots. It is used by putfull
// when more work is made available.
//
//go:nowritebarrier
func (c *gcControllerState) enlistWorker() {
runtime: wake idle Ps when enqueuing GC work If the scheduler has no user work and there's no GC work visible, it puts the P to sleep (or blocks on the network). However, if we later enqueue more GC work, there's currently nothing that specifically wakes up the scheduler to let it start an idle GC worker. As a result, we can underutilize the CPU during GC if Ps have been put to sleep. Fix this by making GC wake idle Ps when work buffers are put on the full list. We already have a hook to do this, since we use this to preempt a random P if we need more dedicated workers. We expand this hook to instead wake an idle P if there is one. The logic we use for this is identical to the logic used to wake an idle P when we ready a goroutine. To make this really sound, we also fix the scheduler to re-check the idle GC worker condition after releasing its P. This closes a race where 1) the scheduler checks for idle work and finds none, 2) new work is enqueued but there are no idle Ps so none are woken, and 3) the scheduler releases its P. There is one subtlety here. Currently we call enlistWorker directly from putfull, but the gcWork is in an inconsistent state in the places that call putfull. This isn't a problem right now because nothing that enlistWorker does touches the gcWork, but with the added call to wakep, it's possible to get a recursive call into the gcWork (specifically, while write barriers are disallowed, this can do an allocation, which can dispose a gcWork, which can put a workbuf). To handle this, we lift the enlistWorker calls up a layer and delay them until the gcWork is in a consistent state. Fixes #14179. Change-Id: Ia2467a52e54c9688c3c1752e1fc00f5b37bbfeeb Reviewed-on: https://go-review.googlesource.com/32434 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
2016-10-30 20:43:53 -04:00
// If there are idle Ps, wake one so it will run an idle worker.
runtime: do not call wakep from enlistWorker, to avoid possible deadlock We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state. Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does: lock(&sched.lock) var n int for n = 0; glist != nil; n++ { gp := glist glist = gp.schedlink.ptr() casgstatus(gp, _Gwaiting, _Grunnable) globrunqput(gp) } unlock(&sched.lock) and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus. The code that is doing the scan is: if castogscanstatus(gp, s, s|_Gscan) { if !gp.gcscandone { scanstack(gp, gcw) gp.gcscandone = true } restartg(gp) break loop } More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example: runtime.scanstack at proc.go:866 calls runtime.gentraceback at mgcmark.go:842 calls runtime.scanstack$1 at traceback.go:378 calls runtime.scanframeworker at mgcmark.go:819 calls runtime.scanblock at mgcmark.go:904 calls runtime.greyobject at mgcmark.go:1221 calls (*runtime.gcWork).put at mgcmark.go:1412 calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127 calls runtime.wakep at mgc.go:632 calls runtime.startm at proc.go:1779 acquires runtime.sched.lock at proc.go:1675 This path was found with an automated deadlock-detecting tool. There are many such paths but they all go through enlistWorker -> wakep. The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again. Further refinement and analysis shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables. The enlistWorker -> wakep call was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw. Comment it out, to avoid the potential deadlock. Fixes #19112. Unfixes #14179. Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b Reviewed-on: https://go-review.googlesource.com/37093 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2017-02-15 15:41:50 -05:00
// NOTE: This is suspected of causing deadlocks. See golang.org/issue/19112.
//
// if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
// wakep()
// return
// }
runtime: wake idle Ps when enqueuing GC work If the scheduler has no user work and there's no GC work visible, it puts the P to sleep (or blocks on the network). However, if we later enqueue more GC work, there's currently nothing that specifically wakes up the scheduler to let it start an idle GC worker. As a result, we can underutilize the CPU during GC if Ps have been put to sleep. Fix this by making GC wake idle Ps when work buffers are put on the full list. We already have a hook to do this, since we use this to preempt a random P if we need more dedicated workers. We expand this hook to instead wake an idle P if there is one. The logic we use for this is identical to the logic used to wake an idle P when we ready a goroutine. To make this really sound, we also fix the scheduler to re-check the idle GC worker condition after releasing its P. This closes a race where 1) the scheduler checks for idle work and finds none, 2) new work is enqueued but there are no idle Ps so none are woken, and 3) the scheduler releases its P. There is one subtlety here. Currently we call enlistWorker directly from putfull, but the gcWork is in an inconsistent state in the places that call putfull. This isn't a problem right now because nothing that enlistWorker does touches the gcWork, but with the added call to wakep, it's possible to get a recursive call into the gcWork (specifically, while write barriers are disallowed, this can do an allocation, which can dispose a gcWork, which can put a workbuf). To handle this, we lift the enlistWorker calls up a layer and delay them until the gcWork is in a consistent state. Fixes #14179. Change-Id: Ia2467a52e54c9688c3c1752e1fc00f5b37bbfeeb Reviewed-on: https://go-review.googlesource.com/32434 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
2016-10-30 20:43:53 -04:00
// There are no idle Ps. If we need more dedicated workers,
// try to preempt a running P so it will switch to a worker.
runtime: make putfull start mark workers Currently we depend on the good graces and timing of the scheduler to get opportunities to start dedicated mark workers. In the worst case, it may take 10ms to get dedicated mark workers going at the beginning of mark 1 and mark 2 or after the amount of available work has dropped and gone back up. Instead of waiting for the regular preemption logic to get around to us, make putfull enlist a random P if we're not already running enough dedicated workers. This should improve performance stability of the garbage collector and is likely to improve the overall performance somewhat. No overall effect on the go1 benchmarks. It speeds up the garbage benchmark by 12%, which more than counters the performance loss from the previous commit. name old time/op new time/op delta XBenchGarbage-12 6.32ms ± 4% 5.58ms ± 2% -11.68% (p=0.000 n=20+16) name old time/op new time/op delta BinaryTree17-12 3.18s ± 5% 3.12s ± 4% -1.83% (p=0.021 n=20+20) Fannkuch11-12 2.50s ± 2% 2.46s ± 2% -1.57% (p=0.000 n=18+19) FmtFprintfEmpty-12 50.8ns ± 3% 50.4ns ± 3% ~ (p=0.184 n=20+20) FmtFprintfString-12 167ns ± 2% 171ns ± 1% +2.46% (p=0.000 n=20+19) FmtFprintfInt-12 161ns ± 2% 163ns ± 2% +1.81% (p=0.000 n=20+20) FmtFprintfIntInt-12 269ns ± 1% 266ns ± 1% -0.81% (p=0.002 n=19+20) FmtFprintfPrefixedInt-12 237ns ± 2% 231ns ± 2% -2.86% (p=0.000 n=20+20) FmtFprintfFloat-12 313ns ± 2% 313ns ± 1% ~ (p=0.681 n=20+20) FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 1% -2.26% (p=0.000 n=20+20) GobDecode-12 8.66ms ± 1% 8.67ms ± 1% ~ (p=0.380 n=19+20) GobEncode-12 6.56ms ± 1% 6.56ms ± 2% ~ (p=0.607 n=19+20) Gzip-12 317ms ± 1% 314ms ± 2% -1.10% (p=0.000 n=20+19) Gunzip-12 42.1ms ± 1% 42.2ms ± 1% +0.27% (p=0.044 n=20+19) HTTPClientServer-12 62.7µs ± 1% 62.0µs ± 1% -1.04% (p=0.000 n=19+18) JSONEncode-12 16.7ms ± 1% 16.8ms ± 2% +0.59% (p=0.021 n=20+20) JSONDecode-12 58.2ms ± 1% 61.4ms ± 2% +5.43% (p=0.000 n=18+19) Mandelbrot200-12 3.84ms ± 1% 3.87ms ± 2% +0.79% (p=0.008 n=18+20) GoParse-12 3.86ms ± 2% 3.76ms ± 2% -2.60% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 100ns ± 2% 100ns ± 1% -0.68% (p=0.005 n=18+15) RegexpMatchEasy0_1K-12 332ns ± 1% 342ns ± 1% +3.16% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 82.9ns ± 3% 83.0ns ± 2% ~ (p=0.906 n=19+20) RegexpMatchEasy1_1K-12 487ns ± 1% 494ns ± 1% +1.50% (p=0.000 n=17+20) RegexpMatchMedium_32-12 131ns ± 2% 130ns ± 1% ~ (p=0.686 n=19+20) RegexpMatchMedium_1K-12 39.6µs ± 1% 39.2µs ± 1% -1.09% (p=0.000 n=18+19) RegexpMatchHard_32-12 2.04µs ± 1% 2.04µs ± 2% ~ (p=0.804 n=20+20) RegexpMatchHard_1K-12 61.7µs ± 2% 61.3µs ± 2% ~ (p=0.052 n=18+20) Revcomp-12 529ms ± 2% 533ms ± 1% +0.83% (p=0.003 n=20+19) Template-12 70.7ms ± 2% 71.0ms ± 2% ~ (p=0.065 n=20+19) TimeParse-12 351ns ± 2% 355ns ± 1% +1.25% (p=0.000 n=19+20) TimeFormat-12 362ns ± 2% 373ns ± 1% +2.83% (p=0.000 n=18+20) [Geo mean] 62.2µs 62.3µs +0.13% name old speed new speed delta GobDecode-12 88.6MB/s ± 1% 88.5MB/s ± 1% ~ (p=0.392 n=19+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.622 n=19+20) Gzip-12 61.1MB/s ± 1% 61.8MB/s ± 2% +1.11% (p=0.000 n=20+19) Gunzip-12 461MB/s ± 1% 460MB/s ± 1% -0.27% (p=0.044 n=20+19) JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% -0.58% (p=0.022 n=20+20) JSONDecode-12 33.3MB/s ± 1% 31.6MB/s ± 2% -5.15% (p=0.000 n=18+19) GoParse-12 15.0MB/s ± 2% 15.4MB/s ± 2% +2.66% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 317MB/s ± 2% 319MB/s ± 2% ~ (p=0.052 n=20+20) RegexpMatchEasy0_1K-12 3.08GB/s ± 1% 2.99GB/s ± 1% -3.07% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 386MB/s ± 3% 386MB/s ± 2% ~ (p=0.939 n=19+20) RegexpMatchEasy1_1K-12 2.10GB/s ± 1% 2.07GB/s ± 1% -1.46% (p=0.000 n=17+20) RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.64MB/s ± 1% ~ (p=0.702 n=19+20) RegexpMatchMedium_1K-12 25.9MB/s ± 1% 26.1MB/s ± 2% +0.99% (p=0.000 n=18+20) RegexpMatchHard_32-12 15.7MB/s ± 1% 15.7MB/s ± 2% ~ (p=0.723 n=20+20) RegexpMatchHard_1K-12 16.6MB/s ± 2% 16.7MB/s ± 2% ~ (p=0.052 n=18+20) Revcomp-12 481MB/s ± 2% 477MB/s ± 1% -0.83% (p=0.003 n=20+19) Template-12 27.5MB/s ± 2% 27.3MB/s ± 2% ~ (p=0.062 n=20+19) [Geo mean] 99.4MB/s 99.1MB/s -0.35% Change-Id: I914d8cadded5a230509d118164a4c201601afc06 Reviewed-on: https://go-review.googlesource.com/16298 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 17:07:02 -04:00
if c.dedicatedMarkWorkersNeeded <= 0 {
return
}
// Pick a random other P to preempt.
if gomaxprocs <= 1 {
return
}
gp := getg()
if gp == nil || gp.m == nil || gp.m.p == 0 {
return
}
myID := gp.m.p.ptr().id
for tries := 0; tries < 5; tries++ {
id := int32(fastrandn(uint32(gomaxprocs - 1)))
runtime: make putfull start mark workers Currently we depend on the good graces and timing of the scheduler to get opportunities to start dedicated mark workers. In the worst case, it may take 10ms to get dedicated mark workers going at the beginning of mark 1 and mark 2 or after the amount of available work has dropped and gone back up. Instead of waiting for the regular preemption logic to get around to us, make putfull enlist a random P if we're not already running enough dedicated workers. This should improve performance stability of the garbage collector and is likely to improve the overall performance somewhat. No overall effect on the go1 benchmarks. It speeds up the garbage benchmark by 12%, which more than counters the performance loss from the previous commit. name old time/op new time/op delta XBenchGarbage-12 6.32ms ± 4% 5.58ms ± 2% -11.68% (p=0.000 n=20+16) name old time/op new time/op delta BinaryTree17-12 3.18s ± 5% 3.12s ± 4% -1.83% (p=0.021 n=20+20) Fannkuch11-12 2.50s ± 2% 2.46s ± 2% -1.57% (p=0.000 n=18+19) FmtFprintfEmpty-12 50.8ns ± 3% 50.4ns ± 3% ~ (p=0.184 n=20+20) FmtFprintfString-12 167ns ± 2% 171ns ± 1% +2.46% (p=0.000 n=20+19) FmtFprintfInt-12 161ns ± 2% 163ns ± 2% +1.81% (p=0.000 n=20+20) FmtFprintfIntInt-12 269ns ± 1% 266ns ± 1% -0.81% (p=0.002 n=19+20) FmtFprintfPrefixedInt-12 237ns ± 2% 231ns ± 2% -2.86% (p=0.000 n=20+20) FmtFprintfFloat-12 313ns ± 2% 313ns ± 1% ~ (p=0.681 n=20+20) FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 1% -2.26% (p=0.000 n=20+20) GobDecode-12 8.66ms ± 1% 8.67ms ± 1% ~ (p=0.380 n=19+20) GobEncode-12 6.56ms ± 1% 6.56ms ± 2% ~ (p=0.607 n=19+20) Gzip-12 317ms ± 1% 314ms ± 2% -1.10% (p=0.000 n=20+19) Gunzip-12 42.1ms ± 1% 42.2ms ± 1% +0.27% (p=0.044 n=20+19) HTTPClientServer-12 62.7µs ± 1% 62.0µs ± 1% -1.04% (p=0.000 n=19+18) JSONEncode-12 16.7ms ± 1% 16.8ms ± 2% +0.59% (p=0.021 n=20+20) JSONDecode-12 58.2ms ± 1% 61.4ms ± 2% +5.43% (p=0.000 n=18+19) Mandelbrot200-12 3.84ms ± 1% 3.87ms ± 2% +0.79% (p=0.008 n=18+20) GoParse-12 3.86ms ± 2% 3.76ms ± 2% -2.60% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 100ns ± 2% 100ns ± 1% -0.68% (p=0.005 n=18+15) RegexpMatchEasy0_1K-12 332ns ± 1% 342ns ± 1% +3.16% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 82.9ns ± 3% 83.0ns ± 2% ~ (p=0.906 n=19+20) RegexpMatchEasy1_1K-12 487ns ± 1% 494ns ± 1% +1.50% (p=0.000 n=17+20) RegexpMatchMedium_32-12 131ns ± 2% 130ns ± 1% ~ (p=0.686 n=19+20) RegexpMatchMedium_1K-12 39.6µs ± 1% 39.2µs ± 1% -1.09% (p=0.000 n=18+19) RegexpMatchHard_32-12 2.04µs ± 1% 2.04µs ± 2% ~ (p=0.804 n=20+20) RegexpMatchHard_1K-12 61.7µs ± 2% 61.3µs ± 2% ~ (p=0.052 n=18+20) Revcomp-12 529ms ± 2% 533ms ± 1% +0.83% (p=0.003 n=20+19) Template-12 70.7ms ± 2% 71.0ms ± 2% ~ (p=0.065 n=20+19) TimeParse-12 351ns ± 2% 355ns ± 1% +1.25% (p=0.000 n=19+20) TimeFormat-12 362ns ± 2% 373ns ± 1% +2.83% (p=0.000 n=18+20) [Geo mean] 62.2µs 62.3µs +0.13% name old speed new speed delta GobDecode-12 88.6MB/s ± 1% 88.5MB/s ± 1% ~ (p=0.392 n=19+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.622 n=19+20) Gzip-12 61.1MB/s ± 1% 61.8MB/s ± 2% +1.11% (p=0.000 n=20+19) Gunzip-12 461MB/s ± 1% 460MB/s ± 1% -0.27% (p=0.044 n=20+19) JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% -0.58% (p=0.022 n=20+20) JSONDecode-12 33.3MB/s ± 1% 31.6MB/s ± 2% -5.15% (p=0.000 n=18+19) GoParse-12 15.0MB/s ± 2% 15.4MB/s ± 2% +2.66% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 317MB/s ± 2% 319MB/s ± 2% ~ (p=0.052 n=20+20) RegexpMatchEasy0_1K-12 3.08GB/s ± 1% 2.99GB/s ± 1% -3.07% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 386MB/s ± 3% 386MB/s ± 2% ~ (p=0.939 n=19+20) RegexpMatchEasy1_1K-12 2.10GB/s ± 1% 2.07GB/s ± 1% -1.46% (p=0.000 n=17+20) RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.64MB/s ± 1% ~ (p=0.702 n=19+20) RegexpMatchMedium_1K-12 25.9MB/s ± 1% 26.1MB/s ± 2% +0.99% (p=0.000 n=18+20) RegexpMatchHard_32-12 15.7MB/s ± 1% 15.7MB/s ± 2% ~ (p=0.723 n=20+20) RegexpMatchHard_1K-12 16.6MB/s ± 2% 16.7MB/s ± 2% ~ (p=0.052 n=18+20) Revcomp-12 481MB/s ± 2% 477MB/s ± 1% -0.83% (p=0.003 n=20+19) Template-12 27.5MB/s ± 2% 27.3MB/s ± 2% ~ (p=0.062 n=20+19) [Geo mean] 99.4MB/s 99.1MB/s -0.35% Change-Id: I914d8cadded5a230509d118164a4c201601afc06 Reviewed-on: https://go-review.googlesource.com/16298 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 17:07:02 -04:00
if id >= myID {
id++
}
p := allp[id]
if p.status != _Prunning {
continue
}
if preemptone(p) {
return
}
}
}
// findRunnableGCWorker returns the background mark worker for _p_ if it
// should be run. This must only be called when gcBlackenEnabled != 0.
func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g {
if gcBlackenEnabled == 0 {
throw("gcControllerState.findRunnable: blackening not enabled")
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
}
if _p_.gcBgMarkWorker == 0 {
// The mark worker associated with this P is blocked
// performing a mark transition. We can't run it
// because it may be on some other run or wait queue.
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
return nil
}
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
if !gcMarkWorkAvailable(_p_) {
// No work to be done right now. This can happen at
// the end of the mark phase when there are still
// assists tapering off. Don't bother running a worker
// now because it'll just return immediately.
return nil
}
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
decIfPositive := func(ptr *int64) bool {
if *ptr > 0 {
if atomic.Xaddint64(ptr, -1) >= 0 {
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
return true
}
// We lost a race
atomic.Xaddint64(ptr, +1)
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
}
return false
}
if decIfPositive(&c.dedicatedMarkWorkersNeeded) {
// This P is now dedicated to marking until the end of
// the concurrent mark phase.
_p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode
} else if c.fractionalUtilizationGoal == 0 {
// No need for fractional workers.
return nil
} else {
// Is this P behind on the fractional utilization
// goal?
//
// This should be kept in sync with pollFractionalWorkerExit.
delta := nanotime() - gcController.markStartTime
if delta > 0 && float64(_p_.gcFractionalMarkTime)/float64(delta) > c.fractionalUtilizationGoal {
// Nope. No need to run a fractional worker.
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
return nil
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
}
// Run a fractional worker.
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
_p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
}
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// Run the background mark worker
gp := _p_.gcBgMarkWorker.ptr()
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.enabled {
traceGoUnpark(gp, 0)
}
return gp
}
// pollFractionalWorkerExit returns true if a fractional mark worker
// should self-preempt. It assumes it is called from the fractional
// worker.
func pollFractionalWorkerExit() bool {
// This should be kept in sync with the fractional worker
// scheduler logic in findRunnableGCWorker.
now := nanotime()
delta := now - gcController.markStartTime
if delta <= 0 {
return true
}
p := getg().m.p.ptr()
selfTime := p.gcFractionalMarkTime + (now - p.gcMarkWorkerStartTime)
// Add some slack to the utilization goal so that the
// fractional worker isn't behind again the instant it exits.
return float64(selfTime)/float64(delta) > 1.2*gcController.fractionalUtilizationGoal
}
// gcSetTriggerRatio sets the trigger ratio and updates everything
// derived from it: the absolute trigger, the heap goal, mark pacing,
// and sweep pacing.
//
// This can be called any time. If GC is the in the middle of a
// concurrent phase, it will adjust the pacing of that phase.
//
// This depends on gcpercent, memstats.heap_marked, and
// memstats.heap_live. These must be up to date.
//
// mheap_.lock must be held or the world must be stopped.
func gcSetTriggerRatio(triggerRatio float64) {
// Set the trigger ratio, capped to reasonable bounds.
if triggerRatio < 0 {
// This can happen if the mutator is allocating very
// quickly or the GC is scanning very slowly.
triggerRatio = 0
} else if gcpercent >= 0 {
// Ensure there's always a little margin so that the
// mutator assist ratio isn't infinity.
maxTriggerRatio := 0.95 * float64(gcpercent) / 100
if triggerRatio > maxTriggerRatio {
triggerRatio = maxTriggerRatio
}
}
memstats.triggerRatio = triggerRatio
// Compute the absolute GC trigger from the trigger ratio.
//
// We trigger the next GC cycle when the allocated heap has
// grown by the trigger ratio over the marked heap size.
trigger := ^uint64(0)
if gcpercent >= 0 {
trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))
// Don't trigger below the minimum heap size.
minTrigger := heapminimum
if !gosweepdone() {
// Concurrent sweep happens in the heap growth
// from heap_live to gc_trigger, so ensure
// that concurrent sweep has some heap growth
// in which to perform sweeping before we
// start the next GC cycle.
sweepMin := atomic.Load64(&memstats.heap_live) + sweepMinHeapDistance*uint64(gcpercent)/100
if sweepMin > minTrigger {
minTrigger = sweepMin
}
}
if trigger < minTrigger {
trigger = minTrigger
}
if int64(trigger) < 0 {
print("runtime: next_gc=", memstats.next_gc, " heap_marked=", memstats.heap_marked, " heap_live=", memstats.heap_live, " initialHeapLive=", work.initialHeapLive, "triggerRatio=", triggerRatio, " minTrigger=", minTrigger, "\n")
throw("gc_trigger underflow")
}
}
memstats.gc_trigger = trigger
// Compute the next GC goal, which is when the allocated heap
// has grown by GOGC/100 over the heap marked by the last
// cycle.
goal := ^uint64(0)
if gcpercent >= 0 {
goal = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
if goal < trigger {
// The trigger ratio is always less than GOGC/100, but
// other bounds on the trigger may have raised it.
// Push up the goal, too.
goal = trigger
}
}
memstats.next_gc = goal
if trace.enabled {
traceNextGC()
}
// Update mark pacing.
if gcphase != _GCoff {
gcController.revise()
}
// Update sweep pacing.
if gosweepdone() {
mheap_.sweepPagesPerByte = 0
} else {
// Concurrent sweep needs to sweep all of the in-use
// pages by the time the allocated heap reaches the GC
// trigger. Compute the ratio of in-use pages to sweep
// per byte allocated, accounting for the fact that
// some might already be swept.
runtime: drive proportional sweep directly off heap_live Currently, proportional sweep maintains its own count of how many bytes have been allocated since the beginning of the sweep cycle so it can compute how many pages need to be swept for a given allocation. However, this requires a somewhat complex reimbursement scheme since proportional sweep must be done before a span is allocated, but we don't know how many bytes to charge until we've allocated a span. This means that the allocated byte count used by proportional sweep can go up and down, which has led to underflow bugs in the past (#18043) and is going to interfere with adjusting sweep pacing on-the-fly (for #19076). This approach also means we're maintaining a statistic that is very closely related to heap_live, but has a different 0 value. This is particularly confusing because the sweep ratio is computed based on heap_live, so you have to understand that these two statistics are very closely related. Replace all of this and compute the sweep debt directly from the current value of heap_live. To make this work, we simply save the value of heap_live when the sweep ratio is computed to use as a "basis" for later computing the sweep debt. This eliminates the need for reimbursement as well as the code for maintaining the sweeper's version of the live heap size. For #19076. Coincidentally fixes #18043, since this eliminates sweep reimbursement entirely. Change-Id: I1f931ddd6e90c901a3972c7506874c899251dc2a Reviewed-on: https://go-review.googlesource.com/39832 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-03 15:22:06 -04:00
heapLiveBasis := atomic.Load64(&memstats.heap_live)
heapDistance := int64(trigger) - int64(heapLiveBasis)
// Add a little margin so rounding errors and
// concurrent sweep are less likely to leave pages
// unswept when GC starts.
heapDistance -= 1024 * 1024
if heapDistance < _PageSize {
// Avoid setting the sweep ratio extremely high
heapDistance = _PageSize
}
pagesSwept := atomic.Load64(&mheap_.pagesSwept)
sweepDistancePages := int64(mheap_.pagesInUse) - int64(pagesSwept)
if sweepDistancePages <= 0 {
mheap_.sweepPagesPerByte = 0
} else {
mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance)
mheap_.sweepHeapLiveBasis = heapLiveBasis
// Write pagesSweptBasis last, since this
// signals concurrent sweeps to recompute
// their debt.
atomic.Store64(&mheap_.pagesSweptBasis, pagesSwept)
}
}
}
// gcGoalUtilization is the goal CPU utilization for
runtime: fix background marking at 25% utilization Currently, in accordance with the GC pacing proposal, we schedule background marking with a goal of achieving 25% utilization *total* between mutator assists and background marking. This is stricter than was set out in the Go 1.5 proposal, which suggests that the garbage collector can use 25% just for itself and anything the mutator does to help out is on top of that. It also has several technical drawbacks. Because mutator assist time is constantly changing and we can't have instantaneous information on background marking time, it effectively requires hitting a moving target based on out-of-date information. This works out in the long run, but works poorly for short GC cycles and on short time scales. Also, this requires time-multiplexing all Ps between the mutator and background GC since the goal utilization of background GC constantly fluctuates. This results in a complicated scheduling algorithm, poor affinity, and extra overheads from context switching. This change modifies the way we schedule and run background marking so that background marking always consumes 25% of GOMAXPROCS and mutator assist is in addition to this. This enables a much more robust scheduling algorithm where we pre-determine the number of Ps we should dedicate to background marking as well as the utilization goal for a single floating "remainder" mark worker. Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168 Reviewed-on: https://go-review.googlesource.com/9013 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-15 17:01:30 -04:00
// marking as a fraction of GOMAXPROCS.
runtime: allow 5% mutator assist over 25% background mark Currently, both the background mark worker and the goal GC CPU are both fixed at 25%. The trigger controller's goal is to achieve the goal CPU usage, and with the previous commit it can actually achieve this. But this means there are *no* assists, which sounds ideal but actually causes problems for the trigger controller. Since the controller can't lower CPU usage below the background mark worker CPU, it saturates at the CPU goal and no longer gets feedback, which translates into higher variability in heap growth. This commit fixes this by allowing assists 5% CPU beyond the 25% fixed background mark. This avoids saturating the trigger controller, since it can now get feedback from both sides of the CPU goal. This leads to low variability in both CPU usage and heap growth, at the cost of reintroducing a low rate of mark assists. We also experimented with 20% background plus 5% assist, but 25%+5% clearly performed better in benchmarks. Updates #14951. Updates #14812. Updates #18534. Combined with the previous CL, this significantly improves tail mutator utilization in the x/bechmarks garbage benchmark. On a sample trace, it increased the 99.9%ile mutator utilization at 10ms from 26% to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero utilization window from 2ms to 700µs. It also helps the mean mutator utilization: it increased the 10s mutator utilization from 83% to 94%. The minimum mutator utilization is also somewhat improved, though there is still some unknown artifact that causes a miniscule fraction of mutator assists to take 5--10ms (in fact, there was exactly one 10ms mutator assist in my sample trace). This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little effect on the go1 benchmarks (and the slight overall improvement makes up for the slight overall slowdown from the previous commit): name old time/op new time/op delta BinaryTree17-12 2.40s ± 0% 2.41s ± 1% +0.26% (p=0.010 n=18+19) Fannkuch11-12 2.95s ± 0% 2.93s ± 0% -0.62% (p=0.000 n=18+15) FmtFprintfEmpty-12 42.2ns ± 0% 42.3ns ± 1% +0.37% (p=0.001 n=15+14) FmtFprintfString-12 67.9ns ± 2% 67.2ns ± 3% -1.03% (p=0.002 n=20+18) FmtFprintfInt-12 75.6ns ± 3% 76.8ns ± 2% +1.59% (p=0.000 n=19+17) FmtFprintfIntInt-12 123ns ± 1% 124ns ± 1% +0.77% (p=0.000 n=17+14) FmtFprintfPrefixedInt-12 148ns ± 1% 150ns ± 1% +1.28% (p=0.000 n=20+20) FmtFprintfFloat-12 212ns ± 0% 211ns ± 1% -0.67% (p=0.000 n=16+17) FmtManyArgs-12 499ns ± 1% 500ns ± 0% +0.23% (p=0.004 n=19+16) GobDecode-12 6.49ms ± 1% 6.51ms ± 1% +0.32% (p=0.008 n=19+19) GobEncode-12 5.47ms ± 0% 5.43ms ± 1% -0.68% (p=0.000 n=19+20) Gzip-12 220ms ± 1% 216ms ± 1% -1.66% (p=0.000 n=20+19) Gunzip-12 38.8ms ± 0% 38.5ms ± 0% -0.80% (p=0.000 n=19+20) HTTPClientServer-12 78.5µs ± 1% 78.1µs ± 1% -0.53% (p=0.008 n=20+19) JSONEncode-12 12.2ms ± 0% 11.9ms ± 0% -2.38% (p=0.000 n=17+19) JSONDecode-12 52.3ms ± 0% 53.3ms ± 0% +1.84% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.69ms ± 0% -0.19% (p=0.000 n=19+19) GoParse-12 3.17ms ± 1% 3.19ms ± 1% +0.61% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 73.7ns ± 0% 73.2ns ± 1% -0.66% (p=0.000 n=17+20) RegexpMatchEasy0_1K-12 238ns ± 0% 239ns ± 0% +0.32% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 69.1ns ± 1% 69.2ns ± 1% ~ (p=0.669 n=19+13) RegexpMatchEasy1_1K-12 365ns ± 1% 367ns ± 1% +0.49% (p=0.000 n=19+19) RegexpMatchMedium_32-12 104ns ± 1% 105ns ± 1% +1.33% (p=0.000 n=16+20) RegexpMatchMedium_1K-12 33.6µs ± 3% 34.1µs ± 4% +1.67% (p=0.001 n=20+20) RegexpMatchHard_32-12 1.67µs ± 1% 1.62µs ± 1% -2.78% (p=0.000 n=18+17) RegexpMatchHard_1K-12 50.3µs ± 2% 48.7µs ± 1% -3.09% (p=0.000 n=19+18) Revcomp-12 384ms ± 0% 386ms ± 0% +0.59% (p=0.000 n=19+19) Template-12 61.1ms ± 1% 60.5ms ± 1% -1.02% (p=0.000 n=19+20) TimeParse-12 307ns ± 0% 303ns ± 1% -1.23% (p=0.000 n=19+15) TimeFormat-12 323ns ± 0% 323ns ± 0% -0.12% (p=0.011 n=15+20) [Geo mean] 47.1µs 47.0µs -0.20% https://perf.golang.org/search?q=upload:20171030.4 It slightly improve the performance the x/benchmarks: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.29ms ± 3% 2.22ms ± 2% -2.97% (p=0.000 n=18+18) Garbage/benchmem-MB=64-12 2.24ms ± 2% 2.21ms ± 2% -1.64% (p=0.000 n=18+18) HTTP-12 12.6µs ± 1% 12.6µs ± 1% ~ (p=0.690 n=19+17) JSON-12 11.3ms ± 2% 11.3ms ± 1% ~ (p=0.163 n=17+18) and fixes some of the heap size bloat caused by the previous commit: name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.88G ± 2% 1.77G ± 2% -5.52% (p=0.000 n=20+18) Garbage/benchmem-MB=64-12 248M ± 8% 226M ± 5% -8.93% (p=0.000 n=20+20) HTTP-12 47.0M ±27% 47.2M ±12% ~ (p=0.512 n=20+20) JSON-12 206M ±11% 206M ±10% ~ (p=0.841 n=20+20) https://perf.golang.org/search?q=upload:20171030.5 Combined with the change to add a soft goal in the previous commit, the achieves a decent performance improvement on the garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.22ms ± 2% -7.40% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.21ms ± 2% -1.06% (p=0.000 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.330 n=20+17) JSON-12 11.1ms ± 1% 11.3ms ± 1% +1.87% (p=0.000 n=16+18) https://perf.golang.org/search?q=upload:20171030.6 Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121 Reviewed-on: https://go-review.googlesource.com/59971 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-20 10:58:23 -04:00
const gcGoalUtilization = 0.30
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// gcBackgroundUtilization is the fixed CPU utilization for background
// marking. It must be <= gcGoalUtilization. The difference between
// gcGoalUtilization and gcBackgroundUtilization will be made up by
// mark assists. The scheduler will aim to use within 50% of this
// goal.
runtime: allow 5% mutator assist over 25% background mark Currently, both the background mark worker and the goal GC CPU are both fixed at 25%. The trigger controller's goal is to achieve the goal CPU usage, and with the previous commit it can actually achieve this. But this means there are *no* assists, which sounds ideal but actually causes problems for the trigger controller. Since the controller can't lower CPU usage below the background mark worker CPU, it saturates at the CPU goal and no longer gets feedback, which translates into higher variability in heap growth. This commit fixes this by allowing assists 5% CPU beyond the 25% fixed background mark. This avoids saturating the trigger controller, since it can now get feedback from both sides of the CPU goal. This leads to low variability in both CPU usage and heap growth, at the cost of reintroducing a low rate of mark assists. We also experimented with 20% background plus 5% assist, but 25%+5% clearly performed better in benchmarks. Updates #14951. Updates #14812. Updates #18534. Combined with the previous CL, this significantly improves tail mutator utilization in the x/bechmarks garbage benchmark. On a sample trace, it increased the 99.9%ile mutator utilization at 10ms from 26% to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero utilization window from 2ms to 700µs. It also helps the mean mutator utilization: it increased the 10s mutator utilization from 83% to 94%. The minimum mutator utilization is also somewhat improved, though there is still some unknown artifact that causes a miniscule fraction of mutator assists to take 5--10ms (in fact, there was exactly one 10ms mutator assist in my sample trace). This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little effect on the go1 benchmarks (and the slight overall improvement makes up for the slight overall slowdown from the previous commit): name old time/op new time/op delta BinaryTree17-12 2.40s ± 0% 2.41s ± 1% +0.26% (p=0.010 n=18+19) Fannkuch11-12 2.95s ± 0% 2.93s ± 0% -0.62% (p=0.000 n=18+15) FmtFprintfEmpty-12 42.2ns ± 0% 42.3ns ± 1% +0.37% (p=0.001 n=15+14) FmtFprintfString-12 67.9ns ± 2% 67.2ns ± 3% -1.03% (p=0.002 n=20+18) FmtFprintfInt-12 75.6ns ± 3% 76.8ns ± 2% +1.59% (p=0.000 n=19+17) FmtFprintfIntInt-12 123ns ± 1% 124ns ± 1% +0.77% (p=0.000 n=17+14) FmtFprintfPrefixedInt-12 148ns ± 1% 150ns ± 1% +1.28% (p=0.000 n=20+20) FmtFprintfFloat-12 212ns ± 0% 211ns ± 1% -0.67% (p=0.000 n=16+17) FmtManyArgs-12 499ns ± 1% 500ns ± 0% +0.23% (p=0.004 n=19+16) GobDecode-12 6.49ms ± 1% 6.51ms ± 1% +0.32% (p=0.008 n=19+19) GobEncode-12 5.47ms ± 0% 5.43ms ± 1% -0.68% (p=0.000 n=19+20) Gzip-12 220ms ± 1% 216ms ± 1% -1.66% (p=0.000 n=20+19) Gunzip-12 38.8ms ± 0% 38.5ms ± 0% -0.80% (p=0.000 n=19+20) HTTPClientServer-12 78.5µs ± 1% 78.1µs ± 1% -0.53% (p=0.008 n=20+19) JSONEncode-12 12.2ms ± 0% 11.9ms ± 0% -2.38% (p=0.000 n=17+19) JSONDecode-12 52.3ms ± 0% 53.3ms ± 0% +1.84% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.69ms ± 0% -0.19% (p=0.000 n=19+19) GoParse-12 3.17ms ± 1% 3.19ms ± 1% +0.61% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 73.7ns ± 0% 73.2ns ± 1% -0.66% (p=0.000 n=17+20) RegexpMatchEasy0_1K-12 238ns ± 0% 239ns ± 0% +0.32% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 69.1ns ± 1% 69.2ns ± 1% ~ (p=0.669 n=19+13) RegexpMatchEasy1_1K-12 365ns ± 1% 367ns ± 1% +0.49% (p=0.000 n=19+19) RegexpMatchMedium_32-12 104ns ± 1% 105ns ± 1% +1.33% (p=0.000 n=16+20) RegexpMatchMedium_1K-12 33.6µs ± 3% 34.1µs ± 4% +1.67% (p=0.001 n=20+20) RegexpMatchHard_32-12 1.67µs ± 1% 1.62µs ± 1% -2.78% (p=0.000 n=18+17) RegexpMatchHard_1K-12 50.3µs ± 2% 48.7µs ± 1% -3.09% (p=0.000 n=19+18) Revcomp-12 384ms ± 0% 386ms ± 0% +0.59% (p=0.000 n=19+19) Template-12 61.1ms ± 1% 60.5ms ± 1% -1.02% (p=0.000 n=19+20) TimeParse-12 307ns ± 0% 303ns ± 1% -1.23% (p=0.000 n=19+15) TimeFormat-12 323ns ± 0% 323ns ± 0% -0.12% (p=0.011 n=15+20) [Geo mean] 47.1µs 47.0µs -0.20% https://perf.golang.org/search?q=upload:20171030.4 It slightly improve the performance the x/benchmarks: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.29ms ± 3% 2.22ms ± 2% -2.97% (p=0.000 n=18+18) Garbage/benchmem-MB=64-12 2.24ms ± 2% 2.21ms ± 2% -1.64% (p=0.000 n=18+18) HTTP-12 12.6µs ± 1% 12.6µs ± 1% ~ (p=0.690 n=19+17) JSON-12 11.3ms ± 2% 11.3ms ± 1% ~ (p=0.163 n=17+18) and fixes some of the heap size bloat caused by the previous commit: name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.88G ± 2% 1.77G ± 2% -5.52% (p=0.000 n=20+18) Garbage/benchmem-MB=64-12 248M ± 8% 226M ± 5% -8.93% (p=0.000 n=20+20) HTTP-12 47.0M ±27% 47.2M ±12% ~ (p=0.512 n=20+20) JSON-12 206M ±11% 206M ±10% ~ (p=0.841 n=20+20) https://perf.golang.org/search?q=upload:20171030.5 Combined with the change to add a soft goal in the previous commit, the achieves a decent performance improvement on the garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.22ms ± 2% -7.40% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.21ms ± 2% -1.06% (p=0.000 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.330 n=20+17) JSON-12 11.1ms ± 1% 11.3ms ± 1% +1.87% (p=0.000 n=16+18) https://perf.golang.org/search?q=upload:20171030.6 Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121 Reviewed-on: https://go-review.googlesource.com/59971 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-07-20 10:58:23 -04:00
//
// Setting this to < gcGoalUtilization avoids saturating the trigger
// feedback controller when there are no assists, which allows it to
// better control CPU and heap growth. However, the larger the gap,
// the more mutator assists are expected to happen, which impact
// mutator latency.
const gcBackgroundUtilization = 0.25
// gcCreditSlack is the amount of scan work credit that can can
// accumulate locally before updating gcController.scanWork and,
// optionally, gcController.bgScanCredit. Lower values give a more
// accurate assist ratio and make it more likely that assists will
// successfully steal background credit. Higher values reduce memory
// contention.
const gcCreditSlack = 2000
// gcAssistTimeSlack is the nanoseconds of mutator assist time that
// can accumulate on a P before updating gcController.assistTime.
const gcAssistTimeSlack = 5000
// gcOverAssistWork determines how many extra units of scan work a GC
// assist does when an assist happens. This amortizes the cost of an
// assist by pre-paying for this many bytes of future allocations.
const gcOverAssistWork = 64 << 10
var work struct {
full lfstack // lock-free list of full blocks workbuf
empty lfstack // lock-free list of empty blocks workbuf
pad0 [sys.CacheLineSize]uint8 // prevents false-sharing between full/empty and nproc/nwait
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
wbufSpans struct {
lock mutex
runtime: free workbufs during sweeping This extends the sweeper to free workbufs back to the heap between GC cycles, allowing this memory to be reused for GC'd allocations or eventually returned to the OS. This helps for applications that have high peak heap usage relative to their regular heap usage (for example, a high-memory initialization phase). Workbuf memory is roughly proportional to heap size and since we currently never free workbufs, it's proportional to *peak* heap size. By freeing workbufs, we can release and reuse this memory for other purposes when the heap shrinks. This is somewhat complicated because this costs ~1–2 µs per workbuf span, so for large heaps it's too expensive to just do synchronously after mark termination between starting the world and dropping the worldsema. Hence, we do it asynchronously in the sweeper. This adds a list of "free" workbuf spans that can be returned to the heap. GC moves all workbuf spans to this list after mark termination and the background sweeper drains this list back to the heap. If the sweeper doesn't finish, that's fine, since getempty can directly reuse any remaining spans to allocate more workbufs. Performance impact is negligible. On the x/benchmarks, this reduces GC-bytes-from-system by 6–11%. Fixes #19325. Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8 Reviewed-on: https://go-review.googlesource.com/38582 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-20 17:25:59 -04:00
// free is a list of spans dedicated to workbufs, but
// that don't currently contain any workbufs.
free mSpanList
// busy is a list of all spans containing workbufs on
// one of the workbuf lists.
busy mSpanList
}
// Restore 64-bit alignment on 32-bit.
_ uint32
// bytesMarked is the number of bytes marked this cycle. This
// includes bytes blackened in scanned objects, noscan objects
// that go straight to black, and permagrey objects scanned by
// markroot during the concurrent scan phase. This is updated
// atomically during the cycle. Updates may be batched
// arbitrarily, since the value is only read at the end of the
// cycle.
//
// Because of benign races during marking, this number may not
// be the exact number of marked bytes, but it should be very
// close.
//
// Put this field here because it needs 64-bit atomic access
// (and thus 8-byte alignment even on 32-bit architectures).
bytesMarked uint64
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
markrootNext uint32 // next markroot job
markrootJobs uint32 // number of markroot jobs
nproc uint32
tstart int64
nwait uint32
ndone uint32
alldone note
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
// helperDrainBlock indicates that GC mark termination helpers
// should pass gcDrainBlock to gcDrain to block in the
// getfull() barrier. Otherwise, they should pass gcDrainNoBlock.
//
// TODO: This is a temporary fallback to work around races
// that cause early mark termination.
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
helperDrainBlock bool
runtime: partition data and BSS root marking Currently data and BSS root marking are each a single markroot job. This makes them difficult to load balance, which can draw out mark termination time if they are large. Fix this by splitting both in to 256K chunks. While we're putting in the infrastructure for dynamic roots, we also replace the fixed sharding of the span roots with sharding in to fixed sizes. In addition to helping balance root marking, this also paves the way to parallelizing concurrent scan and to letting assists help with root marking. Updates #10345. This fixes the data and BSS aspects of that bug; it does not partition scanning of large heap objects. This has negligible effect on either the go1 benchmarks or the garbage benchmark: name old time/op new time/op delta XBenchGarbage-12 4.90ms ± 1% 4.91ms ± 2% ~ (p=0.058 n=17+16) name old time/op new time/op delta BinaryTree17-12 3.11s ± 4% 3.12s ± 4% ~ (p=0.512 n=20+20) Fannkuch11-12 2.53s ± 2% 2.47s ± 2% -2.28% (p=0.000 n=20+18) FmtFprintfEmpty-12 49.1ns ± 1% 50.0ns ± 4% +1.68% (p=0.008 n=18+20) FmtFprintfString-12 170ns ± 0% 172ns ± 1% +1.05% (p=0.000 n=14+19) FmtFprintfInt-12 174ns ± 1% 162ns ± 1% -6.81% (p=0.000 n=18+17) FmtFprintfIntInt-12 284ns ± 1% 277ns ± 1% -2.42% (p=0.000 n=20+19) FmtFprintfPrefixedInt-12 252ns ± 1% 244ns ± 1% -2.84% (p=0.000 n=18+20) FmtFprintfFloat-12 317ns ± 0% 311ns ± 0% -1.95% (p=0.000 n=19+18) FmtManyArgs-12 1.08µs ± 1% 1.11µs ± 1% +3.43% (p=0.000 n=18+19) GobDecode-12 8.56ms ± 1% 8.61ms ± 1% +0.50% (p=0.020 n=20+20) GobEncode-12 6.58ms ± 1% 6.57ms ± 1% ~ (p=0.792 n=20+19) Gzip-12 317ms ± 3% 317ms ± 2% ~ (p=0.840 n=19+19) Gunzip-12 41.6ms ± 0% 41.6ms ± 0% +0.07% (p=0.027 n=18+15) HTTPClientServer-12 62.2µs ± 1% 62.3µs ± 1% ~ (p=0.283 n=19+20) JSONEncode-12 16.5ms ± 2% 16.5ms ± 1% ~ (p=0.857 n=20+19) JSONDecode-12 58.5ms ± 1% 61.3ms ± 1% +4.67% (p=0.000 n=18+17) Mandelbrot200-12 3.84ms ± 0% 3.84ms ± 0% ~ (p=0.259 n=17+17) GoParse-12 3.70ms ± 2% 3.74ms ± 2% +0.96% (p=0.009 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% +0.31% (p=0.040 n=19+15) RegexpMatchEasy0_1K-12 340ns ± 1% 340ns ± 1% ~ (p=0.411 n=17+19) RegexpMatchEasy1_32-12 82.7ns ± 2% 82.3ns ± 1% ~ (p=0.456 n=20+19) RegexpMatchEasy1_1K-12 498ns ± 2% 495ns ± 0% ~ (p=0.108 n=19+17) RegexpMatchMedium_32-12 130ns ± 1% 130ns ± 2% ~ (p=0.405 n=18+19) RegexpMatchMedium_1K-12 39.4µs ± 2% 39.1µs ± 1% -0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 0% ~ (p=0.561 n=20+17) RegexpMatchHard_1K-12 61.1µs ± 2% 60.8µs ± 1% ~ (p=0.615 n=19+18) Revcomp-12 532ms ± 2% 531ms ± 1% ~ (p=0.470 n=19+19) Template-12 68.5ms ± 1% 69.1ms ± 1% +0.87% (p=0.000 n=17+17) TimeParse-12 344ns ± 2% 344ns ± 1% +0.25% (p=0.032 n=19+18) TimeFormat-12 347ns ± 1% 362ns ± 1% +4.27% (p=0.000 n=17+19) [Geo mean] 62.3µs 62.3µs -0.04% name old speed new speed delta GobDecode-12 89.6MB/s ± 1% 89.2MB/s ± 1% -0.50% (p=0.019 n=20+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.797 n=20+19) Gzip-12 61.3MB/s ± 3% 61.2MB/s ± 2% ~ (p=0.834 n=19+19) Gunzip-12 467MB/s ± 0% 466MB/s ± 0% -0.07% (p=0.027 n=18+15) JSONEncode-12 117MB/s ± 2% 117MB/s ± 1% ~ (p=0.851 n=20+19) JSONDecode-12 33.2MB/s ± 1% 31.7MB/s ± 1% -4.47% (p=0.000 n=18+17) GoParse-12 15.6MB/s ± 2% 15.5MB/s ± 2% -0.95% (p=0.008 n=19+20) RegexpMatchEasy0_32-12 321MB/s ± 2% 320MB/s ± 1% -0.57% (p=0.002 n=17+17) RegexpMatchEasy0_1K-12 3.01GB/s ± 1% 3.01GB/s ± 1% ~ (p=0.132 n=17+18) RegexpMatchEasy1_32-12 387MB/s ± 2% 389MB/s ± 1% ~ (p=0.423 n=20+19) RegexpMatchEasy1_1K-12 2.05GB/s ± 2% 2.06GB/s ± 0% ~ (p=0.129 n=19+17) RegexpMatchMedium_32-12 7.64MB/s ± 1% 7.66MB/s ± 1% ~ (p=0.258 n=18+19) RegexpMatchMedium_1K-12 26.0MB/s ± 2% 26.2MB/s ± 1% +0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 15.7MB/s ± 2% 15.8MB/s ± 1% ~ (p=0.510 n=20+17) RegexpMatchHard_1K-12 16.8MB/s ± 2% 16.8MB/s ± 1% ~ (p=0.603 n=19+18) Revcomp-12 477MB/s ± 2% 479MB/s ± 1% ~ (p=0.470 n=19+19) Template-12 28.3MB/s ± 1% 28.1MB/s ± 1% -0.85% (p=0.000 n=17+17) [Geo mean] 100MB/s 100MB/s -0.26% Change-Id: Ib0bfe0145675ce88c5a8791752f7486ac98805b4 Reviewed-on: https://go-review.googlesource.com/16043 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-16 16:52:26 -04:00
// Number of roots of various root types. Set by gcMarkRootPrepare.
nFlushCacheRoots int
nDataRoots, nBSSRoots, nSpanRoots, nStackRoots int
runtime: partition data and BSS root marking Currently data and BSS root marking are each a single markroot job. This makes them difficult to load balance, which can draw out mark termination time if they are large. Fix this by splitting both in to 256K chunks. While we're putting in the infrastructure for dynamic roots, we also replace the fixed sharding of the span roots with sharding in to fixed sizes. In addition to helping balance root marking, this also paves the way to parallelizing concurrent scan and to letting assists help with root marking. Updates #10345. This fixes the data and BSS aspects of that bug; it does not partition scanning of large heap objects. This has negligible effect on either the go1 benchmarks or the garbage benchmark: name old time/op new time/op delta XBenchGarbage-12 4.90ms ± 1% 4.91ms ± 2% ~ (p=0.058 n=17+16) name old time/op new time/op delta BinaryTree17-12 3.11s ± 4% 3.12s ± 4% ~ (p=0.512 n=20+20) Fannkuch11-12 2.53s ± 2% 2.47s ± 2% -2.28% (p=0.000 n=20+18) FmtFprintfEmpty-12 49.1ns ± 1% 50.0ns ± 4% +1.68% (p=0.008 n=18+20) FmtFprintfString-12 170ns ± 0% 172ns ± 1% +1.05% (p=0.000 n=14+19) FmtFprintfInt-12 174ns ± 1% 162ns ± 1% -6.81% (p=0.000 n=18+17) FmtFprintfIntInt-12 284ns ± 1% 277ns ± 1% -2.42% (p=0.000 n=20+19) FmtFprintfPrefixedInt-12 252ns ± 1% 244ns ± 1% -2.84% (p=0.000 n=18+20) FmtFprintfFloat-12 317ns ± 0% 311ns ± 0% -1.95% (p=0.000 n=19+18) FmtManyArgs-12 1.08µs ± 1% 1.11µs ± 1% +3.43% (p=0.000 n=18+19) GobDecode-12 8.56ms ± 1% 8.61ms ± 1% +0.50% (p=0.020 n=20+20) GobEncode-12 6.58ms ± 1% 6.57ms ± 1% ~ (p=0.792 n=20+19) Gzip-12 317ms ± 3% 317ms ± 2% ~ (p=0.840 n=19+19) Gunzip-12 41.6ms ± 0% 41.6ms ± 0% +0.07% (p=0.027 n=18+15) HTTPClientServer-12 62.2µs ± 1% 62.3µs ± 1% ~ (p=0.283 n=19+20) JSONEncode-12 16.5ms ± 2% 16.5ms ± 1% ~ (p=0.857 n=20+19) JSONDecode-12 58.5ms ± 1% 61.3ms ± 1% +4.67% (p=0.000 n=18+17) Mandelbrot200-12 3.84ms ± 0% 3.84ms ± 0% ~ (p=0.259 n=17+17) GoParse-12 3.70ms ± 2% 3.74ms ± 2% +0.96% (p=0.009 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% +0.31% (p=0.040 n=19+15) RegexpMatchEasy0_1K-12 340ns ± 1% 340ns ± 1% ~ (p=0.411 n=17+19) RegexpMatchEasy1_32-12 82.7ns ± 2% 82.3ns ± 1% ~ (p=0.456 n=20+19) RegexpMatchEasy1_1K-12 498ns ± 2% 495ns ± 0% ~ (p=0.108 n=19+17) RegexpMatchMedium_32-12 130ns ± 1% 130ns ± 2% ~ (p=0.405 n=18+19) RegexpMatchMedium_1K-12 39.4µs ± 2% 39.1µs ± 1% -0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 0% ~ (p=0.561 n=20+17) RegexpMatchHard_1K-12 61.1µs ± 2% 60.8µs ± 1% ~ (p=0.615 n=19+18) Revcomp-12 532ms ± 2% 531ms ± 1% ~ (p=0.470 n=19+19) Template-12 68.5ms ± 1% 69.1ms ± 1% +0.87% (p=0.000 n=17+17) TimeParse-12 344ns ± 2% 344ns ± 1% +0.25% (p=0.032 n=19+18) TimeFormat-12 347ns ± 1% 362ns ± 1% +4.27% (p=0.000 n=17+19) [Geo mean] 62.3µs 62.3µs -0.04% name old speed new speed delta GobDecode-12 89.6MB/s ± 1% 89.2MB/s ± 1% -0.50% (p=0.019 n=20+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.797 n=20+19) Gzip-12 61.3MB/s ± 3% 61.2MB/s ± 2% ~ (p=0.834 n=19+19) Gunzip-12 467MB/s ± 0% 466MB/s ± 0% -0.07% (p=0.027 n=18+15) JSONEncode-12 117MB/s ± 2% 117MB/s ± 1% ~ (p=0.851 n=20+19) JSONDecode-12 33.2MB/s ± 1% 31.7MB/s ± 1% -4.47% (p=0.000 n=18+17) GoParse-12 15.6MB/s ± 2% 15.5MB/s ± 2% -0.95% (p=0.008 n=19+20) RegexpMatchEasy0_32-12 321MB/s ± 2% 320MB/s ± 1% -0.57% (p=0.002 n=17+17) RegexpMatchEasy0_1K-12 3.01GB/s ± 1% 3.01GB/s ± 1% ~ (p=0.132 n=17+18) RegexpMatchEasy1_32-12 387MB/s ± 2% 389MB/s ± 1% ~ (p=0.423 n=20+19) RegexpMatchEasy1_1K-12 2.05GB/s ± 2% 2.06GB/s ± 0% ~ (p=0.129 n=19+17) RegexpMatchMedium_32-12 7.64MB/s ± 1% 7.66MB/s ± 1% ~ (p=0.258 n=18+19) RegexpMatchMedium_1K-12 26.0MB/s ± 2% 26.2MB/s ± 1% +0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 15.7MB/s ± 2% 15.8MB/s ± 1% ~ (p=0.510 n=20+17) RegexpMatchHard_1K-12 16.8MB/s ± 2% 16.8MB/s ± 1% ~ (p=0.603 n=19+18) Revcomp-12 477MB/s ± 2% 479MB/s ± 1% ~ (p=0.470 n=19+19) Template-12 28.3MB/s ± 1% 28.1MB/s ± 1% -0.85% (p=0.000 n=17+17) [Geo mean] 100MB/s 100MB/s -0.26% Change-Id: Ib0bfe0145675ce88c5a8791752f7486ac98805b4 Reviewed-on: https://go-review.googlesource.com/16043 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-16 16:52:26 -04:00
// markrootDone indicates that roots have been marked at least
// once during the current GC cycle. This is checked by root
// marking operations that have to happen only during the
// first root marking pass, whether that's during the
// concurrent mark phase in current GC or mark termination in
// STW GC.
markrootDone bool
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// Each type of GC state transition is protected by a lock.
// Since multiple threads can simultaneously detect the state
// transition condition, any thread that detects a transition
// condition must acquire the appropriate transition lock,
// re-check the transition condition and return if it no
// longer holds or perform the transition if it does.
// Likewise, any transition must invalidate the transition
// condition before releasing the lock. This ensures that each
// transition is performed by exactly one thread and threads
// that need the transition to happen block until it has
// happened.
//
// startSema protects the transition from "off" to mark or
// mark termination.
startSema uint32
// markDoneSema protects transitions from mark 1 to mark 2 and
// from mark 2 to mark termination.
markDoneSema uint32
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
bgMarkReady note // signal background mark worker has started
bgMarkDone uint32 // cas to 1 when at a background mark completion point
runtime: use park/ready to wake up GC at end of concurrent mark Currently, the main GC goroutine sleeps on a note during concurrent mark and the first background mark worker or assist to finish marking use wakes up that note to let the main goroutine proceed into mark termination. Unfortunately, the latency of this wakeup can be quite high, since the GC goroutine will typically have lost its P while in the futex sleep, meaning it will be placed on the global run queue and will wait there until some P is kind enough to pick it up. This delay gives the mutator more time to allocate and create floating garbage, growing the heap unnecessarily. Worse, it's likely that background marking has stopped at this point (unless GOMAXPROCS>4), so anything that's allocated and published to the heap during this window will have to be scanned during mark termination while the world is stopped. This change replaces the note sleep/wakeup with a gopark/ready scheme. This keeps the wakeup inside the Go scheduler and lets the garbage collector take advantage of the new scheduler semantics that run the ready()d goroutine immediately when the ready()ing goroutine sleeps. For the json benchmark from x/benchmarks with GOMAXPROCS=4, this reduces the delay in waking up the GC goroutine and entering mark termination once concurrent marking is done from ~100ms to typically <100µs. Change-Id: Ib11f8b581b8914f2d68e0094f121e49bac3bb384 Reviewed-on: https://go-review.googlesource.com/9291 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-22 17:44:36 -04:00
// Background mark completion signaling
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
// mode is the concurrency mode of the current GC cycle.
mode gcMode
// userForced indicates the current GC cycle was forced by an
// explicit user call.
userForced bool
// totaltime is the CPU nanoseconds spent in GC since the
// program started if debug.gctrace > 0.
totaltime int64
runtime: fix underflow in next_gc calculation Currently, it's possible for the next_gc calculation to underflow. Since next_gc is unsigned, this wraps around and effectively disables GC for the rest of the program's execution. Besides being obviously wrong, this is causing test failures on 32-bit because some tests are running out of heap. This underflow happens for two reasons, both having to do with how we estimate the reachable heap size at the end of the GC cycle. One reason is that this calculation depends on the value of heap_live at the beginning of the GC cycle, but we currently only record that value during a concurrent GC and not during a forced STW GC. Fix this by moving the recorded value from gcController to work and recording it on a common code path. The other reason is that we use the amount of allocation during the GC cycle as an approximation of the amount of floating garbage and subtract it from the marked heap to estimate the reachable heap. However, since this is only an approximation, it's possible for the amount of allocation during the cycle to be *larger* than the marked heap size (since the runtime allocates white and it's possible for these allocations to never be made reachable from the heap). Currently this causes wrap-around in our estimate of the reachable heap size, which in turn causes wrap-around in next_gc. Fix this by bottoming out the reachable heap estimate at 0, in which case we just fall back to triggering GC at heapminimum (which is okay since this only happens on small heaps). Fixes #10555, fixes #10556, and fixes #10559. Change-Id: Iad07b529c03772356fede2ae557732f13ebfdb63 Reviewed-on: https://go-review.googlesource.com/9286 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-23 13:02:31 -04:00
// initialHeapLive is the value of memstats.heap_live at the
// beginning of this GC cycle.
initialHeapLive uint64
// assistQueue is a queue of assists that are blocked because
// there was neither enough credit to steal or enough work to
// do.
assistQueue struct {
lock mutex
head, tail guintptr
}
// sweepWaiters is a list of blocked goroutines to wake when
// we transition from mark termination to sweep.
sweepWaiters struct {
lock mutex
head guintptr
}
// cycles is the number of completed GC cycles, where a GC
// cycle is sweep termination, mark, mark termination, and
// sweep. This differs from memstats.numgc, which is
// incremented at mark termination.
cycles uint32
// Timing/utilization stats for this cycle.
stwprocs, maxprocs int32
tSweepTerm, tMark, tMarkTerm, tEnd int64 // nanotime() of phase start
pauseNS int64 // total STW time this cycle
pauseStart int64 // nanotime() of last STW
// debug.gctrace heap sizes for this cycle.
heap0, heap1, heap2, heapGoal uint64
}
// GC runs a garbage collection and blocks the caller until the
// garbage collection is complete. It may also block the entire
// program.
func GC() {
// We consider a cycle to be: sweep termination, mark, mark
// termination, and sweep. This function shouldn't return
// until a full cycle has been completed, from beginning to
// end. Hence, we always want to finish up the current cycle
// and start a new one. That means:
//
// 1. In sweep termination, mark, or mark termination of cycle
// N, wait until mark termination N completes and transitions
// to sweep N.
//
// 2. In sweep N, help with sweep N.
//
// At this point we can begin a full cycle N+1.
//
// 3. Trigger cycle N+1 by starting sweep termination N+1.
//
// 4. Wait for mark termination N+1 to complete.
//
// 5. Help with sweep N+1 until it's done.
//
// This all has to be written to deal with the fact that the
// GC may move ahead on its own. For example, when we block
// until mark termination N, we may wake up in cycle N+2.
gp := getg()
// Prevent the GC phase or cycle count from changing.
lock(&work.sweepWaiters.lock)
n := atomic.Load(&work.cycles)
if gcphase == _GCmark {
// Wait until sweep termination, mark, and mark
// termination of cycle N complete.
gp.schedlink = work.sweepWaiters.head
work.sweepWaiters.head.set(gp)
goparkunlock(&work.sweepWaiters.lock, "wait for GC cycle", traceEvGoBlock, 1)
} else {
// We're in sweep N already.
unlock(&work.sweepWaiters.lock)
}
// We're now in sweep N or later. Trigger GC cycle N+1, which
// will first finish sweep N if necessary and then enter sweep
// termination N+1.
gcStart(gcBackgroundMode, gcTrigger{kind: gcTriggerCycle, n: n + 1})
// Wait for mark termination N+1 to complete.
lock(&work.sweepWaiters.lock)
if gcphase == _GCmark && atomic.Load(&work.cycles) == n+1 {
gp.schedlink = work.sweepWaiters.head
work.sweepWaiters.head.set(gp)
goparkunlock(&work.sweepWaiters.lock, "wait for GC cycle", traceEvGoBlock, 1)
} else {
unlock(&work.sweepWaiters.lock)
}
// Finish sweep N+1 before returning. We do this both to
// complete the cycle and because runtime.GC() is often used
// as part of tests and benchmarks to get the system into a
// relatively stable and isolated state.
for atomic.Load(&work.cycles) == n+1 && gosweepone() != ^uintptr(0) {
sweep.nbgsweep++
Gosched()
}
// Callers may assume that the heap profile reflects the
// just-completed cycle when this returns (historically this
// happened because this was a STW GC), but right now the
// profile still reflects mark termination N, not N+1.
//
// As soon as all of the sweep frees from cycle N+1 are done,
// we can go ahead and publish the heap profile.
//
// First, wait for sweeping to finish. (We know there are no
// more spans on the sweep queue, but we may be concurrently
// sweeping spans, so we have to wait.)
for atomic.Load(&work.cycles) == n+1 && atomic.Load(&mheap_.sweepers) != 0 {
Gosched()
}
// Now we're really done with sweeping, so we can publish the
// stable heap profile. Only do this if we haven't already hit
// another mark termination.
mp := acquirem()
cycle := atomic.Load(&work.cycles)
if cycle == n+1 || (gcphase == _GCmark && cycle == n+2) {
mProf_PostSweep()
}
releasem(mp)
}
// gcMode indicates how concurrent a GC cycle should be.
type gcMode int
const (
gcBackgroundMode gcMode = iota // concurrent GC and sweep
gcForceMode // stop-the-world GC now, concurrent sweep
gcForceBlockMode // stop-the-world GC now and STW sweep (forced by user)
)
// A gcTrigger is a predicate for starting a GC cycle. Specifically,
// it is an exit condition for the _GCoff phase.
type gcTrigger struct {
kind gcTriggerKind
now int64 // gcTriggerTime: current time
n uint32 // gcTriggerCycle: cycle number to start
}
type gcTriggerKind int
const (
// gcTriggerAlways indicates that a cycle should be started
// unconditionally, even if GOGC is off or we're in a cycle
// right now. This cannot be consolidated with other cycles.
gcTriggerAlways gcTriggerKind = iota
// gcTriggerHeap indicates that a cycle should be started when
// the heap size reaches the trigger heap size computed by the
// controller.
gcTriggerHeap
// gcTriggerTime indicates that a cycle should be started when
// it's been more than forcegcperiod nanoseconds since the
// previous GC cycle.
gcTriggerTime
// gcTriggerCycle indicates that a cycle should be started if
// we have not yet started cycle number gcTrigger.n (relative
// to work.cycles).
gcTriggerCycle
)
// test returns true if the trigger condition is satisfied, meaning
// that the exit condition for the _GCoff phase has been met. The exit
// condition should be tested when allocating.
func (t gcTrigger) test() bool {
if !memstats.enablegc || panicking != 0 {
return false
}
if t.kind == gcTriggerAlways {
return true
}
if gcphase != _GCoff {
return false
}
switch t.kind {
case gcTriggerHeap:
// Non-atomic access to heap_live for performance. If
// we are going to trigger on this, this thread just
// atomically wrote heap_live anyway and we'll see our
// own write.
return memstats.heap_live >= memstats.gc_trigger
case gcTriggerTime:
if gcpercent < 0 {
return false
}
lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
return lastgc != 0 && t.now-lastgc > forcegcperiod
case gcTriggerCycle:
// t.n > work.cycles, but accounting for wraparound.
return int32(t.n-work.cycles) > 0
}
return true
}
// gcStart transitions the GC from _GCoff to _GCmark (if
// !mode.stwMark) or _GCmarktermination (if mode.stwMark) by
// performing sweep termination and GC initialization.
//
// This may return without performing this transition in some cases,
// such as when called on a system stack or with locks held.
func gcStart(mode gcMode, trigger gcTrigger) {
// Since this is called from malloc and malloc is called in
// the guts of a number of libraries that might be holding
// locks, don't attempt to start GC in non-preemptible or
// potentially unstable situations.
mp := acquirem()
if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
releasem(mp)
return
}
releasem(mp)
mp = nil
// Pick up the remaining unswept/not being swept spans concurrently
//
// This shouldn't happen if we're being invoked in background
// mode since proportional sweep should have just finished
// sweeping everything, but rounding errors, etc, may leave a
// few spans unswept. In forced mode, this is necessary since
// GC can be forced at any point in the sweeping cycle.
//
// We check the transition condition continuously here in case
// this G gets delayed in to the next GC cycle.
for trigger.test() && gosweepone() != ^uintptr(0) {
sweep.nbgsweep++
}
// Perform GC initialization and the sweep termination
// transition.
semacquire(&work.startSema)
// Re-check transition condition under transition lock.
if !trigger.test() {
semrelease(&work.startSema)
return
}
// For stats, check if this GC was forced by the user.
work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle
// In gcstoptheworld debug mode, upgrade the mode accordingly.
// We do this after re-checking the transition condition so
// that multiple goroutines that detect the heap trigger don't
// start multiple STW GCs.
if mode == gcBackgroundMode {
if debug.gcstoptheworld == 1 {
mode = gcForceMode
} else if debug.gcstoptheworld == 2 {
mode = gcForceBlockMode
}
}
// Ok, we're doing it! Stop everybody else
sync: make Mutex more fair Add new starvation mode for Mutex. In starvation mode ownership is directly handed off from unlocking goroutine to the next waiter. New arriving goroutines don't compete for ownership. Unfair wait time is now limited to 1ms. Also fix a long standing bug that goroutines were requeued at the tail of the wait queue. That lead to even more unfair acquisition times with multiple waiters. Performance of normal mode is not considerably affected. Fixes #13086 On the provided in the issue lockskew program: done in 1.207853ms done in 1.177451ms done in 1.184168ms done in 1.198633ms done in 1.185797ms done in 1.182502ms done in 1.316485ms done in 1.211611ms done in 1.182418ms name old time/op new time/op delta MutexUncontended-48 0.65ns ± 0% 0.65ns ± 1% ~ (p=0.087 n=10+10) Mutex-48 112ns ± 1% 114ns ± 1% +1.69% (p=0.000 n=10+10) MutexSlack-48 113ns ± 0% 87ns ± 1% -22.65% (p=0.000 n=8+10) MutexWork-48 149ns ± 0% 145ns ± 0% -2.48% (p=0.000 n=9+10) MutexWorkSlack-48 149ns ± 0% 122ns ± 3% -18.26% (p=0.000 n=6+10) MutexNoSpin-48 103ns ± 4% 105ns ± 3% ~ (p=0.089 n=10+10) MutexSpin-48 490ns ± 4% 515ns ± 6% +5.08% (p=0.006 n=10+10) Cond32-48 13.4µs ± 6% 13.1µs ± 5% -2.75% (p=0.023 n=10+10) RWMutexWrite100-48 53.2ns ± 3% 41.2ns ± 3% -22.57% (p=0.000 n=10+10) RWMutexWrite10-48 45.9ns ± 2% 43.9ns ± 2% -4.38% (p=0.000 n=10+10) RWMutexWorkWrite100-48 122ns ± 2% 134ns ± 1% +9.92% (p=0.000 n=10+10) RWMutexWorkWrite10-48 206ns ± 1% 188ns ± 1% -8.52% (p=0.000 n=8+10) Cond32-24 12.1µs ± 3% 12.4µs ± 3% +1.98% (p=0.043 n=10+9) MutexUncontended-24 0.74ns ± 1% 0.75ns ± 1% ~ (p=0.650 n=10+10) Mutex-24 122ns ± 2% 124ns ± 1% +1.31% (p=0.007 n=10+10) MutexSlack-24 96.9ns ± 2% 102.8ns ± 2% +6.11% (p=0.000 n=10+10) MutexWork-24 146ns ± 1% 135ns ± 2% -7.70% (p=0.000 n=10+9) MutexWorkSlack-24 135ns ± 1% 128ns ± 2% -5.01% (p=0.000 n=10+9) MutexNoSpin-24 114ns ± 3% 110ns ± 4% -3.84% (p=0.000 n=10+10) MutexSpin-24 482ns ± 4% 475ns ± 8% ~ (p=0.286 n=10+10) RWMutexWrite100-24 43.0ns ± 3% 43.1ns ± 2% ~ (p=0.956 n=10+10) RWMutexWrite10-24 43.4ns ± 1% 43.2ns ± 1% ~ (p=0.085 n=10+9) RWMutexWorkWrite100-24 130ns ± 3% 131ns ± 3% ~ (p=0.747 n=10+10) RWMutexWorkWrite10-24 191ns ± 1% 192ns ± 1% ~ (p=0.210 n=10+10) Cond32-12 11.5µs ± 2% 11.7µs ± 2% +1.98% (p=0.002 n=10+10) MutexUncontended-12 1.48ns ± 0% 1.50ns ± 1% +1.08% (p=0.004 n=10+10) Mutex-12 141ns ± 1% 143ns ± 1% +1.63% (p=0.000 n=10+10) MutexSlack-12 121ns ± 0% 119ns ± 0% -1.65% (p=0.001 n=8+9) MutexWork-12 141ns ± 2% 150ns ± 3% +6.36% (p=0.000 n=9+10) MutexWorkSlack-12 131ns ± 0% 138ns ± 0% +5.73% (p=0.000 n=9+10) MutexNoSpin-12 87.0ns ± 1% 83.7ns ± 1% -3.80% (p=0.000 n=10+10) MutexSpin-12 364ns ± 1% 377ns ± 1% +3.77% (p=0.000 n=10+10) RWMutexWrite100-12 42.8ns ± 1% 43.9ns ± 1% +2.41% (p=0.000 n=8+10) RWMutexWrite10-12 39.8ns ± 4% 39.3ns ± 1% ~ (p=0.433 n=10+9) RWMutexWorkWrite100-12 131ns ± 1% 131ns ± 0% ~ (p=0.591 n=10+9) RWMutexWorkWrite10-12 173ns ± 1% 174ns ± 0% ~ (p=0.059 n=10+8) Cond32-6 10.9µs ± 2% 10.9µs ± 2% ~ (p=0.739 n=10+10) MutexUncontended-6 2.97ns ± 0% 2.97ns ± 0% ~ (all samples are equal) Mutex-6 122ns ± 6% 122ns ± 2% ~ (p=0.668 n=10+10) MutexSlack-6 149ns ± 3% 142ns ± 3% -4.63% (p=0.000 n=10+10) MutexWork-6 136ns ± 3% 140ns ± 5% ~ (p=0.077 n=10+10) MutexWorkSlack-6 152ns ± 0% 138ns ± 2% -9.21% (p=0.000 n=6+10) MutexNoSpin-6 150ns ± 1% 152ns ± 0% +1.50% (p=0.000 n=8+10) MutexSpin-6 726ns ± 0% 730ns ± 1% ~ (p=0.069 n=10+10) RWMutexWrite100-6 40.6ns ± 1% 40.9ns ± 1% +0.91% (p=0.001 n=8+10) RWMutexWrite10-6 37.1ns ± 0% 37.0ns ± 1% ~ (p=0.386 n=9+10) RWMutexWorkWrite100-6 133ns ± 1% 134ns ± 1% +1.01% (p=0.005 n=9+10) RWMutexWorkWrite10-6 152ns ± 0% 152ns ± 0% ~ (all samples are equal) Cond32-2 7.86µs ± 2% 7.95µs ± 2% +1.10% (p=0.023 n=10+10) MutexUncontended-2 8.10ns ± 0% 9.11ns ± 4% +12.44% (p=0.000 n=9+10) Mutex-2 32.9ns ± 9% 38.4ns ± 6% +16.58% (p=0.000 n=10+10) MutexSlack-2 93.4ns ± 1% 98.5ns ± 2% +5.39% (p=0.000 n=10+9) MutexWork-2 40.8ns ± 3% 43.8ns ± 7% +7.38% (p=0.000 n=10+9) MutexWorkSlack-2 98.6ns ± 5% 108.2ns ± 2% +9.80% (p=0.000 n=10+8) MutexNoSpin-2 399ns ± 1% 398ns ± 2% ~ (p=0.463 n=8+9) MutexSpin-2 1.99µs ± 3% 1.97µs ± 1% -0.81% (p=0.003 n=9+8) RWMutexWrite100-2 37.6ns ± 5% 46.0ns ± 4% +22.17% (p=0.000 n=10+8) RWMutexWrite10-2 50.1ns ± 6% 36.8ns ±12% -26.46% (p=0.000 n=9+10) RWMutexWorkWrite100-2 136ns ± 0% 134ns ± 2% -1.80% (p=0.001 n=7+9) RWMutexWorkWrite10-2 140ns ± 1% 138ns ± 1% -1.50% (p=0.000 n=10+10) Cond32 5.93µs ± 1% 5.91µs ± 0% ~ (p=0.411 n=9+10) MutexUncontended 15.9ns ± 0% 15.8ns ± 0% -0.63% (p=0.000 n=8+8) Mutex 15.9ns ± 0% 15.8ns ± 0% -0.44% (p=0.003 n=10+10) MutexSlack 26.9ns ± 3% 26.7ns ± 2% ~ (p=0.084 n=10+10) MutexWork 47.8ns ± 0% 47.9ns ± 0% +0.21% (p=0.014 n=9+8) MutexWorkSlack 54.9ns ± 3% 54.5ns ± 3% ~ (p=0.254 n=10+10) MutexNoSpin 786ns ± 2% 765ns ± 1% -2.66% (p=0.000 n=10+10) MutexSpin 3.87µs ± 1% 3.83µs ± 0% -0.85% (p=0.005 n=9+8) RWMutexWrite100 21.2ns ± 2% 21.0ns ± 1% -0.88% (p=0.018 n=10+9) RWMutexWrite10 22.6ns ± 1% 22.6ns ± 0% ~ (p=0.471 n=9+9) RWMutexWorkWrite100 132ns ± 0% 132ns ± 0% ~ (all samples are equal) RWMutexWorkWrite10 124ns ± 0% 123ns ± 0% ~ (p=0.656 n=10+10) Change-Id: I66412a3a0980df1233ad7a5a0cd9723b4274528b Reviewed-on: https://go-review.googlesource.com/34310 Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2016-12-13 16:45:55 +01:00
semacquire(&worldsema)
if trace.enabled {
traceGCStart()
}
if mode == gcBackgroundMode {
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
gcBgMarkStartWorkers()
}
gcResetMarkState()
work.stwprocs, work.maxprocs = gomaxprocs, gomaxprocs
if work.stwprocs > ncpu {
// This is used to compute CPU time of the STW phases,
// so it can't be more than ncpu, even if GOMAXPROCS is.
work.stwprocs = ncpu
}
work.heap0 = atomic.Load64(&memstats.heap_live)
work.pauseNS = 0
work.mode = mode
now := nanotime()
work.tSweepTerm = now
work.pauseStart = now
if trace.enabled {
traceGCSTWStart(1)
}
systemstack(stopTheWorldWithSema)
runtime: remove sweep wait loop in finishsweep_m In general, finishsweep_m must block until any spans that are concurrently being swept have been swept. It accomplishes this by looping over all spans, which, as in the previous commit, takes ~1ms/heap GB. Unfortunately, we do this during the STW sweep termination phase, so multi-gigabyte heaps can push our STW time past 10ms. However, there's no need to do this wait if the world is stopped because, in effect, stopping the world already had to wait for anything that was sweeping (and if it didn't, the wait in finishsweep_m would deadlock). Hence, we can simply skip this loop if the world is stopped, such as during sweep termination. In fact, currently all calls to finishsweep_m are STW, but this hasn't always been the case and may not be the case in the future, so we keep the logic around. For 24GB heaps, this reduces max pause time by 75% relative to tip and by 90% relative to Go 1.5. Notably, all pauses are now well under 10ms. Here are the results for the garbage benchmark: ------------- max pause ------------ Heap Procs after change before change 1.5.1 24GB 12 3.8ms 16ms 37ms 24GB 4 3.7ms 16ms 37ms 4GB 4 3.7ms 3ms 6.9ms In the 4GB/4P case, it seems the "before change" run got lucky: the max went up, but the 99%ile pause time went down from 3ms to 2.04ms. Change-Id: Ica22189559f231d408ef2815019c9dbb5f38bf31 Reviewed-on: https://go-review.googlesource.com/15071 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-26 14:00:57 -04:00
// Finish sweep before we start concurrent scan.
systemstack(func() {
finishsweep_m()
runtime: remove sweep wait loop in finishsweep_m In general, finishsweep_m must block until any spans that are concurrently being swept have been swept. It accomplishes this by looping over all spans, which, as in the previous commit, takes ~1ms/heap GB. Unfortunately, we do this during the STW sweep termination phase, so multi-gigabyte heaps can push our STW time past 10ms. However, there's no need to do this wait if the world is stopped because, in effect, stopping the world already had to wait for anything that was sweeping (and if it didn't, the wait in finishsweep_m would deadlock). Hence, we can simply skip this loop if the world is stopped, such as during sweep termination. In fact, currently all calls to finishsweep_m are STW, but this hasn't always been the case and may not be the case in the future, so we keep the logic around. For 24GB heaps, this reduces max pause time by 75% relative to tip and by 90% relative to Go 1.5. Notably, all pauses are now well under 10ms. Here are the results for the garbage benchmark: ------------- max pause ------------ Heap Procs after change before change 1.5.1 24GB 12 3.8ms 16ms 37ms 24GB 4 3.7ms 16ms 37ms 4GB 4 3.7ms 3ms 6.9ms In the 4GB/4P case, it seems the "before change" run got lucky: the max went up, but the 99%ile pause time went down from 3ms to 2.04ms. Change-Id: Ica22189559f231d408ef2815019c9dbb5f38bf31 Reviewed-on: https://go-review.googlesource.com/15071 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-26 14:00:57 -04:00
})
// clearpools before we start the GC. If we wait they memory will not be
// reclaimed until the next GC cycle.
clearpools()
work.cycles++
if mode == gcBackgroundMode { // Do as much work concurrently as possible
gcController.startCycle()
work.heapGoal = memstats.next_gc
// Enter concurrent mark phase and enable
// write barriers.
//
// Because the world is stopped, all Ps will
// observe that write barriers are enabled by
// the time we start the world and begin
// scanning.
//
// Write barriers must be enabled before assists are
// enabled because they must be enabled before
// any non-leaf heap objects are marked. Since
// allocations are blocked until assists can
// happen, we want enable assists as early as
// possible.
setGCPhase(_GCmark)
gcBgMarkPrepare() // Must happen before assist enable.
gcMarkRootPrepare()
// Mark all active tinyalloc blocks. Since we're
// allocating from these, they need to be black like
// other allocations. The alternative is to blacken
// the tiny block on every allocation from it, which
// would slow down the tiny allocator.
gcMarkTinyAllocs()
// At this point all Ps have enabled the write
// barrier, thus maintaining the no white to
// black invariant. Enable mutator assists to
// put back-pressure on fast allocating
// mutators.
atomic.Store(&gcBlackenEnabled, 1)
// Assists and workers can start the moment we start
// the world.
gcController.markStartTime = now
// Concurrent mark.
systemstack(func() {
now = startTheWorldWithSema(trace.enabled)
})
work.pauseNS += now - work.pauseStart
work.tMark = now
} else {
if trace.enabled {
// Switch to mark termination STW.
traceGCSTWDone()
traceGCSTWStart(0)
}
t := nanotime()
work.tMark, work.tMarkTerm = t, t
work.heapGoal = work.heap0
// Perform mark termination. This will restart the world.
gcMarkTermination(memstats.triggerRatio)
}
semrelease(&work.startSema)
}
// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2
// to mark termination.
//
// This should be called when all mark work has been drained. In mark
// 1, this includes all root marking jobs, global work buffers, and
// active work buffers in assists and background workers; however,
// work may still be cached in per-P work buffers. In mark 2, per-P
// caches are disabled.
//
// The calling context must be preemptible.
//
// Note that it is explicitly okay to have write barriers in this
// function because completion of concurrent mark is best-effort
// anyway. Any work created by write barriers here will be cleaned up
// by mark termination.
func gcMarkDone() {
top:
sync: make Mutex more fair Add new starvation mode for Mutex. In starvation mode ownership is directly handed off from unlocking goroutine to the next waiter. New arriving goroutines don't compete for ownership. Unfair wait time is now limited to 1ms. Also fix a long standing bug that goroutines were requeued at the tail of the wait queue. That lead to even more unfair acquisition times with multiple waiters. Performance of normal mode is not considerably affected. Fixes #13086 On the provided in the issue lockskew program: done in 1.207853ms done in 1.177451ms done in 1.184168ms done in 1.198633ms done in 1.185797ms done in 1.182502ms done in 1.316485ms done in 1.211611ms done in 1.182418ms name old time/op new time/op delta MutexUncontended-48 0.65ns ± 0% 0.65ns ± 1% ~ (p=0.087 n=10+10) Mutex-48 112ns ± 1% 114ns ± 1% +1.69% (p=0.000 n=10+10) MutexSlack-48 113ns ± 0% 87ns ± 1% -22.65% (p=0.000 n=8+10) MutexWork-48 149ns ± 0% 145ns ± 0% -2.48% (p=0.000 n=9+10) MutexWorkSlack-48 149ns ± 0% 122ns ± 3% -18.26% (p=0.000 n=6+10) MutexNoSpin-48 103ns ± 4% 105ns ± 3% ~ (p=0.089 n=10+10) MutexSpin-48 490ns ± 4% 515ns ± 6% +5.08% (p=0.006 n=10+10) Cond32-48 13.4µs ± 6% 13.1µs ± 5% -2.75% (p=0.023 n=10+10) RWMutexWrite100-48 53.2ns ± 3% 41.2ns ± 3% -22.57% (p=0.000 n=10+10) RWMutexWrite10-48 45.9ns ± 2% 43.9ns ± 2% -4.38% (p=0.000 n=10+10) RWMutexWorkWrite100-48 122ns ± 2% 134ns ± 1% +9.92% (p=0.000 n=10+10) RWMutexWorkWrite10-48 206ns ± 1% 188ns ± 1% -8.52% (p=0.000 n=8+10) Cond32-24 12.1µs ± 3% 12.4µs ± 3% +1.98% (p=0.043 n=10+9) MutexUncontended-24 0.74ns ± 1% 0.75ns ± 1% ~ (p=0.650 n=10+10) Mutex-24 122ns ± 2% 124ns ± 1% +1.31% (p=0.007 n=10+10) MutexSlack-24 96.9ns ± 2% 102.8ns ± 2% +6.11% (p=0.000 n=10+10) MutexWork-24 146ns ± 1% 135ns ± 2% -7.70% (p=0.000 n=10+9) MutexWorkSlack-24 135ns ± 1% 128ns ± 2% -5.01% (p=0.000 n=10+9) MutexNoSpin-24 114ns ± 3% 110ns ± 4% -3.84% (p=0.000 n=10+10) MutexSpin-24 482ns ± 4% 475ns ± 8% ~ (p=0.286 n=10+10) RWMutexWrite100-24 43.0ns ± 3% 43.1ns ± 2% ~ (p=0.956 n=10+10) RWMutexWrite10-24 43.4ns ± 1% 43.2ns ± 1% ~ (p=0.085 n=10+9) RWMutexWorkWrite100-24 130ns ± 3% 131ns ± 3% ~ (p=0.747 n=10+10) RWMutexWorkWrite10-24 191ns ± 1% 192ns ± 1% ~ (p=0.210 n=10+10) Cond32-12 11.5µs ± 2% 11.7µs ± 2% +1.98% (p=0.002 n=10+10) MutexUncontended-12 1.48ns ± 0% 1.50ns ± 1% +1.08% (p=0.004 n=10+10) Mutex-12 141ns ± 1% 143ns ± 1% +1.63% (p=0.000 n=10+10) MutexSlack-12 121ns ± 0% 119ns ± 0% -1.65% (p=0.001 n=8+9) MutexWork-12 141ns ± 2% 150ns ± 3% +6.36% (p=0.000 n=9+10) MutexWorkSlack-12 131ns ± 0% 138ns ± 0% +5.73% (p=0.000 n=9+10) MutexNoSpin-12 87.0ns ± 1% 83.7ns ± 1% -3.80% (p=0.000 n=10+10) MutexSpin-12 364ns ± 1% 377ns ± 1% +3.77% (p=0.000 n=10+10) RWMutexWrite100-12 42.8ns ± 1% 43.9ns ± 1% +2.41% (p=0.000 n=8+10) RWMutexWrite10-12 39.8ns ± 4% 39.3ns ± 1% ~ (p=0.433 n=10+9) RWMutexWorkWrite100-12 131ns ± 1% 131ns ± 0% ~ (p=0.591 n=10+9) RWMutexWorkWrite10-12 173ns ± 1% 174ns ± 0% ~ (p=0.059 n=10+8) Cond32-6 10.9µs ± 2% 10.9µs ± 2% ~ (p=0.739 n=10+10) MutexUncontended-6 2.97ns ± 0% 2.97ns ± 0% ~ (all samples are equal) Mutex-6 122ns ± 6% 122ns ± 2% ~ (p=0.668 n=10+10) MutexSlack-6 149ns ± 3% 142ns ± 3% -4.63% (p=0.000 n=10+10) MutexWork-6 136ns ± 3% 140ns ± 5% ~ (p=0.077 n=10+10) MutexWorkSlack-6 152ns ± 0% 138ns ± 2% -9.21% (p=0.000 n=6+10) MutexNoSpin-6 150ns ± 1% 152ns ± 0% +1.50% (p=0.000 n=8+10) MutexSpin-6 726ns ± 0% 730ns ± 1% ~ (p=0.069 n=10+10) RWMutexWrite100-6 40.6ns ± 1% 40.9ns ± 1% +0.91% (p=0.001 n=8+10) RWMutexWrite10-6 37.1ns ± 0% 37.0ns ± 1% ~ (p=0.386 n=9+10) RWMutexWorkWrite100-6 133ns ± 1% 134ns ± 1% +1.01% (p=0.005 n=9+10) RWMutexWorkWrite10-6 152ns ± 0% 152ns ± 0% ~ (all samples are equal) Cond32-2 7.86µs ± 2% 7.95µs ± 2% +1.10% (p=0.023 n=10+10) MutexUncontended-2 8.10ns ± 0% 9.11ns ± 4% +12.44% (p=0.000 n=9+10) Mutex-2 32.9ns ± 9% 38.4ns ± 6% +16.58% (p=0.000 n=10+10) MutexSlack-2 93.4ns ± 1% 98.5ns ± 2% +5.39% (p=0.000 n=10+9) MutexWork-2 40.8ns ± 3% 43.8ns ± 7% +7.38% (p=0.000 n=10+9) MutexWorkSlack-2 98.6ns ± 5% 108.2ns ± 2% +9.80% (p=0.000 n=10+8) MutexNoSpin-2 399ns ± 1% 398ns ± 2% ~ (p=0.463 n=8+9) MutexSpin-2 1.99µs ± 3% 1.97µs ± 1% -0.81% (p=0.003 n=9+8) RWMutexWrite100-2 37.6ns ± 5% 46.0ns ± 4% +22.17% (p=0.000 n=10+8) RWMutexWrite10-2 50.1ns ± 6% 36.8ns ±12% -26.46% (p=0.000 n=9+10) RWMutexWorkWrite100-2 136ns ± 0% 134ns ± 2% -1.80% (p=0.001 n=7+9) RWMutexWorkWrite10-2 140ns ± 1% 138ns ± 1% -1.50% (p=0.000 n=10+10) Cond32 5.93µs ± 1% 5.91µs ± 0% ~ (p=0.411 n=9+10) MutexUncontended 15.9ns ± 0% 15.8ns ± 0% -0.63% (p=0.000 n=8+8) Mutex 15.9ns ± 0% 15.8ns ± 0% -0.44% (p=0.003 n=10+10) MutexSlack 26.9ns ± 3% 26.7ns ± 2% ~ (p=0.084 n=10+10) MutexWork 47.8ns ± 0% 47.9ns ± 0% +0.21% (p=0.014 n=9+8) MutexWorkSlack 54.9ns ± 3% 54.5ns ± 3% ~ (p=0.254 n=10+10) MutexNoSpin 786ns ± 2% 765ns ± 1% -2.66% (p=0.000 n=10+10) MutexSpin 3.87µs ± 1% 3.83µs ± 0% -0.85% (p=0.005 n=9+8) RWMutexWrite100 21.2ns ± 2% 21.0ns ± 1% -0.88% (p=0.018 n=10+9) RWMutexWrite10 22.6ns ± 1% 22.6ns ± 0% ~ (p=0.471 n=9+9) RWMutexWorkWrite100 132ns ± 0% 132ns ± 0% ~ (all samples are equal) RWMutexWorkWrite10 124ns ± 0% 123ns ± 0% ~ (p=0.656 n=10+10) Change-Id: I66412a3a0980df1233ad7a5a0cd9723b4274528b Reviewed-on: https://go-review.googlesource.com/34310 Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2016-12-13 16:45:55 +01:00
semacquire(&work.markDoneSema)
// Re-check transition condition under transition lock.
if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
semrelease(&work.markDoneSema)
return
}
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
// Disallow starting new workers so that any remaining workers
// in the current mark phase will drain out.
//
// TODO(austin): Should dedicated workers keep an eye on this
// and exit gcDrain promptly?
atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
prevFractionalGoal := gcController.fractionalUtilizationGoal
gcController.fractionalUtilizationGoal = 0
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
if !gcBlackenPromptly {
// Transition from mark 1 to mark 2.
//
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
// The global work list is empty, but there can still be work
// sitting in the per-P work caches.
// Flush and disable work caches.
// Disallow caching workbufs and indicate that we're in mark 2.
gcBlackenPromptly = true
// Prevent completion of mark 2 until we've flushed
// cached workbufs.
atomic.Xadd(&work.nwait, -1)
// GC is set up for mark 2. Let Gs blocked on the
// transition lock go while we flush caches.
semrelease(&work.markDoneSema)
systemstack(func() {
// Flush all currently cached workbufs and
// ensure all Ps see gcBlackenPromptly. This
// also blocks until any remaining mark 1
// workers have exited their loop so we can
// start new mark 2 workers.
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
forEachP(func(_p_ *p) {
runtime: buffered write barrier implementation This implements runtime support for buffered write barriers on amd64. The buffered write barrier has a fast path that simply enqueues pointers in a per-P buffer. Unlike the current write barrier, this fast path is *not* a normal Go call and does not require the compiler to spill general-purpose registers or put arguments on the stack. When the buffer fills up, the write barrier takes the slow path, which spills all general purpose registers and flushes the buffer. We don't allow safe-points or stack splits while this frame is active, so it doesn't matter that we have no type information for the spilled registers in this frame. One minor complication is cgocheck=2 mode, which uses the write barrier to detect Go pointers being written to non-Go memory. We obviously can't buffer this, so instead we set the buffer to its minimum size, forcing the write barrier into the slow path on every call. For this specific case, we pass additional information as arguments to the flush function. This also requires enabling the cgo write barrier slightly later during runtime initialization, after Ps (and the per-P write barrier buffers) have been initialized. The code in this CL is not yet active. The next CL will modify the compiler to generate calls to the new write barrier. This reduces the average cost of the write barrier by roughly a factor of 4, which will pay for the cost of having it enabled more of the time after we make the GC pacer less aggressive. (Benchmarks will be in the next CL.) Updates #14951. Updates #22460. Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1 Reviewed-on: https://go-review.googlesource.com/73711 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-10-26 12:21:16 -04:00
wbBufFlush1(_p_)
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
_p_.gcw.dispose()
})
})
// Check that roots are marked. We should be able to
// do this before the forEachP, but based on issue
// #16083 there may be a (harmless) race where we can
// enter mark 2 while some workers are still scanning
// stacks. The forEachP ensures these scans are done.
//
// TODO(austin): Figure out the race and fix this
// properly.
gcMarkRootCheck()
// Now we can start up mark 2 workers.
atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
gcController.fractionalUtilizationGoal = prevFractionalGoal
incnwait := atomic.Xadd(&work.nwait, +1)
if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
// This loop will make progress because
// gcBlackenPromptly is now true, so it won't
// take this same "if" branch.
goto top
}
} else {
// Transition to mark termination.
now := nanotime()
work.tMarkTerm = now
work.pauseStart = now
getg().m.preemptoff = "gcing"
if trace.enabled {
traceGCSTWStart(0)
}
systemstack(stopTheWorldWithSema)
// The gcphase is _GCmark, it will transition to _GCmarktermination
// below. The important thing is that the wb remains active until
// all marking is complete. This includes writes made by the GC.
// Record that one root marking pass has completed.
work.markrootDone = true
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// Disable assists and background workers. We must do
// this before waking blocked assists.
atomic.Store(&gcBlackenEnabled, 0)
// Wake all blocked assists. These will run when we
// start the world again.
gcWakeAllAssists()
// Likewise, release the transition lock. Blocked
// workers and assists will run when we start the
// world again.
semrelease(&work.markDoneSema)
// endCycle depends on all gcWork cache stats being
// flushed. This is ensured by mark 2.
nextTriggerRatio := gcController.endCycle()
// Perform mark termination. This will restart the world.
gcMarkTermination(nextTriggerRatio)
}
}
func gcMarkTermination(nextTriggerRatio float64) {
// World is stopped.
// Start marktermination which includes enabling the write barrier.
atomic.Store(&gcBlackenEnabled, 0)
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
gcBlackenPromptly = false
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
setGCPhase(_GCmarktermination)
work.heap1 = memstats.heap_live
startTime := nanotime()
mp := acquirem()
mp.preemptoff = "gcing"
_g_ := getg()
_g_.m.traceback = 2
gp := _g_.m.curg
casgstatus(gp, _Grunning, _Gwaiting)
gp.waitreason = "garbage collection"
// Run gc on the g0 stack. We do this so that the g stack
// we're currently running on will no longer change. Cuts
// the root set down a bit (g0 stacks are not scanned, and
// we don't need to scan gc's internal state). We also
// need to switch to g0 so we can shrink the stack.
systemstack(func() {
gcMark(startTime)
// Must return immediately.
// The outer function's stack may have moved
// during gcMark (it shrinks stacks, including the
// outer function's stack), so we must not refer
// to any of its variables. Return back to the
// non-system stack to pick up the new addresses
// before continuing.
})
systemstack(func() {
work.heap2 = work.bytesMarked
if debug.gccheckmark > 0 {
// Run a full stop-the-world mark using checkmark bits,
// to check that we didn't forget to mark anything during
// the concurrent mark process.
gcResetMarkState()
initCheckmarks()
gcMark(startTime)
clearCheckmarks()
}
// marking is complete so we can turn the write barrier off
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
setGCPhase(_GCoff)
gcSweep(work.mode)
if debug.gctrace > 1 {
startTime = nanotime()
// The g stacks have been scanned so
// they have gcscanvalid==true and gcworkdone==true.
// Reset these so that all stacks will be rescanned.
gcResetMarkState()
finishsweep_m()
// Still in STW but gcphase is _GCoff, reset to _GCmarktermination
// At this point all objects will be found during the gcMark which
// does a complete STW mark and object scan.
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
setGCPhase(_GCmarktermination)
gcMark(startTime)
runtime: replace needwb() with writeBarrierEnabled Reduce the write barrier check to a single load and compare so that it can be inlined into write barrier use sites. Makes the standard write barrier a little faster too. name old new delta BenchmarkBinaryTree17 17.9s × (0.99,1.01) 17.9s × (1.00,1.01) ~ BenchmarkFannkuch11 4.35s × (1.00,1.00) 4.43s × (1.00,1.00) +1.81% BenchmarkFmtFprintfEmpty 120ns × (0.93,1.06) 110ns × (1.00,1.06) -7.92% BenchmarkFmtFprintfString 479ns × (0.99,1.00) 487ns × (0.99,1.00) +1.67% BenchmarkFmtFprintfInt 452ns × (0.99,1.02) 450ns × (0.99,1.00) ~ BenchmarkFmtFprintfIntInt 766ns × (0.99,1.01) 762ns × (1.00,1.00) ~ BenchmarkFmtFprintfPrefixedInt 576ns × (0.98,1.01) 584ns × (0.99,1.01) ~ BenchmarkFmtFprintfFloat 730ns × (1.00,1.01) 738ns × (1.00,1.00) +1.16% BenchmarkFmtManyArgs 2.84µs × (0.99,1.00) 2.80µs × (1.00,1.01) -1.22% BenchmarkGobDecode 39.3ms × (0.98,1.01) 39.0ms × (0.99,1.00) ~ BenchmarkGobEncode 39.5ms × (0.99,1.01) 37.8ms × (0.98,1.01) -4.33% BenchmarkGzip 663ms × (1.00,1.01) 661ms × (0.99,1.01) ~ BenchmarkGunzip 143ms × (1.00,1.00) 142ms × (1.00,1.00) ~ BenchmarkHTTPClientServer 132µs × (0.99,1.01) 132µs × (0.99,1.01) ~ BenchmarkJSONEncode 57.4ms × (0.99,1.01) 56.3ms × (0.99,1.01) -1.96% BenchmarkJSONDecode 139ms × (0.99,1.00) 138ms × (0.99,1.01) ~ BenchmarkMandelbrot200 6.03ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ BenchmarkGoParse 10.3ms × (0.89,1.14) 10.2ms × (0.87,1.05) ~ BenchmarkRegexpMatchEasy0_32 209ns × (1.00,1.00) 208ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy0_1K 591ns × (0.99,1.00) 588ns × (1.00,1.00) ~ BenchmarkRegexpMatchEasy1_32 184ns × (0.99,1.02) 182ns × (0.99,1.01) ~ BenchmarkRegexpMatchEasy1_1K 1.01µs × (1.00,1.00) 0.99µs × (1.00,1.01) -2.33% BenchmarkRegexpMatchMedium_32 330ns × (1.00,1.00) 323ns × (1.00,1.01) -2.12% BenchmarkRegexpMatchMedium_1K 92.6µs × (1.00,1.00) 89.9µs × (1.00,1.00) -2.92% BenchmarkRegexpMatchHard_32 4.80µs × (0.95,1.00) 4.72µs × (0.95,1.01) ~ BenchmarkRegexpMatchHard_1K 136µs × (1.00,1.00) 133µs × (1.00,1.01) -1.86% BenchmarkRevcomp 900ms × (0.99,1.04) 900ms × (1.00,1.05) ~ BenchmarkTemplate 172ms × (1.00,1.00) 168ms × (0.99,1.01) -2.07% BenchmarkTimeParse 637ns × (1.00,1.00) 637ns × (1.00,1.00) ~ BenchmarkTimeFormat 744ns × (1.00,1.01) 738ns × (1.00,1.00) -0.67% Change-Id: I4ecc925805da1f5ee264377f1f7574f54ee575e7 Reviewed-on: https://go-review.googlesource.com/9321 Reviewed-by: Austin Clements <austin@google.com>
2015-04-24 14:00:55 -04:00
setGCPhase(_GCoff) // marking is done, turn off wb.
gcSweep(work.mode)
}
})
_g_.m.traceback = 0
casgstatus(gp, _Gwaiting, _Grunning)
if trace.enabled {
traceGCDone()
}
// all done
mp.preemptoff = ""
if gcphase != _GCoff {
throw("gc done but gcphase != _GCoff")
}
// Update GC trigger and pacing for the next cycle.
gcSetTriggerRatio(nextTriggerRatio)
// Update timing memstats
now := nanotime()
sec, nsec, _ := time_now()
unixNow := sec*1e9 + int64(nsec)
work.pauseNS += now - work.pauseStart
work.tEnd = now
atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
memstats.pause_total_ns += uint64(work.pauseNS)
// Update work.totaltime.
sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
// We report idle marking time below, but omit it from the
// overall utilization here since it's "free".
markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
cycleCpu := sweepTermCpu + markCpu + markTermCpu
work.totaltime += cycleCpu
// Compute overall GC CPU utilization.
totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)
// Reset sweep state.
sweep.nbgsweep = 0
sweep.npausesweep = 0
if work.userForced {
memstats.numforcedgc++
}
// Bump GC cycle count and wake goroutines waiting on sweep.
lock(&work.sweepWaiters.lock)
memstats.numgc++
injectglist(work.sweepWaiters.head.ptr())
work.sweepWaiters.head = 0
unlock(&work.sweepWaiters.lock)
// Finish the current heap profiling cycle and start a new
// heap profiling cycle. We do this before starting the world
// so events don't leak into the wrong cycle.
mProf_NextCycle()
systemstack(func() { startTheWorldWithSema(true) })
// Flush the heap profile so we can start a new cycle next GC.
// This is relatively expensive, so we don't do it with the
// world stopped.
mProf_Flush()
runtime: free workbufs during sweeping This extends the sweeper to free workbufs back to the heap between GC cycles, allowing this memory to be reused for GC'd allocations or eventually returned to the OS. This helps for applications that have high peak heap usage relative to their regular heap usage (for example, a high-memory initialization phase). Workbuf memory is roughly proportional to heap size and since we currently never free workbufs, it's proportional to *peak* heap size. By freeing workbufs, we can release and reuse this memory for other purposes when the heap shrinks. This is somewhat complicated because this costs ~1–2 µs per workbuf span, so for large heaps it's too expensive to just do synchronously after mark termination between starting the world and dropping the worldsema. Hence, we do it asynchronously in the sweeper. This adds a list of "free" workbuf spans that can be returned to the heap. GC moves all workbuf spans to this list after mark termination and the background sweeper drains this list back to the heap. If the sweeper doesn't finish, that's fine, since getempty can directly reuse any remaining spans to allocate more workbufs. Performance impact is negligible. On the x/benchmarks, this reduces GC-bytes-from-system by 6–11%. Fixes #19325. Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8 Reviewed-on: https://go-review.googlesource.com/38582 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-20 17:25:59 -04:00
// Prepare workbufs for freeing by the sweeper. We do this
// asynchronously because it can take non-trivial time.
prepareFreeWorkbufs()
// Free stack spans. This must be done between GC cycles.
systemstack(freeStackSpans)
// Print gctrace before dropping worldsema. As soon as we drop
// worldsema another cycle could start and smash the stats
// we're trying to print.
if debug.gctrace > 0 {
util := int(memstats.gc_cpu_fraction * 100)
var sbuf [24]byte
printlock()
print("gc ", memstats.numgc,
" @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
util, "%: ")
prev := work.tSweepTerm
for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
if i != 0 {
print("+")
}
print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
prev = ns
}
print(" ms clock, ")
for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
if i == 2 || i == 3 {
// Separate mark time components with /.
print("/")
} else if i != 0 {
print("+")
}
print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
}
print(" ms cpu, ",
work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
work.heapGoal>>20, " MB goal, ",
work.maxprocs, " P")
if work.userForced {
print(" (forced)")
}
print("\n")
printunlock()
}
semrelease(&worldsema)
// Careful: another GC cycle may start now.
releasem(mp)
mp = nil
// now that gc is done, kick off finalizer thread if needed
if !concurrentSweep {
// give the queued finalizers, if any, a chance to run
Gosched()
}
}
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// gcBgMarkStartWorkers prepares background mark worker goroutines.
// These goroutines will not run until the mark phase, but they must
// be started while the work is not stopped and from a regular G
// stack. The caller must hold worldsema.
func gcBgMarkStartWorkers() {
// Background marking is performed by per-P G's. Ensure that
// each P has a background GC G.
for _, p := range allp {
if p.gcBgMarkWorker == 0 {
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
go gcBgMarkWorker(p)
notetsleepg(&work.bgMarkReady, -1)
noteclear(&work.bgMarkReady)
}
}
}
// gcBgMarkPrepare sets up state for background marking.
// Mutator assists must not yet be enabled.
func gcBgMarkPrepare() {
// Background marking will stop when the work queues are empty
// and there are no more workers (note that, since this is
// concurrent, this may be a transient state, but mark
// termination will clean it up). Between background workers
// and assists, we don't really know how many workers there
// will be, so we pretend to have an arbitrarily large number
// of workers, almost all of which are "waiting". While a
// worker is working it decrements nwait. If nproc == nwait,
// there are no workers.
work.nproc = ^uint32(0)
work.nwait = ^uint32(0)
}
func gcBgMarkWorker(_p_ *p) {
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
gp := getg()
type parkInfo struct {
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
m muintptr // Release this m on park.
attach puintptr // If non-nil, attach to this p on park.
}
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
// We pass park to a gopark unlock function, so it can't be on
// the stack (see gopark). Prevent deadlock from recursively
// starting GC by disabling preemption.
gp.m.preemptoff = "GC worker init"
park := new(parkInfo)
gp.m.preemptoff = ""
park.m.set(acquirem())
park.attach.set(_p_)
// Inform gcBgMarkStartWorkers that this worker is ready.
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// After this point, the background mark worker is scheduled
// cooperatively by gcController.findRunnable. Hence, it must
// never be preempted, as this would put it into _Grunnable
// and put it on a run queue. Instead, when the preempt flag
// is set, this puts itself into _Gwaiting to be woken up by
// gcController.findRunnable at the appropriate time.
notewakeup(&work.bgMarkReady)
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
for {
// Go to sleep until woken by gcController.findRunnable.
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// We can't releasem yet since even the call to gopark
// may be preempted.
gopark(func(g *g, parkp unsafe.Pointer) bool {
park := (*parkInfo)(parkp)
// The worker G is no longer running, so it's
// now safe to allow preemption.
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
releasem(park.m.ptr())
// If the worker isn't attached to its P,
// attach now. During initialization and after
// a phase change, the worker may have been
// running on a different P. As soon as we
// attach, the owner P may schedule the
// worker, so this must be done after the G is
// stopped.
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
if park.attach != 0 {
p := park.attach.ptr()
park.attach.set(nil)
// cas the worker because we may be
// racing with a new worker starting
// on this P.
if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
// The P got a new worker.
// Exit this worker.
return false
}
}
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
return true
}, unsafe.Pointer(park), "GC worker (idle)", traceEvGoBlock, 0)
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
// Loop until the P dies and disassociates this
// worker (the P may later be reused, in which case
// it will get a new worker) or we failed to associate.
if _p_.gcBgMarkWorker.ptr() != gp {
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
break
}
// Disable preemption so we can use the gcw. If the
// scheduler wants to preempt us, we'll stop draining,
// dispose the gcw, and then preempt.
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
park.m.set(acquirem())
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
if gcBlackenEnabled == 0 {
throw("gcBgMarkWorker: blackening not enabled")
}
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
startTime := nanotime()
_p_.gcMarkWorkerStartTime = startTime
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
decnwait := atomic.Xadd(&work.nwait, -1)
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
if decnwait == work.nproc {
println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
throw("work.nwait was > work.nproc")
}
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
runtime: scan mark worker stacks like normal Currently, markroot delays scanning mark worker stacks until mark termination by putting the mark worker G directly on the rescan list when it encounters one during the mark phase. Without this, since mark workers are non-preemptible, two mark workers that attempt to scan each other's stacks can deadlock. However, this is annoyingly asymmetric and causes some real problems. First, markroot does not own the G at that point, so it's not technically safe to add it to the rescan list. I haven't been able to find a specific problem this could cause, but I suspect it's the root cause of issue #17099. Second, this will interfere with the hybrid barrier, since there is no stack rescanning during mark termination with the hybrid barrier. This commit switches to a different approach. We move the mark worker's call to gcDrain to the system stack and set the mark worker's status to _Gwaiting for the duration of the drain to indicate that it's preemptible. This lets another mark worker scan its G stack while the drain is running on the system stack. We don't return to the G stack until we can switch back to _Grunning, which ensures we don't race with a stack scan. This lets us eliminate the special case for mark worker stack scans and scan them just like any other goroutine. The only subtlety to this approach is that we have to disable stack shrinking for mark workers; they could be referring to captured variables from the G stack, so it's not safe to move their stacks. Updates #17099 and #17503. Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea Reviewed-on: https://go-review.googlesource.com/31820 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-24 14:20:07 -04:00
systemstack(func() {
// Mark our goroutine preemptible so its stack
// can be scanned. This lets two mark workers
// scan each other (otherwise, they would
// deadlock). We must not modify anything on
// the G stack. However, stack shrinking is
// disabled for mark workers, so it is safe to
// read from the G stack.
casgstatus(gp, _Grunning, _Gwaiting)
switch _p_.gcMarkWorkerMode {
default:
throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
case gcMarkWorkerDedicatedMode:
gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
if gp.preempt {
// We were preempted. This is
// a useful signal to kick
// everything out of the run
// queue so it can run
// somewhere else.
lock(&sched.lock)
for {
gp, _ := runqget(_p_)
if gp == nil {
break
}
globrunqput(gp)
}
unlock(&sched.lock)
}
// Go back to draining, this time
// without preemption.
runtime: scan mark worker stacks like normal Currently, markroot delays scanning mark worker stacks until mark termination by putting the mark worker G directly on the rescan list when it encounters one during the mark phase. Without this, since mark workers are non-preemptible, two mark workers that attempt to scan each other's stacks can deadlock. However, this is annoyingly asymmetric and causes some real problems. First, markroot does not own the G at that point, so it's not technically safe to add it to the rescan list. I haven't been able to find a specific problem this could cause, but I suspect it's the root cause of issue #17099. Second, this will interfere with the hybrid barrier, since there is no stack rescanning during mark termination with the hybrid barrier. This commit switches to a different approach. We move the mark worker's call to gcDrain to the system stack and set the mark worker's status to _Gwaiting for the duration of the drain to indicate that it's preemptible. This lets another mark worker scan its G stack while the drain is running on the system stack. We don't return to the G stack until we can switch back to _Grunning, which ensures we don't race with a stack scan. This lets us eliminate the special case for mark worker stack scans and scan them just like any other goroutine. The only subtlety to this approach is that we have to disable stack shrinking for mark workers; they could be referring to captured variables from the G stack, so it's not safe to move their stacks. Updates #17099 and #17503. Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea Reviewed-on: https://go-review.googlesource.com/31820 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-24 14:20:07 -04:00
gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit)
case gcMarkWorkerFractionalMode:
gcDrain(&_p_.gcw, gcDrainFractional|gcDrainUntilPreempt|gcDrainFlushBgCredit)
case gcMarkWorkerIdleMode:
gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
runtime: scan mark worker stacks like normal Currently, markroot delays scanning mark worker stacks until mark termination by putting the mark worker G directly on the rescan list when it encounters one during the mark phase. Without this, since mark workers are non-preemptible, two mark workers that attempt to scan each other's stacks can deadlock. However, this is annoyingly asymmetric and causes some real problems. First, markroot does not own the G at that point, so it's not technically safe to add it to the rescan list. I haven't been able to find a specific problem this could cause, but I suspect it's the root cause of issue #17099. Second, this will interfere with the hybrid barrier, since there is no stack rescanning during mark termination with the hybrid barrier. This commit switches to a different approach. We move the mark worker's call to gcDrain to the system stack and set the mark worker's status to _Gwaiting for the duration of the drain to indicate that it's preemptible. This lets another mark worker scan its G stack while the drain is running on the system stack. We don't return to the G stack until we can switch back to _Grunning, which ensures we don't race with a stack scan. This lets us eliminate the special case for mark worker stack scans and scan them just like any other goroutine. The only subtlety to this approach is that we have to disable stack shrinking for mark workers; they could be referring to captured variables from the G stack, so it's not safe to move their stacks. Updates #17099 and #17503. Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea Reviewed-on: https://go-review.googlesource.com/31820 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-24 14:20:07 -04:00
}
casgstatus(gp, _Gwaiting, _Grunning)
})
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
// If we are nearing the end of mark, dispose
// of the cache promptly. We must do this
// before signaling that we're no longer
// working so that other workers can't observe
// no workers and no work while we have this
// cached, and before we compute done.
if gcBlackenPromptly {
_p_.gcw.dispose()
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
}
// Account for time.
duration := nanotime() - startTime
switch _p_.gcMarkWorkerMode {
case gcMarkWorkerDedicatedMode:
atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
case gcMarkWorkerFractionalMode:
atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
atomic.Xaddint64(&_p_.gcFractionalMarkTime, duration)
case gcMarkWorkerIdleMode:
atomic.Xaddint64(&gcController.idleMarkTime, duration)
}
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
// Was this the last worker and did we run out
// of work?
incnwait := atomic.Xadd(&work.nwait, +1)
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
if incnwait > work.nproc {
println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
"work.nwait=", incnwait, "work.nproc=", work.nproc)
throw("work.nwait > work.nproc")
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
}
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
runtime: use park/ready to wake up GC at end of concurrent mark Currently, the main GC goroutine sleeps on a note during concurrent mark and the first background mark worker or assist to finish marking use wakes up that note to let the main goroutine proceed into mark termination. Unfortunately, the latency of this wakeup can be quite high, since the GC goroutine will typically have lost its P while in the futex sleep, meaning it will be placed on the global run queue and will wait there until some P is kind enough to pick it up. This delay gives the mutator more time to allocate and create floating garbage, growing the heap unnecessarily. Worse, it's likely that background marking has stopped at this point (unless GOMAXPROCS>4), so anything that's allocated and published to the heap during this window will have to be scanned during mark termination while the world is stopped. This change replaces the note sleep/wakeup with a gopark/ready scheme. This keeps the wakeup inside the Go scheduler and lets the garbage collector take advantage of the new scheduler semantics that run the ready()d goroutine immediately when the ready()ing goroutine sleeps. For the json benchmark from x/benchmarks with GOMAXPROCS=4, this reduces the delay in waking up the GC goroutine and entering mark termination once concurrent marking is done from ~100ms to typically <100µs. Change-Id: Ib11f8b581b8914f2d68e0094f121e49bac3bb384 Reviewed-on: https://go-review.googlesource.com/9291 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-22 17:44:36 -04:00
// If this worker reached a background mark completion
// point, signal the main GC goroutine.
runtime: eliminate getfull barrier from concurrent mark Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-26 16:29:25 -04:00
if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
// Make this G preemptible and disassociate it
// as the worker for this P so
// findRunnableGCWorker doesn't try to
// schedule it.
_p_.gcBgMarkWorker.set(nil)
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
releasem(park.m.ptr())
gcMarkDone()
// Disable preemption and prepare to reattach
// to the P.
//
// We may be running on a different P at this
// point, so we can't reattach until this G is
// parked.
runtime: never pass stack pointers to gopark gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26 10:50:54 -05:00
park.m.set(acquirem())
park.attach.set(_p_)
runtime: multi-threaded, utilization-scheduled background mark Currently, the concurrent mark phase is performed by the main GC goroutine. Prior to the previous commit enabling preemption, this caused marking to always consume 1/GOMAXPROCS of the available CPU time. If GOMAXPROCS=1, this meant background GC would consume 100% of the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use less than the goal of 25%. If GOMAXPROCS=4, background GC would use the goal 25%, but if the mutator wasn't using the remaining 75%, background marking wouldn't take advantage of the idle time. Enabling preemption in the previous commit made GC miss CPU targets in completely different ways, but set us up to bring everything back in line. This change replaces the fixed GC goroutine with per-P background mark goroutines. Once started, these goroutines don't go in the standard run queues; instead, they are scheduled specially such that the time spent in mutator assists and the background mark goroutines totals 25% of the CPU time available to the program. Furthermore, this lets background marking take advantage of idle Ps, which significantly boosts GC performance for applications that under-utilize the CPU. This requires also changing how time is reported for gctrace, so this change splits the concurrent mark CPU time into assist/background/idle scanning. This also requires increasing the size of the StackRecord slice used in a GoroutineProfile test. Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157 Reviewed-on: https://go-review.googlesource.com/8850 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-03-23 21:07:33 -04:00
}
}
}
runtime: reduce latency by aggressively ending mark phase Some latency regressions have crept into our system over the past few weeks. This CL fixes those by having the mark phase more aggressively blacken objects so that the mark termination phase, a STW phase, has less work to do. Three approaches were taken when the mark phase believes it has no more work to do, ie all the work buffers are empty. If things have gone well the mark phase is correct and there is in fact little or no work. In that case the following items will take very little time. If the mark phase is wrong this CL will ferret that work out and give the mark phase a chance to deal with it concurrently before mark termination begins. When the mark phase first appears to be out of work, it does three things: 1) It switches from allocating white to allocating black to reduce the number of unmarked objects reachable only from stacks. 2) It flushes and disables per-P GC work caches so all work must be in globally visible work buffers. 3) It rescans the global roots---the BSS and data segments---so there are fewer objects to blacken during mark termination. We do not rescan stacks at this point, though that could be done in a later CL. After these steps, it again drains the global work buffers. On a lightly loaded machine the garbage benchmark has reduced the number of GC cycles with latency > 10 ms from 83 out of 4083 cycles down to 2 out of 3995 cycles. Maximum latency was reduced from 60+ msecs down to 20 ms. Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492 Reviewed-on: https://go-review.googlesource.com/10590 Reviewed-by: Austin Clements <austin@google.com>
2015-06-01 18:16:03 -04:00
// gcMarkWorkAvailable returns true if executing a mark worker
// on p is potentially useful. p may be nil, in which case it only
// checks the global sources of work.
func gcMarkWorkAvailable(p *p) bool {
if p != nil && !p.gcw.empty() {
return true
}
if !work.full.empty() {
return true // global work available
}
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
if work.markrootNext < work.markrootJobs {
return true // root scan work available
}
return false
}
// gcMark runs the mark (or, for concurrent GC, mark termination)
// All gcWork caches must be empty.
// STW is in effect at this point.
//TODO go:nowritebarrier
func gcMark(start_time int64) {
if debug.allocfreetrace > 0 {
tracegc()
}
if gcphase != _GCmarktermination {
throw("in gcMark expecting to see gcphase as _GCmarktermination")
}
[dev.cc] runtime: delete scalararg, ptrarg; rename onM to systemstack Scalararg and ptrarg are not "signal safe". Go code filling them out can be interrupted by a signal, and then the signal handler runs, and if it also ends up in Go code that uses scalararg or ptrarg, now the old values have been smashed. For the pieces of code that do need to run in a signal handler, we introduced onM_signalok, which is really just onM except that the _signalok is meant to convey that the caller asserts that scalarg and ptrarg will be restored to their old values after the call (instead of the usual behavior, zeroing them). Scalararg and ptrarg are also untyped and therefore error-prone. Go code can always pass a closure instead of using scalararg and ptrarg; they were only really necessary for C code. And there's no more C code. For all these reasons, delete scalararg and ptrarg, converting the few remaining references to use closures. Once those are gone, there is no need for a distinction between onM and onM_signalok, so replace both with a single function equivalent to the current onM_signalok (that is, it can be called on any of the curg, g0, and gsignal stacks). The name onM and the phrase 'm stack' are misnomers, because on most system an M has two system stacks: the main thread stack and the signal handling stack. Correct the misnomer by naming the replacement function systemstack. Fix a few references to "M stack" in code. The main motivation for this change is to eliminate scalararg/ptrarg. Rick and I have already seen them cause problems because the calling sequence m.ptrarg[0] = p is a heap pointer assignment, so it gets a write barrier. The write barrier also uses onM, so it has all the same problems as if it were being invoked by a signal handler. We worked around this by saving and restoring the old values and by calling onM_signalok, but there's no point in keeping this nice home for bugs around any longer. This CL also changes funcline to return the file name as a result instead of filling in a passed-in *string. (The *string signature is left over from when the code was written in and called from C.) That's arguably an unrelated change, except that once I had done the ptrarg/scalararg/onM cleanup I started getting false positives about the *string argument escaping (not allowed in package runtime). The compiler is wrong, but the easiest fix is to write the code like Go code instead of like C code. I am a bit worried that the compiler is wrong because of some use of uninitialized memory in the escape analysis. If that's the reason, it will go away when we convert the compiler to Go. (And if not, we'll debug it the next time.) LGTM=khr R=r, khr CC=austin, golang-codereviews, iant, rlh https://golang.org/cl/174950043
2014-11-12 14:54:31 -05:00
work.tstart = start_time
runtime: partition data and BSS root marking Currently data and BSS root marking are each a single markroot job. This makes them difficult to load balance, which can draw out mark termination time if they are large. Fix this by splitting both in to 256K chunks. While we're putting in the infrastructure for dynamic roots, we also replace the fixed sharding of the span roots with sharding in to fixed sizes. In addition to helping balance root marking, this also paves the way to parallelizing concurrent scan and to letting assists help with root marking. Updates #10345. This fixes the data and BSS aspects of that bug; it does not partition scanning of large heap objects. This has negligible effect on either the go1 benchmarks or the garbage benchmark: name old time/op new time/op delta XBenchGarbage-12 4.90ms ± 1% 4.91ms ± 2% ~ (p=0.058 n=17+16) name old time/op new time/op delta BinaryTree17-12 3.11s ± 4% 3.12s ± 4% ~ (p=0.512 n=20+20) Fannkuch11-12 2.53s ± 2% 2.47s ± 2% -2.28% (p=0.000 n=20+18) FmtFprintfEmpty-12 49.1ns ± 1% 50.0ns ± 4% +1.68% (p=0.008 n=18+20) FmtFprintfString-12 170ns ± 0% 172ns ± 1% +1.05% (p=0.000 n=14+19) FmtFprintfInt-12 174ns ± 1% 162ns ± 1% -6.81% (p=0.000 n=18+17) FmtFprintfIntInt-12 284ns ± 1% 277ns ± 1% -2.42% (p=0.000 n=20+19) FmtFprintfPrefixedInt-12 252ns ± 1% 244ns ± 1% -2.84% (p=0.000 n=18+20) FmtFprintfFloat-12 317ns ± 0% 311ns ± 0% -1.95% (p=0.000 n=19+18) FmtManyArgs-12 1.08µs ± 1% 1.11µs ± 1% +3.43% (p=0.000 n=18+19) GobDecode-12 8.56ms ± 1% 8.61ms ± 1% +0.50% (p=0.020 n=20+20) GobEncode-12 6.58ms ± 1% 6.57ms ± 1% ~ (p=0.792 n=20+19) Gzip-12 317ms ± 3% 317ms ± 2% ~ (p=0.840 n=19+19) Gunzip-12 41.6ms ± 0% 41.6ms ± 0% +0.07% (p=0.027 n=18+15) HTTPClientServer-12 62.2µs ± 1% 62.3µs ± 1% ~ (p=0.283 n=19+20) JSONEncode-12 16.5ms ± 2% 16.5ms ± 1% ~ (p=0.857 n=20+19) JSONDecode-12 58.5ms ± 1% 61.3ms ± 1% +4.67% (p=0.000 n=18+17) Mandelbrot200-12 3.84ms ± 0% 3.84ms ± 0% ~ (p=0.259 n=17+17) GoParse-12 3.70ms ± 2% 3.74ms ± 2% +0.96% (p=0.009 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% +0.31% (p=0.040 n=19+15) RegexpMatchEasy0_1K-12 340ns ± 1% 340ns ± 1% ~ (p=0.411 n=17+19) RegexpMatchEasy1_32-12 82.7ns ± 2% 82.3ns ± 1% ~ (p=0.456 n=20+19) RegexpMatchEasy1_1K-12 498ns ± 2% 495ns ± 0% ~ (p=0.108 n=19+17) RegexpMatchMedium_32-12 130ns ± 1% 130ns ± 2% ~ (p=0.405 n=18+19) RegexpMatchMedium_1K-12 39.4µs ± 2% 39.1µs ± 1% -0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 0% ~ (p=0.561 n=20+17) RegexpMatchHard_1K-12 61.1µs ± 2% 60.8µs ± 1% ~ (p=0.615 n=19+18) Revcomp-12 532ms ± 2% 531ms ± 1% ~ (p=0.470 n=19+19) Template-12 68.5ms ± 1% 69.1ms ± 1% +0.87% (p=0.000 n=17+17) TimeParse-12 344ns ± 2% 344ns ± 1% +0.25% (p=0.032 n=19+18) TimeFormat-12 347ns ± 1% 362ns ± 1% +4.27% (p=0.000 n=17+19) [Geo mean] 62.3µs 62.3µs -0.04% name old speed new speed delta GobDecode-12 89.6MB/s ± 1% 89.2MB/s ± 1% -0.50% (p=0.019 n=20+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.797 n=20+19) Gzip-12 61.3MB/s ± 3% 61.2MB/s ± 2% ~ (p=0.834 n=19+19) Gunzip-12 467MB/s ± 0% 466MB/s ± 0% -0.07% (p=0.027 n=18+15) JSONEncode-12 117MB/s ± 2% 117MB/s ± 1% ~ (p=0.851 n=20+19) JSONDecode-12 33.2MB/s ± 1% 31.7MB/s ± 1% -4.47% (p=0.000 n=18+17) GoParse-12 15.6MB/s ± 2% 15.5MB/s ± 2% -0.95% (p=0.008 n=19+20) RegexpMatchEasy0_32-12 321MB/s ± 2% 320MB/s ± 1% -0.57% (p=0.002 n=17+17) RegexpMatchEasy0_1K-12 3.01GB/s ± 1% 3.01GB/s ± 1% ~ (p=0.132 n=17+18) RegexpMatchEasy1_32-12 387MB/s ± 2% 389MB/s ± 1% ~ (p=0.423 n=20+19) RegexpMatchEasy1_1K-12 2.05GB/s ± 2% 2.06GB/s ± 0% ~ (p=0.129 n=19+17) RegexpMatchMedium_32-12 7.64MB/s ± 1% 7.66MB/s ± 1% ~ (p=0.258 n=18+19) RegexpMatchMedium_1K-12 26.0MB/s ± 2% 26.2MB/s ± 1% +0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 15.7MB/s ± 2% 15.8MB/s ± 1% ~ (p=0.510 n=20+17) RegexpMatchHard_1K-12 16.8MB/s ± 2% 16.8MB/s ± 1% ~ (p=0.603 n=19+18) Revcomp-12 477MB/s ± 2% 479MB/s ± 1% ~ (p=0.470 n=19+19) Template-12 28.3MB/s ± 1% 28.1MB/s ± 1% -0.85% (p=0.000 n=17+17) [Geo mean] 100MB/s 100MB/s -0.26% Change-Id: Ib0bfe0145675ce88c5a8791752f7486ac98805b4 Reviewed-on: https://go-review.googlesource.com/16043 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-16 16:52:26 -04:00
// Queue root marking jobs.
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
gcMarkRootPrepare()
runtime: partition data and BSS root marking Currently data and BSS root marking are each a single markroot job. This makes them difficult to load balance, which can draw out mark termination time if they are large. Fix this by splitting both in to 256K chunks. While we're putting in the infrastructure for dynamic roots, we also replace the fixed sharding of the span roots with sharding in to fixed sizes. In addition to helping balance root marking, this also paves the way to parallelizing concurrent scan and to letting assists help with root marking. Updates #10345. This fixes the data and BSS aspects of that bug; it does not partition scanning of large heap objects. This has negligible effect on either the go1 benchmarks or the garbage benchmark: name old time/op new time/op delta XBenchGarbage-12 4.90ms ± 1% 4.91ms ± 2% ~ (p=0.058 n=17+16) name old time/op new time/op delta BinaryTree17-12 3.11s ± 4% 3.12s ± 4% ~ (p=0.512 n=20+20) Fannkuch11-12 2.53s ± 2% 2.47s ± 2% -2.28% (p=0.000 n=20+18) FmtFprintfEmpty-12 49.1ns ± 1% 50.0ns ± 4% +1.68% (p=0.008 n=18+20) FmtFprintfString-12 170ns ± 0% 172ns ± 1% +1.05% (p=0.000 n=14+19) FmtFprintfInt-12 174ns ± 1% 162ns ± 1% -6.81% (p=0.000 n=18+17) FmtFprintfIntInt-12 284ns ± 1% 277ns ± 1% -2.42% (p=0.000 n=20+19) FmtFprintfPrefixedInt-12 252ns ± 1% 244ns ± 1% -2.84% (p=0.000 n=18+20) FmtFprintfFloat-12 317ns ± 0% 311ns ± 0% -1.95% (p=0.000 n=19+18) FmtManyArgs-12 1.08µs ± 1% 1.11µs ± 1% +3.43% (p=0.000 n=18+19) GobDecode-12 8.56ms ± 1% 8.61ms ± 1% +0.50% (p=0.020 n=20+20) GobEncode-12 6.58ms ± 1% 6.57ms ± 1% ~ (p=0.792 n=20+19) Gzip-12 317ms ± 3% 317ms ± 2% ~ (p=0.840 n=19+19) Gunzip-12 41.6ms ± 0% 41.6ms ± 0% +0.07% (p=0.027 n=18+15) HTTPClientServer-12 62.2µs ± 1% 62.3µs ± 1% ~ (p=0.283 n=19+20) JSONEncode-12 16.5ms ± 2% 16.5ms ± 1% ~ (p=0.857 n=20+19) JSONDecode-12 58.5ms ± 1% 61.3ms ± 1% +4.67% (p=0.000 n=18+17) Mandelbrot200-12 3.84ms ± 0% 3.84ms ± 0% ~ (p=0.259 n=17+17) GoParse-12 3.70ms ± 2% 3.74ms ± 2% +0.96% (p=0.009 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% +0.31% (p=0.040 n=19+15) RegexpMatchEasy0_1K-12 340ns ± 1% 340ns ± 1% ~ (p=0.411 n=17+19) RegexpMatchEasy1_32-12 82.7ns ± 2% 82.3ns ± 1% ~ (p=0.456 n=20+19) RegexpMatchEasy1_1K-12 498ns ± 2% 495ns ± 0% ~ (p=0.108 n=19+17) RegexpMatchMedium_32-12 130ns ± 1% 130ns ± 2% ~ (p=0.405 n=18+19) RegexpMatchMedium_1K-12 39.4µs ± 2% 39.1µs ± 1% -0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 0% ~ (p=0.561 n=20+17) RegexpMatchHard_1K-12 61.1µs ± 2% 60.8µs ± 1% ~ (p=0.615 n=19+18) Revcomp-12 532ms ± 2% 531ms ± 1% ~ (p=0.470 n=19+19) Template-12 68.5ms ± 1% 69.1ms ± 1% +0.87% (p=0.000 n=17+17) TimeParse-12 344ns ± 2% 344ns ± 1% +0.25% (p=0.032 n=19+18) TimeFormat-12 347ns ± 1% 362ns ± 1% +4.27% (p=0.000 n=17+19) [Geo mean] 62.3µs 62.3µs -0.04% name old speed new speed delta GobDecode-12 89.6MB/s ± 1% 89.2MB/s ± 1% -0.50% (p=0.019 n=20+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.797 n=20+19) Gzip-12 61.3MB/s ± 3% 61.2MB/s ± 2% ~ (p=0.834 n=19+19) Gunzip-12 467MB/s ± 0% 466MB/s ± 0% -0.07% (p=0.027 n=18+15) JSONEncode-12 117MB/s ± 2% 117MB/s ± 1% ~ (p=0.851 n=20+19) JSONDecode-12 33.2MB/s ± 1% 31.7MB/s ± 1% -4.47% (p=0.000 n=18+17) GoParse-12 15.6MB/s ± 2% 15.5MB/s ± 2% -0.95% (p=0.008 n=19+20) RegexpMatchEasy0_32-12 321MB/s ± 2% 320MB/s ± 1% -0.57% (p=0.002 n=17+17) RegexpMatchEasy0_1K-12 3.01GB/s ± 1% 3.01GB/s ± 1% ~ (p=0.132 n=17+18) RegexpMatchEasy1_32-12 387MB/s ± 2% 389MB/s ± 1% ~ (p=0.423 n=20+19) RegexpMatchEasy1_1K-12 2.05GB/s ± 2% 2.06GB/s ± 0% ~ (p=0.129 n=19+17) RegexpMatchMedium_32-12 7.64MB/s ± 1% 7.66MB/s ± 1% ~ (p=0.258 n=18+19) RegexpMatchMedium_1K-12 26.0MB/s ± 2% 26.2MB/s ± 1% +0.64% (p=0.002 n=20+19) RegexpMatchHard_32-12 15.7MB/s ± 2% 15.8MB/s ± 1% ~ (p=0.510 n=20+17) RegexpMatchHard_1K-12 16.8MB/s ± 2% 16.8MB/s ± 1% ~ (p=0.603 n=19+18) Revcomp-12 477MB/s ± 2% 479MB/s ± 1% ~ (p=0.470 n=19+19) Template-12 28.3MB/s ± 1% 28.1MB/s ± 1% -0.85% (p=0.000 n=17+17) [Geo mean] 100MB/s 100MB/s -0.26% Change-Id: Ib0bfe0145675ce88c5a8791752f7486ac98805b4 Reviewed-on: https://go-review.googlesource.com/16043 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-16 16:52:26 -04:00
work.nwait = 0
work.ndone = 0
work.nproc = uint32(gcprocs())
if work.full == 0 && work.nDataRoots+work.nBSSRoots+work.nSpanRoots+work.nStackRoots == 0 {
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
// There's no work on the work queue and no root jobs
// that can produce work, so don't bother entering the
// getfull() barrier.
//
// This will be the situation the vast majority of the
// time after concurrent mark. However, we still need
// a fallback for STW GC and because there are some
// known races that occasionally leave work around for
// mark termination.
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
//
// We're still hedging our bets here: if we do
// accidentally produce some work, we'll still process
// it, just not necessarily in parallel.
//
// TODO(austin): Fix the races and and remove
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
// work draining from mark termination so we don't
// need the fallback path.
work.helperDrainBlock = false
} else {
work.helperDrainBlock = true
}
if work.nproc > 1 {
noteclear(&work.alldone)
helpgc(int32(work.nproc))
}
gchelperstart()
runtime: replace per-M workbuf cache with per-P gcWork cache Currently, each M has a cache of the most recently used *workbuf. This is used primarily by the write barrier so it doesn't have to access the global workbuf lists on every write barrier. It's also used by stack scanning because it's convenient. This cache is important for write barrier performance, but this particular approach has several downsides. It's faster than no cache, but far from optimal (as the benchmarks below show). It's complex: access to the cache is sprinkled through most of the workbuf list operations and it requires special care to transform into and back out of the gcWork cache that's actually used for scanning and marking. It requires atomic exchanges to take ownership of the cached workbuf and to return it to the M's cache even though it's almost always used by only the current M. Since it's per-M, flushing these caches is O(# of Ms), which may be high. And it has some significant subtleties: for example, in general the cache shouldn't be used after the harvestwbufs() in mark termination because it could hide work from mark termination, but stack scanning can happen after this and *will* use the cache (but it turns out this is okay because it will always be followed by a getfull(), which drains the cache). This change replaces this cache with a per-P gcWork object. This gcWork cache can be used directly by scanning and marking (as long as preemption is disabled, which is a general requirement of gcWork). Since it's per-P, it doesn't require synchronization, which simplifies things and means the only atomic operations in the write barrier are occasionally fetching new work buffers and setting a mark bit if the object isn't already marked. This cache can be flushed in O(# of Ps), which is generally small. It follows a simple flushing rule: the cache can be used during any phase, but during mark termination it must be flushed before allowing preemption. This also makes the dispose during mutator assist no longer necessary, which eliminates the vast majority of gcWork dispose calls and reduces contention on the global workbuf lists. And it's a lot faster on some benchmarks: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 11963668673 11206112763 -6.33% BenchmarkFannkuch11 2643217136 2649182499 +0.23% BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28% BenchmarkFmtFprintfString 364 307 -15.66% BenchmarkFmtFprintfInt 317 282 -11.04% BenchmarkFmtFprintfIntInt 512 483 -5.66% BenchmarkFmtFprintfPrefixedInt 404 380 -5.94% BenchmarkFmtFprintfFloat 521 479 -8.06% BenchmarkFmtManyArgs 2164 1894 -12.48% BenchmarkGobDecode 30366146 22429593 -26.14% BenchmarkGobEncode 29867472 26663152 -10.73% BenchmarkGzip 391236616 396779490 +1.42% BenchmarkGunzip 96639491 96297024 -0.35% BenchmarkHTTPClientServer 100110 70763 -29.31% BenchmarkJSONEncode 51866051 52511382 +1.24% BenchmarkJSONDecode 103813138 86094963 -17.07% BenchmarkMandelbrot200 4121834 4120886 -0.02% BenchmarkGoParse 16472789 5879949 -64.31% BenchmarkRegexpMatchEasy0_32 140 140 +0.00% BenchmarkRegexpMatchEasy0_1K 394 394 +0.00% BenchmarkRegexpMatchEasy1_32 120 120 +0.00% BenchmarkRegexpMatchEasy1_1K 621 614 -1.13% BenchmarkRegexpMatchMedium_32 209 202 -3.35% BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52% BenchmarkRegexpMatchHard_32 2682 2675 -0.26% BenchmarkRegexpMatchHard_1K 79383 79524 +0.18% BenchmarkRevcomp 584116718 584595320 +0.08% BenchmarkTemplate 125400565 109620196 -12.58% BenchmarkTimeParse 386 387 +0.26% BenchmarkTimeFormat 580 447 -22.93% (Best out of 10 runs. The delta of averages is similar.) This also puts us in a good position to flush these caches when nearing the end of concurrent marking, which will let us increase the size of the work buffers while still controlling mark termination pause time. Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522 Reviewed-on: https://go-review.googlesource.com/9178 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-19 15:22:20 -04:00
gcw := &getg().m.p.ptr().gcw
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
if work.helperDrainBlock {
gcDrain(gcw, gcDrainBlock)
} else {
gcDrain(gcw, gcDrainNoBlock)
}
runtime: switch to gcWork abstraction This converts the garbage collector from directly manipulating work buffers to using the new gcWork abstraction. The previous management of work buffers was rather ad hoc. As a result, switching to the gcWork abstraction changes many details of work buffer management. If greyobject fills a work buffer, it can now pull from work.partial in addition to work.empty. Previously, gcDrain started with a partial or empty work buffer and fetched an empty work buffer if it filled its current buffer (in greyobject). Now, gcDrain starts with a full work buffer and fetches an partial or empty work buffer if it fills its current buffer (in greyobject). The original behavior was bad because gcDrain would immediately drop the empty work buffer returned by greyobject and fetch a full work buffer, which greyobject was likely to immediately overflow, fetching another empty work buffer, etc. The new behavior isn't great at the start because greyobject is likely to immediately overflow the full buffer, but the steady-state behavior should be more stable. Both before and after this change, gcDrain fetches a full work buffer if it drains its current buffer. Basically all of these choices are bad; the right answer is to use a dual work buffer scheme. Previously, shade always fetched a work buffer (though usually from m.currentwbuf), even if the object was already marked. Now it only fetches a work buffer if it actually greys an object. Change-Id: I8b880ed660eb63135236fa5d5678f0c1c041881f Reviewed-on: https://go-review.googlesource.com/5232 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2015-02-17 10:53:31 -05:00
gcw.dispose()
if debug.gccheckmark > 0 {
// This is expensive when there's a large number of
// Gs, so only do it if checkmark is also enabled.
gcMarkRootCheck()
}
if work.full != 0 {
throw("work.full != 0")
}
if work.nproc > 1 {
notesleep(&work.alldone)
}
// Record that at least one root marking pass has completed.
work.markrootDone = true
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// Double-check that all gcWork caches are empty. This should
// be ensured by mark 2 before we enter mark termination.
for _, p := range allp {
gcw := &p.gcw
if !gcw.empty() {
runtime: replace per-M workbuf cache with per-P gcWork cache Currently, each M has a cache of the most recently used *workbuf. This is used primarily by the write barrier so it doesn't have to access the global workbuf lists on every write barrier. It's also used by stack scanning because it's convenient. This cache is important for write barrier performance, but this particular approach has several downsides. It's faster than no cache, but far from optimal (as the benchmarks below show). It's complex: access to the cache is sprinkled through most of the workbuf list operations and it requires special care to transform into and back out of the gcWork cache that's actually used for scanning and marking. It requires atomic exchanges to take ownership of the cached workbuf and to return it to the M's cache even though it's almost always used by only the current M. Since it's per-M, flushing these caches is O(# of Ms), which may be high. And it has some significant subtleties: for example, in general the cache shouldn't be used after the harvestwbufs() in mark termination because it could hide work from mark termination, but stack scanning can happen after this and *will* use the cache (but it turns out this is okay because it will always be followed by a getfull(), which drains the cache). This change replaces this cache with a per-P gcWork object. This gcWork cache can be used directly by scanning and marking (as long as preemption is disabled, which is a general requirement of gcWork). Since it's per-P, it doesn't require synchronization, which simplifies things and means the only atomic operations in the write barrier are occasionally fetching new work buffers and setting a mark bit if the object isn't already marked. This cache can be flushed in O(# of Ps), which is generally small. It follows a simple flushing rule: the cache can be used during any phase, but during mark termination it must be flushed before allowing preemption. This also makes the dispose during mutator assist no longer necessary, which eliminates the vast majority of gcWork dispose calls and reduces contention on the global workbuf lists. And it's a lot faster on some benchmarks: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 11963668673 11206112763 -6.33% BenchmarkFannkuch11 2643217136 2649182499 +0.23% BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28% BenchmarkFmtFprintfString 364 307 -15.66% BenchmarkFmtFprintfInt 317 282 -11.04% BenchmarkFmtFprintfIntInt 512 483 -5.66% BenchmarkFmtFprintfPrefixedInt 404 380 -5.94% BenchmarkFmtFprintfFloat 521 479 -8.06% BenchmarkFmtManyArgs 2164 1894 -12.48% BenchmarkGobDecode 30366146 22429593 -26.14% BenchmarkGobEncode 29867472 26663152 -10.73% BenchmarkGzip 391236616 396779490 +1.42% BenchmarkGunzip 96639491 96297024 -0.35% BenchmarkHTTPClientServer 100110 70763 -29.31% BenchmarkJSONEncode 51866051 52511382 +1.24% BenchmarkJSONDecode 103813138 86094963 -17.07% BenchmarkMandelbrot200 4121834 4120886 -0.02% BenchmarkGoParse 16472789 5879949 -64.31% BenchmarkRegexpMatchEasy0_32 140 140 +0.00% BenchmarkRegexpMatchEasy0_1K 394 394 +0.00% BenchmarkRegexpMatchEasy1_32 120 120 +0.00% BenchmarkRegexpMatchEasy1_1K 621 614 -1.13% BenchmarkRegexpMatchMedium_32 209 202 -3.35% BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52% BenchmarkRegexpMatchHard_32 2682 2675 -0.26% BenchmarkRegexpMatchHard_1K 79383 79524 +0.18% BenchmarkRevcomp 584116718 584595320 +0.08% BenchmarkTemplate 125400565 109620196 -12.58% BenchmarkTimeParse 386 387 +0.26% BenchmarkTimeFormat 580 447 -22.93% (Best out of 10 runs. The delta of averages is similar.) This also puts us in a good position to flush these caches when nearing the end of concurrent marking, which will let us increase the size of the work buffers while still controlling mark termination pause time. Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522 Reviewed-on: https://go-review.googlesource.com/9178 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-19 15:22:20 -04:00
throw("P has cached GC work at end of mark termination")
}
if gcw.scanWork != 0 || gcw.bytesMarked != 0 {
throw("P has unflushed stats at end of mark termination")
runtime: replace per-M workbuf cache with per-P gcWork cache Currently, each M has a cache of the most recently used *workbuf. This is used primarily by the write barrier so it doesn't have to access the global workbuf lists on every write barrier. It's also used by stack scanning because it's convenient. This cache is important for write barrier performance, but this particular approach has several downsides. It's faster than no cache, but far from optimal (as the benchmarks below show). It's complex: access to the cache is sprinkled through most of the workbuf list operations and it requires special care to transform into and back out of the gcWork cache that's actually used for scanning and marking. It requires atomic exchanges to take ownership of the cached workbuf and to return it to the M's cache even though it's almost always used by only the current M. Since it's per-M, flushing these caches is O(# of Ms), which may be high. And it has some significant subtleties: for example, in general the cache shouldn't be used after the harvestwbufs() in mark termination because it could hide work from mark termination, but stack scanning can happen after this and *will* use the cache (but it turns out this is okay because it will always be followed by a getfull(), which drains the cache). This change replaces this cache with a per-P gcWork object. This gcWork cache can be used directly by scanning and marking (as long as preemption is disabled, which is a general requirement of gcWork). Since it's per-P, it doesn't require synchronization, which simplifies things and means the only atomic operations in the write barrier are occasionally fetching new work buffers and setting a mark bit if the object isn't already marked. This cache can be flushed in O(# of Ps), which is generally small. It follows a simple flushing rule: the cache can be used during any phase, but during mark termination it must be flushed before allowing preemption. This also makes the dispose during mutator assist no longer necessary, which eliminates the vast majority of gcWork dispose calls and reduces contention on the global workbuf lists. And it's a lot faster on some benchmarks: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 11963668673 11206112763 -6.33% BenchmarkFannkuch11 2643217136 2649182499 +0.23% BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28% BenchmarkFmtFprintfString 364 307 -15.66% BenchmarkFmtFprintfInt 317 282 -11.04% BenchmarkFmtFprintfIntInt 512 483 -5.66% BenchmarkFmtFprintfPrefixedInt 404 380 -5.94% BenchmarkFmtFprintfFloat 521 479 -8.06% BenchmarkFmtManyArgs 2164 1894 -12.48% BenchmarkGobDecode 30366146 22429593 -26.14% BenchmarkGobEncode 29867472 26663152 -10.73% BenchmarkGzip 391236616 396779490 +1.42% BenchmarkGunzip 96639491 96297024 -0.35% BenchmarkHTTPClientServer 100110 70763 -29.31% BenchmarkJSONEncode 51866051 52511382 +1.24% BenchmarkJSONDecode 103813138 86094963 -17.07% BenchmarkMandelbrot200 4121834 4120886 -0.02% BenchmarkGoParse 16472789 5879949 -64.31% BenchmarkRegexpMatchEasy0_32 140 140 +0.00% BenchmarkRegexpMatchEasy0_1K 394 394 +0.00% BenchmarkRegexpMatchEasy1_32 120 120 +0.00% BenchmarkRegexpMatchEasy1_1K 621 614 -1.13% BenchmarkRegexpMatchMedium_32 209 202 -3.35% BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52% BenchmarkRegexpMatchHard_32 2682 2675 -0.26% BenchmarkRegexpMatchHard_1K 79383 79524 +0.18% BenchmarkRevcomp 584116718 584595320 +0.08% BenchmarkTemplate 125400565 109620196 -12.58% BenchmarkTimeParse 386 387 +0.26% BenchmarkTimeFormat 580 447 -22.93% (Best out of 10 runs. The delta of averages is similar.) This also puts us in a good position to flush these caches when nearing the end of concurrent marking, which will let us increase the size of the work buffers while still controlling mark termination pause time. Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522 Reviewed-on: https://go-review.googlesource.com/9178 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-19 15:22:20 -04:00
}
}
cachestats()
runtime: introduce heap_live; replace use of heap_alloc in GC Currently there are two main consumers of memstats.heap_alloc: updatememstats (aka ReadMemStats) and shouldtriggergc. updatememstats recomputes heap_alloc from the ground up, so we don't need to keep heap_alloc up to date for it. shouldtriggergc wants to know how many bytes were marked by the previous GC plus how many bytes have been allocated since then, but this *isn't* what heap_alloc tracks. heap_alloc also includes objects that are not marked and haven't yet been swept. Introduce a new memstat called heap_live that actually tracks what shouldtriggergc wants to know and stop keeping heap_alloc up to date. Unlike heap_alloc, heap_live follows a simple sawtooth that drops during each mark termination and increases monotonically between GCs. heap_alloc, on the other hand, has much more complicated behavior: it may drop during sweep termination, slowly decreases from background sweeping between GCs, is roughly unaffected by allocation as long as there are unswept spans (because we sweep and allocate at the same rate), and may go up after background sweeping is done depending on the GC trigger. heap_live simplifies computing next_gc and using it to figure out when to trigger garbage collection. Currently, we guess next_gc at the end of a cycle and update it as we sweep and get a better idea of how much heap was marked. Now, since we're directly tracking how much heap is marked, we can directly compute next_gc. This also corrects bugs that could cause us to trigger GC early. Currently, in any case where sweep termination actually finds spans to sweep, heap_alloc is an overestimation of live heap, so we'll trigger GC too early. heap_live, on the other hand, is unaffected by sweeping. Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388 Reviewed-on: https://go-review.googlesource.com/8389 Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
// Update the marked heap stat.
memstats.heap_marked = work.bytesMarked
runtime: use reachable heap estimate to set trigger/goal Currently, we set the heap goal for the next GC cycle using the size of the marked heap at the end of the current cycle. This can lead to a bad feedback loop if the mutator is rapidly allocating and releasing pointers that can significantly bloat heap size. If the GC were STW, the marked heap size would be exactly the reachable heap size (call it stwLive). However, in concurrent GC, marked=stwLive+floatLive, where floatLive is the amount of "floating garbage": objects that were reachable at some point during the cycle and were marked, but which are no longer reachable by the end of the cycle. If the GC cycle is short, then the mutator doesn't have much time to create floating garbage, so marked≈stwLive. However, if the GC cycle is long and the mutator is allocating and creating floating garbage very rapidly, then it's possible that marked≫stwLive. Since the runtime currently sets the heap goal based on marked, this will cause it to set a high heap goal. This means that 1) the next GC cycle will take longer because of the larger heap and 2) the assist ratio will be low because of the large distance between the trigger and the goal. The combination of these lets the mutator produce even more floating garbage in the next cycle, which further exacerbates the problem. For example, on the garbage benchmark with GOMAXPROCS=1, this causes the heap to grow to ~500MB and the garbage collector to retain upwards of ~300MB of heap, while the true reachable heap size is ~32MB. This, in turn, causes the GC cycle to take upwards of ~3 seconds. Fix this bad feedback loop by estimating the true reachable heap size (stwLive) and using this rather than the marked heap size (stwLive+floatLive) as the basis for the GC trigger and heap goal. This breaks the bad feedback loop and causes the mutator to assist more, which decreases the rate at which it can create floating garbage. On the same garbage benchmark, this reduces the maximum heap size to ~73MB, the retained heap to ~40MB, and the duration of the GC cycle to ~200ms. Change-Id: I7712244c94240743b266f9eb720c03802799cdd1 Reviewed-on: https://go-review.googlesource.com/9177 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 14:24:25 -04:00
runtime: fix (sometimes major) underestimation of heap_live Currently, we update memstats.heap_live from mcache.local_cachealloc whenever we lock the heap (e.g., to obtain a fresh span or to release an unused span). However, under the right circumstances, local_cachealloc can accumulate allocations up to the size of the *entire heap* without flushing them to heap_live. Specifically, since span allocations from an mcentral don't lock the heap, if a large number of pages are held in an mcentral and the application continues to use and free objects of that size class (e.g., the BinaryTree17 benchmark), local_cachealloc won't be flushed until the mcentral runs out of spans. This is a problem because, unlike many of the memory statistics that are purely informative, heap_live is used to determine when the garbage collector should start and how hard it should work. This commit eliminates local_cachealloc, instead atomically updating heap_live directly. To control contention, we do this only when obtaining a span from an mcentral. Furthermore, we make heap_live conservative: allocating a span assumes that all free slots in that span will be used and accounts for these when the span is allocated, *before* the objects themselves are. This is important because 1) this triggers the GC earlier than necessary rather than potentially too late and 2) this leads to a conservative GC rate rather than a GC rate that is potentially too low. Alternatively, we could have flushed local_cachealloc when it passed some threshold, but this would require determining a threshold and would cause heap_live to underestimate the true value rather than overestimate. Fixes #12199. name old time/op new time/op delta BinaryTree17-12 2.88s ± 4% 2.88s ± 1% ~ (p=0.470 n=19+19) Fannkuch11-12 2.48s ± 1% 2.48s ± 1% ~ (p=0.243 n=16+19) FmtFprintfEmpty-12 50.9ns ± 2% 50.7ns ± 1% ~ (p=0.238 n=15+14) FmtFprintfString-12 175ns ± 1% 171ns ± 1% -2.48% (p=0.000 n=18+18) FmtFprintfInt-12 159ns ± 1% 158ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 270ns ± 1% 265ns ± 2% -1.67% (p=0.000 n=18+18) FmtFprintfPrefixedInt-12 235ns ± 1% 234ns ± 0% ~ (p=0.362 n=18+19) FmtFprintfFloat-12 309ns ± 1% 308ns ± 1% -0.41% (p=0.001 n=18+19) FmtManyArgs-12 1.10µs ± 1% 1.08µs ± 0% -1.96% (p=0.000 n=19+18) GobDecode-12 7.81ms ± 1% 7.80ms ± 1% ~ (p=0.425 n=18+19) GobEncode-12 6.53ms ± 1% 6.53ms ± 1% ~ (p=0.817 n=19+19) Gzip-12 312ms ± 1% 312ms ± 2% ~ (p=0.967 n=19+20) Gunzip-12 42.0ms ± 1% 41.9ms ± 1% ~ (p=0.172 n=19+19) HTTPClientServer-12 63.7µs ± 1% 63.8µs ± 1% ~ (p=0.639 n=19+19) JSONEncode-12 16.4ms ± 1% 16.4ms ± 1% ~ (p=0.954 n=19+19) JSONDecode-12 58.5ms ± 1% 57.8ms ± 1% -1.27% (p=0.000 n=18+19) Mandelbrot200-12 3.86ms ± 1% 3.88ms ± 0% +0.44% (p=0.000 n=18+18) GoParse-12 3.67ms ± 2% 3.66ms ± 1% -0.52% (p=0.001 n=18+19) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% ~ (p=0.257 n=19+18) RegexpMatchEasy0_1K-12 347ns ± 1% 347ns ± 1% ~ (p=0.527 n=18+18) RegexpMatchEasy1_32-12 83.7ns ± 2% 83.1ns ± 2% ~ (p=0.096 n=18+19) RegexpMatchEasy1_1K-12 509ns ± 1% 505ns ± 1% -0.75% (p=0.000 n=18+19) RegexpMatchMedium_32-12 130ns ± 2% 129ns ± 1% ~ (p=0.962 n=20+20) RegexpMatchMedium_1K-12 39.5µs ± 2% 39.4µs ± 1% ~ (p=0.376 n=20+19) RegexpMatchHard_32-12 2.04µs ± 0% 2.04µs ± 1% ~ (p=0.195 n=18+17) RegexpMatchHard_1K-12 61.4µs ± 1% 61.4µs ± 1% ~ (p=0.885 n=19+19) Revcomp-12 540ms ± 2% 542ms ± 4% ~ (p=0.552 n=19+17) Template-12 69.6ms ± 1% 71.2ms ± 1% +2.39% (p=0.000 n=20+20) TimeParse-12 357ns ± 1% 357ns ± 1% ~ (p=0.883 n=18+20) TimeFormat-12 379ns ± 1% 362ns ± 1% -4.53% (p=0.000 n=18+19) [Geo mean] 62.0µs 61.8µs -0.44% name old time/op new time/op delta XBenchGarbage-12 5.89ms ± 2% 5.81ms ± 2% -1.41% (p=0.000 n=19+18) Change-Id: I96b31cca6ae77c30693a891cff3fe663fa2447a0 Reviewed-on: https://go-review.googlesource.com/17748 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-12-11 17:49:14 -05:00
// Update other GC heap size stats. This must happen after
// cachestats (which flushes local statistics to these) and
// flushallmcaches (which modifies heap_live).
runtime: fix underflow in next_gc calculation Currently, it's possible for the next_gc calculation to underflow. Since next_gc is unsigned, this wraps around and effectively disables GC for the rest of the program's execution. Besides being obviously wrong, this is causing test failures on 32-bit because some tests are running out of heap. This underflow happens for two reasons, both having to do with how we estimate the reachable heap size at the end of the GC cycle. One reason is that this calculation depends on the value of heap_live at the beginning of the GC cycle, but we currently only record that value during a concurrent GC and not during a forced STW GC. Fix this by moving the recorded value from gcController to work and recording it on a common code path. The other reason is that we use the amount of allocation during the GC cycle as an approximation of the amount of floating garbage and subtract it from the marked heap to estimate the reachable heap. However, since this is only an approximation, it's possible for the amount of allocation during the cycle to be *larger* than the marked heap size (since the runtime allocates white and it's possible for these allocations to never be made reachable from the heap). Currently this causes wrap-around in our estimate of the reachable heap size, which in turn causes wrap-around in next_gc. Fix this by bottoming out the reachable heap estimate at 0, in which case we just fall back to triggering GC at heapminimum (which is okay since this only happens on small heaps). Fixes #10555, fixes #10556, and fixes #10559. Change-Id: Iad07b529c03772356fede2ae557732f13ebfdb63 Reviewed-on: https://go-review.googlesource.com/9286 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-23 13:02:31 -04:00
memstats.heap_live = work.bytesMarked
memstats.heap_scan = uint64(gcController.scanWork)
if trace.enabled {
runtime: introduce heap_live; replace use of heap_alloc in GC Currently there are two main consumers of memstats.heap_alloc: updatememstats (aka ReadMemStats) and shouldtriggergc. updatememstats recomputes heap_alloc from the ground up, so we don't need to keep heap_alloc up to date for it. shouldtriggergc wants to know how many bytes were marked by the previous GC plus how many bytes have been allocated since then, but this *isn't* what heap_alloc tracks. heap_alloc also includes objects that are not marked and haven't yet been swept. Introduce a new memstat called heap_live that actually tracks what shouldtriggergc wants to know and stop keeping heap_alloc up to date. Unlike heap_alloc, heap_live follows a simple sawtooth that drops during each mark termination and increases monotonically between GCs. heap_alloc, on the other hand, has much more complicated behavior: it may drop during sweep termination, slowly decreases from background sweeping between GCs, is roughly unaffected by allocation as long as there are unswept spans (because we sweep and allocate at the same rate), and may go up after background sweeping is done depending on the GC trigger. heap_live simplifies computing next_gc and using it to figure out when to trigger garbage collection. Currently, we guess next_gc at the end of a cycle and update it as we sweep and get a better idea of how much heap was marked. Now, since we're directly tracking how much heap is marked, we can directly compute next_gc. This also corrects bugs that could cause us to trigger GC early. Currently, in any case where sweep termination actually finds spans to sweep, heap_alloc is an overestimation of live heap, so we'll trigger GC too early. heap_live, on the other hand, is unaffected by sweeping. Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388 Reviewed-on: https://go-review.googlesource.com/8389 Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-30 18:01:32 -04:00
traceHeapAlloc()
}
}
func gcSweep(mode gcMode) {
if gcphase != _GCoff {
throw("gcSweep being done but phase is not GCoff")
}
lock(&mheap_.lock)
mheap_.sweepgen += 2
mheap_.sweepdone = 0
runtime: make sweep time proportional to in-use spans Currently sweeping walks the list of all spans, which means the work in sweeping is proportional to the maximum number of spans ever used. If the heap was once large but is now small, this causes an amortization failure: on a small heap, GCs happen frequently, but a full sweep still has to happen in each GC cycle, which means we spent a lot of time in sweeping. Fix this by creating a separate list consisting of just the in-use spans to be swept, so sweeping is proportional to the number of in-use spans (which is proportional to the live heap). Specifically, we create two lists: a list of unswept in-use spans and a list of swept in-use spans. At the start of the sweep cycle, the swept list becomes the unswept list and the new swept list is empty. Allocating a new in-use span adds it to the swept list. Sweeping moves spans from the unswept list to the swept list. This fixes the amortization problem because a shrinking heap moves spans off the unswept list without adding them to the swept list, reducing the time required by the next sweep cycle. Updates #9265. This fix eliminates almost all of the time spent in sweepone; however, markrootSpans has essentially the same bug, so now the test program from this issue spends all of its time in markrootSpans. No significant effect on other benchmarks. Change-Id: Ib382e82790aad907da1c127e62b3ab45d7a4ac1e Reviewed-on: https://go-review.googlesource.com/30535 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-05 17:50:39 -04:00
if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
// We should have drained this list during the last
// sweep phase. We certainly need to start this phase
// with an empty swept list.
throw("non-empty swept list")
}
mheap_.pagesSwept = 0
unlock(&mheap_.lock)
if !_ConcurrentSweep || mode == gcForceBlockMode {
// Special case synchronous sweep.
runtime: finish sweeping before concurrent GC starts Currently, the concurrent sweep follows a 1:1 rule: when allocation needs a span, it sweeps a span (likewise, when a large allocation needs N pages, it sweeps until it frees N pages). This rule worked well for the STW collector (especially when GOGC==100) because it did no more sweeping than necessary to keep the heap from growing, would generally finish sweeping just before GC, and ensured good temporal locality between sweeping a page and allocating from it. It doesn't work well with concurrent GC. Since concurrent GC requires starting GC earlier (sometimes much earlier), the sweep often won't be done when GC starts. Unfortunately, the first thing GC has to do is finish the sweep. In the mean time, the mutator can continue allocating, pushing the heap size even closer to the goal size. This worked okay with the 7/8ths trigger, but it gets into a vicious cycle with the GC trigger controller: if the mutator is allocating quickly and driving the trigger lower, more and more sweep work will be left to GC; this both causes GC to take longer (allowing the mutator to allocate more during GC) and delays the start of the concurrent mark phase, which throws off the GC controller's statistics and generally causes it to push the trigger even lower. As an example of a particularly bad case, the garbage benchmark with GOMAXPROCS=4 and -benchmem 512 (MB) spends the first 0.4-0.8 seconds of each GC cycle sweeping, during which the heap grows by between 109MB and 252MB. To fix this, this change replaces the 1:1 sweep rule with a proportional sweep rule. At the end of GC, GC knows exactly how much heap allocation will occur before the next concurrent GC as well as how many span pages must be swept. This change computes this "sweep ratio" and when the mallocgc asks for a span, the mcentral sweeps enough spans to bring the swept span count into ratio with the allocated byte count. On the benchmark from above, this entirely eliminates sweeping at the beginning of GC, which reduces the time between startGC readying the GC goroutine and GC stopping the world for sweep termination to ~100µs during which the heap grows at most 134KB. Change-Id: I35422d6bba0c2310d48bb1f8f30a72d29e98c1af Reviewed-on: https://go-review.googlesource.com/8921 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-13 23:34:57 -04:00
// Record that no proportional sweeping has to happen.
lock(&mheap_.lock)
mheap_.sweepPagesPerByte = 0
unlock(&mheap_.lock)
// Sweep all spans eagerly.
for sweepone() != ^uintptr(0) {
sweep.npausesweep++
}
runtime: free workbufs during sweeping This extends the sweeper to free workbufs back to the heap between GC cycles, allowing this memory to be reused for GC'd allocations or eventually returned to the OS. This helps for applications that have high peak heap usage relative to their regular heap usage (for example, a high-memory initialization phase). Workbuf memory is roughly proportional to heap size and since we currently never free workbufs, it's proportional to *peak* heap size. By freeing workbufs, we can release and reuse this memory for other purposes when the heap shrinks. This is somewhat complicated because this costs ~1–2 µs per workbuf span, so for large heaps it's too expensive to just do synchronously after mark termination between starting the world and dropping the worldsema. Hence, we do it asynchronously in the sweeper. This adds a list of "free" workbuf spans that can be returned to the heap. GC moves all workbuf spans to this list after mark termination and the background sweeper drains this list back to the heap. If the sweeper doesn't finish, that's fine, since getempty can directly reuse any remaining spans to allocate more workbufs. Performance impact is negligible. On the x/benchmarks, this reduces GC-bytes-from-system by 6–11%. Fixes #19325. Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8 Reviewed-on: https://go-review.googlesource.com/38582 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-20 17:25:59 -04:00
// Free workbufs eagerly.
prepareFreeWorkbufs()
for freeSomeWbufs(false) {
}
// All "free" events for this mark/sweep cycle have
// now happened, so we can make this profile cycle
// available immediately.
mProf_NextCycle()
mProf_Flush()
return
}
// Background sweep.
lock(&sweep.lock)
if sweep.parked {
sweep.parked = false
ready(sweep.g, 0, true)
}
unlock(&sweep.lock)
}
// gcResetMarkState resets global state prior to marking (concurrent
// or STW) and resets the stack scan state of all Gs.
//
// This is safe to do without the world stopped because any Gs created
// during or after this will start out in the reset state.
func gcResetMarkState() {
// This may be called during a concurrent phase, so make sure
// allgs doesn't change.
lock(&allglock)
for _, gp := range allgs {
gp.gcscandone = false // set to true in gcphasework
gp.gcscanvalid = false // stack has not been scanned
runtime: directly track GC assist balance Currently we track the per-G GC assist balance as two monotonically increasing values: the bytes allocated by the G this cycle (gcalloc) and the scan work performed by the G this cycle (gcscanwork). The assist balance is hence assistRatio*gcalloc - gcscanwork. This works, but has two important downsides: 1) It requires floating-point math to figure out if a G is in debt or not. This makes it inappropriate to check for assist debt in the hot path of mallocgc, so we only do this when a G allocates a new span. As a result, Gs can operate "in the red", leading to under-assist and extended GC cycle length. 2) Revising the assist ratio during a GC cycle can lead to an "assist burst". If you think of plotting the scan work performed versus heaps size, the assist ratio controls the slope of this line. However, in the current system, the target line always passes through 0 at the heap size that triggered GC, so if the runtime increases the assist ratio, there has to be a potentially large assist to jump from the current amount of scan work up to the new target scan work for the current heap size. This commit replaces this approach with directly tracking the GC assist balance in terms of allocation credit bytes. Allocating N bytes simply decreases this by N and assisting raises it by the amount of scan work performed divided by the assist ratio (to get back to bytes). This will make it cheap to figure out if a G is in debt, which will let us efficiently check if an assist is necessary *before* performing an allocation and hence keep Gs "in the black". This also fixes assist bursts because the assist ratio is now in terms of *remaining* work, rather than work from the beginning of the GC cycle. Hence, the plot of scan work versus heap size becomes continuous: we can revise the slope, but this slope always starts from where we are right now, rather than where we were at the beginning of the cycle. Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c Reviewed-on: https://go-review.googlesource.com/15408 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-10-04 20:16:57 -07:00
gp.gcAssistBytes = 0
}
unlock(&allglock)
work.bytesMarked = 0
work.initialHeapLive = atomic.Load64(&memstats.heap_live)
work.markrootDone = false
}
// Hooks for other packages
var poolcleanup func()
//go:linkname sync_runtime_registerPoolCleanup sync.runtime_registerPoolCleanup
func sync_runtime_registerPoolCleanup(f func()) {
poolcleanup = f
}
func clearpools() {
// clear sync.Pools
if poolcleanup != nil {
poolcleanup()
}
// Clear central sudog cache.
// Leave per-P caches alone, they have strictly bounded size.
// Disconnect cached list before dropping it on the floor,
// so that a dangling ref to one entry does not pin all of them.
lock(&sched.sudoglock)
var sg, sgnext *sudog
for sg = sched.sudogcache; sg != nil; sg = sgnext {
sgnext = sg.next
sg.next = nil
}
sched.sudogcache = nil
unlock(&sched.sudoglock)
// Clear central defer pools.
// Leave per-P pools alone, they have strictly bounded size.
lock(&sched.deferlock)
for i := range sched.deferpool {
// disconnect cached list before dropping it on the floor,
// so that a dangling ref to one entry does not pin all of them.
var d, dlink *_defer
for d = sched.deferpool[i]; d != nil; d = dlink {
dlink = d.link
d.link = nil
}
sched.deferpool[i] = nil
}
unlock(&sched.deferlock)
}
// gchelper runs mark termination tasks on Ps other than the P
// coordinating mark termination.
//
// The caller is responsible for ensuring that this has a P to run on,
// even though it's running during STW. Because of this, it's allowed
// to have write barriers.
//
//go:yeswritebarrierrec
func gchelper() {
_g_ := getg()
_g_.m.traceback = 2
gchelperstart()
runtime: perform concurrent scan in GC workers Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are *very* small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19 13:46:32 -04:00
// Parallel mark over GC roots and heap
if gcphase == _GCmarktermination {
gcw := &_g_.m.p.ptr().gcw
runtime: avoid getfull() barrier most of the time With the hybrid barrier, unless we're doing a STW GC or hit a very rare race (~once per all.bash) that can start mark termination before all of the work is drained, we don't need to drain the work queue at all. Even draining an empty work queue is rather expensive since we have to enter the getfull() barrier, so it's worth avoiding this. Conveniently, it's quite easy to detect whether or not we actually need the getufull() barrier: since the world is stopped when we enter mark termination, everything must have flushed its work to the work queue, so we can just check the queue. If the queue is empty and we haven't queued up any jobs that may create more work (which should always be the case with the hybrid barrier), we can simply have all GC workers perform non-blocking drains. Also conveniently, this solution is quite safe. If we do somehow screw something up and there's work on the work queue, some worker will still process it, it just may not happen in parallel. This is not the "right" solution, but it's simple, expedient, low-risk, and maintains compatibility with debug.gcrescanstacks. When we remove the gcrescanstacks fallback in Go 1.9, we should also fix the race that starts mark termination early, and then we can eliminate work draining from mark termination. Updates #17503. Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361 Reviewed-on: https://go-review.googlesource.com/32186 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 17:05:41 -04:00
if work.helperDrainBlock {
gcDrain(gcw, gcDrainBlock) // blocks in getfull
} else {
gcDrain(gcw, gcDrainNoBlock)
}
gcw.dispose()
}
nproc := atomic.Load(&work.nproc) // work.nproc can change right after we increment work.ndone
if atomic.Xadd(&work.ndone, +1) == nproc-1 {
notewakeup(&work.alldone)
}
_g_.m.traceback = 0
}
func gchelperstart() {
_g_ := getg()
if _g_.m.helpgc < 0 || _g_.m.helpgc >= _MaxGcproc {
throw("gchelperstart: bad m->helpgc")
}
if _g_ != _g_.m.g0 {
throw("gchelper not running on g0 stack")
}
}
// Timing
// itoaDiv formats val/(10**dec) into buf.
func itoaDiv(buf []byte, val uint64, dec int) []byte {
i := len(buf) - 1
idec := i - dec
for val >= 10 || i >= idec {
buf[i] = byte(val%10 + '0')
i--
if i == idec {
buf[i] = '.'
i--
}
val /= 10
}
buf[i] = byte(val + '0')
return buf[i:]
}
// fmtNSAsMS nicely formats ns nanoseconds as milliseconds.
func fmtNSAsMS(buf []byte, ns uint64) []byte {
if ns >= 10e6 {
// Format as whole milliseconds.
return itoaDiv(buf, ns/1e6, 0)
}
// Format two digits of precision, with at most three decimal places.
x := ns / 1e3
if x == 0 {
buf[0] = '0'
return buf[:1]
}
dec := 3
for x >= 100 {
x /= 10
dec--
}
return itoaDiv(buf, x, dec)
}