runtime/metrics: add STW stopping and total time metrics

This CL adds four new time histogram metrics:

/sched/pauses/stopping/gc:seconds
/sched/pauses/stopping/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds

The "stopping" metrics measure the time taken to start a stop-the-world
pause. i.e., how long it takes stopTheWorldWithSema to stop all Ps.
This can be used to detect STW struggling to preempt Ps.

The "total" metrics measure the total duration of a stop-the-world
pause, from starting to stop-the-world until the world is started again.
This includes the time spent in the "start" phase.

The "gc" metrics are used for GC-related STW pauses. The "other" metrics
are used for all other STW pauses.

All of these metrics start timing in stopTheWorldWithSema only after
successfully acquiring sched.lock, thus excluding lock contention on
sched.lock. The reasoning behind this is that while waiting on
sched.lock the world is not stopped at all (all other Ps can run), so
the impact of this contention is primarily limited to the goroutine
attempting to stop-the-world. Additionally, we already have some
visibility into sched.lock contention via contention profiles (#57071).

/sched/pauses/total/gc:seconds is conceptually equivalent to
/gc/pauses:seconds, so the latter is marked as deprecated and returns
the same histogram as the former.

In the implementation, there are a few minor differences:

* For both mark and sweep termination stops, /gc/pauses:seconds started
  timing prior to calling startTheWorldWithSema, thus including lock
  contention.

These details are minor enough, that I do not believe the slight change
in reporting will matter. For mark termination stops, moving timing stop
into startTheWorldWithSema does have the side effect of requiring moving
other GC metric calculations outside of the STW, as they depend on the
same end time.

Fixes #63340

Change-Id: Iacd0bab11bedab85d3dcfb982361413a7d9c0d05
Reviewed-on: https://go-review.googlesource.com/c/go/+/534161
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This commit is contained in:
Michael Pratt 2023-10-10 15:28:32 -04:00 committed by Gopher Robot
parent a0df23888f
commit 6ef98ac87c
17 changed files with 425 additions and 100 deletions

View file

@ -383,7 +383,7 @@ var ReadUnaligned32 = readUnaligned32
var ReadUnaligned64 = readUnaligned64
func CountPagesInUse() (pagesInUse, counted uintptr) {
stopTheWorld(stwForTestCountPagesInUse)
stw := stopTheWorld(stwForTestCountPagesInUse)
pagesInUse = mheap_.pagesInUse.Load()
@ -393,7 +393,7 @@ func CountPagesInUse() (pagesInUse, counted uintptr) {
}
}
startTheWorld()
startTheWorld(stw)
return
}
@ -426,7 +426,7 @@ func (p *ProfBuf) Close() {
}
func ReadMetricsSlow(memStats *MemStats, samplesp unsafe.Pointer, len, cap int) {
stopTheWorld(stwForTestReadMetricsSlow)
stw := stopTheWorld(stwForTestReadMetricsSlow)
// Initialize the metrics beforehand because this could
// allocate and skew the stats.
@ -461,13 +461,13 @@ func ReadMetricsSlow(memStats *MemStats, samplesp unsafe.Pointer, len, cap int)
})
metricsUnlock()
startTheWorld()
startTheWorld(stw)
}
// ReadMemStatsSlow returns both the runtime-computed MemStats and
// MemStats accumulated by scanning the heap.
func ReadMemStatsSlow() (base, slow MemStats) {
stopTheWorld(stwForTestReadMemStatsSlow)
stw := stopTheWorld(stwForTestReadMemStatsSlow)
// Run on the system stack to avoid stack growth allocation.
systemstack(func() {
@ -544,7 +544,7 @@ func ReadMemStatsSlow() (base, slow MemStats) {
getg().m.mallocing--
})
startTheWorld()
startTheWorld(stw)
return
}
@ -1324,7 +1324,7 @@ func CheckScavengedBitsCleared(mismatches []BitsMismatch) (n int, ok bool) {
}
func PageCachePagesLeaked() (leaked uintptr) {
stopTheWorld(stwForTestPageCachePagesLeaked)
stw := stopTheWorld(stwForTestPageCachePagesLeaked)
// Walk over destroyed Ps and look for unflushed caches.
deadp := allp[len(allp):cap(allp)]
@ -1336,7 +1336,7 @@ func PageCachePagesLeaked() (leaked uintptr) {
}
}
startTheWorld()
startTheWorld(stw)
return
}