go/src/runtime/mheap.go

2993 lines
98 KiB
Go
Raw Normal View History

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Page heap.
//
// See malloc.go for overview.
package runtime
import (
"internal/abi"
"internal/cpu"
"internal/goarch"
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
"internal/goexperiment"
"internal/runtime/atomic"
"internal/runtime/gc"
"internal/runtime/sys"
"unsafe"
)
const (
// minPhysPageSize is a lower-bound on the physical page size. The
// true physical page size may be larger than this. In contrast,
// sys.PhysPageSize is an upper-bound on the physical page size.
minPhysPageSize = 4096
// maxPhysPageSize is the maximum page size the runtime supports.
maxPhysPageSize = 512 << 10
// maxPhysHugePageSize sets an upper-bound on the maximum huge page size
// that the runtime supports.
maxPhysHugePageSize = pallocChunkBytes
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
// pagesPerReclaimerChunk indicates how many pages to scan from the
// pageInUse bitmap at a time. Used by the page reclaimer.
//
// Higher values reduce contention on scanning indexes (such as
// h.reclaimIndex), but increase the minimum latency of the
// operation.
//
// The time required to scan this many pages can vary a lot depending
// on how many spans are actually freed. Experimentally, it can
// scan for pages at ~300 GB/ms on a 2.6GHz Core i7, but can only
// free spans at ~32 MB/ms. Using 512 pages bounds this at
// roughly 100µs.
//
// Must be a multiple of the pageInUse bitmap element size and
// must also evenly divide pagesPerArena.
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
pagesPerReclaimerChunk = 512
// physPageAlignedStacks indicates whether stack allocations must be
// physical page aligned. This is a requirement for MAP_STACK on
// OpenBSD.
physPageAlignedStacks = GOOS == "openbsd"
)
runtime: support smaller physical pages than PhysPageSize Most operations need an upper bound on the physical page size, which is what sys.PhysPageSize is for (this is checked at runtime init on Linux). However, a few operations need a *lower* bound on the physical page size. Introduce a "minPhysPageSize" constant to act as this lower bound and use it where it makes sense: 1) In addrspace_free, we have to query each page in the given range. Currently we increment by the upper bound on the physical page size, which means we may skip over pages if the true size is smaller. Worse, we currently pass a result buffer that only has enough room for one page. If there are actually multiple pages in the range passed to mincore, the kernel will overflow this buffer. Fix these problems by incrementing by the lower-bound on the physical page size and by passing "1" for the length, which the kernel will round up to the true physical page size. 2) In the write barrier, the bad pointer check tests for pointers to the first physical page, which are presumably small integers masquerading as pointers. However, if physical pages are smaller than we think, we may have legitimate pointers below sys.PhysPageSize. Hence, use minPhysPageSize for this test since pointers should never fall below that. In particular, this applies to ARM64 and MIPS. The runtime is configured to use 64kB pages on ARM64, but by default Linux uses 4kB pages. Similarly, the runtime assumes 16kB pages on MIPS, but both 4kB and 16kB kernel configurations are common. This also applies to ARM on systems where the runtime is recompiled to deal with a larger page size. It is also a step toward making the runtime use only a dynamically-queried page size. Change-Id: I1fdfd18f6e7cbca170cc100354b9faa22fde8a69 Reviewed-on: https://go-review.googlesource.com/25020 Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Austin Clements <austin@google.com>
2016-07-18 16:01:22 -04:00
// Main malloc heap.
// The heap itself is the "free" and "scav" treaps,
// but all the other global data is here too.
//
// mheap must not be heap-allocated because it contains mSpanLists,
// which must not be heap-allocated.
type mheap struct {
_ sys.NotInHeap
// lock must only be acquired on the system stack, otherwise a g
// could self-deadlock if its stack grows with the lock held.
runtime: redesign scavenging algorithm Currently the runtime's scavenging algorithm involves running from the top of the heap address space to the bottom (or as far as it gets) once per GC cycle. Once it treads some ground, it doesn't tread it again until the next GC cycle. This works just fine for the background scavenger, for heap-growth scavenging, and for debug.FreeOSMemory. However, it breaks down in the face of a memory limit for small heaps in the tens of MiB. Basically, because the scavenger never retreads old ground, it's completely oblivious to new memory it could scavenge, and that it really *should* in the face of a memory limit. Also, every time some thread goes to scavenge in the runtime, it reserves what could be a considerable amount of address space, hiding it from other scavengers. This change modifies and simplifies the implementation overall. It's less code with complexities that are much better encapsulated. The current implementation iterates optimistically over the address space looking for memory to scavenge, keeping track of what it last saw. The new implementation does the same, but instead of directly iterating over pages, it iterates over chunks. It maintains an index of chunks (as a bitmap over the address space) that indicate which chunks may contain scavenge work. The page allocator populates this index, while scavengers consume it and iterate over it optimistically. This has a two key benefits: 1. Scavenging is much simpler: find a candidate chunk, and check it, essentially just using the scavengeOne fast path. There's no need for the complexity of iterating beyond one chunk, because the index is lock-free and already maintains that information. 2. If pages are freed to the page allocator (always guaranteed to be unscavenged), the page allocator immediately notifies all scavengers of the new source of work, avoiding the hiding issues of the old implementation. One downside of the new implementation, however, is that it's potentially more expensive to find pages to scavenge. In the past, if a single page would become free high up in the address space, the runtime's scavengers would ignore it. Now that scavengers won't, one or more scavengers may need to iterate potentially across the whole heap to find the next source of work. For the background scavenger, this just means a potentially less reactive scavenger -- overall it should still use the same amount of CPU. It means worse overheads for memory limit scavenging, but that's not exactly something with a baseline yet. In practice, this shouldn't be too bad, hopefully since the chunk index is extremely compact. For a 48-bit address space, the index is only 8 MiB in size at worst, but even just one physical page in the index is able to support up to 128 GiB heaps, provided they aren't terribly sparse. On 32-bit platforms, the index is only 128 bytes in size. For #48409. Change-Id: I72b7e74365046b18c64a6417224c5d85511194fb Reviewed-on: https://go-review.googlesource.com/c/go/+/399474 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-04-10 20:34:17 +00:00
lock mutex
pages pageAlloc // page allocation data structure
sweepgen uint32 // sweep generation, see comment in mspan; written during STW
// allspans is a slice of all mspans ever created. Each mspan
// appears exactly once.
//
// The memory for allspans is manually managed and can be
// reallocated and move as the heap grows.
//
// In general, allspans is protected by mheap_.lock, which
// prevents concurrent access as well as freeing the backing
// store. Accesses during STW might not hold the lock, but
// must ensure that allocation cannot happen around the
// access (since that may free the backing store).
allspans []*mspan // all spans out there
// Proportional sweep
//
// These parameters represent a linear function from gcController.heapLive
// to page sweep count. The proportional sweep system works to
// stay in the black by keeping the current page sweep count
// above this line at the current gcController.heapLive.
//
// The line has slope sweepPagesPerByte and passes through a
// basis point at (sweepHeapLiveBasis, pagesSweptBasis). At
// any given time, the system is at (gcController.heapLive,
// pagesSwept) in this space.
//
runtime: retype mheap.pagesInUse as atomic.Uint64 [git-generate] cd src/runtime mv export_test.go export.go GOROOT=$(dirname $(dirname $PWD)) rf ' add mheap.pagesInUse \ // Proportional sweep \ // \ // These parameters represent a linear function from gcController.heapLive \ // to page sweep count. The proportional sweep system works to \ // stay in the black by keeping the current page sweep count \ // above this line at the current gcController.heapLive. \ // \ // The line has slope sweepPagesPerByte and passes through a \ // basis point at (sweepHeapLiveBasis, pagesSweptBasis). At \ // any given time, the system is at (gcController.heapLive, \ // pagesSwept) in this space. \ // \ // It is important that the line pass through a point we \ // control rather than simply starting at a 0,0 origin \ // because that lets us adjust sweep pacing at any time while \ // accounting for current progress. If we could only adjust \ // the slope, it would create a discontinuity in debt if any \ // progress has already been made. \ pagesInUse_ atomic.Uint64 // pages of spans in stats mSpanInUse ex { import "runtime/internal/atomic" var t mheap var v, w uint64 var d int64 t.pagesInUse -> t.pagesInUse_.Load() t.pagesInUse = v -> t.pagesInUse_.Store(v) atomic.Load64(&t.pagesInUse) -> t.pagesInUse_.Load() atomic.LoadAcq64(&t.pagesInUse) -> t.pagesInUse_.LoadAcquire() atomic.Store64(&t.pagesInUse, v) -> t.pagesInUse_.Store(v) atomic.StoreRel64(&t.pagesInUse, v) -> t.pagesInUse_.StoreRelease(v) atomic.Cas64(&t.pagesInUse, v, w) -> t.pagesInUse_.CompareAndSwap(v, w) atomic.Xchg64(&t.pagesInUse, v) -> t.pagesInUse_.Swap(v) atomic.Xadd64(&t.pagesInUse, d) -> t.pagesInUse_.Add(d) } rm mheap.pagesInUse mv mheap.pagesInUse_ mheap.pagesInUse ' mv export.go export_test.go Change-Id: I495d188683dba0778518563c46755b5ad43be298 Reviewed-on: https://go-review.googlesource.com/c/go/+/356549 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Austin Clements <austin@google.com>
2021-10-15 19:22:10 +00:00
// It is important that the line pass through a point we
// control rather than simply starting at a 0,0 origin
// because that lets us adjust sweep pacing at any time while
// accounting for current progress. If we could only adjust
// the slope, it would create a discontinuity in debt if any
// progress has already been made.
pagesInUse atomic.Uintptr // pages of spans in stats mSpanInUse
pagesSwept atomic.Uint64 // pages swept this cycle
pagesSweptBasis atomic.Uint64 // pagesSwept to use as the origin of the sweep ratio
sweepHeapLiveBasis uint64 // value of gcController.heapLive to use as the origin of sweep ratio; written with lock, read without
sweepPagesPerByte float64 // proportional sweep ratio; written with lock, read without
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// Page reclaimer state
// reclaimIndex is the page index in heapArenas of next page to
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// reclaim. Specifically, it refers to page (i %
// pagesPerArena) of arena heapArenas[i / pagesPerArena].
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
//
// If this is >= 1<<63, the page reclaimer is done scanning
// the page marks.
reclaimIndex atomic.Uint64
2021-10-18 23:14:20 +00:00
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// reclaimCredit is spare credit for extra pages swept. Since
// the page reclaimer works in large chunks, it may reclaim
// more than requested. Any spare pages released go to this
// credit pool.
2021-10-18 23:14:20 +00:00
reclaimCredit atomic.Uintptr
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
_ cpu.CacheLinePad // prevents false-sharing between arenas and preceding variables
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
// arenas is the heap arena map. It points to the metadata for
// the heap for every arena frame of the entire usable virtual
// address space.
runtime: make the heap bitmap sparse This splits the heap bitmap into separate chunks for every 64MB of the heap and introduces an index mapping from virtual address to metadata. It modifies the heapBits abstraction to use this two-level structure. Finally, it modifies heapBitsSetType to unroll the bitmap into the object itself and then copy it out if the bitmap would span discontiguous bitmap chunks. This is a step toward supporting general sparse heaps, which will eliminate address space conflict failures as well as the limit on the heap size. It's also advantageous for 32-bit. 32-bit already supports discontiguous heaps by always starting the arena at address 0. However, as a result, with a contiguous bitmap, if the kernel chooses a high address (near 2GB) for a heap mapping, the runtime is forced to map up to 128MB of heap bitmap. Now the runtime can map sections of the bitmap for just the parts of the address space used by the heap. Updates #10460. This slightly slows down the x/garbage and compilebench benchmarks. However, I think the slowdown is acceptably small. name old time/op new time/op delta Template 178ms ± 1% 180ms ± 1% +0.78% (p=0.029 n=10+10) Unicode 85.7ms ± 2% 86.5ms ± 2% ~ (p=0.089 n=10+10) GoTypes 594ms ± 0% 599ms ± 1% +0.70% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.87s ± 0% +0.40% (p=0.001 n=9+9) SSA 7.23s ± 2% 7.29s ± 2% +0.94% (p=0.029 n=10+10) Flate 116ms ± 1% 117ms ± 1% +0.99% (p=0.000 n=9+9) GoParser 146ms ± 1% 146ms ± 0% ~ (p=0.193 n=10+7) Reflect 399ms ± 0% 403ms ± 1% +0.89% (p=0.001 n=10+10) Tar 173ms ± 1% 174ms ± 1% +0.91% (p=0.013 n=10+9) XML 208ms ± 1% 210ms ± 1% +0.93% (p=0.000 n=10+10) [Geo mean] 368ms 371ms +0.79% name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.17ms ± 1% 2.21ms ± 1% +2.15% (p=0.000 n=20+20) Change-Id: I037fd283221976f4f61249119d6b97b100bcbc66 Reviewed-on: https://go-review.googlesource.com/85883 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-08 22:57:53 -05:00
//
// Use arenaIndex to compute indexes into this array.
//
runtime: make the heap bitmap sparse This splits the heap bitmap into separate chunks for every 64MB of the heap and introduces an index mapping from virtual address to metadata. It modifies the heapBits abstraction to use this two-level structure. Finally, it modifies heapBitsSetType to unroll the bitmap into the object itself and then copy it out if the bitmap would span discontiguous bitmap chunks. This is a step toward supporting general sparse heaps, which will eliminate address space conflict failures as well as the limit on the heap size. It's also advantageous for 32-bit. 32-bit already supports discontiguous heaps by always starting the arena at address 0. However, as a result, with a contiguous bitmap, if the kernel chooses a high address (near 2GB) for a heap mapping, the runtime is forced to map up to 128MB of heap bitmap. Now the runtime can map sections of the bitmap for just the parts of the address space used by the heap. Updates #10460. This slightly slows down the x/garbage and compilebench benchmarks. However, I think the slowdown is acceptably small. name old time/op new time/op delta Template 178ms ± 1% 180ms ± 1% +0.78% (p=0.029 n=10+10) Unicode 85.7ms ± 2% 86.5ms ± 2% ~ (p=0.089 n=10+10) GoTypes 594ms ± 0% 599ms ± 1% +0.70% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.87s ± 0% +0.40% (p=0.001 n=9+9) SSA 7.23s ± 2% 7.29s ± 2% +0.94% (p=0.029 n=10+10) Flate 116ms ± 1% 117ms ± 1% +0.99% (p=0.000 n=9+9) GoParser 146ms ± 1% 146ms ± 0% ~ (p=0.193 n=10+7) Reflect 399ms ± 0% 403ms ± 1% +0.89% (p=0.001 n=10+10) Tar 173ms ± 1% 174ms ± 1% +0.91% (p=0.013 n=10+9) XML 208ms ± 1% 210ms ± 1% +0.93% (p=0.000 n=10+10) [Geo mean] 368ms 371ms +0.79% name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.17ms ± 1% 2.21ms ± 1% +2.15% (p=0.000 n=20+20) Change-Id: I037fd283221976f4f61249119d6b97b100bcbc66 Reviewed-on: https://go-review.googlesource.com/85883 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-08 22:57:53 -05:00
// For regions of the address space that are not backed by the
// Go heap, the arena map contains nil.
runtime: make the heap bitmap sparse This splits the heap bitmap into separate chunks for every 64MB of the heap and introduces an index mapping from virtual address to metadata. It modifies the heapBits abstraction to use this two-level structure. Finally, it modifies heapBitsSetType to unroll the bitmap into the object itself and then copy it out if the bitmap would span discontiguous bitmap chunks. This is a step toward supporting general sparse heaps, which will eliminate address space conflict failures as well as the limit on the heap size. It's also advantageous for 32-bit. 32-bit already supports discontiguous heaps by always starting the arena at address 0. However, as a result, with a contiguous bitmap, if the kernel chooses a high address (near 2GB) for a heap mapping, the runtime is forced to map up to 128MB of heap bitmap. Now the runtime can map sections of the bitmap for just the parts of the address space used by the heap. Updates #10460. This slightly slows down the x/garbage and compilebench benchmarks. However, I think the slowdown is acceptably small. name old time/op new time/op delta Template 178ms ± 1% 180ms ± 1% +0.78% (p=0.029 n=10+10) Unicode 85.7ms ± 2% 86.5ms ± 2% ~ (p=0.089 n=10+10) GoTypes 594ms ± 0% 599ms ± 1% +0.70% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.87s ± 0% +0.40% (p=0.001 n=9+9) SSA 7.23s ± 2% 7.29s ± 2% +0.94% (p=0.029 n=10+10) Flate 116ms ± 1% 117ms ± 1% +0.99% (p=0.000 n=9+9) GoParser 146ms ± 1% 146ms ± 0% ~ (p=0.193 n=10+7) Reflect 399ms ± 0% 403ms ± 1% +0.89% (p=0.001 n=10+10) Tar 173ms ± 1% 174ms ± 1% +0.91% (p=0.013 n=10+9) XML 208ms ± 1% 210ms ± 1% +0.93% (p=0.000 n=10+10) [Geo mean] 368ms 371ms +0.79% name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.17ms ± 1% 2.21ms ± 1% +2.15% (p=0.000 n=20+20) Change-Id: I037fd283221976f4f61249119d6b97b100bcbc66 Reviewed-on: https://go-review.googlesource.com/85883 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-08 22:57:53 -05:00
//
// Modifications are protected by mheap_.lock. Reads can be
// performed without locking; however, a given entry can
// transition from nil to non-nil at any time when the lock
// isn't held. (Entries never transitions back to nil.)
//
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
// In general, this is a two-level mapping consisting of an L1
// map and possibly many L2 maps. This saves space when there
// are a huge number of arena frames. However, on many
// platforms (even 64-bit), arenaL1Bits is 0, making this
// effectively a single-level map. In this case, arenas[0]
// will never be nil.
arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
runtime: make the heap bitmap sparse This splits the heap bitmap into separate chunks for every 64MB of the heap and introduces an index mapping from virtual address to metadata. It modifies the heapBits abstraction to use this two-level structure. Finally, it modifies heapBitsSetType to unroll the bitmap into the object itself and then copy it out if the bitmap would span discontiguous bitmap chunks. This is a step toward supporting general sparse heaps, which will eliminate address space conflict failures as well as the limit on the heap size. It's also advantageous for 32-bit. 32-bit already supports discontiguous heaps by always starting the arena at address 0. However, as a result, with a contiguous bitmap, if the kernel chooses a high address (near 2GB) for a heap mapping, the runtime is forced to map up to 128MB of heap bitmap. Now the runtime can map sections of the bitmap for just the parts of the address space used by the heap. Updates #10460. This slightly slows down the x/garbage and compilebench benchmarks. However, I think the slowdown is acceptably small. name old time/op new time/op delta Template 178ms ± 1% 180ms ± 1% +0.78% (p=0.029 n=10+10) Unicode 85.7ms ± 2% 86.5ms ± 2% ~ (p=0.089 n=10+10) GoTypes 594ms ± 0% 599ms ± 1% +0.70% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.87s ± 0% +0.40% (p=0.001 n=9+9) SSA 7.23s ± 2% 7.29s ± 2% +0.94% (p=0.029 n=10+10) Flate 116ms ± 1% 117ms ± 1% +0.99% (p=0.000 n=9+9) GoParser 146ms ± 1% 146ms ± 0% ~ (p=0.193 n=10+7) Reflect 399ms ± 0% 403ms ± 1% +0.89% (p=0.001 n=10+10) Tar 173ms ± 1% 174ms ± 1% +0.91% (p=0.013 n=10+9) XML 208ms ± 1% 210ms ± 1% +0.93% (p=0.000 n=10+10) [Geo mean] 368ms 371ms +0.79% name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.17ms ± 1% 2.21ms ± 1% +2.15% (p=0.000 n=20+20) Change-Id: I037fd283221976f4f61249119d6b97b100bcbc66 Reviewed-on: https://go-review.googlesource.com/85883 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-08 22:57:53 -05:00
// arenasHugePages indicates whether arenas' L2 entries are eligible
// to be backed by huge pages.
arenasHugePages bool
runtime: use sparse mappings for the heap This replaces the contiguous heap arena mapping with a potentially sparse mapping that can support heap mappings anywhere in the address space. This has several advantages over the current approach: * There is no longer any limit on the size of the Go heap. (Currently it's limited to 512GB.) Hence, this fixes #10460. * It eliminates many failures modes of heap initialization and growing. In particular it eliminates any possibility of panicking with an address space conflict. This can happen for many reasons and even causes a low but steady rate of TSAN test failures because of conflicts with the TSAN runtime. See #16936 and #11993. * It eliminates the notion of "non-reserved" heap, which was added because creating huge address space reservations (particularly on 64-bit) led to huge process VSIZE. This was at best confusing and at worst conflicted badly with ulimit -v. However, the non-reserved heap logic is complicated, can race with other mappings in non-pure Go binaries (e.g., #18976), and requires that the entire heap be either reserved or non-reserved. We currently maintain the latter property, but it's quite difficult to convince yourself of that, and hence difficult to keep correct. This logic is still present, but will be removed in the next CL. * It fixes problems on 32-bit where skipping over parts of the address space leads to mapping huge (and never-to-be-used) metadata structures. See #19831. This also completely rewrites and significantly simplifies mheap.sysAlloc, which has been a source of many bugs. E.g., #21044, #20259, #18651, and #13143 (and maybe #23222). This change also makes it possible to allocate individual objects larger than 512GB. As a result, a few tests that expected huge allocations to fail needed to be changed to make even larger allocations. However, at the moment attempting to allocate a humongous object may cause the program to freeze for several minutes on Linux as we fall back to probing every page with addrspace_free. That logic (and this failure mode) will be removed in the next CL. Fixes #10460. Fixes #22204 (since it rewrites the code involved). This slightly slows down compilebench and the x/benchmarks garbage benchmark. name old time/op new time/op delta Template 184ms ± 1% 185ms ± 1% ~ (p=0.065 n=10+9) Unicode 86.9ms ± 3% 86.3ms ± 1% ~ (p=0.631 n=10+10) GoTypes 599ms ± 0% 602ms ± 0% +0.56% (p=0.000 n=10+9) Compiler 2.87s ± 1% 2.89s ± 1% +0.51% (p=0.002 n=9+10) SSA 7.29s ± 1% 7.25s ± 1% ~ (p=0.182 n=10+9) Flate 118ms ± 2% 118ms ± 1% ~ (p=0.113 n=9+9) GoParser 147ms ± 1% 148ms ± 1% +1.07% (p=0.003 n=9+10) Reflect 401ms ± 1% 404ms ± 1% +0.71% (p=0.003 n=10+9) Tar 175ms ± 1% 175ms ± 1% ~ (p=0.604 n=9+10) XML 209ms ± 1% 210ms ± 1% ~ (p=0.052 n=10+10) (https://perf.golang.org/search?q=upload:20171231.4) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.25ms ± 1% +0.84% (p=0.000 n=19+19) (https://perf.golang.org/search?q=upload:20171231.3) Relative to the start of the sparse heap changes (starting at and including "runtime: fix various contiguous bitmap assumptions"), overall slowdown is roughly 1% on GC-intensive benchmarks: name old time/op new time/op delta Template 183ms ± 1% 185ms ± 1% +1.32% (p=0.000 n=9+9) Unicode 84.9ms ± 2% 86.3ms ± 1% +1.65% (p=0.000 n=9+10) GoTypes 595ms ± 1% 602ms ± 0% +1.19% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.89s ± 1% +0.91% (p=0.000 n=9+10) SSA 7.19s ± 0% 7.25s ± 1% +0.75% (p=0.000 n=8+9) Flate 117ms ± 1% 118ms ± 1% +1.10% (p=0.000 n=10+9) GoParser 146ms ± 2% 148ms ± 1% +1.48% (p=0.002 n=10+10) Reflect 398ms ± 1% 404ms ± 1% +1.51% (p=0.000 n=10+9) Tar 173ms ± 1% 175ms ± 1% +1.17% (p=0.000 n=10+10) XML 208ms ± 1% 210ms ± 1% +0.62% (p=0.011 n=10+10) [Geo mean] 369ms 373ms +1.17% (https://perf.golang.org/search?q=upload:20180101.2) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.22ms ± 1% 2.25ms ± 1% +1.51% (p=0.000 n=20+19) (https://perf.golang.org/search?q=upload:20180101.3) Change-Id: I5daf4cfec24b252e5a57001f0a6c03f22479d0f0 Reviewed-on: https://go-review.googlesource.com/85887 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-19 22:05:23 -08:00
// heapArenaAlloc is pre-reserved space for allocating heapArena
// objects. This is only used on 32-bit, where we pre-reserve
// this space to avoid interleaving it with the heap itself.
heapArenaAlloc linearAlloc
// arenaHints is a list of addresses at which to attempt to
// add more heap arenas. This is initially populated with a
// set of general hint addresses, and grown with the bounds of
// actual heap arena ranges.
arenaHints *arenaHint
// arena is a pre-reserved space for allocating heap arenas
// (the actual arenas). This is only used on 32-bit.
arena linearAlloc
// heapArenas is the arenaIndex of every mapped arena mapped for the heap.
// This can be used to iterate through the heap address space.
//
// Access is protected by mheap_.lock. However, since this is
// append-only and old backing arrays are never freed, it is
// safe to acquire mheap_.lock, copy the slice header, and
// then release mheap_.lock.
heapArenas []arenaIdx
// userArenaArenas is the arenaIndex of every mapped arena mapped for
// user arenas.
//
// Access is protected by mheap_.lock. However, since this is
// append-only and old backing arrays are never freed, it is
// safe to acquire mheap_.lock, copy the slice header, and
// then release mheap_.lock.
userArenaArenas []arenaIdx
// sweepArenas is a snapshot of heapArenas taken at the
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// beginning of the sweep cycle. This can be read safely by
// simply blocking GC (by disabling preemption).
sweepArenas []arenaIdx
// markArenas is a snapshot of heapArenas taken at the beginning
// of the mark cycle. Because heapArenas is append-only, neither
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
// this slice nor its contents will change during the mark, so
// it can be read safely.
markArenas []arenaIdx
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
// curArena is the arena that the heap is currently growing
// into. This should always be physPageSize-aligned.
curArena struct {
base, end uintptr
}
// central free lists for small size classes.
// the padding makes sure that the mcentrals are
// spaced CacheLinePadSize bytes apart, so that each mcentral.lock
// gets its own cache line.
// central is indexed by spanClass.
central [numSpanClasses]struct {
mcentral mcentral
pad [(cpu.CacheLinePadSize - unsafe.Sizeof(mcentral{})%cpu.CacheLinePadSize) % cpu.CacheLinePadSize]byte
}
runtime: eliminate global span queue [green tea] This change removes the locked global span queue and replaces the fixed-size local span queue with a variable-sized local span queue. The variable-sized local span queue grows as needed to accomodate local work. With no global span queue either, GC workers balance work amongst themselves by stealing from each other. The new variable-sized local span queues are inspired by the P-local deque underlying sync.Pool. Unlike the sync.Pool deque, however, both the owning P and stealing Ps take spans from the tail, making this incarnation a strict queue, not a deque. This is intentional, since we want a queue-like order to encourage objects to accumulate on each span. These variable-sized local span queues are crucial to mark termination, just like the global span queue was. To avoid hitting the ragged barrier too often, we must check whether any Ps have any spans on their variable-sized local span queues. We maintain a per-P atomic bitmask (another pMask) that contains this state. We can also use this to speed up stealing by skipping Ps that don't have any local spans. The variable-sized local span queues are slower than the old fixed-size local span queues because of the additional indirection, so this change adds a non-atomic local fixed-size queue. This risks getting work stuck on it, so, similarly to how workbufs work, each worker will occasionally dump some spans onto its local variable-sized queue. This scales much more nicely than dumping to a global queue, but is still visible to all other Ps. For #73581. Change-Id: I814f54d9c3cc7fa7896167746e9823f50943ac22 Reviewed-on: https://go-review.googlesource.com/c/go/+/700496 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-15 17:09:05 +00:00
spanalloc fixalloc // allocator for span
spanSPMCAlloc fixalloc // allocator for spanSPMC, protected by work.spanSPMCs.lock
cachealloc fixalloc // allocator for mcache
specialfinalizeralloc fixalloc // allocator for specialfinalizer
specialCleanupAlloc fixalloc // allocator for specialCleanup
specialCheckFinalizerAlloc fixalloc // allocator for specialCheckFinalizer
specialTinyBlockAlloc fixalloc // allocator for specialTinyBlock
specialprofilealloc fixalloc // allocator for specialprofile
specialReachableAlloc fixalloc // allocator for specialReachable
specialPinCounterAlloc fixalloc // allocator for specialPinCounter
specialWeakHandleAlloc fixalloc // allocator for specialWeakHandle
specialBubbleAlloc fixalloc // allocator for specialBubble
speciallock mutex // lock for special record allocators.
arenaHintAlloc fixalloc // allocator for arenaHints
runtime: add safe arena support to the runtime This change adds an API to the runtime for arenas. A later CL can potentially export it as an experimental API, but for now, just the runtime implementation will suffice. The purpose of arenas is to improve efficiency, primarily by allowing for an application to manually free memory, thereby delaying garbage collection. It comes with other potential performance benefits, such as better locality, a better allocation strategy, and better handling of interior pointers by the GC. This implementation is based on one by danscales@google.com with a few significant differences: * The implementation lives entirely in the runtime (all layers). * Arena chunks are the minimum of 8 MiB or the heap arena size. This choice is made because in practice 64 MiB appears to be way too large of an area for most real-world use-cases. * Arena chunks are not unmapped, instead they're placed on an evacuation list and when there are no pointers left pointing into them, they're allowed to be reused. * Reusing partially-used arena chunks no longer tries to find one used by the same P first; it just takes the first one available. * In order to ensure worst-case fragmentation is never worse than 25%, only types and slice backing stores whose sizes are 1/4th the size of a chunk or less may be used. Previously larger sizes, up to the size of the chunk, were allowed. * ASAN, MSAN, and the race detector are fully supported. * Sets arena chunks to fault that were deferred at the end of mark termination (a non-public patch once did this; I don't see a reason not to continue that). For #51317. Change-Id: I83b1693a17302554cb36b6daa4e9249a81b1644f Reviewed-on: https://go-review.googlesource.com/c/go/+/423359 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2022-08-12 21:40:46 +00:00
// User arena state.
//
// Protected by mheap_.lock.
userArena struct {
// arenaHints is a list of addresses at which to attempt to
// add more heap arenas for user arena chunks. This is initially
// populated with a set of general hint addresses, and grown with
// the bounds of actual heap arena ranges.
arenaHints *arenaHint
// quarantineList is a list of user arena spans that have been set to fault, but
// are waiting for all pointers into them to go away. Sweeping handles
// identifying when this is true, and moves the span to the ready list.
quarantineList mSpanList
// readyList is a list of empty user arena spans that are ready for reuse.
readyList mSpanList
}
// cleanupID is a counter which is incremented each time a cleanup special is added
// to a span. It's used to create globally unique identifiers for individual cleanup.
// cleanupID is protected by mheap_.speciallock. It must only be incremented while holding
// the lock. ID 0 is reserved. Users should increment first, then read the value.
cleanupID uint64
_ cpu.CacheLinePad
immortalWeakHandles immortalWeakHandleMap
unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
}
var mheap_ mheap
runtime: make the heap bitmap sparse This splits the heap bitmap into separate chunks for every 64MB of the heap and introduces an index mapping from virtual address to metadata. It modifies the heapBits abstraction to use this two-level structure. Finally, it modifies heapBitsSetType to unroll the bitmap into the object itself and then copy it out if the bitmap would span discontiguous bitmap chunks. This is a step toward supporting general sparse heaps, which will eliminate address space conflict failures as well as the limit on the heap size. It's also advantageous for 32-bit. 32-bit already supports discontiguous heaps by always starting the arena at address 0. However, as a result, with a contiguous bitmap, if the kernel chooses a high address (near 2GB) for a heap mapping, the runtime is forced to map up to 128MB of heap bitmap. Now the runtime can map sections of the bitmap for just the parts of the address space used by the heap. Updates #10460. This slightly slows down the x/garbage and compilebench benchmarks. However, I think the slowdown is acceptably small. name old time/op new time/op delta Template 178ms ± 1% 180ms ± 1% +0.78% (p=0.029 n=10+10) Unicode 85.7ms ± 2% 86.5ms ± 2% ~ (p=0.089 n=10+10) GoTypes 594ms ± 0% 599ms ± 1% +0.70% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.87s ± 0% +0.40% (p=0.001 n=9+9) SSA 7.23s ± 2% 7.29s ± 2% +0.94% (p=0.029 n=10+10) Flate 116ms ± 1% 117ms ± 1% +0.99% (p=0.000 n=9+9) GoParser 146ms ± 1% 146ms ± 0% ~ (p=0.193 n=10+7) Reflect 399ms ± 0% 403ms ± 1% +0.89% (p=0.001 n=10+10) Tar 173ms ± 1% 174ms ± 1% +0.91% (p=0.013 n=10+9) XML 208ms ± 1% 210ms ± 1% +0.93% (p=0.000 n=10+10) [Geo mean] 368ms 371ms +0.79% name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.17ms ± 1% 2.21ms ± 1% +2.15% (p=0.000 n=20+20) Change-Id: I037fd283221976f4f61249119d6b97b100bcbc66 Reviewed-on: https://go-review.googlesource.com/85883 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-08 22:57:53 -05:00
// A heapArena stores metadata for a heap arena. heapArenas are stored
// outside of the Go heap and accessed via the mheap_.arenas index.
type heapArena struct {
_ sys.NotInHeap
// spans maps from virtual address page ID within this arena to *mspan.
// For allocated spans, their pages map to the span itself.
// For free spans, only the lowest and highest pages map to the span itself.
// Internal pages map to an arbitrary span.
// For pages that have never been allocated, spans entries are nil.
//
// Modifications are protected by mheap.lock. Reads can be
// performed without locking, but ONLY from indexes that are
// known to contain in-use or stack spans. This means there
// must not be a safe-point between establishing that an
// address is live and looking it up in the spans array.
spans [pagesPerArena]*mspan
// pageInUse is a bitmap that indicates which spans are in
// state mSpanInUse. This bitmap is indexed by page number,
// but only the bit corresponding to the first page in each
// span is used.
//
// Reads and writes are atomic.
pageInUse [pagesPerArena / 8]uint8
runtime: mark span when marking any object on the span This adds a mark bit for each span that is set if any objects on the span are marked. This will be used for sweeping. For #18155. The impact of this is negligible for most benchmarks, and < 1% for GC-heavy benchmarks. name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.18ms ± 0% 2.20ms ± 1% +0.88% (p=0.000 n=16+18) (https://perf.golang.org/search?q=upload:20180928.1) name old time/op new time/op delta BinaryTree17-12 2.68s ± 1% 2.68s ± 1% ~ (p=0.707 n=17+19) Fannkuch11-12 2.28s ± 0% 2.39s ± 0% +4.95% (p=0.000 n=19+18) FmtFprintfEmpty-12 40.3ns ± 4% 39.4ns ± 2% -2.27% (p=0.000 n=17+18) FmtFprintfString-12 67.9ns ± 1% 68.3ns ± 1% +0.55% (p=0.000 n=18+19) FmtFprintfInt-12 75.7ns ± 1% 76.1ns ± 1% +0.44% (p=0.005 n=18+19) FmtFprintfIntInt-12 123ns ± 1% 121ns ± 1% -1.00% (p=0.000 n=18+18) FmtFprintfPrefixedInt-12 150ns ± 0% 148ns ± 0% -1.33% (p=0.000 n=16+13) FmtFprintfFloat-12 208ns ± 0% 204ns ± 0% -1.92% (p=0.000 n=13+17) FmtManyArgs-12 501ns ± 1% 498ns ± 0% -0.55% (p=0.000 n=19+17) GobDecode-12 6.24ms ± 0% 6.25ms ± 1% ~ (p=0.113 n=20+19) GobEncode-12 5.33ms ± 0% 5.29ms ± 1% -0.72% (p=0.000 n=20+18) Gzip-12 220ms ± 1% 218ms ± 1% -1.02% (p=0.000 n=19+19) Gunzip-12 35.5ms ± 0% 35.7ms ± 0% +0.45% (p=0.000 n=16+18) HTTPClientServer-12 77.9µs ± 1% 77.7µs ± 1% -0.30% (p=0.047 n=20+19) JSONEncode-12 8.82ms ± 0% 8.93ms ± 0% +1.20% (p=0.000 n=18+17) JSONDecode-12 47.3ms ± 0% 47.0ms ± 0% -0.49% (p=0.000 n=17+18) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.25% (p=0.000 n=19+18) GoParse-12 3.13ms ± 1% 3.13ms ± 1% ~ (p=0.640 n=20+20) RegexpMatchEasy0_32-12 76.2ns ± 1% 76.2ns ± 1% ~ (p=0.818 n=20+19) RegexpMatchEasy0_1K-12 226ns ± 0% 226ns ± 0% -0.22% (p=0.001 n=17+18) RegexpMatchEasy1_32-12 71.9ns ± 1% 72.0ns ± 1% ~ (p=0.653 n=18+18) RegexpMatchEasy1_1K-12 355ns ± 1% 356ns ± 1% ~ (p=0.160 n=18+19) RegexpMatchMedium_32-12 106ns ± 1% 106ns ± 1% ~ (p=0.325 n=17+20) RegexpMatchMedium_1K-12 31.1µs ± 2% 31.2µs ± 0% +0.59% (p=0.007 n=19+15) RegexpMatchHard_32-12 1.54µs ± 2% 1.53µs ± 2% -0.78% (p=0.021 n=17+18) RegexpMatchHard_1K-12 46.0µs ± 1% 45.9µs ± 1% -0.31% (p=0.025 n=17+19) Revcomp-12 391ms ± 1% 394ms ± 2% +0.80% (p=0.000 n=17+19) Template-12 59.9ms ± 1% 59.9ms ± 1% ~ (p=0.428 n=20+19) TimeParse-12 304ns ± 1% 312ns ± 0% +2.88% (p=0.000 n=20+17) TimeFormat-12 318ns ± 0% 326ns ± 0% +2.64% (p=0.000 n=20+17) (https://perf.golang.org/search?q=upload:20180928.2) Change-Id: I336b9bf054113580a24103192904c8c76593e90e Reviewed-on: https://go-review.googlesource.com/c/138958 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2018-09-26 15:59:21 -04:00
// pageMarks is a bitmap that indicates which spans have any
// marked objects on them. Like pageInUse, only the bit
// corresponding to the first page in each span is used.
//
// Writes are done atomically during marking. Reads are
// non-atomic and lock-free since they only occur during
// sweeping (and hence never race with writes).
//
// This is used to quickly find whole spans that can be freed.
//
// TODO(austin): It would be nice if this was uint64 for
// faster scanning, but we don't have 64-bit atomic bit
// operations.
pageMarks [pagesPerArena / 8]uint8
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
// pageSpecials is a bitmap that indicates which spans have
// specials (finalizers or other). Like pageInUse, only the bit
// corresponding to the first page in each span is used.
//
// Writes are done atomically whenever a special is added to
// a span and whenever the last special is removed from a span.
// Reads are done atomically to find spans containing specials
// during marking.
pageSpecials [pagesPerArena / 8]uint8
// pageUseSpanInlineMarkBits is a bitmap where each bit corresponds
// to a span, as only spans one page in size can have inline mark bits.
// The bit indicates that the span has a spanInlineMarkBits struct
// stored directly at the top end of the span's memory.
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
pageUseSpanInlineMarkBits [pagesPerArena / 8]uint8
runtime: move checkmarks to a separate bitmap Currently, the GC stores the object marks for checkmarks mode in the heap bitmap using a rather complex encoding: for one word objects, the checkmark is stored in the pointer/scalar bit since one word objects must be pointers; for larger objects, the checkmark is stored in what would be the scan/dead bit for the second word of the object. This encoding made more sense when the runtime used the first scan/dead bit as the regular mark bit, but we moved away from that long ago. This encoding and overloading of the heap bitmap bits causes a great deal of complexity in many parts of the allocator and garbage collector and leads to some subtle bugs like #15903. This CL moves the checkmarks mark bits into their own per-arena bitmap and reclaims the second scan/dead bit as a regular scan/dead bit. I tested this by enabling doubleCheck mode in heapBitsSetType and running in both regular and GODEBUG=gccheckmark=1 mode. Fixes #15903. No performance degradation. (Very slight improvement on a few benchmarks, but it's probably just noise.) name old time/op new time/op delta BiogoIgor 16.6s ± 1% 16.4s ± 1% -0.94% (p=0.000 n=25+24) BiogoKrishna 19.2s ± 3% 19.2s ± 3% ~ (p=0.638 n=23+25) BleveIndexBatch100 6.12s ± 5% 6.17s ± 4% ~ (p=0.170 n=25+25) CompileTemplate 206ms ± 1% 205ms ± 1% -0.43% (p=0.005 n=24+24) CompileUnicode 82.2ms ± 2% 81.5ms ± 2% -0.95% (p=0.001 n=22+22) CompileGoTypes 755ms ± 3% 754ms ± 4% ~ (p=0.715 n=25+25) CompileCompiler 3.73s ± 1% 3.73s ± 1% ~ (p=0.445 n=25+24) CompileSSA 8.67s ± 1% 8.66s ± 1% ~ (p=0.836 n=24+22) CompileFlate 134ms ± 2% 133ms ± 1% -0.66% (p=0.001 n=24+23) CompileGoParser 164ms ± 1% 163ms ± 1% -0.85% (p=0.000 n=24+24) CompileReflect 466ms ± 5% 466ms ± 3% ~ (p=0.863 n=25+25) CompileTar 182ms ± 1% 182ms ± 1% -0.31% (p=0.048 n=24+24) CompileXML 249ms ± 1% 248ms ± 1% -0.32% (p=0.031 n=21+25) CompileStdCmd 10.3s ± 1% 10.3s ± 1% ~ (p=0.459 n=23+23) FoglemanFauxGLRenderRotateBoat 8.66s ± 1% 8.62s ± 1% -0.47% (p=0.000 n=23+24) FoglemanPathTraceRenderGopherIter1 20.3s ± 3% 20.2s ± 2% ~ (p=0.893 n=25+25) GopherLuaKNucleotide 29.7s ± 1% 29.8s ± 2% ~ (p=0.421 n=24+25) MarkdownRenderXHTML 246ms ± 1% 247ms ± 1% ~ (p=0.558 n=25+24) Tile38WithinCircle100kmRequest 779µs ± 4% 779µs ± 3% ~ (p=0.954 n=25+25) Tile38IntersectsCircle100kmRequest 1.02ms ± 3% 1.01ms ± 4% ~ (p=0.658 n=25+25) Tile38KNearestLimit100Request 984µs ± 4% 986µs ± 4% ~ (p=0.627 n=24+25) [Geo mean] 552ms 551ms -0.19% https://perf.golang.org/search?q=upload:20200723.6 Change-Id: Ic703f26a83fb034941dc6f4788fc997d56890dec Reviewed-on: https://go-review.googlesource.com/c/go/+/244539 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>
2020-06-05 16:48:03 -04:00
// checkmarks stores the debug.gccheckmark state. It is only
// used if debug.gccheckmark > 0 or debug.checkfinalizers > 0.
runtime: move checkmarks to a separate bitmap Currently, the GC stores the object marks for checkmarks mode in the heap bitmap using a rather complex encoding: for one word objects, the checkmark is stored in the pointer/scalar bit since one word objects must be pointers; for larger objects, the checkmark is stored in what would be the scan/dead bit for the second word of the object. This encoding made more sense when the runtime used the first scan/dead bit as the regular mark bit, but we moved away from that long ago. This encoding and overloading of the heap bitmap bits causes a great deal of complexity in many parts of the allocator and garbage collector and leads to some subtle bugs like #15903. This CL moves the checkmarks mark bits into their own per-arena bitmap and reclaims the second scan/dead bit as a regular scan/dead bit. I tested this by enabling doubleCheck mode in heapBitsSetType and running in both regular and GODEBUG=gccheckmark=1 mode. Fixes #15903. No performance degradation. (Very slight improvement on a few benchmarks, but it's probably just noise.) name old time/op new time/op delta BiogoIgor 16.6s ± 1% 16.4s ± 1% -0.94% (p=0.000 n=25+24) BiogoKrishna 19.2s ± 3% 19.2s ± 3% ~ (p=0.638 n=23+25) BleveIndexBatch100 6.12s ± 5% 6.17s ± 4% ~ (p=0.170 n=25+25) CompileTemplate 206ms ± 1% 205ms ± 1% -0.43% (p=0.005 n=24+24) CompileUnicode 82.2ms ± 2% 81.5ms ± 2% -0.95% (p=0.001 n=22+22) CompileGoTypes 755ms ± 3% 754ms ± 4% ~ (p=0.715 n=25+25) CompileCompiler 3.73s ± 1% 3.73s ± 1% ~ (p=0.445 n=25+24) CompileSSA 8.67s ± 1% 8.66s ± 1% ~ (p=0.836 n=24+22) CompileFlate 134ms ± 2% 133ms ± 1% -0.66% (p=0.001 n=24+23) CompileGoParser 164ms ± 1% 163ms ± 1% -0.85% (p=0.000 n=24+24) CompileReflect 466ms ± 5% 466ms ± 3% ~ (p=0.863 n=25+25) CompileTar 182ms ± 1% 182ms ± 1% -0.31% (p=0.048 n=24+24) CompileXML 249ms ± 1% 248ms ± 1% -0.32% (p=0.031 n=21+25) CompileStdCmd 10.3s ± 1% 10.3s ± 1% ~ (p=0.459 n=23+23) FoglemanFauxGLRenderRotateBoat 8.66s ± 1% 8.62s ± 1% -0.47% (p=0.000 n=23+24) FoglemanPathTraceRenderGopherIter1 20.3s ± 3% 20.2s ± 2% ~ (p=0.893 n=25+25) GopherLuaKNucleotide 29.7s ± 1% 29.8s ± 2% ~ (p=0.421 n=24+25) MarkdownRenderXHTML 246ms ± 1% 247ms ± 1% ~ (p=0.558 n=25+24) Tile38WithinCircle100kmRequest 779µs ± 4% 779µs ± 3% ~ (p=0.954 n=25+25) Tile38IntersectsCircle100kmRequest 1.02ms ± 3% 1.01ms ± 4% ~ (p=0.658 n=25+25) Tile38KNearestLimit100Request 984µs ± 4% 986µs ± 4% ~ (p=0.627 n=24+25) [Geo mean] 552ms 551ms -0.19% https://perf.golang.org/search?q=upload:20200723.6 Change-Id: Ic703f26a83fb034941dc6f4788fc997d56890dec Reviewed-on: https://go-review.googlesource.com/c/go/+/244539 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>
2020-06-05 16:48:03 -04:00
checkmarks *checkmarksMap
// zeroedBase marks the first byte of the first page in this
// arena which hasn't been used yet and is therefore already
// zero. zeroedBase is relative to the arena base.
// Increases monotonically until it hits heapArenaBytes.
//
// This field is sufficient to determine if an allocation
// needs to be zeroed because the page allocator follows an
// address-ordered first-fit policy.
//
// Read atomically and written with an atomic CAS.
zeroedBase uintptr
runtime: make the heap bitmap sparse This splits the heap bitmap into separate chunks for every 64MB of the heap and introduces an index mapping from virtual address to metadata. It modifies the heapBits abstraction to use this two-level structure. Finally, it modifies heapBitsSetType to unroll the bitmap into the object itself and then copy it out if the bitmap would span discontiguous bitmap chunks. This is a step toward supporting general sparse heaps, which will eliminate address space conflict failures as well as the limit on the heap size. It's also advantageous for 32-bit. 32-bit already supports discontiguous heaps by always starting the arena at address 0. However, as a result, with a contiguous bitmap, if the kernel chooses a high address (near 2GB) for a heap mapping, the runtime is forced to map up to 128MB of heap bitmap. Now the runtime can map sections of the bitmap for just the parts of the address space used by the heap. Updates #10460. This slightly slows down the x/garbage and compilebench benchmarks. However, I think the slowdown is acceptably small. name old time/op new time/op delta Template 178ms ± 1% 180ms ± 1% +0.78% (p=0.029 n=10+10) Unicode 85.7ms ± 2% 86.5ms ± 2% ~ (p=0.089 n=10+10) GoTypes 594ms ± 0% 599ms ± 1% +0.70% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.87s ± 0% +0.40% (p=0.001 n=9+9) SSA 7.23s ± 2% 7.29s ± 2% +0.94% (p=0.029 n=10+10) Flate 116ms ± 1% 117ms ± 1% +0.99% (p=0.000 n=9+9) GoParser 146ms ± 1% 146ms ± 0% ~ (p=0.193 n=10+7) Reflect 399ms ± 0% 403ms ± 1% +0.89% (p=0.001 n=10+10) Tar 173ms ± 1% 174ms ± 1% +0.91% (p=0.013 n=10+9) XML 208ms ± 1% 210ms ± 1% +0.93% (p=0.000 n=10+10) [Geo mean] 368ms 371ms +0.79% name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.17ms ± 1% 2.21ms ± 1% +2.15% (p=0.000 n=20+20) Change-Id: I037fd283221976f4f61249119d6b97b100bcbc66 Reviewed-on: https://go-review.googlesource.com/85883 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-08 22:57:53 -05:00
}
runtime: use sparse mappings for the heap This replaces the contiguous heap arena mapping with a potentially sparse mapping that can support heap mappings anywhere in the address space. This has several advantages over the current approach: * There is no longer any limit on the size of the Go heap. (Currently it's limited to 512GB.) Hence, this fixes #10460. * It eliminates many failures modes of heap initialization and growing. In particular it eliminates any possibility of panicking with an address space conflict. This can happen for many reasons and even causes a low but steady rate of TSAN test failures because of conflicts with the TSAN runtime. See #16936 and #11993. * It eliminates the notion of "non-reserved" heap, which was added because creating huge address space reservations (particularly on 64-bit) led to huge process VSIZE. This was at best confusing and at worst conflicted badly with ulimit -v. However, the non-reserved heap logic is complicated, can race with other mappings in non-pure Go binaries (e.g., #18976), and requires that the entire heap be either reserved or non-reserved. We currently maintain the latter property, but it's quite difficult to convince yourself of that, and hence difficult to keep correct. This logic is still present, but will be removed in the next CL. * It fixes problems on 32-bit where skipping over parts of the address space leads to mapping huge (and never-to-be-used) metadata structures. See #19831. This also completely rewrites and significantly simplifies mheap.sysAlloc, which has been a source of many bugs. E.g., #21044, #20259, #18651, and #13143 (and maybe #23222). This change also makes it possible to allocate individual objects larger than 512GB. As a result, a few tests that expected huge allocations to fail needed to be changed to make even larger allocations. However, at the moment attempting to allocate a humongous object may cause the program to freeze for several minutes on Linux as we fall back to probing every page with addrspace_free. That logic (and this failure mode) will be removed in the next CL. Fixes #10460. Fixes #22204 (since it rewrites the code involved). This slightly slows down compilebench and the x/benchmarks garbage benchmark. name old time/op new time/op delta Template 184ms ± 1% 185ms ± 1% ~ (p=0.065 n=10+9) Unicode 86.9ms ± 3% 86.3ms ± 1% ~ (p=0.631 n=10+10) GoTypes 599ms ± 0% 602ms ± 0% +0.56% (p=0.000 n=10+9) Compiler 2.87s ± 1% 2.89s ± 1% +0.51% (p=0.002 n=9+10) SSA 7.29s ± 1% 7.25s ± 1% ~ (p=0.182 n=10+9) Flate 118ms ± 2% 118ms ± 1% ~ (p=0.113 n=9+9) GoParser 147ms ± 1% 148ms ± 1% +1.07% (p=0.003 n=9+10) Reflect 401ms ± 1% 404ms ± 1% +0.71% (p=0.003 n=10+9) Tar 175ms ± 1% 175ms ± 1% ~ (p=0.604 n=9+10) XML 209ms ± 1% 210ms ± 1% ~ (p=0.052 n=10+10) (https://perf.golang.org/search?q=upload:20171231.4) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.25ms ± 1% +0.84% (p=0.000 n=19+19) (https://perf.golang.org/search?q=upload:20171231.3) Relative to the start of the sparse heap changes (starting at and including "runtime: fix various contiguous bitmap assumptions"), overall slowdown is roughly 1% on GC-intensive benchmarks: name old time/op new time/op delta Template 183ms ± 1% 185ms ± 1% +1.32% (p=0.000 n=9+9) Unicode 84.9ms ± 2% 86.3ms ± 1% +1.65% (p=0.000 n=9+10) GoTypes 595ms ± 1% 602ms ± 0% +1.19% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.89s ± 1% +0.91% (p=0.000 n=9+10) SSA 7.19s ± 0% 7.25s ± 1% +0.75% (p=0.000 n=8+9) Flate 117ms ± 1% 118ms ± 1% +1.10% (p=0.000 n=10+9) GoParser 146ms ± 2% 148ms ± 1% +1.48% (p=0.002 n=10+10) Reflect 398ms ± 1% 404ms ± 1% +1.51% (p=0.000 n=10+9) Tar 173ms ± 1% 175ms ± 1% +1.17% (p=0.000 n=10+10) XML 208ms ± 1% 210ms ± 1% +0.62% (p=0.011 n=10+10) [Geo mean] 369ms 373ms +1.17% (https://perf.golang.org/search?q=upload:20180101.2) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.22ms ± 1% 2.25ms ± 1% +1.51% (p=0.000 n=20+19) (https://perf.golang.org/search?q=upload:20180101.3) Change-Id: I5daf4cfec24b252e5a57001f0a6c03f22479d0f0 Reviewed-on: https://go-review.googlesource.com/85887 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-19 22:05:23 -08:00
// arenaHint is a hint for where to grow the heap arenas. See
// mheap_.arenaHints.
type arenaHint struct {
_ sys.NotInHeap
runtime: use sparse mappings for the heap This replaces the contiguous heap arena mapping with a potentially sparse mapping that can support heap mappings anywhere in the address space. This has several advantages over the current approach: * There is no longer any limit on the size of the Go heap. (Currently it's limited to 512GB.) Hence, this fixes #10460. * It eliminates many failures modes of heap initialization and growing. In particular it eliminates any possibility of panicking with an address space conflict. This can happen for many reasons and even causes a low but steady rate of TSAN test failures because of conflicts with the TSAN runtime. See #16936 and #11993. * It eliminates the notion of "non-reserved" heap, which was added because creating huge address space reservations (particularly on 64-bit) led to huge process VSIZE. This was at best confusing and at worst conflicted badly with ulimit -v. However, the non-reserved heap logic is complicated, can race with other mappings in non-pure Go binaries (e.g., #18976), and requires that the entire heap be either reserved or non-reserved. We currently maintain the latter property, but it's quite difficult to convince yourself of that, and hence difficult to keep correct. This logic is still present, but will be removed in the next CL. * It fixes problems on 32-bit where skipping over parts of the address space leads to mapping huge (and never-to-be-used) metadata structures. See #19831. This also completely rewrites and significantly simplifies mheap.sysAlloc, which has been a source of many bugs. E.g., #21044, #20259, #18651, and #13143 (and maybe #23222). This change also makes it possible to allocate individual objects larger than 512GB. As a result, a few tests that expected huge allocations to fail needed to be changed to make even larger allocations. However, at the moment attempting to allocate a humongous object may cause the program to freeze for several minutes on Linux as we fall back to probing every page with addrspace_free. That logic (and this failure mode) will be removed in the next CL. Fixes #10460. Fixes #22204 (since it rewrites the code involved). This slightly slows down compilebench and the x/benchmarks garbage benchmark. name old time/op new time/op delta Template 184ms ± 1% 185ms ± 1% ~ (p=0.065 n=10+9) Unicode 86.9ms ± 3% 86.3ms ± 1% ~ (p=0.631 n=10+10) GoTypes 599ms ± 0% 602ms ± 0% +0.56% (p=0.000 n=10+9) Compiler 2.87s ± 1% 2.89s ± 1% +0.51% (p=0.002 n=9+10) SSA 7.29s ± 1% 7.25s ± 1% ~ (p=0.182 n=10+9) Flate 118ms ± 2% 118ms ± 1% ~ (p=0.113 n=9+9) GoParser 147ms ± 1% 148ms ± 1% +1.07% (p=0.003 n=9+10) Reflect 401ms ± 1% 404ms ± 1% +0.71% (p=0.003 n=10+9) Tar 175ms ± 1% 175ms ± 1% ~ (p=0.604 n=9+10) XML 209ms ± 1% 210ms ± 1% ~ (p=0.052 n=10+10) (https://perf.golang.org/search?q=upload:20171231.4) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.25ms ± 1% +0.84% (p=0.000 n=19+19) (https://perf.golang.org/search?q=upload:20171231.3) Relative to the start of the sparse heap changes (starting at and including "runtime: fix various contiguous bitmap assumptions"), overall slowdown is roughly 1% on GC-intensive benchmarks: name old time/op new time/op delta Template 183ms ± 1% 185ms ± 1% +1.32% (p=0.000 n=9+9) Unicode 84.9ms ± 2% 86.3ms ± 1% +1.65% (p=0.000 n=9+10) GoTypes 595ms ± 1% 602ms ± 0% +1.19% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.89s ± 1% +0.91% (p=0.000 n=9+10) SSA 7.19s ± 0% 7.25s ± 1% +0.75% (p=0.000 n=8+9) Flate 117ms ± 1% 118ms ± 1% +1.10% (p=0.000 n=10+9) GoParser 146ms ± 2% 148ms ± 1% +1.48% (p=0.002 n=10+10) Reflect 398ms ± 1% 404ms ± 1% +1.51% (p=0.000 n=10+9) Tar 173ms ± 1% 175ms ± 1% +1.17% (p=0.000 n=10+10) XML 208ms ± 1% 210ms ± 1% +0.62% (p=0.011 n=10+10) [Geo mean] 369ms 373ms +1.17% (https://perf.golang.org/search?q=upload:20180101.2) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.22ms ± 1% 2.25ms ± 1% +1.51% (p=0.000 n=20+19) (https://perf.golang.org/search?q=upload:20180101.3) Change-Id: I5daf4cfec24b252e5a57001f0a6c03f22479d0f0 Reviewed-on: https://go-review.googlesource.com/85887 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-19 22:05:23 -08:00
addr uintptr
down bool
next *arenaHint
}
// An mspan is a run of pages.
//
// When a mspan is in the heap free treap, state == mSpanFree
// and heapmap(s->start) == span, heapmap(s->start+s->npages-1) == span.
// If the mspan is in the heap scav treap, then in addition to the
// above scavenged == true. scavenged == false in all other cases.
//
// When a mspan is allocated, state == mSpanInUse or mSpanManual
// and heapmap(i) == span for all s->start <= i < s->start+s->npages.
// Every mspan is in one doubly-linked list, either in the mheap's
// busy list or one of the mcentral's span lists.
// An mspan representing actual memory has state mSpanInUse,
// mSpanManual, or mSpanFree. Transitions between these states are
runtime: don't free stack spans during GC Memory for stacks is manually managed by the runtime and, currently (with one exception) we free stack spans immediately when the last stack on a span is freed. However, the garbage collector assumes that spans can never transition from non-free to free during scan or mark. This disagreement makes it possible for the garbage collector to mark uninitialized objects and is blocking us from re-enabling the bad pointer test in the garbage collector (issue #9880). For example, the following sequence will result in marking an uninitialized object: 1. scanobject loads a pointer slot out of the object it's scanning. This happens to be one of the special pointers from the heap into a stack. Call the pointer p and suppose it points into X's stack. 2. X, running on another thread, grows its stack and frees its old stack. 3. The old stack happens to be large or was the last stack in its span, so X frees this span, setting it to state _MSpanFree. 4. The span gets reused as a heap span. 5. scanobject calls heapBitsForObject, which loads the span containing p, which is now in state _MSpanInUse, but doesn't necessarily have an object at p. The not-object at p gets marked, and at this point all sorts of things can go wrong. We already have a partial solution to this. When shrinking a stack, we put the old stack on a queue to be freed at the end of garbage collection. This was done to address exactly this problem, but wasn't a complete solution. This commit generalizes this solution to both shrinking and growing stacks. For stacks that fit in the stack pool, we simply don't free the span, even if its reference count reaches zero. It's fine to reuse the span for other stacks, and this enables that. At the end of GC, we sweep for cached stack spans with a zero reference count and free them. For larger stacks, we simply queue the stack span to be freed at the end of GC. Ideally, we would reuse these large stack spans the way we can small stack spans, but that's a more invasive change that will have to wait until after the freeze. Fixes #11267. Change-Id: Ib7f2c5da4845cc0268e8dc098b08465116972a71 Reviewed-on: https://go-review.googlesource.com/11502 Reviewed-by: Russ Cox <rsc@golang.org>
2015-06-22 10:24:50 -04:00
// constrained as follows:
//
// - A span may transition from free to in-use or manual during any GC
// phase.
runtime: don't free stack spans during GC Memory for stacks is manually managed by the runtime and, currently (with one exception) we free stack spans immediately when the last stack on a span is freed. However, the garbage collector assumes that spans can never transition from non-free to free during scan or mark. This disagreement makes it possible for the garbage collector to mark uninitialized objects and is blocking us from re-enabling the bad pointer test in the garbage collector (issue #9880). For example, the following sequence will result in marking an uninitialized object: 1. scanobject loads a pointer slot out of the object it's scanning. This happens to be one of the special pointers from the heap into a stack. Call the pointer p and suppose it points into X's stack. 2. X, running on another thread, grows its stack and frees its old stack. 3. The old stack happens to be large or was the last stack in its span, so X frees this span, setting it to state _MSpanFree. 4. The span gets reused as a heap span. 5. scanobject calls heapBitsForObject, which loads the span containing p, which is now in state _MSpanInUse, but doesn't necessarily have an object at p. The not-object at p gets marked, and at this point all sorts of things can go wrong. We already have a partial solution to this. When shrinking a stack, we put the old stack on a queue to be freed at the end of garbage collection. This was done to address exactly this problem, but wasn't a complete solution. This commit generalizes this solution to both shrinking and growing stacks. For stacks that fit in the stack pool, we simply don't free the span, even if its reference count reaches zero. It's fine to reuse the span for other stacks, and this enables that. At the end of GC, we sweep for cached stack spans with a zero reference count and free them. For larger stacks, we simply queue the stack span to be freed at the end of GC. Ideally, we would reuse these large stack spans the way we can small stack spans, but that's a more invasive change that will have to wait until after the freeze. Fixes #11267. Change-Id: Ib7f2c5da4845cc0268e8dc098b08465116972a71 Reviewed-on: https://go-review.googlesource.com/11502 Reviewed-by: Russ Cox <rsc@golang.org>
2015-06-22 10:24:50 -04:00
//
// - During sweeping (gcphase == _GCoff), a span may transition from
// in-use to free (as a result of sweeping) or manual to free (as a
// result of stacks being freed).
runtime: don't free stack spans during GC Memory for stacks is manually managed by the runtime and, currently (with one exception) we free stack spans immediately when the last stack on a span is freed. However, the garbage collector assumes that spans can never transition from non-free to free during scan or mark. This disagreement makes it possible for the garbage collector to mark uninitialized objects and is blocking us from re-enabling the bad pointer test in the garbage collector (issue #9880). For example, the following sequence will result in marking an uninitialized object: 1. scanobject loads a pointer slot out of the object it's scanning. This happens to be one of the special pointers from the heap into a stack. Call the pointer p and suppose it points into X's stack. 2. X, running on another thread, grows its stack and frees its old stack. 3. The old stack happens to be large or was the last stack in its span, so X frees this span, setting it to state _MSpanFree. 4. The span gets reused as a heap span. 5. scanobject calls heapBitsForObject, which loads the span containing p, which is now in state _MSpanInUse, but doesn't necessarily have an object at p. The not-object at p gets marked, and at this point all sorts of things can go wrong. We already have a partial solution to this. When shrinking a stack, we put the old stack on a queue to be freed at the end of garbage collection. This was done to address exactly this problem, but wasn't a complete solution. This commit generalizes this solution to both shrinking and growing stacks. For stacks that fit in the stack pool, we simply don't free the span, even if its reference count reaches zero. It's fine to reuse the span for other stacks, and this enables that. At the end of GC, we sweep for cached stack spans with a zero reference count and free them. For larger stacks, we simply queue the stack span to be freed at the end of GC. Ideally, we would reuse these large stack spans the way we can small stack spans, but that's a more invasive change that will have to wait until after the freeze. Fixes #11267. Change-Id: Ib7f2c5da4845cc0268e8dc098b08465116972a71 Reviewed-on: https://go-review.googlesource.com/11502 Reviewed-by: Russ Cox <rsc@golang.org>
2015-06-22 10:24:50 -04:00
//
// - During GC (gcphase != _GCoff), a span *must not* transition from
// manual or in-use to free. Because concurrent GC may read a pointer
// and then look up its span, the span state must be monotonic.
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
//
// Setting mspan.state to mSpanInUse or mSpanManual must be done
// atomically and only after all other span fields are valid.
// Likewise, if inspecting a span is contingent on it being
// mSpanInUse, the state should be loaded atomically and checked
// before depending on other fields. This allows the garbage collector
// to safely deal with potentially invalid pointers, since resolving
// such pointers may race with a span being allocated.
type mSpanState uint8
const (
mSpanDead mSpanState = iota
mSpanInUse // allocated for garbage collected heap
mSpanManual // allocated for manual management (e.g., stack allocator)
)
// mSpanStateNames are the names of the span states, indexed by
// mSpanState.
var mSpanStateNames = []string{
"mSpanDead",
"mSpanInUse",
"mSpanManual",
}
// mSpanStateBox holds an atomic.Uint8 to provide atomic operations on
// an mSpanState. This is a separate type to disallow accidental comparison
// or assignment with mSpanState.
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
type mSpanStateBox struct {
s atomic.Uint8
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
}
// It is nosplit to match get, below.
//go:nosplit
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
func (b *mSpanStateBox) set(s mSpanState) {
b.s.Store(uint8(s))
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
}
// It is nosplit because it's called indirectly by typedmemclr,
// which must not be preempted.
//go:nosplit
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
func (b *mSpanStateBox) get() mSpanState {
return mSpanState(b.s.Load())
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
}
type mspan struct {
_ sys.NotInHeap
next *mspan // next span in list, or nil if none
prev *mspan // previous span in list, or nil if none
list *mSpanList // For debugging.
startAddr uintptr // address of first byte of span aka s.base()
npages uintptr // number of pages in span
manualFreeList gclinkptr // list of free objects in mSpanManual spans
// freeindex is the slot index between 0 and nelems at which to begin scanning
// for the next free object in this span.
// Each allocation scans allocBits starting at freeindex until it encounters a 0
// indicating a free object. freeindex is then adjusted so that subsequent scans begin
// just past the newly discovered free object.
//
// If freeindex == nelems, this span has no free objects.
//
// allocBits is a bitmap of objects in this span.
// If n >= freeindex and allocBits[n/8] & (1<<(n%8)) is 0
// then object n is free;
// otherwise, object n is allocated. Bits starting at nelems are
// undefined and should never be referenced.
//
// Object n starts at address n*elemsize + (start << pageShift).
freeindex uint16
// TODO: Look up nelems from sizeclass and remove this field if it
// helps performance.
nelems uint16 // number of object in the span.
// freeIndexForScan is like freeindex, except that freeindex is
// used by the allocator whereas freeIndexForScan is used by the
// GC scanner. They are two fields so that the GC sees the object
// is allocated only when the object and the heap bits are
// initialized (see also the assignment of freeIndexForScan in
// mallocgc, and issue 54596).
freeIndexForScan uint16
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
// Temporary storage for the object index that caused this span to
// be queued for scanning.
//
// Used only with goexperiment.GreenTeaGC.
scanIdx uint16
// Cache of the allocBits at freeindex. allocCache is shifted
// such that the lowest bit corresponds to the bit freeindex.
// allocCache holds the complement of allocBits, thus allowing
// ctz (count trailing zero) to use it directly.
// allocCache may contain bits beyond s.nelems; the caller must ignore
// these.
allocCache uint64
// allocBits and gcmarkBits hold pointers to a span's mark and
// allocation bits. The pointers are 8 byte aligned.
// There are three arenas where this data is held.
// free: Dirty arenas that are no longer accessed
// and can be reused.
// next: Holds information to be used in the next GC cycle.
// current: Information being used during this GC cycle.
// previous: Information being used during the last GC cycle.
// A new GC cycle starts with the call to finishsweep_m.
// finishsweep_m moves the previous arena to the free arena,
// the current arena to the previous arena, and
// the next arena to the current arena.
// The next arena is populated as the spans request
// memory to hold gcmarkBits for the next GC cycle as well
// as allocBits for newly allocated spans.
//
// The pointer arithmetic is done "by hand" instead of using
// arrays to avoid bounds checks along critical performance
// paths.
// The sweep will free the old allocBits and set allocBits to the
// gcmarkBits. The gcmarkBits are replaced with a fresh zeroed
// out memory.
allocBits *gcBits
gcmarkBits *gcBits
pinnerBits *gcBits // bitmap for pinned objects; accessed atomically
// sweep generation:
// if sweepgen == h->sweepgen - 2, the span needs sweeping
// if sweepgen == h->sweepgen - 1, the span is currently being swept
// if sweepgen == h->sweepgen, the span is swept and ready to use
runtime: flush mcaches lazily Currently, all mcaches are flushed during STW mark termination as a root marking job. This is currently necessary because all spans must be out of these caches before sweeping begins to avoid races with allocation and to ensure the spans are in the state expected by sweeping. We do it as a root marking job because mcache flushing is somewhat expensive and O(GOMAXPROCS) and this parallelizes the work across the Ps. However, it's also the last remaining root marking job performed during mark termination. This CL moves mcache flushing out of mark termination and performs it lazily. We keep track of the last sweepgen at which each mcache was flushed and as each P is woken from STW, it observes that its mcache is out-of-date and flushes it. The introduces a complication for spans cached in stale mcaches. These may now be observed by background or proportional sweeping or when attempting to add a finalizer, but aren't in a stable state. For example, they are likely to be on the wrong mcentral list. To fix this, this CL extends the sweepgen protocol to also capture whether a span is cached and, if so, whether or not its cache is stale. This protocol blocks asynchronous sweeping from touching cached spans and makes it the responsibility of mcache flushing to sweep the flushed spans. This eliminates the last mark termination root marking job, which means we can now eliminate that entire infrastructure. Updates #26903. This implements lazy mcache flushing. Change-Id: Iadda7aabe540b2026cffc5195da7be37d5b4125e Reviewed-on: https://go-review.googlesource.com/c/134783 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-08-23 13:14:19 -04:00
// if sweepgen == h->sweepgen + 1, the span was cached before sweep began and is still cached, and needs sweeping
// if sweepgen == h->sweepgen + 3, the span was swept and then cached and is still cached
// h->sweepgen is incremented by 2 after every GC
runtime: Speed up heapBitsForObject Optimized heapBitsForObject by special casing objects whose size is a power of two. When a span holding such objects is initialized I added a mask that when &ed with an interior pointer results in the base of the pointer. For the garbage benchmark this resulted in CPU_CLK_UNHALTED in heapBitsForObject going from 7.7% down to 5.9% of the total, INST_RETIRED went from 12.2 -> 8.7. Here are the benchmarks that were at lease plus or minus 1%. benchmark old ns/op new ns/op delta BenchmarkFmtFprintfString 249 221 -11.24% BenchmarkFmtFprintfInt 247 223 -9.72% BenchmarkFmtFprintfEmpty 76.5 69.6 -9.02% BenchmarkBinaryTree17 4106631412 3744550160 -8.82% BenchmarkFmtFprintfFloat 424 399 -5.90% BenchmarkGoParse 4484421 4242115 -5.40% BenchmarkGobEncode 8803668 8449107 -4.03% BenchmarkFmtManyArgs 1494 1436 -3.88% BenchmarkGobDecode 10431051 10032606 -3.82% BenchmarkFannkuch11 2591306713 2517400464 -2.85% BenchmarkTimeParse 361 371 +2.77% BenchmarkJSONDecode 70620492 68830357 -2.53% BenchmarkRegexpMatchMedium_1K 54693 53343 -2.47% BenchmarkTemplate 90008879 91929940 +2.13% BenchmarkTimeFormat 380 387 +1.84% BenchmarkRegexpMatchEasy1_32 111 113 +1.80% BenchmarkJSONEncode 21359159 21007583 -1.65% BenchmarkRegexpMatchEasy1_1K 603 613 +1.66% BenchmarkRegexpMatchEasy0_32 127 129 +1.57% BenchmarkFmtFprintfIntInt 399 393 -1.50% BenchmarkRegexpMatchEasy0_1K 373 378 +1.34% Change-Id: I78e297161026f8b5cc7507c965fd3e486f81ed29 Reviewed-on: https://go-review.googlesource.com/8980 Reviewed-by: Austin Clements <austin@google.com>
2015-04-15 17:08:58 -04:00
sweepgen uint32
divMul uint32 // for divide by elemsize
allocCount uint16 // number of allocated objects
spanclass spanClass // size class and noscan (uint8)
state mSpanStateBox // mSpanInUse etc; accessed atomically (get/set methods)
needzero uint8 // needs to be zeroed before allocation
runtime: add safe arena support to the runtime This change adds an API to the runtime for arenas. A later CL can potentially export it as an experimental API, but for now, just the runtime implementation will suffice. The purpose of arenas is to improve efficiency, primarily by allowing for an application to manually free memory, thereby delaying garbage collection. It comes with other potential performance benefits, such as better locality, a better allocation strategy, and better handling of interior pointers by the GC. This implementation is based on one by danscales@google.com with a few significant differences: * The implementation lives entirely in the runtime (all layers). * Arena chunks are the minimum of 8 MiB or the heap arena size. This choice is made because in practice 64 MiB appears to be way too large of an area for most real-world use-cases. * Arena chunks are not unmapped, instead they're placed on an evacuation list and when there are no pointers left pointing into them, they're allowed to be reused. * Reusing partially-used arena chunks no longer tries to find one used by the same P first; it just takes the first one available. * In order to ensure worst-case fragmentation is never worse than 25%, only types and slice backing stores whose sizes are 1/4th the size of a chunk or less may be used. Previously larger sizes, up to the size of the chunk, were allowed. * ASAN, MSAN, and the race detector are fully supported. * Sets arena chunks to fault that were deferred at the end of mark termination (a non-public patch once did this; I don't see a reason not to continue that). For #51317. Change-Id: I83b1693a17302554cb36b6daa4e9249a81b1644f Reviewed-on: https://go-review.googlesource.com/c/go/+/423359 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2022-08-12 21:40:46 +00:00
isUserArenaChunk bool // whether or not this span represents a user arena
allocCountBeforeCache uint16 // a copy of allocCount that is stored just before this span is cached
elemsize uintptr // computed from sizeclass or from npages
limit uintptr // end of data in span
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
speciallock mutex // guards specials list and changes to pinnerBits
specials *special // linked list of special records sorted by offset.
runtime: add safe arena support to the runtime This change adds an API to the runtime for arenas. A later CL can potentially export it as an experimental API, but for now, just the runtime implementation will suffice. The purpose of arenas is to improve efficiency, primarily by allowing for an application to manually free memory, thereby delaying garbage collection. It comes with other potential performance benefits, such as better locality, a better allocation strategy, and better handling of interior pointers by the GC. This implementation is based on one by danscales@google.com with a few significant differences: * The implementation lives entirely in the runtime (all layers). * Arena chunks are the minimum of 8 MiB or the heap arena size. This choice is made because in practice 64 MiB appears to be way too large of an area for most real-world use-cases. * Arena chunks are not unmapped, instead they're placed on an evacuation list and when there are no pointers left pointing into them, they're allowed to be reused. * Reusing partially-used arena chunks no longer tries to find one used by the same P first; it just takes the first one available. * In order to ensure worst-case fragmentation is never worse than 25%, only types and slice backing stores whose sizes are 1/4th the size of a chunk or less may be used. Previously larger sizes, up to the size of the chunk, were allowed. * ASAN, MSAN, and the race detector are fully supported. * Sets arena chunks to fault that were deferred at the end of mark termination (a non-public patch once did this; I don't see a reason not to continue that). For #51317. Change-Id: I83b1693a17302554cb36b6daa4e9249a81b1644f Reviewed-on: https://go-review.googlesource.com/c/go/+/423359 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2022-08-12 21:40:46 +00:00
userArenaChunkFree addrRange // interval for managing chunk allocation
runtime: implement experiment to replace heap bitmap with alloc headers This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2022-09-11 04:07:41 +00:00
largeType *_type // malloc header for large objects.
}
func (s *mspan) base() uintptr {
return s.startAddr
}
func (s *mspan) layout() (size, n, total uintptr) {
total = s.npages << gc.PageShift
size = s.elemsize
if size > 0 {
n = total / size
}
return
}
// recordspan adds a newly allocated span to h.allspans.
//
// This only happens the first time a span is allocated from
// mheap.spanalloc (it is not called when a span is reused).
//
// Write barriers are disallowed here because it can be called from
// gcWork when allocating new workbufs. However, because it's an
// indirect call from the fixalloc initializer, the compiler can't see
// this.
//
// The heap lock must be held.
//
//go:nowritebarrierrec
func recordspan(vh unsafe.Pointer, p unsafe.Pointer) {
h := (*mheap)(vh)
s := (*mspan)(p)
assertLockHeld(&h.lock)
if len(h.allspans) >= cap(h.allspans) {
n := 64 * 1024 / goarch.PtrSize
if n < cap(h.allspans)*3/2 {
n = cap(h.allspans) * 3 / 2
}
var new []*mspan
sp := (*slice)(unsafe.Pointer(&new))
runtime: decorate anonymous memory mappings Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name the anonymous memory areas. This API has been introduced in Linux 5.17 to decorate the anonymous memory areas shown in /proc/<pid>/maps. This is already used by glibc. See: * https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434 * https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63 This can be useful when investigating the memory consumption of a multi-language program. On a 100% Go program, pprof profiler can be used to profile the memory consumption of the program. But pprof is only aware of what happens within the Go world. On a multi-language program, there could be a doubt about whether the suspicious extra-memory consumption comes from the Go part or the native part. With this change, the following Go program: package main import ( "fmt" "log" "os" ) /* #include <stdlib.h> void f(void) { (void)malloc(1024*1024*1024); } */ import "C" func main() { C.f() data, err := os.ReadFile("/proc/self/maps") if err != nil { log.Fatal(err) } fmt.Println(string(data)) } produces this output: $ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run . 00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00580000-005a4000 rw-p 00000000 00:00 0 2e075000-2e096000 rw-p 00000000 00:00 0 [heap] c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap] c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation] 777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f40021000-777f44000000 ---p 00000000 00:00 0 777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f44021000-777f48000000 ---p 00000000 00:00 0 777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f48021000-777f4c000000 ---p 00000000 00:00 0 777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f4c021000-777f50000000 ---p 00000000 00:00 0 777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f50021000-777f54000000 ---p 00000000 00:00 0 777f55afb000-777f55afc000 ---p 00000000 00:00 0 777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378] 777f562fc000-777f562fd000 ---p 00000000 00:00 0 777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377] 777f56afd000-777f56afe000 ---p 00000000 00:00 0 777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376] 777f572fe000-777f572ff000 ---p 00000000 00:00 0 777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375] 777f57aff000-777f57b00000 ---p 00000000 00:00 0 777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374] 777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index] 777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index] 777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits] 777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array] 777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6 777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6 777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6 777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6 777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6 777f9eff4000-777f9effc000 rw-p 00000000 00:00 0 777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar] 777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock] 777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso] 777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f088000-777f9f089000 rw-p 00000000 00:00 0 7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] The anonymous memory areas are now labelled so that we can see which ones have been allocated by the Go runtime versus which ones have been allocated by the glibc. Fixes #71546 Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4 Reviewed-on: https://go-review.googlesource.com/c/go/+/646095 Auto-Submit: Alan Donovan <adonovan@google.com> Commit-Queue: Alan Donovan <adonovan@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-02-01 14:19:04 +01:00
sp.array = sysAlloc(uintptr(n)*goarch.PtrSize, &memstats.other_sys, "allspans array")
if sp.array == nil {
throw("runtime: cannot allocate memory")
}
sp.len = len(h.allspans)
sp.cap = n
if len(h.allspans) > 0 {
copy(new, h.allspans)
}
oldAllspans := h.allspans
*(*notInHeapSlice)(unsafe.Pointer(&h.allspans)) = *(*notInHeapSlice)(unsafe.Pointer(&new))
if len(oldAllspans) != 0 {
sysFree(unsafe.Pointer(&oldAllspans[0]), uintptr(cap(oldAllspans))*unsafe.Sizeof(oldAllspans[0]), &memstats.other_sys)
}
}
h.allspans = h.allspans[:len(h.allspans)+1]
h.allspans[len(h.allspans)-1] = s
}
// A spanClass represents the size class and noscan-ness of a span.
//
// Each size class has a noscan spanClass and a scan spanClass. The
// noscan spanClass contains only noscan objects, which do not contain
// pointers and thus do not need to be scanned by the garbage
// collector.
type spanClass uint8
const (
numSpanClasses = gc.NumSizeClasses << 1
tinySpanClass = spanClass(tinySizeClass<<1 | 1)
)
func makeSpanClass(sizeclass uint8, noscan bool) spanClass {
return spanClass(sizeclass<<1) | spanClass(bool2int(noscan))
}
runtime: implement experiment to replace heap bitmap with alloc headers This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2022-09-11 04:07:41 +00:00
//go:nosplit
func (sc spanClass) sizeclass() int8 {
return int8(sc >> 1)
}
runtime: implement experiment to replace heap bitmap with alloc headers This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2022-09-11 04:07:41 +00:00
//go:nosplit
func (sc spanClass) noscan() bool {
return sc&1 != 0
}
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
// arenaIndex returns the index into mheap_.arenas of the arena
// containing metadata for p. This index combines of an index into the
// L1 map and an index into the L2 map and should be used as
// mheap_.arenas[ai.l1()][ai.l2()].
//
// If p is outside the range of valid heap addresses, either l1() or
// l2() will be out of bounds.
//
// It is nosplit because it's called by spanOf and several other
// nosplit functions.
//
//go:nosplit
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
func arenaIndex(p uintptr) arenaIdx {
runtime: make maxOffAddr reflect the actual address space upper bound Currently maxOffAddr is defined in terms of the whole 64-bit address space, assuming that it's all supported, by using ^uintptr(0) as the maximal address in the offset space. In reality, the maximal address in the offset space is (1<<heapAddrBits)-1 because we don't have more than that actually available to us on a given platform. On most platforms this is fine, because arenaBaseOffset is just connecting two segments of address space, but on AIX we use it as an actual offset for the starting address of the available address space, which is limited. This means using ^uintptr(0) as the maximal address in the offset address space causes wrap-around, especially when we just want to represent a range approximately like [addr, infinity), which today we do by using maxOffAddr. To fix this, we define maxOffAddr more appropriately, in terms of (1<<heapAddrBits)-1. This change also redefines arenaBaseOffset to not be the negation of the virtual address corresponding to address zero in the virtual address space, but instead directly as the virtual address corresponding to zero. This matches the existing documentation more closely and makes the logic around arenaBaseOffset decidedly simpler, especially when trying to reason about its use on AIX. Fixes #38966. Change-Id: I1336e5036a39de846f64cc2d253e8536dee57611 Reviewed-on: https://go-review.googlesource.com/c/go/+/233497 Run-TryBot: Michael Knyszek <mknyszek@google.com> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2020-05-12 16:08:50 +00:00
return arenaIdx((p - arenaBaseOffset) / heapArenaBytes)
}
// arenaBase returns the low address of the region covered by heap
// arena i.
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
func arenaBase(i arenaIdx) uintptr {
runtime: make maxOffAddr reflect the actual address space upper bound Currently maxOffAddr is defined in terms of the whole 64-bit address space, assuming that it's all supported, by using ^uintptr(0) as the maximal address in the offset space. In reality, the maximal address in the offset space is (1<<heapAddrBits)-1 because we don't have more than that actually available to us on a given platform. On most platforms this is fine, because arenaBaseOffset is just connecting two segments of address space, but on AIX we use it as an actual offset for the starting address of the available address space, which is limited. This means using ^uintptr(0) as the maximal address in the offset address space causes wrap-around, especially when we just want to represent a range approximately like [addr, infinity), which today we do by using maxOffAddr. To fix this, we define maxOffAddr more appropriately, in terms of (1<<heapAddrBits)-1. This change also redefines arenaBaseOffset to not be the negation of the virtual address corresponding to address zero in the virtual address space, but instead directly as the virtual address corresponding to zero. This matches the existing documentation more closely and makes the logic around arenaBaseOffset decidedly simpler, especially when trying to reason about its use on AIX. Fixes #38966. Change-Id: I1336e5036a39de846f64cc2d253e8536dee57611 Reviewed-on: https://go-review.googlesource.com/c/go/+/233497 Run-TryBot: Michael Knyszek <mknyszek@google.com> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2020-05-12 16:08:50 +00:00
return uintptr(i)*heapArenaBytes + arenaBaseOffset
}
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
type arenaIdx uint
// l1 returns the "l1" portion of an arenaIdx.
//
// Marked nosplit because it's called by spanOf and other nosplit
// functions.
//
//go:nosplit
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
func (i arenaIdx) l1() uint {
if arenaL1Bits == 0 {
// Let the compiler optimize this away if there's no
// L1 map.
return 0
} else {
return uint(i) >> arenaL1Shift
}
}
// l2 returns the "l2" portion of an arenaIdx.
//
// Marked nosplit because it's called by spanOf and other nosplit funcs.
// functions.
//
//go:nosplit
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
func (i arenaIdx) l2() uint {
if arenaL1Bits == 0 {
return uint(i)
} else {
return uint(i) & (1<<arenaL2Bits - 1)
}
}
// inheap reports whether b is a pointer into a (potentially dead) heap object.
// It returns false for pointers into mSpanManual spans.
runtime: fix callwritebarrier Given a call frame F of size N where the return values start at offset R, callwritebarrier was instructing heapBitsBulkBarrier to scan the block of memory [F+R, F+R+N). It should only scan [F+R, F+N). The extra N-R bytes scanned might lead into the next allocated block in memory. Because the scan was consulting the heap bitmap for type information, scanning into the next block normally "just worked" in the sense of not crashing. Scanning the extra N-R bytes of memory is a problem mainly because it causes the GC to consider pointers that might otherwise not be considered, leading it to retain objects that should actually be freed. This is very difficult to detect. Luckily, juju turned up a case where the heap bitmap and the memory were out of sync for the block immediately after the call frame, so that heapBitsBulkBarrier saw an obvious non-pointer where it expected a pointer, causing a loud crash. Why is there a non-pointer in memory that the heap bitmap records as a pointer? That is more difficult to answer. At least one way that it could happen is that allocations containing no pointers at all do not update the heap bitmap. So if heapBitsBulkBarrier walked out of the current object and into a no-pointer object and consulted those bitmap bits, it would be misled. This doesn't happen in general because all the paths to heapBitsBulkBarrier first check for the no-pointer case. This may or may not be what happened, but it's the only scenario I've been able to construct. I tried for quite a while to write a simple test for this and could not. It does fix the juju crash, and it is clearly an improvement over the old code. Fixes #10844. Change-Id: I53982c93ef23ef93155c4086bbd95a4c4fdaac9a Reviewed-on: https://go-review.googlesource.com/10317 Reviewed-by: Austin Clements <austin@google.com>
2015-05-19 22:58:10 -04:00
// Non-preemptible because it is used by write barriers.
//
//go:nowritebarrier
runtime: fix callwritebarrier Given a call frame F of size N where the return values start at offset R, callwritebarrier was instructing heapBitsBulkBarrier to scan the block of memory [F+R, F+R+N). It should only scan [F+R, F+N). The extra N-R bytes scanned might lead into the next allocated block in memory. Because the scan was consulting the heap bitmap for type information, scanning into the next block normally "just worked" in the sense of not crashing. Scanning the extra N-R bytes of memory is a problem mainly because it causes the GC to consider pointers that might otherwise not be considered, leading it to retain objects that should actually be freed. This is very difficult to detect. Luckily, juju turned up a case where the heap bitmap and the memory were out of sync for the block immediately after the call frame, so that heapBitsBulkBarrier saw an obvious non-pointer where it expected a pointer, causing a loud crash. Why is there a non-pointer in memory that the heap bitmap records as a pointer? That is more difficult to answer. At least one way that it could happen is that allocations containing no pointers at all do not update the heap bitmap. So if heapBitsBulkBarrier walked out of the current object and into a no-pointer object and consulted those bitmap bits, it would be misled. This doesn't happen in general because all the paths to heapBitsBulkBarrier first check for the no-pointer case. This may or may not be what happened, but it's the only scenario I've been able to construct. I tried for quite a while to write a simple test for this and could not. It does fix the juju crash, and it is clearly an improvement over the old code. Fixes #10844. Change-Id: I53982c93ef23ef93155c4086bbd95a4c4fdaac9a Reviewed-on: https://go-review.googlesource.com/10317 Reviewed-by: Austin Clements <austin@google.com>
2015-05-19 22:58:10 -04:00
//go:nosplit
func inheap(b uintptr) bool {
return spanOfHeap(b) != nil
}
// inHeapOrStack is a variant of inheap that returns true for pointers
// into any allocated heap span.
//
runtime: use entire address space on 32 bit In issue #13992, Russ mentioned that the heap bitmap footprint was halved but that the bitmap size calculation hadn't been updated. This presents the opportunity to either halve the bitmap size or double the addressable virtual space. This CL doubles the addressable virtual space. On 32 bit this can be tweaked further to allow the bitmap to cover the entire 4GB virtual address space, removing a failure mode if the kernel hands out memory with a too low address. First, fix the calculation and double _MaxArena32 to cover 4GB virtual memory space with the same bitmap size (256 MB). Then, allow the fallback mode for the initial memory reservation on 32 bit (or 64 bit with too little available virtual memory) to not include space for the arena. mheap.sysAlloc will automatically reserve additional space when the existing arena is full. Finally, set arena_start to 0 in 32 bit mode, so that any address is acceptable for subsequent (additional) reservations. Before, the bitmap was always located just before arena_start, so fix the two places relying on that assumption: Point the otherwise unused mheap.bitmap to one byte after the end of the bitmap, and use it for bitmap addressing instead of arena_start. With arena_start set to 0 on 32 bit, the cgoInRange check is no longer a sufficient check for Go pointers. Introduce and call inHeapOrStack to check whether a pointer is to the Go heap or stack. While we're here, remove sysReserveHigh which seems to be unused. Fixes #13992 Change-Id: I592b513148a50b9d3967b5c5d94b86b3ec39acc2 Reviewed-on: https://go-review.googlesource.com/20471 Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-09 10:00:12 +01:00
//go:nowritebarrier
//go:nosplit
func inHeapOrStack(b uintptr) bool {
s := spanOf(b)
runtime: use entire address space on 32 bit In issue #13992, Russ mentioned that the heap bitmap footprint was halved but that the bitmap size calculation hadn't been updated. This presents the opportunity to either halve the bitmap size or double the addressable virtual space. This CL doubles the addressable virtual space. On 32 bit this can be tweaked further to allow the bitmap to cover the entire 4GB virtual address space, removing a failure mode if the kernel hands out memory with a too low address. First, fix the calculation and double _MaxArena32 to cover 4GB virtual memory space with the same bitmap size (256 MB). Then, allow the fallback mode for the initial memory reservation on 32 bit (or 64 bit with too little available virtual memory) to not include space for the arena. mheap.sysAlloc will automatically reserve additional space when the existing arena is full. Finally, set arena_start to 0 in 32 bit mode, so that any address is acceptable for subsequent (additional) reservations. Before, the bitmap was always located just before arena_start, so fix the two places relying on that assumption: Point the otherwise unused mheap.bitmap to one byte after the end of the bitmap, and use it for bitmap addressing instead of arena_start. With arena_start set to 0 on 32 bit, the cgoInRange check is no longer a sufficient check for Go pointers. Introduce and call inHeapOrStack to check whether a pointer is to the Go heap or stack. While we're here, remove sysReserveHigh which seems to be unused. Fixes #13992 Change-Id: I592b513148a50b9d3967b5c5d94b86b3ec39acc2 Reviewed-on: https://go-review.googlesource.com/20471 Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-09 10:00:12 +01:00
if s == nil || b < s.base() {
return false
}
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
switch s.state.get() {
case mSpanInUse, mSpanManual:
runtime: use entire address space on 32 bit In issue #13992, Russ mentioned that the heap bitmap footprint was halved but that the bitmap size calculation hadn't been updated. This presents the opportunity to either halve the bitmap size or double the addressable virtual space. This CL doubles the addressable virtual space. On 32 bit this can be tweaked further to allow the bitmap to cover the entire 4GB virtual address space, removing a failure mode if the kernel hands out memory with a too low address. First, fix the calculation and double _MaxArena32 to cover 4GB virtual memory space with the same bitmap size (256 MB). Then, allow the fallback mode for the initial memory reservation on 32 bit (or 64 bit with too little available virtual memory) to not include space for the arena. mheap.sysAlloc will automatically reserve additional space when the existing arena is full. Finally, set arena_start to 0 in 32 bit mode, so that any address is acceptable for subsequent (additional) reservations. Before, the bitmap was always located just before arena_start, so fix the two places relying on that assumption: Point the otherwise unused mheap.bitmap to one byte after the end of the bitmap, and use it for bitmap addressing instead of arena_start. With arena_start set to 0 on 32 bit, the cgoInRange check is no longer a sufficient check for Go pointers. Introduce and call inHeapOrStack to check whether a pointer is to the Go heap or stack. While we're here, remove sysReserveHigh which seems to be unused. Fixes #13992 Change-Id: I592b513148a50b9d3967b5c5d94b86b3ec39acc2 Reviewed-on: https://go-review.googlesource.com/20471 Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-09 10:00:12 +01:00
return b < s.limit
default:
return false
}
}
// spanOf returns the span of p. If p does not point into the heap
// arena or no span has ever contained p, spanOf returns nil.
//
// If p does not point to allocated memory, this may return a non-nil
// span that does *not* contain p. If this is a possibility, the
// caller should either call spanOfHeap or check the span bounds
// explicitly.
//
// Must be nosplit because it has callers that are nosplit.
//
//go:nosplit
runtime: eliminate one heapBitsForObject from scanobject scanobject with ptrmask!=nil is only ever called with the base pointer of a heap object. Currently, scanobject calls heapBitsForObject, which goes to a great deal of trouble to check that the pointer points into the heap and to find the base of the object it points to, both of which are completely unnecessary in this case. Replace this call to heapBitsForObject with much simpler logic to fetch the span and compute the heap bits. Benchmark results with five runs: name old mean new mean delta BenchmarkBinaryTree17 9.21s × (0.95,1.02) 8.55s × (0.91,1.03) -7.16% (p=0.022) BenchmarkFannkuch11 2.65s × (1.00,1.00) 2.62s × (1.00,1.00) -1.10% (p=0.000) BenchmarkFmtFprintfEmpty 73.2ns × (0.99,1.01) 71.7ns × (1.00,1.01) -1.99% (p=0.004) BenchmarkFmtFprintfString 302ns × (0.99,1.00) 292ns × (0.98,1.02) -3.31% (p=0.020) BenchmarkFmtFprintfInt 281ns × (0.98,1.01) 279ns × (0.96,1.02) ~ (p=0.596) BenchmarkFmtFprintfIntInt 482ns × (0.98,1.01) 488ns × (0.95,1.02) ~ (p=0.419) BenchmarkFmtFprintfPrefixedInt 382ns × (0.99,1.01) 365ns × (0.96,1.02) -4.35% (p=0.015) BenchmarkFmtFprintfFloat 475ns × (0.99,1.01) 472ns × (1.00,1.00) ~ (p=0.108) BenchmarkFmtManyArgs 1.89µs × (1.00,1.01) 1.90µs × (0.94,1.02) ~ (p=0.883) BenchmarkGobDecode 22.4ms × (0.99,1.01) 21.9ms × (0.92,1.04) ~ (p=0.332) BenchmarkGobEncode 24.7ms × (0.98,1.02) 23.9ms × (0.87,1.07) ~ (p=0.407) BenchmarkGzip 397ms × (0.99,1.01) 398ms × (0.99,1.01) ~ (p=0.718) BenchmarkGunzip 96.7ms × (1.00,1.00) 96.9ms × (1.00,1.00) ~ (p=0.230) BenchmarkHTTPClientServer 71.5µs × (0.98,1.01) 68.5µs × (0.92,1.06) ~ (p=0.243) BenchmarkJSONEncode 46.1ms × (0.98,1.01) 44.9ms × (0.98,1.03) -2.51% (p=0.040) BenchmarkJSONDecode 86.1ms × (0.99,1.01) 86.5ms × (0.99,1.01) ~ (p=0.343) BenchmarkMandelbrot200 4.12ms × (1.00,1.00) 4.13ms × (1.00,1.00) +0.23% (p=0.000) BenchmarkGoParse 5.89ms × (0.96,1.03) 5.82ms × (0.96,1.04) ~ (p=0.522) BenchmarkRegexpMatchEasy0_32 141ns × (0.99,1.01) 142ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy0_1K 408ns × (1.00,1.00) 392ns × (0.99,1.00) -3.83% (p=0.000) BenchmarkRegexpMatchEasy1_32 122ns × (1.00,1.00) 122ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy1_1K 626ns × (1.00,1.01) 624ns × (0.99,1.00) ~ (p=0.122) BenchmarkRegexpMatchMedium_32 202ns × (0.99,1.00) 205ns × (0.99,1.01) +1.58% (p=0.001) BenchmarkRegexpMatchMedium_1K 54.4µs × (1.00,1.00) 55.5µs × (1.00,1.00) +1.86% (p=0.000) BenchmarkRegexpMatchHard_32 2.68µs × (1.00,1.00) 2.71µs × (1.00,1.00) +0.97% (p=0.002) BenchmarkRegexpMatchHard_1K 79.8µs × (1.00,1.01) 80.5µs × (1.00,1.01) +0.94% (p=0.003) BenchmarkRevcomp 590ms × (0.99,1.01) 585ms × (1.00,1.00) ~ (p=0.066) BenchmarkTemplate 111ms × (0.97,1.02) 112ms × (0.99,1.01) ~ (p=0.201) BenchmarkTimeParse 392ns × (1.00,1.00) 385ns × (1.00,1.00) -1.69% (p=0.000) BenchmarkTimeFormat 449ns × (0.98,1.01) 448ns × (0.99,1.01) ~ (p=0.550) Change-Id: Ie7c3830c481d96c9043e7bf26853c6c1d05dc9f4 Reviewed-on: https://go-review.googlesource.com/9364 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-26 18:27:17 -04:00
func spanOf(p uintptr) *mspan {
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
// This function looks big, but we use a lot of constant
// folding around arenaL1Bits to get it under the inlining
// budget. Also, many of the checks here are safety checks
// that Go needs to do anyway, so the generated code is quite
// short.
ri := arenaIndex(p)
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
if arenaL1Bits == 0 {
// If there's no L1, then ri.l1() can't be out of bounds but ri.l2() can.
if ri.l2() >= uint(len(mheap_.arenas[0])) {
return nil
}
} else {
// If there's an L1, then ri.l1() can be out of bounds but ri.l2() can't.
if ri.l1() >= uint(len(mheap_.arenas)) {
return nil
}
}
l2 := mheap_.arenas[ri.l1()]
if arenaL1Bits != 0 && l2 == nil { // Should never happen if there's no L1.
return nil
}
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
ha := l2[ri.l2()]
if ha == nil {
runtime: eliminate one heapBitsForObject from scanobject scanobject with ptrmask!=nil is only ever called with the base pointer of a heap object. Currently, scanobject calls heapBitsForObject, which goes to a great deal of trouble to check that the pointer points into the heap and to find the base of the object it points to, both of which are completely unnecessary in this case. Replace this call to heapBitsForObject with much simpler logic to fetch the span and compute the heap bits. Benchmark results with five runs: name old mean new mean delta BenchmarkBinaryTree17 9.21s × (0.95,1.02) 8.55s × (0.91,1.03) -7.16% (p=0.022) BenchmarkFannkuch11 2.65s × (1.00,1.00) 2.62s × (1.00,1.00) -1.10% (p=0.000) BenchmarkFmtFprintfEmpty 73.2ns × (0.99,1.01) 71.7ns × (1.00,1.01) -1.99% (p=0.004) BenchmarkFmtFprintfString 302ns × (0.99,1.00) 292ns × (0.98,1.02) -3.31% (p=0.020) BenchmarkFmtFprintfInt 281ns × (0.98,1.01) 279ns × (0.96,1.02) ~ (p=0.596) BenchmarkFmtFprintfIntInt 482ns × (0.98,1.01) 488ns × (0.95,1.02) ~ (p=0.419) BenchmarkFmtFprintfPrefixedInt 382ns × (0.99,1.01) 365ns × (0.96,1.02) -4.35% (p=0.015) BenchmarkFmtFprintfFloat 475ns × (0.99,1.01) 472ns × (1.00,1.00) ~ (p=0.108) BenchmarkFmtManyArgs 1.89µs × (1.00,1.01) 1.90µs × (0.94,1.02) ~ (p=0.883) BenchmarkGobDecode 22.4ms × (0.99,1.01) 21.9ms × (0.92,1.04) ~ (p=0.332) BenchmarkGobEncode 24.7ms × (0.98,1.02) 23.9ms × (0.87,1.07) ~ (p=0.407) BenchmarkGzip 397ms × (0.99,1.01) 398ms × (0.99,1.01) ~ (p=0.718) BenchmarkGunzip 96.7ms × (1.00,1.00) 96.9ms × (1.00,1.00) ~ (p=0.230) BenchmarkHTTPClientServer 71.5µs × (0.98,1.01) 68.5µs × (0.92,1.06) ~ (p=0.243) BenchmarkJSONEncode 46.1ms × (0.98,1.01) 44.9ms × (0.98,1.03) -2.51% (p=0.040) BenchmarkJSONDecode 86.1ms × (0.99,1.01) 86.5ms × (0.99,1.01) ~ (p=0.343) BenchmarkMandelbrot200 4.12ms × (1.00,1.00) 4.13ms × (1.00,1.00) +0.23% (p=0.000) BenchmarkGoParse 5.89ms × (0.96,1.03) 5.82ms × (0.96,1.04) ~ (p=0.522) BenchmarkRegexpMatchEasy0_32 141ns × (0.99,1.01) 142ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy0_1K 408ns × (1.00,1.00) 392ns × (0.99,1.00) -3.83% (p=0.000) BenchmarkRegexpMatchEasy1_32 122ns × (1.00,1.00) 122ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy1_1K 626ns × (1.00,1.01) 624ns × (0.99,1.00) ~ (p=0.122) BenchmarkRegexpMatchMedium_32 202ns × (0.99,1.00) 205ns × (0.99,1.01) +1.58% (p=0.001) BenchmarkRegexpMatchMedium_1K 54.4µs × (1.00,1.00) 55.5µs × (1.00,1.00) +1.86% (p=0.000) BenchmarkRegexpMatchHard_32 2.68µs × (1.00,1.00) 2.71µs × (1.00,1.00) +0.97% (p=0.002) BenchmarkRegexpMatchHard_1K 79.8µs × (1.00,1.01) 80.5µs × (1.00,1.01) +0.94% (p=0.003) BenchmarkRevcomp 590ms × (0.99,1.01) 585ms × (1.00,1.00) ~ (p=0.066) BenchmarkTemplate 111ms × (0.97,1.02) 112ms × (0.99,1.01) ~ (p=0.201) BenchmarkTimeParse 392ns × (1.00,1.00) 385ns × (1.00,1.00) -1.69% (p=0.000) BenchmarkTimeFormat 449ns × (0.98,1.01) 448ns × (0.99,1.01) ~ (p=0.550) Change-Id: Ie7c3830c481d96c9043e7bf26853c6c1d05dc9f4 Reviewed-on: https://go-review.googlesource.com/9364 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-26 18:27:17 -04:00
return nil
}
return ha.spans[(p/pageSize)%pagesPerArena]
runtime: eliminate one heapBitsForObject from scanobject scanobject with ptrmask!=nil is only ever called with the base pointer of a heap object. Currently, scanobject calls heapBitsForObject, which goes to a great deal of trouble to check that the pointer points into the heap and to find the base of the object it points to, both of which are completely unnecessary in this case. Replace this call to heapBitsForObject with much simpler logic to fetch the span and compute the heap bits. Benchmark results with five runs: name old mean new mean delta BenchmarkBinaryTree17 9.21s × (0.95,1.02) 8.55s × (0.91,1.03) -7.16% (p=0.022) BenchmarkFannkuch11 2.65s × (1.00,1.00) 2.62s × (1.00,1.00) -1.10% (p=0.000) BenchmarkFmtFprintfEmpty 73.2ns × (0.99,1.01) 71.7ns × (1.00,1.01) -1.99% (p=0.004) BenchmarkFmtFprintfString 302ns × (0.99,1.00) 292ns × (0.98,1.02) -3.31% (p=0.020) BenchmarkFmtFprintfInt 281ns × (0.98,1.01) 279ns × (0.96,1.02) ~ (p=0.596) BenchmarkFmtFprintfIntInt 482ns × (0.98,1.01) 488ns × (0.95,1.02) ~ (p=0.419) BenchmarkFmtFprintfPrefixedInt 382ns × (0.99,1.01) 365ns × (0.96,1.02) -4.35% (p=0.015) BenchmarkFmtFprintfFloat 475ns × (0.99,1.01) 472ns × (1.00,1.00) ~ (p=0.108) BenchmarkFmtManyArgs 1.89µs × (1.00,1.01) 1.90µs × (0.94,1.02) ~ (p=0.883) BenchmarkGobDecode 22.4ms × (0.99,1.01) 21.9ms × (0.92,1.04) ~ (p=0.332) BenchmarkGobEncode 24.7ms × (0.98,1.02) 23.9ms × (0.87,1.07) ~ (p=0.407) BenchmarkGzip 397ms × (0.99,1.01) 398ms × (0.99,1.01) ~ (p=0.718) BenchmarkGunzip 96.7ms × (1.00,1.00) 96.9ms × (1.00,1.00) ~ (p=0.230) BenchmarkHTTPClientServer 71.5µs × (0.98,1.01) 68.5µs × (0.92,1.06) ~ (p=0.243) BenchmarkJSONEncode 46.1ms × (0.98,1.01) 44.9ms × (0.98,1.03) -2.51% (p=0.040) BenchmarkJSONDecode 86.1ms × (0.99,1.01) 86.5ms × (0.99,1.01) ~ (p=0.343) BenchmarkMandelbrot200 4.12ms × (1.00,1.00) 4.13ms × (1.00,1.00) +0.23% (p=0.000) BenchmarkGoParse 5.89ms × (0.96,1.03) 5.82ms × (0.96,1.04) ~ (p=0.522) BenchmarkRegexpMatchEasy0_32 141ns × (0.99,1.01) 142ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy0_1K 408ns × (1.00,1.00) 392ns × (0.99,1.00) -3.83% (p=0.000) BenchmarkRegexpMatchEasy1_32 122ns × (1.00,1.00) 122ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy1_1K 626ns × (1.00,1.01) 624ns × (0.99,1.00) ~ (p=0.122) BenchmarkRegexpMatchMedium_32 202ns × (0.99,1.00) 205ns × (0.99,1.01) +1.58% (p=0.001) BenchmarkRegexpMatchMedium_1K 54.4µs × (1.00,1.00) 55.5µs × (1.00,1.00) +1.86% (p=0.000) BenchmarkRegexpMatchHard_32 2.68µs × (1.00,1.00) 2.71µs × (1.00,1.00) +0.97% (p=0.002) BenchmarkRegexpMatchHard_1K 79.8µs × (1.00,1.01) 80.5µs × (1.00,1.01) +0.94% (p=0.003) BenchmarkRevcomp 590ms × (0.99,1.01) 585ms × (1.00,1.00) ~ (p=0.066) BenchmarkTemplate 111ms × (0.97,1.02) 112ms × (0.99,1.01) ~ (p=0.201) BenchmarkTimeParse 392ns × (1.00,1.00) 385ns × (1.00,1.00) -1.69% (p=0.000) BenchmarkTimeFormat 449ns × (0.98,1.01) 448ns × (0.99,1.01) ~ (p=0.550) Change-Id: Ie7c3830c481d96c9043e7bf26853c6c1d05dc9f4 Reviewed-on: https://go-review.googlesource.com/9364 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-26 18:27:17 -04:00
}
// spanOfUnchecked is equivalent to spanOf, but the caller must ensure
runtime: use sparse mappings for the heap This replaces the contiguous heap arena mapping with a potentially sparse mapping that can support heap mappings anywhere in the address space. This has several advantages over the current approach: * There is no longer any limit on the size of the Go heap. (Currently it's limited to 512GB.) Hence, this fixes #10460. * It eliminates many failures modes of heap initialization and growing. In particular it eliminates any possibility of panicking with an address space conflict. This can happen for many reasons and even causes a low but steady rate of TSAN test failures because of conflicts with the TSAN runtime. See #16936 and #11993. * It eliminates the notion of "non-reserved" heap, which was added because creating huge address space reservations (particularly on 64-bit) led to huge process VSIZE. This was at best confusing and at worst conflicted badly with ulimit -v. However, the non-reserved heap logic is complicated, can race with other mappings in non-pure Go binaries (e.g., #18976), and requires that the entire heap be either reserved or non-reserved. We currently maintain the latter property, but it's quite difficult to convince yourself of that, and hence difficult to keep correct. This logic is still present, but will be removed in the next CL. * It fixes problems on 32-bit where skipping over parts of the address space leads to mapping huge (and never-to-be-used) metadata structures. See #19831. This also completely rewrites and significantly simplifies mheap.sysAlloc, which has been a source of many bugs. E.g., #21044, #20259, #18651, and #13143 (and maybe #23222). This change also makes it possible to allocate individual objects larger than 512GB. As a result, a few tests that expected huge allocations to fail needed to be changed to make even larger allocations. However, at the moment attempting to allocate a humongous object may cause the program to freeze for several minutes on Linux as we fall back to probing every page with addrspace_free. That logic (and this failure mode) will be removed in the next CL. Fixes #10460. Fixes #22204 (since it rewrites the code involved). This slightly slows down compilebench and the x/benchmarks garbage benchmark. name old time/op new time/op delta Template 184ms ± 1% 185ms ± 1% ~ (p=0.065 n=10+9) Unicode 86.9ms ± 3% 86.3ms ± 1% ~ (p=0.631 n=10+10) GoTypes 599ms ± 0% 602ms ± 0% +0.56% (p=0.000 n=10+9) Compiler 2.87s ± 1% 2.89s ± 1% +0.51% (p=0.002 n=9+10) SSA 7.29s ± 1% 7.25s ± 1% ~ (p=0.182 n=10+9) Flate 118ms ± 2% 118ms ± 1% ~ (p=0.113 n=9+9) GoParser 147ms ± 1% 148ms ± 1% +1.07% (p=0.003 n=9+10) Reflect 401ms ± 1% 404ms ± 1% +0.71% (p=0.003 n=10+9) Tar 175ms ± 1% 175ms ± 1% ~ (p=0.604 n=9+10) XML 209ms ± 1% 210ms ± 1% ~ (p=0.052 n=10+10) (https://perf.golang.org/search?q=upload:20171231.4) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.25ms ± 1% +0.84% (p=0.000 n=19+19) (https://perf.golang.org/search?q=upload:20171231.3) Relative to the start of the sparse heap changes (starting at and including "runtime: fix various contiguous bitmap assumptions"), overall slowdown is roughly 1% on GC-intensive benchmarks: name old time/op new time/op delta Template 183ms ± 1% 185ms ± 1% +1.32% (p=0.000 n=9+9) Unicode 84.9ms ± 2% 86.3ms ± 1% +1.65% (p=0.000 n=9+10) GoTypes 595ms ± 1% 602ms ± 0% +1.19% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.89s ± 1% +0.91% (p=0.000 n=9+10) SSA 7.19s ± 0% 7.25s ± 1% +0.75% (p=0.000 n=8+9) Flate 117ms ± 1% 118ms ± 1% +1.10% (p=0.000 n=10+9) GoParser 146ms ± 2% 148ms ± 1% +1.48% (p=0.002 n=10+10) Reflect 398ms ± 1% 404ms ± 1% +1.51% (p=0.000 n=10+9) Tar 173ms ± 1% 175ms ± 1% +1.17% (p=0.000 n=10+10) XML 208ms ± 1% 210ms ± 1% +0.62% (p=0.011 n=10+10) [Geo mean] 369ms 373ms +1.17% (https://perf.golang.org/search?q=upload:20180101.2) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.22ms ± 1% 2.25ms ± 1% +1.51% (p=0.000 n=20+19) (https://perf.golang.org/search?q=upload:20180101.3) Change-Id: I5daf4cfec24b252e5a57001f0a6c03f22479d0f0 Reviewed-on: https://go-review.googlesource.com/85887 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-19 22:05:23 -08:00
// that p points into an allocated heap arena.
//
// Must be nosplit because it has callers that are nosplit.
//
//go:nosplit
runtime: eliminate one heapBitsForObject from scanobject scanobject with ptrmask!=nil is only ever called with the base pointer of a heap object. Currently, scanobject calls heapBitsForObject, which goes to a great deal of trouble to check that the pointer points into the heap and to find the base of the object it points to, both of which are completely unnecessary in this case. Replace this call to heapBitsForObject with much simpler logic to fetch the span and compute the heap bits. Benchmark results with five runs: name old mean new mean delta BenchmarkBinaryTree17 9.21s × (0.95,1.02) 8.55s × (0.91,1.03) -7.16% (p=0.022) BenchmarkFannkuch11 2.65s × (1.00,1.00) 2.62s × (1.00,1.00) -1.10% (p=0.000) BenchmarkFmtFprintfEmpty 73.2ns × (0.99,1.01) 71.7ns × (1.00,1.01) -1.99% (p=0.004) BenchmarkFmtFprintfString 302ns × (0.99,1.00) 292ns × (0.98,1.02) -3.31% (p=0.020) BenchmarkFmtFprintfInt 281ns × (0.98,1.01) 279ns × (0.96,1.02) ~ (p=0.596) BenchmarkFmtFprintfIntInt 482ns × (0.98,1.01) 488ns × (0.95,1.02) ~ (p=0.419) BenchmarkFmtFprintfPrefixedInt 382ns × (0.99,1.01) 365ns × (0.96,1.02) -4.35% (p=0.015) BenchmarkFmtFprintfFloat 475ns × (0.99,1.01) 472ns × (1.00,1.00) ~ (p=0.108) BenchmarkFmtManyArgs 1.89µs × (1.00,1.01) 1.90µs × (0.94,1.02) ~ (p=0.883) BenchmarkGobDecode 22.4ms × (0.99,1.01) 21.9ms × (0.92,1.04) ~ (p=0.332) BenchmarkGobEncode 24.7ms × (0.98,1.02) 23.9ms × (0.87,1.07) ~ (p=0.407) BenchmarkGzip 397ms × (0.99,1.01) 398ms × (0.99,1.01) ~ (p=0.718) BenchmarkGunzip 96.7ms × (1.00,1.00) 96.9ms × (1.00,1.00) ~ (p=0.230) BenchmarkHTTPClientServer 71.5µs × (0.98,1.01) 68.5µs × (0.92,1.06) ~ (p=0.243) BenchmarkJSONEncode 46.1ms × (0.98,1.01) 44.9ms × (0.98,1.03) -2.51% (p=0.040) BenchmarkJSONDecode 86.1ms × (0.99,1.01) 86.5ms × (0.99,1.01) ~ (p=0.343) BenchmarkMandelbrot200 4.12ms × (1.00,1.00) 4.13ms × (1.00,1.00) +0.23% (p=0.000) BenchmarkGoParse 5.89ms × (0.96,1.03) 5.82ms × (0.96,1.04) ~ (p=0.522) BenchmarkRegexpMatchEasy0_32 141ns × (0.99,1.01) 142ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy0_1K 408ns × (1.00,1.00) 392ns × (0.99,1.00) -3.83% (p=0.000) BenchmarkRegexpMatchEasy1_32 122ns × (1.00,1.00) 122ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy1_1K 626ns × (1.00,1.01) 624ns × (0.99,1.00) ~ (p=0.122) BenchmarkRegexpMatchMedium_32 202ns × (0.99,1.00) 205ns × (0.99,1.01) +1.58% (p=0.001) BenchmarkRegexpMatchMedium_1K 54.4µs × (1.00,1.00) 55.5µs × (1.00,1.00) +1.86% (p=0.000) BenchmarkRegexpMatchHard_32 2.68µs × (1.00,1.00) 2.71µs × (1.00,1.00) +0.97% (p=0.002) BenchmarkRegexpMatchHard_1K 79.8µs × (1.00,1.01) 80.5µs × (1.00,1.01) +0.94% (p=0.003) BenchmarkRevcomp 590ms × (0.99,1.01) 585ms × (1.00,1.00) ~ (p=0.066) BenchmarkTemplate 111ms × (0.97,1.02) 112ms × (0.99,1.01) ~ (p=0.201) BenchmarkTimeParse 392ns × (1.00,1.00) 385ns × (1.00,1.00) -1.69% (p=0.000) BenchmarkTimeFormat 449ns × (0.98,1.01) 448ns × (0.99,1.01) ~ (p=0.550) Change-Id: Ie7c3830c481d96c9043e7bf26853c6c1d05dc9f4 Reviewed-on: https://go-review.googlesource.com/9364 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-26 18:27:17 -04:00
func spanOfUnchecked(p uintptr) *mspan {
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
ai := arenaIndex(p)
return mheap_.arenas[ai.l1()][ai.l2()].spans[(p/pageSize)%pagesPerArena]
runtime: eliminate one heapBitsForObject from scanobject scanobject with ptrmask!=nil is only ever called with the base pointer of a heap object. Currently, scanobject calls heapBitsForObject, which goes to a great deal of trouble to check that the pointer points into the heap and to find the base of the object it points to, both of which are completely unnecessary in this case. Replace this call to heapBitsForObject with much simpler logic to fetch the span and compute the heap bits. Benchmark results with five runs: name old mean new mean delta BenchmarkBinaryTree17 9.21s × (0.95,1.02) 8.55s × (0.91,1.03) -7.16% (p=0.022) BenchmarkFannkuch11 2.65s × (1.00,1.00) 2.62s × (1.00,1.00) -1.10% (p=0.000) BenchmarkFmtFprintfEmpty 73.2ns × (0.99,1.01) 71.7ns × (1.00,1.01) -1.99% (p=0.004) BenchmarkFmtFprintfString 302ns × (0.99,1.00) 292ns × (0.98,1.02) -3.31% (p=0.020) BenchmarkFmtFprintfInt 281ns × (0.98,1.01) 279ns × (0.96,1.02) ~ (p=0.596) BenchmarkFmtFprintfIntInt 482ns × (0.98,1.01) 488ns × (0.95,1.02) ~ (p=0.419) BenchmarkFmtFprintfPrefixedInt 382ns × (0.99,1.01) 365ns × (0.96,1.02) -4.35% (p=0.015) BenchmarkFmtFprintfFloat 475ns × (0.99,1.01) 472ns × (1.00,1.00) ~ (p=0.108) BenchmarkFmtManyArgs 1.89µs × (1.00,1.01) 1.90µs × (0.94,1.02) ~ (p=0.883) BenchmarkGobDecode 22.4ms × (0.99,1.01) 21.9ms × (0.92,1.04) ~ (p=0.332) BenchmarkGobEncode 24.7ms × (0.98,1.02) 23.9ms × (0.87,1.07) ~ (p=0.407) BenchmarkGzip 397ms × (0.99,1.01) 398ms × (0.99,1.01) ~ (p=0.718) BenchmarkGunzip 96.7ms × (1.00,1.00) 96.9ms × (1.00,1.00) ~ (p=0.230) BenchmarkHTTPClientServer 71.5µs × (0.98,1.01) 68.5µs × (0.92,1.06) ~ (p=0.243) BenchmarkJSONEncode 46.1ms × (0.98,1.01) 44.9ms × (0.98,1.03) -2.51% (p=0.040) BenchmarkJSONDecode 86.1ms × (0.99,1.01) 86.5ms × (0.99,1.01) ~ (p=0.343) BenchmarkMandelbrot200 4.12ms × (1.00,1.00) 4.13ms × (1.00,1.00) +0.23% (p=0.000) BenchmarkGoParse 5.89ms × (0.96,1.03) 5.82ms × (0.96,1.04) ~ (p=0.522) BenchmarkRegexpMatchEasy0_32 141ns × (0.99,1.01) 142ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy0_1K 408ns × (1.00,1.00) 392ns × (0.99,1.00) -3.83% (p=0.000) BenchmarkRegexpMatchEasy1_32 122ns × (1.00,1.00) 122ns × (1.00,1.00) ~ (p=0.178) BenchmarkRegexpMatchEasy1_1K 626ns × (1.00,1.01) 624ns × (0.99,1.00) ~ (p=0.122) BenchmarkRegexpMatchMedium_32 202ns × (0.99,1.00) 205ns × (0.99,1.01) +1.58% (p=0.001) BenchmarkRegexpMatchMedium_1K 54.4µs × (1.00,1.00) 55.5µs × (1.00,1.00) +1.86% (p=0.000) BenchmarkRegexpMatchHard_32 2.68µs × (1.00,1.00) 2.71µs × (1.00,1.00) +0.97% (p=0.002) BenchmarkRegexpMatchHard_1K 79.8µs × (1.00,1.01) 80.5µs × (1.00,1.01) +0.94% (p=0.003) BenchmarkRevcomp 590ms × (0.99,1.01) 585ms × (1.00,1.00) ~ (p=0.066) BenchmarkTemplate 111ms × (0.97,1.02) 112ms × (0.99,1.01) ~ (p=0.201) BenchmarkTimeParse 392ns × (1.00,1.00) 385ns × (1.00,1.00) -1.69% (p=0.000) BenchmarkTimeFormat 449ns × (0.98,1.01) 448ns × (0.99,1.01) ~ (p=0.550) Change-Id: Ie7c3830c481d96c9043e7bf26853c6c1d05dc9f4 Reviewed-on: https://go-review.googlesource.com/9364 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-26 18:27:17 -04:00
}
// spanOfHeap is like spanOf, but returns nil if p does not point to a
// heap object.
//
// Must be nosplit because it has callers that are nosplit.
//
//go:nosplit
func spanOfHeap(p uintptr) *mspan {
s := spanOf(p)
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
// s is nil if it's never been allocated. Otherwise, we check
// its state first because we don't trust this pointer, so we
// have to synchronize with span initialization. Then, it's
// still possible we picked up a stale span pointer, so we
// have to check the span's bounds.
if s == nil || s.state.get() != mSpanInUse || p < s.base() || p >= s.limit {
return nil
}
return s
}
// pageIndexOf returns the arena, page index, and page mask for pointer p.
// The caller must ensure p is in the heap.
func pageIndexOf(p uintptr) (arena *heapArena, pageIdx uintptr, pageMask uint8) {
ai := arenaIndex(p)
arena = mheap_.arenas[ai.l1()][ai.l2()]
pageIdx = ((p / pageSize) / 8) % uintptr(len(arena.pageInUse))
pageMask = byte(1 << ((p / pageSize) % 8))
return
}
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
// heapArenaOf returns the heap arena for p, if one exists.
func heapArenaOf(p uintptr) *heapArena {
ri := arenaIndex(p)
if arenaL1Bits == 0 {
// If there's no L1, then ri.l1() can't be out of bounds but ri.l2() can.
if ri.l2() >= uint(len(mheap_.arenas[0])) {
return nil
}
} else {
// If there's an L1, then ri.l1() can be out of bounds but ri.l2() can't.
if ri.l1() >= uint(len(mheap_.arenas)) {
return nil
}
}
l2 := mheap_.arenas[ri.l1()]
if arenaL1Bits != 0 && l2 == nil { // Should never happen if there's no L1.
return nil
}
return l2[ri.l2()]
}
// Initialize the heap.
func (h *mheap) init() {
runtime: static lock ranking for the runtime (enabled by GOEXPERIMENT) I took some of the infrastructure from Austin's lock logging CR https://go-review.googlesource.com/c/go/+/192704 (with deadlock detection from the logs), and developed a setup to give static lock ranking for runtime locks. Static lock ranking establishes a documented total ordering among locks, and then reports an error if the total order is violated. This can happen if a deadlock happens (by acquiring a sequence of locks in different orders), or if just one side of a possible deadlock happens. Lock ordering deadlocks cannot happen as long as the lock ordering is followed. Along the way, I found a deadlock involving the new timer code, which Ian fixed via https://go-review.googlesource.com/c/go/+/207348, as well as two other potential deadlocks. See the constants at the top of runtime/lockrank.go to show the static lock ranking that I ended up with, along with some comments. This is great documentation of the current intended lock ordering when acquiring multiple locks in the runtime. I also added an array lockPartialOrder[] which shows and enforces the current partial ordering among locks (which is embedded within the total ordering). This is more specific about the dependencies among locks. I don't try to check the ranking within a lock class with multiple locks that can be acquired at the same time (i.e. check the ranking when multiple hchan locks are acquired). Currently, I am doing a lockInit() call to set the lock rank of most locks. Any lock that is not otherwise initialized is assumed to be a leaf lock (a very high rank lock), so that eliminates the need to do anything for a bunch of locks (including all architecture-dependent locks). For two locks, root.lock and notifyList.lock (only in the runtime/sema.go file), it is not as easy to do lock initialization, so instead, I am passing the lock rank with the lock calls. For Windows compilation, I needed to increase the StackGuard size from 896 to 928 because of the new lock-rank checking functions. Checking of the static lock ranking is enabled by setting GOEXPERIMENT=staticlockranking before doing a run. To make sure that the static lock ranking code has no overhead in memory or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so that it defines a build tag (with the same name) whenever any experiment has been baked into the toolchain (by checking Expstring()). This allows me to avoid increasing the size of the 'mutex' type when static lock ranking is not enabled. Fixes #38029 Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a Reviewed-on: https://go-review.googlesource.com/c/go/+/207619 Reviewed-by: Dan Scales <danscales@google.com> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Dan Scales <danscales@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-11-13 17:34:47 -08:00
lockInit(&h.lock, lockRankMheap)
lockInit(&h.speciallock, lockRankMheapSpecial)
runtime: static lock ranking for the runtime (enabled by GOEXPERIMENT) I took some of the infrastructure from Austin's lock logging CR https://go-review.googlesource.com/c/go/+/192704 (with deadlock detection from the logs), and developed a setup to give static lock ranking for runtime locks. Static lock ranking establishes a documented total ordering among locks, and then reports an error if the total order is violated. This can happen if a deadlock happens (by acquiring a sequence of locks in different orders), or if just one side of a possible deadlock happens. Lock ordering deadlocks cannot happen as long as the lock ordering is followed. Along the way, I found a deadlock involving the new timer code, which Ian fixed via https://go-review.googlesource.com/c/go/+/207348, as well as two other potential deadlocks. See the constants at the top of runtime/lockrank.go to show the static lock ranking that I ended up with, along with some comments. This is great documentation of the current intended lock ordering when acquiring multiple locks in the runtime. I also added an array lockPartialOrder[] which shows and enforces the current partial ordering among locks (which is embedded within the total ordering). This is more specific about the dependencies among locks. I don't try to check the ranking within a lock class with multiple locks that can be acquired at the same time (i.e. check the ranking when multiple hchan locks are acquired). Currently, I am doing a lockInit() call to set the lock rank of most locks. Any lock that is not otherwise initialized is assumed to be a leaf lock (a very high rank lock), so that eliminates the need to do anything for a bunch of locks (including all architecture-dependent locks). For two locks, root.lock and notifyList.lock (only in the runtime/sema.go file), it is not as easy to do lock initialization, so instead, I am passing the lock rank with the lock calls. For Windows compilation, I needed to increase the StackGuard size from 896 to 928 because of the new lock-rank checking functions. Checking of the static lock ranking is enabled by setting GOEXPERIMENT=staticlockranking before doing a run. To make sure that the static lock ranking code has no overhead in memory or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so that it defines a build tag (with the same name) whenever any experiment has been baked into the toolchain (by checking Expstring()). This allows me to avoid increasing the size of the 'mutex' type when static lock ranking is not enabled. Fixes #38029 Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a Reviewed-on: https://go-review.googlesource.com/c/go/+/207619 Reviewed-by: Dan Scales <danscales@google.com> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Dan Scales <danscales@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-11-13 17:34:47 -08:00
h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
runtime: eliminate global span queue [green tea] This change removes the locked global span queue and replaces the fixed-size local span queue with a variable-sized local span queue. The variable-sized local span queue grows as needed to accomodate local work. With no global span queue either, GC workers balance work amongst themselves by stealing from each other. The new variable-sized local span queues are inspired by the P-local deque underlying sync.Pool. Unlike the sync.Pool deque, however, both the owning P and stealing Ps take spans from the tail, making this incarnation a strict queue, not a deque. This is intentional, since we want a queue-like order to encourage objects to accumulate on each span. These variable-sized local span queues are crucial to mark termination, just like the global span queue was. To avoid hitting the ragged barrier too often, we must check whether any Ps have any spans on their variable-sized local span queues. We maintain a per-P atomic bitmask (another pMask) that contains this state. We can also use this to speed up stealing by skipping Ps that don't have any local spans. The variable-sized local span queues are slower than the old fixed-size local span queues because of the additional indirection, so this change adds a non-atomic local fixed-size queue. This risks getting work stuck on it, so, similarly to how workbufs work, each worker will occasionally dump some spans onto its local variable-sized queue. This scales much more nicely than dumping to a global queue, but is still visible to all other Ps. For #73581. Change-Id: I814f54d9c3cc7fa7896167746e9823f50943ac22 Reviewed-on: https://go-review.googlesource.com/c/go/+/700496 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-15 17:09:05 +00:00
h.spanSPMCAlloc.init(unsafe.Sizeof(spanSPMC{}), nil, nil, &memstats.gcMiscSys)
h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
h.specialCleanupAlloc.init(unsafe.Sizeof(specialCleanup{}), nil, nil, &memstats.other_sys)
h.specialCheckFinalizerAlloc.init(unsafe.Sizeof(specialCheckFinalizer{}), nil, nil, &memstats.other_sys)
h.specialTinyBlockAlloc.init(unsafe.Sizeof(specialTinyBlock{}), nil, nil, &memstats.other_sys)
h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)
h.specialReachableAlloc.init(unsafe.Sizeof(specialReachable{}), nil, nil, &memstats.other_sys)
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
h.specialPinCounterAlloc.init(unsafe.Sizeof(specialPinCounter{}), nil, nil, &memstats.other_sys)
h.specialWeakHandleAlloc.init(unsafe.Sizeof(specialWeakHandle{}), nil, nil, &memstats.gcMiscSys)
h.specialBubbleAlloc.init(unsafe.Sizeof(specialBubble{}), nil, nil, &memstats.other_sys)
runtime: use sparse mappings for the heap This replaces the contiguous heap arena mapping with a potentially sparse mapping that can support heap mappings anywhere in the address space. This has several advantages over the current approach: * There is no longer any limit on the size of the Go heap. (Currently it's limited to 512GB.) Hence, this fixes #10460. * It eliminates many failures modes of heap initialization and growing. In particular it eliminates any possibility of panicking with an address space conflict. This can happen for many reasons and even causes a low but steady rate of TSAN test failures because of conflicts with the TSAN runtime. See #16936 and #11993. * It eliminates the notion of "non-reserved" heap, which was added because creating huge address space reservations (particularly on 64-bit) led to huge process VSIZE. This was at best confusing and at worst conflicted badly with ulimit -v. However, the non-reserved heap logic is complicated, can race with other mappings in non-pure Go binaries (e.g., #18976), and requires that the entire heap be either reserved or non-reserved. We currently maintain the latter property, but it's quite difficult to convince yourself of that, and hence difficult to keep correct. This logic is still present, but will be removed in the next CL. * It fixes problems on 32-bit where skipping over parts of the address space leads to mapping huge (and never-to-be-used) metadata structures. See #19831. This also completely rewrites and significantly simplifies mheap.sysAlloc, which has been a source of many bugs. E.g., #21044, #20259, #18651, and #13143 (and maybe #23222). This change also makes it possible to allocate individual objects larger than 512GB. As a result, a few tests that expected huge allocations to fail needed to be changed to make even larger allocations. However, at the moment attempting to allocate a humongous object may cause the program to freeze for several minutes on Linux as we fall back to probing every page with addrspace_free. That logic (and this failure mode) will be removed in the next CL. Fixes #10460. Fixes #22204 (since it rewrites the code involved). This slightly slows down compilebench and the x/benchmarks garbage benchmark. name old time/op new time/op delta Template 184ms ± 1% 185ms ± 1% ~ (p=0.065 n=10+9) Unicode 86.9ms ± 3% 86.3ms ± 1% ~ (p=0.631 n=10+10) GoTypes 599ms ± 0% 602ms ± 0% +0.56% (p=0.000 n=10+9) Compiler 2.87s ± 1% 2.89s ± 1% +0.51% (p=0.002 n=9+10) SSA 7.29s ± 1% 7.25s ± 1% ~ (p=0.182 n=10+9) Flate 118ms ± 2% 118ms ± 1% ~ (p=0.113 n=9+9) GoParser 147ms ± 1% 148ms ± 1% +1.07% (p=0.003 n=9+10) Reflect 401ms ± 1% 404ms ± 1% +0.71% (p=0.003 n=10+9) Tar 175ms ± 1% 175ms ± 1% ~ (p=0.604 n=9+10) XML 209ms ± 1% 210ms ± 1% ~ (p=0.052 n=10+10) (https://perf.golang.org/search?q=upload:20171231.4) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.25ms ± 1% +0.84% (p=0.000 n=19+19) (https://perf.golang.org/search?q=upload:20171231.3) Relative to the start of the sparse heap changes (starting at and including "runtime: fix various contiguous bitmap assumptions"), overall slowdown is roughly 1% on GC-intensive benchmarks: name old time/op new time/op delta Template 183ms ± 1% 185ms ± 1% +1.32% (p=0.000 n=9+9) Unicode 84.9ms ± 2% 86.3ms ± 1% +1.65% (p=0.000 n=9+10) GoTypes 595ms ± 1% 602ms ± 0% +1.19% (p=0.000 n=9+9) Compiler 2.86s ± 0% 2.89s ± 1% +0.91% (p=0.000 n=9+10) SSA 7.19s ± 0% 7.25s ± 1% +0.75% (p=0.000 n=8+9) Flate 117ms ± 1% 118ms ± 1% +1.10% (p=0.000 n=10+9) GoParser 146ms ± 2% 148ms ± 1% +1.48% (p=0.002 n=10+10) Reflect 398ms ± 1% 404ms ± 1% +1.51% (p=0.000 n=10+9) Tar 173ms ± 1% 175ms ± 1% +1.17% (p=0.000 n=10+10) XML 208ms ± 1% 210ms ± 1% +0.62% (p=0.011 n=10+10) [Geo mean] 369ms 373ms +1.17% (https://perf.golang.org/search?q=upload:20180101.2) name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.22ms ± 1% 2.25ms ± 1% +1.51% (p=0.000 n=20+19) (https://perf.golang.org/search?q=upload:20180101.3) Change-Id: I5daf4cfec24b252e5a57001f0a6c03f22479d0f0 Reviewed-on: https://go-review.googlesource.com/85887 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-12-19 22:05:23 -08:00
h.arenaHintAlloc.init(unsafe.Sizeof(arenaHint{}), nil, nil, &memstats.other_sys)
runtime: make fixalloc zero allocations on reuse Currently fixalloc does not zero memory it reuses. This is dangerous with the hybrid barrier if the type may contain heap pointers, since it may cause us to observe a dead heap pointer on reuse. It's also error-prone since it's the only allocator that doesn't zero on allocation (mallocgc of course zeroes, but so do persistentalloc and sysAlloc). It's also largely pointless: for mcache, the caller immediately memclrs the allocation; and the two specials types are tiny so there's no real cost to zeroing them. Change fixalloc to zero allocations by default. The only type we don't zero by default is mspan. This actually requires that the spsn's sweepgen survive across freeing and reallocating a span. If we were to zero it, the following race would be possible: 1. The current sweepgen is 2. Span s is on the unswept list. 2. Direct sweeping sweeps span s, finds it's all free, and releases s to the fixalloc. 3. Thread 1 allocates s from fixalloc. Suppose this zeros s, including s.sweepgen. 4. Thread 1 calls s.init, which sets s.state to _MSpanDead. 5. On thread 2, background sweeping comes across span s in allspans and cas's s.sweepgen from 0 (sg-2) to 1 (sg-1). Now it thinks it owns it for sweeping. 6. Thread 1 continues initializing s. Everything breaks. I would like to fix this because it's obviously confusing, but it's a subtle enough problem that I'm leaving it alone for now. The solution may be to skip sweepgen 0, but then we have to think about wrap-around much more carefully. Updates #17503. Change-Id: Ie08691feed3abbb06a31381b94beb0a2e36a0613 Reviewed-on: https://go-review.googlesource.com/31368 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-09-25 17:12:43 -04:00
// Don't zero mspan allocations. Background sweeping can
// inspect a span concurrently with allocating it, so it's
// important that the span's sweepgen survive across freeing
// and re-allocating a span to prevent background sweeping
// from improperly cas'ing it from 0.
//
// This is safe because mspan contains no heap pointers.
h.spanalloc.zero = false
// h->mapcache needs no init
for i := range h.central {
h.central[i].mcentral.init(spanClass(i))
}
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
h.pages.init(&h.lock, &memstats.gcMiscSys, false)
runtime: save scalar registers off stack in amd64 async preemption Asynchronous preemption must save all registers that could be in use by Go code. Currently, it saves all of these to the goroutine stack. As a result, the stack frame requirements of asynchronous preemption can be rather high. On amd64, this requires 368 bytes of stack space, most of which is the XMM registers. Several RISC architectures are around 0.5 KiB. As we add support for SIMD instructions, this is going to become a problem. The AVX-512 register state is 2.5 KiB. This well exceeds the nosplit limit, and even if it didn't, could constrain when we can asynchronously preempt goroutines on small stacks. This CL fixes this by moving pure scalar state stored in non-GP registers off the stack and into an allocated "extended register state" object. To reduce space overhead, we only allocate these objects as needed. While in the theoretical limit, every G could need this register state, in practice very few do at a time. However, we can't allocate when we're in the middle of saving the register state during an asynchronous preemption, so we reserve scratch space on every P to temporarily store the register state, which can then be copied out to an allocated state object later by Go code. This commit only implements this for amd64, since that's where we're about to add much more vector state, but it lays the groundwork for doing this on any architecture that could benefit. This is a cherry-pick of CL 680898 plus bug fix CL 684836 from the dev.simd branch. Change-Id: I123a95e21c11d5c10942d70e27f84d2d99bbf735 Reviewed-on: https://go-review.googlesource.com/c/go/+/669195 Auto-Submit: Austin Clements <austin@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-04-29 22:55:40 -04:00
xRegInitAlloc()
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// reclaim sweeps and reclaims at least npage pages into the heap.
// It is called before allocating npage pages to keep growth in check.
//
// reclaim implements the page-reclaimer half of the sweeper.
//
// h.lock must NOT be held.
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
func (h *mheap) reclaim(npage uintptr) {
// TODO(austin): Half of the time spent freeing spans is in
// locking/unlocking the heap (even with low contention). We
// could make the slow path here several times faster by
// batching heap frees.
// Bail early if there's no more reclaim work.
if h.reclaimIndex.Load() >= 1<<63 {
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
return
}
// Disable preemption so the GC can't start while we're
// sweeping, so we can read h.sweepArenas, and so
// traceGCSweepStart/Done pair on the P.
mp := acquirem()
runtime: refactor runtime->tracer API to appear more like a lock Currently the execution tracer synchronizes with itself using very heavyweight operations. As a result, it's totally fine for most of the tracer code to look like: if traceEnabled() { traceXXX(...) } However, if we want to make that synchronization more lightweight (as issue #60773 proposes), then this is insufficient. In particular, we need to make sure the tracer can't observe an inconsistency between g atomicstatus and the event that would be emitted for a particular g transition. This means making the g status change appear to happen atomically with the corresponding trace event being written out from the perspective of the tracer. This requires a change in API to something more like a lock. While we're here, we might as well make sure that trace events can *only* be emitted while this lock is held. This change introduces such an API: traceAcquire, which returns a value that can emit events, and traceRelease, which requires the value that was returned by traceAcquire. In practice, this won't be a real lock, it'll be more like a seqlock. For the current tracer, this API is completely overkill and the value returned by traceAcquire basically just checks trace.enabled. But it's necessary for the tracer described in #60773 and we can implement that more cleanly if we do this refactoring now instead of later. For #60773. Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646 Reviewed-on: https://go-review.googlesource.com/c/go/+/515635 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2023-07-27 19:04:04 +00:00
trace := traceAcquire()
if trace.ok() {
trace.GCSweepStart()
traceRelease(trace)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
}
arenas := h.sweepArenas
locked := false
for npage > 0 {
// Pull from accumulated credit first.
2021-10-18 23:14:20 +00:00
if credit := h.reclaimCredit.Load(); credit > 0 {
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
take := credit
if take > npage {
// Take only what we need.
take = npage
}
2021-10-18 23:14:20 +00:00
if h.reclaimCredit.CompareAndSwap(credit, credit-take) {
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
npage -= take
}
continue
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// Claim a chunk of work.
idx := uintptr(h.reclaimIndex.Add(pagesPerReclaimerChunk) - pagesPerReclaimerChunk)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
if idx/pagesPerArena >= uintptr(len(arenas)) {
// Page reclaiming is done.
h.reclaimIndex.Store(1 << 63)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
break
}
if !locked {
// Lock the heap for reclaimChunk.
lock(&h.lock)
locked = true
}
// Scan this chunk.
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
nfound := h.reclaimChunk(arenas, idx, pagesPerReclaimerChunk)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
if nfound <= npage {
npage -= nfound
} else {
// Put spare pages toward global credit.
2021-10-18 23:14:20 +00:00
h.reclaimCredit.Add(nfound - npage)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
npage = 0
}
}
if locked {
unlock(&h.lock)
}
runtime: refactor runtime->tracer API to appear more like a lock Currently the execution tracer synchronizes with itself using very heavyweight operations. As a result, it's totally fine for most of the tracer code to look like: if traceEnabled() { traceXXX(...) } However, if we want to make that synchronization more lightweight (as issue #60773 proposes), then this is insufficient. In particular, we need to make sure the tracer can't observe an inconsistency between g atomicstatus and the event that would be emitted for a particular g transition. This means making the g status change appear to happen atomically with the corresponding trace event being written out from the perspective of the tracer. This requires a change in API to something more like a lock. While we're here, we might as well make sure that trace events can *only* be emitted while this lock is held. This change introduces such an API: traceAcquire, which returns a value that can emit events, and traceRelease, which requires the value that was returned by traceAcquire. In practice, this won't be a real lock, it'll be more like a seqlock. For the current tracer, this API is completely overkill and the value returned by traceAcquire basically just checks trace.enabled. But it's necessary for the tracer described in #60773 and we can implement that more cleanly if we do this refactoring now instead of later. For #60773. Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646 Reviewed-on: https://go-review.googlesource.com/c/go/+/515635 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2023-07-27 19:04:04 +00:00
trace = traceAcquire()
if trace.ok() {
trace.GCSweepDone()
traceRelease(trace)
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
releasem(mp)
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// reclaimChunk sweeps unmarked spans that start at page indexes [pageIdx, pageIdx+n).
// It returns the number of pages returned to the heap.
//
// h.lock must be held and the caller must be non-preemptible. Note: h.lock may be
// temporarily unlocked and re-locked in order to do sweeping or if tracing is
// enabled.
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
func (h *mheap) reclaimChunk(arenas []arenaIdx, pageIdx, n uintptr) uintptr {
// The heap lock must be held because this accesses the
// heapArena.spans arrays using potentially non-live pointers.
// In particular, if a span were freed and merged concurrently
// with this probing heapArena.spans, it would be possible to
// observe arbitrary, stale span pointers.
assertLockHeld(&h.lock)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
n0 := n
var nFreed uintptr
sl := sweep.active.begin()
if !sl.valid {
return 0
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
for n > 0 {
ai := arenas[pageIdx/pagesPerArena]
ha := h.arenas[ai.l1()][ai.l2()]
// Get a chunk of the bitmap to work on.
arenaPage := uint(pageIdx % pagesPerArena)
inUse := ha.pageInUse[arenaPage/8:]
marked := ha.pageMarks[arenaPage/8:]
if uintptr(len(inUse)) > n/8 {
inUse = inUse[:n/8]
marked = marked[:n/8]
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// Scan this bitmap chunk for spans that are in-use
// but have no marked objects on them.
for i := range inUse {
inUseUnmarked := atomic.Load8(&inUse[i]) &^ marked[i]
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
if inUseUnmarked == 0 {
continue
}
for j := uint(0); j < 8; j++ {
if inUseUnmarked&(1<<j) != 0 {
s := ha.spans[arenaPage+uint(i)*8+j]
runtime: block sweep completion on all sweep paths The runtime currently has two different notions of sweep completion: 1. All spans are either swept or have begun sweeping. 2. The sweeper has *finished* sweeping all spans. Most things depend on condition 1. Notably, GC correctness depends on condition 1, but since all sweep operations a non-preemptible, the STW at the beginning of GC forces condition 1 to become condition 2. runtime.GC(), however, depends on condition 2, since the intent is to complete a complete GC cycle, and also update the heap profile (which can only be done after sweeping is complete). However, the way we compute condition 2 is racy right now and may in fact only indicate condition 1. Specifically, sweepone blocks condition 2 until all sweepone calls are done, but there are many other ways to enter the sweeper that don't block this. Hence, sweepone may see that there are no more spans in the sweep list and see that it's the last sweepone and declare sweeping done, while there's some other sweeper still working on a span. Fix this by making sure every entry to the sweeper participates in the protocol that blocks condition 2. To make sure we get this right, this CL introduces a type to track sweep blocking and (lightly) enforces span sweep ownership via the type system. This has the nice side-effect of abstracting the pattern of acquiring sweep ownership that's currently repeated in many different places. Fixes #45315. Change-Id: I7fab30170c5ae14c8b2f10998628735b8be6d901 Reviewed-on: https://go-review.googlesource.com/c/go/+/307915 Trust: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2021-04-02 15:54:24 -04:00
if s, ok := sl.tryAcquire(s); ok {
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
npages := s.npages
unlock(&h.lock)
if s.sweep(false) {
nFreed += npages
}
lock(&h.lock)
// Reload inUse. It's possible nearby
// spans were freed when we dropped the
// lock and we don't want to get stale
// pointers from the spans array.
inUseUnmarked = atomic.Load8(&inUse[i]) &^ marked[i]
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
}
}
}
}
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// Advance.
pageIdx += uintptr(len(inUse) * 8)
n -= uintptr(len(inUse) * 8)
}
sweep.active.end(sl)
runtime: refactor runtime->tracer API to appear more like a lock Currently the execution tracer synchronizes with itself using very heavyweight operations. As a result, it's totally fine for most of the tracer code to look like: if traceEnabled() { traceXXX(...) } However, if we want to make that synchronization more lightweight (as issue #60773 proposes), then this is insufficient. In particular, we need to make sure the tracer can't observe an inconsistency between g atomicstatus and the event that would be emitted for a particular g transition. This means making the g status change appear to happen atomically with the corresponding trace event being written out from the perspective of the tracer. This requires a change in API to something more like a lock. While we're here, we might as well make sure that trace events can *only* be emitted while this lock is held. This change introduces such an API: traceAcquire, which returns a value that can emit events, and traceRelease, which requires the value that was returned by traceAcquire. In practice, this won't be a real lock, it'll be more like a seqlock. For the current tracer, this API is completely overkill and the value returned by traceAcquire basically just checks trace.enabled. But it's necessary for the tracer described in #60773 and we can implement that more cleanly if we do this refactoring now instead of later. For #60773. Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646 Reviewed-on: https://go-review.googlesource.com/c/go/+/515635 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2023-07-27 19:04:04 +00:00
trace := traceAcquire()
if trace.ok() {
unlock(&h.lock)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
// Account for pages scanned but not reclaimed.
runtime: refactor runtime->tracer API to appear more like a lock Currently the execution tracer synchronizes with itself using very heavyweight operations. As a result, it's totally fine for most of the tracer code to look like: if traceEnabled() { traceXXX(...) } However, if we want to make that synchronization more lightweight (as issue #60773 proposes), then this is insufficient. In particular, we need to make sure the tracer can't observe an inconsistency between g atomicstatus and the event that would be emitted for a particular g transition. This means making the g status change appear to happen atomically with the corresponding trace event being written out from the perspective of the tracer. This requires a change in API to something more like a lock. While we're here, we might as well make sure that trace events can *only* be emitted while this lock is held. This change introduces such an API: traceAcquire, which returns a value that can emit events, and traceRelease, which requires the value that was returned by traceAcquire. In practice, this won't be a real lock, it'll be more like a seqlock. For the current tracer, this API is completely overkill and the value returned by traceAcquire basically just checks trace.enabled. But it's necessary for the tracer described in #60773 and we can implement that more cleanly if we do this refactoring now instead of later. For #60773. Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646 Reviewed-on: https://go-review.googlesource.com/c/go/+/515635 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2023-07-27 19:04:04 +00:00
trace.GCSweepSpan((n0 - nFreed) * pageSize)
traceRelease(trace)
lock(&h.lock)
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
}
assertLockHeld(&h.lock) // Must be locked on return.
runtime: implement efficient page reclaimer When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-09-27 11:34:07 -04:00
return nFreed
}
// spanAllocType represents the type of allocation to make, or
// the type of allocation to be freed.
type spanAllocType uint8
const (
spanAllocHeap spanAllocType = iota // heap span
spanAllocStack // stack span
spanAllocWorkBuf // work buf span
)
// manual returns true if the span allocation is manually managed.
func (s spanAllocType) manual() bool {
return s != spanAllocHeap
}
// alloc allocates a new span of npage pages from the GC'd heap.
//
// spanclass indicates the span's size class and scannability.
//
runtime: clean up allocation zeroing Currently, the runtime zeroes allocations in several ways. First, small object spans are always zeroed if they come from mheap, and their slots are zeroed later in mallocgc if needed. Second, large object spans (objects that have their own spans) plumb the need for zeroing down into mheap. Thirdly, large objects that have no pointers have their zeroing delayed until after preemption is reenabled, but before returning in mallocgc. All of this has two consequences: 1. Spans for small objects that come from mheap are sometimes unnecessarily zeroed, even if the mallocgc call that created them doesn't need the object slot to be zeroed. 2. This is all messy and difficult to reason about. This CL simplifies this code, resolving both (1) and (2). First, it recognizes that zeroing in mheap is unnecessary for small object spans; mallocgc and its callees in mcache and mcentral, by design, are *always* able to deal with non-zeroed spans. They must, for they deal with recycled spans all the time. Once this fact is made clear, the only remaining use of zeroing in mheap is for large objects. As a result, this CL lifts mheap zeroing for large objects into mallocgc, to parallel all the other codepaths in mallocgc. This is makes the large object allocation code less surprising. Next, this CL sets the flag for the delayed zeroing explicitly in the one case where it matters, and inverts and renames the flag from isZeroed to delayZeroing. Finally, it adds a check to make sure that only pointer-free allocations take the delayed zeroing codepath, as an extra safety measure. Benchmark results: https://perf.golang.org/search?q=upload:20211028.8 Inspired by tapir.liu@gmail.com's CL 343470. Change-Id: I7e1296adc19ce8a02c8d93a0a5082aefb2673e8f Reviewed-on: https://go-review.googlesource.com/c/go/+/359477 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: David Chase <drchase@google.com>
2021-10-28 17:52:22 +00:00
// Returns a span that has been fully initialized. span.needzero indicates
// whether the span has been zeroed. Note that it may not be.
func (h *mheap) alloc(npages uintptr, spanclass spanClass) *mspan {
// Don't do any operations that lock the heap on the G stack.
// It might trigger stack growth, and the stack growth code needs
// to be able to allocate heap.
var s *mspan
[dev.cc] runtime: delete scalararg, ptrarg; rename onM to systemstack Scalararg and ptrarg are not "signal safe". Go code filling them out can be interrupted by a signal, and then the signal handler runs, and if it also ends up in Go code that uses scalararg or ptrarg, now the old values have been smashed. For the pieces of code that do need to run in a signal handler, we introduced onM_signalok, which is really just onM except that the _signalok is meant to convey that the caller asserts that scalarg and ptrarg will be restored to their old values after the call (instead of the usual behavior, zeroing them). Scalararg and ptrarg are also untyped and therefore error-prone. Go code can always pass a closure instead of using scalararg and ptrarg; they were only really necessary for C code. And there's no more C code. For all these reasons, delete scalararg and ptrarg, converting the few remaining references to use closures. Once those are gone, there is no need for a distinction between onM and onM_signalok, so replace both with a single function equivalent to the current onM_signalok (that is, it can be called on any of the curg, g0, and gsignal stacks). The name onM and the phrase 'm stack' are misnomers, because on most system an M has two system stacks: the main thread stack and the signal handling stack. Correct the misnomer by naming the replacement function systemstack. Fix a few references to "M stack" in code. The main motivation for this change is to eliminate scalararg/ptrarg. Rick and I have already seen them cause problems because the calling sequence m.ptrarg[0] = p is a heap pointer assignment, so it gets a write barrier. The write barrier also uses onM, so it has all the same problems as if it were being invoked by a signal handler. We worked around this by saving and restoring the old values and by calling onM_signalok, but there's no point in keeping this nice home for bugs around any longer. This CL also changes funcline to return the file name as a result instead of filling in a passed-in *string. (The *string signature is left over from when the code was written in and called from C.) That's arguably an unrelated change, except that once I had done the ptrarg/scalararg/onM cleanup I started getting false positives about the *string argument escaping (not allowed in package runtime). The compiler is wrong, but the easiest fix is to write the code like Go code instead of like C code. I am a bit worried that the compiler is wrong because of some use of uninitialized memory in the escape analysis. If that's the reason, it will go away when we convert the compiler to Go. (And if not, we'll debug it the next time.) LGTM=khr R=r, khr CC=austin, golang-codereviews, iant, rlh https://golang.org/cl/174950043
2014-11-12 14:54:31 -05:00
systemstack(func() {
// To prevent excessive heap growth, before allocating n pages
// we need to sweep and reclaim at least n pages.
if !isSweepDone() {
h.reclaim(npages)
}
s = h.allocSpan(npages, spanAllocHeap, spanclass)
})
runtime: clean up allocation zeroing Currently, the runtime zeroes allocations in several ways. First, small object spans are always zeroed if they come from mheap, and their slots are zeroed later in mallocgc if needed. Second, large object spans (objects that have their own spans) plumb the need for zeroing down into mheap. Thirdly, large objects that have no pointers have their zeroing delayed until after preemption is reenabled, but before returning in mallocgc. All of this has two consequences: 1. Spans for small objects that come from mheap are sometimes unnecessarily zeroed, even if the mallocgc call that created them doesn't need the object slot to be zeroed. 2. This is all messy and difficult to reason about. This CL simplifies this code, resolving both (1) and (2). First, it recognizes that zeroing in mheap is unnecessary for small object spans; mallocgc and its callees in mcache and mcentral, by design, are *always* able to deal with non-zeroed spans. They must, for they deal with recycled spans all the time. Once this fact is made clear, the only remaining use of zeroing in mheap is for large objects. As a result, this CL lifts mheap zeroing for large objects into mallocgc, to parallel all the other codepaths in mallocgc. This is makes the large object allocation code less surprising. Next, this CL sets the flag for the delayed zeroing explicitly in the one case where it matters, and inverts and renames the flag from isZeroed to delayZeroing. Finally, it adds a check to make sure that only pointer-free allocations take the delayed zeroing codepath, as an extra safety measure. Benchmark results: https://perf.golang.org/search?q=upload:20211028.8 Inspired by tapir.liu@gmail.com's CL 343470. Change-Id: I7e1296adc19ce8a02c8d93a0a5082aefb2673e8f Reviewed-on: https://go-review.googlesource.com/c/go/+/359477 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: David Chase <drchase@google.com>
2021-10-28 17:52:22 +00:00
return s
}
// allocManual allocates a manually-managed span of npage pages.
// allocManual returns nil if allocation fails.
//
// allocManual adds the bytes used to *stat, which should be a
// memstats in-use field. Unlike allocations in the GC'd heap, the
// allocation does *not* count toward heapInUse.
//
// The memory backing the returned span may not be zeroed if
// span.needzero is set.
//
// allocManual must be called on the system stack because it may
// acquire the heap lock via allocSpan. See mheap for details.
//
// If new code is written to call allocManual, do NOT use an
// existing spanAllocType value and instead declare a new one.
//
//go:systemstack
func (h *mheap) allocManual(npages uintptr, typ spanAllocType) *mspan {
if !typ.manual() {
throw("manual span allocation called with non-manually-managed type")
}
return h.allocSpan(npages, typ, 0)
}
// setSpans modifies the span map so [spanOf(base), spanOf(base+npage*pageSize))
// is s.
func (h *mheap) setSpans(base, npage uintptr, s *mspan) {
p := base / pageSize
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
ai := arenaIndex(base)
ha := h.arenas[ai.l1()][ai.l2()]
for n := uintptr(0); n < npage; n++ {
i := (p + n) % pagesPerArena
if i == 0 {
runtime: support a two-level arena map Currently, the heap arena map is a single, large array that covers every possible arena frame in the entire address space. This is practical up to about 48 bits of address space with 64 MB arenas. However, there are two problems with this: 1. mips64, ppc64, and s390x support full 64-bit address spaces (though on Linux only s390x has kernel support for 64-bit address spaces). On these platforms, it would be good to support these larger address spaces. 2. On Windows, processes are charged for untouched memory, so for processes with small heaps, the mostly-untouched 32 MB arena map plus a 64 MB arena are significant overhead. Hence, it would be good to reduce both the arena map size and the arena size, but with a single-level arena, these are inversely proportional. This CL adds support for a two-level arena map. Arena frame numbers are now divided into arenaL1Bits of L1 index and arenaL2Bits of L2 index. At the moment, arenaL1Bits is always 0, so we effectively have a single level map. We do a few things so that this has no cost beyond the current single-level map: 1. We embed the L2 array directly in mheap, so if there's a single entry in the L2 array, the representation is identical to the current representation and there's no extra level of indirection. 2. Hot code that accesses the arena map is structured so that it optimizes to nearly the same machine code as it does currently. 3. We make some small tweaks to hot code paths and to the inliner itself to keep some important functions inlined despite their now-larger ASTs. In particular, this is necessary for heapBitsForAddr and heapBits.next. Possibly as a result of some of the tweaks, this actually slightly improves the performance of the x/benchmarks garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.28ms ± 1% 2.26ms ± 1% -1.07% (p=0.000 n=17+19) (https://perf.golang.org/search?q=upload:20180223.2) For #23900. Change-Id: If5164e0961754f97eb9eca58f837f36d759505ff Reviewed-on: https://go-review.googlesource.com/96779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-22 20:38:09 -05:00
ai = arenaIndex(base + n*pageSize)
ha = h.arenas[ai.l1()][ai.l2()]
}
ha.spans[i] = s
}
}
// allocNeedsZero checks if the region of address space [base, base+npage*pageSize),
// assumed to be allocated, needs to be zeroed, updating heap arena metadata for
// future allocations.
//
// This must be called each time pages are allocated from the heap, even if the page
// allocator can otherwise prove the memory it's allocating is already zero because
// they're fresh from the operating system. It updates heapArena metadata that is
// critical for future page allocations.
//
// There are no locking constraints on this method.
func (h *mheap) allocNeedsZero(base, npage uintptr) (needZero bool) {
for npage > 0 {
ai := arenaIndex(base)
ha := h.arenas[ai.l1()][ai.l2()]
zeroedBase := atomic.Loaduintptr(&ha.zeroedBase)
arenaBase := base % heapArenaBytes
if arenaBase < zeroedBase {
// We extended into the non-zeroed part of the
// arena, so this region needs to be zeroed before use.
//
// zeroedBase is monotonically increasing, so if we see this now then
// we can be sure we need to zero this memory region.
//
// We still need to update zeroedBase for this arena, and
// potentially more arenas.
needZero = true
}
// We may observe arenaBase > zeroedBase if we're racing with one or more
// allocations which are acquiring memory directly before us in the address
// space. But, because we know no one else is acquiring *this* memory, it's
// still safe to not zero.
// Compute how far into the arena we extend into, capped
// at heapArenaBytes.
arenaLimit := arenaBase + npage*pageSize
if arenaLimit > heapArenaBytes {
arenaLimit = heapArenaBytes
}
// Increase ha.zeroedBase so it's >= arenaLimit.
// We may be racing with other updates.
for arenaLimit > zeroedBase {
if atomic.Casuintptr(&ha.zeroedBase, zeroedBase, arenaLimit) {
break
}
zeroedBase = atomic.Loaduintptr(&ha.zeroedBase)
runtime: clean up allocation zeroing Currently, the runtime zeroes allocations in several ways. First, small object spans are always zeroed if they come from mheap, and their slots are zeroed later in mallocgc if needed. Second, large object spans (objects that have their own spans) plumb the need for zeroing down into mheap. Thirdly, large objects that have no pointers have their zeroing delayed until after preemption is reenabled, but before returning in mallocgc. All of this has two consequences: 1. Spans for small objects that come from mheap are sometimes unnecessarily zeroed, even if the mallocgc call that created them doesn't need the object slot to be zeroed. 2. This is all messy and difficult to reason about. This CL simplifies this code, resolving both (1) and (2). First, it recognizes that zeroing in mheap is unnecessary for small object spans; mallocgc and its callees in mcache and mcentral, by design, are *always* able to deal with non-zeroed spans. They must, for they deal with recycled spans all the time. Once this fact is made clear, the only remaining use of zeroing in mheap is for large objects. As a result, this CL lifts mheap zeroing for large objects into mallocgc, to parallel all the other codepaths in mallocgc. This is makes the large object allocation code less surprising. Next, this CL sets the flag for the delayed zeroing explicitly in the one case where it matters, and inverts and renames the flag from isZeroed to delayZeroing. Finally, it adds a check to make sure that only pointer-free allocations take the delayed zeroing codepath, as an extra safety measure. Benchmark results: https://perf.golang.org/search?q=upload:20211028.8 Inspired by tapir.liu@gmail.com's CL 343470. Change-Id: I7e1296adc19ce8a02c8d93a0a5082aefb2673e8f Reviewed-on: https://go-review.googlesource.com/c/go/+/359477 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: David Chase <drchase@google.com>
2021-10-28 17:52:22 +00:00
// Double check basic conditions of zeroedBase.
if zeroedBase <= arenaLimit && zeroedBase > arenaBase {
// The zeroedBase moved into the space we were trying to
// claim. That's very bad, and indicates someone allocated
// the same region we did.
throw("potentially overlapping in-use allocations detected")
}
}
// Move base forward and subtract from npage to move into
// the next arena, or finish.
base += arenaLimit - arenaBase
npage -= (arenaLimit - arenaBase) / pageSize
}
return
}
// tryAllocMSpan attempts to allocate an mspan object from
// the P-local cache, but may fail.
//
// h.lock need not be held.
//
// This caller must ensure that its P won't change underneath
// it during this function. Currently to ensure that we enforce
// that the function is run on the system stack, because that's
// the only place it is used now. In the future, this requirement
// may be relaxed if its use is necessary elsewhere.
//
//go:systemstack
func (h *mheap) tryAllocMSpan() *mspan {
pp := getg().m.p.ptr()
// If we don't have a p or the cache is empty, we can't do
// anything here.
if pp == nil || pp.mspancache.len == 0 {
return nil
}
// Pull off the last entry in the cache.
s := pp.mspancache.buf[pp.mspancache.len-1]
pp.mspancache.len--
return s
}
// allocMSpanLocked allocates an mspan object.
//
// h.lock must be held.
//
// allocMSpanLocked must be called on the system stack because
// its caller holds the heap lock. See mheap for details.
// Running on the system stack also ensures that we won't
// switch Ps during this function. See tryAllocMSpan for details.
//
//go:systemstack
func (h *mheap) allocMSpanLocked() *mspan {
assertLockHeld(&h.lock)
pp := getg().m.p.ptr()
if pp == nil {
// We don't have a p so just do the normal thing.
return (*mspan)(h.spanalloc.alloc())
}
// Refill the cache if necessary.
if pp.mspancache.len == 0 {
const refillCount = len(pp.mspancache.buf) / 2
for i := 0; i < refillCount; i++ {
pp.mspancache.buf[i] = (*mspan)(h.spanalloc.alloc())
}
pp.mspancache.len = refillCount
}
// Pull off the last entry in the cache.
s := pp.mspancache.buf[pp.mspancache.len-1]
pp.mspancache.len--
return s
}
// freeMSpanLocked free an mspan object.
//
// h.lock must be held.
//
// freeMSpanLocked must be called on the system stack because
// its caller holds the heap lock. See mheap for details.
// Running on the system stack also ensures that we won't
// switch Ps during this function. See tryAllocMSpan for details.
//
//go:systemstack
func (h *mheap) freeMSpanLocked(s *mspan) {
assertLockHeld(&h.lock)
pp := getg().m.p.ptr()
// First try to free the mspan directly to the cache.
if pp != nil && pp.mspancache.len < len(pp.mspancache.buf) {
pp.mspancache.buf[pp.mspancache.len] = s
pp.mspancache.len++
return
}
// Failing that (or if we don't have a p), just free it to
// the heap.
h.spanalloc.free(unsafe.Pointer(s))
}
// allocSpan allocates an mspan which owns npages worth of memory.
//
// If typ.manual() == false, allocSpan allocates a heap span of class spanclass
// and updates heap accounting. If manual == true, allocSpan allocates a
// manually-managed span (spanclass is ignored), and the caller is
// responsible for any accounting related to its use of the span. Either
// way, allocSpan will atomically add the bytes in the newly allocated
// span to *sysStat.
//
// The returned span is fully initialized.
//
// h.lock must not be held.
//
// allocSpan must be called on the system stack both because it acquires
// the heap lock and because it must block GC transitions.
//
//go:systemstack
func (h *mheap) allocSpan(npages uintptr, typ spanAllocType, spanclass spanClass) (s *mspan) {
// Function-global state.
gp := getg()
base, scav := uintptr(0), uintptr(0)
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
growth := uintptr(0)
// On some platforms we need to provide physical page aligned stack
// allocations. Where the page size is less than the physical page
// size, we already manage to do this by default.
needPhysPageAlign := physPageAlignedStacks && typ == spanAllocStack && pageSize < physPageSize
// If the allocation is small enough, try the page cache!
// The page cache does not support aligned allocations, so we cannot use
// it if we need to provide a physical page aligned stack allocation.
pp := gp.m.p.ptr()
if !needPhysPageAlign && pp != nil && npages < pageCachePages/4 {
c := &pp.pcache
// If the cache is empty, refill it.
if c.empty() {
lock(&h.lock)
*c = h.pages.allocToCache()
unlock(&h.lock)
}
// Try to allocate from the cache.
base, scav = c.alloc(npages)
if base != 0 {
s = h.tryAllocMSpan()
if s != nil {
goto HaveSpan
}
// We have a base but no mspan, so we need
// to lock the heap.
}
}
// For one reason or another, we couldn't get the
// whole job done without the heap lock.
lock(&h.lock)
if needPhysPageAlign {
// Overallocate by a physical page to allow for later alignment.
extraPages := physPageSize / pageSize
// Find a big enough region first, but then only allocate the
// aligned portion. We can't just allocate and then free the
// edges because we need to account for scavenged memory, and
// that's difficult with alloc.
//
// Note that we skip updates to searchAddr here. It's OK if
// it's stale and higher than normal; it'll operate correctly,
// just come with a performance cost.
base, _ = h.pages.find(npages + extraPages)
if base == 0 {
var ok bool
growth, ok = h.grow(npages + extraPages)
if !ok {
unlock(&h.lock)
return nil
}
base, _ = h.pages.find(npages + extraPages)
if base == 0 {
throw("grew heap, but no adequate free space found")
}
}
base = alignUp(base, physPageSize)
scav = h.pages.allocRange(base, npages)
}
runtime: add safe arena support to the runtime This change adds an API to the runtime for arenas. A later CL can potentially export it as an experimental API, but for now, just the runtime implementation will suffice. The purpose of arenas is to improve efficiency, primarily by allowing for an application to manually free memory, thereby delaying garbage collection. It comes with other potential performance benefits, such as better locality, a better allocation strategy, and better handling of interior pointers by the GC. This implementation is based on one by danscales@google.com with a few significant differences: * The implementation lives entirely in the runtime (all layers). * Arena chunks are the minimum of 8 MiB or the heap arena size. This choice is made because in practice 64 MiB appears to be way too large of an area for most real-world use-cases. * Arena chunks are not unmapped, instead they're placed on an evacuation list and when there are no pointers left pointing into them, they're allowed to be reused. * Reusing partially-used arena chunks no longer tries to find one used by the same P first; it just takes the first one available. * In order to ensure worst-case fragmentation is never worse than 25%, only types and slice backing stores whose sizes are 1/4th the size of a chunk or less may be used. Previously larger sizes, up to the size of the chunk, were allowed. * ASAN, MSAN, and the race detector are fully supported. * Sets arena chunks to fault that were deferred at the end of mark termination (a non-public patch once did this; I don't see a reason not to continue that). For #51317. Change-Id: I83b1693a17302554cb36b6daa4e9249a81b1644f Reviewed-on: https://go-review.googlesource.com/c/go/+/423359 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2022-08-12 21:40:46 +00:00
if base == 0 {
// Try to acquire a base address.
base, scav = h.pages.alloc(npages)
if base == 0 {
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
var ok bool
growth, ok = h.grow(npages)
if !ok {
unlock(&h.lock)
return nil
}
base, scav = h.pages.alloc(npages)
if base == 0 {
throw("grew heap, but no adequate free space found")
}
}
}
if s == nil {
// We failed to get an mspan earlier, so grab
// one now that we have the heap lock.
s = h.allocMSpanLocked()
}
unlock(&h.lock)
HaveSpan:
// Decide if we need to scavenge in response to what we just allocated.
// Specifically, we track the maximum amount of memory to scavenge of all
// the alternatives below, assuming that the maximum satisfies *all*
// conditions we check (e.g. if we need to scavenge X to satisfy the
// memory limit and Y to satisfy heap-growth scavenging, and Y > X, then
// it's fine to pick Y, because the memory limit is still satisfied).
//
// It's fine to do this after allocating because we expect any scavenged
// pages not to get touched until we return. Simultaneously, it's important
// to do this before calling sysUsed because that may commit address space.
bytesToScavenge := uintptr(0)
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
forceScavenge := false
if limit := gcController.memoryLimit.Load(); !gcCPULimiter.limiting() {
// Assist with scavenging to maintain the memory limit by the amount
// that we expect to page in.
inuse := gcController.mappedReady.Load()
// Be careful about overflow, especially with uintptrs. Even on 32-bit platforms
// someone can set a really big memory limit that isn't math.MaxInt64.
if uint64(scav)+inuse > uint64(limit) {
bytesToScavenge = uintptr(uint64(scav) + inuse - uint64(limit))
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
forceScavenge = true
}
}
if goal := scavenge.gcPercentGoal.Load(); goal != ^uint64(0) && growth > 0 {
// We just caused a heap growth, so scavenge down what will soon be used.
// By scavenging inline we deal with the failure to allocate out of
// memory fragments by scavenging the memory fragments that are least
// likely to be re-used.
//
// Only bother with this because we're not using a memory limit. We don't
// care about heap growths as long as we're under the memory limit, and the
// previous check for scaving already handles that.
if retained := heapRetained(); retained+uint64(growth) > goal {
// The scavenging algorithm requires the heap lock to be dropped so it
// can acquire it only sparingly. This is a potentially expensive operation
// so it frees up other goroutines to allocate in the meanwhile. In fact,
// they can make use of the growth we just created.
todo := growth
if overage := uintptr(retained + uint64(growth) - goal); todo > overage {
todo = overage
}
if todo > bytesToScavenge {
bytesToScavenge = todo
}
}
}
// There are a few very limited circumstances where we won't have a P here.
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
// It's OK to simply skip scavenging in these cases. Something else will notice
// and pick up the tab.
var now int64
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
if pp != nil && bytesToScavenge > 0 {
// Measure how long we spent scavenging and add that measurement to the assist
// time so we can track it for the GC CPU limiter.
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
//
// Limiter event tracking might be disabled if we end up here
// while on a mark worker.
start := nanotime()
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
track := pp.limiterEvent.start(limiterEventScavengeAssist, start)
// Scavenge, but back out if the limiter turns on.
released := h.pages.scavenge(bytesToScavenge, func() bool {
return gcCPULimiter.limiting()
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
}, forceScavenge)
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
mheap_.pages.scav.releasedEager.Add(released)
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
// Finish up accounting.
now = nanotime()
runtime: only use CPU time from the current window in the GC CPU limiter Currently the GC CPU limiter consumes CPU time from a few pools, but because the events that flush to those pools may overlap, rather than be strictly contained within, the update window for the GC CPU limiter, the limiter's accounting is ultimately sloppy. This sloppiness complicates accounting for idle time more completely, and makes reasoning about the transient behavior of the GC CPU limiter much more difficult. To remedy this, this CL adds a field to the P struct that tracks the start time of any in-flight event the limiter might care about, along with information about the nature of that event. This timestamp is managed atomically so that the GC CPU limiter can come in and perform a read of the partial CPU time consumed by a given event. The limiter also updates the timestamp so that only what's left over is flushed by the event itself when it completes. The end result of this change is that, since the GC CPU limiter is aware of all past completed events, and all in-flight events, it can much more accurately collect the CPU time of events since the last update. There's still the possibility for skew, but any leftover time will be captured in the following update, and the magnitude of this leftover time is effectively bounded by the update period of the GC CPU limiter, which is much easier to consider. One caveat of managing this timestamp-type combo atomically is that they need to be packed in 64 bits. So, this CL gives up the top 3 bits of the timestamp and places the type information there. What this means is we effectively have only a 61-bit resolution timestamp. This is fine when the top 3 bits are the same between calls to nanotime, but becomes a problem on boundaries when those 3 bits change. These cases may cause hiccups in the GC CPU limiter by not accounting for some source of CPU time correctly, but with 61 bits of resolution this should be extremely rare. The rate of update is on the order of milliseconds, so at worst the runtime will be off of any given measurement by only a few CPU-milliseconds (and this is directly bounded by the rate of update). We're probably more inaccurate from the fact that we don't measure real CPU time but only approximate it. For #52890. Change-Id: I347f30ac9e2ba6061806c21dfe0193ef2ab3bbe9 Reviewed-on: https://go-review.googlesource.com/c/go/+/410120 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-06-02 19:06:27 +00:00
if track {
pp.limiterEvent.stop(limiterEventScavengeAssist, now)
}
scavenge.assistTime.Add(now - start)
}
// Initialize the span.
h.initSpan(s, typ, spanclass, base, npages)
runtime: add valgrind instrumentation Add build tag gated Valgrind annotations to the runtime which let it understand how the runtime manages memory. This allows for Go binaries to be run under Valgrind without emitting spurious errors. Instead of adding the Valgrind headers to the tree, and using cgo to call the various Valgrind client request macros, we just add an assembly function which emits the necessary instructions to trigger client requests. In particular we add instrumentation of the memory allocator, using a two-level mempool structure (as described in the Valgrind manual [0]). We also add annotations which allow Valgrind to track which memory we use for stacks, which seems necessary to let it properly function. We describe the memory model to Valgrind as follows: we treat heap arenas as a "pool" created with VALGRIND_CREATE_MEMPOOL_EXT (so that we can use VALGRIND_MEMPOOL_METAPOOL and VALGRIND_MEMPOOL_AUTO_FREE). Within the pool we treat spans as "superblocks", annotated with VALGRIND_MEMPOOL_ALLOC. We then allocate individual objects within spans with VALGRIND_MALLOCLIKE_BLOCK. It should be noted that running binaries under Valgrind can be _quite slow_, and certain operations, such as running the GC, can be _very slow_. It is recommended to run programs with GOGC=off. Additionally, async preemption should be turned off, since it'll cause strange behavior (GODEBUG=asyncpreemptoff=1). Running Valgrind with --leak-check=yes will result in some errors resulting from some things not being marked fully free'd. These likely need more annotations to rectify, but for now it is recommended to run with --leak-check=off. Updates #73602 [0] https://valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools Change-Id: I71b26c47d7084de71ef1e03947ef6b1cc6d38301 Reviewed-on: https://go-review.googlesource.com/c/go/+/674077 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-03-22 00:58:55 +00:00
if valgrindenabled {
valgrindMempoolMalloc(unsafe.Pointer(arenaBase(arenaIndex(base))), unsafe.Pointer(base), npages*pageSize)
}
// Commit and account for any scavenged memory that the span now owns.
nbytes := npages * pageSize
if scav != 0 {
// sysUsed all the pages that are actually available
// in the span since some of them might be scavenged.
runtime: track how much memory is mapped in the Ready state This change adds a field to memstats called mappedReady that tracks how much memory is in the Ready state at any given time. In essence, it's the total memory usage by the Go runtime (with one exception which is documented). Essentially, all memory mapped read/write that has either been paged in or will soon. To make tracking this not involve the many different stats that track mapped memory, we track this statistic at a very low level. The downside of tracking this statistic at such a low level is that it managed to catch lots of situations where the runtime wasn't fully accounting for memory. This change rectifies these situations by always accounting for memory that's mapped in some way (i.e. always passing a sysMemStat to a mem.go function), with *two* exceptions. Rectifying these situations means also having the memory mapped during testing being accounted for, so that tests (i.e. ReadMemStats) that ultimately check mappedReady continue to work correctly without special exceptions. We choose to simply account for this memory in other_sys. Let's talk about the exceptions. The first is the arenas array for finding heap arena metadata from an address is mapped as read/write in one large chunk. It's tens of MiB in size. On systems with demand paging, we assume that the whole thing isn't paged in at once (after all, it maps to the whole address space, and it's exceedingly difficult with today's technology to even broach having as much physical memory as the total address space). On systems where we have to commit memory manually, we use a two-level structure. Now, the reason why this is an exception is because we have no mechanism to track what memory is paged in, and we can't just account for the entire thing, because that would *look* like an enormous overhead. Furthermore, this structure is on a few really, really critical paths in the runtime, so doing more explicit tracking isn't really an option. So, we explicitly don't and call sysAllocOS to map this memory. The second exception is that we call sysFree with no accounting to clean up address space reservations, or otherwise to throw out mappings we don't care about. In this case, also drop down to a lower level and call sysFreeOS to explicitly avoid accounting. The third exception is debuglog allocations. That is purely a debugging facility and ideally we want it to have as small an impact on the runtime as possible. If we include it in mappedReady calculations, it could cause GC pacing shifts in future CLs, especailly if one increases the debuglog buffer sizes as a one-off. As of this CL, these are the only three places in the runtime that would pass nil for a stat to any of the functions in mem.go. As a result, this CL makes sysMemStats mandatory to facilitate better accounting in the future. It's now much easier to grep and find out where accounting is explicitly elided, because one doesn't have to follow the trail of sysMemStat nil pointer values, and can just look at the function name. For #48409. Change-Id: I274eb467fc2603881717482214fddc47c9eaf218 Reviewed-on: https://go-review.googlesource.com/c/go/+/393402 Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2022-03-15 02:48:18 +00:00
sysUsed(unsafe.Pointer(base), nbytes, scav)
gcController.heapReleased.add(-int64(scav))
}
// Update stats.
gcController.heapFree.add(-int64(nbytes - scav))
if typ == spanAllocHeap {
gcController.heapInUse.add(int64(nbytes))
}
// Update consistent stats.
stats := memstats.heapStats.acquire()
atomic.Xaddint64(&stats.committed, int64(scav))
atomic.Xaddint64(&stats.released, -int64(scav))
switch typ {
case spanAllocHeap:
atomic.Xaddint64(&stats.inHeap, int64(nbytes))
case spanAllocStack:
atomic.Xaddint64(&stats.inStacks, int64(nbytes))
case spanAllocWorkBuf:
atomic.Xaddint64(&stats.inWorkBufs, int64(nbytes))
}
memstats.heapStats.release()
runtime: add traceallocfree GODEBUG for alloc/free events in traces This change adds expensive alloc/free events to traces, guarded by a GODEBUG that can be set at run time by mutating the GODEBUG environment variable. This supersedes the alloc/free trace deleted in a previous CL. There are two parts to this CL. The first part is adding a mechanism for exposing experimental events through the tracer and trace parser. This boils down to a new ExperimentalEvent event type in the parser API which simply reveals the raw event data for the event. Each experimental event can also be associated with "experimental data" which is associated with a particular generation. This experimental data is just exposed as a bag of bytes that supplements the experimental events. In the runtime, this CL organizes experimental events by experiment. An experiment is defined by a set of experimental events and a single special batch type. Batches of this special type are exposed through the parser's API as the aforementioned "experimental data". The second part of this CL is defining the AllocFree experiment, which defines 9 new experimental events covering heap object alloc/frees, span alloc/frees, and goroutine stack alloc/frees. It also generates special batches that contain a type table: a mapping of IDs to type information. Change-Id: I965c00e3dcfdf5570f365ff89d0f70d8aeca219c Reviewed-on: https://go-review.googlesource.com/c/go/+/583377 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-24 16:26:39 +00:00
// Trace the span alloc.
if traceAllocFreeEnabled() {
trace := traceAcquire()
runtime: add traceallocfree GODEBUG for alloc/free events in traces This change adds expensive alloc/free events to traces, guarded by a GODEBUG that can be set at run time by mutating the GODEBUG environment variable. This supersedes the alloc/free trace deleted in a previous CL. There are two parts to this CL. The first part is adding a mechanism for exposing experimental events through the tracer and trace parser. This boils down to a new ExperimentalEvent event type in the parser API which simply reveals the raw event data for the event. Each experimental event can also be associated with "experimental data" which is associated with a particular generation. This experimental data is just exposed as a bag of bytes that supplements the experimental events. In the runtime, this CL organizes experimental events by experiment. An experiment is defined by a set of experimental events and a single special batch type. Batches of this special type are exposed through the parser's API as the aforementioned "experimental data". The second part of this CL is defining the AllocFree experiment, which defines 9 new experimental events covering heap object alloc/frees, span alloc/frees, and goroutine stack alloc/frees. It also generates special batches that contain a type table: a mapping of IDs to type information. Change-Id: I965c00e3dcfdf5570f365ff89d0f70d8aeca219c Reviewed-on: https://go-review.googlesource.com/c/go/+/583377 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-24 16:26:39 +00:00
if trace.ok() {
trace.SpanAlloc(s)
traceRelease(trace)
}
}
return s
}
// initSpan initializes a blank span s which will represent the range
// [base, base+npages*pageSize). typ is the type of span being allocated.
func (h *mheap) initSpan(s *mspan, typ spanAllocType, spanclass spanClass, base, npages uintptr) {
// At this point, both s != nil and base != 0, and the heap
// lock is no longer held. Initialize the span.
s.init(base, npages)
if h.allocNeedsZero(base, npages) {
s.needzero = 1
}
nbytes := npages * pageSize
if typ.manual() {
s.manualFreeList = 0
s.nelems = 0
s.state.set(mSpanManual)
} else {
// We must set span properties before the span is published anywhere
// since we're not holding the heap lock.
s.spanclass = spanclass
if sizeclass := spanclass.sizeclass(); sizeclass == 0 {
s.elemsize = nbytes
s.nelems = 1
s.divMul = 0
} else {
s.elemsize = uintptr(gc.SizeClassToSize[sizeclass])
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
if goexperiment.GreenTeaGC {
var reserve uintptr
if gcUsesSpanInlineMarkBits(s.elemsize) {
// Reserve space for the inline mark bits.
reserve += unsafe.Sizeof(spanInlineMarkBits{})
}
if heapBitsInSpan(s.elemsize) && !s.spanclass.noscan() {
// Reserve space for the pointer/scan bitmap at the end.
reserve += nbytes / goarch.PtrSize / 8
}
s.nelems = uint16((nbytes - reserve) / s.elemsize)
runtime: implement experiment to replace heap bitmap with alloc headers This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2022-09-11 04:07:41 +00:00
} else {
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
if !s.spanclass.noscan() && heapBitsInSpan(s.elemsize) {
// Reserve space for the pointer/scan bitmap at the end.
s.nelems = uint16((nbytes - (nbytes / goarch.PtrSize / 8)) / s.elemsize)
} else {
s.nelems = uint16(nbytes / s.elemsize)
}
runtime: implement experiment to replace heap bitmap with alloc headers This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2022-09-11 04:07:41 +00:00
}
s.divMul = gc.SizeClassToDivMagic[sizeclass]
}
// Initialize mark and allocation structures.
s.freeindex = 0
s.freeIndexForScan = 0
s.allocCache = ^uint64(0) // all 1s indicating all free.
s.gcmarkBits = newMarkBits(uintptr(s.nelems))
s.allocBits = newAllocBits(uintptr(s.nelems))
runtime: set mspan limit field early and eagerly Currently the mspan limit field is set after allocSpan returns, *after* the span has already been published to the GC (including the conservative scanner). But the limit field is load-bearing, because it's checked to filter out invalid pointers. A stale limit value could cause a crash by having the conservative scanner access allocBits out of bounds. Fix this by setting the mspan limit field before publishing the span. For large objects and arena chunks, we adjust the limit down after allocSpan because we don't have access to the true object's size from allocSpan. However this is safe, since we first initialize the limit to something definitely safe (the actual span bounds) and only adjust it down after. Adjusting it down has the benefit of more precise debug output, but the window in which it's imprecise is also fine because a single object (logically, with arena chunks) occupies the whole span, so the 'invalid' part of the memory will just safely point back to that object. We can't do this for smaller objects because the limit will include space that does *not* contain any valid objects. Fixes #74288. Change-Id: I0a22e5b9bccc1bfdf51d2b73ea7130f1b99c0c7c Reviewed-on: https://go-review.googlesource.com/c/go/+/682655 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>
2025-06-18 17:42:16 +00:00
// Adjust s.limit down to the object-containing part of the span.
s.limit = s.base() + s.elemsize*uintptr(s.nelems)
runtime: set mspan limit field early and eagerly Currently the mspan limit field is set after allocSpan returns, *after* the span has already been published to the GC (including the conservative scanner). But the limit field is load-bearing, because it's checked to filter out invalid pointers. A stale limit value could cause a crash by having the conservative scanner access allocBits out of bounds. Fix this by setting the mspan limit field before publishing the span. For large objects and arena chunks, we adjust the limit down after allocSpan because we don't have access to the true object's size from allocSpan. However this is safe, since we first initialize the limit to something definitely safe (the actual span bounds) and only adjust it down after. Adjusting it down has the benefit of more precise debug output, but the window in which it's imprecise is also fine because a single object (logically, with arena chunks) occupies the whole span, so the 'invalid' part of the memory will just safely point back to that object. We can't do this for smaller objects because the limit will include space that does *not* contain any valid objects. Fixes #74288. Change-Id: I0a22e5b9bccc1bfdf51d2b73ea7130f1b99c0c7c Reviewed-on: https://go-review.googlesource.com/c/go/+/682655 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>
2025-06-18 17:42:16 +00:00
// It's safe to access h.sweepgen without the heap lock because it's
// only ever updated with the world stopped and we run on the
// systemstack which blocks a STW transition.
atomic.Store(&s.sweepgen, h.sweepgen)
// Now that the span is filled in, set its state. This
// is a publication barrier for the other fields in
// the span. While valid pointers into this span
// should never be visible until the span is returned,
// if the garbage collector finds an invalid pointer,
// access to the span may race with initialization of
// the span. We resolve this race by atomically
// setting the state after the span is fully
// initialized, and atomically checking the state in
// any situation where a pointer is suspect.
s.state.set(mSpanInUse)
}
// Publish the span in various locations.
// This is safe to call without the lock held because the slots
// related to this span will only ever be read or modified by
// this thread until pointers into the span are published (and
// we execute a publication barrier at the end of this function
// before that happens) or pageInUse is updated.
h.setSpans(s.base(), npages, s)
if !typ.manual() {
// Mark in-use span in arena page bitmap.
//
// This publishes the span to the page sweeper, so
// it's imperative that the span be completely initialized
// prior to this line.
arena, pageIdx, pageMask := pageIndexOf(s.base())
atomic.Or8(&arena.pageInUse[pageIdx], pageMask)
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
// Mark packed span.
if gcUsesSpanInlineMarkBits(s.elemsize) {
atomic.Or8(&arena.pageUseSpanInlineMarkBits[pageIdx], pageMask)
}
// Update related page sweeper stats.
h.pagesInUse.Add(npages)
}
// Make sure the newly allocated span will be observed
// by the GC before pointers into the span are published.
publicationBarrier()
}
// Try to add at least npage pages of memory to the heap,
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
// returning how much the heap grew by and whether it worked.
//
// h.lock must be held.
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
func (h *mheap) grow(npage uintptr) (uintptr, bool) {
assertLockHeld(&h.lock)
firstGrow := h.curArena.base == 0
// We must grow the heap in whole palloc chunks.
// We call sysMap below but note that because we
// round up to pallocChunkPages which is on the order
// of MiB (generally >= to the huge page size) we
// won't be calling it too much.
ask := alignUp(npage, pallocChunkPages) * pageSize
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
totalGrowth := uintptr(0)
// This may overflow because ask could be very large
// and is otherwise unrelated to h.curArena.base.
end := h.curArena.base + ask
nBase := alignUp(end, physPageSize)
if nBase > h.curArena.end || /* overflow */ end < h.curArena.base {
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
// Not enough room in the current arena. Allocate more
// arena space. This may not be contiguous with the
// current arena, so we have to request the full ask.
av, asize := h.sysAlloc(ask, &h.arenaHints, &h.heapArenas)
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
if av == nil {
inUse := gcController.heapFree.load() + gcController.heapReleased.load() + gcController.heapInUse.load()
print("runtime: out of memory: cannot allocate ", ask, "-byte block (", inUse, " in use)\n")
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
return 0, false
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
}
if uintptr(av) == h.curArena.end {
// The new space is contiguous with the old
// space, so just extend the current space.
h.curArena.end = uintptr(av) + asize
} else {
// The new space is discontiguous. Track what
// remains of the current space and switch to
// the new space. This should be rare.
if size := h.curArena.end - h.curArena.base; size != 0 {
// Transition this space from Reserved to Prepared and mark it
// as released since we'll be able to start using it after updating
// the page allocator and releasing the lock at any time.
runtime: decorate anonymous memory mappings Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name the anonymous memory areas. This API has been introduced in Linux 5.17 to decorate the anonymous memory areas shown in /proc/<pid>/maps. This is already used by glibc. See: * https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434 * https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63 This can be useful when investigating the memory consumption of a multi-language program. On a 100% Go program, pprof profiler can be used to profile the memory consumption of the program. But pprof is only aware of what happens within the Go world. On a multi-language program, there could be a doubt about whether the suspicious extra-memory consumption comes from the Go part or the native part. With this change, the following Go program: package main import ( "fmt" "log" "os" ) /* #include <stdlib.h> void f(void) { (void)malloc(1024*1024*1024); } */ import "C" func main() { C.f() data, err := os.ReadFile("/proc/self/maps") if err != nil { log.Fatal(err) } fmt.Println(string(data)) } produces this output: $ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run . 00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00580000-005a4000 rw-p 00000000 00:00 0 2e075000-2e096000 rw-p 00000000 00:00 0 [heap] c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap] c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation] 777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f40021000-777f44000000 ---p 00000000 00:00 0 777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f44021000-777f48000000 ---p 00000000 00:00 0 777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f48021000-777f4c000000 ---p 00000000 00:00 0 777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f4c021000-777f50000000 ---p 00000000 00:00 0 777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f50021000-777f54000000 ---p 00000000 00:00 0 777f55afb000-777f55afc000 ---p 00000000 00:00 0 777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378] 777f562fc000-777f562fd000 ---p 00000000 00:00 0 777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377] 777f56afd000-777f56afe000 ---p 00000000 00:00 0 777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376] 777f572fe000-777f572ff000 ---p 00000000 00:00 0 777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375] 777f57aff000-777f57b00000 ---p 00000000 00:00 0 777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374] 777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index] 777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index] 777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits] 777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array] 777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6 777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6 777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6 777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6 777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6 777f9eff4000-777f9effc000 rw-p 00000000 00:00 0 777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar] 777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock] 777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso] 777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f088000-777f9f089000 rw-p 00000000 00:00 0 7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] The anonymous memory areas are now labelled so that we can see which ones have been allocated by the Go runtime versus which ones have been allocated by the glibc. Fixes #71546 Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4 Reviewed-on: https://go-review.googlesource.com/c/go/+/646095 Auto-Submit: Alan Donovan <adonovan@google.com> Commit-Queue: Alan Donovan <adonovan@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-02-01 14:19:04 +01:00
sysMap(unsafe.Pointer(h.curArena.base), size, &gcController.heapReleased, "heap")
// Update stats.
stats := memstats.heapStats.acquire()
atomic.Xaddint64(&stats.released, int64(size))
memstats.heapStats.release()
// Update the page allocator's structures to make this
// space ready for allocation.
h.pages.grow(h.curArena.base, size)
totalGrowth += size
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
}
// Switch to the new space.
h.curArena.base = uintptr(av)
h.curArena.end = uintptr(av) + asize
if firstGrow && randomizeHeapBase {
// The top heapAddrBits-logHeapArenaBytes are randomized, we now
// want to randomize the next
// logHeapArenaBytes-log2(pallocChunkBytes) bits, making sure
// h.curArena.base is aligned to pallocChunkBytes.
bits := logHeapArenaBytes - logPallocChunkBytes
offset := nextHeapRandBits(bits)
h.curArena.base = alignDown(h.curArena.base|(offset<<logPallocChunkBytes), pallocChunkBytes)
}
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
}
// Recalculate nBase.
// We know this won't overflow, because sysAlloc returned
// a valid region starting at h.curArena.base which is at
// least ask bytes in size.
nBase = alignUp(h.curArena.base+ask, physPageSize)
}
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
// Grow into the current arena.
v := h.curArena.base
h.curArena.base = nBase
// Transition the space we're going to use from Reserved to Prepared.
//
// The allocation is always aligned to the heap arena
// size which is always > physPageSize, so its safe to
// just add directly to heapReleased.
runtime: decorate anonymous memory mappings Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name the anonymous memory areas. This API has been introduced in Linux 5.17 to decorate the anonymous memory areas shown in /proc/<pid>/maps. This is already used by glibc. See: * https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434 * https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63 This can be useful when investigating the memory consumption of a multi-language program. On a 100% Go program, pprof profiler can be used to profile the memory consumption of the program. But pprof is only aware of what happens within the Go world. On a multi-language program, there could be a doubt about whether the suspicious extra-memory consumption comes from the Go part or the native part. With this change, the following Go program: package main import ( "fmt" "log" "os" ) /* #include <stdlib.h> void f(void) { (void)malloc(1024*1024*1024); } */ import "C" func main() { C.f() data, err := os.ReadFile("/proc/self/maps") if err != nil { log.Fatal(err) } fmt.Println(string(data)) } produces this output: $ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run . 00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00580000-005a4000 rw-p 00000000 00:00 0 2e075000-2e096000 rw-p 00000000 00:00 0 [heap] c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap] c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation] 777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f40021000-777f44000000 ---p 00000000 00:00 0 777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f44021000-777f48000000 ---p 00000000 00:00 0 777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f48021000-777f4c000000 ---p 00000000 00:00 0 777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f4c021000-777f50000000 ---p 00000000 00:00 0 777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f50021000-777f54000000 ---p 00000000 00:00 0 777f55afb000-777f55afc000 ---p 00000000 00:00 0 777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378] 777f562fc000-777f562fd000 ---p 00000000 00:00 0 777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377] 777f56afd000-777f56afe000 ---p 00000000 00:00 0 777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376] 777f572fe000-777f572ff000 ---p 00000000 00:00 0 777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375] 777f57aff000-777f57b00000 ---p 00000000 00:00 0 777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374] 777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index] 777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index] 777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits] 777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array] 777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6 777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6 777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6 777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6 777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6 777f9eff4000-777f9effc000 rw-p 00000000 00:00 0 777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar] 777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock] 777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso] 777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f088000-777f9f089000 rw-p 00000000 00:00 0 7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] The anonymous memory areas are now labelled so that we can see which ones have been allocated by the Go runtime versus which ones have been allocated by the glibc. Fixes #71546 Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4 Reviewed-on: https://go-review.googlesource.com/c/go/+/646095 Auto-Submit: Alan Donovan <adonovan@google.com> Commit-Queue: Alan Donovan <adonovan@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-02-01 14:19:04 +01:00
sysMap(unsafe.Pointer(v), nBase-v, &gcController.heapReleased, "heap")
// The memory just allocated counts as both released
// and idle, even though it's not yet backed by spans.
stats := memstats.heapStats.acquire()
atomic.Xaddint64(&stats.released, int64(nBase-v))
memstats.heapStats.release()
// Update the page allocator's structures to make this
// space ready for allocation.
h.pages.grow(v, nBase-v)
totalGrowth += nBase - v
if firstGrow && randomizeHeapBase {
// The top heapAddrBits-log2(pallocChunkBytes) bits are now randomized,
// we finally want to randomize the next
// log2(pallocChunkBytes)-log2(pageSize) bits, while maintaining
// alignment to pageSize. We do this by calculating a random number of
// pages into the current arena, and marking them as allocated. The
// address of the next available page becomes our fully randomized base
// heap address.
randOffset := nextHeapRandBits(logPallocChunkBytes)
randNumPages := alignDown(randOffset, pageSize) / pageSize
if randNumPages != 0 {
h.pages.markRandomPaddingPages(v, randNumPages)
}
}
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
return totalGrowth, true
runtime: grow the heap incrementally Currently, we map and grow the heap a whole arena (64MB) at a time. Unfortunately, in order to fix #32828, we need to switch from scavenging inline with allocation back to scavenging on heap growth, but heap-growth scavenging happens in large jumps because we grow the heap in large jumps. In order to prepare for better heap-growth scavenging, this CL separates mapping more space for the heap from actually "growing" it (tracking the new space with spans). Instead, growing the heap keeps track of the "current arena" it's growing into. It track that with new spans as needed, and only maps more arena space when the current arena is inadequate. The effect to the user is the same, but this will let us scavenge on much smaller increments of heap growth. There are two slightly subtleties to this change: 1. If an allocation requires mapping a new arena and that new arena isn't contiguous with the current arena, we don't want to lose the unused space in the current arena, so we have to immediately track that with a span. 2. The mapped space must be accounted as released and idle, even though it isn't actually tracked in a span. For #32828, since this makes heap-growth scavenging far more effective, especially at small heap sizes. For example, this change is necessary for TestPhysicalMemoryUtilization to pass once we remove inline scavenging. Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a Reviewed-on: https://go-review.googlesource.com/c/go/+/189957 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-08-12 14:54:28 -04:00
}
// Free the span back into the heap.
func (h *mheap) freeSpan(s *mspan) {
[dev.cc] runtime: delete scalararg, ptrarg; rename onM to systemstack Scalararg and ptrarg are not "signal safe". Go code filling them out can be interrupted by a signal, and then the signal handler runs, and if it also ends up in Go code that uses scalararg or ptrarg, now the old values have been smashed. For the pieces of code that do need to run in a signal handler, we introduced onM_signalok, which is really just onM except that the _signalok is meant to convey that the caller asserts that scalarg and ptrarg will be restored to their old values after the call (instead of the usual behavior, zeroing them). Scalararg and ptrarg are also untyped and therefore error-prone. Go code can always pass a closure instead of using scalararg and ptrarg; they were only really necessary for C code. And there's no more C code. For all these reasons, delete scalararg and ptrarg, converting the few remaining references to use closures. Once those are gone, there is no need for a distinction between onM and onM_signalok, so replace both with a single function equivalent to the current onM_signalok (that is, it can be called on any of the curg, g0, and gsignal stacks). The name onM and the phrase 'm stack' are misnomers, because on most system an M has two system stacks: the main thread stack and the signal handling stack. Correct the misnomer by naming the replacement function systemstack. Fix a few references to "M stack" in code. The main motivation for this change is to eliminate scalararg/ptrarg. Rick and I have already seen them cause problems because the calling sequence m.ptrarg[0] = p is a heap pointer assignment, so it gets a write barrier. The write barrier also uses onM, so it has all the same problems as if it were being invoked by a signal handler. We worked around this by saving and restoring the old values and by calling onM_signalok, but there's no point in keeping this nice home for bugs around any longer. This CL also changes funcline to return the file name as a result instead of filling in a passed-in *string. (The *string signature is left over from when the code was written in and called from C.) That's arguably an unrelated change, except that once I had done the ptrarg/scalararg/onM cleanup I started getting false positives about the *string argument escaping (not allowed in package runtime). The compiler is wrong, but the easiest fix is to write the code like Go code instead of like C code. I am a bit worried that the compiler is wrong because of some use of uninitialized memory in the escape analysis. If that's the reason, it will go away when we convert the compiler to Go. (And if not, we'll debug it the next time.) LGTM=khr R=r, khr CC=austin, golang-codereviews, iant, rlh https://golang.org/cl/174950043
2014-11-12 14:54:31 -05:00
systemstack(func() {
runtime: add traceallocfree GODEBUG for alloc/free events in traces This change adds expensive alloc/free events to traces, guarded by a GODEBUG that can be set at run time by mutating the GODEBUG environment variable. This supersedes the alloc/free trace deleted in a previous CL. There are two parts to this CL. The first part is adding a mechanism for exposing experimental events through the tracer and trace parser. This boils down to a new ExperimentalEvent event type in the parser API which simply reveals the raw event data for the event. Each experimental event can also be associated with "experimental data" which is associated with a particular generation. This experimental data is just exposed as a bag of bytes that supplements the experimental events. In the runtime, this CL organizes experimental events by experiment. An experiment is defined by a set of experimental events and a single special batch type. Batches of this special type are exposed through the parser's API as the aforementioned "experimental data". The second part of this CL is defining the AllocFree experiment, which defines 9 new experimental events covering heap object alloc/frees, span alloc/frees, and goroutine stack alloc/frees. It also generates special batches that contain a type table: a mapping of IDs to type information. Change-Id: I965c00e3dcfdf5570f365ff89d0f70d8aeca219c Reviewed-on: https://go-review.googlesource.com/c/go/+/583377 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-24 16:26:39 +00:00
// Trace the span free.
if traceAllocFreeEnabled() {
trace := traceAcquire()
runtime: add traceallocfree GODEBUG for alloc/free events in traces This change adds expensive alloc/free events to traces, guarded by a GODEBUG that can be set at run time by mutating the GODEBUG environment variable. This supersedes the alloc/free trace deleted in a previous CL. There are two parts to this CL. The first part is adding a mechanism for exposing experimental events through the tracer and trace parser. This boils down to a new ExperimentalEvent event type in the parser API which simply reveals the raw event data for the event. Each experimental event can also be associated with "experimental data" which is associated with a particular generation. This experimental data is just exposed as a bag of bytes that supplements the experimental events. In the runtime, this CL organizes experimental events by experiment. An experiment is defined by a set of experimental events and a single special batch type. Batches of this special type are exposed through the parser's API as the aforementioned "experimental data". The second part of this CL is defining the AllocFree experiment, which defines 9 new experimental events covering heap object alloc/frees, span alloc/frees, and goroutine stack alloc/frees. It also generates special batches that contain a type table: a mapping of IDs to type information. Change-Id: I965c00e3dcfdf5570f365ff89d0f70d8aeca219c Reviewed-on: https://go-review.googlesource.com/c/go/+/583377 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-24 16:26:39 +00:00
if trace.ok() {
trace.SpanFree(s)
traceRelease(trace)
}
}
lock(&h.lock)
if msanenabled {
// Tell msan that this entire span is no longer in use.
base := unsafe.Pointer(s.base())
bytes := s.npages << gc.PageShift
msanfree(base, bytes)
}
if asanenabled {
// Tell asan that this entire span is no longer in use.
base := unsafe.Pointer(s.base())
bytes := s.npages << gc.PageShift
asanpoison(base, bytes)
}
runtime: add valgrind instrumentation Add build tag gated Valgrind annotations to the runtime which let it understand how the runtime manages memory. This allows for Go binaries to be run under Valgrind without emitting spurious errors. Instead of adding the Valgrind headers to the tree, and using cgo to call the various Valgrind client request macros, we just add an assembly function which emits the necessary instructions to trigger client requests. In particular we add instrumentation of the memory allocator, using a two-level mempool structure (as described in the Valgrind manual [0]). We also add annotations which allow Valgrind to track which memory we use for stacks, which seems necessary to let it properly function. We describe the memory model to Valgrind as follows: we treat heap arenas as a "pool" created with VALGRIND_CREATE_MEMPOOL_EXT (so that we can use VALGRIND_MEMPOOL_METAPOOL and VALGRIND_MEMPOOL_AUTO_FREE). Within the pool we treat spans as "superblocks", annotated with VALGRIND_MEMPOOL_ALLOC. We then allocate individual objects within spans with VALGRIND_MALLOCLIKE_BLOCK. It should be noted that running binaries under Valgrind can be _quite slow_, and certain operations, such as running the GC, can be _very slow_. It is recommended to run programs with GOGC=off. Additionally, async preemption should be turned off, since it'll cause strange behavior (GODEBUG=asyncpreemptoff=1). Running Valgrind with --leak-check=yes will result in some errors resulting from some things not being marked fully free'd. These likely need more annotations to rectify, but for now it is recommended to run with --leak-check=off. Updates #73602 [0] https://valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools Change-Id: I71b26c47d7084de71ef1e03947ef6b1cc6d38301 Reviewed-on: https://go-review.googlesource.com/c/go/+/674077 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-03-22 00:58:55 +00:00
if valgrindenabled {
base := s.base()
valgrindMempoolFree(unsafe.Pointer(arenaBase(arenaIndex(base))), unsafe.Pointer(base))
}
h.freeSpanLocked(s, spanAllocHeap)
unlock(&h.lock)
})
}
// freeManual frees a manually-managed span returned by allocManual.
// typ must be the same as the spanAllocType passed to the allocManual that
// allocated s.
//
// This must only be called when gcphase == _GCoff. See mSpanState for
// an explanation.
//
// freeManual must be called on the system stack because it acquires
// the heap lock. See mheap for details.
//
//go:systemstack
func (h *mheap) freeManual(s *mspan, typ spanAllocType) {
runtime: add traceallocfree GODEBUG for alloc/free events in traces This change adds expensive alloc/free events to traces, guarded by a GODEBUG that can be set at run time by mutating the GODEBUG environment variable. This supersedes the alloc/free trace deleted in a previous CL. There are two parts to this CL. The first part is adding a mechanism for exposing experimental events through the tracer and trace parser. This boils down to a new ExperimentalEvent event type in the parser API which simply reveals the raw event data for the event. Each experimental event can also be associated with "experimental data" which is associated with a particular generation. This experimental data is just exposed as a bag of bytes that supplements the experimental events. In the runtime, this CL organizes experimental events by experiment. An experiment is defined by a set of experimental events and a single special batch type. Batches of this special type are exposed through the parser's API as the aforementioned "experimental data". The second part of this CL is defining the AllocFree experiment, which defines 9 new experimental events covering heap object alloc/frees, span alloc/frees, and goroutine stack alloc/frees. It also generates special batches that contain a type table: a mapping of IDs to type information. Change-Id: I965c00e3dcfdf5570f365ff89d0f70d8aeca219c Reviewed-on: https://go-review.googlesource.com/c/go/+/583377 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-24 16:26:39 +00:00
// Trace the span free.
if traceAllocFreeEnabled() {
trace := traceAcquire()
runtime: add traceallocfree GODEBUG for alloc/free events in traces This change adds expensive alloc/free events to traces, guarded by a GODEBUG that can be set at run time by mutating the GODEBUG environment variable. This supersedes the alloc/free trace deleted in a previous CL. There are two parts to this CL. The first part is adding a mechanism for exposing experimental events through the tracer and trace parser. This boils down to a new ExperimentalEvent event type in the parser API which simply reveals the raw event data for the event. Each experimental event can also be associated with "experimental data" which is associated with a particular generation. This experimental data is just exposed as a bag of bytes that supplements the experimental events. In the runtime, this CL organizes experimental events by experiment. An experiment is defined by a set of experimental events and a single special batch type. Batches of this special type are exposed through the parser's API as the aforementioned "experimental data". The second part of this CL is defining the AllocFree experiment, which defines 9 new experimental events covering heap object alloc/frees, span alloc/frees, and goroutine stack alloc/frees. It also generates special batches that contain a type table: a mapping of IDs to type information. Change-Id: I965c00e3dcfdf5570f365ff89d0f70d8aeca219c Reviewed-on: https://go-review.googlesource.com/c/go/+/583377 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-24 16:26:39 +00:00
if trace.ok() {
trace.SpanFree(s)
traceRelease(trace)
}
}
s.needzero = 1
lock(&h.lock)
runtime: add valgrind instrumentation Add build tag gated Valgrind annotations to the runtime which let it understand how the runtime manages memory. This allows for Go binaries to be run under Valgrind without emitting spurious errors. Instead of adding the Valgrind headers to the tree, and using cgo to call the various Valgrind client request macros, we just add an assembly function which emits the necessary instructions to trigger client requests. In particular we add instrumentation of the memory allocator, using a two-level mempool structure (as described in the Valgrind manual [0]). We also add annotations which allow Valgrind to track which memory we use for stacks, which seems necessary to let it properly function. We describe the memory model to Valgrind as follows: we treat heap arenas as a "pool" created with VALGRIND_CREATE_MEMPOOL_EXT (so that we can use VALGRIND_MEMPOOL_METAPOOL and VALGRIND_MEMPOOL_AUTO_FREE). Within the pool we treat spans as "superblocks", annotated with VALGRIND_MEMPOOL_ALLOC. We then allocate individual objects within spans with VALGRIND_MALLOCLIKE_BLOCK. It should be noted that running binaries under Valgrind can be _quite slow_, and certain operations, such as running the GC, can be _very slow_. It is recommended to run programs with GOGC=off. Additionally, async preemption should be turned off, since it'll cause strange behavior (GODEBUG=asyncpreemptoff=1). Running Valgrind with --leak-check=yes will result in some errors resulting from some things not being marked fully free'd. These likely need more annotations to rectify, but for now it is recommended to run with --leak-check=off. Updates #73602 [0] https://valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools Change-Id: I71b26c47d7084de71ef1e03947ef6b1cc6d38301 Reviewed-on: https://go-review.googlesource.com/c/go/+/674077 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-03-22 00:58:55 +00:00
if valgrindenabled {
base := s.base()
valgrindMempoolFree(unsafe.Pointer(arenaBase(arenaIndex(base))), unsafe.Pointer(base))
}
h.freeSpanLocked(s, typ)
unlock(&h.lock)
}
func (h *mheap) freeSpanLocked(s *mspan, typ spanAllocType) {
assertLockHeld(&h.lock)
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
switch s.state.get() {
case mSpanManual:
if s.allocCount != 0 {
throw("mheap.freeSpanLocked - invalid stack free")
}
case mSpanInUse:
runtime: add safe arena support to the runtime This change adds an API to the runtime for arenas. A later CL can potentially export it as an experimental API, but for now, just the runtime implementation will suffice. The purpose of arenas is to improve efficiency, primarily by allowing for an application to manually free memory, thereby delaying garbage collection. It comes with other potential performance benefits, such as better locality, a better allocation strategy, and better handling of interior pointers by the GC. This implementation is based on one by danscales@google.com with a few significant differences: * The implementation lives entirely in the runtime (all layers). * Arena chunks are the minimum of 8 MiB or the heap arena size. This choice is made because in practice 64 MiB appears to be way too large of an area for most real-world use-cases. * Arena chunks are not unmapped, instead they're placed on an evacuation list and when there are no pointers left pointing into them, they're allowed to be reused. * Reusing partially-used arena chunks no longer tries to find one used by the same P first; it just takes the first one available. * In order to ensure worst-case fragmentation is never worse than 25%, only types and slice backing stores whose sizes are 1/4th the size of a chunk or less may be used. Previously larger sizes, up to the size of the chunk, were allowed. * ASAN, MSAN, and the race detector are fully supported. * Sets arena chunks to fault that were deferred at the end of mark termination (a non-public patch once did this; I don't see a reason not to continue that). For #51317. Change-Id: I83b1693a17302554cb36b6daa4e9249a81b1644f Reviewed-on: https://go-review.googlesource.com/c/go/+/423359 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2022-08-12 21:40:46 +00:00
if s.isUserArenaChunk {
throw("mheap.freeSpanLocked - invalid free of user arena chunk")
}
if s.allocCount != 0 || s.sweepgen != h.sweepgen {
print("mheap.freeSpanLocked - span ", s, " ptr ", hex(s.base()), " allocCount ", s.allocCount, " sweepgen ", s.sweepgen, "/", h.sweepgen, "\n")
throw("mheap.freeSpanLocked - invalid free")
}
h.pagesInUse.Add(-s.npages)
// Clear in-use bit in arena page bitmap.
arena, pageIdx, pageMask := pageIndexOf(s.base())
atomic.And8(&arena.pageInUse[pageIdx], ^pageMask)
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
// Clear small heap span bit if necessary.
if gcUsesSpanInlineMarkBits(s.elemsize) {
atomic.And8(&arena.pageUseSpanInlineMarkBits[pageIdx], ^pageMask)
}
default:
throw("mheap.freeSpanLocked - invalid span state")
}
// Update stats.
//
// Mirrors the code in allocSpan.
runtime: delineate which memstats are system stats with a type This change modifies the type of several mstats fields to be a new type: sysMemStat. This type has the same structure as the fields used to have. The purpose of this change is to make it very clear which stats may be used in various functions for accounting (usually the platform-specific sys* functions, but there are others). Currently there's an implicit understanding that the *uint64 value passed to these functions is some kind of statistic whose value is atomically managed. This understanding isn't inherently problematic, but we're about to change how some stats (which currently use mSysStatInc and mSysStatDec) work, so we want to make it very clear what the various requirements are around "sysStat". This change also removes mSysStatInc and mSysStatDec in favor of a method on sysMemStat. Note that those two functions were originally written the way they were because atomic 64-bit adds required a valid G on ARM, but this hasn't been the case for a very long time (since golang.org/cl/14204, but even before then it wasn't clear if mutexes required a valid G anymore). Today we implement 64-bit adds on ARM with a spinlock table. Change-Id: I4e9b37cf14afc2ae20cf736e874eb0064af086d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/246971 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2020-07-29 20:25:05 +00:00
nbytes := s.npages * pageSize
gcController.heapFree.add(int64(nbytes))
if typ == spanAllocHeap {
gcController.heapInUse.add(-int64(nbytes))
}
// Update consistent stats.
stats := memstats.heapStats.acquire()
switch typ {
case spanAllocHeap:
atomic.Xaddint64(&stats.inHeap, -int64(nbytes))
case spanAllocStack:
atomic.Xaddint64(&stats.inStacks, -int64(nbytes))
case spanAllocWorkBuf:
atomic.Xaddint64(&stats.inWorkBufs, -int64(nbytes))
}
memstats.heapStats.release()
// Mark the space as free.
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
h.pages.free(s.base(), s.npages)
// Free the span structure. We no longer have a use for it.
s.state.set(mSpanDead)
h.freeMSpanLocked(s)
}
// scavengeAll acquires the heap lock (blocking any additional
// manipulation of the page allocator) and iterates over the whole
// heap, scavenging every free page available.
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
//
// Must run on the system stack because it acquires the heap lock.
//
//go:systemstack
func (h *mheap) scavengeAll() {
// Disallow malloc or panic while holding the heap lock. We do
// this here because this is a non-mallocgc entry-point to
// the mheap API.
gp := getg()
gp.m.mallocing++
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
runtime: manage huge pages explicitly This change makes it so that on Linux the Go runtime explicitly marks page heap memory as either available to be backed by hugepages or not using heuristics based on density. The motivation behind this change is twofold: 1. In default Linux configurations, khugepaged can recoalesce hugepages even after the scavenger breaks them up, resulting in significant overheads for small heaps when their heaps shrink. 2. The Go runtime already has some heuristics about this, but those heuristics appear to have bit-rotted and result in haphazard hugepage management. Unlucky (but otherwise fairly dense) regions of memory end up not backed by huge pages while sparse regions end up accidentally marked MADV_HUGEPAGE and are not later broken up by the scavenger, because it already got the memory it needed from more dense sections (this is more likely to happen with small heaps that go idle). In this change, the runtime uses a new policy: 1. Mark all new memory MADV_HUGEPAGE. 2. Track whether each page chunk (4 MiB) became dense during the GC cycle. Mark those MADV_HUGEPAGE, and hide them from the scavenger. 3. If a chunk is not dense for 1 full GC cycle, make it visible to the scavenger. 4. The scavenger marks a chunk MADV_NOHUGEPAGE before it scavenges it. This policy is intended to try and back memory that is a good candidate for huge pages (high occupancy) with huge pages, and give memory that is not (low occupancy) to the scavenger. Occupancy is defined not just by occupancy at any instant of time, but also occupancy in the near future. It's generally true that by the end of a GC cycle the heap gets quite dense (from the perspective of the page allocator). Because we want scavenging and huge page management to happen together (the right time to MADV_NOHUGEPAGE is just before scavenging in order to break up huge pages and keep them that way) and the cost of applying MADV_HUGEPAGE and MADV_NOHUGEPAGE is somewhat high, the scavenger avoids releasing memory in dense page chunks. All this together means the scavenger will now more generally release memory on a ~1 GC cycle delay. Notably this has implications for scavenging to maintain the memory limit and the runtime/debug.FreeOSMemory API. This change makes it so that in these cases all memory is visible to the scavenger regardless of sparseness and delays the page allocator in re-marking this memory with MADV_NOHUGEPAGE for around 1 GC cycle to mitigate churn. The end result of this change should be little-to-no performance difference for dense heaps (MADV_HUGEPAGE works a lot like the default unmarked state) but should allow the scavenger to more effectively take back fragments of huge pages. The main risk here is churn, because MADV_HUGEPAGE usually forces the kernel to immediately back memory with a huge page. That's the reason for the large amount of hysteresis (1 full GC cycle) and why the definition of high density is 96% occupancy. Fixes #55328. Change-Id: I8da7998f1a31b498a9cc9bc662c1ae1a6bf64630 Reviewed-on: https://go-review.googlesource.com/c/go/+/436395 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-23 16:32:34 +00:00
// Force scavenge everything.
released := h.pages.scavenge(^uintptr(0), nil, true)
runtime: don't hold the heap lock while scavenging This change modifies the scavenger to no longer hold the heap lock while actively scavenging pages. To achieve this, the change also: * Reverses the locking behavior of the (*pageAlloc).scavenge API, to only acquire the heap lock when necessary. * Introduces a new lock on the scavenger-related fields in a pageAlloc so that access to those fields doesn't require the heap lock. There are a few places in the scavenge path, notably reservation, that requires synchronization. The heap lock is far too heavy handed for this case. * Changes the scavenger to marks pages that are actively being scavenged as allocated, and "frees" them back to the page allocator the usual way. * Lifts the heap-growth scavenging code out of mheap.grow, where the heap lock is held, and into allocSpan, just after the lock is released. Releasing the lock during mheap.grow is not feasible if we want to ensure that allocation always makes progress (post-growth, another allocator could come in and take all that space, forcing the goroutine that just grew the heap to do so again). This change means that the scavenger now must do more work for each scavenge, but it is also now much more scalable. Although in theory it's not great by always taking the locked paths in the page allocator, it takes advantage of some properties of the allocator: * Most of the time, the scavenger will be working with one page at a time. The page allocator's locked path is optimized for this case. * On the allocation path, it doesn't need to do the find operation at all; it can go straight to setting bits for the range and updating the summary structure. Change-Id: Ie941d5e7c05dcc96476795c63fef74bcafc2a0f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/353974 Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-10-04 20:36:49 +00:00
gp.m.mallocing--
if debug.scavtrace > 0 {
printScavTrace(0, released, true)
}
}
//go:linkname runtime_debug_freeOSMemory runtime/debug.freeOSMemory
func runtime_debug_freeOSMemory() {
GC()
systemstack(func() { mheap_.scavengeAll() })
}
// Initialize a new span with the given start and npages.
func (span *mspan) init(base uintptr, npages uintptr) {
runtime: make fixalloc zero allocations on reuse Currently fixalloc does not zero memory it reuses. This is dangerous with the hybrid barrier if the type may contain heap pointers, since it may cause us to observe a dead heap pointer on reuse. It's also error-prone since it's the only allocator that doesn't zero on allocation (mallocgc of course zeroes, but so do persistentalloc and sysAlloc). It's also largely pointless: for mcache, the caller immediately memclrs the allocation; and the two specials types are tiny so there's no real cost to zeroing them. Change fixalloc to zero allocations by default. The only type we don't zero by default is mspan. This actually requires that the spsn's sweepgen survive across freeing and reallocating a span. If we were to zero it, the following race would be possible: 1. The current sweepgen is 2. Span s is on the unswept list. 2. Direct sweeping sweeps span s, finds it's all free, and releases s to the fixalloc. 3. Thread 1 allocates s from fixalloc. Suppose this zeros s, including s.sweepgen. 4. Thread 1 calls s.init, which sets s.state to _MSpanDead. 5. On thread 2, background sweeping comes across span s in allspans and cas's s.sweepgen from 0 (sg-2) to 1 (sg-1). Now it thinks it owns it for sweeping. 6. Thread 1 continues initializing s. Everything breaks. I would like to fix this because it's obviously confusing, but it's a subtle enough problem that I'm leaving it alone for now. The solution may be to skip sweepgen 0, but then we have to think about wrap-around much more carefully. Updates #17503. Change-Id: Ie08691feed3abbb06a31381b94beb0a2e36a0613 Reviewed-on: https://go-review.googlesource.com/31368 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-09-25 17:12:43 -04:00
// span is *not* zeroed.
span.next = nil
span.prev = nil
span.list = nil
span.startAddr = base
span.npages = npages
runtime: set mspan limit field early and eagerly Currently the mspan limit field is set after allocSpan returns, *after* the span has already been published to the GC (including the conservative scanner). But the limit field is load-bearing, because it's checked to filter out invalid pointers. A stale limit value could cause a crash by having the conservative scanner access allocBits out of bounds. Fix this by setting the mspan limit field before publishing the span. For large objects and arena chunks, we adjust the limit down after allocSpan because we don't have access to the true object's size from allocSpan. However this is safe, since we first initialize the limit to something definitely safe (the actual span bounds) and only adjust it down after. Adjusting it down has the benefit of more precise debug output, but the window in which it's imprecise is also fine because a single object (logically, with arena chunks) occupies the whole span, so the 'invalid' part of the memory will just safely point back to that object. We can't do this for smaller objects because the limit will include space that does *not* contain any valid objects. Fixes #74288. Change-Id: I0a22e5b9bccc1bfdf51d2b73ea7130f1b99c0c7c Reviewed-on: https://go-review.googlesource.com/c/go/+/682655 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>
2025-06-18 17:42:16 +00:00
span.limit = base + npages*gc.PageSize // see go.dev/issue/74288; adjusted later for heap spans
span.allocCount = 0
span.spanclass = 0
span.elemsize = 0
span.speciallock.key = 0
span.specials = nil
span.needzero = 0
span.freeindex = 0
span.freeIndexForScan = 0
span.allocBits = nil
span.gcmarkBits = nil
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
span.pinnerBits = nil
runtime: atomically set span state and use as publication barrier When everything is working correctly, any pointer the garbage collector encounters can only point into a fully initialized heap span, since the span must have been initialized before that pointer could escape the heap allocator and become visible to the GC. However, in various cases, we try to be defensive against bad pointers. In findObject, this is just a sanity check: we never expect to find a bad pointer, but programming errors can lead to them. In spanOfHeap, we don't necessarily trust the pointer and we're trying to check if it really does point to the heap, though it should always point to something. Conservative scanning takes this to a new level, since it can only guess that a word may be a pointer and verify this. In all of these cases, we have a problem that the span lookup and check can race with span initialization, since the span becomes visible to lookups before it's fully initialized. Furthermore, we're about to start initializing the span without the heap lock held, which is going to introduce races where accesses were previously protected by the heap lock. To address this, this CL makes accesses to mspan.state atomic, and ensures that the span is fully initialized before setting the state to mSpanInUse. All loads are now atomic, and in any case where we don't trust the pointer, it first atomically loads the span state and checks that it's mSpanInUse, after which it will have synchronized with span initialization and can safely check the other span fields. For #10958, #24543, but a good fix in general. Change-Id: I518b7c63555b02064b98aa5f802c92b758fef853 Reviewed-on: https://go-review.googlesource.com/c/go/+/203286 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2019-10-23 11:25:38 -04:00
span.state.set(mSpanDead)
runtime: static lock ranking for the runtime (enabled by GOEXPERIMENT) I took some of the infrastructure from Austin's lock logging CR https://go-review.googlesource.com/c/go/+/192704 (with deadlock detection from the logs), and developed a setup to give static lock ranking for runtime locks. Static lock ranking establishes a documented total ordering among locks, and then reports an error if the total order is violated. This can happen if a deadlock happens (by acquiring a sequence of locks in different orders), or if just one side of a possible deadlock happens. Lock ordering deadlocks cannot happen as long as the lock ordering is followed. Along the way, I found a deadlock involving the new timer code, which Ian fixed via https://go-review.googlesource.com/c/go/+/207348, as well as two other potential deadlocks. See the constants at the top of runtime/lockrank.go to show the static lock ranking that I ended up with, along with some comments. This is great documentation of the current intended lock ordering when acquiring multiple locks in the runtime. I also added an array lockPartialOrder[] which shows and enforces the current partial ordering among locks (which is embedded within the total ordering). This is more specific about the dependencies among locks. I don't try to check the ranking within a lock class with multiple locks that can be acquired at the same time (i.e. check the ranking when multiple hchan locks are acquired). Currently, I am doing a lockInit() call to set the lock rank of most locks. Any lock that is not otherwise initialized is assumed to be a leaf lock (a very high rank lock), so that eliminates the need to do anything for a bunch of locks (including all architecture-dependent locks). For two locks, root.lock and notifyList.lock (only in the runtime/sema.go file), it is not as easy to do lock initialization, so instead, I am passing the lock rank with the lock calls. For Windows compilation, I needed to increase the StackGuard size from 896 to 928 because of the new lock-rank checking functions. Checking of the static lock ranking is enabled by setting GOEXPERIMENT=staticlockranking before doing a run. To make sure that the static lock ranking code has no overhead in memory or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so that it defines a build tag (with the same name) whenever any experiment has been baked into the toolchain (by checking Expstring()). This allows me to avoid increasing the size of the 'mutex' type when static lock ranking is not enabled. Fixes #38029 Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a Reviewed-on: https://go-review.googlesource.com/c/go/+/207619 Reviewed-by: Dan Scales <danscales@google.com> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Dan Scales <danscales@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-11-13 17:34:47 -08:00
lockInit(&span.speciallock, lockRankMspanSpecial)
}
func (span *mspan) inList() bool {
return span.list != nil
}
runtime: mark and scan small objects in whole spans [green tea] Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-12 18:52:58 +00:00
// mSpanList heads a linked list of spans.
type mSpanList struct {
_ sys.NotInHeap
first *mspan // first span in list, or nil if none
last *mspan // last span in list, or nil if none
}
// Initialize an empty doubly-linked list.
func (list *mSpanList) init() {
list.first = nil
list.last = nil
}
func (list *mSpanList) remove(span *mspan) {
if span.list != list {
print("runtime: failed mSpanList.remove span.npages=", span.npages,
runtime: redo insert/remove of large spans Currently for spans with up to 1 MBytes (128 pages) we maintain an array indexed by the number of pages in the span. This is efficient both in terms of space as well as time to insert or remove a span of a particular size. Unfortunately for spans larger than 1 MByte we currently place them on a separate linked list. This results in O(n) behavior. Now that we are seeing heaps approaching 100 GBytes n is large enough to be noticed in real programs. This change replaces the linked list now used with a balanced binary tree structure called a treap. A treap is a probabilistically balanced tree offering O(logN) behavior for inserting and removing spans. To verify that this approach will work we start with noting that only spans with sizes > 1MByte will be put into the treap. This means that to support 1 TByte a treap will need at most 1 million nodes and can ideally be held in a treap with a depth of 20. Experiments with adding and removing randomly sized spans from the treap seem to result in treaps with depths of about twice the ideal or 40. A petabyte would require a tree of only twice again that depth again so this algorithm should last well into the future. Fixes #19393 Go1 benchmarks indicate this is basically an overall wash. Tue Mar 28 21:29:21 EDT 2017 name old time/op new time/op delta BinaryTree17-4 2.42s ± 1% 2.42s ± 1% ~ (p=0.980 n=21+21) Fannkuch11-4 3.00s ± 1% 3.18s ± 4% +6.10% (p=0.000 n=22+24) FmtFprintfEmpty-4 40.5ns ± 1% 40.3ns ± 3% ~ (p=0.692 n=22+25) FmtFprintfString-4 65.9ns ± 3% 64.6ns ± 1% -1.98% (p=0.000 n=24+23) FmtFprintfInt-4 69.6ns ± 1% 68.0ns ± 7% -2.30% (p=0.001 n=21+22) FmtFprintfIntInt-4 102ns ± 2% 99ns ± 1% -3.07% (p=0.000 n=23+23) FmtFprintfPrefixedInt-4 126ns ± 0% 125ns ± 0% -0.79% (p=0.000 n=19+17) FmtFprintfFloat-4 206ns ± 2% 205ns ± 1% ~ (p=0.671 n=23+21) FmtManyArgs-4 441ns ± 1% 445ns ± 1% +0.88% (p=0.000 n=22+23) GobDecode-4 5.73ms ± 1% 5.86ms ± 1% +2.37% (p=0.000 n=23+22) GobEncode-4 4.51ms ± 1% 4.89ms ± 1% +8.32% (p=0.000 n=22+22) Gzip-4 197ms ± 0% 202ms ± 1% +2.75% (p=0.000 n=23+24) Gunzip-4 32.9ms ± 8% 32.7ms ± 2% ~ (p=0.466 n=23+24) HTTPClientServer-4 57.3µs ± 1% 56.7µs ± 1% -0.94% (p=0.000 n=21+22) JSONEncode-4 13.8ms ± 1% 13.9ms ± 2% +1.14% (p=0.000 n=22+23) JSONDecode-4 47.4ms ± 1% 48.1ms ± 1% +1.49% (p=0.000 n=23+23) Mandelbrot200-4 3.92ms ± 0% 3.92ms ± 1% +0.21% (p=0.000 n=22+22) GoParse-4 2.89ms ± 1% 2.87ms ± 1% -0.68% (p=0.000 n=21+22) RegexpMatchEasy0_32-4 73.6ns ± 1% 72.0ns ± 2% -2.15% (p=0.000 n=21+22) RegexpMatchEasy0_1K-4 173ns ± 1% 173ns ± 1% ~ (p=0.847 n=22+24) RegexpMatchEasy1_32-4 71.9ns ± 1% 69.8ns ± 1% -2.99% (p=0.000 n=23+20) RegexpMatchEasy1_1K-4 314ns ± 1% 308ns ± 1% -1.91% (p=0.000 n=22+23) RegexpMatchMedium_32-4 106ns ± 0% 105ns ± 1% -0.58% (p=0.000 n=19+21) RegexpMatchMedium_1K-4 34.3µs ± 1% 34.3µs ± 1% ~ (p=0.871 n=23+22) RegexpMatchHard_32-4 1.67µs ± 1% 1.67µs ± 7% ~ (p=0.224 n=22+23) RegexpMatchHard_1K-4 51.5µs ± 1% 50.4µs ± 1% -1.99% (p=0.000 n=22+23) Revcomp-4 383ms ± 1% 415ms ± 0% +8.51% (p=0.000 n=22+22) Template-4 51.5ms ± 1% 51.5ms ± 1% ~ (p=0.555 n=20+23) TimeParse-4 279ns ± 2% 277ns ± 1% -0.95% (p=0.000 n=24+22) TimeFormat-4 294ns ± 1% 296ns ± 1% +0.58% (p=0.003 n=24+23) [Geo mean] 43.7µs 43.8µs +0.32% name old speed new speed delta GobDecode-4 134MB/s ± 1% 131MB/s ± 1% -2.32% (p=0.000 n=23+22) GobEncode-4 170MB/s ± 1% 157MB/s ± 1% -7.68% (p=0.000 n=22+22) Gzip-4 98.7MB/s ± 0% 96.1MB/s ± 1% -2.68% (p=0.000 n=23+24) Gunzip-4 590MB/s ± 7% 593MB/s ± 2% ~ (p=0.466 n=23+24) JSONEncode-4 141MB/s ± 1% 139MB/s ± 2% -1.13% (p=0.000 n=22+23) JSONDecode-4 40.9MB/s ± 1% 40.3MB/s ± 0% -1.47% (p=0.000 n=23+23) GoParse-4 20.1MB/s ± 1% 20.2MB/s ± 1% +0.69% (p=0.000 n=21+22) RegexpMatchEasy0_32-4 435MB/s ± 1% 444MB/s ± 2% +2.21% (p=0.000 n=21+22) RegexpMatchEasy0_1K-4 5.89GB/s ± 1% 5.89GB/s ± 1% ~ (p=0.439 n=22+24) RegexpMatchEasy1_32-4 445MB/s ± 1% 459MB/s ± 1% +3.06% (p=0.000 n=23+20) RegexpMatchEasy1_1K-4 3.26GB/s ± 1% 3.32GB/s ± 1% +1.97% (p=0.000 n=22+23) RegexpMatchMedium_32-4 9.40MB/s ± 1% 9.44MB/s ± 1% +0.43% (p=0.000 n=23+21) RegexpMatchMedium_1K-4 29.8MB/s ± 1% 29.8MB/s ± 1% ~ (p=0.826 n=23+22) RegexpMatchHard_32-4 19.1MB/s ± 1% 19.1MB/s ± 7% ~ (p=0.233 n=22+23) RegexpMatchHard_1K-4 19.9MB/s ± 1% 20.3MB/s ± 1% +2.03% (p=0.000 n=22+23) Revcomp-4 664MB/s ± 1% 612MB/s ± 0% -7.85% (p=0.000 n=22+22) Template-4 37.6MB/s ± 1% 37.7MB/s ± 1% ~ (p=0.558 n=20+23) [Geo mean] 134MB/s 133MB/s -0.76% Tue Mar 28 22:16:54 EDT 2017 Change-Id: I4a4f5c2b53d3fb85ef76c98522d3ed5cf8ae5b7e Reviewed-on: https://go-review.googlesource.com/38732 Reviewed-by: Russ Cox <rsc@golang.org>
2017-03-27 14:20:35 -04:00
" span=", span, " prev=", span.prev, " span.list=", span.list, " list=", list, "\n")
throw("mSpanList.remove")
}
if list.first == span {
list.first = span.next
} else {
span.prev.next = span.next
}
if list.last == span {
list.last = span.prev
} else {
span.next.prev = span.prev
}
span.next = nil
span.prev = nil
span.list = nil
}
func (list *mSpanList) isEmpty() bool {
return list.first == nil
}
func (list *mSpanList) insert(span *mspan) {
if span.next != nil || span.prev != nil || span.list != nil {
println("runtime: failed mSpanList.insert", span, span.next, span.prev, span.list)
throw("mSpanList.insert")
}
span.next = list.first
if list.first != nil {
// The list contains at least one span; link it in.
// The last span in the list doesn't change.
list.first.prev = span
} else {
// The list contains no spans, so this is also the last span.
list.last = span
}
list.first = span
span.list = list
}
func (list *mSpanList) insertBack(span *mspan) {
if span.next != nil || span.prev != nil || span.list != nil {
println("runtime: failed mSpanList.insertBack", span, span.next, span.prev, span.list)
throw("mSpanList.insertBack")
}
span.prev = list.last
if list.last != nil {
// The list contains at least one span.
list.last.next = span
} else {
// The list contains no spans, so this is also the first span.
list.first = span
}
list.last = span
span.list = list
}
runtime: free workbufs during sweeping This extends the sweeper to free workbufs back to the heap between GC cycles, allowing this memory to be reused for GC'd allocations or eventually returned to the OS. This helps for applications that have high peak heap usage relative to their regular heap usage (for example, a high-memory initialization phase). Workbuf memory is roughly proportional to heap size and since we currently never free workbufs, it's proportional to *peak* heap size. By freeing workbufs, we can release and reuse this memory for other purposes when the heap shrinks. This is somewhat complicated because this costs ~1–2 µs per workbuf span, so for large heaps it's too expensive to just do synchronously after mark termination between starting the world and dropping the worldsema. Hence, we do it asynchronously in the sweeper. This adds a list of "free" workbuf spans that can be returned to the heap. GC moves all workbuf spans to this list after mark termination and the background sweeper drains this list back to the heap. If the sweeper doesn't finish, that's fine, since getempty can directly reuse any remaining spans to allocate more workbufs. Performance impact is negligible. On the x/benchmarks, this reduces GC-bytes-from-system by 6–11%. Fixes #19325. Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8 Reviewed-on: https://go-review.googlesource.com/38582 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-20 17:25:59 -04:00
// takeAll removes all spans from other and inserts them at the front
// of list.
func (list *mSpanList) takeAll(other *mSpanList) {
if other.isEmpty() {
return
}
// Reparent everything in other to list.
for s := other.first; s != nil; s = s.next {
s.list = list
}
// Concatenate the lists.
if list.isEmpty() {
*list = *other
} else {
// Neither list is empty. Put other before list.
other.last.next = list.first
list.first.prev = other.last
list.first = other.first
}
other.first, other.last = nil, nil
}
const (
// _KindSpecialTinyBlock indicates that a given allocation is a tiny block.
// Ordered before KindSpecialFinalizer and KindSpecialCleanup so that it
// always appears first in the specials list.
// Used only if debug.checkfinalizers != 0.
_KindSpecialTinyBlock = 1
// _KindSpecialFinalizer is for tracking finalizers.
_KindSpecialFinalizer = 2
// _KindSpecialWeakHandle is used for creating weak pointers.
_KindSpecialWeakHandle = 3
// _KindSpecialProfile is for memory profiling.
_KindSpecialProfile = 4
// _KindSpecialReachable is a special used for tracking
// reachability during testing.
_KindSpecialReachable = 5
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
// _KindSpecialPinCounter is a special used for objects that are pinned
// multiple times
_KindSpecialPinCounter = 6
// _KindSpecialCleanup is for tracking cleanups.
_KindSpecialCleanup = 7
// _KindSpecialCheckFinalizer adds additional context to a finalizer or cleanup.
// Used only if debug.checkfinalizers != 0.
_KindSpecialCheckFinalizer = 8
// _KindSpecialBubble is used to associate objects with synctest bubbles.
_KindSpecialBubble = 9
)
type special struct {
_ sys.NotInHeap
next *special // linked list in span
runtime: make special offset a uintptr Currently specials try to save on space by only encoding the offset from the base of the span in a uint16. This worked fine up until Go 1.24. - Most specials have an offset of 0 (mem profile, finalizers, etc.) - Cleanups do not care about the offset at all, so even if it's wrong, it's OK. - Weak pointers *do* care, but the unique package always makes a new allocation, so the weak pointer handle offset it makes is always zero. With Go 1.24 and general weak pointers now available, nothing is stopping someone from just creating a weak pointer that is >64 KiB offset from the start of an object, and this weak pointer must be distinct from others. Fix this problem by just increasing the size of a special and making the offset a uintptr, to capture all possible offsets. Since we're in the freeze, this is the safest thing to do. Specials aren't so common that I expect a substantial memory increase from this change. In a future release (or if there is a problem) we can almost certainly pack the special's kind and offset together. There was already a bunch of wasted space due to padding, so this would bring us back to the same memory footprint before this change. Also, add tests for equality of basic weak interior pointers. This works, but we really should've had tests for it. Fixes #70739. Change-Id: Ib49a7f8f0f1ec3db4571a7afb0f4d94c8a93aa40 Reviewed-on: https://go-review.googlesource.com/c/go/+/634598 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Michael Knyszek <mknyszek@google.com> Commit-Queue: Michael Knyszek <mknyszek@google.com>
2024-12-09 19:21:48 +00:00
offset uintptr // span offset of object
kind byte // kind of special
}
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
// spanHasSpecials marks a span as having specials in the arena bitmap.
func spanHasSpecials(s *mspan) {
arenaPage := (s.base() / pageSize) % pagesPerArena
ai := arenaIndex(s.base())
ha := mheap_.arenas[ai.l1()][ai.l2()]
atomic.Or8(&ha.pageSpecials[arenaPage/8], uint8(1)<<(arenaPage%8))
}
// spanHasNoSpecials marks a span as having no specials in the arena bitmap.
func spanHasNoSpecials(s *mspan) {
arenaPage := (s.base() / pageSize) % pagesPerArena
ai := arenaIndex(s.base())
ha := mheap_.arenas[ai.l1()][ai.l2()]
atomic.And8(&ha.pageSpecials[arenaPage/8], ^(uint8(1) << (arenaPage % 8)))
}
// addspecial adds the special record s to the list of special records for
// the object p. All fields of s should be filled in except for
// offset & next, which this routine will fill in.
// Returns true if the special was successfully added, false otherwise.
// (The add will fail only if a record with the same p and s->kind
// already exists unless force is set to true.)
func addspecial(p unsafe.Pointer, s *special, force bool) bool {
span := spanOfHeap(uintptr(p))
if span == nil {
throw("addspecial on invalid pointer")
}
// Ensure that the span is swept.
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// Sweeping accesses the specials list w/o locks, so we have
// to synchronize with it. And it's just much safer.
mp := acquirem()
span.ensureSwept()
offset := uintptr(p) - span.base()
kind := s.kind
lock(&span.speciallock)
// Find splice point, check for existing record.
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
iter, exists := span.specialFindSplicePoint(offset, kind)
if !exists || force {
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
// Splice in record, fill in offset.
runtime: make special offset a uintptr Currently specials try to save on space by only encoding the offset from the base of the span in a uint16. This worked fine up until Go 1.24. - Most specials have an offset of 0 (mem profile, finalizers, etc.) - Cleanups do not care about the offset at all, so even if it's wrong, it's OK. - Weak pointers *do* care, but the unique package always makes a new allocation, so the weak pointer handle offset it makes is always zero. With Go 1.24 and general weak pointers now available, nothing is stopping someone from just creating a weak pointer that is >64 KiB offset from the start of an object, and this weak pointer must be distinct from others. Fix this problem by just increasing the size of a special and making the offset a uintptr, to capture all possible offsets. Since we're in the freeze, this is the safest thing to do. Specials aren't so common that I expect a substantial memory increase from this change. In a future release (or if there is a problem) we can almost certainly pack the special's kind and offset together. There was already a bunch of wasted space due to padding, so this would bring us back to the same memory footprint before this change. Also, add tests for equality of basic weak interior pointers. This works, but we really should've had tests for it. Fixes #70739. Change-Id: Ib49a7f8f0f1ec3db4571a7afb0f4d94c8a93aa40 Reviewed-on: https://go-review.googlesource.com/c/go/+/634598 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Michael Knyszek <mknyszek@google.com> Commit-Queue: Michael Knyszek <mknyszek@google.com>
2024-12-09 19:21:48 +00:00
s.offset = offset
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
s.next = *iter
*iter = s
spanHasSpecials(span)
}
unlock(&span.speciallock)
releasem(mp)
// We're converting p to a uintptr and looking it up, and we
// don't want it to die and get swept while we're doing so.
KeepAlive(p)
return !exists || force // already exists or addition was forced
}
// Removes the Special record of the given kind for the object p.
// Returns the record if the record existed, nil otherwise.
// The caller must FixAlloc_Free the result.
func removespecial(p unsafe.Pointer, kind uint8) *special {
span := spanOfHeap(uintptr(p))
if span == nil {
throw("removespecial on invalid pointer")
}
// Ensure that the span is swept.
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// Sweeping accesses the specials list w/o locks, so we have
// to synchronize with it. And it's just much safer.
mp := acquirem()
span.ensureSwept()
offset := uintptr(p) - span.base()
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
var result *special
lock(&span.speciallock)
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
iter, exists := span.specialFindSplicePoint(offset, kind)
if exists {
s := *iter
*iter = s.next
result = s
}
if span.specials == nil {
spanHasNoSpecials(span)
}
unlock(&span.speciallock)
releasem(mp)
return result
}
// Find a splice point in the sorted list and check for an already existing
// record. Returns a pointer to the next-reference in the list predecessor.
// Returns true, if the referenced item is an exact match.
func (span *mspan) specialFindSplicePoint(offset uintptr, kind byte) (**special, bool) {
// Find splice point, check for existing record.
iter := &span.specials
found := false
for {
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
s := *iter
if s == nil {
break
}
if offset == s.offset && kind == s.kind {
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
found = true
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
break
}
if offset < s.offset || (offset == s.offset && kind < s.kind) {
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
break
}
iter = &s.next
runtime: add bitmap-based markrootSpans implementation Currently markrootSpans, the scanning routine which scans span specials (particularly finalizers) as roots, uses sweepSpans to shard work and find spans to mark. However, as part of a future CL to change span ownership and how mcentral works, we want to avoid having markrootSpans use the sweep bufs to find specials, so in this change we introduce a new mechanism. Much like for the page reclaimer, we set up a per-page bitmap where the first page for a span is marked if the span contains any specials, and unmarked if it has no specials. This bitmap is updated by addspecial, removespecial, and during sweeping. markrootSpans then shards this bitmap into mark work and markers iterate over the bitmap looking for spans with specials to mark. Unlike the page reclaimer, we don't need to use the pageInUse bits because having a special implies that a span is in-use. While in terms of computational complexity this design is technically worse, because it needs to iterate over the mapped heap, in practice this iteration is very fast (we can skip over large swathes of the heap very quickly) and we only look at spans that have any specials at all, rather than having to touch each span. This new implementation of markrootSpans is behind a feature flag called go115NewMarkrootSpans. Updates #37487. Change-Id: I8ea07b6c11059f6d412fe419e0ab512d989377b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/221178 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-02-20 20:03:39 +00:00
}
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
return iter, found
}
// The described object has a finalizer set for it.
//
// specialfinalizer is allocated from non-GC'd memory, so any heap
// pointers must be specially handled.
type specialfinalizer struct {
_ sys.NotInHeap
special special
fn *funcval // May be a heap pointer.
nret uintptr
fint *_type // May be a heap pointer, but always live.
ot *ptrtype // May be a heap pointer, but always live.
}
// Adds a finalizer to the object p. Returns true if it succeeded.
func addfinalizer(p unsafe.Pointer, f *funcval, nret uintptr, fint *_type, ot *ptrtype) bool {
lock(&mheap_.speciallock)
s := (*specialfinalizer)(mheap_.specialfinalizeralloc.alloc())
unlock(&mheap_.speciallock)
s.special.kind = _KindSpecialFinalizer
s.fn = f
s.nret = nret
s.fint = fint
s.ot = ot
if addspecial(p, &s.special, false) {
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// This is responsible for maintaining the same
// GC-related invariants as markrootSpans in any
// situation where it's possible that markrootSpans
// has already run but mark termination hasn't yet.
if gcphase != _GCoff {
base, span, _ := findObject(uintptr(p), 0, 0)
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
mp := acquirem()
gcw := &mp.p.ptr().gcw
// Mark everything reachable from the object
// so it's retained for the finalizer.
if !span.spanclass.noscan() {
scanObject(base, gcw)
}
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
// Mark the finalizer itself, since the
// special isn't part of the GC'd heap.
scanblock(uintptr(unsafe.Pointer(&s.fn)), goarch.PtrSize, &oneptrmask[0], gcw, nil)
runtime: scan objects with finalizers concurrently This reduces pause time by ~25% relative to tip and by ~50% relative to Go 1.5.1. Currently one of the steps of STW mark termination is to loop (in parallel) over all spans to find objects with finalizers in order to mark all objects reachable from these objects and to treat the finalizer special as a root. Unfortunately, even if there are no finalizers at all, this loop takes roughly 1 ms/heap GB/core, so multi-gigabyte heaps can quickly push our STW time past 10ms. Fix this by moving this scan from mark termination to concurrent scan, where it can run in parallel with mutators. The loop itself could also be optimized, but this cost is small compared to concurrent marking. Making this scan concurrent introduces two complications: 1) The scan currently walks the specials list of each span without locking it, which is safe only with the world stopped. We fix this by speculatively checking if a span has any specials (the vast majority won't) and then locking the specials list only if there are specials to check. 2) An object can have a finalizer set after concurrent scan, in which case it won't have been marked appropriately by concurrent scan. If the finalizer is a closure and is only reachable from the special, it could be swept before it is run. Likewise, if the object is not marked yet when the finalizer is set and then becomes unreachable before it is marked, other objects reachable only from it may be swept before the finalizer function is run. We fix this issue by making addfinalizer ensure the same marking invariants as markroot does. For multi-gigabyte heaps, this reduces max pause time by 20%–30% relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go 1.5.1 (where this loop was neither concurrent nor parallel). Here are the results for the garbage benchmark: ---------------- max pause ---------------- Heap Procs Concurrent scan STW parallel scan 1.5.1 24GB 12 18ms 23ms 37ms 24GB 4 18ms 25ms 37ms 4GB 4 3.8ms 4.9ms 6.9ms In all cases, 95%ile pause time is similar to the max pause time. This also improves mean STW time by 10%–30%. Fixes #11485. Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd Reviewed-on: https://go-review.googlesource.com/14982 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-09-24 14:39:27 -04:00
releasem(mp)
}
return true
}
// There was an old finalizer
lock(&mheap_.speciallock)
mheap_.specialfinalizeralloc.free(unsafe.Pointer(s))
unlock(&mheap_.speciallock)
return false
}
// Removes the finalizer (if any) from the object p.
func removefinalizer(p unsafe.Pointer) {
s := (*specialfinalizer)(unsafe.Pointer(removespecial(p, _KindSpecialFinalizer)))
if s == nil {
return // there wasn't a finalizer to remove
}
lock(&mheap_.speciallock)
mheap_.specialfinalizeralloc.free(unsafe.Pointer(s))
unlock(&mheap_.speciallock)
}
// The described object has a cleanup set for it.
type specialCleanup struct {
_ sys.NotInHeap
special special
fn *funcval
// Globally unique ID for the cleanup, obtained from mheap_.cleanupID.
id uint64
}
// addCleanup attaches a cleanup function to the object. Multiple
// cleanups are allowed on an object, and even the same pointer.
// A cleanup id is returned which can be used to uniquely identify
// the cleanup.
func addCleanup(p unsafe.Pointer, f *funcval) uint64 {
lock(&mheap_.speciallock)
s := (*specialCleanup)(mheap_.specialCleanupAlloc.alloc())
mheap_.cleanupID++ // Increment first. ID 0 is reserved.
id := mheap_.cleanupID
unlock(&mheap_.speciallock)
s.special.kind = _KindSpecialCleanup
s.fn = f
s.id = id
mp := acquirem()
addspecial(p, &s.special, true)
// This is responsible for maintaining the same
// GC-related invariants as markrootSpans in any
// situation where it's possible that markrootSpans
// has already run but mark termination hasn't yet.
if gcphase != _GCoff {
gcw := &mp.p.ptr().gcw
// Mark the cleanup itself, since the
// special isn't part of the GC'd heap.
scanblock(uintptr(unsafe.Pointer(&s.fn)), goarch.PtrSize, &oneptrmask[0], gcw, nil)
}
releasem(mp)
// Keep f alive. There's a window in this function where it's
// only reachable via the special while the special hasn't been
// added to the specials list yet. This is similar to a bug
// discovered for weak handles, see #70455.
KeepAlive(f)
return id
}
// Always paired with a specialCleanup or specialfinalizer, adds context.
type specialCheckFinalizer struct {
_ sys.NotInHeap
special special
cleanupID uint64 // Needed to disambiguate cleanups.
createPC uintptr
funcPC uintptr
ptrType *_type
}
// setFinalizerContext adds a specialCheckFinalizer to ptr. ptr must already have a
// finalizer special attached.
func setFinalizerContext(ptr unsafe.Pointer, ptrType *_type, createPC, funcPC uintptr) {
setCleanupContext(ptr, ptrType, createPC, funcPC, 0)
}
// setCleanupContext adds a specialCheckFinalizer to ptr. ptr must already have a
// finalizer or cleanup special attached. Pass 0 for the cleanupID to indicate
// a finalizer.
func setCleanupContext(ptr unsafe.Pointer, ptrType *_type, createPC, funcPC uintptr, cleanupID uint64) {
lock(&mheap_.speciallock)
s := (*specialCheckFinalizer)(mheap_.specialCheckFinalizerAlloc.alloc())
unlock(&mheap_.speciallock)
s.special.kind = _KindSpecialCheckFinalizer
s.cleanupID = cleanupID
s.createPC = createPC
s.funcPC = funcPC
s.ptrType = ptrType
mp := acquirem()
addspecial(ptr, &s.special, true)
releasem(mp)
KeepAlive(ptr)
}
func getCleanupContext(ptr uintptr, cleanupID uint64) *specialCheckFinalizer {
assertWorldStopped()
span := spanOfHeap(ptr)
if span == nil {
return nil
}
var found *specialCheckFinalizer
offset := ptr - span.base()
iter, exists := span.specialFindSplicePoint(offset, _KindSpecialCheckFinalizer)
if exists {
for {
s := *iter
if s == nil {
// Reached the end of the linked list. Stop searching at this point.
break
}
if offset == s.offset && _KindSpecialCheckFinalizer == s.kind &&
(*specialCheckFinalizer)(unsafe.Pointer(s)).cleanupID == cleanupID {
// The special is a cleanup and contains a matching cleanup id.
*iter = s.next
found = (*specialCheckFinalizer)(unsafe.Pointer(s))
break
}
if offset < s.offset || (offset == s.offset && _KindSpecialCheckFinalizer < s.kind) {
// The special is outside the region specified for that kind of
// special. The specials are sorted by kind.
break
}
// Try the next special.
iter = &s.next
}
}
return found
}
// clearFinalizerContext removes the specialCheckFinalizer for the given pointer, if any.
func clearFinalizerContext(ptr uintptr) {
clearCleanupContext(ptr, 0)
}
// clearFinalizerContext removes the specialCheckFinalizer for the given pointer and cleanup ID, if any.
func clearCleanupContext(ptr uintptr, cleanupID uint64) {
// The following block removes the Special record of type cleanup for the object c.ptr.
span := spanOfHeap(ptr)
if span == nil {
return
}
// Ensure that the span is swept.
// Sweeping accesses the specials list w/o locks, so we have
// to synchronize with it. And it's just much safer.
mp := acquirem()
span.ensureSwept()
offset := ptr - span.base()
var found *special
lock(&span.speciallock)
iter, exists := span.specialFindSplicePoint(offset, _KindSpecialCheckFinalizer)
if exists {
for {
s := *iter
if s == nil {
// Reached the end of the linked list. Stop searching at this point.
break
}
if offset == s.offset && _KindSpecialCheckFinalizer == s.kind &&
(*specialCheckFinalizer)(unsafe.Pointer(s)).cleanupID == cleanupID {
// The special is a cleanup and contains a matching cleanup id.
*iter = s.next
found = s
break
}
if offset < s.offset || (offset == s.offset && _KindSpecialCheckFinalizer < s.kind) {
// The special is outside the region specified for that kind of
// special. The specials are sorted by kind.
break
}
// Try the next special.
iter = &s.next
}
}
if span.specials == nil {
spanHasNoSpecials(span)
}
unlock(&span.speciallock)
releasem(mp)
if found == nil {
return
}
lock(&mheap_.speciallock)
mheap_.specialCheckFinalizerAlloc.free(unsafe.Pointer(found))
unlock(&mheap_.speciallock)
}
// Indicates that an allocation is a tiny block.
// Used only if debug.checkfinalizers != 0.
type specialTinyBlock struct {
_ sys.NotInHeap
special special
}
// setTinyBlockContext marks an allocation as a tiny block to diagnostics like
// checkfinalizer.
//
// A tiny block is only marked if it actually contains more than one distinct
// value, since we're using this for debugging.
func setTinyBlockContext(ptr unsafe.Pointer) {
lock(&mheap_.speciallock)
s := (*specialTinyBlock)(mheap_.specialTinyBlockAlloc.alloc())
unlock(&mheap_.speciallock)
s.special.kind = _KindSpecialTinyBlock
mp := acquirem()
addspecial(ptr, &s.special, false)
releasem(mp)
KeepAlive(ptr)
}
// inTinyBlock returns whether ptr is in a tiny alloc block, at one point grouped
// with other distinct values.
func inTinyBlock(ptr uintptr) bool {
assertWorldStopped()
ptr = alignDown(ptr, maxTinySize)
span := spanOfHeap(ptr)
if span == nil {
return false
}
offset := ptr - span.base()
_, exists := span.specialFindSplicePoint(offset, _KindSpecialTinyBlock)
return exists
}
// The described object has a weak pointer.
//
// Weak pointers in the GC have the following invariants:
//
// - Strong-to-weak conversions must ensure the strong pointer
// remains live until the weak handle is installed. This ensures
// that creating a weak pointer cannot fail.
//
// - Weak-to-strong conversions require the weakly-referenced
// object to be swept before the conversion may proceed. This
// ensures that weak-to-strong conversions cannot resurrect
// dead objects by sweeping them before that happens.
//
// - Weak handles are unique and canonical for each byte offset into
// an object that a strong pointer may point to, until an object
// becomes unreachable.
//
// - Weak handles contain nil as soon as an object becomes unreachable
// the first time, before a finalizer makes it reachable again. New
// weak handles created after resurrection are newly unique.
//
// specialWeakHandle is allocated from non-GC'd memory, so any heap
// pointers must be specially handled.
type specialWeakHandle struct {
_ sys.NotInHeap
special special
// handle is a reference to the actual weak pointer.
// It is always heap-allocated and must be explicitly kept
// live so long as this special exists.
handle *atomic.Uintptr
}
//go:linkname internal_weak_runtime_registerWeakPointer weak.runtime_registerWeakPointer
func internal_weak_runtime_registerWeakPointer(p unsafe.Pointer) unsafe.Pointer {
return unsafe.Pointer(getOrAddWeakHandle(p))
}
//go:linkname internal_weak_runtime_makeStrongFromWeak weak.runtime_makeStrongFromWeak
func internal_weak_runtime_makeStrongFromWeak(u unsafe.Pointer) unsafe.Pointer {
handle := (*atomic.Uintptr)(u)
runtime: prevent weak->strong conversions during mark termination Currently it's possible for weak->strong conversions to create more GC work during mark termination. When a weak->strong conversion happens during the mark phase, we need to mark the newly-strong pointer, since it may now be the only pointer to that object. In other words, the object could be white. But queueing new white objects creates GC work, and if this happens during mark termination, we could end up violating mark termination invariants. In the parlance of the mark termination algorithm, the weak->strong conversion is a non-monotonic source of GC work, unlike the write barriers (which will eventually only see black objects). This change fixes the problem by forcing weak->strong conversions to block during mark termination. We can do this efficiently by setting a global flag before the ragged barrier that is checked at each weak->strong conversion. If the flag is set, then the conversions block. The ragged barrier ensures that all Ps have observed the flag and that any weak->strong conversions which completed before the ragged barrier have their newly-minted strong pointers visible in GC work queues if necessary. We later unset the flag and wake all the blocked goroutines during the mark termination STW. There are a few subtleties that we need to account for. For one, it's possible that a goroutine which blocked in a weak->strong conversion wakes up only to find it's mark termination time again, so we need to recheck the global flag on wake. We should also stay non-preemptible while performing the check, so that if the check *does* appear as true, it cannot switch back to false while we're actively trying to block. If it switches to false while we try to block, then we'll be stuck in the queue until the following GC. All-in-all, this CL is more complicated than I would have liked, but it's the only idea so far that is clearly correct to me at a high level. This change adds a test which is somewhat invasive as it manipulates mark termination, but hopefully that infrastructure will be useful for debugging, fixing, and regression testing mark termination whenever we do fix it. Fixes #69803. Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a Reviewed-on: https://go-review.googlesource.com/c/go/+/623615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-01 21:54:07 +00:00
// Prevent preemption. We want to make sure that another GC cycle can't start
// and that work.strongFromWeak.block can't change out from under us.
mp := acquirem()
runtime: prevent weak->strong conversions during mark termination Currently it's possible for weak->strong conversions to create more GC work during mark termination. When a weak->strong conversion happens during the mark phase, we need to mark the newly-strong pointer, since it may now be the only pointer to that object. In other words, the object could be white. But queueing new white objects creates GC work, and if this happens during mark termination, we could end up violating mark termination invariants. In the parlance of the mark termination algorithm, the weak->strong conversion is a non-monotonic source of GC work, unlike the write barriers (which will eventually only see black objects). This change fixes the problem by forcing weak->strong conversions to block during mark termination. We can do this efficiently by setting a global flag before the ragged barrier that is checked at each weak->strong conversion. If the flag is set, then the conversions block. The ragged barrier ensures that all Ps have observed the flag and that any weak->strong conversions which completed before the ragged barrier have their newly-minted strong pointers visible in GC work queues if necessary. We later unset the flag and wake all the blocked goroutines during the mark termination STW. There are a few subtleties that we need to account for. For one, it's possible that a goroutine which blocked in a weak->strong conversion wakes up only to find it's mark termination time again, so we need to recheck the global flag on wake. We should also stay non-preemptible while performing the check, so that if the check *does* appear as true, it cannot switch back to false while we're actively trying to block. If it switches to false while we try to block, then we'll be stuck in the queue until the following GC. All-in-all, this CL is more complicated than I would have liked, but it's the only idea so far that is clearly correct to me at a high level. This change adds a test which is somewhat invasive as it manipulates mark termination, but hopefully that infrastructure will be useful for debugging, fixing, and regression testing mark termination whenever we do fix it. Fixes #69803. Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a Reviewed-on: https://go-review.googlesource.com/c/go/+/623615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-01 21:54:07 +00:00
// Yield to the GC if necessary.
if work.strongFromWeak.block {
releasem(mp)
// Try to park and wait for mark termination.
// N.B. gcParkStrongFromWeak calls acquirem before returning.
mp = gcParkStrongFromWeak()
}
p := handle.Load()
if p == 0 {
releasem(mp)
return nil
}
// Be careful. p may or may not refer to valid memory anymore, as it could've been
// swept and released already. It's always safe to ensure a span is swept, though,
// even if it's just some random span.
span := spanOfHeap(p)
if span == nil {
// If it's immortal, then just return the pointer.
//
// Stay non-preemptible so the GC can't see us convert this potentially
// completely bogus value to an unsafe.Pointer.
if isGoPointerWithoutSpan(unsafe.Pointer(p)) {
releasem(mp)
return unsafe.Pointer(p)
}
// It's heap-allocated, so the span probably just got swept and released.
releasem(mp)
return nil
}
// Ensure the span is swept.
span.ensureSwept()
// Now we can trust whatever we get from handle, so make a strong pointer.
//
// Even if we just swept some random span that doesn't contain this object, because
// this object is long dead and its memory has since been reused, we'll just observe nil.
ptr := unsafe.Pointer(handle.Load())
// This is responsible for maintaining the same GC-related
// invariants as the Yuasa part of the write barrier. During
// the mark phase, it's possible that we just created the only
// valid pointer to the object pointed to by ptr. If it's only
// ever referenced from our stack, and our stack is blackened
// already, we could fail to mark it. So, mark it now.
if gcphase != _GCoff {
shade(uintptr(ptr))
}
releasem(mp)
// Explicitly keep ptr alive. This seems unnecessary since we return ptr,
// but let's be explicit since it's important we keep ptr alive across the
// call to shade.
KeepAlive(ptr)
return ptr
}
runtime: prevent weak->strong conversions during mark termination Currently it's possible for weak->strong conversions to create more GC work during mark termination. When a weak->strong conversion happens during the mark phase, we need to mark the newly-strong pointer, since it may now be the only pointer to that object. In other words, the object could be white. But queueing new white objects creates GC work, and if this happens during mark termination, we could end up violating mark termination invariants. In the parlance of the mark termination algorithm, the weak->strong conversion is a non-monotonic source of GC work, unlike the write barriers (which will eventually only see black objects). This change fixes the problem by forcing weak->strong conversions to block during mark termination. We can do this efficiently by setting a global flag before the ragged barrier that is checked at each weak->strong conversion. If the flag is set, then the conversions block. The ragged barrier ensures that all Ps have observed the flag and that any weak->strong conversions which completed before the ragged barrier have their newly-minted strong pointers visible in GC work queues if necessary. We later unset the flag and wake all the blocked goroutines during the mark termination STW. There are a few subtleties that we need to account for. For one, it's possible that a goroutine which blocked in a weak->strong conversion wakes up only to find it's mark termination time again, so we need to recheck the global flag on wake. We should also stay non-preemptible while performing the check, so that if the check *does* appear as true, it cannot switch back to false while we're actively trying to block. If it switches to false while we try to block, then we'll be stuck in the queue until the following GC. All-in-all, this CL is more complicated than I would have liked, but it's the only idea so far that is clearly correct to me at a high level. This change adds a test which is somewhat invasive as it manipulates mark termination, but hopefully that infrastructure will be useful for debugging, fixing, and regression testing mark termination whenever we do fix it. Fixes #69803. Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a Reviewed-on: https://go-review.googlesource.com/c/go/+/623615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-01 21:54:07 +00:00
// gcParkStrongFromWeak puts the current goroutine on the weak->strong queue and parks.
func gcParkStrongFromWeak() *m {
// Prevent preemption as we check strongFromWeak, so it can't change out from under us.
mp := acquirem()
for work.strongFromWeak.block {
lock(&work.strongFromWeak.lock)
releasem(mp) // N.B. Holding the lock prevents preemption.
// Queue ourselves up.
work.strongFromWeak.q.pushBack(getg())
// Park.
goparkunlock(&work.strongFromWeak.lock, waitReasonGCWeakToStrongWait, traceBlockGCWeakToStrongWait, 2)
// Re-acquire the current M since we're going to check the condition again.
mp = acquirem()
// Re-check condition. We may have awoken in the next GC's mark termination phase.
}
return mp
}
// gcWakeAllStrongFromWeak wakes all currently blocked weak->strong
// conversions. This is used at the end of a GC cycle.
//
// work.strongFromWeak.block must be false to prevent woken goroutines
// from immediately going back to sleep.
func gcWakeAllStrongFromWeak() {
lock(&work.strongFromWeak.lock)
list := work.strongFromWeak.q.popList()
injectglist(&list)
unlock(&work.strongFromWeak.lock)
}
// Retrieves or creates a weak pointer handle for the object p.
func getOrAddWeakHandle(p unsafe.Pointer) *atomic.Uintptr {
if debug.sbrk != 0 {
// debug.sbrk never frees memory, so it'll never go nil. However, we do still
// need a weak handle that's specific to p. Use the immortal weak handle map.
// Keep p alive across the call to getOrAdd defensively, though it doesn't
// really matter in this particular case.
handle := mheap_.immortalWeakHandles.getOrAdd(uintptr(p))
KeepAlive(p)
return handle
}
// First try to retrieve without allocating.
if handle := getWeakHandle(p); handle != nil {
// Keep p alive for the duration of the function to ensure
// that it cannot die while we're trying to do this.
KeepAlive(p)
return handle
}
lock(&mheap_.speciallock)
s := (*specialWeakHandle)(mheap_.specialWeakHandleAlloc.alloc())
unlock(&mheap_.speciallock)
handle := new(atomic.Uintptr)
s.special.kind = _KindSpecialWeakHandle
s.handle = handle
handle.Store(uintptr(p))
if addspecial(p, &s.special, false) {
// This is responsible for maintaining the same
// GC-related invariants as markrootSpans in any
// situation where it's possible that markrootSpans
// has already run but mark termination hasn't yet.
if gcphase != _GCoff {
mp := acquirem()
gcw := &mp.p.ptr().gcw
// Mark the weak handle itself, since the
// special isn't part of the GC'd heap.
scanblock(uintptr(unsafe.Pointer(&s.handle)), goarch.PtrSize, &oneptrmask[0], gcw, nil)
releasem(mp)
}
// Keep p alive for the duration of the function to ensure
// that it cannot die while we're trying to do this.
//
// Same for handle, which is only stored in the special.
// There's a window where it might die if we don't keep it
// alive explicitly. Returning it here is probably good enough,
// but let's be defensive and explicit. See #70455.
KeepAlive(p)
KeepAlive(handle)
return handle
}
// There was an existing handle. Free the special
// and try again. We must succeed because we're explicitly
// keeping p live until the end of this function. Either
// we, or someone else, must have succeeded, because we can
// only fail in the event of a race, and p will still be
// be valid no matter how much time we spend here.
lock(&mheap_.speciallock)
mheap_.specialWeakHandleAlloc.free(unsafe.Pointer(s))
unlock(&mheap_.speciallock)
handle = getWeakHandle(p)
if handle == nil {
throw("failed to get or create weak handle")
}
// Keep p alive for the duration of the function to ensure
// that it cannot die while we're trying to do this.
//
// Same for handle, just to be defensive.
KeepAlive(p)
KeepAlive(handle)
return handle
}
func getWeakHandle(p unsafe.Pointer) *atomic.Uintptr {
span := spanOfHeap(uintptr(p))
if span == nil {
if isGoPointerWithoutSpan(p) {
return mheap_.immortalWeakHandles.getOrAdd(uintptr(p))
}
throw("getWeakHandle on invalid pointer")
}
// Ensure that the span is swept.
// Sweeping accesses the specials list w/o locks, so we have
// to synchronize with it. And it's just much safer.
mp := acquirem()
span.ensureSwept()
offset := uintptr(p) - span.base()
lock(&span.speciallock)
// Find the existing record and return the handle if one exists.
var handle *atomic.Uintptr
iter, exists := span.specialFindSplicePoint(offset, _KindSpecialWeakHandle)
if exists {
handle = ((*specialWeakHandle)(unsafe.Pointer(*iter))).handle
}
unlock(&span.speciallock)
releasem(mp)
// Keep p alive for the duration of the function to ensure
// that it cannot die while we're trying to do this.
KeepAlive(p)
return handle
}
type immortalWeakHandleMap struct {
root atomic.UnsafePointer // *immortalWeakHandle (can't use generics because it's notinheap)
}
// immortalWeakHandle is a lock-free append-only hash-trie.
//
// Key features:
// - 2-ary trie. Child nodes are indexed by the highest bit (remaining) of the hash of the address.
// - New nodes are placed at the first empty level encountered.
// - When the first child is added to a node, the existing value is not moved into a child.
// This means that we must check the value at each level, not just at the leaf.
// - No deletion or rebalancing.
// - Intentionally devolves into a linked list on hash collisions (the hash bits will all
// get shifted out during iteration, and new nodes will just be appended to the 0th child).
type immortalWeakHandle struct {
_ sys.NotInHeap
children [2]atomic.UnsafePointer // *immortalObjectMapNode (can't use generics because it's notinheap)
ptr uintptr // &ptr is the weak handle
}
// handle returns a canonical weak handle.
func (h *immortalWeakHandle) handle() *atomic.Uintptr {
// N.B. Since we just need an *atomic.Uintptr that never changes, we can trivially
// reference ptr to save on some memory in immortalWeakHandle and avoid extra atomics
// in getOrAdd.
return (*atomic.Uintptr)(unsafe.Pointer(&h.ptr))
}
// getOrAdd introduces p, which must be a pointer to immortal memory (for example, a linker-allocated
// object) and returns a weak handle. The weak handle will never become nil.
func (tab *immortalWeakHandleMap) getOrAdd(p uintptr) *atomic.Uintptr {
var newNode *immortalWeakHandle
m := &tab.root
hash := memhash(abi.NoEscape(unsafe.Pointer(&p)), 0, goarch.PtrSize)
hashIter := hash
for {
n := (*immortalWeakHandle)(m.Load())
if n == nil {
// Try to insert a new map node. We may end up discarding
// this node if we fail to insert because it turns out the
// value is already in the map.
//
// The discard will only happen if two threads race on inserting
// the same value. Both might create nodes, but only one will
// succeed on insertion. If two threads race to insert two
// different values, then both nodes will *always* get inserted,
// because the equality checking below will always fail.
//
// Performance note: contention on insertion is likely to be
// higher for small maps, but since this data structure is
// append-only, either the map stays small because there isn't
// much activity, or the map gets big and races to insert on
// the same node are much less likely.
if newNode == nil {
newNode = (*immortalWeakHandle)(persistentalloc(unsafe.Sizeof(immortalWeakHandle{}), goarch.PtrSize, &memstats.gcMiscSys))
newNode.ptr = p
}
if m.CompareAndSwapNoWB(nil, unsafe.Pointer(newNode)) {
return newNode.handle()
}
// Reload n. Because pointers are only stored once,
// we must have lost the race, and therefore n is not nil
// anymore.
n = (*immortalWeakHandle)(m.Load())
}
if n.ptr == p {
return n.handle()
}
m = &n.children[hashIter>>(8*goarch.PtrSize-1)]
hashIter <<= 1
}
}
// The described object is being heap profiled.
type specialprofile struct {
_ sys.NotInHeap
special special
b *bucket
}
// Set the heap profile bucket associated with addr to b.
func setprofilebucket(p unsafe.Pointer, b *bucket) {
lock(&mheap_.speciallock)
s := (*specialprofile)(mheap_.specialprofilealloc.alloc())
unlock(&mheap_.speciallock)
s.special.kind = _KindSpecialProfile
s.b = b
if !addspecial(p, &s.special, false) {
throw("setprofilebucket: profile already set")
}
}
// specialReachable tracks whether an object is reachable on the next
// GC cycle. This is used by testing.
type specialReachable struct {
special special
done bool
reachable bool
}
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
// specialPinCounter tracks whether an object is pinned multiple times.
type specialPinCounter struct {
special special
counter uintptr
}
// specialsIter helps iterate over specials lists.
type specialsIter struct {
pprev **special
s *special
}
func newSpecialsIter(span *mspan) specialsIter {
return specialsIter{&span.specials, span.specials}
}
func (i *specialsIter) valid() bool {
return i.s != nil
}
func (i *specialsIter) next() {
i.pprev = &i.s.next
i.s = *i.pprev
}
// unlinkAndNext removes the current special from the list and moves
// the iterator to the next special. It returns the unlinked special.
func (i *specialsIter) unlinkAndNext() *special {
cur := i.s
i.s = cur.next
*i.pprev = i.s
return cur
}
// freeSpecial performs any cleanup on special s and deallocates it.
// s must already be unlinked from the specials list.
func freeSpecial(s *special, p unsafe.Pointer, size uintptr) {
switch s.kind {
case _KindSpecialFinalizer:
sf := (*specialfinalizer)(unsafe.Pointer(s))
queuefinalizer(p, sf.fn, sf.nret, sf.fint, sf.ot)
lock(&mheap_.speciallock)
mheap_.specialfinalizeralloc.free(unsafe.Pointer(sf))
unlock(&mheap_.speciallock)
case _KindSpecialWeakHandle:
sw := (*specialWeakHandle)(unsafe.Pointer(s))
sw.handle.Store(0)
lock(&mheap_.speciallock)
mheap_.specialWeakHandleAlloc.free(unsafe.Pointer(s))
unlock(&mheap_.speciallock)
case _KindSpecialProfile:
sp := (*specialprofile)(unsafe.Pointer(s))
mProf_Free(sp.b, size)
lock(&mheap_.speciallock)
mheap_.specialprofilealloc.free(unsafe.Pointer(sp))
unlock(&mheap_.speciallock)
case _KindSpecialReachable:
sp := (*specialReachable)(unsafe.Pointer(s))
sp.done = true
// The creator frees these.
runtime: implement Pinner API for object pinning Some C APIs require the use or structures that contain pointers to buffers (iovec, io_uring, ...). The pointer passing rules would require that these buffers are allocated in C memory and to process this data with Go libraries it would need to be copied. In order to provide a zero-copy way to use these C APIs, this CL implements a Pinner API that allows to pin Go objects, which guarantees that the garbage collector does not move these objects while pinned. This allows to relax the pointer passing rules so that pinned pointers can be stored in C allocated memory or can be contained in Go memory that is passed to C functions. The Pin() method accepts pointers to objects of any type and unsafe.Pointer. Slices and arrays can be pinned by calling Pin() with the pointer to the first element. Pinning of maps is not supported. If the GC collects unreachable Pinner holding pinned objects it panics. If Pin() is called with the other non-pointer types it panics as well. Performance considerations: This change has no impact on execution time on existing code, because checks are only done in code paths, that would panic otherwise. The memory footprint on existing code is one pointer per memory span. Fixes: #46787 Signed-off-by: Sven Anderson <sven@anderson.de> Change-Id: I110031fe789b92277ae45a9455624687bd1c54f2 Reviewed-on: https://go-review.googlesource.com/c/go/+/367296 Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>
2021-11-28 13:05:16 +09:00
case _KindSpecialPinCounter:
lock(&mheap_.speciallock)
mheap_.specialPinCounterAlloc.free(unsafe.Pointer(s))
unlock(&mheap_.speciallock)
case _KindSpecialCleanup:
sc := (*specialCleanup)(unsafe.Pointer(s))
// Cleanups, unlike finalizers, do not resurrect the objects
// they're attached to, so we only need to pass the cleanup
// function, not the object.
runtime: schedule cleanups across multiple goroutines This change splits the finalizer and cleanup queues and implements a new lock-free blocking queue for cleanups. The basic design is as follows: The cleanup queue is organized in fixed-sized blocks. Individual cleanup functions are queued, but only whole blocks are dequeued. Enqueuing cleanups places them in P-local cleanup blocks. These are flushed to the full list as they get full. Cleanups can only be enqueued by an active sweeper. Dequeuing cleanups always dequeues entire blocks from the full list. Cleanup blocks can be dequeued and executed at any time. The very last active sweeper in the sweep phase is responsible for flushing all local cleanup blocks to the full list. It can do this without any synchronization because the next GC can't start yet, so we can be very certain that nobody else will be accessing the local blocks. Cleanup blocks are stored off-heap because the need to be allocated by the sweeper, which is called from heap allocation paths. As a result, the GC treats cleanup blocks as roots, just like finalizer blocks. Flushes to the full list signal to the scheduler that cleanup goroutines should be awoken. Every time the scheduler goes to wake up a cleanup goroutine and there were more signals than goroutines to wake, it then forwards this signal to runtime.AddCleanup, so that it creates another goroutine the next time it is called, up to gomaxprocs goroutines. The signals here are a little convoluted, but exist because the sweeper and the scheduler cannot safely create new goroutines. For #71772. For #71825. Change-Id: Ie839fde2b67e1b79ac1426be0ea29a8d923a62cc Reviewed-on: https://go-review.googlesource.com/c/go/+/650697 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
2025-02-19 16:33:21 +00:00
gcCleanups.enqueue(sc.fn)
lock(&mheap_.speciallock)
mheap_.specialCleanupAlloc.free(unsafe.Pointer(sc))
unlock(&mheap_.speciallock)
case _KindSpecialCheckFinalizer:
sc := (*specialCheckFinalizer)(unsafe.Pointer(s))
lock(&mheap_.speciallock)
mheap_.specialCheckFinalizerAlloc.free(unsafe.Pointer(sc))
unlock(&mheap_.speciallock)
case _KindSpecialTinyBlock:
st := (*specialTinyBlock)(unsafe.Pointer(s))
lock(&mheap_.speciallock)
mheap_.specialTinyBlockAlloc.free(unsafe.Pointer(st))
unlock(&mheap_.speciallock)
case _KindSpecialBubble:
st := (*specialBubble)(unsafe.Pointer(s))
lock(&mheap_.speciallock)
mheap_.specialBubbleAlloc.free(unsafe.Pointer(st))
unlock(&mheap_.speciallock)
default:
throw("bad special kind")
panic("not reached")
}
}
// gcBits is an alloc/mark bitmap. This is always used as gcBits.x.
type gcBits struct {
_ sys.NotInHeap
x uint8
}
// bytep returns a pointer to the n'th byte of b.
func (b *gcBits) bytep(n uintptr) *uint8 {
return addb(&b.x, n)
}
// bitp returns a pointer to the byte containing bit n and a mask for
// selecting that bit from *bytep.
func (b *gcBits) bitp(n uintptr) (bytep *uint8, mask uint8) {
return b.bytep(n / 8), 1 << (n % 8)
}
const gcBitsChunkBytes = uintptr(64 << 10)
const gcBitsHeaderBytes = unsafe.Sizeof(gcBitsHeader{})
type gcBitsHeader struct {
free uintptr // free is the index into bits of the next free byte.
next uintptr // *gcBits triggers recursive type bug. (issue 14620)
}
type gcBitsArena struct {
_ sys.NotInHeap
// gcBitsHeader // side step recursive type bug (issue 14620) by including fields by hand.
free uintptr // free is the index into bits of the next free byte; read/write atomically
next *gcBitsArena
bits [gcBitsChunkBytes - gcBitsHeaderBytes]gcBits
}
var gcBitsArenas struct {
lock mutex
free *gcBitsArena
next *gcBitsArena // Read atomically. Write atomically under lock.
current *gcBitsArena
previous *gcBitsArena
}
// tryAlloc allocates from b or returns nil if b does not have enough room.
// This is safe to call concurrently.
func (b *gcBitsArena) tryAlloc(bytes uintptr) *gcBits {
if b == nil || atomic.Loaduintptr(&b.free)+bytes > uintptr(len(b.bits)) {
return nil
}
// Try to allocate from this block.
end := atomic.Xadduintptr(&b.free, bytes)
if end > uintptr(len(b.bits)) {
return nil
}
// There was enough room.
start := end - bytes
return &b.bits[start]
}
// newMarkBits returns a pointer to 8 byte aligned bytes
// to be used for a span's mark bits.
func newMarkBits(nelems uintptr) *gcBits {
blocksNeeded := (nelems + 63) / 64
bytesNeeded := blocksNeeded * 8
// Try directly allocating from the current head arena.
head := (*gcBitsArena)(atomic.Loadp(unsafe.Pointer(&gcBitsArenas.next)))
if p := head.tryAlloc(bytesNeeded); p != nil {
return p
}
// There's not enough room in the head arena. We may need to
// allocate a new arena.
lock(&gcBitsArenas.lock)
// Try the head arena again, since it may have changed. Now
// that we hold the lock, the list head can't change, but its
// free position still can.
if p := gcBitsArenas.next.tryAlloc(bytesNeeded); p != nil {
unlock(&gcBitsArenas.lock)
return p
}
// Allocate a new arena. This may temporarily drop the lock.
fresh := newArenaMayUnlock()
// If newArenaMayUnlock dropped the lock, another thread may
// have put a fresh arena on the "next" list. Try allocating
// from next again.
if p := gcBitsArenas.next.tryAlloc(bytesNeeded); p != nil {
// Put fresh back on the free list.
// TODO: Mark it "already zeroed"
fresh.next = gcBitsArenas.free
gcBitsArenas.free = fresh
unlock(&gcBitsArenas.lock)
return p
}
// Allocate from the fresh arena. We haven't linked it in yet, so
// this cannot race and is guaranteed to succeed.
p := fresh.tryAlloc(bytesNeeded)
if p == nil {
throw("markBits overflow")
}
// Add the fresh arena to the "next" list.
fresh.next = gcBitsArenas.next
atomic.StorepNoWB(unsafe.Pointer(&gcBitsArenas.next), unsafe.Pointer(fresh))
unlock(&gcBitsArenas.lock)
return p
}
// newAllocBits returns a pointer to 8 byte aligned bytes
// to be used for this span's alloc bits.
// newAllocBits is used to provide newly initialized spans
// allocation bits. For spans not being initialized the
// mark bits are repurposed as allocation bits when
// the span is swept.
func newAllocBits(nelems uintptr) *gcBits {
return newMarkBits(nelems)
}
// nextMarkBitArenaEpoch establishes a new epoch for the arenas
// holding the mark bits. The arenas are named relative to the
// current GC cycle which is demarcated by the call to finishweep_m.
//
// All current spans have been swept.
// During that sweep each span allocated room for its gcmarkBits in
// gcBitsArenas.next block. gcBitsArenas.next becomes the gcBitsArenas.current
// where the GC will mark objects and after each span is swept these bits
// will be used to allocate objects.
// gcBitsArenas.current becomes gcBitsArenas.previous where the span's
// gcAllocBits live until all the spans have been swept during this GC cycle.
// The span's sweep extinguishes all the references to gcBitsArenas.previous
// by pointing gcAllocBits into the gcBitsArenas.current.
// The gcBitsArenas.previous is released to the gcBitsArenas.free list.
func nextMarkBitArenaEpoch() {
lock(&gcBitsArenas.lock)
if gcBitsArenas.previous != nil {
if gcBitsArenas.free == nil {
gcBitsArenas.free = gcBitsArenas.previous
} else {
// Find end of previous arenas.
last := gcBitsArenas.previous
for last = gcBitsArenas.previous; last.next != nil; last = last.next {
}
last.next = gcBitsArenas.free
gcBitsArenas.free = gcBitsArenas.previous
}
}
gcBitsArenas.previous = gcBitsArenas.current
gcBitsArenas.current = gcBitsArenas.next
atomic.StorepNoWB(unsafe.Pointer(&gcBitsArenas.next), nil) // newMarkBits calls newArena when needed
unlock(&gcBitsArenas.lock)
}
// newArenaMayUnlock allocates and zeroes a gcBits arena.
// The caller must hold gcBitsArena.lock. This may temporarily release it.
func newArenaMayUnlock() *gcBitsArena {
var result *gcBitsArena
if gcBitsArenas.free == nil {
unlock(&gcBitsArenas.lock)
runtime: decorate anonymous memory mappings Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name the anonymous memory areas. This API has been introduced in Linux 5.17 to decorate the anonymous memory areas shown in /proc/<pid>/maps. This is already used by glibc. See: * https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434 * https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63 This can be useful when investigating the memory consumption of a multi-language program. On a 100% Go program, pprof profiler can be used to profile the memory consumption of the program. But pprof is only aware of what happens within the Go world. On a multi-language program, there could be a doubt about whether the suspicious extra-memory consumption comes from the Go part or the native part. With this change, the following Go program: package main import ( "fmt" "log" "os" ) /* #include <stdlib.h> void f(void) { (void)malloc(1024*1024*1024); } */ import "C" func main() { C.f() data, err := os.ReadFile("/proc/self/maps") if err != nil { log.Fatal(err) } fmt.Println(string(data)) } produces this output: $ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run . 00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00580000-005a4000 rw-p 00000000 00:00 0 2e075000-2e096000 rw-p 00000000 00:00 0 [heap] c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap] c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation] 777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f40021000-777f44000000 ---p 00000000 00:00 0 777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f44021000-777f48000000 ---p 00000000 00:00 0 777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f48021000-777f4c000000 ---p 00000000 00:00 0 777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f4c021000-777f50000000 ---p 00000000 00:00 0 777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f50021000-777f54000000 ---p 00000000 00:00 0 777f55afb000-777f55afc000 ---p 00000000 00:00 0 777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378] 777f562fc000-777f562fd000 ---p 00000000 00:00 0 777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377] 777f56afd000-777f56afe000 ---p 00000000 00:00 0 777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376] 777f572fe000-777f572ff000 ---p 00000000 00:00 0 777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375] 777f57aff000-777f57b00000 ---p 00000000 00:00 0 777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374] 777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index] 777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index] 777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits] 777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array] 777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6 777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6 777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6 777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6 777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6 777f9eff4000-777f9effc000 rw-p 00000000 00:00 0 777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar] 777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock] 777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso] 777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f088000-777f9f089000 rw-p 00000000 00:00 0 7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] The anonymous memory areas are now labelled so that we can see which ones have been allocated by the Go runtime versus which ones have been allocated by the glibc. Fixes #71546 Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4 Reviewed-on: https://go-review.googlesource.com/c/go/+/646095 Auto-Submit: Alan Donovan <adonovan@google.com> Commit-Queue: Alan Donovan <adonovan@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-02-01 14:19:04 +01:00
result = (*gcBitsArena)(sysAlloc(gcBitsChunkBytes, &memstats.gcMiscSys, "gc bits"))
if result == nil {
throw("runtime: cannot allocate memory")
}
lock(&gcBitsArenas.lock)
} else {
result = gcBitsArenas.free
gcBitsArenas.free = gcBitsArenas.free.next
memclrNoHeapPointers(unsafe.Pointer(result), gcBitsChunkBytes)
}
result.next = nil
// If result.bits is not 8 byte aligned adjust index so
// that &result.bits[result.free] is 8 byte aligned.
if unsafe.Offsetof(gcBitsArena{}.bits)&7 == 0 {
result.free = 0
} else {
result.free = 8 - (uintptr(unsafe.Pointer(&result.bits[0])) & 7)
}
return result
}