Commit graph

414 commits

Author SHA1 Message Date
Rhys Hiltner
9114c51521 Revert "runtime: prepare for extensions to waiting M list"
This reverts commit be0b569caa (CL 585635).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I7843ccaecbd273b7ceacfa0f420dd993b4b15a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589117
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-30 17:57:37 +00:00
Russ Cox
2a7ca156b8 all: document legacy //go:linkname for final round of modules
Add linknames for most modules with ≥50 dependents.
Add linknames for a few other modules that we know
are important but are below 50.

Remove linknames from badlinkname.go that do not merit
inclusion (very small number of dependents).
We can add them back later if the need arises.

Fixes #67401. (For now.)

Change-Id: I1e49fec0292265256044d64b1841d366c4106002
Reviewed-on: https://go-review.googlesource.com/c/go/+/587756
Auto-Submit: Russ Cox <rsc@golang.org>
TryBot-Bypass: Russ Cox <rsc@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-05-29 17:58:53 +00:00
Rhys Hiltner
be0b569caa runtime: prepare for extensions to waiting M list
Move the nextwaitm field into a small struct, in preparation for
additional metadata to track how long Ms need to wait for locks.

For #66999

Change-Id: Ib40e43c15cde22f7e35922641107973d99439ecd
Reviewed-on: https://go-review.googlesource.com/c/go/+/585635
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2024-05-21 17:17:24 +00:00
Austin Clements
6ec291f495 internal/runtime/atomic: fix missing linknames
CL 544455, which added atomic And/Or APIs, raced with CL 585556, which
enabled stricter linkname checking. This caused linkname-related
failures on ARM and MIPS. Fix this by adding the necessary linknames.

We fix one other linkname that got overlooked in CL 585556.

Updates #61395.

Change-Id: I454f0767ce28188e550a61bc39b7e398239bc10e
Reviewed-on: https://go-review.googlesource.com/c/go/+/586516
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Austin Clements <austin@google.com>
2024-05-17 20:08:37 +00:00
Michael Anthony Knyszek
97c13cfb25 runtime: delete pagetrace GOEXPERIMENT
The page tracer's functionality is now captured by the regular execution
tracer as an experimental GODEBUG variable. This is a lot more usable
and maintainable than the page tracer, which is likely to have bitrotted
by this point. There's also no tooling available for the page tracer.

Change-Id: I2408394555e01dde75a522e9a489b7e55cf12c8e
Reviewed-on: https://go-review.googlesource.com/c/go/+/583379
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-05-08 17:48:45 +00:00
Felix Geisendörfer
2141315251 runtime: move profiling pc buffers to m
Move profiling pc buffers from being stack allocated to an m field.

This is motivated by the next patch, which will increase the default
stack depth to 128, which might lead to undesirable stack growth for
goroutines that produce profiling events.

Additionally, this change paves the way to make the stack depth
configurable via GODEBUG.

Change-Id: Ifa407f899188e2c7c0a81de92194fdb627cb4b36
Reviewed-on: https://go-review.googlesource.com/c/go/+/574699
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-08 17:48:38 +00:00
Sabyrzhan Tasbolatov
552faa8927 runtime: reduced struct sizes found via pahole
During my research of pahole with Go structs, I've found couple of
structs in runtime/ pkg where we can reduce several structs' sizes
highligted by pahole tool which detect byte holes and paddings.

Overall, there are 80 bytes reduced.

Change-Id: I398e5ed6f5b199394307741981cb5ad5b875e98f
Reviewed-on: https://go-review.googlesource.com/c/go/+/578795
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Joedian Reid <joedian@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-04-22 22:07:41 +00:00
Michael Anthony Knyszek
2b82a4f488 runtime: track frame pointer while in syscall
Currently the runtime only tracks the PC and SP upon entering a syscall,
but not the FP (BP). This is mainly for historical reasons, and because
the tracer (which uses the frame pointer unwinder) does not need it.

Until it did, of course, in CL 567076, where the tracer tries to take a
stack trace of a goroutine that's in a syscall from afar. It tries to
use gp.sched.bp and lots of things go wrong. It *really* should be using
the equivalent of gp.syscallbp, which doesn't exist before this CL.

This change introduces gp.syscallbp and tracks it. It also introduces
getcallerfp which is nice for simplifying some code. Because we now have
gp.syscallbp, we can also delete the frame skip count computation in
traceLocker.GoSysCall, because it's now the same regardless of whether
frame pointer unwinding is used.

Fixes #66889.

Change-Id: Ib6d761c9566055e0a037134138cb0f81be73ecf7
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-nocgo
Reviewed-on: https://go-review.googlesource.com/c/go/+/580255
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-04-19 17:25:00 +00:00
Michael Anthony Knyszek
e995aa95cb runtime: account for _Pgcstop in GC CPU pause time in a fine-grained way
The previous CL, CL 570257, made it so that STW time no longer
overlapped with other CPU time tracking. However, what we lost was
insight into the CPU time spent _stopping_ the world, which can be just
as important. There's pretty much no easy way to measure this
indirectly, so this CL implements a direct measurement: whenever a P
enters _Pgcstop, it writes down what time it did so. stopTheWorld then
accumulates all the time deltas between when it finished stopping the
world and each P's stop time into a total additional pause time. The GC
pause cases then accumulate this number into the metrics.

This should cause minimal additional overhead in stopping the world. GC
STWs already take on the order of 10s to 100s of microseconds. Even for
100 Ps, the extra `nanotime` call per P is only 1500ns of additional CPU
time. This is likely to be much less in actual pause latency, since it
all happens concurrently.

Change-Id: Icf190ffea469cd35ebaf0b2587bf6358648c8554
Reviewed-on: https://go-review.googlesource.com/c/go/+/574215
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Nicolas Hillegeer <aktau@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
2024-04-08 21:43:16 +00:00
Michael Anthony Knyszek
d6a3d093c3 runtime: take a stack trace during tracing only when we own the stack
Currently, the execution tracer may attempt to take a stack trace of a
goroutine whose stack it does not own. For example, if the goroutine is
in _Grunnable or _Gwaiting. This is easily fixed in all cases by simply
moving the emission of GoStop and GoBlock events to before the
casgstatus happens. The goroutine status is what is used to signal stack
ownership, and the GC may shrink a goroutine's stack if it can acquire
the scan bit.

Although this is easily fixed, the interaction here is very subtle,
because stack ownership is only implicit in the goroutine's scan status.
To make this invariant more maintainable and less error-prone in the
future, this change adds a GODEBUG setting that checks, at the point of
taking a stack trace, whether the caller owns the goroutine. This check
is not quite perfect because there's no way for the stack tracing code
to know that the _Gscan bit was acquired by the caller, so for
simplicity it assumes that it was the caller that acquired the scan bit.
In all other cases however, we can check for ownership precisely. At the
very least, this check is sufficient to catch the issue this change is
fixing.

To make sure this debug check doesn't bitrot, it's always enabled during
trace testing. This new mode has actually caught a few other issues
already, so this change fixes them.

One issue that this debug mode caught was that it's not safe to take a
stack trace of a _Gwaiting goroutine that's being unparked.

Another much bigger issue this debug mode caught was the fact that the
execution tracer could try to take a stack trace of a G that was in
_Gwaiting solely to avoid a deadlock in the GC. The execution tracer
already has a partial list of these cases since they're modeled as the
goroutine just executing as normal in the tracer, but this change takes
the list and makes it more formal. In this specific case, we now prevent
the GC from shrinking the stacks of goroutines in this state if tracing
is enabled. The stack traces from these scenarios are too useful to
discard, but there is indeed a race here between the tracer and any
attempt to shrink the stack by the GC.

Change-Id: I019850dabc8cede202fd6dcc0a4b1f16764209fb
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-linux-amd64-longtest-race
Reviewed-on: https://go-review.googlesource.com/c/go/+/573155
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
2024-04-05 20:50:21 +00:00
qmuntal
1f354a60ff runtime: don't call lockOSThread for every syscall call on Windows
Windows syscall.SyscallN currently calls lockOSThread for every syscall.
This can be expensive and produce unnecessary context switches,
especially when the syscall is called frequently under high contention.

The lockOSThread was necessary to ensure that cgocall wouldn't
reschedule the goroutine to a different M, as the syscall return values
are reported back in the M struct.

This CL instructs cgocall to copy the syscall return values into the
the M that will see the caller on return, so the caller no longer needs
to call lockOSThread.

Updates #58336.

Cq-Include-Trybots: luci.golang.try:gotip-windows-arm64,gotip-windows-amd64-longtest
Change-Id: If6644fd111dbacab74e7dcee2afa18ca146735da
Reviewed-on: https://go-review.googlesource.com/c/go/+/562915
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
Reviewed-by: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-03-26 03:12:13 +00:00
Andy Pan
4c2b1e0feb runtime: migrate internal/atomic to internal/runtime
For #65355

Change-Id: I65dd090fb99de9b231af2112c5ccb0eb635db2be
Reviewed-on: https://go-review.googlesource.com/c/go/+/560155
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ibrahim Bazoka <ibrahimbazoka729@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
2024-03-25 19:53:03 +00:00
Russ Cox
6135651867 runtime: clean up timer state
The timers had evolved to the point where the state was stored as follows:

	if timer in heap:
	    state has timerHeaped set
	    if heap timer is stale:
	        heap deadline in t.when
	        real deadline in t.nextWhen
	        state has timerNextWhen set
	    else:
	        real deadline in t.when
	        t.nextWhen unset
	else:
	    real deadline in t.when
	    t.nextWhen unset

That made it hard to find the real deadline and just hard to think about everything.
The new state is:

	real deadline in t.when (always)
	if timer in heap:
	    state has timerHeaped set
	    heap deadline in t.whenHeap
	    if heap timer is stale:
	        state has timerModified set

Separately, the 'state' word itself was being used as a lock
and state bits because the code started with CAS loops,
which we abstracted into the lock/unlock methods step by step.
At this point, we can switch to a real lock, making sure to
publish the one boolean needed by timers fast paths
at each unlock.

All this simplifies various logic considerably.

Change-Id: I35766204f7a26d999206bd56cc0db60ad1b17cbe
Reviewed-on: https://go-review.googlesource.com/c/go/+/570335
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-03-13 17:06:22 +00:00
Russ Cox
3c39b2ed11 runtime: avoid pp.timers.lock in updateTimerPMask
The comment in updateTimerPMask is wrong. It says:

	// Looks like there are no timers, however another P
	// may be adding one at this very moment.
	// Take the lock to synchronize.

This was my incorrect simplification of the original comment
from CL 264477 when I was renaming all the things it mentioned:

	// Looks like there are no timers, however another P may transiently
	// decrement numTimers when handling a timerModified timer in
	// checkTimers. We must take timersLock to serialize with these changes.

updateTimerPMask is being called by pidleput, so the P in question
is not in use. And other P's cannot add to this P.
As the original comment more precisely noted, the problem was
that other P's might be calling timers.check, which updates ts.len
occasionally while ts is locked, and one of those updates might
"leak" an ephemeral len==0 even when the heap is not going to
be empty when the P is finally unlocked. The lock/unlock in
updateTimerPMask synchronizes to avoid that. But this defeats
most of the purpose of using ts.len in the first place.

Instead of requiring that synchronization, we can arrange that
ts.len only ever shows a "publishable" length, meaning the len(ts.heap)
we leave behind during ts.unlock.

Having done that, updateTimerPMask can be inlined into pidleput.

The big comment on updateTimerPMask explaining how timerpMask
works is better placed as the doc comment for timerpMask itself,
so move it there.

Change-Id: I5442c9bb7f1473b5fd37c43165429d087012e73f
Reviewed-on: https://go-review.googlesource.com/c/go/+/568336
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
2024-03-08 22:14:44 +00:00
Russ Cox
adc575e64c runtime: move per-P timers state into its own struct
Continuing conversion from C to Go, introduce type timers
encapsulating all timer heap state, with methods for operations.
This should at least be easier to think about, instead of having
these fields strewn through the P struct. It should also be easier
to test.

I am skeptical about the pair of atomic int64 deadlines:
I think there are missed wakeups lurking.
Having the code in an abstracted API should make it easier
to reason through and fix if needed.

[This is one CL in a refactoring stack making very small changes
in each step, so that any subtle bugs that we miss can be more
easily pinpointed to a small change.]

Change-Id: If5ea3e0b946ca14076f44c85cbb4feb9eddb4f95
Reviewed-on: https://go-review.googlesource.com/c/go/+/564132
Reviewed-by: Austin Clements <austin@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
2024-02-29 18:51:47 +00:00
Keith Randall
e0ee5b1afa cmd/compile: move runtime.itab to internal/abi.ITab
Change-Id: I44293452764dc4bc4de8d386153c6402a9cbe409
Reviewed-on: https://go-review.googlesource.com/c/go/+/549435
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
2024-02-08 01:38:11 +00:00
John Howard
6db1102605 pagetrace: fix build when experiment is on
due to a recent change, this experiment does not compile at all. This
simply fixes to pass in the new required parameter.

Change-Id: Idce0e72fa436a7acf4923717913deb3a37847fe2
Reviewed-on: https://go-review.googlesource.com/c/go/+/551415
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-01-03 21:01:56 +00:00
Russ Cox
a9c9cc07ac iter, runtime: add coroutine support
The exported API is only available with GOEXPERIMENT=rangefunc.
This will let Go 1.22 users who want to experiment with rangefuncs
access an efficient implementation of iter.Pull and iter.Pull2.

For #61897.

Change-Id: I6ef5fa8f117567efe4029b7b8b0f4d9b85697fb7
Reviewed-on: https://go-review.googlesource.com/c/go/+/543319
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-12-06 21:33:59 +00:00
Russ Cox
c29444ef39 math/rand, math/rand/v2: use ChaCha8 for global rand
Move ChaCha8 code into internal/chacha8rand and use it to implement
runtime.rand, which is used for the unseeded global source for
both math/rand and math/rand/v2. This also affects the calculation of
the start point for iteration over very very large maps (when the
32-bit fastrand is not big enough).

The benefit is that misuse of the global random number generators
in math/rand and math/rand/v2 in contexts where non-predictable
randomness is important for security reasons is no longer a
security problem, removing a common mistake among programmers
who are unaware of the different kinds of randomness.

The cost is an extra 304 bytes per thread stored in the m struct
plus 2-3ns more per random uint64 due to the more sophisticated
algorithm. Using PCG looks like it would cost about the same,
although I haven't benchmarked that.

Before this, the math/rand and math/rand/v2 global generator
was wyrand (https://github.com/wangyi-fudan/wyhash).
For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson
ALFG was justifiable, since the latter was not any better.
But for math/rand/v2, the global generator really should be
at least as good as one of the well-studied, specific algorithms
provided directly by the package, and it's not.

(Wyrand is still reasonable for scheduling and cache decisions.)

Good randomness does have a cost: about twice wyrand.

Also rationalize the various runtime rand references.

goos: linux
goarch: amd64
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
                        │ bbb48afeb7.amd64 │           5cf807d1ea.amd64           │
                        │      sec/op      │    sec/op     vs base                │
ChaCha8-32                     1.862n ± 2%    1.861n ± 2%        ~ (p=0.825 n=20)
PCG_DXSM-32                    1.471n ± 1%    1.460n ± 2%        ~ (p=0.153 n=20)
SourceUint64-32                1.636n ± 2%    1.582n ± 1%   -3.30% (p=0.000 n=20)
GlobalInt64-32                 2.087n ± 1%    3.663n ± 1%  +75.54% (p=0.000 n=20)
GlobalInt64Parallel-32        0.1042n ± 1%   0.2026n ± 1%  +94.48% (p=0.000 n=20)
GlobalUint64-32                2.263n ± 2%    3.724n ± 1%  +64.57% (p=0.000 n=20)
GlobalUint64Parallel-32       0.1019n ± 1%   0.1973n ± 1%  +93.67% (p=0.000 n=20)
Int64-32                       1.771n ± 1%    1.774n ± 1%        ~ (p=0.449 n=20)
Uint64-32                      1.863n ± 2%    1.866n ± 1%        ~ (p=0.364 n=20)
GlobalIntN1000-32              3.134n ± 3%    4.730n ± 2%  +50.95% (p=0.000 n=20)
IntN1000-32                    2.489n ± 1%    2.489n ± 1%        ~ (p=0.683 n=20)
Int64N1000-32                  2.521n ± 1%    2.516n ± 1%        ~ (p=0.394 n=20)
Int64N1e8-32                   2.479n ± 1%    2.478n ± 2%        ~ (p=0.743 n=20)
Int64N1e9-32                   2.530n ± 2%    2.514n ± 2%        ~ (p=0.193 n=20)
Int64N2e9-32                   2.501n ± 1%    2.494n ± 1%        ~ (p=0.616 n=20)
Int64N1e18-32                  3.227n ± 1%    3.205n ± 1%        ~ (p=0.101 n=20)
Int64N2e18-32                  3.647n ± 1%    3.599n ± 1%        ~ (p=0.019 n=20)
Int64N4e18-32                  5.135n ± 1%    5.069n ± 2%        ~ (p=0.034 n=20)
Int32N1000-32                  2.657n ± 1%    2.637n ± 1%        ~ (p=0.180 n=20)
Int32N1e8-32                   2.636n ± 1%    2.636n ± 1%        ~ (p=0.763 n=20)
Int32N1e9-32                   2.660n ± 2%    2.638n ± 1%        ~ (p=0.358 n=20)
Int32N2e9-32                   2.662n ± 2%    2.618n ± 2%        ~ (p=0.064 n=20)
Float32-32                     2.272n ± 2%    2.239n ± 2%        ~ (p=0.194 n=20)
Float64-32                     2.272n ± 1%    2.286n ± 2%        ~ (p=0.763 n=20)
ExpFloat64-32                  3.762n ± 1%    3.744n ± 1%        ~ (p=0.171 n=20)
NormFloat64-32                 3.706n ± 1%    3.655n ± 2%        ~ (p=0.066 n=20)
Perm3-32                       32.93n ± 3%    34.62n ± 1%   +5.13% (p=0.000 n=20)
Perm30-32                      202.9n ± 1%    204.0n ± 1%        ~ (p=0.482 n=20)
Perm30ViaShuffle-32            115.0n ± 1%    114.9n ± 1%        ~ (p=0.358 n=20)
ShuffleOverhead-32             112.8n ± 1%    112.7n ± 1%        ~ (p=0.692 n=20)
Concurrent-32                  2.107n ± 0%    3.725n ± 1%  +76.75% (p=0.000 n=20)

goos: darwin
goarch: arm64
pkg: math/rand/v2
                       │ bbb48afeb7.arm64 │           5cf807d1ea.arm64            │
                       │      sec/op      │    sec/op     vs base                 │
ChaCha8-8                     2.480n ± 0%    2.429n ± 0%    -2.04% (p=0.000 n=20)
PCG_DXSM-8                    2.531n ± 0%    2.530n ± 0%         ~ (p=0.877 n=20)
SourceUint64-8                2.534n ± 0%    2.533n ± 0%         ~ (p=0.732 n=20)
GlobalInt64-8                 2.172n ± 1%    4.794n ± 0%  +120.67% (p=0.000 n=20)
GlobalInt64Parallel-8        0.4320n ± 0%   0.9605n ± 0%  +122.32% (p=0.000 n=20)
GlobalUint64-8                2.182n ± 0%    4.770n ± 0%  +118.58% (p=0.000 n=20)
GlobalUint64Parallel-8       0.4307n ± 0%   0.9583n ± 0%  +122.51% (p=0.000 n=20)
Int64-8                       4.107n ± 0%    4.104n ± 0%         ~ (p=0.416 n=20)
Uint64-8                      4.080n ± 0%    4.080n ± 0%         ~ (p=0.052 n=20)
GlobalIntN1000-8              2.814n ± 2%    5.643n ± 0%  +100.50% (p=0.000 n=20)
IntN1000-8                    4.141n ± 0%    4.139n ± 0%         ~ (p=0.140 n=20)
Int64N1000-8                  4.140n ± 0%    4.140n ± 0%         ~ (p=0.313 n=20)
Int64N1e8-8                   4.140n ± 0%    4.139n ± 0%         ~ (p=0.103 n=20)
Int64N1e9-8                   4.139n ± 0%    4.140n ± 0%         ~ (p=0.761 n=20)
Int64N2e9-8                   4.140n ± 0%    4.140n ± 0%         ~ (p=0.636 n=20)
Int64N1e18-8                  5.266n ± 0%    5.326n ± 1%    +1.14% (p=0.001 n=20)
Int64N2e18-8                  6.052n ± 0%    6.167n ± 0%    +1.90% (p=0.000 n=20)
Int64N4e18-8                  8.826n ± 0%    9.051n ± 0%    +2.55% (p=0.000 n=20)
Int32N1000-8                  4.127n ± 0%    4.132n ± 0%    +0.12% (p=0.000 n=20)
Int32N1e8-8                   4.126n ± 0%    4.131n ± 0%    +0.12% (p=0.000 n=20)
Int32N1e9-8                   4.127n ± 0%    4.132n ± 0%    +0.12% (p=0.000 n=20)
Int32N2e9-8                   4.132n ± 0%    4.131n ± 0%         ~ (p=0.017 n=20)
Float32-8                     4.109n ± 0%    4.105n ± 0%         ~ (p=0.379 n=20)
Float64-8                     4.107n ± 0%    4.106n ± 0%         ~ (p=0.867 n=20)
ExpFloat64-8                  5.339n ± 0%    5.383n ± 0%    +0.82% (p=0.000 n=20)
NormFloat64-8                 5.735n ± 0%    5.737n ± 1%         ~ (p=0.856 n=20)
Perm3-8                       26.65n ± 0%    26.80n ± 1%    +0.58% (p=0.000 n=20)
Perm30-8                      194.8n ± 1%    197.0n ± 0%    +1.18% (p=0.000 n=20)
Perm30ViaShuffle-8            156.6n ± 0%    157.6n ± 1%    +0.61% (p=0.000 n=20)
ShuffleOverhead-8             124.9n ± 0%    125.5n ± 0%    +0.52% (p=0.000 n=20)
Concurrent-8                  2.434n ± 3%    5.066n ± 0%  +108.09% (p=0.000 n=20)

goos: linux
goarch: 386
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
                        │ bbb48afeb7.386 │            5cf807d1ea.386             │
                        │     sec/op     │    sec/op     vs base                 │
ChaCha8-32                  11.295n ± 1%    4.748n ± 2%   -57.96% (p=0.000 n=20)
PCG_DXSM-32                  7.693n ± 1%    7.738n ± 2%         ~ (p=0.542 n=20)
SourceUint64-32              7.658n ± 2%    7.622n ± 2%         ~ (p=0.344 n=20)
GlobalInt64-32               3.473n ± 2%    7.526n ± 2%  +116.73% (p=0.000 n=20)
GlobalInt64Parallel-32      0.3198n ± 0%   0.5444n ± 0%   +70.22% (p=0.000 n=20)
GlobalUint64-32              3.612n ± 0%    7.575n ± 1%  +109.69% (p=0.000 n=20)
GlobalUint64Parallel-32     0.3168n ± 0%   0.5403n ± 0%   +70.51% (p=0.000 n=20)
Int64-32                     7.673n ± 2%    7.789n ± 1%         ~ (p=0.122 n=20)
Uint64-32                    7.773n ± 1%    7.827n ± 2%         ~ (p=0.920 n=20)
GlobalIntN1000-32            6.268n ± 1%    9.581n ± 1%   +52.87% (p=0.000 n=20)
IntN1000-32                  10.33n ± 2%    10.45n ± 1%         ~ (p=0.233 n=20)
Int64N1000-32                10.98n ± 2%    11.01n ± 1%         ~ (p=0.401 n=20)
Int64N1e8-32                 11.19n ± 2%    10.97n ± 1%         ~ (p=0.033 n=20)
Int64N1e9-32                 11.06n ± 1%    11.08n ± 1%         ~ (p=0.498 n=20)
Int64N2e9-32                 11.10n ± 1%    11.01n ± 2%         ~ (p=0.995 n=20)
Int64N1e18-32                15.23n ± 2%    15.04n ± 1%         ~ (p=0.973 n=20)
Int64N2e18-32                15.89n ± 1%    15.85n ± 1%         ~ (p=0.409 n=20)
Int64N4e18-32                18.96n ± 2%    19.34n ± 2%         ~ (p=0.048 n=20)
Int32N1000-32                10.46n ± 2%    10.44n ± 2%         ~ (p=0.480 n=20)
Int32N1e8-32                 10.46n ± 2%    10.49n ± 2%         ~ (p=0.951 n=20)
Int32N1e9-32                 10.28n ± 2%    10.26n ± 1%         ~ (p=0.431 n=20)
Int32N2e9-32                 10.50n ± 2%    10.44n ± 2%         ~ (p=0.249 n=20)
Float32-32                   13.80n ± 2%    13.80n ± 2%         ~ (p=0.751 n=20)
Float64-32                   23.55n ± 2%    23.87n ± 0%         ~ (p=0.408 n=20)
ExpFloat64-32                15.36n ± 1%    15.29n ± 2%         ~ (p=0.316 n=20)
NormFloat64-32               13.57n ± 1%    13.79n ± 1%    +1.66% (p=0.005 n=20)
Perm3-32                     45.70n ± 2%    46.99n ± 2%    +2.81% (p=0.001 n=20)
Perm30-32                    399.0n ± 1%    403.8n ± 1%    +1.19% (p=0.006 n=20)
Perm30ViaShuffle-32          349.0n ± 1%    350.4n ± 1%         ~ (p=0.909 n=20)
ShuffleOverhead-32           322.3n ± 1%    323.8n ± 1%         ~ (p=0.410 n=20)
Concurrent-32                3.331n ± 1%    7.312n ± 1%  +119.50% (p=0.000 n=20)

For #61716.

Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2
Reviewed-on: https://go-review.googlesource.com/c/go/+/516860
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-12-05 20:34:30 +00:00
cui fliter
0d018b49e3 all: fix field names
Change-Id: I3ad7a50707486ebdbbd676b3581df6e3ed0fd3a1
Reviewed-on: https://go-review.googlesource.com/c/go/+/543476
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: shuang cui <imcusg@gmail.com>
2023-11-27 15:48:25 +00:00
Rhys Hiltner
450ecbe905 runtime: profile contended lock calls
Add runtime-internal locks to the mutex contention profile.

Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.

Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.

Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.

Fixes #57071

This is a roll-forward of https://go.dev/cl/528657, which was reverted
in https://go.dev/cl/543660

Reason for revert: de-flakes tests (reduces dependence on fine-grained
timers, correctly identifies contention on big-endian futex locks,
attempts to measure contention in the semaphore implementation but only
uses that secondary measurement to finish the test early, skips tests on
single-processor systems)

Change-Id: I31389f24283d85e46ad9ba8d4f514cb9add8dfb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/544195
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
2023-11-21 21:02:20 +00:00
Ludi Rehak
ee6b34797b all: add floating point option for ARM targets
This change introduces new options to set the floating point
mode on ARM targets. The GOARM version number can optionally be
followed by ',hardfloat' or ',softfloat' to select whether to
use hardware instructions or software emulation for floating
point computations, respectively. For example,
GOARM=7,softfloat.

Previously, software floating point support was limited to
GOARM=5. With these options, software floating point is now
extended to all ARM versions, including GOARM=6 and 7. This
change also extends hardware floating point to GOARM=5.

GOARM=5 defaults to softfloat and GOARM=6 and 7 default to
hardfloat.

For #61588

Change-Id: I23dc86fbd0733b262004a2ed001e1032cf371e94
Reviewed-on: https://go-review.googlesource.com/c/go/+/514907
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
2023-11-20 17:19:36 +00:00
Matthew Dempsky
468bc94188 Revert "runtime: profile contended lock calls"
This reverts commit go.dev/cl/528657.

Reason for revert: broke a lot of builders.

Change-Id: I70c33062020e997c4df67b3eaa2e886cf0da961e
Reviewed-on: https://go-review.googlesource.com/c/go/+/543660
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-20 13:20:29 +00:00
Rhys Hiltner
0b31a46f1f runtime: profile contended lock calls
Add runtime-internal locks to the mutex contention profile.

Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.

Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.

Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.

Fixes #57071

Change-Id: I3eee0ccbfc20f333b56f20d8725dfd7f3a526b41
Reviewed-on: https://go-review.googlesource.com/c/go/+/528657
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Than McIntosh <thanm@google.com>
2023-11-17 23:04:58 +00:00
Michael Pratt
6ef98ac87c runtime/metrics: add STW stopping and total time metrics
This CL adds four new time histogram metrics:

/sched/pauses/stopping/gc:seconds
/sched/pauses/stopping/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds

The "stopping" metrics measure the time taken to start a stop-the-world
pause. i.e., how long it takes stopTheWorldWithSema to stop all Ps.
This can be used to detect STW struggling to preempt Ps.

The "total" metrics measure the total duration of a stop-the-world
pause, from starting to stop-the-world until the world is started again.
This includes the time spent in the "start" phase.

The "gc" metrics are used for GC-related STW pauses. The "other" metrics
are used for all other STW pauses.

All of these metrics start timing in stopTheWorldWithSema only after
successfully acquiring sched.lock, thus excluding lock contention on
sched.lock. The reasoning behind this is that while waiting on
sched.lock the world is not stopped at all (all other Ps can run), so
the impact of this contention is primarily limited to the goroutine
attempting to stop-the-world. Additionally, we already have some
visibility into sched.lock contention via contention profiles (#57071).

/sched/pauses/total/gc:seconds is conceptually equivalent to
/gc/pauses:seconds, so the latter is marked as deprecated and returns
the same histogram as the former.

In the implementation, there are a few minor differences:

* For both mark and sweep termination stops, /gc/pauses:seconds started
  timing prior to calling startTheWorldWithSema, thus including lock
  contention.

These details are minor enough, that I do not believe the slight change
in reporting will matter. For mark termination stops, moving timing stop
into startTheWorldWithSema does have the side effect of requiring moving
other GC metric calculations outside of the STW, as they depend on the
same end time.

Fixes #63340

Change-Id: Iacd0bab11bedab85d3dcfb982361413a7d9c0d05
Reviewed-on: https://go-review.googlesource.com/c/go/+/534161
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-15 16:49:45 +00:00
Michael Anthony Knyszek
43ffe2a892 runtime: add execution tracer v2 behind GOEXPERIMENT=exectracer2
This change mostly implements the design described in #60773 and
includes a new scalable parser for the new trace format, available in
internal/trace/v2. I'll leave this commit message short because this is
clearly an enormous CL with a lot of detail.

This change does not hook up the new tracer into cmd/trace yet. A
follow-up CL will handle that.

For #60773.

Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-linux-amd64-longtest-race
Change-Id: I5d2aca2cc07580ed3c76a9813ac48ec96b157de0
Reviewed-on: https://go-review.googlesource.com/c/go/+/494187
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-10 15:49:59 +00:00
Michael Anthony Knyszek
ff7cf2d4cd runtime: make it harder to introduce deadlocks with forEachP
Currently any thread that tries to get the attention of all Ps (e.g.
stopTheWorldWithSema and forEachP) ends up in a non-preemptible state
waiting to preempt another thread. Thing is, that other thread might
also be in a non-preemptible state, trying to preempt the first thread,
resulting in a deadlock.

This is a general problem, but in practice it only boils down to one
specific scenario: a thread in GC is blocked trying to preempt a
goroutine to scan its stack while that goroutine is blocked in a
non-preemptible state to get the attention of all Ps.

There's currently a hack in a few places in the runtime to move the
calling goroutine into _Gwaiting before it goes into a non-preemptible
state to preempt other threads. This lets the GC scan its stack because
the goroutine is trivially preemptible. The only restriction is that
forEachP and stopTheWorldWithSema absolutely cannot reference the
calling goroutine's stack. This is generally not necessary, so things
are good.

Anyway, to avoid exposing the details of this hack, this change creates
a safer wrapper around forEachP (and then renames it to forEachP and the
existing one to forEachPInternal) that performs the goroutine status
change, just like stopTheWorld does. We're going to need to use this
hack with forEachP in the new tracer, so this avoids propagating the
hack further and leaves it as an implementation detail.

Change-Id: I51f02e8d8e0a3172334d23787e31abefb8a129ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/533455
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2023-11-09 22:35:07 +00:00
bestgopher
416bc85f61 runtime: fix comments for itab
The function WriteTabs has been renamed WritePluginTable.

Change-Id: I5f04b99b91498c41121f898cb7774334a730d7b4
GitHub-Last-Rev: c98ab3f872
GitHub-Pull-Request: golang/go#63595
Reviewed-on: https://go-review.googlesource.com/c/go/+/535996
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2023-10-18 00:24:39 +00:00
Michael Pratt
2b462646ed runtime: set stackguard1 on extra M g0
[This is an unmodified redo of CL 527056.]

Standard Ms set g0.stackguard1 to the same value as stackguard0 in
mstart0. For consistency, extra Ms should do the same for their g0. Do
this in needm -> callbackUpdateSystemStack.

Background: getg().stackguard1 is used as the stack guard for the stack
growth prolouge in functions marked //go:systemstack [1]. User Gs set
stackguard1 to ^uintptr(0) so that the check always fail, calling
morestackc, which throws to report a //go:systemstack function call on a
user stack.

g0 setting stackguard1 is unnecessary for this functionality. 0 would be
sufficient, as g0 is always allowed to call //go:systemstack functions.
However, since we have the check anyway, setting stackguard1 to the
actual stack bound is useful to detect actual stack overflows on g0
(though morestackc doesn't detect this case and would report a
misleading message about user stacks).

[1] cmd/internal/obj calls //go:systemstack functions AttrCFunc. This is
a holdover from when the runtime contained actual C functions. But since
CL 2275, it has simply meant "pretend this is a C function, which would
thus need to use the system stack". Hence the name morestackc. At this
point, this terminology is pretty far removed from reality and should
probably be updated to something more intuitive.

Change-Id: If315677217354465fbbfbd0d406d79be20db0cc3
Reviewed-on: https://go-review.googlesource.com/c/go/+/527716
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-09-19 19:32:22 +00:00
Michael Pratt
c0c4a59816 Revert "runtime: set stackguard1 on extra M g0"
This reverts CL 527056.

CL 525455 breaks darwin, alpine, and android. This CL must be reverted
in order to revert that CL.

For #62440.

Change-Id: I4e1b16e384b475a605e0214ca36c918d50faa22c
Reviewed-on: https://go-review.googlesource.com/c/go/+/527316
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-09-11 16:35:54 +00:00
Michael Pratt
9af74e711a runtime: set stackguard1 on extra M g0
Standard Ms set g0.stackguard1 to the same value as stackguard0 in
mstart0. For consistency, extra Ms should do the same for their g0. Do
this in needm -> callbackUpdateSystemStack.

Background: getg().stackguard1 is used as the stack guard for the stack
growth prolouge in functions marked //go:systemstack [1]. User Gs set
stackguard1 to ^uintptr(0) so that the check always fail, calling
morestackc, which throws to report a //go:systemstack function call on a
user stack.

g0 setting stackguard1 is unnecessary for this functionality. 0 would be
sufficient, as g0 is always allowed to call //go:systemstack functions.
However, since we have the check anyway, setting stackguard1 to the
actual stack bound is useful to detect actual stack overflows on g0
(though morestackc doesn't detect this case and would report a
misleading message about user stacks).

[1] cmd/internal/obj calls //go:systemstack functions AttrCFunc. This is
a holdover from when the runtime contained actual C functions. But since
CL 2275, it has simply meant "pretend this is a C function, which would
thus need to use the system stack". Hence the name morestackc. At this
point, this terminology is pretty far removed from reality and should
probably be updated to something more intuitive.

Change-Id: I8d0e5628ce31ac6a189a7d7a4124be85aef89862
Reviewed-on: https://go-review.googlesource.com/c/go/+/527056
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2023-09-11 14:46:55 +00:00
doujiang24
24b9ef1a73 cmd/cgo: add #cgo noescape/nocallback annotations
When passing pointers of Go objects from Go to C, the cgo command generate _Cgo_use(pN) for the unsafe.Pointer type arguments, so that the Go compiler will escape these object to heap.

Since the C function may callback to Go, then the Go stack might grow/shrink, that means the pointers that the C function have will be invalid.

After adding the #cgo noescape annotation for a C function, the cgo command won't generate _Cgo_use(pN), and the Go compiler won't force the object escape to heap.

After adding the #cgo nocallback annotation for a C function, which means the C function won't callback to Go, if it do callback to Go, the Go process will crash.

Fixes #56378

Change-Id: Ifdca070584e0d349c7b12276270e50089e481f7a
GitHub-Last-Rev: f1a17b08b0
GitHub-Pull-Request: golang/go#60399
Reviewed-on: https://go-review.googlesource.com/c/go/+/497837
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-08-25 17:39:23 +00:00
Austin Clements
4a8373c553 runtime: move pcvalue cache to M
Currently, the pcvalue cache is stack allocated for each operation
that needs to look up a lot of pcvalues. It's not always clear where
to put it, a lot of the time we just pass a nil cache, it doesn't get
reused across operations, and we put a surprising amount of effort
into threading these caches around.

This CL moves it to the M, where it can be long-lived and used by all
pcvalue lookups, and we don't have to carefully thread it across
operations.

This is a re-roll of CL 515276 with a fix for reentrant use of the
pcvalue cache from the signal handler.

Change-Id: Id94c0c0fb3004d1fda1b196790eebd949c621f28
Reviewed-on: https://go-review.googlesource.com/c/go/+/520063
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Auto-Submit: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2023-08-21 21:05:54 +00:00
Andy Pan
a7c3de7052 runtime: document maxStack and m.createstack in more details
Change-Id: If93b6cfa5a598a5f4101c879a0cd88a194e4a6aa
Reviewed-on: https://go-review.googlesource.com/c/go/+/518116
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Andy Pan <panjf2000@gmail.com>
2023-08-21 17:10:00 +00:00
Russ Cox
1c00354013 runtime: change mutex profile to count every blocked goroutine
The pprof mutex profile was meant to match the Google C++ (now Abseil)
mutex profiler, originally designed and implemented by Mike Burrows.
When we worked on the Go version, pjw and I missed that C++ counts the
time each thread is blocked, even if multiple threads are blocked on a
mutex. That is, if 100 threads are blocked on the same mutex for the
same 10ms, that still counts as 1000ms of contention in C++. In Go, to
date, /debug/pprof/mutex has counted that as only 10ms of contention.
If 100 goroutines are blocked on one mutex and only 1 goroutine is
blocked on another mutex, we probably do want to see the first mutex
as being more contended, so the Abseil approach is the more useful one.

This CL adopts "contention scales with number of goroutines blocked",
to better match Abseil [1]. However, it still makes sure to attribute the
time to the unlock that caused the backup, not subsequent innocent
unlocks that were affected by the congestion. In this way it still gives
more accurate profiles than Abseil does.

[1] https://github.com/abseil/abseil-cpp/blob/lts_2023_01_25/absl/synchronization/mutex.cc#L2390

Fixes #61015.

Change-Id: I7eb9e706867ffa8c0abb5b26a1b448f6eba49331
Reviewed-on: https://go-review.googlesource.com/c/go/+/506415
Run-TryBot: Russ Cox <rsc@golang.org>
Auto-Submit: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2023-08-17 01:30:31 +00:00
Russ Cox
390763aed8 cmd/compile, runtime: make room for rangefunc defers
This is subtle and the compiler and runtime be in sync.
It is easier to develop the rest of the changes (especially when using
toolstash save/restore) if this change is separated out and done first.

Preparation for proposal #61405. The actual logic in the
compiler will be guarded by a GOEXPERIMENT, but it is
easier not to have GOEXPERIMENT-specific data structures
in the runtime, so just make the field always.

Change-Id: I7ec7049b99ae98bf0db365d42966baeec56e3774
Reviewed-on: https://go-review.googlesource.com/c/go/+/510539
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2023-08-16 16:20:06 +00:00
Austin Clements
26e0660811 Revert "runtime: move pcvalue cache to M"
This reverts CL 515276.

This broke the longtest builders. For example:
https://build.golang.org/log/351a1a198a6b843b1881c1fb6cdef51f3e413e8b

Change-Id: Ie79067464fe8e226da31721cf127f3efb6011452
Reviewed-on: https://go-review.googlesource.com/c/go/+/516856
Auto-Submit: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2023-08-07 22:27:53 +00:00
Austin Clements
d367ec6a0e runtime: move pcvalue cache to M
Currently, the pcvalue cache is stack allocated for each operation
that needs to look up a lot of pcvalues. It's not always clear where
to put it, a lot of the time we just pass a nil cache, it doesn't get
reused across operations, and we put a surprising amount of effort
into threading these caches around.

This CL moves it to the M, where it can be long-lived and used by all
pcvalue lookups, and we don't have to carefully thread it across
operations.

Change-Id: I675e583e0daac887c8ef77a402ba792648d96027
Reviewed-on: https://go-review.googlesource.com/c/go/+/515276
Run-TryBot: Austin Clements <austin@google.com>
Auto-Submit: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2023-08-07 19:31:24 +00:00
Matthew Dempsky
9869699c44 runtime: avoid relying on the unwinder in deferreturn
This CL changes deferreturn so that it never needs to invoke the
unwinder. Instead, in the unusual case that we recover into a frame
with pending open-coded defers, we now save the extra state needed to
find them in g.param.

Change-Id: Ied35f6c1063fee5b6044cc37b2bccd3f90682fe6
Reviewed-on: https://go-review.googlesource.com/c/go/+/515856
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
2023-08-07 18:44:12 +00:00
Matthew Dempsky
bb5974e0cb runtime, cmd/compile: optimize open-coded defers
This CL optimizes open-coded defers in two ways:

1. It modifies local variable sorting to place all open-coded defer
closure slots in order, so that rather than requiring the metadata to
contain each offset individually, we just need a single offset to the
first slot.

2. Because the slots are in ascending order and can be directly
indexed, we can get rid of the count of how many defers are in the
frame. Instead, we just find the top set bit in the active defers
bitmask, and load the corresponding closure.

Change-Id: I6f912295a492211023a9efe12c94a14f449d86ad
Reviewed-on: https://go-review.googlesource.com/c/go/+/516199
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-08-07 18:05:54 +00:00
Matthew Dempsky
0c2abb3233 runtime, cmd/compile: prune unused _defer fields
These fields are no longer needed since go.dev/cl/513837.

Change-Id: I980fc9db998c293e930094bbb87e8c8f1654e39c
Reviewed-on: https://go-review.googlesource.com/c/go/+/516198
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2023-08-07 16:05:16 +00:00
Matthew Dempsky
9eb1d5317b runtime: refactor defer processing
This CL refactors gopanic, Goexit, and deferreturn to share a common
state machine for processing pending defers. The new state machine
removes a lot of redundant code and does overall less work.

It should also make it easier to implement further optimizations
(e.g., TODOs added in this CL).

Change-Id: I71d3cc8878a6f951d8633505424a191536c8e6b3
Reviewed-on: https://go-review.googlesource.com/c/go/+/513837
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-07-31 16:52:06 +00:00
Michael Anthony Knyszek
0adcc5ace8 runtime: cache inner pinner on P
This change caches the *pinner on the P to pool it and reduce the chance
that a new allocation is made. It also makes the *pinner no longer drop
its refs array on unpin, also to avoid reallocating.

The Pinner benchmark results before and after this CL are attached at
the bottom of the commit message.

Note that these results are biased toward the current change because of
the last two benchmark changes. Reusing the pinner in the benchmark
itself achieves similar performance before this change. The benchmark
results thus basically just confirm that this change does cache the
inner pinner in a useful way. Using the previous benchmarks there's
actually a slight regression from the extra check in the cache, however
the long pole is still setPinned itself.

name                                old time/op    new time/op    delta
PinnerPinUnpinBatch-8                 42.2µs ± 2%    41.5µs ± 1%      ~     (p=0.056 n=5+5)
PinnerPinUnpinBatchDouble-8            367µs ± 1%     350µs ± 1%    -4.67%  (p=0.008 n=5+5)
PinnerPinUnpinBatchTiny-8              108µs ± 0%     102µs ± 1%    -6.22%  (p=0.008 n=5+5)
PinnerPinUnpin-8                       592ns ± 8%      40ns ± 1%   -93.29%  (p=0.008 n=5+5)
PinnerPinUnpinTiny-8                   693ns ± 9%      39ns ± 1%   -94.31%  (p=0.008 n=5+5)
PinnerPinUnpinDouble-8                 843ns ± 5%     124ns ± 3%   -85.24%  (p=0.008 n=5+5)
PinnerPinUnpinParallel-8              1.11µs ± 5%    0.00µs ± 0%   -99.55%  (p=0.008 n=5+5)
PinnerPinUnpinParallelTiny-8          1.12µs ± 8%    0.00µs ± 1%   -99.55%  (p=0.008 n=5+5)
PinnerPinUnpinParallelDouble-8        1.79µs ± 4%    0.58µs ± 6%   -67.36%  (p=0.008 n=5+5)
PinnerIsPinnedOnPinned-8              5.78ns ± 0%    5.80ns ± 1%      ~     (p=0.548 n=5+5)
PinnerIsPinnedOnUnpinned-8            4.99ns ± 1%    4.98ns ± 0%      ~     (p=0.841 n=5+5)
PinnerIsPinnedOnPinnedParallel-8      0.71ns ± 0%    0.71ns ± 0%      ~     (p=0.175 n=5+5)
PinnerIsPinnedOnUnpinnedParallel-8    0.67ns ± 1%    0.66ns ± 0%      ~     (p=0.167 n=5+5)

name                                old alloc/op   new alloc/op   delta
PinnerPinUnpinBatch-8                 20.1kB ± 0%    20.0kB ± 0%    -0.32%  (p=0.008 n=5+5)
PinnerPinUnpinBatchDouble-8           52.7kB ± 0%    52.7kB ± 0%    -0.12%  (p=0.008 n=5+5)
PinnerPinUnpinBatchTiny-8             20.1kB ± 0%    20.0kB ± 0%    -0.32%  (p=0.008 n=5+5)
PinnerPinUnpin-8                       64.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinTiny-8                   64.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinDouble-8                 64.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinParallel-8               64.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinParallelTiny-8           64.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinParallelDouble-8         64.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
PinnerIsPinnedOnPinned-8               0.00B          0.00B           ~     (all equal)
PinnerIsPinnedOnUnpinned-8             0.00B          0.00B           ~     (all equal)
PinnerIsPinnedOnPinnedParallel-8       0.00B          0.00B           ~     (all equal)
PinnerIsPinnedOnUnpinnedParallel-8     0.00B          0.00B           ~     (all equal)

name                                old allocs/op  new allocs/op  delta
PinnerPinUnpinBatch-8                   9.00 ± 0%      8.00 ± 0%   -11.11%  (p=0.008 n=5+5)
PinnerPinUnpinBatchDouble-8             11.0 ± 0%      10.0 ± 0%    -9.09%  (p=0.008 n=5+5)
PinnerPinUnpinBatchTiny-8               9.00 ± 0%      8.00 ± 0%   -11.11%  (p=0.008 n=5+5)
PinnerPinUnpin-8                        1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinTiny-8                    1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinDouble-8                  1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinParallel-8                1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinParallelTiny-8            1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
PinnerPinUnpinParallelDouble-8          1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
PinnerIsPinnedOnPinned-8                0.00           0.00           ~     (all equal)
PinnerIsPinnedOnUnpinned-8              0.00           0.00           ~     (all equal)
PinnerIsPinnedOnPinnedParallel-8        0.00           0.00           ~     (all equal)
PinnerIsPinnedOnUnpinnedParallel-8      0.00           0.00           ~     (all equal)

For #46787.

Change-Id: I0cdfad77b189c425868944a4faeff3d5b97417b9
Reviewed-on: https://go-review.googlesource.com/c/go/+/497615
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ansiwen <ansiwen@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
2023-05-24 16:23:08 +00:00
Michael Anthony Knyszek
7c91e1e568 runtime: replace raw traceEv with traceBlockReason in gopark
This change adds traceBlockReason which leaks fewer implementation
details of the tracer to the runtime. Currently, gopark is called with
an explicit trace event, but this leaks details about trace internals
throughout the runtime.

This change will make it easier to change out the trace implementation.

Change-Id: Id633e1704d2c8838c6abd1214d9695537c4ac7db
Reviewed-on: https://go-review.googlesource.com/c/go/+/494185
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
2023-05-19 20:47:25 +00:00
Cherry Mui
c426c87012 runtime/cgo: store M for C-created thread in pthread key
This reapplies CL 485500, with a fix drafted in CL 492987 incorporated.

CL 485500 is reverted due to #60004 and #60007. #60004 is fixed in
CL 492743. #60007 is fixed in CL 492987 (incorporated in this CL).

[Original CL 485500 description]

This reapplies CL 481061, with the followup fixes in CL 482975, CL 485315, and
CL 485316 incorporated.

CL 481061, by doujiang24 <doujiang24@gmail.com>, speed up C to Go
calls by binding the M to the C thread. See below for its
description.
CL 482975 is a followup fix to a C declaration in testprogcgo.
CL 485315 is a followup fix for x_cgo_getstackbound on Illumos.
CL 485316 is a followup cleanup for ppc64 assembly.

CL 479915 passed the G to _cgo_getstackbound for direct updates to
gp.stack.lo. A G can be reused on a new thread after the previous thread
exited. This could trigger the C TSAN race detector because it couldn't
see the synchronization in Go (lockextra) preventing the same G from
being used on multiple threads at the same time.

We work around this by passing the address of a stack variable to
_cgo_getstackbound rather than the G. The stack is generally unique per
thread, so TSAN won't see the same address from multiple threads. Even
if stacks are reused across threads by pthread, C TSAN should see the
synchonization in the stack allocator.

A regression test is added to misc/cgo/testsanitizer.

[Original CL 481061 description]

This reapplies CL 392854, with the followup fixes in CL 479255,
CL 479915, and CL 481057 incorporated.

CL 392854, by doujiang24 <doujiang24@gmail.com>, speed up C to Go
calls by binding the M to the C thread. See below for its
description.
CL 479255 is a followup fix for a small bug in ARM assembly code.
CL 479915 is another followup fix to address C to Go calls after
the C code uses some stack, but that CL is also buggy.
CL 481057, by Michael Knyszek, is a followup fix for a memory leak
bug of CL 479915.

[Original CL 392854 description]

In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls.
So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call.
Instead, we only dropm while the C thread exits, so the extra M won't leak.

When invoking a Go function from C:
Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor.
And store the g0 of the current m into the thread-specified value of the pthread key,  only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits.

When returning back to C:
Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C.

This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows.

This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread.

For the newly added BenchmarkCGoInCThread, some benchmark results:
1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz

[CL 479915 description]

Currently, when C calls into Go the first time, we grab an M
using needm, which sets m.g0's stack bounds using the SP. We don't
know how big the stack is, so we simply assume 32K. Previously,
when the Go function returns to C, we drop the M, and the next
time C calls into Go, we put a new stack bound on the g0 based on
the current SP. After CL 392854, we don't drop the M, and the next
time C calls into Go, we reuse the same g0, without recomputing
the stack bounds. If the C code uses quite a bit of stack space
before calling into Go, the SP may be well below the 32K stack
bound we assumed, so the runtime thinks the g0 stack overflows.

This CL makes needm get a more accurate stack bound from
pthread. (In some platforms this may still be a guess as we don't
know exactly where we are in the C stack), but it is probably
better than simply assuming 32K.

[CL 492987 description]

On the first call into Go from a C thread, currently we set the g0
stack's high bound imprecisely based on the SP. With CL 485500, we
keep the M and don't recompute the stack bounds when it calls into
Go again. If the first call is made when the C thread uses some
deep stack, but a subsequent call is made with a shallower stack,
the SP may be above g0.stack.hi.

This is usually okay as we don't check usually stack.hi. One place
where we do check for stack.hi is in the signal handler, in
adjustSignalStack. In particular, C TSAN delivers signals on the
g0 stack (instead of the usual signal stack). If the SP is above
g0.stack.hi, we don't see it is on the g0 stack, and throws.

This CL makes it get an accurate stack upper bound with the
pthread API (on the platforms where it is available).

Also add some debug print for the "handler not on signal stack"
throw.

Fixes #51676.
Fixes #59294.
Fixes #59678.
Fixes #60007.

Change-Id: Ie51c8e81ade34ec81d69fd7bce1fe0039a470776
Reviewed-on: https://go-review.googlesource.com/c/go/+/495855
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
2023-05-17 21:53:11 +00:00
Michael Anthony Knyszek
b7e767b022 runtime: capture per-p trace state in a type
More tightening up of the tracer's interface.

Change-Id: I992141c7f30e5c2d5d77d1fcd6817d35bc6e5f6d
Reviewed-on: https://go-review.googlesource.com/c/go/+/494191
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-05-17 14:46:57 +00:00
Michael Anthony Knyszek
2f35a655e7 runtime: capture per-m trace state in a type
More tightening up of the tracer's interface.

While we're here, clarify why waittraceskip isn't included by explaining
what the wait* fields in the M are really for.

Change-Id: I0e7b4cac79fb77a7a0b3ca6b6cc267668e3610bc
Reviewed-on: https://go-review.googlesource.com/c/go/+/494190
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-05-17 14:45:42 +00:00
Michael Anthony Knyszek
c213c905a2 runtime: capture per-g trace state in a type
More tightening up of the tracer's interface.

This increases the size of each G very slightly, which isn't great, but
we stay within the same size class, so actually memory use will be
unchanged.

Change-Id: I7d1f5798edcf437c212beb1e1a2619eab833aafb
Reviewed-on: https://go-review.googlesource.com/c/go/+/494188
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
2023-05-17 14:45:40 +00:00
cui fliter
57e3189821 all: fix a lot of comments
Fix comments, including duplicate is, wrong phrases and articles, misspellings, etc.

Change-Id: I8bfea53b9b275e649757cc4bee6a8a026ed9c7a4
Reviewed-on: https://go-review.googlesource.com/c/go/+/493035
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: shuang cui <imcusg@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
2023-05-10 12:59:20 +00:00
Chressie Himpel
72c33a5ef0 Revert "runtime/cgo: store M for C-created thread in pthread key"
This reverts CL 485500.

Reason for revert: This breaks internal tests at Google, see b/280861579 and b/280820455.

Change-Id: I426278d400f7611170918fc07c524cb059b9cc55
Reviewed-on: https://go-review.googlesource.com/c/go/+/492995
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Chressie Himpel <chressie@google.com>
2023-05-05 14:37:29 +00:00