runtime/cgo: store M for C-created thread in pthread key

This reapplies CL 481061, with the followup fixes in CL 482975, CL 485315, and CL 485316 incorporated. CL 481061, by doujiang24 <doujiang24@gmail.com>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 482975 is a followup fix to a C declaration in testprogcgo. CL 485315 is a followup fix for x_cgo_getstackbound on Illumos. CL 485316 is a followup cleanup for ppc64 assembly. [Original CL 481061 description] This reapplies CL 392854, with the followup fixes in CL 479255, CL 479915, and CL 481057 incorporated. CL 392854, by doujiang24 <doujiang24@gmail.com>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 479255 is a followup fix for a small bug in ARM assembly code. CL 479915 is another followup fix to address C to Go calls after the C code uses some stack, but that CL is also buggy. CL 481057, by Michael Knyszek, is a followup fix for a memory leak bug of CL 479915. [Original CL 392854 description] In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls. So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call. Instead, we only dropm while the C thread exits, so the extra M won't leak. When invoking a Go function from C: Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor. And store the g0 of the current m into the thread-specified value of the pthread key, only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits. When returning back to C: Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C. This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows. This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread. For the newly added BenchmarkCGoInCThread, some benchmark results: 1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz [CL 479915 description] Currently, when C calls into Go the first time, we grab an M using needm, which sets m.g0's stack bounds using the SP. We don't know how big the stack is, so we simply assume 32K. Previously, when the Go function returns to C, we drop the M, and the next time C calls into Go, we put a new stack bound on the g0 based on the current SP. After CL 392854, we don't drop the M, and the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack space before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows. This CL makes needm get a more accurate stack bound from pthread. (In some platforms this may still be a guess as we don't know exactly where we are in the C stack), but it is probably better than simply assuming 32K. [CL 485500 description] CL 479915 passed the G to _cgo_getstackbound for direct updates to gp.stack.lo. A G can be reused on a new thread after the previous thread exited. This could trigger the C TSAN race detector because it couldn't see the synchronization in Go (lockextra) preventing the same G from being used on multiple threads at the same time. We work around this by passing the address of a stack variable to _cgo_getstackbound rather than the G. The stack is generally unique per thread, so TSAN won't see the same address from multiple threads. Even if stacks are reused across threads by pthread, C TSAN should see the synchonization in the stack allocator. A regression test is added to misc/cgo/testsanitizer. Fixes #51676. Fixes #59294. Fixes #59678. Change-Id: Ic62be31a06ee83568215e875a891df37084e08ca Reviewed-on: https://go-review.googlesource.com/c/go/+/485500 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Michael Pratt <mpratt@google.com>
2025-12-08 06:10:04 +00:00 · 2023-04-17 14:51:28 -04:00 · 2023-04-17 14:51:28 -04:00 · 7b874619be
commit 7b874619be
parent d816f85f78
46 changed files with 1012 additions and 67 deletions
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@ -209,6 +209,10 @@ func main() {

 	main_init_done = make(chan bool)
 	if iscgo {
+		if _cgo_pthread_key_created == nil {
+			throw("_cgo_pthread_key_created missing")
+		}
+
 		if _cgo_thread_start == nil {
 			throw("_cgo_thread_start missing")
 		}
@ -223,6 +227,13 @@ func main() {
 		if _cgo_notify_runtime_init_done == nil {
 			throw("_cgo_notify_runtime_init_done missing")
 		}
+
+		// Set the x_crosscall2_ptr C function pointer variable point to crosscall2.
+		if set_crosscall2 == nil {
+			throw("set_crosscall2 missing")
+		}
+		set_crosscall2()
+
 		// Start the template thread in case we enter Go from
 		// a C-created thread and need to create a new thread.
 		startTemplateThread()
@ -1880,11 +1891,15 @@ func allocm(pp *p, fn func(), id int64) *m {
 // pressed into service as the scheduling stack and current
 // goroutine for the duration of the cgo callback.
 //
-// When the callback is done with the m, it calls dropm to
-// put the m back on the list.
+// It calls dropm to put the m back on the list,
+// 1. when the callback is done with the m in non-pthread platforms,
+// 2. or when the C thread exiting on pthread platforms.
+//
+// The signal argument indicates whether we're called from a signal
+// handler.
 //
 //go:nosplit
-func needm() {
+func needm(signal bool) {
 	if (iscgo || GOOS == "windows") && !cgoHasExtraM {
 		// Can happen if C/C++ code calls Go from a global ctor.
 		// Can also happen on Windows if a global ctor uses a
@ -1933,16 +1948,36 @@ func needm() {
 	osSetupTLS(mp)

 	// Install g (= m->g0) and set the stack bounds
-	// to match the current stack. We don't actually know
+	// to match the current stack. If we don't actually know
 	// how big the stack is, like we don't know how big any
-	// scheduling stack is, but we assume there's at least 32 kB,
-	// which is more than enough for us.
+	// scheduling stack is, but we assume there's at least 32 kB.
+	// If we can get a more accurate stack bound from pthread,
+	// use that.
 	setg(mp.g0)
 	gp := getg()
 	gp.stack.hi = getcallersp() + 1024
 	gp.stack.lo = getcallersp() - 32*1024
+	if !signal && _cgo_getstackbound != nil {
+		// Don't adjust if called from the signal handler.
+		// We are on the signal stack, not the pthread stack.
+		// (We could get the stack bounds from sigaltstack, but
+		// we're getting out of the signal handler very soon
+		// anyway. Not worth it.)
+		var low uintptr
+		asmcgocall(_cgo_getstackbound, unsafe.Pointer(&low))
+		// getstackbound is an unsupported no-op on Windows.
+		if low != 0 {
+			gp.stack.lo = low
+			// TODO: Also get gp.stack.hi from getstackbound.
+		}
+	}
 	gp.stackguard0 = gp.stack.lo + stackGuard

+	// Should mark we are already in Go now.
+	// Otherwise, we may call needm again when we get a signal, before cgocallbackg1,
+	// which means the extram list may be empty, that will cause a deadlock.
+	mp.isExtraInC = false
+
 	// Initialize this thread to use the m.
 	asminit()
 	minit()
@ -1952,6 +1987,17 @@ func needm() {
 	sched.ngsys.Add(-1)
 }

+// Acquire an extra m and bind it to the C thread when a pthread key has been created.
+//
+//go:nosplit
+func needAndBindM() {
+	needm(false)
+
+	if _cgo_pthread_key_created != nil && *(*uintptr)(_cgo_pthread_key_created) != 0 {
+		cgoBindM()
+	}
+}
+
 // newextram allocates m's and puts them on the extra list.
 // It is called with a working local m, so that it can do things
 // like call schedlock and allocate.
@ -1996,6 +2042,8 @@ func oneNewExtraM() {
 	gp.m = mp
 	mp.curg = gp
 	mp.isextra = true
+	// mark we are in C by default.
+	mp.isExtraInC = true
 	mp.lockedInt++
 	mp.lockedg.set(gp)
 	gp.lockedm.set(mp)
@ -2028,9 +2076,11 @@ func oneNewExtraM() {
 	unlockextra(mp)
 }

+// dropm puts the current m back onto the extra list.
+//
+// 1. On systems without pthreads, like Windows
 // dropm is called when a cgo callback has called needm but is now
 // done with the callback and returning back into the non-Go thread.
-// It puts the current m back onto the extra list.
 //
 // The main expense here is the call to signalstack to release the
 // m's signal stack, and then the call to needm on the next callback
@ -2042,15 +2092,18 @@ func oneNewExtraM() {
 // call. These should typically not be scheduling operations, just a few
 // atomics, so the cost should be small.
 //
-// TODO(rsc): An alternative would be to allocate a dummy pthread per-thread
-// variable using pthread_key_create. Unlike the pthread keys we already use
-// on OS X, this dummy key would never be read by Go code. It would exist
-// only so that we could register at thread-exit-time destructor.
-// That destructor would put the m back onto the extra list.
-// This is purely a performance optimization. The current version,
-// in which dropm happens on each cgo call, is still correct too.
-// We may have to keep the current version on systems with cgo
-// but without pthreads, like Windows.
+// 2. On systems with pthreads
+// dropm is called while a non-Go thread is exiting.
+// We allocate a pthread per-thread variable using pthread_key_create,
+// to register a thread-exit-time destructor.
+// And store the g into a thread-specific value associated with the pthread key,
+// when first return back to C.
+// So that the destructor would invoke dropm while the non-Go thread is exiting.
+// This is much faster since it avoids expensive signal-related syscalls.
+//
+// NOTE: this always runs without a P, so, nowritebarrierrec required.
+//
+//go:nowritebarrierrec
 func dropm() {
 	// Clear m and g, and return m to the extra list.
 	// After the call to setg we can only call nosplit functions
@ -2082,6 +2135,39 @@ func dropm() {
 	msigrestore(sigmask)
 }

+// bindm store the g0 of the current m into a thread-specific value.
+//
+// We allocate a pthread per-thread variable using pthread_key_create,
+// to register a thread-exit-time destructor.
+// We are here setting the thread-specific value of the pthread key, to enable the destructor.
+// So that the pthread_key_destructor would dropm while the C thread is exiting.
+//
+// And the saved g will be used in pthread_key_destructor,
+// since the g stored in the TLS by Go might be cleared in some platforms,
+// before the destructor invoked, so, we restore g by the stored g, before dropm.
+//
+// We store g0 instead of m, to make the assembly code simpler,
+// since we need to restore g0 in runtime.cgocallback.
+//
+// On systems without pthreads, like Windows, bindm shouldn't be used.
+//
+// NOTE: this always runs without a P, so, nowritebarrierrec required.
+//
+//go:nosplit
+//go:nowritebarrierrec
+func cgoBindM() {
+	if GOOS == "windows" || GOOS == "plan9" {
+		fatal("bindm in unexpected GOOS")
+	}
+	g := getg()
+	if g.m.g0 != g {
+		fatal("the current g is not g0")
+	}
+	if _cgo_bindm != nil {
+		asmcgocall(_cgo_bindm, unsafe.Pointer(g))
+	}
+}
+
 // A helper function for EnsureDropM.
 func getm() uintptr {
 	return uintptr(unsafe.Pointer(getg().m))