mirror of
https://github.com/golang/go.git
synced 2025-12-08 06:10:04 +00:00
internal/runtime/gc/scan: avoid memory destination on VPCOMPRESSQ
On AMD Genoa / Zen 4, VPCOMPRESSQ with a memory destination imposes a
severe performance penalty of another an order of magnitude compared to
a register destination.
We can trivially work around this penalty with a register destination
and an additional move to memory.
Benchmark results from:
$ go test -bench=BenchmarkScanSpanPacked/.*/.*/.*/.*/impl=Platform internal/runtime/gc/scan
I've only included the summarized geomean here because there are ~2500
unique test cases.
AMD Genoa (Zen 4):
cpu: AMD EPYC 9B14 96-Core Processor
│ mem │ reg │
│ sec/op │ sec/op vs base │
geomean 1.039µ 310.1n -70.16%
│ mem │ reg │
│ B/s │ B/s vs base │
geomean 2.906Gi 10.99Gi +278.27%
As expected, we see a massive performance improvement on Genoa.
AMD Turin (Zen 5):
cpu: AMD EPYC 9B45 128-Core Processor
│ mem │ reg │
│ sec/op │ sec/op vs base │
geomean 231.9n 237.3n +2.32%
│ mem │ reg │
│ B/s │ B/s vs base │
geomean 14.79Gi 14.43Gi -2.50%
On Turin there is a minor regression. This is primarily due to a fairly
large regression (~15%) in very small microbenchmark cases where the
entire memory fits in L1 cache. This regression disappears as memory
access slows down with larger memories. The latter should be more common
in real workloads.
Intel Sapphire Rapids:
cpu: Intel(R) Xeon(R) Platinum 8481C
│ mem │ reg │
│ sec/op │ sec/op vs base │
geomean 254.9n 246.8n -3.18%
│ mem │ reg │
│ B/s │ B/s vs base │
geomean 13.65Gi 14.15Gi +3.69%
On Sapphire Rapids there is a minor improvement. Here results are fairly
noisy. Most cases are a wash, but some are arbitrary 20% slower or 20%
faster for unclear reasons.
For #73581.
Change-Id: I6a6a636cfd294a0dcdc4f34c9ece1bc9a6e5e4c7
Reviewed-on: https://go-review.googlesource.com/c/go/+/715362
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
This commit is contained in:
parent
81afd3a59b
commit
041f564b3e
1 changed files with 18 additions and 1 deletions
|
|
@ -86,7 +86,24 @@ loop:
|
|||
|
||||
// Collect just the pointers from the greyed objects into the scan buffer,
|
||||
// i.e., copy the word indices in the mask from Z1 into contiguous memory.
|
||||
VPCOMPRESSQ Z1, K1, (DI)(DX*8)
|
||||
//
|
||||
// N.B. VPCOMPRESSQ supports a memory destination. Unfortunately, on
|
||||
// AMD Genoa / Zen 4, using VPCOMPRESSQ with a memory destination
|
||||
// imposes a severe performance penalty of around an order of magnitude
|
||||
// compared to a register destination.
|
||||
//
|
||||
// This workaround is unfortunate on other microarchitectures, where a
|
||||
// memory destination is slightly faster than adding an additional move
|
||||
// instruction, but no where near an order of magnitude. It would be
|
||||
// nice to have a Genoa-only variant here.
|
||||
//
|
||||
// AMD Turin / Zen 5 fixes this issue.
|
||||
//
|
||||
// See
|
||||
// https://lemire.me/blog/2025/02/14/avx-512-gotcha-avoid-compressing-words-to-memory-with-amd-zen-4-processors/.
|
||||
VPCOMPRESSQ Z1, K1, Z2
|
||||
VMOVDQU64 Z2, (DI)(DX*8)
|
||||
|
||||
// Advance the scan buffer position by the number of pointers.
|
||||
MOVBQZX 128(AX), CX
|
||||
ADDQ CX, DX
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue