all: REVERSE MERGE dev.simd (7d65463) into master

This commit is a REVERSE MERGE. It merges dev.simd back into its parent branch, master. The development of simd will continue on (only) dev.simd, and it will be merged to the master branch when necessary. Merge List: + 2025-11-24 7d65463a54 [dev.simd] all: merge master (e704b09) into dev.simd + 2025-11-24 afd1721fc5 [dev.simd] all: merge master (02d1f3a) into dev.simd + 2025-11-24 a9914886da [dev.simd] internal/buildcfg: don't enable SIMD experiment by default + 2025-11-24 61a5a6b016 [dev.simd] simd: add goexperiment tag to generate.go + 2025-11-24 f045ed4110 [dev.simd] go/doc/comment: don't include experimental packages in std list + 2025-11-24 220d73cc44 [dev.simd] all: merge master (8dd5b13) into dev.simd + 2025-11-24 0c69e77343 Revert "[dev.simd] internal/runtime/gc: add simd package based greentea kernels" + 2025-11-21 da92168ec8 [dev.simd] internal/runtime/gc: add simd package based greentea kernels + 2025-11-21 3fdd183aef [dev.simd] cmd/compile, simd: update conversion API names + 2025-11-21 d3a0321dba [dev.simd] cmd/compile: fix incorrect mapping of SHA256MSG2128 + 2025-11-20 74ebdd28d1 [dev.simd] simd, cmd/compile: add more element types for Select128FromPair + 2025-11-20 4d26d66a49 [dev.simd] simd: fix signatures for PermuteConstant* methods + 2025-11-20 e3d4645693 [dev.simd] all: merge master (ca37d24) into dev.simd + 2025-11-20 95b4ad525f [dev.simd] simd: reorganize internal tests so that simd does not import testing + 2025-11-18 3fe246ae0f [dev.simd] simd: make 'go generate' generate everything + 2025-11-18 cf45adf140 [dev.simd] simd: move template code generator into _gen + 2025-11-18 19b4a30899 [dev.simd] simd/_gen/simdgen: remove outdated asm.yaml.toy + 2025-11-18 9461db5c59 [dev.simd] simd: fix comment in file generator + 2025-11-18 4004ff3523 [dev.simd] simd: remove FlattenedTranspose from exports + 2025-11-18 896f293a25 [dev.simd] cmd/compile, simd: change DotProductQuadruple and add peepholes + 2025-11-18 be9c50c6a0 [dev.simd] cmd/compile, simd: change SHA ops names and types + 2025-11-17 0978935a99 [dev.simd] cmd/compile, simd: change AES op names and add missing size + 2025-11-17 95871e4a00 [dev.simd] cmd/compile, simd: add VPALIGNR + 2025-11-17 934dbcea1a [dev.simd] simd: update CPU feature APIs + 2025-11-17 e4d9484220 [dev.simd] cmd/compile: fix unstable output + 2025-11-13 d7a0c45642 [dev.simd] all: merge master (57362e9) into dev.simd + 2025-11-11 86b4fe31d9 [dev.simd] cmd/compile: add masked merging ops and optimizations + 2025-11-10 771a1dc216 [dev.simd] cmd/compile: add peepholes for all masked ops and bug fixes + 2025-11-10 972732b245 [dev.simd] simd, cmd/compile: remove move from API + 2025-11-10 bf77323efa [dev.simd] simd: put unexported methods to another file + 2025-11-04 fe040658b2 [dev.simd] simd/_gen: fix sorting ops slices + 2025-10-29 e452f4ac7d [dev.simd] cmd/compile: enhance inlining for closure-of-SIMD + 2025-10-27 ca1264ac50 [dev.simd] test: add some trickier cases to ternary-boolean simd test + 2025-10-24 f6b4711095 [dev.simd] cmd/compile, simd: add rewrite to convert logical expression trees into TERNLOG instructions + 2025-10-24 cf7c1a4cbb [dev.simd] cmd/compile, simd: add SHA features + 2025-10-24 2b8eded4f4 [dev.simd] simd/_gen: parse SHA features from XED + 2025-10-24 c75965b666 [dev.simd] simd: added String() method to SIMD vectors. + 2025-10-22 d03634f807 [dev.simd] cmd/compile, simd: add definitions for VPTERNLOG[DQ] + 2025-10-20 20b3339542 [dev.simd] simd: add AES feature check + 2025-10-14 fc3bc49337 [dev.simd] simd: clean up mask load comments + 2025-10-14 416332dba2 [dev.simd] cmd/compile, simd: update DotProd to DotProduct + 2025-10-14 647c790143 [dev.simd] cmd/compile: peephole simd mask load/stores from bits + 2025-10-14 2e71cf1a2a [dev.simd] cmd/compile, simd: remove mask load and stores + 2025-10-13 c4fbf3b4cf [dev.simd] simd/_gen: add mem peephole with feat mismatches + 2025-10-13 ba72ee0f30 [dev.simd] cmd/compile: more support for cpufeatures + 2025-10-09 be57d94c4c [dev.simd] simd: add emulated Not method + 2025-10-07 d2270bccbd [dev.simd] cmd/compile: track which CPU features are in scope + 2025-10-03 48756abd3a [dev.simd] cmd/compile: inliner tweaks to favor simd-handling functions + 2025-10-03 fb1749a3fe [dev.simd] all: merge master (adce7f1) into dev.simd + 2025-09-30 703a5fbaad [dev.simd] cmd/compile, simd: add AES instructions + 2025-09-29 1c961c2fb2 [dev.simd] simd: use new data movement instructions to do "fast" transposes + 2025-09-26 fe4af1c067 [dev.simd] simd: repair broken comments in generated ops_amd64.go + 2025-09-26 ea3b2ecd28 [dev.simd] cmd/compile, simd: add 64-bit select-from-pair methods + 2025-09-26 25c36b95d1 [dev.simd] simd, cmd/compile: add 128 bit select-from-pair + 2025-09-26 f0e281e693 [dev.simd] cmd/compile: don't require single use for SIMD load/store folding + 2025-09-26 b4d1e018a8 [dev.simd] cmd/compile: remove unnecessary code from early simd prototype + 2025-09-26 578777bf7c [dev.simd] cmd/compile: make condtion of CanSSA smarter for SIMD fields + 2025-09-26 c28b2a0ca1 [dev.simd] simd: generalize select-float32-from-pair + 2025-09-25 a693ae1e9a [dev.simd] all: merge master (d70ad4e) into dev.simd + 2025-09-25 5a78e1a4a1 [dev.simd] simd, cmd/compile: mark simd vectors uncomparable + 2025-09-23 bf00f5dfd6 [dev.simd] simd, cmd/compile: added simd methods for VSHUFP[DS] + 2025-09-23 8e60feeb41 [dev.simd] cmd/compile: improve slicemask removal + 2025-09-23 2b50ffe172 [dev.simd] cmd/compile: remove stores to unread parameters + 2025-09-23 2d8cb80d7c [dev.simd] all: merge master (9b2d39b) into dev.simd + 2025-09-22 63a09d6d3d [dev.simd] cmd/compile: fix SIMD const rematerialization condition + 2025-09-20 2ca96d218d [dev.simd] cmd/compile: enhance prove to infer bounds in slice len/cap calculations + 2025-09-19 c0f031fcc3 [dev.simd] cmd/compile: spill the correct SIMD register for morestack + 2025-09-19 58fa1d023e [dev.simd] cmd/compile: enhance the chunked indexing case to include reslicing + 2025-09-18 7ae0eb2e80 [dev.simd] cmd/compile: remove Add32x4 generic op + 2025-09-18 31b664d40b [dev.simd] cmd/compile: widen index for simd intrinsics jumptable + 2025-09-18 e34ad6de42 [dev.simd] cmd/compile: optimize VPTEST for 2-operand cases + 2025-09-18 f1e3651c33 [dev.simd] cmd/compile, simd: add VPTEST + 2025-09-18 d9751166a6 [dev.simd] cmd/compile: handle rematerialized op for incompatible reg constraint + 2025-09-18 4eb5c6e07b [dev.simd] cmd/compile, simd/_gen: add rewrite for const load ops + 2025-09-18 443b7aeddb [dev.simd] cmd/compile, simd/_gen: make rewrite rules consistent on CPU Features + 2025-09-16 bdd30e25ca [dev.simd] all: merge master (ca0e035) into dev.simd + 2025-09-16 0e590a505d [dev.simd] cmd/compile: use the right type for spill slot + 2025-09-15 dabe2bb4fb [dev.simd] cmd/compile: fix holes in mask peepholes + 2025-09-12 3ec0b25ab7 [dev.simd] cmd/compile, simd/_gen/simdgen: add const load mops + 2025-09-12 1e5631d4e0 [dev.simd] cmd/compile: peephole simd load + 2025-09-11 48f366d826 [dev.simd] cmd/compile: add memop peephole rules + 2025-09-11 9a349f8e72 [dev.simd] all: merge master (cf5e993) into dev.simd + 2025-09-11 5a0446d449 [dev.simd] simd/_gen/simdgen, cmd/compile: add memory op machine ops + 2025-09-08 c39b2fdd1e [dev.simd] cmd/compile, simd: add VPLZCNT[DQ] + 2025-09-07 832c1f76dc [dev.simd] cmd/compile: enhance prove to deal with double-offset IsInBounds checks + 2025-09-06 0b323350a5 [dev.simd] simd/_gen/simdgen: merge memory ops + 2025-09-06 f42c9261d3 [dev.simd] simd/_gen/simdgen: parse memory operands + 2025-09-05 356c48d8e9 [dev.simd] cmd/compile, simd: add ClearAVXUpperBits + 2025-09-03 7c8b9115bc [dev.simd] all: merge master (4c4cefc) into dev.simd + 2025-09-02 9125351583 [dev.simd] internal/cpu: report AVX1 and 2 as supported on macOS 15 Rosetta 2 + 2025-09-02 b509516b2e [dev.simd] simd, cmd/compile: add Interleave{Hi,Lo} (VPUNPCK*) + 2025-09-02 6890aa2e20 [dev.simd] cmd/compile: add instructions and rewrites for scalar-> vector moves + 2025-08-24 5ebe2d05d5 [dev.simd] simd: correct SumAbsDiff documentation + 2025-08-22 a5137ec92a [dev.simd] cmd/compile: sample peephole optimization for SIMD broadcast + 2025-08-22 83714616aa [dev.simd] cmd/compile: remove VPADDD4 + 2025-08-22 4a3ea146ae [dev.simd] cmd/compile: correct register mask of some AVX512 ops + 2025-08-22 8d874834f1 [dev.simd] cmd/compile: use X15 for zero value in AVX context + 2025-08-22 4c311aa38f [dev.simd] cmd/compile: ensure the whole X15 register is zeroed + 2025-08-22 baea0c700b [dev.simd] cmd/compile, simd: complete AVX2? u?int shuffles + 2025-08-22 fa1e78c9ad [dev.simd] cmd/compile, simd: make Permute 128-bit use AVX VPSHUFB + 2025-08-22 bc217d4170 [dev.simd] cmd/compile, simd: add packed saturated u?int conversions + 2025-08-22 4fa23b0d29 [dev.simd] cmd/compile, simd: add saturated u?int conversions + 2025-08-21 3f6bab5791 [dev.simd] simd: move tests to a subdirectory to declutter "simd" + 2025-08-21 aea0a5e8d7 [dev.simd] simd/_gen/unify: improve envSet doc comment + 2025-08-21 7fdb1da6b0 [dev.simd] cmd/compile, simd: complete truncating u?int conversions. + 2025-08-21 f4c41d9922 [dev.simd] cmd/compile, simd: complete u?int widening conversions + 2025-08-21 6af8881adb [dev.simd] simd: reorganize cvt rules + 2025-08-21 58cfc2a5f6 [dev.simd] cmd/compile, simd: add VPSADBW + 2025-08-21 f7c6fa709e [dev.simd] simd/_gen/unify: fix some missing environments + 2025-08-20 7c84e984e6 [dev.simd] cmd/compile: rewrite to elide Slicemask from len==c>0 slicing + 2025-08-20 cf31b15635 [dev.simd] simd, cmd/compile: added .Masked() peephole opt for many operations. + 2025-08-20 1334285862 [dev.simd] simd: template field name cleanup in genfiles + 2025-08-20 af6475df73 [dev.simd] simd: add testing hooks for size-changing conversions + 2025-08-20 ede64cf0d8 [dev.simd] simd, cmd/compile: sample peephole optimization for .Masked() + 2025-08-20 103b6e39ca [dev.simd] all: merge master (9de69f6) into dev.simd + 2025-08-20 728ac3e050 [dev.simd] simd: tweaks to improve test disassembly + 2025-08-20 4fce49b86c [dev.simd] simd, cmd/compile: add widening unsigned converts 8->16->32 + 2025-08-19 0f660d675f [dev.simd] simd: make OpMasked machine ops only + 2025-08-19 a034826e26 [dev.simd] simd, cmd/compile: implement ToMask, unexport asMask. + 2025-08-18 8ccd6c2034 [dev.simd] simd, cmd/compile: mark BLEND instructions as not-zero-mask + 2025-08-18 9a934d5080 [dev.simd] cmd/compile, simd: added methods for "float" GetElem + 2025-08-15 7380213a4e [dev.simd] cmd/compile: make move/load/store dependent only on reg and width + 2025-08-15 908e3e8166 [dev.simd] cmd/compile: make (most) move/load/store lowering use reg and width only + 2025-08-14 9783f86bc8 [dev.simd] cmd/compile: accounts rematerialize ops's output reginfo + 2025-08-14 a4ad41708d [dev.simd] all: merge master (924fe98) into dev.simd + 2025-08-13 8b90d48d8c [dev.simd] simd/_gen/simdgen: rewrite etetest.sh + 2025-08-13 b7c8698549 [dev.simd] simd/_gen: migrate simdgen from x/arch + 2025-08-13 257c1356ec [dev.simd] go/types: exclude simd/_gen module from TestStdlib + 2025-08-13 858a8d2276 [dev.simd] simd: reorganize/rename generated emulation files + 2025-08-13 2080415aa2 [dev.simd] simd: add emulations for missing AVX2 comparisons + 2025-08-13 ddb689c7bb [dev.simd] simd, cmd/compile: generated code for Broadcast + 2025-08-13 e001300cf2 [dev.simd] cmd/compile: fix LoadReg so it is aware of register target + 2025-08-13 d5dea86993 [dev.simd] cmd/compile: fix isIntrinsic for methods; fix fp <-> gp moves + 2025-08-13 08ab8e24a3 [dev.simd] cmd/compile: generated code from 'fix generated rules for shifts' + 2025-08-11 702ee2d51e [dev.simd] cmd/compile, simd: update generated files + 2025-08-11 e33eb1a7a5 [dev.simd] cmd/compile, simd: update generated files + 2025-08-11 667add4f1c [dev.simd] cmd/compile, simd: update generated files + 2025-08-11 1755c2909d [dev.simd] cmd/compile, simd: update generated files + 2025-08-11 2fd49d8f30 [dev.simd] simd: imm doc improve + 2025-08-11 ce0e803ab9 [dev.simd] cmd/compile: keep track of multiple rule file names in ssa/_gen + 2025-08-11 38b76bf2a3 [dev.simd] cmd/compile, simd: jump table for imm ops + 2025-08-08 94d72355f6 [dev.simd] simd: add emulations for bitwise ops and for mask/merge methods + 2025-08-07 8eb5f6020e [dev.simd] cmd/compile, simd: API interface fixes + 2025-08-07 b226bcc4a9 [dev.simd] cmd/compile, simd: add value conversion ToBits for mask + 2025-08-06 5b0ef7fcdc [dev.simd] cmd/compile, simd: add Expand + 2025-08-06 d3cf582f8a [dev.simd] cmd/compile, simd: (Set|Get)(Lo|Hi) + 2025-08-05 7ca34599ec [dev.simd] simd, cmd/compile: generated files to add 'blend' and 'blendMasked' + 2025-08-05 82d056ddd7 [dev.simd] cmd/compile: add ShiftAll immediate variant + 2025-08-04 775fb52745 [dev.simd] all: merge master (7a1679d) into dev.simd + 2025-08-04 6b9b59e144 [dev.simd] simd, cmd/compile: rename some methods + 2025-08-04 d375b95357 [dev.simd] simd: move lots of slice functions and methods to generated code + 2025-08-04 3f92aa1eca [dev.simd] cmd/compile, simd: make bitwise logic ops available to all u?int vectors + 2025-08-04 c2d775d401 [dev.simd] cmd/compile, simd: change PairDotProdAccumulate to AddDotProd + 2025-08-04 2c25f3e846 [dev.simd] cmd/compile, simd: change Shift*AndFillUpperFrom to Shift*Concat + 2025-08-01 c25e5c86b2 [dev.simd] cmd/compile: generated code for K-mask-register slice load/stores + 2025-08-01 1ac5f3533f [dev.simd] cmd/compile: opcodes and rules and code generation to enable AVX512 masked loads/stores + 2025-08-01 f39711a03d [dev.simd] cmd/compile: test for int-to-mask conversion + 2025-08-01 08bec02907 [dev.simd] cmd/compile: add register-to-mask moves, other simd glue + 2025-08-01 09ff25e350 [dev.simd] simd: add tests for simd conversions to Int32/Uint32. + 2025-08-01 a24ffe3379 [dev.simd] simd: modify test generation to make it more flexible + 2025-08-01 ec5c20ba5a [dev.simd] cmd/compile: generated simd code to add some conversions + 2025-08-01 e62e377ed6 [dev.simd] cmd/compile, simd: generated code from repaired simdgen sort + 2025-08-01 761894d4a5 [dev.simd] simd: add partial slice load/store for 32/64-bits on AVX2 + 2025-08-01 acc1492b7d [dev.simd] cmd/compile: Generated code for AVX2 SIMD masked load/store + 2025-08-01 a0b87a7478 [dev.simd] cmd/compile: changes for AVX2 SIMD masked load/store + 2025-08-01 88568519b4 [dev.simd] simd: move test generation into Go repo + 2025-07-31 6f7a1164e7 [dev.simd] cmd/compile, simd: support store to bits for mask + 2025-07-21 41054cdb1c [dev.simd] simd, internal/cpu: support more AVX CPU Feature checks + 2025-07-21 957f06c410 [dev.simd] cmd/compile, simd: support load from bits for mask + 2025-07-21 f0e9dc0975 [dev.simd] cmd/compile: fix opLen(2|3)Imm8_2I intrinsic function + 2025-07-17 03a3887f31 [dev.simd] simd: clean up masked op doc + 2025-07-17 c61743e4f0 [dev.simd] cmd/compile, simd: reorder PairDotProdAccumulate + 2025-07-15 ef5f6cc921 [dev.simd] cmd/compile: adjust param order for AndNot + 2025-07-15 6d10680141 [dev.simd] cmd/compile, simd: add Compress + 2025-07-15 17baae72db [dev.simd] simd: default mask param's name to mask + 2025-07-15 01f7f57025 [dev.simd] cmd/compile, simd: add variable Permute + 2025-07-14 f5f42753ab [dev.simd] cmd/compile, simd: add VDPPS + 2025-07-14 08ffd66ab2 [dev.simd] simd: updates CPU Feature in doc + 2025-07-14 3f789721d6 [dev.simd] cmd/compile: mark SIMD types non-fat + 2025-07-11 b69622b83e [dev.simd] cmd/compile, simd: adjust Shift.* operations + 2025-07-11 4993a91ae1 [dev.simd] simd: change imm param name to constant + 2025-07-11 bbb6dccd84 [dev.simd] simd: fix documentations + 2025-07-11 1440ff7036 [dev.simd] cmd/compile: exclude simd vars from merge local + 2025-07-11 ccb43dcec7 [dev.simd] cmd/compile: add VZEROUPPER and VZEROALL inst + 2025-07-11 21596f2f75 [dev.simd] all: merge master (88cf0c5) into dev.simd + 2025-07-10 ab7f839280 [dev.simd] cmd/compile: fix maskreg/simdreg chaos + 2025-07-09 47b07a87a6 [dev.simd] cmd/compile, simd: fix Int64x2 Greater output type to mask + 2025-07-09 08cd62e9f5 [dev.simd] cmd/compile: remove X15 from register mask + 2025-07-09 9ea33ed538 [dev.simd] cmd/compile: output of simd generator, more ... rewrite rules + 2025-07-09 aab8b173a9 [dev.simd] cmd/compile, simd: Int64x2 Greater and Uint* Equal + 2025-07-09 8db7f41674 [dev.simd] cmd/compile: use upper registers for AVX512 simd ops + 2025-07-09 574854fd86 [dev.simd] runtime: save Z16-Z31 registers in async preempt + 2025-07-09 5429328b0c [dev.simd] cmd/compile: change register mask names for simd ops + 2025-07-09 029d7ec3e9 [dev.simd] cmd/compile, simd: rename Masked$OP to $(OP)Masked. + 2025-07-09 983e81ce57 [dev.simd] simd: rename stubs_amd64.go to ops_amd64.go + 2025-07-08 56ca67682b [dev.simd] cmd/compile, simd: remove FP bitwise logic operations. + 2025-07-08 0870ed04a3 [dev.simd] cmd/compile: make compares between NaNs all false. + 2025-07-08 24f2b8ae2e [dev.simd] simd: {Int,Uint}{8x{16,32},16x{8,16}} subvector loads/stores from slices. + 2025-07-08 2bb45cb8a5 [dev.simd] cmd/compile: minor tweak for race detector + 2025-07-07 43a61aef56 [dev.simd] cmd/compile: add EXTRACT[IF]128 instructions + 2025-07-07 292db9b676 [dev.simd] cmd/compile: add INSERT[IF]128 instructions + 2025-07-07 d8fa853b37 [dev.simd] cmd/compile: make regalloc simd aware on copy + 2025-07-07 dfd75f82d4 [dev.simd] cmd/compile: output of simdgen with invariant type order + 2025-07-04 72c39ef834 [dev.simd] cmd/compile: fix the "always panic" code to actually panic + 2025-07-01 1ee72a15a3 [dev.simd] internal/cpu: add GFNI feature check + 2025-06-30 0710cce6eb [dev.simd] runtime: remove write barrier in xRegRestore + 2025-06-30 59846af331 [dev.simd] cmd/compile, simd: cleanup operations and documentations + 2025-06-30 f849225b3b [dev.simd] all: merge master (740857f) into dev.simd + 2025-06-30 9eeb1e7a9a [dev.simd] runtime: save AVX2 and AVX-512 state on asynchronous preemption + 2025-06-30 426cf36b4d [dev.simd] runtime: save scalar registers off stack in amd64 async preemption + 2025-06-30 ead249a2e2 [dev.simd] cmd/compile: reorder operands for some simd operations + 2025-06-30 55665e1e37 [dev.simd] cmd/compile: undoes reorder transform in prior commit, changes names + 2025-06-26 10c9621936 [dev.simd] cmd/compile, simd: add galois field operations + 2025-06-26 e61ebfce56 [dev.simd] cmd/compile, simd: add shift operations + 2025-06-26 35b8cf7fed [dev.simd] cmd/compile: tweak sort order in generator + 2025-06-26 7fadfa9638 [dev.simd] cmd/compile: add simd VPEXTRA* + 2025-06-26 0d8cb89f5c [dev.simd] cmd/compile: support simd(imm,fp) returns gp + 2025-06-25 f4a7c124cc [dev.simd] all: merge master (f8ccda2) into dev.simd + 2025-06-25 4fda27c0cc [dev.simd] cmd/compile: glue codes for Shift and Rotate + 2025-06-24 61c1183342 [dev.simd] simd: add test wrappers + 2025-06-23 e32488003d [dev.simd] cmd/compile: make simd regmask naming more like existing conventions + 2025-06-23 1fa4bcfcda [dev.simd] simd, cmd/compile: generated code for VPINSR[BWDQ], and test + 2025-06-23 dd63b7aa0e [dev.simd] simd: add AVX512 aggregated check + 2025-06-23 0cdb2697d1 [dev.simd] simd: add tests for intrinsic used as a func value and via reflection + 2025-06-23 88c013d6ff [dev.simd] cmd/compile: generate function body for bodyless intrinsics + 2025-06-20 a8669c78f5 [dev.simd] sync: correct the type of runtime_StoreReluintptr + 2025-06-20 7c6ac35275 [dev.simd] cmd/compile: add simdFp1gp1fp1Imm8 helper to amd64 code generation + 2025-06-20 4150372a5d [dev.simd] cmd/compile: don't treat devel compiler as a released compiler + 2025-06-18 1b87d52549 [dev.simd] cmd/compile: add fp1gp1fp1 register mask for AMD64 + 2025-06-18 1313521f75 [dev.simd] cmd/compile: remove fused mul/add/sub shapes. + 2025-06-17 1be5eb2686 [dev.simd] cmd/compile: fix signature error of PairDotProdAccumulate. + 2025-06-17 3a4d10bfca [dev.simd] cmd/compile: removed a map iteration from generator; tweaked type order + 2025-06-17 21d6573154 [dev.simd] cmd/compile: alphabetize SIMD intrinsics + 2025-06-16 ee1d9f3f85 [dev.simd] cmd/compile: reorder stubs + 2025-06-13 6c50c8b892 [dev.simd] cmd/compile: move simd helpers into compiler, out of generated code + 2025-06-13 7392dfd43e [dev.simd] cmd/compile: generated simd*ops files weren't up to date + 2025-06-13 00a8dacbe4 [dev.simd] cmd/compile: remove unused simd intrinsics "helpers" + 2025-06-13 b9a548775f cmd/compile: add up-to-date test for generated files + 2025-06-13 ca01eab9c7 [dev.simd] cmd/compile: add fused mul add sub ops + 2025-06-13 ded6e0ac71 [dev.simd] cmd/compile: add more dot products + 2025-06-13 3df41c856e [dev.simd] simd: update documentations + 2025-06-13 9ba7db36b5 [dev.simd] cmd/compile: add dot product ops + 2025-06-13 34a9cdef87 [dev.simd] cmd/compile: add round simd ops + 2025-06-13 5289e0f24e [dev.simd] cmd/compile: updates simd ordering and docs + 2025-06-13 c81cb05e3e [dev.simd] cmd/compile: add simdGen prog writer + 2025-06-13 9b9af3d638 [dev.simd] internal/cpu: add AVX-512-CD and DQ, and derived "basic AVX-512" + 2025-06-13 dfa6c74263 [dev.simd] runtime: eliminate global state in mkpreempt.go + 2025-06-10 b2e8ddba3c [dev.simd] all: merge master (773701a) into dev.simd + 2025-06-09 884f646966 [dev.simd] cmd/compile: add fp3m1fp1 shape to regalloc + 2025-06-09 6bc3505773 [dev.simd] cmd/compile: add fp3fp1 regsiter shape + 2025-06-05 2eaa5a0703 [dev.simd] simd: add functions+methods to load-from/store-to slices + 2025-06-05 8ecbd59ebb [dev.simd] cmd/compile: generated codes for amd64 SIMD + 2025-06-02 baa72c25f1 [dev.simd] all: merge master (711ff94) into dev.simd + 2025-05-30 0ff18a9cca [dev.simd] cmd/compile: disable intrinsics test for new simd stuff + 2025-05-30 7800f3813c [dev.simd] cmd/compile: flip sense of intrinsics test for SIMD + 2025-05-29 eba2430c16 [dev.simd] simd, cmd/compile, go build, go/doc: test tweaks + 2025-05-29 71c0e550cd [dev.simd] cmd/dist: disable API check on dev branch + 2025-05-29 62e1fccfb9 [dev.simd] internal: delete unused internal/simd directory + 2025-05-29 1161228bf1 [dev.simd] cmd/compile: add a fp1m1fp1 register shape to amd64 + 2025-05-28 fdb067d946 [dev.simd] simd: initialize directory to make it suitable for testing SIMD + 2025-05-28 11d2b28bff [dev.simd] cmd/compile: add and fix k register supports + 2025-05-28 04b1030ae4 [dev.simd] cmd/compile: adapters for simd + 2025-05-27 2ef7106881 [dev.simd] internal/buildcfg: enable SIMD GOEXPERIMENT for amd64 + 2025-05-22 4d2c71ebf9 [dev.simd] internal/goexperiment: add SIMD goexperiment + 2025-05-22 3ac5f2f962 [dev.simd] codereview.cfg: set up dev.simd branch Change-Id: I60f2cd2ea055384a3788097738c6989630207871
2025-12-08 06:10:04 +00:00 · 2025-11-24 16:02:01 -05:00 · 2025-11-24 16:02:01 -05:00 · d4f5650cc5
commit d4f5650cc5
parent e704b0993b 7d65463a54
186 changed files with 146299 additions and 835 deletions
--- a/src/cmd/compile/internal/abi/abiutils.go
+++ b/src/cmd/compile/internal/abi/abiutils.go
@ -150,12 +150,12 @@ func appendParamTypes(rts []*types.Type, t *types.Type) []*types.Type {
 	if w == 0 {
 		return rts
 	}
-	if t.IsScalar() || t.IsPtrShaped() {
+	if t.IsScalar() || t.IsPtrShaped() || t.IsSIMD() {
 		if t.IsComplex() {
 			c := types.FloatForComplex(t)
 			return append(rts, c, c)
 		} else {
-			if int(t.Size()) <= types.RegSize {
+			if int(t.Size()) <= types.RegSize || t.IsSIMD() {
 				return append(rts, t)
 			}
 			// assume 64bit int on 32-bit machine
@ -199,6 +199,9 @@ func appendParamOffsets(offsets []int64, at int64, t *types.Type) ([]int64, int6
 	if w == 0 {
 		return offsets, at
 	}
+	if t.IsSIMD() {
+		return append(offsets, at), at + w
+	}
 	if t.IsScalar() || t.IsPtrShaped() {
 		if t.IsComplex() || int(t.Size()) > types.RegSize { // complex and *int64 on 32-bit
 			s := w / 2
@ -521,11 +524,11 @@ func (state *assignState) allocateRegs(regs []RegIndex, t *types.Type) []RegInde
 	}
 	ri := state.rUsed.intRegs
 	rf := state.rUsed.floatRegs
-	if t.IsScalar() || t.IsPtrShaped() {
+	if t.IsScalar() || t.IsPtrShaped() || t.IsSIMD() {
 		if t.IsComplex() {
 			regs = append(regs, RegIndex(rf+state.rTotal.intRegs), RegIndex(rf+1+state.rTotal.intRegs))
 			rf += 2
-		} else if t.IsFloat() {
+		} else if t.IsFloat() || t.IsSIMD() {
 			regs = append(regs, RegIndex(rf+state.rTotal.intRegs))
 			rf += 1
 		} else {
--- a/src/cmd/compile/internal/amd64/simdssa.go
+++ b/src/cmd/compile/internal/amd64/simdssa.go
--- a/src/cmd/compile/internal/amd64/ssa.go
+++ b/src/cmd/compile/internal/amd64/ssa.go
@ -18,6 +18,7 @@ import (
 	"cmd/internal/obj"
 	"cmd/internal/obj/x86"
 	"internal/abi"
+	"internal/buildcfg"
 )

 // ssaMarkMoves marks any MOVXconst ops that need to avoid clobbering flags.
@ -43,11 +44,23 @@ func ssaMarkMoves(s *ssagen.State, b *ssa.Block) {
 	}
 }

-// loadByType returns the load instruction of the given type.
-func loadByType(t *types.Type) obj.As {
-	// Avoid partial register write
-	if !t.IsFloat() {
-		switch t.Size() {
+func isFPReg(r int16) bool {
+	return x86.REG_X0 <= r && r <= x86.REG_Z31
+}
+
+func isKReg(r int16) bool {
+	return x86.REG_K0 <= r && r <= x86.REG_K7
+}
+
+func isLowFPReg(r int16) bool {
+	return x86.REG_X0 <= r && r <= x86.REG_X15
+}
+
+// loadByRegWidth returns the load instruction of the given register of a given width.
+func loadByRegWidth(r int16, width int64) obj.As {
+	// Avoid partial register write for GPR
+	if !isFPReg(r) && !isKReg(r) {
+		switch width {
 		case 1:
 			return x86.AMOVBLZX
 		case 2:
@ -55,20 +68,35 @@ func loadByType(t *types.Type) obj.As {
 		}
 	}
 	// Otherwise, there's no difference between load and store opcodes.
-	return storeByType(t)
+	return storeByRegWidth(r, width)
 }

-// storeByType returns the store instruction of the given type.
-func storeByType(t *types.Type) obj.As {
-	width := t.Size()
-	if t.IsFloat() {
+// storeByRegWidth returns the store instruction of the given register of a given width.
+// It's also used for loading const to a reg.
+func storeByRegWidth(r int16, width int64) obj.As {
+	if isFPReg(r) {
 		switch width {
 		case 4:
 			return x86.AMOVSS
 		case 8:
 			return x86.AMOVSD
-		}
+		case 16:
+			// int128s are in SSE registers
+			if isLowFPReg(r) {
+				return x86.AMOVUPS
 			} else {
+				return x86.AVMOVDQU
+			}
+		case 32:
+			return x86.AVMOVDQU
+		case 64:
+			return x86.AVMOVDQU64
+		}
+	}
+	if isKReg(r) {
+		return x86.AKMOVQ
+	}
+	// gp
 	switch width {
 	case 1:
 		return x86.AMOVB
@ -78,23 +106,35 @@ func storeByType(t *types.Type) obj.As {
 		return x86.AMOVL
 	case 8:
 		return x86.AMOVQ
-		case 16:
-			return x86.AMOVUPS
 	}
-	}
-	panic(fmt.Sprintf("bad store type %v", t))
+	panic(fmt.Sprintf("bad store reg=%v, width=%d", r, width))
 }

-// moveByType returns the reg->reg move instruction of the given type.
-func moveByType(t *types.Type) obj.As {
-	if t.IsFloat() {
+// moveByRegsWidth returns the reg->reg move instruction of the given dest/src registers of a given width.
+func moveByRegsWidth(dest, src int16, width int64) obj.As {
+	// fp -> fp
+	if isFPReg(dest) && isFPReg(src) {
 		// Moving the whole sse2 register is faster
 		// than moving just the correct low portion of it.
 		// There is no xmm->xmm move with 1 byte opcode,
 		// so use movups, which has 2 byte opcode.
+		if isLowFPReg(dest) && isLowFPReg(src) && width <= 16 {
 			return x86.AMOVUPS
-	} else {
-		switch t.Size() {
+		}
+		if width <= 32 {
+			return x86.AVMOVDQU
+		}
+		return x86.AVMOVDQU64
+	}
+	// k -> gp, gp -> k, k -> k
+	if isKReg(dest) || isKReg(src) {
+		if isFPReg(dest) || isFPReg(src) {
+			panic(fmt.Sprintf("bad move, src=%v, dest=%v, width=%d", src, dest, width))
+		}
+		return x86.AKMOVQ
+	}
+	// gp -> fp, fp -> gp, gp -> gp
+	switch width {
 	case 1:
 		// Avoids partial register write
 		return x86.AMOVL
@ -105,11 +145,18 @@ func moveByType(t *types.Type) obj.As {
 	case 8:
 		return x86.AMOVQ
 	case 16:
-			return x86.AMOVUPS // int128s are in SSE registers
-		default:
-			panic(fmt.Sprintf("bad int register width %d:%v", t.Size(), t))
+		if isLowFPReg(dest) && isLowFPReg(src) {
+			// int128s are in SSE registers
+			return x86.AMOVUPS
+		} else {
+			return x86.AVMOVDQU
 		}
+	case 32:
+		return x86.AVMOVDQU
+	case 64:
+		return x86.AVMOVDQU64
 	}
+	panic(fmt.Sprintf("bad move, src=%v, dest=%v, width=%d", src, dest, width))
 }

 // opregreg emits instructions for
@ -605,7 +652,7 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 		// But this requires a way for regalloc to know that SRC might be
 		// clobbered by this instruction.
 		t := v.RegTmp()
-		opregreg(s, moveByType(v.Type), t, v.Args[1].Reg())
+		opregreg(s, moveByRegsWidth(t, v.Args[1].Reg(), v.Type.Size()), t, v.Args[1].Reg())

 		p := s.Prog(v.Op.Asm())
 		p.From.Type = obj.TYPE_REG
@ -777,9 +824,14 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 		p.From.Offset = v.AuxInt
 		p.To.Type = obj.TYPE_REG
 		p.To.Reg = x
+
 	case ssa.OpAMD64MOVSSconst, ssa.OpAMD64MOVSDconst:
 		x := v.Reg()
-		p := s.Prog(v.Op.Asm())
+		if !isFPReg(x) && v.AuxInt == 0 && v.Aux == nil {
+			opregreg(s, x86.AXORL, x, x)
+			break
+		}
+		p := s.Prog(storeByRegWidth(x, v.Type.Size()))
 		p.From.Type = obj.TYPE_FCONST
 		p.From.Val = math.Float64frombits(uint64(v.AuxInt))
 		p.To.Type = obj.TYPE_REG
@ -1176,27 +1228,39 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 		}
 		x := v.Args[0].Reg()
 		y := v.Reg()
+		if v.Type.IsSIMD() {
+			x = simdOrMaskReg(v.Args[0])
+			y = simdOrMaskReg(v)
+		}
 		if x != y {
-			opregreg(s, moveByType(v.Type), y, x)
+			opregreg(s, moveByRegsWidth(y, x, v.Type.Size()), y, x)
 		}
 	case ssa.OpLoadReg:
 		if v.Type.IsFlags() {
 			v.Fatalf("load flags not implemented: %v", v.LongString())
 			return
 		}
-		p := s.Prog(loadByType(v.Type))
+		r := v.Reg()
+		p := s.Prog(loadByRegWidth(r, v.Type.Size()))
 		ssagen.AddrAuto(&p.From, v.Args[0])
 		p.To.Type = obj.TYPE_REG
-		p.To.Reg = v.Reg()
+		if v.Type.IsSIMD() {
+			r = simdOrMaskReg(v)
+		}
+		p.To.Reg = r

 	case ssa.OpStoreReg:
 		if v.Type.IsFlags() {
 			v.Fatalf("store flags not implemented: %v", v.LongString())
 			return
 		}
-		p := s.Prog(storeByType(v.Type))
+		r := v.Args[0].Reg()
+		if v.Type.IsSIMD() {
+			r = simdOrMaskReg(v.Args[0])
+		}
+		p := s.Prog(storeByRegWidth(r, v.Type.Size()))
 		p.From.Type = obj.TYPE_REG
-		p.From.Reg = v.Args[0].Reg()
+		p.From.Reg = r
 		ssagen.AddrAuto(&p.To, v)
 	case ssa.OpAMD64LoweredHasCPUFeature:
 		p := s.Prog(x86.AMOVBLZX)
@ -1210,8 +1274,14 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 		for _, ap := range v.Block.Func.RegArgs {
 			// Pass the spill/unspill information along to the assembler, offset by size of return PC pushed on stack.
 			addr := ssagen.SpillSlotAddr(ap, x86.REG_SP, v.Block.Func.Config.PtrSize)
+			reg := ap.Reg
+			t := ap.Type
+			sz := t.Size()
+			if t.IsSIMD() {
+				reg = simdRegBySize(reg, sz)
+			}
 			s.FuncInfo().AddSpill(
-				obj.RegSpill{Reg: ap.Reg, Addr: addr, Unspill: loadByType(ap.Type), Spill: storeByType(ap.Type)})
+				obj.RegSpill{Reg: reg, Addr: addr, Unspill: loadByRegWidth(reg, sz), Spill: storeByRegWidth(reg, sz)})
 		}
 		v.Block.Func.RegArgs = nil
 		ssagen.CheckArgReg(v)
@ -1227,7 +1297,7 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 	case ssa.OpAMD64CALLstatic, ssa.OpAMD64CALLtail:
 		if s.ABI == obj.ABI0 && v.Aux.(*ssa.AuxCall).Fn.ABI() == obj.ABIInternal {
 			// zeroing X15 when entering ABIInternal from ABI0
-			opregreg(s, x86.AXORPS, x86.REG_X15, x86.REG_X15)
+			zeroX15(s)
 			// set G register from TLS
 			getgFromTLS(s, x86.REG_R14)
 		}
@ -1238,7 +1308,7 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 		s.Call(v)
 		if s.ABI == obj.ABIInternal && v.Aux.(*ssa.AuxCall).Fn.ABI() == obj.ABI0 {
 			// zeroing X15 when entering ABIInternal from ABI0
-			opregreg(s, x86.AXORPS, x86.REG_X15, x86.REG_X15)
+			zeroX15(s)
 			// set G register from TLS
 			getgFromTLS(s, x86.REG_R14)
 		}
@ -1643,10 +1713,683 @@ func ssaGenValue(s *ssagen.State, v *ssa.Value) {
 		p.From.Offset = int64(x)
 		p.To.Type = obj.TYPE_REG
 		p.To.Reg = v.Reg()
+
+	// SIMD ops
+	case ssa.OpAMD64VZEROUPPER, ssa.OpAMD64VZEROALL:
+		s.Prog(v.Op.Asm())
+
+	case ssa.OpAMD64Zero128, ssa.OpAMD64Zero256, ssa.OpAMD64Zero512: // no code emitted
+
+	case ssa.OpAMD64VMOVSSf2v, ssa.OpAMD64VMOVSDf2v:
+		// These are for initializing the least 32/64 bits of a SIMD register from a "float".
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = v.Args[0].Reg()
+		p.AddRestSourceReg(x86.REG_X15)
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v)
+
+	case ssa.OpAMD64VMOVQload, ssa.OpAMD64VMOVDload,
+		ssa.OpAMD64VMOVSSload, ssa.OpAMD64VMOVSDload:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_MEM
+		p.From.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.From, v)
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v)
+
+	case ssa.OpAMD64VMOVSSconst, ssa.OpAMD64VMOVSDconst:
+		// for loading constants directly into SIMD registers
+		x := simdReg(v)
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_FCONST
+		p.From.Val = math.Float64frombits(uint64(v.AuxInt))
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = x
+
+	case ssa.OpAMD64VMOVD, ssa.OpAMD64VMOVQ:
+		// These are for initializing the least 32/64 bits of a SIMD register from an "int".
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = v.Args[0].Reg()
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v)
+
+	case ssa.OpAMD64VMOVDQUload128, ssa.OpAMD64VMOVDQUload256, ssa.OpAMD64VMOVDQUload512,
+		ssa.OpAMD64KMOVBload, ssa.OpAMD64KMOVWload, ssa.OpAMD64KMOVDload, ssa.OpAMD64KMOVQload:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_MEM
+		p.From.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.From, v)
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdOrMaskReg(v)
+	case ssa.OpAMD64VMOVDQUstore128, ssa.OpAMD64VMOVDQUstore256, ssa.OpAMD64VMOVDQUstore512,
+		ssa.OpAMD64KMOVBstore, ssa.OpAMD64KMOVWstore, ssa.OpAMD64KMOVDstore, ssa.OpAMD64KMOVQstore:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = simdOrMaskReg(v.Args[1])
+		p.To.Type = obj.TYPE_MEM
+		p.To.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.To, v)
+
+	case ssa.OpAMD64VPMASK32load128, ssa.OpAMD64VPMASK64load128, ssa.OpAMD64VPMASK32load256, ssa.OpAMD64VPMASK64load256:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_MEM
+		p.From.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.From, v)
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v)
+		p.AddRestSourceReg(simdReg(v.Args[1])) // masking simd reg
+
+	case ssa.OpAMD64VPMASK32store128, ssa.OpAMD64VPMASK64store128, ssa.OpAMD64VPMASK32store256, ssa.OpAMD64VPMASK64store256:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = simdReg(v.Args[2])
+		p.To.Type = obj.TYPE_MEM
+		p.To.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.To, v)
+		p.AddRestSourceReg(simdReg(v.Args[1])) // masking simd reg
+
+	case ssa.OpAMD64VPMASK64load512, ssa.OpAMD64VPMASK32load512, ssa.OpAMD64VPMASK16load512, ssa.OpAMD64VPMASK8load512:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_MEM
+		p.From.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.From, v)
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v)
+		p.AddRestSourceReg(v.Args[1].Reg()) // simd mask reg
+		x86.ParseSuffix(p, "Z")             // must be zero if not in mask
+
+	case ssa.OpAMD64VPMASK64store512, ssa.OpAMD64VPMASK32store512, ssa.OpAMD64VPMASK16store512, ssa.OpAMD64VPMASK8store512:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = simdReg(v.Args[2])
+		p.To.Type = obj.TYPE_MEM
+		p.To.Reg = v.Args[0].Reg()
+		ssagen.AddAux(&p.To, v)
+		p.AddRestSourceReg(v.Args[1].Reg()) // simd mask reg
+
+	case ssa.OpAMD64VPMOVMToVec8x16,
+		ssa.OpAMD64VPMOVMToVec8x32,
+		ssa.OpAMD64VPMOVMToVec8x64,
+		ssa.OpAMD64VPMOVMToVec16x8,
+		ssa.OpAMD64VPMOVMToVec16x16,
+		ssa.OpAMD64VPMOVMToVec16x32,
+		ssa.OpAMD64VPMOVMToVec32x4,
+		ssa.OpAMD64VPMOVMToVec32x8,
+		ssa.OpAMD64VPMOVMToVec32x16,
+		ssa.OpAMD64VPMOVMToVec64x2,
+		ssa.OpAMD64VPMOVMToVec64x4,
+		ssa.OpAMD64VPMOVMToVec64x8:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = v.Args[0].Reg()
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v)
+
+	case ssa.OpAMD64VPMOVVec8x16ToM,
+		ssa.OpAMD64VPMOVVec8x32ToM,
+		ssa.OpAMD64VPMOVVec8x64ToM,
+		ssa.OpAMD64VPMOVVec16x8ToM,
+		ssa.OpAMD64VPMOVVec16x16ToM,
+		ssa.OpAMD64VPMOVVec16x32ToM,
+		ssa.OpAMD64VPMOVVec32x4ToM,
+		ssa.OpAMD64VPMOVVec32x8ToM,
+		ssa.OpAMD64VPMOVVec32x16ToM,
+		ssa.OpAMD64VPMOVVec64x2ToM,
+		ssa.OpAMD64VPMOVVec64x4ToM,
+		ssa.OpAMD64VPMOVVec64x8ToM:
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = simdReg(v.Args[0])
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = v.Reg()
+
+	case ssa.OpAMD64KMOVQk, ssa.OpAMD64KMOVDk, ssa.OpAMD64KMOVWk, ssa.OpAMD64KMOVBk,
+		ssa.OpAMD64KMOVQi, ssa.OpAMD64KMOVDi, ssa.OpAMD64KMOVWi, ssa.OpAMD64KMOVBi:
+		// See also ssa.OpAMD64KMOVQload
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = v.Args[0].Reg()
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = v.Reg()
+	case ssa.OpAMD64VPTEST:
+		// Some instructions setting flags put their second operand into the destination reg.
+		// See also CMP[BWDQ].
+		p := s.Prog(v.Op.Asm())
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = simdReg(v.Args[0])
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = simdReg(v.Args[1])
+
 	default:
+		if !ssaGenSIMDValue(s, v) {
 			v.Fatalf("genValue not implemented: %s", v.LongString())
 		}
 	}
+}
+
+// zeroX15 zeroes the X15 register.
+func zeroX15(s *ssagen.State) {
+	vxorps := func(s *ssagen.State) {
+		p := s.Prog(x86.AVXORPS)
+		p.From.Type = obj.TYPE_REG
+		p.From.Reg = x86.REG_X15
+		p.AddRestSourceReg(x86.REG_X15)
+		p.To.Type = obj.TYPE_REG
+		p.To.Reg = x86.REG_X15
+	}
+	if buildcfg.GOAMD64 >= 3 {
+		vxorps(s)
+		return
+	}
+	// AVX may not be available, check before zeroing the high bits.
+	p := s.Prog(x86.ACMPB)
+	p.From.Type = obj.TYPE_MEM
+	p.From.Name = obj.NAME_EXTERN
+	p.From.Sym = ir.Syms.X86HasAVX
+	p.To.Type = obj.TYPE_CONST
+	p.To.Offset = 1
+	jmp := s.Prog(x86.AJNE)
+	jmp.To.Type = obj.TYPE_BRANCH
+	vxorps(s)
+	sse := opregreg(s, x86.AXORPS, x86.REG_X15, x86.REG_X15)
+	jmp.To.SetTarget(sse)
+}
+
+// Example instruction: VRSQRTPS X1, X1
+func simdV11(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[0])
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPSUBD X1, X2, X3
+func simdV21(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	// Vector registers operands follows a right-to-left order.
+	// e.g. VPSUBD X1, X2, X3 means X3 = X2 - X1.
+	p.From.Reg = simdReg(v.Args[1])
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// This function is to accustomize the shifts.
+// The 2nd arg is an XMM, and this function merely checks that.
+// Example instruction: VPSLLQ Z1, X1, Z2
+func simdVfpv(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	// Vector registers operands follows a right-to-left order.
+	// e.g. VPSUBD X1, X2, X3 means X3 = X2 - X1.
+	p.From.Reg = v.Args[1].Reg()
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPCMPEQW Z26, Z30, K4
+func simdV2k(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[1])
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+// Example instruction: VPMINUQ X21, X3, K3, X31
+func simdV2kv(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[1])
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	// These "simd*" series of functions assumes:
+	// Any "K" register that serves as the write-mask
+	// or "predicate" for "predicated AVX512 instructions"
+	// sits right at the end of the operand list.
+	// TODO: verify this assumption.
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPABSB X1, X2, K3 (masking merging)
+func simdV2kvResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[1])
+	// These "simd*" series of functions assumes:
+	// Any "K" register that serves as the write-mask
+	// or "predicate" for "predicated AVX512 instructions"
+	// sits right at the end of the operand list.
+	// TODO: verify this assumption.
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// This function is to accustomize the shifts.
+// The 2nd arg is an XMM, and this function merely checks that.
+// Example instruction: VPSLLQ Z1, X1, K1, Z2
+func simdVfpkv(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = v.Args[1].Reg()
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPCMPEQW Z26, Z30, K1, K4
+func simdV2kk(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[1])
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+// Example instruction: VPOPCNTB X14, K4, X16
+func simdVkv(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[0])
+	p.AddRestSourceReg(maskReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VROUNDPD $7, X2, X2
+func simdV11Imm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VREDUCEPD $126, X1, K3, X31
+func simdVkvImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VCMPPS $7, X2, X9, X2
+func simdV21Imm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPINSRB $3, DX, X0, X0
+func simdVgpvImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(v.Args[1].Reg())
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPCMPD $1, Z1, Z2, K1
+func simdV2kImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+// Example instruction: VPCMPD $1, Z1, Z2, K2, K1
+func simdV2kkImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+func simdV2kvImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VFMADD213PD Z2, Z1, Z0
+func simdV31ResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[2])
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+func simdV31ResultInArg0Imm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+
+	p.AddRestSourceReg(simdReg(v.Args[2]))
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	// p.AddRestSourceReg(x86.REG_K0)
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// v31loadResultInArg0Imm8
+// Example instruction:
+// for (VPTERNLOGD128load {sym} [makeValAndOff(int32(int8(c)),off)]  x y ptr mem)
+func simdV31loadResultInArg0Imm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[2].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	return p
+}
+
+// Example instruction: VFMADD213PD Z2, Z1, K1, Z0
+func simdV3kvResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[2])
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(maskReg(v.Args[3]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+func simdVgpImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = v.Reg()
+	return p
+}
+
+// Currently unused
+func simdV31(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[2])
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Currently unused
+func simdV3kv(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[2])
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[3]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VRCP14PS (DI), K6, X22
+func simdVkvload(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[0].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.AddRestSourceReg(maskReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPSLLVD (DX), X7, X18
+func simdV21load(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[1].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPDPWSSD (SI), X24, X18
+func simdV31loadResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[2].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPDPWSSD (SI), X24, K1, X18
+func simdV3kvloadResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[2].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.AddRestSourceReg(maskReg(v.Args[3]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPSLLVD (SI), X1, K1, X2
+func simdV2kvload(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[1].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPCMPEQD (SI), X1, K1
+func simdV2kload(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[1].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+// Example instruction: VCVTTPS2DQ (BX), X2
+func simdV11load(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_MEM
+	p.From.Reg = v.Args[0].Reg()
+	ssagen.AddAux(&p.From, v)
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPSHUFD $7, (BX), X11
+func simdV11loadImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[0].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPRORD $81, -15(R14), K7, Y1
+func simdVkvloadImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[0].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+	p.AddRestSourceReg(maskReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VPSHLDD $82, 7(SI), Y21, Y3
+func simdV21loadImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[1].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: VCMPPS $81, -7(DI), Y16, K3
+func simdV2kloadImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[1].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+// Example instruction: VCMPPS $81, -7(DI), Y16, K1, K3
+func simdV2kkloadImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[1].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = maskReg(v)
+	return p
+}
+
+// Example instruction: VGF2P8AFFINEINVQB $64, -17(BP), X31, K3, X26
+func simdV2kvloadImm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	sc := v.AuxValAndOff()
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_CONST
+	p.From.Offset = sc.Val64()
+	m := obj.Addr{Type: obj.TYPE_MEM, Reg: v.Args[1].Reg()}
+	ssagen.AddAux2(&m, v, sc.Off64())
+	p.AddRestSource(m)
+	p.AddRestSourceReg(simdReg(v.Args[0]))
+	p.AddRestSourceReg(maskReg(v.Args[2]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: SHA1NEXTE X2, X2
+func simdV21ResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Type = obj.TYPE_REG
+	p.From.Reg = simdReg(v.Args[1])
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: SHA1RNDS4 $1, X2, X2
+func simdV21ResultInArg0Imm8(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	p := s.Prog(v.Op.Asm())
+	p.From.Offset = int64(v.AuxUInt8())
+	p.From.Type = obj.TYPE_CONST
+	p.AddRestSourceReg(simdReg(v.Args[1]))
+	p.To.Type = obj.TYPE_REG
+	p.To.Reg = simdReg(v)
+	return p
+}
+
+// Example instruction: SHA256RNDS2 X0, X11, X2
+func simdV31x0AtIn2ResultInArg0(s *ssagen.State, v *ssa.Value) *obj.Prog {
+	return simdV31ResultInArg0(s, v)
+}

 var blockJump = [...]struct {
 	asm, invasm obj.As
@ -1732,7 +2475,7 @@ func ssaGenBlock(s *ssagen.State, b, next *ssa.Block) {
 }

 func loadRegResult(s *ssagen.State, f *ssa.Func, t *types.Type, reg int16, n *ir.Name, off int64) *obj.Prog {
-	p := s.Prog(loadByType(t))
+	p := s.Prog(loadByRegWidth(reg, t.Size()))
 	p.From.Type = obj.TYPE_MEM
 	p.From.Name = obj.NAME_AUTO
 	p.From.Sym = n.Linksym()
@ -1743,7 +2486,7 @@ func loadRegResult(s *ssagen.State, f *ssa.Func, t *types.Type, reg int16, n *ir
 }

 func spillArgReg(pp *objw.Progs, p *obj.Prog, f *ssa.Func, t *types.Type, reg int16, n *ir.Name, off int64) *obj.Prog {
-	p = pp.Append(p, storeByType(t), obj.TYPE_REG, reg, 0, obj.TYPE_MEM, 0, n.FrameOffset()+off)
+	p = pp.Append(p, storeByRegWidth(reg, t.Size()), obj.TYPE_REG, reg, 0, obj.TYPE_MEM, 0, n.FrameOffset()+off)
 	p.To.Name = obj.NAME_PARAM
 	p.To.Sym = n.Linksym()
 	p.Pos = p.Pos.WithNotStmt()
@ -1778,3 +2521,58 @@ func move16(s *ssagen.State, src, dst, tmp int16, off int64) {
 	p.To.Reg = dst
 	p.To.Offset = off
 }
+
+// XXX maybe make this part of v.Reg?
+// On the other hand, it is architecture-specific.
+func simdReg(v *ssa.Value) int16 {
+	t := v.Type
+	if !t.IsSIMD() {
+		base.Fatalf("simdReg: not a simd type; v=%s, b=b%d, f=%s", v.LongString(), v.Block.ID, v.Block.Func.Name)
+	}
+	return simdRegBySize(v.Reg(), t.Size())
+}
+
+func simdRegBySize(reg int16, size int64) int16 {
+	switch size {
+	case 16:
+		return reg
+	case 32:
+		return reg + (x86.REG_Y0 - x86.REG_X0)
+	case 64:
+		return reg + (x86.REG_Z0 - x86.REG_X0)
+	}
+	panic("simdRegBySize: bad size")
+}
+
+// XXX k mask
+func maskReg(v *ssa.Value) int16 {
+	t := v.Type
+	if !t.IsSIMD() {
+		base.Fatalf("maskReg: not a simd type; v=%s, b=b%d, f=%s", v.LongString(), v.Block.ID, v.Block.Func.Name)
+	}
+	switch t.Size() {
+	case 8:
+		return v.Reg()
+	}
+	panic("unreachable")
+}
+
+// XXX k mask + vec
+func simdOrMaskReg(v *ssa.Value) int16 {
+	t := v.Type
+	if t.Size() <= 8 {
+		return maskReg(v)
+	}
+	return simdReg(v)
+}
+
+// XXX this is used for shift operations only.
+// regalloc will issue OpCopy with incorrect type, but the assigned
+// register should be correct, and this function is merely checking
+// the sanity of this part.
+func simdCheckRegOnly(v *ssa.Value, regStart, regEnd int16) int16 {
+	if v.Reg() > regEnd || v.Reg() < regStart {
+		panic("simdCheckRegOnly: not the desired register")
+	}
+	return v.Reg()
+}
--- a/src/cmd/compile/internal/gc/compile.go
+++ b/src/cmd/compile/internal/gc/compile.go
@ -29,7 +29,7 @@ var (
 	compilequeue []*ir.Func // functions waiting to be compiled
 )

-func enqueueFunc(fn *ir.Func) {
+func enqueueFunc(fn *ir.Func, symABIs *ssagen.SymABIs) {
 	if ir.CurFunc != nil {
 		base.FatalfAt(fn.Pos(), "enqueueFunc %v inside %v", fn, ir.CurFunc)
 	}
@ -49,6 +49,13 @@ func enqueueFunc(fn *ir.Func) {
 	}

 	if len(fn.Body) == 0 {
+		if ir.IsIntrinsicSym(fn.Sym()) && fn.Sym().Linkname == "" && !symABIs.HasDef(fn.Sym()) {
+			// Generate the function body for a bodyless intrinsic, in case it
+			// is used in a non-call context (e.g. as a function pointer).
+			// We skip functions defined in assembly, or has a linkname (which
+			// could be defined in another package).
+			ssagen.GenIntrinsicBody(fn)
+		} else {
 			// Initialize ABI wrappers if necessary.
 			ir.InitLSym(fn, false)
 			types.CalcSize(fn.Type())
@ -66,6 +73,7 @@ func enqueueFunc(fn *ir.Func) {
 			}
 			return
 		}
+	}

 	errorsBefore := base.Errors()

--- a/src/cmd/compile/internal/gc/main.go
+++ b/src/cmd/compile/internal/gc/main.go
@ -188,9 +188,9 @@ func Main(archInit func(*ssagen.ArchInfo)) {

 	ir.EscFmt = escape.Fmt
 	ir.IsIntrinsicCall = ssagen.IsIntrinsicCall
+	ir.IsIntrinsicSym = ssagen.IsIntrinsicSym
 	inline.SSADumpInline = ssagen.DumpInline
 	ssagen.InitEnv()
-	ssagen.InitTables()

 	types.PtrSize = ssagen.Arch.LinkArch.PtrSize
 	types.RegSize = ssagen.Arch.LinkArch.RegSize
@ -204,6 +204,11 @@ func Main(archInit func(*ssagen.ArchInfo)) {
 	typecheck.InitRuntime()
 	rttype.Init()

+	// Some intrinsics (notably, the simd intrinsics) mention
+	// types "eagerly", thus ssagen must be initialized AFTER
+	// the type system is ready.
+	ssagen.InitTables()
+
 	// Parse and typecheck input.
 	noder.LoadPackage(flag.Args())

@ -309,7 +314,7 @@ func Main(archInit func(*ssagen.ArchInfo)) {
 		}

 		if nextFunc < len(typecheck.Target.Funcs) {
-			enqueueFunc(typecheck.Target.Funcs[nextFunc])
+			enqueueFunc(typecheck.Target.Funcs[nextFunc], symABIs)
 			nextFunc++
 			continue
 		}
--- a/src/cmd/compile/internal/inline/inl.go
+++ b/src/cmd/compile/internal/inline/inl.go
@ -179,6 +179,25 @@ func CanInlineFuncs(funcs []*ir.Func, profile *pgoir.Profile) {
 	})
 }

+func simdCreditMultiplier(fn *ir.Func) int32 {
+	for _, field := range fn.Type().RecvParamsResults() {
+		if field.Type.IsSIMD() {
+			return 3
+		}
+	}
+	// Sometimes code uses closures, that do not take simd
+	// parameters, to perform repetitive SIMD operations.
+	// fn.  These really need to be inlined, or the anticipated
+	// awesome SIMD performance will be missed.
+	for _, v := range fn.ClosureVars {
+		if v.Type().IsSIMD() {
+			return 11 // 11 ought to be enough.
+		}
+	}
+
+	return 1
+}
+
 // inlineBudget determines the max budget for function 'fn' prior to
 // analyzing the hairiness of the body of 'fn'. We pass in the pgo
 // profile if available (which can change the budget), also a
@ -186,9 +205,14 @@ func CanInlineFuncs(funcs []*ir.Func, profile *pgoir.Profile) {
 // possibility that a call to the function might have its score
 // adjusted downwards. If 'verbose' is set, then print a remark where
 // we boost the budget due to PGO.
+// Note that inlineCostOk has the final say on whether an inline will
+// happen; changes here merely make inlines possible.
 func inlineBudget(fn *ir.Func, profile *pgoir.Profile, relaxed bool, verbose bool) int32 {
 	// Update the budget for profile-guided inlining.
 	budget := int32(inlineMaxBudget)
+
+	budget *= simdCreditMultiplier(fn)
+
 	if IsPgoHotFunc(fn, profile) {
 		budget = inlineHotMaxBudget
 		if verbose {
@ -202,6 +226,7 @@ func inlineBudget(fn *ir.Func, profile *pgoir.Profile, relaxed bool, verbose boo
 		// be very liberal here, if the closure is only called once, the budget is large
 		budget = max(budget, inlineClosureCalledOnceCost)
 	}
+
 	return budget
 }

@ -263,6 +288,7 @@ func CanInline(fn *ir.Func, profile *pgoir.Profile) {

 	visitor := hairyVisitor{
 		curFunc:       fn,
+		debug:         isDebugFn(fn),
 		isBigFunc:     IsBigFunc(fn),
 		budget:        budget,
 		maxBudget:     budget,
@ -407,6 +433,7 @@ type hairyVisitor struct {
 	// This is needed to access the current caller in the doNode function.
 	curFunc       *ir.Func
 	isBigFunc     bool
+	debug         bool
 	budget        int32
 	maxBudget     int32
 	reason        string
@ -416,6 +443,16 @@ type hairyVisitor struct {
 	profile       *pgoir.Profile
 }

+func isDebugFn(fn *ir.Func) bool {
+	// if n := fn.Nname; n != nil {
+	// 	if n.Sym().Name == "Int32x8.Transpose8" && n.Sym().Pkg.Path == "simd" {
+	// 		fmt.Printf("isDebugFn '%s' DOT '%s'\n", n.Sym().Pkg.Path, n.Sym().Name)
+	// 		return true
+	// 	}
+	// }
+	return false
+}
+
 func (v *hairyVisitor) tooHairy(fn *ir.Func) bool {
 	v.do = v.doNode // cache closure
 	if ir.DoChildren(fn, v.do) {
@ -434,6 +471,9 @@ func (v *hairyVisitor) doNode(n ir.Node) bool {
 	if n == nil {
 		return false
 	}
+	if v.debug {
+		fmt.Printf("%v: doNode %v budget is %d\n", ir.Line(n), n.Op(), v.budget)
+	}
 opSwitch:
 	switch n.Op() {
 	// Call is okay if inlinable and we have the budget for the body.
@ -551,12 +591,19 @@ opSwitch:
 		}

 		if cheap {
+			if v.debug {
+				if ir.IsIntrinsicCall(n) {
+					fmt.Printf("%v: cheap call is also intrinsic, %v\n", ir.Line(n), n)
+				}
+			}
 			break // treat like any other node, that is, cost of 1
 		}

 		if ir.IsIntrinsicCall(n) {
-			// Treat like any other node.
-			break
+			if v.debug {
+				fmt.Printf("%v: intrinsic call, %v\n", ir.Line(n), n)
+			}
+			break // Treat like any other node.
 		}

 		if callee := inlCallee(v.curFunc, n.Fun, v.profile, false); callee != nil && typecheck.HaveInlineBody(callee) {
@ -583,6 +630,10 @@ opSwitch:
 			}
 		}

+		if v.debug {
+			fmt.Printf("%v: costly OCALLFUNC %v\n", ir.Line(n), n)
+		}
+
 		// Call cost for non-leaf inlining.
 		v.budget -= extraCost

@ -592,6 +643,9 @@ opSwitch:
 	// Things that are too hairy, irrespective of the budget
 	case ir.OCALL, ir.OCALLINTER:
 		// Call cost for non-leaf inlining.
+		if v.debug {
+			fmt.Printf("%v: costly OCALL %v\n", ir.Line(n), n)
+		}
 		v.budget -= v.extraCallCost

 	case ir.OPANIC:
@ -754,7 +808,7 @@ opSwitch:
 	v.budget--

 	// When debugging, don't stop early, to get full cost of inlining this function
-	if v.budget < 0 && base.Flag.LowerM < 2 && !logopt.Enabled() {
+	if v.budget < 0 && base.Flag.LowerM < 2 && !logopt.Enabled() && !v.debug {
 		v.reason = "too expensive"
 		return true
 	}
@ -914,6 +968,8 @@ func inlineCostOK(n *ir.CallExpr, caller, callee *ir.Func, bigCaller, closureCal
 		maxCost = inlineBigFunctionMaxCost
 	}

+	simdMaxCost := simdCreditMultiplier(callee) * maxCost
+
 	if callee.ClosureParent != nil {
 		maxCost *= 2           // favor inlining closures
 		if closureCalledOnce { // really favor inlining the one call to this closure
@ -921,6 +977,8 @@ func inlineCostOK(n *ir.CallExpr, caller, callee *ir.Func, bigCaller, closureCal
 		}
 	}

+	maxCost = max(maxCost, simdMaxCost)
+
 	metric := callee.Inl.Cost
 	if inlheur.Enabled() {
 		score, ok := inlheur.GetCallSiteScore(caller, n)
--- a/src/cmd/compile/internal/ir/expr.go
+++ b/src/cmd/compile/internal/ir/expr.go
@ -1031,6 +1031,9 @@ func StaticCalleeName(n Node) *Name {
 // IsIntrinsicCall reports whether the compiler back end will treat the call as an intrinsic operation.
 var IsIntrinsicCall = func(*CallExpr) bool { return false }

+// IsIntrinsicSym reports whether the compiler back end will treat a call to this symbol as an intrinsic operation.
+var IsIntrinsicSym = func(*types.Sym) bool { return false }
+
 // SameSafeExpr checks whether it is safe to reuse one of l and r
 // instead of computing both. SameSafeExpr assumes that l and r are
 // used in the same statement or expression. In order for it to be
@ -1149,6 +1152,14 @@ func ParamNames(ft *types.Type) []Node {
 	return args
 }

+func RecvParamNames(ft *types.Type) []Node {
+	args := make([]Node, ft.NumRecvs()+ft.NumParams())
+	for i, f := range ft.RecvParams() {
+		args[i] = f.Nname.(*Name)
+	}
+	return args
+}
+
 // MethodSym returns the method symbol representing a method name
 // associated with a specific receiver type.
 //
--- a/src/cmd/compile/internal/ir/symtab.go
+++ b/src/cmd/compile/internal/ir/symtab.go
@ -53,6 +53,7 @@ type symsStruct struct {
 	PanicdottypeI             *obj.LSym
 	Panicnildottype           *obj.LSym
 	Panicoverflow             *obj.LSym
+	PanicSimdImm              *obj.LSym
 	Racefuncenter             *obj.LSym
 	Racefuncexit              *obj.LSym
 	Raceread                  *obj.LSym
@ -76,6 +77,7 @@ type symsStruct struct {
 	Loong64HasLAM_BH *obj.LSym
 	Loong64HasLSX    *obj.LSym
 	RISCV64HasZbb    *obj.LSym
+	X86HasAVX        *obj.LSym
 	X86HasFMA        *obj.LSym
 	X86HasPOPCNT     *obj.LSym
 	X86HasSSE41      *obj.LSym
--- a/src/cmd/compile/internal/liveness/plive.go
+++ b/src/cmd/compile/internal/liveness/plive.go
@ -1534,6 +1534,9 @@ func isfat(t *types.Type) bool {
 			}
 			return true
 		case types.TSTRUCT:
+			if t.IsSIMD() {
+				return false
+			}
 			// Struct with 1 field, check if field is fat
 			if t.NumFields() == 1 {
 				return isfat(t.Field(0).Type)
--- a/src/cmd/compile/internal/ssa/_gen/AMD64.rules
+++ b/src/cmd/compile/internal/ssa/_gen/AMD64.rules
@ -1657,3 +1657,171 @@

 // If we don't use the flags any more, just use the standard op.
 (Select0 a:(ADD(Q|L)constflags [c] x)) && a.Uses == 1 => (ADD(Q|L)const [c] x)
+
+// SIMD lowering rules
+
+// Mask conversions
+// integers to masks
+(Cvt16toMask8x16 <t> x) => (VPMOVMToVec8x16 <types.TypeVec128> (KMOVWk <t> x))
+(Cvt32toMask8x32 <t> x) => (VPMOVMToVec8x32 <types.TypeVec256> (KMOVDk <t> x))
+(Cvt64toMask8x64 <t> x) => (VPMOVMToVec8x64 <types.TypeVec512> (KMOVQk <t> x))
+
+(Cvt8toMask16x8 <t> x) => (VPMOVMToVec16x8 <types.TypeVec128> (KMOVBk <t> x))
+(Cvt16toMask16x16 <t> x) => (VPMOVMToVec16x16 <types.TypeVec256> (KMOVWk <t> x))
+(Cvt32toMask16x32 <t> x) => (VPMOVMToVec16x32 <types.TypeVec512> (KMOVDk <t> x))
+
+(Cvt8toMask32x4 <t> x) => (VPMOVMToVec32x4 <types.TypeVec128> (KMOVBk <t> x))
+(Cvt8toMask32x8 <t> x) => (VPMOVMToVec32x8 <types.TypeVec256> (KMOVBk <t> x))
+(Cvt16toMask32x16 <t> x) => (VPMOVMToVec32x16 <types.TypeVec512> (KMOVWk <t> x))
+
+(Cvt8toMask64x2 <t> x) => (VPMOVMToVec64x2 <types.TypeVec128> (KMOVBk <t> x))
+(Cvt8toMask64x4 <t> x) => (VPMOVMToVec64x4 <types.TypeVec256> (KMOVBk <t> x))
+(Cvt8toMask64x8 <t> x) => (VPMOVMToVec64x8 <types.TypeVec512> (KMOVBk <t> x))
+
+// masks to integers
+(CvtMask8x16to16 <t> x) => (KMOVWi <t> (VPMOVVec8x16ToM <types.TypeMask> x))
+(CvtMask8x32to32 <t> x) => (KMOVDi <t> (VPMOVVec8x32ToM <types.TypeMask> x))
+(CvtMask8x64to64 <t> x) => (KMOVQi <t> (VPMOVVec8x64ToM <types.TypeMask> x))
+
+(CvtMask16x8to8 <t> x) => (KMOVBi <t> (VPMOVVec16x8ToM <types.TypeMask> x))
+(CvtMask16x16to16 <t> x) => (KMOVWi <t> (VPMOVVec16x16ToM <types.TypeMask> x))
+(CvtMask16x32to32 <t> x) => (KMOVDi <t> (VPMOVVec16x32ToM <types.TypeMask> x))
+
+(CvtMask32x4to8 <t> x) => (KMOVBi <t> (VPMOVVec32x4ToM <types.TypeMask> x))
+(CvtMask32x8to8 <t> x) => (KMOVBi <t> (VPMOVVec32x8ToM <types.TypeMask> x))
+(CvtMask32x16to16 <t> x) => (KMOVWi <t> (VPMOVVec32x16ToM <types.TypeMask> x))
+
+(CvtMask64x2to8 <t> x) => (KMOVBi <t> (VPMOVVec64x2ToM <types.TypeMask> x))
+(CvtMask64x4to8 <t> x) => (KMOVBi <t> (VPMOVVec64x4ToM <types.TypeMask> x))
+(CvtMask64x8to8 <t> x) => (KMOVBi <t> (VPMOVVec64x8ToM <types.TypeMask> x))
+
+// optimizations
+(MOVBstore [off] {sym} ptr (KMOVBi mask) mem) => (KMOVBstore [off] {sym} ptr mask mem)
+(MOVWstore [off] {sym} ptr (KMOVWi mask) mem) => (KMOVWstore [off] {sym} ptr mask mem)
+(MOVLstore [off] {sym} ptr (KMOVDi mask) mem) => (KMOVDstore [off] {sym} ptr mask mem)
+(MOVQstore [off] {sym} ptr (KMOVQi mask) mem) => (KMOVQstore [off] {sym} ptr mask mem)
+
+(KMOVBk l:(MOVBload [off] {sym} ptr mem)) && canMergeLoad(v, l) && clobber(l) => (KMOVBload [off] {sym} ptr mem)
+(KMOVWk l:(MOVWload [off] {sym} ptr mem)) && canMergeLoad(v, l) && clobber(l) => (KMOVWload [off] {sym} ptr mem)
+(KMOVDk l:(MOVLload [off] {sym} ptr mem)) && canMergeLoad(v, l) && clobber(l) => (KMOVDload [off] {sym} ptr mem)
+(KMOVQk l:(MOVQload [off] {sym} ptr mem)) && canMergeLoad(v, l) && clobber(l) => (KMOVQload [off] {sym} ptr mem)
+
+// SIMD vector loads and stores
+(Load <t> ptr mem) && t.Size() == 16 => (VMOVDQUload128 ptr mem)
+(Store {t} ptr val mem) && t.Size() == 16 => (VMOVDQUstore128 ptr val mem)
+
+(Load <t> ptr mem) && t.Size() == 32 => (VMOVDQUload256 ptr mem)
+(Store {t} ptr val mem) && t.Size() == 32 => (VMOVDQUstore256 ptr val mem)
+
+(Load <t> ptr mem) && t.Size() == 64 => (VMOVDQUload512 ptr mem)
+(Store {t} ptr val mem) && t.Size() == 64 => (VMOVDQUstore512 ptr val mem)
+
+// SIMD vector integer-vector-masked loads and stores.
+(LoadMasked32 <t> ptr mask mem) && t.Size() == 16 => (VPMASK32load128 ptr mask mem)
+(LoadMasked32 <t> ptr mask mem) && t.Size() == 32 => (VPMASK32load256 ptr mask mem)
+(LoadMasked64 <t> ptr mask mem) && t.Size() == 16 => (VPMASK64load128 ptr mask mem)
+(LoadMasked64 <t> ptr mask mem) && t.Size() == 32 => (VPMASK64load256 ptr mask mem)
+
+(StoreMasked32 {t} ptr mask val mem) && t.Size() == 16 => (VPMASK32store128 ptr mask val mem)
+(StoreMasked32 {t} ptr mask val mem) && t.Size() == 32 => (VPMASK32store256 ptr mask val mem)
+(StoreMasked64 {t} ptr mask val mem) && t.Size() == 16 => (VPMASK64store128 ptr mask val mem)
+(StoreMasked64 {t} ptr mask val mem) && t.Size() == 32 => (VPMASK64store256 ptr mask val mem)
+
+// Misc
+(IsZeroVec x) => (SETEQ (VPTEST x x))
+
+// SIMD vector K-masked loads and stores
+
+(LoadMasked64 <t> ptr mask mem) && t.Size() == 64 => (VPMASK64load512 ptr (VPMOVVec64x8ToM  <types.TypeMask> mask) mem)
+(LoadMasked32 <t> ptr mask mem) && t.Size() == 64 => (VPMASK32load512 ptr (VPMOVVec32x16ToM <types.TypeMask> mask) mem)
+(LoadMasked16 <t> ptr mask mem) && t.Size() == 64 => (VPMASK16load512 ptr (VPMOVVec16x32ToM <types.TypeMask> mask) mem)
+(LoadMasked8  <t> ptr mask mem) && t.Size() == 64 => (VPMASK8load512  ptr (VPMOVVec8x64ToM  <types.TypeMask> mask) mem)
+
+(StoreMasked64 {t} ptr mask val mem) && t.Size() == 64 => (VPMASK64store512 ptr (VPMOVVec64x8ToM  <types.TypeMask> mask) val mem)
+(StoreMasked32 {t} ptr mask val mem) && t.Size() == 64 => (VPMASK32store512 ptr (VPMOVVec32x16ToM <types.TypeMask> mask) val mem)
+(StoreMasked16 {t} ptr mask val mem) && t.Size() == 64 => (VPMASK16store512 ptr (VPMOVVec16x32ToM <types.TypeMask> mask) val mem)
+(StoreMasked8  {t} ptr mask val mem) && t.Size() == 64 => (VPMASK8store512  ptr (VPMOVVec8x64ToM  <types.TypeMask> mask) val mem)
+
+(ZeroSIMD <t>) && t.Size() == 16 => (Zero128 <t>)
+(ZeroSIMD <t>) && t.Size() == 32 => (Zero256 <t>)
+(ZeroSIMD <t>) && t.Size() == 64 => (Zero512 <t>)
+
+(VPMOVVec8x16ToM (VPMOVMToVec8x16 x)) => x
+(VPMOVVec8x32ToM (VPMOVMToVec8x32 x)) => x
+(VPMOVVec8x64ToM (VPMOVMToVec8x64 x)) => x
+
+(VPMOVVec16x8ToM (VPMOVMToVec16x8 x)) => x
+(VPMOVVec16x16ToM (VPMOVMToVec16x16 x)) => x
+(VPMOVVec16x32ToM (VPMOVMToVec16x32 x)) => x
+
+(VPMOVVec32x4ToM (VPMOVMToVec32x4 x)) => x
+(VPMOVVec32x8ToM (VPMOVMToVec32x8 x)) => x
+(VPMOVVec32x16ToM (VPMOVMToVec32x16 x)) => x
+
+(VPMOVVec64x2ToM (VPMOVMToVec64x2 x)) => x
+(VPMOVVec64x4ToM (VPMOVMToVec64x4 x)) => x
+(VPMOVVec64x8ToM (VPMOVMToVec64x8 x)) => x
+
+(VPANDQ512 x (VPMOVMToVec64x8 k)) => (VMOVDQU64Masked512 x k)
+(VPANDQ512 x (VPMOVMToVec32x16 k)) => (VMOVDQU32Masked512 x k)
+(VPANDQ512 x (VPMOVMToVec16x32 k)) => (VMOVDQU16Masked512 x k)
+(VPANDQ512 x (VPMOVMToVec8x64 k)) => (VMOVDQU8Masked512 x k)
+(VPANDD512 x (VPMOVMToVec64x8 k)) => (VMOVDQU64Masked512 x k)
+(VPANDD512 x (VPMOVMToVec32x16 k)) => (VMOVDQU32Masked512 x k)
+(VPANDD512 x (VPMOVMToVec16x32 k)) => (VMOVDQU16Masked512 x k)
+(VPANDD512 x (VPMOVMToVec8x64 k)) => (VMOVDQU8Masked512 x k)
+
+(VPAND128 x (VPMOVMToVec8x16 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU8Masked128 x k)
+(VPAND128 x (VPMOVMToVec16x8 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU16Masked128 x k)
+(VPAND128 x (VPMOVMToVec32x4 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU32Masked128 x k)
+(VPAND128 x (VPMOVMToVec64x2 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU64Masked128 x k)
+
+(VPAND256 x (VPMOVMToVec8x32 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU8Masked256 x k)
+(VPAND256 x (VPMOVMToVec16x16 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU16Masked256 x k)
+(VPAND256 x (VPMOVMToVec32x8 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU32Masked256 x k)
+(VPAND256 x (VPMOVMToVec64x4 k)) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (VMOVDQU64Masked256 x k)
+
+// Insert to zero of 32/64 bit floats and ints to a zero is just MOVS[SD]
+(VPINSRQ128 [0] (Zero128 <t>) y) && y.Type.IsFloat() => (VMOVSDf2v <types.TypeVec128> y)
+(VPINSRD128 [0] (Zero128 <t>) y) && y.Type.IsFloat() => (VMOVSSf2v <types.TypeVec128> y)
+(VPINSRQ128 [0] (Zero128 <t>) y) && !y.Type.IsFloat() => (VMOVQ <types.TypeVec128> y)
+(VPINSRD128 [0] (Zero128 <t>) y) && !y.Type.IsFloat() => (VMOVD <types.TypeVec128> y)
+
+// These rewrites can skip zero-extending the 8/16-bit inputs because they are
+// only used as the input to a broadcast; the potentially "bad" bits are ignored
+(VPBROADCASTB(128|256|512) x:(VPINSRB128 [0] (Zero128    <t>) y)) && x.Uses == 1 =>
+	(VPBROADCASTB(128|256|512) (VMOVQ <types.TypeVec128> y))
+(VPBROADCASTW(128|256|512) x:(VPINSRW128 [0] (Zero128    <t>) y)) && x.Uses == 1 =>
+	(VPBROADCASTW(128|256|512)   (VMOVQ <types.TypeVec128> y))
+
+(VMOVQ x:(MOVQload [off] {sym} ptr mem)) && x.Uses == 1 && clobber(x) => @x.Block (VMOVQload <v.Type> [off] {sym} ptr mem)
+(VMOVD x:(MOVLload [off] {sym} ptr mem)) && x.Uses == 1 && clobber(x) => @x.Block (VMOVDload <v.Type> [off] {sym} ptr mem)
+
+(VMOVSDf2v x:(MOVSDload [off] {sym} ptr mem)) && x.Uses == 1 && clobber(x) => @x.Block (VMOVSDload <v.Type> [off] {sym} ptr mem)
+(VMOVSSf2v x:(MOVSSload [off] {sym} ptr mem)) && x.Uses == 1 && clobber(x) => @x.Block (VMOVSSload <v.Type> [off] {sym} ptr mem)
+
+(VMOVSDf2v x:(MOVSDconst [c] )) => (VMOVSDconst [c] )
+(VMOVSSf2v x:(MOVSSconst [c] )) => (VMOVSSconst [c] )
+
+(VMOVDQUload(128|256|512) [off1] {sym} x:(ADDQconst [off2] ptr) mem) && is32Bit(int64(off1)+int64(off2)) => (VMOVDQUload(128|256|512) [off1+off2] {sym} ptr mem)
+(VMOVDQUstore(128|256|512) [off1] {sym} x:(ADDQconst [off2] ptr) val mem) && is32Bit(int64(off1)+int64(off2)) => (VMOVDQUstore(128|256|512) [off1+off2] {sym} ptr val mem)
+(VMOVDQUload(128|256|512) [off1] {sym1} x:(LEAQ [off2] {sym2} base) mem) && is32Bit(int64(off1)+int64(off2)) && canMergeSym(sym1, sym2) => (VMOVDQUload(128|256|512) [off1+off2] {mergeSym(sym1, sym2)} base mem)
+(VMOVDQUstore(128|256|512) [off1] {sym1} x:(LEAQ [off2] {sym2} base) val mem) && is32Bit(int64(off1)+int64(off2)) && canMergeSym(sym1, sym2) => (VMOVDQUstore(128|256|512) [off1+off2] {mergeSym(sym1, sym2)} base val mem)
+
+// 2-op VPTEST optimizations
+(SETEQ (VPTEST x:(VPAND(128|256) j k) y)) && x == y && x.Uses == 2 => (SETEQ (VPTEST j k))
+(SETEQ (VPTEST x:(VPAND(D|Q)512 j k) y)) && x == y && x.Uses == 2 => (SETEQ (VPTEST j k))
+(SETEQ (VPTEST x:(VPANDN(128|256) j k) y)) && x == y && x.Uses == 2 => (SETB (VPTEST k j)) // AndNot has swapped its operand order
+(SETEQ (VPTEST x:(VPANDN(D|Q)512 j k) y)) && x == y && x.Uses == 2 => (SETB (VPTEST k j)) // AndNot has swapped its operand order
+(EQ (VPTEST x:(VPAND(128|256) j k) y) yes no) && x == y && x.Uses == 2 => (EQ (VPTEST j k) yes no)
+(EQ (VPTEST x:(VPAND(D|Q)512 j k) y) yes no) && x == y && x.Uses == 2 => (EQ (VPTEST j k) yes no)
+(EQ (VPTEST x:(VPANDN(128|256) j k) y) yes no) && x == y && x.Uses == 2 => (ULT (VPTEST k j) yes no) // AndNot has swapped its operand order
+(EQ (VPTEST x:(VPANDN(D|Q)512 j k) y) yes no) && x == y && x.Uses == 2 => (ULT (VPTEST k j) yes no) // AndNot has swapped its operand order
+
+// DotProductQuadruple optimizations
+(VPADDD128 (VPDPBUSD128 (Zero128 <t>) x y) z) => (VPDPBUSD128 <t> z x y)
+(VPADDD256 (VPDPBUSD256 (Zero256 <t>) x y) z) => (VPDPBUSD256 <t> z x y)
+(VPADDD512 (VPDPBUSD512 (Zero512 <t>) x y) z) => (VPDPBUSD512 <t> z x y)
+(VPADDD128 (VPDPBUSDS128 (Zero128 <t>) x y) z) => (VPDPBUSDS128 <t> z x y)
+(VPADDD256 (VPDPBUSDS256 (Zero256 <t>) x y) z) => (VPDPBUSDS256 <t> z x y)
+(VPADDD512 (VPDPBUSDS512 (Zero512 <t>) x y) z) => (VPDPBUSDS512 <t> z x y)
--- a/src/cmd/compile/internal/ssa/_gen/AMD64Ops.go
+++ b/src/cmd/compile/internal/ssa/_gen/AMD64Ops.go
@ -62,7 +62,33 @@ var regNamesAMD64 = []string{
 	"X13",
 	"X14",
 	"X15", // constant 0 in ABIInternal
+	"X16",
+	"X17",
+	"X18",
+	"X19",
+	"X20",
+	"X21",
+	"X22",
+	"X23",
+	"X24",
+	"X25",
+	"X26",
+	"X27",
+	"X28",
+	"X29",
+	"X30",
+	"X31",

+	// TODO: update asyncPreempt for K registers.
+	// asyncPreempt also needs to store Z0-Z15 properly.
+	"K0",
+	"K1",
+	"K2",
+	"K3",
+	"K4",
+	"K5",
+	"K6",
+	"K7",
 	// If you add registers, update asyncPreempt in runtime

 	// pseudo-registers
@ -98,16 +124,28 @@ func init() {
 		gp         = buildReg("AX CX DX BX BP SI DI R8 R9 R10 R11 R12 R13 R15")
 		g          = buildReg("g")
 		fp         = buildReg("X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14")
+		v          = buildReg("X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14")
+		w          = buildReg("X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31")
 		x15        = buildReg("X15")
+		mask       = buildReg("K1 K2 K3 K4 K5 K6 K7")
 		gpsp       = gp | buildReg("SP")
 		gpspsb     = gpsp | buildReg("SB")
 		gpspsbg    = gpspsb | g
 		callerSave = gp | fp | g // runtime.setg (and anything calling it) may clobber g
+
+		vz = v | x15
+		wz = w | x15
+		x0 = buildReg("X0")
 	)
 	// Common slices of register masks
 	var (
 		gponly   = []regMask{gp}
 		fponly   = []regMask{fp}
+		vonly    = []regMask{v}
+		wonly    = []regMask{w}
+		maskonly = []regMask{mask}
+		vzonly   = []regMask{vz}
+		wzonly   = []regMask{wz}
 	)

 	// Common regInfo
@ -170,6 +208,67 @@ func init() {
 		fpstore    = regInfo{inputs: []regMask{gpspsb, fp, 0}}
 		fpstoreidx = regInfo{inputs: []regMask{gpspsb, gpsp, fp, 0}}

+		// masked loads/stores, vector register or mask register
+		vloadv  = regInfo{inputs: []regMask{gpspsb, v, 0}, outputs: vonly}
+		vstorev = regInfo{inputs: []regMask{gpspsb, v, v, 0}}
+		vloadk  = regInfo{inputs: []regMask{gpspsb, mask, 0}, outputs: vonly}
+		vstorek = regInfo{inputs: []regMask{gpspsb, mask, v, 0}}
+
+		v11     = regInfo{inputs: vonly, outputs: vonly}            // used in resultInArg0 ops, arg0 must not be x15
+		v21     = regInfo{inputs: []regMask{v, vz}, outputs: vonly} // used in resultInArg0 ops, arg0 must not be x15
+		vk      = regInfo{inputs: vzonly, outputs: maskonly}
+		kv      = regInfo{inputs: maskonly, outputs: vonly}
+		v2k     = regInfo{inputs: []regMask{vz, vz}, outputs: maskonly}
+		vkv     = regInfo{inputs: []regMask{vz, mask}, outputs: vonly}
+		v2kv    = regInfo{inputs: []regMask{vz, vz, mask}, outputs: vonly}
+		v2kk    = regInfo{inputs: []regMask{vz, vz, mask}, outputs: maskonly}
+		v31     = regInfo{inputs: []regMask{v, vz, vz}, outputs: vonly}       // used in resultInArg0 ops, arg0 must not be x15
+		v3kv    = regInfo{inputs: []regMask{v, vz, vz, mask}, outputs: vonly} // used in resultInArg0 ops, arg0 must not be x15
+		vgpv    = regInfo{inputs: []regMask{vz, gp}, outputs: vonly}
+		vgp     = regInfo{inputs: vonly, outputs: gponly}
+		vfpv    = regInfo{inputs: []regMask{vz, fp}, outputs: vonly}
+		vfpkv   = regInfo{inputs: []regMask{vz, fp, mask}, outputs: vonly}
+		fpv     = regInfo{inputs: []regMask{fp}, outputs: vonly}
+		gpv     = regInfo{inputs: []regMask{gp}, outputs: vonly}
+		v2flags = regInfo{inputs: []regMask{vz, vz}}
+
+		w11   = regInfo{inputs: wonly, outputs: wonly} // used in resultInArg0 ops, arg0 must not be x15
+		w21   = regInfo{inputs: []regMask{wz, wz}, outputs: wonly}
+		wk    = regInfo{inputs: wzonly, outputs: maskonly}
+		kw    = regInfo{inputs: maskonly, outputs: wonly}
+		w2k   = regInfo{inputs: []regMask{wz, wz}, outputs: maskonly}
+		wkw   = regInfo{inputs: []regMask{wz, mask}, outputs: wonly}
+		w2kw  = regInfo{inputs: []regMask{w, wz, mask}, outputs: wonly} // used in resultInArg0 ops, arg0 must not be x15
+		w2kk  = regInfo{inputs: []regMask{wz, wz, mask}, outputs: maskonly}
+		w31   = regInfo{inputs: []regMask{w, wz, wz}, outputs: wonly}       // used in resultInArg0 ops, arg0 must not be x15
+		w3kw  = regInfo{inputs: []regMask{w, wz, wz, mask}, outputs: wonly} // used in resultInArg0 ops, arg0 must not be x15
+		wgpw  = regInfo{inputs: []regMask{wz, gp}, outputs: wonly}
+		wgp   = regInfo{inputs: wzonly, outputs: gponly}
+		wfpw  = regInfo{inputs: []regMask{wz, fp}, outputs: wonly}
+		wfpkw = regInfo{inputs: []regMask{wz, fp, mask}, outputs: wonly}
+
+		// These register masks are used by SIMD only, they follow the pattern:
+		// Mem last, k mask second to last (if any), address right before mem and k mask.
+		wkwload    = regInfo{inputs: []regMask{gpspsb, mask, 0}, outputs: wonly}
+		v21load    = regInfo{inputs: []regMask{v, gpspsb, 0}, outputs: vonly}     // used in resultInArg0 ops, arg0 must not be x15
+		v31load    = regInfo{inputs: []regMask{v, vz, gpspsb, 0}, outputs: vonly} // used in resultInArg0 ops, arg0 must not be x15
+		v11load    = regInfo{inputs: []regMask{gpspsb, 0}, outputs: vonly}
+		w21load    = regInfo{inputs: []regMask{wz, gpspsb, 0}, outputs: wonly}
+		w31load    = regInfo{inputs: []regMask{w, wz, gpspsb, 0}, outputs: wonly} // used in resultInArg0 ops, arg0 must not be x15
+		w2kload    = regInfo{inputs: []regMask{wz, gpspsb, 0}, outputs: maskonly}
+		w2kwload   = regInfo{inputs: []regMask{wz, gpspsb, mask, 0}, outputs: wonly}
+		w11load    = regInfo{inputs: []regMask{gpspsb, 0}, outputs: wonly}
+		w3kwload   = regInfo{inputs: []regMask{w, wz, gpspsb, mask, 0}, outputs: wonly} // used in resultInArg0 ops, arg0 must not be x15
+		w2kkload   = regInfo{inputs: []regMask{wz, gpspsb, mask, 0}, outputs: maskonly}
+		v31x0AtIn2 = regInfo{inputs: []regMask{v, vz, x0}, outputs: vonly} // used in resultInArg0 ops, arg0 must not be x15
+
+		kload  = regInfo{inputs: []regMask{gpspsb, 0}, outputs: maskonly}
+		kstore = regInfo{inputs: []regMask{gpspsb, mask, 0}}
+		gpk    = regInfo{inputs: gponly, outputs: maskonly}
+		kgp    = regInfo{inputs: maskonly, outputs: gponly}
+
+		x15only = regInfo{inputs: nil, outputs: []regMask{x15}}
+
 		prefreg = regInfo{inputs: []regMask{gpspsbg}}
 	)

@ -1235,6 +1334,118 @@ func init() {
 		//
 		// output[i] = (input[i] >> 7) & 1
 		{name: "PMOVMSKB", argLength: 1, reg: fpgp, asm: "PMOVMSKB"},
+
+		// SIMD ops
+		{name: "VMOVDQUload128", argLength: 2, reg: fpload, asm: "VMOVDQU", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1 = mem
+		{name: "VMOVDQUstore128", argLength: 3, reg: fpstore, asm: "VMOVDQU", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg1, arg2 = mem
+
+		{name: "VMOVDQUload256", argLength: 2, reg: fpload, asm: "VMOVDQU", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1 = mem
+		{name: "VMOVDQUstore256", argLength: 3, reg: fpstore, asm: "VMOVDQU", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg1, arg2 = mem
+
+		{name: "VMOVDQUload512", argLength: 2, reg: fpload, asm: "VMOVDQU64", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1 = mem
+		{name: "VMOVDQUstore512", argLength: 3, reg: fpstore, asm: "VMOVDQU64", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg1, arg2 = mem
+
+		// AVX2 32 and 64-bit element int-vector masked moves.
+		{name: "VPMASK32load128", argLength: 3, reg: vloadv, asm: "VPMASKMOVD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=integer mask, arg2 = mem
+		{name: "VPMASK32store128", argLength: 4, reg: vstorev, asm: "VPMASKMOVD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=integer mask, arg3 = mem
+		{name: "VPMASK64load128", argLength: 3, reg: vloadv, asm: "VPMASKMOVQ", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=integer mask, arg2 = mem
+		{name: "VPMASK64store128", argLength: 4, reg: vstorev, asm: "VPMASKMOVQ", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=integer mask, arg3 = mem
+
+		{name: "VPMASK32load256", argLength: 3, reg: vloadv, asm: "VPMASKMOVD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=integer mask, arg2 = mem
+		{name: "VPMASK32store256", argLength: 4, reg: vstorev, asm: "VPMASKMOVD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=integer mask, arg3 = mem
+		{name: "VPMASK64load256", argLength: 3, reg: vloadv, asm: "VPMASKMOVQ", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=integer mask, arg2 = mem
+		{name: "VPMASK64store256", argLength: 4, reg: vstorev, asm: "VPMASKMOVQ", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=integer mask, arg3 = mem
+
+		// AVX512 8-64-bit element mask-register masked moves
+		{name: "VPMASK8load512", argLength: 3, reg: vloadk, asm: "VMOVDQU8", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},      // load from arg0+auxint+aux, arg1=k mask, arg2 = mem
+		{name: "VPMASK8store512", argLength: 4, reg: vstorek, asm: "VMOVDQU8", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"},   // store, *(arg0+auxint+aux) = arg2, arg1=k mask, arg3 = mem
+		{name: "VPMASK16load512", argLength: 3, reg: vloadk, asm: "VMOVDQU16", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=k mask, arg2 = mem
+		{name: "VPMASK16store512", argLength: 4, reg: vstorek, asm: "VMOVDQU16", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=k mask, arg3 = mem
+		{name: "VPMASK32load512", argLength: 3, reg: vloadk, asm: "VMOVDQU32", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=k mask, arg2 = mem
+		{name: "VPMASK32store512", argLength: 4, reg: vstorek, asm: "VMOVDQU32", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=k mask, arg3 = mem
+		{name: "VPMASK64load512", argLength: 3, reg: vloadk, asm: "VMOVDQU64", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},    // load from arg0+auxint+aux, arg1=k mask, arg2 = mem
+		{name: "VPMASK64store512", argLength: 4, reg: vstorek, asm: "VMOVDQU64", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"}, // store, *(arg0+auxint+aux) = arg2, arg1=k mask, arg3 = mem
+
+		{name: "VPMOVMToVec8x16", argLength: 1, reg: kv, asm: "VPMOVM2B"},
+		{name: "VPMOVMToVec8x32", argLength: 1, reg: kv, asm: "VPMOVM2B"},
+		{name: "VPMOVMToVec8x64", argLength: 1, reg: kw, asm: "VPMOVM2B"},
+
+		{name: "VPMOVMToVec16x8", argLength: 1, reg: kv, asm: "VPMOVM2W"},
+		{name: "VPMOVMToVec16x16", argLength: 1, reg: kv, asm: "VPMOVM2W"},
+		{name: "VPMOVMToVec16x32", argLength: 1, reg: kw, asm: "VPMOVM2W"},
+
+		{name: "VPMOVMToVec32x4", argLength: 1, reg: kv, asm: "VPMOVM2D"},
+		{name: "VPMOVMToVec32x8", argLength: 1, reg: kv, asm: "VPMOVM2D"},
+		{name: "VPMOVMToVec32x16", argLength: 1, reg: kw, asm: "VPMOVM2D"},
+
+		{name: "VPMOVMToVec64x2", argLength: 1, reg: kv, asm: "VPMOVM2Q"},
+		{name: "VPMOVMToVec64x4", argLength: 1, reg: kv, asm: "VPMOVM2Q"},
+		{name: "VPMOVMToVec64x8", argLength: 1, reg: kw, asm: "VPMOVM2Q"},
+
+		{name: "VPMOVVec8x16ToM", argLength: 1, reg: vk, asm: "VPMOVB2M"},
+		{name: "VPMOVVec8x32ToM", argLength: 1, reg: vk, asm: "VPMOVB2M"},
+		{name: "VPMOVVec8x64ToM", argLength: 1, reg: wk, asm: "VPMOVB2M"},
+
+		{name: "VPMOVVec16x8ToM", argLength: 1, reg: vk, asm: "VPMOVW2M"},
+		{name: "VPMOVVec16x16ToM", argLength: 1, reg: vk, asm: "VPMOVW2M"},
+		{name: "VPMOVVec16x32ToM", argLength: 1, reg: wk, asm: "VPMOVW2M"},
+
+		{name: "VPMOVVec32x4ToM", argLength: 1, reg: vk, asm: "VPMOVD2M"},
+		{name: "VPMOVVec32x8ToM", argLength: 1, reg: vk, asm: "VPMOVD2M"},
+		{name: "VPMOVVec32x16ToM", argLength: 1, reg: wk, asm: "VPMOVD2M"},
+
+		{name: "VPMOVVec64x2ToM", argLength: 1, reg: vk, asm: "VPMOVQ2M"},
+		{name: "VPMOVVec64x4ToM", argLength: 1, reg: vk, asm: "VPMOVQ2M"},
+		{name: "VPMOVVec64x8ToM", argLength: 1, reg: wk, asm: "VPMOVQ2M"},
+
+		{name: "Zero128", argLength: 0, reg: x15only, zeroWidth: true, fixedReg: true},
+		{name: "Zero256", argLength: 0, reg: x15only, zeroWidth: true, fixedReg: true},
+		{name: "Zero512", argLength: 0, reg: x15only, zeroWidth: true, fixedReg: true},
+
+		{name: "VMOVSDf2v", argLength: 1, reg: fpv, asm: "VMOVSD"},
+		{name: "VMOVSSf2v", argLength: 1, reg: fpv, asm: "VMOVSS"},
+		{name: "VMOVQ", argLength: 1, reg: gpv, asm: "VMOVQ"},
+		{name: "VMOVD", argLength: 1, reg: gpv, asm: "VMOVD"},
+
+		{name: "VMOVQload", argLength: 2, reg: fpload, asm: "VMOVQ", aux: "SymOff", typ: "UInt64", faultOnNilArg0: true, symEffect: "Read"},
+		{name: "VMOVDload", argLength: 2, reg: fpload, asm: "VMOVD", aux: "SymOff", typ: "UInt32", faultOnNilArg0: true, symEffect: "Read"},
+		{name: "VMOVSSload", argLength: 2, reg: fpload, asm: "VMOVSS", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},
+		{name: "VMOVSDload", argLength: 2, reg: fpload, asm: "VMOVSD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},
+
+		{name: "VMOVSSconst", reg: fp01, asm: "VMOVSS", aux: "Float32", rematerializeable: true},
+		{name: "VMOVSDconst", reg: fp01, asm: "VMOVSD", aux: "Float64", rematerializeable: true},
+
+		{name: "VZEROUPPER", argLength: 1, reg: regInfo{clobbers: v}, asm: "VZEROUPPER"}, // arg=mem, returns mem
+		{name: "VZEROALL", argLength: 1, reg: regInfo{clobbers: v}, asm: "VZEROALL"},     // arg=mem, returns mem
+
+		// KMOVxload: loads masks
+		// Load (Q=8,D=4,W=2,B=1) bytes from (arg0+auxint+aux), arg1=mem.
+		// "+auxint+aux" == add auxint and the offset of the symbol in aux (if any) to the effective address
+		{name: "KMOVBload", argLength: 2, reg: kload, asm: "KMOVB", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},
+		{name: "KMOVWload", argLength: 2, reg: kload, asm: "KMOVW", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},
+		{name: "KMOVDload", argLength: 2, reg: kload, asm: "KMOVD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},
+		{name: "KMOVQload", argLength: 2, reg: kload, asm: "KMOVQ", aux: "SymOff", faultOnNilArg0: true, symEffect: "Read"},
+
+		// KMOVxstore: stores masks
+		// Store (Q=8,D=4,W=2,B=1) low bytes of arg1.
+		// Does *(arg0+auxint+aux) = arg1, arg2=mem.
+		{name: "KMOVBstore", argLength: 3, reg: kstore, asm: "KMOVB", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"},
+		{name: "KMOVWstore", argLength: 3, reg: kstore, asm: "KMOVW", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"},
+		{name: "KMOVDstore", argLength: 3, reg: kstore, asm: "KMOVD", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"},
+		{name: "KMOVQstore", argLength: 3, reg: kstore, asm: "KMOVQ", aux: "SymOff", faultOnNilArg0: true, symEffect: "Write"},
+
+		// Move GP directly to mask register
+		{name: "KMOVQk", argLength: 1, reg: gpk, asm: "KMOVQ"},
+		{name: "KMOVDk", argLength: 1, reg: gpk, asm: "KMOVD"},
+		{name: "KMOVWk", argLength: 1, reg: gpk, asm: "KMOVW"},
+		{name: "KMOVBk", argLength: 1, reg: gpk, asm: "KMOVB"},
+		{name: "KMOVQi", argLength: 1, reg: kgp, asm: "KMOVQ"},
+		{name: "KMOVDi", argLength: 1, reg: kgp, asm: "KMOVD"},
+		{name: "KMOVWi", argLength: 1, reg: kgp, asm: "KMOVW"},
+		{name: "KMOVBi", argLength: 1, reg: kgp, asm: "KMOVB"},
+
+		// VPTEST
+		{name: "VPTEST", asm: "VPTEST", argLength: 2, reg: v2flags, clobberFlags: true, typ: "Flags"},
 	}

 	var AMD64blocks = []blockData{
@ -1266,14 +1477,17 @@ func init() {
 		name:        "AMD64",
 		pkg:         "cmd/internal/obj/x86",
 		genfile:     "../../amd64/ssa.go",
-		ops:                AMD64ops,
+		genSIMDfile: "../../amd64/simdssa.go",
+		ops: append(AMD64ops, simdAMD64Ops(v11, v21, v2k, vkv, v2kv, v2kk, v31, v3kv, vgpv, vgp, vfpv, vfpkv,
+			w11, w21, w2k, wkw, w2kw, w2kk, w31, w3kw, wgpw, wgp, wfpw, wfpkw, wkwload, v21load, v31load, v11load,
+			w21load, w31load, w2kload, w2kwload, w11load, w3kwload, w2kkload, v31x0AtIn2)...), // AMD64ops,
 		blocks:             AMD64blocks,
 		regnames:           regNamesAMD64,
 		ParamIntRegNames:   "AX BX CX DI SI R8 R9 R10 R11",
 		ParamFloatRegNames: "X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14",
 		gpregmask:          gp,
 		fpregmask:          fp,
-		specialregmask:     x15,
+		specialregmask:     mask,
 		framepointerreg:    int8(num["BP"]),
 		linkreg:            -1, // not used
 	})
--- a/src/cmd/compile/internal/ssa/_gen/generic.rules
+++ b/src/cmd/compile/internal/ssa/_gen/generic.rules
@ -941,7 +941,7 @@

 // struct operations
 (StructSelect [i] x:(StructMake ___)) => x.Args[i]
-(Load <t> _ _) && t.IsStruct() && CanSSA(t) => rewriteStructLoad(v)
+(Load <t> _ _) && t.IsStruct() && CanSSA(t) && !t.IsSIMD() => rewriteStructLoad(v)
 (Store _ (StructMake ___) _) => rewriteStructStore(v)

 (StructSelect [i] x:(Load <t> ptr mem)) && !CanSSA(t) =>
--- a/src/cmd/compile/internal/ssa/_gen/genericOps.go
+++ b/src/cmd/compile/internal/ssa/_gen/genericOps.go
@ -375,6 +375,18 @@ var genericOps = []opData{
 	{name: "Load", argLength: 2},                          // Load from arg0.  arg1=memory
 	{name: "Dereference", argLength: 2},                   // Load from arg0.  arg1=memory.  Helper op for arg/result passing, result is an otherwise not-SSA-able "value".
 	{name: "Store", argLength: 3, typ: "Mem", aux: "Typ"}, // Store arg1 to arg0.  arg2=memory, aux=type.  Returns memory.
+
+	// masked memory operations.
+	// TODO add 16 and 8
+	{name: "LoadMasked8", argLength: 3},                           // Load from arg0, arg1 = mask of 8-bits, arg2 = memory
+	{name: "LoadMasked16", argLength: 3},                          // Load from arg0, arg1 = mask of 16-bits, arg2 = memory
+	{name: "LoadMasked32", argLength: 3},                          // Load from arg0, arg1 = mask of 32-bits, arg2 = memory
+	{name: "LoadMasked64", argLength: 3},                          // Load from arg0, arg1 = mask of 64-bits, arg2 = memory
+	{name: "StoreMasked8", argLength: 4, typ: "Mem", aux: "Typ"},  // Store arg2 to arg0, arg1=mask of 8-bits, arg3 = memory
+	{name: "StoreMasked16", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 16-bits, arg3 = memory
+	{name: "StoreMasked32", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 32-bits, arg3 = memory
+	{name: "StoreMasked64", argLength: 4, typ: "Mem", aux: "Typ"}, // Store arg2 to arg0, arg1=mask of 64-bits, arg3 = memory
+
 	// Normally we require that the source and destination of Move do not overlap.
 	// There is an exception when we know all the loads will happen before all
 	// the stores. In that case, overlap is ok. See
@ -666,6 +678,40 @@ var genericOps = []opData{
 	// Prefetch instruction
 	{name: "PrefetchCache", argLength: 2, hasSideEffects: true},         // Do prefetch arg0 to cache. arg0=addr, arg1=memory.
 	{name: "PrefetchCacheStreamed", argLength: 2, hasSideEffects: true}, // Do non-temporal or streamed prefetch arg0 to cache. arg0=addr, arg1=memory.
+
+	// SIMD
+	{name: "ZeroSIMD", argLength: 0}, // zero value of a vector
+
+	// Convert integers to masks
+	{name: "Cvt16toMask8x16", argLength: 1},  // arg0 = integer mask value
+	{name: "Cvt32toMask8x32", argLength: 1},  // arg0 = integer mask value
+	{name: "Cvt64toMask8x64", argLength: 1},  // arg0 = integer mask value
+	{name: "Cvt8toMask16x8", argLength: 1},   // arg0 = integer mask value
+	{name: "Cvt16toMask16x16", argLength: 1}, // arg0 = integer mask value
+	{name: "Cvt32toMask16x32", argLength: 1}, // arg0 = integer mask value
+	{name: "Cvt8toMask32x4", argLength: 1},   // arg0 = integer mask value
+	{name: "Cvt8toMask32x8", argLength: 1},   // arg0 = integer mask value
+	{name: "Cvt16toMask32x16", argLength: 1}, // arg0 = integer mask value
+	{name: "Cvt8toMask64x2", argLength: 1},   // arg0 = integer mask value
+	{name: "Cvt8toMask64x4", argLength: 1},   // arg0 = integer mask value
+	{name: "Cvt8toMask64x8", argLength: 1},   // arg0 = integer mask value
+
+	// Convert masks to integers
+	{name: "CvtMask8x16to16", argLength: 1},  // arg0 = mask
+	{name: "CvtMask8x32to32", argLength: 1},  // arg0 = mask
+	{name: "CvtMask8x64to64", argLength: 1},  // arg0 = mask
+	{name: "CvtMask16x8to8", argLength: 1},   // arg0 = mask
+	{name: "CvtMask16x16to16", argLength: 1}, // arg0 = mask
+	{name: "CvtMask16x32to32", argLength: 1}, // arg0 = mask
+	{name: "CvtMask32x4to8", argLength: 1},   // arg0 = mask
+	{name: "CvtMask32x8to8", argLength: 1},   // arg0 = mask
+	{name: "CvtMask32x16to16", argLength: 1}, // arg0 = mask
+	{name: "CvtMask64x2to8", argLength: 1},   // arg0 = mask
+	{name: "CvtMask64x4to8", argLength: 1},   // arg0 = mask
+	{name: "CvtMask64x8to8", argLength: 1},   // arg0 = mask
+
+	// Returns true if arg0 is all zero.
+	{name: "IsZeroVec", argLength: 1},
 }

 //     kind          controls          successors   implicit exit
@ -693,6 +739,7 @@ var genericBlocks = []blockData{
 }

 func init() {
+	genericOps = append(genericOps, simdGenericOps()...)
 	archs = append(archs, arch{
 		name:    "generic",
 		ops:     genericOps,
--- a/src/cmd/compile/internal/ssa/_gen/main.go
+++ b/src/cmd/compile/internal/ssa/_gen/main.go
@ -32,6 +32,7 @@ type arch struct {
 	name               string
 	pkg                string // obj package to import for this arch.
 	genfile            string // source file containing opcode code generation.
+	genSIMDfile        string // source file containing opcode code generation for SIMD.
 	ops                []opData
 	blocks             []blockData
 	regnames           []string
@ -547,6 +548,15 @@ func genOp() {
 		if err != nil {
 			log.Fatalf("can't read %s: %v", a.genfile, err)
 		}
+		// Append the file of simd operations, too
+		if a.genSIMDfile != "" {
+			simdSrc, err := os.ReadFile(a.genSIMDfile)
+			if err != nil {
+				log.Fatalf("can't read %s: %v", a.genSIMDfile, err)
+			}
+			src = append(src, simdSrc...)
+		}
+
 		seen := make(map[string]bool, len(a.ops))
 		for _, m := range rxOp.FindAllSubmatch(src, -1) {
 			seen[string(m[1])] = true
--- a/src/cmd/compile/internal/ssa/_gen/multiscanner.go
+++ b/src/cmd/compile/internal/ssa/_gen/multiscanner.go
@ -0,0 +1,117 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bufio"
+	"io"
+)
+
+// NamedScanner is a simple struct to pair a name with a Scanner.
+type NamedScanner struct {
+	Name    string
+	Scanner *bufio.Scanner
+}
+
+// NamedReader is a simple struct to pair a name with a Reader,
+// which will be converted to a Scanner using bufio.NewScanner.
+type NamedReader struct {
+	Name   string
+	Reader io.Reader
+}
+
+// MultiScanner scans over multiple bufio.Scanners as if they were a single stream.
+// It also keeps track of the name of the current scanner and the line number.
+type MultiScanner struct {
+	scanners   []NamedScanner
+	scannerIdx int
+	line       int
+	totalLine  int
+	err        error
+}
+
+// NewMultiScanner creates a new MultiScanner from slice of NamedScanners.
+func NewMultiScanner(scanners []NamedScanner) *MultiScanner {
+	return &MultiScanner{
+		scanners:   scanners,
+		scannerIdx: -1, // Start before the first scanner
+	}
+}
+
+// MultiScannerFromReaders creates a new MultiScanner from a slice of NamedReaders.
+func MultiScannerFromReaders(readers []NamedReader) *MultiScanner {
+	var scanners []NamedScanner
+	for _, r := range readers {
+		scanners = append(scanners, NamedScanner{
+			Name:    r.Name,
+			Scanner: bufio.NewScanner(r.Reader),
+		})
+	}
+	return NewMultiScanner(scanners)
+}
+
+// Scan advances the scanner to the next token, which will then be
+// available through the Text method. It returns false when the scan stops,
+// either by reaching the end of the input or an error.
+// After Scan returns false, the Err method will return any error that
+// occurred during scanning, except that if it was io.EOF, Err
+// will return nil.
+func (ms *MultiScanner) Scan() bool {
+	if ms.scannerIdx == -1 {
+		ms.scannerIdx = 0
+	}
+
+	for ms.scannerIdx < len(ms.scanners) {
+		current := ms.scanners[ms.scannerIdx]
+		if current.Scanner.Scan() {
+			ms.line++
+			ms.totalLine++
+			return true
+		}
+		if err := current.Scanner.Err(); err != nil {
+			ms.err = err
+			return false
+		}
+		// Move to the next scanner
+		ms.scannerIdx++
+		ms.line = 0
+	}
+
+	return false
+}
+
+// Text returns the most recent token generated by a call to Scan.
+func (ms *MultiScanner) Text() string {
+	if ms.scannerIdx < 0 || ms.scannerIdx >= len(ms.scanners) {
+		return ""
+	}
+	return ms.scanners[ms.scannerIdx].Scanner.Text()
+}
+
+// Err returns the first non-EOF error that was encountered by the MultiScanner.
+func (ms *MultiScanner) Err() error {
+	return ms.err
+}
+
+// Name returns the name of the current scanner.
+func (ms *MultiScanner) Name() string {
+	if ms.scannerIdx < 0 {
+		return "<before first>"
+	}
+	if ms.scannerIdx >= len(ms.scanners) {
+		return "<after last>"
+	}
+	return ms.scanners[ms.scannerIdx].Name
+}
+
+// Line returns the current line number within the current scanner.
+func (ms *MultiScanner) Line() int {
+	return ms.line
+}
+
+// TotalLine returns the total number of lines scanned across all scanners.
+func (ms *MultiScanner) TotalLine() int {
+	return ms.totalLine
+}
--- a/src/cmd/compile/internal/ssa/_gen/rulegen.go
+++ b/src/cmd/compile/internal/ssa/_gen/rulegen.go
@ -94,8 +94,11 @@ func genSplitLoadRules(arch arch) { genRulesSuffix(arch, "splitload") }
 func genLateLowerRules(arch arch) { genRulesSuffix(arch, "latelower") }

 func genRulesSuffix(arch arch, suff string) {
+	var readers []NamedReader
 	// Open input file.
-	text, err := os.Open(arch.name + suff + ".rules")
+	var text io.Reader
+	name := arch.name + suff + ".rules"
+	text, err := os.Open(name)
 	if err != nil {
 		if suff == "" {
 			// All architectures must have a plain rules file.
@ -104,18 +107,28 @@ func genRulesSuffix(arch arch, suff string) {
 		// Some architectures have bonus rules files that others don't share. That's fine.
 		return
 	}
+	readers = append(readers, NamedReader{name, text})
+
+	// Check for file of SIMD rules to add
+	if suff == "" {
+		simdname := "simd" + arch.name + ".rules"
+		simdtext, err := os.Open(simdname)
+		if err == nil {
+			readers = append(readers, NamedReader{simdname, simdtext})
+		}
+	}

 	// oprules contains a list of rules for each block and opcode
 	blockrules := map[string][]Rule{}
 	oprules := map[string][]Rule{}

 	// read rule file
-	scanner := bufio.NewScanner(text)
+	scanner := MultiScannerFromReaders(readers)
 	rule := ""
 	var lineno int
 	var ruleLineno int // line number of "=>"
 	for scanner.Scan() {
-		lineno++
+		lineno = scanner.Line()
 		line := scanner.Text()
 		if i := strings.Index(line, "//"); i >= 0 {
 			// Remove comments. Note that this isn't string safe, so
@ -142,7 +155,7 @@ func genRulesSuffix(arch arch, suff string) {
 			break // continuing the line can't help, and it will only make errors worse
 		}

-		loc := fmt.Sprintf("%s%s.rules:%d", arch.name, suff, ruleLineno)
+		loc := fmt.Sprintf("%s:%d", scanner.Name(), ruleLineno)
 		for _, rule2 := range expandOr(rule) {
 			r := Rule{Rule: rule2, Loc: loc}
 			if rawop := strings.Split(rule2, " ")[0][1:]; isBlock(rawop, arch) {
@ -162,7 +175,7 @@ func genRulesSuffix(arch arch, suff string) {
 		log.Fatalf("scanner failed: %v\n", err)
 	}
 	if balance(rule) != 0 {
-		log.Fatalf("%s.rules:%d: unbalanced rule: %v\n", arch.name, lineno, rule)
+		log.Fatalf("%s:%d: unbalanced rule: %v\n", scanner.Name(), lineno, rule)
 	}

 	// Order all the ops.
@ -862,7 +875,7 @@ func declReserved(name, value string) *Declare {
 	if !reservedNames[name] {
 		panic(fmt.Sprintf("declReserved call does not use a reserved name: %q", name))
 	}
-	return &Declare{name, exprf(value)}
+	return &Declare{name, exprf("%s", value)}
 }

 // breakf constructs a simple "if cond { break }" statement, using exprf for its
@ -889,7 +902,7 @@ func genBlockRewrite(rule Rule, arch arch, data blockData) *RuleRewrite {
 			if vname == "" {
 				vname = fmt.Sprintf("v_%v", i)
 			}
-			rr.add(declf(rr.Loc, vname, cname))
+			rr.add(declf(rr.Loc, vname, "%s", cname))
 			p, op := genMatch0(rr, arch, expr, vname, nil, false) // TODO: pass non-nil cnt?
 			if op != "" {
 				check := fmt.Sprintf("%s.Op == %s", cname, op)
@ -904,7 +917,7 @@ func genBlockRewrite(rule Rule, arch arch, data blockData) *RuleRewrite {
 			}
 			pos[i] = p
 		} else {
-			rr.add(declf(rr.Loc, arg, cname))
+			rr.add(declf(rr.Loc, arg, "%s", cname))
 			pos[i] = arg + ".Pos"
 		}
 	}
--- a/src/cmd/compile/internal/ssa/_gen/simdAMD64.rules
+++ b/src/cmd/compile/internal/ssa/_gen/simdAMD64.rules
--- a/src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go
+++ b/src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go
--- a/src/cmd/compile/internal/ssa/_gen/simdgenericOps.go
+++ b/src/cmd/compile/internal/ssa/_gen/simdgenericOps.go
--- a/src/cmd/compile/internal/ssa/block.go
+++ b/src/cmd/compile/internal/ssa/block.go
@ -18,6 +18,9 @@ type Block struct {
 	// Source position for block's control operation
 	Pos src.XPos

+	// What cpu features (AVXnnn, SVEyyy) are implied to reach/execute this block?
+	CPUfeatures CPUfeatures
+
 	// The kind of block this is.
 	Kind BlockKind

@ -449,3 +452,57 @@ const (
 	HotPgoInitial          = HotPgo | HotInitial                // special case; single block loop, initial block is header block has a flow-in entry, but PGO says it is hot
 	HotPgoInitialNotFLowIn = HotPgo | HotInitial | HotNotFlowIn // PGO says it is hot, and the loop is rotated so flow enters loop with a branch
 )
+
+type CPUfeatures uint32
+
+const (
+	CPUNone CPUfeatures = 0
+	CPUAll  CPUfeatures = ^CPUfeatures(0)
+	CPUavx  CPUfeatures = 1 << iota
+	CPUavx2
+	CPUavxvnni
+	CPUavx512
+	CPUbitalg
+	CPUgfni
+	CPUvbmi
+	CPUvbmi2
+	CPUvpopcntdq
+	CPUavx512vnni
+
+	CPUneon
+	CPUsve2
+)
+
+func (f CPUfeatures) hasFeature(x CPUfeatures) bool {
+	return f&x == x
+}
+
+func (f CPUfeatures) String() string {
+	if f == CPUNone {
+		return "none"
+	}
+	if f == CPUAll {
+		return "all"
+	}
+	s := ""
+	foo := func(what string, feat CPUfeatures) {
+		if feat&f != 0 {
+			if s != "" {
+				s += "+"
+			}
+			s += what
+		}
+	}
+	foo("avx", CPUavx)
+	foo("avx2", CPUavx2)
+	foo("avx512", CPUavx512)
+	foo("avxvnni", CPUavxvnni)
+	foo("bitalg", CPUbitalg)
+	foo("gfni", CPUgfni)
+	foo("vbmi", CPUvbmi)
+	foo("vbmi2", CPUvbmi2)
+	foo("popcntdq", CPUvpopcntdq)
+	foo("avx512vnni", CPUavx512vnni)
+
+	return s
+}
--- a/src/cmd/compile/internal/ssa/check.go
+++ b/src/cmd/compile/internal/ssa/check.go
@ -150,8 +150,9 @@ func checkFunc(f *Func) {
 			case auxInt128:
 				// AuxInt must be zero, so leave canHaveAuxInt set to false.
 			case auxUInt8:
-				if v.AuxInt != int64(uint8(v.AuxInt)) {
-					f.Fatalf("bad uint8 AuxInt value for %v", v)
+				// Cast to int8 due to requirement of AuxInt, check its comment for details.
+				if v.AuxInt != int64(int8(v.AuxInt)) {
+					f.Fatalf("bad uint8 AuxInt value for %v, saw %d but need %d", v, v.AuxInt, int64(int8(v.AuxInt)))
 				}
 				canHaveAuxInt = true
 			case auxFloat32:
--- a/src/cmd/compile/internal/ssa/compile.go
+++ b/src/cmd/compile/internal/ssa/compile.go
@ -488,6 +488,8 @@ var passes = [...]pass{
 	{name: "writebarrier", fn: writebarrier, required: true}, // expand write barrier ops
 	{name: "insert resched checks", fn: insertLoopReschedChecks,
 		disabled: !buildcfg.Experiment.PreemptibleLoops}, // insert resched checks in loops.
+	{name: "cpufeatures", fn: cpufeatures, required: buildcfg.Experiment.SIMD, disabled: !buildcfg.Experiment.SIMD},
+	{name: "rewrite tern", fn: rewriteTern, required: false, disabled: !buildcfg.Experiment.SIMD},
 	{name: "lower", fn: lower, required: true},
 	{name: "addressing modes", fn: addressingModes, required: false},
 	{name: "late lower", fn: lateLower, required: true},
@ -596,6 +598,8 @@ var passOrder = [...]constraint{
 	{"branchelim", "late opt"},
 	// branchelim is an arch-independent pass.
 	{"branchelim", "lower"},
+	// lower needs cpu feature information (for SIMD)
+	{"cpufeatures", "lower"},
 }

 func init() {
--- a/src/cmd/compile/internal/ssa/config.go
+++ b/src/cmd/compile/internal/ssa/config.go
@ -88,6 +88,10 @@ type Types struct {
 	Float32Ptr *types.Type
 	Float64Ptr *types.Type
 	BytePtrPtr *types.Type
+	Vec128     *types.Type
+	Vec256     *types.Type
+	Vec512     *types.Type
+	Mask       *types.Type
 }

 // NewTypes creates and populates a Types.
@ -122,6 +126,10 @@ func (t *Types) SetTypPtrs() {
 	t.Float32Ptr = types.NewPtr(types.Types[types.TFLOAT32])
 	t.Float64Ptr = types.NewPtr(types.Types[types.TFLOAT64])
 	t.BytePtrPtr = types.NewPtr(types.NewPtr(types.Types[types.TUINT8]))
+	t.Vec128 = types.TypeVec128
+	t.Vec256 = types.TypeVec256
+	t.Vec512 = types.TypeVec512
+	t.Mask = types.TypeMask
 }

 type Logger interface {
--- a/src/cmd/compile/internal/ssa/cpufeatures.go
+++ b/src/cmd/compile/internal/ssa/cpufeatures.go
@ -0,0 +1,262 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package ssa
+
+import (
+	"cmd/compile/internal/types"
+	"cmd/internal/obj"
+	"fmt"
+	"internal/goarch"
+)
+
+type localEffect struct {
+	start    CPUfeatures    // features present at beginning of block
+	internal CPUfeatures    // features implied by execution of block
+	end      [2]CPUfeatures // for BlockIf, features present on outgoing edges
+	visited  bool           // On the first iteration this will be false for backedges.
+}
+
+func (e localEffect) String() string {
+	return fmt.Sprintf("visited=%v, start=%v, internal=%v, end[0]=%v, end[1]=%v", e.visited, e.start, e.internal, e.end[0], e.end[1])
+}
+
+// ifEffect pattern matches for a BlockIf conditional on a load
+// of a field from internal/cpu.X86 and returns the corresponding
+// effect.
+func ifEffect(b *Block) (features CPUfeatures, taken int) {
+	// TODO generalize for other architectures.
+	if b.Kind != BlockIf {
+		return
+	}
+	c := b.Controls[0]
+
+	if c.Op == OpNot {
+		taken = 1
+		c = c.Args[0]
+	}
+	if c.Op != OpLoad {
+		return
+	}
+	offPtr := c.Args[0]
+	if offPtr.Op != OpOffPtr {
+		return
+	}
+	addr := offPtr.Args[0]
+	if addr.Op != OpAddr || addr.Args[0].Op != OpSB {
+		return
+	}
+	sym := addr.Aux.(*obj.LSym)
+	if sym.Name != "internal/cpu.X86" {
+		return
+	}
+	o := offPtr.AuxInt
+	t := addr.Type
+	if !t.IsPtr() {
+		b.Func.Fatalf("The symbol %s is not a pointer, found %v instead", sym.Name, t)
+	}
+	t = t.Elem()
+	if !t.IsStruct() {
+		b.Func.Fatalf("The referent of symbol %s is not a struct, found %v instead", sym.Name, t)
+	}
+	match := ""
+	for _, f := range t.Fields() {
+		if o == f.Offset && f.Sym != nil {
+			match = f.Sym.Name
+			break
+		}
+	}
+
+	switch match {
+
+	case "HasAVX":
+		features = CPUavx
+	case "HasAVXVNNI":
+		features = CPUavx | CPUavxvnni
+	case "HasAVX2":
+		features = CPUavx2 | CPUavx
+
+		// Compiler currently treats these all alike.
+	case "HasAVX512", "HasAVX512F", "HasAVX512CD", "HasAVX512BW",
+		"HasAVX512DQ", "HasAVX512VL", "HasAVX512VPCLMULQDQ":
+		features = CPUavx512 | CPUavx2 | CPUavx
+
+	case "HasAVX512GFNI":
+		features = CPUavx512 | CPUgfni | CPUavx2 | CPUavx
+	case "HasAVX512VNNI":
+		features = CPUavx512 | CPUavx512vnni | CPUavx2 | CPUavx
+	case "HasAVX512VBMI":
+		features = CPUavx512 | CPUvbmi | CPUavx2 | CPUavx
+	case "HasAVX512VBMI2":
+		features = CPUavx512 | CPUvbmi2 | CPUavx2 | CPUavx
+	case "HasAVX512BITALG":
+		features = CPUavx512 | CPUbitalg | CPUavx2 | CPUavx
+	case "HasAVX512VPOPCNTDQ":
+		features = CPUavx512 | CPUvpopcntdq | CPUavx2 | CPUavx
+
+	case "HasBMI1":
+		features = CPUvbmi
+	case "HasBMI2":
+		features = CPUvbmi2
+
+		// Features that are not currently interesting to the compiler.
+	case "HasAES", "HasADX", "HasERMS", "HasFSRM", "HasFMA", "HasGFNI", "HasOSXSAVE",
+		"HasPCLMULQDQ", "HasPOPCNT", "HasRDTSCP", "HasSHA",
+		"HasSSE3", "HasSSSE3", "HasSSE41", "HasSSE42":
+
+	}
+	if b.Func.pass.debug > 2 {
+		b.Func.Warnl(b.Pos, "%s, block b%v has features offset %d, match is %s, features is %v", b.Func.Name, b.ID, o, match, features)
+	}
+	return
+}
+
+func cpufeatures(f *Func) {
+	arch := f.Config.Ctxt().Arch.Family
+	// TODO there are other SIMD architectures
+	if arch != goarch.AMD64 {
+		return
+	}
+
+	po := f.Postorder()
+
+	effects := make([]localEffect, 1+f.NumBlocks(), 1+f.NumBlocks())
+
+	features := func(t *types.Type) CPUfeatures {
+		if t.IsSIMD() {
+			switch t.Size() {
+			case 16, 32:
+				return CPUavx
+			case 64:
+				return CPUavx512 | CPUavx2 | CPUavx
+			}
+		}
+		return CPUNone
+	}
+
+	// visit blocks in reverse post order
+	// when b is visited, all of its predecessors (except for loop back edges)
+	// will have been visited
+	for i := len(po) - 1; i >= 0; i-- {
+		b := po[i]
+
+		var feat CPUfeatures
+
+		if b == f.Entry {
+			// Check the types of inputs and outputs, as well as annotations.
+			// Start with none and union all that is implied by all the types seen.
+			if f.Type != nil { // a problem for SSA tests
+				for _, field := range f.Type.RecvParamsResults() {
+					feat |= features(field.Type)
+				}
+			}
+
+		} else {
+			// Start with all and intersect over predecessors
+			feat = CPUAll
+			for _, p := range b.Preds {
+				pb := p.Block()
+				if !effects[pb.ID].visited {
+
+					continue
+				}
+				pi := p.Index()
+				if pb.Kind != BlockIf {
+					pi = 0
+				}
+
+				feat &= effects[pb.ID].end[pi]
+			}
+		}
+
+		e := localEffect{start: feat, visited: true}
+
+		// Separately capture the internal effects of this block
+		var internal CPUfeatures
+		for _, v := range b.Values {
+			// the rule applied here is, if the block contains any
+			// instruction that would fault if the feature (avx, avx512)
+			// were not present, then assume that the feature is present
+			// for all the instructions in the block, a fault is a fault.
+			t := v.Type
+			if t.IsResults() {
+				for i := 0; i < t.NumFields(); i++ {
+					feat |= features(t.FieldType(i))
+				}
+			} else {
+				internal |= features(v.Type)
+			}
+		}
+		e.internal = internal
+		feat |= internal
+
+		branchEffect, taken := ifEffect(b)
+		e.end = [2]CPUfeatures{feat, feat}
+		e.end[taken] |= branchEffect
+
+		effects[b.ID] = e
+		if f.pass.debug > 1 && feat != CPUNone {
+			f.Warnl(b.Pos, "%s, block b%v has features %v", b.Func.Name, b.ID, feat)
+		}
+
+		b.CPUfeatures = feat
+		f.maxCPUFeatures |= feat // not necessary to refine this estimate below
+	}
+
+	// If the flow graph is irreducible, things can still change on backedges.
+	change := true
+	for change {
+		change = false
+		for i := len(po) - 1; i >= 0; i-- {
+			b := po[i]
+
+			if b == f.Entry {
+				continue // cannot change
+			}
+			feat := CPUAll
+			for _, p := range b.Preds {
+				pb := p.Block()
+				pi := p.Index()
+				if pb.Kind != BlockIf {
+					pi = 0
+				}
+				feat &= effects[pb.ID].end[pi]
+			}
+			e := effects[b.ID]
+			if feat == e.start {
+				continue
+			}
+			e.start = feat
+			effects[b.ID] = e
+			// uh-oh, something changed
+			if f.pass.debug > 1 {
+				f.Warnl(b.Pos, "%s, block b%v saw predecessor feature change", b.Func.Name, b.ID)
+			}
+
+			feat |= e.internal
+			if feat == e.end[0]&e.end[1] {
+				continue
+			}
+
+			branchEffect, taken := ifEffect(b)
+			e.end = [2]CPUfeatures{feat, feat}
+			e.end[taken] |= branchEffect
+
+			effects[b.ID] = e
+			b.CPUfeatures = feat
+			if f.pass.debug > 1 {
+				f.Warnl(b.Pos, "%s, block b%v has new features %v", b.Func.Name, b.ID, feat)
+			}
+			change = true
+		}
+	}
+	if f.pass.debug > 0 {
+		for _, b := range f.Blocks {
+			if b.CPUfeatures != CPUNone {
+				f.Warnl(b.Pos, "%s, block b%v has features %v", b.Func.Name, b.ID, b.CPUfeatures)
+			}
+
+		}
+	}
+}
--- a/src/cmd/compile/internal/ssa/decompose.go
+++ b/src/cmd/compile/internal/ssa/decompose.go
@ -100,7 +100,7 @@ func decomposeBuiltin(f *Func) {
 			}
 		case t.IsFloat():
 			// floats are never decomposed, even ones bigger than RegSize
-		case t.Size() > f.Config.RegSize:
+		case t.Size() > f.Config.RegSize && !t.IsSIMD():
 			f.Fatalf("undecomposed named type %s %v", name, t)
 		}
 	}
@ -135,7 +135,7 @@ func decomposeBuiltinPhi(v *Value) {
 		decomposeInterfacePhi(v)
 	case v.Type.IsFloat():
 		// floats are never decomposed, even ones bigger than RegSize
-	case v.Type.Size() > v.Block.Func.Config.RegSize:
+	case v.Type.Size() > v.Block.Func.Config.RegSize && !v.Type.IsSIMD():
 		v.Fatalf("%v undecomposed type %v", v, v.Type)
 	}
 }
@ -248,7 +248,7 @@ func decomposeUser(f *Func) {
 	for _, name := range f.Names {
 		t := name.Type
 		switch {
-		case t.IsStruct():
+		case isStructNotSIMD(t):
 			newNames = decomposeUserStructInto(f, name, newNames)
 		case t.IsArray():
 			newNames = decomposeUserArrayInto(f, name, newNames)
@ -293,7 +293,7 @@ func decomposeUserArrayInto(f *Func, name *LocalSlot, slots []*LocalSlot) []*Loc

 	if t.Elem().IsArray() {
 		return decomposeUserArrayInto(f, elemName, slots)
-	} else if t.Elem().IsStruct() {
+	} else if isStructNotSIMD(t.Elem()) {
 		return decomposeUserStructInto(f, elemName, slots)
 	}

@ -313,7 +313,7 @@ func decomposeUserStructInto(f *Func, name *LocalSlot, slots []*LocalSlot) []*Lo
 		fnames = append(fnames, fs)
 		// arrays and structs will be decomposed further, so
 		// there's no need to record a name
-		if !fs.Type.IsArray() && !fs.Type.IsStruct() {
+		if !fs.Type.IsArray() && !isStructNotSIMD(fs.Type) {
 			slots = maybeAppend(f, slots, fs)
 		}
 	}
@ -339,7 +339,7 @@ func decomposeUserStructInto(f *Func, name *LocalSlot, slots []*LocalSlot) []*Lo
 	// now that this f.NamedValues contains values for the struct
 	// fields, recurse into nested structs
 	for i := 0; i < n; i++ {
-		if name.Type.FieldType(i).IsStruct() {
+		if isStructNotSIMD(name.Type.FieldType(i)) {
 			slots = decomposeUserStructInto(f, fnames[i], slots)
 			delete(f.NamedValues, *fnames[i])
 		} else if name.Type.FieldType(i).IsArray() {
@ -351,7 +351,7 @@ func decomposeUserStructInto(f *Func, name *LocalSlot, slots []*LocalSlot) []*Lo
 }
 func decomposeUserPhi(v *Value) {
 	switch {
-	case v.Type.IsStruct():
+	case isStructNotSIMD(v.Type):
 		decomposeStructPhi(v)
 	case v.Type.IsArray():
 		decomposeArrayPhi(v)
@ -458,3 +458,7 @@ func deleteNamedVals(f *Func, toDelete []namedVal) {
 	}
 	f.Names = f.Names[:end]
 }
+
+func isStructNotSIMD(t *types.Type) bool {
+	return t.IsStruct() && !t.IsSIMD()
+}
--- a/src/cmd/compile/internal/ssa/expand_calls.go
+++ b/src/cmd/compile/internal/ssa/expand_calls.go
@ -396,6 +396,9 @@ func (x *expandState) decomposeAsNecessary(pos src.XPos, b *Block, a, m0 *Value,
 		return mem

 	case types.TSTRUCT:
+		if at.IsSIMD() {
+			break // XXX
+		}
 		for i := 0; i < at.NumFields(); i++ {
 			et := at.Field(i).Type // might need to read offsets from the fields
 			e := b.NewValue1I(pos, OpStructSelect, et, int64(i), a)
@ -551,6 +554,9 @@ func (x *expandState) rewriteSelectOrArg(pos src.XPos, b *Block, container, a, m

 	case types.TSTRUCT:
 		// Assume ssagen/ssa.go (in buildssa) spills large aggregates so they won't appear here.
+		if at.IsSIMD() {
+			break // XXX
+		}
 		for i := 0; i < at.NumFields(); i++ {
 			et := at.Field(i).Type
 			e := x.rewriteSelectOrArg(pos, b, container, nil, m0, et, rc.next(et))
@ -717,6 +723,9 @@ func (x *expandState) rewriteWideSelectToStores(pos src.XPos, b *Block, containe

 	case types.TSTRUCT:
 		// Assume ssagen/ssa.go (in buildssa) spills large aggregates so they won't appear here.
+		if at.IsSIMD() {
+			break // XXX
+		}
 		for i := 0; i < at.NumFields(); i++ {
 			et := at.Field(i).Type
 			m0 = x.rewriteWideSelectToStores(pos, b, container, m0, et, rc.next(et))
--- a/src/cmd/compile/internal/ssa/func.go
+++ b/src/cmd/compile/internal/ssa/func.go
@ -41,6 +41,8 @@ type Func struct {
 	ABISelf        *abi.ABIConfig // ABI for function being compiled
 	ABIDefault     *abi.ABIConfig // ABI for rtcall and other no-parsed-signature/pragma functions.

+	maxCPUFeatures CPUfeatures // union of all the CPU features in all the blocks.
+
 	scheduled   bool  // Values in Blocks are in final order
 	laidout     bool  // Blocks are ordered
 	NoSplit     bool  // true if function is marked as nosplit.  Used by schedule check pass.
@ -632,6 +634,19 @@ func (b *Block) NewValue4(pos src.XPos, op Op, t *types.Type, arg0, arg1, arg2,
 	return v
 }

+// NewValue4A returns a new value in the block with four arguments and zero aux values.
+func (b *Block) NewValue4A(pos src.XPos, op Op, t *types.Type, aux Aux, arg0, arg1, arg2, arg3 *Value) *Value {
+	v := b.Func.newValue(op, t, b, pos)
+	v.AuxInt = 0
+	v.Aux = aux
+	v.Args = []*Value{arg0, arg1, arg2, arg3}
+	arg0.Uses++
+	arg1.Uses++
+	arg2.Uses++
+	arg3.Uses++
+	return v
+}
+
 // NewValue4I returns a new value in the block with four arguments and auxint value.
 func (b *Block) NewValue4I(pos src.XPos, op Op, t *types.Type, auxint int64, arg0, arg1, arg2, arg3 *Value) *Value {
 	v := b.Func.newValue(op, t, b, pos)
--- a/src/cmd/compile/internal/ssa/opGen.go
+++ b/src/cmd/compile/internal/ssa/opGen.go
--- a/src/cmd/compile/internal/ssa/regalloc.go
+++ b/src/cmd/compile/internal/ssa/regalloc.go
@ -931,6 +931,14 @@ func (s *regAllocState) compatRegs(t *types.Type) regMask {
 	if t.IsTuple() || t.IsFlags() {
 		return 0
 	}
+	if t.IsSIMD() {
+		if t.Size() > 8 {
+			return s.f.Config.fpRegMask & s.allocatable
+		} else {
+			// K mask
+			return s.f.Config.gpRegMask & s.allocatable
+		}
+	}
 	if t.IsFloat() || t == types.TypeInt128 {
 		if t.Kind() == types.TFLOAT32 && s.f.Config.fp32RegMask != 0 {
 			m = s.f.Config.fp32RegMask
@ -1439,6 +1447,13 @@ func (s *regAllocState) regalloc(f *Func) {
 					s.sb = v.ID
 				case OpARM64ZERO, OpLOONG64ZERO, OpMIPS64ZERO:
 					s.assignReg(s.ZeroIntReg, v, v)
+				case OpAMD64Zero128, OpAMD64Zero256, OpAMD64Zero512:
+					regspec := s.regspec(v)
+					m := regspec.outputs[0].regs
+					if countRegs(m) != 1 {
+						f.Fatalf("bad fixed-register op %s", v)
+					}
+					s.assignReg(pickReg(m), v, v)
 				default:
 					f.Fatalf("unknown fixed-register op %s", v)
 				}
--- a/src/cmd/compile/internal/ssa/rewriteAMD64.go
+++ b/src/cmd/compile/internal/ssa/rewriteAMD64.go
--- a/src/cmd/compile/internal/ssa/rewritegeneric.go
+++ b/src/cmd/compile/internal/ssa/rewritegeneric.go
@ -12416,11 +12416,11 @@ func rewriteValuegeneric_OpLoad(v *Value) bool {
 		return true
 	}
 	// match: (Load <t> _ _)
-	// cond: t.IsStruct() && CanSSA(t)
+	// cond: t.IsStruct() && CanSSA(t) && !t.IsSIMD()
 	// result: rewriteStructLoad(v)
 	for {
 		t := v.Type
-		if !(t.IsStruct() && CanSSA(t)) {
+		if !(t.IsStruct() && CanSSA(t) && !t.IsSIMD()) {
 			break
 		}
 		v.copyOf(rewriteStructLoad(v))
--- a/src/cmd/compile/internal/ssa/rewritetern.go
+++ b/src/cmd/compile/internal/ssa/rewritetern.go
@ -0,0 +1,292 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package ssa
+
+import (
+	"fmt"
+	"internal/goarch"
+	"slices"
+)
+
+var truthTableValues [3]uint8 = [3]uint8{0b1111_0000, 0b1100_1100, 0b1010_1010}
+
+func (slop SIMDLogicalOP) String() string {
+	if slop == sloInterior {
+		return "leaf"
+	}
+	interior := ""
+	if slop&sloInterior != 0 {
+		interior = "+interior"
+	}
+	switch slop &^ sloInterior {
+	case sloAnd:
+		return "and" + interior
+	case sloXor:
+		return "xor" + interior
+	case sloOr:
+		return "or" + interior
+	case sloAndNot:
+		return "andNot" + interior
+	case sloNot:
+		return "not" + interior
+	}
+	return "wrong"
+}
+
+func rewriteTern(f *Func) {
+	if f.maxCPUFeatures == CPUNone {
+		return
+	}
+
+	arch := f.Config.Ctxt().Arch.Family
+	// TODO there are other SIMD architectures
+	if arch != goarch.AMD64 {
+		return
+	}
+
+	boolExprTrees := make(map[*Value]SIMDLogicalOP)
+
+	// Find logical-expr expression trees, including leaves.
+	// interior nodes will be marked sloInterior,
+	// root nodes will not be marked sloInterior,
+	// leaf nodes are only marked sloInterior.
+	for _, b := range f.Blocks {
+		for _, v := range b.Values {
+			slo := classifyBooleanSIMD(v)
+			switch slo {
+			case sloOr,
+				sloAndNot,
+				sloXor,
+				sloAnd:
+				boolExprTrees[v.Args[1]] |= sloInterior
+				fallthrough
+			case sloNot:
+				boolExprTrees[v.Args[0]] |= sloInterior
+				boolExprTrees[v] |= slo
+			}
+		}
+	}
+
+	// get a canonical sorted set of roots
+	var roots []*Value
+	for v, slo := range boolExprTrees {
+		if f.pass.debug > 1 {
+			f.Warnl(v.Pos, "%s has SLO %v", v.LongString(), slo)
+		}
+
+		if slo&sloInterior == 0 && v.Block.CPUfeatures.hasFeature(CPUavx512) {
+			roots = append(roots, v)
+		}
+	}
+	slices.SortFunc(roots, func(u, v *Value) int { return int(u.ID - v.ID) }) // IDs are small enough to not care about overflow.
+
+	// This rewrite works by iterating over the root set.
+	// For each boolean expression, it walks the expression
+	// bottom up accumulating sets of variables mentioned in
+	// subexpressions, lazy-greedily finding the largest subexpressions
+	// of 3 inputs that can be rewritten to use ternary-truth-table instructions.
+
+	// rewrite recursively attempts to replace v and v's subexpressions with
+	// ternary-logic truth-table operations, returning a set of not more than 3
+	// subexpressions within v that may be combined into a parent's replacement.
+	// V need not have the CPU features that allow a ternary-logic operation;
+	// in that case, v will not be rewritten.  Replacements also require
+	// exactly 3 different variable inputs to a boolean expression.
+	//
+	// Given the CPU feature and 3 inputs, v is replaced in the following
+	// cases:
+	//
+	// 1) v is a root
+	// 2) u = NOT(v) and u lacks the CPU feature
+	// 3) u = OP(v, w) and u lacks the CPU feature
+	// 4) u = OP(v, w) and u has more than 3 variable inputs.	var rewrite func(v *Value) [3]*Value
+	var rewrite func(v *Value) [3]*Value
+
+	// computeTT returns the truth table for a boolean expression
+	// over the variables in vars, where vars[0] varies slowest in
+	// the truth table and vars[2] varies fastest.
+	// e.g. computeTT( "and(x, or(y, not(z)))", {x,y,z} ) returns
+	// (bit 0 first) 0 0 0 0 1 0 1 1 = (reversed) 1101_0000 = 0xD0
+	//            x: 0 0 0 0 1 1 1 1
+	//            y: 0 0 1 1 0 0 1 1
+	//            z: 0 1 0 1 0 1 0 1
+	var computeTT func(v *Value, vars [3]*Value) uint8
+
+	// combine two sets of variables into one, returning ok/not
+	// if the two sets contained 3 or fewer elements.  Combine
+	// ensures that the sets of Values never contain duplicates.
+	// (Duplicates would create less-efficient code, not incorrect code.)
+	combine := func(a, b [3]*Value) ([3]*Value, bool) {
+		var c [3]*Value
+		i := 0
+		for _, v := range a {
+			if v == nil {
+				break
+			}
+			c[i] = v
+			i++
+		}
+	bloop:
+		for _, v := range b {
+			if v == nil {
+				break
+			}
+			for _, u := range a {
+				if v == u {
+					continue bloop
+				}
+			}
+			if i == 3 {
+				return [3]*Value{}, false
+			}
+			c[i] = v
+			i++
+		}
+		return c, true
+	}
+
+	computeTT = func(v *Value, vars [3]*Value) uint8 {
+		i := 0
+		for ; i < len(vars); i++ {
+			if vars[i] == v {
+				return truthTableValues[i]
+			}
+		}
+		slo := boolExprTrees[v] &^ sloInterior
+		a := computeTT(v.Args[0], vars)
+		switch slo {
+		case sloNot:
+			return ^a
+		case sloAnd:
+			return a & computeTT(v.Args[1], vars)
+		case sloXor:
+			return a ^ computeTT(v.Args[1], vars)
+		case sloOr:
+			return a | computeTT(v.Args[1], vars)
+		case sloAndNot:
+			return a & ^computeTT(v.Args[1], vars)
+		}
+		panic("switch should have covered all cases, or unknown var in logical expression")
+	}
+
+	replace := func(a0 *Value, vars0 [3]*Value) {
+		imm := computeTT(a0, vars0)
+		op := ternOpForLogical(a0.Op)
+		if op == a0.Op {
+			panic(fmt.Errorf("should have mapped away from input op, a0 is %s", a0.LongString()))
+		}
+		if f.pass.debug > 0 {
+			f.Warnl(a0.Pos, "Rewriting %s into %v of 0b%b %v %v %v", a0.LongString(), op, imm,
+				vars0[0], vars0[1], vars0[2])
+		}
+		a0.reset(op)
+		a0.SetArgs3(vars0[0], vars0[1], vars0[2])
+		a0.AuxInt = int64(int8(imm))
+	}
+
+	// addOne ensures the no-duplicates addition of a single value
+	// to a set that is not full.  It seems possible that a shared
+	// subexpression in tricky combination with blocks lacking the
+	// AVX512 feature might permit this.
+	addOne := func(vars [3]*Value, v *Value) [3]*Value {
+		if vars[2] != nil {
+			panic("rewriteTern.addOne, vars[2] should be nil")
+		}
+		if v == vars[0] || v == vars[1] {
+			return vars
+		}
+		if vars[1] == nil {
+			vars[1] = v
+		} else {
+			vars[2] = v
+		}
+		return vars
+	}
+
+	rewrite = func(v *Value) [3]*Value {
+		slo := boolExprTrees[v]
+		if slo == sloInterior { // leaf node, i.e., a "variable"
+			return [3]*Value{v, nil, nil}
+		}
+		var vars [3]*Value
+		hasFeature := v.Block.CPUfeatures.hasFeature(CPUavx512)
+		if slo&sloNot == sloNot {
+			vars = rewrite(v.Args[0])
+			if !hasFeature {
+				if vars[2] != nil {
+					replace(v.Args[0], vars)
+					return [3]*Value{v, nil, nil}
+				}
+				return vars
+			}
+		} else {
+			var ok bool
+			a0, a1 := v.Args[0], v.Args[1]
+			vars0 := rewrite(a0)
+			vars1 := rewrite(a1)
+			vars, ok = combine(vars0, vars1)
+
+			if f.pass.debug > 1 {
+				f.Warnl(a0.Pos, "combine(%v, %v) -> %v, %v", vars0, vars1, vars, ok)
+			}
+
+			if !(ok && v.Block.CPUfeatures.hasFeature(CPUavx512)) {
+				// too many variables, or cannot rewrite current values.
+				// rewrite one or both subtrees if possible
+				if vars0[2] != nil && a0.Block.CPUfeatures.hasFeature(CPUavx512) {
+					replace(a0, vars0)
+				}
+				if vars1[2] != nil && a1.Block.CPUfeatures.hasFeature(CPUavx512) {
+					replace(a1, vars1)
+				}
+
+				// 3-element var arrays are either rewritten, or unable to be rewritten
+				// because of the features in effect in their block.  Either way, they
+				// are treated as a "new var" if 3 elements are present.
+
+				if vars0[2] == nil {
+					if vars1[2] == nil {
+						// both subtrees are 2-element and were not rewritten.
+						//
+						// TODO a clever person would look at subtrees of inputs,
+						// e.g. rewrite
+						//        ((a AND b) XOR b) XOR (d  XOR (c AND d))
+						// to    (((a AND b) XOR b) XOR  d) XOR (c AND d)
+						// to v = TERNLOG(truthtable, a, b, d) XOR (c AND d)
+						// and return the variable set {v, c, d}
+						//
+						// But for now, just restart with a0 and a1.
+						return [3]*Value{a0, a1, nil}
+					} else {
+						// a1 (maybe) rewrote, a0 has room for another var
+						vars = addOne(vars0, a1)
+					}
+				} else if vars1[2] == nil {
+					// a0 (maybe) rewrote, a1 has room for another var
+					vars = addOne(vars1, a0)
+				} else if !ok {
+					// both (maybe) rewrote
+					// a0 and a1 are different because otherwise their variable
+					// sets would have combined "ok".
+					return [3]*Value{a0, a1, nil}
+				}
+				// continue with either the vars from "ok" or the updated set of vars.
+			}
+		}
+		// if root and 3 vars and hasFeature, rewrite.
+		if slo&sloInterior == 0 && vars[2] != nil && hasFeature {
+			replace(v, vars)
+			return [3]*Value{v, nil, nil}
+		}
+		return vars
+	}
+
+	for _, v := range roots {
+		if f.pass.debug > 1 {
+			f.Warnl(v.Pos, "SLO root %s", v.LongString())
+		}
+		rewrite(v)
+	}
+}
--- a/src/cmd/compile/internal/ssa/sizeof_test.go
+++ b/src/cmd/compile/internal/ssa/sizeof_test.go
@ -21,7 +21,7 @@ func TestSizeof(t *testing.T) {
 		_64bit uintptr // size on 64bit platforms
 	}{
 		{Value{}, 72, 112},
-		{Block{}, 164, 304},
+		{Block{}, 168, 312},
 		{LocalSlot{}, 28, 40},
 		{valState{}, 28, 40},
 	}
--- a/src/cmd/compile/internal/ssa/tern_helpers.go
+++ b/src/cmd/compile/internal/ssa/tern_helpers.go
@ -0,0 +1,160 @@
+// Code generated by 'go run genfiles.go'; DO NOT EDIT.
+
+package ssa
+
+type SIMDLogicalOP uint8
+
+const (
+	// boolean simd operations, for reducing expression to VPTERNLOG* instructions
+	// sloInterior is set for non-root nodes in logical-op expression trees.
+	// the operations are even-numbered.
+	sloInterior SIMDLogicalOP = 1
+	sloNone     SIMDLogicalOP = 2 * iota
+	sloAnd
+	sloOr
+	sloAndNot
+	sloXor
+	sloNot
+)
+
+func classifyBooleanSIMD(v *Value) SIMDLogicalOP {
+	switch v.Op {
+	case OpAndInt8x16, OpAndInt16x8, OpAndInt32x4, OpAndInt64x2, OpAndInt8x32, OpAndInt16x16, OpAndInt32x8, OpAndInt64x4, OpAndInt8x64, OpAndInt16x32, OpAndInt32x16, OpAndInt64x8:
+		return sloAnd
+
+	case OpOrInt8x16, OpOrInt16x8, OpOrInt32x4, OpOrInt64x2, OpOrInt8x32, OpOrInt16x16, OpOrInt32x8, OpOrInt64x4, OpOrInt8x64, OpOrInt16x32, OpOrInt32x16, OpOrInt64x8:
+		return sloOr
+
+	case OpAndNotInt8x16, OpAndNotInt16x8, OpAndNotInt32x4, OpAndNotInt64x2, OpAndNotInt8x32, OpAndNotInt16x16, OpAndNotInt32x8, OpAndNotInt64x4, OpAndNotInt8x64, OpAndNotInt16x32, OpAndNotInt32x16, OpAndNotInt64x8:
+		return sloAndNot
+	case OpXorInt8x16:
+		if y := v.Args[1]; y.Op == OpEqualInt8x16 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt16x8:
+		if y := v.Args[1]; y.Op == OpEqualInt16x8 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt32x4:
+		if y := v.Args[1]; y.Op == OpEqualInt32x4 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt64x2:
+		if y := v.Args[1]; y.Op == OpEqualInt64x2 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt8x32:
+		if y := v.Args[1]; y.Op == OpEqualInt8x32 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt16x16:
+		if y := v.Args[1]; y.Op == OpEqualInt16x16 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt32x8:
+		if y := v.Args[1]; y.Op == OpEqualInt32x8 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt64x4:
+		if y := v.Args[1]; y.Op == OpEqualInt64x4 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt8x64:
+		if y := v.Args[1]; y.Op == OpEqualInt8x64 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt16x32:
+		if y := v.Args[1]; y.Op == OpEqualInt16x32 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt32x16:
+		if y := v.Args[1]; y.Op == OpEqualInt32x16 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+	case OpXorInt64x8:
+		if y := v.Args[1]; y.Op == OpEqualInt64x8 &&
+			y.Args[0] == y.Args[1] {
+			return sloNot
+		}
+		return sloXor
+
+	}
+	return sloNone
+}
+
+func ternOpForLogical(op Op) Op {
+	switch op {
+	case OpAndInt8x16, OpOrInt8x16, OpXorInt8x16, OpAndNotInt8x16:
+		return OpternInt32x4
+	case OpAndUint8x16, OpOrUint8x16, OpXorUint8x16, OpAndNotUint8x16:
+		return OpternUint32x4
+	case OpAndInt16x8, OpOrInt16x8, OpXorInt16x8, OpAndNotInt16x8:
+		return OpternInt32x4
+	case OpAndUint16x8, OpOrUint16x8, OpXorUint16x8, OpAndNotUint16x8:
+		return OpternUint32x4
+	case OpAndInt32x4, OpOrInt32x4, OpXorInt32x4, OpAndNotInt32x4:
+		return OpternInt32x4
+	case OpAndUint32x4, OpOrUint32x4, OpXorUint32x4, OpAndNotUint32x4:
+		return OpternUint32x4
+	case OpAndInt64x2, OpOrInt64x2, OpXorInt64x2, OpAndNotInt64x2:
+		return OpternInt64x2
+	case OpAndUint64x2, OpOrUint64x2, OpXorUint64x2, OpAndNotUint64x2:
+		return OpternUint64x2
+	case OpAndInt8x32, OpOrInt8x32, OpXorInt8x32, OpAndNotInt8x32:
+		return OpternInt32x8
+	case OpAndUint8x32, OpOrUint8x32, OpXorUint8x32, OpAndNotUint8x32:
+		return OpternUint32x8
+	case OpAndInt16x16, OpOrInt16x16, OpXorInt16x16, OpAndNotInt16x16:
+		return OpternInt32x8
+	case OpAndUint16x16, OpOrUint16x16, OpXorUint16x16, OpAndNotUint16x16:
+		return OpternUint32x8
+	case OpAndInt32x8, OpOrInt32x8, OpXorInt32x8, OpAndNotInt32x8:
+		return OpternInt32x8
+	case OpAndUint32x8, OpOrUint32x8, OpXorUint32x8, OpAndNotUint32x8:
+		return OpternUint32x8
+	case OpAndInt64x4, OpOrInt64x4, OpXorInt64x4, OpAndNotInt64x4:
+		return OpternInt64x4
+	case OpAndUint64x4, OpOrUint64x4, OpXorUint64x4, OpAndNotUint64x4:
+		return OpternUint64x4
+	case OpAndInt8x64, OpOrInt8x64, OpXorInt8x64, OpAndNotInt8x64:
+		return OpternInt32x16
+	case OpAndUint8x64, OpOrUint8x64, OpXorUint8x64, OpAndNotUint8x64:
+		return OpternUint32x16
+	case OpAndInt16x32, OpOrInt16x32, OpXorInt16x32, OpAndNotInt16x32:
+		return OpternInt32x16
+	case OpAndUint16x32, OpOrUint16x32, OpXorUint16x32, OpAndNotUint16x32:
+		return OpternUint32x16
+	case OpAndInt32x16, OpOrInt32x16, OpXorInt32x16, OpAndNotInt32x16:
+		return OpternInt32x16
+	case OpAndUint32x16, OpOrUint32x16, OpXorUint32x16, OpAndNotUint32x16:
+		return OpternUint32x16
+	case OpAndInt64x8, OpOrInt64x8, OpXorInt64x8, OpAndNotInt64x8:
+		return OpternInt64x8
+	case OpAndUint64x8, OpOrUint64x8, OpXorUint64x8, OpAndNotUint64x8:
+		return OpternUint64x8
+
+	}
+	return op
+}
--- a/src/cmd/compile/internal/ssa/value.go
+++ b/src/cmd/compile/internal/ssa/value.go
@ -9,6 +9,7 @@ import (
 	"cmd/compile/internal/types"
 	"cmd/internal/src"
 	"fmt"
+	"internal/buildcfg"
 	"math"
 	"sort"
 	"strings"
@ -612,12 +613,18 @@ func AutoVar(v *Value) (*ir.Name, int64) {
 // CanSSA reports whether values of type t can be represented as a Value.
 func CanSSA(t *types.Type) bool {
 	types.CalcSize(t)
-	if t.Size() > int64(4*types.PtrSize) {
+	if t.IsSIMD() {
+		return true
+	}
+	sizeLimit := int64(MaxStruct * types.PtrSize)
+	if t.Size() > sizeLimit {
 		// 4*Widthptr is an arbitrary constant. We want it
 		// to be at least 3*Widthptr so slices can be registerized.
 		// Too big and we'll introduce too much register pressure.
+		if !buildcfg.Experiment.SIMD {
 			return false
 		}
+	}
 	switch t.Kind() {
 	case types.TARRAY:
 		// We can't do larger arrays because dynamic indexing is
@ -636,7 +643,17 @@ func CanSSA(t *types.Type) bool {
 				return false
 			}
 		}
+		// Special check for SIMD. If the composite type
+		// contains SIMD vectors we can return true
+		// if it pass the checks below.
+		if !buildcfg.Experiment.SIMD {
 			return true
+		}
+		if t.Size() <= sizeLimit {
+			return true
+		}
+		i, f := t.Registers()
+		return i+f <= MaxStruct
 	default:
 		return true
 	}
--- a/src/cmd/compile/internal/ssagen/abi.go
+++ b/src/cmd/compile/internal/ssagen/abi.go
@ -99,6 +99,18 @@ func (s *SymABIs) ReadSymABIs(file string) {
 	}
 }

+// HasDef returns whether the given symbol has an assembly definition.
+func (s *SymABIs) HasDef(sym *types.Sym) bool {
+	symName := sym.Linkname
+	if symName == "" {
+		symName = sym.Pkg.Prefix + "." + sym.Name
+	}
+	symName = s.canonicalize(symName)
+
+	_, hasDefABI := s.defs[symName]
+	return hasDefABI
+}
+
 // GenABIWrappers applies ABI information to Funcs and generates ABI
 // wrapper functions where necessary.
 func (s *SymABIs) GenABIWrappers() {
--- a/src/cmd/compile/internal/ssagen/intrinsics.go
+++ b/src/cmd/compile/internal/ssagen/intrinsics.go
@ -12,6 +12,7 @@ import (
 	"cmd/compile/internal/base"
 	"cmd/compile/internal/ir"
 	"cmd/compile/internal/ssa"
+	"cmd/compile/internal/typecheck"
 	"cmd/compile/internal/types"
 	"cmd/internal/sys"
 )
@ -1632,6 +1633,495 @@ func initIntrinsics(cfg *intrinsicBuildConfig) {
 			return s.newValue1(ssa.OpCvtBoolToUint8, types.Types[types.TUINT8], args[0])
 		},
 		all...)
+
+	if buildcfg.Experiment.SIMD {
+		// Only enable intrinsics, if SIMD experiment.
+		simdIntrinsics(addF)
+
+		addF("simd", "ClearAVXUpperBits",
+			func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+				s.vars[memVar] = s.newValue1(ssa.OpAMD64VZEROUPPER, types.TypeMem, s.mem())
+				return nil
+			},
+			sys.AMD64)
+
+		addF(simdPackage, "Int8x16.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int16x8.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int32x4.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int64x2.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint8x16.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint16x8.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint32x4.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint64x2.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int8x32.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int16x16.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int32x8.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Int64x4.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint8x32.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint16x16.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint32x8.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+		addF(simdPackage, "Uint64x4.IsZero", opLen1(ssa.OpIsZeroVec, types.Types[types.TBOOL]), sys.AMD64)
+
+		sfp4 := func(method string, hwop ssa.Op, vectype *types.Type) {
+			addF("simd", method,
+				func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+					x, a, b, c, d, y := args[0], args[1], args[2], args[3], args[4], args[5]
+					if a.Op == ssa.OpConst8 && b.Op == ssa.OpConst8 && c.Op == ssa.OpConst8 && d.Op == ssa.OpConst8 {
+						return select4FromPair(x, a, b, c, d, y, s, hwop, vectype)
+					} else {
+						return s.callResult(n, callNormal)
+					}
+				},
+				sys.AMD64)
+		}
+
+		sfp4("Int32x4.SelectFromPair", ssa.OpconcatSelectedConstantInt32x4, types.TypeVec128)
+		sfp4("Uint32x4.SelectFromPair", ssa.OpconcatSelectedConstantUint32x4, types.TypeVec128)
+		sfp4("Float32x4.SelectFromPair", ssa.OpconcatSelectedConstantFloat32x4, types.TypeVec128)
+
+		sfp4("Int32x8.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedInt32x8, types.TypeVec256)
+		sfp4("Uint32x8.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedUint32x8, types.TypeVec256)
+		sfp4("Float32x8.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedFloat32x8, types.TypeVec256)
+
+		sfp4("Int32x16.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedInt32x16, types.TypeVec512)
+		sfp4("Uint32x16.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedUint32x16, types.TypeVec512)
+		sfp4("Float32x16.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedFloat32x16, types.TypeVec512)
+
+		sfp2 := func(method string, hwop ssa.Op, vectype *types.Type, cscimm func(i, j uint8) int64) {
+			addF("simd", method,
+				func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+					x, a, b, y := args[0], args[1], args[2], args[3]
+					if a.Op == ssa.OpConst8 && b.Op == ssa.OpConst8 {
+						return select2FromPair(x, a, b, y, s, hwop, vectype, cscimm)
+					} else {
+						return s.callResult(n, callNormal)
+					}
+				},
+				sys.AMD64)
+		}
+
+		sfp2("Uint64x2.SelectFromPair", ssa.OpconcatSelectedConstantUint64x2, types.TypeVec128, cscimm2)
+		sfp2("Int64x2.SelectFromPair", ssa.OpconcatSelectedConstantInt64x2, types.TypeVec128, cscimm2)
+		sfp2("Float64x2.SelectFromPair", ssa.OpconcatSelectedConstantFloat64x2, types.TypeVec128, cscimm2)
+
+		sfp2("Uint64x4.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedUint64x4, types.TypeVec256, cscimm2g2)
+		sfp2("Int64x4.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedInt64x4, types.TypeVec256, cscimm2g2)
+		sfp2("Float64x4.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedFloat64x4, types.TypeVec256, cscimm2g2)
+
+		sfp2("Uint64x8.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedUint64x8, types.TypeVec512, cscimm2g4)
+		sfp2("Int64x8.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedInt64x8, types.TypeVec512, cscimm2g4)
+		sfp2("Float64x8.SelectFromPairGrouped", ssa.OpconcatSelectedConstantGroupedFloat64x8, types.TypeVec512, cscimm2g4)
+
+	}
+}
+
+func cscimm4(a, b, c, d uint8) int64 {
+	return se(a + b<<2 + c<<4 + d<<6)
+}
+
+func cscimm2(a, b uint8) int64 {
+	return se(a + b<<1)
+}
+
+func cscimm2g2(a, b uint8) int64 {
+	g := cscimm2(a, b)
+	return int64(int8(g + g<<2))
+}
+
+func cscimm2g4(a, b uint8) int64 {
+	g := cscimm2g2(a, b)
+	return int64(int8(g + g<<4))
+}
+
+const (
+	_LLLL = iota
+	_HLLL
+	_LHLL
+	_HHLL
+	_LLHL
+	_HLHL
+	_LHHL
+	_HHHL
+	_LLLH
+	_HLLH
+	_LHLH
+	_HHLH
+	_LLHH
+	_HLHH
+	_LHHH
+	_HHHH
+)
+
+const (
+	_LL = iota
+	_HL
+	_LH
+	_HH
+)
+
+func select2FromPair(x, _a, _b, y *ssa.Value, s *state, op ssa.Op, t *types.Type, csc func(a, b uint8) int64) *ssa.Value {
+	a, b := uint8(_a.AuxInt8()), uint8(_b.AuxInt8())
+	pattern := (a&2)>>1 + (b & 2)
+	a, b = a&1, b&1
+
+	switch pattern {
+	case _LL:
+		return s.newValue2I(op, t, csc(a, b), x, x)
+	case _HH:
+		return s.newValue2I(op, t, csc(a, b), y, y)
+	case _LH:
+		return s.newValue2I(op, t, csc(a, b), x, y)
+	case _HL:
+		return s.newValue2I(op, t, csc(a, b), y, x)
+	}
+	panic("The preceding switch should have been exhaustive")
+}
+
+func select4FromPair(x, _a, _b, _c, _d, y *ssa.Value, s *state, op ssa.Op, t *types.Type) *ssa.Value {
+	a, b, c, d := uint8(_a.AuxInt8()), uint8(_b.AuxInt8()), uint8(_c.AuxInt8()), uint8(_d.AuxInt8())
+	pattern := a>>2 + (b&4)>>1 + (c & 4) + (d&4)<<1
+
+	a, b, c, d = a&3, b&3, c&3, d&3
+
+	switch pattern {
+	case _LLLL:
+		// TODO DETECT 0,1,2,3, 0,0,0,0
+		return s.newValue2I(op, t, cscimm4(a, b, c, d), x, x)
+	case _HHHH:
+		// TODO DETECT 0,1,2,3, 0,0,0,0
+		return s.newValue2I(op, t, cscimm4(a, b, c, d), y, y)
+	case _LLHH:
+		return s.newValue2I(op, t, cscimm4(a, b, c, d), x, y)
+	case _HHLL:
+		return s.newValue2I(op, t, cscimm4(a, b, c, d), y, x)
+
+	case _HLLL:
+		z := s.newValue2I(op, t, cscimm4(a, a, b, b), y, x)
+		return s.newValue2I(op, t, cscimm4(0, 2, c, d), z, x)
+	case _LHLL:
+		z := s.newValue2I(op, t, cscimm4(a, a, b, b), x, y)
+		return s.newValue2I(op, t, cscimm4(0, 2, c, d), z, x)
+	case _HLHH:
+		z := s.newValue2I(op, t, cscimm4(a, a, b, b), y, x)
+		return s.newValue2I(op, t, cscimm4(0, 2, c, d), z, y)
+	case _LHHH:
+		z := s.newValue2I(op, t, cscimm4(a, a, b, b), x, y)
+		return s.newValue2I(op, t, cscimm4(0, 2, c, d), z, y)
+
+	case _LLLH:
+		z := s.newValue2I(op, t, cscimm4(c, c, d, d), x, y)
+		return s.newValue2I(op, t, cscimm4(a, b, 0, 2), x, z)
+	case _LLHL:
+		z := s.newValue2I(op, t, cscimm4(c, c, d, d), y, x)
+		return s.newValue2I(op, t, cscimm4(a, b, 0, 2), x, z)
+
+	case _HHLH:
+		z := s.newValue2I(op, t, cscimm4(c, c, d, d), x, y)
+		return s.newValue2I(op, t, cscimm4(a, b, 0, 2), y, z)
+
+	case _HHHL:
+		z := s.newValue2I(op, t, cscimm4(c, c, d, d), y, x)
+		return s.newValue2I(op, t, cscimm4(a, b, 0, 2), y, z)
+
+	case _LHLH:
+		z := s.newValue2I(op, t, cscimm4(a, c, b, d), x, y)
+		return s.newValue2I(op, t, se(0b11_01_10_00), z, z)
+	case _HLHL:
+		z := s.newValue2I(op, t, cscimm4(b, d, a, c), x, y)
+		return s.newValue2I(op, t, se(0b01_11_00_10), z, z)
+	case _HLLH:
+		z := s.newValue2I(op, t, cscimm4(b, c, a, d), x, y)
+		return s.newValue2I(op, t, se(0b11_01_00_10), z, z)
+	case _LHHL:
+		z := s.newValue2I(op, t, cscimm4(a, d, b, c), x, y)
+		return s.newValue2I(op, t, se(0b01_11_10_00), z, z)
+	}
+	panic("The preceding switch should have been exhaustive")
+}
+
+// se smears the not-really-a-sign bit of a uint8 to conform to the conventions
+// for representing AuxInt in ssa.
+func se(x uint8) int64 {
+	return int64(int8(x))
+}
+
+func opLen1(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue1(op, t, args[0])
+	}
+}
+
+func opLen2(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue2(op, t, args[0], args[1])
+	}
+}
+
+func opLen2_21(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue2(op, t, args[1], args[0])
+	}
+}
+
+func opLen3(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue3(op, t, args[0], args[1], args[2])
+	}
+}
+
+var ssaVecBySize = map[int64]*types.Type{
+	16: types.TypeVec128,
+	32: types.TypeVec256,
+	64: types.TypeVec512,
+}
+
+func opLen3_31Zero3(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if t, ok := ssaVecBySize[args[1].Type.Size()]; !ok {
+			panic("unknown simd vector size")
+		} else {
+			return s.newValue3(op, t, s.newValue0(ssa.OpZeroSIMD, t), args[1], args[0])
+		}
+	}
+}
+
+func opLen3_21(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue3(op, t, args[1], args[0], args[2])
+	}
+}
+
+func opLen3_231(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue3(op, t, args[2], args[0], args[1])
+	}
+}
+
+func opLen4(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue4(op, t, args[0], args[1], args[2], args[3])
+	}
+}
+
+func opLen4_231(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue4(op, t, args[2], args[0], args[1], args[3])
+	}
+}
+
+func opLen4_31(op ssa.Op, t *types.Type) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue4(op, t, args[2], args[1], args[0], args[3])
+	}
+}
+
+func immJumpTable(s *state, idx *ssa.Value, intrinsicCall *ir.CallExpr, genOp func(*state, int)) *ssa.Value {
+	// Make blocks we'll need.
+	bEnd := s.f.NewBlock(ssa.BlockPlain)
+
+	if !idx.Type.IsKind(types.TUINT8) {
+		panic("immJumpTable expects uint8 value")
+	}
+
+	// We will exhaust 0-255, so no need to check the bounds.
+	t := types.Types[types.TUINTPTR]
+	idx = s.conv(nil, idx, idx.Type, t)
+
+	b := s.curBlock
+	b.Kind = ssa.BlockJumpTable
+	b.Pos = intrinsicCall.Pos()
+	if base.Flag.Cfg.SpectreIndex {
+		// Potential Spectre vulnerability hardening?
+		idx = s.newValue2(ssa.OpSpectreSliceIndex, t, idx, s.uintptrConstant(255))
+	}
+	b.SetControl(idx)
+	targets := [256]*ssa.Block{}
+	for i := range 256 {
+		t := s.f.NewBlock(ssa.BlockPlain)
+		targets[i] = t
+		b.AddEdgeTo(t)
+	}
+	s.endBlock()
+
+	for i, t := range targets {
+		s.startBlock(t)
+		genOp(s, i)
+		if t.Kind != ssa.BlockExit {
+			t.AddEdgeTo(bEnd)
+		}
+		s.endBlock()
+	}
+
+	s.startBlock(bEnd)
+	ret := s.variable(intrinsicCall, intrinsicCall.Type())
+	return ret
+}
+
+func opLen1Imm8(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[1].Op == ssa.OpConst8 {
+			return s.newValue1I(op, t, args[1].AuxInt<<int64(offset), args[0])
+		}
+		return immJumpTable(s, args[1], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue1I(op, t, int64(int8(idx<<offset)), args[0])
+		})
+	}
+}
+
+func opLen2Imm8(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[1].Op == ssa.OpConst8 {
+			return s.newValue2I(op, t, args[1].AuxInt<<int64(offset), args[0], args[2])
+		}
+		return immJumpTable(s, args[1], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue2I(op, t, int64(int8(idx<<offset)), args[0], args[2])
+		})
+	}
+}
+
+func opLen3Imm8(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[1].Op == ssa.OpConst8 {
+			return s.newValue3I(op, t, args[1].AuxInt<<int64(offset), args[0], args[2], args[3])
+		}
+		return immJumpTable(s, args[1], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue3I(op, t, int64(int8(idx<<offset)), args[0], args[2], args[3])
+		})
+	}
+}
+
+func opLen2Imm8_2I(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[2].Op == ssa.OpConst8 {
+			return s.newValue2I(op, t, args[2].AuxInt<<int64(offset), args[0], args[1])
+		}
+		return immJumpTable(s, args[2], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue2I(op, t, int64(int8(idx<<offset)), args[0], args[1])
+		})
+	}
+}
+
+// Two immediates instead of just 1.  Offset is ignored, so it is a _ parameter instead.
+func opLen2Imm8_II(op ssa.Op, t *types.Type, _ int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[1].Op == ssa.OpConst8 && args[2].Op == ssa.OpConst8 && args[1].AuxInt & ^3 == 0 && args[2].AuxInt & ^3 == 0 {
+			i1, i2 := args[1].AuxInt, args[2].AuxInt
+			return s.newValue2I(op, t, int64(int8(i1+i2<<4)), args[0], args[3])
+		}
+		four := s.constInt64(types.Types[types.TUINT8], 4)
+		shifted := s.newValue2(ssa.OpLsh8x8, types.Types[types.TUINT8], args[2], four)
+		combined := s.newValue2(ssa.OpAdd8, types.Types[types.TUINT8], args[1], shifted)
+		return immJumpTable(s, combined, n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			// TODO for "zeroing" values, panic instead.
+			if idx & ^(3+3<<4) == 0 {
+				s.vars[n] = sNew.newValue2I(op, t, int64(int8(idx)), args[0], args[3])
+			} else {
+				sNew.rtcall(ir.Syms.PanicSimdImm, false, nil)
+			}
+		})
+	}
+}
+
+// The assembler requires the imm value of a SHA1RNDS4 instruction to be one of 0,1,2,3...
+func opLen2Imm8_SHA1RNDS4(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[1].Op == ssa.OpConst8 {
+			return s.newValue2I(op, t, (args[1].AuxInt<<int64(offset))&0b11, args[0], args[2])
+		}
+		return immJumpTable(s, args[1], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue2I(op, t, int64(int8(idx<<offset))&0b11, args[0], args[2])
+		})
+	}
+}
+
+func opLen3Imm8_2I(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[2].Op == ssa.OpConst8 {
+			return s.newValue3I(op, t, args[2].AuxInt<<int64(offset), args[0], args[1], args[3])
+		}
+		return immJumpTable(s, args[2], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue3I(op, t, int64(int8(idx<<offset)), args[0], args[1], args[3])
+		})
+	}
+}
+
+func opLen4Imm8(op ssa.Op, t *types.Type, offset int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		if args[1].Op == ssa.OpConst8 {
+			return s.newValue4I(op, t, args[1].AuxInt<<int64(offset), args[0], args[2], args[3], args[4])
+		}
+		return immJumpTable(s, args[1], n, func(sNew *state, idx int) {
+			// Encode as int8 due to requirement of AuxInt, check its comment for details.
+			s.vars[n] = sNew.newValue4I(op, t, int64(int8(idx<<offset)), args[0], args[2], args[3], args[4])
+		})
+	}
+}
+
+func simdLoad() func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue2(ssa.OpLoad, n.Type(), args[0], s.mem())
+	}
+}
+
+func simdStore() func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		s.store(args[0].Type, args[1], args[0])
+		return nil
+	}
+}
+
+var cvtVToMaskOpcodes = map[int]map[int]ssa.Op{
+	8:  {16: ssa.OpCvt16toMask8x16, 32: ssa.OpCvt32toMask8x32, 64: ssa.OpCvt64toMask8x64},
+	16: {8: ssa.OpCvt8toMask16x8, 16: ssa.OpCvt16toMask16x16, 32: ssa.OpCvt32toMask16x32},
+	32: {4: ssa.OpCvt8toMask32x4, 8: ssa.OpCvt8toMask32x8, 16: ssa.OpCvt16toMask32x16},
+	64: {2: ssa.OpCvt8toMask64x2, 4: ssa.OpCvt8toMask64x4, 8: ssa.OpCvt8toMask64x8},
+}
+
+var cvtMaskToVOpcodes = map[int]map[int]ssa.Op{
+	8:  {16: ssa.OpCvtMask8x16to16, 32: ssa.OpCvtMask8x32to32, 64: ssa.OpCvtMask8x64to64},
+	16: {8: ssa.OpCvtMask16x8to8, 16: ssa.OpCvtMask16x16to16, 32: ssa.OpCvtMask16x32to32},
+	32: {4: ssa.OpCvtMask32x4to8, 8: ssa.OpCvtMask32x8to8, 16: ssa.OpCvtMask32x16to16},
+	64: {2: ssa.OpCvtMask64x2to8, 4: ssa.OpCvtMask64x4to8, 8: ssa.OpCvtMask64x8to8},
+}
+
+func simdCvtVToMask(elemBits, lanes int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		op := cvtVToMaskOpcodes[elemBits][lanes]
+		if op == 0 {
+			panic(fmt.Sprintf("Unknown mask shape: Mask%dx%d", elemBits, lanes))
+		}
+		return s.newValue1(op, types.TypeMask, args[0])
+	}
+}
+
+func simdCvtMaskToV(elemBits, lanes int) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		op := cvtMaskToVOpcodes[elemBits][lanes]
+		if op == 0 {
+			panic(fmt.Sprintf("Unknown mask shape: Mask%dx%d", elemBits, lanes))
+		}
+		return s.newValue1(op, n.Type(), args[0])
+	}
+}
+
+func simdMaskedLoad(op ssa.Op) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		return s.newValue3(op, n.Type(), args[0], args[1], s.mem())
+	}
+}
+
+func simdMaskedStore(op ssa.Op) func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+	return func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value {
+		s.vars[memVar] = s.newValue4A(op, types.TypeMem, args[0].Type, args[1], args[2], args[0], s.mem())
+		return nil
+	}
 }

 // findIntrinsic returns a function which builds the SSA equivalent of the
@ -1657,7 +2147,8 @@ func findIntrinsic(sym *types.Sym) intrinsicBuilder {

 	fn := sym.Name
 	if ssa.IntrinsicsDisable {
-		if pkg == "internal/runtime/sys" && (fn == "GetCallerPC" || fn == "GrtCallerSP" || fn == "GetClosurePtr") {
+		if pkg == "internal/runtime/sys" && (fn == "GetCallerPC" || fn == "GrtCallerSP" || fn == "GetClosurePtr") ||
+			pkg == "internal/simd" || pkg == "simd" { // TODO after simd has been moved to package simd, remove internal/simd
 			// These runtime functions don't have definitions, must be intrinsics.
 		} else {
 			return nil
@ -1672,7 +2163,74 @@ func IsIntrinsicCall(n *ir.CallExpr) bool {
 	}
 	name, ok := n.Fun.(*ir.Name)
 	if !ok {
+		if n.Fun.Op() == ir.OMETHEXPR {
+			if meth := ir.MethodExprName(n.Fun); meth != nil {
+				if fn := meth.Func; fn != nil {
+					return IsIntrinsicSym(fn.Sym())
+				}
+			}
+		}
 		return false
 	}
-	return findIntrinsic(name.Sym()) != nil
+	return IsIntrinsicSym(name.Sym())
+}
+
+func IsIntrinsicSym(sym *types.Sym) bool {
+	return findIntrinsic(sym) != nil
+}
+
+// GenIntrinsicBody generates the function body for a bodyless intrinsic.
+// This is used when the intrinsic is used in a non-call context, e.g.
+// as a function pointer, or (for a method) being referenced from the type
+// descriptor.
+//
+// The compiler already recognizes a call to fn as an intrinsic and can
+// directly generate code for it. So we just fill in the body with a call
+// to fn.
+func GenIntrinsicBody(fn *ir.Func) {
+	if ir.CurFunc != nil {
+		base.FatalfAt(fn.Pos(), "enqueueFunc %v inside %v", fn, ir.CurFunc)
+	}
+
+	if base.Flag.LowerR != 0 {
+		fmt.Println("generate intrinsic for", ir.FuncName(fn))
+	}
+
+	pos := fn.Pos()
+	ft := fn.Type()
+	var ret ir.Node
+
+	// For a method, it usually starts with an ODOTMETH (pre-typecheck) or
+	// OMETHEXPR (post-typecheck) referencing the method symbol without the
+	// receiver type, and Walk rewrites it to a call directly to the
+	// type-qualified method symbol, moving the receiver to an argument.
+	// Here fn has already the type-qualified method symbol, and it is hard
+	// to get the unqualified symbol. So we just generate the post-Walk form
+	// and mark it typechecked and Walked.
+	call := ir.NewCallExpr(pos, ir.OCALLFUNC, fn.Nname, nil)
+	call.Args = ir.RecvParamNames(ft)
+	call.IsDDD = ft.IsVariadic()
+	typecheck.Exprs(call.Args)
+	call.SetTypecheck(1)
+	call.SetWalked(true)
+	ret = call
+	if ft.NumResults() > 0 {
+		if ft.NumResults() == 1 {
+			call.SetType(ft.Result(0).Type)
+		} else {
+			call.SetType(ft.ResultsTuple())
+		}
+		n := ir.NewReturnStmt(base.Pos, nil)
+		n.Results = []ir.Node{call}
+		ret = n
+	}
+	fn.Body.Append(ret)
+
+	if base.Flag.LowerR != 0 {
+		ir.DumpList("generate intrinsic body", fn.Body)
+	}
+
+	ir.CurFunc = fn
+	typecheck.Stmts(fn.Body)
+	ir.CurFunc = nil // we know CurFunc is nil at entry
 }
--- a/src/cmd/compile/internal/ssagen/intrinsics_test.go
+++ b/src/cmd/compile/internal/ssagen/intrinsics_test.go
@ -16,6 +16,9 @@ import (

 var updateIntrinsics = flag.Bool("update", false, "Print an updated intrinsics table")

+// TODO turn on after SIMD is stable.  The time burned keeping this test happy during SIMD development has already well exceeded any plausible benefit.
+var simd = flag.Bool("simd", false, "Also check SIMD intrinsics; for now, it is noisy and not helpful")
+
 type testIntrinsicKey struct {
 	archName string
 	pkg      string
@ -1403,13 +1406,13 @@ func TestIntrinsics(t *testing.T) {
 		gotIntrinsics[testIntrinsicKey{ik.arch.Name, ik.pkg, ik.fn}] = struct{}{}
 	}
 	for ik, _ := range gotIntrinsics {
-		if _, found := wantIntrinsics[ik]; !found {
+		if _, found := wantIntrinsics[ik]; !found && (ik.pkg != "simd" || *simd) {
 			t.Errorf("Got unwanted intrinsic %v %v.%v", ik.archName, ik.pkg, ik.fn)
 		}
 	}

 	for ik, _ := range wantIntrinsics {
-		if _, found := gotIntrinsics[ik]; !found {
+		if _, found := gotIntrinsics[ik]; !found && (ik.pkg != "simd" || *simd) {
 			t.Errorf("Want missing intrinsic %v %v.%v", ik.archName, ik.pkg, ik.fn)
 		}
 	}
--- a/src/cmd/compile/internal/ssagen/simdintrinsics.go
+++ b/src/cmd/compile/internal/ssagen/simdintrinsics.go
--- a/src/cmd/compile/internal/ssagen/ssa.go
+++ b/src/cmd/compile/internal/ssagen/ssa.go
@ -156,6 +156,7 @@ func InitConfig() {
 	ir.Syms.Panicnildottype = typecheck.LookupRuntimeFunc("panicnildottype")
 	ir.Syms.Panicoverflow = typecheck.LookupRuntimeFunc("panicoverflow")
 	ir.Syms.Panicshift = typecheck.LookupRuntimeFunc("panicshift")
+	ir.Syms.PanicSimdImm = typecheck.LookupRuntimeFunc("panicSimdImm")
 	ir.Syms.Racefuncenter = typecheck.LookupRuntimeFunc("racefuncenter")
 	ir.Syms.Racefuncexit = typecheck.LookupRuntimeFunc("racefuncexit")
 	ir.Syms.Raceread = typecheck.LookupRuntimeFunc("raceread")
@ -165,9 +166,10 @@ func InitConfig() {
 	ir.Syms.TypeAssert = typecheck.LookupRuntimeFunc("typeAssert")
 	ir.Syms.WBZero = typecheck.LookupRuntimeFunc("wbZero")
 	ir.Syms.WBMove = typecheck.LookupRuntimeFunc("wbMove")
+	ir.Syms.X86HasAVX = typecheck.LookupRuntimeVar("x86HasAVX")               // bool
+	ir.Syms.X86HasFMA = typecheck.LookupRuntimeVar("x86HasFMA")               // bool
 	ir.Syms.X86HasPOPCNT = typecheck.LookupRuntimeVar("x86HasPOPCNT")         // bool
 	ir.Syms.X86HasSSE41 = typecheck.LookupRuntimeVar("x86HasSSE41")           // bool
-	ir.Syms.X86HasFMA = typecheck.LookupRuntimeVar("x86HasFMA")               // bool
 	ir.Syms.ARMHasVFPv4 = typecheck.LookupRuntimeVar("armHasVFPv4")           // bool
 	ir.Syms.ARM64HasATOMICS = typecheck.LookupRuntimeVar("arm64HasATOMICS")   // bool
 	ir.Syms.Loong64HasLAMCAS = typecheck.LookupRuntimeVar("loong64HasLAMCAS") // bool
@ -600,6 +602,9 @@ func buildssa(fn *ir.Func, worker int, isPgoHot bool) *ssa.Func {
 	// TODO figure out exactly what's unused, don't spill it. Make liveness fine-grained, also.
 	for _, p := range params.InParams() {
 		typs, offs := p.RegisterTypesAndOffsets()
+		if len(offs) < len(typs) {
+			s.Fatalf("len(offs)=%d < len(typs)=%d, params=\n%s", len(offs), len(typs), params)
+		}
 		for i, t := range typs {
 			o := offs[i]                // offset within parameter
 			fo := p.FrameOffset(params) // offset of parameter in frame
@ -1333,6 +1338,11 @@ func (s *state) newValue4(op ssa.Op, t *types.Type, arg0, arg1, arg2, arg3 *ssa.
 	return s.curBlock.NewValue4(s.peekPos(), op, t, arg0, arg1, arg2, arg3)
 }

+// newValue4A adds a new value with four arguments and an aux value to the current block.
+func (s *state) newValue4A(op ssa.Op, t *types.Type, aux ssa.Aux, arg0, arg1, arg2, arg3 *ssa.Value) *ssa.Value {
+	return s.curBlock.NewValue4A(s.peekPos(), op, t, aux, arg0, arg1, arg2, arg3)
+}
+
 // newValue4I adds a new value with four arguments and an auxint value to the current block.
 func (s *state) newValue4I(op ssa.Op, t *types.Type, aux int64, arg0, arg1, arg2, arg3 *ssa.Value) *ssa.Value {
 	return s.curBlock.NewValue4I(s.peekPos(), op, t, aux, arg0, arg1, arg2, arg3)
@ -1462,7 +1472,7 @@ func (s *state) instrument(t *types.Type, addr *ssa.Value, kind instrumentKind)
 // If it is instrumenting for MSAN or ASAN and t is a struct type, it instruments
 // operation for each field, instead of for the whole struct.
 func (s *state) instrumentFields(t *types.Type, addr *ssa.Value, kind instrumentKind) {
-	if !(base.Flag.MSan || base.Flag.ASan) || !t.IsStruct() {
+	if !(base.Flag.MSan || base.Flag.ASan) || !isStructNotSIMD(t) {
 		s.instrument(t, addr, kind)
 		return
 	}
@ -4585,7 +4595,7 @@ func (s *state) zeroVal(t *types.Type) *ssa.Value {
 		return s.constInterface(t)
 	case t.IsSlice():
 		return s.constSlice(t)
-	case t.IsStruct():
+	case isStructNotSIMD(t):
 		n := t.NumFields()
 		v := s.entryNewValue0(ssa.OpStructMake, t)
 		for i := 0; i < n; i++ {
@ -4599,6 +4609,8 @@ func (s *state) zeroVal(t *types.Type) *ssa.Value {
 		case 1:
 			return s.entryNewValue1(ssa.OpArrayMake1, t, s.zeroVal(t.Elem()))
 		}
+	case t.IsSIMD():
+		return s.newValue0(ssa.OpZeroSIMD, t)
 	}
 	s.Fatalf("zero for type %v not implemented", t)
 	return nil
@ -5578,7 +5590,7 @@ func (s *state) storeType(t *types.Type, left, right *ssa.Value, skip skipMask,
 // do *left = right for all scalar (non-pointer) parts of t.
 func (s *state) storeTypeScalars(t *types.Type, left, right *ssa.Value, skip skipMask) {
 	switch {
-	case t.IsBoolean() || t.IsInteger() || t.IsFloat() || t.IsComplex():
+	case t.IsBoolean() || t.IsInteger() || t.IsFloat() || t.IsComplex() || t.IsSIMD():
 		s.store(t, left, right)
 	case t.IsPtrShaped():
 		if t.IsPtr() && t.Elem().NotInHeap() {
@ -5607,7 +5619,7 @@ func (s *state) storeTypeScalars(t *types.Type, left, right *ssa.Value, skip ski
 		// itab field doesn't need a write barrier (even though it is a pointer).
 		itab := s.newValue1(ssa.OpITab, s.f.Config.Types.BytePtr, right)
 		s.store(types.Types[types.TUINTPTR], left, itab)
-	case t.IsStruct():
+	case isStructNotSIMD(t):
 		n := t.NumFields()
 		for i := 0; i < n; i++ {
 			ft := t.FieldType(i)
@ -5644,7 +5656,7 @@ func (s *state) storeTypePtrs(t *types.Type, left, right *ssa.Value) {
 		idata := s.newValue1(ssa.OpIData, s.f.Config.Types.BytePtr, right)
 		idataAddr := s.newValue1I(ssa.OpOffPtr, s.f.Config.Types.BytePtrPtr, s.config.PtrSize, left)
 		s.store(s.f.Config.Types.BytePtr, idataAddr, idata)
-	case t.IsStruct():
+	case isStructNotSIMD(t):
 		n := t.NumFields()
 		for i := 0; i < n; i++ {
 			ft := t.FieldType(i)
@ -6757,7 +6769,7 @@ func EmitArgInfo(f *ir.Func, abiInfo *abi.ABIParamResultInfo) *obj.LSym {
 	uintptrTyp := types.Types[types.TUINTPTR]

 	isAggregate := func(t *types.Type) bool {
-		return t.IsStruct() || t.IsArray() || t.IsComplex() || t.IsInterface() || t.IsString() || t.IsSlice()
+		return isStructNotSIMD(t) || t.IsArray() || t.IsComplex() || t.IsInterface() || t.IsString() || t.IsSlice()
 	}

 	wOff := 0
@ -6817,7 +6829,7 @@ func EmitArgInfo(f *ir.Func, abiInfo *abi.ABIParamResultInfo) *obj.LSym {
 				}
 				baseOffset += t.Elem().Size()
 			}
-		case t.IsStruct():
+		case isStructNotSIMD(t):
 			if t.NumFields() == 0 {
 				n++ // {} counts as a component
 				break
@ -7837,7 +7849,7 @@ func (s *State) UseArgs(n int64) {
 // fieldIdx finds the index of the field referred to by the ODOT node n.
 func fieldIdx(n *ir.SelectorExpr) int {
 	t := n.X.Type()
-	if !t.IsStruct() {
+	if !isStructNotSIMD(t) {
 		panic("ODOT's LHS is not a struct")
 	}

@ -8045,4 +8057,8 @@ func SpillSlotAddr(spill ssa.Spill, baseReg int16, extraOffset int64) obj.Addr {
 	}
 }

+func isStructNotSIMD(t *types.Type) bool {
+	return t.IsStruct() && !t.IsSIMD()
+}
+
 var BoundsCheckFunc [ssa.BoundsKindCount]*obj.LSym
--- a/src/cmd/compile/internal/test/value_test.go
+++ b/src/cmd/compile/internal/test/value_test.go
@ -0,0 +1,41 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package test
+
+import (
+	"cmd/compile/internal/ssa"
+	"cmd/compile/internal/types"
+	"internal/buildcfg"
+	"testing"
+)
+
+// This file contains tests for ssa values, types and their utility functions.
+
+func TestCanSSA(t *testing.T) {
+	i64 := types.Types[types.TINT64]
+	v128 := types.TypeVec128
+	s1 := mkstruct(i64, mkstruct(i64, i64, i64, i64))
+	if ssa.CanSSA(s1) {
+		// Test size check for struct.
+		t.Errorf("CanSSA(%v) returned true, expected false", s1)
+	}
+	a1 := types.NewArray(s1, 1)
+	if ssa.CanSSA(a1) {
+		// Test size check for array.
+		t.Errorf("CanSSA(%v) returned true, expected false", a1)
+	}
+	if buildcfg.Experiment.SIMD {
+		s2 := mkstruct(v128, v128, v128, v128)
+		if !ssa.CanSSA(s2) {
+			// Test size check for SIMD struct special case.
+			t.Errorf("CanSSA(%v) returned false, expected true", s2)
+		}
+		a2 := types.NewArray(s2, 1)
+		if !ssa.CanSSA(a2) {
+			// Test size check for SIMD array special case.
+			t.Errorf("CanSSA(%v) returned false, expected true", a2)
+		}
+	}
+}
--- a/src/cmd/compile/internal/typecheck/_builtin/runtime.go
+++ b/src/cmd/compile/internal/typecheck/_builtin/runtime.go
@ -292,9 +292,10 @@ func libfuzzerHookEqualFold(string, string, uint)
 func addCovMeta(p unsafe.Pointer, len uint32, hash [16]byte, pkpath string, pkgId int, cmode uint8, cgran uint8) uint32

 // architecture variants
+var x86HasAVX bool
+var x86HasFMA bool
 var x86HasPOPCNT bool
 var x86HasSSE41 bool
-var x86HasFMA bool
 var armHasVFPv4 bool
 var arm64HasATOMICS bool
 var loong64HasLAMCAS bool
--- a/src/cmd/compile/internal/typecheck/builtin.go
+++ b/src/cmd/compile/internal/typecheck/builtin.go
@ -239,9 +239,10 @@ var runtimeDecls = [...]struct {
 	{"libfuzzerHookStrCmp", funcTag, 163},
 	{"libfuzzerHookEqualFold", funcTag, 163},
 	{"addCovMeta", funcTag, 165},
+	{"x86HasAVX", varTag, 6},
+	{"x86HasFMA", varTag, 6},
 	{"x86HasPOPCNT", varTag, 6},
 	{"x86HasSSE41", varTag, 6},
-	{"x86HasFMA", varTag, 6},
 	{"armHasVFPv4", varTag, 6},
 	{"arm64HasATOMICS", varTag, 6},
 	{"loong64HasLAMCAS", varTag, 6},
--- a/src/cmd/compile/internal/types/size.go
+++ b/src/cmd/compile/internal/types/size.go
@ -10,6 +10,7 @@ import (

 	"cmd/compile/internal/base"
 	"cmd/internal/src"
+	"internal/buildcfg"
 	"internal/types/errors"
 )

@ -452,6 +453,31 @@ func CalcSize(t *Type) {
 	ResumeCheckSize()
 }

+// simdify marks as type as "SIMD", either as a tag field,
+// or having the SIMD attribute.  The tag field is a marker
+// type used to identify a struct that is not really a struct.
+// A SIMD type is allocated to a vector register (on amd64,
+// xmm, ymm, or zmm).  The fields of a SIMD type are ignored
+// by the compiler except for the space that they reserve.
+func simdify(st *Type, isTag bool) {
+	st.align = 8
+	st.alg = ANOALG // not comparable with ==
+	st.intRegs = 0
+	st.isSIMD = true
+	if isTag {
+		st.width = 0
+		st.isSIMDTag = true
+		st.floatRegs = 0
+	} else {
+		st.floatRegs = 1
+	}
+	// if st.Sym() != nil {
+	// 	base.Warn("Simdify %s, %v, %d", st.Sym().Name, isTag, st.width)
+	// } else {
+	// 	base.Warn("Simdify %v, %v, %d", st, isTag, st.width)
+	// }
+}
+
 // CalcStructSize calculates the size of t,
 // filling in t.width, t.align, t.intRegs, and t.floatRegs,
 // even if size calculation is otherwise disabled.
@ -464,10 +490,27 @@ func CalcStructSize(t *Type) {
 		switch {
 		case sym.Name == "align64" && isAtomicStdPkg(sym.Pkg):
 			maxAlign = 8
+
+		case buildcfg.Experiment.SIMD && (sym.Pkg.Path == "internal/simd" || sym.Pkg.Path == "simd") && len(t.Fields()) >= 1:
+			// This gates the experiment -- without it, no user-visible types can be "simd".
+			// The SSA-visible SIMD types remain.
+			// TODO after simd has been moved to package simd, remove internal/simd.
+			switch sym.Name {
+			case "v128":
+				simdify(t, true)
+				return
+			case "v256":
+				simdify(t, true)
+				return
+			case "v512":
+				simdify(t, true)
+				return
+			}
 		}
 	}

 	fields := t.Fields()
+
 	size := calcStructOffset(t, fields, 0)

 	// For non-zero-sized structs which end in a zero-sized field, we
@ -540,6 +583,11 @@ func CalcStructSize(t *Type) {
 			break
 		}
 	}
+
+	if len(t.Fields()) >= 1 && t.Fields()[0].Type.isSIMDTag {
+		// this catches `type Foo simd.Whatever` -- Foo is also SIMD.
+		simdify(t, false)
+	}
 }

 // CalcArraySize calculates the size of t,
--- a/src/cmd/compile/internal/types/type.go
+++ b/src/cmd/compile/internal/types/type.go
@ -202,6 +202,7 @@ type Type struct {

 	flags             bitset8
 	alg               AlgKind // valid if Align > 0
+	isSIMDTag, isSIMD bool    // tag is the marker type, isSIMD means has marker type

 	// size of prefix of object that contains all pointers. valid if Align > 0.
 	// Note that for pointers, this is always PtrSize even if the element type
@ -594,6 +595,12 @@ func newSSA(name string) *Type {
 	return t
 }

+func newSIMD(name string) *Type {
+	t := newSSA(name)
+	t.isSIMD = true
+	return t
+}
+
 // NewMap returns a new map Type with key type k and element (aka value) type v.
 func NewMap(k, v *Type) *Type {
 	t := newType(TMAP)
@ -982,17 +989,16 @@ func (t *Type) ArgWidth() int64 {
 	return t.extra.(*Func).Argwid
 }

+// Size returns the width of t in bytes.
 func (t *Type) Size() int64 {
 	if t.kind == TSSA {
-		if t == TypeInt128 {
-			return 16
-		}
-		return 0
+		return t.width
 	}
 	CalcSize(t)
 	return t.width
 }

+// Alignment returns the alignment of t in bytes.
 func (t *Type) Alignment() int64 {
 	CalcSize(t)
 	return int64(t.align)
@ -1598,12 +1604,26 @@ var (
 	TypeFlags     = newSSA("flags")
 	TypeVoid      = newSSA("void")
 	TypeInt128    = newSSA("int128")
+	TypeVec128    = newSIMD("vec128")
+	TypeVec256    = newSIMD("vec256")
+	TypeVec512    = newSIMD("vec512")
+	TypeMask      = newSIMD("mask") // not a vector, not 100% sure what this should be.
 	TypeResultMem = newResults([]*Type{TypeMem})
 )

 func init() {
 	TypeInt128.width = 16
 	TypeInt128.align = 8
+
+	TypeVec128.width = 16
+	TypeVec128.align = 8
+	TypeVec256.width = 32
+	TypeVec256.align = 8
+	TypeVec512.width = 64
+	TypeVec512.align = 8
+
+	TypeMask.width = 8 // This will depend on the architecture; spilling will be "interesting".
+	TypeMask.align = 8
 }

 // NewNamed returns a new named type for the given type name. obj should be an
@ -1963,3 +1983,7 @@ var SimType [NTYPE]Kind

 // Fake package for shape types (see typecheck.Shapify()).
 var ShapePkg = NewPkg("go.shape", "go.shape")
+
+func (t *Type) IsSIMD() bool {
+	return t.isSIMD
+}
--- a/src/cmd/compile/internal/types2/stdlib_test.go
+++ b/src/cmd/compile/internal/types2/stdlib_test.go
@ -361,6 +361,8 @@ var excluded = map[string]bool{
 	"builtin":                       true,
 	"cmd/compile/internal/ssa/_gen": true,
 	"runtime/_mkmalloc":             true,
+	"simd/_gen/simdgen":             true,
+	"simd/_gen/unify":               true,
 }

 // printPackageMu synchronizes the printing of type-checked package files in
--- a/src/cmd/dist/test.go
+++ b/src/cmd/dist/test.go
@ -956,7 +956,9 @@ func (t *tester) registerTests() {
 	// which is darwin,linux,windows/amd64 and darwin/arm64.
 	//
 	// The same logic applies to the release notes that correspond to each api/next file.
-	if goos == "darwin" || ((goos == "linux" || goos == "windows") && goarch == "amd64") {
+	//
+	// TODO: remove the exclusion of goexperiment simd right before dev.simd branch is merged to master.
+	if goos == "darwin" || ((goos == "linux" || goos == "windows") && (goarch == "amd64" && !strings.Contains(goexperiment, "simd"))) {
 		t.registerTest("API release note check", &goTest{variant: "check", pkg: "cmd/relnote", testFlags: []string{"-check"}})
 		t.registerTest("API check", &goTest{variant: "check", pkg: "cmd/api", timeout: 5 * time.Minute, testFlags: []string{"-check"}})
 	}
--- a/src/cmd/internal/obj/x86/obj6.go
+++ b/src/cmd/internal/obj/x86/obj6.go
@ -236,7 +236,7 @@ func progedit(ctxt *obj.Link, p *obj.Prog, newprog obj.ProgAlloc) {
 	// Rewrite float constants to values stored in memory.
 	switch p.As {
 	// Convert AMOVSS $(0), Xx to AXORPS Xx, Xx
-	case AMOVSS:
+	case AMOVSS, AVMOVSS:
 		if p.From.Type == obj.TYPE_FCONST {
 			//  f == 0 can't be used here due to -0, so use Float64bits
 			if f := p.From.Val.(float64); math.Float64bits(f) == 0 {
@ -272,7 +272,7 @@ func progedit(ctxt *obj.Link, p *obj.Prog, newprog obj.ProgAlloc) {
 			p.From.Offset = 0
 		}

-	case AMOVSD:
+	case AMOVSD, AVMOVSD:
 		// Convert AMOVSD $(0), Xx to AXORPS Xx, Xx
 		if p.From.Type == obj.TYPE_FCONST {
 			//  f == 0 can't be used here due to -0, so use Float64bits
--- a/src/cmd/internal/testdir/testdir_test.go
+++ b/src/cmd/internal/testdir/testdir_test.go
@ -67,7 +67,7 @@ var (

 	// dirs are the directories to look for *.go files in.
 	// TODO(bradfitz): just use all directories?
-	dirs = []string{".", "ken", "chan", "interface", "internal/runtime/sys", "syntax", "dwarf", "fixedbugs", "codegen", "abi", "typeparam", "typeparam/mdempsky", "arenas"}
+	dirs = []string{".", "ken", "chan", "interface", "internal/runtime/sys", "syntax", "dwarf", "fixedbugs", "codegen", "abi", "typeparam", "typeparam/mdempsky", "arenas", "simd"}
 )

 // Test is the main entrypoint that runs tests in the GOROOT/test directory.
--- a/src/go/build/deps_test.go
+++ b/src/go/build/deps_test.go
@ -54,6 +54,7 @@ var depsRules = `
 	  internal/goexperiment,
 	  internal/goos,
 	  internal/goversion,
+	  internal/itoa,
 	  internal/nettrace,
 	  internal/platform,
 	  internal/profilerecord,
@ -71,6 +72,8 @@ var depsRules = `
 	internal/byteorder, internal/cpu, internal/goarch < internal/chacha8rand;
 	internal/goarch, math/bits < internal/strconv;

+	internal/cpu, internal/strconv < simd;
+
 	# RUNTIME is the core runtime group of packages, all of them very light-weight.
 	internal/abi,
 	internal/chacha8rand,
@ -80,6 +83,7 @@ var depsRules = `
 	internal/godebugs,
 	internal/goexperiment,
 	internal/goos,
+	internal/itoa,
 	internal/profilerecord,
 	internal/strconv,
 	internal/trace/tracev2,
@ -697,6 +701,9 @@ var depsRules = `
 	FMT, DEBUG, flag, runtime/trace, internal/sysinfo, math/rand
 	< testing;

+	testing, math
+	< simd/internal/test_helpers;
+
 	log/slog, testing
 	< testing/slogtest;

--- a/src/go/doc/comment/mkstd.sh
+++ b/src/go/doc/comment/mkstd.sh
@ -19,6 +19,6 @@ echo "// Copyright 2022 The Go Authors. All rights reserved.
 package comment

 var stdPkgs = []string{"
-go list std | grep -v / | sort | sed 's/.*/"&",/'
+GOEXPERIMENT=none go list std | grep -v / | sort | sed 's/.*/"&",/'
 echo "}"
 ) | gofmt >std.go.tmp && mv std.go.tmp std.go
--- a/src/go/doc/comment/std_test.go
+++ b/src/go/doc/comment/std_test.go
@ -13,7 +13,9 @@ import (
 )

 func TestStd(t *testing.T) {
-	out, err := testenv.Command(t, testenv.GoToolPath(t), "list", "std").CombinedOutput()
+	cmd := testenv.Command(t, testenv.GoToolPath(t), "list", "std")
+	cmd.Env = append(cmd.Environ(), "GOEXPERIMENT=none")
+	out, err := cmd.CombinedOutput()
 	if err != nil {
 		t.Fatalf("%v\n%s", err, out)
 	}
--- a/src/go/types/stdlib_test.go
+++ b/src/go/types/stdlib_test.go
@ -361,6 +361,8 @@ var excluded = map[string]bool{
 	"builtin":                       true,
 	"cmd/compile/internal/ssa/_gen": true,
 	"runtime/_mkmalloc":             true,
+	"simd/_gen/simdgen":             true,
+	"simd/_gen/unify":               true,
 }

 // printPackageMu synchronizes the printing of type-checked package files in
--- a/src/internal/buildcfg/exp.go
+++ b/src/internal/buildcfg/exp.go
@ -88,8 +88,6 @@ func ParseGOEXPERIMENT(goos, goarch, goexp string) (*ExperimentFlags, error) {
 		SizeSpecializedMalloc: true,
 		GreenTeaGC:            true,
 	}
-
-	// Start with the statically enabled set of experiments.
 	flags := &ExperimentFlags{
 		Flags:    baseline,
 		baseline: baseline,
--- a/src/internal/cpu/cpu.go
+++ b/src/internal/cpu/cpu.go
@ -25,17 +25,22 @@ var X86 struct {
 	HasAES              bool
 	HasADX              bool
 	HasAVX              bool
+	HasAVXVNNI          bool
 	HasAVX2             bool
 	HasAVX512           bool // Virtual feature: F+CD+BW+DQ+VL
 	HasAVX512F          bool
 	HasAVX512CD         bool
-	HasAVX512BITALG     bool
 	HasAVX512BW         bool
 	HasAVX512DQ         bool
 	HasAVX512VL         bool
-	HasAVX512VPCLMULQDQ bool
+	HasAVX512GFNI       bool
+	HasAVX512VAES       bool
+	HasAVX512VNNI       bool
 	HasAVX512VBMI       bool
 	HasAVX512VBMI2      bool
+	HasAVX512BITALG     bool
+	HasAVX512VPOPCNTDQ  bool
+	HasAVX512VPCLMULQDQ bool
 	HasBMI1             bool
 	HasBMI2             bool
 	HasERMS             bool
--- a/src/internal/cpu/cpu_arm64_darwin.go
+++ b/src/internal/cpu/cpu_arm64_darwin.go
@ -6,8 +6,6 @@

 package cpu

-import _ "unsafe" // for linkname
-
 func osInit() {
 	// macOS 12 moved these to the hw.optional.arm tree, but as of Go 1.24 we
 	// still support macOS 11. See [Determine Encryption Capabilities].
@ -29,24 +27,3 @@ func osInit() {
 	ARM64.HasSHA1 = true
 	ARM64.HasSHA2 = true
 }
-
-//go:noescape
-func getsysctlbyname(name []byte) (int32, int32)
-
-// sysctlEnabled should be an internal detail,
-// but widely used packages access it using linkname.
-// Notable members of the hall of shame include:
-//   - github.com/bytedance/gopkg
-//   - github.com/songzhibin97/gkit
-//
-// Do not remove or change the type signature.
-// See go.dev/issue/67401.
-//
-//go:linkname sysctlEnabled
-func sysctlEnabled(name []byte) bool {
-	ret, value := getsysctlbyname(name)
-	if ret < 0 {
-		return false
-	}
-	return value > 0
-}
--- a/src/internal/cpu/cpu_darwin.go
+++ b/src/internal/cpu/cpu_darwin.go
@ -0,0 +1,72 @@
+// Copyright 2020 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build darwin && !ios
+
+package cpu
+
+import _ "unsafe" // for linkname
+
+// Pushed from runtime.
+//
+//go:noescape
+func sysctlbynameInt32(name []byte) (int32, int32)
+
+// Pushed from runtime.
+//
+//go:noescape
+func sysctlbynameBytes(name, out []byte) int32
+
+// sysctlEnabled should be an internal detail,
+// but widely used packages access it using linkname.
+// Notable members of the hall of shame include:
+//   - github.com/bytedance/gopkg
+//   - github.com/songzhibin97/gkit
+//
+// Do not remove or change the type signature.
+// See go.dev/issue/67401.
+//
+//go:linkname sysctlEnabled
+func sysctlEnabled(name []byte) bool {
+	ret, value := sysctlbynameInt32(name)
+	if ret < 0 {
+		return false
+	}
+	return value > 0
+}
+
+// darwinKernelVersionCheck reports if Darwin kernel version is at
+// least major.minor.patch.
+//
+// Code borrowed from x/sys/cpu.
+func darwinKernelVersionCheck(major, minor, patch int) bool {
+	var release [256]byte
+	ret := sysctlbynameBytes([]byte("kern.osrelease\x00"), release[:])
+	if ret < 0 {
+		return false
+	}
+
+	var mmp [3]int
+	c := 0
+Loop:
+	for _, b := range release[:] {
+		switch {
+		case b >= '0' && b <= '9':
+			mmp[c] = 10*mmp[c] + int(b-'0')
+		case b == '.':
+			c++
+			if c > 2 {
+				return false
+			}
+		case b == 0:
+			break Loop
+		default:
+			return false
+		}
+	}
+	if c != 2 {
+		return false
+	}
+	return mmp[0] > major || mmp[0] == major && (mmp[1] > minor || mmp[1] == minor && mmp[2] >= patch)
+}
--- a/src/internal/cpu/cpu_x86.go
+++ b/src/internal/cpu/cpu_x86.go
@ -18,11 +18,21 @@ func xgetbv() (eax, edx uint32)
 func getGOAMD64level() int32

 const (
-	// Bits returned in ECX for CPUID EAX=0x1 ECX=0x0
+	// eax bits
+	cpuid_AVXVNNI = 1 << 4
+
+	// ecx bits
 	cpuid_SSE3            = 1 << 0
 	cpuid_PCLMULQDQ       = 1 << 1
+	cpuid_AVX512VBMI      = 1 << 1
+	cpuid_AVX512VBMI2     = 1 << 6
 	cpuid_SSSE3           = 1 << 9
+	cpuid_AVX512GFNI      = 1 << 8
+	cpuid_AVX512VAES      = 1 << 9
+	cpuid_AVX512VNNI      = 1 << 11
+	cpuid_AVX512BITALG    = 1 << 12
 	cpuid_FMA             = 1 << 12
+	cpuid_AVX512VPOPCNTDQ = 1 << 14
 	cpuid_SSE41           = 1 << 19
 	cpuid_SSE42           = 1 << 20
 	cpuid_POPCNT          = 1 << 23
@ -105,6 +115,7 @@ func doinit() {
 	maxID, _, _, _ := cpuid(0, 0)

 	if maxID < 1 {
+		osInit()
 		return
 	}

@ -149,10 +160,11 @@ func doinit() {
 	X86.HasAVX = isSet(ecx1, cpuid_AVX) && osSupportsAVX

 	if maxID < 7 {
+		osInit()
 		return
 	}

-	_, ebx7, ecx7, edx7 := cpuid(7, 0)
+	eax7, ebx7, ecx7, edx7 := cpuid(7, 0)
 	X86.HasBMI1 = isSet(ebx7, cpuid_BMI1)
 	X86.HasAVX2 = isSet(ebx7, cpuid_AVX2) && osSupportsAVX
 	X86.HasBMI2 = isSet(ebx7, cpuid_BMI2)
@ -166,6 +178,13 @@ func doinit() {
 		X86.HasAVX512BW = isSet(ebx7, cpuid_AVX512BW)
 		X86.HasAVX512DQ = isSet(ebx7, cpuid_AVX512DQ)
 		X86.HasAVX512VL = isSet(ebx7, cpuid_AVX512VL)
+		X86.HasAVX512GFNI = isSet(ecx7, cpuid_AVX512GFNI)
+		X86.HasAVX512BITALG = isSet(ecx7, cpuid_AVX512BITALG)
+		X86.HasAVX512VPOPCNTDQ = isSet(ecx7, cpuid_AVX512VPOPCNTDQ)
+		X86.HasAVX512VBMI = isSet(ecx7, cpuid_AVX512VBMI)
+		X86.HasAVX512VBMI2 = isSet(ecx7, cpuid_AVX512VBMI2)
+		X86.HasAVX512VAES = isSet(ecx7, cpuid_AVX512VAES)
+		X86.HasAVX512VNNI = isSet(ecx7, cpuid_AVX512VNNI)
 		X86.HasAVX512VPCLMULQDQ = isSet(ecx7, cpuid_AVX512VPCLMULQDQ)
 		X86.HasAVX512VBMI = isSet(ecx7, cpuid_AVX512_VBMI)
 		X86.HasAVX512VBMI2 = isSet(ecx7, cpuid_AVX512_VBMI2)
@ -179,6 +198,7 @@ func doinit() {
 	maxExtendedInformation, _, _, _ = cpuid(0x80000000, 0)

 	if maxExtendedInformation < 0x80000001 {
+		osInit()
 		return
 	}

@ -195,6 +215,15 @@ func doinit() {
 		// included in AVX10.1.
 		X86.HasAVX512 = X86.HasAVX512F && X86.HasAVX512CD && X86.HasAVX512BW && X86.HasAVX512DQ && X86.HasAVX512VL
 	}
+
+	if eax7 >= 1 {
+		eax71, _, _, _ := cpuid(7, 1)
+		if X86.HasAVX {
+			X86.HasAVXVNNI = isSet(4, eax71)
+		}
+	}
+
+	osInit()
 }

 func isSet(hwc uint32, value uint32) bool {
--- a/src/internal/cpu/cpu_x86_darwin.go
+++ b/src/internal/cpu/cpu_x86_darwin.go
@ -0,0 +1,23 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build (386 || amd64) && darwin && !ios
+
+package cpu
+
+func osInit() {
+	if isRosetta() && darwinKernelVersionCheck(24, 0, 0) {
+		// Apparently, on macOS 15 (Darwin kernel version 24) or newer,
+		// Rosetta 2 supports AVX1 and 2. However, neither CPUID nor
+		// sysctl says it has AVX. Detect this situation here and report
+		// AVX1 and 2 as supported.
+		// TODO: check if any other feature is actually supported.
+		X86.HasAVX = true
+		X86.HasAVX2 = true
+	}
+}
+
+func isRosetta() bool {
+	return sysctlEnabled([]byte("sysctl.proc_translated\x00"))
+}
--- a/src/internal/cpu/cpu_x86_other.go
+++ b/src/internal/cpu/cpu_x86_other.go
@ -0,0 +1,9 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build (386 || amd64) && (!darwin || ios)
+
+package cpu
+
+func osInit() {}
--- a/src/internal/goexperiment/exp_simd_off.go
+++ b/src/internal/goexperiment/exp_simd_off.go
@ -0,0 +1,8 @@
+// Code generated by mkconsts.go. DO NOT EDIT.
+
+//go:build !goexperiment.simd
+
+package goexperiment
+
+const SIMD = false
+const SIMDInt = 0
--- a/src/internal/goexperiment/exp_simd_on.go
+++ b/src/internal/goexperiment/exp_simd_on.go
@ -0,0 +1,8 @@
+// Code generated by mkconsts.go. DO NOT EDIT.
+
+//go:build goexperiment.simd
+
+package goexperiment
+
+const SIMD = true
+const SIMDInt = 1
--- a/src/internal/goexperiment/flags.go
+++ b/src/internal/goexperiment/flags.go
@ -121,4 +121,8 @@ type Flags struct {

 	// GoroutineLeakProfile enables the collection of goroutine leak profiles.
 	GoroutineLeakProfile bool
+
+	// SIMD enables the simd package and the compiler's handling
+	// of SIMD intrinsics.
+	SIMD bool
 }
--- a/src/runtime/asm_amd64.s
+++ b/src/runtime/asm_amd64.s
@ -1049,6 +1049,9 @@ needm:
 	// there's no need to handle that. Clear R14 so that there's
 	// a bad value in there, in case needm tries to use it.
 	XORPS	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15
 	XORQ    R14, R14
 	MOVQ	$runtime·needAndBindM<ABIInternal>(SB), AX
 	CALL	AX
@ -1746,6 +1749,9 @@ TEXT ·sigpanic0(SB),NOSPLIT,$0-0
 	get_tls(R14)
 	MOVQ	g(R14), R14
 	XORPS	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15
 	JMP	·sigpanic<ABIInternal>(SB)

 // gcWriteBarrier informs the GC about heap pointer writes.
--- a/src/runtime/cpuflags.go
+++ b/src/runtime/cpuflags.go
@ -28,9 +28,10 @@ const (
 var (
 	// Set in runtime.cpuinit.
 	// TODO: deprecate these; use internal/cpu directly.
+	x86HasAVX    bool
+	x86HasFMA    bool
 	x86HasPOPCNT bool
 	x86HasSSE41  bool
-	x86HasFMA    bool

 	armHasVFPv4 bool

--- a/src/runtime/cpuflags_amd64_test.go
+++ b/src/runtime/cpuflags_amd64_test.go
@ -0,0 +1,19 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package runtime_test
+
+import (
+	"runtime"
+	"testing"
+)
+
+func TestHasAVX(t *testing.T) {
+	t.Parallel()
+	output := runTestProg(t, "testprog", "CheckAVX")
+	ok := output == "OK\n"
+	if *runtime.X86HasAVX != ok {
+		t.Fatalf("x86HasAVX: %v, CheckAVX got:\n%s", *runtime.X86HasAVX, output)
+	}
+}
--- a/src/runtime/export_test.go
+++ b/src/runtime/export_test.go
@ -1978,6 +1978,8 @@ func TraceStack(gp *G, tab *TraceStackTable) {
 	traceStack(0, gp, (*traceStackTable)(tab))
 }

+var X86HasAVX = &x86HasAVX
+
 var DebugDecorateMappings = &debug.decoratemappings

 func SetVMANameSupported() bool { return setVMANameSupported() }
--- a/src/runtime/mkpreempt.go
+++ b/src/runtime/mkpreempt.go
@ -402,7 +402,7 @@ func genAMD64(g *gen) {
 	// Create layouts for X, Y, and Z registers.
 	const (
 		numXRegs = 16
-		numZRegs = 16 // TODO: If we start using upper registers, change to 32
+		numZRegs = 32
 		numKRegs = 8
 	)
 	lZRegs := layout{sp: xReg} // Non-GP registers
--- a/src/runtime/os_darwin.go
+++ b/src/runtime/os_darwin.go
@ -162,11 +162,22 @@ func sysctlbynameInt32(name []byte) (int32, int32) {
 	return ret, out
 }

-//go:linkname internal_cpu_getsysctlbyname internal/cpu.getsysctlbyname
-func internal_cpu_getsysctlbyname(name []byte) (int32, int32) {
+func sysctlbynameBytes(name, out []byte) int32 {
+	nout := uintptr(len(out))
+	ret := sysctlbyname(&name[0], &out[0], &nout, nil, 0)
+	return ret
+}
+
+//go:linkname internal_cpu_sysctlbynameInt32 internal/cpu.sysctlbynameInt32
+func internal_cpu_sysctlbynameInt32(name []byte) (int32, int32) {
 	return sysctlbynameInt32(name)
 }

+//go:linkname internal_cpu_sysctlbynameBytes internal/cpu.sysctlbynameBytes
+func internal_cpu_sysctlbynameBytes(name, out []byte) int32 {
+	return sysctlbynameBytes(name, out)
+}
+
 const (
 	_CTL_HW      = 6
 	_HW_NCPU     = 3
--- a/src/runtime/panic.go
+++ b/src/runtime/panic.go
@ -341,6 +341,13 @@ func panicmemAddr(addr uintptr) {
 	panic(errorAddressString{msg: "invalid memory address or nil pointer dereference", addr: addr})
 }

+var simdImmError = error(errorString("out-of-range immediate for simd intrinsic"))
+
+func panicSimdImm() {
+	panicCheck2("simd immediate error")
+	panic(simdImmError)
+}
+
 // Create a new deferred function fn, which has no arguments and results.
 // The compiler turns a defer statement into a call to this.
 func deferproc(fn func()) {
--- a/src/runtime/preempt_amd64.go
+++ b/src/runtime/preempt_amd64.go
@ -19,6 +19,22 @@ type xRegs struct {
 	Z13 [64]byte
 	Z14 [64]byte
 	Z15 [64]byte
+	Z16 [64]byte
+	Z17 [64]byte
+	Z18 [64]byte
+	Z19 [64]byte
+	Z20 [64]byte
+	Z21 [64]byte
+	Z22 [64]byte
+	Z23 [64]byte
+	Z24 [64]byte
+	Z25 [64]byte
+	Z26 [64]byte
+	Z27 [64]byte
+	Z28 [64]byte
+	Z29 [64]byte
+	Z30 [64]byte
+	Z31 [64]byte
 	K0  uint64
 	K1  uint64
 	K2  uint64
--- a/src/runtime/preempt_amd64.s
+++ b/src/runtime/preempt_amd64.s
@ -95,14 +95,30 @@ saveAVX512:
 	VMOVDQU64 Z13, 832(AX)
 	VMOVDQU64 Z14, 896(AX)
 	VMOVDQU64 Z15, 960(AX)
-	KMOVQ K0, 1024(AX)
-	KMOVQ K1, 1032(AX)
-	KMOVQ K2, 1040(AX)
-	KMOVQ K3, 1048(AX)
-	KMOVQ K4, 1056(AX)
-	KMOVQ K5, 1064(AX)
-	KMOVQ K6, 1072(AX)
-	KMOVQ K7, 1080(AX)
+	VMOVDQU64 Z16, 1024(AX)
+	VMOVDQU64 Z17, 1088(AX)
+	VMOVDQU64 Z18, 1152(AX)
+	VMOVDQU64 Z19, 1216(AX)
+	VMOVDQU64 Z20, 1280(AX)
+	VMOVDQU64 Z21, 1344(AX)
+	VMOVDQU64 Z22, 1408(AX)
+	VMOVDQU64 Z23, 1472(AX)
+	VMOVDQU64 Z24, 1536(AX)
+	VMOVDQU64 Z25, 1600(AX)
+	VMOVDQU64 Z26, 1664(AX)
+	VMOVDQU64 Z27, 1728(AX)
+	VMOVDQU64 Z28, 1792(AX)
+	VMOVDQU64 Z29, 1856(AX)
+	VMOVDQU64 Z30, 1920(AX)
+	VMOVDQU64 Z31, 1984(AX)
+	KMOVQ K0, 2048(AX)
+	KMOVQ K1, 2056(AX)
+	KMOVQ K2, 2064(AX)
+	KMOVQ K3, 2072(AX)
+	KMOVQ K4, 2080(AX)
+	KMOVQ K5, 2088(AX)
+	KMOVQ K6, 2096(AX)
+	KMOVQ K7, 2104(AX)
 	JMP preempt
 preempt:
 	CALL ·asyncPreempt2(SB)
@ -153,14 +169,30 @@ restoreAVX2:
 	VMOVDQU 0(AX), Y0
 	JMP restoreGPs
 restoreAVX512:
-	KMOVQ 1080(AX), K7
-	KMOVQ 1072(AX), K6
-	KMOVQ 1064(AX), K5
-	KMOVQ 1056(AX), K4
-	KMOVQ 1048(AX), K3
-	KMOVQ 1040(AX), K2
-	KMOVQ 1032(AX), K1
-	KMOVQ 1024(AX), K0
+	KMOVQ 2104(AX), K7
+	KMOVQ 2096(AX), K6
+	KMOVQ 2088(AX), K5
+	KMOVQ 2080(AX), K4
+	KMOVQ 2072(AX), K3
+	KMOVQ 2064(AX), K2
+	KMOVQ 2056(AX), K1
+	KMOVQ 2048(AX), K0
+	VMOVDQU64 1984(AX), Z31
+	VMOVDQU64 1920(AX), Z30
+	VMOVDQU64 1856(AX), Z29
+	VMOVDQU64 1792(AX), Z28
+	VMOVDQU64 1728(AX), Z27
+	VMOVDQU64 1664(AX), Z26
+	VMOVDQU64 1600(AX), Z25
+	VMOVDQU64 1536(AX), Z24
+	VMOVDQU64 1472(AX), Z23
+	VMOVDQU64 1408(AX), Z22
+	VMOVDQU64 1344(AX), Z21
+	VMOVDQU64 1280(AX), Z20
+	VMOVDQU64 1216(AX), Z19
+	VMOVDQU64 1152(AX), Z18
+	VMOVDQU64 1088(AX), Z17
+	VMOVDQU64 1024(AX), Z16
 	VMOVDQU64 960(AX), Z15
 	VMOVDQU64 896(AX), Z14
 	VMOVDQU64 832(AX), Z13
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@ -763,9 +763,10 @@ func cpuinit(env string) {
 	// to guard execution of instructions that can not be assumed to be always supported.
 	switch GOARCH {
 	case "386", "amd64":
+		x86HasAVX = cpu.X86.HasAVX
+		x86HasFMA = cpu.X86.HasFMA
 		x86HasPOPCNT = cpu.X86.HasPOPCNT
 		x86HasSSE41 = cpu.X86.HasSSE41
-		x86HasFMA = cpu.X86.HasFMA

 	case "arm":
 		armHasVFPv4 = cpu.ARM.HasVFPv4
--- a/src/runtime/race_amd64.s
+++ b/src/runtime/race_amd64.s
@ -456,6 +456,9 @@ call:
 	// Back to Go world, set special registers.
 	// The g register (R14) is preserved in C.
 	XORPS	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15
 	RET

 // C->Go callback thunk that allows to call runtime·racesymbolize from C code.
--- a/src/runtime/sys_darwin_amd64.s
+++ b/src/runtime/sys_darwin_amd64.s
@ -177,6 +177,9 @@ TEXT runtime·sigtramp(SB),NOSPLIT|TOPFRAME|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
--- a/src/runtime/sys_dragonfly_amd64.s
+++ b/src/runtime/sys_dragonfly_amd64.s
@ -228,6 +228,9 @@ TEXT runtime·sigtramp(SB),NOSPLIT|TOPFRAME|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
--- a/src/runtime/sys_freebsd_amd64.s
+++ b/src/runtime/sys_freebsd_amd64.s
@ -265,6 +265,9 @@ TEXT runtime·sigtramp(SB),NOSPLIT|TOPFRAME|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
@ -290,6 +293,9 @@ TEXT runtime·sigprofNonGoWrapper<>(SB),NOSPLIT|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
--- a/src/runtime/sys_linux_amd64.s
+++ b/src/runtime/sys_linux_amd64.s
@ -340,6 +340,9 @@ TEXT runtime·sigtramp(SB),NOSPLIT|TOPFRAME|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
@ -365,6 +368,9 @@ TEXT runtime·sigprofNonGoWrapper<>(SB),NOSPLIT|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
--- a/src/runtime/sys_netbsd_amd64.s
+++ b/src/runtime/sys_netbsd_amd64.s
@ -310,6 +310,9 @@ TEXT runtime·sigtramp(SB),NOSPLIT|TOPFRAME|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
--- a/src/runtime/sys_openbsd_amd64.s
+++ b/src/runtime/sys_openbsd_amd64.s
@ -64,6 +64,9 @@ TEXT runtime·sigtramp(SB),NOSPLIT|TOPFRAME|NOFRAME,$0
 	get_tls(R12)
 	MOVQ	g(R12), R14
 	PXOR	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15

 	// Reserve space for spill slots.
 	NOP	SP		// disable vet stack checking
--- a/src/runtime/sys_windows_amd64.s
+++ b/src/runtime/sys_windows_amd64.s
@ -32,6 +32,9 @@ TEXT sigtramp<>(SB),NOSPLIT,$0-0
 	// R14 is cleared in case there's a non-zero value in there
 	// if called from a non-go thread.
 	XORPS	X15, X15
+	CMPB	internal∕cpu·X86+const_offsetX86HasAVX(SB), $1
+	JNE	2(PC)
+	VXORPS	X15, X15, X15
 	XORQ	R14, R14

 	get_tls(AX)
--- a/src/runtime/testdata/testprog/cpuflags_amd64.go
+++ b/src/runtime/testdata/testprog/cpuflags_amd64.go
@ -0,0 +1,18 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import "fmt"
+
+func init() {
+	register("CheckAVX", CheckAVX)
+}
+
+func CheckAVX() {
+	checkAVX()
+	fmt.Println("OK")
+}
+
+func checkAVX()
--- a/src/runtime/testdata/testprog/cpuflags_amd64.s
+++ b/src/runtime/testdata/testprog/cpuflags_amd64.s
@ -0,0 +1,9 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+#include "textflag.h"
+
+TEXT	·checkAVX(SB), NOSPLIT|NOFRAME, $0-0
+	VXORPS	X1, X2, X3
+	RET
--- a/src/simd/_gen/go.mod
+++ b/src/simd/_gen/go.mod
@ -0,0 +1,8 @@
+module simd/_gen
+
+go 1.24
+
+require (
+	golang.org/x/arch v0.20.0
+	gopkg.in/yaml.v3 v3.0.1
+)
--- a/src/simd/_gen/go.sum
+++ b/src/simd/_gen/go.sum
@ -0,0 +1,6 @@
+golang.org/x/arch v0.20.0 h1:dx1zTU0MAE98U+TQ8BLl7XsJbgze2WnNKF/8tGp/Q6c=
+golang.org/x/arch v0.20.0/go.mod h1:bdwinDaKcfZUGpH09BB7ZmOfhalA8lQdzl62l8gGWsk=
+gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
+gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
+gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
--- a/src/simd/_gen/main.go
+++ b/src/simd/_gen/main.go
@ -0,0 +1,149 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// Run all SIMD-related code generators.
+package main
+
+import (
+	"flag"
+	"fmt"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"strings"
+)
+
+const defaultXedPath = "$XEDPATH" + string(filepath.ListSeparator) + "./simdgen/xeddata" + string(filepath.ListSeparator) + "$HOME/xed/obj/dgen"
+
+var (
+	flagTmplgen = flag.Bool("tmplgen", true, "run tmplgen generator")
+	flagSimdgen = flag.Bool("simdgen", true, "run simdgen generator")
+
+	flagN       = flag.Bool("n", false, "dry run")
+	flagXedPath = flag.String("xedPath", defaultXedPath, "load XED datafile from `path`, which must be the XED obj/dgen directory")
+)
+
+var goRoot string
+
+func main() {
+	flag.Parse()
+	if flag.NArg() > 0 {
+		flag.Usage()
+		os.Exit(1)
+	}
+
+	if *flagXedPath == defaultXedPath {
+		// In general we want the shell to do variable expansion, but for the
+		// default value we don't get that, so do it ourselves.
+		*flagXedPath = os.ExpandEnv(defaultXedPath)
+	}
+
+	var err error
+	goRoot, err = resolveGOROOT()
+	if err != nil {
+		fmt.Fprintln(os.Stderr, err)
+		os.Exit(1)
+	}
+
+	if *flagTmplgen {
+		doTmplgen()
+	}
+	if *flagSimdgen {
+		doSimdgen()
+	}
+}
+
+func doTmplgen() {
+	goRun("-C", "tmplgen", ".")
+}
+
+func doSimdgen() {
+	xedPath, err := resolveXEDPath(*flagXedPath)
+	if err != nil {
+		fmt.Fprintln(os.Stderr, err)
+		os.Exit(1)
+	}
+
+	// Regenerate the XED-derived SIMD files
+	goRun("-C", "simdgen", ".", "-o", "godefs", "-goroot", goRoot, "-xedPath", prettyPath("./simdgen", xedPath), "go.yaml", "types.yaml", "categories.yaml")
+
+	// simdgen produces SSA rule files, so update the SSA files
+	goRun("-C", prettyPath(".", filepath.Join(goRoot, "src", "cmd", "compile", "internal", "ssa", "_gen")), ".")
+}
+
+func resolveXEDPath(pathList string) (xedPath string, err error) {
+	for _, path := range filepath.SplitList(pathList) {
+		if path == "" {
+			// Probably an unknown shell variable. Ignore.
+			continue
+		}
+		if _, err := os.Stat(filepath.Join(path, "all-dec-instructions.txt")); err == nil {
+			return filepath.Abs(path)
+		}
+	}
+	return "", fmt.Errorf("set $XEDPATH or -xedPath to the XED obj/dgen directory")
+}
+
+func resolveGOROOT() (goRoot string, err error) {
+	cmd := exec.Command("go", "env", "GOROOT")
+	cmd.Stderr = os.Stderr
+	out, err := cmd.Output()
+	if err != nil {
+		return "", fmt.Errorf("%s: %s", cmd, err)
+	}
+	goRoot = strings.TrimSuffix(string(out), "\n")
+	return goRoot, nil
+}
+
+func goRun(args ...string) {
+	exe := filepath.Join(goRoot, "bin", "go")
+	cmd := exec.Command(exe, append([]string{"run"}, args...)...)
+	run(cmd)
+}
+
+func run(cmd *exec.Cmd) {
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	fmt.Fprintf(os.Stderr, "%s\n", cmdString(cmd))
+	if *flagN {
+		return
+	}
+	if err := cmd.Run(); err != nil {
+		fmt.Fprintf(os.Stderr, "%s failed: %s\n", cmd, err)
+	}
+}
+
+func prettyPath(base, path string) string {
+	base, err := filepath.Abs(base)
+	if err != nil {
+		return path
+	}
+	p, err := filepath.Rel(base, path)
+	if err != nil {
+		return path
+	}
+	return p
+}
+
+func cmdString(cmd *exec.Cmd) string {
+	// TODO: Shell quoting?
+	// TODO: Environment.
+
+	var buf strings.Builder
+
+	cmdPath, err := exec.LookPath(filepath.Base(cmd.Path))
+	if err == nil && cmdPath == cmd.Path {
+		cmdPath = filepath.Base(cmdPath)
+	} else {
+		cmdPath = prettyPath(".", cmd.Path)
+	}
+	buf.WriteString(cmdPath)
+
+	for _, arg := range cmd.Args[1:] {
+		buf.WriteByte(' ')
+		buf.WriteString(arg)
+	}
+
+	return buf.String()
+}
--- a/src/simd/_gen/simdgen/.gitignore
+++ b/src/simd/_gen/simdgen/.gitignore
@ -0,0 +1,3 @@
+testdata/*
+.gemini/*
+.gemini*
--- a/src/simd/_gen/simdgen/categories.yaml
+++ b/src/simd/_gen/simdgen/categories.yaml
@ -0,0 +1 @@
+!import ops/*/categories.yaml
--- a/src/simd/_gen/simdgen/etetest.sh
+++ b/src/simd/_gen/simdgen/etetest.sh
@ -0,0 +1,48 @@
+#!/bin/bash
+
+# This is an end-to-end test of Go SIMD. It updates all generated
+# files in this repo and then runs several tests.
+
+XEDDATA="${XEDDATA:-xeddata}"
+if [[ ! -d "$XEDDATA" ]]; then
+    echo >&2 "Must either set \$XEDDATA or symlink xeddata/ to the XED obj/dgen directory."
+    exit 1
+fi
+
+which go >/dev/null || exit 1
+goroot="$(go env GOROOT)"
+if [[ ! ../../../.. -ef "$goroot" ]]; then
+    # We might be able to make this work but it's SO CONFUSING.
+    echo >&2 "go command in path has GOROOT $goroot"
+    exit 1
+fi
+
+if [[ $(go env GOEXPERIMENT) != simd ]]; then
+    echo >&2 "GOEXPERIMENT=$(go env GOEXPERIMENT), expected simd"
+    exit 1
+fi
+
+set -ex
+
+# Regenerate SIMD files
+go run . -o godefs -goroot "$goroot" -xedPath "$XEDDATA" go.yaml types.yaml categories.yaml
+# Regenerate SSA files from SIMD rules
+go run -C "$goroot"/src/cmd/compile/internal/ssa/_gen .
+
+# Rebuild compiler
+cd "$goroot"/src
+go install cmd/compile
+
+# Tests
+GOARCH=amd64 go run -C simd/testdata .
+GOARCH=amd64 go test -v simd
+go test go/doc go/build
+go test cmd/api -v -check -run ^TestCheck$
+go test cmd/compile/internal/ssagen -simd=0
+
+# Check tests without the GOEXPERIMENT
+GOEXPERIMENT= go test go/doc go/build
+GOEXPERIMENT= go test cmd/api -v -check -run ^TestCheck$
+GOEXPERIMENT= go test cmd/compile/internal/ssagen -simd=0
+
+# TODO: Add some tests of SIMD itself
--- a/src/simd/_gen/simdgen/gen_simdGenericOps.go
+++ b/src/simd/_gen/simdgen/gen_simdGenericOps.go
@ -0,0 +1,73 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bytes"
+	"fmt"
+	"sort"
+)
+
+const simdGenericOpsTmpl = `
+package main
+
+func simdGenericOps() []opData {
+	return []opData{
+{{- range .Ops }}
+		{name: "{{.OpName}}", argLength: {{.OpInLen}}, commutative: {{.Comm}}},
+{{- end }}
+{{- range .OpsImm }}
+		{name: "{{.OpName}}", argLength: {{.OpInLen}}, commutative: {{.Comm}}, aux: "UInt8"},
+{{- end }}
+	}
+}
+`
+
+// writeSIMDGenericOps generates the generic ops and writes it to simdAMD64ops.go
+// within the specified directory.
+func writeSIMDGenericOps(ops []Operation) *bytes.Buffer {
+	t := templateOf(simdGenericOpsTmpl, "simdgenericOps")
+	buffer := new(bytes.Buffer)
+	buffer.WriteString(generatedHeader)
+
+	type genericOpsData struct {
+		OpName  string
+		OpInLen int
+		Comm    bool
+	}
+	type opData struct {
+		Ops    []genericOpsData
+		OpsImm []genericOpsData
+	}
+	var opsData opData
+	for _, op := range ops {
+		if op.NoGenericOps != nil && *op.NoGenericOps == "true" {
+			continue
+		}
+		if op.SkipMaskedMethod() {
+			continue
+		}
+		_, _, _, immType, gOp := op.shape()
+		gOpData := genericOpsData{gOp.GenericName(), len(gOp.In), op.Commutative}
+		if immType == VarImm || immType == ConstVarImm {
+			opsData.OpsImm = append(opsData.OpsImm, gOpData)
+		} else {
+			opsData.Ops = append(opsData.Ops, gOpData)
+		}
+	}
+	sort.Slice(opsData.Ops, func(i, j int) bool {
+		return compareNatural(opsData.Ops[i].OpName, opsData.Ops[j].OpName) < 0
+	})
+	sort.Slice(opsData.OpsImm, func(i, j int) bool {
+		return compareNatural(opsData.OpsImm[i].OpName, opsData.OpsImm[j].OpName) < 0
+	})
+
+	err := t.Execute(buffer, opsData)
+	if err != nil {
+		panic(fmt.Errorf("failed to execute template: %w", err))
+	}
+
+	return buffer
+}
--- a/src/simd/_gen/simdgen/gen_simdIntrinsics.go
+++ b/src/simd/_gen/simdgen/gen_simdIntrinsics.go
@ -0,0 +1,156 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bytes"
+	"fmt"
+	"slices"
+)
+
+const simdIntrinsicsTmpl = `
+{{define "header"}}
+package ssagen
+
+import (
+	"cmd/compile/internal/ir"
+	"cmd/compile/internal/ssa"
+	"cmd/compile/internal/types"
+	"cmd/internal/sys"
+)
+
+const simdPackage = "` + simdPackage + `"
+
+func simdIntrinsics(addF func(pkg, fn string, b intrinsicBuilder, archFamilies ...sys.ArchFamily)) {
+{{end}}
+
+{{define "op1"}}	addF(simdPackage, "{{(index .In 0).Go}}.{{.Go}}", opLen1(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op2"}}	addF(simdPackage, "{{(index .In 0).Go}}.{{.Go}}", opLen2(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op2_21"}}	addF(simdPackage, "{{(index .In 0).Go}}.{{.Go}}", opLen2_21(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op2_21Type1"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen2_21(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op3"}}	addF(simdPackage, "{{(index .In 0).Go}}.{{.Go}}", opLen3(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op3_21"}}	addF(simdPackage, "{{(index .In 0).Go}}.{{.Go}}", opLen3_21(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op3_21Type1"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen3_21(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op3_231Type1"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen3_231(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op3_31Zero3"}}	addF(simdPackage, "{{(index .In 2).Go}}.{{.Go}}", opLen3_31Zero3(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op4"}}	addF(simdPackage, "{{(index .In 0).Go}}.{{.Go}}", opLen4(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op4_231Type1"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen4_231(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op4_31"}}	addF(simdPackage, "{{(index .In 2).Go}}.{{.Go}}", opLen4_31(ssa.Op{{.GenericName}}, {{.SSAType}}), sys.AMD64)
+{{end}}
+{{define "op1Imm8"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen1Imm8(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op2Imm8"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen2Imm8(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op2Imm8_2I"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen2Imm8_2I(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op2Imm8_II"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen2Imm8_II(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op2Imm8_SHA1RNDS4"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen2Imm8_SHA1RNDS4(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op3Imm8"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen3Imm8(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op3Imm8_2I"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen3Imm8_2I(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+{{define "op4Imm8"}}	addF(simdPackage, "{{(index .In 1).Go}}.{{.Go}}", opLen4Imm8(ssa.Op{{.GenericName}}, {{.SSAType}}, {{(index .In 0).ImmOffset}}), sys.AMD64)
+{{end}}
+
+{{define "vectorConversion"}}	addF(simdPackage, "{{.Tsrc.Name}}.As{{.Tdst.Name}}", func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value { return args[0] }, sys.AMD64)
+{{end}}
+
+{{define "loadStore"}}	addF(simdPackage, "Load{{.Name}}", simdLoad(), sys.AMD64)
+	addF(simdPackage, "{{.Name}}.Store", simdStore(), sys.AMD64)
+{{end}}
+
+{{define "maskedLoadStore"}}	addF(simdPackage, "LoadMasked{{.Name}}", simdMaskedLoad(ssa.OpLoadMasked{{.ElemBits}}), sys.AMD64)
+	addF(simdPackage, "{{.Name}}.StoreMasked", simdMaskedStore(ssa.OpStoreMasked{{.ElemBits}}), sys.AMD64)
+{{end}}
+
+{{define "mask"}}	addF(simdPackage, "{{.Name}}.As{{.VectorCounterpart}}", func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value { return args[0] }, sys.AMD64)
+	addF(simdPackage, "{{.VectorCounterpart}}.asMask", func(s *state, n *ir.CallExpr, args []*ssa.Value) *ssa.Value { return args[0] }, sys.AMD64)
+	addF(simdPackage, "{{.Name}}.And", opLen2(ssa.OpAnd{{.ReshapedVectorWithAndOr}}, types.TypeVec{{.Size}}), sys.AMD64)
+	addF(simdPackage, "{{.Name}}.Or", opLen2(ssa.OpOr{{.ReshapedVectorWithAndOr}}, types.TypeVec{{.Size}}), sys.AMD64)
+	addF(simdPackage, "{{.Name}}FromBits", simdCvtVToMask({{.ElemBits}}, {{.Lanes}}), sys.AMD64)
+	addF(simdPackage, "{{.Name}}.ToBits", simdCvtMaskToV({{.ElemBits}}, {{.Lanes}}), sys.AMD64)
+{{end}}
+
+{{define "footer"}}}
+{{end}}
+`
+
+// writeSIMDIntrinsics generates the intrinsic mappings and writes it to simdintrinsics.go
+// within the specified directory.
+func writeSIMDIntrinsics(ops []Operation, typeMap simdTypeMap) *bytes.Buffer {
+	t := templateOf(simdIntrinsicsTmpl, "simdintrinsics")
+	buffer := new(bytes.Buffer)
+	buffer.WriteString(generatedHeader)
+
+	if err := t.ExecuteTemplate(buffer, "header", nil); err != nil {
+		panic(fmt.Errorf("failed to execute header template: %w", err))
+	}
+
+	slices.SortFunc(ops, compareOperations)
+
+	for _, op := range ops {
+		if op.NoTypes != nil && *op.NoTypes == "true" {
+			continue
+		}
+		if op.SkipMaskedMethod() {
+			continue
+		}
+		if s, op, err := classifyOp(op); err == nil {
+			if err := t.ExecuteTemplate(buffer, s, op); err != nil {
+				panic(fmt.Errorf("failed to execute template %s for op %s: %w", s, op.Go, err))
+			}
+
+		} else {
+			panic(fmt.Errorf("failed to classify op %v: %w", op.Go, err))
+		}
+	}
+
+	for _, conv := range vConvertFromTypeMap(typeMap) {
+		if err := t.ExecuteTemplate(buffer, "vectorConversion", conv); err != nil {
+			panic(fmt.Errorf("failed to execute vectorConversion template: %w", err))
+		}
+	}
+
+	for _, typ := range typesFromTypeMap(typeMap) {
+		if typ.Type != "mask" {
+			if err := t.ExecuteTemplate(buffer, "loadStore", typ); err != nil {
+				panic(fmt.Errorf("failed to execute loadStore template: %w", err))
+			}
+		}
+	}
+
+	for _, typ := range typesFromTypeMap(typeMap) {
+		if typ.MaskedLoadStoreFilter() {
+			if err := t.ExecuteTemplate(buffer, "maskedLoadStore", typ); err != nil {
+				panic(fmt.Errorf("failed to execute maskedLoadStore template: %w", err))
+			}
+		}
+	}
+
+	for _, mask := range masksFromTypeMap(typeMap) {
+		if err := t.ExecuteTemplate(buffer, "mask", mask); err != nil {
+			panic(fmt.Errorf("failed to execute mask template: %w", err))
+		}
+	}
+
+	if err := t.ExecuteTemplate(buffer, "footer", nil); err != nil {
+		panic(fmt.Errorf("failed to execute footer template: %w", err))
+	}
+
+	return buffer
+}
--- a/src/simd/_gen/simdgen/gen_simdMachineOps.go
+++ b/src/simd/_gen/simdgen/gen_simdMachineOps.go
@ -0,0 +1,256 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bytes"
+	"fmt"
+	"log"
+	"sort"
+	"strings"
+)
+
+const simdMachineOpsTmpl = `
+package main
+
+func simdAMD64Ops(v11, v21, v2k, vkv, v2kv, v2kk, v31, v3kv, vgpv, vgp, vfpv, vfpkv, w11, w21, w2k, wkw, w2kw, w2kk, w31, w3kw, wgpw, wgp, wfpw, wfpkw,
+	wkwload, v21load, v31load, v11load, w21load, w31load, w2kload, w2kwload, w11load, w3kwload, w2kkload, v31x0AtIn2 regInfo) []opData {
+	return []opData{
+{{- range .OpsData }}
+		{name: "{{.OpName}}", argLength: {{.OpInLen}}, reg: {{.RegInfo}}, asm: "{{.Asm}}", commutative: {{.Comm}}, typ: "{{.Type}}", resultInArg0: {{.ResultInArg0}}},
+{{- end }}
+{{- range .OpsDataImm }}
+		{name: "{{.OpName}}", argLength: {{.OpInLen}}, reg: {{.RegInfo}}, asm: "{{.Asm}}", aux: "UInt8", commutative: {{.Comm}}, typ: "{{.Type}}", resultInArg0: {{.ResultInArg0}}},
+{{- end }}
+{{- range .OpsDataLoad}}
+		{name: "{{.OpName}}", argLength: {{.OpInLen}}, reg: {{.RegInfo}}, asm: "{{.Asm}}", commutative: {{.Comm}}, typ: "{{.Type}}", aux: "SymOff", symEffect: "Read", resultInArg0: {{.ResultInArg0}}},
+{{- end}}
+{{- range .OpsDataImmLoad}}
+		{name: "{{.OpName}}", argLength: {{.OpInLen}}, reg: {{.RegInfo}}, asm: "{{.Asm}}", commutative: {{.Comm}}, typ: "{{.Type}}", aux: "SymValAndOff", symEffect: "Read", resultInArg0: {{.ResultInArg0}}},
+{{- end}}
+{{- range .OpsDataMerging }}
+		{name: "{{.OpName}}Merging", argLength: {{.OpInLen}}, reg: {{.RegInfo}}, asm: "{{.Asm}}", commutative: false, typ: "{{.Type}}", resultInArg0: true},
+{{- end }}
+{{- range .OpsDataImmMerging }}
+		{name: "{{.OpName}}Merging", argLength: {{.OpInLen}}, reg: {{.RegInfo}}, asm: "{{.Asm}}", aux: "UInt8", commutative: false, typ: "{{.Type}}", resultInArg0: true},
+{{- end }}
+	}
+}
+`
+
+// writeSIMDMachineOps generates the machine ops and writes it to simdAMD64ops.go
+// within the specified directory.
+func writeSIMDMachineOps(ops []Operation) *bytes.Buffer {
+	t := templateOf(simdMachineOpsTmpl, "simdAMD64Ops")
+	buffer := new(bytes.Buffer)
+	buffer.WriteString(generatedHeader)
+
+	type opData struct {
+		OpName       string
+		Asm          string
+		OpInLen      int
+		RegInfo      string
+		Comm         bool
+		Type         string
+		ResultInArg0 bool
+	}
+	type machineOpsData struct {
+		OpsData           []opData
+		OpsDataImm        []opData
+		OpsDataLoad       []opData
+		OpsDataImmLoad    []opData
+		OpsDataMerging    []opData
+		OpsDataImmMerging []opData
+	}
+
+	regInfoSet := map[string]bool{
+		"v11": true, "v21": true, "v2k": true, "v2kv": true, "v2kk": true, "vkv": true, "v31": true, "v3kv": true, "vgpv": true, "vgp": true, "vfpv": true, "vfpkv": true,
+		"w11": true, "w21": true, "w2k": true, "w2kw": true, "w2kk": true, "wkw": true, "w31": true, "w3kw": true, "wgpw": true, "wgp": true, "wfpw": true, "wfpkw": true,
+		"wkwload": true, "v21load": true, "v31load": true, "v11load": true, "w21load": true, "w31load": true, "w2kload": true, "w2kwload": true, "w11load": true,
+		"w3kwload": true, "w2kkload": true, "v31x0AtIn2": true}
+	opsData := make([]opData, 0)
+	opsDataImm := make([]opData, 0)
+	opsDataLoad := make([]opData, 0)
+	opsDataImmLoad := make([]opData, 0)
+	opsDataMerging := make([]opData, 0)
+	opsDataImmMerging := make([]opData, 0)
+
+	// Determine the "best" version of an instruction to use
+	best := make(map[string]Operation)
+	var mOpOrder []string
+	countOverrides := func(s []Operand) int {
+		a := 0
+		for _, o := range s {
+			if o.OverwriteBase != nil {
+				a++
+			}
+		}
+		return a
+	}
+	for _, op := range ops {
+		_, _, maskType, _, gOp := op.shape()
+		asm := machineOpName(maskType, gOp)
+		other, ok := best[asm]
+		if !ok {
+			best[asm] = op
+			mOpOrder = append(mOpOrder, asm)
+			continue
+		}
+		// see if "op" is better than "other"
+		if countOverrides(op.In)+countOverrides(op.Out) < countOverrides(other.In)+countOverrides(other.Out) {
+			best[asm] = op
+		}
+	}
+
+	regInfoErrs := make([]error, 0)
+	regInfoMissing := make(map[string]bool, 0)
+	for _, asm := range mOpOrder {
+		op := best[asm]
+		shapeIn, shapeOut, maskType, _, gOp := op.shape()
+
+		// TODO: all our masked operations are now zeroing, we need to generate machine ops with merging masks, maybe copy
+		// one here with a name suffix "Merging". The rewrite rules will need them.
+		makeRegInfo := func(op Operation, mem memShape) (string, error) {
+			regInfo, err := op.regShape(mem)
+			if err != nil {
+				panic(err)
+			}
+			regInfo, err = rewriteVecAsScalarRegInfo(op, regInfo)
+			if err != nil {
+				if mem == NoMem || mem == InvalidMem {
+					panic(err)
+				}
+				return "", err
+			}
+			if regInfo == "v01load" {
+				regInfo = "vload"
+			}
+			// Makes AVX512 operations use upper registers
+			if strings.Contains(op.CPUFeature, "AVX512") {
+				regInfo = strings.ReplaceAll(regInfo, "v", "w")
+			}
+			if _, ok := regInfoSet[regInfo]; !ok {
+				regInfoErrs = append(regInfoErrs, fmt.Errorf("unsupported register constraint, please update the template and AMD64Ops.go: %s.  Op is %s", regInfo, op))
+				regInfoMissing[regInfo] = true
+			}
+			return regInfo, nil
+		}
+		regInfo, err := makeRegInfo(op, NoMem)
+		if err != nil {
+			panic(err)
+		}
+		var outType string
+		if shapeOut == OneVregOut || shapeOut == OneVregOutAtIn || gOp.Out[0].OverwriteClass != nil {
+			// If class overwrite is happening, that's not really a mask but a vreg.
+			outType = fmt.Sprintf("Vec%d", *gOp.Out[0].Bits)
+		} else if shapeOut == OneGregOut {
+			outType = gOp.GoType() // this is a straight Go type, not a VecNNN type
+		} else if shapeOut == OneKmaskOut {
+			outType = "Mask"
+		} else {
+			panic(fmt.Errorf("simdgen does not recognize this output shape: %d", shapeOut))
+		}
+		resultInArg0 := false
+		if shapeOut == OneVregOutAtIn {
+			resultInArg0 = true
+		}
+		var memOpData *opData
+		regInfoMerging := regInfo
+		hasMerging := false
+		if op.MemFeatures != nil && *op.MemFeatures == "vbcst" {
+			// Right now we only have vbcst case
+			// Make a full vec memory variant.
+			opMem := rewriteLastVregToMem(op)
+			regInfo, err := makeRegInfo(opMem, VregMemIn)
+			if err != nil {
+				// Just skip it if it's non nill.
+				// an error could be triggered by [checkVecAsScalar].
+				// TODO: make [checkVecAsScalar] aware of mem ops.
+				if *Verbose {
+					log.Printf("Seen error: %e", err)
+				}
+			} else {
+				memOpData = &opData{asm + "load", gOp.Asm, len(gOp.In) + 1, regInfo, false, outType, resultInArg0}
+			}
+		}
+		hasMerging = gOp.hasMaskedMerging(maskType, shapeOut)
+		if hasMerging && !resultInArg0 {
+			// We have to copy the slice here becasue the sort will be visible from other
+			// aliases when no reslicing is happening.
+			newIn := make([]Operand, len(op.In), len(op.In)+1)
+			copy(newIn, op.In)
+			op.In = newIn
+			op.In = append(op.In, op.Out[0])
+			op.sortOperand()
+			regInfoMerging, err = makeRegInfo(op, NoMem)
+			if err != nil {
+				panic(err)
+			}
+		}
+
+		if shapeIn == OneImmIn || shapeIn == OneKmaskImmIn {
+			opsDataImm = append(opsDataImm, opData{asm, gOp.Asm, len(gOp.In), regInfo, gOp.Commutative, outType, resultInArg0})
+			if memOpData != nil {
+				if *op.MemFeatures != "vbcst" {
+					panic("simdgen only knows vbcst for mem ops for now")
+				}
+				opsDataImmLoad = append(opsDataImmLoad, *memOpData)
+			}
+			if hasMerging {
+				mergingLen := len(gOp.In)
+				if !resultInArg0 {
+					mergingLen++
+				}
+				opsDataImmMerging = append(opsDataImmMerging, opData{asm, gOp.Asm, mergingLen, regInfoMerging, gOp.Commutative, outType, resultInArg0})
+			}
+		} else {
+			opsData = append(opsData, opData{asm, gOp.Asm, len(gOp.In), regInfo, gOp.Commutative, outType, resultInArg0})
+			if memOpData != nil {
+				if *op.MemFeatures != "vbcst" {
+					panic("simdgen only knows vbcst for mem ops for now")
+				}
+				opsDataLoad = append(opsDataLoad, *memOpData)
+			}
+			if hasMerging {
+				mergingLen := len(gOp.In)
+				if !resultInArg0 {
+					mergingLen++
+				}
+				opsDataMerging = append(opsDataMerging, opData{asm, gOp.Asm, mergingLen, regInfoMerging, gOp.Commutative, outType, resultInArg0})
+			}
+		}
+	}
+	if len(regInfoErrs) != 0 {
+		for _, e := range regInfoErrs {
+			log.Printf("Errors: %e\n", e)
+		}
+		panic(fmt.Errorf("these regInfo unseen: %v", regInfoMissing))
+	}
+	sort.Slice(opsData, func(i, j int) bool {
+		return compareNatural(opsData[i].OpName, opsData[j].OpName) < 0
+	})
+	sort.Slice(opsDataImm, func(i, j int) bool {
+		return compareNatural(opsDataImm[i].OpName, opsDataImm[j].OpName) < 0
+	})
+	sort.Slice(opsDataLoad, func(i, j int) bool {
+		return compareNatural(opsDataLoad[i].OpName, opsDataLoad[j].OpName) < 0
+	})
+	sort.Slice(opsDataImmLoad, func(i, j int) bool {
+		return compareNatural(opsDataImmLoad[i].OpName, opsDataImmLoad[j].OpName) < 0
+	})
+	sort.Slice(opsDataMerging, func(i, j int) bool {
+		return compareNatural(opsDataMerging[i].OpName, opsDataMerging[j].OpName) < 0
+	})
+	sort.Slice(opsDataImmMerging, func(i, j int) bool {
+		return compareNatural(opsDataImmMerging[i].OpName, opsDataImmMerging[j].OpName) < 0
+	})
+	err := t.Execute(buffer, machineOpsData{opsData, opsDataImm, opsDataLoad, opsDataImmLoad,
+		opsDataMerging, opsDataImmMerging})
+	if err != nil {
+		panic(fmt.Errorf("failed to execute template: %w", err))
+	}
+
+	return buffer
+}
--- a/src/simd/_gen/simdgen/gen_simdTypes.go
+++ b/src/simd/_gen/simdgen/gen_simdTypes.go
@ -0,0 +1,658 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bytes"
+	"cmp"
+	"fmt"
+	"maps"
+	"slices"
+	"sort"
+	"strings"
+	"unicode"
+)
+
+type simdType struct {
+	Name                    string // The go type name of this simd type, for example Int32x4.
+	Lanes                   int    // The number of elements in this vector/mask.
+	Base                    string // The element's type, like for Int32x4 it will be int32.
+	Fields                  string // The struct fields, it should be right formatted.
+	Type                    string // Either "mask" or "vreg"
+	VectorCounterpart       string // For mask use only: just replacing the "Mask" in [simdType.Name] with "Int"
+	ReshapedVectorWithAndOr string // For mask use only: vector AND and OR are only available in some shape with element width 32.
+	Size                    int    // The size of the vector type
+}
+
+func (x simdType) ElemBits() int {
+	return x.Size / x.Lanes
+}
+
+// LanesContainer returns the smallest int/uint bit size that is
+// large enough to hold one bit for each lane.  E.g., Mask32x4
+// is 4 lanes, and a uint8 is the smallest uint that has 4 bits.
+func (x simdType) LanesContainer() int {
+	if x.Lanes > 64 {
+		panic("too many lanes")
+	}
+	if x.Lanes > 32 {
+		return 64
+	}
+	if x.Lanes > 16 {
+		return 32
+	}
+	if x.Lanes > 8 {
+		return 16
+	}
+	return 8
+}
+
+// MaskedLoadStoreFilter encodes which simd type type currently
+// get masked loads/stores generated, it is used in two places,
+// this forces coordination.
+func (x simdType) MaskedLoadStoreFilter() bool {
+	return x.Size == 512 || x.ElemBits() >= 32 && x.Type != "mask"
+}
+
+func (x simdType) IntelSizeSuffix() string {
+	switch x.ElemBits() {
+	case 8:
+		return "B"
+	case 16:
+		return "W"
+	case 32:
+		return "D"
+	case 64:
+		return "Q"
+	}
+	panic("oops")
+}
+
+func (x simdType) MaskedLoadDoc() string {
+	if x.Size == 512 || x.ElemBits() < 32 {
+		return fmt.Sprintf("// Asm: VMOVDQU%d.Z, CPU Feature: AVX512", x.ElemBits())
+	} else {
+		return fmt.Sprintf("// Asm: VMASKMOV%s, CPU Feature: AVX2", x.IntelSizeSuffix())
+	}
+}
+
+func (x simdType) MaskedStoreDoc() string {
+	if x.Size == 512 || x.ElemBits() < 32 {
+		return fmt.Sprintf("// Asm: VMOVDQU%d, CPU Feature: AVX512", x.ElemBits())
+	} else {
+		return fmt.Sprintf("// Asm: VMASKMOV%s, CPU Feature: AVX2", x.IntelSizeSuffix())
+	}
+}
+
+func compareSimdTypes(x, y simdType) int {
+	// "vreg" then "mask"
+	if c := -compareNatural(x.Type, y.Type); c != 0 {
+		return c
+	}
+	// want "flo" < "int" < "uin" (and then 8 < 16 < 32 < 64),
+	// not "int16" < "int32" < "int64" < "int8")
+	// so limit comparison to first 3 bytes in string.
+	if c := compareNatural(x.Base[:3], y.Base[:3]); c != 0 {
+		return c
+	}
+	// base type size, 8 < 16 < 32 < 64
+	if c := x.ElemBits() - y.ElemBits(); c != 0 {
+		return c
+	}
+	// vector size last
+	return x.Size - y.Size
+}
+
+type simdTypeMap map[int][]simdType
+
+type simdTypePair struct {
+	Tsrc simdType
+	Tdst simdType
+}
+
+func compareSimdTypePairs(x, y simdTypePair) int {
+	c := compareSimdTypes(x.Tsrc, y.Tsrc)
+	if c != 0 {
+		return c
+	}
+	return compareSimdTypes(x.Tdst, y.Tdst)
+}
+
+const simdPackageHeader = generatedHeader + `
+//go:build goexperiment.simd
+
+package simd
+`
+
+const simdTypesTemplates = `
+{{define "sizeTmpl"}}
+// v{{.}} is a tag type that tells the compiler that this is really {{.}}-bit SIMD
+type v{{.}} struct {
+	_{{.}} [0]func() // uncomparable
+}
+{{end}}
+
+{{define "typeTmpl"}}
+// {{.Name}} is a {{.Size}}-bit SIMD vector of {{.Lanes}} {{.Base}}
+type {{.Name}} struct {
+{{.Fields}}
+}
+
+{{end}}
+`
+
+const simdFeaturesTemplate = `
+import "internal/cpu"
+
+type X86Features struct {}
+
+var X86 X86Features
+
+{{range .}}
+{{- if eq .Feature "AVX512"}}
+// {{.Feature}} returns whether the CPU supports the AVX512F+CD+BW+DQ+VL features.
+//
+// These five CPU features are bundled together, and no use of AVX-512
+// is allowed unless all of these features are supported together.
+// Nearly every CPU that has shipped with any support for AVX-512 has
+// supported all five of these features.
+{{- else -}}
+// {{.Feature}} returns whether the CPU supports the {{.Feature}} feature.
+{{- end}}
+//
+// {{.Feature}} is defined on all GOARCHes, but will only return true on
+// GOARCH {{.GoArch}}.
+func (X86Features) {{.Feature}}() bool {
+	return cpu.X86.Has{{.Feature}}
+}
+{{end}}
+`
+
+const simdLoadStoreTemplate = `
+// Len returns the number of elements in a {{.Name}}
+func (x {{.Name}}) Len() int { return {{.Lanes}} }
+
+// Load{{.Name}} loads a {{.Name}} from an array
+//
+//go:noescape
+func Load{{.Name}}(y *[{{.Lanes}}]{{.Base}}) {{.Name}}
+
+// Store stores a {{.Name}} to an array
+//
+//go:noescape
+func (x {{.Name}}) Store(y *[{{.Lanes}}]{{.Base}})
+`
+
+const simdMaskFromValTemplate = `
+// {{.Name}}FromBits constructs a {{.Name}} from a bitmap value, where 1 means set for the indexed element, 0 means unset.
+{{- if ne .Lanes .LanesContainer}}
+// Only the lower {{.Lanes}} bits of y are used.
+{{- end}}
+//
+// Asm: KMOV{{.IntelSizeSuffix}}, CPU Feature: AVX512
+func {{.Name}}FromBits(y uint{{.LanesContainer}}) {{.Name}}
+
+// ToBits constructs a bitmap from a {{.Name}}, where 1 means set for the indexed element, 0 means unset.
+{{- if ne .Lanes .LanesContainer}}
+// Only the lower {{.Lanes}} bits of y are used.
+{{- end}}
+//
+// Asm: KMOV{{.IntelSizeSuffix}}, CPU Features: AVX512
+func (x {{.Name}}) ToBits() uint{{.LanesContainer}}
+`
+
+const simdMaskedLoadStoreTemplate = `
+// LoadMasked{{.Name}} loads a {{.Name}} from an array,
+// at those elements enabled by mask
+//
+{{.MaskedLoadDoc}}
+//
+//go:noescape
+func LoadMasked{{.Name}}(y *[{{.Lanes}}]{{.Base}}, mask Mask{{.ElemBits}}x{{.Lanes}}) {{.Name}}
+
+// StoreMasked stores a {{.Name}} to an array,
+// at those elements enabled by mask
+//
+{{.MaskedStoreDoc}}
+//
+//go:noescape
+func (x {{.Name}}) StoreMasked(y *[{{.Lanes}}]{{.Base}}, mask Mask{{.ElemBits}}x{{.Lanes}})
+`
+
+const simdStubsTmpl = `
+{{define "op1"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op0NameAndType "x"}}) {{.Go}}() {{.GoType}}
+{{end}}
+
+{{define "op2"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op0NameAndType "x"}}) {{.Go}}({{.Op1NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op2_21"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op0NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op2_21Type1"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op0NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op3"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op0NameAndType "x"}}) {{.Go}}({{.Op1NameAndType "y"}}, {{.Op2NameAndType "z"}}) {{.GoType}}
+{{end}}
+
+{{define "op3_31Zero3"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op2NameAndType "x"}}) {{.Go}}({{.Op1NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op3_21"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op0NameAndType "y"}}, {{.Op2NameAndType "z"}}) {{.GoType}}
+{{end}}
+
+{{define "op3_21Type1"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op0NameAndType "y"}}, {{.Op2NameAndType "z"}}) {{.GoType}}
+{{end}}
+
+{{define "op3_231Type1"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op2NameAndType "y"}}, {{.Op0NameAndType "z"}}) {{.GoType}}
+{{end}}
+
+{{define "op2VecAsScalar"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op0NameAndType "x"}}) {{.Go}}(y uint{{(index .In 1).TreatLikeAScalarOfSize}}) {{(index .Out 0).Go}}
+{{end}}
+
+{{define "op3VecAsScalar"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op0NameAndType "x"}}) {{.Go}}(y uint{{(index .In 1).TreatLikeAScalarOfSize}}, {{.Op2NameAndType "z"}}) {{(index .Out 0).Go}}
+{{end}}
+
+{{define "op4"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op0NameAndType "x"}}) {{.Go}}({{.Op1NameAndType "y"}}, {{.Op2NameAndType "z"}}, {{.Op3NameAndType "u"}}) {{.GoType}}
+{{end}}
+
+{{define "op4_231Type1"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op2NameAndType "y"}}, {{.Op0NameAndType "z"}}, {{.Op3NameAndType "u"}}) {{.GoType}}
+{{end}}
+
+{{define "op4_31"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op2NameAndType "x"}}) {{.Go}}({{.Op1NameAndType "y"}}, {{.Op0NameAndType "z"}}, {{.Op3NameAndType "u"}}) {{.GoType}}
+{{end}}
+
+{{define "op1Imm8"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.ImmName}} uint8) {{.GoType}}
+{{end}}
+
+{{define "op2Imm8"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.ImmName}} uint8, {{.Op2NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op2Imm8_2I"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op2NameAndType "y"}}, {{.ImmName}} uint8) {{.GoType}}
+{{end}}
+
+{{define "op2Imm8_II"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} result in better performance when they are constants, non-constant values will be translated into a jump table.
+// {{.ImmName}} should be between 0 and 3, inclusive; other values may result in a runtime panic.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.ImmName}} uint8, {{.Op2NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op2Imm8_SHA1RNDS4"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.ImmName}} uint8, {{.Op2NameAndType "y"}}) {{.GoType}}
+{{end}}
+
+{{define "op3Imm8"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.ImmName}} uint8, {{.Op2NameAndType "y"}}, {{.Op3NameAndType "z"}}) {{.GoType}}
+{{end}}
+
+{{define "op3Imm8_2I"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.Op2NameAndType "y"}}, {{.ImmName}} uint8, {{.Op3NameAndType "z"}}) {{.GoType}}
+{{end}}
+
+
+{{define "op4Imm8"}}
+{{if .Documentation}}{{.Documentation}}
+//{{end}}
+// {{.ImmName}} results in better performance when it's a constant, a non-constant value will be translated into a jump table.
+//
+// Asm: {{.Asm}}, CPU Feature: {{.CPUFeature}}
+func ({{.Op1NameAndType "x"}}) {{.Go}}({{.ImmName}} uint8, {{.Op2NameAndType "y"}}, {{.Op3NameAndType "z"}}, {{.Op4NameAndType "u"}}) {{.GoType}}
+{{end}}
+
+{{define "vectorConversion"}}
+// {{.Tdst.Name}} converts from {{.Tsrc.Name}} to {{.Tdst.Name}}
+func (from {{.Tsrc.Name}}) As{{.Tdst.Name}}() (to {{.Tdst.Name}})
+{{end}}
+
+{{define "mask"}}
+// As{{.VectorCounterpart}} converts from {{.Name}} to {{.VectorCounterpart}}
+func (from {{.Name}}) As{{.VectorCounterpart}}() (to {{.VectorCounterpart}})
+
+// asMask converts from {{.VectorCounterpart}} to {{.Name}}
+func (from {{.VectorCounterpart}}) asMask() (to {{.Name}})
+
+func (x {{.Name}}) And(y {{.Name}}) {{.Name}}
+
+func (x {{.Name}}) Or(y {{.Name}}) {{.Name}}
+{{end}}
+`
+
+// parseSIMDTypes groups go simd types by their vector sizes, and
+// returns a map whose key is the vector size, value is the simd type.
+func parseSIMDTypes(ops []Operation) simdTypeMap {
+	// TODO: maybe instead of going over ops, let's try go over types.yaml.
+	ret := map[int][]simdType{}
+	seen := map[string]struct{}{}
+	processArg := func(arg Operand) {
+		if arg.Class == "immediate" || arg.Class == "greg" {
+			// Immediates are not encoded as vector types.
+			return
+		}
+		if _, ok := seen[*arg.Go]; ok {
+			return
+		}
+		seen[*arg.Go] = struct{}{}
+
+		lanes := *arg.Lanes
+		base := fmt.Sprintf("%s%d", *arg.Base, *arg.ElemBits)
+		tagFieldNameS := fmt.Sprintf("%sx%d", base, lanes)
+		tagFieldS := fmt.Sprintf("%s v%d", tagFieldNameS, *arg.Bits)
+		valFieldS := fmt.Sprintf("vals%s[%d]%s", strings.Repeat(" ", len(tagFieldNameS)-3), lanes, base)
+		fields := fmt.Sprintf("\t%s\n\t%s", tagFieldS, valFieldS)
+		if arg.Class == "mask" {
+			vectorCounterpart := strings.ReplaceAll(*arg.Go, "Mask", "Int")
+			reshapedVectorWithAndOr := fmt.Sprintf("Int32x%d", *arg.Bits/32)
+			ret[*arg.Bits] = append(ret[*arg.Bits], simdType{*arg.Go, lanes, base, fields, arg.Class, vectorCounterpart, reshapedVectorWithAndOr, *arg.Bits})
+			// In case the vector counterpart of a mask is not present, put its vector counterpart typedef into the map as well.
+			if _, ok := seen[vectorCounterpart]; !ok {
+				seen[vectorCounterpart] = struct{}{}
+				ret[*arg.Bits] = append(ret[*arg.Bits], simdType{vectorCounterpart, lanes, base, fields, "vreg", "", "", *arg.Bits})
+			}
+		} else {
+			ret[*arg.Bits] = append(ret[*arg.Bits], simdType{*arg.Go, lanes, base, fields, arg.Class, "", "", *arg.Bits})
+		}
+	}
+	for _, op := range ops {
+		for _, arg := range op.In {
+			processArg(arg)
+		}
+		for _, arg := range op.Out {
+			processArg(arg)
+		}
+	}
+	return ret
+}
+
+func vConvertFromTypeMap(typeMap simdTypeMap) []simdTypePair {
+	v := []simdTypePair{}
+	for _, ts := range typeMap {
+		for i, tsrc := range ts {
+			for j, tdst := range ts {
+				if i != j && tsrc.Type == tdst.Type && tsrc.Type == "vreg" &&
+					tsrc.Lanes > 1 && tdst.Lanes > 1 {
+					v = append(v, simdTypePair{tsrc, tdst})
+				}
+			}
+		}
+	}
+	slices.SortFunc(v, compareSimdTypePairs)
+	return v
+}
+
+func masksFromTypeMap(typeMap simdTypeMap) []simdType {
+	m := []simdType{}
+	for _, ts := range typeMap {
+		for _, tsrc := range ts {
+			if tsrc.Type == "mask" {
+				m = append(m, tsrc)
+			}
+		}
+	}
+	slices.SortFunc(m, compareSimdTypes)
+	return m
+}
+
+func typesFromTypeMap(typeMap simdTypeMap) []simdType {
+	m := []simdType{}
+	for _, ts := range typeMap {
+		for _, tsrc := range ts {
+			if tsrc.Lanes > 1 {
+				m = append(m, tsrc)
+			}
+		}
+	}
+	slices.SortFunc(m, compareSimdTypes)
+	return m
+}
+
+// writeSIMDTypes generates the simd vector types into a bytes.Buffer
+func writeSIMDTypes(typeMap simdTypeMap) *bytes.Buffer {
+	t := templateOf(simdTypesTemplates, "types_amd64")
+	loadStore := templateOf(simdLoadStoreTemplate, "loadstore_amd64")
+	maskedLoadStore := templateOf(simdMaskedLoadStoreTemplate, "maskedloadstore_amd64")
+	maskFromVal := templateOf(simdMaskFromValTemplate, "maskFromVal_amd64")
+
+	buffer := new(bytes.Buffer)
+	buffer.WriteString(simdPackageHeader)
+
+	sizes := make([]int, 0, len(typeMap))
+	for size, types := range typeMap {
+		slices.SortFunc(types, compareSimdTypes)
+		sizes = append(sizes, size)
+	}
+	sort.Ints(sizes)
+
+	for _, size := range sizes {
+		if size <= 64 {
+			// these are scalar
+			continue
+		}
+		if err := t.ExecuteTemplate(buffer, "sizeTmpl", size); err != nil {
+			panic(fmt.Errorf("failed to execute size template for size %d: %w", size, err))
+		}
+		for _, typeDef := range typeMap[size] {
+			if typeDef.Lanes == 1 {
+				continue
+			}
+			if err := t.ExecuteTemplate(buffer, "typeTmpl", typeDef); err != nil {
+				panic(fmt.Errorf("failed to execute type template for type %s: %w", typeDef.Name, err))
+			}
+			if typeDef.Type != "mask" {
+				if err := loadStore.ExecuteTemplate(buffer, "loadstore_amd64", typeDef); err != nil {
+					panic(fmt.Errorf("failed to execute loadstore template for type %s: %w", typeDef.Name, err))
+				}
+				// restrict to AVX2 masked loads/stores first.
+				if typeDef.MaskedLoadStoreFilter() {
+					if err := maskedLoadStore.ExecuteTemplate(buffer, "maskedloadstore_amd64", typeDef); err != nil {
+						panic(fmt.Errorf("failed to execute maskedloadstore template for type %s: %w", typeDef.Name, err))
+					}
+				}
+			} else {
+				if err := maskFromVal.ExecuteTemplate(buffer, "maskFromVal_amd64", typeDef); err != nil {
+					panic(fmt.Errorf("failed to execute maskFromVal template for type %s: %w", typeDef.Name, err))
+				}
+			}
+		}
+	}
+
+	return buffer
+}
+
+func writeSIMDFeatures(ops []Operation) *bytes.Buffer {
+	// Gather all features
+	type featureKey struct {
+		GoArch  string
+		Feature string
+	}
+	featureSet := make(map[featureKey]struct{})
+	for _, op := range ops {
+		// Generate a feature check for each independant feature in a
+		// composite feature.
+		for feature := range strings.SplitSeq(op.CPUFeature, ",") {
+			feature = strings.TrimSpace(feature)
+			featureSet[featureKey{op.GoArch, feature}] = struct{}{}
+		}
+	}
+	features := slices.SortedFunc(maps.Keys(featureSet), func(a, b featureKey) int {
+		if c := cmp.Compare(a.GoArch, b.GoArch); c != 0 {
+			return c
+		}
+		return compareNatural(a.Feature, b.Feature)
+	})
+
+	// If we ever have the same feature name on more than one GOARCH, we'll have
+	// to be more careful about this.
+	t := templateOf(simdFeaturesTemplate, "features")
+
+	buffer := new(bytes.Buffer)
+	buffer.WriteString(simdPackageHeader)
+
+	if err := t.Execute(buffer, features); err != nil {
+		panic(fmt.Errorf("failed to execute features template: %w", err))
+	}
+
+	return buffer
+}
+
+// writeSIMDStubs returns two bytes.Buffers containing the declarations for the public
+// and internal-use vector intrinsics.
+func writeSIMDStubs(ops []Operation, typeMap simdTypeMap) (f, fI *bytes.Buffer) {
+	t := templateOf(simdStubsTmpl, "simdStubs")
+	f = new(bytes.Buffer)
+	fI = new(bytes.Buffer)
+	f.WriteString(simdPackageHeader)
+	fI.WriteString(simdPackageHeader)
+
+	slices.SortFunc(ops, compareOperations)
+
+	for i, op := range ops {
+		if op.NoTypes != nil && *op.NoTypes == "true" {
+			continue
+		}
+		if op.SkipMaskedMethod() {
+			continue
+		}
+		idxVecAsScalar, err := checkVecAsScalar(op)
+		if err != nil {
+			panic(err)
+		}
+		if s, op, err := classifyOp(op); err == nil {
+			if idxVecAsScalar != -1 {
+				if s == "op2" || s == "op3" {
+					s += "VecAsScalar"
+				} else {
+					panic(fmt.Errorf("simdgen only supports op2 or op3 with TreatLikeAScalarOfSize"))
+				}
+			}
+			if i == 0 || op.Go != ops[i-1].Go {
+				if unicode.IsUpper([]rune(op.Go)[0]) {
+					fmt.Fprintf(f, "\n/* %s */\n", op.Go)
+				} else {
+					fmt.Fprintf(fI, "\n/* %s */\n", op.Go)
+				}
+			}
+			if unicode.IsUpper([]rune(op.Go)[0]) {
+				if err := t.ExecuteTemplate(f, s, op); err != nil {
+					panic(fmt.Errorf("failed to execute template %s for op %v: %w", s, op, err))
+				}
+			} else {
+				if err := t.ExecuteTemplate(fI, s, op); err != nil {
+					panic(fmt.Errorf("failed to execute template %s for op %v: %w", s, op, err))
+				}
+			}
+		} else {
+			panic(fmt.Errorf("failed to classify op %v: %w", op.Go, err))
+		}
+	}
+
+	vectorConversions := vConvertFromTypeMap(typeMap)
+	for _, conv := range vectorConversions {
+		if err := t.ExecuteTemplate(f, "vectorConversion", conv); err != nil {
+			panic(fmt.Errorf("failed to execute vectorConversion template: %w", err))
+		}
+	}
+
+	masks := masksFromTypeMap(typeMap)
+	for _, mask := range masks {
+		if err := t.ExecuteTemplate(f, "mask", mask); err != nil {
+			panic(fmt.Errorf("failed to execute mask template for mask %s: %w", mask.Name, err))
+		}
+	}
+
+	return
+}
--- a/src/simd/_gen/simdgen/gen_simdrules.go
+++ b/src/simd/_gen/simdgen/gen_simdrules.go
@ -0,0 +1,397 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bytes"
+	"fmt"
+	"slices"
+	"strings"
+	"text/template"
+)
+
+type tplRuleData struct {
+	tplName        string // e.g. "sftimm"
+	GoOp           string // e.g. "ShiftAllLeft"
+	GoType         string // e.g. "Uint32x8"
+	Args           string // e.g. "x y"
+	Asm            string // e.g. "VPSLLD256"
+	ArgsOut        string // e.g. "x y"
+	MaskInConvert  string // e.g. "VPMOVVec32x8ToM"
+	MaskOutConvert string // e.g. "VPMOVMToVec32x8"
+	ElementSize    int    // e.g. 32
+	Size           int    // e.g. 128
+	ArgsLoadAddr   string // [Args] with its last vreg arg being a concrete "(VMOVDQUload* ptr mem)", and might contain mask.
+	ArgsAddr       string // [Args] with its last vreg arg being replaced by "ptr", and might contain mask, and with a "mem" at the end.
+	FeatCheck      string // e.g. "v.Block.CPUfeatures.hasFeature(CPUavx512)" -- for a ssa/_gen rules file.
+}
+
+var (
+	ruleTemplates = template.Must(template.New("simdRules").Parse(`
+{{define "pureVreg"}}({{.GoOp}}{{.GoType}} {{.Args}}) => ({{.Asm}} {{.ArgsOut}})
+{{end}}
+{{define "maskIn"}}({{.GoOp}}{{.GoType}} {{.Args}} mask) => ({{.Asm}} {{.ArgsOut}} ({{.MaskInConvert}} <types.TypeMask> mask))
+{{end}}
+{{define "maskOut"}}({{.GoOp}}{{.GoType}} {{.Args}}) => ({{.MaskOutConvert}} ({{.Asm}} {{.ArgsOut}}))
+{{end}}
+{{define "maskInMaskOut"}}({{.GoOp}}{{.GoType}} {{.Args}} mask) => ({{.MaskOutConvert}} ({{.Asm}} {{.ArgsOut}} ({{.MaskInConvert}} <types.TypeMask> mask)))
+{{end}}
+{{define "sftimm"}}({{.Asm}} x (MOVQconst [c])) => ({{.Asm}}const [uint8(c)] x)
+{{end}}
+{{define "masksftimm"}}({{.Asm}} x (MOVQconst [c]) mask) => ({{.Asm}}const [uint8(c)] x mask)
+{{end}}
+{{define "vregMem"}}({{.Asm}} {{.ArgsLoadAddr}}) && canMergeLoad(v, l) && clobber(l) => ({{.Asm}}load {{.ArgsAddr}})
+{{end}}
+{{define "vregMemFeatCheck"}}({{.Asm}} {{.ArgsLoadAddr}}) && {{.FeatCheck}} && canMergeLoad(v, l) && clobber(l)=> ({{.Asm}}load {{.ArgsAddr}})
+{{end}}
+`))
+)
+
+func (d tplRuleData) MaskOptimization(asmCheck map[string]bool) string {
+	asmNoMask := d.Asm
+	if i := strings.Index(asmNoMask, "Masked"); i == -1 {
+		return ""
+	}
+	asmNoMask = strings.ReplaceAll(asmNoMask, "Masked", "")
+	if asmCheck[asmNoMask] == false {
+		return ""
+	}
+
+	for _, nope := range []string{"VMOVDQU", "VPCOMPRESS", "VCOMPRESS", "VPEXPAND", "VEXPAND", "VPBLENDM", "VMOVUP"} {
+		if strings.HasPrefix(asmNoMask, nope) {
+			return ""
+		}
+	}
+
+	size := asmNoMask[len(asmNoMask)-3:]
+	if strings.HasSuffix(asmNoMask, "const") {
+		sufLen := len("128const")
+		size = asmNoMask[len(asmNoMask)-sufLen:][:3]
+	}
+	switch size {
+	case "128", "256", "512":
+	default:
+		panic("Unexpected operation size on " + d.Asm)
+	}
+
+	switch d.ElementSize {
+	case 8, 16, 32, 64:
+	default:
+		panic(fmt.Errorf("Unexpected operation width %d on %v", d.ElementSize, d.Asm))
+	}
+
+	return fmt.Sprintf("(VMOVDQU%dMasked%s (%s %s) mask) => (%s %s mask)\n", d.ElementSize, size, asmNoMask, d.Args, d.Asm, d.Args)
+}
+
+// SSA rewrite rules need to appear in a most-to-least-specific order.  This works for that.
+var tmplOrder = map[string]int{
+	"masksftimm":    0,
+	"sftimm":        1,
+	"maskInMaskOut": 2,
+	"maskOut":       3,
+	"maskIn":        4,
+	"pureVreg":      5,
+	"vregMem":       6,
+}
+
+func compareTplRuleData(x, y tplRuleData) int {
+	if c := compareNatural(x.GoOp, y.GoOp); c != 0 {
+		return c
+	}
+	if c := compareNatural(x.GoType, y.GoType); c != 0 {
+		return c
+	}
+	if c := compareNatural(x.Args, y.Args); c != 0 {
+		return c
+	}
+	if x.tplName == y.tplName {
+		return 0
+	}
+	xo, xok := tmplOrder[x.tplName]
+	yo, yok := tmplOrder[y.tplName]
+	if !xok {
+		panic(fmt.Errorf("Unexpected template name %s, please add to tmplOrder", x.tplName))
+	}
+	if !yok {
+		panic(fmt.Errorf("Unexpected template name %s, please add to tmplOrder", y.tplName))
+	}
+	return xo - yo
+}
+
+// writeSIMDRules generates the lowering and rewrite rules for ssa and writes it to simdAMD64.rules
+// within the specified directory.
+func writeSIMDRules(ops []Operation) *bytes.Buffer {
+	buffer := new(bytes.Buffer)
+	buffer.WriteString(generatedHeader + "\n")
+
+	// asm -> masked merging rules
+	maskedMergeOpts := make(map[string]string)
+	s2n := map[int]string{8: "B", 16: "W", 32: "D", 64: "Q"}
+	asmCheck := map[string]bool{}
+	var allData []tplRuleData
+	var optData []tplRuleData    // for mask peephole optimizations, and other misc
+	var memOptData []tplRuleData // for memory peephole optimizations
+	memOpSeen := make(map[string]bool)
+
+	for _, opr := range ops {
+		opInShape, opOutShape, maskType, immType, gOp := opr.shape()
+		asm := machineOpName(maskType, gOp)
+		vregInCnt := len(gOp.In)
+		if maskType == OneMask {
+			vregInCnt--
+		}
+
+		data := tplRuleData{
+			GoOp: gOp.Go,
+			Asm:  asm,
+		}
+
+		if vregInCnt == 1 {
+			data.Args = "x"
+			data.ArgsOut = data.Args
+		} else if vregInCnt == 2 {
+			data.Args = "x y"
+			data.ArgsOut = data.Args
+		} else if vregInCnt == 3 {
+			data.Args = "x y z"
+			data.ArgsOut = data.Args
+		} else {
+			panic(fmt.Errorf("simdgen does not support more than 3 vreg in inputs"))
+		}
+		if immType == ConstImm {
+			data.ArgsOut = fmt.Sprintf("[%s] %s", *opr.In[0].Const, data.ArgsOut)
+		} else if immType == VarImm {
+			data.Args = fmt.Sprintf("[a] %s", data.Args)
+			data.ArgsOut = fmt.Sprintf("[a] %s", data.ArgsOut)
+		} else if immType == ConstVarImm {
+			data.Args = fmt.Sprintf("[a] %s", data.Args)
+			data.ArgsOut = fmt.Sprintf("[a+%s] %s", *opr.In[0].Const, data.ArgsOut)
+		}
+
+		goType := func(op Operation) string {
+			if op.OperandOrder != nil {
+				switch *op.OperandOrder {
+				case "21Type1", "231Type1":
+					// Permute uses operand[1] for method receiver.
+					return *op.In[1].Go
+				}
+			}
+			return *op.In[0].Go
+		}
+		var tplName string
+		// If class overwrite is happening, that's not really a mask but a vreg.
+		if opOutShape == OneVregOut || opOutShape == OneVregOutAtIn || gOp.Out[0].OverwriteClass != nil {
+			switch opInShape {
+			case OneImmIn:
+				tplName = "pureVreg"
+				data.GoType = goType(gOp)
+			case PureVregIn:
+				tplName = "pureVreg"
+				data.GoType = goType(gOp)
+			case OneKmaskImmIn:
+				fallthrough
+			case OneKmaskIn:
+				tplName = "maskIn"
+				data.GoType = goType(gOp)
+				rearIdx := len(gOp.In) - 1
+				// Mask is at the end.
+				width := *gOp.In[rearIdx].ElemBits
+				data.MaskInConvert = fmt.Sprintf("VPMOVVec%dx%dToM", width, *gOp.In[rearIdx].Lanes)
+				data.ElementSize = width
+			case PureKmaskIn:
+				panic(fmt.Errorf("simdgen does not support pure k mask instructions, they should be generated by compiler optimizations"))
+			}
+		} else if opOutShape == OneGregOut {
+			tplName = "pureVreg" // TODO this will be wrong
+			data.GoType = goType(gOp)
+		} else {
+			// OneKmaskOut case
+			data.MaskOutConvert = fmt.Sprintf("VPMOVMToVec%dx%d", *gOp.Out[0].ElemBits, *gOp.In[0].Lanes)
+			switch opInShape {
+			case OneImmIn:
+				fallthrough
+			case PureVregIn:
+				tplName = "maskOut"
+				data.GoType = goType(gOp)
+			case OneKmaskImmIn:
+				fallthrough
+			case OneKmaskIn:
+				tplName = "maskInMaskOut"
+				data.GoType = goType(gOp)
+				rearIdx := len(gOp.In) - 1
+				data.MaskInConvert = fmt.Sprintf("VPMOVVec%dx%dToM", *gOp.In[rearIdx].ElemBits, *gOp.In[rearIdx].Lanes)
+			case PureKmaskIn:
+				panic(fmt.Errorf("simdgen does not support pure k mask instructions, they should be generated by compiler optimizations"))
+			}
+		}
+
+		if gOp.SpecialLower != nil {
+			if *gOp.SpecialLower == "sftimm" {
+				if data.GoType[0] == 'I' {
+					// only do these for signed types, it is a duplicate rewrite for unsigned
+					sftImmData := data
+					if tplName == "maskIn" {
+						sftImmData.tplName = "masksftimm"
+					} else {
+						sftImmData.tplName = "sftimm"
+					}
+					allData = append(allData, sftImmData)
+					asmCheck[sftImmData.Asm+"const"] = true
+				}
+			} else {
+				panic("simdgen sees unknwon special lower " + *gOp.SpecialLower + ", maybe implement it?")
+			}
+		}
+		if gOp.MemFeatures != nil && *gOp.MemFeatures == "vbcst" {
+			// sanity check
+			selected := true
+			for _, a := range gOp.In {
+				if a.TreatLikeAScalarOfSize != nil {
+					selected = false
+					break
+				}
+			}
+			if _, ok := memOpSeen[data.Asm]; ok {
+				selected = false
+			}
+			if selected {
+				memOpSeen[data.Asm] = true
+				lastVreg := gOp.In[vregInCnt-1]
+				// sanity check
+				if lastVreg.Class != "vreg" {
+					panic(fmt.Errorf("simdgen expects vbcst replaced operand to be a vreg, but %v found", lastVreg))
+				}
+				memOpData := data
+				// Remove the last vreg from the arg and change it to a load.
+				origArgs := data.Args[:len(data.Args)-1]
+				// Prepare imm args.
+				immArg := ""
+				immArgCombineOff := " [off] "
+				if immType != NoImm && immType != InvalidImm {
+					_, after, found := strings.Cut(origArgs, "]")
+					if found {
+						origArgs = after
+					}
+					immArg = "[c] "
+					immArgCombineOff = " [makeValAndOff(int32(int8(c)),off)] "
+				}
+				memOpData.ArgsLoadAddr = immArg + origArgs + fmt.Sprintf("l:(VMOVDQUload%d {sym} [off] ptr mem)", *lastVreg.Bits)
+				// Remove the last vreg from the arg and change it to "ptr".
+				memOpData.ArgsAddr = "{sym}" + immArgCombineOff + origArgs + "ptr"
+				if maskType == OneMask {
+					memOpData.ArgsAddr += " mask"
+					memOpData.ArgsLoadAddr += " mask"
+				}
+				memOpData.ArgsAddr += " mem"
+				if gOp.MemFeaturesData != nil {
+					_, feat2 := getVbcstData(*gOp.MemFeaturesData)
+					knownFeatChecks := map[string]string{
+						"AVX":    "v.Block.CPUfeatures.hasFeature(CPUavx)",
+						"AVX2":   "v.Block.CPUfeatures.hasFeature(CPUavx2)",
+						"AVX512": "v.Block.CPUfeatures.hasFeature(CPUavx512)",
+					}
+					memOpData.FeatCheck = knownFeatChecks[feat2]
+					memOpData.tplName = "vregMemFeatCheck"
+				} else {
+					memOpData.tplName = "vregMem"
+				}
+				memOptData = append(memOptData, memOpData)
+				asmCheck[memOpData.Asm+"load"] = true
+			}
+		}
+		// Generate the masked merging optimization rules
+		if gOp.hasMaskedMerging(maskType, opOutShape) {
+			// TODO: handle customized operand order and special lower.
+			maskElem := gOp.In[len(gOp.In)-1]
+			if maskElem.Bits == nil {
+				panic("mask has no bits")
+			}
+			if maskElem.ElemBits == nil {
+				panic("mask has no elemBits")
+			}
+			if maskElem.Lanes == nil {
+				panic("mask has no lanes")
+			}
+			switch *maskElem.Bits {
+			case 128, 256:
+				// VPBLENDVB cases.
+				noMaskName := machineOpName(NoMask, gOp)
+				ruleExisting, ok := maskedMergeOpts[noMaskName]
+				rule := fmt.Sprintf("(VPBLENDVB%d dst (%s %s) mask) && v.Block.CPUfeatures.hasFeature(CPUavx512) => (%sMerging dst %s (VPMOVVec%dx%dToM <types.TypeMask> mask))\n",
+					*maskElem.Bits, noMaskName, data.Args, data.Asm, data.Args, *maskElem.ElemBits, *maskElem.Lanes)
+				if ok && ruleExisting != rule {
+					panic("multiple masked merge rules for one op")
+				} else {
+					maskedMergeOpts[noMaskName] = rule
+				}
+			case 512:
+				// VPBLENDM[BWDQ] cases.
+				noMaskName := machineOpName(NoMask, gOp)
+				ruleExisting, ok := maskedMergeOpts[noMaskName]
+				rule := fmt.Sprintf("(VPBLENDM%sMasked%d dst (%s %s) mask) => (%sMerging dst %s mask)\n",
+					s2n[*maskElem.ElemBits], *maskElem.Bits, noMaskName, data.Args, data.Asm, data.Args)
+				if ok && ruleExisting != rule {
+					panic("multiple masked merge rules for one op")
+				} else {
+					maskedMergeOpts[noMaskName] = rule
+				}
+			}
+		}
+
+		if tplName == "pureVreg" && data.Args == data.ArgsOut {
+			data.Args = "..."
+			data.ArgsOut = "..."
+		}
+		data.tplName = tplName
+		if opr.NoGenericOps != nil && *opr.NoGenericOps == "true" ||
+			opr.SkipMaskedMethod() {
+			optData = append(optData, data)
+			continue
+		}
+		allData = append(allData, data)
+		asmCheck[data.Asm] = true
+	}
+
+	slices.SortFunc(allData, compareTplRuleData)
+
+	for _, data := range allData {
+		if err := ruleTemplates.ExecuteTemplate(buffer, data.tplName, data); err != nil {
+			panic(fmt.Errorf("failed to execute template %s for %s: %w", data.tplName, data.GoOp+data.GoType, err))
+		}
+	}
+
+	seen := make(map[string]bool)
+
+	for _, data := range optData {
+		if data.tplName == "maskIn" {
+			rule := data.MaskOptimization(asmCheck)
+			if seen[rule] {
+				continue
+			}
+			seen[rule] = true
+			buffer.WriteString(rule)
+		}
+	}
+
+	maskedMergeOptsRules := []string{}
+	for asm, rule := range maskedMergeOpts {
+		if !asmCheck[asm] {
+			continue
+		}
+		maskedMergeOptsRules = append(maskedMergeOptsRules, rule)
+	}
+	slices.Sort(maskedMergeOptsRules)
+	for _, rule := range maskedMergeOptsRules {
+		buffer.WriteString(rule)
+	}
+
+	for _, data := range memOptData {
+		if err := ruleTemplates.ExecuteTemplate(buffer, data.tplName, data); err != nil {
+			panic(fmt.Errorf("failed to execute template %s for %s: %w", data.tplName, data.Asm, err))
+		}
+	}
+
+	return buffer
+}
--- a/src/simd/_gen/simdgen/gen_simdssa.go
+++ b/src/simd/_gen/simdgen/gen_simdssa.go
@ -0,0 +1,236 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bytes"
+	"fmt"
+	"log"
+	"strings"
+	"text/template"
+)
+
+var (
+	ssaTemplates = template.Must(template.New("simdSSA").Parse(`
+{{define "header"}}// Code generated by x/arch/internal/simdgen using 'go run . -xedPath $XED_PATH -o godefs -goroot $GOROOT go.yaml types.yaml categories.yaml'; DO NOT EDIT.
+
+package amd64
+
+import (
+	"cmd/compile/internal/ssa"
+	"cmd/compile/internal/ssagen"
+	"cmd/internal/obj"
+	"cmd/internal/obj/x86"
+)
+
+func ssaGenSIMDValue(s *ssagen.State, v *ssa.Value) bool {
+	var p *obj.Prog
+	switch v.Op {{"{"}}{{end}}
+{{define "case"}}
+	case {{.Cases}}:
+		p = {{.Helper}}(s, v)
+{{end}}
+{{define "footer"}}
+	default:
+		// Unknown reg shape
+		return false
+	}
+{{end}}
+{{define "zeroing"}}
+	// Masked operation are always compiled with zeroing.
+	switch v.Op {
+	case {{.}}:
+		x86.ParseSuffix(p, "Z")
+	}
+{{end}}
+{{define "ending"}}
+	return true
+}
+{{end}}`))
+)
+
+type tplSSAData struct {
+	Cases  string
+	Helper string
+}
+
+// writeSIMDSSA generates the ssa to prog lowering codes and writes it to simdssa.go
+// within the specified directory.
+func writeSIMDSSA(ops []Operation) *bytes.Buffer {
+	var ZeroingMask []string
+	regInfoKeys := []string{
+		"v11",
+		"v21",
+		"v2k",
+		"v2kv",
+		"v2kk",
+		"vkv",
+		"v31",
+		"v3kv",
+		"v11Imm8",
+		"vkvImm8",
+		"v21Imm8",
+		"v2kImm8",
+		"v2kkImm8",
+		"v31ResultInArg0",
+		"v3kvResultInArg0",
+		"vfpv",
+		"vfpkv",
+		"vgpvImm8",
+		"vgpImm8",
+		"v2kvImm8",
+		"vkvload",
+		"v21load",
+		"v31loadResultInArg0",
+		"v3kvloadResultInArg0",
+		"v2kvload",
+		"v2kload",
+		"v11load",
+		"v11loadImm8",
+		"vkvloadImm8",
+		"v21loadImm8",
+		"v2kloadImm8",
+		"v2kkloadImm8",
+		"v2kvloadImm8",
+		"v31ResultInArg0Imm8",
+		"v31loadResultInArg0Imm8",
+		"v21ResultInArg0",
+		"v21ResultInArg0Imm8",
+		"v31x0AtIn2ResultInArg0",
+		"v2kvResultInArg0",
+	}
+	regInfoSet := map[string][]string{}
+	for _, key := range regInfoKeys {
+		regInfoSet[key] = []string{}
+	}
+
+	seen := map[string]struct{}{}
+	allUnseen := make(map[string][]Operation)
+	allUnseenCaseStr := make(map[string][]string)
+	classifyOp := func(op Operation, maskType maskShape, shapeIn inShape, shapeOut outShape, caseStr string, mem memShape) error {
+		regShape, err := op.regShape(mem)
+		if err != nil {
+			return err
+		}
+		if regShape == "v01load" {
+			regShape = "vload"
+		}
+		if shapeOut == OneVregOutAtIn {
+			regShape += "ResultInArg0"
+		}
+		if shapeIn == OneImmIn || shapeIn == OneKmaskImmIn {
+			regShape += "Imm8"
+		}
+		regShape, err = rewriteVecAsScalarRegInfo(op, regShape)
+		if err != nil {
+			return err
+		}
+		if _, ok := regInfoSet[regShape]; !ok {
+			allUnseen[regShape] = append(allUnseen[regShape], op)
+			allUnseenCaseStr[regShape] = append(allUnseenCaseStr[regShape], caseStr)
+		}
+		regInfoSet[regShape] = append(regInfoSet[regShape], caseStr)
+		if mem == NoMem && op.hasMaskedMerging(maskType, shapeOut) {
+			regShapeMerging := regShape
+			if shapeOut != OneVregOutAtIn {
+				// We have to copy the slice here becasue the sort will be visible from other
+				// aliases when no reslicing is happening.
+				newIn := make([]Operand, len(op.In), len(op.In)+1)
+				copy(newIn, op.In)
+				op.In = newIn
+				op.In = append(op.In, op.Out[0])
+				op.sortOperand()
+				regShapeMerging, err = op.regShape(mem)
+				regShapeMerging += "ResultInArg0"
+			}
+			if err != nil {
+				return err
+			}
+			if _, ok := regInfoSet[regShapeMerging]; !ok {
+				allUnseen[regShapeMerging] = append(allUnseen[regShapeMerging], op)
+				allUnseenCaseStr[regShapeMerging] = append(allUnseenCaseStr[regShapeMerging], caseStr+"Merging")
+			}
+			regInfoSet[regShapeMerging] = append(regInfoSet[regShapeMerging], caseStr+"Merging")
+		}
+		return nil
+	}
+	for _, op := range ops {
+		shapeIn, shapeOut, maskType, _, gOp := op.shape()
+		asm := machineOpName(maskType, gOp)
+		if _, ok := seen[asm]; ok {
+			continue
+		}
+		seen[asm] = struct{}{}
+		caseStr := fmt.Sprintf("ssa.OpAMD64%s", asm)
+		isZeroMasking := false
+		if shapeIn == OneKmaskIn || shapeIn == OneKmaskImmIn {
+			if gOp.Zeroing == nil || *gOp.Zeroing {
+				ZeroingMask = append(ZeroingMask, caseStr)
+				isZeroMasking = true
+			}
+		}
+		if err := classifyOp(op, maskType, shapeIn, shapeOut, caseStr, NoMem); err != nil {
+			panic(err)
+		}
+		if op.MemFeatures != nil && *op.MemFeatures == "vbcst" {
+			// Make a full vec memory variant
+			op = rewriteLastVregToMem(op)
+			// Ignore the error
+			// an error could be triggered by [checkVecAsScalar].
+			// TODO: make [checkVecAsScalar] aware of mem ops.
+			if err := classifyOp(op, maskType, shapeIn, shapeOut, caseStr+"load", VregMemIn); err != nil {
+				if *Verbose {
+					log.Printf("Seen error: %e", err)
+				}
+			} else if isZeroMasking {
+				ZeroingMask = append(ZeroingMask, caseStr+"load")
+			}
+		}
+	}
+	if len(allUnseen) != 0 {
+		allKeys := make([]string, 0)
+		for k := range allUnseen {
+			allKeys = append(allKeys, k)
+		}
+		panic(fmt.Errorf("unsupported register constraint for prog, please update gen_simdssa.go and amd64/ssa.go: %+v\nAll keys: %v\n, cases: %v\n", allUnseen, allKeys, allUnseenCaseStr))
+	}
+
+	buffer := new(bytes.Buffer)
+
+	if err := ssaTemplates.ExecuteTemplate(buffer, "header", nil); err != nil {
+		panic(fmt.Errorf("failed to execute header template: %w", err))
+	}
+
+	for _, regShape := range regInfoKeys {
+		// Stable traversal of regInfoSet
+		cases := regInfoSet[regShape]
+		if len(cases) == 0 {
+			continue
+		}
+		data := tplSSAData{
+			Cases:  strings.Join(cases, ",\n\t\t"),
+			Helper: "simd" + capitalizeFirst(regShape),
+		}
+		if err := ssaTemplates.ExecuteTemplate(buffer, "case", data); err != nil {
+			panic(fmt.Errorf("failed to execute case template for %s: %w", regShape, err))
+		}
+	}
+
+	if err := ssaTemplates.ExecuteTemplate(buffer, "footer", nil); err != nil {
+		panic(fmt.Errorf("failed to execute footer template: %w", err))
+	}
+
+	if len(ZeroingMask) != 0 {
+		if err := ssaTemplates.ExecuteTemplate(buffer, "zeroing", strings.Join(ZeroingMask, ",\n\t\t")); err != nil {
+			panic(fmt.Errorf("failed to execute footer template: %w", err))
+		}
+	}
+
+	if err := ssaTemplates.ExecuteTemplate(buffer, "ending", nil); err != nil {
+		panic(fmt.Errorf("failed to execute footer template: %w", err))
+	}
+
+	return buffer
+}
--- a/src/simd/_gen/simdgen/gen_utility.go
+++ b/src/simd/_gen/simdgen/gen_utility.go
@ -0,0 +1,830 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"bufio"
+	"bytes"
+	"fmt"
+	"go/format"
+	"log"
+	"os"
+	"path/filepath"
+	"reflect"
+	"slices"
+	"sort"
+	"strings"
+	"text/template"
+	"unicode"
+)
+
+func templateOf(temp, name string) *template.Template {
+	t, err := template.New(name).Parse(temp)
+	if err != nil {
+		panic(fmt.Errorf("failed to parse template %s: %w", name, err))
+	}
+	return t
+}
+
+func createPath(goroot string, file string) (*os.File, error) {
+	fp := filepath.Join(goroot, file)
+	dir := filepath.Dir(fp)
+	err := os.MkdirAll(dir, 0755)
+	if err != nil {
+		return nil, fmt.Errorf("failed to create directory %s: %w", dir, err)
+	}
+	f, err := os.Create(fp)
+	if err != nil {
+		return nil, fmt.Errorf("failed to create file %s: %w", fp, err)
+	}
+	return f, nil
+}
+
+func formatWriteAndClose(out *bytes.Buffer, goroot string, file string) {
+	b, err := format.Source(out.Bytes())
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "%v\n", err)
+		fmt.Fprintf(os.Stderr, "%s\n", numberLines(out.Bytes()))
+		fmt.Fprintf(os.Stderr, "%v\n", err)
+		panic(err)
+	} else {
+		writeAndClose(b, goroot, file)
+	}
+}
+
+func writeAndClose(b []byte, goroot string, file string) {
+	ofile, err := createPath(goroot, file)
+	if err != nil {
+		panic(err)
+	}
+	ofile.Write(b)
+	ofile.Close()
+}
+
+// numberLines takes a slice of bytes, and returns a string where each line
+// is numbered, starting from 1.
+func numberLines(data []byte) string {
+	var buf bytes.Buffer
+	r := bytes.NewReader(data)
+	s := bufio.NewScanner(r)
+	for i := 1; s.Scan(); i++ {
+		fmt.Fprintf(&buf, "%d: %s\n", i, s.Text())
+	}
+	return buf.String()
+}
+
+type inShape uint8
+type outShape uint8
+type maskShape uint8
+type immShape uint8
+type memShape uint8
+
+const (
+	InvalidIn     inShape = iota
+	PureVregIn            // vector register input only
+	OneKmaskIn            // vector and kmask input
+	OneImmIn              // vector and immediate input
+	OneKmaskImmIn         // vector, kmask, and immediate inputs
+	PureKmaskIn           // only mask inputs.
+)
+
+const (
+	InvalidOut     outShape = iota
+	NoOut                   // no output
+	OneVregOut              // (one) vector register output
+	OneGregOut              // (one) general register output
+	OneKmaskOut             // mask output
+	OneVregOutAtIn          // the first input is also the output
+)
+
+const (
+	InvalidMask maskShape = iota
+	NoMask                // no mask
+	OneMask               // with mask (K1 to K7)
+	AllMasks              // a K mask instruction (K0-K7)
+)
+
+const (
+	InvalidImm  immShape = iota
+	NoImm                // no immediate
+	ConstImm             // const only immediate
+	VarImm               // pure imm argument provided by the users
+	ConstVarImm          // a combination of user arg and const
+)
+
+const (
+	InvalidMem memShape = iota
+	NoMem
+	VregMemIn // The instruction contains a mem input which is loading a vreg.
+)
+
+// opShape returns the several integers describing the shape of the operation,
+// and modified versions of the op:
+//
+// opNoImm is op with its inputs excluding the const imm.
+//
+// This function does not modify op.
+func (op *Operation) shape() (shapeIn inShape, shapeOut outShape, maskType maskShape, immType immShape,
+	opNoImm Operation) {
+	if len(op.Out) > 1 {
+		panic(fmt.Errorf("simdgen only supports 1 output: %s", op))
+	}
+	var outputReg int
+	if len(op.Out) == 1 {
+		outputReg = op.Out[0].AsmPos
+		if op.Out[0].Class == "vreg" {
+			shapeOut = OneVregOut
+		} else if op.Out[0].Class == "greg" {
+			shapeOut = OneGregOut
+		} else if op.Out[0].Class == "mask" {
+			shapeOut = OneKmaskOut
+		} else {
+			panic(fmt.Errorf("simdgen only supports output of class vreg or mask: %s", op))
+		}
+	} else {
+		shapeOut = NoOut
+		// TODO: are these only Load/Stores?
+		// We manually supported two Load and Store, are those enough?
+		panic(fmt.Errorf("simdgen only supports 1 output: %s", op))
+	}
+	hasImm := false
+	maskCount := 0
+	hasVreg := false
+	for _, in := range op.In {
+		if in.AsmPos == outputReg {
+			if shapeOut != OneVregOutAtIn && in.AsmPos == 0 && in.Class == "vreg" {
+				shapeOut = OneVregOutAtIn
+			} else {
+				panic(fmt.Errorf("simdgen only support output and input sharing the same position case of \"the first input is vreg and the only output\": %s", op))
+			}
+		}
+		if in.Class == "immediate" {
+			// A manual check on XED data found that AMD64 SIMD instructions at most
+			// have 1 immediates. So we don't need to check this here.
+			if *in.Bits != 8 {
+				panic(fmt.Errorf("simdgen only supports immediates of 8 bits: %s", op))
+			}
+			hasImm = true
+		} else if in.Class == "mask" {
+			maskCount++
+		} else {
+			hasVreg = true
+		}
+	}
+	opNoImm = *op
+
+	removeImm := func(o *Operation) {
+		o.In = o.In[1:]
+	}
+	if hasImm {
+		removeImm(&opNoImm)
+		if op.In[0].Const != nil {
+			if op.In[0].ImmOffset != nil {
+				immType = ConstVarImm
+			} else {
+				immType = ConstImm
+			}
+		} else if op.In[0].ImmOffset != nil {
+			immType = VarImm
+		} else {
+			panic(fmt.Errorf("simdgen requires imm to have at least one of ImmOffset or Const set: %s", op))
+		}
+	} else {
+		immType = NoImm
+	}
+	if maskCount == 0 {
+		maskType = NoMask
+	} else {
+		maskType = OneMask
+	}
+	checkPureMask := func() bool {
+		if hasImm {
+			panic(fmt.Errorf("simdgen does not support immediates in pure mask operations: %s", op))
+		}
+		if hasVreg {
+			panic(fmt.Errorf("simdgen does not support more than 1 masks in non-pure mask operations: %s", op))
+		}
+		return false
+	}
+	if !hasImm && maskCount == 0 {
+		shapeIn = PureVregIn
+	} else if !hasImm && maskCount > 0 {
+		if maskCount == 1 {
+			shapeIn = OneKmaskIn
+		} else {
+			if checkPureMask() {
+				return
+			}
+			shapeIn = PureKmaskIn
+			maskType = AllMasks
+		}
+	} else if hasImm && maskCount == 0 {
+		shapeIn = OneImmIn
+	} else {
+		if maskCount == 1 {
+			shapeIn = OneKmaskImmIn
+		} else {
+			checkPureMask()
+			return
+		}
+	}
+	return
+}
+
+// regShape returns a string representation of the register shape.
+func (op *Operation) regShape(mem memShape) (string, error) {
+	_, _, _, _, gOp := op.shape()
+	var regInfo, fixedName string
+	var vRegInCnt, gRegInCnt, kMaskInCnt, vRegOutCnt, gRegOutCnt, kMaskOutCnt, memInCnt, memOutCnt int
+	for i, in := range gOp.In {
+		switch in.Class {
+		case "vreg":
+			vRegInCnt++
+		case "greg":
+			gRegInCnt++
+		case "mask":
+			kMaskInCnt++
+		case "memory":
+			if mem != VregMemIn {
+				panic("simdgen only knows VregMemIn in regShape")
+			}
+			memInCnt++
+			vRegInCnt++
+		}
+		if in.FixedReg != nil {
+			fixedName = fmt.Sprintf("%sAtIn%d", *in.FixedReg, i)
+		}
+	}
+	for i, out := range gOp.Out {
+		// If class overwrite is happening, that's not really a mask but a vreg.
+		if out.Class == "vreg" || out.OverwriteClass != nil {
+			vRegOutCnt++
+		} else if out.Class == "greg" {
+			gRegOutCnt++
+		} else if out.Class == "mask" {
+			kMaskOutCnt++
+		} else if out.Class == "memory" {
+			if mem != VregMemIn {
+				panic("simdgen only knows VregMemIn in regShape")
+			}
+			vRegOutCnt++
+			memOutCnt++
+		}
+		if out.FixedReg != nil {
+			fixedName = fmt.Sprintf("%sAtIn%d", *out.FixedReg, i)
+		}
+	}
+	var inRegs, inMasks, outRegs, outMasks string
+
+	rmAbbrev := func(s string, i int) string {
+		if i == 0 {
+			return ""
+		}
+		if i == 1 {
+			return s
+		}
+		return fmt.Sprintf("%s%d", s, i)
+
+	}
+
+	inRegs = rmAbbrev("v", vRegInCnt)
+	inRegs += rmAbbrev("gp", gRegInCnt)
+	inMasks = rmAbbrev("k", kMaskInCnt)
+
+	outRegs = rmAbbrev("v", vRegOutCnt)
+	outRegs += rmAbbrev("gp", gRegOutCnt)
+	outMasks = rmAbbrev("k", kMaskOutCnt)
+
+	if kMaskInCnt == 0 && kMaskOutCnt == 0 && gRegInCnt == 0 && gRegOutCnt == 0 {
+		// For pure v we can abbreviate it as v%d%d.
+		regInfo = fmt.Sprintf("v%d%d", vRegInCnt, vRegOutCnt)
+	} else if kMaskInCnt == 0 && kMaskOutCnt == 0 {
+		regInfo = fmt.Sprintf("%s%s", inRegs, outRegs)
+	} else {
+		regInfo = fmt.Sprintf("%s%s%s%s", inRegs, inMasks, outRegs, outMasks)
+	}
+	if memInCnt > 0 {
+		if memInCnt == 1 {
+			regInfo += "load"
+		} else {
+			panic("simdgen does not understand more than 1 mem op as of now")
+		}
+	}
+	if memOutCnt > 0 {
+		panic("simdgen does not understand memory as output as of now")
+	}
+	regInfo += fixedName
+	return regInfo, nil
+}
+
+// sortOperand sorts op.In by putting immediates first, then vreg, and mask the last.
+// TODO: verify that this is a safe assumption of the prog structure.
+// from my observation looks like in asm, imms are always the first,
+// masks are always the last, with vreg in between.
+func (op *Operation) sortOperand() {
+	priority := map[string]int{"immediate": 0, "vreg": 1, "greg": 1, "mask": 2}
+	sort.SliceStable(op.In, func(i, j int) bool {
+		pi := priority[op.In[i].Class]
+		pj := priority[op.In[j].Class]
+		if pi != pj {
+			return pi < pj
+		}
+		return op.In[i].AsmPos < op.In[j].AsmPos
+	})
+}
+
+// goNormalType returns the Go type name for the result of an Op that
+// does not return a vector, i.e., that returns a result in a general
+// register.  Currently there's only one family of Ops in Go's simd library
+// that does this (GetElem), and so this is specialized to work for that,
+// but the problem (mismatch betwen hardware register width and Go type
+// width) seems likely to recur if there are any other cases.
+func (op Operation) goNormalType() string {
+	if op.Go == "GetElem" {
+		// GetElem returns an element of the vector into a general register
+		// but as far as the hardware is concerned, that result is either 32
+		// or 64 bits wide, no matter what the vector element width is.
+		// This is not "wrong" but it is not the right answer for Go source code.
+		// To get the Go type right, combine the base type ("int", "uint", "float"),
+		// with the input vector element width in bits (8,16,32,64).
+
+		at := 0 // proper value of at depends on whether immediate was stripped or not
+		if op.In[at].Class == "immediate" {
+			at++
+		}
+		return fmt.Sprintf("%s%d", *op.Out[0].Base, *op.In[at].ElemBits)
+	}
+	panic(fmt.Errorf("Implement goNormalType for %v", op))
+}
+
+// SSAType returns the string for the type reference in SSA generation,
+// for example in the intrinsics generating template.
+func (op Operation) SSAType() string {
+	if op.Out[0].Class == "greg" {
+		return fmt.Sprintf("types.Types[types.T%s]", strings.ToUpper(op.goNormalType()))
+	}
+	return fmt.Sprintf("types.TypeVec%d", *op.Out[0].Bits)
+}
+
+// GoType returns the Go type returned by this operation (relative to the simd package),
+// for example "int32" or "Int8x16".  This is used in a template.
+func (op Operation) GoType() string {
+	if op.Out[0].Class == "greg" {
+		return op.goNormalType()
+	}
+	return *op.Out[0].Go
+}
+
+// ImmName returns the name to use for an operation's immediate operand.
+// This can be overriden in the yaml with "name" on an operand,
+// otherwise, for now, "constant"
+func (op Operation) ImmName() string {
+	return op.Op0Name("constant")
+}
+
+func (o Operand) OpName(s string) string {
+	if n := o.Name; n != nil {
+		return *n
+	}
+	if o.Class == "mask" {
+		return "mask"
+	}
+	return s
+}
+
+func (o Operand) OpNameAndType(s string) string {
+	return o.OpName(s) + " " + *o.Go
+}
+
+// GoExported returns [Go] with first character capitalized.
+func (op Operation) GoExported() string {
+	return capitalizeFirst(op.Go)
+}
+
+// DocumentationExported returns [Documentation] with method name capitalized.
+func (op Operation) DocumentationExported() string {
+	return strings.ReplaceAll(op.Documentation, op.Go, op.GoExported())
+}
+
+// Op0Name returns the name to use for the 0 operand,
+// if any is present, otherwise the parameter is used.
+func (op Operation) Op0Name(s string) string {
+	return op.In[0].OpName(s)
+}
+
+// Op1Name returns the name to use for the 1 operand,
+// if any is present, otherwise the parameter is used.
+func (op Operation) Op1Name(s string) string {
+	return op.In[1].OpName(s)
+}
+
+// Op2Name returns the name to use for the 2 operand,
+// if any is present, otherwise the parameter is used.
+func (op Operation) Op2Name(s string) string {
+	return op.In[2].OpName(s)
+}
+
+// Op3Name returns the name to use for the 3 operand,
+// if any is present, otherwise the parameter is used.
+func (op Operation) Op3Name(s string) string {
+	return op.In[3].OpName(s)
+}
+
+// Op0NameAndType returns the name and type to use for
+// the 0 operand, if a name is provided, otherwise
+// the parameter value is used as the default.
+func (op Operation) Op0NameAndType(s string) string {
+	return op.In[0].OpNameAndType(s)
+}
+
+// Op1NameAndType returns the name and type to use for
+// the 1 operand, if a name is provided, otherwise
+// the parameter value is used as the default.
+func (op Operation) Op1NameAndType(s string) string {
+	return op.In[1].OpNameAndType(s)
+}
+
+// Op2NameAndType returns the name and type to use for
+// the 2 operand, if a name is provided, otherwise
+// the parameter value is used as the default.
+func (op Operation) Op2NameAndType(s string) string {
+	return op.In[2].OpNameAndType(s)
+}
+
+// Op3NameAndType returns the name and type to use for
+// the 3 operand, if a name is provided, otherwise
+// the parameter value is used as the default.
+func (op Operation) Op3NameAndType(s string) string {
+	return op.In[3].OpNameAndType(s)
+}
+
+// Op4NameAndType returns the name and type to use for
+// the 4 operand, if a name is provided, otherwise
+// the parameter value is used as the default.
+func (op Operation) Op4NameAndType(s string) string {
+	return op.In[4].OpNameAndType(s)
+}
+
+var immClasses []string = []string{"BAD0Imm", "BAD1Imm", "op1Imm8", "op2Imm8", "op3Imm8", "op4Imm8"}
+var classes []string = []string{"BAD0", "op1", "op2", "op3", "op4"}
+
+// classifyOp returns a classification string, modified operation, and perhaps error based
+// on the stub and intrinsic shape for the operation.
+// The classification string is in the regular expression set "op[1234](Imm8)?(_<order>)?"
+// where the "<order>" suffix is optionally attached to the Operation in its input yaml.
+// The classification string is used to select a template or a clause of a template
+// for intrinsics declaration and the ssagen intrinisics glue code in the compiler.
+func classifyOp(op Operation) (string, Operation, error) {
+	_, _, _, immType, gOp := op.shape()
+
+	var class string
+
+	if immType == VarImm || immType == ConstVarImm {
+		switch l := len(op.In); l {
+		case 1:
+			return "", op, fmt.Errorf("simdgen does not recognize this operation of only immediate input: %s", op)
+		case 2, 3, 4, 5:
+			class = immClasses[l]
+		default:
+			return "", op, fmt.Errorf("simdgen does not recognize this operation of input length %d: %s", len(op.In), op)
+		}
+		if order := op.OperandOrder; order != nil {
+			class += "_" + *order
+		}
+		return class, op, nil
+	} else {
+		switch l := len(gOp.In); l {
+		case 1, 2, 3, 4:
+			class = classes[l]
+		default:
+			return "", op, fmt.Errorf("simdgen does not recognize this operation of input length %d: %s", len(op.In), op)
+		}
+		if order := op.OperandOrder; order != nil {
+			class += "_" + *order
+		}
+		return class, gOp, nil
+	}
+}
+
+func checkVecAsScalar(op Operation) (idx int, err error) {
+	idx = -1
+	sSize := 0
+	for i, o := range op.In {
+		if o.TreatLikeAScalarOfSize != nil {
+			if idx == -1 {
+				idx = i
+				sSize = *o.TreatLikeAScalarOfSize
+			} else {
+				err = fmt.Errorf("simdgen only supports one TreatLikeAScalarOfSize in the arg list: %s", op)
+				return
+			}
+		}
+	}
+	if idx >= 0 {
+		if sSize != 8 && sSize != 16 && sSize != 32 && sSize != 64 {
+			err = fmt.Errorf("simdgen does not recognize this uint size: %d, %s", sSize, op)
+			return
+		}
+	}
+	return
+}
+
+func rewriteVecAsScalarRegInfo(op Operation, regInfo string) (string, error) {
+	idx, err := checkVecAsScalar(op)
+	if err != nil {
+		return "", err
+	}
+	if idx != -1 {
+		if regInfo == "v21" {
+			regInfo = "vfpv"
+		} else if regInfo == "v2kv" {
+			regInfo = "vfpkv"
+		} else if regInfo == "v31" {
+			regInfo = "v2fpv"
+		} else if regInfo == "v3kv" {
+			regInfo = "v2fpkv"
+		} else {
+			return "", fmt.Errorf("simdgen does not recognize uses of treatLikeAScalarOfSize with op regShape %s in op: %s", regInfo, op)
+		}
+	}
+	return regInfo, nil
+}
+
+func rewriteLastVregToMem(op Operation) Operation {
+	newIn := make([]Operand, len(op.In))
+	lastVregIdx := -1
+	for i := range len(op.In) {
+		newIn[i] = op.In[i]
+		if op.In[i].Class == "vreg" {
+			lastVregIdx = i
+		}
+	}
+	// vbcst operations put their mem op always as the last vreg.
+	if lastVregIdx == -1 {
+		panic("simdgen cannot find one vreg in the mem op vreg original")
+	}
+	newIn[lastVregIdx].Class = "memory"
+	op.In = newIn
+
+	return op
+}
+
+// dedup is deduping operations in the full structure level.
+func dedup(ops []Operation) (deduped []Operation) {
+	for _, op := range ops {
+		seen := false
+		for _, dop := range deduped {
+			if reflect.DeepEqual(op, dop) {
+				seen = true
+				break
+			}
+		}
+		if !seen {
+			deduped = append(deduped, op)
+		}
+	}
+	return
+}
+
+func (op Operation) GenericName() string {
+	if op.OperandOrder != nil {
+		switch *op.OperandOrder {
+		case "21Type1", "231Type1":
+			// Permute uses operand[1] for method receiver.
+			return op.Go + *op.In[1].Go
+		}
+	}
+	if op.In[0].Class == "immediate" {
+		return op.Go + *op.In[1].Go
+	}
+	return op.Go + *op.In[0].Go
+}
+
+// dedupGodef is deduping operations in [Op.Go]+[*Op.In[0].Go] level.
+// By deduping, it means picking the least advanced architecture that satisfy the requirement:
+// AVX512 will be least preferred.
+// If FlagNoDedup is set, it will report the duplicates to the console.
+func dedupGodef(ops []Operation) ([]Operation, error) {
+	seen := map[string][]Operation{}
+	for _, op := range ops {
+		_, _, _, _, gOp := op.shape()
+
+		gN := gOp.GenericName()
+		seen[gN] = append(seen[gN], op)
+	}
+	if *FlagReportDup {
+		for gName, dup := range seen {
+			if len(dup) > 1 {
+				log.Printf("Duplicate for %s:\n", gName)
+				for _, op := range dup {
+					log.Printf("%s\n", op)
+				}
+			}
+		}
+		return ops, nil
+	}
+	isAVX512 := func(op Operation) bool {
+		return strings.Contains(op.CPUFeature, "AVX512")
+	}
+	deduped := []Operation{}
+	for _, dup := range seen {
+		if len(dup) > 1 {
+			slices.SortFunc(dup, func(i, j Operation) int {
+				// Put non-AVX512 candidates at the beginning
+				if !isAVX512(i) && isAVX512(j) {
+					return -1
+				}
+				if isAVX512(i) && !isAVX512(j) {
+					return 1
+				}
+				if i.CPUFeature != j.CPUFeature {
+					return strings.Compare(i.CPUFeature, j.CPUFeature)
+				}
+				// Weirdly Intel sometimes has duplicated definitions for the same instruction,
+				// this confuses the XED mem-op merge logic: [MemFeature] will only be attached to an instruction
+				// for only once, which means that for essentially duplicated instructions only one will have the
+				// proper [MemFeature] set. We have to make this sort deterministic for [MemFeature].
+				if i.MemFeatures != nil && j.MemFeatures == nil {
+					return -1
+				}
+				if i.MemFeatures == nil && j.MemFeatures != nil {
+					return 1
+				}
+				// Their order does not matter anymore, at least for now.
+				return 0
+			})
+		}
+		deduped = append(deduped, dup[0])
+	}
+	slices.SortFunc(deduped, compareOperations)
+	return deduped, nil
+}
+
+// Copy op.ConstImm to op.In[0].Const
+// This is a hack to reduce the size of defs we need for const imm operations.
+func copyConstImm(ops []Operation) error {
+	for _, op := range ops {
+		if op.ConstImm == nil {
+			continue
+		}
+		_, _, _, immType, _ := op.shape()
+
+		if immType == ConstImm || immType == ConstVarImm {
+			op.In[0].Const = op.ConstImm
+		}
+		// Otherwise, just not port it - e.g. {VPCMP[BWDQ] imm=0} and {VPCMPEQ[BWDQ]} are
+		// the same operations "Equal", [dedupgodef] should be able to distinguish them.
+	}
+	return nil
+}
+
+func capitalizeFirst(s string) string {
+	if s == "" {
+		return ""
+	}
+	// Convert the string to a slice of runes to handle multi-byte characters correctly.
+	r := []rune(s)
+	r[0] = unicode.ToUpper(r[0])
+	return string(r)
+}
+
+// overwrite corrects some errors due to:
+//   - The XED data is wrong
+//   - Go's SIMD API requirement, for example AVX2 compares should also produce masks.
+//     This rewrite has strict constraints, please see the error message.
+//     These constraints are also explointed in [writeSIMDRules], [writeSIMDMachineOps]
+//     and [writeSIMDSSA], please be careful when updating these constraints.
+func overwrite(ops []Operation) error {
+	hasClassOverwrite := false
+	overwrite := func(op []Operand, idx int, o Operation) error {
+		if op[idx].OverwriteElementBits != nil {
+			if op[idx].ElemBits == nil {
+				panic(fmt.Errorf("ElemBits is nil at operand %d of %v", idx, o))
+			}
+			*op[idx].ElemBits = *op[idx].OverwriteElementBits
+			*op[idx].Lanes = *op[idx].Bits / *op[idx].ElemBits
+			*op[idx].Go = fmt.Sprintf("%s%dx%d", capitalizeFirst(*op[idx].Base), *op[idx].ElemBits, *op[idx].Lanes)
+		}
+		if op[idx].OverwriteClass != nil {
+			if op[idx].OverwriteBase == nil {
+				panic(fmt.Errorf("simdgen: [OverwriteClass] must be set together with [OverwriteBase]: %s", op[idx]))
+			}
+			oBase := *op[idx].OverwriteBase
+			oClass := *op[idx].OverwriteClass
+			if oClass != "mask" {
+				panic(fmt.Errorf("simdgen: [Class] overwrite only supports overwritting to mask: %s", op[idx]))
+			}
+			if oBase != "int" {
+				panic(fmt.Errorf("simdgen: [Class] overwrite must set [OverwriteBase] to int: %s", op[idx]))
+			}
+			if op[idx].Class != "vreg" {
+				panic(fmt.Errorf("simdgen: [Class] overwrite must be overwriting [Class] from vreg: %s", op[idx]))
+			}
+			hasClassOverwrite = true
+			*op[idx].Base = oBase
+			op[idx].Class = oClass
+			*op[idx].Go = fmt.Sprintf("Mask%dx%d", *op[idx].ElemBits, *op[idx].Lanes)
+		} else if op[idx].OverwriteBase != nil {
+			oBase := *op[idx].OverwriteBase
+			*op[idx].Go = strings.ReplaceAll(*op[idx].Go, capitalizeFirst(*op[idx].Base), capitalizeFirst(oBase))
+			if op[idx].Class == "greg" {
+				*op[idx].Go = strings.ReplaceAll(*op[idx].Go, *op[idx].Base, oBase)
+			}
+			*op[idx].Base = oBase
+		}
+		return nil
+	}
+	for i, o := range ops {
+		hasClassOverwrite = false
+		for j := range ops[i].In {
+			if err := overwrite(ops[i].In, j, o); err != nil {
+				return err
+			}
+			if hasClassOverwrite {
+				return fmt.Errorf("simdgen does not support [OverwriteClass] in inputs: %s", ops[i])
+			}
+		}
+		for j := range ops[i].Out {
+			if err := overwrite(ops[i].Out, j, o); err != nil {
+				return err
+			}
+		}
+		if hasClassOverwrite {
+			for _, in := range ops[i].In {
+				if in.Class == "mask" {
+					return fmt.Errorf("simdgen only supports [OverwriteClass] for operations without mask inputs")
+				}
+			}
+		}
+	}
+	return nil
+}
+
+// reportXEDInconsistency reports potential XED inconsistencies.
+// We can add more fields to [Operation] to enable more checks and implement it here.
+// Supported checks:
+// [NameAndSizeCheck]: NAME[BWDQ] should set the elemBits accordingly.
+// This check is useful to find inconsistencies, then we can add overwrite fields to
+// those defs to correct them manually.
+func reportXEDInconsistency(ops []Operation) error {
+	for _, o := range ops {
+		if o.NameAndSizeCheck != nil {
+			suffixSizeMap := map[byte]int{'B': 8, 'W': 16, 'D': 32, 'Q': 64}
+			checkOperand := func(opr Operand) error {
+				if opr.ElemBits == nil {
+					return fmt.Errorf("simdgen expects elemBits to be set when performing NameAndSizeCheck")
+				}
+				if v, ok := suffixSizeMap[o.Asm[len(o.Asm)-1]]; !ok {
+					return fmt.Errorf("simdgen expects asm to end with [BWDQ] when performing NameAndSizeCheck")
+				} else {
+					if v != *opr.ElemBits {
+						return fmt.Errorf("simdgen finds NameAndSizeCheck inconsistency in def: %s", o)
+					}
+				}
+				return nil
+			}
+			for _, in := range o.In {
+				if in.Class != "vreg" && in.Class != "mask" {
+					continue
+				}
+				if in.TreatLikeAScalarOfSize != nil {
+					// This is an irregular operand, don't check it.
+					continue
+				}
+				if err := checkOperand(in); err != nil {
+					return err
+				}
+			}
+			for _, out := range o.Out {
+				if err := checkOperand(out); err != nil {
+					return err
+				}
+			}
+		}
+	}
+	return nil
+}
+
+func (o *Operation) hasMaskedMerging(maskType maskShape, outType outShape) bool {
+	// BLEND and VMOVDQU are not user-facing ops so we should filter them out.
+	return o.OperandOrder == nil && o.SpecialLower == nil && maskType == OneMask && outType == OneVregOut &&
+		len(o.InVariant) == 1 && !strings.Contains(o.Asm, "BLEND") && !strings.Contains(o.Asm, "VMOVDQU")
+}
+
+func getVbcstData(s string) (feat1Match, feat2Match string) {
+	_, err := fmt.Sscanf(s, "feat1=%[^;];feat2=%s", &feat1Match, &feat2Match)
+	if err != nil {
+		panic(err)
+	}
+	return
+}
+
+func (o Operation) String() string {
+	return pprints(o)
+}
+
+func (op Operand) String() string {
+	return pprints(op)
+}
--- a/src/simd/_gen/simdgen/go.yaml
+++ b/src/simd/_gen/simdgen/go.yaml
@ -0,0 +1 @@
+!import ops/*/go.yaml
--- a/src/simd/_gen/simdgen/godefs.go
+++ b/src/simd/_gen/simdgen/godefs.go
@ -0,0 +1,438 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package main
+
+import (
+	"fmt"
+	"log"
+	"regexp"
+	"slices"
+	"strconv"
+	"strings"
+	"unicode"
+
+	"simd/_gen/unify"
+)
+
+type Operation struct {
+	rawOperation
+
+	// Go is the Go method name of this operation.
+	//
+	// It is derived from the raw Go method name by adding optional suffixes.
+	// Currently, "Masked" is the only suffix.
+	Go string
+
+	// Documentation is the doc string for this API.
+	//
+	// It is computed from the raw documentation:
+	//
+	// - "NAME" is replaced by the Go method name.
+	//
+	// - For masked operation, a sentence about masking is added.
+	Documentation string
+
+	// In is the sequence of parameters to the Go method.
+	//
+	// For masked operations, this will have the mask operand appended.
+	In []Operand
+}
+
+// rawOperation is the unifier representation of an [Operation]. It is
+// translated into a more parsed form after unifier decoding.
+type rawOperation struct {
+	Go string // Base Go method name
+
+	GoArch       string  // GOARCH for this definition
+	Asm          string  // Assembly mnemonic
+	OperandOrder *string // optional Operand order for better Go declarations
+	// Optional tag to indicate this operation is paired with special generic->machine ssa lowering rules.
+	// Should be paired with special templates in gen_simdrules.go
+	SpecialLower *string
+
+	In              []Operand // Parameters
+	InVariant       []Operand // Optional parameters
+	Out             []Operand // Results
+	MemFeatures     *string   // The memory operand feature this operation supports
+	MemFeaturesData *string   // Additional data associated with MemFeatures
+	Commutative     bool      // Commutativity
+	CPUFeature      string    // CPUID/Has* feature name
+	Zeroing         *bool     // nil => use asm suffix ".Z"; false => do not use asm suffix ".Z"
+	Documentation   *string   // Documentation will be appended to the stubs comments.
+	AddDoc          *string   // Additional doc to be appended.
+	// ConstMask is a hack to reduce the size of defs the user writes for const-immediate
+	// If present, it will be copied to [In[0].Const].
+	ConstImm *string
+	// NameAndSizeCheck is used to check [BWDQ] maps to (8|16|32|64) elemBits.
+	NameAndSizeCheck *bool
+	// If non-nil, all generation in gen_simdTypes.go and gen_intrinsics will be skipped.
+	NoTypes *string
+	// If non-nil, all generation in gen_simdGenericOps and gen_simdrules will be skipped.
+	NoGenericOps *string
+	// If non-nil, this string will be attached to the machine ssa op name.  E.g. "const"
+	SSAVariant *string
+	// If true, do not emit method declarations, generic ops, or intrinsics for masked variants
+	// DO emit the architecture-specific opcodes and optimizations.
+	HideMaskMethods *bool
+}
+
+func (o *Operation) IsMasked() bool {
+	if len(o.InVariant) == 0 {
+		return false
+	}
+	if len(o.InVariant) == 1 && o.InVariant[0].Class == "mask" {
+		return true
+	}
+	panic(fmt.Errorf("unknown inVariant"))
+}
+
+func (o *Operation) SkipMaskedMethod() bool {
+	if o.HideMaskMethods == nil {
+		return false
+	}
+	if *o.HideMaskMethods && o.IsMasked() {
+		return true
+	}
+	return false
+}
+
+var reForName = regexp.MustCompile(`\bNAME\b`)
+
+func (o *Operation) DecodeUnified(v *unify.Value) error {
+	if err := v.Decode(&o.rawOperation); err != nil {
+		return err
+	}
+
+	isMasked := o.IsMasked()
+
+	// Compute full Go method name.
+	o.Go = o.rawOperation.Go
+	if isMasked {
+		o.Go += "Masked"
+	}
+
+	// Compute doc string.
+	if o.rawOperation.Documentation != nil {
+		o.Documentation = *o.rawOperation.Documentation
+	} else {
+		o.Documentation = "// UNDOCUMENTED"
+	}
+	o.Documentation = reForName.ReplaceAllString(o.Documentation, o.Go)
+	if isMasked {
+		o.Documentation += "\n//\n// This operation is applied selectively under a write mask."
+		// Suppress generic op and method declaration for exported methods, if a mask is present.
+		if unicode.IsUpper([]rune(o.Go)[0]) {
+			trueVal := "true"
+			o.NoGenericOps = &trueVal
+			o.NoTypes = &trueVal
+		}
+	}
+	if o.rawOperation.AddDoc != nil {
+		o.Documentation += "\n" + reForName.ReplaceAllString(*o.rawOperation.AddDoc, o.Go)
+	}
+
+	o.In = append(o.rawOperation.In, o.rawOperation.InVariant...)
+
+	return nil
+}
+
+func (o *Operation) VectorWidth() int {
+	out := o.Out[0]
+	if out.Class == "vreg" {
+		return *out.Bits
+	} else if out.Class == "greg" || out.Class == "mask" {
+		for i := range o.In {
+			if o.In[i].Class == "vreg" {
+				return *o.In[i].Bits
+			}
+		}
+	}
+	panic(fmt.Errorf("Figure out what the vector width is for %v and implement it", *o))
+}
+
+// Right now simdgen computes the machine op name for most instructions
+// as $Name$OutputSize, by this denotation, these instructions are "overloaded".
+// for example:
+// (Uint16x8) ConvertToInt8
+// (Uint16x16) ConvertToInt8
+// are both VPMOVWB128.
+// To make them distinguishable we need to append the input size to them as well.
+// TODO: document them well in the generated code.
+var demotingConvertOps = map[string]bool{
+	"VPMOVQD128": true, "VPMOVSQD128": true, "VPMOVUSQD128": true, "VPMOVQW128": true, "VPMOVSQW128": true,
+	"VPMOVUSQW128": true, "VPMOVDW128": true, "VPMOVSDW128": true, "VPMOVUSDW128": true, "VPMOVQB128": true,
+	"VPMOVSQB128": true, "VPMOVUSQB128": true, "VPMOVDB128": true, "VPMOVSDB128": true, "VPMOVUSDB128": true,
+	"VPMOVWB128": true, "VPMOVSWB128": true, "VPMOVUSWB128": true,
+	"VPMOVQDMasked128": true, "VPMOVSQDMasked128": true, "VPMOVUSQDMasked128": true, "VPMOVQWMasked128": true, "VPMOVSQWMasked128": true,
+	"VPMOVUSQWMasked128": true, "VPMOVDWMasked128": true, "VPMOVSDWMasked128": true, "VPMOVUSDWMasked128": true, "VPMOVQBMasked128": true,
+	"VPMOVSQBMasked128": true, "VPMOVUSQBMasked128": true, "VPMOVDBMasked128": true, "VPMOVSDBMasked128": true, "VPMOVUSDBMasked128": true,
+	"VPMOVWBMasked128": true, "VPMOVSWBMasked128": true, "VPMOVUSWBMasked128": true,
+}
+
+func machineOpName(maskType maskShape, gOp Operation) string {
+	asm := gOp.Asm
+	if maskType == OneMask {
+		asm += "Masked"
+	}
+	asm = fmt.Sprintf("%s%d", asm, gOp.VectorWidth())
+	if gOp.SSAVariant != nil {
+		asm += *gOp.SSAVariant
+	}
+	if demotingConvertOps[asm] {
+		// Need to append the size of the source as well.
+		// TODO: should be "%sto%d".
+		asm = fmt.Sprintf("%s_%d", asm, *gOp.In[0].Bits)
+	}
+	return asm
+}
+
+func compareStringPointers(x, y *string) int {
+	if x != nil && y != nil {
+		return compareNatural(*x, *y)
+	}
+	if x == nil && y == nil {
+		return 0
+	}
+	if x == nil {
+		return -1
+	}
+	return 1
+}
+
+func compareIntPointers(x, y *int) int {
+	if x != nil && y != nil {
+		return *x - *y
+	}
+	if x == nil && y == nil {
+		return 0
+	}
+	if x == nil {
+		return -1
+	}
+	return 1
+}
+
+func compareOperations(x, y Operation) int {
+	if c := compareNatural(x.Go, y.Go); c != 0 {
+		return c
+	}
+	xIn, yIn := x.In, y.In
+
+	if len(xIn) > len(yIn) && xIn[len(xIn)-1].Class == "mask" {
+		xIn = xIn[:len(xIn)-1]
+	} else if len(xIn) < len(yIn) && yIn[len(yIn)-1].Class == "mask" {
+		yIn = yIn[:len(yIn)-1]
+	}
+
+	if len(xIn) < len(yIn) {
+		return -1
+	}
+	if len(xIn) > len(yIn) {
+		return 1
+	}
+	if len(x.Out) < len(y.Out) {
+		return -1
+	}
+	if len(x.Out) > len(y.Out) {
+		return 1
+	}
+	for i := range xIn {
+		ox, oy := &xIn[i], &yIn[i]
+		if c := compareOperands(ox, oy); c != 0 {
+			return c
+		}
+	}
+	return 0
+}
+
+func compareOperands(x, y *Operand) int {
+	if c := compareNatural(x.Class, y.Class); c != 0 {
+		return c
+	}
+	if x.Class == "immediate" {
+		return compareStringPointers(x.ImmOffset, y.ImmOffset)
+	} else {
+		if c := compareStringPointers(x.Base, y.Base); c != 0 {
+			return c
+		}
+		if c := compareIntPointers(x.ElemBits, y.ElemBits); c != 0 {
+			return c
+		}
+		if c := compareIntPointers(x.Bits, y.Bits); c != 0 {
+			return c
+		}
+		return 0
+	}
+}
+
+type Operand struct {
+	Class string // One of "mask", "immediate", "vreg", "greg", and "mem"
+
+	Go     *string // Go type of this operand
+	AsmPos int     // Position of this operand in the assembly instruction
+
+	Base     *string // Base Go type ("int", "uint", "float")
+	ElemBits *int    // Element bit width
+	Bits     *int    // Total vector bit width
+
+	Const *string // Optional constant value for immediates.
+	// Optional immediate arg offsets. If this field is non-nil,
+	// This operand will be an immediate operand:
+	// The compiler will right-shift the user-passed value by ImmOffset and set it as the AuxInt
+	// field of the operation.
+	ImmOffset *string
+	Name      *string // optional name in the Go intrinsic declaration
+	Lanes     *int    // *Lanes equals Bits/ElemBits except for scalars, when *Lanes == 1
+	// TreatLikeAScalarOfSize means only the lower $TreatLikeAScalarOfSize bits of the vector
+	// is used, so at the API level we can make it just a scalar value of this size; Then we
+	// can overwrite it to a vector of the right size during intrinsics stage.
+	TreatLikeAScalarOfSize *int
+	// If non-nil, it means the [Class] field is overwritten here, right now this is used to
+	// overwrite the results of AVX2 compares to masks.
+	OverwriteClass *string
+	// If non-nil, it means the [Base] field is overwritten here. This field exist solely
+	// because Intel's XED data is inconsistent. e.g. VANDNP[SD] marks its operand int.
+	OverwriteBase *string
+	// If non-nil, it means the [ElementBits] field is overwritten. This field exist solely
+	// because Intel's XED data is inconsistent. e.g. AVX512 VPMADDUBSW marks its operand
+	// elemBits 16, which should be 8.
+	OverwriteElementBits *int
+	// FixedReg is the name of the fixed registers
+	FixedReg *string
+}
+
+// isDigit returns true if the byte is an ASCII digit.
+func isDigit(b byte) bool {
+	return b >= '0' && b <= '9'
+}
+
+// compareNatural performs a "natural sort" comparison of two strings.
+// It compares non-digit sections lexicographically and digit sections
+// numerically.  In the case of string-unequal "equal" strings like
+// "a01b" and "a1b", strings.Compare breaks the tie.
+//
+// It returns:
+//
+//	-1 if s1 < s2
+//	 0 if s1 == s2
+//	+1 if s1 > s2
+func compareNatural(s1, s2 string) int {
+	i, j := 0, 0
+	len1, len2 := len(s1), len(s2)
+
+	for i < len1 && j < len2 {
+		// Find a non-digit segment or a number segment in both strings.
+		if isDigit(s1[i]) && isDigit(s2[j]) {
+			// Number segment comparison.
+			numStart1 := i
+			for i < len1 && isDigit(s1[i]) {
+				i++
+			}
+			num1, _ := strconv.Atoi(s1[numStart1:i])
+
+			numStart2 := j
+			for j < len2 && isDigit(s2[j]) {
+				j++
+			}
+			num2, _ := strconv.Atoi(s2[numStart2:j])
+
+			if num1 < num2 {
+				return -1
+			}
+			if num1 > num2 {
+				return 1
+			}
+			// If numbers are equal, continue to the next segment.
+		} else {
+			// Non-digit comparison.
+			if s1[i] < s2[j] {
+				return -1
+			}
+			if s1[i] > s2[j] {
+				return 1
+			}
+			i++
+			j++
+		}
+	}
+
+	// deal with a01b vs a1b; there needs to be an order.
+	return strings.Compare(s1, s2)
+}
+
+const generatedHeader = `// Code generated by x/arch/internal/simdgen using 'go run . -xedPath $XED_PATH -o godefs -goroot $GOROOT go.yaml types.yaml categories.yaml'; DO NOT EDIT.
+`
+
+func writeGoDefs(path string, cl unify.Closure) error {
+	// TODO: Merge operations with the same signature but multiple
+	// implementations (e.g., SSE vs AVX)
+	var ops []Operation
+	for def := range cl.All() {
+		var op Operation
+		if !def.Exact() {
+			continue
+		}
+		if err := def.Decode(&op); err != nil {
+			log.Println(err.Error())
+			log.Println(def)
+			continue
+		}
+		// TODO: verify that this is safe.
+		op.sortOperand()
+		ops = append(ops, op)
+	}
+	slices.SortFunc(ops, compareOperations)
+	// The parsed XED data might contain duplicates, like
+	// 512 bits VPADDP.
+	deduped := dedup(ops)
+	slices.SortFunc(deduped, compareOperations)
+
+	if *Verbose {
+		log.Printf("dedup len: %d\n", len(ops))
+	}
+	var err error
+	if err = overwrite(deduped); err != nil {
+		return err
+	}
+	if *Verbose {
+		log.Printf("dedup len: %d\n", len(deduped))
+	}
+	if *Verbose {
+		log.Printf("dedup len: %d\n", len(deduped))
+	}
+	if !*FlagNoDedup {
+		// TODO: This can hide mistakes in the API definitions, especially when
+		// multiple patterns result in the same API unintentionally. Make it stricter.
+		if deduped, err = dedupGodef(deduped); err != nil {
+			return err
+		}
+	}
+	if *Verbose {
+		log.Printf("dedup len: %d\n", len(deduped))
+	}
+	if !*FlagNoConstImmPorting {
+		if err = copyConstImm(deduped); err != nil {
+			return err
+		}
+	}
+	if *Verbose {
+		log.Printf("dedup len: %d\n", len(deduped))
+	}
+	reportXEDInconsistency(deduped)
+	typeMap := parseSIMDTypes(deduped)
+
+	formatWriteAndClose(writeSIMDTypes(typeMap), path, "src/"+simdPackage+"/types_amd64.go")
+	formatWriteAndClose(writeSIMDFeatures(deduped), path, "src/"+simdPackage+"/cpu.go")
+	f, fI := writeSIMDStubs(deduped, typeMap)
+	formatWriteAndClose(f, path, "src/"+simdPackage+"/ops_amd64.go")
+	formatWriteAndClose(fI, path, "src/"+simdPackage+"/ops_internal_amd64.go")
+	formatWriteAndClose(writeSIMDIntrinsics(deduped, typeMap), path, "src/cmd/compile/internal/ssagen/simdintrinsics.go")
+	formatWriteAndClose(writeSIMDGenericOps(deduped), path, "src/cmd/compile/internal/ssa/_gen/simdgenericOps.go")
+	formatWriteAndClose(writeSIMDMachineOps(deduped), path, "src/cmd/compile/internal/ssa/_gen/simdAMD64ops.go")
+	formatWriteAndClose(writeSIMDSSA(deduped), path, "src/cmd/compile/internal/amd64/simdssa.go")
+	writeAndClose(writeSIMDRules(deduped).Bytes(), path, "src/cmd/compile/internal/ssa/_gen/simdAMD64.rules")
+
+	return nil
+}
--- a/src/simd/_gen/simdgen/main.go
+++ b/src/simd/_gen/simdgen/main.go
@ -0,0 +1,281 @@
+// Copyright 2025 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// simdgen is an experiment in generating Go <-> asm SIMD mappings.
+//
+// Usage: simdgen [-xedPath=path] [-q=query] input.yaml...
+//
+// If -xedPath is provided, one of the inputs is a sum of op-code definitions
+// generated from the Intel XED data at path.
+//
+// If input YAML files are provided, each file is read as an input value. See
+// [unify.Closure.UnmarshalYAML] or "go doc unify.Closure.UnmarshalYAML" for the
+// format of these files.
+//
+// TODO: Example definitions and values.
+//
+// The command unifies across all of the inputs and prints all possible results
+// of this unification.
+//
+// If the -q flag is provided, its string value is parsed as a value and treated
+// as another input to unification. This is intended as a way to "query" the
+// result, typically by narrowing it down to a small subset of results.
+//
+// Typical usage:
+//
+//	go run . -xedPath $XEDPATH *.yaml
+//
+// To see just the definitions generated from XED, run:
+//
+//	go run . -xedPath $XEDPATH
+//
+// (This works because if there's only one input, there's nothing to unify it
+// with, so the result is simply itself.)
+//
+// To see just the definitions for VPADDQ:
+//
+//	go run . -xedPath $XEDPATH -q '{asm: VPADDQ}'
+//
+// simdgen can also generate Go definitions of SIMD mappings:
+// To generate go files to the go root, run:
+//
+//	go run . -xedPath $XEDPATH -o godefs -goroot $PATH/TO/go go.yaml categories.yaml types.yaml
+//
+// types.yaml is already written, it specifies the shapes of vectors.
+// categories.yaml and go.yaml contains definitions that unifies with types.yaml and XED
+// data, you can find an example in ops/AddSub/.
+//
+// When generating Go definitions, simdgen do 3 "magic"s:
+// - It splits masked operations(with op's [Masked] field set) to const and non const:
+//   - One is a normal masked operation, the original
+//   - The other has its mask operand's [Const] fields set to "K0".
+//   - This way the user does not need to provide a separate "K0"-masked operation def.
+//
+// - It deduplicates intrinsic names that have duplicates:
+//   - If there are two operations that shares the same signature, one is AVX512 the other
+//     is before AVX512, the other will be selected.
+//   - This happens often when some operations are defined both before AVX512 and after.
+//     This way the user does not need to provide a separate "K0" operation for the
+//     AVX512 counterpart.
+//
+// - It copies the op's [ConstImm] field to its immediate operand's [Const] field.
+//   - This way the user does not need to provide verbose op definition while only
+//     the const immediate field is different. This is useful to reduce verbosity of
+//     compares with imm control predicates.
+//
+// These 3 magics could be disabled by enabling -nosplitmask, -nodedup or
+// -noconstimmporting flags.
+//
+// simdgen right now only supports amd64, -arch=$OTHERARCH will trigger a fatal error.
+package main
+
+// Big TODOs:
+//
+// - This can produce duplicates, which can also lead to less efficient
+// environment merging. Add hashing and use it for deduplication. Be careful
+// about how this shows up in debug traces, since it could make things
+// confusing if we don't show it happening.
+//
+// - Do I need Closure, Value, and Domain? It feels like I should only need two
+// types.
+
+import (
+	"cmp"
+	"flag"
+	"fmt"
+	"log"
+	"maps"
+	"os"
+	"path/filepath"
+	"runtime/pprof"
+	"slices"
+	"strings"
+
+	"simd/_gen/unify"
+
+	"gopkg.in/yaml.v3"
+)
+
+var (
+	xedPath               = flag.String("xedPath", "", "load XED datafiles from `path`")
+	flagQ                 = flag.String("q", "", "query: read `def` as another input (skips final validation)")
+	flagO                 = flag.String("o", "yaml", "output type: yaml, godefs (generate definitions into a Go source tree")
+	flagGoDefRoot         = flag.String("goroot", ".", "the path to the Go dev directory that will receive the generated files")
+	FlagNoDedup           = flag.Bool("nodedup", false, "disable deduplicating godefs of 2 qualifying operations from different extensions")
+	FlagNoConstImmPorting = flag.Bool("noconstimmporting", false, "disable const immediate porting from op to imm operand")
+	FlagArch              = flag.String("arch", "amd64", "the target architecture")
+
+	Verbose = flag.Bool("v", false, "verbose")
+
+	flagDebugXED   = flag.Bool("debug-xed", false, "show XED instructions")
+	flagDebugUnify = flag.Bool("debug-unify", false, "print unification trace")
+	flagDebugHTML  = flag.String("debug-html", "", "write unification trace to `file.html`")
+	FlagReportDup  = flag.Bool("reportdup", false, "report the duplicate godefs")
+
+	flagCPUProfile = flag.String("cpuprofile", "", "write CPU profile to `file`")
+	flagMemProfile = flag.String("memprofile", "", "write memory profile to `file`")
+)
+
+const simdPackage = "simd"
+
+func main() {
+	flag.Parse()
+
+	if *flagCPUProfile != "" {
+		f, err := os.Create(*flagCPUProfile)
+		if err != nil {
+			log.Fatalf("-cpuprofile: %s", err)
+		}
+		defer f.Close()
+		pprof.StartCPUProfile(f)
+		defer pprof.StopCPUProfile()
+	}
+	if *flagMemProfile != "" {
+		f, err := os.Create(*flagMemProfile)
+		if err != nil {
+			log.Fatalf("-memprofile: %s", err)
+		}
+		defer func() {
+			pprof.WriteHeapProfile(f)
+			f.Close()
+		}()
+	}
+
+	var inputs []unify.Closure
+
+	if *FlagArch != "amd64" {
+		log.Fatalf("simdgen only supports amd64")
+	}
+
+	// Load XED into a defs set.
+	if *xedPath != "" {
+		xedDefs := loadXED(*xedPath)
+		inputs = append(inputs, unify.NewSum(xedDefs...))
+	}
+
+	// Load query.
+	if *flagQ != "" {
+		r := strings.NewReader(*flagQ)
+		def, err := unify.Read(r, "<query>", unify.ReadOpts{})
+		if err != nil {
+			log.Fatalf("parsing -q: %s", err)
+		}
+		inputs = append(inputs, def)
+	}
+
+	// Load defs files.
+	must := make(map[*unify.Value]struct{})
+	for _, path := range flag.Args() {
+		defs, err := unify.ReadFile(path, unify.ReadOpts{})
+		if err != nil {
+			log.Fatal(err)
+		}
+		inputs = append(inputs, defs)
+
+		if filepath.Base(path) == "go.yaml" {
+			// These must all be used in the final result
+			for def := range defs.Summands() {
+				must[def] = struct{}{}
+			}
+		}
+	}
+
+	// Prepare for unification
+	if *flagDebugUnify {
+		unify.Debug.UnifyLog = os.Stderr
+	}
+	if *flagDebugHTML != "" {
+		f, err := os.Create(*flagDebugHTML)
+		if err != nil {
+			log.Fatal(err)
+		}
+		unify.Debug.HTML = f
+		defer f.Close()
+	}
+
+	// Unify!
+	unified, err := unify.Unify(inputs...)
+	if err != nil {
+		log.Fatal(err)
+	}
+
+	// Validate results.
+	//
+	// Don't validate if this is a command-line query because that tends to
+	// eliminate lots of required defs and is used in cases where maybe defs
+	// aren't enumerable anyway.
+	if *flagQ == "" && len(must) > 0 {
+		validate(unified, must)
+	}
+
+	// Print results.
+	switch *flagO {
+	case "yaml":
+		// Produce a result that looks like encoding a slice, but stream it.
+		fmt.Println("!sum")
+		var val1 [1]*unify.Value
+		for val := range unified.All() {
+			val1[0] = val
+			// We have to make a new encoder each time or it'll print a document
+			// separator between each object.
+			enc := yaml.NewEncoder(os.Stdout)
+			if err := enc.Encode(val1); err != nil {
+				log.Fatal(err)
+			}
+			enc.Close()
+		}
+	case "godefs":
+		if err := writeGoDefs(*flagGoDefRoot, unified); err != nil {
+			log.Fatalf("Failed writing godefs: %+v", err)
+		}
+	}
+
+	if !*Verbose && *xedPath != "" {
+		if operandRemarks == 0 {
+			fmt.Fprintf(os.Stderr, "XED decoding generated no errors, which is unusual.\n")
+		} else {
+			fmt.Fprintf(os.Stderr, "XED decoding generated %d \"errors\" which is not cause for alarm, use -v for details.\n", operandRemarks)
+		}
+	}
+}
+
+func validate(cl unify.Closure, required map[*unify.Value]struct{}) {
+	// Validate that:
+	// 1. All final defs are exact
+	// 2. All required defs are used
+	for def := range cl.All() {
+		if _, ok := def.Domain.(unify.Def); !ok {
+			fmt.Fprintf(os.Stderr, "%s: expected Def, got %T\n", def.PosString(), def.Domain)
+			continue
+		}
+
+		if !def.Exact() {
+			fmt.Fprintf(os.Stderr, "%s: def not reduced to an exact value, why is %s:\n", def.PosString(), def.WhyNotExact())
+			fmt.Fprintf(os.Stderr, "\t%s\n", strings.ReplaceAll(def.String(), "\n", "\n\t"))
+		}
+
+		for root := range def.Provenance() {
+			delete(required, root)
+		}
+	}
+	// Report unused defs
+	unused := slices.SortedFunc(maps.Keys(required),
+		func(a, b *unify.Value) int {
+			return cmp.Or(
+				cmp.Compare(a.Pos().Path, b.Pos().Path),
+				cmp.Compare(a.Pos().Line, b.Pos().Line),
+			)
+		})
+	for _, def := range unused {
+		// TODO: Can we say anything more actionable? This is always a problem
+		// with unification: if it fails, it's very hard to point a finger at
+		// any particular reason. We could go back and try unifying this again
+		// with each subset of the inputs (starting with individual inputs) to
+		// at least say "it doesn't unify with anything in x.yaml". That's a lot
+		// of work, but if we have trouble debugging unification failure it may
+		// be worth it.
+		fmt.Fprintf(os.Stderr, "%s: def required, but did not unify (%v)\n",
+			def.PosString(), def)
+	}
+}
--- a/src/simd/_gen/simdgen/ops/AddSub/categories.yaml
+++ b/src/simd/_gen/simdgen/ops/AddSub/categories.yaml
@ -0,0 +1,37 @@
+!sum
+- go: Add
+  commutative: true
+  documentation: !string |-
+    // NAME adds corresponding elements of two vectors.
+- go: AddSaturated
+  commutative: true
+  documentation: !string |-
+    // NAME adds corresponding elements of two vectors with saturation.
+- go: Sub
+  commutative: false
+  documentation: !string |-
+    // NAME subtracts corresponding elements of two vectors.
+- go: SubSaturated
+  commutative: false
+  documentation: !string |-
+    // NAME subtracts corresponding elements of two vectors with saturation.
+- go: AddPairs
+  commutative: false
+  documentation: !string |-
+    // NAME horizontally adds adjacent pairs of elements.
+    // For x = [x0, x1, x2, x3, ...] and y = [y0, y1, y2, y3, ...], the result is [y0+y1, y2+y3, ..., x0+x1, x2+x3, ...].
+- go: SubPairs
+  commutative: false
+  documentation: !string |-
+    // NAME horizontally subtracts adjacent pairs of elements.
+    // For x = [x0, x1, x2, x3, ...] and y = [y0, y1, y2, y3, ...], the result is [y0-y1, y2-y3, ..., x0-x1, x2-x3, ...].
+- go: AddPairsSaturated
+  commutative: false
+  documentation: !string |-
+    // NAME horizontally adds adjacent pairs of elements with saturation.
+    // For x = [x0, x1, x2, x3, ...] and y = [y0, y1, y2, y3, ...], the result is [y0+y1, y2+y3, ..., x0+x1, x2+x3, ...].
+- go: SubPairsSaturated
+  commutative: false
+  documentation: !string |-
+    // NAME horizontally subtracts adjacent pairs of elements with saturation.
+    // For x = [x0, x1, x2, x3, ...] and y = [y0, y1, y2, y3, ...], the result is [y0-y1, y2-y3, ..., x0-x1, x2-x3, ...].
--- a/Show more
+++ b/Show more