This cannot beat the Zbb implementation, and it is unlikely that a real
meaningful CPU design would support V and not Zbb. The best loop rewrite
that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower
than Zbb.
A proper faster vector implementation should be feasible with the
cryptographic vector extensions, but that is a story for another time.
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
Simply taking the Zbb REV8 instruction into use in a simple loop gives
some significant savings:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 771.0
But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with
just one additional shift, and one fewer load, effectively doubling the
bandwidth. Consequently, this patch is useful even if the compile-time
target has Zbb enabled for C code:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 341.0 (this patch)
On the other hand, this approach fails miserably for bswap16_buf as the
ratio of shifts and stores becomes unfavorable compared to naïve C:
bswap16_buf_c: 1542.0
bswap16_buf_rvb_b: 1803.7
Unrolling to process 128 bits (4 samples) at a time actually worsens
performance ever so slightly:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 408.5