ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2025-12-08 06:09:50 +00:00

Author	SHA1	Message	Date
Rémi Denis-Courmont	952b426f3b	lavc/bswapdsp: add RV Zvbb bswap16 and bswap32	2024-08-01 18:43:04 +03:00
Rémi Denis-Courmont	bd0c3edb13	lavu/riscv: count bytes rather than words for bswap32 This removes the dependency on Zba at essentially zero cost.	2024-07-30 18:41:51 +03:00
Rémi Denis-Courmont	b3825bbe45	riscv: test for assembler support This should fix the build on LLVM 16 and earlier, at the cost of turning all non-RVV optimisations off.	2023-12-08 17:21:09 +02:00
Rémi Denis-Courmont	61e5ca4ded	lavc/bswapdsp: purge RISC-V V bswap32 This cannot beat the Zbb implementation, and it is unlikely that a real meaningful CPU design would support V and not Zbb. The best loop rewrite that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower than Zbb. A proper faster vector implementation should be feasible with the cryptographic vector extensions, but that is a story for another time.	2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont	b6585eb04c	lavu: add/use flag for RISC-V Zba extension The code was blindly assuming that Zbb or V implied Zba. While the earlier is practically always true, the later broke some QEMU setups, as V was introduced earlier than Zba.	2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont	2abafd7307	lavc/bswapdsp: RISC-V V bswap16_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	d7528af4df	lavc/bswapdsp: RISC-V V bswap_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	f0ef11ea83	lavc/bswapdsp: RISC-V B bswap_buf Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5	2022-10-05 08:26:19 +02:00

8 commits