ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2025-12-08 06:09:50 +00:00

Author	SHA1	Message	Date
Rémi Denis-Courmont	616fdeaea3	lavc/riscv: depend on RVB and simplify accordingly There is no known (real) hardware with V and without the complete B extension. B was indeed required in the RISC-V application profile from 2022, earlier than V. There should not be any relevant hardware in the future either. In practice, different R-V Vector optimisations in FFmpeg already depend on every constituent of the B extension anyhow, so it would not work well.	2024-08-05 21:16:26 +03:00
Rémi Denis-Courmont	121fb846b9	lavc/vp7dsp: add R-V V vp7_idct_dc_add4uv This is almost the same story as vp7_idct_add4y. We just have to use strided loads of 2 64-bit elements to account for the different data layout in memory. T-Head C908: vp7_idct_dc_add4uv_c: 7.5 vp7_idct_dc_add4uv_rvv_i64: 2.0 vp8_idct_dc_add4uv_c: 6.2 vp8_idct_dc_add4uv_rvv_i32: 2.2 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0 SpacemiT X60: vp7_idct_dc_add4uv_c: 6.7 vp7_idct_dc_add4uv_rvv_i64: 2.2 vp8_idct_dc_add4uv_c: 5.7 vp8_idct_dc_add4uv_rvv_i32: 2.5 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0	2024-06-04 17:42:07 +03:00
Rémi Denis-Courmont	4e120fbbbd	lavc/vp8dsp: add R-V V vp7_idct_dc_add4y As with idct_dc_add, most of the code is shared with, and replaces, the previous VP8 function. To improve performance, we break down the 16x4 matrix into 4 rows, rather than 4 squares. Thus strided loads and stores are avoided, and the 4 DC calculations are vectored. Unfortunately this requires a vector gather to splat the DC values, but overall this is still a win for performance: T-Head C908: vp7_idct_dc_add4y_c: 7.2 vp7_idct_dc_add4y_rvv_i32: 2.2 vp8_idct_dc_add4y_c: 6.2 vp8_idct_dc_add4y_rvv_i32: 2.2 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 SpacemiT X60: vp7_idct_dc_add4y_c: 6.2 vp7_idct_dc_add4y_rvv_i32: 2.0 vp8_idct_dc_add4y_c: 5.5 vp8_idct_dc_add4y_rvv_i32: 2.5 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 I also tried to provision the DC values using indexed loads. It ends up slower overall, especially for VP7, as we then have to compute 16 DC's instead of just 4.	2024-06-04 17:40:41 +03:00
Rémi Denis-Courmont	30797e4ff6	lavc/vp8dsp: add R-V V vp7_idct_dc_add This just computes the direct coefficient and hands over to code shared with VP8. Accordingly the bulk of changes are just rewriting the VP8 code to share. Nothing to write home about: vp7_idct_dc_add_c: 1.7 vp7_idct_dc_add_rvv_i32: 1.2	2024-06-04 17:40:36 +03:00
Rémi Denis-Courmont	fa3b153cb1	lavc/vp7dsp: R-V V vp7_idct_add Most of the code is shared with DC, thanks to minor earlier changes. vp7_idct_add_c: 5.2 vp7_idct_add_rvv_i32: 2.5	2024-05-29 16:57:02 +03:00
Rémi Denis-Courmont	4a0e629b6f	lavc/vp7dsp: revector ff_vp7_dc_wht_rvv This prepares for some code reuse.	2024-05-29 16:57:02 +03:00
Rémi Denis-Courmont	fd39997f72	lavc/vp7dsp: add R-V V vp7_luma_dc_wht This works out a bit more favourably than VP8's due to: - additional multiplications that can be vectored, - hardware-supported fixed-point rounding mode. vp7_luma_dc_wht_c: 3.2 vp7_luma_dc_wht_rvv_i64: 2.0	2024-05-29 16:57:02 +03:00

7 commits