Commit graph

7 commits

Author SHA1 Message Date
Rémi Denis-Courmont
616fdeaea3 lavc/riscv: depend on RVB and simplify accordingly
There is no known (real) hardware with V and without the complete B
extension. B was indeed required in the RISC-V application profile from
2022, earlier than V. There should not be any relevant hardware in the
future either.

In practice, different R-V Vector optimisations in FFmpeg already depend on
every constituent of the B extension anyhow, so it would not work well.
2024-08-05 21:16:26 +03:00
Rémi Denis-Courmont
121fb846b9 lavc/vp7dsp: add R-V V vp7_idct_dc_add4uv
This is almost the same story as vp7_idct_add4y. We just have to use
strided loads of 2 64-bit elements to account for the different data
layout in memory.

T-Head C908:
vp7_idct_dc_add4uv_c:       7.5
vp7_idct_dc_add4uv_rvv_i64: 2.0
vp8_idct_dc_add4uv_c:       6.2
vp8_idct_dc_add4uv_rvv_i32: 2.2 (before)
vp8_idct_dc_add4uv_rvv_i64: 2.0

SpacemiT X60:
vp7_idct_dc_add4uv_c:       6.7
vp7_idct_dc_add4uv_rvv_i64: 2.2
vp8_idct_dc_add4uv_c:       5.7
vp8_idct_dc_add4uv_rvv_i32: 2.5 (before)
vp8_idct_dc_add4uv_rvv_i64: 2.0
2024-06-04 17:42:07 +03:00
Rémi Denis-Courmont
4e120fbbbd lavc/vp8dsp: add R-V V vp7_idct_dc_add4y
As with idct_dc_add, most of the code is shared with, and replaces, the
previous VP8 function. To improve performance, we break down the 16x4
matrix into 4 rows, rather than 4 squares. Thus strided loads and
stores are avoided, and the 4 DC calculations are vectored.
Unfortunately this requires a vector gather to splat the DC values, but
overall this is still a win for performance:

T-Head C908:
vp7_idct_dc_add4y_c:       7.2
vp7_idct_dc_add4y_rvv_i32: 2.2
vp8_idct_dc_add4y_c:       6.2
vp8_idct_dc_add4y_rvv_i32: 2.2 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7

SpacemiT X60:
vp7_idct_dc_add4y_c:       6.2
vp7_idct_dc_add4y_rvv_i32: 2.0
vp8_idct_dc_add4y_c:       5.5
vp8_idct_dc_add4y_rvv_i32: 2.5 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7

I also tried to provision the DC values using indexed loads. It ends up
slower overall, especially for VP7, as we then have to compute 16 DC's
instead of just 4.
2024-06-04 17:40:41 +03:00
Rémi Denis-Courmont
30797e4ff6 lavc/vp8dsp: add R-V V vp7_idct_dc_add
This just computes the direct coefficient and hands over to code shared
with VP8. Accordingly the bulk of changes are just rewriting the VP8
code to share.

Nothing to write home about:
vp7_idct_dc_add_c:       1.7
vp7_idct_dc_add_rvv_i32: 1.2
2024-06-04 17:40:36 +03:00
Rémi Denis-Courmont
fa3b153cb1 lavc/vp7dsp: R-V V vp7_idct_add
Most of the code is shared with DC, thanks to minor earlier changes.

vp7_idct_add_c:       5.2
vp7_idct_add_rvv_i32: 2.5
2024-05-29 16:57:02 +03:00
Rémi Denis-Courmont
4a0e629b6f lavc/vp7dsp: revector ff_vp7_dc_wht_rvv
This prepares for some code reuse.
2024-05-29 16:57:02 +03:00
Rémi Denis-Courmont
fd39997f72 lavc/vp7dsp: add R-V V vp7_luma_dc_wht
This works out a bit more favourably than VP8's due to:
- additional multiplications that can be vectored,
- hardware-supported fixed-point rounding mode.

vp7_luma_dc_wht_c:       3.2
vp7_luma_dc_wht_rvv_i64: 2.0
2024-05-29 16:57:02 +03:00