Commit graph

26 commits

Author SHA1 Message Date
Rémi Denis-Courmont
bd226fdd74 lavc/h264dsp: R-V V intra loop filter
As with the inter loop filter, performance metrics seem to be biased in
favour of the C implementation because checkasm inputs almost always
fall in the no-op case.

h264_h_loop_filter_chroma_intra_8bpp_c:                 82.8 ( 1.00x)
h264_h_loop_filter_chroma_intra_8bpp_rvv_i32:           72.6 ( 1.14x)
h264_h_loop_filter_chroma_mbaff_intra_8bpp_c:           41.1 ( 1.00x)
h264_h_loop_filter_chroma_mbaff_intra_8bpp_rvv_i32:     72.6 ( 0.57x)
h264_h_loop_filter_luma_intra_8bpp_c:                  166.1 ( 1.00x)
h264_h_loop_filter_luma_intra_8bpp_rvv_i32:            395.4 ( 0.42x)
h264_h_loop_filter_luma_mbaff_intra_8bpp_c:             93.3 ( 1.00x)
h264_h_loop_filter_luma_mbaff_intra_8bpp_rvv_i32:      395.4 ( 0.24x)
h264_v_loop_filter_chroma_intra_8bpp_c:                134.8 ( 1.00x)
h264_v_loop_filter_chroma_intra_8bpp_rvv_i32:           51.6 ( 2.61x)
h264_v_loop_filter_luma_intra_8bpp_c:                  468.1 ( 1.00x)
h264_v_loop_filter_luma_intra_8bpp_rvv_i32:            134.8 ( 3.47x)
2024-12-17 09:00:28 +02:00
Rémi Denis-Courmont
6611bf5484 lavc/h264dsp: optimise R-V V biweight for shorter heights
T-Head C908:
h264_biweight2_8_c:                                    313.7 ( 1.00x)
h264_biweight2_8_rvv_i32:              before          239.5 ( 1.23x)
h264_biweight2_8_rvv_i32:              after            72.7 ( 4.31x)
h264_biweight4_8_c:                                    582.0 ( 1.00x)
h264_biweight4_8_rvv_i32:              before          471.0 ( 1.16x)
h264_biweight4_8_rvv_i32:              after            91.5 ( 6.36x)
h264_biweight8_8_c:                                   1110.0 ( 1.00x)
h264_biweight8_8_rvv_i32:              before          943.3 ( 1.10x)
h264_biweight8_8_rvv_i64:              after           147.0 ( 7.55x)

SpacemiT X60:
h264_biweight2_8_c:                                    311.4 ( 1.00x)
h264_biweight2_8_rvv_i32:              before          363.1 ( 0.83x)
h264_biweight2_8_rvv_i32:              after           103.1 ( 3.02x)
h264_biweight4_8_c:                                    571.9 ( 1.00x)
h264_biweight4_8_rvv_i32:              before          717.4 ( 0.78x)
h264_biweight4_8_rvv_i32:              after            71.8 ( 7.96x)
h264_biweight8_8_c:                                   1103.1 ( 1.00x)
h264_biweight8_8_rvv_i32:              before         1415.2 ( 0.76x)
h264_biweight8_8_rvv_i64:              ater             92.8 (11.88x)
2024-09-24 20:04:51 +03:00
Rémi Denis-Courmont
459a1512f1 lavc/h264dsp: unroll R-V V weight16
As VLSE128.V does not exist, we have no other way to deal with latency.

T-Head C908:
h264_weight16_8_c:                                     989.4 ( 1.00x)
h264_weight16_8_rvv_i32:                               193.2 ( 5.12x)

SpacemiT X60:
h264_weight16_8_c:                                     874.1 ( 1.00x)
h264_weight16_8_rvv_i32:                               196.9 ( 4.44x)
2024-09-24 20:04:51 +03:00
Rémi Denis-Courmont
4936bb2508 lavc/h264dsp: optimise R-V V weight for shorter heights
The height is a power of two of up to 16 rows. The current code was
optimised for large sample counts.

T-Head C908:
h264_weight2_8_c:                                      211.7 ( 1.00x)
h264_weight2_8_rvv_i32:                   before       184.0 ( 1.15x)
h264_weight2_8_rvv_i32:                   after         54.2 ( 3.90x)
h264_weight4_8_c:                                      285.7 ( 1.00x)
h264_weight4_8_rvv_i32:                   before       341.2 ( 0.86x)
h264_weight4_8_rvv_i32:                   after         82.2 ( 3.47x)
h264_weight8_8_c:                                      498.7 ( 1.00x)
h264_weight8_8_rvv_i32:                   before       683.7 ( 0.73x)
h264_weight8_8_rvv_i64:                   after        128.5 ( 3.95x)
h264_weight16_8_c:                                     878.2 ( 1.00x)
h264_weight16_8_rvv_i32:                  unchanged    239.5 ( 3.67x)

SpacemiT X60:
h264_weight2_8_c:                                      207.2 ( 1.00x)
h264_weight2_8_rvv_i32:                   before       259.6 ( 0.80x)
h264_weight2_8_rvv_i32:                   after         82.2 ( 2.52x)
h264_weight4_8_c:                                      290.8 ( 1.00x)
h264_weight4_8_rvv_i32:                   before       509.6 ( 0.57x)
h264_weight4_8_rvv_i32:                   after         61.5 ( 4.73x)
h264_weight8_8_c:                                      498.8 ( 1.00x)
h264_weight8_8_rvv_i32:                   before      1019.8 ( 0.49x)
h264_weight8_8_rvv_i64:                   after         71.8 ( 6.95x)
h264_weight16_8_c:                                     874.0 ( 1.00x)
h264_weight16_8_rvv_i32:                  unchanged    249.0 ( 3.51x)
2024-09-24 20:04:51 +03:00
Rémi Denis-Courmont
7d1dda4892 lavc/h264dsp: R-V V loop_filter_chroma
T-Head C908:
h264_v_loop_filter_chroma_8bpp_c:      137.4
h264_v_loop_filter_chroma_8bpp_rvv_i32: 54.2
2024-09-01 10:58:48 +03:00
Rémi Denis-Courmont
616fdeaea3 lavc/riscv: depend on RVB and simplify accordingly
There is no known (real) hardware with V and without the complete B
extension. B was indeed required in the RISC-V application profile from
2022, earlier than V. There should not be any relevant hardware in the
future either.

In practice, different R-V Vector optimisations in FFmpeg already depend on
every constituent of the B extension anyhow, so it would not work well.
2024-08-05 21:16:26 +03:00
Rémi Denis-Courmont
4edfc11a28 lavc/h264dsp: R-V V idct4_add8 (all depths)
These are really just wrappers for idct4_add16intra functions, which are in
turn mostly wrappers for idct4_add and idct4_dc_add functions.

For benchmarks refer to the later two sets.
2024-08-05 21:16:26 +03:00
Rémi Denis-Courmont
b0b3bea10b lavc/h264dsp: use saturing add/sub for R-V V 8-bit DC add
T-Head C908 (cycles):
h264_idct4_dc_add_8bpp_c:      109.2
h264_idct4_dc_add_8bpp_rvv_i32: 34.5 (before)
h264_idct4_dc_add_8bpp_rvv_i32: 25.5 (after)
h264_idct8_dc_add_8bpp_c:      418.7
h264_idct8_dc_add_8bpp_rvv_i64: 69.5 (before)
h264_idct8_dc_add_8bpp_rvv_i64: 33.5 (after)
2024-07-28 21:24:12 +03:00
Rémi Denis-Courmont
b62586e310 lavc/h264dsp: use RISC-V B extension
This saves one register and one instruction per transform.
add16 and add16intra thus become stack-less.
2024-07-25 23:10:01 +03:00
Rémi Denis-Courmont
483fd732ab lavc/h264dsp: R-V V high-depth idct_add{,intra}16, idct8_add4
As with 8-bit, this tends to be faster, but results are all over the
place due to the variable distribution of non-zero coefficients.
2024-07-18 20:37:09 +03:00
J. Dekker
fa5a605542 avcodec/riscv: add h264 dc idct rvv
checkasm: bench runs 131072 (1 << 17)
h264_idct4_add_dc_8bpp_c: 1.5
h264_idct4_add_dc_8bpp_rvv_i64: 0.7
h264_idct4_add_dc_9bpp_c: 1.5
h264_idct4_add_dc_9bpp_rvv_i64: 0.7
h264_idct4_add_dc_10bpp_c: 1.5
h264_idct4_add_dc_10bpp_rvv_i64: 0.7
h264_idct4_add_dc_12bpp_c: 1.2
h264_idct4_add_dc_12bpp_rvv_i64: 0.7
h264_idct4_add_dc_14bpp_c: 1.2
h264_idct4_add_dc_14bpp_rvv_i64: 0.7
h264_idct8_add_dc_8bpp_c: 5.2
h264_idct8_add_dc_8bpp_rvv_i64: 1.5
h264_idct8_add_dc_9bpp_c: 5.5
h264_idct8_add_dc_9bpp_rvv_i64: 1.2
h264_idct8_add_dc_10bpp_c: 5.5
h264_idct8_add_dc_10bpp_rvv_i64: 1.2
h264_idct8_add_dc_12bpp_c: 4.2
h264_idct8_add_dc_12bpp_rvv_i64: 1.2
h264_idct8_add_dc_14bpp_c: 4.2
h264_idct8_add_dc_14bpp_rvv_i64: 1.2

Signed-off-by: J. Dekker <jdek@itanimul.li>
2024-07-18 02:47:30 +02:00
Rémi Denis-Courmont
3002310b70 lavc/h264dsp: R-V V high-depth add_pixels8
T-Head C908 (cycles);
h264_add_pixels8_9bpp_c:        270.5
h264_add_pixels8_9bpp_rvv_i32:  164.2
h264_add_pixels8_10bpp_c:       270.5
h264_add_pixels8_10bpp_rvv_i32: 164.2
h264_add_pixels8_12bpp_c:       270.5
h264_add_pixels8_12bpp_rvv_i32: 164.2
h264_add_pixels8_14bpp_c:       270.5
h264_add_pixels8_14bpp_rvv_i32: 164.2
2024-07-16 17:25:40 +03:00
Rémi Denis-Courmont
7744c08240 lavc/h264dsp: R-V V add_pixels4 and 8-bit add_pixels8
T-Head C908 (cycles):
h264_add_pixels4_8bpp_c:        93.5
h264_add_pixels4_8bpp_rvv_i32:  39.5
h264_add_pixels4_9bpp_c:        87.5
h264_add_pixels4_9bpp_rvv_i64:  50.5
h264_add_pixels4_10bpp_c:       87.5
h264_add_pixels4_10bpp_rvv_i64: 50.5
h264_add_pixels4_12bpp_c:       87.5
h264_add_pixels4_12bpp_rvv_i64: 50.5
h264_add_pixels4_14bpp_c:       87.5
h264_add_pixels4_14bpp_rvv_i64: 50.5
h264_add_pixels8_8bpp_c:       265.2
h264_add_pixels8_8bpp_rvv_i64:  84.5
2024-07-16 17:25:40 +03:00
Rémi Denis-Courmont
c654e37254 lavc/h264dsp: R-V V high-depth h264_idct8_add
Unlike the 8-bit version, we need two iterations to process this within
128-bit vectors. This adds some extra complexity for pointer arithmetic
and counting down which is unnecessary in the 8-bit variant.

Accordingly the gain relative to C are just slight better than half as
good with 128-bit vectors as with 256-bit ones.

T-Head C908 (2 iterations):
h264_idct8_add_9bpp_c:       17.5
h264_idct8_add_9bpp_rvv_i32: 10.0
h264_idct8_add_10bpp_c:      17.5
h264_idct8_add_10bpp_rvv_i32: 9.7
h264_idct8_add_12bpp_c:      17.7
h264_idct8_add_12bpp_rvv_i32: 9.7
h264_idct8_add_14bpp_c:      17.7
h264_idct8_add_14bpp_rvv_i32: 9.7

SpacemiT X60 (single iteration):
h264_idct8_add_9bpp_c:       15.2
h264_idct8_add_9bpp_rvv_i32:  5.0
h264_idct8_add_10bpp_c:      15.2
h264_idct8_add_10bpp_rvv_i32: 5.0
h264_idct8_add_12bpp_c:      14.7
h264_idct8_add_12bpp_rvv_i32: 5.0
h264_idct8_add_14bpp_c:      14.7
h264_idct8_add_14bpp_rvv_i32: 4.7
2024-07-14 21:06:50 +03:00
Rémi Denis-Courmont
4e0e872881 lavc/h264dsp: R-V V high-depth h264_idct_add
T-Head C908 (cycles):
h264_idct4_add_9bpp_c:        248.2
h264_idct4_add_9bpp_rvv_i32:  128.7
h264_idct4_add_10bpp_c:       256.7
h264_idct4_add_10bpp_rvv_i32: 128.7
h264_idct4_add_12bpp_c:       252.5
h264_idct4_add_12bpp_rvv_i32: 129.7
h264_idct4_add_14bpp_c:       258.0
h264_idct4_add_14bpp_rvv_i32: 129.7
2024-07-14 11:39:35 +03:00
Rémi Denis-Courmont
f1ed351d3b lavc/h264dsp: R-V V 8-bit h264_biweight_pixels
T-Head C908:
h264_biweight2_8_c:        58.0
h264_biweight2_8_rvv_i32:  11.2
h264_biweight4_8_c:       106.0
h264_biweight4_8_rvv_i32:  22.7
h264_biweight8_8_c:       205.7
h264_biweight8_8_rvv_i32:  50.0
h264_biweight16_8_c:      403.5
h264_biweight16_8_rvv_i32: 83.2

SpacemiT X60:
h264_weight2_8_c:          48.2
h264_weight2_8_rvv_i32:     8.2
h264_weight4_8_c:          90.5
h264_weight4_8_rvv_i32:    16.5
h264_weight8_8_c:         175.2
h264_weight8_8_rvv_i32:    38.0
h264_weight16_8_c:        342.2
h264_weight16_8_rvv_i32:   66.0
2024-07-09 18:03:30 +03:00
Rémi Denis-Courmont
3606e592ea lavc/h264dsp: R-V V 8-bit h264_weight_pixels
There are two implementations here:
- a generic scalable one processing two columns at a time,
- a specialised processing one (fixed-size) row at a time.

Unsurprisingly, the generic one works out better with smaller widths.
With larger widths, the gains from filling vectors are outweighed by
the extra cost of strided loads and stores. In other words, memory
accesses become the bottleneck.

T-Head C908:
h264_weight2_8_c:        54.5
h264_weight2_8_rvv_i32:  13.7
h264_weight4_8_c:       101.7
h264_weight4_8_rvv_i32:  27.5
h264_weight8_8_c:       197.0
h264_weight8_8_rvv_i32:  75.5
h264_weight16_8_c:      385.0
h264_weight16_8_rvv_i32: 74.2

SpacemiT X60:
h264_weight2_8_c:        48.5
h264_weight2_8_rvv_i32:   8.2
h264_weight4_8_c:        90.7
h264_weight4_8_rvv_i32:  16.5
h264_weight8_8_c:       175.0
h264_weight8_8_rvv_i32:  37.7
h264_weight16_8_c:      342.2
h264_weight16_8_rvv_i32: 66.0
2024-07-09 18:03:29 +03:00
Rémi Denis-Courmont
f9d1230224 lavc/h264dsp: R-V V 8-bit h264_idct8_add
T-Head C908 (cycles):
h264_idct8_add_8bpp_c:      1072.0
h264_idct8_add_8bpp_rvv_i32: 318.5
2024-07-07 09:34:32 +03:00
Rémi Denis-Courmont
f447189b0c lavc/h264dsp: R-V V 8-bit h264_idct_add
T-Head C908 (cycles):
h264_idct4_add_8bpp_c:      271.5
h264_idct4_add_8bpp_rvv_i32: 91.5
2024-07-05 20:06:22 +03:00
Rémi Denis-Courmont
e0eff64ed1 lavc/h264dsp: R-V V 8-bit h264_idct8_add4 2024-07-05 18:56:03 +03:00
Rémi Denis-Courmont
d1f0c1fbf8 lavc/h264dsp: R-V V 8-bit h264_idct_add16intra 2024-07-05 18:56:03 +03:00
Rémi Denis-Courmont
30475c95ba lavc/h264dsp: R-V V 8-bit h264_idct_add16
While this *tends* to be faster than plain C, the performance numbers
are all over the place, presuambly due to the conditional character of
the main loop.

Some additional micro-optimisations should be feasible after the
underlying h264_idct_add and h264_idct_dc_add functions are also
implemented. Then it will no longer be necesseray to stricly abide by
the C ABI.
2024-07-05 18:56:02 +03:00
Rémi Denis-Courmont
e2af5904f0 lavc/h264dsp: R-V V 8-bit MBAFF loop filter
Performance is (unfortunately) the same as with non-MBAFF, since the
hardware under test does not short-circuit vector tail calculations.
(IMO, a generic solution or work-around should be agreed on, rather
than bespoke approaches all over the place.)
2024-07-04 19:57:42 +03:00
Rémi Denis-Courmont
5a6e333fc7 lavc/h264dsp: R-V V 8-bit luma loop filter
T-Head C908 (cycles):
h264_h_loop_filter_luma_8bpp_c:       297.5
h264_h_loop_filter_luma_8bpp_rvv_i32: 369.2
h264_v_loop_filter_luma_8bpp_c:       862.7
h264_v_loop_filter_luma_8bpp_rvv_i32: 199.7

Performance in the horizontal scenario seems worse than scalar. x86
SSE2 and AVX optimisations are similarly affected. This is presumably
caused by unlucky inputs from checkasm, such that the C code
short-circuits almost all filter calculations.
2024-07-04 19:57:42 +03:00
Rémi Denis-Courmont
fa47299516 lavc/startcode: add R-V V startcode_find_candidate 2024-05-19 10:03:49 +03:00
Rémi Denis-Courmont
4ad5b9c8db lavc/startcode: add R-V Zbb startcode_find_candidate
The main loop processes 8 bytes in 5 instructions.
For comparison, the optimal plain strnlen() requires 4 instructions per
byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code
involves 5 instructions per byte (8x worse). Actual benchmarks may be
slightly less favourable due to latency from ORC.B to BNE.
2024-05-19 10:03:49 +03:00