ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2025-12-08 06:09:50 +00:00

Author	SHA1	Message	Date
Rémi Denis-Courmont	bd226fdd74	lavc/h264dsp: R-V V intra loop filter As with the inter loop filter, performance metrics seem to be biased in favour of the C implementation because checkasm inputs almost always fall in the no-op case. h264_h_loop_filter_chroma_intra_8bpp_c: 82.8 ( 1.00x) h264_h_loop_filter_chroma_intra_8bpp_rvv_i32: 72.6 ( 1.14x) h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 41.1 ( 1.00x) h264_h_loop_filter_chroma_mbaff_intra_8bpp_rvv_i32: 72.6 ( 0.57x) h264_h_loop_filter_luma_intra_8bpp_c: 166.1 ( 1.00x) h264_h_loop_filter_luma_intra_8bpp_rvv_i32: 395.4 ( 0.42x) h264_h_loop_filter_luma_mbaff_intra_8bpp_c: 93.3 ( 1.00x) h264_h_loop_filter_luma_mbaff_intra_8bpp_rvv_i32: 395.4 ( 0.24x) h264_v_loop_filter_chroma_intra_8bpp_c: 134.8 ( 1.00x) h264_v_loop_filter_chroma_intra_8bpp_rvv_i32: 51.6 ( 2.61x) h264_v_loop_filter_luma_intra_8bpp_c: 468.1 ( 1.00x) h264_v_loop_filter_luma_intra_8bpp_rvv_i32: 134.8 ( 3.47x)	2024-12-17 09:00:28 +02:00
Rémi Denis-Courmont	6611bf5484	lavc/h264dsp: optimise R-V V biweight for shorter heights T-Head C908: h264_biweight2_8_c: 313.7 ( 1.00x) h264_biweight2_8_rvv_i32: before 239.5 ( 1.23x) h264_biweight2_8_rvv_i32: after 72.7 ( 4.31x) h264_biweight4_8_c: 582.0 ( 1.00x) h264_biweight4_8_rvv_i32: before 471.0 ( 1.16x) h264_biweight4_8_rvv_i32: after 91.5 ( 6.36x) h264_biweight8_8_c: 1110.0 ( 1.00x) h264_biweight8_8_rvv_i32: before 943.3 ( 1.10x) h264_biweight8_8_rvv_i64: after 147.0 ( 7.55x) SpacemiT X60: h264_biweight2_8_c: 311.4 ( 1.00x) h264_biweight2_8_rvv_i32: before 363.1 ( 0.83x) h264_biweight2_8_rvv_i32: after 103.1 ( 3.02x) h264_biweight4_8_c: 571.9 ( 1.00x) h264_biweight4_8_rvv_i32: before 717.4 ( 0.78x) h264_biweight4_8_rvv_i32: after 71.8 ( 7.96x) h264_biweight8_8_c: 1103.1 ( 1.00x) h264_biweight8_8_rvv_i32: before 1415.2 ( 0.76x) h264_biweight8_8_rvv_i64: ater 92.8 (11.88x)	2024-09-24 20:04:51 +03:00
Rémi Denis-Courmont	459a1512f1	lavc/h264dsp: unroll R-V V weight16 As VLSE128.V does not exist, we have no other way to deal with latency. T-Head C908: h264_weight16_8_c: 989.4 ( 1.00x) h264_weight16_8_rvv_i32: 193.2 ( 5.12x) SpacemiT X60: h264_weight16_8_c: 874.1 ( 1.00x) h264_weight16_8_rvv_i32: 196.9 ( 4.44x)	2024-09-24 20:04:51 +03:00
Rémi Denis-Courmont	4936bb2508	lavc/h264dsp: optimise R-V V weight for shorter heights The height is a power of two of up to 16 rows. The current code was optimised for large sample counts. T-Head C908: h264_weight2_8_c: 211.7 ( 1.00x) h264_weight2_8_rvv_i32: before 184.0 ( 1.15x) h264_weight2_8_rvv_i32: after 54.2 ( 3.90x) h264_weight4_8_c: 285.7 ( 1.00x) h264_weight4_8_rvv_i32: before 341.2 ( 0.86x) h264_weight4_8_rvv_i32: after 82.2 ( 3.47x) h264_weight8_8_c: 498.7 ( 1.00x) h264_weight8_8_rvv_i32: before 683.7 ( 0.73x) h264_weight8_8_rvv_i64: after 128.5 ( 3.95x) h264_weight16_8_c: 878.2 ( 1.00x) h264_weight16_8_rvv_i32: unchanged 239.5 ( 3.67x) SpacemiT X60: h264_weight2_8_c: 207.2 ( 1.00x) h264_weight2_8_rvv_i32: before 259.6 ( 0.80x) h264_weight2_8_rvv_i32: after 82.2 ( 2.52x) h264_weight4_8_c: 290.8 ( 1.00x) h264_weight4_8_rvv_i32: before 509.6 ( 0.57x) h264_weight4_8_rvv_i32: after 61.5 ( 4.73x) h264_weight8_8_c: 498.8 ( 1.00x) h264_weight8_8_rvv_i32: before 1019.8 ( 0.49x) h264_weight8_8_rvv_i64: after 71.8 ( 6.95x) h264_weight16_8_c: 874.0 ( 1.00x) h264_weight16_8_rvv_i32: unchanged 249.0 ( 3.51x)	2024-09-24 20:04:51 +03:00
Rémi Denis-Courmont	7d1dda4892	lavc/h264dsp: R-V V loop_filter_chroma T-Head C908: h264_v_loop_filter_chroma_8bpp_c: 137.4 h264_v_loop_filter_chroma_8bpp_rvv_i32: 54.2	2024-09-01 10:58:48 +03:00
Rémi Denis-Courmont	616fdeaea3	lavc/riscv: depend on RVB and simplify accordingly There is no known (real) hardware with V and without the complete B extension. B was indeed required in the RISC-V application profile from 2022, earlier than V. There should not be any relevant hardware in the future either. In practice, different R-V Vector optimisations in FFmpeg already depend on every constituent of the B extension anyhow, so it would not work well.	2024-08-05 21:16:26 +03:00
Rémi Denis-Courmont	4edfc11a28	lavc/h264dsp: R-V V idct4_add8 (all depths) These are really just wrappers for idct4_add16intra functions, which are in turn mostly wrappers for idct4_add and idct4_dc_add functions. For benchmarks refer to the later two sets.	2024-08-05 21:16:26 +03:00
Rémi Denis-Courmont	b0b3bea10b	lavc/h264dsp: use saturing add/sub for R-V V 8-bit DC add T-Head C908 (cycles): h264_idct4_dc_add_8bpp_c: 109.2 h264_idct4_dc_add_8bpp_rvv_i32: 34.5 (before) h264_idct4_dc_add_8bpp_rvv_i32: 25.5 (after) h264_idct8_dc_add_8bpp_c: 418.7 h264_idct8_dc_add_8bpp_rvv_i64: 69.5 (before) h264_idct8_dc_add_8bpp_rvv_i64: 33.5 (after)	2024-07-28 21:24:12 +03:00
Rémi Denis-Courmont	b62586e310	lavc/h264dsp: use RISC-V B extension This saves one register and one instruction per transform. add16 and add16intra thus become stack-less.	2024-07-25 23:10:01 +03:00
Rémi Denis-Courmont	483fd732ab	lavc/h264dsp: R-V V high-depth idct_add{,intra}16, idct8_add4 As with 8-bit, this tends to be faster, but results are all over the place due to the variable distribution of non-zero coefficients.	2024-07-18 20:37:09 +03:00
J. Dekker	fa5a605542	avcodec/riscv: add h264 dc idct rvv checkasm: bench runs 131072 (1 << 17) h264_idct4_add_dc_8bpp_c: 1.5 h264_idct4_add_dc_8bpp_rvv_i64: 0.7 h264_idct4_add_dc_9bpp_c: 1.5 h264_idct4_add_dc_9bpp_rvv_i64: 0.7 h264_idct4_add_dc_10bpp_c: 1.5 h264_idct4_add_dc_10bpp_rvv_i64: 0.7 h264_idct4_add_dc_12bpp_c: 1.2 h264_idct4_add_dc_12bpp_rvv_i64: 0.7 h264_idct4_add_dc_14bpp_c: 1.2 h264_idct4_add_dc_14bpp_rvv_i64: 0.7 h264_idct8_add_dc_8bpp_c: 5.2 h264_idct8_add_dc_8bpp_rvv_i64: 1.5 h264_idct8_add_dc_9bpp_c: 5.5 h264_idct8_add_dc_9bpp_rvv_i64: 1.2 h264_idct8_add_dc_10bpp_c: 5.5 h264_idct8_add_dc_10bpp_rvv_i64: 1.2 h264_idct8_add_dc_12bpp_c: 4.2 h264_idct8_add_dc_12bpp_rvv_i64: 1.2 h264_idct8_add_dc_14bpp_c: 4.2 h264_idct8_add_dc_14bpp_rvv_i64: 1.2 Signed-off-by: J. Dekker <jdek@itanimul.li>	2024-07-18 02:47:30 +02:00
Rémi Denis-Courmont	3002310b70	lavc/h264dsp: R-V V high-depth add_pixels8 T-Head C908 (cycles); h264_add_pixels8_9bpp_c: 270.5 h264_add_pixels8_9bpp_rvv_i32: 164.2 h264_add_pixels8_10bpp_c: 270.5 h264_add_pixels8_10bpp_rvv_i32: 164.2 h264_add_pixels8_12bpp_c: 270.5 h264_add_pixels8_12bpp_rvv_i32: 164.2 h264_add_pixels8_14bpp_c: 270.5 h264_add_pixels8_14bpp_rvv_i32: 164.2	2024-07-16 17:25:40 +03:00
Rémi Denis-Courmont	7744c08240	lavc/h264dsp: R-V V add_pixels4 and 8-bit add_pixels8 T-Head C908 (cycles): h264_add_pixels4_8bpp_c: 93.5 h264_add_pixels4_8bpp_rvv_i32: 39.5 h264_add_pixels4_9bpp_c: 87.5 h264_add_pixels4_9bpp_rvv_i64: 50.5 h264_add_pixels4_10bpp_c: 87.5 h264_add_pixels4_10bpp_rvv_i64: 50.5 h264_add_pixels4_12bpp_c: 87.5 h264_add_pixels4_12bpp_rvv_i64: 50.5 h264_add_pixels4_14bpp_c: 87.5 h264_add_pixels4_14bpp_rvv_i64: 50.5 h264_add_pixels8_8bpp_c: 265.2 h264_add_pixels8_8bpp_rvv_i64: 84.5	2024-07-16 17:25:40 +03:00
Rémi Denis-Courmont	c654e37254	lavc/h264dsp: R-V V high-depth h264_idct8_add Unlike the 8-bit version, we need two iterations to process this within 128-bit vectors. This adds some extra complexity for pointer arithmetic and counting down which is unnecessary in the 8-bit variant. Accordingly the gain relative to C are just slight better than half as good with 128-bit vectors as with 256-bit ones. T-Head C908 (2 iterations): h264_idct8_add_9bpp_c: 17.5 h264_idct8_add_9bpp_rvv_i32: 10.0 h264_idct8_add_10bpp_c: 17.5 h264_idct8_add_10bpp_rvv_i32: 9.7 h264_idct8_add_12bpp_c: 17.7 h264_idct8_add_12bpp_rvv_i32: 9.7 h264_idct8_add_14bpp_c: 17.7 h264_idct8_add_14bpp_rvv_i32: 9.7 SpacemiT X60 (single iteration): h264_idct8_add_9bpp_c: 15.2 h264_idct8_add_9bpp_rvv_i32: 5.0 h264_idct8_add_10bpp_c: 15.2 h264_idct8_add_10bpp_rvv_i32: 5.0 h264_idct8_add_12bpp_c: 14.7 h264_idct8_add_12bpp_rvv_i32: 5.0 h264_idct8_add_14bpp_c: 14.7 h264_idct8_add_14bpp_rvv_i32: 4.7	2024-07-14 21:06:50 +03:00
Rémi Denis-Courmont	4e0e872881	lavc/h264dsp: R-V V high-depth h264_idct_add T-Head C908 (cycles): h264_idct4_add_9bpp_c: 248.2 h264_idct4_add_9bpp_rvv_i32: 128.7 h264_idct4_add_10bpp_c: 256.7 h264_idct4_add_10bpp_rvv_i32: 128.7 h264_idct4_add_12bpp_c: 252.5 h264_idct4_add_12bpp_rvv_i32: 129.7 h264_idct4_add_14bpp_c: 258.0 h264_idct4_add_14bpp_rvv_i32: 129.7	2024-07-14 11:39:35 +03:00
Rémi Denis-Courmont	f1ed351d3b	lavc/h264dsp: R-V V 8-bit h264_biweight_pixels T-Head C908: h264_biweight2_8_c: 58.0 h264_biweight2_8_rvv_i32: 11.2 h264_biweight4_8_c: 106.0 h264_biweight4_8_rvv_i32: 22.7 h264_biweight8_8_c: 205.7 h264_biweight8_8_rvv_i32: 50.0 h264_biweight16_8_c: 403.5 h264_biweight16_8_rvv_i32: 83.2 SpacemiT X60: h264_weight2_8_c: 48.2 h264_weight2_8_rvv_i32: 8.2 h264_weight4_8_c: 90.5 h264_weight4_8_rvv_i32: 16.5 h264_weight8_8_c: 175.2 h264_weight8_8_rvv_i32: 38.0 h264_weight16_8_c: 342.2 h264_weight16_8_rvv_i32: 66.0	2024-07-09 18:03:30 +03:00
Rémi Denis-Courmont	3606e592ea	lavc/h264dsp: R-V V 8-bit h264_weight_pixels There are two implementations here: - a generic scalable one processing two columns at a time, - a specialised processing one (fixed-size) row at a time. Unsurprisingly, the generic one works out better with smaller widths. With larger widths, the gains from filling vectors are outweighed by the extra cost of strided loads and stores. In other words, memory accesses become the bottleneck. T-Head C908: h264_weight2_8_c: 54.5 h264_weight2_8_rvv_i32: 13.7 h264_weight4_8_c: 101.7 h264_weight4_8_rvv_i32: 27.5 h264_weight8_8_c: 197.0 h264_weight8_8_rvv_i32: 75.5 h264_weight16_8_c: 385.0 h264_weight16_8_rvv_i32: 74.2 SpacemiT X60: h264_weight2_8_c: 48.5 h264_weight2_8_rvv_i32: 8.2 h264_weight4_8_c: 90.7 h264_weight4_8_rvv_i32: 16.5 h264_weight8_8_c: 175.0 h264_weight8_8_rvv_i32: 37.7 h264_weight16_8_c: 342.2 h264_weight16_8_rvv_i32: 66.0	2024-07-09 18:03:29 +03:00
Rémi Denis-Courmont	f9d1230224	lavc/h264dsp: R-V V 8-bit h264_idct8_add T-Head C908 (cycles): h264_idct8_add_8bpp_c: 1072.0 h264_idct8_add_8bpp_rvv_i32: 318.5	2024-07-07 09:34:32 +03:00
Rémi Denis-Courmont	f447189b0c	lavc/h264dsp: R-V V 8-bit h264_idct_add T-Head C908 (cycles): h264_idct4_add_8bpp_c: 271.5 h264_idct4_add_8bpp_rvv_i32: 91.5	2024-07-05 20:06:22 +03:00
Rémi Denis-Courmont	e0eff64ed1	lavc/h264dsp: R-V V 8-bit h264_idct8_add4	2024-07-05 18:56:03 +03:00
Rémi Denis-Courmont	d1f0c1fbf8	lavc/h264dsp: R-V V 8-bit h264_idct_add16intra	2024-07-05 18:56:03 +03:00
Rémi Denis-Courmont	30475c95ba	lavc/h264dsp: R-V V 8-bit h264_idct_add16 While this tends to be faster than plain C, the performance numbers are all over the place, presuambly due to the conditional character of the main loop. Some additional micro-optimisations should be feasible after the underlying h264_idct_add and h264_idct_dc_add functions are also implemented. Then it will no longer be necesseray to stricly abide by the C ABI.	2024-07-05 18:56:02 +03:00
Rémi Denis-Courmont	e2af5904f0	lavc/h264dsp: R-V V 8-bit MBAFF loop filter Performance is (unfortunately) the same as with non-MBAFF, since the hardware under test does not short-circuit vector tail calculations. (IMO, a generic solution or work-around should be agreed on, rather than bespoke approaches all over the place.)	2024-07-04 19:57:42 +03:00
Rémi Denis-Courmont	5a6e333fc7	lavc/h264dsp: R-V V 8-bit luma loop filter T-Head C908 (cycles): h264_h_loop_filter_luma_8bpp_c: 297.5 h264_h_loop_filter_luma_8bpp_rvv_i32: 369.2 h264_v_loop_filter_luma_8bpp_c: 862.7 h264_v_loop_filter_luma_8bpp_rvv_i32: 199.7 Performance in the horizontal scenario seems worse than scalar. x86 SSE2 and AVX optimisations are similarly affected. This is presumably caused by unlucky inputs from checkasm, such that the C code short-circuits almost all filter calculations.	2024-07-04 19:57:42 +03:00
Rémi Denis-Courmont	fa47299516	lavc/startcode: add R-V V startcode_find_candidate	2024-05-19 10:03:49 +03:00
Rémi Denis-Courmont	4ad5b9c8db	lavc/startcode: add R-V Zbb startcode_find_candidate The main loop processes 8 bytes in 5 instructions. For comparison, the optimal plain strnlen() requires 4 instructions per byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code involves 5 instructions per byte (8x worse). Actual benchmarks may be slightly less favourable due to latency from ORC.B to BNE.	2024-05-19 10:03:49 +03:00

26 commits