ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-02-18 13:30:22 +00:00

Author	SHA1	Message	Date
sunyuechi	b3f7440298	lavc/hevc: R-V V put_pixels(pow2) k230 banana_f3 put_hevc_pel_pixels4_8_c: 61.6 ( 1.00x) 69.5 ( 1.00x) put_hevc_pel_pixels4_8_rvv_i32: 24.6 ( 2.50x) 28.0 ( 2.48x) put_hevc_pel_pixels8_8_c: 209.8 ( 1.00x) 215.5 ( 1.00x) put_hevc_pel_pixels8_8_rvv_i32: 52.6 ( 3.99x) 38.2 ( 5.64x) put_hevc_pel_pixels16_8_c: 839.4 ( 1.00x) 830.0 ( 1.00x) put_hevc_pel_pixels16_8_rvv_i32: 126.6 ( 6.63x) 90.5 ( 9.17x) put_hevc_pel_pixels32_8_c: 3246.6 ( 1.00x) 3246.7 ( 1.00x) put_hevc_pel_pixels32_8_rvv_i32: 311.6 (10.42x) 257.0 (12.63x) put_hevc_pel_pixels64_8_c: 12894.6 ( 1.00x) 12892.7 ( 1.00x) put_hevc_pel_pixels64_8_rvv_i32: 1135.8 (11.35x) 778.0 (16.57x)	2024-12-17 09:21:20 +08:00
Niklas Haas	2f77ecc6bc	avcodec/riscv: add h264 qpel Benched on K230 for VLEN 128, SpaceMIT for VLEN 256. Variants for 4 width have no speedup for VLEN 256 vs VLEN 128 on available hardware, so were disabled. C RVV128 C RVV256 avg_h264_qpel_4_mc00_8 33.9 33.6 (1.01x) avg_h264_qpel_4_mc01_8 218.8 89.1 (2.46x) avg_h264_qpel_4_mc02_8 218.8 79.8 (2.74x) avg_h264_qpel_4_mc03_8 218.8 89.1 (2.46x) avg_h264_qpel_4_mc10_8 172.3 126.1 (1.37x) avg_h264_qpel_4_mc11_8 339.1 190.8 (1.78x) avg_h264_qpel_4_mc12_8 533.6 357.6 (1.49x) avg_h264_qpel_4_mc13_8 348.4 190.8 (1.83x) avg_h264_qpel_4_mc20_8 144.8 116.8 (1.24x) avg_h264_qpel_4_mc21_8 478.1 385.6 (1.24x) avg_h264_qpel_4_mc22_8 348.4 283.6 (1.23x) avg_h264_qpel_4_mc23_8 478.1 394.6 (1.21x) avg_h264_qpel_4_mc30_8 172.6 126.1 (1.37x) avg_h264_qpel_4_mc31_8 339.4 191.1 (1.78x) avg_h264_qpel_4_mc32_8 542.9 357.6 (1.52x) avg_h264_qpel_4_mc33_8 339.4 191.1 (1.78x) avg_h264_qpel_8_mc00_8 116.8 42.9 (2.72x) 123.6 50.6 (2.44x) avg_h264_qpel_8_mc01_8 774.4 163.1 (4.75x) 779.8 165.1 (4.72x) avg_h264_qpel_8_mc02_8 774.4 154.1 (5.03x) 779.8 144.3 (5.40x) avg_h264_qpel_8_mc03_8 774.4 163.3 (4.74x) 779.8 165.3 (4.72x) avg_h264_qpel_8_mc10_8 617.1 237.3 (2.60x) 613.1 227.6 (2.69x) avg_h264_qpel_8_mc11_8 1209.3 376.4 (3.21x) 1206.8 363.1 (3.32x) avg_h264_qpel_8_mc12_8 1913.3 598.6 (3.20x) 1894.3 561.1 (3.38x) avg_h264_qpel_8_mc13_8 1218.6 376.4 (3.24x) 1217.1 363.1 (3.35x) avg_h264_qpel_8_mc20_8 524.4 228.1 (2.30x) 519.3 227.6 (2.28x) avg_h264_qpel_8_mc21_8 1709.6 681.9 (2.51x) 1707.1 644.3 (2.65x) avg_h264_qpel_8_mc22_8 1274.3 459.6 (2.77x) 1279.8 436.1 (2.93x) avg_h264_qpel_8_mc23_8 1700.3 672.6 (2.53x) 1706.8 644.6 (2.65x) avg_h264_qpel_8_mc30_8 607.6 246.6 (2.46x) 623.6 238.1 (2.62x) avg_h264_qpel_8_mc31_8 1209.6 376.4 (3.21x) 1206.8 363.1 (3.32x) avg_h264_qpel_8_mc32_8 1904.1 607.9 (3.13x) 1894.3 571.3 (3.32x) avg_h264_qpel_8_mc33_8 1209.6 376.1 (3.22x) 1206.8 363.1 (3.32x) avg_h264_qpel_16_mc00_8 431.9 89.1 (4.85x) 436.1 71.3 (6.12x) avg_h264_qpel_16_mc01_8 2894.6 376.1 (7.70x) 2842.3 300.6 (9.46x) avg_h264_qpel_16_mc02_8 2987.3 348.4 (8.57x) 2967.3 290.1 (10.23x) avg_h264_qpel_16_mc03_8 2885.3 376.4 (7.67x) 2842.3 300.6 (9.46x) avg_h264_qpel_16_mc10_8 2404.1 524.4 (4.58x) 2404.8 456.8 (5.26x) avg_h264_qpel_16_mc11_8 4709.4 811.6 (5.80x) 4675.6 706.8 (6.62x) avg_h264_qpel_16_mc12_8 7477.9 1274.3 (5.87x) 7436.1 1061.1 (7.01x) avg_h264_qpel_16_mc13_8 4718.6 820.6 (5.75x) 4655.1 706.8 (6.59x) avg_h264_qpel_16_mc20_8 2052.1 487.1 (4.21x) 2071.3 446.3 (4.64x) avg_h264_qpel_16_mc21_8 7440.6 1422.6 (5.23x) 6727.8 1217.3 (5.53x) avg_h264_qpel_16_mc22_8 5051.9 950.4 (5.32x) 5071.6 790.3 (6.42x) avg_h264_qpel_16_mc23_8 6764.9 1422.3 (4.76x) 6748.6 1217.3 (5.54x) avg_h264_qpel_16_mc30_8 2413.1 524.4 (4.60x) 2415.1 467.3 (5.17x) avg_h264_qpel_16_mc31_8 4681.6 839.1 (5.58x) 4675.6 727.6 (6.43x) avg_h264_qpel_16_mc32_8 8579.6 1292.8 (6.64x) 7436.3 1071.3 (6.94x) avg_h264_qpel_16_mc33_8 5375.9 829.9 (6.48x) 4665.3 717.3 (6.50x) put_h264_qpel_4_mc00_8 24.4 24.4 (1.00x) put_h264_qpel_4_mc01_8 987.4 79.8 (12.37x) put_h264_qpel_4_mc02_8 190.8 79.8 (2.39x) put_h264_qpel_4_mc03_8 209.6 89.1 (2.35x) put_h264_qpel_4_mc10_8 163.3 117.1 (1.39x) put_h264_qpel_4_mc11_8 339.4 181.6 (1.87x) put_h264_qpel_4_mc12_8 533.6 348.4 (1.53x) put_h264_qpel_4_mc13_8 339.4 190.8 (1.78x) put_h264_qpel_4_mc20_8 126.3 116.8 (1.08x) put_h264_qpel_4_mc21_8 468.9 376.1 (1.25x) put_h264_qpel_4_mc22_8 330.1 274.4 (1.20x) put_h264_qpel_4_mc23_8 468.9 376.1 (1.25x) put_h264_qpel_4_mc30_8 163.3 126.3 (1.29x) put_h264_qpel_4_mc31_8 339.1 191.1 (1.77x) put_h264_qpel_4_mc32_8 533.6 348.4 (1.53x) put_h264_qpel_4_mc33_8 339.4 181.8 (1.87x) put_h264_qpel_8_mc00_8 98.6 33.6 (2.93x) 92.3 40.1 (2.30x) put_h264_qpel_8_mc01_8 737.1 153.8 (4.79x) 738.1 144.3 (5.12x) put_h264_qpel_8_mc02_8 663.1 135.3 (4.90x) 665.1 134.1 (4.96x) put_h264_qpel_8_mc03_8 737.4 154.1 (4.79x) 1508.8 144.3 (10.46x) put_h264_qpel_8_mc10_8 598.4 237.1 (2.52x) 592.3 227.6 (2.60x) put_h264_qpel_8_mc11_8 1172.3 357.9 (3.28x) 1175.6 342.3 (3.43x) put_h264_qpel_8_mc12_8 1867.1 589.1 (3.17x) 1863.1 561.1 (3.32x) put_h264_qpel_8_mc13_8 1172.6 366.9 (3.20x) 1175.6 352.8 (3.33x) put_h264_qpel_8_mc20_8 450.4 218.8 (2.06x) 446.3 206.8 (2.16x) put_h264_qpel_8_mc21_8 1672.3 663.1 (2.52x) 1675.6 633.8 (2.64x) put_h264_qpel_8_mc22_8 1144.6 1200.1 (0.95x) 1144.3 425.6 (2.69x) put_h264_qpel_8_mc23_8 1672.6 672.4 (2.49x) 1665.3 634.1 (2.63x) put_h264_qpel_8_mc30_8 598.6 237.3 (2.52x) 613.1 227.6 (2.69x) put_h264_qpel_8_mc31_8 1172.3 376.1 (3.12x) 1175.6 352.6 (3.33x) put_h264_qpel_8_mc32_8 1857.8 598.6 (3.10x) 1863.1 561.1 (3.32x) put_h264_qpel_8_mc33_8 1172.3 376.1 (3.12x) 1175.6 352.8 (3.33x) put_h264_qpel_16_mc00_8 320.6 61.4 (5.22x) 321.3 60.8 (5.28x) put_h264_qpel_16_mc01_8 2774.3 339.1 (8.18x) 2759.1 279.8 (9.86x) put_h264_qpel_16_mc02_8 2589.1 320.6 (8.08x) 2571.6 269.3 (9.55x) put_h264_qpel_16_mc03_8 2774.3 339.4 (8.17x) 2738.1 290.1 (9.44x) put_h264_qpel_16_mc10_8 2274.3 487.4 (4.67x) 2290.1 436.1 (5.25x) put_h264_qpel_16_mc11_8 5237.1 792.9 (6.60x) 4529.8 685.8 (6.61x) put_h264_qpel_16_mc12_8 7357.6 1255.8 (5.86x) 7352.8 1040.1 (7.07x) put_h264_qpel_16_mc13_8 4579.9 792.9 (5.78x) 4571.6 686.1 (6.66x) put_h264_qpel_16_mc20_8 1802.1 459.6 (3.92x) 1800.6 425.6 (4.23x) put_h264_qpel_16_mc21_8 6644.6 2246.6 (2.96x) 6644.3 1196.6 (5.55x) put_h264_qpel_16_mc22_8 4589.1 913.4 (5.02x) 4592.3 769.3 (5.97x) put_h264_qpel_16_mc23_8 6644.6 1394.6 (4.76x) 6634.1 1196.6 (5.54x) put_h264_qpel_16_mc30_8 2274.3 496.6 (4.58x) 2290.1 456.8 (5.01x) put_h264_qpel_16_mc31_8 5255.6 802.1 (6.55x) 4550.8 706.8 (6.44x) put_h264_qpel_16_mc32_8 7376.1 1265.1 (5.83x) 7352.8 1050.6 (7.00x) put_h264_qpel_16_mc33_8 4579.9 802.1 (5.71x) 4561.1 696.3 (6.55x) Signed-off-by: Niklas Haas <git@haasn.dev> Signed-off-by: J. Dekker <jdek@itanimul.li>	2024-09-28 18:35:35 +02:00
Rémi Denis-Courmont	63d016aea5	lavc/mpegvideoencdsp: R-V V pix_sum T-Head C908: pix_sum_c: 332.2 pix_sum_rvv_i64: 91.2 SpacemiT X60: pix_sum_c: 321.2 pix_sum_rvv_i64: 60.9	2024-08-19 22:41:13 +03:00
Rémi Denis-Courmont	2f083fd581	lavc/audiodsp: drop R-V F vector_clipf This is now firmly slower than C. SiFive-U74 (cycles): audiodsp.vector_clipf_c: 31.2 audiodsp.vector_clipf_rvf: 39.5	2024-08-01 19:29:40 +03:00
Rémi Denis-Courmont	952b426f3b	lavc/bswapdsp: add RV Zvbb bswap16 and bswap32	2024-08-01 18:43:04 +03:00
Rémi Denis-Courmont	262168b04e	lavc/videodsp: RISC-V zicbop prefetch There are currently no ways to run-time detect the CPU capability, so we take it for granted (in the worst case, it will execute NOPs).	2024-07-30 18:41:51 +03:00
Rémi Denis-Courmont	7b24f96c87	lavc/vp9dsp: remove R-V I intra functions At this point, they are identical to the C code, except for instruction ordering. In fact, they are typically slower or no faster than the C code.	2024-07-29 21:16:41 +03:00
Rémi Denis-Courmont	7744c08240	lavc/h264dsp: R-V V add_pixels4 and 8-bit add_pixels8 T-Head C908 (cycles): h264_add_pixels4_8bpp_c: 93.5 h264_add_pixels4_8bpp_rvv_i32: 39.5 h264_add_pixels4_9bpp_c: 87.5 h264_add_pixels4_9bpp_rvv_i64: 50.5 h264_add_pixels4_10bpp_c: 87.5 h264_add_pixels4_10bpp_rvv_i64: 50.5 h264_add_pixels4_12bpp_c: 87.5 h264_add_pixels4_12bpp_rvv_i64: 50.5 h264_add_pixels4_14bpp_c: 87.5 h264_add_pixels4_14bpp_rvv_i64: 50.5 h264_add_pixels8_8bpp_c: 265.2 h264_add_pixels8_8bpp_rvv_i64: 84.5	2024-07-16 17:25:40 +03:00
Rémi Denis-Courmont	30475c95ba	lavc/h264dsp: R-V V 8-bit h264_idct_add16 While this tends to be faster than plain C, the performance numbers are all over the place, presuambly due to the conditional character of the main loop. Some additional micro-optimisations should be feasible after the underlying h264_idct_add and h264_idct_dc_add functions are also implemented. Then it will no longer be necesseray to stricly abide by the C ABI.	2024-07-05 18:56:02 +03:00
Rémi Denis-Courmont	5a6e333fc7	lavc/h264dsp: R-V V 8-bit luma loop filter T-Head C908 (cycles): h264_h_loop_filter_luma_8bpp_c: 297.5 h264_h_loop_filter_luma_8bpp_rvv_i32: 369.2 h264_v_loop_filter_luma_8bpp_c: 862.7 h264_v_loop_filter_luma_8bpp_rvv_i32: 199.7 Performance in the horizontal scenario seems worse than scalar. x86 SSE2 and AVX optimisations are similarly affected. This is presumably caused by unlucky inputs from checkasm, such that the C code short-circuits almost all filter calculations.	2024-07-04 19:57:42 +03:00
Rémi Denis-Courmont	378d1b06c3	riscv: probe for Zbb extension at load time Due to hysterical raisins, most RISC-V Linux distributions target a RV64GC baseline excluding the Bit-manipulation ISA extensions, most notably: - Zba: address generation extension and - Zbb: basic bit manipulation extension. Most CPUs that would make sense to run FFmpeg on support Zba and Zbb (including the current FATE runner), so it makes sense to optimise for them. In fact a large chunk of existing assembler optimisations relies on Zba and/or Zbb. Since we cannot patch shared library code, the next best thing is to carry a flag initialised at load-time and check it on need basis. This results in 3 instructions overhead on isolated use, e.g.: 1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported) LBU rd, %pcrel_lo(1b)(rd) BEQZ rd, non_Zbb_fallback_code // Zbb code here The C compiler will typically load the flag ahead of time to reducing latency, and can also keep it around if Zbb is used multiple times in a single optimisation scope. For this to work, the flag symbol must be hidden; otherwise the optimisation degrades with a GOT look-up to support interposition: 1: AUIPC rd, GOT_OFFSET_HI LD rd, GOT_OFFSET_LO(rd) LBU rd, (rd) BEQZ rd, non_Zbb_fallback_code // Zbb code here This patch adds code to provision the flag in libraries using bit manipulation functions from libavutil: byte-swap, bit-weight and counting leading or trailing zeroes.	2024-06-11 20:12:37 +03:00
Rémi Denis-Courmont	fd39997f72	lavc/vp7dsp: add R-V V vp7_luma_dc_wht This works out a bit more favourably than VP8's due to: - additional multiplications that can be vectored, - hardware-supported fixed-point rounding mode. vp7_luma_dc_wht_c: 3.2 vp7_luma_dc_wht_rvv_i64: 2.0	2024-05-29 16:57:02 +03:00
Rémi Denis-Courmont	910d281b21	lavc/h263dsp: R-V V {h,v}_loop_filter Since the horizontal and vertical filters are identical except for a transposition, this uses a common subprocedure with an ad-hoc ABI. To preserve return-address stack prediction, a link register has to be used (c.f. the "Control Transfer Instructions" from the RISC-V ISA Manual). The alternate/temporary link register T0 is used here, so that the normal RA is preserved (something Arm cannot do!). To load the strength value based on `qscale`, the shortest possible and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic LLA; ADD; LBU sequence would add one more instruction since LLA is a convenience alias for AUIPC; ADDI. To ensure that this trick works, relocation relaxation is disabled. To implement the two signed divisions by a power of two toward zero: (x / (1 << SHIFT)) the code relies on the small range of integers involved, computing: (x + (x >> (16 - SHIFT))) >> SHIFT rather than the more general: (x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT Thus one ANDI instruction is avoided. T-Head C908: h263dsp.h_loop_filter_c: 228.2 h263dsp.h_loop_filter_rvv_i32: 144.0 h263dsp.v_loop_filter_c: 242.7 h263dsp.v_loop_filter_rvv_i32: 114.0 (C is probably worse in real use due to less predictible branches.)	2024-05-22 19:15:39 +03:00
sunyuechi	0c1304ae11	lavc/vp9dsp: R-V V mc avg C908: vp9_avg4_8bpp_c: 1.2 vp9_avg4_8bpp_rvv_i64: 1.0 vp9_avg8_8bpp_c: 3.7 vp9_avg8_8bpp_rvv_i64: 1.5 vp9_avg16_8bpp_c: 14.7 vp9_avg16_8bpp_rvv_i64: 3.5 vp9_avg32_8bpp_c: 57.7 vp9_avg32_8bpp_rvv_i64: 10.0 vp9_avg64_8bpp_c: 229.0 vp9_avg64_8bpp_rvv_i64: 31.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-21 21:28:14 +03:00
Rémi Denis-Courmont	fa47299516	lavc/startcode: add R-V V startcode_find_candidate	2024-05-19 10:03:49 +03:00
Rémi Denis-Courmont	4ad5b9c8db	lavc/startcode: add R-V Zbb startcode_find_candidate The main loop processes 8 bytes in 5 instructions. For comparison, the optimal plain strnlen() requires 4 instructions per byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code involves 5 instructions per byte (8x worse). Actual benchmarks may be slightly less favourable due to latency from ORC.B to BNE.	2024-05-19 10:03:49 +03:00
sunyuechi	d4083ecb7c	lavc/vc1dsp: R-V V mspel_pixels C908 X60 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c : 14.7 13.2 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32 : 2.5 2.2 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c : 3.7 3.5 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i64 : 1.0 1.2 vc1dsp.put_vc1_mspel_pixels_tab[0][0]_c : 9.0 8.0 vc1dsp.put_vc1_mspel_pixels_tab[0][0]_rvi : 1.0 1.0 vc1dsp.put_vc1_mspel_pixels_tab[1][0]_c : 2.5 2.2 vc1dsp.put_vc1_mspel_pixels_tab[1][0]_rvi : 0.5 0.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-16 17:08:18 +03:00
sunyuechi	b82d9f55d1	lavc/vp9dsp: R-V mc copy C908: vp9_put4_8bpp_c: 0.7 vp9_put4_8bpp_rvi: 0.5 vp9_put8_8bpp_c: 2.5 vp9_put8_8bpp_rvi: 0.5 vp9_put16_8bpp_c: 16.7 vp9_put16_8bpp_rvi: 1.5 vp9_put32_8bpp_c: 37.2 vp9_put32_8bpp_rvi: 5.7 vp9_put64_8bpp_c: 107.5 vp9_put64_8bpp_rvi: 21.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-15 19:52:28 +03:00
sunyuechi	aa9dbd91cf	lavc/vp9dsp: R-V ipred vert C908: vp9_vert_8x8_8bpp_c: 22.0 vp9_vert_8x8_8bpp_rvi: 15.7 vp9_vert_16x16_8bpp_c: 71.2 vp9_vert_16x16_8bpp_rvi: 39.0 vp9_vert_32x32_8bpp_c: 300.2 vp9_vert_32x32_8bpp_rvi: 135.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-15 19:52:25 +03:00
Rémi Denis-Courmont	0d9591841b	lavc/ac3dsp: add R-V Zvbb extract_exponents	2024-05-11 11:38:49 +03:00
sunyuechi	0b8e5e5a00	lavc/vp8dsp: R-V put_vp8_pixels C908: vp8_put_pixels4_c: 78.0 vp8_put_pixels4_rvi: 33.7 vp8_put_pixels8_c: 278.0 vp8_put_pixels8_rvi: 55.0 vp8_put_pixels16_c: 999.0 vp8_put_pixels16_rvi: 86.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-10 18:41:13 +03:00
sunyuechi	5bc3b7f513	lavc/rv40dsp: R-V V chroma_mc This is similar to h264, but here we use manual_avg instead of vaaddu because rv40's OP differs from h264. If we use vaaddu, rv40 would need to repeatedly switch between vxrm=0 and vxrm=2, and switching vxrm is very slow. C908: avg_chroma_mc4_c: 2330.0 avg_chroma_mc4_rvv_i32: 602.7 avg_chroma_mc8_c: 1211.0 avg_chroma_mc8_rvv_i32: 602.7 put_chroma_mc4_c: 1825.0 put_chroma_mc4_rvv_i32: 414.7 put_chroma_mc8_c: 932.0 put_chroma_mc8_rvv_i32: 414.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-03 18:00:53 +03:00
sunyuechi	c3a96f97f8	lavc/vp9dsp: R-V V ipred dc C908: vp9_dc_8x8_8bpp_c: 46.0 vp9_dc_8x8_8bpp_rvv_i64: 41.0 vp9_dc_16x16_8bpp_c: 109.2 vp9_dc_16x16_8bpp_rvv_i32: 72.7 vp9_dc_32x32_8bpp_c: 365.2 vp9_dc_32x32_8bpp_rvv_i32: 165.5 vp9_dc_127_8x8_8bpp_c: 23.0 vp9_dc_127_8x8_8bpp_rvv_i64: 22.0 vp9_dc_127_16x16_8bpp_c: 70.2 vp9_dc_127_16x16_8bpp_rvv_i32: 50.2 vp9_dc_127_32x32_8bpp_c: 295.2 vp9_dc_127_32x32_8bpp_rvv_i32: 136.7 vp9_dc_128_8x8_8bpp_c: 23.0 vp9_dc_128_8x8_8bpp_rvv_i64: 22.0 vp9_dc_128_16x16_8bpp_c: 70.2 vp9_dc_128_16x16_8bpp_rvv_i32: 50.2 vp9_dc_128_32x32_8bpp_c: 295.2 vp9_dc_128_32x32_8bpp_rvv_i32: 136.7 vp9_dc_129_8x8_8bpp_c: 23.0 vp9_dc_129_8x8_8bpp_rvv_i64: 22.0 vp9_dc_129_16x16_8bpp_c: 70.2 vp9_dc_129_16x16_8bpp_rvv_i32: 50.2 vp9_dc_129_32x32_8bpp_c: 295.2 vp9_dc_129_32x32_8bpp_rvv_i32: 136.7 vp9_dc_left_8x8_8bpp_c: 38.0 vp9_dc_left_8x8_8bpp_rvv_i64: 36.0 vp9_dc_left_16x16_8bpp_c: 93.2 vp9_dc_left_16x16_8bpp_rvv_i32: 67.7 vp9_dc_left_32x32_8bpp_c: 333.2 vp9_dc_left_32x32_8bpp_rvv_i32: 158.5 vp9_dc_top_8x8_8bpp_c: 38.7 vp9_dc_top_8x8_8bpp_rvv_i64: 36.0 vp9_dc_top_16x16_8bpp_c: 93.2 vp9_dc_top_16x16_8bpp_rvv_i32: 67.7 vp9_dc_top_32x32_8bpp_c: 333.2 vp9_dc_top_32x32_8bpp_rvv_i32: 156.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-04-29 20:46:05 +03:00
sunyuechi	b41e115dde	lavc/me_cmp: R-V V pix_abs C908: pix_abs_0_0_c: 534.0 pix_abs_0_0_rvv_i32: 136.2 pix_abs_1_0_c: 287.7 pix_abs_1_0_rvv_i32: 125.2 sad_0_c: 534.0 sad_0_rvv_i32: 136.2 sad_1_c: 287.7 sad_1_rvv_i32: 125.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-21 20:08:25 +02:00
sunyuechi	c12053cefc	lavc/vp8dsp: R-V V vp8_idct_dc_add c908: vp8_idct_dc_add_c: 102.2 vp8_idct_dc_add_rvv_i32: 42.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:45:49 +02:00
sunyuechi	ee08974f90	lavc/rv34dsp: R-V V rv34_inv_transform_dc C908: rv34_inv_transform_dc_c: 35.5 rv34_inv_transform_dc_rvv_i32: 27.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:33:35 +02:00
sunyuechi	0748d2bbc7	lavc/blockdsp: R-V V clear_block C908: blockdsp.clear_block_c: 47.2 blockdsp.clear_block_rvv_i64: 28.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-12 22:00:03 +02:00
sunyuechi	8e23ebe6f9	lavc/svq1enc: R-V V ssd_int8_vs_int16 C908 ssd_int8_vs_int16_c: 207.7 ssd_int8_vs_int16_rvv_i32: 14.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-01-17 17:49:54 +02:00
sunyuechi	864174dd00	lavc/takdsp: R-V V decorrelate_ls C908: decorrelate_ls_c: 69.7 decorrelate_ls_rvv_i32: 27.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-21 22:42:34 +02:00
sunyuechi	98596f90f4	lavc/aacencdsp: R-V V abs_pow34 C908: abs_pow34_c: 535.5 abs_pow34_rvv_f32: 337.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-11 18:42:07 +02:00
Rémi Denis-Courmont	272d0c164d	lavc/lpc: R-V V apply_welch_window apply_welch_window_even_c: 617.5 apply_welch_window_even_rvv_f64: 235.0 apply_welch_window_odd_c: 709.0 apply_welch_window_odd_rvv_f64: 256.5	2023-12-11 18:17:43 +02:00
Rémi Denis-Courmont	b3825bbe45	riscv: test for assembler support This should fix the build on LLVM 16 and earlier, at the cost of turning all non-RVV optimisations off.	2023-12-08 17:21:09 +02:00
sunyuechi	0b9d009b4a	lavc/vc1dsp: R-V V inv_trans C908: vc1dsp.vc1_inv_trans_4x4_dc_c: 125.7 vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5 vc1dsp.vc1_inv_trans_4x8_dc_c: 230.7 vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5 vc1dsp.vc1_inv_trans_8x4_dc_c: 228.7 vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5 vc1dsp.vc1_inv_trans_8x8_dc_c: 476.5 vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-08 17:20:48 +02:00
sunyuechi	8bdb663062	lavc/ac3dsp: R-V V float_to_fixed24 c910 float_to_fixed24_c: 2207.2 float_to_fixed24_rvv_f32: 696.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-06 16:04:22 +02:00
Rémi Denis-Courmont	0fa421c8f1	lavc/llvidencdsp: add R-V V diff_bytes diff_bytes_c: 163.0 diff_bytes_rvv_i32: 52.7	2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont	fbc7adba67	lavc/llviddsp: R-V V add_bytes add_bytes_c: 2077.2 add_bytes_rvv_i32: 105.0	2023-11-18 22:07:14 +02:00
Rémi Denis-Courmont	636ae0e0bc	lavc/flacdsp: R-V V packed decorrelate_{l,r}s flac_decorrelate_ms_16_c: 457.2 flac_decorrelate_ms_16_rvv_i32: 203.0 flac_decorrelate_ms_32_c: 457.2 flac_decorrelate_ms_32_rvv_i32: 203.5 flac_decorrelate_rs_16_c: 456.2 flac_decorrelate_rs_16_rvv_i32: 207.0 flac_decorrelate_rs_32_c: 456.2 flac_decorrelate_rs_32_rvv_i32: 210.5	2023-11-17 23:59:22 +02:00
Rémi Denis-Courmont	45d0eb3f70	lavc/llauddsp: R-V V scalarproduct_and_madd_int16 scalarproduct_and_madd_int16_c: 10355.7 scalarproduct_and_madd_int16_rvv_i32: 1480.0	2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont	86bee42473	lavc/sbrdsp: R-V V sum64x5 sum64x5_c: 385.0 sum64x5_rvv_f32: 116.0	2023-11-01 22:53:26 +02:00
Rémi Denis-Courmont	73dea2bb91	lavc/jpeg2000dsp: R-V V ict_float jpeg2000_ict_float_c: 3112.2 jpeg2000_ict_float_rvv_f32: 1225.0	2023-11-01 18:52:55 +02:00
Rémi Denis-Courmont	424c8ceb08	lavc/huffyuvdsp: R-V V add_int16 add_int16_128_c: 2390.5 add_int16_128_rvv_i32: 832.0 add_int16_rnd_width_c: 2390.2 add_int16_rnd_width_rvv_i32: 832.5	2023-10-31 21:33:25 +02:00
Rémi Denis-Courmont	4aea0da230	lavc/utvideodsp: R-V V restore_rgb_planes restore_rgb_planes_c: 133065.7 restore_rgb_planes_rvv_i32: 33317.2	2023-10-31 21:33:25 +02:00
Rémi Denis-Courmont	3c6516330f	lavc/exrdsp: R-V V reoder_pixels	2023-10-09 19:52:51 +03:00
Rémi Denis-Courmont	89c10d8d20	lavc/ac3: add R-V Zbb extract_exponents	2023-10-05 18:13:00 +03:00
Rémi Denis-Courmont	9bc5676e40	lavc/g722dsp: add RISC-V V DSP function	2023-08-24 21:07:18 +03:00
Arnie Chang	c5508f60c2	lavc/h264chroma: RISC-V V add motion compensation for 8x8 chroma blocks Optimize the put and avg filtering for 8x8 chroma blocks Signed-off-by: Arnie Chang <arnie.chang@sifive.com>	2023-05-30 17:15:05 +02:00
Rémi Denis-Courmont	8009581912	lavc/opusdsp: RISC-V V (128-bit) postfilter This is implemented for a vector size of 128-bit. Since the scalar product in the inner loop covers 5 samples or 160 bits, we need a group multipler of 2. To avoid reconfiguring the vector type, the outer loop, which loads multiple input samples sticks to the same multipler. Consequently, the outer loop loads 8 samples per iteration. This is safe since the minimum period of the CELT codec is 15 samples. The same code would also work, albeit needlessly inefficiently with a vector length of 256 bits. A proper implementation will follow instead.	2022-10-10 02:22:10 +02:00
Rémi Denis-Courmont	d7528af4df	lavc/bswapdsp: RISC-V V bswap_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	f0ef11ea83	lavc/bswapdsp: RISC-V B bswap_buf Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	64ab577954	lavc/alacdsp: RISC-V V decorrelate_stereo To avoid data dependencies, this does the following unroll, which requires one extra but probably free addition: coeff = (b * left_weight) >> decorr_shift; b += a; a -= coeff; b -= coeff; swap(a, b);	2022-10-05 06:51:11 +02:00

1 2

58 commits