ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-04-21 01:40:23 +00:00

Author	SHA1	Message	Date
Martin Storsjö	74cfcd1c69	aarch64/vvc: Fix DCE undefined references with MSVC This fixes compiling with MSVC for aarch64 after `510999f6b0`. While MSVC does do dead code elimintation for function references within e.g. "if (0)", it doesn't do that for functions referenced within a static function, even if that static function itself ends up not used. A reproduction example: void missing(void); void (*func_ptr)(void); static void wrapper(void) { missing(); } void init(int cpu_flags) { if (0) { func_ptr = wrapper; } } If "wrapper" is entirely unreferenced, then MSVC doesn't produce any reference to the symbol "missing". Also, if we do "func_ptr = missing;" then the reference to missing also is eliminated. But for the case of referencing the function in a static function, even if the reference to the static function can be eliminated, then MSVC does keep the reference to the symbol.	2026-03-05 11:57:40 +02:00
Zuoqiang He	1fc7464cf7	libavcodec/huffyuvdsp: Add NEON optimization for the add_int16 function Benchmark Results (1024 iterations, Raspberry Pi 5 - Cortex-A76): add_int16_128_c: 914.0 ( 1.00x) add_int16_128_neon: 516.9 ( 1.77x) add_int16_rnd_width_c: 914.0 ( 1.00x) add_int16_rnd_width_neon: 517.5 ( 1.77x) Co-Authored-By: Martin Storsjö <martin@martin.st>	2026-03-04 22:31:19 +00:00
Georgii Zagoruiko	510999f6b0	aarch64/vvc: sme2 optimisation of alf_filter_luma() 8/10/12 bit Apple M4: vvc_alf_filter_luma_8x8_8_c: 347.3 ( 1.00x) vvc_alf_filter_luma_8x8_8_neon: 138.7 ( 2.50x) vvc_alf_filter_luma_8x8_8_sme2: 134.5 ( 2.58x) vvc_alf_filter_luma_8x8_10_c: 299.8 ( 1.00x) vvc_alf_filter_luma_8x8_10_neon: 129.8 ( 2.31x) vvc_alf_filter_luma_8x8_10_sme2: 128.6 ( 2.33x) vvc_alf_filter_luma_8x8_12_c: 293.0 ( 1.00x) vvc_alf_filter_luma_8x8_12_neon: 126.8 ( 2.31x) vvc_alf_filter_luma_8x8_12_sme2: 126.3 ( 2.32x) vvc_alf_filter_luma_16x16_8_c: 1386.1 ( 1.00x) vvc_alf_filter_luma_16x16_8_neon: 560.3 ( 2.47x) vvc_alf_filter_luma_16x16_8_sme2: 540.1 ( 2.57x) vvc_alf_filter_luma_16x16_10_c: 1200.3 ( 1.00x) vvc_alf_filter_luma_16x16_10_neon: 515.6 ( 2.33x) vvc_alf_filter_luma_16x16_10_sme2: 531.3 ( 2.26x) vvc_alf_filter_luma_16x16_12_c: 1223.8 ( 1.00x) vvc_alf_filter_luma_16x16_12_neon: 510.7 ( 2.40x) vvc_alf_filter_luma_16x16_12_sme2: 524.9 ( 2.33x) vvc_alf_filter_luma_32x32_8_c: 5488.8 ( 1.00x) vvc_alf_filter_luma_32x32_8_neon: 2233.4 ( 2.46x) vvc_alf_filter_luma_32x32_8_sme2: 1093.6 ( 5.02x) vvc_alf_filter_luma_32x32_10_c: 4738.0 ( 1.00x) vvc_alf_filter_luma_32x32_10_neon: 2057.5 ( 2.30x) vvc_alf_filter_luma_32x32_10_sme2: 1053.6 ( 4.50x) vvc_alf_filter_luma_32x32_12_c: 4808.3 ( 1.00x) vvc_alf_filter_luma_32x32_12_neon: 1981.2 ( 2.43x) vvc_alf_filter_luma_32x32_12_sme2: 1047.7 ( 4.59x) vvc_alf_filter_luma_64x64_8_c: 22116.8 ( 1.00x) vvc_alf_filter_luma_64x64_8_neon: 8951.0 ( 2.47x) vvc_alf_filter_luma_64x64_8_sme2: 4225.2 ( 5.23x) vvc_alf_filter_luma_64x64_10_c: 19072.8 ( 1.00x) vvc_alf_filter_luma_64x64_10_neon: 8448.1 ( 2.26x) vvc_alf_filter_luma_64x64_10_sme2: 4225.8 ( 4.51x) vvc_alf_filter_luma_64x64_12_c: 19312.6 ( 1.00x) vvc_alf_filter_luma_64x64_12_neon: 8270.9 ( 2.34x) vvc_alf_filter_luma_64x64_12_sme2: 4245.4 ( 4.55x) vvc_alf_filter_luma_128x128_8_c: 88530.5 ( 1.00x) vvc_alf_filter_luma_128x128_8_neon: 35686.3 ( 2.48x) vvc_alf_filter_luma_128x128_8_sme2: 16961.2 ( 5.22x) vvc_alf_filter_luma_128x128_10_c: 76904.9 ( 1.00x) vvc_alf_filter_luma_128x128_10_neon: 32439.5 ( 2.37x) vvc_alf_filter_luma_128x128_10_sme2: 16845.6 ( 4.57x) vvc_alf_filter_luma_128x128_12_c: 77363.3 ( 1.00x) vvc_alf_filter_luma_128x128_12_neon: 32907.5 ( 2.35x) vvc_alf_filter_luma_128x128_12_sme2: 17018.1 ( 4.55x)	2026-03-04 23:52:58 +02:00
Georgii Zagoruiko	90431417cb	aarch64/vvc: Optimisations of put_luma_hv() functions for 10/12-bit Apple M2: put_luma_hv_10_4x4_c: 36.3 ( 1.00x) put_luma_hv_10_8x8_c: 82.9 ( 1.00x) put_luma_hv_10_8x8_neon: 34.9 ( 2.37x) put_luma_hv_10_16x16_c: 239.2 ( 1.00x) put_luma_hv_10_16x16_neon: 119.0 ( 2.01x) put_luma_hv_10_32x32_c: 900.3 ( 1.00x) put_luma_hv_10_32x32_neon: 429.3 ( 2.10x) put_luma_hv_10_64x64_c: 2984.7 ( 1.00x) put_luma_hv_10_64x64_neon: 1736.2 ( 1.72x) put_luma_hv_10_128x128_c: 11194.2 ( 1.00x) put_luma_hv_10_128x128_neon: 6357.3 ( 1.76x) put_luma_hv_12_4x4_c: 35.9 ( 1.00x) put_luma_hv_12_8x8_c: 82.6 ( 1.00x) put_luma_hv_12_8x8_neon: 34.3 ( 2.41x) put_luma_hv_12_16x16_c: 240.2 ( 1.00x) put_luma_hv_12_16x16_neon: 115.3 ( 2.08x) put_luma_hv_12_32x32_c: 787.7 ( 1.00x) put_luma_hv_12_32x32_neon: 414.2 ( 1.90x) put_luma_hv_12_64x64_c: 3058.4 ( 1.00x) put_luma_hv_12_64x64_neon: 1592.3 ( 1.92x) put_luma_hv_12_128x128_c: 11350.8 ( 1.00x) put_luma_hv_12_128x128_neon: 6378.3 ( 1.78x) RPi4: put_luma_hv_10_4x4_c: 637.8 ( 1.00x) put_luma_hv_10_8x8_c: 1044.9 ( 1.00x) put_luma_hv_10_8x8_neon: 483.7 ( 2.16x) put_luma_hv_10_16x16_c: 3098.0 ( 1.00x) put_luma_hv_10_16x16_neon: 1603.1 ( 1.93x) put_luma_hv_10_32x32_c: 10054.8 ( 1.00x) put_luma_hv_10_32x32_neon: 5843.6 ( 1.72x) put_luma_hv_10_64x64_c: 40506.2 ( 1.00x) put_luma_hv_10_64x64_neon: 24384.0 ( 1.66x) put_luma_hv_10_128x128_c: 130604.2 ( 1.00x) put_luma_hv_10_128x128_neon: 99746.6 ( 1.31x) put_luma_hv_12_4x4_c: 638.2 ( 1.00x) put_luma_hv_12_8x8_c: 1074.6 ( 1.00x) put_luma_hv_12_8x8_neon: 482.6 ( 2.23x) put_luma_hv_12_16x16_c: 3094.0 ( 1.00x) put_luma_hv_12_16x16_neon: 1602.5 ( 1.93x) put_luma_hv_12_32x32_c: 10034.4 ( 1.00x) put_luma_hv_12_32x32_neon: 5843.3 ( 1.72x) put_luma_hv_12_64x64_c: 40447.5 ( 1.00x) put_luma_hv_12_64x64_neon: 24377.2 ( 1.66x) put_luma_hv_12_128x128_c: 130610.4 ( 1.00x) put_luma_hv_12_128x128_neon: 99765.8 ( 1.31x)	2026-03-04 12:53:16 +00:00
Jun Zhao	7e7d69632d	lavc/hevc: optimize qpel H-pass for width>=16 with byte-domain widening multiply Rewrite ff_hevc_put_hevc_qpel_h16_8_neon and h32 to use byte-domain widening multiply (umull/umlal/umlsl via calc_qpelb/calc_qpelb2 macros) instead of the previous int16-domain approach (uxtl + mul/mla). The byte-domain approach eliminates the uxtl expansion step and halves the ext stride (1 byte vs 2 bytes per tap), reducing per-row instruction count from ~32 to ~23. The functions are also inlined, removing bl/ret call overhead. This benefits all HV-path callers (hv/uni_hv/bi_hv/uni_w_hv/bi_w_hv) at widths 16/32/48/64. checkasm benchmarks on Apple M4 (5-run average): H-pass standalone (NEON): h16: 34.0 -> 24.4 cycles (1.39x speedup) h32: 132.0 -> 95.0 cycles (1.39x speedup) h64: 521.8 -> 373.9 cycles (1.40x speedup) HV compound paths geometric mean speedup (NEON, width >= 16): qpel_hv: 1.144x (4 functions) qpel_bi_hv: 1.158x (4 functions) qpel_uni_hv: 1.188x (4 functions) qpel_uni_w_hv: 1.158x (3 functions) Overall: 1.162x (15 functions) VVC qpel h16/h32 are separated into self-contained functions retaining the int16-domain approach, as VVC filters have arbitrary coefficients incompatible with the hardcoded sign pattern in calc_qpelb. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-03 12:04:14 +00:00
Jun Zhao	23b7005d98	lavc/vvc: remove duplicate 'mov mx, x30' in VVC qpel h16/h32 The VVC qpel h16 and h32 functions had a redundant 'mov mx, x30' instruction. The first one was placed before vvc_load_filter had finished using mx (the filter pointer argument), making it a dead store immediately overwritten by the second 'mov mx, x30'. Remove the first instance and reorder so that 'sub src, src, #3' comes before 'mov mx, x30', ensuring the filter pointer in mx is fully consumed by vvc_load_filter before being overwritten with the link register. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-03 12:04:14 +00:00
Andreas Rheinhardt	dc65dcec22	avcodec/vvc/inter: Combine offsets early For bi-predicted weighted averages, only the sum of the two offsets is ever used, so add the two early. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-25 12:08:33 +01:00
Jun Zhao	27dd2f1c70	lavc/hevc: fix missing # in ldrsw immediate offset The ldrsw instruction requires immediate offset with # prefix. This fixes the syntax error introduced in commit `26752368f0` (aarch64/h26x: Add put_hevc_pel_bi_w_pixels) where the load_bi_w_pixels_param macro was added. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-02-05 09:13:22 +08:00
Zhao Zhili	e250854ecf	aarch64/h264pred: disable inefficient functions These assembly optimizations have been identified as "performance regressions." Due to advancements in modern CPU micro-architectures and compiler optimization the C implementations now consistently outperform these handwritten routines. Test Name A55-clang M1 A76-gcc-14 A510-clang A715-clang X3-clang -------------------------------------------------------------------------------------------------------------------- pred8x8_dc_8_neon 55.9 ( 0.79x)! 0.2 ( 0.31x)! 35.7 ( 0.63x)! 98.3 ( 0.37x)! 35.9 ( 0.45x)! 33.6 ( 0.38x)! pred8x8_dc_10_neon 57.0 ( 1.04x) 0.3 ( 0.36x)! 35.9 ( 0.94x)! 98.2 ( 0.53x)! 35.8 ( 0.58x)! 33.2 ( 0.50x)! pred8x8_dc_128_8_neon 26.0 ( 0.69x)! 0.1 ( 0.43x)! 15.3 ( 0.73x)! 46.4 ( 0.36x)! 10.6 ( 0.48x)! 10.3 ( 1.09x) pred8x8_dc_128_10_neon 25.3 ( 0.99x)! 0.1 ( 0.42x)! 19.3 ( 0.48x)! 44.5 ( 0.42x)! 10.0 ( 0.61x)! 11.0 ( 1.00x) pred8x8_left_dc_8_neon 46.9 ( 0.72x)! 0.2 ( 0.26x)! 30.2 ( 0.49x)! 71.4 ( 0.39x)! 29.8 ( 0.35x)! 26.5 ( 0.44x)! pred8x8_left_dc_10_neon 45.4 ( 0.82x)! 0.2 ( 0.29x)! 28.1 ( 0.67x)! 70.2 ( 0.47x)! 30.0 ( 0.38x)! 26.5 ( 0.43x)! pred16x16_dc_8_neon 74.4 ( 1.34x) 0.3 ( 0.62x)! 44.7 ( 0.89x)! 128.0 ( 0.79x)! 48.5 ( 0.67x)! 39.4 ( 0.71x)! pred16x16_dc_128_8_neon 37.9 ( 0.79x)! 0.1 ( 0.60x)! 20.1 ( 0.80x)! 41.8 ( 0.46x)! 16.2 ( 0.81x)! 12.8 ( 0.95x)! pred16x16_left_dc_8_neon 69.9 ( 1.19x) 0.3 ( 0.46x)! 49.6 ( 0.54x)! 116.8 ( 0.62x)! 52.8 ( 0.45x)! 44.2 ( 0.51x)! pred8x8_hori_8_neon 30.6 ( 1.39x) 0.1 ( 0.45x)! 19.4 ( 0.81x)! 71.0 ( 0.50x)! 15.9 ( 0.55x)! 12.2 ( 0.94x)! pred8x8_hori_10_neon* 29.3 ( 1.82x) 0.1 ( 0.59x)! 18.5 ( 1.56x) 68.9 ( 0.64x)! 15.8 ( 0.62x)! 11.8 ( 0.97x)! pred8x8_top_dc_8_neon 35.8 ( 0.96x)! 0.1 ( 0.59x)! 16.8 ( 0.81x)! 58.9 ( 0.44x)! 11.3 ( 0.89x)! 11.4 ( 0.99x)! pred8x8_top_dc_10_neon 37.4 ( 1.24x) 0.1 ( 0.92x)! 20.4 ( 0.81x)! 59.5 ( 0.69x)! 10.5 ( 1.48x) 11.8 ( 1.02x) pred8x8_vertical_8_neon 18.3 ( 1.08x) 0.1 ( 0.54x)! 12.8 ( 0.89x)! 37.2 ( 0.40x)! 8.3 ( 0.77x)! 11.2 ( 1.00x) pred8x8_vertical_10_neon 19.0 ( 1.24x) 0.1 ( 0.55x)! 15.3 ( 0.62x)! 39.7 ( 0.50x)! 8.2 ( 0.91x)! 11.1 ( 0.99x)! - pred8x8_horizontal_10 also underperforms on new architectures, but useful on A55 and A76. Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2026-02-04 09:06:37 +00:00
Zhao Zhili	f54841d375	avcodec/aarch64: add pngdsp Test Name A55-gcc-11 M1-clang A76-gcc-12 A510-clang X3-clang ------------------------------------------------------------------------------------------------------------------- add_bytes_l2_4096_neon 1807.2 ( 2.01x) 1.6 ( 1.94x) 333.0 ( 6.35x) 1058.2 ( 2.34x) 214.3 ( 1.99x) add_paeth_prediction_3_neon 33036.1 ( 2.41x) 145.1 ( 1.66x) 20443.3 ( 1.97x) 35225.1 ( 1.23x) 19420.8 ( 1.05x) add_paeth_prediction_4_neon 24368.6 ( 3.26x) 106.7 ( 2.01x) 15163.8 ( 2.77x) 26454.7 ( 1.62x) 14319.0 ( 1.35x) add_paeth_prediction_6_neon 17900.6 ( 4.44x) 72.0 ( 2.70x) 10214.3 ( 4.20x) 18296.9 ( 2.27x) 9693.1 ( 1.97x) add_paeth_prediction_8_neon 12615.4 ( 6.31x) 54.1 ( 2.58x) 7706.0 ( 5.45x) 13733.3 ( 2.94x) 7272.6 ( 2.63x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2026-02-04 12:05:35 +08:00
Martin Storsjö	f74c551eaa	aarch64: Fix indentation of a few instructions This file is excempt from the indent checker script, as there are a few other bits in it that the script wants to reformat into slightly worse form, or which might not warrant being reformatted. But these instructions should indeed be indented this way.	2026-01-30 05:21:27 +00:00
Andreas Rheinhardt	bf4d5037b4	avcodec/h264dsp: Remove redundant h264 from H264DSPCtx member names These names are a remnant of dsputil when all the DSP functions from all codecs were part of DSPcontext. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Sean McGovern <gseanmcg@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:25 +01:00
Jun Zhao	8966101fa6	lavc/hevc: add aarch64 neon for 12-bit dequant Implement NEON optimization for HEVC dequant at 12-bit depth. For 12-bit: shift = 15 - 12 - log2_size = 3 - log2_size. When shift is negative, we use shl (shift left) instead of srshr. Performance benchmark on Apple M4: ./tests/checkasm/checkasm --test=hevc_dequant --bench hevc_dequant_4x4_12_c: 9.9 ( 1.00x) hevc_dequant_4x4_12_neon: 5.7 ( 1.74x) hevc_dequant_8x8_12_c: 1.7 ( 1.00x) hevc_dequant_8x8_12_neon: 1.3 ( 1.30x) hevc_dequant_16x16_12_c: 131.1 ( 1.00x) hevc_dequant_16x16_12_neon: 7.9 (16.52x) hevc_dequant_32x32_12_c: 69.7 ( 1.00x) hevc_dequant_32x32_12_neon: 28.4 ( 2.46x) Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-25 06:55:26 +00:00
Jun Zhao	ce89d974c8	lavc/hevc: add aarch64 neon for 10-bit dequant Implement NEON optimization for HEVC dequant at 10-bit depth. For 10-bit: shift = 15 - 10 - log2_size = 5 - log2_size Performance benchmark on Apple M4: ./tests/checkasm/checkasm --test=hevc_dequant --bench hevc_dequant_4x4_10_c: 16.6 ( 1.00x) hevc_dequant_4x4_10_neon: 7.4 ( 2.23x) hevc_dequant_8x8_10_c: 39.7 ( 1.00x) hevc_dequant_8x8_10_neon: 7.5 ( 5.28x) hevc_dequant_16x16_10_c: 168.7 ( 1.00x) hevc_dequant_16x16_10_neon: 10.2 (16.56x) hevc_dequant_32x32_10_c: 1.9 ( 1.00x) hevc_dequant_32x32_10_neon: 1.9 ( 1.01x) Note: 32x32 shift=0 is identity transform (no-op), so NEON has no advantage over C which is also optimized away by the compiler. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-25 06:55:26 +00:00
Jun Zhao	0886e50c6b	lavc/hevc: add aarch64 neon for 8-bit dequant Implement NEON optimization for HEVC dequant at 8-bit depth. The NEON implementation uses srshr (Signed Rounding Shift Right) which does both the add with offset and right shift in a single instruction. Optimization details: - 4x4 (16 coeffs): Single load-process-store sequence - 8x8 (64 coeffs): Fully unrolled, no loop overhead - 16x16 (256 coeffs): Pipelined load/compute/store to hide memory latency - 32x32 (1024 coeffs): Pipelined with all available NEON registers Performance benchmark on Apple M4: ./tests/checkasm/checkasm --test=hevc_dequant --bench hevc_dequant_4x4_8_c: 11.3 ( 1.00x) hevc_dequant_4x4_8_neon: 6.3 ( 1.78x) hevc_dequant_8x8_8_c: 33.9 ( 1.00x) hevc_dequant_8x8_8_neon: 6.6 ( 5.11x) hevc_dequant_16x16_8_c: 153.8 ( 1.00x) hevc_dequant_16x16_8_neon: 9.0 (17.02x) hevc_dequant_32x32_8_c: 78.1 ( 1.00x) hevc_dequant_32x32_8_neon: 31.9 ( 2.45x) Note on Performance Anomaly: The observation that hevc_dequant_32x32_8_c is faster than 16x16 (78.1 vs 153.8) is due to Clang auto-vectorizing only for sizes >= 32x32. Compiler: Apple clang version 17.0.0 (clang-1700.6.3.2) Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-25 06:55:26 +00:00
Georgii Zagoruiko	8acdffa22c	aarch64/vvc: Optimisations of put_luma_v() functions for 10/12-bit RPi4 (auto-vectorisation is on) put_luma_v_10_4x4_c: 303.3 ( 1.00x) put_luma_v_10_4x4_neon: 55.7 ( 5.45x) put_luma_v_10_8x8_c: 1106.7 ( 1.00x) put_luma_v_10_8x8_neon: 163.8 ( 6.76x) put_luma_v_10_16x16_c: 2242.1 ( 1.00x) put_luma_v_10_16x16_neon: 672.7 ( 3.33x) put_luma_v_10_32x32_c: 7057.3 ( 1.00x) put_luma_v_10_32x32_neon: 2731.3 ( 2.58x) put_luma_v_10_64x64_c: 25699.8 ( 1.00x) put_luma_v_10_64x64_neon: 12145.6 ( 2.12x) put_luma_v_10_128x128_c: 90694.6 ( 1.00x) put_luma_v_10_128x128_neon: 44862.4 ( 2.02x) put_luma_v_12_4x4_c: 304.4 ( 1.00x) put_luma_v_12_4x4_neon: 55.6 ( 5.47x) put_luma_v_12_8x8_c: 1107.4 ( 1.00x) put_luma_v_12_8x8_neon: 164.7 ( 6.72x) put_luma_v_12_16x16_c: 2235.8 ( 1.00x) put_luma_v_12_16x16_neon: 672.5 ( 3.32x) put_luma_v_12_32x32_c: 7049.2 ( 1.00x) put_luma_v_12_32x32_neon: 2731.6 ( 2.58x) put_luma_v_12_64x64_c: 25706.5 ( 1.00x) put_luma_v_12_64x64_neon: 12145.0 ( 2.12x) put_luma_v_12_128x128_c: 90672.5 ( 1.00x) put_luma_v_12_128x128_neon: 44857.1 ( 2.02x) Apple M4 (auto-vectorisation is on): put_luma_v_10_4x4_c: 25.6 ( 1.00x) put_luma_v_10_4x4_neon: 3.1 ( 8.18x) put_luma_v_10_8x8_c: 34.7 ( 1.00x) put_luma_v_10_8x8_neon: 10.5 ( 3.32x) put_luma_v_10_16x16_c: 103.9 ( 1.00x) put_luma_v_10_16x16_neon: 42.3 ( 2.45x) put_luma_v_10_32x32_c: 399.7 ( 1.00x) put_luma_v_10_32x32_neon: 161.8 ( 2.47x) put_luma_v_10_64x64_c: 1276.7 ( 1.00x) put_luma_v_10_64x64_neon: 840.1 ( 1.52x) put_luma_v_10_128x128_c: 4981.3 ( 1.00x) put_luma_v_10_128x128_neon: 3008.0 ( 1.66x) put_luma_v_12_4x4_c: 23.6 ( 1.00x) put_luma_v_12_4x4_neon: 2.0 (11.84x) put_luma_v_12_8x8_c: 31.8 ( 1.00x) put_luma_v_12_8x8_neon: 12.4 ( 2.55x) put_luma_v_12_16x16_c: 100.8 ( 1.00x) put_luma_v_12_16x16_neon: 44.9 ( 2.25x) put_luma_v_12_32x32_c: 331.1 ( 1.00x) put_luma_v_12_32x32_neon: 175.2 ( 1.89x) put_luma_v_12_64x64_c: 1227.1 ( 1.00x) put_luma_v_12_64x64_neon: 712.7 ( 1.72x) put_luma_v_12_128x128_c: 5149.1 ( 1.00x) put_luma_v_12_128x128_neon: 2809.3 ( 1.83x)	2026-01-08 17:35:55 +00:00
Zhao Zhili	840183d823	aarch64/hpeldsp_neon: fix out-of-bounds read Fix #21141 The performance improved a little bit. On A76: Before After put_pixels_tab[0][1]_neon: 32.4 ( 3.91x) 31.6 ( 3.99x) put_pixels_tab[0][3]_neon: 88.0 ( 4.50x) 74.6 ( 5.31x) put_pixels_tab[1][1]_neon: 33.5 ( 2.52x) 31.2 ( 2.71x) put_pixels_tab[1][3]_neon: 30.5 ( 3.61x) 21.7 ( 5.08x) On A55: Before After put_pixels_tab[0][1]_neon: 175.2 ( 2.41x) 138.7 ( 3.04x) put_pixels_tab[0][3]_neon: 334.3 ( 2.71x) 296.1 ( 3.07x) put_pixels_tab[1][1]_neon: 168.3 ( 1.78x) 94.1 ( 3.19x) put_pixels_tab[1][3]_neon: 112.3 ( 2.20x) 90.0 ( 2.74x)	2026-01-04 03:22:55 +00:00
Georgii Zagoruiko	f790de2a87	aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit RPi4 (auto-vectorisation is turned on) put_luma_h_10_4x4_c: 282.8 ( 1.00x) put_luma_h_10_8x8_c: 1069.5 ( 1.00x) put_luma_h_10_8x8_neon: 207.5 ( 5.15x) put_luma_h_10_16x16_c: 1999.6 ( 1.00x) put_luma_h_10_16x16_neon: 777.5 ( 2.57x) put_luma_h_10_32x32_c: 6612.9 ( 1.00x) put_luma_h_10_32x32_neon: 3201.6 ( 2.07x) put_luma_h_10_64x64_c: 25059.0 ( 1.00x) put_luma_h_10_64x64_neon: 13623.5 ( 1.84x) put_luma_h_10_128x128_c: 91310.1 ( 1.00x) put_luma_h_10_128x128_neon: 50358.3 ( 1.81x) put_luma_h_12_4x4_c: 282.1 ( 1.00x) put_luma_h_12_8x8_c: 1068.4 ( 1.00x) put_luma_h_12_8x8_neon: 207.7 ( 5.14x) put_luma_h_12_16x16_c: 1998.0 ( 1.00x) put_luma_h_12_16x16_neon: 777.5 ( 2.57x) put_luma_h_12_32x32_c: 6612.0 ( 1.00x) put_luma_h_12_32x32_neon: 3201.6 ( 2.07x) put_luma_h_12_64x64_c: 25036.8 ( 1.00x) put_luma_h_12_64x64_neon: 13595.1 ( 1.84x) put_luma_h_12_128x128_c: 91305.8 ( 1.00x) put_luma_h_12_128x128_neon: 50359.7 ( 1.81x) Apple M2 Air (auto-vectorisation is turned on) put_luma_h_10_4x4_c: 0.3 ( 1.00x) put_luma_h_10_8x8_c: 1.0 ( 1.00x) put_luma_h_10_8x8_neon: 0.4 ( 2.59x) put_luma_h_10_16x16_c: 2.9 ( 1.00x) put_luma_h_10_16x16_neon: 1.4 ( 2.01x) put_luma_h_10_32x32_c: 9.4 ( 1.00x) put_luma_h_10_32x32_neon: 5.8 ( 1.62x) put_luma_h_10_64x64_c: 35.6 ( 1.00x) put_luma_h_10_64x64_neon: 23.6 ( 1.51x) put_luma_h_10_128x128_c: 131.1 ( 1.00x) put_luma_h_10_128x128_neon: 92.6 ( 1.42x) put_luma_h_12_4x4_c: 0.3 ( 1.00x) put_luma_h_12_8x8_c: 1.0 ( 1.00x) put_luma_h_12_8x8_neon: 0.4 ( 2.58x) put_luma_h_12_16x16_c: 2.9 ( 1.00x) put_luma_h_12_16x16_neon: 1.4 ( 2.00x) put_luma_h_12_32x32_c: 9.4 ( 1.00x) put_luma_h_12_32x32_neon: 5.8 ( 1.61x) put_luma_h_12_64x64_c: 35.3 ( 1.00x) put_luma_h_12_64x64_neon: 23.3 ( 1.52x) put_luma_h_12_128x128_c: 131.2 ( 1.00x) put_luma_h_12_128x128_neon: 92.4 ( 1.42x)	2025-11-24 21:22:55 +00:00
Kacper Michajłow	9ad20839fb	avcodec/pixblockdsp: be consistent about restrict use in ff_{get,diff}_pixels Suppresses warnings about function pointer mismatch. Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2025-10-25 01:01:15 +02:00
Bin Peng	3115c0c0e6	lavc/aarch64: Fix addp overflow in ff_pred16x16_plane_neon_10 The mismatch between neon and C functions can be reproduced using the following bitstream and command line on aarch64 devices. wget https://streams.videolan.org/ffmpeg/incoming/replay_intra_pred_16x16.h264 ./ffmpeg -cpuflags 0 -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_ref ./ffmpeg -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_neon Signed-off-by: Bin Peng <pengbin@visionular.com>	2025-10-24 15:32:35 +00:00
Krzysztof Pyrkosz	03c054d43c	avcodec/aarch64/vvc: Implement dmvr_v_8 A72 dmvr_v_8_12x20_neon: 207.0 ( 4.15x) dmvr_v_8_20x12_neon: 170.4 ( 4.37x) dmvr_v_8_20x20_neon: 273.4 ( 4.58x) A53 dmvr_v_8_12x20_neon: 450.6 ( 4.21x) dmvr_v_8_20x12_neon: 342.8 ( 3.70x) dmvr_v_8_20x20_neon: 550.9 ( 3.79x)	2025-09-23 11:20:20 +00:00
Krzysztof Pyrkosz	56a638d836	avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon Before and after: A53: apply_bdof_8_16x8_neon: 2733.1 ( 4.88x) apply_bdof_8_16x16_neon: 5458.6 ( 4.86x) apply_bdof_10_16x8_neon: 2789.8 ( 4.64x) apply_bdof_10_16x16_neon: 5523.8 ( 4.68x) apply_bdof_12_16x8_neon: 2792.8 ( 4.58x) apply_bdof_12_16x16_neon: 5519.5 ( 4.63x) apply_bdof_8_16x8_neon: 2571.8 ( 5.12x) apply_bdof_8_16x16_neon: 5173.3 ( 5.12x) apply_bdof_10_16x8_neon: 2635.1 ( 4.87x) apply_bdof_10_16x16_neon: 5243.0 ( 4.89x) apply_bdof_12_16x8_neon: 2613.0 ( 4.89x) apply_bdof_12_16x16_neon: 5231.7 ( 4.90x) A78: apply_bdof_8_16x8_neon: 565.3 ( 8.43x) apply_bdof_8_16x16_neon: 1109.5 ( 8.60x) apply_bdof_10_16x8_neon: 568.2 ( 7.92x) apply_bdof_10_16x16_neon: 1114.1 ( 8.08x) apply_bdof_12_16x8_neon: 570.2 ( 7.87x) apply_bdof_12_16x16_neon: 1116.3 ( 8.03x) apply_bdof_8_16x8_neon: 541.4 ( 8.81x) apply_bdof_8_16x16_neon: 1065.9 ( 8.97x) apply_bdof_10_16x8_neon: 543.2 ( 8.32x) apply_bdof_10_16x16_neon: 1071.5 ( 8.39x) apply_bdof_12_16x8_neon: 544.2 ( 8.25x) apply_bdof_12_16x16_neon: 1074.1 ( 8.37x)	2025-09-23 11:20:11 +00:00
Krzysztof Pyrkosz	f1a155d975	avcodec/aarch64/vvc: Optimize dmvr_hv_10 Before and after on A53: dmvr_hv_10_12x20_neon: 1838.2 ( 3.02x) dmvr_hv_10_20x12_neon: 1330.2 ( 1.83x) dmvr_hv_10_20x20_neon: 2148.2 ( 1.85x) dmvr_hv_12_12x20_neon: 1839.2 ( 3.02x) dmvr_hv_12_20x12_neon: 1330.6 ( 1.83x) dmvr_hv_12_20x20_neon: 2147.2 ( 1.85x) dmvr_hv_10_12x20_neon: 1755.0 ( 3.17x) dmvr_hv_10_20x12_neon: 1165.8 ( 2.09x) dmvr_hv_10_20x20_neon: 1876.1 ( 2.12x) dmvr_hv_12_12x20_neon: 1754.4 ( 3.17x) dmvr_hv_12_20x12_neon: 1167.8 ( 2.09x) dmvr_hv_12_20x20_neon: 1878.8 ( 2.12x)	2025-09-21 19:39:27 +00:00
Georgii Zagoruiko	4fbacb3944	avcodec/aarch64/vvc: Optimised version of classify function. Macbook Air (M2): vvc_alf_classify_8x8_8_c: 2.6 ( 1.00x) vvc_alf_classify_8x8_8_neon: 1.0 ( 2.47x) vvc_alf_classify_8x8_10_c: 2.7 ( 1.00x) vvc_alf_classify_8x8_10_neon: 0.9 ( 2.98x) vvc_alf_classify_8x8_12_c: 2.7 ( 1.00x) vvc_alf_classify_8x8_12_neon: 0.9 ( 2.97x) vvc_alf_classify_16x16_8_c: 7.3 ( 1.00x) vvc_alf_classify_16x16_8_neon: 3.4 ( 2.12x) vvc_alf_classify_16x16_10_c: 4.3 ( 1.00x) vvc_alf_classify_16x16_10_neon: 2.9 ( 1.47x) vvc_alf_classify_16x16_12_c: 4.3 ( 1.00x) vvc_alf_classify_16x16_12_neon: 3.0 ( 1.44x) vvc_alf_classify_32x32_8_c: 13.7 ( 1.00x) vvc_alf_classify_32x32_8_neon: 10.7 ( 1.29x) vvc_alf_classify_32x32_10_c: 12.3 ( 1.00x) vvc_alf_classify_32x32_10_neon: 8.7 ( 1.42x) vvc_alf_classify_32x32_12_c: 12.2 ( 1.00x) vvc_alf_classify_32x32_12_neon: 8.7 ( 1.40x) vvc_alf_classify_64x64_8_c: 45.8 ( 1.00x) vvc_alf_classify_64x64_8_neon: 37.1 ( 1.23x) vvc_alf_classify_64x64_10_c: 41.3 ( 1.00x) vvc_alf_classify_64x64_10_neon: 32.8 ( 1.26x) vvc_alf_classify_64x64_12_c: 41.4 ( 1.00x) vvc_alf_classify_64x64_12_neon: 32.4 ( 1.28x) vvc_alf_classify_128x128_8_c: 163.7 ( 1.00x) vvc_alf_classify_128x128_8_neon: 138.3 ( 1.18x) vvc_alf_classify_128x128_10_c: 149.1 ( 1.00x) vvc_alf_classify_128x128_10_neon: 120.3 ( 1.24x) vvc_alf_classify_128x128_12_c: 148.7 ( 1.00x) vvc_alf_classify_128x128_12_neon: 119.4 ( 1.25x) RPi4 (Cortex-A72): vvc_alf_classify_8x8_8_c: 1251.6 ( 1.00x) vvc_alf_classify_8x8_8_neon: 700.7 ( 1.79x) vvc_alf_classify_8x8_10_c: 1141.9 ( 1.00x) vvc_alf_classify_8x8_10_neon: 659.7 ( 1.73x) vvc_alf_classify_8x8_12_c: 1075.8 ( 1.00x) vvc_alf_classify_8x8_12_neon: 658.7 ( 1.63x) vvc_alf_classify_16x16_8_c: 3574.1 ( 1.00x) vvc_alf_classify_16x16_8_neon: 1849.8 ( 1.93x) vvc_alf_classify_16x16_10_c: 3270.0 ( 1.00x) vvc_alf_classify_16x16_10_neon: 1786.1 ( 1.83x) vvc_alf_classify_16x16_12_c: 3271.7 ( 1.00x) vvc_alf_classify_16x16_12_neon: 1785.5 ( 1.83x) vvc_alf_classify_32x32_8_c: 12451.9 ( 1.00x) vvc_alf_classify_32x32_8_neon: 5984.3 ( 2.08x) vvc_alf_classify_32x32_10_c: 11428.9 ( 1.00x) vvc_alf_classify_32x32_10_neon: 5756.3 ( 1.99x) vvc_alf_classify_32x32_12_c: 11252.8 ( 1.00x) vvc_alf_classify_32x32_12_neon: 5755.7 ( 1.96x) vvc_alf_classify_64x64_8_c: 47625.5 ( 1.00x) vvc_alf_classify_64x64_8_neon: 21071.9 ( 2.26x) vvc_alf_classify_64x64_10_c: 44576.3 ( 1.00x) vvc_alf_classify_64x64_10_neon: 21544.7 ( 2.07x) vvc_alf_classify_64x64_12_c: 44600.5 ( 1.00x) vvc_alf_classify_64x64_12_neon: 21491.2 ( 2.08x) vvc_alf_classify_128x128_8_c: 192143.3 ( 1.00x) vvc_alf_classify_128x128_8_neon: 82387.6 ( 2.33x) vvc_alf_classify_128x128_10_c: 177583.1 ( 1.00x) vvc_alf_classify_128x128_10_neon: 81628.8 ( 2.18x) vvc_alf_classify_128x128_12_c: 177582.2 ( 1.00x) vvc_alf_classify_128x128_12_neon: 81625.1 ( 2.18x)	2025-09-09 22:13:04 +01:00
Krzysztof Pyrkosz	de25cb4603	avcodec/aarch64/vvc: Optimize vvc_apply_bdof_block_8x Before and after: A53: apply_bdof_8_8x16_neon: 3320.5 ( 4.02x) apply_bdof_10_8x16_neon: 3317.8 ( 3.90x) apply_bdof_12_8x16_neon: 3303.6 ( 3.91x) apply_bdof_8_8x16_neon: 3168.1 ( 4.23x) apply_bdof_10_8x16_neon: 3127.8 ( 4.13x) apply_bdof_12_8x16_neon: 3119.3 ( 4.18x) A72: apply_bdof_8_8x16_neon: 1827.4 ( 5.02x) apply_bdof_10_8x16_neon: 1838.5 ( 4.89x) apply_bdof_12_8x16_neon: 1841.1 ( 4.83x) apply_bdof_8_8x16_neon: 1691.6 ( 5.46x) apply_bdof_10_8x16_neon: 1695.9 ( 5.23x) apply_bdof_12_8x16_neon: 1695.4 ( 5.29x) A78 apply_bdof_8_8x16_neon: 648.9 ( 7.43x) apply_bdof_10_8x16_neon: 646.1 ( 7.04x) apply_bdof_12_8x16_neon: 643.8 ( 7.04x) apply_bdof_8_8x16_neon: 603.2 ( 7.97x) apply_bdof_10_8x16_neon: 604.1 ( 7.52x) apply_bdof_12_8x16_neon: 604.5 ( 7.52x)	2025-09-09 16:37:28 +00:00
Krzysztof Pyrkosz	7b21bde34c	avcodec/aarch64/vvc: Implemented dmvr_h_10 A78: dmvr_h_10_12x20_neon: 82.2 ( 6.49x) dmvr_h_10_20x12_neon: 69.9 ( 3.66x) dmvr_h_10_20x20_neon: 112.5 ( 3.74x) dmvr_h_12_12x20_neon: 81.4 ( 6.51x) dmvr_h_12_20x12_neon: 69.2 ( 3.74x) dmvr_h_12_20x20_neon: 110.2 ( 3.85x) A72: dmvr_h_10_12x20_neon: 234.1 ( 4.67x) dmvr_h_10_20x12_neon: 221.4 ( 3.48x) dmvr_h_10_20x20_neon: 356.9 ( 3.59x) dmvr_h_12_12x20_neon: 234.1 ( 4.67x) dmvr_h_12_20x12_neon: 221.5 ( 3.53x) dmvr_h_12_20x20_neon: 357.0 ( 3.64x)	2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz	189e841cfd	avcodec/aarch64/vvc: Implement dmvr_h_8 A78: dmvr_h_8_12x20_neon: 76.6 ( 4.31x) dmvr_h_8_20x12_neon: 65.8 ( 3.49x) dmvr_h_8_20x20_neon: 106.6 ( 3.62x) A72: dmvr_h_8_12x20_neon: 190.6 ( 4.40x) dmvr_h_8_20x12_neon: 171.1 ( 4.31x) dmvr_h_8_20x20_neon: 275.1 ( 4.50x)	2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz	fb4407797e	Replace uxtl with umull in dmvr_hv_8 Before and after on A78: dmvr_hv_8_12x20_neon: 205.3 ( 5.21x) dmvr_hv_8_20x12_neon: 171.8 ( 3.15x) dmvr_hv_8_20x20_neon: 282.7 ( 3.11x) dmvr_hv_8_12x20_neon: 172.7 ( 5.58x) dmvr_hv_8_20x12_neon: 133.3 ( 3.36x) dmvr_hv_8_20x20_neon: 214.6 ( 3.40x)	2025-09-05 07:20:15 +00:00
Zhao Zhili	6ce02bcc3a	avcodec/aarch64/vvc: Optimize apply_bdof Before this patch, prof_grad_filter calculate gh[0], gh[1], gv[0], gv[1] and save them to stack. derive_bdof_vx_vy load them from stack and calculate gh[0] + gh[1], gv[0] + gv[1]. apply_bdof_min_block load them from stack and calculate gh[0] - gh[1], gv[0] - gv[1] This patch add bdof_grad_filter, which calculate gh[0] + gh[1], gh[0] - gh[1], gv[0] + gv[1], gv[0] - gv[1], and save them to stack, so derive_bdof_vx_vy and apply_bdof_min_block can use the results directly. prof_grad_filter is kept for reuse by other functions in the future. Benchmark on rpi5 with gcc 12 Before After -------------------------------------------------------------------- apply_bdof_8_8x16_c: \| 7431.4 ( 1.00x) \| 7371.7 ( 1.00x) apply_bdof_8_8x16_neon: \| 1175.4 ( 6.32x) \| 1036.3 ( 7.11x) apply_bdof_8_16x8_c: \| 7182.2 ( 1.00x) \| 7201.1 ( 1.00x) apply_bdof_8_16x8_neon: \| 1021.7 ( 7.03x) \| 879.9 ( 8.18x) apply_bdof_8_16x16_c: \| 14577.1 ( 1.00x) \| 14589.3 ( 1.00x) apply_bdof_8_16x16_neon: \| 2012.8 ( 7.24x) \| 1743.3 ( 8.37x) apply_bdof_10_8x16_c: \| 7292.4 ( 1.00x) \| 7308.5 ( 1.00x) apply_bdof_10_8x16_neon: \| 1156.3 ( 6.31x) \| 1045.3 ( 6.99x) apply_bdof_10_16x8_c: \| 7112.4 ( 1.00x) \| 7214.4 ( 1.00x) apply_bdof_10_16x8_neon: \| 1007.6 ( 7.06x) \| 904.8 ( 7.97x) apply_bdof_10_16x16_c: \| 14363.3 ( 1.00x) \| 14476.4 ( 1.00x) apply_bdof_10_16x16_neon: \| 1986.9 ( 7.23x) \| 1783.1 ( 8.12x) apply_bdof_12_8x16_c: \| 7433.3 ( 1.00x) \| 7374.7 ( 1.00x) apply_bdof_12_8x16_neon: \| 1155.9 ( 6.43x) \| 1040.8 ( 7.09x) apply_bdof_12_16x8_c: \| 7171.1 ( 1.00x) \| 7376.3 ( 1.00x) apply_bdof_12_16x8_neon: \| 1010.8 ( 7.09x) \| 899.4 ( 8.20x) apply_bdof_12_16x16_c: \| 14515.5 ( 1.00x) \| 14731.5 ( 1.00x) apply_bdof_12_16x16_neon: \| 1988.4 ( 7.30x) \| 1785.2 ( 8.25x)	2025-09-03 06:55:37 +00:00
Zhao Zhili	2e92417603	avcodec/aarch64/vvc: Optimize derive_bdof_vx_vy Implement line tricks and pixel tricks. See comments in inter.S for details. Benchmark on rpi5 with gcc 12 Before After ----------------------------------------------------------------- apply_bdof_8_8x16_c: \| 7375.5 ( 1.00x) \| 7473.8 ( 1.00x) apply_bdof_8_8x16_neon: \| 1875.1 ( 3.93x) \| 1135.8 ( 6.58x) apply_bdof_8_16x8_c: \| 7273.9 ( 1.00x) \| 7204.0 ( 1.00x) apply_bdof_8_16x8_neon: \| 1738.2 ( 4.18x) \| 1013.0 ( 7.11x) apply_bdof_8_16x16_c: \| 14744.9 ( 1.00x) \| 14712.6 ( 1.00x) apply_bdof_8_16x16_neon: \| 3446.7 ( 4.28x) \| 1997.7 ( 7.36x) apply_bdof_10_8x16_c: \| 7352.4 ( 1.00x) \| 7485.7 ( 1.00x) apply_bdof_10_8x16_neon: \| 1861.0 ( 3.95x) \| 1134.1 ( 6.60x) apply_bdof_10_16x8_c: \| 7330.5 ( 1.00x) \| 7232.8 ( 1.00x) apply_bdof_10_16x8_neon: \| 1747.2 ( 4.20x) \| 1002.6 ( 7.21x) apply_bdof_10_16x16_c: \| 14522.4 ( 1.00x) \| 14664.8 ( 1.00x) apply_bdof_10_16x16_neon: \| 3490.5 ( 4.16x) \| 1978.4 ( 7.41x) apply_bdof_12_8x16_c: \| 7389.0 ( 1.00x) \| 7380.1 ( 1.00x) apply_bdof_12_8x16_neon: \| 1861.3 ( 3.97x) \| 1134.0 ( 6.51x) apply_bdof_12_16x8_c: \| 7283.1 ( 1.00x) \| 7336.9 ( 1.00x) apply_bdof_12_16x8_neon: \| 1749.1 ( 4.16x) \| 1002.3 ( 7.32x) apply_bdof_12_16x16_c: \| 14580.7 ( 1.00x) \| 14502.7 ( 1.00x) apply_bdof_12_16x16_neon: \| 3472.9 ( 4.20x) \| 1978.3 ( 7.33x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-09-03 06:55:37 +00:00
Timo Rothenpieler	262d41c804	all: fix typos found by codespell	2025-08-03 13:48:47 +02:00
Andreas Rheinhardt	9b409ea1e6	configure: Factor mpegvideoencdsp out of mpegvideoenc This will allow to relax the dependency on mpegvideoenc for several codecs. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-06-21 22:08:52 +02:00
Andreas Rheinhardt	20ddada2a3	avcodec/pixblockdsp: Improve 8 vs 16 bit check Before this commit, the input in get_pixels and get_pixels_unaligned has been treated inconsistenly: - The generic code treated 9, 10, 12 and 14 bits as 16bit input (these bits correspond to what FFmpeg's dsputils supported), everything with <= 8 bits as 8 bit and everything else as 8 bit when used via AVDCT (which exposes these functions and purports to support up to 14 bits). - AARCH64, ARM, PPC and RISC-V, x86 ignore this AVDCT special case. - RISC-V also ignored the restriction to 9, 10, 12 and 14 for its 16bit check and treated everything > 8 bits as 16bit. - The mmi MIPS code treats everything as 8 bit when used via AVDCT (this is certainly broken); otherwise it checks for <= 8 bits. The msa MIPS code behaves like the generic code. This commit changes this to treat 9..16 bits as 16 bit input, everything else as 8 bit (the former because it makes sense, the latter to preserve the behaviour for external users). : The only internal user of AVDCT (the spp filter) always uses 8, 9 or 10 bits. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-05-31 01:25:27 +02:00
Zhao Zhili	26752368f0	aarch64/h26x: Add put_hevc_pel_bi_w_pixels On rpi5 (A76): put_hevc_pel_bi_w_pixels4_8_c: 90.0 ( 1.00x) put_hevc_pel_bi_w_pixels4_8_neon: 34.1 ( 2.64x) put_hevc_pel_bi_w_pixels6_8_c: 188.3 ( 1.00x) put_hevc_pel_bi_w_pixels6_8_neon: 73.5 ( 2.56x) put_hevc_pel_bi_w_pixels8_8_c: 327.1 ( 1.00x) put_hevc_pel_bi_w_pixels8_8_neon: 75.8 ( 4.32x) put_hevc_pel_bi_w_pixels12_8_c: 728.8 ( 1.00x) put_hevc_pel_bi_w_pixels12_8_neon: 186.1 ( 3.92x) put_hevc_pel_bi_w_pixels16_8_c: 1288.1 ( 1.00x) put_hevc_pel_bi_w_pixels16_8_neon: 268.5 ( 4.80x) put_hevc_pel_bi_w_pixels24_8_c: 2855.5 ( 1.00x) put_hevc_pel_bi_w_pixels24_8_neon: 723.8 ( 3.95x) put_hevc_pel_bi_w_pixels32_8_c: 5095.3 ( 1.00x) put_hevc_pel_bi_w_pixels32_8_neon: 1165.0 ( 4.37x) put_hevc_pel_bi_w_pixels48_8_c: 11521.5 ( 1.00x) put_hevc_pel_bi_w_pixels48_8_neon: 2856.0 ( 4.03x) put_hevc_pel_bi_w_pixels64_8_c: 21020.5 ( 1.00x) put_hevc_pel_bi_w_pixels64_8_neon: 4699.1 ( 4.47x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-04-29 15:24:14 +08:00
Zhao Zhili	39786f8cd5	aarch64/h26x: optimize sao_band_filter int8_t[] is enough for offset_table of 8 bit streams. On rpi5: Before After hevc_sao_band_8_8_c: 252.3 ( 1.00x) 252.3 ( 1.00x) hevc_sao_band_8_8_neon: 95.8 ( 2.63x) 61.0 ( 4.57x) hevc_sao_band_16_8_c: 875.2 ( 1.00x) 864.9 ( 1.00x) hevc_sao_band_16_8_neon: 317.5 ( 2.76x) 150.0 ( 6.26x) hevc_sao_band_32_8_c: 3853.5 ( 1.00x) 3871.6 ( 1.00x) hevc_sao_band_32_8_neon: 1222.3 ( 3.15x) 550.6 ( 7.39) hevc_sao_band_48_8_c: 8203.6 ( 1.00x) 8182.6 ( 1.00x) hevc_sao_band_48_8_neon: 2685.7 ( 3.05x) 1185.8 ( 7.36x) hevc_sao_band_64_8_c: 14023.0 ( 1.00x) 14038.9 ( 1.00x) hevc_sao_band_64_8_neon: 4783.2 ( 2.93x) 2078.4 ( 7.15x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-04-29 15:11:45 +08:00
Andreas Rheinhardt	a064d34a32	avcodec/mpegvideoenc: Add MPVEncContext Many of the fields of MpegEncContext (which is also used by decoders) are actually only used by encoders. Therefore this commit adds a new encoder-only structure and moves all of the encoder-only fields to it except for those which require more explicit synchronisation between the main slice context and the other slice contexts. This synchronisation is currently mainly provided by ff_update_thread_context() which simply copies most of the main slice context over the other slice contexts. Fields which are moved to the new MPVEncContext no longer participate in this (which is desired, because it is horrible and for the fields b) below wasteful) which means that some fields can only be moved when explicit synchronisation code is added in later commits. More explicitly, this commit moves the following fields: a) Fields not copied by ff_update_duplicate_context(): dct_error_sum and dct_count; the former does not need synchronisation, the latter is synchronised in merge_context_after_encode(). b) Fields which do not change after initialisation (these fields could also be put into MPVMainEncContext at the cost of an indirection to access them): lambda_table, adaptive_quant, {luma,chroma}_elim_threshold, new_pic, fdsp, mpvencdsp, pdsp, {p,b_forw,b_back,b_bidir_forw,b_bidir_back,b_direct,b_field}_mv_table, [pb]_field_select_table, mb_{type,var,mean}, mc_mb_var, {min,max}_qcoeff, {inter,intra}_quant_bias, ac_esc_length, the *_vlc_length fields, the q_{intra,inter,chroma_intra}_matrix{,16}, dct_offset, mb_info, mjpeg_ctx, rtp_mode, rtp_payload_size, encode_mb, all function pointers, mpv_flags, quantizer_noise_shaping, frame_reconstruction_bitfield, error_rate and intra_penalty. c) Fields which are already (re)set explicitly: The PutBitContexts pb, tex_pb, pb2; dquant, skipdct, encoding_error, the statistics fields {mv,i_tex,p_tex,misc,last}_bits and i_count; last_mv_dir, esc_pos (reset when writing the header). d) Fields which are only used by encoders not supporting slice threading for which synchronisation doesn't matter: esc3_level_length and the remaining mb_info fields. e) coded_score: This field is only really used when FF_MPV_FLAG_CBP_RD is set (which implies trellis) and even then it is only used for non-intra blocks. For these blocks dct_quantize_trellis_c() either sets coded_score[n] or returns a last_non_zero value of -1 in which case coded_score will be reset in encode_mb_internal(). Therefore no old values are ever used. The MotionEstContext has not been moved yet. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-03-26 04:08:33 +01:00
Krzysztof Pyrkosz	f9b8f30680	avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. Benchmarks before and after: A78 avg_8_64x64_neon: 2686.2 ( 6.12x) avg_8_128x128_neon: 10734.2 ( 5.88x) avg_10_64x64_neon: 2536.8 ( 5.40x) avg_10_128x128_neon: 10079.0 ( 5.22x) avg_12_64x64_neon: 2548.2 ( 5.38x) avg_12_128x128_neon: 10133.8 ( 5.19x) avg_8_64x64_neon: 897.8 (18.26x) avg_8_128x128_neon: 3608.5 (17.37x) avg_10_32x32_neon: 444.2 ( 8.51x) avg_10_64x64_neon: 1711.8 ( 8.00x) avg_12_64x64_neon: 1706.2 ( 8.02x) avg_12_128x128_neon: 7010.0 ( 7.46x) A72 avg_8_64x64_neon: 5823.4 ( 3.88x) avg_8_128x128_neon: 17430.5 ( 4.73x) avg_10_64x64_neon: 5228.1 ( 3.71x) avg_10_128x128_neon: 16722.2 ( 4.17x) avg_12_64x64_neon: 5379.1 ( 3.51x) avg_12_128x128_neon: 16715.7 ( 4.17x) avg_8_64x64_neon: 2006.5 (10.61x) avg_8_128x128_neon: 9158.7 ( 8.96x) avg_10_64x64_neon: 3357.7 ( 5.60x) avg_10_128x128_neon: 12411.7 ( 5.56x) avg_12_64x64_neon: 3317.5 ( 5.67x) avg_12_128x128_neon: 12358.5 ( 5.58x) A53 avg_8_64x64_neon: 8327.8 ( 5.18x) avg_8_128x128_neon: 31631.3 ( 5.34x) avg_10_64x64_neon: 8783.5 ( 4.98x) avg_10_128x128_neon: 32617.0 ( 5.25x) avg_12_64x64_neon: 8686.0 ( 5.06x) avg_12_128x128_neon: 32487.5 ( 5.25x) avg_8_64x64_neon: 6032.3 ( 7.17x) avg_8_128x128_neon: 22008.5 ( 7.69x) avg_10_64x64_neon: 7738.0 ( 5.68x) avg_10_128x128_neon: 27813.8 ( 6.14x) avg_12_64x64_neon: 7844.5 ( 5.60x) avg_12_128x128_neon: 26999.5 ( 6.34x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-07 15:51:20 +02:00
Zhao Zhili	3e9777dc75	aarch64/hevcdsp_idct_neon: Add implementation for idct dc 12 Reduce binary size at the same time. The performance compared to clang -O3 is the same. Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-03-04 17:01:58 +08:00
Zhao Zhili	5977bff569	aarch64/hevcdsp_idct_neon: Optimize idct dc clang does better than the assembly code before the patch, especially for small size: hevc_idct_4x4_dc_8_c: 11.2 ( 1.00x) hevc_idct_4x4_dc_8_neon: 15.5 ( 0.73x) hevc_idct_4x4_dc_10_c: 12.0 ( 1.00x) hevc_idct_4x4_dc_10_neon: 15.2 ( 0.79x) hevc_idct_8x8_dc_8_c: 13.2 ( 1.00x) hevc_idct_8x8_dc_8_neon: 18.2 ( 0.73x) hevc_idct_8x8_dc_10_c: 13.5 ( 1.00x) hevc_idct_8x8_dc_10_neon: 17.2 ( 0.78x) hevc_idct_16x16_dc_8_c: 41.8 ( 1.00x) hevc_idct_16x16_dc_8_neon: 37.8 ( 1.11x) hevc_idct_16x16_dc_10_c: 41.8 ( 1.00x) hevc_idct_16x16_dc_10_neon: 37.8 ( 1.11x) hevc_idct_32x32_dc_8_c: 130.2 ( 1.00x) hevc_idct_32x32_dc_8_neon: 132.2 ( 0.98x) hevc_idct_32x32_dc_10_c: 130.2 ( 1.00x) hevc_idct_32x32_dc_10_neon: 132.2 ( 0.98x) This patch basically clone what the compiler does, so the performance is the same. Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-03-04 17:01:58 +08:00
Krzysztof Pyrkosz	71a91485fa	avcodec/aarch64/vvc: Optimize NEON version of vvc_dmvr This patch replaces blocks of instructions performing rounding and widening shifts with one-liners achieving the same result. Before and after on A78 dmvr_8_12x20_neon: 86.2 ( 6.90x) dmvr_8_20x12_neon: 94.8 ( 5.93x) dmvr_8_20x20_neon: 141.5 ( 6.50x) dmvr_12_12x20_neon: 158.0 ( 3.76x) dmvr_12_20x12_neon: 151.2 ( 3.73x) dmvr_12_20x20_neon: 247.2 ( 3.71x) dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) dmvr_8_12x20_neon: 77.2 ( 7.70x) dmvr_8_20x12_neon: 66.5 ( 8.49x) dmvr_8_20x20_neon: 92.2 ( 9.90x) dmvr_12_12x20_neon: 80.2 ( 7.38x) dmvr_12_20x12_neon: 58.2 ( 9.59x) dmvr_12_20x20_neon: 90.0 (10.15x) dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-04 10:35:31 +02:00
Krzysztof Pyrkosz	e8d4c55987	avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only a^2, b^2 and 2ab in each iteration and derive the latter parts from these three at the end. Before and after: A78 ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x) ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x) A72 ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x) ac3_sum_square_bufferfly_int32_neon: 527.3 ( 1.92x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-02 01:17:53 +02:00
Krzysztof Pyrkosz	9fb97215df	avcodec/aarch64/opusdsp_neon: Simplify opus_postfilter_neon This change removes one extra floating point operation and simplifies load operations at the beginning of the loop by using dedicated register for each of the 5 pointers and interleaving it with calculations. The first case seems to be a bit slower, but the performance increase is substantial in the other two. A78 before: postfilter_15_neon: 1684.8 ( 4.23x) postfilter_512_neon: 1395.5 ( 5.10x) postfilter_1022_neon: 1357.0 ( 5.25x) After: postfilter_15_neon: 1742.2 ( 4.09x) postfilter_512_neon: 1169.8 ( 6.09x) postfilter_1022_neon: 1160.0 ( 6.12x) A72 before: postfilter_15_neon: 3144.8 ( 2.39x) postfilter_512_neon: 3141.2 ( 2.39x) postfilter_1022_neon: 3230.0 ( 2.33x) After: postfilter_15_neon: 2847.8 ( 2.64x) postfilter_512_neon: 2877.8 ( 2.61x) postfilter_1022_neon: 2837.2 ( 2.65x) x13s before: postfilter_15_neon: 1615.4 ( 2.61x) postfilter_512_neon: 963.1 ( 4.39x) postfilter_1022_neon: 963.6 ( 4.39x) After: postfilter_15_neon: 1749.6 ( 2.41x) postfilter_512_neon: 707.1 ( 5.97x) postfilter_1022_neon: 706.1 ( 5.99x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-02-10 14:55:16 +02:00
Krzysztof Pyrkosz	83e4b068d9	avcodec/aarch64/aacencdsp: NEON implementation This patch supplies handwritten NEON code for AAC. The benchmarks below were collected by invoking these two commands on each of my boards, A78, A72 and Thinkpad x13s: 1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12 2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is speed without the patch, second, with) - A78 abs_pow34_c: 4161.5 ( 1.00x) abs_pow34_neon: 3586.2 ( 1.16x) quant_bands_signed_c: 5548.0 ( 1.00x) quant_bands_signed_neon: 1126.8 ( 4.92x) quant_bands_unsigned_c: 3979.2 ( 1.00x) quant_bands_unsigned_neon: 800.2 ( 4.97x) size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=71.6x size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=82.3x - A72 abs_pow34_c: 15362.2 ( 1.00x) abs_pow34_neon: 15382.5 ( 1.00x) quant_bands_signed_c: 9926.5 ( 1.00x) quant_bands_signed_neon: 2467.8 ( 4.02x) quant_bands_unsigned_c: 5469.8 ( 1.00x) quant_bands_unsigned_neon: 2089.5 ( 2.62x) size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=34.3x size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=37.8 - x13s abs_pow34_c: 2413.4 ( 1.00x) abs_pow34_neon: 1796.2 ( 1.34x) quant_bands_signed_c: 2968.9 ( 1.00x) quant_bands_signed_neon: 675.6 ( 4.39x) quant_bands_unsigned_c: 2311.9 ( 1.00x) quant_bands_unsigned_neon: 477.1 ( 4.85x) size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 135x size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 159x Signed-off-by: Martin Storsjö <martin@martin.st>	2025-01-28 10:44:40 +02:00
Janne Grunau	430c38f698	aarch64: vp9mc: Load only 12 pixels in the 4 pixel wide horizontal filter This reduces the amount the horizontal filters read beyond the filter width to a consistent 1 pixel. The data is not used so this is usually not noticeable. It becomes a problem when the application allocates frame buffers only for the aligned picture size and the end of it is at a page boundary. This happens for picture sizes which are a multiple of the page size like 1280x640. The frame buffer allocation is based on its most likely done via mmap + MAP_ANONYMOUS so start and end of the buffer are page aligned and the previous and next page are not necessarily mapped. Under these conditions like seen by Firefox a read beyond the end of the buffer results in a segfault. After the over-read is reduced to a single pixel it's reasonable to use VP9's emulated edge motion compensation for this. Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185 Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2025-01-03 17:53:46 -05:00
Zhao Zhili	952508ae05	aarch64/vvc: Add apply_bdof Test on rpi 5 with gcc 12: apply_bdof_8_8x16_c: 7315.2 ( 1.00x) apply_bdof_8_8x16_neon: 1876.8 ( 3.90x) apply_bdof_8_16x8_c: 7170.5 ( 1.00x) apply_bdof_8_16x8_neon: 1752.8 ( 4.09x) apply_bdof_8_16x16_c: 14695.2 ( 1.00x) apply_bdof_8_16x16_neon: 3490.5 ( 4.21x) apply_bdof_10_8x16_c: 7371.5 ( 1.00x) apply_bdof_10_8x16_neon: 1863.8 ( 3.96x) apply_bdof_10_16x8_c: 7172.0 ( 1.00x) apply_bdof_10_16x8_neon: 1766.0 ( 4.06x) apply_bdof_10_16x16_c: 14551.5 ( 1.00x) apply_bdof_10_16x16_neon: 3576.0 ( 4.07x) apply_bdof_12_8x16_c: 7236.5 ( 1.00x) apply_bdof_12_8x16_neon: 1863.8 ( 3.88x) apply_bdof_12_16x8_c: 7316.5 ( 1.00x) apply_bdof_12_16x8_neon: 1758.8 ( 4.16x) apply_bdof_12_16x16_c: 14691.2 ( 1.00x) apply_bdof_12_16x16_neon: 3480.5 ( 4.22x)	2024-12-21 11:54:44 +08:00
Martin Storsjö	2bb00ef59c	aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions Explicitly use ldur for unaligned offsets; newer versions of armasm64 implicitly convert ldr to ldur as necessary, but older versions require it explicitly written out. This fixes these build errors: ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) : error A2518: operand 2: Memory offset must be aligned ldr s5, [x1, #1] ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) : error A2518: operand 2: Memory offset must be aligned ldr d7, [x1, #2] Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-18 13:45:09 +02:00
Bin Peng	72a3656e84	lavc/aarch64: Fix ff_pred16x16_plane_neon_10 Fix test failure on aarch64: ./tests/checkasm/checkasm --test=h264pred 367840 Signed-off-by: Peng Bin <pengbin@visionular.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-17 14:50:29 +02:00
Bin Peng	decc9e643c	lavc/aarch64: Fix ff_pred8x8_plane_neon_10 Fix test failure on aarch64: ./tests/checkasm/checkasm --test=h264pred 479612 The mismatch between neon and C functions can also be reproduced using the following bitstream and command line. wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264 ./ffmpeg -cpuflags 0 -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_ref ./ffmpeg -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_neon Signed-off-by: Bin Peng <pengbin@visionular.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-17 14:50:29 +02:00
Zhao Zhili	40feba5f77	aarch64/vvc: Fix clip in alf Fix test failure: ./tests/checkasm/checkasm --test=vvc_alf 3607569773	2024-12-10 21:00:47 +08:00
Zhao Zhili	91436638de	aarch64/vvc: Use faster clip operation Replace sqxtn+smin+smax by sqxtun+umin.	2024-12-10 21:00:47 +08:00

1 2 3 4 5 ...

481 commits