ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-06-04 14:40:26 +00:00

Author	SHA1	Message	Date
Jeongkeun Kim	4ea59d5665	avcodec/aarch64: add NEON DCA LFE FIR filter functions Port lfe_fir0_float and lfe_fir1_float to AArch64 NEON. These polyphase FIR interpolation filters have an x86 SSE/AVX path but no AArch64 equivalent, falling back to scalar C. The inner loop computes two dot products per output pair. Precomputing a reversed LFE sample vector before the inner loop avoids per-iteration shuffle overhead. Benchmarks on AWS Graviton3 (Neoverse V1, c7g.xlarge): lfe_fir0_float: C 5902.0 cycles -> NEON 2135.0 cycles (2.77x) lfe_fir1_float: C 2836.3 cycles -> NEON 1527.8 cycles (1.86x) Measured with: taskset -c 0 ./tests/checkasm/checkasm --test=dcadsp --bench, 3-run average, Ubuntu 22.04 (kernel 6.8.0-1052-aws), perf_event_paranoid=0. Signed-off-by: Jeongkeun Kim <variety0724@gmail.com>	2026-04-27 20:13:23 +00:00
Georgii Zagoruiko	1ced59326a	aarch64/vvc: Optimisations of put_chroma_hv() functions for 10/12-bit Apple M4: put_chroma_hv_10_2x2_c: 9.1 ( 1.00x) put_chroma_hv_10_4x4_c: 20.1 ( 1.00x) put_chroma_hv_10_8x8_c: 35.6 ( 1.00x) put_chroma_hv_10_8x8_neon: 15.4 ( 2.31x) put_chroma_hv_10_16x16_c: 113.7 ( 1.00x) put_chroma_hv_10_16x16_neon: 57.0 ( 1.99x) put_chroma_hv_10_32x32_c: 406.9 ( 1.00x) put_chroma_hv_10_32x32_neon: 225.7 ( 1.80x) put_chroma_hv_10_64x64_c: 1498.8 ( 1.00x) put_chroma_hv_10_64x64_neon: 876.2 ( 1.71x) put_chroma_hv_10_128x128_c: 5757.0 ( 1.00x) put_chroma_hv_10_128x128_neon: 3446.6 ( 1.67x) put_chroma_hv_12_2x2_c: 9.9 ( 1.00x) put_chroma_hv_12_4x4_c: 19.2 ( 1.00x) put_chroma_hv_12_8x8_c: 36.1 ( 1.00x) put_chroma_hv_12_8x8_neon: 17.9 ( 2.02x) put_chroma_hv_12_16x16_c: 112.2 ( 1.00x) put_chroma_hv_12_16x16_neon: 55.6 ( 2.02x) put_chroma_hv_12_32x32_c: 416.6 ( 1.00x) put_chroma_hv_12_32x32_neon: 224.3 ( 1.86x) put_chroma_hv_12_64x64_c: 1464.8 ( 1.00x) put_chroma_hv_12_64x64_neon: 860.1 ( 1.70x) put_chroma_hv_12_128x128_c: 5776.8 ( 1.00x) put_chroma_hv_12_128x128_neon: 3445.2 ( 1.68x) RPi5: put_chroma_hv_10_2x2_c: 118.5 ( 1.00x) put_chroma_hv_10_4x4_c: 190.6 ( 1.00x) put_chroma_hv_10_8x8_c: 303.1 ( 1.00x) put_chroma_hv_10_8x8_neon: 172.6 ( 1.76x) put_chroma_hv_10_16x16_c: 1036.1 ( 1.00x) put_chroma_hv_10_16x16_neon: 626.7 ( 1.65x) put_chroma_hv_10_32x32_c: 3624.4 ( 1.00x) put_chroma_hv_10_32x32_neon: 2386.9 ( 1.52x) put_chroma_hv_10_64x64_c: 13612.1 ( 1.00x) put_chroma_hv_10_64x64_neon: 9314.8 ( 1.46x) put_chroma_hv_10_128x128_c: 52975.4 ( 1.00x) put_chroma_hv_10_128x128_neon: 37083.5 ( 1.43x) put_chroma_hv_12_2x2_c: 118.6 ( 1.00x) put_chroma_hv_12_4x4_c: 188.1 ( 1.00x) put_chroma_hv_12_8x8_c: 303.4 ( 1.00x) put_chroma_hv_12_8x8_neon: 176.7 ( 1.72x) put_chroma_hv_12_16x16_c: 1037.9 ( 1.00x) put_chroma_hv_12_16x16_neon: 626.5 ( 1.66x) put_chroma_hv_12_32x32_c: 3629.0 ( 1.00x) put_chroma_hv_12_32x32_neon: 2386.6 ( 1.52x) put_chroma_hv_12_64x64_c: 13649.0 ( 1.00x) put_chroma_hv_12_64x64_neon: 9313.6 ( 1.47x) put_chroma_hv_12_128x128_c: 52978.0 ( 1.00x) put_chroma_hv_12_128x128_neon: 37101.2 ( 1.43x)	2026-04-27 20:10:57 +00:00
Jun Zhao	75838b9c89	lavc/hevc: add aarch64 NEON for reference sample filtering 3-tap [1,2,1]>>2: shared implementation body across size-specialized entry points (8x8/16x16/32x32) to reduce code size. Fold the 3-tap kernel into uhadd + urhadd: uhadd gives floor((prev+next)/2), then urhadd rounds with curr to produce (prev + 2*curr + next + 2) >> 2 on 16 bytes in-place (no widen/narrow needed). Overlap-last technique for tail avoids partial stores. Caller pads input arrays by 16 bytes to guarantee safe over-read. Strong smoothing (32x32): preloaded weight tables, interleaved umull/umlal pairs (two 16-byte blocks at a time) to hide rshrn-to-store latency, with paired st1 for 32-byte writes. checkasm --bench --runs=15 (Apple M4, average of 3 trials): ref_filter_3tap_8x8_8_neon: 4.1x ref_filter_3tap_16x16_8_neon: 3.3x ref_filter_3tap_32x32_8_neon: 2.5x ref_filter_strong_8_neon: 1.9x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-04-21 07:50:49 +00:00
Jun Zhao	89c21b5ab7	lavc/hevc: add aarch64 NEON for Planar prediction Add NEON-optimized implementation for HEVC intra Planar prediction at 8-bit depth, supporting all block sizes (4x4 to 32x32). Planar prediction implements bilinear interpolation using an incremental base update: base_{y+1}[x] = base_y[x] - (top[x] - left[N]), reducing per-row computation from 4 multiply-adds to 1 subtract + 1 multiply. Uses rshrn for rounded narrowing shifts, eliminating manual rounding bias. All left[y] values are broadcast in the NEON domain, avoiding GP-to-NEON transfers. 4x4 interleaves row computations across 4 rows to break dependencies. 16x16 uses v19-v22 for persistent base/decrement vectors, avoiding callee-saved register spills. 32x32 processes 8 rows per loop iteration (4 iterations total) to reduce code size while maintaining full NEON utilization. Speedup over C on Apple M4 (checkasm --bench): 4x4: 2.25x 8x8: 6.40x 16x16: 9.72x 32x32: 3.21x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-30 14:32:10 +00:00
Jun Zhao	60b372c934	lavc/hevc: add aarch64 NEON for DC prediction Add NEON-optimized implementation for HEVC intra DC prediction at 8-bit depth, supporting all block sizes (4x4 to 32x32). DC prediction computes the average of top and left reference samples using uaddlv, with urshr for rounded division. For luma blocks smaller than 32x32, edge smoothing is applied: the first row and column are blended toward the reference using (ref[i] + 3*dc + 2) >> 2 computed entirely in the NEON domain. Fill stores use pre-computed address patterns to break dependency chains. Also adds the aarch64 initialization framework (Makefile, pred.c/pred.h hooks, hevcpred_init_aarch64.c). Speedup over C on Apple M4 (checkasm --bench): 4x4: 2.28x 8x8: 3.14x 16x16: 3.29x 32x32: 3.02x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-30 14:32:10 +00:00
Georgii Zagoruiko	1c385023aa	aarch64/vvc: Optimisations of put_chroma_v() functions for 10/12-bit Apple M4: put_chroma_v_10_2x2_c: 5.8 ( 1.00x) put_chroma_v_10_4x4_c: 9.0 ( 1.00x) put_chroma_v_10_4x4_neon: 1.7 ( 5.29x) put_chroma_v_10_8x8_c: 22.1 ( 1.00x) put_chroma_v_10_8x8_neon: 5.8 ( 3.79x) put_chroma_v_10_16x16_c: 56.3 ( 1.00x) put_chroma_v_10_16x16_neon: 21.2 ( 2.66x) put_chroma_v_10_32x32_c: 181.6 ( 1.00x) put_chroma_v_10_32x32_neon: 86.9 ( 2.09x) put_chroma_v_10_64x64_c: 680.3 ( 1.00x) put_chroma_v_10_64x64_neon: 337.4 ( 2.02x) put_chroma_v_10_128x128_c: 2567.3 ( 1.00x) put_chroma_v_10_128x128_neon: 1374.8 ( 1.87x) put_chroma_v_12_2x2_c: 6.4 ( 1.00x) put_chroma_v_12_4x4_c: 8.2 ( 1.00x) put_chroma_v_12_4x4_neon: 1.5 ( 5.56x) put_chroma_v_12_8x8_c: 18.9 ( 1.00x) put_chroma_v_12_8x8_neon: 5.7 ( 3.29x) put_chroma_v_12_16x16_c: 52.6 ( 1.00x) put_chroma_v_12_16x16_neon: 19.9 ( 2.65x) put_chroma_v_12_32x32_c: 185.7 ( 1.00x) put_chroma_v_12_32x32_neon: 81.9 ( 2.27x) put_chroma_v_12_64x64_c: 661.8 ( 1.00x) put_chroma_v_12_64x64_neon: 342.1 ( 1.93x) put_chroma_v_12_128x128_c: 2547.8 ( 1.00x) put_chroma_v_12_128x128_neon: 1368.0 ( 1.86x) RPi4: put_chroma_v_10_2x2_c: 64.8 ( 1.00x) put_chroma_v_10_4x4_c: 157.2 ( 1.00x) put_chroma_v_10_4x4_neon: 39.7 ( 3.96x) put_chroma_v_10_8x8_c: 562.1 ( 1.00x) put_chroma_v_10_8x8_neon: 98.8 ( 5.69x) put_chroma_v_10_16x16_c: 1170.7 ( 1.00x) put_chroma_v_10_16x16_neon: 380.7 ( 3.07x) put_chroma_v_10_32x32_c: 3696.6 ( 1.00x) put_chroma_v_10_32x32_neon: 1723.8 ( 2.14x) put_chroma_v_10_64x64_c: 13170.9 ( 1.00x) put_chroma_v_10_64x64_neon: 7284.1 ( 1.81x) put_chroma_v_10_128x128_c: 46068.3 ( 1.00x) put_chroma_v_10_128x128_neon: 27219.5 ( 1.69x) put_chroma_v_12_2x2_c: 63.8 ( 1.00x) put_chroma_v_12_4x4_c: 156.5 ( 1.00x) put_chroma_v_12_4x4_neon: 39.3 ( 3.98x) put_chroma_v_12_8x8_c: 560.9 ( 1.00x) put_chroma_v_12_8x8_neon: 98.7 ( 5.68x) put_chroma_v_12_16x16_c: 1169.9 ( 1.00x) put_chroma_v_12_16x16_neon: 380.8 ( 3.07x) put_chroma_v_12_32x32_c: 3693.9 ( 1.00x) put_chroma_v_12_32x32_neon: 1728.4 ( 2.14x) put_chroma_v_12_64x64_c: 13170.9 ( 1.00x) put_chroma_v_12_64x64_neon: 7284.9 ( 1.81x) put_chroma_v_12_128x128_c: 46068.0 ( 1.00x) put_chroma_v_12_128x128_neon: 27224.6 ( 1.69x)	2026-03-27 13:42:50 +00:00
Martin Storsjö	f72f692afa	aarch64: Add PAC sign/validation of the link register Whenever the link register is stored on the stack, sign it before storing it and validate at a symmetrical point (with the stack at the same level as when it was signed). These macros only have an effect if built with PAC enabled (e.g. through -mbranch-protection=standard), otherwise they don't generate any extra instructions. None of these cases were present when PAC support was added in `248986a0db` in 2022. Without these changes, PAC still had an effect in the compiler generated code and in the existing cases where we these macros were used - but make it apply to the remaining cases of link register on the stack.	2026-03-20 13:16:06 +02:00
Martin Storsjö	dbf7354d98	aarch64/inter_sme2: Remove needless backup/restore of x29/x30 The sme_entry/sme_exit macros already take care of backing up/restoring these registers. Additionally, as long as no function calls are made within the function, x30 doesn't need to be backed up at all.	2026-03-20 13:16:06 +02:00
Martin Storsjö	1f7ed8a78d	aarch64: hevcdsp: Make returns match the call site For cases when returning early without updating any pixels, we previously returned to return address in the caller's scope, bypassing one function entirely. While this may seem like a neat optimization, it makes the return stack predictor mispredict the returns - which potentially can cost more performance than it gains. Secondly, if the armv9.3 feature GCS (Guarded Control Stack) is enabled, then returns _must_ match the expected value; this feature is being enabled across linux distributions, and by fixing the hevc assembly, we can enable the security feature on ffmpeg as well.	2026-03-17 20:37:53 +00:00
Jun Zhao	254b92ec8a	lavc/hevc: reorder aarch64 NEON pel function assignments Group assignments by filter family (qpel, epel), variant (base, uni, bi, uni_w, bi_w) and direction (pixels, h, v, hv). Add NEON8_FNASSIGN_QPEL_H macro to replace repeated manual qpel horizontal assignments. No functional change. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-13 21:43:37 +00:00
Jun Zhao	489d36b5e1	lavc/hevc: add aarch64 NEON for epel uni horizontal filter Add NEON-optimized implementations for HEVC EPEL uni-directional horizontal interpolation (put_hevc_epel_uni_h) at 8-bit depth. These functions perform horizontal 4-tap EPEL filtering with output directly to uint8_t pixels (no weighting): - 4-tap horizontal EPEL filter - Output: (filter_result + 32) >> 6, clipped to [0, 255] Supports all block widths: 4, 6, 8, 12, 16, 24, 32, 48, 64. Performance results on Apple M4: ./tests/checkasm/checkasm --test=hevc_pel --bench put_hevc_epel_uni_h4_8_neon: 2.26x put_hevc_epel_uni_h6_8_neon: 2.71x put_hevc_epel_uni_h8_8_neon: 4.40x put_hevc_epel_uni_h12_8_neon: 3.60x put_hevc_epel_uni_h16_8_neon: 3.00x put_hevc_epel_uni_h24_8_neon: 3.72x put_hevc_epel_uni_h32_8_neon: 3.14x put_hevc_epel_uni_h48_8_neon: 3.16x put_hevc_epel_uni_h64_8_neon: 3.15x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-13 21:43:37 +00:00
Jun Zhao	f5e6cca935	lavc/hevc: add aarch64 NEON for qpel uni-weighted HV filter Add NEON-optimized implementations for HEVC QPEL uni-directional weighted HV interpolation (put_hevc_qpel_uni_w_hv) at 8-bit depth, for block widths 6, 12, 24, and 48. These functions perform horizontal then vertical 8-tap QPEL filtering with weighting (wx, ox, denom) and output to uint8_t. Previously only widths 4, 8, 16, 32, 64 were implemented; this completes coverage for all standard HEVC block widths. Performance results on Apple M4: ./tests/checkasm/checkasm --test=hevc_pel --bench put_hevc_qpel_uni_w_hv6_8_neon: 3.11x put_hevc_qpel_uni_w_hv12_8_neon: 3.19x put_hevc_qpel_uni_w_hv24_8_neon: 2.26x put_hevc_qpel_uni_w_hv48_8_neon: 1.80x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-13 21:43:37 +00:00
Jun Zhao	fe41ff7413	lavc/hevc: add aarch64 NEON for qpel uni-weighted vertical filter Add NEON-optimized implementations for HEVC QPEL uni-weighted vertical interpolation (put_hevc_qpel_uni_w_v) at 8-bit depth. These functions perform weighted uni-directional prediction with vertical QPEL filtering: - 8-tap vertical QPEL filter - Weighted prediction: (filter_result * wx + offset) >> shift Previously only sizes 4, 8, 16, 64 were optimized. This patch adds optimized implementations for all remaining sizes: 6, 12, 24, 32, 48. Performance results on Apple M4: ./tests/checkasm/checkasm --test=hevc_pel --bench put_hevc_qpel_uni_w_v6_8_neon: 3.40x put_hevc_qpel_uni_w_v12_8_neon: 3.24x put_hevc_qpel_uni_w_v24_8_neon: 3.06x put_hevc_qpel_uni_w_v32_8_neon: 2.66x put_hevc_qpel_uni_w_v48_8_neon: 2.67x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-13 21:43:37 +00:00
Jun Zhao	32df0352b7	lavc/hevc: move subs earlier in qpel uni-weighted NEON loops Move the subs instruction before the store macro in the 8x-unrolled loops of qpel_uni_w_v4/v8/v16/v64 and qpel_uni_w_hv4/hv8/hv16, so that many NEON instructions from the store macro separate it from the conditional branch. This gives the CPU pipeline time to resolve the condition flags before the branch decision. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-13 21:43:37 +00:00
Georgii Zagoruiko	c1be2107c9	aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit RPi4: put_chroma_h_10_2x2_c: 63.4 ( 1.00x) put_chroma_h_10_4x4_c: 151.4 ( 1.00x) put_chroma_h_10_8x8_c: 555.1 ( 1.00x) put_chroma_h_10_8x8_neon: 113.9 ( 4.88x) put_chroma_h_10_16x16_c: 1068.5 ( 1.00x) put_chroma_h_10_16x16_neon: 439.4 ( 2.43x) put_chroma_h_10_32x32_c: 3432.6 ( 1.00x) put_chroma_h_10_32x32_neon: 1878.3 ( 1.83x) put_chroma_h_10_64x64_c: 12872.2 ( 1.00x) put_chroma_h_10_64x64_neon: 7868.2 ( 1.64x) put_chroma_h_10_128x128_c: 45612.2 ( 1.00x) put_chroma_h_10_128x128_neon: 28742.1 ( 1.59x) put_chroma_h_12_2x2_c: 63.7 ( 1.00x) put_chroma_h_12_4x4_c: 151.5 ( 1.00x) put_chroma_h_12_8x8_c: 555.2 ( 1.00x) put_chroma_h_12_8x8_neon: 114.2 ( 4.86x) put_chroma_h_12_16x16_c: 1068.1 ( 1.00x) put_chroma_h_12_16x16_neon: 438.8 ( 2.43x) put_chroma_h_12_32x32_c: 3419.7 ( 1.00x) put_chroma_h_12_32x32_neon: 1878.7 ( 1.82x) put_chroma_h_12_64x64_c: 12862.2 ( 1.00x) put_chroma_h_12_64x64_neon: 7868.2 ( 1.63x) put_chroma_h_12_128x128_c: 45613.5 ( 1.00x) put_chroma_h_12_128x128_neon: 28743.3 ( 1.59x) Apple M4: put_chroma_h_10_2x2_c: 2.5 ( 1.00x) put_chroma_h_10_4x4_c: 6.5 ( 1.00x) put_chroma_h_10_8x8_c: 17.8 ( 1.00x) put_chroma_h_10_8x8_neon: 6.8 ( 2.60x) put_chroma_h_10_16x16_c: 53.3 ( 1.00x) put_chroma_h_10_16x16_neon: 30.4 ( 1.75x) put_chroma_h_10_32x32_c: 181.8 ( 1.00x) put_chroma_h_10_32x32_neon: 116.2 ( 1.56x) put_chroma_h_10_64x64_c: 684.2 ( 1.00x) put_chroma_h_10_64x64_neon: 470.3 ( 1.45x) put_chroma_h_10_128x128_c: 2567.6 ( 1.00x) put_chroma_h_10_128x128_neon: 1879.3 ( 1.37x) put_chroma_h_12_2x2_c: 1.9 ( 1.00x) put_chroma_h_12_4x4_c: 7.0 ( 1.00x) put_chroma_h_12_8x8_c: 16.8 ( 1.00x) put_chroma_h_12_8x8_neon: 7.9 ( 2.12x) put_chroma_h_12_16x16_c: 55.0 ( 1.00x) put_chroma_h_12_16x16_neon: 29.0 ( 1.90x) put_chroma_h_12_32x32_c: 182.5 ( 1.00x) put_chroma_h_12_32x32_neon: 116.9 ( 1.56x) put_chroma_h_12_64x64_c: 666.8 ( 1.00x) put_chroma_h_12_64x64_neon: 474.5 ( 1.41x) put_chroma_h_12_128x128_c: 2588.1 ( 1.00x) put_chroma_h_12_128x128_neon: 1912.2 ( 1.35x)	2026-03-10 12:48:54 +00:00
Martin Storsjö	74cfcd1c69	aarch64/vvc: Fix DCE undefined references with MSVC This fixes compiling with MSVC for aarch64 after `510999f6b0`. While MSVC does do dead code elimintation for function references within e.g. "if (0)", it doesn't do that for functions referenced within a static function, even if that static function itself ends up not used. A reproduction example: void missing(void); void (*func_ptr)(void); static void wrapper(void) { missing(); } void init(int cpu_flags) { if (0) { func_ptr = wrapper; } } If "wrapper" is entirely unreferenced, then MSVC doesn't produce any reference to the symbol "missing". Also, if we do "func_ptr = missing;" then the reference to missing also is eliminated. But for the case of referencing the function in a static function, even if the reference to the static function can be eliminated, then MSVC does keep the reference to the symbol.	2026-03-05 11:57:40 +02:00
Zuoqiang He	1fc7464cf7	libavcodec/huffyuvdsp: Add NEON optimization for the add_int16 function Benchmark Results (1024 iterations, Raspberry Pi 5 - Cortex-A76): add_int16_128_c: 914.0 ( 1.00x) add_int16_128_neon: 516.9 ( 1.77x) add_int16_rnd_width_c: 914.0 ( 1.00x) add_int16_rnd_width_neon: 517.5 ( 1.77x) Co-Authored-By: Martin Storsjö <martin@martin.st>	2026-03-04 22:31:19 +00:00
Georgii Zagoruiko	510999f6b0	aarch64/vvc: sme2 optimisation of alf_filter_luma() 8/10/12 bit Apple M4: vvc_alf_filter_luma_8x8_8_c: 347.3 ( 1.00x) vvc_alf_filter_luma_8x8_8_neon: 138.7 ( 2.50x) vvc_alf_filter_luma_8x8_8_sme2: 134.5 ( 2.58x) vvc_alf_filter_luma_8x8_10_c: 299.8 ( 1.00x) vvc_alf_filter_luma_8x8_10_neon: 129.8 ( 2.31x) vvc_alf_filter_luma_8x8_10_sme2: 128.6 ( 2.33x) vvc_alf_filter_luma_8x8_12_c: 293.0 ( 1.00x) vvc_alf_filter_luma_8x8_12_neon: 126.8 ( 2.31x) vvc_alf_filter_luma_8x8_12_sme2: 126.3 ( 2.32x) vvc_alf_filter_luma_16x16_8_c: 1386.1 ( 1.00x) vvc_alf_filter_luma_16x16_8_neon: 560.3 ( 2.47x) vvc_alf_filter_luma_16x16_8_sme2: 540.1 ( 2.57x) vvc_alf_filter_luma_16x16_10_c: 1200.3 ( 1.00x) vvc_alf_filter_luma_16x16_10_neon: 515.6 ( 2.33x) vvc_alf_filter_luma_16x16_10_sme2: 531.3 ( 2.26x) vvc_alf_filter_luma_16x16_12_c: 1223.8 ( 1.00x) vvc_alf_filter_luma_16x16_12_neon: 510.7 ( 2.40x) vvc_alf_filter_luma_16x16_12_sme2: 524.9 ( 2.33x) vvc_alf_filter_luma_32x32_8_c: 5488.8 ( 1.00x) vvc_alf_filter_luma_32x32_8_neon: 2233.4 ( 2.46x) vvc_alf_filter_luma_32x32_8_sme2: 1093.6 ( 5.02x) vvc_alf_filter_luma_32x32_10_c: 4738.0 ( 1.00x) vvc_alf_filter_luma_32x32_10_neon: 2057.5 ( 2.30x) vvc_alf_filter_luma_32x32_10_sme2: 1053.6 ( 4.50x) vvc_alf_filter_luma_32x32_12_c: 4808.3 ( 1.00x) vvc_alf_filter_luma_32x32_12_neon: 1981.2 ( 2.43x) vvc_alf_filter_luma_32x32_12_sme2: 1047.7 ( 4.59x) vvc_alf_filter_luma_64x64_8_c: 22116.8 ( 1.00x) vvc_alf_filter_luma_64x64_8_neon: 8951.0 ( 2.47x) vvc_alf_filter_luma_64x64_8_sme2: 4225.2 ( 5.23x) vvc_alf_filter_luma_64x64_10_c: 19072.8 ( 1.00x) vvc_alf_filter_luma_64x64_10_neon: 8448.1 ( 2.26x) vvc_alf_filter_luma_64x64_10_sme2: 4225.8 ( 4.51x) vvc_alf_filter_luma_64x64_12_c: 19312.6 ( 1.00x) vvc_alf_filter_luma_64x64_12_neon: 8270.9 ( 2.34x) vvc_alf_filter_luma_64x64_12_sme2: 4245.4 ( 4.55x) vvc_alf_filter_luma_128x128_8_c: 88530.5 ( 1.00x) vvc_alf_filter_luma_128x128_8_neon: 35686.3 ( 2.48x) vvc_alf_filter_luma_128x128_8_sme2: 16961.2 ( 5.22x) vvc_alf_filter_luma_128x128_10_c: 76904.9 ( 1.00x) vvc_alf_filter_luma_128x128_10_neon: 32439.5 ( 2.37x) vvc_alf_filter_luma_128x128_10_sme2: 16845.6 ( 4.57x) vvc_alf_filter_luma_128x128_12_c: 77363.3 ( 1.00x) vvc_alf_filter_luma_128x128_12_neon: 32907.5 ( 2.35x) vvc_alf_filter_luma_128x128_12_sme2: 17018.1 ( 4.55x)	2026-03-04 23:52:58 +02:00
Georgii Zagoruiko	90431417cb	aarch64/vvc: Optimisations of put_luma_hv() functions for 10/12-bit Apple M2: put_luma_hv_10_4x4_c: 36.3 ( 1.00x) put_luma_hv_10_8x8_c: 82.9 ( 1.00x) put_luma_hv_10_8x8_neon: 34.9 ( 2.37x) put_luma_hv_10_16x16_c: 239.2 ( 1.00x) put_luma_hv_10_16x16_neon: 119.0 ( 2.01x) put_luma_hv_10_32x32_c: 900.3 ( 1.00x) put_luma_hv_10_32x32_neon: 429.3 ( 2.10x) put_luma_hv_10_64x64_c: 2984.7 ( 1.00x) put_luma_hv_10_64x64_neon: 1736.2 ( 1.72x) put_luma_hv_10_128x128_c: 11194.2 ( 1.00x) put_luma_hv_10_128x128_neon: 6357.3 ( 1.76x) put_luma_hv_12_4x4_c: 35.9 ( 1.00x) put_luma_hv_12_8x8_c: 82.6 ( 1.00x) put_luma_hv_12_8x8_neon: 34.3 ( 2.41x) put_luma_hv_12_16x16_c: 240.2 ( 1.00x) put_luma_hv_12_16x16_neon: 115.3 ( 2.08x) put_luma_hv_12_32x32_c: 787.7 ( 1.00x) put_luma_hv_12_32x32_neon: 414.2 ( 1.90x) put_luma_hv_12_64x64_c: 3058.4 ( 1.00x) put_luma_hv_12_64x64_neon: 1592.3 ( 1.92x) put_luma_hv_12_128x128_c: 11350.8 ( 1.00x) put_luma_hv_12_128x128_neon: 6378.3 ( 1.78x) RPi4: put_luma_hv_10_4x4_c: 637.8 ( 1.00x) put_luma_hv_10_8x8_c: 1044.9 ( 1.00x) put_luma_hv_10_8x8_neon: 483.7 ( 2.16x) put_luma_hv_10_16x16_c: 3098.0 ( 1.00x) put_luma_hv_10_16x16_neon: 1603.1 ( 1.93x) put_luma_hv_10_32x32_c: 10054.8 ( 1.00x) put_luma_hv_10_32x32_neon: 5843.6 ( 1.72x) put_luma_hv_10_64x64_c: 40506.2 ( 1.00x) put_luma_hv_10_64x64_neon: 24384.0 ( 1.66x) put_luma_hv_10_128x128_c: 130604.2 ( 1.00x) put_luma_hv_10_128x128_neon: 99746.6 ( 1.31x) put_luma_hv_12_4x4_c: 638.2 ( 1.00x) put_luma_hv_12_8x8_c: 1074.6 ( 1.00x) put_luma_hv_12_8x8_neon: 482.6 ( 2.23x) put_luma_hv_12_16x16_c: 3094.0 ( 1.00x) put_luma_hv_12_16x16_neon: 1602.5 ( 1.93x) put_luma_hv_12_32x32_c: 10034.4 ( 1.00x) put_luma_hv_12_32x32_neon: 5843.3 ( 1.72x) put_luma_hv_12_64x64_c: 40447.5 ( 1.00x) put_luma_hv_12_64x64_neon: 24377.2 ( 1.66x) put_luma_hv_12_128x128_c: 130610.4 ( 1.00x) put_luma_hv_12_128x128_neon: 99765.8 ( 1.31x)	2026-03-04 12:53:16 +00:00
Jun Zhao	7e7d69632d	lavc/hevc: optimize qpel H-pass for width>=16 with byte-domain widening multiply Rewrite ff_hevc_put_hevc_qpel_h16_8_neon and h32 to use byte-domain widening multiply (umull/umlal/umlsl via calc_qpelb/calc_qpelb2 macros) instead of the previous int16-domain approach (uxtl + mul/mla). The byte-domain approach eliminates the uxtl expansion step and halves the ext stride (1 byte vs 2 bytes per tap), reducing per-row instruction count from ~32 to ~23. The functions are also inlined, removing bl/ret call overhead. This benefits all HV-path callers (hv/uni_hv/bi_hv/uni_w_hv/bi_w_hv) at widths 16/32/48/64. checkasm benchmarks on Apple M4 (5-run average): H-pass standalone (NEON): h16: 34.0 -> 24.4 cycles (1.39x speedup) h32: 132.0 -> 95.0 cycles (1.39x speedup) h64: 521.8 -> 373.9 cycles (1.40x speedup) HV compound paths geometric mean speedup (NEON, width >= 16): qpel_hv: 1.144x (4 functions) qpel_bi_hv: 1.158x (4 functions) qpel_uni_hv: 1.188x (4 functions) qpel_uni_w_hv: 1.158x (3 functions) Overall: 1.162x (15 functions) VVC qpel h16/h32 are separated into self-contained functions retaining the int16-domain approach, as VVC filters have arbitrary coefficients incompatible with the hardcoded sign pattern in calc_qpelb. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-03 12:04:14 +00:00
Jun Zhao	23b7005d98	lavc/vvc: remove duplicate 'mov mx, x30' in VVC qpel h16/h32 The VVC qpel h16 and h32 functions had a redundant 'mov mx, x30' instruction. The first one was placed before vvc_load_filter had finished using mx (the filter pointer argument), making it a dead store immediately overwritten by the second 'mov mx, x30'. Remove the first instance and reorder so that 'sub src, src, #3' comes before 'mov mx, x30', ensuring the filter pointer in mx is fully consumed by vvc_load_filter before being overwritten with the link register. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-03-03 12:04:14 +00:00
Andreas Rheinhardt	dc65dcec22	avcodec/vvc/inter: Combine offsets early For bi-predicted weighted averages, only the sum of the two offsets is ever used, so add the two early. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-25 12:08:33 +01:00
Jun Zhao	27dd2f1c70	lavc/hevc: fix missing # in ldrsw immediate offset The ldrsw instruction requires immediate offset with # prefix. This fixes the syntax error introduced in commit `26752368f0` (aarch64/h26x: Add put_hevc_pel_bi_w_pixels) where the load_bi_w_pixels_param macro was added. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-02-05 09:13:22 +08:00
Zhao Zhili	e250854ecf	aarch64/h264pred: disable inefficient functions These assembly optimizations have been identified as "performance regressions." Due to advancements in modern CPU micro-architectures and compiler optimization the C implementations now consistently outperform these handwritten routines. Test Name A55-clang M1 A76-gcc-14 A510-clang A715-clang X3-clang -------------------------------------------------------------------------------------------------------------------- pred8x8_dc_8_neon 55.9 ( 0.79x)! 0.2 ( 0.31x)! 35.7 ( 0.63x)! 98.3 ( 0.37x)! 35.9 ( 0.45x)! 33.6 ( 0.38x)! pred8x8_dc_10_neon 57.0 ( 1.04x) 0.3 ( 0.36x)! 35.9 ( 0.94x)! 98.2 ( 0.53x)! 35.8 ( 0.58x)! 33.2 ( 0.50x)! pred8x8_dc_128_8_neon 26.0 ( 0.69x)! 0.1 ( 0.43x)! 15.3 ( 0.73x)! 46.4 ( 0.36x)! 10.6 ( 0.48x)! 10.3 ( 1.09x) pred8x8_dc_128_10_neon 25.3 ( 0.99x)! 0.1 ( 0.42x)! 19.3 ( 0.48x)! 44.5 ( 0.42x)! 10.0 ( 0.61x)! 11.0 ( 1.00x) pred8x8_left_dc_8_neon 46.9 ( 0.72x)! 0.2 ( 0.26x)! 30.2 ( 0.49x)! 71.4 ( 0.39x)! 29.8 ( 0.35x)! 26.5 ( 0.44x)! pred8x8_left_dc_10_neon 45.4 ( 0.82x)! 0.2 ( 0.29x)! 28.1 ( 0.67x)! 70.2 ( 0.47x)! 30.0 ( 0.38x)! 26.5 ( 0.43x)! pred16x16_dc_8_neon 74.4 ( 1.34x) 0.3 ( 0.62x)! 44.7 ( 0.89x)! 128.0 ( 0.79x)! 48.5 ( 0.67x)! 39.4 ( 0.71x)! pred16x16_dc_128_8_neon 37.9 ( 0.79x)! 0.1 ( 0.60x)! 20.1 ( 0.80x)! 41.8 ( 0.46x)! 16.2 ( 0.81x)! 12.8 ( 0.95x)! pred16x16_left_dc_8_neon 69.9 ( 1.19x) 0.3 ( 0.46x)! 49.6 ( 0.54x)! 116.8 ( 0.62x)! 52.8 ( 0.45x)! 44.2 ( 0.51x)! pred8x8_hori_8_neon 30.6 ( 1.39x) 0.1 ( 0.45x)! 19.4 ( 0.81x)! 71.0 ( 0.50x)! 15.9 ( 0.55x)! 12.2 ( 0.94x)! pred8x8_hori_10_neon* 29.3 ( 1.82x) 0.1 ( 0.59x)! 18.5 ( 1.56x) 68.9 ( 0.64x)! 15.8 ( 0.62x)! 11.8 ( 0.97x)! pred8x8_top_dc_8_neon 35.8 ( 0.96x)! 0.1 ( 0.59x)! 16.8 ( 0.81x)! 58.9 ( 0.44x)! 11.3 ( 0.89x)! 11.4 ( 0.99x)! pred8x8_top_dc_10_neon 37.4 ( 1.24x) 0.1 ( 0.92x)! 20.4 ( 0.81x)! 59.5 ( 0.69x)! 10.5 ( 1.48x) 11.8 ( 1.02x) pred8x8_vertical_8_neon 18.3 ( 1.08x) 0.1 ( 0.54x)! 12.8 ( 0.89x)! 37.2 ( 0.40x)! 8.3 ( 0.77x)! 11.2 ( 1.00x) pred8x8_vertical_10_neon 19.0 ( 1.24x) 0.1 ( 0.55x)! 15.3 ( 0.62x)! 39.7 ( 0.50x)! 8.2 ( 0.91x)! 11.1 ( 0.99x)! - pred8x8_horizontal_10 also underperforms on new architectures, but useful on A55 and A76. Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2026-02-04 09:06:37 +00:00
Zhao Zhili	f54841d375	avcodec/aarch64: add pngdsp Test Name A55-gcc-11 M1-clang A76-gcc-12 A510-clang X3-clang ------------------------------------------------------------------------------------------------------------------- add_bytes_l2_4096_neon 1807.2 ( 2.01x) 1.6 ( 1.94x) 333.0 ( 6.35x) 1058.2 ( 2.34x) 214.3 ( 1.99x) add_paeth_prediction_3_neon 33036.1 ( 2.41x) 145.1 ( 1.66x) 20443.3 ( 1.97x) 35225.1 ( 1.23x) 19420.8 ( 1.05x) add_paeth_prediction_4_neon 24368.6 ( 3.26x) 106.7 ( 2.01x) 15163.8 ( 2.77x) 26454.7 ( 1.62x) 14319.0 ( 1.35x) add_paeth_prediction_6_neon 17900.6 ( 4.44x) 72.0 ( 2.70x) 10214.3 ( 4.20x) 18296.9 ( 2.27x) 9693.1 ( 1.97x) add_paeth_prediction_8_neon 12615.4 ( 6.31x) 54.1 ( 2.58x) 7706.0 ( 5.45x) 13733.3 ( 2.94x) 7272.6 ( 2.63x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2026-02-04 12:05:35 +08:00
Martin Storsjö	f74c551eaa	aarch64: Fix indentation of a few instructions This file is excempt from the indent checker script, as there are a few other bits in it that the script wants to reformat into slightly worse form, or which might not warrant being reformatted. But these instructions should indeed be indented this way.	2026-01-30 05:21:27 +00:00
Andreas Rheinhardt	bf4d5037b4	avcodec/h264dsp: Remove redundant h264 from H264DSPCtx member names These names are a remnant of dsputil when all the DSP functions from all codecs were part of DSPcontext. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Sean McGovern <gseanmcg@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:25 +01:00
Jun Zhao	8966101fa6	lavc/hevc: add aarch64 neon for 12-bit dequant Implement NEON optimization for HEVC dequant at 12-bit depth. For 12-bit: shift = 15 - 12 - log2_size = 3 - log2_size. When shift is negative, we use shl (shift left) instead of srshr. Performance benchmark on Apple M4: ./tests/checkasm/checkasm --test=hevc_dequant --bench hevc_dequant_4x4_12_c: 9.9 ( 1.00x) hevc_dequant_4x4_12_neon: 5.7 ( 1.74x) hevc_dequant_8x8_12_c: 1.7 ( 1.00x) hevc_dequant_8x8_12_neon: 1.3 ( 1.30x) hevc_dequant_16x16_12_c: 131.1 ( 1.00x) hevc_dequant_16x16_12_neon: 7.9 (16.52x) hevc_dequant_32x32_12_c: 69.7 ( 1.00x) hevc_dequant_32x32_12_neon: 28.4 ( 2.46x) Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-25 06:55:26 +00:00
Jun Zhao	ce89d974c8	lavc/hevc: add aarch64 neon for 10-bit dequant Implement NEON optimization for HEVC dequant at 10-bit depth. For 10-bit: shift = 15 - 10 - log2_size = 5 - log2_size Performance benchmark on Apple M4: ./tests/checkasm/checkasm --test=hevc_dequant --bench hevc_dequant_4x4_10_c: 16.6 ( 1.00x) hevc_dequant_4x4_10_neon: 7.4 ( 2.23x) hevc_dequant_8x8_10_c: 39.7 ( 1.00x) hevc_dequant_8x8_10_neon: 7.5 ( 5.28x) hevc_dequant_16x16_10_c: 168.7 ( 1.00x) hevc_dequant_16x16_10_neon: 10.2 (16.56x) hevc_dequant_32x32_10_c: 1.9 ( 1.00x) hevc_dequant_32x32_10_neon: 1.9 ( 1.01x) Note: 32x32 shift=0 is identity transform (no-op), so NEON has no advantage over C which is also optimized away by the compiler. Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-25 06:55:26 +00:00
Jun Zhao	0886e50c6b	lavc/hevc: add aarch64 neon for 8-bit dequant Implement NEON optimization for HEVC dequant at 8-bit depth. The NEON implementation uses srshr (Signed Rounding Shift Right) which does both the add with offset and right shift in a single instruction. Optimization details: - 4x4 (16 coeffs): Single load-process-store sequence - 8x8 (64 coeffs): Fully unrolled, no loop overhead - 16x16 (256 coeffs): Pipelined load/compute/store to hide memory latency - 32x32 (1024 coeffs): Pipelined with all available NEON registers Performance benchmark on Apple M4: ./tests/checkasm/checkasm --test=hevc_dequant --bench hevc_dequant_4x4_8_c: 11.3 ( 1.00x) hevc_dequant_4x4_8_neon: 6.3 ( 1.78x) hevc_dequant_8x8_8_c: 33.9 ( 1.00x) hevc_dequant_8x8_8_neon: 6.6 ( 5.11x) hevc_dequant_16x16_8_c: 153.8 ( 1.00x) hevc_dequant_16x16_8_neon: 9.0 (17.02x) hevc_dequant_32x32_8_c: 78.1 ( 1.00x) hevc_dequant_32x32_8_neon: 31.9 ( 2.45x) Note on Performance Anomaly: The observation that hevc_dequant_32x32_8_c is faster than 16x16 (78.1 vs 153.8) is due to Clang auto-vectorizing only for sizes >= 32x32. Compiler: Apple clang version 17.0.0 (clang-1700.6.3.2) Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-25 06:55:26 +00:00
Georgii Zagoruiko	8acdffa22c	aarch64/vvc: Optimisations of put_luma_v() functions for 10/12-bit RPi4 (auto-vectorisation is on) put_luma_v_10_4x4_c: 303.3 ( 1.00x) put_luma_v_10_4x4_neon: 55.7 ( 5.45x) put_luma_v_10_8x8_c: 1106.7 ( 1.00x) put_luma_v_10_8x8_neon: 163.8 ( 6.76x) put_luma_v_10_16x16_c: 2242.1 ( 1.00x) put_luma_v_10_16x16_neon: 672.7 ( 3.33x) put_luma_v_10_32x32_c: 7057.3 ( 1.00x) put_luma_v_10_32x32_neon: 2731.3 ( 2.58x) put_luma_v_10_64x64_c: 25699.8 ( 1.00x) put_luma_v_10_64x64_neon: 12145.6 ( 2.12x) put_luma_v_10_128x128_c: 90694.6 ( 1.00x) put_luma_v_10_128x128_neon: 44862.4 ( 2.02x) put_luma_v_12_4x4_c: 304.4 ( 1.00x) put_luma_v_12_4x4_neon: 55.6 ( 5.47x) put_luma_v_12_8x8_c: 1107.4 ( 1.00x) put_luma_v_12_8x8_neon: 164.7 ( 6.72x) put_luma_v_12_16x16_c: 2235.8 ( 1.00x) put_luma_v_12_16x16_neon: 672.5 ( 3.32x) put_luma_v_12_32x32_c: 7049.2 ( 1.00x) put_luma_v_12_32x32_neon: 2731.6 ( 2.58x) put_luma_v_12_64x64_c: 25706.5 ( 1.00x) put_luma_v_12_64x64_neon: 12145.0 ( 2.12x) put_luma_v_12_128x128_c: 90672.5 ( 1.00x) put_luma_v_12_128x128_neon: 44857.1 ( 2.02x) Apple M4 (auto-vectorisation is on): put_luma_v_10_4x4_c: 25.6 ( 1.00x) put_luma_v_10_4x4_neon: 3.1 ( 8.18x) put_luma_v_10_8x8_c: 34.7 ( 1.00x) put_luma_v_10_8x8_neon: 10.5 ( 3.32x) put_luma_v_10_16x16_c: 103.9 ( 1.00x) put_luma_v_10_16x16_neon: 42.3 ( 2.45x) put_luma_v_10_32x32_c: 399.7 ( 1.00x) put_luma_v_10_32x32_neon: 161.8 ( 2.47x) put_luma_v_10_64x64_c: 1276.7 ( 1.00x) put_luma_v_10_64x64_neon: 840.1 ( 1.52x) put_luma_v_10_128x128_c: 4981.3 ( 1.00x) put_luma_v_10_128x128_neon: 3008.0 ( 1.66x) put_luma_v_12_4x4_c: 23.6 ( 1.00x) put_luma_v_12_4x4_neon: 2.0 (11.84x) put_luma_v_12_8x8_c: 31.8 ( 1.00x) put_luma_v_12_8x8_neon: 12.4 ( 2.55x) put_luma_v_12_16x16_c: 100.8 ( 1.00x) put_luma_v_12_16x16_neon: 44.9 ( 2.25x) put_luma_v_12_32x32_c: 331.1 ( 1.00x) put_luma_v_12_32x32_neon: 175.2 ( 1.89x) put_luma_v_12_64x64_c: 1227.1 ( 1.00x) put_luma_v_12_64x64_neon: 712.7 ( 1.72x) put_luma_v_12_128x128_c: 5149.1 ( 1.00x) put_luma_v_12_128x128_neon: 2809.3 ( 1.83x)	2026-01-08 17:35:55 +00:00
Zhao Zhili	840183d823	aarch64/hpeldsp_neon: fix out-of-bounds read Fix #21141 The performance improved a little bit. On A76: Before After put_pixels_tab[0][1]_neon: 32.4 ( 3.91x) 31.6 ( 3.99x) put_pixels_tab[0][3]_neon: 88.0 ( 4.50x) 74.6 ( 5.31x) put_pixels_tab[1][1]_neon: 33.5 ( 2.52x) 31.2 ( 2.71x) put_pixels_tab[1][3]_neon: 30.5 ( 3.61x) 21.7 ( 5.08x) On A55: Before After put_pixels_tab[0][1]_neon: 175.2 ( 2.41x) 138.7 ( 3.04x) put_pixels_tab[0][3]_neon: 334.3 ( 2.71x) 296.1 ( 3.07x) put_pixels_tab[1][1]_neon: 168.3 ( 1.78x) 94.1 ( 3.19x) put_pixels_tab[1][3]_neon: 112.3 ( 2.20x) 90.0 ( 2.74x)	2026-01-04 03:22:55 +00:00
Georgii Zagoruiko	f790de2a87	aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit RPi4 (auto-vectorisation is turned on) put_luma_h_10_4x4_c: 282.8 ( 1.00x) put_luma_h_10_8x8_c: 1069.5 ( 1.00x) put_luma_h_10_8x8_neon: 207.5 ( 5.15x) put_luma_h_10_16x16_c: 1999.6 ( 1.00x) put_luma_h_10_16x16_neon: 777.5 ( 2.57x) put_luma_h_10_32x32_c: 6612.9 ( 1.00x) put_luma_h_10_32x32_neon: 3201.6 ( 2.07x) put_luma_h_10_64x64_c: 25059.0 ( 1.00x) put_luma_h_10_64x64_neon: 13623.5 ( 1.84x) put_luma_h_10_128x128_c: 91310.1 ( 1.00x) put_luma_h_10_128x128_neon: 50358.3 ( 1.81x) put_luma_h_12_4x4_c: 282.1 ( 1.00x) put_luma_h_12_8x8_c: 1068.4 ( 1.00x) put_luma_h_12_8x8_neon: 207.7 ( 5.14x) put_luma_h_12_16x16_c: 1998.0 ( 1.00x) put_luma_h_12_16x16_neon: 777.5 ( 2.57x) put_luma_h_12_32x32_c: 6612.0 ( 1.00x) put_luma_h_12_32x32_neon: 3201.6 ( 2.07x) put_luma_h_12_64x64_c: 25036.8 ( 1.00x) put_luma_h_12_64x64_neon: 13595.1 ( 1.84x) put_luma_h_12_128x128_c: 91305.8 ( 1.00x) put_luma_h_12_128x128_neon: 50359.7 ( 1.81x) Apple M2 Air (auto-vectorisation is turned on) put_luma_h_10_4x4_c: 0.3 ( 1.00x) put_luma_h_10_8x8_c: 1.0 ( 1.00x) put_luma_h_10_8x8_neon: 0.4 ( 2.59x) put_luma_h_10_16x16_c: 2.9 ( 1.00x) put_luma_h_10_16x16_neon: 1.4 ( 2.01x) put_luma_h_10_32x32_c: 9.4 ( 1.00x) put_luma_h_10_32x32_neon: 5.8 ( 1.62x) put_luma_h_10_64x64_c: 35.6 ( 1.00x) put_luma_h_10_64x64_neon: 23.6 ( 1.51x) put_luma_h_10_128x128_c: 131.1 ( 1.00x) put_luma_h_10_128x128_neon: 92.6 ( 1.42x) put_luma_h_12_4x4_c: 0.3 ( 1.00x) put_luma_h_12_8x8_c: 1.0 ( 1.00x) put_luma_h_12_8x8_neon: 0.4 ( 2.58x) put_luma_h_12_16x16_c: 2.9 ( 1.00x) put_luma_h_12_16x16_neon: 1.4 ( 2.00x) put_luma_h_12_32x32_c: 9.4 ( 1.00x) put_luma_h_12_32x32_neon: 5.8 ( 1.61x) put_luma_h_12_64x64_c: 35.3 ( 1.00x) put_luma_h_12_64x64_neon: 23.3 ( 1.52x) put_luma_h_12_128x128_c: 131.2 ( 1.00x) put_luma_h_12_128x128_neon: 92.4 ( 1.42x)	2025-11-24 21:22:55 +00:00
Kacper Michajłow	9ad20839fb	avcodec/pixblockdsp: be consistent about restrict use in ff_{get,diff}_pixels Suppresses warnings about function pointer mismatch. Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2025-10-25 01:01:15 +02:00
Bin Peng	3115c0c0e6	lavc/aarch64: Fix addp overflow in ff_pred16x16_plane_neon_10 The mismatch between neon and C functions can be reproduced using the following bitstream and command line on aarch64 devices. wget https://streams.videolan.org/ffmpeg/incoming/replay_intra_pred_16x16.h264 ./ffmpeg -cpuflags 0 -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_ref ./ffmpeg -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_neon Signed-off-by: Bin Peng <pengbin@visionular.com>	2025-10-24 15:32:35 +00:00
Krzysztof Pyrkosz	03c054d43c	avcodec/aarch64/vvc: Implement dmvr_v_8 A72 dmvr_v_8_12x20_neon: 207.0 ( 4.15x) dmvr_v_8_20x12_neon: 170.4 ( 4.37x) dmvr_v_8_20x20_neon: 273.4 ( 4.58x) A53 dmvr_v_8_12x20_neon: 450.6 ( 4.21x) dmvr_v_8_20x12_neon: 342.8 ( 3.70x) dmvr_v_8_20x20_neon: 550.9 ( 3.79x)	2025-09-23 11:20:20 +00:00
Krzysztof Pyrkosz	56a638d836	avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon Before and after: A53: apply_bdof_8_16x8_neon: 2733.1 ( 4.88x) apply_bdof_8_16x16_neon: 5458.6 ( 4.86x) apply_bdof_10_16x8_neon: 2789.8 ( 4.64x) apply_bdof_10_16x16_neon: 5523.8 ( 4.68x) apply_bdof_12_16x8_neon: 2792.8 ( 4.58x) apply_bdof_12_16x16_neon: 5519.5 ( 4.63x) apply_bdof_8_16x8_neon: 2571.8 ( 5.12x) apply_bdof_8_16x16_neon: 5173.3 ( 5.12x) apply_bdof_10_16x8_neon: 2635.1 ( 4.87x) apply_bdof_10_16x16_neon: 5243.0 ( 4.89x) apply_bdof_12_16x8_neon: 2613.0 ( 4.89x) apply_bdof_12_16x16_neon: 5231.7 ( 4.90x) A78: apply_bdof_8_16x8_neon: 565.3 ( 8.43x) apply_bdof_8_16x16_neon: 1109.5 ( 8.60x) apply_bdof_10_16x8_neon: 568.2 ( 7.92x) apply_bdof_10_16x16_neon: 1114.1 ( 8.08x) apply_bdof_12_16x8_neon: 570.2 ( 7.87x) apply_bdof_12_16x16_neon: 1116.3 ( 8.03x) apply_bdof_8_16x8_neon: 541.4 ( 8.81x) apply_bdof_8_16x16_neon: 1065.9 ( 8.97x) apply_bdof_10_16x8_neon: 543.2 ( 8.32x) apply_bdof_10_16x16_neon: 1071.5 ( 8.39x) apply_bdof_12_16x8_neon: 544.2 ( 8.25x) apply_bdof_12_16x16_neon: 1074.1 ( 8.37x)	2025-09-23 11:20:11 +00:00
Krzysztof Pyrkosz	f1a155d975	avcodec/aarch64/vvc: Optimize dmvr_hv_10 Before and after on A53: dmvr_hv_10_12x20_neon: 1838.2 ( 3.02x) dmvr_hv_10_20x12_neon: 1330.2 ( 1.83x) dmvr_hv_10_20x20_neon: 2148.2 ( 1.85x) dmvr_hv_12_12x20_neon: 1839.2 ( 3.02x) dmvr_hv_12_20x12_neon: 1330.6 ( 1.83x) dmvr_hv_12_20x20_neon: 2147.2 ( 1.85x) dmvr_hv_10_12x20_neon: 1755.0 ( 3.17x) dmvr_hv_10_20x12_neon: 1165.8 ( 2.09x) dmvr_hv_10_20x20_neon: 1876.1 ( 2.12x) dmvr_hv_12_12x20_neon: 1754.4 ( 3.17x) dmvr_hv_12_20x12_neon: 1167.8 ( 2.09x) dmvr_hv_12_20x20_neon: 1878.8 ( 2.12x)	2025-09-21 19:39:27 +00:00
Georgii Zagoruiko	4fbacb3944	avcodec/aarch64/vvc: Optimised version of classify function. Macbook Air (M2): vvc_alf_classify_8x8_8_c: 2.6 ( 1.00x) vvc_alf_classify_8x8_8_neon: 1.0 ( 2.47x) vvc_alf_classify_8x8_10_c: 2.7 ( 1.00x) vvc_alf_classify_8x8_10_neon: 0.9 ( 2.98x) vvc_alf_classify_8x8_12_c: 2.7 ( 1.00x) vvc_alf_classify_8x8_12_neon: 0.9 ( 2.97x) vvc_alf_classify_16x16_8_c: 7.3 ( 1.00x) vvc_alf_classify_16x16_8_neon: 3.4 ( 2.12x) vvc_alf_classify_16x16_10_c: 4.3 ( 1.00x) vvc_alf_classify_16x16_10_neon: 2.9 ( 1.47x) vvc_alf_classify_16x16_12_c: 4.3 ( 1.00x) vvc_alf_classify_16x16_12_neon: 3.0 ( 1.44x) vvc_alf_classify_32x32_8_c: 13.7 ( 1.00x) vvc_alf_classify_32x32_8_neon: 10.7 ( 1.29x) vvc_alf_classify_32x32_10_c: 12.3 ( 1.00x) vvc_alf_classify_32x32_10_neon: 8.7 ( 1.42x) vvc_alf_classify_32x32_12_c: 12.2 ( 1.00x) vvc_alf_classify_32x32_12_neon: 8.7 ( 1.40x) vvc_alf_classify_64x64_8_c: 45.8 ( 1.00x) vvc_alf_classify_64x64_8_neon: 37.1 ( 1.23x) vvc_alf_classify_64x64_10_c: 41.3 ( 1.00x) vvc_alf_classify_64x64_10_neon: 32.8 ( 1.26x) vvc_alf_classify_64x64_12_c: 41.4 ( 1.00x) vvc_alf_classify_64x64_12_neon: 32.4 ( 1.28x) vvc_alf_classify_128x128_8_c: 163.7 ( 1.00x) vvc_alf_classify_128x128_8_neon: 138.3 ( 1.18x) vvc_alf_classify_128x128_10_c: 149.1 ( 1.00x) vvc_alf_classify_128x128_10_neon: 120.3 ( 1.24x) vvc_alf_classify_128x128_12_c: 148.7 ( 1.00x) vvc_alf_classify_128x128_12_neon: 119.4 ( 1.25x) RPi4 (Cortex-A72): vvc_alf_classify_8x8_8_c: 1251.6 ( 1.00x) vvc_alf_classify_8x8_8_neon: 700.7 ( 1.79x) vvc_alf_classify_8x8_10_c: 1141.9 ( 1.00x) vvc_alf_classify_8x8_10_neon: 659.7 ( 1.73x) vvc_alf_classify_8x8_12_c: 1075.8 ( 1.00x) vvc_alf_classify_8x8_12_neon: 658.7 ( 1.63x) vvc_alf_classify_16x16_8_c: 3574.1 ( 1.00x) vvc_alf_classify_16x16_8_neon: 1849.8 ( 1.93x) vvc_alf_classify_16x16_10_c: 3270.0 ( 1.00x) vvc_alf_classify_16x16_10_neon: 1786.1 ( 1.83x) vvc_alf_classify_16x16_12_c: 3271.7 ( 1.00x) vvc_alf_classify_16x16_12_neon: 1785.5 ( 1.83x) vvc_alf_classify_32x32_8_c: 12451.9 ( 1.00x) vvc_alf_classify_32x32_8_neon: 5984.3 ( 2.08x) vvc_alf_classify_32x32_10_c: 11428.9 ( 1.00x) vvc_alf_classify_32x32_10_neon: 5756.3 ( 1.99x) vvc_alf_classify_32x32_12_c: 11252.8 ( 1.00x) vvc_alf_classify_32x32_12_neon: 5755.7 ( 1.96x) vvc_alf_classify_64x64_8_c: 47625.5 ( 1.00x) vvc_alf_classify_64x64_8_neon: 21071.9 ( 2.26x) vvc_alf_classify_64x64_10_c: 44576.3 ( 1.00x) vvc_alf_classify_64x64_10_neon: 21544.7 ( 2.07x) vvc_alf_classify_64x64_12_c: 44600.5 ( 1.00x) vvc_alf_classify_64x64_12_neon: 21491.2 ( 2.08x) vvc_alf_classify_128x128_8_c: 192143.3 ( 1.00x) vvc_alf_classify_128x128_8_neon: 82387.6 ( 2.33x) vvc_alf_classify_128x128_10_c: 177583.1 ( 1.00x) vvc_alf_classify_128x128_10_neon: 81628.8 ( 2.18x) vvc_alf_classify_128x128_12_c: 177582.2 ( 1.00x) vvc_alf_classify_128x128_12_neon: 81625.1 ( 2.18x)	2025-09-09 22:13:04 +01:00
Krzysztof Pyrkosz	de25cb4603	avcodec/aarch64/vvc: Optimize vvc_apply_bdof_block_8x Before and after: A53: apply_bdof_8_8x16_neon: 3320.5 ( 4.02x) apply_bdof_10_8x16_neon: 3317.8 ( 3.90x) apply_bdof_12_8x16_neon: 3303.6 ( 3.91x) apply_bdof_8_8x16_neon: 3168.1 ( 4.23x) apply_bdof_10_8x16_neon: 3127.8 ( 4.13x) apply_bdof_12_8x16_neon: 3119.3 ( 4.18x) A72: apply_bdof_8_8x16_neon: 1827.4 ( 5.02x) apply_bdof_10_8x16_neon: 1838.5 ( 4.89x) apply_bdof_12_8x16_neon: 1841.1 ( 4.83x) apply_bdof_8_8x16_neon: 1691.6 ( 5.46x) apply_bdof_10_8x16_neon: 1695.9 ( 5.23x) apply_bdof_12_8x16_neon: 1695.4 ( 5.29x) A78 apply_bdof_8_8x16_neon: 648.9 ( 7.43x) apply_bdof_10_8x16_neon: 646.1 ( 7.04x) apply_bdof_12_8x16_neon: 643.8 ( 7.04x) apply_bdof_8_8x16_neon: 603.2 ( 7.97x) apply_bdof_10_8x16_neon: 604.1 ( 7.52x) apply_bdof_12_8x16_neon: 604.5 ( 7.52x)	2025-09-09 16:37:28 +00:00
Krzysztof Pyrkosz	7b21bde34c	avcodec/aarch64/vvc: Implemented dmvr_h_10 A78: dmvr_h_10_12x20_neon: 82.2 ( 6.49x) dmvr_h_10_20x12_neon: 69.9 ( 3.66x) dmvr_h_10_20x20_neon: 112.5 ( 3.74x) dmvr_h_12_12x20_neon: 81.4 ( 6.51x) dmvr_h_12_20x12_neon: 69.2 ( 3.74x) dmvr_h_12_20x20_neon: 110.2 ( 3.85x) A72: dmvr_h_10_12x20_neon: 234.1 ( 4.67x) dmvr_h_10_20x12_neon: 221.4 ( 3.48x) dmvr_h_10_20x20_neon: 356.9 ( 3.59x) dmvr_h_12_12x20_neon: 234.1 ( 4.67x) dmvr_h_12_20x12_neon: 221.5 ( 3.53x) dmvr_h_12_20x20_neon: 357.0 ( 3.64x)	2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz	189e841cfd	avcodec/aarch64/vvc: Implement dmvr_h_8 A78: dmvr_h_8_12x20_neon: 76.6 ( 4.31x) dmvr_h_8_20x12_neon: 65.8 ( 3.49x) dmvr_h_8_20x20_neon: 106.6 ( 3.62x) A72: dmvr_h_8_12x20_neon: 190.6 ( 4.40x) dmvr_h_8_20x12_neon: 171.1 ( 4.31x) dmvr_h_8_20x20_neon: 275.1 ( 4.50x)	2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz	fb4407797e	Replace uxtl with umull in dmvr_hv_8 Before and after on A78: dmvr_hv_8_12x20_neon: 205.3 ( 5.21x) dmvr_hv_8_20x12_neon: 171.8 ( 3.15x) dmvr_hv_8_20x20_neon: 282.7 ( 3.11x) dmvr_hv_8_12x20_neon: 172.7 ( 5.58x) dmvr_hv_8_20x12_neon: 133.3 ( 3.36x) dmvr_hv_8_20x20_neon: 214.6 ( 3.40x)	2025-09-05 07:20:15 +00:00
Zhao Zhili	6ce02bcc3a	avcodec/aarch64/vvc: Optimize apply_bdof Before this patch, prof_grad_filter calculate gh[0], gh[1], gv[0], gv[1] and save them to stack. derive_bdof_vx_vy load them from stack and calculate gh[0] + gh[1], gv[0] + gv[1]. apply_bdof_min_block load them from stack and calculate gh[0] - gh[1], gv[0] - gv[1] This patch add bdof_grad_filter, which calculate gh[0] + gh[1], gh[0] - gh[1], gv[0] + gv[1], gv[0] - gv[1], and save them to stack, so derive_bdof_vx_vy and apply_bdof_min_block can use the results directly. prof_grad_filter is kept for reuse by other functions in the future. Benchmark on rpi5 with gcc 12 Before After -------------------------------------------------------------------- apply_bdof_8_8x16_c: \| 7431.4 ( 1.00x) \| 7371.7 ( 1.00x) apply_bdof_8_8x16_neon: \| 1175.4 ( 6.32x) \| 1036.3 ( 7.11x) apply_bdof_8_16x8_c: \| 7182.2 ( 1.00x) \| 7201.1 ( 1.00x) apply_bdof_8_16x8_neon: \| 1021.7 ( 7.03x) \| 879.9 ( 8.18x) apply_bdof_8_16x16_c: \| 14577.1 ( 1.00x) \| 14589.3 ( 1.00x) apply_bdof_8_16x16_neon: \| 2012.8 ( 7.24x) \| 1743.3 ( 8.37x) apply_bdof_10_8x16_c: \| 7292.4 ( 1.00x) \| 7308.5 ( 1.00x) apply_bdof_10_8x16_neon: \| 1156.3 ( 6.31x) \| 1045.3 ( 6.99x) apply_bdof_10_16x8_c: \| 7112.4 ( 1.00x) \| 7214.4 ( 1.00x) apply_bdof_10_16x8_neon: \| 1007.6 ( 7.06x) \| 904.8 ( 7.97x) apply_bdof_10_16x16_c: \| 14363.3 ( 1.00x) \| 14476.4 ( 1.00x) apply_bdof_10_16x16_neon: \| 1986.9 ( 7.23x) \| 1783.1 ( 8.12x) apply_bdof_12_8x16_c: \| 7433.3 ( 1.00x) \| 7374.7 ( 1.00x) apply_bdof_12_8x16_neon: \| 1155.9 ( 6.43x) \| 1040.8 ( 7.09x) apply_bdof_12_16x8_c: \| 7171.1 ( 1.00x) \| 7376.3 ( 1.00x) apply_bdof_12_16x8_neon: \| 1010.8 ( 7.09x) \| 899.4 ( 8.20x) apply_bdof_12_16x16_c: \| 14515.5 ( 1.00x) \| 14731.5 ( 1.00x) apply_bdof_12_16x16_neon: \| 1988.4 ( 7.30x) \| 1785.2 ( 8.25x)	2025-09-03 06:55:37 +00:00
Zhao Zhili	2e92417603	avcodec/aarch64/vvc: Optimize derive_bdof_vx_vy Implement line tricks and pixel tricks. See comments in inter.S for details. Benchmark on rpi5 with gcc 12 Before After ----------------------------------------------------------------- apply_bdof_8_8x16_c: \| 7375.5 ( 1.00x) \| 7473.8 ( 1.00x) apply_bdof_8_8x16_neon: \| 1875.1 ( 3.93x) \| 1135.8 ( 6.58x) apply_bdof_8_16x8_c: \| 7273.9 ( 1.00x) \| 7204.0 ( 1.00x) apply_bdof_8_16x8_neon: \| 1738.2 ( 4.18x) \| 1013.0 ( 7.11x) apply_bdof_8_16x16_c: \| 14744.9 ( 1.00x) \| 14712.6 ( 1.00x) apply_bdof_8_16x16_neon: \| 3446.7 ( 4.28x) \| 1997.7 ( 7.36x) apply_bdof_10_8x16_c: \| 7352.4 ( 1.00x) \| 7485.7 ( 1.00x) apply_bdof_10_8x16_neon: \| 1861.0 ( 3.95x) \| 1134.1 ( 6.60x) apply_bdof_10_16x8_c: \| 7330.5 ( 1.00x) \| 7232.8 ( 1.00x) apply_bdof_10_16x8_neon: \| 1747.2 ( 4.20x) \| 1002.6 ( 7.21x) apply_bdof_10_16x16_c: \| 14522.4 ( 1.00x) \| 14664.8 ( 1.00x) apply_bdof_10_16x16_neon: \| 3490.5 ( 4.16x) \| 1978.4 ( 7.41x) apply_bdof_12_8x16_c: \| 7389.0 ( 1.00x) \| 7380.1 ( 1.00x) apply_bdof_12_8x16_neon: \| 1861.3 ( 3.97x) \| 1134.0 ( 6.51x) apply_bdof_12_16x8_c: \| 7283.1 ( 1.00x) \| 7336.9 ( 1.00x) apply_bdof_12_16x8_neon: \| 1749.1 ( 4.16x) \| 1002.3 ( 7.32x) apply_bdof_12_16x16_c: \| 14580.7 ( 1.00x) \| 14502.7 ( 1.00x) apply_bdof_12_16x16_neon: \| 3472.9 ( 4.20x) \| 1978.3 ( 7.33x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-09-03 06:55:37 +00:00
Timo Rothenpieler	262d41c804	all: fix typos found by codespell	2025-08-03 13:48:47 +02:00
Andreas Rheinhardt	9b409ea1e6	configure: Factor mpegvideoencdsp out of mpegvideoenc This will allow to relax the dependency on mpegvideoenc for several codecs. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-06-21 22:08:52 +02:00
Andreas Rheinhardt	20ddada2a3	avcodec/pixblockdsp: Improve 8 vs 16 bit check Before this commit, the input in get_pixels and get_pixels_unaligned has been treated inconsistenly: - The generic code treated 9, 10, 12 and 14 bits as 16bit input (these bits correspond to what FFmpeg's dsputils supported), everything with <= 8 bits as 8 bit and everything else as 8 bit when used via AVDCT (which exposes these functions and purports to support up to 14 bits). - AARCH64, ARM, PPC and RISC-V, x86 ignore this AVDCT special case. - RISC-V also ignored the restriction to 9, 10, 12 and 14 for its 16bit check and treated everything > 8 bits as 16bit. - The mmi MIPS code treats everything as 8 bit when used via AVDCT (this is certainly broken); otherwise it checks for <= 8 bits. The msa MIPS code behaves like the generic code. This commit changes this to treat 9..16 bits as 16 bit input, everything else as 8 bit (the former because it makes sense, the latter to preserve the behaviour for external users). : The only internal user of AVDCT (the spp filter) always uses 8, 9 or 10 bits. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-05-31 01:25:27 +02:00
Zhao Zhili	26752368f0	aarch64/h26x: Add put_hevc_pel_bi_w_pixels On rpi5 (A76): put_hevc_pel_bi_w_pixels4_8_c: 90.0 ( 1.00x) put_hevc_pel_bi_w_pixels4_8_neon: 34.1 ( 2.64x) put_hevc_pel_bi_w_pixels6_8_c: 188.3 ( 1.00x) put_hevc_pel_bi_w_pixels6_8_neon: 73.5 ( 2.56x) put_hevc_pel_bi_w_pixels8_8_c: 327.1 ( 1.00x) put_hevc_pel_bi_w_pixels8_8_neon: 75.8 ( 4.32x) put_hevc_pel_bi_w_pixels12_8_c: 728.8 ( 1.00x) put_hevc_pel_bi_w_pixels12_8_neon: 186.1 ( 3.92x) put_hevc_pel_bi_w_pixels16_8_c: 1288.1 ( 1.00x) put_hevc_pel_bi_w_pixels16_8_neon: 268.5 ( 4.80x) put_hevc_pel_bi_w_pixels24_8_c: 2855.5 ( 1.00x) put_hevc_pel_bi_w_pixels24_8_neon: 723.8 ( 3.95x) put_hevc_pel_bi_w_pixels32_8_c: 5095.3 ( 1.00x) put_hevc_pel_bi_w_pixels32_8_neon: 1165.0 ( 4.37x) put_hevc_pel_bi_w_pixels48_8_c: 11521.5 ( 1.00x) put_hevc_pel_bi_w_pixels48_8_neon: 2856.0 ( 4.03x) put_hevc_pel_bi_w_pixels64_8_c: 21020.5 ( 1.00x) put_hevc_pel_bi_w_pixels64_8_neon: 4699.1 ( 4.47x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-04-29 15:24:14 +08:00
Zhao Zhili	39786f8cd5	aarch64/h26x: optimize sao_band_filter int8_t[] is enough for offset_table of 8 bit streams. On rpi5: Before After hevc_sao_band_8_8_c: 252.3 ( 1.00x) 252.3 ( 1.00x) hevc_sao_band_8_8_neon: 95.8 ( 2.63x) 61.0 ( 4.57x) hevc_sao_band_16_8_c: 875.2 ( 1.00x) 864.9 ( 1.00x) hevc_sao_band_16_8_neon: 317.5 ( 2.76x) 150.0 ( 6.26x) hevc_sao_band_32_8_c: 3853.5 ( 1.00x) 3871.6 ( 1.00x) hevc_sao_band_32_8_neon: 1222.3 ( 3.15x) 550.6 ( 7.39) hevc_sao_band_48_8_c: 8203.6 ( 1.00x) 8182.6 ( 1.00x) hevc_sao_band_48_8_neon: 2685.7 ( 3.05x) 1185.8 ( 7.36x) hevc_sao_band_64_8_c: 14023.0 ( 1.00x) 14038.9 ( 1.00x) hevc_sao_band_64_8_neon: 4783.2 ( 2.93x) 2078.4 ( 7.15x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-04-29 15:11:45 +08:00

1 2 3 4 5 ...

496 commits