Commit graph

496 commits

Author SHA1 Message Date
Jeongkeun Kim
4ea59d5665 avcodec/aarch64: add NEON DCA LFE FIR filter functions
Port lfe_fir0_float and lfe_fir1_float to AArch64 NEON. These polyphase
FIR interpolation filters have an x86 SSE/AVX path but no AArch64
equivalent, falling back to scalar C.

The inner loop computes two dot products per output pair. Precomputing a
reversed LFE sample vector before the inner loop avoids per-iteration
shuffle overhead.

Benchmarks on AWS Graviton3 (Neoverse V1, c7g.xlarge):
  lfe_fir0_float: C 5902.0 cycles -> NEON 2135.0 cycles (2.77x)
  lfe_fir1_float: C 2836.3 cycles -> NEON 1527.8 cycles (1.86x)
Measured with: taskset -c 0 ./tests/checkasm/checkasm --test=dcadsp --bench,
3-run average, Ubuntu 22.04 (kernel 6.8.0-1052-aws), perf_event_paranoid=0.

Signed-off-by: Jeongkeun Kim <variety0724@gmail.com>
2026-04-27 20:13:23 +00:00
Georgii Zagoruiko
1ced59326a aarch64/vvc: Optimisations of put_chroma_hv() functions for 10/12-bit
Apple M4:
put_chroma_hv_10_2x2_c:                                  9.1 ( 1.00x)
put_chroma_hv_10_4x4_c:                                 20.1 ( 1.00x)
put_chroma_hv_10_8x8_c:                                 35.6 ( 1.00x)
put_chroma_hv_10_8x8_neon:                              15.4 ( 2.31x)
put_chroma_hv_10_16x16_c:                              113.7 ( 1.00x)
put_chroma_hv_10_16x16_neon:                            57.0 ( 1.99x)
put_chroma_hv_10_32x32_c:                              406.9 ( 1.00x)
put_chroma_hv_10_32x32_neon:                           225.7 ( 1.80x)
put_chroma_hv_10_64x64_c:                             1498.8 ( 1.00x)
put_chroma_hv_10_64x64_neon:                           876.2 ( 1.71x)
put_chroma_hv_10_128x128_c:                           5757.0 ( 1.00x)
put_chroma_hv_10_128x128_neon:                        3446.6 ( 1.67x)
put_chroma_hv_12_2x2_c:                                  9.9 ( 1.00x)
put_chroma_hv_12_4x4_c:                                 19.2 ( 1.00x)
put_chroma_hv_12_8x8_c:                                 36.1 ( 1.00x)
put_chroma_hv_12_8x8_neon:                              17.9 ( 2.02x)
put_chroma_hv_12_16x16_c:                              112.2 ( 1.00x)
put_chroma_hv_12_16x16_neon:                            55.6 ( 2.02x)
put_chroma_hv_12_32x32_c:                              416.6 ( 1.00x)
put_chroma_hv_12_32x32_neon:                           224.3 ( 1.86x)
put_chroma_hv_12_64x64_c:                             1464.8 ( 1.00x)
put_chroma_hv_12_64x64_neon:                           860.1 ( 1.70x)
put_chroma_hv_12_128x128_c:                           5776.8 ( 1.00x)
put_chroma_hv_12_128x128_neon:                        3445.2 ( 1.68x)

RPi5:
put_chroma_hv_10_2x2_c:                                118.5 ( 1.00x)
put_chroma_hv_10_4x4_c:                                190.6 ( 1.00x)
put_chroma_hv_10_8x8_c:                                303.1 ( 1.00x)
put_chroma_hv_10_8x8_neon:                             172.6 ( 1.76x)
put_chroma_hv_10_16x16_c:                             1036.1 ( 1.00x)
put_chroma_hv_10_16x16_neon:                           626.7 ( 1.65x)
put_chroma_hv_10_32x32_c:                             3624.4 ( 1.00x)
put_chroma_hv_10_32x32_neon:                          2386.9 ( 1.52x)
put_chroma_hv_10_64x64_c:                            13612.1 ( 1.00x)
put_chroma_hv_10_64x64_neon:                          9314.8 ( 1.46x)
put_chroma_hv_10_128x128_c:                          52975.4 ( 1.00x)
put_chroma_hv_10_128x128_neon:                       37083.5 ( 1.43x)
put_chroma_hv_12_2x2_c:                                118.6 ( 1.00x)
put_chroma_hv_12_4x4_c:                                188.1 ( 1.00x)
put_chroma_hv_12_8x8_c:                                303.4 ( 1.00x)
put_chroma_hv_12_8x8_neon:                             176.7 ( 1.72x)
put_chroma_hv_12_16x16_c:                             1037.9 ( 1.00x)
put_chroma_hv_12_16x16_neon:                           626.5 ( 1.66x)
put_chroma_hv_12_32x32_c:                             3629.0 ( 1.00x)
put_chroma_hv_12_32x32_neon:                          2386.6 ( 1.52x)
put_chroma_hv_12_64x64_c:                            13649.0 ( 1.00x)
put_chroma_hv_12_64x64_neon:                          9313.6 ( 1.47x)
put_chroma_hv_12_128x128_c:                          52978.0 ( 1.00x)
put_chroma_hv_12_128x128_neon:                       37101.2 ( 1.43x)
2026-04-27 20:10:57 +00:00
Jun Zhao
75838b9c89 lavc/hevc: add aarch64 NEON for reference sample filtering
3-tap [1,2,1]>>2: shared implementation body across size-specialized
entry points (8x8/16x16/32x32) to reduce code size. Fold the 3-tap
kernel into uhadd + urhadd: uhadd gives floor((prev+next)/2), then
urhadd rounds with curr to produce (prev + 2*curr + next + 2) >> 2
on 16 bytes in-place (no widen/narrow needed). Overlap-last technique
for tail avoids partial stores. Caller pads input arrays by 16 bytes
to guarantee safe over-read.

Strong smoothing (32x32): preloaded weight tables, interleaved
umull/umlal pairs (two 16-byte blocks at a time) to hide
rshrn-to-store latency, with paired st1 for 32-byte writes.

checkasm --bench --runs=15 (Apple M4, average of 3 trials):
  ref_filter_3tap_8x8_8_neon:    4.1x
  ref_filter_3tap_16x16_8_neon:  3.3x
  ref_filter_3tap_32x32_8_neon:  2.5x
  ref_filter_strong_8_neon:      1.9x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-04-21 07:50:49 +00:00
Jun Zhao
89c21b5ab7 lavc/hevc: add aarch64 NEON for Planar prediction
Add NEON-optimized implementation for HEVC intra Planar prediction at
8-bit depth, supporting all block sizes (4x4 to 32x32).

Planar prediction implements bilinear interpolation using an incremental
base update: base_{y+1}[x] = base_y[x] - (top[x] - left[N]), reducing
per-row computation from 4 multiply-adds to 1 subtract + 1 multiply.
Uses rshrn for rounded narrowing shifts, eliminating manual rounding
bias. All left[y] values are broadcast in the NEON domain, avoiding
GP-to-NEON transfers.

4x4 interleaves row computations across 4 rows to break dependencies.
16x16 uses v19-v22 for persistent base/decrement vectors, avoiding
callee-saved register spills. 32x32 processes 8 rows per loop iteration
(4 iterations total) to reduce code size while maintaining full NEON
utilization.

Speedup over C on Apple M4 (checkasm --bench):

    4x4: 2.25x    8x8: 6.40x    16x16: 9.72x    32x32: 3.21x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-30 14:32:10 +00:00
Jun Zhao
60b372c934 lavc/hevc: add aarch64 NEON for DC prediction
Add NEON-optimized implementation for HEVC intra DC prediction at 8-bit
depth, supporting all block sizes (4x4 to 32x32).

DC prediction computes the average of top and left reference samples
using uaddlv, with urshr for rounded division. For luma blocks smaller
than 32x32, edge smoothing is applied: the first row and column are
blended toward the reference using (ref[i] + 3*dc + 2) >> 2 computed
entirely in the NEON domain. Fill stores use pre-computed address
patterns to break dependency chains.

Also adds the aarch64 initialization framework (Makefile, pred.c/pred.h
hooks, hevcpred_init_aarch64.c).

Speedup over C on Apple M4 (checkasm --bench):

    4x4: 2.28x    8x8: 3.14x    16x16: 3.29x    32x32: 3.02x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-30 14:32:10 +00:00
Georgii Zagoruiko
1c385023aa aarch64/vvc: Optimisations of put_chroma_v() functions for 10/12-bit
Apple M4:
put_chroma_v_10_2x2_c:                                   5.8 ( 1.00x)
put_chroma_v_10_4x4_c:                                   9.0 ( 1.00x)
put_chroma_v_10_4x4_neon:                                1.7 ( 5.29x)
put_chroma_v_10_8x8_c:                                  22.1 ( 1.00x)
put_chroma_v_10_8x8_neon:                                5.8 ( 3.79x)
put_chroma_v_10_16x16_c:                                56.3 ( 1.00x)
put_chroma_v_10_16x16_neon:                             21.2 ( 2.66x)
put_chroma_v_10_32x32_c:                               181.6 ( 1.00x)
put_chroma_v_10_32x32_neon:                             86.9 ( 2.09x)
put_chroma_v_10_64x64_c:                               680.3 ( 1.00x)
put_chroma_v_10_64x64_neon:                            337.4 ( 2.02x)
put_chroma_v_10_128x128_c:                            2567.3 ( 1.00x)
put_chroma_v_10_128x128_neon:                         1374.8 ( 1.87x)
put_chroma_v_12_2x2_c:                                   6.4 ( 1.00x)
put_chroma_v_12_4x4_c:                                   8.2 ( 1.00x)
put_chroma_v_12_4x4_neon:                                1.5 ( 5.56x)
put_chroma_v_12_8x8_c:                                  18.9 ( 1.00x)
put_chroma_v_12_8x8_neon:                                5.7 ( 3.29x)
put_chroma_v_12_16x16_c:                                52.6 ( 1.00x)
put_chroma_v_12_16x16_neon:                             19.9 ( 2.65x)
put_chroma_v_12_32x32_c:                               185.7 ( 1.00x)
put_chroma_v_12_32x32_neon:                             81.9 ( 2.27x)
put_chroma_v_12_64x64_c:                               661.8 ( 1.00x)
put_chroma_v_12_64x64_neon:                            342.1 ( 1.93x)
put_chroma_v_12_128x128_c:                            2547.8 ( 1.00x)
put_chroma_v_12_128x128_neon:                         1368.0 ( 1.86x)

RPi4:
put_chroma_v_10_2x2_c:                                  64.8 ( 1.00x)
put_chroma_v_10_4x4_c:                                 157.2 ( 1.00x)
put_chroma_v_10_4x4_neon:                               39.7 ( 3.96x)
put_chroma_v_10_8x8_c:                                 562.1 ( 1.00x)
put_chroma_v_10_8x8_neon:                               98.8 ( 5.69x)
put_chroma_v_10_16x16_c:                              1170.7 ( 1.00x)
put_chroma_v_10_16x16_neon:                            380.7 ( 3.07x)
put_chroma_v_10_32x32_c:                              3696.6 ( 1.00x)
put_chroma_v_10_32x32_neon:                           1723.8 ( 2.14x)
put_chroma_v_10_64x64_c:                             13170.9 ( 1.00x)
put_chroma_v_10_64x64_neon:                           7284.1 ( 1.81x)
put_chroma_v_10_128x128_c:                           46068.3 ( 1.00x)
put_chroma_v_10_128x128_neon:                        27219.5 ( 1.69x)
put_chroma_v_12_2x2_c:                                  63.8 ( 1.00x)
put_chroma_v_12_4x4_c:                                 156.5 ( 1.00x)
put_chroma_v_12_4x4_neon:                               39.3 ( 3.98x)
put_chroma_v_12_8x8_c:                                 560.9 ( 1.00x)
put_chroma_v_12_8x8_neon:                               98.7 ( 5.68x)
put_chroma_v_12_16x16_c:                              1169.9 ( 1.00x)
put_chroma_v_12_16x16_neon:                            380.8 ( 3.07x)
put_chroma_v_12_32x32_c:                              3693.9 ( 1.00x)
put_chroma_v_12_32x32_neon:                           1728.4 ( 2.14x)
put_chroma_v_12_64x64_c:                             13170.9 ( 1.00x)
put_chroma_v_12_64x64_neon:                           7284.9 ( 1.81x)
put_chroma_v_12_128x128_c:                           46068.0 ( 1.00x)
put_chroma_v_12_128x128_neon:                        27224.6 ( 1.69x)
2026-03-27 13:42:50 +00:00
Martin Storsjö
f72f692afa aarch64: Add PAC sign/validation of the link register
Whenever the link register is stored on the stack, sign it
before storing it and validate at a symmetrical point (with the
stack at the same level as when it was signed).

These macros only have an effect if built with PAC enabled (e.g.
through -mbranch-protection=standard), otherwise they don't
generate any extra instructions.

None of these cases were present when PAC support was added
in 248986a0db in 2022.

Without these changes, PAC still had an effect in the compiler
generated code and in the existing cases where we these macros were
used - but make it apply to the remaining cases of link register
on the stack.
2026-03-20 13:16:06 +02:00
Martin Storsjö
dbf7354d98 aarch64/inter_sme2: Remove needless backup/restore of x29/x30
The sme_entry/sme_exit macros already take care of backing up/restoring
these registers. Additionally, as long as no function calls are
made within the function, x30 doesn't need to be backed up at all.
2026-03-20 13:16:06 +02:00
Martin Storsjö
1f7ed8a78d aarch64: hevcdsp: Make returns match the call site
For cases when returning early without updating any pixels, we
previously returned to return address in the caller's scope,
bypassing one function entirely. While this may seem like a neat
optimization, it makes the return stack predictor mispredict
the returns - which potentially can cost more performance than
it gains.

Secondly, if the armv9.3 feature GCS (Guarded Control Stack) is
enabled, then returns _must_ match the expected value; this feature
is being enabled across linux distributions, and by fixing the
hevc assembly, we can enable the security feature on ffmpeg as well.
2026-03-17 20:37:53 +00:00
Jun Zhao
254b92ec8a lavc/hevc: reorder aarch64 NEON pel function assignments
Group assignments by filter family (qpel, epel), variant
(base, uni, bi, uni_w, bi_w) and direction (pixels, h, v, hv).
Add NEON8_FNASSIGN_QPEL_H macro to replace repeated manual
qpel horizontal assignments.

No functional change.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-13 21:43:37 +00:00
Jun Zhao
489d36b5e1 lavc/hevc: add aarch64 NEON for epel uni horizontal filter
Add NEON-optimized implementations for HEVC EPEL uni-directional
horizontal interpolation (put_hevc_epel_uni_h) at 8-bit depth.

These functions perform horizontal 4-tap EPEL filtering with
output directly to uint8_t pixels (no weighting):
- 4-tap horizontal EPEL filter
- Output: (filter_result + 32) >> 6, clipped to [0, 255]

Supports all block widths: 4, 6, 8, 12, 16, 24, 32, 48, 64.

Performance results on Apple M4:
./tests/checkasm/checkasm --test=hevc_pel --bench

put_hevc_epel_uni_h4_8_neon:   2.26x
put_hevc_epel_uni_h6_8_neon:   2.71x
put_hevc_epel_uni_h8_8_neon:   4.40x
put_hevc_epel_uni_h12_8_neon:  3.60x
put_hevc_epel_uni_h16_8_neon:  3.00x
put_hevc_epel_uni_h24_8_neon:  3.72x
put_hevc_epel_uni_h32_8_neon:  3.14x
put_hevc_epel_uni_h48_8_neon:  3.16x
put_hevc_epel_uni_h64_8_neon:  3.15x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-13 21:43:37 +00:00
Jun Zhao
f5e6cca935 lavc/hevc: add aarch64 NEON for qpel uni-weighted HV filter
Add NEON-optimized implementations for HEVC QPEL uni-directional
weighted HV interpolation (put_hevc_qpel_uni_w_hv) at 8-bit depth,
for block widths 6, 12, 24, and 48.

These functions perform horizontal then vertical 8-tap QPEL filtering
with weighting (wx, ox, denom) and output to uint8_t. Previously
only widths 4, 8, 16, 32, 64 were implemented; this completes
coverage for all standard HEVC block widths.

Performance results on Apple M4:
./tests/checkasm/checkasm --test=hevc_pel --bench

put_hevc_qpel_uni_w_hv6_8_neon:   3.11x
put_hevc_qpel_uni_w_hv12_8_neon:  3.19x
put_hevc_qpel_uni_w_hv24_8_neon:  2.26x
put_hevc_qpel_uni_w_hv48_8_neon:  1.80x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-13 21:43:37 +00:00
Jun Zhao
fe41ff7413 lavc/hevc: add aarch64 NEON for qpel uni-weighted vertical filter
Add NEON-optimized implementations for HEVC QPEL uni-weighted
vertical interpolation (put_hevc_qpel_uni_w_v) at 8-bit depth.

These functions perform weighted uni-directional prediction with
vertical QPEL filtering:
- 8-tap vertical QPEL filter
- Weighted prediction: (filter_result * wx + offset) >> shift

Previously only sizes 4, 8, 16, 64 were optimized. This patch adds
optimized implementations for all remaining sizes: 6, 12, 24, 32, 48.

Performance results on Apple M4:
./tests/checkasm/checkasm --test=hevc_pel --bench

put_hevc_qpel_uni_w_v6_8_neon:   3.40x
put_hevc_qpel_uni_w_v12_8_neon:  3.24x
put_hevc_qpel_uni_w_v24_8_neon:  3.06x
put_hevc_qpel_uni_w_v32_8_neon:  2.66x
put_hevc_qpel_uni_w_v48_8_neon:  2.67x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-13 21:43:37 +00:00
Jun Zhao
32df0352b7 lavc/hevc: move subs earlier in qpel uni-weighted NEON loops
Move the subs instruction before the store macro in the 8x-unrolled
loops of qpel_uni_w_v4/v8/v16/v64 and qpel_uni_w_hv4/hv8/hv16, so
that many NEON instructions from the store macro separate it from the
conditional branch. This gives the CPU pipeline time to resolve the
condition flags before the branch decision.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-13 21:43:37 +00:00
Georgii Zagoruiko
c1be2107c9 aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit
RPi4:
put_chroma_h_10_2x2_c:                                  63.4 ( 1.00x)
put_chroma_h_10_4x4_c:                                 151.4 ( 1.00x)
put_chroma_h_10_8x8_c:                                 555.1 ( 1.00x)
put_chroma_h_10_8x8_neon:                              113.9 ( 4.88x)
put_chroma_h_10_16x16_c:                              1068.5 ( 1.00x)
put_chroma_h_10_16x16_neon:                            439.4 ( 2.43x)
put_chroma_h_10_32x32_c:                              3432.6 ( 1.00x)
put_chroma_h_10_32x32_neon:                           1878.3 ( 1.83x)
put_chroma_h_10_64x64_c:                             12872.2 ( 1.00x)
put_chroma_h_10_64x64_neon:                           7868.2 ( 1.64x)
put_chroma_h_10_128x128_c:                           45612.2 ( 1.00x)
put_chroma_h_10_128x128_neon:                        28742.1 ( 1.59x)
put_chroma_h_12_2x2_c:                                  63.7 ( 1.00x)
put_chroma_h_12_4x4_c:                                 151.5 ( 1.00x)
put_chroma_h_12_8x8_c:                                 555.2 ( 1.00x)
put_chroma_h_12_8x8_neon:                              114.2 ( 4.86x)
put_chroma_h_12_16x16_c:                              1068.1 ( 1.00x)
put_chroma_h_12_16x16_neon:                            438.8 ( 2.43x)
put_chroma_h_12_32x32_c:                              3419.7 ( 1.00x)
put_chroma_h_12_32x32_neon:                           1878.7 ( 1.82x)
put_chroma_h_12_64x64_c:                             12862.2 ( 1.00x)
put_chroma_h_12_64x64_neon:                           7868.2 ( 1.63x)
put_chroma_h_12_128x128_c:                           45613.5 ( 1.00x)
put_chroma_h_12_128x128_neon:                        28743.3 ( 1.59x)

Apple M4:
put_chroma_h_10_2x2_c:                                   2.5 ( 1.00x)
put_chroma_h_10_4x4_c:                                   6.5 ( 1.00x)
put_chroma_h_10_8x8_c:                                  17.8 ( 1.00x)
put_chroma_h_10_8x8_neon:                                6.8 ( 2.60x)
put_chroma_h_10_16x16_c:                                53.3 ( 1.00x)
put_chroma_h_10_16x16_neon:                             30.4 ( 1.75x)
put_chroma_h_10_32x32_c:                               181.8 ( 1.00x)
put_chroma_h_10_32x32_neon:                            116.2 ( 1.56x)
put_chroma_h_10_64x64_c:                               684.2 ( 1.00x)
put_chroma_h_10_64x64_neon:                            470.3 ( 1.45x)
put_chroma_h_10_128x128_c:                            2567.6 ( 1.00x)
put_chroma_h_10_128x128_neon:                         1879.3 ( 1.37x)
put_chroma_h_12_2x2_c:                                   1.9 ( 1.00x)
put_chroma_h_12_4x4_c:                                   7.0 ( 1.00x)
put_chroma_h_12_8x8_c:                                  16.8 ( 1.00x)
put_chroma_h_12_8x8_neon:                                7.9 ( 2.12x)
put_chroma_h_12_16x16_c:                                55.0 ( 1.00x)
put_chroma_h_12_16x16_neon:                             29.0 ( 1.90x)
put_chroma_h_12_32x32_c:                               182.5 ( 1.00x)
put_chroma_h_12_32x32_neon:                            116.9 ( 1.56x)
put_chroma_h_12_64x64_c:                               666.8 ( 1.00x)
put_chroma_h_12_64x64_neon:                            474.5 ( 1.41x)
put_chroma_h_12_128x128_c:                            2588.1 ( 1.00x)
put_chroma_h_12_128x128_neon:                         1912.2 ( 1.35x)
2026-03-10 12:48:54 +00:00
Martin Storsjö
74cfcd1c69 aarch64/vvc: Fix DCE undefined references with MSVC
This fixes compiling with MSVC for aarch64 after
510999f6b0.

While MSVC does do dead code elimintation for function references
within e.g. "if (0)", it doesn't do that for functions referenced
within a static function, even if that static function itself ends
up not used.

A reproduction example:

    void missing(void);
    void (*func_ptr)(void);

    static void wrapper(void) {
        missing();
    }

    void init(int cpu_flags) {
        if (0) {
            func_ptr = wrapper;
        }
    }

If "wrapper" is entirely unreferenced, then MSVC doesn't produce
any reference to the symbol "missing". Also, if we do
"func_ptr = missing;" then the reference to missing also is
eliminated. But for the case of referencing the function in a
static function, even if the reference to the static function can
be eliminated, then MSVC does keep the reference to the symbol.
2026-03-05 11:57:40 +02:00
Zuoqiang He
1fc7464cf7 libavcodec/huffyuvdsp: Add NEON optimization for the add_int16 function
Benchmark Results (1024 iterations, Raspberry Pi 5 - Cortex-A76):
add_int16_128_c:                                       914.0 ( 1.00x)
add_int16_128_neon:                                    516.9 ( 1.77x)
add_int16_rnd_width_c:                                 914.0 ( 1.00x)
add_int16_rnd_width_neon:                              517.5 ( 1.77x)

Co-Authored-By: Martin Storsjö <martin@martin.st>
2026-03-04 22:31:19 +00:00
Georgii Zagoruiko
510999f6b0 aarch64/vvc: sme2 optimisation of alf_filter_luma() 8/10/12 bit
Apple M4:
vvc_alf_filter_luma_8x8_8_c:                           347.3 ( 1.00x)
vvc_alf_filter_luma_8x8_8_neon:                        138.7 ( 2.50x)
vvc_alf_filter_luma_8x8_8_sme2:                        134.5 ( 2.58x)
vvc_alf_filter_luma_8x8_10_c:                          299.8 ( 1.00x)
vvc_alf_filter_luma_8x8_10_neon:                       129.8 ( 2.31x)
vvc_alf_filter_luma_8x8_10_sme2:                       128.6 ( 2.33x)
vvc_alf_filter_luma_8x8_12_c:                          293.0 ( 1.00x)
vvc_alf_filter_luma_8x8_12_neon:                       126.8 ( 2.31x)
vvc_alf_filter_luma_8x8_12_sme2:                       126.3 ( 2.32x)
vvc_alf_filter_luma_16x16_8_c:                        1386.1 ( 1.00x)
vvc_alf_filter_luma_16x16_8_neon:                      560.3 ( 2.47x)
vvc_alf_filter_luma_16x16_8_sme2:                      540.1 ( 2.57x)
vvc_alf_filter_luma_16x16_10_c:                       1200.3 ( 1.00x)
vvc_alf_filter_luma_16x16_10_neon:                     515.6 ( 2.33x)
vvc_alf_filter_luma_16x16_10_sme2:                     531.3 ( 2.26x)
vvc_alf_filter_luma_16x16_12_c:                       1223.8 ( 1.00x)
vvc_alf_filter_luma_16x16_12_neon:                     510.7 ( 2.40x)
vvc_alf_filter_luma_16x16_12_sme2:                     524.9 ( 2.33x)
vvc_alf_filter_luma_32x32_8_c:                        5488.8 ( 1.00x)
vvc_alf_filter_luma_32x32_8_neon:                     2233.4 ( 2.46x)
vvc_alf_filter_luma_32x32_8_sme2:                     1093.6 ( 5.02x)
vvc_alf_filter_luma_32x32_10_c:                       4738.0 ( 1.00x)
vvc_alf_filter_luma_32x32_10_neon:                    2057.5 ( 2.30x)
vvc_alf_filter_luma_32x32_10_sme2:                    1053.6 ( 4.50x)
vvc_alf_filter_luma_32x32_12_c:                       4808.3 ( 1.00x)
vvc_alf_filter_luma_32x32_12_neon:                    1981.2 ( 2.43x)
vvc_alf_filter_luma_32x32_12_sme2:                    1047.7 ( 4.59x)
vvc_alf_filter_luma_64x64_8_c:                       22116.8 ( 1.00x)
vvc_alf_filter_luma_64x64_8_neon:                     8951.0 ( 2.47x)
vvc_alf_filter_luma_64x64_8_sme2:                     4225.2 ( 5.23x)
vvc_alf_filter_luma_64x64_10_c:                      19072.8 ( 1.00x)
vvc_alf_filter_luma_64x64_10_neon:                    8448.1 ( 2.26x)
vvc_alf_filter_luma_64x64_10_sme2:                    4225.8 ( 4.51x)
vvc_alf_filter_luma_64x64_12_c:                      19312.6 ( 1.00x)
vvc_alf_filter_luma_64x64_12_neon:                    8270.9 ( 2.34x)
vvc_alf_filter_luma_64x64_12_sme2:                    4245.4 ( 4.55x)
vvc_alf_filter_luma_128x128_8_c:                     88530.5 ( 1.00x)
vvc_alf_filter_luma_128x128_8_neon:                  35686.3 ( 2.48x)
vvc_alf_filter_luma_128x128_8_sme2:                  16961.2 ( 5.22x)
vvc_alf_filter_luma_128x128_10_c:                    76904.9 ( 1.00x)
vvc_alf_filter_luma_128x128_10_neon:                 32439.5 ( 2.37x)
vvc_alf_filter_luma_128x128_10_sme2:                 16845.6 ( 4.57x)
vvc_alf_filter_luma_128x128_12_c:                    77363.3 ( 1.00x)
vvc_alf_filter_luma_128x128_12_neon:                 32907.5 ( 2.35x)
vvc_alf_filter_luma_128x128_12_sme2:                 17018.1 ( 4.55x)
2026-03-04 23:52:58 +02:00
Georgii Zagoruiko
90431417cb aarch64/vvc: Optimisations of put_luma_hv() functions for 10/12-bit
Apple M2:
put_luma_hv_10_4x4_c:                                   36.3 ( 1.00x)
put_luma_hv_10_8x8_c:                                   82.9 ( 1.00x)
put_luma_hv_10_8x8_neon:                                34.9 ( 2.37x)
put_luma_hv_10_16x16_c:                                239.2 ( 1.00x)
put_luma_hv_10_16x16_neon:                             119.0 ( 2.01x)
put_luma_hv_10_32x32_c:                                900.3 ( 1.00x)
put_luma_hv_10_32x32_neon:                             429.3 ( 2.10x)
put_luma_hv_10_64x64_c:                               2984.7 ( 1.00x)
put_luma_hv_10_64x64_neon:                            1736.2 ( 1.72x)
put_luma_hv_10_128x128_c:                            11194.2 ( 1.00x)
put_luma_hv_10_128x128_neon:                          6357.3 ( 1.76x)
put_luma_hv_12_4x4_c:                                   35.9 ( 1.00x)
put_luma_hv_12_8x8_c:                                   82.6 ( 1.00x)
put_luma_hv_12_8x8_neon:                                34.3 ( 2.41x)
put_luma_hv_12_16x16_c:                                240.2 ( 1.00x)
put_luma_hv_12_16x16_neon:                             115.3 ( 2.08x)
put_luma_hv_12_32x32_c:                                787.7 ( 1.00x)
put_luma_hv_12_32x32_neon:                             414.2 ( 1.90x)
put_luma_hv_12_64x64_c:                               3058.4 ( 1.00x)
put_luma_hv_12_64x64_neon:                            1592.3 ( 1.92x)
put_luma_hv_12_128x128_c:                            11350.8 ( 1.00x)
put_luma_hv_12_128x128_neon:                          6378.3 ( 1.78x)

RPi4:
put_luma_hv_10_4x4_c:                                  637.8 ( 1.00x)
put_luma_hv_10_8x8_c:                                 1044.9 ( 1.00x)
put_luma_hv_10_8x8_neon:                               483.7 ( 2.16x)
put_luma_hv_10_16x16_c:                               3098.0 ( 1.00x)
put_luma_hv_10_16x16_neon:                            1603.1 ( 1.93x)
put_luma_hv_10_32x32_c:                              10054.8 ( 1.00x)
put_luma_hv_10_32x32_neon:                            5843.6 ( 1.72x)
put_luma_hv_10_64x64_c:                              40506.2 ( 1.00x)
put_luma_hv_10_64x64_neon:                           24384.0 ( 1.66x)
put_luma_hv_10_128x128_c:                           130604.2 ( 1.00x)
put_luma_hv_10_128x128_neon:                         99746.6 ( 1.31x)
put_luma_hv_12_4x4_c:                                  638.2 ( 1.00x)
put_luma_hv_12_8x8_c:                                 1074.6 ( 1.00x)
put_luma_hv_12_8x8_neon:                               482.6 ( 2.23x)
put_luma_hv_12_16x16_c:                               3094.0 ( 1.00x)
put_luma_hv_12_16x16_neon:                            1602.5 ( 1.93x)
put_luma_hv_12_32x32_c:                              10034.4 ( 1.00x)
put_luma_hv_12_32x32_neon:                            5843.3 ( 1.72x)
put_luma_hv_12_64x64_c:                              40447.5 ( 1.00x)
put_luma_hv_12_64x64_neon:                           24377.2 ( 1.66x)
put_luma_hv_12_128x128_c:                           130610.4 ( 1.00x)
put_luma_hv_12_128x128_neon:                         99765.8 ( 1.31x)
2026-03-04 12:53:16 +00:00
Jun Zhao
7e7d69632d lavc/hevc: optimize qpel H-pass for width>=16 with byte-domain widening multiply
Rewrite ff_hevc_put_hevc_qpel_h16_8_neon and h32 to use byte-domain
widening multiply (umull/umlal/umlsl via calc_qpelb/calc_qpelb2 macros)
instead of the previous int16-domain approach (uxtl + mul/mla).

The byte-domain approach eliminates the uxtl expansion step and halves
the ext stride (1 byte vs 2 bytes per tap), reducing per-row instruction
count from ~32 to ~23. The functions are also inlined, removing bl/ret
call overhead.

This benefits all HV-path callers (hv/uni_hv/bi_hv/uni_w_hv/bi_w_hv)
at widths 16/32/48/64.

checkasm benchmarks on Apple M4 (5-run average):

  H-pass standalone (NEON):
    h16:  34.0 -> 24.4 cycles (1.39x speedup)
    h32: 132.0 -> 95.0 cycles (1.39x speedup)
    h64: 521.8 -> 373.9 cycles (1.40x speedup)

  HV compound paths geometric mean speedup (NEON, width >= 16):
    qpel_hv:      1.144x (4 functions)
    qpel_bi_hv:   1.158x (4 functions)
    qpel_uni_hv:  1.188x (4 functions)
    qpel_uni_w_hv: 1.158x (3 functions)
    Overall:       1.162x (15 functions)

VVC qpel h16/h32 are separated into self-contained functions retaining
the int16-domain approach, as VVC filters have arbitrary coefficients
incompatible with the hardcoded sign pattern in calc_qpelb.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-03 12:04:14 +00:00
Jun Zhao
23b7005d98 lavc/vvc: remove duplicate 'mov mx, x30' in VVC qpel h16/h32
The VVC qpel h16 and h32 functions had a redundant 'mov mx, x30'
instruction. The first one was placed before vvc_load_filter had
finished using mx (the filter pointer argument), making it a dead
store immediately overwritten by the second 'mov mx, x30'.

Remove the first instance and reorder so that 'sub src, src, #3'
comes before 'mov mx, x30', ensuring the filter pointer in mx is
fully consumed by vvc_load_filter before being overwritten with the
link register.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-03-03 12:04:14 +00:00
Andreas Rheinhardt
dc65dcec22 avcodec/vvc/inter: Combine offsets early
For bi-predicted weighted averages, only the sum
of the two offsets is ever used, so add the two early.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-25 12:08:33 +01:00
Jun Zhao
27dd2f1c70 lavc/hevc: fix missing # in ldrsw immediate offset
The ldrsw instruction requires immediate offset with # prefix.
This fixes the syntax error introduced in commit 26752368f0
(aarch64/h26x: Add put_hevc_pel_bi_w_pixels) where the
load_bi_w_pixels_param macro was added.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-02-05 09:13:22 +08:00
Zhao Zhili
e250854ecf aarch64/h264pred: disable inefficient functions
These assembly optimizations have been identified as "performance
regressions." Due to advancements in modern CPU micro-architectures
and compiler optimization the C implementations now consistently
outperform these handwritten routines.

Test Name          	 A55-clang       M1             A76-gcc-14      A510-clang      A715-clang      X3-clang
--------------------------------------------------------------------------------------------------------------------
pred8x8_dc_8_neon        55.9 ( 0.79x)!  0.2 ( 0.31x)!  35.7 ( 0.63x)!  98.3 ( 0.37x)!  35.9 ( 0.45x)!  33.6 ( 0.38x)!
pred8x8_dc_10_neon       57.0 ( 1.04x)   0.3 ( 0.36x)!  35.9 ( 0.94x)!  98.2 ( 0.53x)!  35.8 ( 0.58x)!  33.2 ( 0.50x)!
pred8x8_dc_128_8_neon    26.0 ( 0.69x)!  0.1 ( 0.43x)!  15.3 ( 0.73x)!  46.4 ( 0.36x)!  10.6 ( 0.48x)!  10.3 ( 1.09x)
pred8x8_dc_128_10_neon   25.3 ( 0.99x)!  0.1 ( 0.42x)!  19.3 ( 0.48x)!  44.5 ( 0.42x)!  10.0 ( 0.61x)!  11.0 ( 1.00x)
pred8x8_left_dc_8_neon   46.9 ( 0.72x)!  0.2 ( 0.26x)!  30.2 ( 0.49x)!  71.4 ( 0.39x)!  29.8 ( 0.35x)!  26.5 ( 0.44x)!
pred8x8_left_dc_10_neon  45.4 ( 0.82x)!  0.2 ( 0.29x)!  28.1 ( 0.67x)!  70.2 ( 0.47x)!  30.0 ( 0.38x)!  26.5 ( 0.43x)!
pred16x16_dc_8_neon      74.4 ( 1.34x)   0.3 ( 0.62x)!  44.7 ( 0.89x)!  128.0 ( 0.79x)! 48.5 ( 0.67x)!  39.4 ( 0.71x)!
pred16x16_dc_128_8_neon  37.9 ( 0.79x)!  0.1 ( 0.60x)!  20.1 ( 0.80x)!  41.8 ( 0.46x)!  16.2 ( 0.81x)!  12.8 ( 0.95x)!
pred16x16_left_dc_8_neon 69.9 ( 1.19x)   0.3 ( 0.46x)!  49.6 ( 0.54x)!  116.8 ( 0.62x)! 52.8 ( 0.45x)!  44.2 ( 0.51x)!
pred8x8_hori_8_neon      30.6 ( 1.39x)   0.1 ( 0.45x)!  19.4 ( 0.81x)!  71.0 ( 0.50x)!  15.9 ( 0.55x)!  12.2 ( 0.94x)!
pred8x8_hori_10_neon*    29.3 ( 1.82x)   0.1 ( 0.59x)!  18.5 ( 1.56x)   68.9 ( 0.64x)!  15.8 ( 0.62x)!  11.8 ( 0.97x)!
pred8x8_top_dc_8_neon    35.8 ( 0.96x)!  0.1 ( 0.59x)!  16.8 ( 0.81x)!  58.9 ( 0.44x)!  11.3 ( 0.89x)!  11.4 ( 0.99x)!
pred8x8_top_dc_10_neon   37.4 ( 1.24x)   0.1 ( 0.92x)!  20.4 ( 0.81x)!  59.5 ( 0.69x)!  10.5 ( 1.48x)   11.8 ( 1.02x)
pred8x8_vertical_8_neon  18.3 ( 1.08x)   0.1 ( 0.54x)!  12.8 ( 0.89x)!  37.2 ( 0.40x)!   8.3 ( 0.77x)!  11.2 ( 1.00x)
pred8x8_vertical_10_neon 19.0 ( 1.24x)   0.1 ( 0.55x)!  15.3 ( 0.62x)!  39.7 ( 0.50x)!   8.2 ( 0.91x)!  11.1 ( 0.99x)!

- pred8x8_horizontal_10 also underperforms on new architectures, but useful on A55 and A76.

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2026-02-04 09:06:37 +00:00
Zhao Zhili
f54841d375 avcodec/aarch64: add pngdsp
Test Name                    A55-gcc-11        M1-clang           A76-gcc-12          A510-clang        X3-clang
-------------------------------------------------------------------------------------------------------------------
add_bytes_l2_4096_neon        1807.2 ( 2.01x)    1.6 ( 1.94x)    333.0 ( 6.35x)   1058.2 ( 2.34x)    214.3 ( 1.99x)
add_paeth_prediction_3_neon  33036.1 ( 2.41x)  145.1 ( 1.66x)  20443.3 ( 1.97x)  35225.1 ( 1.23x)  19420.8 ( 1.05x)
add_paeth_prediction_4_neon  24368.6 ( 3.26x)  106.7 ( 2.01x)  15163.8 ( 2.77x)  26454.7 ( 1.62x)  14319.0 ( 1.35x)
add_paeth_prediction_6_neon  17900.6 ( 4.44x)   72.0 ( 2.70x)  10214.3 ( 4.20x)  18296.9 ( 2.27x)   9693.1 ( 1.97x)
add_paeth_prediction_8_neon  12615.4 ( 6.31x)   54.1 ( 2.58x)   7706.0 ( 5.45x)  13733.3 ( 2.94x)   7272.6 ( 2.63x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2026-02-04 12:05:35 +08:00
Martin Storsjö
f74c551eaa aarch64: Fix indentation of a few instructions
This file is excempt from the indent checker script, as there
are a few other bits in it that the script wants to reformat
into slightly worse form, or which might not warrant being
reformatted.

But these instructions should indeed be indented this way.
2026-01-30 05:21:27 +00:00
Andreas Rheinhardt
bf4d5037b4 avcodec/h264dsp: Remove redundant h264 from H264DSPCtx member names
These names are a remnant of dsputil when all the DSP functions
from all codecs were part of DSPcontext.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:25 +01:00
Jun Zhao
8966101fa6 lavc/hevc: add aarch64 neon for 12-bit dequant
Implement NEON optimization for HEVC dequant at 12-bit depth.

For 12-bit: shift = 15 - 12 - log2_size = 3 - log2_size. When shift
is negative, we use shl (shift left) instead of srshr.

Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_12_c:                                   9.9 ( 1.00x)
hevc_dequant_4x4_12_neon:                                5.7 ( 1.74x)

hevc_dequant_8x8_12_c:                                   1.7 ( 1.00x)
hevc_dequant_8x8_12_neon:                                1.3 ( 1.30x)

hevc_dequant_16x16_12_c:                               131.1 ( 1.00x)
hevc_dequant_16x16_12_neon:                              7.9 (16.52x)

hevc_dequant_32x32_12_c:                                69.7 ( 1.00x)
hevc_dequant_32x32_12_neon:                             28.4 ( 2.46x)

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-01-25 06:55:26 +00:00
Jun Zhao
ce89d974c8 lavc/hevc: add aarch64 neon for 10-bit dequant
Implement NEON optimization for HEVC dequant at 10-bit depth.

For 10-bit: shift = 15 - 10 - log2_size = 5 - log2_size

Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_10_c:                                  16.6 ( 1.00x)
hevc_dequant_4x4_10_neon:                                7.4 ( 2.23x)

hevc_dequant_8x8_10_c:                                  39.7 ( 1.00x)
hevc_dequant_8x8_10_neon:                                7.5 ( 5.28x)

hevc_dequant_16x16_10_c:                               168.7 ( 1.00x)
hevc_dequant_16x16_10_neon:                             10.2 (16.56x)

hevc_dequant_32x32_10_c:                                 1.9 ( 1.00x)
hevc_dequant_32x32_10_neon:                              1.9 ( 1.01x)

Note: 32x32 shift=0 is identity transform (no-op), so NEON has no
advantage over C which is also optimized away by the compiler.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-01-25 06:55:26 +00:00
Jun Zhao
0886e50c6b lavc/hevc: add aarch64 neon for 8-bit dequant
Implement NEON optimization for HEVC dequant at 8-bit depth.

The NEON implementation uses srshr (Signed Rounding Shift Right) which
does both the add with offset and right shift in a single instruction.

Optimization details:
- 4x4 (16 coeffs): Single load-process-store sequence
- 8x8 (64 coeffs): Fully unrolled, no loop overhead
- 16x16 (256 coeffs): Pipelined load/compute/store to hide memory latency
- 32x32 (1024 coeffs): Pipelined with all available NEON registers

Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_8_c:                                   11.3 ( 1.00x)
hevc_dequant_4x4_8_neon:                                 6.3 ( 1.78x)

hevc_dequant_8x8_8_c:                                   33.9 ( 1.00x)
hevc_dequant_8x8_8_neon:                                 6.6 ( 5.11x)

hevc_dequant_16x16_8_c:                                153.8 ( 1.00x)
hevc_dequant_16x16_8_neon:                               9.0 (17.02x)

hevc_dequant_32x32_8_c:                                 78.1 ( 1.00x)
hevc_dequant_32x32_8_neon:                              31.9 ( 2.45x)

Note on Performance Anomaly:
The observation that hevc_dequant_32x32_8_c is faster than 16x16 (78.1 vs 153.8)
is due to Clang auto-vectorizing only for sizes >= 32x32.
Compiler: Apple clang version 17.0.0 (clang-1700.6.3.2)

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-01-25 06:55:26 +00:00
Georgii Zagoruiko
8acdffa22c aarch64/vvc: Optimisations of put_luma_v() functions for 10/12-bit
RPi4 (auto-vectorisation is on)
put_luma_v_10_4x4_c:                                   303.3 ( 1.00x)
put_luma_v_10_4x4_neon:                                 55.7 ( 5.45x)
put_luma_v_10_8x8_c:                                  1106.7 ( 1.00x)
put_luma_v_10_8x8_neon:                                163.8 ( 6.76x)
put_luma_v_10_16x16_c:                                2242.1 ( 1.00x)
put_luma_v_10_16x16_neon:                              672.7 ( 3.33x)
put_luma_v_10_32x32_c:                                7057.3 ( 1.00x)
put_luma_v_10_32x32_neon:                             2731.3 ( 2.58x)
put_luma_v_10_64x64_c:                               25699.8 ( 1.00x)
put_luma_v_10_64x64_neon:                            12145.6 ( 2.12x)
put_luma_v_10_128x128_c:                             90694.6 ( 1.00x)
put_luma_v_10_128x128_neon:                          44862.4 ( 2.02x)
put_luma_v_12_4x4_c:                                   304.4 ( 1.00x)
put_luma_v_12_4x4_neon:                                 55.6 ( 5.47x)
put_luma_v_12_8x8_c:                                  1107.4 ( 1.00x)
put_luma_v_12_8x8_neon:                                164.7 ( 6.72x)
put_luma_v_12_16x16_c:                                2235.8 ( 1.00x)
put_luma_v_12_16x16_neon:                              672.5 ( 3.32x)
put_luma_v_12_32x32_c:                                7049.2 ( 1.00x)
put_luma_v_12_32x32_neon:                             2731.6 ( 2.58x)
put_luma_v_12_64x64_c:                               25706.5 ( 1.00x)
put_luma_v_12_64x64_neon:                            12145.0 ( 2.12x)
put_luma_v_12_128x128_c:                             90672.5 ( 1.00x)
put_luma_v_12_128x128_neon:                          44857.1 ( 2.02x)

Apple M4 (auto-vectorisation is on):
put_luma_v_10_4x4_c:                                    25.6 ( 1.00x)
put_luma_v_10_4x4_neon:                                  3.1 ( 8.18x)
put_luma_v_10_8x8_c:                                    34.7 ( 1.00x)
put_luma_v_10_8x8_neon:                                 10.5 ( 3.32x)
put_luma_v_10_16x16_c:                                 103.9 ( 1.00x)
put_luma_v_10_16x16_neon:                               42.3 ( 2.45x)
put_luma_v_10_32x32_c:                                 399.7 ( 1.00x)
put_luma_v_10_32x32_neon:                              161.8 ( 2.47x)
put_luma_v_10_64x64_c:                                1276.7 ( 1.00x)
put_luma_v_10_64x64_neon:                              840.1 ( 1.52x)
put_luma_v_10_128x128_c:                              4981.3 ( 1.00x)
put_luma_v_10_128x128_neon:                           3008.0 ( 1.66x)
put_luma_v_12_4x4_c:                                    23.6 ( 1.00x)
put_luma_v_12_4x4_neon:                                  2.0 (11.84x)
put_luma_v_12_8x8_c:                                    31.8 ( 1.00x)
put_luma_v_12_8x8_neon:                                 12.4 ( 2.55x)
put_luma_v_12_16x16_c:                                 100.8 ( 1.00x)
put_luma_v_12_16x16_neon:                               44.9 ( 2.25x)
put_luma_v_12_32x32_c:                                 331.1 ( 1.00x)
put_luma_v_12_32x32_neon:                              175.2 ( 1.89x)
put_luma_v_12_64x64_c:                                1227.1 ( 1.00x)
put_luma_v_12_64x64_neon:                              712.7 ( 1.72x)
put_luma_v_12_128x128_c:                              5149.1 ( 1.00x)
put_luma_v_12_128x128_neon:                           2809.3 ( 1.83x)
2026-01-08 17:35:55 +00:00
Zhao Zhili
840183d823 aarch64/hpeldsp_neon: fix out-of-bounds read
Fix #21141

The performance improved a little bit.
On A76:
                              Before            After
put_pixels_tab[0][1]_neon:    32.4 ( 3.91x)     31.6 ( 3.99x)
put_pixels_tab[0][3]_neon:    88.0 ( 4.50x)     74.6 ( 5.31x)
put_pixels_tab[1][1]_neon:    33.5 ( 2.52x)     31.2 ( 2.71x)
put_pixels_tab[1][3]_neon:    30.5 ( 3.61x)     21.7 ( 5.08x)

On A55:
                             Before            After
put_pixels_tab[0][1]_neon:   175.2 ( 2.41x)    138.7 ( 3.04x)
put_pixels_tab[0][3]_neon:   334.3 ( 2.71x)    296.1 ( 3.07x)
put_pixels_tab[1][1]_neon:   168.3 ( 1.78x)     94.1 ( 3.19x)
put_pixels_tab[1][3]_neon:   112.3 ( 2.20x)     90.0 ( 2.74x)
2026-01-04 03:22:55 +00:00
Georgii Zagoruiko
f790de2a87 aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit
RPi4 (auto-vectorisation is turned on)
put_luma_h_10_4x4_c:                                   282.8 ( 1.00x)
put_luma_h_10_8x8_c:                                  1069.5 ( 1.00x)
put_luma_h_10_8x8_neon:                                207.5 ( 5.15x)
put_luma_h_10_16x16_c:                                1999.6 ( 1.00x)
put_luma_h_10_16x16_neon:                              777.5 ( 2.57x)
put_luma_h_10_32x32_c:                                6612.9 ( 1.00x)
put_luma_h_10_32x32_neon:                             3201.6 ( 2.07x)
put_luma_h_10_64x64_c:                               25059.0 ( 1.00x)
put_luma_h_10_64x64_neon:                            13623.5 ( 1.84x)
put_luma_h_10_128x128_c:                             91310.1 ( 1.00x)
put_luma_h_10_128x128_neon:                          50358.3 ( 1.81x)
put_luma_h_12_4x4_c:                                   282.1 ( 1.00x)
put_luma_h_12_8x8_c:                                  1068.4 ( 1.00x)
put_luma_h_12_8x8_neon:                                207.7 ( 5.14x)
put_luma_h_12_16x16_c:                                1998.0 ( 1.00x)
put_luma_h_12_16x16_neon:                              777.5 ( 2.57x)
put_luma_h_12_32x32_c:                                6612.0 ( 1.00x)
put_luma_h_12_32x32_neon:                             3201.6 ( 2.07x)
put_luma_h_12_64x64_c:                               25036.8 ( 1.00x)
put_luma_h_12_64x64_neon:                            13595.1 ( 1.84x)
put_luma_h_12_128x128_c:                             91305.8 ( 1.00x)
put_luma_h_12_128x128_neon:                          50359.7 ( 1.81x)

Apple M2 Air (auto-vectorisation is turned on)
put_luma_h_10_4x4_c:                                     0.3 ( 1.00x)
put_luma_h_10_8x8_c:                                     1.0 ( 1.00x)
put_luma_h_10_8x8_neon:                                  0.4 ( 2.59x)
put_luma_h_10_16x16_c:                                   2.9 ( 1.00x)
put_luma_h_10_16x16_neon:                                1.4 ( 2.01x)
put_luma_h_10_32x32_c:                                   9.4 ( 1.00x)
put_luma_h_10_32x32_neon:                                5.8 ( 1.62x)
put_luma_h_10_64x64_c:                                  35.6 ( 1.00x)
put_luma_h_10_64x64_neon:                               23.6 ( 1.51x)
put_luma_h_10_128x128_c:                               131.1 ( 1.00x)
put_luma_h_10_128x128_neon:                             92.6 ( 1.42x)
put_luma_h_12_4x4_c:                                     0.3 ( 1.00x)
put_luma_h_12_8x8_c:                                     1.0 ( 1.00x)
put_luma_h_12_8x8_neon:                                  0.4 ( 2.58x)
put_luma_h_12_16x16_c:                                   2.9 ( 1.00x)
put_luma_h_12_16x16_neon:                                1.4 ( 2.00x)
put_luma_h_12_32x32_c:                                   9.4 ( 1.00x)
put_luma_h_12_32x32_neon:                                5.8 ( 1.61x)
put_luma_h_12_64x64_c:                                  35.3 ( 1.00x)
put_luma_h_12_64x64_neon:                               23.3 ( 1.52x)
put_luma_h_12_128x128_c:                               131.2 ( 1.00x)
put_luma_h_12_128x128_neon:                             92.4 ( 1.42x)
2025-11-24 21:22:55 +00:00
Kacper Michajłow
9ad20839fb
avcodec/pixblockdsp: be consistent about restrict use in ff_{get,diff}_pixels
Suppresses warnings about function pointer mismatch.

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-10-25 01:01:15 +02:00
Bin Peng
3115c0c0e6 lavc/aarch64: Fix addp overflow in ff_pred16x16_plane_neon_10
The mismatch between neon and C functions can be reproduced
using the following bitstream and command line on aarch64 devices.

wget https://streams.videolan.org/ffmpeg/incoming/replay_intra_pred_16x16.h264
 ./ffmpeg -cpuflags 0  -threads 1 -i replay_intra_pred_16x16.h264  -f framemd5 -y md5_ref
 ./ffmpeg              -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_neon

Signed-off-by: Bin Peng <pengbin@visionular.com>
2025-10-24 15:32:35 +00:00
Krzysztof Pyrkosz
03c054d43c avcodec/aarch64/vvc: Implement dmvr_v_8
A72
dmvr_v_8_12x20_neon:                                   207.0 ( 4.15x)
dmvr_v_8_20x12_neon:                                   170.4 ( 4.37x)
dmvr_v_8_20x20_neon:                                   273.4 ( 4.58x)

A53
dmvr_v_8_12x20_neon:                                   450.6 ( 4.21x)
dmvr_v_8_20x12_neon:                                   342.8 ( 3.70x)
dmvr_v_8_20x20_neon:                                   550.9 ( 3.79x)
2025-09-23 11:20:20 +00:00
Krzysztof Pyrkosz
56a638d836 avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon
Before and after:
A53:
apply_bdof_8_16x8_neon:                               2733.1 ( 4.88x)
apply_bdof_8_16x16_neon:                              5458.6 ( 4.86x)
apply_bdof_10_16x8_neon:                              2789.8 ( 4.64x)
apply_bdof_10_16x16_neon:                             5523.8 ( 4.68x)
apply_bdof_12_16x8_neon:                              2792.8 ( 4.58x)
apply_bdof_12_16x16_neon:                             5519.5 ( 4.63x)

apply_bdof_8_16x8_neon:                               2571.8 ( 5.12x)
apply_bdof_8_16x16_neon:                              5173.3 ( 5.12x)
apply_bdof_10_16x8_neon:                              2635.1 ( 4.87x)
apply_bdof_10_16x16_neon:                             5243.0 ( 4.89x)
apply_bdof_12_16x8_neon:                              2613.0 ( 4.89x)
apply_bdof_12_16x16_neon:                             5231.7 ( 4.90x)

A78:
apply_bdof_8_16x8_neon:                                565.3 ( 8.43x)
apply_bdof_8_16x16_neon:                              1109.5 ( 8.60x)
apply_bdof_10_16x8_neon:                               568.2 ( 7.92x)
apply_bdof_10_16x16_neon:                             1114.1 ( 8.08x)
apply_bdof_12_16x8_neon:                               570.2 ( 7.87x)
apply_bdof_12_16x16_neon:                             1116.3 ( 8.03x)

apply_bdof_8_16x8_neon:                                541.4 ( 8.81x)
apply_bdof_8_16x16_neon:                              1065.9 ( 8.97x)
apply_bdof_10_16x8_neon:                               543.2 ( 8.32x)
apply_bdof_10_16x16_neon:                             1071.5 ( 8.39x)
apply_bdof_12_16x8_neon:                               544.2 ( 8.25x)
apply_bdof_12_16x16_neon:                             1074.1 ( 8.37x)
2025-09-23 11:20:11 +00:00
Krzysztof Pyrkosz
f1a155d975 avcodec/aarch64/vvc: Optimize dmvr_hv_10
Before and after on A53:
dmvr_hv_10_12x20_neon:                                1838.2 ( 3.02x)
dmvr_hv_10_20x12_neon:                                1330.2 ( 1.83x)
dmvr_hv_10_20x20_neon:                                2148.2 ( 1.85x)
dmvr_hv_12_12x20_neon:                                1839.2 ( 3.02x)
dmvr_hv_12_20x12_neon:                                1330.6 ( 1.83x)
dmvr_hv_12_20x20_neon:                                2147.2 ( 1.85x)

dmvr_hv_10_12x20_neon:                                1755.0 ( 3.17x)
dmvr_hv_10_20x12_neon:                                1165.8 ( 2.09x)
dmvr_hv_10_20x20_neon:                                1876.1 ( 2.12x)
dmvr_hv_12_12x20_neon:                                1754.4 ( 3.17x)
dmvr_hv_12_20x12_neon:                                1167.8 ( 2.09x)
dmvr_hv_12_20x20_neon:                                1878.8 ( 2.12x)
2025-09-21 19:39:27 +00:00
Georgii Zagoruiko
4fbacb3944 avcodec/aarch64/vvc: Optimised version of classify function.
Macbook Air (M2):
    vvc_alf_classify_8x8_8_c:                                2.6 ( 1.00x)
    vvc_alf_classify_8x8_8_neon:                             1.0 ( 2.47x)
    vvc_alf_classify_8x8_10_c:                               2.7 ( 1.00x)
    vvc_alf_classify_8x8_10_neon:                            0.9 ( 2.98x)
    vvc_alf_classify_8x8_12_c:                               2.7 ( 1.00x)
    vvc_alf_classify_8x8_12_neon:                            0.9 ( 2.97x)
    vvc_alf_classify_16x16_8_c:                              7.3 ( 1.00x)
    vvc_alf_classify_16x16_8_neon:                           3.4 ( 2.12x)
    vvc_alf_classify_16x16_10_c:                             4.3 ( 1.00x)
    vvc_alf_classify_16x16_10_neon:                          2.9 ( 1.47x)
    vvc_alf_classify_16x16_12_c:                             4.3 ( 1.00x)
    vvc_alf_classify_16x16_12_neon:                          3.0 ( 1.44x)
    vvc_alf_classify_32x32_8_c:                             13.7 ( 1.00x)
    vvc_alf_classify_32x32_8_neon:                          10.7 ( 1.29x)
    vvc_alf_classify_32x32_10_c:                            12.3 ( 1.00x)
    vvc_alf_classify_32x32_10_neon:                          8.7 ( 1.42x)
    vvc_alf_classify_32x32_12_c:                            12.2 ( 1.00x)
    vvc_alf_classify_32x32_12_neon:                          8.7 ( 1.40x)
    vvc_alf_classify_64x64_8_c:                             45.8 ( 1.00x)
    vvc_alf_classify_64x64_8_neon:                          37.1 ( 1.23x)
    vvc_alf_classify_64x64_10_c:                            41.3 ( 1.00x)
    vvc_alf_classify_64x64_10_neon:                         32.8 ( 1.26x)
    vvc_alf_classify_64x64_12_c:                            41.4 ( 1.00x)
    vvc_alf_classify_64x64_12_neon:                         32.4 ( 1.28x)
    vvc_alf_classify_128x128_8_c:                          163.7 ( 1.00x)
    vvc_alf_classify_128x128_8_neon:                       138.3 ( 1.18x)
    vvc_alf_classify_128x128_10_c:                         149.1 ( 1.00x)
    vvc_alf_classify_128x128_10_neon:                      120.3 ( 1.24x)
    vvc_alf_classify_128x128_12_c:                         148.7 ( 1.00x)
    vvc_alf_classify_128x128_12_neon:                      119.4 ( 1.25x)

    RPi4 (Cortex-A72):
    vvc_alf_classify_8x8_8_c:                             1251.6 ( 1.00x)
    vvc_alf_classify_8x8_8_neon:                           700.7 ( 1.79x)
    vvc_alf_classify_8x8_10_c:                            1141.9 ( 1.00x)
    vvc_alf_classify_8x8_10_neon:                          659.7 ( 1.73x)
    vvc_alf_classify_8x8_12_c:                            1075.8 ( 1.00x)
    vvc_alf_classify_8x8_12_neon:                          658.7 ( 1.63x)
    vvc_alf_classify_16x16_8_c:                           3574.1 ( 1.00x)
    vvc_alf_classify_16x16_8_neon:                        1849.8 ( 1.93x)
    vvc_alf_classify_16x16_10_c:                          3270.0 ( 1.00x)
    vvc_alf_classify_16x16_10_neon:                       1786.1 ( 1.83x)
    vvc_alf_classify_16x16_12_c:                          3271.7 ( 1.00x)
    vvc_alf_classify_16x16_12_neon:                       1785.5 ( 1.83x)
    vvc_alf_classify_32x32_8_c:                          12451.9 ( 1.00x)
    vvc_alf_classify_32x32_8_neon:                        5984.3 ( 2.08x)
    vvc_alf_classify_32x32_10_c:                         11428.9 ( 1.00x)
    vvc_alf_classify_32x32_10_neon:                       5756.3 ( 1.99x)
    vvc_alf_classify_32x32_12_c:                         11252.8 ( 1.00x)
    vvc_alf_classify_32x32_12_neon:                       5755.7 ( 1.96x)
    vvc_alf_classify_64x64_8_c:                          47625.5 ( 1.00x)
    vvc_alf_classify_64x64_8_neon:                       21071.9 ( 2.26x)
    vvc_alf_classify_64x64_10_c:                         44576.3 ( 1.00x)
    vvc_alf_classify_64x64_10_neon:                      21544.7 ( 2.07x)
    vvc_alf_classify_64x64_12_c:                         44600.5 ( 1.00x)
    vvc_alf_classify_64x64_12_neon:                      21491.2 ( 2.08x)
    vvc_alf_classify_128x128_8_c:                       192143.3 ( 1.00x)
    vvc_alf_classify_128x128_8_neon:                     82387.6 ( 2.33x)
    vvc_alf_classify_128x128_10_c:                      177583.1 ( 1.00x)
    vvc_alf_classify_128x128_10_neon:                    81628.8 ( 2.18x)
    vvc_alf_classify_128x128_12_c:                      177582.2 ( 1.00x)
    vvc_alf_classify_128x128_12_neon:                    81625.1 ( 2.18x)
2025-09-09 22:13:04 +01:00
Krzysztof Pyrkosz
de25cb4603 avcodec/aarch64/vvc: Optimize vvc_apply_bdof_block_8x
Before and after:
A53:
apply_bdof_8_8x16_neon:                               3320.5 ( 4.02x)
apply_bdof_10_8x16_neon:                              3317.8 ( 3.90x)
apply_bdof_12_8x16_neon:                              3303.6 ( 3.91x)

apply_bdof_8_8x16_neon:                               3168.1 ( 4.23x)
apply_bdof_10_8x16_neon:                              3127.8 ( 4.13x)
apply_bdof_12_8x16_neon:                              3119.3 ( 4.18x)

A72:
apply_bdof_8_8x16_neon:                               1827.4 ( 5.02x)
apply_bdof_10_8x16_neon:                              1838.5 ( 4.89x)
apply_bdof_12_8x16_neon:                              1841.1 ( 4.83x)

apply_bdof_8_8x16_neon:                               1691.6 ( 5.46x)
apply_bdof_10_8x16_neon:                              1695.9 ( 5.23x)
apply_bdof_12_8x16_neon:                              1695.4 ( 5.29x)

A78
apply_bdof_8_8x16_neon:                                648.9 ( 7.43x)
apply_bdof_10_8x16_neon:                               646.1 ( 7.04x)
apply_bdof_12_8x16_neon:                               643.8 ( 7.04x)

apply_bdof_8_8x16_neon:                                603.2 ( 7.97x)
apply_bdof_10_8x16_neon:                               604.1 ( 7.52x)
apply_bdof_12_8x16_neon:                               604.5 ( 7.52x)
2025-09-09 16:37:28 +00:00
Krzysztof Pyrkosz
7b21bde34c avcodec/aarch64/vvc: Implemented dmvr_h_10
A78:
dmvr_h_10_12x20_neon:                                   82.2 ( 6.49x)
dmvr_h_10_20x12_neon:                                   69.9 ( 3.66x)
dmvr_h_10_20x20_neon:                                  112.5 ( 3.74x)
dmvr_h_12_12x20_neon:                                   81.4 ( 6.51x)
dmvr_h_12_20x12_neon:                                   69.2 ( 3.74x)
dmvr_h_12_20x20_neon:                                  110.2 ( 3.85x)

A72:
dmvr_h_10_12x20_neon:                                  234.1 ( 4.67x)
dmvr_h_10_20x12_neon:                                  221.4 ( 3.48x)
dmvr_h_10_20x20_neon:                                  356.9 ( 3.59x)
dmvr_h_12_12x20_neon:                                  234.1 ( 4.67x)
dmvr_h_12_20x12_neon:                                  221.5 ( 3.53x)
dmvr_h_12_20x20_neon:                                  357.0 ( 3.64x)
2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz
189e841cfd avcodec/aarch64/vvc: Implement dmvr_h_8
A78:
dmvr_h_8_12x20_neon:                                    76.6 ( 4.31x)
dmvr_h_8_20x12_neon:                                    65.8 ( 3.49x)
dmvr_h_8_20x20_neon:                                   106.6 ( 3.62x)

A72:
dmvr_h_8_12x20_neon:                                   190.6 ( 4.40x)
dmvr_h_8_20x12_neon:                                   171.1 ( 4.31x)
dmvr_h_8_20x20_neon:                                   275.1 ( 4.50x)
2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz
fb4407797e Replace uxtl with umull in dmvr_hv_8
Before and after on A78:
dmvr_hv_8_12x20_neon:                                  205.3 ( 5.21x)
dmvr_hv_8_20x12_neon:                                  171.8 ( 3.15x)
dmvr_hv_8_20x20_neon:                                  282.7 ( 3.11x)

dmvr_hv_8_12x20_neon:                                  172.7 ( 5.58x)
dmvr_hv_8_20x12_neon:                                  133.3 ( 3.36x)
dmvr_hv_8_20x20_neon:                                  214.6 ( 3.40x)
2025-09-05 07:20:15 +00:00
Zhao Zhili
6ce02bcc3a avcodec/aarch64/vvc: Optimize apply_bdof
Before this patch, prof_grad_filter calculate
gh[0], gh[1], gv[0], gv[1] and save them to stack.

derive_bdof_vx_vy load them from stack and calculate
gh[0] + gh[1], gv[0] + gv[1].

apply_bdof_min_block load them from stack and calculate
gh[0] - gh[1], gv[0] - gv[1]

This patch add bdof_grad_filter, which calculate gh[0] + gh[1],
gh[0] - gh[1], gv[0] + gv[1], gv[0] - gv[1], and save them to
stack, so derive_bdof_vx_vy and apply_bdof_min_block can use the
results directly.

prof_grad_filter is kept for reuse by other functions in the future.

Benchmark on rpi5 with gcc 12
                               Before               After
--------------------------------------------------------------------
apply_bdof_8_8x16_c:       |   7431.4 ( 1.00x)   |   7371.7 ( 1.00x)
apply_bdof_8_8x16_neon:    |   1175.4 ( 6.32x)   |   1036.3 ( 7.11x)
apply_bdof_8_16x8_c:       |   7182.2 ( 1.00x)   |   7201.1 ( 1.00x)
apply_bdof_8_16x8_neon:    |   1021.7 ( 7.03x)   |    879.9 ( 8.18x)
apply_bdof_8_16x16_c:      |  14577.1 ( 1.00x)   |  14589.3 ( 1.00x)
apply_bdof_8_16x16_neon:   |   2012.8 ( 7.24x)   |   1743.3 ( 8.37x)
apply_bdof_10_8x16_c:      |   7292.4 ( 1.00x)   |   7308.5 ( 1.00x)
apply_bdof_10_8x16_neon:   |   1156.3 ( 6.31x)   |   1045.3 ( 6.99x)
apply_bdof_10_16x8_c:      |   7112.4 ( 1.00x)   |   7214.4 ( 1.00x)
apply_bdof_10_16x8_neon:   |   1007.6 ( 7.06x)   |    904.8 ( 7.97x)
apply_bdof_10_16x16_c:     |  14363.3 ( 1.00x)   |  14476.4 ( 1.00x)
apply_bdof_10_16x16_neon:  |   1986.9 ( 7.23x)   |   1783.1 ( 8.12x)
apply_bdof_12_8x16_c:      |   7433.3 ( 1.00x)   |   7374.7 ( 1.00x)
apply_bdof_12_8x16_neon:   |   1155.9 ( 6.43x)   |   1040.8 ( 7.09x)
apply_bdof_12_16x8_c:      |   7171.1 ( 1.00x)   |   7376.3 ( 1.00x)
apply_bdof_12_16x8_neon:   |   1010.8 ( 7.09x)   |    899.4 ( 8.20x)
apply_bdof_12_16x16_c:     |  14515.5 ( 1.00x)   |  14731.5 ( 1.00x)
apply_bdof_12_16x16_neon:  |   1988.4 ( 7.30x)   |   1785.2 ( 8.25x)
2025-09-03 06:55:37 +00:00
Zhao Zhili
2e92417603 avcodec/aarch64/vvc: Optimize derive_bdof_vx_vy
Implement line tricks and pixel tricks. See comments in inter.S
for details.

Benchmark on rpi5 with gcc 12
                               Before             After
-----------------------------------------------------------------
apply_bdof_8_8x16_c:       |   7375.5 ( 1.00x) |  7473.8 ( 1.00x)
apply_bdof_8_8x16_neon:    |   1875.1 ( 3.93x) |  1135.8 ( 6.58x)
apply_bdof_8_16x8_c:       |   7273.9 ( 1.00x) |  7204.0 ( 1.00x)
apply_bdof_8_16x8_neon:    |   1738.2 ( 4.18x) |  1013.0 ( 7.11x)
apply_bdof_8_16x16_c:      |  14744.9 ( 1.00x) | 14712.6 ( 1.00x)
apply_bdof_8_16x16_neon:   |   3446.7 ( 4.28x) |  1997.7 ( 7.36x)
apply_bdof_10_8x16_c:      |   7352.4 ( 1.00x) |  7485.7 ( 1.00x)
apply_bdof_10_8x16_neon:   |   1861.0 ( 3.95x) |  1134.1 ( 6.60x)
apply_bdof_10_16x8_c:      |   7330.5 ( 1.00x) |  7232.8 ( 1.00x)
apply_bdof_10_16x8_neon:   |   1747.2 ( 4.20x) |  1002.6 ( 7.21x)
apply_bdof_10_16x16_c:     |  14522.4 ( 1.00x) | 14664.8 ( 1.00x)
apply_bdof_10_16x16_neon:  |   3490.5 ( 4.16x) |  1978.4 ( 7.41x)
apply_bdof_12_8x16_c:      |   7389.0 ( 1.00x) |  7380.1 ( 1.00x)
apply_bdof_12_8x16_neon:   |   1861.3 ( 3.97x) |  1134.0 ( 6.51x)
apply_bdof_12_16x8_c:      |   7283.1 ( 1.00x) |  7336.9 ( 1.00x)
apply_bdof_12_16x8_neon:   |   1749.1 ( 4.16x) |  1002.3 ( 7.32x)
apply_bdof_12_16x16_c:     |  14580.7 ( 1.00x) | 14502.7 ( 1.00x)
apply_bdof_12_16x16_neon:  |   3472.9 ( 4.20x) |  1978.3 ( 7.33x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-09-03 06:55:37 +00:00
Timo Rothenpieler
262d41c804 all: fix typos found by codespell 2025-08-03 13:48:47 +02:00
Andreas Rheinhardt
9b409ea1e6 configure: Factor mpegvideoencdsp out of mpegvideoenc
This will allow to relax the dependency on mpegvideoenc
for several codecs.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-06-21 22:08:52 +02:00
Andreas Rheinhardt
20ddada2a3 avcodec/pixblockdsp: Improve 8 vs 16 bit check
Before this commit, the input in get_pixels and get_pixels_unaligned
has been treated inconsistenly:
- The generic code treated 9, 10, 12 and 14 bits as 16bit input
(these bits correspond to what FFmpeg's dsputils supported),
everything with <= 8 bits as 8 bit and everything else as 8 bit
when used via AVDCT (which exposes these functions and purports
to support up to 14 bits).
- AARCH64, ARM, PPC and RISC-V, x86 ignore this AVDCT special case.
- RISC-V also ignored the restriction to 9, 10, 12 and 14 for its
16bit check and treated everything > 8 bits as 16bit.
- The mmi MIPS code treats everything as 8 bit when used via
AVDCT (this is certainly broken); otherwise it checks for <= 8 bits.
The msa MIPS code behaves like the generic code.

This commit changes this to treat 9..16 bits as 16 bit input,
everything else as 8 bit (the former because it makes sense,
the latter to preserve the behaviour for external users*).

*: The only internal user of AVDCT (the spp filter) always
uses 8, 9 or 10 bits.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-05-31 01:25:27 +02:00
Zhao Zhili
26752368f0 aarch64/h26x: Add put_hevc_pel_bi_w_pixels
On rpi5 (A76):

put_hevc_pel_bi_w_pixels4_8_c:                          90.0 ( 1.00x)
put_hevc_pel_bi_w_pixels4_8_neon:                       34.1 ( 2.64x)
put_hevc_pel_bi_w_pixels6_8_c:                         188.3 ( 1.00x)
put_hevc_pel_bi_w_pixels6_8_neon:                       73.5 ( 2.56x)
put_hevc_pel_bi_w_pixels8_8_c:                         327.1 ( 1.00x)
put_hevc_pel_bi_w_pixels8_8_neon:                       75.8 ( 4.32x)
put_hevc_pel_bi_w_pixels12_8_c:                        728.8 ( 1.00x)
put_hevc_pel_bi_w_pixels12_8_neon:                     186.1 ( 3.92x)
put_hevc_pel_bi_w_pixels16_8_c:                       1288.1 ( 1.00x)
put_hevc_pel_bi_w_pixels16_8_neon:                     268.5 ( 4.80x)
put_hevc_pel_bi_w_pixels24_8_c:                       2855.5 ( 1.00x)
put_hevc_pel_bi_w_pixels24_8_neon:                     723.8 ( 3.95x)
put_hevc_pel_bi_w_pixels32_8_c:                       5095.3 ( 1.00x)
put_hevc_pel_bi_w_pixels32_8_neon:                    1165.0 ( 4.37x)
put_hevc_pel_bi_w_pixels48_8_c:                      11521.5 ( 1.00x)
put_hevc_pel_bi_w_pixels48_8_neon:                    2856.0 ( 4.03x)
put_hevc_pel_bi_w_pixels64_8_c:                      21020.5 ( 1.00x)
put_hevc_pel_bi_w_pixels64_8_neon:                    4699.1 ( 4.47x)

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-04-29 15:24:14 +08:00
Zhao Zhili
39786f8cd5 aarch64/h26x: optimize sao_band_filter
int8_t[] is enough for offset_table of 8 bit streams.

On rpi5:
                             Before               After
hevc_sao_band_8_8_c:          252.3 ( 1.00x)     252.3 ( 1.00x)
hevc_sao_band_8_8_neon:        95.8 ( 2.63x)      61.0 ( 4.57x)
hevc_sao_band_16_8_c:         875.2 ( 1.00x)     864.9 ( 1.00x)
hevc_sao_band_16_8_neon:      317.5 ( 2.76x)     150.0 ( 6.26x)
hevc_sao_band_32_8_c:        3853.5 ( 1.00x)    3871.6 ( 1.00x)
hevc_sao_band_32_8_neon:     1222.3 ( 3.15x)     550.6 ( 7.39)
hevc_sao_band_48_8_c:        8203.6 ( 1.00x)    8182.6 ( 1.00x)
hevc_sao_band_48_8_neon:     2685.7 ( 3.05x)    1185.8 ( 7.36x)
hevc_sao_band_64_8_c:       14023.0 ( 1.00x)   14038.9 ( 1.00x)
hevc_sao_band_64_8_neon:     4783.2 ( 2.93x)    2078.4 ( 7.15x)

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-04-29 15:11:45 +08:00