Commit graph

474 commits

Author SHA1 Message Date
Jun Zhao
27dd2f1c70 lavc/hevc: fix missing # in ldrsw immediate offset
The ldrsw instruction requires immediate offset with # prefix.
This fixes the syntax error introduced in commit 26752368f0
(aarch64/h26x: Add put_hevc_pel_bi_w_pixels) where the
load_bi_w_pixels_param macro was added.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-02-05 09:13:22 +08:00
Zhao Zhili
e250854ecf aarch64/h264pred: disable inefficient functions
These assembly optimizations have been identified as "performance
regressions." Due to advancements in modern CPU micro-architectures
and compiler optimization the C implementations now consistently
outperform these handwritten routines.

Test Name          	 A55-clang       M1             A76-gcc-14      A510-clang      A715-clang      X3-clang
--------------------------------------------------------------------------------------------------------------------
pred8x8_dc_8_neon        55.9 ( 0.79x)!  0.2 ( 0.31x)!  35.7 ( 0.63x)!  98.3 ( 0.37x)!  35.9 ( 0.45x)!  33.6 ( 0.38x)!
pred8x8_dc_10_neon       57.0 ( 1.04x)   0.3 ( 0.36x)!  35.9 ( 0.94x)!  98.2 ( 0.53x)!  35.8 ( 0.58x)!  33.2 ( 0.50x)!
pred8x8_dc_128_8_neon    26.0 ( 0.69x)!  0.1 ( 0.43x)!  15.3 ( 0.73x)!  46.4 ( 0.36x)!  10.6 ( 0.48x)!  10.3 ( 1.09x)
pred8x8_dc_128_10_neon   25.3 ( 0.99x)!  0.1 ( 0.42x)!  19.3 ( 0.48x)!  44.5 ( 0.42x)!  10.0 ( 0.61x)!  11.0 ( 1.00x)
pred8x8_left_dc_8_neon   46.9 ( 0.72x)!  0.2 ( 0.26x)!  30.2 ( 0.49x)!  71.4 ( 0.39x)!  29.8 ( 0.35x)!  26.5 ( 0.44x)!
pred8x8_left_dc_10_neon  45.4 ( 0.82x)!  0.2 ( 0.29x)!  28.1 ( 0.67x)!  70.2 ( 0.47x)!  30.0 ( 0.38x)!  26.5 ( 0.43x)!
pred16x16_dc_8_neon      74.4 ( 1.34x)   0.3 ( 0.62x)!  44.7 ( 0.89x)!  128.0 ( 0.79x)! 48.5 ( 0.67x)!  39.4 ( 0.71x)!
pred16x16_dc_128_8_neon  37.9 ( 0.79x)!  0.1 ( 0.60x)!  20.1 ( 0.80x)!  41.8 ( 0.46x)!  16.2 ( 0.81x)!  12.8 ( 0.95x)!
pred16x16_left_dc_8_neon 69.9 ( 1.19x)   0.3 ( 0.46x)!  49.6 ( 0.54x)!  116.8 ( 0.62x)! 52.8 ( 0.45x)!  44.2 ( 0.51x)!
pred8x8_hori_8_neon      30.6 ( 1.39x)   0.1 ( 0.45x)!  19.4 ( 0.81x)!  71.0 ( 0.50x)!  15.9 ( 0.55x)!  12.2 ( 0.94x)!
pred8x8_hori_10_neon*    29.3 ( 1.82x)   0.1 ( 0.59x)!  18.5 ( 1.56x)   68.9 ( 0.64x)!  15.8 ( 0.62x)!  11.8 ( 0.97x)!
pred8x8_top_dc_8_neon    35.8 ( 0.96x)!  0.1 ( 0.59x)!  16.8 ( 0.81x)!  58.9 ( 0.44x)!  11.3 ( 0.89x)!  11.4 ( 0.99x)!
pred8x8_top_dc_10_neon   37.4 ( 1.24x)   0.1 ( 0.92x)!  20.4 ( 0.81x)!  59.5 ( 0.69x)!  10.5 ( 1.48x)   11.8 ( 1.02x)
pred8x8_vertical_8_neon  18.3 ( 1.08x)   0.1 ( 0.54x)!  12.8 ( 0.89x)!  37.2 ( 0.40x)!   8.3 ( 0.77x)!  11.2 ( 1.00x)
pred8x8_vertical_10_neon 19.0 ( 1.24x)   0.1 ( 0.55x)!  15.3 ( 0.62x)!  39.7 ( 0.50x)!   8.2 ( 0.91x)!  11.1 ( 0.99x)!

- pred8x8_horizontal_10 also underperforms on new architectures, but useful on A55 and A76.

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2026-02-04 09:06:37 +00:00
Zhao Zhili
f54841d375 avcodec/aarch64: add pngdsp
Test Name                    A55-gcc-11        M1-clang           A76-gcc-12          A510-clang        X3-clang
-------------------------------------------------------------------------------------------------------------------
add_bytes_l2_4096_neon        1807.2 ( 2.01x)    1.6 ( 1.94x)    333.0 ( 6.35x)   1058.2 ( 2.34x)    214.3 ( 1.99x)
add_paeth_prediction_3_neon  33036.1 ( 2.41x)  145.1 ( 1.66x)  20443.3 ( 1.97x)  35225.1 ( 1.23x)  19420.8 ( 1.05x)
add_paeth_prediction_4_neon  24368.6 ( 3.26x)  106.7 ( 2.01x)  15163.8 ( 2.77x)  26454.7 ( 1.62x)  14319.0 ( 1.35x)
add_paeth_prediction_6_neon  17900.6 ( 4.44x)   72.0 ( 2.70x)  10214.3 ( 4.20x)  18296.9 ( 2.27x)   9693.1 ( 1.97x)
add_paeth_prediction_8_neon  12615.4 ( 6.31x)   54.1 ( 2.58x)   7706.0 ( 5.45x)  13733.3 ( 2.94x)   7272.6 ( 2.63x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2026-02-04 12:05:35 +08:00
Martin Storsjö
f74c551eaa aarch64: Fix indentation of a few instructions
This file is excempt from the indent checker script, as there
are a few other bits in it that the script wants to reformat
into slightly worse form, or which might not warrant being
reformatted.

But these instructions should indeed be indented this way.
2026-01-30 05:21:27 +00:00
Andreas Rheinhardt
bf4d5037b4 avcodec/h264dsp: Remove redundant h264 from H264DSPCtx member names
These names are a remnant of dsputil when all the DSP functions
from all codecs were part of DSPcontext.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:25 +01:00
Jun Zhao
8966101fa6 lavc/hevc: add aarch64 neon for 12-bit dequant
Implement NEON optimization for HEVC dequant at 12-bit depth.

For 12-bit: shift = 15 - 12 - log2_size = 3 - log2_size. When shift
is negative, we use shl (shift left) instead of srshr.

Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_12_c:                                   9.9 ( 1.00x)
hevc_dequant_4x4_12_neon:                                5.7 ( 1.74x)

hevc_dequant_8x8_12_c:                                   1.7 ( 1.00x)
hevc_dequant_8x8_12_neon:                                1.3 ( 1.30x)

hevc_dequant_16x16_12_c:                               131.1 ( 1.00x)
hevc_dequant_16x16_12_neon:                              7.9 (16.52x)

hevc_dequant_32x32_12_c:                                69.7 ( 1.00x)
hevc_dequant_32x32_12_neon:                             28.4 ( 2.46x)

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-01-25 06:55:26 +00:00
Jun Zhao
ce89d974c8 lavc/hevc: add aarch64 neon for 10-bit dequant
Implement NEON optimization for HEVC dequant at 10-bit depth.

For 10-bit: shift = 15 - 10 - log2_size = 5 - log2_size

Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_10_c:                                  16.6 ( 1.00x)
hevc_dequant_4x4_10_neon:                                7.4 ( 2.23x)

hevc_dequant_8x8_10_c:                                  39.7 ( 1.00x)
hevc_dequant_8x8_10_neon:                                7.5 ( 5.28x)

hevc_dequant_16x16_10_c:                               168.7 ( 1.00x)
hevc_dequant_16x16_10_neon:                             10.2 (16.56x)

hevc_dequant_32x32_10_c:                                 1.9 ( 1.00x)
hevc_dequant_32x32_10_neon:                              1.9 ( 1.01x)

Note: 32x32 shift=0 is identity transform (no-op), so NEON has no
advantage over C which is also optimized away by the compiler.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-01-25 06:55:26 +00:00
Jun Zhao
0886e50c6b lavc/hevc: add aarch64 neon for 8-bit dequant
Implement NEON optimization for HEVC dequant at 8-bit depth.

The NEON implementation uses srshr (Signed Rounding Shift Right) which
does both the add with offset and right shift in a single instruction.

Optimization details:
- 4x4 (16 coeffs): Single load-process-store sequence
- 8x8 (64 coeffs): Fully unrolled, no loop overhead
- 16x16 (256 coeffs): Pipelined load/compute/store to hide memory latency
- 32x32 (1024 coeffs): Pipelined with all available NEON registers

Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_8_c:                                   11.3 ( 1.00x)
hevc_dequant_4x4_8_neon:                                 6.3 ( 1.78x)

hevc_dequant_8x8_8_c:                                   33.9 ( 1.00x)
hevc_dequant_8x8_8_neon:                                 6.6 ( 5.11x)

hevc_dequant_16x16_8_c:                                153.8 ( 1.00x)
hevc_dequant_16x16_8_neon:                               9.0 (17.02x)

hevc_dequant_32x32_8_c:                                 78.1 ( 1.00x)
hevc_dequant_32x32_8_neon:                              31.9 ( 2.45x)

Note on Performance Anomaly:
The observation that hevc_dequant_32x32_8_c is faster than 16x16 (78.1 vs 153.8)
is due to Clang auto-vectorizing only for sizes >= 32x32.
Compiler: Apple clang version 17.0.0 (clang-1700.6.3.2)

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-01-25 06:55:26 +00:00
Georgii Zagoruiko
8acdffa22c aarch64/vvc: Optimisations of put_luma_v() functions for 10/12-bit
RPi4 (auto-vectorisation is on)
put_luma_v_10_4x4_c:                                   303.3 ( 1.00x)
put_luma_v_10_4x4_neon:                                 55.7 ( 5.45x)
put_luma_v_10_8x8_c:                                  1106.7 ( 1.00x)
put_luma_v_10_8x8_neon:                                163.8 ( 6.76x)
put_luma_v_10_16x16_c:                                2242.1 ( 1.00x)
put_luma_v_10_16x16_neon:                              672.7 ( 3.33x)
put_luma_v_10_32x32_c:                                7057.3 ( 1.00x)
put_luma_v_10_32x32_neon:                             2731.3 ( 2.58x)
put_luma_v_10_64x64_c:                               25699.8 ( 1.00x)
put_luma_v_10_64x64_neon:                            12145.6 ( 2.12x)
put_luma_v_10_128x128_c:                             90694.6 ( 1.00x)
put_luma_v_10_128x128_neon:                          44862.4 ( 2.02x)
put_luma_v_12_4x4_c:                                   304.4 ( 1.00x)
put_luma_v_12_4x4_neon:                                 55.6 ( 5.47x)
put_luma_v_12_8x8_c:                                  1107.4 ( 1.00x)
put_luma_v_12_8x8_neon:                                164.7 ( 6.72x)
put_luma_v_12_16x16_c:                                2235.8 ( 1.00x)
put_luma_v_12_16x16_neon:                              672.5 ( 3.32x)
put_luma_v_12_32x32_c:                                7049.2 ( 1.00x)
put_luma_v_12_32x32_neon:                             2731.6 ( 2.58x)
put_luma_v_12_64x64_c:                               25706.5 ( 1.00x)
put_luma_v_12_64x64_neon:                            12145.0 ( 2.12x)
put_luma_v_12_128x128_c:                             90672.5 ( 1.00x)
put_luma_v_12_128x128_neon:                          44857.1 ( 2.02x)

Apple M4 (auto-vectorisation is on):
put_luma_v_10_4x4_c:                                    25.6 ( 1.00x)
put_luma_v_10_4x4_neon:                                  3.1 ( 8.18x)
put_luma_v_10_8x8_c:                                    34.7 ( 1.00x)
put_luma_v_10_8x8_neon:                                 10.5 ( 3.32x)
put_luma_v_10_16x16_c:                                 103.9 ( 1.00x)
put_luma_v_10_16x16_neon:                               42.3 ( 2.45x)
put_luma_v_10_32x32_c:                                 399.7 ( 1.00x)
put_luma_v_10_32x32_neon:                              161.8 ( 2.47x)
put_luma_v_10_64x64_c:                                1276.7 ( 1.00x)
put_luma_v_10_64x64_neon:                              840.1 ( 1.52x)
put_luma_v_10_128x128_c:                              4981.3 ( 1.00x)
put_luma_v_10_128x128_neon:                           3008.0 ( 1.66x)
put_luma_v_12_4x4_c:                                    23.6 ( 1.00x)
put_luma_v_12_4x4_neon:                                  2.0 (11.84x)
put_luma_v_12_8x8_c:                                    31.8 ( 1.00x)
put_luma_v_12_8x8_neon:                                 12.4 ( 2.55x)
put_luma_v_12_16x16_c:                                 100.8 ( 1.00x)
put_luma_v_12_16x16_neon:                               44.9 ( 2.25x)
put_luma_v_12_32x32_c:                                 331.1 ( 1.00x)
put_luma_v_12_32x32_neon:                              175.2 ( 1.89x)
put_luma_v_12_64x64_c:                                1227.1 ( 1.00x)
put_luma_v_12_64x64_neon:                              712.7 ( 1.72x)
put_luma_v_12_128x128_c:                              5149.1 ( 1.00x)
put_luma_v_12_128x128_neon:                           2809.3 ( 1.83x)
2026-01-08 17:35:55 +00:00
Zhao Zhili
840183d823 aarch64/hpeldsp_neon: fix out-of-bounds read
Fix #21141

The performance improved a little bit.
On A76:
                              Before            After
put_pixels_tab[0][1]_neon:    32.4 ( 3.91x)     31.6 ( 3.99x)
put_pixels_tab[0][3]_neon:    88.0 ( 4.50x)     74.6 ( 5.31x)
put_pixels_tab[1][1]_neon:    33.5 ( 2.52x)     31.2 ( 2.71x)
put_pixels_tab[1][3]_neon:    30.5 ( 3.61x)     21.7 ( 5.08x)

On A55:
                             Before            After
put_pixels_tab[0][1]_neon:   175.2 ( 2.41x)    138.7 ( 3.04x)
put_pixels_tab[0][3]_neon:   334.3 ( 2.71x)    296.1 ( 3.07x)
put_pixels_tab[1][1]_neon:   168.3 ( 1.78x)     94.1 ( 3.19x)
put_pixels_tab[1][3]_neon:   112.3 ( 2.20x)     90.0 ( 2.74x)
2026-01-04 03:22:55 +00:00
Georgii Zagoruiko
f790de2a87 aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit
RPi4 (auto-vectorisation is turned on)
put_luma_h_10_4x4_c:                                   282.8 ( 1.00x)
put_luma_h_10_8x8_c:                                  1069.5 ( 1.00x)
put_luma_h_10_8x8_neon:                                207.5 ( 5.15x)
put_luma_h_10_16x16_c:                                1999.6 ( 1.00x)
put_luma_h_10_16x16_neon:                              777.5 ( 2.57x)
put_luma_h_10_32x32_c:                                6612.9 ( 1.00x)
put_luma_h_10_32x32_neon:                             3201.6 ( 2.07x)
put_luma_h_10_64x64_c:                               25059.0 ( 1.00x)
put_luma_h_10_64x64_neon:                            13623.5 ( 1.84x)
put_luma_h_10_128x128_c:                             91310.1 ( 1.00x)
put_luma_h_10_128x128_neon:                          50358.3 ( 1.81x)
put_luma_h_12_4x4_c:                                   282.1 ( 1.00x)
put_luma_h_12_8x8_c:                                  1068.4 ( 1.00x)
put_luma_h_12_8x8_neon:                                207.7 ( 5.14x)
put_luma_h_12_16x16_c:                                1998.0 ( 1.00x)
put_luma_h_12_16x16_neon:                              777.5 ( 2.57x)
put_luma_h_12_32x32_c:                                6612.0 ( 1.00x)
put_luma_h_12_32x32_neon:                             3201.6 ( 2.07x)
put_luma_h_12_64x64_c:                               25036.8 ( 1.00x)
put_luma_h_12_64x64_neon:                            13595.1 ( 1.84x)
put_luma_h_12_128x128_c:                             91305.8 ( 1.00x)
put_luma_h_12_128x128_neon:                          50359.7 ( 1.81x)

Apple M2 Air (auto-vectorisation is turned on)
put_luma_h_10_4x4_c:                                     0.3 ( 1.00x)
put_luma_h_10_8x8_c:                                     1.0 ( 1.00x)
put_luma_h_10_8x8_neon:                                  0.4 ( 2.59x)
put_luma_h_10_16x16_c:                                   2.9 ( 1.00x)
put_luma_h_10_16x16_neon:                                1.4 ( 2.01x)
put_luma_h_10_32x32_c:                                   9.4 ( 1.00x)
put_luma_h_10_32x32_neon:                                5.8 ( 1.62x)
put_luma_h_10_64x64_c:                                  35.6 ( 1.00x)
put_luma_h_10_64x64_neon:                               23.6 ( 1.51x)
put_luma_h_10_128x128_c:                               131.1 ( 1.00x)
put_luma_h_10_128x128_neon:                             92.6 ( 1.42x)
put_luma_h_12_4x4_c:                                     0.3 ( 1.00x)
put_luma_h_12_8x8_c:                                     1.0 ( 1.00x)
put_luma_h_12_8x8_neon:                                  0.4 ( 2.58x)
put_luma_h_12_16x16_c:                                   2.9 ( 1.00x)
put_luma_h_12_16x16_neon:                                1.4 ( 2.00x)
put_luma_h_12_32x32_c:                                   9.4 ( 1.00x)
put_luma_h_12_32x32_neon:                                5.8 ( 1.61x)
put_luma_h_12_64x64_c:                                  35.3 ( 1.00x)
put_luma_h_12_64x64_neon:                               23.3 ( 1.52x)
put_luma_h_12_128x128_c:                               131.2 ( 1.00x)
put_luma_h_12_128x128_neon:                             92.4 ( 1.42x)
2025-11-24 21:22:55 +00:00
Kacper Michajłow
9ad20839fb
avcodec/pixblockdsp: be consistent about restrict use in ff_{get,diff}_pixels
Suppresses warnings about function pointer mismatch.

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-10-25 01:01:15 +02:00
Bin Peng
3115c0c0e6 lavc/aarch64: Fix addp overflow in ff_pred16x16_plane_neon_10
The mismatch between neon and C functions can be reproduced
using the following bitstream and command line on aarch64 devices.

wget https://streams.videolan.org/ffmpeg/incoming/replay_intra_pred_16x16.h264
 ./ffmpeg -cpuflags 0  -threads 1 -i replay_intra_pred_16x16.h264  -f framemd5 -y md5_ref
 ./ffmpeg              -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_neon

Signed-off-by: Bin Peng <pengbin@visionular.com>
2025-10-24 15:32:35 +00:00
Krzysztof Pyrkosz
03c054d43c avcodec/aarch64/vvc: Implement dmvr_v_8
A72
dmvr_v_8_12x20_neon:                                   207.0 ( 4.15x)
dmvr_v_8_20x12_neon:                                   170.4 ( 4.37x)
dmvr_v_8_20x20_neon:                                   273.4 ( 4.58x)

A53
dmvr_v_8_12x20_neon:                                   450.6 ( 4.21x)
dmvr_v_8_20x12_neon:                                   342.8 ( 3.70x)
dmvr_v_8_20x20_neon:                                   550.9 ( 3.79x)
2025-09-23 11:20:20 +00:00
Krzysztof Pyrkosz
56a638d836 avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon
Before and after:
A53:
apply_bdof_8_16x8_neon:                               2733.1 ( 4.88x)
apply_bdof_8_16x16_neon:                              5458.6 ( 4.86x)
apply_bdof_10_16x8_neon:                              2789.8 ( 4.64x)
apply_bdof_10_16x16_neon:                             5523.8 ( 4.68x)
apply_bdof_12_16x8_neon:                              2792.8 ( 4.58x)
apply_bdof_12_16x16_neon:                             5519.5 ( 4.63x)

apply_bdof_8_16x8_neon:                               2571.8 ( 5.12x)
apply_bdof_8_16x16_neon:                              5173.3 ( 5.12x)
apply_bdof_10_16x8_neon:                              2635.1 ( 4.87x)
apply_bdof_10_16x16_neon:                             5243.0 ( 4.89x)
apply_bdof_12_16x8_neon:                              2613.0 ( 4.89x)
apply_bdof_12_16x16_neon:                             5231.7 ( 4.90x)

A78:
apply_bdof_8_16x8_neon:                                565.3 ( 8.43x)
apply_bdof_8_16x16_neon:                              1109.5 ( 8.60x)
apply_bdof_10_16x8_neon:                               568.2 ( 7.92x)
apply_bdof_10_16x16_neon:                             1114.1 ( 8.08x)
apply_bdof_12_16x8_neon:                               570.2 ( 7.87x)
apply_bdof_12_16x16_neon:                             1116.3 ( 8.03x)

apply_bdof_8_16x8_neon:                                541.4 ( 8.81x)
apply_bdof_8_16x16_neon:                              1065.9 ( 8.97x)
apply_bdof_10_16x8_neon:                               543.2 ( 8.32x)
apply_bdof_10_16x16_neon:                             1071.5 ( 8.39x)
apply_bdof_12_16x8_neon:                               544.2 ( 8.25x)
apply_bdof_12_16x16_neon:                             1074.1 ( 8.37x)
2025-09-23 11:20:11 +00:00
Krzysztof Pyrkosz
f1a155d975 avcodec/aarch64/vvc: Optimize dmvr_hv_10
Before and after on A53:
dmvr_hv_10_12x20_neon:                                1838.2 ( 3.02x)
dmvr_hv_10_20x12_neon:                                1330.2 ( 1.83x)
dmvr_hv_10_20x20_neon:                                2148.2 ( 1.85x)
dmvr_hv_12_12x20_neon:                                1839.2 ( 3.02x)
dmvr_hv_12_20x12_neon:                                1330.6 ( 1.83x)
dmvr_hv_12_20x20_neon:                                2147.2 ( 1.85x)

dmvr_hv_10_12x20_neon:                                1755.0 ( 3.17x)
dmvr_hv_10_20x12_neon:                                1165.8 ( 2.09x)
dmvr_hv_10_20x20_neon:                                1876.1 ( 2.12x)
dmvr_hv_12_12x20_neon:                                1754.4 ( 3.17x)
dmvr_hv_12_20x12_neon:                                1167.8 ( 2.09x)
dmvr_hv_12_20x20_neon:                                1878.8 ( 2.12x)
2025-09-21 19:39:27 +00:00
Georgii Zagoruiko
4fbacb3944 avcodec/aarch64/vvc: Optimised version of classify function.
Macbook Air (M2):
    vvc_alf_classify_8x8_8_c:                                2.6 ( 1.00x)
    vvc_alf_classify_8x8_8_neon:                             1.0 ( 2.47x)
    vvc_alf_classify_8x8_10_c:                               2.7 ( 1.00x)
    vvc_alf_classify_8x8_10_neon:                            0.9 ( 2.98x)
    vvc_alf_classify_8x8_12_c:                               2.7 ( 1.00x)
    vvc_alf_classify_8x8_12_neon:                            0.9 ( 2.97x)
    vvc_alf_classify_16x16_8_c:                              7.3 ( 1.00x)
    vvc_alf_classify_16x16_8_neon:                           3.4 ( 2.12x)
    vvc_alf_classify_16x16_10_c:                             4.3 ( 1.00x)
    vvc_alf_classify_16x16_10_neon:                          2.9 ( 1.47x)
    vvc_alf_classify_16x16_12_c:                             4.3 ( 1.00x)
    vvc_alf_classify_16x16_12_neon:                          3.0 ( 1.44x)
    vvc_alf_classify_32x32_8_c:                             13.7 ( 1.00x)
    vvc_alf_classify_32x32_8_neon:                          10.7 ( 1.29x)
    vvc_alf_classify_32x32_10_c:                            12.3 ( 1.00x)
    vvc_alf_classify_32x32_10_neon:                          8.7 ( 1.42x)
    vvc_alf_classify_32x32_12_c:                            12.2 ( 1.00x)
    vvc_alf_classify_32x32_12_neon:                          8.7 ( 1.40x)
    vvc_alf_classify_64x64_8_c:                             45.8 ( 1.00x)
    vvc_alf_classify_64x64_8_neon:                          37.1 ( 1.23x)
    vvc_alf_classify_64x64_10_c:                            41.3 ( 1.00x)
    vvc_alf_classify_64x64_10_neon:                         32.8 ( 1.26x)
    vvc_alf_classify_64x64_12_c:                            41.4 ( 1.00x)
    vvc_alf_classify_64x64_12_neon:                         32.4 ( 1.28x)
    vvc_alf_classify_128x128_8_c:                          163.7 ( 1.00x)
    vvc_alf_classify_128x128_8_neon:                       138.3 ( 1.18x)
    vvc_alf_classify_128x128_10_c:                         149.1 ( 1.00x)
    vvc_alf_classify_128x128_10_neon:                      120.3 ( 1.24x)
    vvc_alf_classify_128x128_12_c:                         148.7 ( 1.00x)
    vvc_alf_classify_128x128_12_neon:                      119.4 ( 1.25x)

    RPi4 (Cortex-A72):
    vvc_alf_classify_8x8_8_c:                             1251.6 ( 1.00x)
    vvc_alf_classify_8x8_8_neon:                           700.7 ( 1.79x)
    vvc_alf_classify_8x8_10_c:                            1141.9 ( 1.00x)
    vvc_alf_classify_8x8_10_neon:                          659.7 ( 1.73x)
    vvc_alf_classify_8x8_12_c:                            1075.8 ( 1.00x)
    vvc_alf_classify_8x8_12_neon:                          658.7 ( 1.63x)
    vvc_alf_classify_16x16_8_c:                           3574.1 ( 1.00x)
    vvc_alf_classify_16x16_8_neon:                        1849.8 ( 1.93x)
    vvc_alf_classify_16x16_10_c:                          3270.0 ( 1.00x)
    vvc_alf_classify_16x16_10_neon:                       1786.1 ( 1.83x)
    vvc_alf_classify_16x16_12_c:                          3271.7 ( 1.00x)
    vvc_alf_classify_16x16_12_neon:                       1785.5 ( 1.83x)
    vvc_alf_classify_32x32_8_c:                          12451.9 ( 1.00x)
    vvc_alf_classify_32x32_8_neon:                        5984.3 ( 2.08x)
    vvc_alf_classify_32x32_10_c:                         11428.9 ( 1.00x)
    vvc_alf_classify_32x32_10_neon:                       5756.3 ( 1.99x)
    vvc_alf_classify_32x32_12_c:                         11252.8 ( 1.00x)
    vvc_alf_classify_32x32_12_neon:                       5755.7 ( 1.96x)
    vvc_alf_classify_64x64_8_c:                          47625.5 ( 1.00x)
    vvc_alf_classify_64x64_8_neon:                       21071.9 ( 2.26x)
    vvc_alf_classify_64x64_10_c:                         44576.3 ( 1.00x)
    vvc_alf_classify_64x64_10_neon:                      21544.7 ( 2.07x)
    vvc_alf_classify_64x64_12_c:                         44600.5 ( 1.00x)
    vvc_alf_classify_64x64_12_neon:                      21491.2 ( 2.08x)
    vvc_alf_classify_128x128_8_c:                       192143.3 ( 1.00x)
    vvc_alf_classify_128x128_8_neon:                     82387.6 ( 2.33x)
    vvc_alf_classify_128x128_10_c:                      177583.1 ( 1.00x)
    vvc_alf_classify_128x128_10_neon:                    81628.8 ( 2.18x)
    vvc_alf_classify_128x128_12_c:                      177582.2 ( 1.00x)
    vvc_alf_classify_128x128_12_neon:                    81625.1 ( 2.18x)
2025-09-09 22:13:04 +01:00
Krzysztof Pyrkosz
de25cb4603 avcodec/aarch64/vvc: Optimize vvc_apply_bdof_block_8x
Before and after:
A53:
apply_bdof_8_8x16_neon:                               3320.5 ( 4.02x)
apply_bdof_10_8x16_neon:                              3317.8 ( 3.90x)
apply_bdof_12_8x16_neon:                              3303.6 ( 3.91x)

apply_bdof_8_8x16_neon:                               3168.1 ( 4.23x)
apply_bdof_10_8x16_neon:                              3127.8 ( 4.13x)
apply_bdof_12_8x16_neon:                              3119.3 ( 4.18x)

A72:
apply_bdof_8_8x16_neon:                               1827.4 ( 5.02x)
apply_bdof_10_8x16_neon:                              1838.5 ( 4.89x)
apply_bdof_12_8x16_neon:                              1841.1 ( 4.83x)

apply_bdof_8_8x16_neon:                               1691.6 ( 5.46x)
apply_bdof_10_8x16_neon:                              1695.9 ( 5.23x)
apply_bdof_12_8x16_neon:                              1695.4 ( 5.29x)

A78
apply_bdof_8_8x16_neon:                                648.9 ( 7.43x)
apply_bdof_10_8x16_neon:                               646.1 ( 7.04x)
apply_bdof_12_8x16_neon:                               643.8 ( 7.04x)

apply_bdof_8_8x16_neon:                                603.2 ( 7.97x)
apply_bdof_10_8x16_neon:                               604.1 ( 7.52x)
apply_bdof_12_8x16_neon:                               604.5 ( 7.52x)
2025-09-09 16:37:28 +00:00
Krzysztof Pyrkosz
7b21bde34c avcodec/aarch64/vvc: Implemented dmvr_h_10
A78:
dmvr_h_10_12x20_neon:                                   82.2 ( 6.49x)
dmvr_h_10_20x12_neon:                                   69.9 ( 3.66x)
dmvr_h_10_20x20_neon:                                  112.5 ( 3.74x)
dmvr_h_12_12x20_neon:                                   81.4 ( 6.51x)
dmvr_h_12_20x12_neon:                                   69.2 ( 3.74x)
dmvr_h_12_20x20_neon:                                  110.2 ( 3.85x)

A72:
dmvr_h_10_12x20_neon:                                  234.1 ( 4.67x)
dmvr_h_10_20x12_neon:                                  221.4 ( 3.48x)
dmvr_h_10_20x20_neon:                                  356.9 ( 3.59x)
dmvr_h_12_12x20_neon:                                  234.1 ( 4.67x)
dmvr_h_12_20x12_neon:                                  221.5 ( 3.53x)
dmvr_h_12_20x20_neon:                                  357.0 ( 3.64x)
2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz
189e841cfd avcodec/aarch64/vvc: Implement dmvr_h_8
A78:
dmvr_h_8_12x20_neon:                                    76.6 ( 4.31x)
dmvr_h_8_20x12_neon:                                    65.8 ( 3.49x)
dmvr_h_8_20x20_neon:                                   106.6 ( 3.62x)

A72:
dmvr_h_8_12x20_neon:                                   190.6 ( 4.40x)
dmvr_h_8_20x12_neon:                                   171.1 ( 4.31x)
dmvr_h_8_20x20_neon:                                   275.1 ( 4.50x)
2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz
fb4407797e Replace uxtl with umull in dmvr_hv_8
Before and after on A78:
dmvr_hv_8_12x20_neon:                                  205.3 ( 5.21x)
dmvr_hv_8_20x12_neon:                                  171.8 ( 3.15x)
dmvr_hv_8_20x20_neon:                                  282.7 ( 3.11x)

dmvr_hv_8_12x20_neon:                                  172.7 ( 5.58x)
dmvr_hv_8_20x12_neon:                                  133.3 ( 3.36x)
dmvr_hv_8_20x20_neon:                                  214.6 ( 3.40x)
2025-09-05 07:20:15 +00:00
Zhao Zhili
6ce02bcc3a avcodec/aarch64/vvc: Optimize apply_bdof
Before this patch, prof_grad_filter calculate
gh[0], gh[1], gv[0], gv[1] and save them to stack.

derive_bdof_vx_vy load them from stack and calculate
gh[0] + gh[1], gv[0] + gv[1].

apply_bdof_min_block load them from stack and calculate
gh[0] - gh[1], gv[0] - gv[1]

This patch add bdof_grad_filter, which calculate gh[0] + gh[1],
gh[0] - gh[1], gv[0] + gv[1], gv[0] - gv[1], and save them to
stack, so derive_bdof_vx_vy and apply_bdof_min_block can use the
results directly.

prof_grad_filter is kept for reuse by other functions in the future.

Benchmark on rpi5 with gcc 12
                               Before               After
--------------------------------------------------------------------
apply_bdof_8_8x16_c:       |   7431.4 ( 1.00x)   |   7371.7 ( 1.00x)
apply_bdof_8_8x16_neon:    |   1175.4 ( 6.32x)   |   1036.3 ( 7.11x)
apply_bdof_8_16x8_c:       |   7182.2 ( 1.00x)   |   7201.1 ( 1.00x)
apply_bdof_8_16x8_neon:    |   1021.7 ( 7.03x)   |    879.9 ( 8.18x)
apply_bdof_8_16x16_c:      |  14577.1 ( 1.00x)   |  14589.3 ( 1.00x)
apply_bdof_8_16x16_neon:   |   2012.8 ( 7.24x)   |   1743.3 ( 8.37x)
apply_bdof_10_8x16_c:      |   7292.4 ( 1.00x)   |   7308.5 ( 1.00x)
apply_bdof_10_8x16_neon:   |   1156.3 ( 6.31x)   |   1045.3 ( 6.99x)
apply_bdof_10_16x8_c:      |   7112.4 ( 1.00x)   |   7214.4 ( 1.00x)
apply_bdof_10_16x8_neon:   |   1007.6 ( 7.06x)   |    904.8 ( 7.97x)
apply_bdof_10_16x16_c:     |  14363.3 ( 1.00x)   |  14476.4 ( 1.00x)
apply_bdof_10_16x16_neon:  |   1986.9 ( 7.23x)   |   1783.1 ( 8.12x)
apply_bdof_12_8x16_c:      |   7433.3 ( 1.00x)   |   7374.7 ( 1.00x)
apply_bdof_12_8x16_neon:   |   1155.9 ( 6.43x)   |   1040.8 ( 7.09x)
apply_bdof_12_16x8_c:      |   7171.1 ( 1.00x)   |   7376.3 ( 1.00x)
apply_bdof_12_16x8_neon:   |   1010.8 ( 7.09x)   |    899.4 ( 8.20x)
apply_bdof_12_16x16_c:     |  14515.5 ( 1.00x)   |  14731.5 ( 1.00x)
apply_bdof_12_16x16_neon:  |   1988.4 ( 7.30x)   |   1785.2 ( 8.25x)
2025-09-03 06:55:37 +00:00
Zhao Zhili
2e92417603 avcodec/aarch64/vvc: Optimize derive_bdof_vx_vy
Implement line tricks and pixel tricks. See comments in inter.S
for details.

Benchmark on rpi5 with gcc 12
                               Before             After
-----------------------------------------------------------------
apply_bdof_8_8x16_c:       |   7375.5 ( 1.00x) |  7473.8 ( 1.00x)
apply_bdof_8_8x16_neon:    |   1875.1 ( 3.93x) |  1135.8 ( 6.58x)
apply_bdof_8_16x8_c:       |   7273.9 ( 1.00x) |  7204.0 ( 1.00x)
apply_bdof_8_16x8_neon:    |   1738.2 ( 4.18x) |  1013.0 ( 7.11x)
apply_bdof_8_16x16_c:      |  14744.9 ( 1.00x) | 14712.6 ( 1.00x)
apply_bdof_8_16x16_neon:   |   3446.7 ( 4.28x) |  1997.7 ( 7.36x)
apply_bdof_10_8x16_c:      |   7352.4 ( 1.00x) |  7485.7 ( 1.00x)
apply_bdof_10_8x16_neon:   |   1861.0 ( 3.95x) |  1134.1 ( 6.60x)
apply_bdof_10_16x8_c:      |   7330.5 ( 1.00x) |  7232.8 ( 1.00x)
apply_bdof_10_16x8_neon:   |   1747.2 ( 4.20x) |  1002.6 ( 7.21x)
apply_bdof_10_16x16_c:     |  14522.4 ( 1.00x) | 14664.8 ( 1.00x)
apply_bdof_10_16x16_neon:  |   3490.5 ( 4.16x) |  1978.4 ( 7.41x)
apply_bdof_12_8x16_c:      |   7389.0 ( 1.00x) |  7380.1 ( 1.00x)
apply_bdof_12_8x16_neon:   |   1861.3 ( 3.97x) |  1134.0 ( 6.51x)
apply_bdof_12_16x8_c:      |   7283.1 ( 1.00x) |  7336.9 ( 1.00x)
apply_bdof_12_16x8_neon:   |   1749.1 ( 4.16x) |  1002.3 ( 7.32x)
apply_bdof_12_16x16_c:     |  14580.7 ( 1.00x) | 14502.7 ( 1.00x)
apply_bdof_12_16x16_neon:  |   3472.9 ( 4.20x) |  1978.3 ( 7.33x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-09-03 06:55:37 +00:00
Timo Rothenpieler
262d41c804 all: fix typos found by codespell 2025-08-03 13:48:47 +02:00
Andreas Rheinhardt
9b409ea1e6 configure: Factor mpegvideoencdsp out of mpegvideoenc
This will allow to relax the dependency on mpegvideoenc
for several codecs.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-06-21 22:08:52 +02:00
Andreas Rheinhardt
20ddada2a3 avcodec/pixblockdsp: Improve 8 vs 16 bit check
Before this commit, the input in get_pixels and get_pixels_unaligned
has been treated inconsistenly:
- The generic code treated 9, 10, 12 and 14 bits as 16bit input
(these bits correspond to what FFmpeg's dsputils supported),
everything with <= 8 bits as 8 bit and everything else as 8 bit
when used via AVDCT (which exposes these functions and purports
to support up to 14 bits).
- AARCH64, ARM, PPC and RISC-V, x86 ignore this AVDCT special case.
- RISC-V also ignored the restriction to 9, 10, 12 and 14 for its
16bit check and treated everything > 8 bits as 16bit.
- The mmi MIPS code treats everything as 8 bit when used via
AVDCT (this is certainly broken); otherwise it checks for <= 8 bits.
The msa MIPS code behaves like the generic code.

This commit changes this to treat 9..16 bits as 16 bit input,
everything else as 8 bit (the former because it makes sense,
the latter to preserve the behaviour for external users*).

*: The only internal user of AVDCT (the spp filter) always
uses 8, 9 or 10 bits.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-05-31 01:25:27 +02:00
Zhao Zhili
26752368f0 aarch64/h26x: Add put_hevc_pel_bi_w_pixels
On rpi5 (A76):

put_hevc_pel_bi_w_pixels4_8_c:                          90.0 ( 1.00x)
put_hevc_pel_bi_w_pixels4_8_neon:                       34.1 ( 2.64x)
put_hevc_pel_bi_w_pixels6_8_c:                         188.3 ( 1.00x)
put_hevc_pel_bi_w_pixels6_8_neon:                       73.5 ( 2.56x)
put_hevc_pel_bi_w_pixels8_8_c:                         327.1 ( 1.00x)
put_hevc_pel_bi_w_pixels8_8_neon:                       75.8 ( 4.32x)
put_hevc_pel_bi_w_pixels12_8_c:                        728.8 ( 1.00x)
put_hevc_pel_bi_w_pixels12_8_neon:                     186.1 ( 3.92x)
put_hevc_pel_bi_w_pixels16_8_c:                       1288.1 ( 1.00x)
put_hevc_pel_bi_w_pixels16_8_neon:                     268.5 ( 4.80x)
put_hevc_pel_bi_w_pixels24_8_c:                       2855.5 ( 1.00x)
put_hevc_pel_bi_w_pixels24_8_neon:                     723.8 ( 3.95x)
put_hevc_pel_bi_w_pixels32_8_c:                       5095.3 ( 1.00x)
put_hevc_pel_bi_w_pixels32_8_neon:                    1165.0 ( 4.37x)
put_hevc_pel_bi_w_pixels48_8_c:                      11521.5 ( 1.00x)
put_hevc_pel_bi_w_pixels48_8_neon:                    2856.0 ( 4.03x)
put_hevc_pel_bi_w_pixels64_8_c:                      21020.5 ( 1.00x)
put_hevc_pel_bi_w_pixels64_8_neon:                    4699.1 ( 4.47x)

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-04-29 15:24:14 +08:00
Zhao Zhili
39786f8cd5 aarch64/h26x: optimize sao_band_filter
int8_t[] is enough for offset_table of 8 bit streams.

On rpi5:
                             Before               After
hevc_sao_band_8_8_c:          252.3 ( 1.00x)     252.3 ( 1.00x)
hevc_sao_band_8_8_neon:        95.8 ( 2.63x)      61.0 ( 4.57x)
hevc_sao_band_16_8_c:         875.2 ( 1.00x)     864.9 ( 1.00x)
hevc_sao_band_16_8_neon:      317.5 ( 2.76x)     150.0 ( 6.26x)
hevc_sao_band_32_8_c:        3853.5 ( 1.00x)    3871.6 ( 1.00x)
hevc_sao_band_32_8_neon:     1222.3 ( 3.15x)     550.6 ( 7.39)
hevc_sao_band_48_8_c:        8203.6 ( 1.00x)    8182.6 ( 1.00x)
hevc_sao_band_48_8_neon:     2685.7 ( 3.05x)    1185.8 ( 7.36x)
hevc_sao_band_64_8_c:       14023.0 ( 1.00x)   14038.9 ( 1.00x)
hevc_sao_band_64_8_neon:     4783.2 ( 2.93x)    2078.4 ( 7.15x)

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-04-29 15:11:45 +08:00
Andreas Rheinhardt
a064d34a32 avcodec/mpegvideoenc: Add MPVEncContext
Many of the fields of MpegEncContext (which is also used by decoders)
are actually only used by encoders. Therefore this commit adds
a new encoder-only structure and moves all of the encoder-only
fields to it except for those which require more explicit
synchronisation between the main slice context and the other
slice contexts. This synchronisation is currently mainly provided
by ff_update_thread_context() which simply copies most of
the main slice context over the other slice contexts. Fields
which are moved to the new MPVEncContext no longer participate
in this (which is desired, because it is horrible and for the
fields b) below wasteful) which means that some fields can only
be moved when explicit synchronisation code is added in later commits.

More explicitly, this commit moves the following fields:
a) Fields not copied by ff_update_duplicate_context():
dct_error_sum and dct_count; the former does not need synchronisation,
the latter is synchronised in merge_context_after_encode().
b) Fields which do not change after initialisation (these fields
could also be put into MPVMainEncContext at the cost of
an indirection to access them): lambda_table, adaptive_quant,
{luma,chroma}_elim_threshold, new_pic, fdsp, mpvencdsp, pdsp,
{p,b_forw,b_back,b_bidir_forw,b_bidir_back,b_direct,b_field}_mv_table,
[pb]_field_select_table, mb_{type,var,mean}, mc_mb_var, {min,max}_qcoeff,
{inter,intra}_quant_bias, ac_esc_length, the *_vlc_length fields,
the q_{intra,inter,chroma_intra}_matrix{,16}, dct_offset, mb_info,
mjpeg_ctx, rtp_mode, rtp_payload_size, encode_mb, all function
pointers, mpv_flags, quantizer_noise_shaping,
frame_reconstruction_bitfield, error_rate and intra_penalty.
c) Fields which are already (re)set explicitly: The PutBitContexts
pb, tex_pb, pb2; dquant, skipdct, encoding_error, the statistics
fields {mv,i_tex,p_tex,misc,last}_bits and i_count; last_mv_dir,
esc_pos (reset when writing the header).
d) Fields which are only used by encoders not supporting slice
threading for which synchronisation doesn't matter: esc3_level_length
and the remaining mb_info fields.
e) coded_score: This field is only really used when FF_MPV_FLAG_CBP_RD
is set (which implies trellis) and even then it is only used for
non-intra blocks. For these blocks dct_quantize_trellis_c() either
sets coded_score[n] or returns a last_non_zero value of -1
in which case coded_score will be reset in encode_mb_internal().
Therefore no old values are ever used.

The MotionEstContext has not been moved yet.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-03-26 04:08:33 +01:00
Krzysztof Pyrkosz
f9b8f30680 avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that.

Benchmarks before and after:
A78
avg_8_64x64_neon:                                     2686.2 ( 6.12x)
avg_8_128x128_neon:                                  10734.2 ( 5.88x)
avg_10_64x64_neon:                                    2536.8 ( 5.40x)
avg_10_128x128_neon:                                 10079.0 ( 5.22x)
avg_12_64x64_neon:                                    2548.2 ( 5.38x)
avg_12_128x128_neon:                                 10133.8 ( 5.19x)

avg_8_64x64_neon:                                      897.8 (18.26x)
avg_8_128x128_neon:                                   3608.5 (17.37x)
avg_10_32x32_neon:                                     444.2 ( 8.51x)
avg_10_64x64_neon:                                    1711.8 ( 8.00x)
avg_12_64x64_neon:                                    1706.2 ( 8.02x)
avg_12_128x128_neon:                                  7010.0 ( 7.46x)

A72
avg_8_64x64_neon:                                     5823.4 ( 3.88x)
avg_8_128x128_neon:                                  17430.5 ( 4.73x)
avg_10_64x64_neon:                                    5228.1 ( 3.71x)
avg_10_128x128_neon:                                 16722.2 ( 4.17x)
avg_12_64x64_neon:                                    5379.1 ( 3.51x)
avg_12_128x128_neon:                                 16715.7 ( 4.17x)

avg_8_64x64_neon:                                     2006.5 (10.61x)
avg_8_128x128_neon:                                   9158.7 ( 8.96x)
avg_10_64x64_neon:                                    3357.7 ( 5.60x)
avg_10_128x128_neon:                                 12411.7 ( 5.56x)
avg_12_64x64_neon:                                    3317.5 ( 5.67x)
avg_12_128x128_neon:                                 12358.5 ( 5.58x)

A53
avg_8_64x64_neon:                                     8327.8 ( 5.18x)
avg_8_128x128_neon:                                  31631.3 ( 5.34x)
avg_10_64x64_neon:                                    8783.5 ( 4.98x)
avg_10_128x128_neon:                                 32617.0 ( 5.25x)
avg_12_64x64_neon:                                    8686.0 ( 5.06x)
avg_12_128x128_neon:                                 32487.5 ( 5.25x)

avg_8_64x64_neon:                                     6032.3 ( 7.17x)
avg_8_128x128_neon:                                  22008.5 ( 7.69x)
avg_10_64x64_neon:                                    7738.0 ( 5.68x)
avg_10_128x128_neon:                                 27813.8 ( 6.14x)
avg_12_64x64_neon:                                    7844.5 ( 5.60x)
avg_12_128x128_neon:                                 26999.5 ( 6.34x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-07 15:51:20 +02:00
Zhao Zhili
3e9777dc75 aarch64/hevcdsp_idct_neon: Add implementation for idct dc 12
Reduce binary size at the same time. The performance compared to clang -O3
is the same.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-03-04 17:01:58 +08:00
Zhao Zhili
5977bff569 aarch64/hevcdsp_idct_neon: Optimize idct dc
clang does better than the assembly code before the patch, especially
for small size:

hevc_idct_4x4_dc_8_c:                                   11.2 ( 1.00x)
hevc_idct_4x4_dc_8_neon:                                15.5 ( 0.73x)
hevc_idct_4x4_dc_10_c:                                  12.0 ( 1.00x)
hevc_idct_4x4_dc_10_neon:                               15.2 ( 0.79x)
hevc_idct_8x8_dc_8_c:                                   13.2 ( 1.00x)
hevc_idct_8x8_dc_8_neon:                                18.2 ( 0.73x)
hevc_idct_8x8_dc_10_c:                                  13.5 ( 1.00x)
hevc_idct_8x8_dc_10_neon:                               17.2 ( 0.78x)
hevc_idct_16x16_dc_8_c:                                 41.8 ( 1.00x)
hevc_idct_16x16_dc_8_neon:                              37.8 ( 1.11x)
hevc_idct_16x16_dc_10_c:                                41.8 ( 1.00x)
hevc_idct_16x16_dc_10_neon:                             37.8 ( 1.11x)
hevc_idct_32x32_dc_8_c:                                130.2 ( 1.00x)
hevc_idct_32x32_dc_8_neon:                             132.2 ( 0.98x)
hevc_idct_32x32_dc_10_c:                               130.2 ( 1.00x)
hevc_idct_32x32_dc_10_neon:                            132.2 ( 0.98x)

This patch basically clone what the compiler does, so the performance
is the same.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2025-03-04 17:01:58 +08:00
Krzysztof Pyrkosz
71a91485fa avcodec/aarch64/vvc: Optimize NEON version of vvc_dmvr
This patch replaces blocks of instructions performing rounding and
widening shifts with one-liners achieving the same result.

Before and after on A78
dmvr_8_12x20_neon:                                      86.2 ( 6.90x)
dmvr_8_20x12_neon:                                      94.8 ( 5.93x)
dmvr_8_20x20_neon:                                     141.5 ( 6.50x)
dmvr_12_12x20_neon:                                    158.0 ( 3.76x)
dmvr_12_20x12_neon:                                    151.2 ( 3.73x)
dmvr_12_20x20_neon:                                    247.2 ( 3.71x)
dmvr_hv_8_12x20_neon:                                  423.2 ( 3.75x)
dmvr_hv_8_20x12_neon:                                  434.0 ( 3.69x)
dmvr_hv_8_20x20_neon:                                  706.0 ( 3.69x)

dmvr_8_12x20_neon:                                      77.2 ( 7.70x)
dmvr_8_20x12_neon:                                      66.5 ( 8.49x)
dmvr_8_20x20_neon:                                      92.2 ( 9.90x)
dmvr_12_12x20_neon:                                     80.2 ( 7.38x)
dmvr_12_20x12_neon:                                     58.2 ( 9.59x)
dmvr_12_20x20_neon:                                     90.0 (10.15x)
dmvr_hv_8_12x20_neon:                                  369.0 ( 4.34x)
dmvr_hv_8_20x12_neon:                                  355.8 ( 4.49x)
dmvr_hv_8_20x20_neon:                                  574.2 ( 4.51x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-04 10:35:31 +02:00
Krzysztof Pyrkosz
e8d4c55987 avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon
Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
these three at the end.

Before and after:

A78
ac3_sum_square_bufferfly_int32_neon:                   484.8 ( 2.00x)
ac3_sum_square_bufferfly_int32_neon:                   468.2 ( 2.08x)

A72
ac3_sum_square_bufferfly_int32_neon:                   793.6 ( 1.26x)
ac3_sum_square_bufferfly_int32_neon:                   527.3 ( 1.92x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-02 01:17:53 +02:00
Krzysztof Pyrkosz
9fb97215df avcodec/aarch64/opusdsp_neon: Simplify opus_postfilter_neon
This change removes one extra floating point operation and simplifies
load operations at the beginning of the loop by using dedicated register
for each of the 5 pointers and interleaving it with calculations. The
first case seems to be a bit slower, but the performance increase is
substantial in the other two.

A78 before:
postfilter_15_neon:                                   1684.8 ( 4.23x)
postfilter_512_neon:                                  1395.5 ( 5.10x)
postfilter_1022_neon:                                 1357.0 ( 5.25x)

After:
postfilter_15_neon:                                   1742.2 ( 4.09x)
postfilter_512_neon:                                  1169.8 ( 6.09x)
postfilter_1022_neon:                                 1160.0 ( 6.12x)

A72 before:
postfilter_15_neon:                                   3144.8 ( 2.39x)
postfilter_512_neon:                                  3141.2 ( 2.39x)
postfilter_1022_neon:                                 3230.0 ( 2.33x)

After:
postfilter_15_neon:                                   2847.8 ( 2.64x)
postfilter_512_neon:                                  2877.8 ( 2.61x)
postfilter_1022_neon:                                 2837.2 ( 2.65x)

x13s before:
postfilter_15_neon:                                   1615.4 ( 2.61x)
postfilter_512_neon:                                   963.1 ( 4.39x)
postfilter_1022_neon:                                  963.6 ( 4.39x)

After:
postfilter_15_neon:                                   1749.6 ( 2.41x)
postfilter_512_neon:                                   707.1 ( 5.97x)
postfilter_1022_neon:                                  706.1 ( 5.99x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-10 14:55:16 +02:00
Krzysztof Pyrkosz
83e4b068d9 avcodec/aarch64/aacencdsp: NEON implementation
This patch supplies handwritten NEON code for AAC.

The benchmarks below were collected by invoking these two commands on
each of my boards, A78, A72 and Thinkpad x13s:
1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12
2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is
speed without the patch, second, with)

- A78
abs_pow34_c:                                          4161.5 ( 1.00x)
abs_pow34_neon:                                       3586.2 ( 1.16x)
quant_bands_signed_c:                                 5548.0 ( 1.00x)
quant_bands_signed_neon:                              1126.8 ( 4.92x)
quant_bands_unsigned_c:                               3979.2 ( 1.00x)
quant_bands_unsigned_neon:                             800.2 ( 4.97x)

size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=71.6x
size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=82.3x

- A72
abs_pow34_c:                                         15362.2 ( 1.00x)
abs_pow34_neon:                                      15382.5 ( 1.00x)
quant_bands_signed_c:                                 9926.5 ( 1.00x)
quant_bands_signed_neon:                              2467.8 ( 4.02x)
quant_bands_unsigned_c:                               5469.8 ( 1.00x)
quant_bands_unsigned_neon:                            2089.5 ( 2.62x)

size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=34.3x
size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=37.8

- x13s
abs_pow34_c:                                          2413.4 ( 1.00x)
abs_pow34_neon:                                       1796.2 ( 1.34x)
quant_bands_signed_c:                                 2968.9 ( 1.00x)
quant_bands_signed_neon:                               675.6 ( 4.39x)
quant_bands_unsigned_c:                               2311.9 ( 1.00x)
quant_bands_unsigned_neon:                             477.1 ( 4.85x)

size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed= 135x
size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed= 159x

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-01-28 10:44:40 +02:00
Janne Grunau
430c38f698 aarch64: vp9mc: Load only 12 pixels in the 4 pixel wide horizontal filter
This reduces the amount the horizontal filters read beyond the filter
width to a consistent 1 pixel. The data is not used so this is usually
not noticeable. It becomes a problem when the application allocates
frame buffers only for the aligned picture size and the end of it is at
a page boundary. This happens for picture sizes which are a multiple of
the page size like 1280x640. The frame buffer allocation is based on
its most likely done via mmap + MAP_ANONYMOUS so start and end of the
buffer are page aligned and the previous and next page are not
necessarily mapped.
Under these conditions like seen by Firefox a read beyond the end of the
buffer results in a segfault.
After the over-read is reduced to a single pixel it's reasonable to use
VP9's emulated edge motion compensation for this.

Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185
Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2025-01-03 17:53:46 -05:00
Zhao Zhili
952508ae05 aarch64/vvc: Add apply_bdof
Test on rpi 5 with gcc 12:

apply_bdof_8_8x16_c:                                  7315.2 ( 1.00x)
apply_bdof_8_8x16_neon:                               1876.8 ( 3.90x)
apply_bdof_8_16x8_c:                                  7170.5 ( 1.00x)
apply_bdof_8_16x8_neon:                               1752.8 ( 4.09x)
apply_bdof_8_16x16_c:                                14695.2 ( 1.00x)
apply_bdof_8_16x16_neon:                              3490.5 ( 4.21x)
apply_bdof_10_8x16_c:                                 7371.5 ( 1.00x)
apply_bdof_10_8x16_neon:                              1863.8 ( 3.96x)
apply_bdof_10_16x8_c:                                 7172.0 ( 1.00x)
apply_bdof_10_16x8_neon:                              1766.0 ( 4.06x)
apply_bdof_10_16x16_c:                               14551.5 ( 1.00x)
apply_bdof_10_16x16_neon:                             3576.0 ( 4.07x)
apply_bdof_12_8x16_c:                                 7236.5 ( 1.00x)
apply_bdof_12_8x16_neon:                              1863.8 ( 3.88x)
apply_bdof_12_16x8_c:                                 7316.5 ( 1.00x)
apply_bdof_12_16x8_neon:                              1758.8 ( 4.16x)
apply_bdof_12_16x16_c:                               14691.2 ( 1.00x)
apply_bdof_12_16x16_neon:                             3480.5 ( 4.22x)
2024-12-21 11:54:44 +08:00
Martin Storsjö
2bb00ef59c aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.

This fixes these build errors:

    ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
     error A2518: operand 2: Memory offset must be aligned
            ldr             s5, [x1, #1]
    ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
     error A2518: operand 2: Memory offset must be aligned
            ldr             d7, [x1, #2]

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-18 13:45:09 +02:00
Bin Peng
72a3656e84 lavc/aarch64: Fix ff_pred16x16_plane_neon_10
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 367840

Signed-off-by: Peng Bin <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-17 14:50:29 +02:00
Bin Peng
decc9e643c lavc/aarch64: Fix ff_pred8x8_plane_neon_10
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 479612

The mismatch between neon and C functions can also be reproduced using the following bitstream and command line.

wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264
 ./ffmpeg -cpuflags 0  -threads 1 -i intra8x8pred_10bit.264  -f framemd5 -y md5_ref
 ./ffmpeg              -threads 1 -i intra8x8pred_10bit.264  -f framemd5 -y md5_neon

Signed-off-by: Bin Peng <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-17 14:50:29 +02:00
Zhao Zhili
40feba5f77 aarch64/vvc: Fix clip in alf
Fix test failure:
./tests/checkasm/checkasm --test=vvc_alf 3607569773
2024-12-10 21:00:47 +08:00
Zhao Zhili
91436638de aarch64/vvc: Use faster clip operation
Replace sqxtn+smin+smax by sqxtun+umin.
2024-12-10 21:00:47 +08:00
Zhao Zhili
bfed5f6b7d aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma 2024-12-10 21:00:47 +08:00
Zhao Zhili
5988a2729b aarch64/vvc: Add dmvr
dmvr_8_12x20_c:                                          1.5 ( 1.00x)
dmvr_8_12x20_neon:                                       0.2 ( 6.56x)
dmvr_8_20x12_c:                                          1.0 ( 1.00x)
dmvr_8_20x12_neon:                                       0.2 ( 4.33x)
dmvr_8_20x20_c:                                          1.7 ( 1.00x)
dmvr_8_20x20_neon:                                       0.5 ( 3.63x)
dmvr_12_12x20_c:                                         2.2 ( 1.00x)
dmvr_12_12x20_neon:                                      0.5 ( 4.68x)
dmvr_12_20x12_c:                                         2.0 ( 1.00x)
dmvr_12_20x12_neon:                                      0.5 ( 4.16x)
dmvr_12_20x20_c:                                         3.7 ( 1.00x)
dmvr_12_20x20_neon:                                      0.7 ( 5.14x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Zhao Zhili
bcd65ebd8f aarch64/vvc: Add dmvr_hv
dmvr_hv_8_12x20_c:                                       8.0 ( 1.00x)
dmvr_hv_8_12x20_neon:                                    1.2 ( 6.62x)
dmvr_hv_8_20x12_c:                                       8.0 ( 1.00x)
dmvr_hv_8_20x12_neon:                                    0.9 ( 8.37x)
dmvr_hv_8_20x20_c:                                      12.9 ( 1.00x)
dmvr_hv_8_20x20_neon:                                    1.7 ( 7.62x)
dmvr_hv_10_12x20_c:                                      7.0 ( 1.00x)
dmvr_hv_10_12x20_neon:                                   1.7 ( 4.09x)
dmvr_hv_10_20x12_c:                                      7.0 ( 1.00x)
dmvr_hv_10_20x12_neon:                                   1.7 ( 4.09x)
dmvr_hv_10_20x20_c:                                     11.2 ( 1.00x)
dmvr_hv_10_20x20_neon:                                   2.7 ( 4.15x)
dmvr_hv_12_12x20_c:                                      6.5 ( 1.00x)
dmvr_hv_12_12x20_neon:                                   1.7 ( 3.79x)
dmvr_hv_12_20x12_c:                                      6.5 ( 1.00x)
dmvr_hv_12_20x12_neon:                                   1.7 ( 3.79x)
dmvr_hv_12_20x20_c:                                     10.2 ( 1.00x)
dmvr_hv_12_20x20_neon:                                   2.2 ( 4.64x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Zhao Zhili
0ba9e8d0d4 aarch64/vvc: Add w_avg
w_avg_8_2x2_c:                                           0.0 ( 0.00x)
w_avg_8_2x2_neon:                                        0.0 ( 0.00x)
w_avg_8_4x4_c:                                           0.2 ( 1.00x)
w_avg_8_4x4_neon:                                        0.0 ( 0.00x)
w_avg_8_8x8_c:                                           1.2 ( 1.00x)
w_avg_8_8x8_neon:                                        0.2 ( 5.00x)
w_avg_8_16x16_c:                                         4.2 ( 1.00x)
w_avg_8_16x16_neon:                                      0.8 ( 5.67x)
w_avg_8_32x32_c:                                        16.2 ( 1.00x)
w_avg_8_32x32_neon:                                      2.5 ( 6.50x)
w_avg_8_64x64_c:                                        64.5 ( 1.00x)
w_avg_8_64x64_neon:                                      9.0 ( 7.17x)
w_avg_8_128x128_c:                                     269.5 ( 1.00x)
w_avg_8_128x128_neon:                                   35.5 ( 7.59x)
w_avg_10_2x2_c:                                          0.2 ( 1.00x)
w_avg_10_2x2_neon:                                       0.2 ( 1.00x)
w_avg_10_4x4_c:                                          0.2 ( 1.00x)
w_avg_10_4x4_neon:                                       0.2 ( 1.00x)
w_avg_10_8x8_c:                                          1.0 ( 1.00x)
w_avg_10_8x8_neon:                                       0.2 ( 4.00x)
w_avg_10_16x16_c:                                        4.2 ( 1.00x)
w_avg_10_16x16_neon:                                     0.8 ( 5.67x)
w_avg_10_32x32_c:                                       16.2 ( 1.00x)
w_avg_10_32x32_neon:                                     2.5 ( 6.50x)
w_avg_10_64x64_c:                                       66.2 ( 1.00x)
w_avg_10_64x64_neon:                                    10.0 ( 6.62x)
w_avg_10_128x128_c:                                    277.8 ( 1.00x)
w_avg_10_128x128_neon:                                  39.8 ( 6.99x)
w_avg_12_2x2_c:                                          0.0 ( 0.00x)
w_avg_12_2x2_neon:                                       0.2 ( 0.00x)
w_avg_12_4x4_c:                                          0.2 ( 1.00x)
w_avg_12_4x4_neon:                                       0.0 ( 0.00x)
w_avg_12_8x8_c:                                          1.2 ( 1.00x)
w_avg_12_8x8_neon:                                       0.5 ( 2.50x)
w_avg_12_16x16_c:                                        4.8 ( 1.00x)
w_avg_12_16x16_neon:                                     0.8 ( 6.33x)
w_avg_12_32x32_c:                                       17.0 ( 1.00x)
w_avg_12_32x32_neon:                                     2.8 ( 6.18x)
w_avg_12_64x64_c:                                       64.0 ( 1.00x)
w_avg_12_64x64_neon:                                    10.0 ( 6.40x)
w_avg_12_128x128_c:                                    269.2 ( 1.00x)
w_avg_12_128x128_neon:                                  42.0 ( 6.41x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Martin Storsjö
a3ec1f8c6c aarch64: h26x: Fix the indentation of one function
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-09-26 13:42:11 +03:00
Zhao Zhili
3f84d1d1fb aarch64/vvc: Add avg
avg_8_2x2_c:                                             0.2 ( 1.00x)
avg_8_2x2_neon:                                          0.2 ( 1.00x)
avg_8_4x4_c:                                             0.2 ( 1.00x)
avg_8_4x4_neon:                                          0.2 ( 1.00x)
avg_8_8x8_c:                                             0.9 ( 1.00x)
avg_8_8x8_neon:                                          0.2 ( 5.29x)
avg_8_16x16_c:                                           3.7 ( 1.00x)
avg_8_16x16_neon:                                        0.7 ( 5.44x)
avg_8_32x32_c:                                          14.9 ( 1.00x)
avg_8_32x32_neon:                                        1.7 ( 8.91x)
avg_8_64x64_c:                                          59.7 ( 1.00x)
avg_8_64x64_neon:                                        6.9 ( 8.62x)
avg_8_128x128_c:                                       254.7 ( 1.00x)
avg_8_128x128_neon:                                     26.9 ( 9.46x)
avg_10_2x2_c:                                            0.2 ( 1.00x)
avg_10_2x2_neon:                                         0.2 ( 1.00x)
avg_10_4x4_c:                                            0.2 ( 1.00x)
avg_10_4x4_neon:                                         0.2 ( 1.00x)
avg_10_8x8_c:                                            0.9 ( 1.00x)
avg_10_8x8_neon:                                         0.2 ( 5.29x)
avg_10_16x16_c:                                          3.4 ( 1.00x)
avg_10_16x16_neon:                                       0.4 ( 8.06x)
avg_10_32x32_c:                                         13.9 ( 1.00x)
avg_10_32x32_neon:                                       1.9 ( 7.23x)
avg_10_64x64_c:                                         54.2 ( 1.00x)
avg_10_64x64_neon:                                       8.4 ( 6.43x)
avg_10_128x128_c:                                      232.4 ( 1.00x)
avg_10_128x128_neon:                                    30.9 ( 7.52x)
avg_12_2x2_c:                                            0.0 ( 0.00x)
avg_12_2x2_neon:                                         0.2 ( 0.00x)
avg_12_4x4_c:                                            0.4 ( 1.00x)
avg_12_4x4_neon:                                         0.2 ( 2.43x)
avg_12_8x8_c:                                            0.7 ( 1.00x)
avg_12_8x8_neon:                                         0.2 ( 3.86x)
avg_12_16x16_c:                                          3.7 ( 1.00x)
avg_12_16x16_neon:                                       0.4 ( 8.65x)
avg_12_32x32_c:                                         13.7 ( 1.00x)
avg_12_32x32_neon:                                       2.2 ( 6.29x)
avg_12_64x64_c:                                         53.9 ( 1.00x)
avg_12_64x64_neon:                                       7.7 ( 7.03x)
avg_12_128x128_c:                                      270.9 ( 1.00x)
avg_12_128x128_neon:                                    30.4 ( 8.90x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
1be5a2374f aarch64/vvc: Add put_epel_hv
On Apple M1:

put_chroma_hv_8_4x4_c:                                   1.7 ( 1.00x)
put_chroma_hv_8_4x4_neon:                                0.2 ( 7.67x)
put_chroma_hv_8_8x8_c:                                   5.5 ( 1.00x)
put_chroma_hv_8_8x8_neon:                                0.5 (11.53x)
put_chroma_hv_8_16x16_c:                                18.5 ( 1.00x)
put_chroma_hv_8_16x16_neon:                              1.5 (12.53x)
put_chroma_hv_8_32x32_c:                                72.5 ( 1.00x)
put_chroma_hv_8_32x32_neon:                              4.7 (15.34x)
put_chroma_hv_8_64x64_c:                               274.0 ( 1.00x)
put_chroma_hv_8_64x64_neon:                             18.5 (14.83x)
put_chroma_hv_8_128x128_c:                            1058.7 ( 1.00x)
put_chroma_hv_8_128x128_neon:                           75.2 (14.07x)

On Android Pixel 8 Pro:

put_chroma_hv_8_4x4_c:                                   1.2 ( 1.00x)
put_chroma_hv_8_4x4_neon:                                0.0 ( 0.00x)
put_chroma_hv_8_4x4_i8mm:                                0.2 ( 5.00x)
put_chroma_hv_8_8x8_c:                                   4.0 ( 1.00x)
put_chroma_hv_8_8x8_neon:                                0.5 ( 8.00x)
put_chroma_hv_8_8x8_i8mm:                                0.5 ( 8.00x)
put_chroma_hv_8_16x16_c:                                15.2 ( 1.00x)
put_chroma_hv_8_16x16_neon:                              2.5 ( 6.10x)
put_chroma_hv_8_16x16_i8mm:                              2.2 ( 6.78x)
put_chroma_hv_8_32x32_c:                                61.0 ( 1.00x)
put_chroma_hv_8_32x32_neon:                              9.8 ( 6.26x)
put_chroma_hv_8_32x32_i8mm:                              8.5 ( 7.18x)
put_chroma_hv_8_64x64_c:                               229.5 ( 1.00x)
put_chroma_hv_8_64x64_neon:                             38.5 ( 5.96x)
put_chroma_hv_8_64x64_i8mm:                             34.0 ( 6.75x)
put_chroma_hv_8_128x128_c:                             919.8 ( 1.00x)
put_chroma_hv_8_128x128_neon:                          154.5 ( 5.95x)
put_chroma_hv_8_128x128_i8mm:                          140.0 ( 6.57x)
2024-09-14 16:36:34 +08:00