Commit graph

439 commits

Author SHA1 Message Date
Krzysztof Pyrkosz
83e4b068d9 avcodec/aarch64/aacencdsp: NEON implementation
This patch supplies handwritten NEON code for AAC.

The benchmarks below were collected by invoking these two commands on
each of my boards, A78, A72 and Thinkpad x13s:
1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12
2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is
speed without the patch, second, with)

- A78
abs_pow34_c:                                          4161.5 ( 1.00x)
abs_pow34_neon:                                       3586.2 ( 1.16x)
quant_bands_signed_c:                                 5548.0 ( 1.00x)
quant_bands_signed_neon:                              1126.8 ( 4.92x)
quant_bands_unsigned_c:                               3979.2 ( 1.00x)
quant_bands_unsigned_neon:                             800.2 ( 4.97x)

size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=71.6x
size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=82.3x

- A72
abs_pow34_c:                                         15362.2 ( 1.00x)
abs_pow34_neon:                                      15382.5 ( 1.00x)
quant_bands_signed_c:                                 9926.5 ( 1.00x)
quant_bands_signed_neon:                              2467.8 ( 4.02x)
quant_bands_unsigned_c:                               5469.8 ( 1.00x)
quant_bands_unsigned_neon:                            2089.5 ( 2.62x)

size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=34.3x
size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=37.8

- x13s
abs_pow34_c:                                          2413.4 ( 1.00x)
abs_pow34_neon:                                       1796.2 ( 1.34x)
quant_bands_signed_c:                                 2968.9 ( 1.00x)
quant_bands_signed_neon:                               675.6 ( 4.39x)
quant_bands_unsigned_c:                               2311.9 ( 1.00x)
quant_bands_unsigned_neon:                             477.1 ( 4.85x)

size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed= 135x
size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed= 159x

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-01-28 10:44:40 +02:00
Janne Grunau
430c38f698 aarch64: vp9mc: Load only 12 pixels in the 4 pixel wide horizontal filter
This reduces the amount the horizontal filters read beyond the filter
width to a consistent 1 pixel. The data is not used so this is usually
not noticeable. It becomes a problem when the application allocates
frame buffers only for the aligned picture size and the end of it is at
a page boundary. This happens for picture sizes which are a multiple of
the page size like 1280x640. The frame buffer allocation is based on
its most likely done via mmap + MAP_ANONYMOUS so start and end of the
buffer are page aligned and the previous and next page are not
necessarily mapped.
Under these conditions like seen by Firefox a read beyond the end of the
buffer results in a segfault.
After the over-read is reduced to a single pixel it's reasonable to use
VP9's emulated edge motion compensation for this.

Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185
Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2025-01-03 17:53:46 -05:00
Zhao Zhili
952508ae05 aarch64/vvc: Add apply_bdof
Test on rpi 5 with gcc 12:

apply_bdof_8_8x16_c:                                  7315.2 ( 1.00x)
apply_bdof_8_8x16_neon:                               1876.8 ( 3.90x)
apply_bdof_8_16x8_c:                                  7170.5 ( 1.00x)
apply_bdof_8_16x8_neon:                               1752.8 ( 4.09x)
apply_bdof_8_16x16_c:                                14695.2 ( 1.00x)
apply_bdof_8_16x16_neon:                              3490.5 ( 4.21x)
apply_bdof_10_8x16_c:                                 7371.5 ( 1.00x)
apply_bdof_10_8x16_neon:                              1863.8 ( 3.96x)
apply_bdof_10_16x8_c:                                 7172.0 ( 1.00x)
apply_bdof_10_16x8_neon:                              1766.0 ( 4.06x)
apply_bdof_10_16x16_c:                               14551.5 ( 1.00x)
apply_bdof_10_16x16_neon:                             3576.0 ( 4.07x)
apply_bdof_12_8x16_c:                                 7236.5 ( 1.00x)
apply_bdof_12_8x16_neon:                              1863.8 ( 3.88x)
apply_bdof_12_16x8_c:                                 7316.5 ( 1.00x)
apply_bdof_12_16x8_neon:                              1758.8 ( 4.16x)
apply_bdof_12_16x16_c:                               14691.2 ( 1.00x)
apply_bdof_12_16x16_neon:                             3480.5 ( 4.22x)
2024-12-21 11:54:44 +08:00
Martin Storsjö
2bb00ef59c aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.

This fixes these build errors:

    ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
     error A2518: operand 2: Memory offset must be aligned
            ldr             s5, [x1, #1]
    ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
     error A2518: operand 2: Memory offset must be aligned
            ldr             d7, [x1, #2]

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-18 13:45:09 +02:00
Bin Peng
72a3656e84 lavc/aarch64: Fix ff_pred16x16_plane_neon_10
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 367840

Signed-off-by: Peng Bin <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-17 14:50:29 +02:00
Bin Peng
decc9e643c lavc/aarch64: Fix ff_pred8x8_plane_neon_10
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 479612

The mismatch between neon and C functions can also be reproduced using the following bitstream and command line.

wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264
 ./ffmpeg -cpuflags 0  -threads 1 -i intra8x8pred_10bit.264  -f framemd5 -y md5_ref
 ./ffmpeg              -threads 1 -i intra8x8pred_10bit.264  -f framemd5 -y md5_neon

Signed-off-by: Bin Peng <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-17 14:50:29 +02:00
Zhao Zhili
40feba5f77 aarch64/vvc: Fix clip in alf
Fix test failure:
./tests/checkasm/checkasm --test=vvc_alf 3607569773
2024-12-10 21:00:47 +08:00
Zhao Zhili
91436638de aarch64/vvc: Use faster clip operation
Replace sqxtn+smin+smax by sqxtun+umin.
2024-12-10 21:00:47 +08:00
Zhao Zhili
bfed5f6b7d aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma 2024-12-10 21:00:47 +08:00
Zhao Zhili
5988a2729b aarch64/vvc: Add dmvr
dmvr_8_12x20_c:                                          1.5 ( 1.00x)
dmvr_8_12x20_neon:                                       0.2 ( 6.56x)
dmvr_8_20x12_c:                                          1.0 ( 1.00x)
dmvr_8_20x12_neon:                                       0.2 ( 4.33x)
dmvr_8_20x20_c:                                          1.7 ( 1.00x)
dmvr_8_20x20_neon:                                       0.5 ( 3.63x)
dmvr_12_12x20_c:                                         2.2 ( 1.00x)
dmvr_12_12x20_neon:                                      0.5 ( 4.68x)
dmvr_12_20x12_c:                                         2.0 ( 1.00x)
dmvr_12_20x12_neon:                                      0.5 ( 4.16x)
dmvr_12_20x20_c:                                         3.7 ( 1.00x)
dmvr_12_20x20_neon:                                      0.7 ( 5.14x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Zhao Zhili
bcd65ebd8f aarch64/vvc: Add dmvr_hv
dmvr_hv_8_12x20_c:                                       8.0 ( 1.00x)
dmvr_hv_8_12x20_neon:                                    1.2 ( 6.62x)
dmvr_hv_8_20x12_c:                                       8.0 ( 1.00x)
dmvr_hv_8_20x12_neon:                                    0.9 ( 8.37x)
dmvr_hv_8_20x20_c:                                      12.9 ( 1.00x)
dmvr_hv_8_20x20_neon:                                    1.7 ( 7.62x)
dmvr_hv_10_12x20_c:                                      7.0 ( 1.00x)
dmvr_hv_10_12x20_neon:                                   1.7 ( 4.09x)
dmvr_hv_10_20x12_c:                                      7.0 ( 1.00x)
dmvr_hv_10_20x12_neon:                                   1.7 ( 4.09x)
dmvr_hv_10_20x20_c:                                     11.2 ( 1.00x)
dmvr_hv_10_20x20_neon:                                   2.7 ( 4.15x)
dmvr_hv_12_12x20_c:                                      6.5 ( 1.00x)
dmvr_hv_12_12x20_neon:                                   1.7 ( 3.79x)
dmvr_hv_12_20x12_c:                                      6.5 ( 1.00x)
dmvr_hv_12_20x12_neon:                                   1.7 ( 3.79x)
dmvr_hv_12_20x20_c:                                     10.2 ( 1.00x)
dmvr_hv_12_20x20_neon:                                   2.2 ( 4.64x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Zhao Zhili
0ba9e8d0d4 aarch64/vvc: Add w_avg
w_avg_8_2x2_c:                                           0.0 ( 0.00x)
w_avg_8_2x2_neon:                                        0.0 ( 0.00x)
w_avg_8_4x4_c:                                           0.2 ( 1.00x)
w_avg_8_4x4_neon:                                        0.0 ( 0.00x)
w_avg_8_8x8_c:                                           1.2 ( 1.00x)
w_avg_8_8x8_neon:                                        0.2 ( 5.00x)
w_avg_8_16x16_c:                                         4.2 ( 1.00x)
w_avg_8_16x16_neon:                                      0.8 ( 5.67x)
w_avg_8_32x32_c:                                        16.2 ( 1.00x)
w_avg_8_32x32_neon:                                      2.5 ( 6.50x)
w_avg_8_64x64_c:                                        64.5 ( 1.00x)
w_avg_8_64x64_neon:                                      9.0 ( 7.17x)
w_avg_8_128x128_c:                                     269.5 ( 1.00x)
w_avg_8_128x128_neon:                                   35.5 ( 7.59x)
w_avg_10_2x2_c:                                          0.2 ( 1.00x)
w_avg_10_2x2_neon:                                       0.2 ( 1.00x)
w_avg_10_4x4_c:                                          0.2 ( 1.00x)
w_avg_10_4x4_neon:                                       0.2 ( 1.00x)
w_avg_10_8x8_c:                                          1.0 ( 1.00x)
w_avg_10_8x8_neon:                                       0.2 ( 4.00x)
w_avg_10_16x16_c:                                        4.2 ( 1.00x)
w_avg_10_16x16_neon:                                     0.8 ( 5.67x)
w_avg_10_32x32_c:                                       16.2 ( 1.00x)
w_avg_10_32x32_neon:                                     2.5 ( 6.50x)
w_avg_10_64x64_c:                                       66.2 ( 1.00x)
w_avg_10_64x64_neon:                                    10.0 ( 6.62x)
w_avg_10_128x128_c:                                    277.8 ( 1.00x)
w_avg_10_128x128_neon:                                  39.8 ( 6.99x)
w_avg_12_2x2_c:                                          0.0 ( 0.00x)
w_avg_12_2x2_neon:                                       0.2 ( 0.00x)
w_avg_12_4x4_c:                                          0.2 ( 1.00x)
w_avg_12_4x4_neon:                                       0.0 ( 0.00x)
w_avg_12_8x8_c:                                          1.2 ( 1.00x)
w_avg_12_8x8_neon:                                       0.5 ( 2.50x)
w_avg_12_16x16_c:                                        4.8 ( 1.00x)
w_avg_12_16x16_neon:                                     0.8 ( 6.33x)
w_avg_12_32x32_c:                                       17.0 ( 1.00x)
w_avg_12_32x32_neon:                                     2.8 ( 6.18x)
w_avg_12_64x64_c:                                       64.0 ( 1.00x)
w_avg_12_64x64_neon:                                    10.0 ( 6.40x)
w_avg_12_128x128_c:                                    269.2 ( 1.00x)
w_avg_12_128x128_neon:                                  42.0 ( 6.41x)

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Martin Storsjö
a3ec1f8c6c aarch64: h26x: Fix the indentation of one function
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-09-26 13:42:11 +03:00
Zhao Zhili
3f84d1d1fb aarch64/vvc: Add avg
avg_8_2x2_c:                                             0.2 ( 1.00x)
avg_8_2x2_neon:                                          0.2 ( 1.00x)
avg_8_4x4_c:                                             0.2 ( 1.00x)
avg_8_4x4_neon:                                          0.2 ( 1.00x)
avg_8_8x8_c:                                             0.9 ( 1.00x)
avg_8_8x8_neon:                                          0.2 ( 5.29x)
avg_8_16x16_c:                                           3.7 ( 1.00x)
avg_8_16x16_neon:                                        0.7 ( 5.44x)
avg_8_32x32_c:                                          14.9 ( 1.00x)
avg_8_32x32_neon:                                        1.7 ( 8.91x)
avg_8_64x64_c:                                          59.7 ( 1.00x)
avg_8_64x64_neon:                                        6.9 ( 8.62x)
avg_8_128x128_c:                                       254.7 ( 1.00x)
avg_8_128x128_neon:                                     26.9 ( 9.46x)
avg_10_2x2_c:                                            0.2 ( 1.00x)
avg_10_2x2_neon:                                         0.2 ( 1.00x)
avg_10_4x4_c:                                            0.2 ( 1.00x)
avg_10_4x4_neon:                                         0.2 ( 1.00x)
avg_10_8x8_c:                                            0.9 ( 1.00x)
avg_10_8x8_neon:                                         0.2 ( 5.29x)
avg_10_16x16_c:                                          3.4 ( 1.00x)
avg_10_16x16_neon:                                       0.4 ( 8.06x)
avg_10_32x32_c:                                         13.9 ( 1.00x)
avg_10_32x32_neon:                                       1.9 ( 7.23x)
avg_10_64x64_c:                                         54.2 ( 1.00x)
avg_10_64x64_neon:                                       8.4 ( 6.43x)
avg_10_128x128_c:                                      232.4 ( 1.00x)
avg_10_128x128_neon:                                    30.9 ( 7.52x)
avg_12_2x2_c:                                            0.0 ( 0.00x)
avg_12_2x2_neon:                                         0.2 ( 0.00x)
avg_12_4x4_c:                                            0.4 ( 1.00x)
avg_12_4x4_neon:                                         0.2 ( 2.43x)
avg_12_8x8_c:                                            0.7 ( 1.00x)
avg_12_8x8_neon:                                         0.2 ( 3.86x)
avg_12_16x16_c:                                          3.7 ( 1.00x)
avg_12_16x16_neon:                                       0.4 ( 8.65x)
avg_12_32x32_c:                                         13.7 ( 1.00x)
avg_12_32x32_neon:                                       2.2 ( 6.29x)
avg_12_64x64_c:                                         53.9 ( 1.00x)
avg_12_64x64_neon:                                       7.7 ( 7.03x)
avg_12_128x128_c:                                      270.9 ( 1.00x)
avg_12_128x128_neon:                                    30.4 ( 8.90x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
1be5a2374f aarch64/vvc: Add put_epel_hv
On Apple M1:

put_chroma_hv_8_4x4_c:                                   1.7 ( 1.00x)
put_chroma_hv_8_4x4_neon:                                0.2 ( 7.67x)
put_chroma_hv_8_8x8_c:                                   5.5 ( 1.00x)
put_chroma_hv_8_8x8_neon:                                0.5 (11.53x)
put_chroma_hv_8_16x16_c:                                18.5 ( 1.00x)
put_chroma_hv_8_16x16_neon:                              1.5 (12.53x)
put_chroma_hv_8_32x32_c:                                72.5 ( 1.00x)
put_chroma_hv_8_32x32_neon:                              4.7 (15.34x)
put_chroma_hv_8_64x64_c:                               274.0 ( 1.00x)
put_chroma_hv_8_64x64_neon:                             18.5 (14.83x)
put_chroma_hv_8_128x128_c:                            1058.7 ( 1.00x)
put_chroma_hv_8_128x128_neon:                           75.2 (14.07x)

On Android Pixel 8 Pro:

put_chroma_hv_8_4x4_c:                                   1.2 ( 1.00x)
put_chroma_hv_8_4x4_neon:                                0.0 ( 0.00x)
put_chroma_hv_8_4x4_i8mm:                                0.2 ( 5.00x)
put_chroma_hv_8_8x8_c:                                   4.0 ( 1.00x)
put_chroma_hv_8_8x8_neon:                                0.5 ( 8.00x)
put_chroma_hv_8_8x8_i8mm:                                0.5 ( 8.00x)
put_chroma_hv_8_16x16_c:                                15.2 ( 1.00x)
put_chroma_hv_8_16x16_neon:                              2.5 ( 6.10x)
put_chroma_hv_8_16x16_i8mm:                              2.2 ( 6.78x)
put_chroma_hv_8_32x32_c:                                61.0 ( 1.00x)
put_chroma_hv_8_32x32_neon:                              9.8 ( 6.26x)
put_chroma_hv_8_32x32_i8mm:                              8.5 ( 7.18x)
put_chroma_hv_8_64x64_c:                               229.5 ( 1.00x)
put_chroma_hv_8_64x64_neon:                             38.5 ( 5.96x)
put_chroma_hv_8_64x64_i8mm:                             34.0 ( 6.75x)
put_chroma_hv_8_128x128_c:                             919.8 ( 1.00x)
put_chroma_hv_8_128x128_neon:                          154.5 ( 5.95x)
put_chroma_hv_8_128x128_i8mm:                          140.0 ( 6.57x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
0dcf204e5d aarch64/vvc: Add put_epel_h i8mm
put_chroma_h_8_4x4_c:                                    0.4 ( 1.00x)
put_chroma_h_8_4x4_neon:                                 0.0 ( 0.00x)
put_chroma_h_8_4x4_i8mm:                                 0.1 ( 2.67x)
put_chroma_h_8_8x8_c:                                    1.6 ( 1.00x)
put_chroma_h_8_8x8_neon:                                 0.1 (11.00x)
put_chroma_h_8_8x8_i8mm:                                 0.1 (11.00x)
put_chroma_h_8_16x16_c:                                  6.9 ( 1.00x)
put_chroma_h_8_16x16_neon:                               1.1 ( 6.00x)
put_chroma_h_8_16x16_i8mm:                               0.7 (10.62x)
put_chroma_h_8_32x32_c:                                 27.6 ( 1.00x)
put_chroma_h_8_32x32_neon:                               4.7 ( 5.95x)
put_chroma_h_8_32x32_i8mm:                               4.4 ( 6.28x)
put_chroma_h_8_64x64_c:                                116.2 ( 1.00x)
put_chroma_h_8_64x64_neon:                              19.1 ( 6.07x)
put_chroma_h_8_64x64_i8mm:                              17.1 ( 6.77x)
put_chroma_h_8_128x128_c:                              466.6 ( 1.00x)
put_chroma_h_8_128x128_neon:                            81.4 ( 5.73x)
put_chroma_h_8_128x128_i8mm:                            71.7 ( 6.51x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
41a1885f7a aarch64/vvc: Add put_epel_h
put_chroma_h_8_4x4_c:                                    0.2 ( 1.00x)
put_chroma_h_8_4x4_neon:                                 0.2 ( 1.00x)
put_chroma_h_8_8x8_c:                                    0.8 ( 1.00x)
put_chroma_h_8_8x8_neon:                                 0.2 ( 3.00x)
put_chroma_h_8_16x16_c:                                  3.8 ( 1.00x)
put_chroma_h_8_16x16_neon:                               0.8 ( 5.00x)
put_chroma_h_8_32x32_c:                                 12.5 ( 1.00x)
put_chroma_h_8_32x32_neon:                               2.2 ( 5.56x)
put_chroma_h_8_64x64_c:                                 47.0 ( 1.00x)
put_chroma_h_8_64x64_neon:                               8.8 ( 5.37x)
put_chroma_h_8_128x128_c:                              200.2 ( 1.00x)
put_chroma_h_8_128x128_neon:                            31.8 ( 6.31x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
260e1b4b62 aarch64/vvc: Add sad
sad_8x16_c:                                              0.8 ( 1.00x)
sad_8x16_neon:                                           0.2 ( 3.00x)
sad_16x8_c:                                              0.5 ( 1.00x)
sad_16x8_neon:                                           0.2 ( 2.00x)
sad_16x16_c:                                             1.5 ( 1.00x)
sad_16x16_neon:                                          0.2 ( 6.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
5ac6925803 aarch64/vvc: Add put_qpel_hv
With Apple M1 (no i8mm):

put_luma_hv_8_4x4_c:                                     2.2 ( 1.00x)
put_luma_hv_8_4x4_neon:                                  0.8 ( 3.00x)
put_luma_hv_8_8x8_c:                                     7.0 ( 1.00x)
put_luma_hv_8_8x8_neon:                                  0.8 ( 9.33x)
put_luma_hv_8_16x16_c:                                  22.8 ( 1.00x)
put_luma_hv_8_16x16_neon:                                2.5 ( 9.10x)
put_luma_hv_8_32x32_c:                                  84.8 ( 1.00x)
put_luma_hv_8_32x32_neon:                                9.5 ( 8.92x)
put_luma_hv_8_64x64_c:                                 333.0 ( 1.00x)
put_luma_hv_8_64x64_neon:                               35.5 ( 9.38x)
put_luma_hv_8_128x128_c:                              1294.5 ( 1.00x)
put_luma_hv_8_128x128_neon:                            137.8 ( 9.40x)

With Pixel 8 Pro:

put_luma_hv_8_4x4_c:                                     5.0 ( 1.00x)
put_luma_hv_8_4x4_neon:                                  0.8 ( 6.67x)
put_luma_hv_8_4x4_i8mm:                                  0.2 (20.00x)
put_luma_hv_8_8x8_c:                                    13.2 ( 1.00x)
put_luma_hv_8_8x8_neon:                                  1.2 (10.60x)
put_luma_hv_8_8x8_i8mm:                                  1.2 (10.60x)
put_luma_hv_8_16x16_c:                                  44.2 ( 1.00x)
put_luma_hv_8_16x16_neon:                                4.5 ( 9.83x)
put_luma_hv_8_16x16_i8mm:                                4.2 (10.41x)
put_luma_hv_8_32x32_c:                                 160.8 ( 1.00x)
put_luma_hv_8_32x32_neon:                               17.5 ( 9.19x)
put_luma_hv_8_32x32_i8mm:                               16.0 (10.05x)
put_luma_hv_8_64x64_c:                                 611.2 ( 1.00x)
put_luma_hv_8_64x64_neon:                               68.0 ( 8.99x)
put_luma_hv_8_64x64_i8mm:                               62.2 ( 9.82x)
put_luma_hv_8_128x128_c:                              2384.8 ( 1.00x)
put_luma_hv_8_128x128_neon:                            268.8 ( 8.87x)
put_luma_hv_8_128x128_i8mm:                            245.8 ( 9.70x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
a0b52afd32 aarch64/vvc: Add put_qpel_vx
put_luma_v_8_4x4_c:                                      1.0 ( 1.00x)
put_luma_v_8_4x4_neon:                                   0.0 ( 0.00x)
put_luma_v_8_8x8_c:                                      3.5 ( 1.00x)
put_luma_v_8_8x8_neon:                                   0.5 ( 7.00x)
put_luma_v_8_16x16_c:                                   13.8 ( 1.00x)
put_luma_v_8_16x16_neon:                                 1.2 (11.00x)
put_luma_v_8_32x32_c:                                   54.2 ( 1.00x)
put_luma_v_8_32x32_neon:                                 5.0 (10.85x)
put_luma_v_8_64x64_c:                                  217.5 ( 1.00x)
put_luma_v_8_64x64_neon:                                18.8 (11.60x)
put_luma_v_8_128x128_c:                                886.2 ( 1.00x)
put_luma_v_8_128x128_neon:                              74.0 (11.98x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
b051bc7cb8 aarch64/h26x: Remove duplicate b.eq instruction
b.eq is added by calc_all after each calc.
2024-09-14 16:36:34 +08:00
Zhao Zhili
9f6c8eb412 aarch64/vvc: Add put_qpel_hx i8mm
Benchmark on Android pixel 8 with -fno-vectorize

put_luma_h_8_4x4_c:                                      0.2 ( 1.00x)
put_luma_h_8_4x4_neon:                                   0.2 ( 1.00x)
put_luma_h_8_4x4_i8mm:                                   0.0 ( 0.00x)
put_luma_h_8_8x8_c:                                      1.5 ( 1.00x)
put_luma_h_8_8x8_neon:                                   0.5 ( 3.00x)
put_luma_h_8_8x8_i8mm:                                   0.5 ( 3.00x)
put_luma_h_8_16x16_c:                                    6.2 ( 1.00x)
put_luma_h_8_16x16_neon:                                 2.0 ( 3.12x)
put_luma_h_8_16x16_i8mm:                                 1.5 ( 4.17x)
put_luma_h_8_32x32_c:                                   25.5 ( 1.00x)
put_luma_h_8_32x32_neon:                                 9.0 ( 2.83x)
put_luma_h_8_32x32_i8mm:                                 6.8 ( 3.78x)
put_luma_h_8_64x64_c:                                   99.8 ( 1.00x)
put_luma_h_8_64x64_neon:                                35.2 ( 2.83x)
put_luma_h_8_64x64_i8mm:                                27.2 ( 3.66x)
put_luma_h_8_128x128_c:                                422.0 ( 1.00x)
put_luma_h_8_128x128_neon:                             138.5 ( 3.05x)
put_luma_h_8_128x128_i8mm:                             109.2 ( 3.86x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
25448d1716 aarch64/vvc: Add put_pel/put_pel_uni/put_pel_uni_w
put_luma_pixels_8_4x4_c:                                 0.2 ( 1.00x)
put_luma_pixels_8_4x4_neon:                              0.2 ( 1.00x)
put_luma_pixels_8_8x8_c:                                 0.7 ( 1.00x)
put_luma_pixels_8_8x8_neon:                              0.2 ( 3.22x)
put_luma_pixels_8_16x16_c:                               2.2 ( 1.00x)
put_luma_pixels_8_16x16_neon:                            0.2 ( 9.89x)
put_luma_pixels_8_32x32_c:                               8.2 ( 1.00x)
put_luma_pixels_8_32x32_neon:                            1.2 ( 6.71x)
put_luma_pixels_8_64x64_c:                              33.7 ( 1.00x)
put_luma_pixels_8_64x64_neon:                            2.5 (13.63x)
put_luma_pixels_8_128x128_c:                           145.5 ( 1.00x)
put_luma_pixels_8_128x128_neon:                         10.2 (14.23x)
put_uni_pixels_luma_8_4x4_c:                             0.5 ( 1.00x)
put_uni_pixels_luma_8_4x4_neon:                          0.0 ( 0.00x)
put_uni_pixels_luma_8_8x8_c:                             0.5 ( 1.00x)
put_uni_pixels_luma_8_8x8_neon:                          0.2 ( 2.11x)
put_uni_pixels_luma_8_16x16_c:                           1.2 ( 1.00x)
put_uni_pixels_luma_8_16x16_neon:                        0.2 ( 5.44x)
put_uni_pixels_luma_8_32x32_c:                           3.0 ( 1.00x)
put_uni_pixels_luma_8_32x32_neon:                        0.5 ( 6.26x)
put_uni_pixels_luma_8_64x64_c:                           3.0 ( 1.00x)
put_uni_pixels_luma_8_64x64_neon:                        1.7 ( 1.72x)
put_uni_pixels_luma_8_128x128_c:                         6.5 ( 1.00x)
put_uni_pixels_luma_8_128x128_neon:                      6.5 ( 1.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
20f2bf5530 aarch64/vvc: Add put_qpel_h_* and put_qpel_uni_h_*
Just share hevc implementation.

checkasm --test=vvc_mc --benchmark:

put_luma_h_8_4x4_c:                                      0.2 ( 1.00x)
put_luma_h_8_4x4_neon:                                   0.2 ( 1.00x)
put_luma_h_8_8x8_c:                                      1.0 ( 1.00x)
put_luma_h_8_8x8_neon:                                   0.2 ( 4.33x)
put_luma_h_8_16x16_c:                                    3.2 ( 1.00x)
put_luma_h_8_16x16_neon:                                 1.2 ( 2.63x)
put_luma_h_8_32x32_c:                                   13.7 ( 1.00x)
put_luma_h_8_32x32_neon:                                 4.0 ( 3.45x)
put_luma_h_8_64x64_c:                                   48.2 ( 1.00x)
put_luma_h_8_64x64_neon:                                15.7 ( 3.07x)
put_luma_h_8_128x128_c:                                203.5 ( 1.00x)
put_luma_h_8_128x128_neon:                              62.0 ( 3.28x)
put_uni_h_luma_8_4x4_c:                                  0.2 ( 1.00x)
put_uni_h_luma_8_4x4_neon:                               0.2 ( 1.00x)
put_uni_h_luma_8_8x8_c:                                  1.5 ( 1.00x)
put_uni_h_luma_8_8x8_neon:                               0.2 ( 6.56x)
put_uni_h_luma_8_16x16_c:                                5.7 ( 1.00x)
put_uni_h_luma_8_16x16_neon:                             1.2 ( 4.67x)
put_uni_h_luma_8_32x32_c:                               24.0 ( 1.00x)
put_uni_h_luma_8_32x32_neon:                             4.7 ( 5.07x)
put_uni_h_luma_8_64x64_c:                               90.0 ( 1.00x)
put_uni_h_luma_8_64x64_neon:                            17.0 ( 5.30x)
put_uni_h_luma_8_128x128_c:                            357.7 ( 1.00x)
put_uni_h_luma_8_128x128_neon:                          67.5 ( 5.30x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
46f07ce7d1 aarch64/hevc: Move epel/qpel to h26x directory
So vvc can reuse the implementation.
2024-09-14 16:36:34 +08:00
Zhao Zhili
8beafb5656 aarch64/hevc: Simplify function prototypes by macro 2024-09-14 16:36:34 +08:00
Anton Khirnov
3f9ca51015 lavc/opus*: move to opus/ subdir 2024-09-02 11:56:53 +02:00
Ramiro Polla
6aafe61285 avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t 2024-09-01 13:42:30 +02:00
Zhao Zhili
4c0372281b aarch64/vvc: Bind h26x/sao filter implementation to vvc
Reviewed-by: Martin Storsjö <martin@martin.st>
2024-08-31 16:07:50 +08:00
Zhao Zhili
8cc10298a7 aarch64/hevc: Move sao to h26x directory
So vvc can reuse the implementation.

Reviewed-by: Martin Storsjö <martin@martin.st>
2024-08-31 16:07:43 +08:00
Ramiro Polla
8c203ea7c7 avcodec/aarch64/mpegvideoencdsp: add dotprod implementation for pix_norm1
A55             A76
pix_norm1_c:        484.3           235.2
pix_norm1_neon:     193.8 ( 2.50x)   44.7 ( 5.26x)
pix_norm1_dotprod:   91.8 ( 5.28x)   21.2 (11.09x)
2024-08-26 12:49:04 +02:00
Ramiro Polla
9f68a3712e avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
A55             A76
pix_norm1_c:     478.2           234.2
pix_norm1_neon:  188.2 ( 2.54x)   41.2 ( 5.68x)
pix_sum_c:       304.2           244.0
pix_sum_neon:     77.2 ( 3.94x)   21.5 (11.35x)
2024-08-26 12:48:31 +02:00
Ramiro Polla
5c1c0325cd avcodec/aarch64/me_cmp: add dotprod implementations of sse16 and vsse_intra16
checkasm --bench for Raspberry Pi 5 Model B Rev 1.0:
sse_0_c: 241.5
sse_0_neon: 37.2
sse_0_dotprod: 22.2
vsse_4_c: 148.7
vsse_4_neon: 31.0
vsse_4_dotprod: 15.7
2024-08-17 15:31:48 +02:00
Martin Storsjö
4acb9b7d10 aarch64: vvc: Fix unnecessary extra spaces
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-07-23 16:04:28 +03:00
Martin Storsjö
99598629e8 aarch64: vvc: Consistently use # for immediate constants
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-07-23 15:24:37 +03:00
Martin Storsjö
400843151d aarch64: vvc: Fix compilation of alf.S with MSVC 2022 17.7 and older
Use the "ldur" instruction explicitly, instead of having the
assembler implicitly convert "ldr" instructions to "ldur".

This fixes build errors like these:

libavcodec\aarch64\vvc\alf.o.asm(1023) : error A2518: operand 2: Memory offset must be aligned
        ldr             q22, [x3, #24]
libavcodec\aarch64\vvc\alf.o.asm(1024) : error A2518: operand 2: Memory offset must be aligned
        ldr             q24, [x2, #24]
libavcodec\aarch64\vvc\alf.o.asm(1393) : error A2518: operand 2: Memory offset must be aligned
        ldr             q22, [x3, #24]
libavcodec\aarch64\vvc\alf.o.asm(1394) : error A2518: operand 2: Memory offset must be aligned
        ldr             q24, [x2, #24]

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-07-23 15:24:33 +03:00
Zhao Zhili
2d4ef304c9 avcodec/vvc: Add aarch64 neon optimization for ALF
vvc_alf_filter_chroma_4x4_8_c: 3.0
vvc_alf_filter_chroma_4x4_8_neon: 1.0
vvc_alf_filter_chroma_4x4_10_c: 2.7
vvc_alf_filter_chroma_4x4_10_neon: 1.0
vvc_alf_filter_chroma_4x4_12_c: 2.7
vvc_alf_filter_chroma_4x4_12_neon: 1.0
vvc_alf_filter_chroma_8x8_8_c: 10.2
vvc_alf_filter_chroma_8x8_8_neon: 3.0
vvc_alf_filter_chroma_8x8_10_c: 10.0
vvc_alf_filter_chroma_8x8_10_neon: 2.5
vvc_alf_filter_chroma_8x8_12_c: 10.0
vvc_alf_filter_chroma_8x8_12_neon: 2.5
vvc_alf_filter_chroma_16x16_8_c: 41.7
vvc_alf_filter_chroma_16x16_8_neon: 11.2
vvc_alf_filter_chroma_16x16_10_c: 39.0
vvc_alf_filter_chroma_16x16_10_neon: 10.0
vvc_alf_filter_chroma_16x16_12_c: 40.2
vvc_alf_filter_chroma_16x16_12_neon: 10.2
vvc_alf_filter_chroma_32x32_8_c: 162.0
vvc_alf_filter_chroma_32x32_8_neon: 45.0
vvc_alf_filter_chroma_32x32_10_c: 155.5
vvc_alf_filter_chroma_32x32_10_neon: 39.5
vvc_alf_filter_chroma_32x32_12_c: 155.5
vvc_alf_filter_chroma_32x32_12_neon: 40.0
vvc_alf_filter_chroma_64x64_8_c: 646.0
vvc_alf_filter_chroma_64x64_8_neon: 175.5
vvc_alf_filter_chroma_64x64_10_c: 708.2
vvc_alf_filter_chroma_64x64_10_neon: 166.7
vvc_alf_filter_chroma_64x64_12_c: 619.2
vvc_alf_filter_chroma_64x64_12_neon: 157.2
vvc_alf_filter_chroma_128x128_8_c: 2611.5
vvc_alf_filter_chroma_128x128_8_neon: 698.2
vvc_alf_filter_chroma_128x128_10_c: 2470.0
vvc_alf_filter_chroma_128x128_10_neon: 616.0
vvc_alf_filter_chroma_128x128_12_c: 2531.5
vvc_alf_filter_chroma_128x128_12_neon: 620.2
vvc_alf_filter_luma_8x8_8_c: 25.2
vvc_alf_filter_luma_8x8_8_neon: 4.2
vvc_alf_filter_luma_8x8_10_c: 18.5
vvc_alf_filter_luma_8x8_10_neon: 4.0
vvc_alf_filter_luma_8x8_12_c: 19.0
vvc_alf_filter_luma_8x8_12_neon: 4.0
vvc_alf_filter_luma_16x16_8_c: 106.5
vvc_alf_filter_luma_16x16_8_neon: 16.2
vvc_alf_filter_luma_16x16_10_c: 75.2
vvc_alf_filter_luma_16x16_10_neon: 14.7
vvc_alf_filter_luma_16x16_12_c: 79.7
vvc_alf_filter_luma_16x16_12_neon: 14.7
vvc_alf_filter_luma_32x32_8_c: 400.5
vvc_alf_filter_luma_32x32_8_neon: 63.2
vvc_alf_filter_luma_32x32_10_c: 299.2
vvc_alf_filter_luma_32x32_10_neon: 57.7
vvc_alf_filter_luma_32x32_12_c: 299.2
vvc_alf_filter_luma_32x32_12_neon: 57.7
vvc_alf_filter_luma_64x64_8_c: 1602.5
vvc_alf_filter_luma_64x64_8_neon: 251.7
vvc_alf_filter_luma_64x64_10_c: 1197.0
vvc_alf_filter_luma_64x64_10_neon: 235.5
vvc_alf_filter_luma_64x64_12_c: 1220.2
vvc_alf_filter_luma_64x64_12_neon: 235.7
vvc_alf_filter_luma_128x128_8_c: 6570.2
vvc_alf_filter_luma_128x128_8_neon: 1007.7
vvc_alf_filter_luma_128x128_10_c: 4822.7
vvc_alf_filter_luma_128x128_10_neon: 936.2
vvc_alf_filter_luma_128x128_12_c: 4791.2
vvc_alf_filter_luma_128x128_12_neon: 938.5

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-22 21:09:56 +08:00
Anton Khirnov
e4601cc339 lavc/hevc*: move to hevc/ subdir 2024-06-04 11:46:27 +02:00
Ramiro Polla
d4d09c8e42 lavc/aarch64/fdct: add neon-optimized fdct for aarch64
The code is imported from libjpeg-turbo-3.0.1. The neon registers used
have been changed to avoid modifying v8-v15.

Reviewed-by: Martin Storsjö <martin@martin.st>
2024-05-13 14:54:10 +02:00
Ramiro Polla
27f6211c74 lavc/aarch64: fix include for cpu.h 2024-05-13 14:50:38 +02:00
Lynne
134dba9544
opusdsp: add ability to modify deemphasis constant
xHE-AAC relies on the same postfilter mechanism
that Opus uses to improve clarity (albeit with a steeper
deemphasis filter).

The code to apply it is identical, it's still just a
simple IIR low-pass filter. This commit makes it possible
to use alternative constants.
2024-04-27 11:12:07 +02:00
Martin Storsjö
359b6a7f8a aarch64/ac3dsp: simplify the end of ff_ac3_sum_square_butterfly_float_neon
Before:                           Cortex A53     A72     A78      M1
ac3_sum_square_bufferfly_float_neon:  1005.7   516.5   224.5   194.0
After:
ac3_sum_square_bufferfly_float_neon:   981.7   504.5   223.2   189.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2024-04-09 16:50:49 +02:00
Geoff Hill
ee1bc723de avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
42e88f18f3 avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
69cb34f885 avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
6f6bd10531 avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
b69486ea18 avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:28 +03:00
Martin Storsjö
f872b19714 aarch64: hevc: Produce plain neon versions of qpel_bi_hv
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_bi_hv4_8_c: 385.7
put_hevc_qpel_bi_hv4_8_neon: 131.0
put_hevc_qpel_bi_hv4_8_i8mm: 92.2
put_hevc_qpel_bi_hv6_8_c: 701.0
put_hevc_qpel_bi_hv6_8_neon: 239.5
put_hevc_qpel_bi_hv6_8_i8mm: 191.0
put_hevc_qpel_bi_hv8_8_c: 1162.0
put_hevc_qpel_bi_hv8_8_neon: 228.0
put_hevc_qpel_bi_hv8_8_i8mm: 225.2
put_hevc_qpel_bi_hv12_8_c: 2305.0
put_hevc_qpel_bi_hv12_8_neon: 558.0
put_hevc_qpel_bi_hv12_8_i8mm: 483.2
put_hevc_qpel_bi_hv16_8_c: 3965.2
put_hevc_qpel_bi_hv16_8_neon: 732.7
put_hevc_qpel_bi_hv16_8_i8mm: 656.5
put_hevc_qpel_bi_hv24_8_c: 8709.7
put_hevc_qpel_bi_hv24_8_neon: 1555.2
put_hevc_qpel_bi_hv24_8_i8mm: 1448.7
put_hevc_qpel_bi_hv32_8_c: 14818.0
put_hevc_qpel_bi_hv32_8_neon: 2763.7
put_hevc_qpel_bi_hv32_8_i8mm: 2468.0
put_hevc_qpel_bi_hv48_8_c: 32855.5
put_hevc_qpel_bi_hv48_8_neon: 6107.2
put_hevc_qpel_bi_hv48_8_i8mm: 5452.7
put_hevc_qpel_bi_hv64_8_c: 57591.5
put_hevc_qpel_bi_hv64_8_neon: 10660.2
put_hevc_qpel_bi_hv64_8_i8mm: 9580.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:05:55 +02:00
Martin Storsjö
d21b9a0411 aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

AWS Graviton 3:
put_hevc_qpel_uni_w_hv4_8_c: 422.2
put_hevc_qpel_uni_w_hv4_8_neon: 140.7
put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7
put_hevc_qpel_uni_w_hv8_8_c: 1208.0
put_hevc_qpel_uni_w_hv8_8_neon: 268.2
put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5
put_hevc_qpel_uni_w_hv16_8_c: 4297.2
put_hevc_qpel_uni_w_hv16_8_neon: 802.2
put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2
put_hevc_qpel_uni_w_hv32_8_c: 15518.5
put_hevc_qpel_uni_w_hv32_8_neon: 3085.2
put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2
put_hevc_qpel_uni_w_hv64_8_c: 57254.5
put_hevc_qpel_uni_w_hv64_8_neon: 11787.5
put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:05:55 +02:00
Martin Storsjö
5ab138673b aarch64: hevc: Produce plain neon versions of qpel_uni_hv
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_uni_hv4_8_c: 384.2
put_hevc_qpel_uni_hv4_8_neon: 127.5
put_hevc_qpel_uni_hv4_8_i8mm: 85.5
put_hevc_qpel_uni_hv6_8_c: 705.5
put_hevc_qpel_uni_hv6_8_neon: 224.5
put_hevc_qpel_uni_hv6_8_i8mm: 176.2
put_hevc_qpel_uni_hv8_8_c: 1136.5
put_hevc_qpel_uni_hv8_8_neon: 216.5
put_hevc_qpel_uni_hv8_8_i8mm: 214.0
put_hevc_qpel_uni_hv12_8_c: 2259.5
put_hevc_qpel_uni_hv12_8_neon: 498.5
put_hevc_qpel_uni_hv12_8_i8mm: 410.7
put_hevc_qpel_uni_hv16_8_c: 3824.7
put_hevc_qpel_uni_hv16_8_neon: 670.0
put_hevc_qpel_uni_hv16_8_i8mm: 603.7
put_hevc_qpel_uni_hv24_8_c: 8113.5
put_hevc_qpel_uni_hv24_8_neon: 1474.7
put_hevc_qpel_uni_hv24_8_i8mm: 1351.5
put_hevc_qpel_uni_hv32_8_c: 14744.5
put_hevc_qpel_uni_hv32_8_neon: 2599.7
put_hevc_qpel_uni_hv32_8_i8mm: 2266.0
put_hevc_qpel_uni_hv48_8_c: 32800.0
put_hevc_qpel_uni_hv48_8_neon: 5650.0
put_hevc_qpel_uni_hv48_8_i8mm: 5011.7
put_hevc_qpel_uni_hv64_8_c: 57856.2
put_hevc_qpel_uni_hv64_8_neon: 9863.5
put_hevc_qpel_uni_hv64_8_i8mm: 8767.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:05:55 +02:00