Krzysztof Pyrkosz
83e4b068d9
avcodec/aarch64/aacencdsp: NEON implementation
...
This patch supplies handwritten NEON code for AAC.
The benchmarks below were collected by invoking these two commands on
each of my boards, A78, A72 and Thinkpad x13s:
1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12
2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is
speed without the patch, second, with)
- A78
abs_pow34_c: 4161.5 ( 1.00x)
abs_pow34_neon: 3586.2 ( 1.16x)
quant_bands_signed_c: 5548.0 ( 1.00x)
quant_bands_signed_neon: 1126.8 ( 4.92x)
quant_bands_unsigned_c: 3979.2 ( 1.00x)
quant_bands_unsigned_neon: 800.2 ( 4.97x)
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=71.6x
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=82.3x
- A72
abs_pow34_c: 15362.2 ( 1.00x)
abs_pow34_neon: 15382.5 ( 1.00x)
quant_bands_signed_c: 9926.5 ( 1.00x)
quant_bands_signed_neon: 2467.8 ( 4.02x)
quant_bands_unsigned_c: 5469.8 ( 1.00x)
quant_bands_unsigned_neon: 2089.5 ( 2.62x)
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=34.3x
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=37.8
- x13s
abs_pow34_c: 2413.4 ( 1.00x)
abs_pow34_neon: 1796.2 ( 1.34x)
quant_bands_signed_c: 2968.9 ( 1.00x)
quant_bands_signed_neon: 675.6 ( 4.39x)
quant_bands_unsigned_c: 2311.9 ( 1.00x)
quant_bands_unsigned_neon: 477.1 ( 4.85x)
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 135x
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 159x
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-01-28 10:44:40 +02:00
Janne Grunau
430c38f698
aarch64: vp9mc: Load only 12 pixels in the 4 pixel wide horizontal filter
...
This reduces the amount the horizontal filters read beyond the filter
width to a consistent 1 pixel. The data is not used so this is usually
not noticeable. It becomes a problem when the application allocates
frame buffers only for the aligned picture size and the end of it is at
a page boundary. This happens for picture sizes which are a multiple of
the page size like 1280x640. The frame buffer allocation is based on
its most likely done via mmap + MAP_ANONYMOUS so start and end of the
buffer are page aligned and the previous and next page are not
necessarily mapped.
Under these conditions like seen by Firefox a read beyond the end of the
buffer results in a segfault.
After the over-read is reduced to a single pixel it's reasonable to use
VP9's emulated edge motion compensation for this.
Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185
Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2025-01-03 17:53:46 -05:00
Zhao Zhili
952508ae05
aarch64/vvc: Add apply_bdof
...
Test on rpi 5 with gcc 12:
apply_bdof_8_8x16_c: 7315.2 ( 1.00x)
apply_bdof_8_8x16_neon: 1876.8 ( 3.90x)
apply_bdof_8_16x8_c: 7170.5 ( 1.00x)
apply_bdof_8_16x8_neon: 1752.8 ( 4.09x)
apply_bdof_8_16x16_c: 14695.2 ( 1.00x)
apply_bdof_8_16x16_neon: 3490.5 ( 4.21x)
apply_bdof_10_8x16_c: 7371.5 ( 1.00x)
apply_bdof_10_8x16_neon: 1863.8 ( 3.96x)
apply_bdof_10_16x8_c: 7172.0 ( 1.00x)
apply_bdof_10_16x8_neon: 1766.0 ( 4.06x)
apply_bdof_10_16x16_c: 14551.5 ( 1.00x)
apply_bdof_10_16x16_neon: 3576.0 ( 4.07x)
apply_bdof_12_8x16_c: 7236.5 ( 1.00x)
apply_bdof_12_8x16_neon: 1863.8 ( 3.88x)
apply_bdof_12_16x8_c: 7316.5 ( 1.00x)
apply_bdof_12_16x8_neon: 1758.8 ( 4.16x)
apply_bdof_12_16x16_c: 14691.2 ( 1.00x)
apply_bdof_12_16x16_neon: 3480.5 ( 4.22x)
2024-12-21 11:54:44 +08:00
Martin Storsjö
2bb00ef59c
aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions
...
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.
This fixes these build errors:
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
error A2518: operand 2: Memory offset must be aligned
ldr s5, [x1, #1 ]
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
error A2518: operand 2: Memory offset must be aligned
ldr d7, [x1, #2 ]
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-18 13:45:09 +02:00
Bin Peng
72a3656e84
lavc/aarch64: Fix ff_pred16x16_plane_neon_10
...
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 367840
Signed-off-by: Peng Bin <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-17 14:50:29 +02:00
Bin Peng
decc9e643c
lavc/aarch64: Fix ff_pred8x8_plane_neon_10
...
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 479612
The mismatch between neon and C functions can also be reproduced using the following bitstream and command line.
wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264
./ffmpeg -cpuflags 0 -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_ref
./ffmpeg -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_neon
Signed-off-by: Bin Peng <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-12-17 14:50:29 +02:00
Zhao Zhili
40feba5f77
aarch64/vvc: Fix clip in alf
...
Fix test failure:
./tests/checkasm/checkasm --test=vvc_alf 3607569773
2024-12-10 21:00:47 +08:00
Zhao Zhili
91436638de
aarch64/vvc: Use faster clip operation
...
Replace sqxtn+smin+smax by sqxtun+umin.
2024-12-10 21:00:47 +08:00
Zhao Zhili
bfed5f6b7d
aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma
2024-12-10 21:00:47 +08:00
Zhao Zhili
5988a2729b
aarch64/vvc: Add dmvr
...
dmvr_8_12x20_c: 1.5 ( 1.00x)
dmvr_8_12x20_neon: 0.2 ( 6.56x)
dmvr_8_20x12_c: 1.0 ( 1.00x)
dmvr_8_20x12_neon: 0.2 ( 4.33x)
dmvr_8_20x20_c: 1.7 ( 1.00x)
dmvr_8_20x20_neon: 0.5 ( 3.63x)
dmvr_12_12x20_c: 2.2 ( 1.00x)
dmvr_12_12x20_neon: 0.5 ( 4.68x)
dmvr_12_20x12_c: 2.0 ( 1.00x)
dmvr_12_20x12_neon: 0.5 ( 4.16x)
dmvr_12_20x20_c: 3.7 ( 1.00x)
dmvr_12_20x20_neon: 0.7 ( 5.14x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Zhao Zhili
bcd65ebd8f
aarch64/vvc: Add dmvr_hv
...
dmvr_hv_8_12x20_c: 8.0 ( 1.00x)
dmvr_hv_8_12x20_neon: 1.2 ( 6.62x)
dmvr_hv_8_20x12_c: 8.0 ( 1.00x)
dmvr_hv_8_20x12_neon: 0.9 ( 8.37x)
dmvr_hv_8_20x20_c: 12.9 ( 1.00x)
dmvr_hv_8_20x20_neon: 1.7 ( 7.62x)
dmvr_hv_10_12x20_c: 7.0 ( 1.00x)
dmvr_hv_10_12x20_neon: 1.7 ( 4.09x)
dmvr_hv_10_20x12_c: 7.0 ( 1.00x)
dmvr_hv_10_20x12_neon: 1.7 ( 4.09x)
dmvr_hv_10_20x20_c: 11.2 ( 1.00x)
dmvr_hv_10_20x20_neon: 2.7 ( 4.15x)
dmvr_hv_12_12x20_c: 6.5 ( 1.00x)
dmvr_hv_12_12x20_neon: 1.7 ( 3.79x)
dmvr_hv_12_20x12_c: 6.5 ( 1.00x)
dmvr_hv_12_20x12_neon: 1.7 ( 3.79x)
dmvr_hv_12_20x20_c: 10.2 ( 1.00x)
dmvr_hv_12_20x20_neon: 2.2 ( 4.64x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Zhao Zhili
0ba9e8d0d4
aarch64/vvc: Add w_avg
...
w_avg_8_2x2_c: 0.0 ( 0.00x)
w_avg_8_2x2_neon: 0.0 ( 0.00x)
w_avg_8_4x4_c: 0.2 ( 1.00x)
w_avg_8_4x4_neon: 0.0 ( 0.00x)
w_avg_8_8x8_c: 1.2 ( 1.00x)
w_avg_8_8x8_neon: 0.2 ( 5.00x)
w_avg_8_16x16_c: 4.2 ( 1.00x)
w_avg_8_16x16_neon: 0.8 ( 5.67x)
w_avg_8_32x32_c: 16.2 ( 1.00x)
w_avg_8_32x32_neon: 2.5 ( 6.50x)
w_avg_8_64x64_c: 64.5 ( 1.00x)
w_avg_8_64x64_neon: 9.0 ( 7.17x)
w_avg_8_128x128_c: 269.5 ( 1.00x)
w_avg_8_128x128_neon: 35.5 ( 7.59x)
w_avg_10_2x2_c: 0.2 ( 1.00x)
w_avg_10_2x2_neon: 0.2 ( 1.00x)
w_avg_10_4x4_c: 0.2 ( 1.00x)
w_avg_10_4x4_neon: 0.2 ( 1.00x)
w_avg_10_8x8_c: 1.0 ( 1.00x)
w_avg_10_8x8_neon: 0.2 ( 4.00x)
w_avg_10_16x16_c: 4.2 ( 1.00x)
w_avg_10_16x16_neon: 0.8 ( 5.67x)
w_avg_10_32x32_c: 16.2 ( 1.00x)
w_avg_10_32x32_neon: 2.5 ( 6.50x)
w_avg_10_64x64_c: 66.2 ( 1.00x)
w_avg_10_64x64_neon: 10.0 ( 6.62x)
w_avg_10_128x128_c: 277.8 ( 1.00x)
w_avg_10_128x128_neon: 39.8 ( 6.99x)
w_avg_12_2x2_c: 0.0 ( 0.00x)
w_avg_12_2x2_neon: 0.2 ( 0.00x)
w_avg_12_4x4_c: 0.2 ( 1.00x)
w_avg_12_4x4_neon: 0.0 ( 0.00x)
w_avg_12_8x8_c: 1.2 ( 1.00x)
w_avg_12_8x8_neon: 0.5 ( 2.50x)
w_avg_12_16x16_c: 4.8 ( 1.00x)
w_avg_12_16x16_neon: 0.8 ( 6.33x)
w_avg_12_32x32_c: 17.0 ( 1.00x)
w_avg_12_32x32_neon: 2.8 ( 6.18x)
w_avg_12_64x64_c: 64.0 ( 1.00x)
w_avg_12_64x64_neon: 10.0 ( 6.40x)
w_avg_12_128x128_c: 269.2 ( 1.00x)
w_avg_12_128x128_neon: 42.0 ( 6.41x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-10-01 10:28:54 +08:00
Martin Storsjö
a3ec1f8c6c
aarch64: h26x: Fix the indentation of one function
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-09-26 13:42:11 +03:00
Zhao Zhili
3f84d1d1fb
aarch64/vvc: Add avg
...
avg_8_2x2_c: 0.2 ( 1.00x)
avg_8_2x2_neon: 0.2 ( 1.00x)
avg_8_4x4_c: 0.2 ( 1.00x)
avg_8_4x4_neon: 0.2 ( 1.00x)
avg_8_8x8_c: 0.9 ( 1.00x)
avg_8_8x8_neon: 0.2 ( 5.29x)
avg_8_16x16_c: 3.7 ( 1.00x)
avg_8_16x16_neon: 0.7 ( 5.44x)
avg_8_32x32_c: 14.9 ( 1.00x)
avg_8_32x32_neon: 1.7 ( 8.91x)
avg_8_64x64_c: 59.7 ( 1.00x)
avg_8_64x64_neon: 6.9 ( 8.62x)
avg_8_128x128_c: 254.7 ( 1.00x)
avg_8_128x128_neon: 26.9 ( 9.46x)
avg_10_2x2_c: 0.2 ( 1.00x)
avg_10_2x2_neon: 0.2 ( 1.00x)
avg_10_4x4_c: 0.2 ( 1.00x)
avg_10_4x4_neon: 0.2 ( 1.00x)
avg_10_8x8_c: 0.9 ( 1.00x)
avg_10_8x8_neon: 0.2 ( 5.29x)
avg_10_16x16_c: 3.4 ( 1.00x)
avg_10_16x16_neon: 0.4 ( 8.06x)
avg_10_32x32_c: 13.9 ( 1.00x)
avg_10_32x32_neon: 1.9 ( 7.23x)
avg_10_64x64_c: 54.2 ( 1.00x)
avg_10_64x64_neon: 8.4 ( 6.43x)
avg_10_128x128_c: 232.4 ( 1.00x)
avg_10_128x128_neon: 30.9 ( 7.52x)
avg_12_2x2_c: 0.0 ( 0.00x)
avg_12_2x2_neon: 0.2 ( 0.00x)
avg_12_4x4_c: 0.4 ( 1.00x)
avg_12_4x4_neon: 0.2 ( 2.43x)
avg_12_8x8_c: 0.7 ( 1.00x)
avg_12_8x8_neon: 0.2 ( 3.86x)
avg_12_16x16_c: 3.7 ( 1.00x)
avg_12_16x16_neon: 0.4 ( 8.65x)
avg_12_32x32_c: 13.7 ( 1.00x)
avg_12_32x32_neon: 2.2 ( 6.29x)
avg_12_64x64_c: 53.9 ( 1.00x)
avg_12_64x64_neon: 7.7 ( 7.03x)
avg_12_128x128_c: 270.9 ( 1.00x)
avg_12_128x128_neon: 30.4 ( 8.90x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
1be5a2374f
aarch64/vvc: Add put_epel_hv
...
On Apple M1:
put_chroma_hv_8_4x4_c: 1.7 ( 1.00x)
put_chroma_hv_8_4x4_neon: 0.2 ( 7.67x)
put_chroma_hv_8_8x8_c: 5.5 ( 1.00x)
put_chroma_hv_8_8x8_neon: 0.5 (11.53x)
put_chroma_hv_8_16x16_c: 18.5 ( 1.00x)
put_chroma_hv_8_16x16_neon: 1.5 (12.53x)
put_chroma_hv_8_32x32_c: 72.5 ( 1.00x)
put_chroma_hv_8_32x32_neon: 4.7 (15.34x)
put_chroma_hv_8_64x64_c: 274.0 ( 1.00x)
put_chroma_hv_8_64x64_neon: 18.5 (14.83x)
put_chroma_hv_8_128x128_c: 1058.7 ( 1.00x)
put_chroma_hv_8_128x128_neon: 75.2 (14.07x)
On Android Pixel 8 Pro:
put_chroma_hv_8_4x4_c: 1.2 ( 1.00x)
put_chroma_hv_8_4x4_neon: 0.0 ( 0.00x)
put_chroma_hv_8_4x4_i8mm: 0.2 ( 5.00x)
put_chroma_hv_8_8x8_c: 4.0 ( 1.00x)
put_chroma_hv_8_8x8_neon: 0.5 ( 8.00x)
put_chroma_hv_8_8x8_i8mm: 0.5 ( 8.00x)
put_chroma_hv_8_16x16_c: 15.2 ( 1.00x)
put_chroma_hv_8_16x16_neon: 2.5 ( 6.10x)
put_chroma_hv_8_16x16_i8mm: 2.2 ( 6.78x)
put_chroma_hv_8_32x32_c: 61.0 ( 1.00x)
put_chroma_hv_8_32x32_neon: 9.8 ( 6.26x)
put_chroma_hv_8_32x32_i8mm: 8.5 ( 7.18x)
put_chroma_hv_8_64x64_c: 229.5 ( 1.00x)
put_chroma_hv_8_64x64_neon: 38.5 ( 5.96x)
put_chroma_hv_8_64x64_i8mm: 34.0 ( 6.75x)
put_chroma_hv_8_128x128_c: 919.8 ( 1.00x)
put_chroma_hv_8_128x128_neon: 154.5 ( 5.95x)
put_chroma_hv_8_128x128_i8mm: 140.0 ( 6.57x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
0dcf204e5d
aarch64/vvc: Add put_epel_h i8mm
...
put_chroma_h_8_4x4_c: 0.4 ( 1.00x)
put_chroma_h_8_4x4_neon: 0.0 ( 0.00x)
put_chroma_h_8_4x4_i8mm: 0.1 ( 2.67x)
put_chroma_h_8_8x8_c: 1.6 ( 1.00x)
put_chroma_h_8_8x8_neon: 0.1 (11.00x)
put_chroma_h_8_8x8_i8mm: 0.1 (11.00x)
put_chroma_h_8_16x16_c: 6.9 ( 1.00x)
put_chroma_h_8_16x16_neon: 1.1 ( 6.00x)
put_chroma_h_8_16x16_i8mm: 0.7 (10.62x)
put_chroma_h_8_32x32_c: 27.6 ( 1.00x)
put_chroma_h_8_32x32_neon: 4.7 ( 5.95x)
put_chroma_h_8_32x32_i8mm: 4.4 ( 6.28x)
put_chroma_h_8_64x64_c: 116.2 ( 1.00x)
put_chroma_h_8_64x64_neon: 19.1 ( 6.07x)
put_chroma_h_8_64x64_i8mm: 17.1 ( 6.77x)
put_chroma_h_8_128x128_c: 466.6 ( 1.00x)
put_chroma_h_8_128x128_neon: 81.4 ( 5.73x)
put_chroma_h_8_128x128_i8mm: 71.7 ( 6.51x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
41a1885f7a
aarch64/vvc: Add put_epel_h
...
put_chroma_h_8_4x4_c: 0.2 ( 1.00x)
put_chroma_h_8_4x4_neon: 0.2 ( 1.00x)
put_chroma_h_8_8x8_c: 0.8 ( 1.00x)
put_chroma_h_8_8x8_neon: 0.2 ( 3.00x)
put_chroma_h_8_16x16_c: 3.8 ( 1.00x)
put_chroma_h_8_16x16_neon: 0.8 ( 5.00x)
put_chroma_h_8_32x32_c: 12.5 ( 1.00x)
put_chroma_h_8_32x32_neon: 2.2 ( 5.56x)
put_chroma_h_8_64x64_c: 47.0 ( 1.00x)
put_chroma_h_8_64x64_neon: 8.8 ( 5.37x)
put_chroma_h_8_128x128_c: 200.2 ( 1.00x)
put_chroma_h_8_128x128_neon: 31.8 ( 6.31x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
260e1b4b62
aarch64/vvc: Add sad
...
sad_8x16_c: 0.8 ( 1.00x)
sad_8x16_neon: 0.2 ( 3.00x)
sad_16x8_c: 0.5 ( 1.00x)
sad_16x8_neon: 0.2 ( 2.00x)
sad_16x16_c: 1.5 ( 1.00x)
sad_16x16_neon: 0.2 ( 6.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
5ac6925803
aarch64/vvc: Add put_qpel_hv
...
With Apple M1 (no i8mm):
put_luma_hv_8_4x4_c: 2.2 ( 1.00x)
put_luma_hv_8_4x4_neon: 0.8 ( 3.00x)
put_luma_hv_8_8x8_c: 7.0 ( 1.00x)
put_luma_hv_8_8x8_neon: 0.8 ( 9.33x)
put_luma_hv_8_16x16_c: 22.8 ( 1.00x)
put_luma_hv_8_16x16_neon: 2.5 ( 9.10x)
put_luma_hv_8_32x32_c: 84.8 ( 1.00x)
put_luma_hv_8_32x32_neon: 9.5 ( 8.92x)
put_luma_hv_8_64x64_c: 333.0 ( 1.00x)
put_luma_hv_8_64x64_neon: 35.5 ( 9.38x)
put_luma_hv_8_128x128_c: 1294.5 ( 1.00x)
put_luma_hv_8_128x128_neon: 137.8 ( 9.40x)
With Pixel 8 Pro:
put_luma_hv_8_4x4_c: 5.0 ( 1.00x)
put_luma_hv_8_4x4_neon: 0.8 ( 6.67x)
put_luma_hv_8_4x4_i8mm: 0.2 (20.00x)
put_luma_hv_8_8x8_c: 13.2 ( 1.00x)
put_luma_hv_8_8x8_neon: 1.2 (10.60x)
put_luma_hv_8_8x8_i8mm: 1.2 (10.60x)
put_luma_hv_8_16x16_c: 44.2 ( 1.00x)
put_luma_hv_8_16x16_neon: 4.5 ( 9.83x)
put_luma_hv_8_16x16_i8mm: 4.2 (10.41x)
put_luma_hv_8_32x32_c: 160.8 ( 1.00x)
put_luma_hv_8_32x32_neon: 17.5 ( 9.19x)
put_luma_hv_8_32x32_i8mm: 16.0 (10.05x)
put_luma_hv_8_64x64_c: 611.2 ( 1.00x)
put_luma_hv_8_64x64_neon: 68.0 ( 8.99x)
put_luma_hv_8_64x64_i8mm: 62.2 ( 9.82x)
put_luma_hv_8_128x128_c: 2384.8 ( 1.00x)
put_luma_hv_8_128x128_neon: 268.8 ( 8.87x)
put_luma_hv_8_128x128_i8mm: 245.8 ( 9.70x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
a0b52afd32
aarch64/vvc: Add put_qpel_vx
...
put_luma_v_8_4x4_c: 1.0 ( 1.00x)
put_luma_v_8_4x4_neon: 0.0 ( 0.00x)
put_luma_v_8_8x8_c: 3.5 ( 1.00x)
put_luma_v_8_8x8_neon: 0.5 ( 7.00x)
put_luma_v_8_16x16_c: 13.8 ( 1.00x)
put_luma_v_8_16x16_neon: 1.2 (11.00x)
put_luma_v_8_32x32_c: 54.2 ( 1.00x)
put_luma_v_8_32x32_neon: 5.0 (10.85x)
put_luma_v_8_64x64_c: 217.5 ( 1.00x)
put_luma_v_8_64x64_neon: 18.8 (11.60x)
put_luma_v_8_128x128_c: 886.2 ( 1.00x)
put_luma_v_8_128x128_neon: 74.0 (11.98x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
b051bc7cb8
aarch64/h26x: Remove duplicate b.eq instruction
...
b.eq is added by calc_all after each calc.
2024-09-14 16:36:34 +08:00
Zhao Zhili
9f6c8eb412
aarch64/vvc: Add put_qpel_hx i8mm
...
Benchmark on Android pixel 8 with -fno-vectorize
put_luma_h_8_4x4_c: 0.2 ( 1.00x)
put_luma_h_8_4x4_neon: 0.2 ( 1.00x)
put_luma_h_8_4x4_i8mm: 0.0 ( 0.00x)
put_luma_h_8_8x8_c: 1.5 ( 1.00x)
put_luma_h_8_8x8_neon: 0.5 ( 3.00x)
put_luma_h_8_8x8_i8mm: 0.5 ( 3.00x)
put_luma_h_8_16x16_c: 6.2 ( 1.00x)
put_luma_h_8_16x16_neon: 2.0 ( 3.12x)
put_luma_h_8_16x16_i8mm: 1.5 ( 4.17x)
put_luma_h_8_32x32_c: 25.5 ( 1.00x)
put_luma_h_8_32x32_neon: 9.0 ( 2.83x)
put_luma_h_8_32x32_i8mm: 6.8 ( 3.78x)
put_luma_h_8_64x64_c: 99.8 ( 1.00x)
put_luma_h_8_64x64_neon: 35.2 ( 2.83x)
put_luma_h_8_64x64_i8mm: 27.2 ( 3.66x)
put_luma_h_8_128x128_c: 422.0 ( 1.00x)
put_luma_h_8_128x128_neon: 138.5 ( 3.05x)
put_luma_h_8_128x128_i8mm: 109.2 ( 3.86x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
25448d1716
aarch64/vvc: Add put_pel/put_pel_uni/put_pel_uni_w
...
put_luma_pixels_8_4x4_c: 0.2 ( 1.00x)
put_luma_pixels_8_4x4_neon: 0.2 ( 1.00x)
put_luma_pixels_8_8x8_c: 0.7 ( 1.00x)
put_luma_pixels_8_8x8_neon: 0.2 ( 3.22x)
put_luma_pixels_8_16x16_c: 2.2 ( 1.00x)
put_luma_pixels_8_16x16_neon: 0.2 ( 9.89x)
put_luma_pixels_8_32x32_c: 8.2 ( 1.00x)
put_luma_pixels_8_32x32_neon: 1.2 ( 6.71x)
put_luma_pixels_8_64x64_c: 33.7 ( 1.00x)
put_luma_pixels_8_64x64_neon: 2.5 (13.63x)
put_luma_pixels_8_128x128_c: 145.5 ( 1.00x)
put_luma_pixels_8_128x128_neon: 10.2 (14.23x)
put_uni_pixels_luma_8_4x4_c: 0.5 ( 1.00x)
put_uni_pixels_luma_8_4x4_neon: 0.0 ( 0.00x)
put_uni_pixels_luma_8_8x8_c: 0.5 ( 1.00x)
put_uni_pixels_luma_8_8x8_neon: 0.2 ( 2.11x)
put_uni_pixels_luma_8_16x16_c: 1.2 ( 1.00x)
put_uni_pixels_luma_8_16x16_neon: 0.2 ( 5.44x)
put_uni_pixels_luma_8_32x32_c: 3.0 ( 1.00x)
put_uni_pixels_luma_8_32x32_neon: 0.5 ( 6.26x)
put_uni_pixels_luma_8_64x64_c: 3.0 ( 1.00x)
put_uni_pixels_luma_8_64x64_neon: 1.7 ( 1.72x)
put_uni_pixels_luma_8_128x128_c: 6.5 ( 1.00x)
put_uni_pixels_luma_8_128x128_neon: 6.5 ( 1.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
20f2bf5530
aarch64/vvc: Add put_qpel_h_* and put_qpel_uni_h_*
...
Just share hevc implementation.
checkasm --test=vvc_mc --benchmark:
put_luma_h_8_4x4_c: 0.2 ( 1.00x)
put_luma_h_8_4x4_neon: 0.2 ( 1.00x)
put_luma_h_8_8x8_c: 1.0 ( 1.00x)
put_luma_h_8_8x8_neon: 0.2 ( 4.33x)
put_luma_h_8_16x16_c: 3.2 ( 1.00x)
put_luma_h_8_16x16_neon: 1.2 ( 2.63x)
put_luma_h_8_32x32_c: 13.7 ( 1.00x)
put_luma_h_8_32x32_neon: 4.0 ( 3.45x)
put_luma_h_8_64x64_c: 48.2 ( 1.00x)
put_luma_h_8_64x64_neon: 15.7 ( 3.07x)
put_luma_h_8_128x128_c: 203.5 ( 1.00x)
put_luma_h_8_128x128_neon: 62.0 ( 3.28x)
put_uni_h_luma_8_4x4_c: 0.2 ( 1.00x)
put_uni_h_luma_8_4x4_neon: 0.2 ( 1.00x)
put_uni_h_luma_8_8x8_c: 1.5 ( 1.00x)
put_uni_h_luma_8_8x8_neon: 0.2 ( 6.56x)
put_uni_h_luma_8_16x16_c: 5.7 ( 1.00x)
put_uni_h_luma_8_16x16_neon: 1.2 ( 4.67x)
put_uni_h_luma_8_32x32_c: 24.0 ( 1.00x)
put_uni_h_luma_8_32x32_neon: 4.7 ( 5.07x)
put_uni_h_luma_8_64x64_c: 90.0 ( 1.00x)
put_uni_h_luma_8_64x64_neon: 17.0 ( 5.30x)
put_uni_h_luma_8_128x128_c: 357.7 ( 1.00x)
put_uni_h_luma_8_128x128_neon: 67.5 ( 5.30x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
46f07ce7d1
aarch64/hevc: Move epel/qpel to h26x directory
...
So vvc can reuse the implementation.
2024-09-14 16:36:34 +08:00
Zhao Zhili
8beafb5656
aarch64/hevc: Simplify function prototypes by macro
2024-09-14 16:36:34 +08:00
Anton Khirnov
3f9ca51015
lavc/opus*: move to opus/ subdir
2024-09-02 11:56:53 +02:00
Ramiro Polla
6aafe61285
avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t
2024-09-01 13:42:30 +02:00
Zhao Zhili
4c0372281b
aarch64/vvc: Bind h26x/sao filter implementation to vvc
...
Reviewed-by: Martin Storsjö <martin@martin.st>
2024-08-31 16:07:50 +08:00
Zhao Zhili
8cc10298a7
aarch64/hevc: Move sao to h26x directory
...
So vvc can reuse the implementation.
Reviewed-by: Martin Storsjö <martin@martin.st>
2024-08-31 16:07:43 +08:00
Ramiro Polla
8c203ea7c7
avcodec/aarch64/mpegvideoencdsp: add dotprod implementation for pix_norm1
...
A55 A76
pix_norm1_c: 484.3 235.2
pix_norm1_neon: 193.8 ( 2.50x) 44.7 ( 5.26x)
pix_norm1_dotprod: 91.8 ( 5.28x) 21.2 (11.09x)
2024-08-26 12:49:04 +02:00
Ramiro Polla
9f68a3712e
avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
...
A55 A76
pix_norm1_c: 478.2 234.2
pix_norm1_neon: 188.2 ( 2.54x) 41.2 ( 5.68x)
pix_sum_c: 304.2 244.0
pix_sum_neon: 77.2 ( 3.94x) 21.5 (11.35x)
2024-08-26 12:48:31 +02:00
Ramiro Polla
5c1c0325cd
avcodec/aarch64/me_cmp: add dotprod implementations of sse16 and vsse_intra16
...
checkasm --bench for Raspberry Pi 5 Model B Rev 1.0:
sse_0_c: 241.5
sse_0_neon: 37.2
sse_0_dotprod: 22.2
vsse_4_c: 148.7
vsse_4_neon: 31.0
vsse_4_dotprod: 15.7
2024-08-17 15:31:48 +02:00
Martin Storsjö
4acb9b7d10
aarch64: vvc: Fix unnecessary extra spaces
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-07-23 16:04:28 +03:00
Martin Storsjö
99598629e8
aarch64: vvc: Consistently use # for immediate constants
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-07-23 15:24:37 +03:00
Martin Storsjö
400843151d
aarch64: vvc: Fix compilation of alf.S with MSVC 2022 17.7 and older
...
Use the "ldur" instruction explicitly, instead of having the
assembler implicitly convert "ldr" instructions to "ldur".
This fixes build errors like these:
libavcodec\aarch64\vvc\alf.o.asm(1023) : error A2518: operand 2: Memory offset must be aligned
ldr q22, [x3, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1024) : error A2518: operand 2: Memory offset must be aligned
ldr q24, [x2, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1393) : error A2518: operand 2: Memory offset must be aligned
ldr q22, [x3, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1394) : error A2518: operand 2: Memory offset must be aligned
ldr q24, [x2, #24 ]
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-07-23 15:24:33 +03:00
Zhao Zhili
2d4ef304c9
avcodec/vvc: Add aarch64 neon optimization for ALF
...
vvc_alf_filter_chroma_4x4_8_c: 3.0
vvc_alf_filter_chroma_4x4_8_neon: 1.0
vvc_alf_filter_chroma_4x4_10_c: 2.7
vvc_alf_filter_chroma_4x4_10_neon: 1.0
vvc_alf_filter_chroma_4x4_12_c: 2.7
vvc_alf_filter_chroma_4x4_12_neon: 1.0
vvc_alf_filter_chroma_8x8_8_c: 10.2
vvc_alf_filter_chroma_8x8_8_neon: 3.0
vvc_alf_filter_chroma_8x8_10_c: 10.0
vvc_alf_filter_chroma_8x8_10_neon: 2.5
vvc_alf_filter_chroma_8x8_12_c: 10.0
vvc_alf_filter_chroma_8x8_12_neon: 2.5
vvc_alf_filter_chroma_16x16_8_c: 41.7
vvc_alf_filter_chroma_16x16_8_neon: 11.2
vvc_alf_filter_chroma_16x16_10_c: 39.0
vvc_alf_filter_chroma_16x16_10_neon: 10.0
vvc_alf_filter_chroma_16x16_12_c: 40.2
vvc_alf_filter_chroma_16x16_12_neon: 10.2
vvc_alf_filter_chroma_32x32_8_c: 162.0
vvc_alf_filter_chroma_32x32_8_neon: 45.0
vvc_alf_filter_chroma_32x32_10_c: 155.5
vvc_alf_filter_chroma_32x32_10_neon: 39.5
vvc_alf_filter_chroma_32x32_12_c: 155.5
vvc_alf_filter_chroma_32x32_12_neon: 40.0
vvc_alf_filter_chroma_64x64_8_c: 646.0
vvc_alf_filter_chroma_64x64_8_neon: 175.5
vvc_alf_filter_chroma_64x64_10_c: 708.2
vvc_alf_filter_chroma_64x64_10_neon: 166.7
vvc_alf_filter_chroma_64x64_12_c: 619.2
vvc_alf_filter_chroma_64x64_12_neon: 157.2
vvc_alf_filter_chroma_128x128_8_c: 2611.5
vvc_alf_filter_chroma_128x128_8_neon: 698.2
vvc_alf_filter_chroma_128x128_10_c: 2470.0
vvc_alf_filter_chroma_128x128_10_neon: 616.0
vvc_alf_filter_chroma_128x128_12_c: 2531.5
vvc_alf_filter_chroma_128x128_12_neon: 620.2
vvc_alf_filter_luma_8x8_8_c: 25.2
vvc_alf_filter_luma_8x8_8_neon: 4.2
vvc_alf_filter_luma_8x8_10_c: 18.5
vvc_alf_filter_luma_8x8_10_neon: 4.0
vvc_alf_filter_luma_8x8_12_c: 19.0
vvc_alf_filter_luma_8x8_12_neon: 4.0
vvc_alf_filter_luma_16x16_8_c: 106.5
vvc_alf_filter_luma_16x16_8_neon: 16.2
vvc_alf_filter_luma_16x16_10_c: 75.2
vvc_alf_filter_luma_16x16_10_neon: 14.7
vvc_alf_filter_luma_16x16_12_c: 79.7
vvc_alf_filter_luma_16x16_12_neon: 14.7
vvc_alf_filter_luma_32x32_8_c: 400.5
vvc_alf_filter_luma_32x32_8_neon: 63.2
vvc_alf_filter_luma_32x32_10_c: 299.2
vvc_alf_filter_luma_32x32_10_neon: 57.7
vvc_alf_filter_luma_32x32_12_c: 299.2
vvc_alf_filter_luma_32x32_12_neon: 57.7
vvc_alf_filter_luma_64x64_8_c: 1602.5
vvc_alf_filter_luma_64x64_8_neon: 251.7
vvc_alf_filter_luma_64x64_10_c: 1197.0
vvc_alf_filter_luma_64x64_10_neon: 235.5
vvc_alf_filter_luma_64x64_12_c: 1220.2
vvc_alf_filter_luma_64x64_12_neon: 235.7
vvc_alf_filter_luma_128x128_8_c: 6570.2
vvc_alf_filter_luma_128x128_8_neon: 1007.7
vvc_alf_filter_luma_128x128_10_c: 4822.7
vvc_alf_filter_luma_128x128_10_neon: 936.2
vvc_alf_filter_luma_128x128_12_c: 4791.2
vvc_alf_filter_luma_128x128_12_neon: 938.5
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-22 21:09:56 +08:00
Anton Khirnov
e4601cc339
lavc/hevc*: move to hevc/ subdir
2024-06-04 11:46:27 +02:00
Ramiro Polla
d4d09c8e42
lavc/aarch64/fdct: add neon-optimized fdct for aarch64
...
The code is imported from libjpeg-turbo-3.0.1. The neon registers used
have been changed to avoid modifying v8-v15.
Reviewed-by: Martin Storsjö <martin@martin.st>
2024-05-13 14:54:10 +02:00
Ramiro Polla
27f6211c74
lavc/aarch64: fix include for cpu.h
2024-05-13 14:50:38 +02:00
Lynne
134dba9544
opusdsp: add ability to modify deemphasis constant
...
xHE-AAC relies on the same postfilter mechanism
that Opus uses to improve clarity (albeit with a steeper
deemphasis filter).
The code to apply it is identical, it's still just a
simple IIR low-pass filter. This commit makes it possible
to use alternative constants.
2024-04-27 11:12:07 +02:00
Martin Storsjö
359b6a7f8a
aarch64/ac3dsp: simplify the end of ff_ac3_sum_square_butterfly_float_neon
...
Before: Cortex A53 A72 A78 M1
ac3_sum_square_bufferfly_float_neon: 1005.7 516.5 224.5 194.0
After:
ac3_sum_square_bufferfly_float_neon: 981.7 504.5 223.2 189.5
Signed-off-by: J. Dekker <jdek@itanimul.li>
2024-04-09 16:50:49 +02:00
Geoff Hill
ee1bc723de
avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON
...
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
42e88f18f3
avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON
...
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
69cb34f885
avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON
...
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
6f6bd10531
avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON
...
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:40 +03:00
Geoff Hill
b69486ea18
avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON
...
Signed-off-by: Geoff Hill <geoff@geoffhill.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-04-08 13:36:28 +03:00
Martin Storsjö
f872b19714
aarch64: hevc: Produce plain neon versions of qpel_bi_hv
...
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.
AWS Graviton 3:
put_hevc_qpel_bi_hv4_8_c: 385.7
put_hevc_qpel_bi_hv4_8_neon: 131.0
put_hevc_qpel_bi_hv4_8_i8mm: 92.2
put_hevc_qpel_bi_hv6_8_c: 701.0
put_hevc_qpel_bi_hv6_8_neon: 239.5
put_hevc_qpel_bi_hv6_8_i8mm: 191.0
put_hevc_qpel_bi_hv8_8_c: 1162.0
put_hevc_qpel_bi_hv8_8_neon: 228.0
put_hevc_qpel_bi_hv8_8_i8mm: 225.2
put_hevc_qpel_bi_hv12_8_c: 2305.0
put_hevc_qpel_bi_hv12_8_neon: 558.0
put_hevc_qpel_bi_hv12_8_i8mm: 483.2
put_hevc_qpel_bi_hv16_8_c: 3965.2
put_hevc_qpel_bi_hv16_8_neon: 732.7
put_hevc_qpel_bi_hv16_8_i8mm: 656.5
put_hevc_qpel_bi_hv24_8_c: 8709.7
put_hevc_qpel_bi_hv24_8_neon: 1555.2
put_hevc_qpel_bi_hv24_8_i8mm: 1448.7
put_hevc_qpel_bi_hv32_8_c: 14818.0
put_hevc_qpel_bi_hv32_8_neon: 2763.7
put_hevc_qpel_bi_hv32_8_i8mm: 2468.0
put_hevc_qpel_bi_hv48_8_c: 32855.5
put_hevc_qpel_bi_hv48_8_neon: 6107.2
put_hevc_qpel_bi_hv48_8_i8mm: 5452.7
put_hevc_qpel_bi_hv64_8_c: 57591.5
put_hevc_qpel_bi_hv64_8_neon: 10660.2
put_hevc_qpel_bi_hv64_8_i8mm: 9580.0
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:05:55 +02:00
Martin Storsjö
d21b9a0411
aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
...
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
AWS Graviton 3:
put_hevc_qpel_uni_w_hv4_8_c: 422.2
put_hevc_qpel_uni_w_hv4_8_neon: 140.7
put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7
put_hevc_qpel_uni_w_hv8_8_c: 1208.0
put_hevc_qpel_uni_w_hv8_8_neon: 268.2
put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5
put_hevc_qpel_uni_w_hv16_8_c: 4297.2
put_hevc_qpel_uni_w_hv16_8_neon: 802.2
put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2
put_hevc_qpel_uni_w_hv32_8_c: 15518.5
put_hevc_qpel_uni_w_hv32_8_neon: 3085.2
put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2
put_hevc_qpel_uni_w_hv64_8_c: 57254.5
put_hevc_qpel_uni_w_hv64_8_neon: 11787.5
put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:05:55 +02:00
Martin Storsjö
5ab138673b
aarch64: hevc: Produce plain neon versions of qpel_uni_hv
...
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.
AWS Graviton 3:
put_hevc_qpel_uni_hv4_8_c: 384.2
put_hevc_qpel_uni_hv4_8_neon: 127.5
put_hevc_qpel_uni_hv4_8_i8mm: 85.5
put_hevc_qpel_uni_hv6_8_c: 705.5
put_hevc_qpel_uni_hv6_8_neon: 224.5
put_hevc_qpel_uni_hv6_8_i8mm: 176.2
put_hevc_qpel_uni_hv8_8_c: 1136.5
put_hevc_qpel_uni_hv8_8_neon: 216.5
put_hevc_qpel_uni_hv8_8_i8mm: 214.0
put_hevc_qpel_uni_hv12_8_c: 2259.5
put_hevc_qpel_uni_hv12_8_neon: 498.5
put_hevc_qpel_uni_hv12_8_i8mm: 410.7
put_hevc_qpel_uni_hv16_8_c: 3824.7
put_hevc_qpel_uni_hv16_8_neon: 670.0
put_hevc_qpel_uni_hv16_8_i8mm: 603.7
put_hevc_qpel_uni_hv24_8_c: 8113.5
put_hevc_qpel_uni_hv24_8_neon: 1474.7
put_hevc_qpel_uni_hv24_8_i8mm: 1351.5
put_hevc_qpel_uni_hv32_8_c: 14744.5
put_hevc_qpel_uni_hv32_8_neon: 2599.7
put_hevc_qpel_uni_hv32_8_i8mm: 2266.0
put_hevc_qpel_uni_hv48_8_c: 32800.0
put_hevc_qpel_uni_hv48_8_neon: 5650.0
put_hevc_qpel_uni_hv48_8_i8mm: 5011.7
put_hevc_qpel_uni_hv64_8_c: 57856.2
put_hevc_qpel_uni_hv64_8_neon: 9863.5
put_hevc_qpel_uni_hv64_8_i8mm: 8767.7
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:05:55 +02:00