ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-06-27 19:31:27 +00:00

Author	SHA1	Message	Date
Krzysztof Pyrkosz	83e4b068d9	avcodec/aarch64/aacencdsp: NEON implementation This patch supplies handwritten NEON code for AAC. The benchmarks below were collected by invoking these two commands on each of my boards, A78, A72 and Thinkpad x13s: 1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12 2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is speed without the patch, second, with) - A78 abs_pow34_c: 4161.5 ( 1.00x) abs_pow34_neon: 3586.2 ( 1.16x) quant_bands_signed_c: 5548.0 ( 1.00x) quant_bands_signed_neon: 1126.8 ( 4.92x) quant_bands_unsigned_c: 3979.2 ( 1.00x) quant_bands_unsigned_neon: 800.2 ( 4.97x) size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=71.6x size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=82.3x - A72 abs_pow34_c: 15362.2 ( 1.00x) abs_pow34_neon: 15382.5 ( 1.00x) quant_bands_signed_c: 9926.5 ( 1.00x) quant_bands_signed_neon: 2467.8 ( 4.02x) quant_bands_unsigned_c: 5469.8 ( 1.00x) quant_bands_unsigned_neon: 2089.5 ( 2.62x) size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=34.3x size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=37.8 - x13s abs_pow34_c: 2413.4 ( 1.00x) abs_pow34_neon: 1796.2 ( 1.34x) quant_bands_signed_c: 2968.9 ( 1.00x) quant_bands_signed_neon: 675.6 ( 4.39x) quant_bands_unsigned_c: 2311.9 ( 1.00x) quant_bands_unsigned_neon: 477.1 ( 4.85x) size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 135x size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 159x Signed-off-by: Martin Storsjö <martin@martin.st>	2025-01-28 10:44:40 +02:00
Janne Grunau	430c38f698	aarch64: vp9mc: Load only 12 pixels in the 4 pixel wide horizontal filter This reduces the amount the horizontal filters read beyond the filter width to a consistent 1 pixel. The data is not used so this is usually not noticeable. It becomes a problem when the application allocates frame buffers only for the aligned picture size and the end of it is at a page boundary. This happens for picture sizes which are a multiple of the page size like 1280x640. The frame buffer allocation is based on its most likely done via mmap + MAP_ANONYMOUS so start and end of the buffer are page aligned and the previous and next page are not necessarily mapped. Under these conditions like seen by Firefox a read beyond the end of the buffer results in a segfault. After the over-read is reduced to a single pixel it's reasonable to use VP9's emulated edge motion compensation for this. Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185 Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2025-01-03 17:53:46 -05:00
Zhao Zhili	952508ae05	aarch64/vvc: Add apply_bdof Test on rpi 5 with gcc 12: apply_bdof_8_8x16_c: 7315.2 ( 1.00x) apply_bdof_8_8x16_neon: 1876.8 ( 3.90x) apply_bdof_8_16x8_c: 7170.5 ( 1.00x) apply_bdof_8_16x8_neon: 1752.8 ( 4.09x) apply_bdof_8_16x16_c: 14695.2 ( 1.00x) apply_bdof_8_16x16_neon: 3490.5 ( 4.21x) apply_bdof_10_8x16_c: 7371.5 ( 1.00x) apply_bdof_10_8x16_neon: 1863.8 ( 3.96x) apply_bdof_10_16x8_c: 7172.0 ( 1.00x) apply_bdof_10_16x8_neon: 1766.0 ( 4.06x) apply_bdof_10_16x16_c: 14551.5 ( 1.00x) apply_bdof_10_16x16_neon: 3576.0 ( 4.07x) apply_bdof_12_8x16_c: 7236.5 ( 1.00x) apply_bdof_12_8x16_neon: 1863.8 ( 3.88x) apply_bdof_12_16x8_c: 7316.5 ( 1.00x) apply_bdof_12_16x8_neon: 1758.8 ( 4.16x) apply_bdof_12_16x16_c: 14691.2 ( 1.00x) apply_bdof_12_16x16_neon: 3480.5 ( 4.22x)	2024-12-21 11:54:44 +08:00
Martin Storsjö	2bb00ef59c	aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions Explicitly use ldur for unaligned offsets; newer versions of armasm64 implicitly convert ldr to ldur as necessary, but older versions require it explicitly written out. This fixes these build errors: ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) : error A2518: operand 2: Memory offset must be aligned ldr s5, [x1, #1] ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) : error A2518: operand 2: Memory offset must be aligned ldr d7, [x1, #2] Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-18 13:45:09 +02:00
Bin Peng	72a3656e84	lavc/aarch64: Fix ff_pred16x16_plane_neon_10 Fix test failure on aarch64: ./tests/checkasm/checkasm --test=h264pred 367840 Signed-off-by: Peng Bin <pengbin@visionular.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-17 14:50:29 +02:00
Bin Peng	decc9e643c	lavc/aarch64: Fix ff_pred8x8_plane_neon_10 Fix test failure on aarch64: ./tests/checkasm/checkasm --test=h264pred 479612 The mismatch between neon and C functions can also be reproduced using the following bitstream and command line. wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264 ./ffmpeg -cpuflags 0 -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_ref ./ffmpeg -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_neon Signed-off-by: Bin Peng <pengbin@visionular.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-17 14:50:29 +02:00
Zhao Zhili	40feba5f77	aarch64/vvc: Fix clip in alf Fix test failure: ./tests/checkasm/checkasm --test=vvc_alf 3607569773	2024-12-10 21:00:47 +08:00
Zhao Zhili	91436638de	aarch64/vvc: Use faster clip operation Replace sqxtn+smin+smax by sqxtun+umin.	2024-12-10 21:00:47 +08:00
Zhao Zhili	bfed5f6b7d	aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma	2024-12-10 21:00:47 +08:00
Zhao Zhili	5988a2729b	aarch64/vvc: Add dmvr dmvr_8_12x20_c: 1.5 ( 1.00x) dmvr_8_12x20_neon: 0.2 ( 6.56x) dmvr_8_20x12_c: 1.0 ( 1.00x) dmvr_8_20x12_neon: 0.2 ( 4.33x) dmvr_8_20x20_c: 1.7 ( 1.00x) dmvr_8_20x20_neon: 0.5 ( 3.63x) dmvr_12_12x20_c: 2.2 ( 1.00x) dmvr_12_12x20_neon: 0.5 ( 4.68x) dmvr_12_20x12_c: 2.0 ( 1.00x) dmvr_12_20x12_neon: 0.5 ( 4.16x) dmvr_12_20x20_c: 3.7 ( 1.00x) dmvr_12_20x20_neon: 0.7 ( 5.14x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-10-01 10:28:54 +08:00
Zhao Zhili	bcd65ebd8f	aarch64/vvc: Add dmvr_hv dmvr_hv_8_12x20_c: 8.0 ( 1.00x) dmvr_hv_8_12x20_neon: 1.2 ( 6.62x) dmvr_hv_8_20x12_c: 8.0 ( 1.00x) dmvr_hv_8_20x12_neon: 0.9 ( 8.37x) dmvr_hv_8_20x20_c: 12.9 ( 1.00x) dmvr_hv_8_20x20_neon: 1.7 ( 7.62x) dmvr_hv_10_12x20_c: 7.0 ( 1.00x) dmvr_hv_10_12x20_neon: 1.7 ( 4.09x) dmvr_hv_10_20x12_c: 7.0 ( 1.00x) dmvr_hv_10_20x12_neon: 1.7 ( 4.09x) dmvr_hv_10_20x20_c: 11.2 ( 1.00x) dmvr_hv_10_20x20_neon: 2.7 ( 4.15x) dmvr_hv_12_12x20_c: 6.5 ( 1.00x) dmvr_hv_12_12x20_neon: 1.7 ( 3.79x) dmvr_hv_12_20x12_c: 6.5 ( 1.00x) dmvr_hv_12_20x12_neon: 1.7 ( 3.79x) dmvr_hv_12_20x20_c: 10.2 ( 1.00x) dmvr_hv_12_20x20_neon: 2.2 ( 4.64x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-10-01 10:28:54 +08:00
Zhao Zhili	0ba9e8d0d4	aarch64/vvc: Add w_avg w_avg_8_2x2_c: 0.0 ( 0.00x) w_avg_8_2x2_neon: 0.0 ( 0.00x) w_avg_8_4x4_c: 0.2 ( 1.00x) w_avg_8_4x4_neon: 0.0 ( 0.00x) w_avg_8_8x8_c: 1.2 ( 1.00x) w_avg_8_8x8_neon: 0.2 ( 5.00x) w_avg_8_16x16_c: 4.2 ( 1.00x) w_avg_8_16x16_neon: 0.8 ( 5.67x) w_avg_8_32x32_c: 16.2 ( 1.00x) w_avg_8_32x32_neon: 2.5 ( 6.50x) w_avg_8_64x64_c: 64.5 ( 1.00x) w_avg_8_64x64_neon: 9.0 ( 7.17x) w_avg_8_128x128_c: 269.5 ( 1.00x) w_avg_8_128x128_neon: 35.5 ( 7.59x) w_avg_10_2x2_c: 0.2 ( 1.00x) w_avg_10_2x2_neon: 0.2 ( 1.00x) w_avg_10_4x4_c: 0.2 ( 1.00x) w_avg_10_4x4_neon: 0.2 ( 1.00x) w_avg_10_8x8_c: 1.0 ( 1.00x) w_avg_10_8x8_neon: 0.2 ( 4.00x) w_avg_10_16x16_c: 4.2 ( 1.00x) w_avg_10_16x16_neon: 0.8 ( 5.67x) w_avg_10_32x32_c: 16.2 ( 1.00x) w_avg_10_32x32_neon: 2.5 ( 6.50x) w_avg_10_64x64_c: 66.2 ( 1.00x) w_avg_10_64x64_neon: 10.0 ( 6.62x) w_avg_10_128x128_c: 277.8 ( 1.00x) w_avg_10_128x128_neon: 39.8 ( 6.99x) w_avg_12_2x2_c: 0.0 ( 0.00x) w_avg_12_2x2_neon: 0.2 ( 0.00x) w_avg_12_4x4_c: 0.2 ( 1.00x) w_avg_12_4x4_neon: 0.0 ( 0.00x) w_avg_12_8x8_c: 1.2 ( 1.00x) w_avg_12_8x8_neon: 0.5 ( 2.50x) w_avg_12_16x16_c: 4.8 ( 1.00x) w_avg_12_16x16_neon: 0.8 ( 6.33x) w_avg_12_32x32_c: 17.0 ( 1.00x) w_avg_12_32x32_neon: 2.8 ( 6.18x) w_avg_12_64x64_c: 64.0 ( 1.00x) w_avg_12_64x64_neon: 10.0 ( 6.40x) w_avg_12_128x128_c: 269.2 ( 1.00x) w_avg_12_128x128_neon: 42.0 ( 6.41x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-10-01 10:28:54 +08:00
Martin Storsjö	a3ec1f8c6c	aarch64: h26x: Fix the indentation of one function Signed-off-by: Martin Storsjö <martin@martin.st>	2024-09-26 13:42:11 +03:00
Zhao Zhili	3f84d1d1fb	aarch64/vvc: Add avg avg_8_2x2_c: 0.2 ( 1.00x) avg_8_2x2_neon: 0.2 ( 1.00x) avg_8_4x4_c: 0.2 ( 1.00x) avg_8_4x4_neon: 0.2 ( 1.00x) avg_8_8x8_c: 0.9 ( 1.00x) avg_8_8x8_neon: 0.2 ( 5.29x) avg_8_16x16_c: 3.7 ( 1.00x) avg_8_16x16_neon: 0.7 ( 5.44x) avg_8_32x32_c: 14.9 ( 1.00x) avg_8_32x32_neon: 1.7 ( 8.91x) avg_8_64x64_c: 59.7 ( 1.00x) avg_8_64x64_neon: 6.9 ( 8.62x) avg_8_128x128_c: 254.7 ( 1.00x) avg_8_128x128_neon: 26.9 ( 9.46x) avg_10_2x2_c: 0.2 ( 1.00x) avg_10_2x2_neon: 0.2 ( 1.00x) avg_10_4x4_c: 0.2 ( 1.00x) avg_10_4x4_neon: 0.2 ( 1.00x) avg_10_8x8_c: 0.9 ( 1.00x) avg_10_8x8_neon: 0.2 ( 5.29x) avg_10_16x16_c: 3.4 ( 1.00x) avg_10_16x16_neon: 0.4 ( 8.06x) avg_10_32x32_c: 13.9 ( 1.00x) avg_10_32x32_neon: 1.9 ( 7.23x) avg_10_64x64_c: 54.2 ( 1.00x) avg_10_64x64_neon: 8.4 ( 6.43x) avg_10_128x128_c: 232.4 ( 1.00x) avg_10_128x128_neon: 30.9 ( 7.52x) avg_12_2x2_c: 0.0 ( 0.00x) avg_12_2x2_neon: 0.2 ( 0.00x) avg_12_4x4_c: 0.4 ( 1.00x) avg_12_4x4_neon: 0.2 ( 2.43x) avg_12_8x8_c: 0.7 ( 1.00x) avg_12_8x8_neon: 0.2 ( 3.86x) avg_12_16x16_c: 3.7 ( 1.00x) avg_12_16x16_neon: 0.4 ( 8.65x) avg_12_32x32_c: 13.7 ( 1.00x) avg_12_32x32_neon: 2.2 ( 6.29x) avg_12_64x64_c: 53.9 ( 1.00x) avg_12_64x64_neon: 7.7 ( 7.03x) avg_12_128x128_c: 270.9 ( 1.00x) avg_12_128x128_neon: 30.4 ( 8.90x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	1be5a2374f	aarch64/vvc: Add put_epel_hv On Apple M1: put_chroma_hv_8_4x4_c: 1.7 ( 1.00x) put_chroma_hv_8_4x4_neon: 0.2 ( 7.67x) put_chroma_hv_8_8x8_c: 5.5 ( 1.00x) put_chroma_hv_8_8x8_neon: 0.5 (11.53x) put_chroma_hv_8_16x16_c: 18.5 ( 1.00x) put_chroma_hv_8_16x16_neon: 1.5 (12.53x) put_chroma_hv_8_32x32_c: 72.5 ( 1.00x) put_chroma_hv_8_32x32_neon: 4.7 (15.34x) put_chroma_hv_8_64x64_c: 274.0 ( 1.00x) put_chroma_hv_8_64x64_neon: 18.5 (14.83x) put_chroma_hv_8_128x128_c: 1058.7 ( 1.00x) put_chroma_hv_8_128x128_neon: 75.2 (14.07x) On Android Pixel 8 Pro: put_chroma_hv_8_4x4_c: 1.2 ( 1.00x) put_chroma_hv_8_4x4_neon: 0.0 ( 0.00x) put_chroma_hv_8_4x4_i8mm: 0.2 ( 5.00x) put_chroma_hv_8_8x8_c: 4.0 ( 1.00x) put_chroma_hv_8_8x8_neon: 0.5 ( 8.00x) put_chroma_hv_8_8x8_i8mm: 0.5 ( 8.00x) put_chroma_hv_8_16x16_c: 15.2 ( 1.00x) put_chroma_hv_8_16x16_neon: 2.5 ( 6.10x) put_chroma_hv_8_16x16_i8mm: 2.2 ( 6.78x) put_chroma_hv_8_32x32_c: 61.0 ( 1.00x) put_chroma_hv_8_32x32_neon: 9.8 ( 6.26x) put_chroma_hv_8_32x32_i8mm: 8.5 ( 7.18x) put_chroma_hv_8_64x64_c: 229.5 ( 1.00x) put_chroma_hv_8_64x64_neon: 38.5 ( 5.96x) put_chroma_hv_8_64x64_i8mm: 34.0 ( 6.75x) put_chroma_hv_8_128x128_c: 919.8 ( 1.00x) put_chroma_hv_8_128x128_neon: 154.5 ( 5.95x) put_chroma_hv_8_128x128_i8mm: 140.0 ( 6.57x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	0dcf204e5d	aarch64/vvc: Add put_epel_h i8mm put_chroma_h_8_4x4_c: 0.4 ( 1.00x) put_chroma_h_8_4x4_neon: 0.0 ( 0.00x) put_chroma_h_8_4x4_i8mm: 0.1 ( 2.67x) put_chroma_h_8_8x8_c: 1.6 ( 1.00x) put_chroma_h_8_8x8_neon: 0.1 (11.00x) put_chroma_h_8_8x8_i8mm: 0.1 (11.00x) put_chroma_h_8_16x16_c: 6.9 ( 1.00x) put_chroma_h_8_16x16_neon: 1.1 ( 6.00x) put_chroma_h_8_16x16_i8mm: 0.7 (10.62x) put_chroma_h_8_32x32_c: 27.6 ( 1.00x) put_chroma_h_8_32x32_neon: 4.7 ( 5.95x) put_chroma_h_8_32x32_i8mm: 4.4 ( 6.28x) put_chroma_h_8_64x64_c: 116.2 ( 1.00x) put_chroma_h_8_64x64_neon: 19.1 ( 6.07x) put_chroma_h_8_64x64_i8mm: 17.1 ( 6.77x) put_chroma_h_8_128x128_c: 466.6 ( 1.00x) put_chroma_h_8_128x128_neon: 81.4 ( 5.73x) put_chroma_h_8_128x128_i8mm: 71.7 ( 6.51x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	41a1885f7a	aarch64/vvc: Add put_epel_h put_chroma_h_8_4x4_c: 0.2 ( 1.00x) put_chroma_h_8_4x4_neon: 0.2 ( 1.00x) put_chroma_h_8_8x8_c: 0.8 ( 1.00x) put_chroma_h_8_8x8_neon: 0.2 ( 3.00x) put_chroma_h_8_16x16_c: 3.8 ( 1.00x) put_chroma_h_8_16x16_neon: 0.8 ( 5.00x) put_chroma_h_8_32x32_c: 12.5 ( 1.00x) put_chroma_h_8_32x32_neon: 2.2 ( 5.56x) put_chroma_h_8_64x64_c: 47.0 ( 1.00x) put_chroma_h_8_64x64_neon: 8.8 ( 5.37x) put_chroma_h_8_128x128_c: 200.2 ( 1.00x) put_chroma_h_8_128x128_neon: 31.8 ( 6.31x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	260e1b4b62	aarch64/vvc: Add sad sad_8x16_c: 0.8 ( 1.00x) sad_8x16_neon: 0.2 ( 3.00x) sad_16x8_c: 0.5 ( 1.00x) sad_16x8_neon: 0.2 ( 2.00x) sad_16x16_c: 1.5 ( 1.00x) sad_16x16_neon: 0.2 ( 6.00x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	5ac6925803	aarch64/vvc: Add put_qpel_hv With Apple M1 (no i8mm): put_luma_hv_8_4x4_c: 2.2 ( 1.00x) put_luma_hv_8_4x4_neon: 0.8 ( 3.00x) put_luma_hv_8_8x8_c: 7.0 ( 1.00x) put_luma_hv_8_8x8_neon: 0.8 ( 9.33x) put_luma_hv_8_16x16_c: 22.8 ( 1.00x) put_luma_hv_8_16x16_neon: 2.5 ( 9.10x) put_luma_hv_8_32x32_c: 84.8 ( 1.00x) put_luma_hv_8_32x32_neon: 9.5 ( 8.92x) put_luma_hv_8_64x64_c: 333.0 ( 1.00x) put_luma_hv_8_64x64_neon: 35.5 ( 9.38x) put_luma_hv_8_128x128_c: 1294.5 ( 1.00x) put_luma_hv_8_128x128_neon: 137.8 ( 9.40x) With Pixel 8 Pro: put_luma_hv_8_4x4_c: 5.0 ( 1.00x) put_luma_hv_8_4x4_neon: 0.8 ( 6.67x) put_luma_hv_8_4x4_i8mm: 0.2 (20.00x) put_luma_hv_8_8x8_c: 13.2 ( 1.00x) put_luma_hv_8_8x8_neon: 1.2 (10.60x) put_luma_hv_8_8x8_i8mm: 1.2 (10.60x) put_luma_hv_8_16x16_c: 44.2 ( 1.00x) put_luma_hv_8_16x16_neon: 4.5 ( 9.83x) put_luma_hv_8_16x16_i8mm: 4.2 (10.41x) put_luma_hv_8_32x32_c: 160.8 ( 1.00x) put_luma_hv_8_32x32_neon: 17.5 ( 9.19x) put_luma_hv_8_32x32_i8mm: 16.0 (10.05x) put_luma_hv_8_64x64_c: 611.2 ( 1.00x) put_luma_hv_8_64x64_neon: 68.0 ( 8.99x) put_luma_hv_8_64x64_i8mm: 62.2 ( 9.82x) put_luma_hv_8_128x128_c: 2384.8 ( 1.00x) put_luma_hv_8_128x128_neon: 268.8 ( 8.87x) put_luma_hv_8_128x128_i8mm: 245.8 ( 9.70x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	a0b52afd32	aarch64/vvc: Add put_qpel_vx put_luma_v_8_4x4_c: 1.0 ( 1.00x) put_luma_v_8_4x4_neon: 0.0 ( 0.00x) put_luma_v_8_8x8_c: 3.5 ( 1.00x) put_luma_v_8_8x8_neon: 0.5 ( 7.00x) put_luma_v_8_16x16_c: 13.8 ( 1.00x) put_luma_v_8_16x16_neon: 1.2 (11.00x) put_luma_v_8_32x32_c: 54.2 ( 1.00x) put_luma_v_8_32x32_neon: 5.0 (10.85x) put_luma_v_8_64x64_c: 217.5 ( 1.00x) put_luma_v_8_64x64_neon: 18.8 (11.60x) put_luma_v_8_128x128_c: 886.2 ( 1.00x) put_luma_v_8_128x128_neon: 74.0 (11.98x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	b051bc7cb8	aarch64/h26x: Remove duplicate b.eq instruction b.eq is added by calc_all after each calc.	2024-09-14 16:36:34 +08:00
Zhao Zhili	9f6c8eb412	aarch64/vvc: Add put_qpel_hx i8mm Benchmark on Android pixel 8 with -fno-vectorize put_luma_h_8_4x4_c: 0.2 ( 1.00x) put_luma_h_8_4x4_neon: 0.2 ( 1.00x) put_luma_h_8_4x4_i8mm: 0.0 ( 0.00x) put_luma_h_8_8x8_c: 1.5 ( 1.00x) put_luma_h_8_8x8_neon: 0.5 ( 3.00x) put_luma_h_8_8x8_i8mm: 0.5 ( 3.00x) put_luma_h_8_16x16_c: 6.2 ( 1.00x) put_luma_h_8_16x16_neon: 2.0 ( 3.12x) put_luma_h_8_16x16_i8mm: 1.5 ( 4.17x) put_luma_h_8_32x32_c: 25.5 ( 1.00x) put_luma_h_8_32x32_neon: 9.0 ( 2.83x) put_luma_h_8_32x32_i8mm: 6.8 ( 3.78x) put_luma_h_8_64x64_c: 99.8 ( 1.00x) put_luma_h_8_64x64_neon: 35.2 ( 2.83x) put_luma_h_8_64x64_i8mm: 27.2 ( 3.66x) put_luma_h_8_128x128_c: 422.0 ( 1.00x) put_luma_h_8_128x128_neon: 138.5 ( 3.05x) put_luma_h_8_128x128_i8mm: 109.2 ( 3.86x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	25448d1716	aarch64/vvc: Add put_pel/put_pel_uni/put_pel_uni_w put_luma_pixels_8_4x4_c: 0.2 ( 1.00x) put_luma_pixels_8_4x4_neon: 0.2 ( 1.00x) put_luma_pixels_8_8x8_c: 0.7 ( 1.00x) put_luma_pixels_8_8x8_neon: 0.2 ( 3.22x) put_luma_pixels_8_16x16_c: 2.2 ( 1.00x) put_luma_pixels_8_16x16_neon: 0.2 ( 9.89x) put_luma_pixels_8_32x32_c: 8.2 ( 1.00x) put_luma_pixels_8_32x32_neon: 1.2 ( 6.71x) put_luma_pixels_8_64x64_c: 33.7 ( 1.00x) put_luma_pixels_8_64x64_neon: 2.5 (13.63x) put_luma_pixels_8_128x128_c: 145.5 ( 1.00x) put_luma_pixels_8_128x128_neon: 10.2 (14.23x) put_uni_pixels_luma_8_4x4_c: 0.5 ( 1.00x) put_uni_pixels_luma_8_4x4_neon: 0.0 ( 0.00x) put_uni_pixels_luma_8_8x8_c: 0.5 ( 1.00x) put_uni_pixels_luma_8_8x8_neon: 0.2 ( 2.11x) put_uni_pixels_luma_8_16x16_c: 1.2 ( 1.00x) put_uni_pixels_luma_8_16x16_neon: 0.2 ( 5.44x) put_uni_pixels_luma_8_32x32_c: 3.0 ( 1.00x) put_uni_pixels_luma_8_32x32_neon: 0.5 ( 6.26x) put_uni_pixels_luma_8_64x64_c: 3.0 ( 1.00x) put_uni_pixels_luma_8_64x64_neon: 1.7 ( 1.72x) put_uni_pixels_luma_8_128x128_c: 6.5 ( 1.00x) put_uni_pixels_luma_8_128x128_neon: 6.5 ( 1.00x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	20f2bf5530	aarch64/vvc: Add put_qpel_h_* and put_qpel_uni_h_* Just share hevc implementation. checkasm --test=vvc_mc --benchmark: put_luma_h_8_4x4_c: 0.2 ( 1.00x) put_luma_h_8_4x4_neon: 0.2 ( 1.00x) put_luma_h_8_8x8_c: 1.0 ( 1.00x) put_luma_h_8_8x8_neon: 0.2 ( 4.33x) put_luma_h_8_16x16_c: 3.2 ( 1.00x) put_luma_h_8_16x16_neon: 1.2 ( 2.63x) put_luma_h_8_32x32_c: 13.7 ( 1.00x) put_luma_h_8_32x32_neon: 4.0 ( 3.45x) put_luma_h_8_64x64_c: 48.2 ( 1.00x) put_luma_h_8_64x64_neon: 15.7 ( 3.07x) put_luma_h_8_128x128_c: 203.5 ( 1.00x) put_luma_h_8_128x128_neon: 62.0 ( 3.28x) put_uni_h_luma_8_4x4_c: 0.2 ( 1.00x) put_uni_h_luma_8_4x4_neon: 0.2 ( 1.00x) put_uni_h_luma_8_8x8_c: 1.5 ( 1.00x) put_uni_h_luma_8_8x8_neon: 0.2 ( 6.56x) put_uni_h_luma_8_16x16_c: 5.7 ( 1.00x) put_uni_h_luma_8_16x16_neon: 1.2 ( 4.67x) put_uni_h_luma_8_32x32_c: 24.0 ( 1.00x) put_uni_h_luma_8_32x32_neon: 4.7 ( 5.07x) put_uni_h_luma_8_64x64_c: 90.0 ( 1.00x) put_uni_h_luma_8_64x64_neon: 17.0 ( 5.30x) put_uni_h_luma_8_128x128_c: 357.7 ( 1.00x) put_uni_h_luma_8_128x128_neon: 67.5 ( 5.30x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	46f07ce7d1	aarch64/hevc: Move epel/qpel to h26x directory So vvc can reuse the implementation.	2024-09-14 16:36:34 +08:00
Zhao Zhili	8beafb5656	aarch64/hevc: Simplify function prototypes by macro	2024-09-14 16:36:34 +08:00
Anton Khirnov	3f9ca51015	lavc/opus*: move to opus/ subdir	2024-09-02 11:56:53 +02:00
Ramiro Polla	6aafe61285	avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t	2024-09-01 13:42:30 +02:00
Zhao Zhili	4c0372281b	aarch64/vvc: Bind h26x/sao filter implementation to vvc Reviewed-by: Martin Storsjö <martin@martin.st>	2024-08-31 16:07:50 +08:00
Zhao Zhili	8cc10298a7	aarch64/hevc: Move sao to h26x directory So vvc can reuse the implementation. Reviewed-by: Martin Storsjö <martin@martin.st>	2024-08-31 16:07:43 +08:00
Ramiro Polla	8c203ea7c7	avcodec/aarch64/mpegvideoencdsp: add dotprod implementation for pix_norm1 A55 A76 pix_norm1_c: 484.3 235.2 pix_norm1_neon: 193.8 ( 2.50x) 44.7 ( 5.26x) pix_norm1_dotprod: 91.8 ( 5.28x) 21.2 (11.09x)	2024-08-26 12:49:04 +02:00
Ramiro Polla	9f68a3712e	avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1 A55 A76 pix_norm1_c: 478.2 234.2 pix_norm1_neon: 188.2 ( 2.54x) 41.2 ( 5.68x) pix_sum_c: 304.2 244.0 pix_sum_neon: 77.2 ( 3.94x) 21.5 (11.35x)	2024-08-26 12:48:31 +02:00
Ramiro Polla	5c1c0325cd	avcodec/aarch64/me_cmp: add dotprod implementations of sse16 and vsse_intra16 checkasm --bench for Raspberry Pi 5 Model B Rev 1.0: sse_0_c: 241.5 sse_0_neon: 37.2 sse_0_dotprod: 22.2 vsse_4_c: 148.7 vsse_4_neon: 31.0 vsse_4_dotprod: 15.7	2024-08-17 15:31:48 +02:00
Martin Storsjö	4acb9b7d10	aarch64: vvc: Fix unnecessary extra spaces Signed-off-by: Martin Storsjö <martin@martin.st>	2024-07-23 16:04:28 +03:00
Martin Storsjö	99598629e8	aarch64: vvc: Consistently use # for immediate constants Signed-off-by: Martin Storsjö <martin@martin.st>	2024-07-23 15:24:37 +03:00
Martin Storsjö	400843151d	aarch64: vvc: Fix compilation of alf.S with MSVC 2022 17.7 and older Use the "ldur" instruction explicitly, instead of having the assembler implicitly convert "ldr" instructions to "ldur". This fixes build errors like these: libavcodec\aarch64\vvc\alf.o.asm(1023) : error A2518: operand 2: Memory offset must be aligned ldr q22, [x3, #24] libavcodec\aarch64\vvc\alf.o.asm(1024) : error A2518: operand 2: Memory offset must be aligned ldr q24, [x2, #24] libavcodec\aarch64\vvc\alf.o.asm(1393) : error A2518: operand 2: Memory offset must be aligned ldr q22, [x3, #24] libavcodec\aarch64\vvc\alf.o.asm(1394) : error A2518: operand 2: Memory offset must be aligned ldr q24, [x2, #24] Signed-off-by: Martin Storsjö <martin@martin.st>	2024-07-23 15:24:33 +03:00
Zhao Zhili	2d4ef304c9	avcodec/vvc: Add aarch64 neon optimization for ALF vvc_alf_filter_chroma_4x4_8_c: 3.0 vvc_alf_filter_chroma_4x4_8_neon: 1.0 vvc_alf_filter_chroma_4x4_10_c: 2.7 vvc_alf_filter_chroma_4x4_10_neon: 1.0 vvc_alf_filter_chroma_4x4_12_c: 2.7 vvc_alf_filter_chroma_4x4_12_neon: 1.0 vvc_alf_filter_chroma_8x8_8_c: 10.2 vvc_alf_filter_chroma_8x8_8_neon: 3.0 vvc_alf_filter_chroma_8x8_10_c: 10.0 vvc_alf_filter_chroma_8x8_10_neon: 2.5 vvc_alf_filter_chroma_8x8_12_c: 10.0 vvc_alf_filter_chroma_8x8_12_neon: 2.5 vvc_alf_filter_chroma_16x16_8_c: 41.7 vvc_alf_filter_chroma_16x16_8_neon: 11.2 vvc_alf_filter_chroma_16x16_10_c: 39.0 vvc_alf_filter_chroma_16x16_10_neon: 10.0 vvc_alf_filter_chroma_16x16_12_c: 40.2 vvc_alf_filter_chroma_16x16_12_neon: 10.2 vvc_alf_filter_chroma_32x32_8_c: 162.0 vvc_alf_filter_chroma_32x32_8_neon: 45.0 vvc_alf_filter_chroma_32x32_10_c: 155.5 vvc_alf_filter_chroma_32x32_10_neon: 39.5 vvc_alf_filter_chroma_32x32_12_c: 155.5 vvc_alf_filter_chroma_32x32_12_neon: 40.0 vvc_alf_filter_chroma_64x64_8_c: 646.0 vvc_alf_filter_chroma_64x64_8_neon: 175.5 vvc_alf_filter_chroma_64x64_10_c: 708.2 vvc_alf_filter_chroma_64x64_10_neon: 166.7 vvc_alf_filter_chroma_64x64_12_c: 619.2 vvc_alf_filter_chroma_64x64_12_neon: 157.2 vvc_alf_filter_chroma_128x128_8_c: 2611.5 vvc_alf_filter_chroma_128x128_8_neon: 698.2 vvc_alf_filter_chroma_128x128_10_c: 2470.0 vvc_alf_filter_chroma_128x128_10_neon: 616.0 vvc_alf_filter_chroma_128x128_12_c: 2531.5 vvc_alf_filter_chroma_128x128_12_neon: 620.2 vvc_alf_filter_luma_8x8_8_c: 25.2 vvc_alf_filter_luma_8x8_8_neon: 4.2 vvc_alf_filter_luma_8x8_10_c: 18.5 vvc_alf_filter_luma_8x8_10_neon: 4.0 vvc_alf_filter_luma_8x8_12_c: 19.0 vvc_alf_filter_luma_8x8_12_neon: 4.0 vvc_alf_filter_luma_16x16_8_c: 106.5 vvc_alf_filter_luma_16x16_8_neon: 16.2 vvc_alf_filter_luma_16x16_10_c: 75.2 vvc_alf_filter_luma_16x16_10_neon: 14.7 vvc_alf_filter_luma_16x16_12_c: 79.7 vvc_alf_filter_luma_16x16_12_neon: 14.7 vvc_alf_filter_luma_32x32_8_c: 400.5 vvc_alf_filter_luma_32x32_8_neon: 63.2 vvc_alf_filter_luma_32x32_10_c: 299.2 vvc_alf_filter_luma_32x32_10_neon: 57.7 vvc_alf_filter_luma_32x32_12_c: 299.2 vvc_alf_filter_luma_32x32_12_neon: 57.7 vvc_alf_filter_luma_64x64_8_c: 1602.5 vvc_alf_filter_luma_64x64_8_neon: 251.7 vvc_alf_filter_luma_64x64_10_c: 1197.0 vvc_alf_filter_luma_64x64_10_neon: 235.5 vvc_alf_filter_luma_64x64_12_c: 1220.2 vvc_alf_filter_luma_64x64_12_neon: 235.7 vvc_alf_filter_luma_128x128_8_c: 6570.2 vvc_alf_filter_luma_128x128_8_neon: 1007.7 vvc_alf_filter_luma_128x128_10_c: 4822.7 vvc_alf_filter_luma_128x128_10_neon: 936.2 vvc_alf_filter_luma_128x128_12_c: 4791.2 vvc_alf_filter_luma_128x128_12_neon: 938.5 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-22 21:09:56 +08:00
Anton Khirnov	e4601cc339	lavc/hevc*: move to hevc/ subdir	2024-06-04 11:46:27 +02:00
Ramiro Polla	d4d09c8e42	lavc/aarch64/fdct: add neon-optimized fdct for aarch64 The code is imported from libjpeg-turbo-3.0.1. The neon registers used have been changed to avoid modifying v8-v15. Reviewed-by: Martin Storsjö <martin@martin.st>	2024-05-13 14:54:10 +02:00
Ramiro Polla	27f6211c74	lavc/aarch64: fix include for cpu.h	2024-05-13 14:50:38 +02:00
Lynne	134dba9544	opusdsp: add ability to modify deemphasis constant xHE-AAC relies on the same postfilter mechanism that Opus uses to improve clarity (albeit with a steeper deemphasis filter). The code to apply it is identical, it's still just a simple IIR low-pass filter. This commit makes it possible to use alternative constants.	2024-04-27 11:12:07 +02:00
Martin Storsjö	359b6a7f8a	aarch64/ac3dsp: simplify the end of ff_ac3_sum_square_butterfly_float_neon Before: Cortex A53 A72 A78 M1 ac3_sum_square_bufferfly_float_neon: 1005.7 516.5 224.5 194.0 After: ac3_sum_square_bufferfly_float_neon: 981.7 504.5 223.2 189.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2024-04-09 16:50:49 +02:00
Geoff Hill	ee1bc723de	avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	42e88f18f3	avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	69cb34f885	avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	6f6bd10531	avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	b69486ea18	avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:28 +03:00
Martin Storsjö	f872b19714	aarch64: hevc: Produce plain neon versions of qpel_bi_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_bi_hv4_8_c: 385.7 put_hevc_qpel_bi_hv4_8_neon: 131.0 put_hevc_qpel_bi_hv4_8_i8mm: 92.2 put_hevc_qpel_bi_hv6_8_c: 701.0 put_hevc_qpel_bi_hv6_8_neon: 239.5 put_hevc_qpel_bi_hv6_8_i8mm: 191.0 put_hevc_qpel_bi_hv8_8_c: 1162.0 put_hevc_qpel_bi_hv8_8_neon: 228.0 put_hevc_qpel_bi_hv8_8_i8mm: 225.2 put_hevc_qpel_bi_hv12_8_c: 2305.0 put_hevc_qpel_bi_hv12_8_neon: 558.0 put_hevc_qpel_bi_hv12_8_i8mm: 483.2 put_hevc_qpel_bi_hv16_8_c: 3965.2 put_hevc_qpel_bi_hv16_8_neon: 732.7 put_hevc_qpel_bi_hv16_8_i8mm: 656.5 put_hevc_qpel_bi_hv24_8_c: 8709.7 put_hevc_qpel_bi_hv24_8_neon: 1555.2 put_hevc_qpel_bi_hv24_8_i8mm: 1448.7 put_hevc_qpel_bi_hv32_8_c: 14818.0 put_hevc_qpel_bi_hv32_8_neon: 2763.7 put_hevc_qpel_bi_hv32_8_i8mm: 2468.0 put_hevc_qpel_bi_hv48_8_c: 32855.5 put_hevc_qpel_bi_hv48_8_neon: 6107.2 put_hevc_qpel_bi_hv48_8_i8mm: 5452.7 put_hevc_qpel_bi_hv64_8_c: 57591.5 put_hevc_qpel_bi_hv64_8_neon: 10660.2 put_hevc_qpel_bi_hv64_8_i8mm: 9580.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	d21b9a0411	aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. AWS Graviton 3: put_hevc_qpel_uni_w_hv4_8_c: 422.2 put_hevc_qpel_uni_w_hv4_8_neon: 140.7 put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7 put_hevc_qpel_uni_w_hv8_8_c: 1208.0 put_hevc_qpel_uni_w_hv8_8_neon: 268.2 put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5 put_hevc_qpel_uni_w_hv16_8_c: 4297.2 put_hevc_qpel_uni_w_hv16_8_neon: 802.2 put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2 put_hevc_qpel_uni_w_hv32_8_c: 15518.5 put_hevc_qpel_uni_w_hv32_8_neon: 3085.2 put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2 put_hevc_qpel_uni_w_hv64_8_c: 57254.5 put_hevc_qpel_uni_w_hv64_8_neon: 11787.5 put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	5ab138673b	aarch64: hevc: Produce plain neon versions of qpel_uni_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_uni_hv4_8_c: 384.2 put_hevc_qpel_uni_hv4_8_neon: 127.5 put_hevc_qpel_uni_hv4_8_i8mm: 85.5 put_hevc_qpel_uni_hv6_8_c: 705.5 put_hevc_qpel_uni_hv6_8_neon: 224.5 put_hevc_qpel_uni_hv6_8_i8mm: 176.2 put_hevc_qpel_uni_hv8_8_c: 1136.5 put_hevc_qpel_uni_hv8_8_neon: 216.5 put_hevc_qpel_uni_hv8_8_i8mm: 214.0 put_hevc_qpel_uni_hv12_8_c: 2259.5 put_hevc_qpel_uni_hv12_8_neon: 498.5 put_hevc_qpel_uni_hv12_8_i8mm: 410.7 put_hevc_qpel_uni_hv16_8_c: 3824.7 put_hevc_qpel_uni_hv16_8_neon: 670.0 put_hevc_qpel_uni_hv16_8_i8mm: 603.7 put_hevc_qpel_uni_hv24_8_c: 8113.5 put_hevc_qpel_uni_hv24_8_neon: 1474.7 put_hevc_qpel_uni_hv24_8_i8mm: 1351.5 put_hevc_qpel_uni_hv32_8_c: 14744.5 put_hevc_qpel_uni_hv32_8_neon: 2599.7 put_hevc_qpel_uni_hv32_8_i8mm: 2266.0 put_hevc_qpel_uni_hv48_8_c: 32800.0 put_hevc_qpel_uni_hv48_8_neon: 5650.0 put_hevc_qpel_uni_hv48_8_i8mm: 5011.7 put_hevc_qpel_uni_hv64_8_c: 57856.2 put_hevc_qpel_uni_hv64_8_neon: 9863.5 put_hevc_qpel_uni_hv64_8_i8mm: 8767.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00

1 2 3 4 5 ...

439 commits