ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-02-18 21:40:22 +00:00

History

Krzysztof Pyrkosz c85a748979 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines The key idea is to pass the pre-generated tables to the TBL instruction and churn through the data 16 bytes at a time. The remaining 4 elements are handled with a specialized block located at the end of the routine. The 3210 variant can be implemented using rev32, but surprisingly it is slower than the generic TBL on A78, but much faster on A72. There may be some room for improvement. Possibly instead of handling last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2] and process along with the last 8 bytes. Speeds measured with checkasm --test=sw_rgb --bench --runs=10 \| grep shuf - A78 shuffle_bytes_0321_c: 75.5 ( 1.00x) shuffle_bytes_0321_neon: 26.5 ( 2.85x) shuffle_bytes_1203_c: 136.2 ( 1.00x) shuffle_bytes_1203_neon: 27.2 ( 5.00x) shuffle_bytes_1230_c: 135.5 ( 1.00x) shuffle_bytes_1230_neon: 28.0 ( 4.84x) shuffle_bytes_2013_c: 138.8 ( 1.00x) shuffle_bytes_2013_neon: 22.0 ( 6.31x) shuffle_bytes_2103_c: 76.5 ( 1.00x) shuffle_bytes_2103_neon: 20.5 ( 3.73x) shuffle_bytes_2130_c: 137.5 ( 1.00x) shuffle_bytes_2130_neon: 28.0 ( 4.91x) shuffle_bytes_3012_c: 138.2 ( 1.00x) shuffle_bytes_3012_neon: 21.5 ( 6.43x) shuffle_bytes_3102_c: 138.2 ( 1.00x) shuffle_bytes_3102_neon: 27.2 ( 5.07x) shuffle_bytes_3210_c: 138.0 ( 1.00x) shuffle_bytes_3210_neon: 22.0 ( 6.27x) shuf3210 using rev32 shuffle_bytes_3210_c: 139.0 ( 1.00x) shuffle_bytes_3210_neon: 28.5 ( 4.88x) - A72 shuffle_bytes_0321_c: 120.0 ( 1.00x) shuffle_bytes_0321_neon: 36.0 ( 3.33x) shuffle_bytes_1203_c: 188.2 ( 1.00x) shuffle_bytes_1203_neon: 37.8 ( 4.99x) shuffle_bytes_1230_c: 195.0 ( 1.00x) shuffle_bytes_1230_neon: 36.0 ( 5.42x) shuffle_bytes_2013_c: 195.8 ( 1.00x) shuffle_bytes_2013_neon: 43.5 ( 4.50x) shuffle_bytes_2103_c: 117.2 ( 1.00x) shuffle_bytes_2103_neon: 53.5 ( 2.19x) shuffle_bytes_2130_c: 203.2 ( 1.00x) shuffle_bytes_2130_neon: 37.8 ( 5.38x) shuffle_bytes_3012_c: 183.8 ( 1.00x) shuffle_bytes_3012_neon: 46.8 ( 3.93x) shuffle_bytes_3102_c: 180.8 ( 1.00x) shuffle_bytes_3102_neon: 37.8 ( 4.79x) shuffle_bytes_3210_c: 195.8 ( 1.00x) shuffle_bytes_3210_neon: 37.8 ( 5.19x) shuf3210 using rev32 shuffle_bytes_3210_c: 194.8 ( 1.00x) shuffle_bytes_3210_neon: 30.8 ( 6.33x) - x13s: shuffle_bytes_0321_c: 49.4 ( 1.00x) shuffle_bytes_0321_neon: 18.1 ( 2.72x) shuffle_bytes_1203_c: 98.4 ( 1.00x) shuffle_bytes_1203_neon: 18.4 ( 5.35x) shuffle_bytes_1230_c: 97.4 ( 1.00x) shuffle_bytes_1230_neon: 19.1 ( 5.09x) shuffle_bytes_2013_c: 101.4 ( 1.00x) shuffle_bytes_2013_neon: 16.9 ( 6.01x) shuffle_bytes_2103_c: 53.9 ( 1.00x) shuffle_bytes_2103_neon: 13.9 ( 3.88x) shuffle_bytes_2130_c: 100.9 ( 1.00x) shuffle_bytes_2130_neon: 19.1 ( 5.27x) shuffle_bytes_3012_c: 97.4 ( 1.00x) shuffle_bytes_3012_neon: 17.1 ( 5.69x) shuffle_bytes_3102_c: 100.9 ( 1.00x) shuffle_bytes_3102_neon: 19.1 ( 5.27x) shuffle_bytes_3210_c: 100.6 ( 1.00x) shuffle_bytes_3210_neon: 16.9 ( 5.96x) shuf3210 using rev32 shuffle_bytes_3210_c: 100.6 ( 1.00x) shuffle_bytes_3210_neon: 18.6 ( 5.40x) Signed-off-by: Martin Storsjö <martin@martin.st>		2025-02-07 12:54:55 +02:00
..
aarch64	checkasm: aarch64: Check for stack overflows	2020-05-15 21:22:36 +03:00
arm	checkasm: arm: Check for stack overflows	2020-05-15 21:22:34 +03:00
riscv	checkasm/riscv: preserve T1 whilst calling...	2024-08-01 18:44:01 +03:00
x86	x86: replace explicit REP_RETs with RETs	2023-02-01 04:23:55 +01:00
.gitignore	Split global .gitignore file into per-directory files	2016-05-13 14:55:56 +02:00
aacencdsp.c	x86/aacencdsp: add AVX version of quantize_bands	2024-06-09 12:29:49 -03:00
aacpsdsp.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
ac3dsp.c	checkasm: Increase the tolerance for ac3_sum_square_butterfly_float	2024-07-24 12:10:33 +03:00
af_afir.c	checkasm: test for dcmul_add	2023-11-27 17:55:24 +02:00
alacdsp.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
audiodsp.c	checkasm/audiodsp: Be strict about MMX	2022-10-11 14:18:54 +02:00
av_tx.c	avutil/common: Don't auto-include mem.h	2024-03-31 00:08:43 +01:00
blockdsp.c	checkasm/blockdsp: use smallest allowed aligned buffers for fill_block_tab tests	2024-05-08 21:13:23 -03:00
bswapdsp.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
checkasm.c	checkasm: Print benchmarks of C-only functions	2024-12-11 10:51:15 +02:00
checkasm.h	checkasm: vvc: Use checkasm_check for printing failing output	2024-12-10 11:26:09 +02:00
diracdsp.c	tests/checkasm/diracdsp: fix alignment for src and ombc_weight buffers	2024-11-19 12:32:49 -03:00
exrdsp.c	tests/checkasm: Improve included headers	2024-03-02 02:54:12 +01:00
fdctdsp.c	checkasm: add test for fdct	2024-05-11 10:28:59 +02:00
fixed_dsp.c	configure: Remove av_restrict	2024-03-15 12:51:15 +01:00
flacdsp.c	checkasm/flacdsp: add a test for lpc33	2024-05-24 09:23:00 -03:00
float_dsp.c	checkasm/float_dsp: add double-precision scalar product	2024-05-31 22:22:43 +03:00
fmtconvert.c	avcodec/fmtconvert: Remove unused AVCodecContext parameter	2022-09-21 20:26:40 +02:00
g722dsp.c	checkasm: add a g722dsp test	2017-07-13 17:00:19 -03:00
h263dsp.c	checkasm: add h263dsp.{h,v}_loop_filter	2024-05-27 22:42:07 +03:00
h264chroma.c	checkasm: Fix h264chroma test name	2024-05-11 11:36:20 +03:00
h264dsp.c	checkasm/h264dsp: test TX bypass	2024-07-21 22:36:48 +03:00
h264pred.c	tests/checkasm: Improve included headers	2024-03-02 02:54:12 +01:00
h264qpel.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
hevc_add_res.c	lavc/hevc*: move to hevc/ subdir	2024-06-04 11:46:27 +02:00
hevc_deblock.c	lavc/hevc*: move to hevc/ subdir	2024-06-04 11:46:27 +02:00
hevc_idct.c	lavc/hevc*: move to hevc/ subdir	2024-06-04 11:46:27 +02:00
hevc_pel.c	checkasm: vvc: Use checkasm_check for printing failing output	2024-12-10 11:26:09 +02:00
hevc_sao.c	lavc/hevc*: move to hevc/ subdir	2024-06-04 11:46:27 +02:00
huffyuvdsp.c	tests/checkasm/huffyuvdsp: Use correct function pointer type	2024-05-17 13:29:34 +02:00
idctdsp.c	avcodec/idctdsp: Avoid inclusion of avcodec.h	2023-09-11 00:26:34 +02:00
jpeg2000dsp.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
llauddsp.c	tests/checkasm/llauddsp: Avoid UB integer overflow	2024-05-17 13:16:58 +02:00
lls.c	checkasm: lls: Use relative tolerances rather than absolute ones	2024-10-09 15:52:56 +03:00
llviddsp.c	tests/checkasm: Fix build error when enable linux perf on Android	2024-06-11 01:11:46 +08:00
llviddspenc.c	tests/checkasm/llvidencdsp: Don't use declare_func_emms	2023-09-04 11:04:45 +02:00
lpc.c	checkasm/lpc: use fixed length to bench apply_welch_window	2024-05-31 17:06:08 -03:00
Makefile	checkasm/diracdsp: test add_dirac_obmc	2024-11-15 13:44:53 -05:00
motion.c	avcodec/me_cmp: Zero MECmpContext in ff_me_cmp_init()	2024-06-20 18:58:38 +02:00
mpegvideoencdsp.c	avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t	2024-09-01 13:42:30 +02:00
opusdsp.c	lavc/opus*: move to opus/ subdir	2024-09-02 11:56:53 +02:00
pixblockdsp.c	configure: Remove av_restrict	2024-03-15 12:51:15 +01:00
rv34dsp.c	checkasm/rv34dsp: add rv34_idct_dc_add test	2024-02-17 14:33:35 +02:00
rv40dsp.c	checkasm/rv40dsp: cover more cases	2024-12-10 11:24:45 -05:00
sbrdsp.c	checkasm: test the noise case of sbrdsp.hf_apply_noise	2023-11-13 18:34:29 +02:00
svq1enc.c	tests/checkasm/svq1enc: Use proper range for input	2024-05-09 13:40:18 +02:00
sw_gbrp.c	swscale: eliminate redundant SwsInternal accesses	2024-11-25 10:59:52 +01:00
sw_range_convert.c	swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats	2024-12-05 21:10:29 +01:00
sw_rgb.c	swscale/aarch64/rgb2rgb: Implemented NEON shuf routines	2025-02-07 12:54:55 +02:00
sw_scale.c	checkasm/sw_scale: add test for yuv2nv12cX	2024-12-23 11:20:58 +01:00
sw_yuv2rgb.c	swscale: rename SwsContext to SwsInternal	2024-10-24 22:50:00 +02:00
sw_yuv2yuv.c	swscale: rename SwsContext to SwsInternal	2024-10-24 22:50:00 +02:00
synth_filter.c	dca_core: convert to lavu/tx	2022-11-06 14:39:36 +01:00
takdsp.c	avcodec/takdsp: fix const correctness	2023-12-22 09:28:04 -03:00
utvideodsp.c	tests/checkasm: Improve included headers	2024-03-02 02:54:12 +01:00
v210dec.c	checkasm/v210dec: add extra space to the destination arrays	2022-12-21 00:36:49 +01:00
v210enc.c	checkasm/v210enc: test the entire width of 10-bit planar input arrays	2022-12-01 18:19:03 +01:00
vc1dsp.c	checkasm: vc1dsp: Align buffers sufficiently for the mspel tests	2024-04-30 23:13:47 +03:00
vf_blend.c	tests/checkasm/vf_blend: Update function type	2024-05-17 13:35:33 +02:00
vf_bwdif.c	tests/checkasm/vf_bwdif: Use correct function pointer type	2024-05-17 13:31:37 +02:00
vf_colorspace.c	tests/checkasm/vf_colorspace: Use correct function pointer type	2024-05-17 13:31:23 +02:00
vf_convolution.c	libavfilter/x86/vf_convolution: add sobel filter optimization and unit test with intel AVX512 VNNI	2022-11-14 10:04:16 +08:00
vf_eq.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
vf_gblur.c	avutil/common: Don't auto-include mem.h	2024-03-31 00:08:43 +01:00
vf_hflip.c	avfilter/vf_hflip: Move ff_hflip_init into a header	2022-05-06 05:19:50 +02:00
vf_nlmeans.c	avutil/common: Don't auto-include mem.h	2024-03-31 00:08:43 +01:00
vf_threshold.c	avfilter/vf_threshold: Move ff_threshold_init into a header	2022-05-06 05:19:50 +02:00
videodsp.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
vorbisdsp.c	configure: Remove av_restrict	2024-03-15 12:51:15 +01:00
vp8dsp.c	checkasm/vp8dsp: add VP7 tests	2024-05-30 18:30:52 +03:00
vp9dsp.c	avutil/internal: Don't auto-include emms.h	2023-09-04 11:04:45 +02:00
vvc_alf.c	checkasm: vvc: Use checkasm_check for printing failing output	2024-12-10 11:26:09 +02:00
vvc_mc.c	checkasm: add vvc_bdof test	2024-08-31 14:08:54 +08:00