ffmpeg/tests/checkasm
Krzysztof Pyrkosz c85a748979 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.

The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.

There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.

Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf

- A78
shuffle_bytes_0321_c:                                   75.5 ( 1.00x)
shuffle_bytes_0321_neon:                                26.5 ( 2.85x)
shuffle_bytes_1203_c:                                  136.2 ( 1.00x)
shuffle_bytes_1203_neon:                                27.2 ( 5.00x)
shuffle_bytes_1230_c:                                  135.5 ( 1.00x)
shuffle_bytes_1230_neon:                                28.0 ( 4.84x)
shuffle_bytes_2013_c:                                  138.8 ( 1.00x)
shuffle_bytes_2013_neon:                                22.0 ( 6.31x)
shuffle_bytes_2103_c:                                   76.5 ( 1.00x)
shuffle_bytes_2103_neon:                                20.5 ( 3.73x)
shuffle_bytes_2130_c:                                  137.5 ( 1.00x)
shuffle_bytes_2130_neon:                                28.0 ( 4.91x)
shuffle_bytes_3012_c:                                  138.2 ( 1.00x)
shuffle_bytes_3012_neon:                                21.5 ( 6.43x)
shuffle_bytes_3102_c:                                  138.2 ( 1.00x)
shuffle_bytes_3102_neon:                                27.2 ( 5.07x)
shuffle_bytes_3210_c:                                  138.0 ( 1.00x)
shuffle_bytes_3210_neon:                                22.0 ( 6.27x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  139.0 ( 1.00x)
shuffle_bytes_3210_neon:                                28.5 ( 4.88x)

- A72
shuffle_bytes_0321_c:                                  120.0 ( 1.00x)
shuffle_bytes_0321_neon:                                36.0 ( 3.33x)
shuffle_bytes_1203_c:                                  188.2 ( 1.00x)
shuffle_bytes_1203_neon:                                37.8 ( 4.99x)
shuffle_bytes_1230_c:                                  195.0 ( 1.00x)
shuffle_bytes_1230_neon:                                36.0 ( 5.42x)
shuffle_bytes_2013_c:                                  195.8 ( 1.00x)
shuffle_bytes_2013_neon:                                43.5 ( 4.50x)
shuffle_bytes_2103_c:                                  117.2 ( 1.00x)
shuffle_bytes_2103_neon:                                53.5 ( 2.19x)
shuffle_bytes_2130_c:                                  203.2 ( 1.00x)
shuffle_bytes_2130_neon:                                37.8 ( 5.38x)
shuffle_bytes_3012_c:                                  183.8 ( 1.00x)
shuffle_bytes_3012_neon:                                46.8 ( 3.93x)
shuffle_bytes_3102_c:                                  180.8 ( 1.00x)
shuffle_bytes_3102_neon:                                37.8 ( 4.79x)
shuffle_bytes_3210_c:                                  195.8 ( 1.00x)
shuffle_bytes_3210_neon:                                37.8 ( 5.19x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  194.8 ( 1.00x)
shuffle_bytes_3210_neon:                                30.8 ( 6.33x)

- x13s:
shuffle_bytes_0321_c:                                   49.4 ( 1.00x)
shuffle_bytes_0321_neon:                                18.1 ( 2.72x)
shuffle_bytes_1203_c:                                   98.4 ( 1.00x)
shuffle_bytes_1203_neon:                                18.4 ( 5.35x)
shuffle_bytes_1230_c:                                   97.4 ( 1.00x)
shuffle_bytes_1230_neon:                                19.1 ( 5.09x)
shuffle_bytes_2013_c:                                  101.4 ( 1.00x)
shuffle_bytes_2013_neon:                                16.9 ( 6.01x)
shuffle_bytes_2103_c:                                   53.9 ( 1.00x)
shuffle_bytes_2103_neon:                                13.9 ( 3.88x)
shuffle_bytes_2130_c:                                  100.9 ( 1.00x)
shuffle_bytes_2130_neon:                                19.1 ( 5.27x)
shuffle_bytes_3012_c:                                   97.4 ( 1.00x)
shuffle_bytes_3012_neon:                                17.1 ( 5.69x)
shuffle_bytes_3102_c:                                  100.9 ( 1.00x)
shuffle_bytes_3102_neon:                                19.1 ( 5.27x)
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                16.9 ( 5.96x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                18.6 ( 5.40x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
..
aarch64 checkasm: aarch64: Check for stack overflows 2020-05-15 21:22:36 +03:00
arm checkasm: arm: Check for stack overflows 2020-05-15 21:22:34 +03:00
riscv checkasm/riscv: preserve T1 whilst calling... 2024-08-01 18:44:01 +03:00
x86 x86: replace explicit REP_RETs with RETs 2023-02-01 04:23:55 +01:00
.gitignore Split global .gitignore file into per-directory files 2016-05-13 14:55:56 +02:00
aacencdsp.c x86/aacencdsp: add AVX version of quantize_bands 2024-06-09 12:29:49 -03:00
aacpsdsp.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
ac3dsp.c checkasm: Increase the tolerance for ac3_sum_square_butterfly_float 2024-07-24 12:10:33 +03:00
af_afir.c checkasm: test for dcmul_add 2023-11-27 17:55:24 +02:00
alacdsp.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
audiodsp.c checkasm/audiodsp: Be strict about MMX 2022-10-11 14:18:54 +02:00
av_tx.c avutil/common: Don't auto-include mem.h 2024-03-31 00:08:43 +01:00
blockdsp.c checkasm/blockdsp: use smallest allowed aligned buffers for fill_block_tab tests 2024-05-08 21:13:23 -03:00
bswapdsp.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
checkasm.c checkasm: Print benchmarks of C-only functions 2024-12-11 10:51:15 +02:00
checkasm.h checkasm: vvc: Use checkasm_check for printing failing output 2024-12-10 11:26:09 +02:00
diracdsp.c tests/checkasm/diracdsp: fix alignment for src and ombc_weight buffers 2024-11-19 12:32:49 -03:00
exrdsp.c tests/checkasm: Improve included headers 2024-03-02 02:54:12 +01:00
fdctdsp.c checkasm: add test for fdct 2024-05-11 10:28:59 +02:00
fixed_dsp.c configure: Remove av_restrict 2024-03-15 12:51:15 +01:00
flacdsp.c checkasm/flacdsp: add a test for lpc33 2024-05-24 09:23:00 -03:00
float_dsp.c checkasm/float_dsp: add double-precision scalar product 2024-05-31 22:22:43 +03:00
fmtconvert.c avcodec/fmtconvert: Remove unused AVCodecContext parameter 2022-09-21 20:26:40 +02:00
g722dsp.c checkasm: add a g722dsp test 2017-07-13 17:00:19 -03:00
h263dsp.c checkasm: add h263dsp.{h,v}_loop_filter 2024-05-27 22:42:07 +03:00
h264chroma.c checkasm: Fix h264chroma test name 2024-05-11 11:36:20 +03:00
h264dsp.c checkasm/h264dsp: test TX bypass 2024-07-21 22:36:48 +03:00
h264pred.c tests/checkasm: Improve included headers 2024-03-02 02:54:12 +01:00
h264qpel.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
hevc_add_res.c lavc/hevc*: move to hevc/ subdir 2024-06-04 11:46:27 +02:00
hevc_deblock.c lavc/hevc*: move to hevc/ subdir 2024-06-04 11:46:27 +02:00
hevc_idct.c lavc/hevc*: move to hevc/ subdir 2024-06-04 11:46:27 +02:00
hevc_pel.c checkasm: vvc: Use checkasm_check for printing failing output 2024-12-10 11:26:09 +02:00
hevc_sao.c lavc/hevc*: move to hevc/ subdir 2024-06-04 11:46:27 +02:00
huffyuvdsp.c tests/checkasm/huffyuvdsp: Use correct function pointer type 2024-05-17 13:29:34 +02:00
idctdsp.c avcodec/idctdsp: Avoid inclusion of avcodec.h 2023-09-11 00:26:34 +02:00
jpeg2000dsp.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
llauddsp.c tests/checkasm/llauddsp: Avoid UB integer overflow 2024-05-17 13:16:58 +02:00
lls.c checkasm: lls: Use relative tolerances rather than absolute ones 2024-10-09 15:52:56 +03:00
llviddsp.c tests/checkasm: Fix build error when enable linux perf on Android 2024-06-11 01:11:46 +08:00
llviddspenc.c tests/checkasm/llvidencdsp: Don't use declare_func_emms 2023-09-04 11:04:45 +02:00
lpc.c checkasm/lpc: use fixed length to bench apply_welch_window 2024-05-31 17:06:08 -03:00
Makefile checkasm/diracdsp: test add_dirac_obmc 2024-11-15 13:44:53 -05:00
motion.c avcodec/me_cmp: Zero MECmpContext in ff_me_cmp_init() 2024-06-20 18:58:38 +02:00
mpegvideoencdsp.c avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t 2024-09-01 13:42:30 +02:00
opusdsp.c lavc/opus*: move to opus/ subdir 2024-09-02 11:56:53 +02:00
pixblockdsp.c configure: Remove av_restrict 2024-03-15 12:51:15 +01:00
rv34dsp.c checkasm/rv34dsp: add rv34_idct_dc_add test 2024-02-17 14:33:35 +02:00
rv40dsp.c checkasm/rv40dsp: cover more cases 2024-12-10 11:24:45 -05:00
sbrdsp.c checkasm: test the noise case of sbrdsp.hf_apply_noise 2023-11-13 18:34:29 +02:00
svq1enc.c tests/checkasm/svq1enc: Use proper range for input 2024-05-09 13:40:18 +02:00
sw_gbrp.c swscale: eliminate redundant SwsInternal accesses 2024-11-25 10:59:52 +01:00
sw_range_convert.c swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats 2024-12-05 21:10:29 +01:00
sw_rgb.c swscale/aarch64/rgb2rgb: Implemented NEON shuf routines 2025-02-07 12:54:55 +02:00
sw_scale.c checkasm/sw_scale: add test for yuv2nv12cX 2024-12-23 11:20:58 +01:00
sw_yuv2rgb.c swscale: rename SwsContext to SwsInternal 2024-10-24 22:50:00 +02:00
sw_yuv2yuv.c swscale: rename SwsContext to SwsInternal 2024-10-24 22:50:00 +02:00
synth_filter.c dca_core: convert to lavu/tx 2022-11-06 14:39:36 +01:00
takdsp.c avcodec/takdsp: fix const correctness 2023-12-22 09:28:04 -03:00
utvideodsp.c tests/checkasm: Improve included headers 2024-03-02 02:54:12 +01:00
v210dec.c checkasm/v210dec: add extra space to the destination arrays 2022-12-21 00:36:49 +01:00
v210enc.c checkasm/v210enc: test the entire width of 10-bit planar input arrays 2022-12-01 18:19:03 +01:00
vc1dsp.c checkasm: vc1dsp: Align buffers sufficiently for the mspel tests 2024-04-30 23:13:47 +03:00
vf_blend.c tests/checkasm/vf_blend: Update function type 2024-05-17 13:35:33 +02:00
vf_bwdif.c tests/checkasm/vf_bwdif: Use correct function pointer type 2024-05-17 13:31:37 +02:00
vf_colorspace.c tests/checkasm/vf_colorspace: Use correct function pointer type 2024-05-17 13:31:23 +02:00
vf_convolution.c libavfilter/x86/vf_convolution: add sobel filter optimization and unit test with intel AVX512 VNNI 2022-11-14 10:04:16 +08:00
vf_eq.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
vf_gblur.c avutil/common: Don't auto-include mem.h 2024-03-31 00:08:43 +01:00
vf_hflip.c avfilter/vf_hflip: Move ff_hflip_init into a header 2022-05-06 05:19:50 +02:00
vf_nlmeans.c avutil/common: Don't auto-include mem.h 2024-03-31 00:08:43 +01:00
vf_threshold.c avfilter/vf_threshold: Move ff_threshold_init into a header 2022-05-06 05:19:50 +02:00
videodsp.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
vorbisdsp.c configure: Remove av_restrict 2024-03-15 12:51:15 +01:00
vp8dsp.c checkasm/vp8dsp: add VP7 tests 2024-05-30 18:30:52 +03:00
vp9dsp.c avutil/internal: Don't auto-include emms.h 2023-09-04 11:04:45 +02:00
vvc_alf.c checkasm: vvc: Use checkasm_check for printing failing output 2024-12-10 11:26:09 +02:00
vvc_mc.c checkasm: add vvc_bdof test 2024-08-31 14:08:54 +08:00