ffmpeg/libswscale
Krzysztof Pyrkosz c85a748979 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.

The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.

There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.

Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf

- A78
shuffle_bytes_0321_c:                                   75.5 ( 1.00x)
shuffle_bytes_0321_neon:                                26.5 ( 2.85x)
shuffle_bytes_1203_c:                                  136.2 ( 1.00x)
shuffle_bytes_1203_neon:                                27.2 ( 5.00x)
shuffle_bytes_1230_c:                                  135.5 ( 1.00x)
shuffle_bytes_1230_neon:                                28.0 ( 4.84x)
shuffle_bytes_2013_c:                                  138.8 ( 1.00x)
shuffle_bytes_2013_neon:                                22.0 ( 6.31x)
shuffle_bytes_2103_c:                                   76.5 ( 1.00x)
shuffle_bytes_2103_neon:                                20.5 ( 3.73x)
shuffle_bytes_2130_c:                                  137.5 ( 1.00x)
shuffle_bytes_2130_neon:                                28.0 ( 4.91x)
shuffle_bytes_3012_c:                                  138.2 ( 1.00x)
shuffle_bytes_3012_neon:                                21.5 ( 6.43x)
shuffle_bytes_3102_c:                                  138.2 ( 1.00x)
shuffle_bytes_3102_neon:                                27.2 ( 5.07x)
shuffle_bytes_3210_c:                                  138.0 ( 1.00x)
shuffle_bytes_3210_neon:                                22.0 ( 6.27x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  139.0 ( 1.00x)
shuffle_bytes_3210_neon:                                28.5 ( 4.88x)

- A72
shuffle_bytes_0321_c:                                  120.0 ( 1.00x)
shuffle_bytes_0321_neon:                                36.0 ( 3.33x)
shuffle_bytes_1203_c:                                  188.2 ( 1.00x)
shuffle_bytes_1203_neon:                                37.8 ( 4.99x)
shuffle_bytes_1230_c:                                  195.0 ( 1.00x)
shuffle_bytes_1230_neon:                                36.0 ( 5.42x)
shuffle_bytes_2013_c:                                  195.8 ( 1.00x)
shuffle_bytes_2013_neon:                                43.5 ( 4.50x)
shuffle_bytes_2103_c:                                  117.2 ( 1.00x)
shuffle_bytes_2103_neon:                                53.5 ( 2.19x)
shuffle_bytes_2130_c:                                  203.2 ( 1.00x)
shuffle_bytes_2130_neon:                                37.8 ( 5.38x)
shuffle_bytes_3012_c:                                  183.8 ( 1.00x)
shuffle_bytes_3012_neon:                                46.8 ( 3.93x)
shuffle_bytes_3102_c:                                  180.8 ( 1.00x)
shuffle_bytes_3102_neon:                                37.8 ( 4.79x)
shuffle_bytes_3210_c:                                  195.8 ( 1.00x)
shuffle_bytes_3210_neon:                                37.8 ( 5.19x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  194.8 ( 1.00x)
shuffle_bytes_3210_neon:                                30.8 ( 6.33x)

- x13s:
shuffle_bytes_0321_c:                                   49.4 ( 1.00x)
shuffle_bytes_0321_neon:                                18.1 ( 2.72x)
shuffle_bytes_1203_c:                                   98.4 ( 1.00x)
shuffle_bytes_1203_neon:                                18.4 ( 5.35x)
shuffle_bytes_1230_c:                                   97.4 ( 1.00x)
shuffle_bytes_1230_neon:                                19.1 ( 5.09x)
shuffle_bytes_2013_c:                                  101.4 ( 1.00x)
shuffle_bytes_2013_neon:                                16.9 ( 6.01x)
shuffle_bytes_2103_c:                                   53.9 ( 1.00x)
shuffle_bytes_2103_neon:                                13.9 ( 3.88x)
shuffle_bytes_2130_c:                                  100.9 ( 1.00x)
shuffle_bytes_2130_neon:                                19.1 ( 5.27x)
shuffle_bytes_3012_c:                                   97.4 ( 1.00x)
shuffle_bytes_3012_neon:                                17.1 ( 5.69x)
shuffle_bytes_3102_c:                                  100.9 ( 1.00x)
shuffle_bytes_3102_neon:                                19.1 ( 5.27x)
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                16.9 ( 5.96x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                18.6 ( 5.40x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
..
aarch64 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines 2025-02-07 12:54:55 +02:00
arm swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00
loongarch loongarch: fixes fate-checkasm-sw_rgb failure 2025-01-15 01:27:36 +01:00
ppc swscale/ppc: disable YUV2RGB AltiVec acceleration 2024-12-02 02:51:39 +01:00
riscv swscale/range_convert: saturate output instead of limiting input 2024-12-05 21:10:29 +01:00
tests tests/swscale: allow nonzero positive return codes from sws_scale_frame() 2024-12-18 17:30:48 +01:00
x86 swscale/x86/rgb2rgb: add AVX512ICL versions of shuffle_bytes 2025-02-03 10:16:44 -03:00
alphablend.c swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00
bayer_template.c swscale/internal: constify SwsFunc 2024-10-07 19:51:34 +02:00
cms.c swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
cms.h swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
csputils.c swscale/csputils: add internal colorspace math helpers 2024-12-23 12:33:43 +01:00
csputils.h swscale/csputils: add internal colorspace math helpers 2024-12-23 12:33:43 +01:00
gamma.c swscale: rename SwsContext to SwsInternal 2024-10-24 22:50:00 +02:00
graph.c swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
graph.h swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
half2float.c swscale/input: add rgbaf16 input support 2022-08-19 22:09:36 +02:00
hscale.c swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats 2024-12-05 21:10:29 +01:00
hscale_fast_bilinear.c swscale: rename SwsContext to SwsInternal 2024-10-24 22:50:00 +02:00
input.c swscale: 16bit planar float input support 2025-01-21 21:06:14 +01:00
libswscale.v build: Change structure of the linker version script templates 2016-05-29 16:43:11 +02:00
log2_tab.c lsws: duplicate ff_log2_tab 2014-08-12 20:52:21 +02:00
lut3d.c swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
lut3d.h swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
Makefile swscale/lut3d: add 3DLUT dispatch system 2024-12-23 12:33:43 +01:00
options.c swscale/options: add -sws_dither none alias 2024-12-23 12:47:10 +01:00
output.c swscale/output: Fix undefined overflow in yuv2rgba64_full_X_c_template() 2025-01-08 23:23:24 +01:00
rgb2rgb.c swscale/swscale_unscaled: add unscaled x2rgb10le to packed RGB 2024-11-06 17:34:32 -03:00
rgb2rgb.h swscale/swscale_unscaled: add unscaled x2rgb10le to packed RGB 2024-11-06 17:34:32 -03:00
rgb2rgb_template.c swscale/swscale_unscaled: add unscaled conversion for AYUV/VUYA/UYVA 2024-11-02 15:01:31 -03:00
slice.c swscale/slice: fix init of 32 bpc planes 2024-12-16 12:21:55 +01:00
swscale.c swscale/swscale: don't reject scaling when color parameters are not supported but conversion is not required 2025-01-22 12:15:18 -03:00
swscale.h swscale: add ICC intent enum and option 2024-12-23 12:33:43 +01:00
swscale_internal.h swscale: use 16-bit intermediate precision for RGB/XYZ conversion 2024-12-26 20:31:36 +01:00
swscale_unscaled.c swscale: 16bit planar float input support 2025-01-21 21:06:14 +01:00
swscaleres.rc Add Windows resource file support for shared libraries 2013-12-05 23:42:07 +01:00
utils.c swscale: 16bit planar float input support 2025-01-21 21:06:14 +01:00
utils.h swscale/swscale: don't reject scaling when color parameters are not supported but conversion is not required 2025-01-22 12:15:18 -03:00
version.c lib*/version: Use static_assert for static asserts 2024-03-31 00:08:42 +01:00
version.h swscale: add ICC intent enum and option 2024-12-23 12:33:43 +01:00
version_major.h libs: bump major version for all libraries 2024-03-07 11:29:43 -03:00
vscale.c swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00
yuv2rgb.c swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00