ffmpeg/libswscale
Shreesh Adiga 26f2f03e0d swscale/x86/rgb2rgb: optimize AVX2 version of uyvytoyuv422
Currently the AVX2 version of uyvytoyuv422 in the SIMD loop does the following:
4 vinsertq to have interleaving of the vector lanes during load from memory.
4 vperm2i128 inside 4 RSHIFT_COPY calls to achieve the desired layout.

This patch replaces the above 8 instructions with 2 vpermq and
2 vpermd with a vector register similar to AVX512ICL version.

Observed the following numbers on various microarchitectures:

On AMD Zen3 laptop:
Before:
uyvytoyuv422_c:                                      51979.7 ( 1.00x)
uyvytoyuv422_sse2:                                    5410.5 ( 9.61x)
uyvytoyuv422_avx:                                     4642.7 (11.20x)
uyvytoyuv422_avx2:                                    4249.0 (12.23x)

After:
uyvytoyuv422_c:                                      51659.8 ( 1.00x)
uyvytoyuv422_sse2:                                    5420.8 ( 9.53x)
uyvytoyuv422_avx:                                     4651.2 (11.11x)
uyvytoyuv422_avx2:                                    3953.8 (13.07x)

On Intel Macbook Pro 2019:
Before:
uyvytoyuv422_c:                                     185014.4 ( 1.00x)
uyvytoyuv422_sse2:                                   22800.4 ( 8.11x)
uyvytoyuv422_avx:                                    19796.9 ( 9.35x)
uyvytoyuv422_avx2:                                   13141.9 (14.08x)

After:
uyvytoyuv422_c:                                     185093.4 ( 1.00x)
uyvytoyuv422_sse2:                                   22795.4 ( 8.12x)
uyvytoyuv422_avx:                                    19791.9 ( 9.35x)
uyvytoyuv422_avx2:                                   12043.1 (15.37x)

On AMD Zen4 desktop:
Before:
uyvytoyuv422_c:                                      29105.0 ( 1.00x)
uyvytoyuv422_sse2:                                    3888.0 ( 7.49x)
uyvytoyuv422_avx:                                     3374.2 ( 8.63x)
uyvytoyuv422_avx2:                                    2649.8 (10.98x)
uyvytoyuv422_avx512icl:                               1615.0 (18.02x)

After:
uyvytoyuv422_c:                                      29093.4 ( 1.00x)
uyvytoyuv422_sse2:                                    3874.4 ( 7.51x)
uyvytoyuv422_avx:                                     3371.6 ( 8.63x)
uyvytoyuv422_avx2:                                    2174.6 (13.38x)
uyvytoyuv422_avx512icl:                               1625.1 (17.90x)

Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
2025-03-23 15:25:48 +00:00
..
aarch64 swscale: aarch64: Simplify the assignment of lumToYV12 2025-03-10 14:03:58 +02:00
arm libswscale/arm/swscale_unscaled: Fix function prototype 2025-03-02 01:10:38 +02:00
loongarch loongarch: fixes fate-checkasm-sw_rgb failure 2025-01-15 01:27:36 +01:00
ppc avutil/libm: Only include intfloat.h when needed 2025-03-22 03:35:28 +01:00
riscv swscale/range_convert: saturate output instead of limiting input 2024-12-05 21:10:29 +01:00
tests swscale/tests/swscale: Fix potential buffer overflow 2025-03-21 04:30:09 +01:00
x86 swscale/x86/rgb2rgb: optimize AVX2 version of uyvytoyuv422 2025-03-23 15:25:48 +00:00
alphablend.c swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00
bayer_template.c swscale/internal: constify SwsFunc 2024-10-07 19:51:34 +02:00
cms.c swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
cms.h swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
csputils.c swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
csputils.h swscale/csputils: add internal colorspace math helpers 2024-12-23 12:33:43 +01:00
format.c avutil/csp: Improve enum range comparisons 2025-03-21 04:30:09 +01:00
format.h swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
gamma.c swscale: rename SwsContext to SwsInternal 2024-10-24 22:50:00 +02:00
graph.c swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
graph.h swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
half2float.c swscale/input: add rgbaf16 input support 2022-08-19 22:09:36 +02:00
hscale.c swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats 2024-12-05 21:10:29 +01:00
hscale_fast_bilinear.c swscale: rename SwsContext to SwsInternal 2024-10-24 22:50:00 +02:00
input.c avutil/libm: Only include intfloat.h when needed 2025-03-22 03:35:28 +01:00
libswscale.v build: Change structure of the linker version script templates 2016-05-29 16:43:11 +02:00
log2_tab.c lsws: duplicate ff_log2_tab 2014-08-12 20:52:21 +02:00
lut3d.c swscale/cms,graph,lut3d: Use ff_-prefix, don't export internal functions 2025-01-12 15:41:39 +01:00
lut3d.h swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
Makefile avutil: only duplicate hal2float and float2half in shared builds 2025-03-18 17:21:23 -03:00
options.c swscale/options: add -sws_dither none alias 2024-12-23 12:47:10 +01:00
output.c avutil/libm: Only include intfloat.h when needed 2025-03-22 03:35:28 +01:00
rgb2rgb.c swscale/swscale_unscaled: add unscaled x2rgb10le to packed RGB 2024-11-06 17:34:32 -03:00
rgb2rgb.h swscale/swscale_unscaled: add unscaled x2rgb10le to packed RGB 2024-11-06 17:34:32 -03:00
rgb2rgb_template.c swscale/swscale_unscaled: add unscaled conversion for AYUV/VUYA/UYVA 2024-11-02 15:01:31 -03:00
slice.c swscale/slice: fix init of 32 bpc planes 2024-12-16 12:21:55 +01:00
swscale.c swscale: fix gray -> grayf32 SIGFPE 2025-03-17 11:40:05 +01:00
swscale.h swscale: add ICC intent enum and option 2024-12-23 12:33:43 +01:00
swscale_internal.h swscale: use 16-bit intermediate precision for RGB/XYZ conversion 2024-12-26 20:31:36 +01:00
swscale_unscaled.c swscale/swscale_unscaled: avoid nv12 <-> nv21 bug 2025-03-17 11:40:05 +01:00
swscaleres.rc Add Windows resource file support for shared libraries 2013-12-05 23:42:07 +01:00
utils.c swscale/utils: split off format code into new file 2025-03-14 19:50:44 +01:00
version.c lib*/version: Use static_assert for static asserts 2024-03-31 00:08:42 +01:00
version.h swscale/output: add support for NV20 2025-03-19 09:34:05 -03:00
version_major.h libs: bump major version for all libraries 2024-03-07 11:29:43 -03:00
vscale.c swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00
yuv2rgb.c swscale/internal: group user-facing options together 2024-11-21 12:49:56 +01:00