Andreas Rheinhardt
35fcdb2132
swscale/x86/rgb2rgb: Deduplicate ASM constants
...
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-04-13 22:49:21 +02:00
Michael Niedermayer
d16a058dbc
swscale/swscale: Do not crash on floats
...
Fixes: shift exponent 32 is too large for 32-bit type 'unsigned int'
Fixes: division by zero
Fixes: 391981061/clusterfuzz-testcase-minimized-ffmpeg_SWS_fuzzer-6691017763389440
Fixes: 392929028/clusterfuzz-testcase-minimized-ffmpeg_SWS_fuzzer-5142088307507200
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-04-10 03:01:32 +02:00
Michael Niedermayer
ce538ef97a
swscale/output: Fix integer overflow in yuv2gbrp_full_X_c()
...
Fixes: signed integer overflow: 1966895953 + 210305024 cannot be represented in type 'int'
Fixes: 391921975/clusterfuzz-testcase-minimized-ffmpeg_SWS_fuzzer-5916798905548800
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-04-10 03:01:32 +02:00
Andreas Rheinhardt
435be31ef5
swscale/csputils: Remove unused ff_sws_matrix3x3_rmul()
...
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-04-03 06:04:57 +02:00
Andreas Rheinhardt
4da84d5c2b
swscale/swscale_unscaled: Actually use X2->RGBA64 conversions
...
The conversion functions were added in
e7382b4d01 , yet they were never
really enabled. Found via -ffunction-sections and --gc-sections.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-03-31 21:45:20 +02:00
Niklas Haas
3e32dc8b08
tests/swscale: allow setting log verbosity
...
Helpful for debugging the new swscale code, since it dumps the
operations list in verbose logging mode.
2025-03-31 12:19:26 +02:00
Niklas Haas
92a57f1cfd
tests/swscale: constrain reference SSIM for low bit depth formats
...
Sometimes, the reference SSIM is significantly higher than the
SSIM level expected for the test. This is the case when the source format
has a much lower bit depth than the destination format. In this case, the fact
that legacy swscale does not accurately preserve the source dither pattern
gives it an unfair advantage in a direct comparison, leading to false
positives.
For example, conversion like rgb4 -> rgb565 should be lossless, but swscale
low passes / downscales the input chroma, throwing away massive amounts of
detail. This gives it a higher SSIM score since the lowpassed result removes
some of the dither noise that was present in the source.
2025-03-31 12:19:26 +02:00
Niklas Haas
8fc9808f18
tests/swscale: calculate theoretical expected SSIM
...
We can calculate with some confidence the theoretical expected SSIM
from an "ideal" conversion, by computing the reference SSIM level
for an image dithered with uniformly distributed quatization noise.
This gives us an additional safety net to check for regressions even in
the absence of a reference to compare against.
2025-03-31 12:19:26 +02:00
Niklas Haas
9549daa996
tests/swscale: remove stray whitespace in scanf format
2025-03-31 12:19:24 +02:00
Niklas Haas
a22faeb992
tests/swscale: check supported inputs for legacy swscale separately
...
The new code path supports more formats, so we can't test them all
against the legacy implementation.
2025-03-31 12:19:08 +02:00
Niklas Haas
e1736d0d0b
tests/swscale: print performance stats on exit
2025-03-31 12:19:08 +02:00
Niklas Haas
6c12b1535a
tests/swscale: switch from MSE to SSIM
...
And bias it towards Y. This is much better at ignoring errors due to differing
dither patterns, and rewards algorithms that lower luma noise at the cost of
higher chroma noise.
The (0.8, 0.1, 0.1) weights for YCbCr are taken from the paper:
"Understanding SSIM" by Jim Nilsson and Tomas Akenine-Möller
(https://arxiv.org/abs/2006.13846 )
2025-03-31 12:19:07 +02:00
Niklas Haas
1707e81073
tests/swscale: use yuva444p as reference
...
Instead of the lossy yuva420p. This does change the results compared to the
status quo, but is more reflective of the actual strength of a conversion,
since it will faithfully measure the round-trip error from subsampling and
upsampling.
2025-03-31 12:18:35 +02:00
Niklas Haas
f438f3f8cd
tests/swscale: print speedup numbers in color
2025-03-31 12:18:35 +02:00
Niklas Haas
995986e512
tests/swscale: allow testing only unscaled convertors
...
I need this to be able to test the new unscaled conversion code more quickly.
We re-order the flags order to make 0 the first entry, so we don't set any
flags when performing unscaled tests.
2025-03-31 12:18:35 +02:00
Niklas Haas
d467ceaa9b
tests/swscale: use hex format for flags values
2025-03-31 12:18:11 +02:00
Niklas Haas
0e2742a693
tests/swscale: allow choosing specific flags and dither mode
...
So I can quickly iterate on the new swscale code.
2025-03-31 12:16:10 +02:00
James Almer
b338d1b35b
libs: bump major version for all libraries
...
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-28 14:44:34 -03:00
Shreesh Adiga
26f2f03e0d
swscale/x86/rgb2rgb: optimize AVX2 version of uyvytoyuv422
...
Currently the AVX2 version of uyvytoyuv422 in the SIMD loop does the following:
4 vinsertq to have interleaving of the vector lanes during load from memory.
4 vperm2i128 inside 4 RSHIFT_COPY calls to achieve the desired layout.
This patch replaces the above 8 instructions with 2 vpermq and
2 vpermd with a vector register similar to AVX512ICL version.
Observed the following numbers on various microarchitectures:
On AMD Zen3 laptop:
Before:
uyvytoyuv422_c: 51979.7 ( 1.00x)
uyvytoyuv422_sse2: 5410.5 ( 9.61x)
uyvytoyuv422_avx: 4642.7 (11.20x)
uyvytoyuv422_avx2: 4249.0 (12.23x)
After:
uyvytoyuv422_c: 51659.8 ( 1.00x)
uyvytoyuv422_sse2: 5420.8 ( 9.53x)
uyvytoyuv422_avx: 4651.2 (11.11x)
uyvytoyuv422_avx2: 3953.8 (13.07x)
On Intel Macbook Pro 2019:
Before:
uyvytoyuv422_c: 185014.4 ( 1.00x)
uyvytoyuv422_sse2: 22800.4 ( 8.11x)
uyvytoyuv422_avx: 19796.9 ( 9.35x)
uyvytoyuv422_avx2: 13141.9 (14.08x)
After:
uyvytoyuv422_c: 185093.4 ( 1.00x)
uyvytoyuv422_sse2: 22795.4 ( 8.12x)
uyvytoyuv422_avx: 19791.9 ( 9.35x)
uyvytoyuv422_avx2: 12043.1 (15.37x)
On AMD Zen4 desktop:
Before:
uyvytoyuv422_c: 29105.0 ( 1.00x)
uyvytoyuv422_sse2: 3888.0 ( 7.49x)
uyvytoyuv422_avx: 3374.2 ( 8.63x)
uyvytoyuv422_avx2: 2649.8 (10.98x)
uyvytoyuv422_avx512icl: 1615.0 (18.02x)
After:
uyvytoyuv422_c: 29093.4 ( 1.00x)
uyvytoyuv422_sse2: 3874.4 ( 7.51x)
uyvytoyuv422_avx: 3371.6 ( 8.63x)
uyvytoyuv422_avx2: 2174.6 (13.38x)
uyvytoyuv422_avx512icl: 1625.1 (17.90x)
Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
2025-03-23 15:25:48 +00:00
Andreas Rheinhardt
c94143350f
avutil/libm: Only include intfloat.h when needed
...
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-03-22 03:35:28 +01:00
Andreas Rheinhardt
65154ba994
swscale/tests/swscale: Fix potential buffer overflow
...
The field width in a %s directive gives the amount of characters
to read from the input and not the size of the receiving buffer;
the latter must be of course also have space for the trailing \0
which has been forgotten here. The commit adds it (and fixes a
-Wfortify-source warning from Clang).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-03-21 04:30:09 +01:00
Andreas Rheinhardt
dff498fddf
avutil/csp: Improve enum range comparisons
...
The underlying integer type of an enumeration is
implementation-defined (see C11, 6.7.2.2 (4)); GCC defaults
to unsigned if there are no negative values like for all enums
from pixfmt.h except enum AVPixelFormat.
This means that tests like "if (csp >= AVCOL_SPC_NB)" for
invalid colorspaces need not work as expected (namely if
enum AVColorSpace is signed). It also means that testing
for such an enum variable to be >= 0 may be tautologically
true. Clang emits a -Wtautological-unsigned-enum-zero-compare
warning for this.
Fix both of these issues by casting to unsigned.
Also do the same in libswscale/format.c.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-03-21 04:30:09 +01:00
James Almer
b8dc875249
swscale/output: add support for NV20
...
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-19 09:34:05 -03:00
James Almer
2f856b488b
swscale/input: add support for NV20
...
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-19 09:31:29 -03:00
James Almer
bf22c4cc3e
avutil: only duplicate hal2float and float2half in shared builds
...
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-18 17:21:23 -03:00
Niklas Haas
5b9356f18e
swscale/swscale_unscaled: avoid nv12 <-> nv21 bug
...
This is not handled by the planar copy wrapper, so exclude it.
2025-03-17 11:40:05 +01:00
Niklas Haas
8ab40ca984
swscale: fix gray -> grayf32 SIGFPE
...
swscale internals don't distinguish between 16-bit and higher bit depth
output formats internally when it comes to the choice of intermediate
representation.
Clamping this value both prevents a SIGFPE and also aligns the check
with reality.
2025-03-17 11:40:05 +01:00
James Almer
63fa1f52b9
swscale/swscale_unscaled: make the fast planar copy path work with more formats
...
dst_depth - src_depth where the result is 6 or 7 in a high bd path means this
is only executed for 16 -> 10 and 16 -> 9.
This patch makes this path general, supporting arbitrary formats as long as
dst_depth > src_depth > 8.
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-15 18:43:18 -03:00
James Almer
819dec697a
swscale/swscale_unscaled: account for semi planar formats with data in the msb
...
Fixes fate failures introduced by recent tests that exercise the faulty code.
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-15 18:43:18 -03:00
Niklas Haas
ae84aa775f
swscale/utils: split off format code into new file
...
utils.c is getting quite crowded, and I need a new place to dump a lot of
format handling code for the ongoing rewrite. Rather than bloating this file
even more, start splitting format handling helpers off into a new file.
This also renames the existing utils.h header, which didn't really contain
anything except the SwsFormat definition anyway (the prototypes for what should
have been in utils.h are all still in the legacy swscale_internal.h).
2025-03-14 19:50:44 +01:00
James Almer
228713ef5d
swscale/input: add support for UYYVYY411
...
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-13 15:00:05 -03:00
James Almer
468577d1a5
swscale/input: add support for YAF16 and YAF32
...
Signed-off-by: James Almer <jamrial@gmail.com>
2025-03-10 10:15:42 -03:00
Martin Storsjö
73f4668ef8
swscale: aarch64: Simplify the assignment of lumToYV12
...
We normally don't need else statements here; the common pattern
is to assign lower level SIMD implementations first, then
conditionally reassign higher level ones afterwards, if supported.
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-10 14:03:58 +02:00
Brad Smith
30a8641465
lsws/ppc/yuv2rgb_altivec: Fix build in non-VSX environments with Clang
...
Add a check for the existence of the vec_xl() function. Clang provides
the function even with VSX not enabled.
2025-03-06 14:21:38 +01:00
Krzysztof Pyrkosz
d765e5f043
swscale/aarch64: dotprod implementation of rgba32_to_Y
...
The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by udot for the
upper half.
Benchmark on A78:
bgra_to_y_128_c: 682.0 ( 1.00x)
bgra_to_y_128_neon: 181.2 ( 3.76x)
bgra_to_y_128_dotprod: 117.8 ( 5.79x)
bgra_to_y_1080_c: 5742.5 ( 1.00x)
bgra_to_y_1080_neon: 1472.5 ( 3.90x)
bgra_to_y_1080_dotprod: 906.5 ( 6.33x)
bgra_to_y_1920_c: 10194.0 ( 1.00x)
bgra_to_y_1920_neon: 2589.8 ( 3.94x)
bgra_to_y_1920_dotprod: 1573.8 ( 6.48x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-04 10:16:44 +02:00
Krzysztof Pyrkosz
38929b824b
swscale/aarch64: Refactor hscale_16_to_15__fs_4
...
This patch removes the use of stack for temporary state and replaces
interleaved ld4 loads with ld1.
Before/after:
A78
hscale_16_to_15__fs_4_dstW_8_neon: 86.8 ( 1.72x)
hscale_16_to_15__fs_4_dstW_24_neon: 147.5 ( 2.73x)
hscale_16_to_15__fs_4_dstW_128_neon: 614.0 ( 3.14x)
hscale_16_to_15__fs_4_dstW_144_neon: 680.5 ( 3.18x)
hscale_16_to_15__fs_4_dstW_256_neon: 1193.2 ( 3.19x)
hscale_16_to_15__fs_4_dstW_512_neon: 2305.0 ( 3.27x)
hscale_16_to_15__fs_4_dstW_8_neon: 86.0 ( 1.74x)
hscale_16_to_15__fs_4_dstW_24_neon: 106.8 ( 3.78x)
hscale_16_to_15__fs_4_dstW_128_neon: 404.0 ( 4.81x)
hscale_16_to_15__fs_4_dstW_144_neon: 451.8 ( 4.80x)
hscale_16_to_15__fs_4_dstW_256_neon: 760.5 ( 5.06x)
hscale_16_to_15__fs_4_dstW_512_neon: 1520.0 ( 5.01x)
A72
hscale_16_to_15__fs_4_dstW_8_neon: 156.8 ( 1.52x)
hscale_16_to_15__fs_4_dstW_24_neon: 217.8 ( 2.52x)
hscale_16_to_15__fs_4_dstW_128_neon: 906.8 ( 2.90x)
hscale_16_to_15__fs_4_dstW_144_neon: 1014.5 ( 2.91x)
hscale_16_to_15__fs_4_dstW_256_neon: 1751.5 ( 2.96x)
hscale_16_to_15__fs_4_dstW_512_neon: 3469.3 ( 2.97x)
hscale_16_to_15__fs_4_dstW_8_neon: 151.2 ( 1.54x)
hscale_16_to_15__fs_4_dstW_24_neon: 173.4 ( 3.15x)
hscale_16_to_15__fs_4_dstW_128_neon: 660.0 ( 3.98x)
hscale_16_to_15__fs_4_dstW_144_neon: 735.7 ( 4.00x)
hscale_16_to_15__fs_4_dstW_256_neon: 1273.5 ( 4.09x)
hscale_16_to_15__fs_4_dstW_512_neon: 2488.2 ( 4.16x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-02 01:17:29 +02:00
Adam Lackorzynski
76b1810017
libswscale/arm/swscale_unscaled: Fix function prototype
...
Constify dstStrice argument of rgbx_to_nv12_neon_16_wrapper to be
compatible with other functions as used in function assignment.
Signed-off-by: Adam Lackorzynski <adam@l4re.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-02 01:10:38 +02:00
Martin Storsjö
b137347278
aarch64: Fix a few misindented lines
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-28 23:23:09 +02:00
Shreesh Adiga
e18f87ed9f
swscale/x86/rgb2rgb: add AVX512ICL version of uyvytoyuv422
...
The scalar loop is replaced with masked AVX512 instructions.
For extracting the Y from UYVY, vperm2b is used instead of
various AND and packuswb.
Instead of loading the vectors with interleaved lanes as done
in AVX2 version, normal load is used. At the end of packuswb,
for U and V, an extra permute operation is done to get the
required layout.
AMD 7950x Zen 4 benchmark data:
uyvytoyuv422_c: 29105.0 ( 1.00x)
uyvytoyuv422_sse2: 3888.0 ( 7.49x)
uyvytoyuv422_avx: 3374.2 ( 8.63x)
uyvytoyuv422_avx2: 2649.8 (10.98x)
uyvytoyuv422_avx512icl: 1615.0 (18.02x)
Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2025-02-18 12:43:57 -03:00
Krzysztof Pyrkosz
b92577405b
swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422}
...
A78:
uyvytoyuv420_neon: 6112.5 ( 6.96x)
uyvytoyuv422_neon: 6696.0 ( 6.32x)
yuyvtoyuv420_neon: 6113.0 ( 6.95x)
yuyvtoyuv422_neon: 6695.2 ( 6.31x)
A72:
uyvytoyuv420_neon: 9512.1 ( 6.09x)
uyvytoyuv422_neon: 9766.8 ( 6.32x)
yuyvtoyuv420_neon: 9639.1 ( 6.00x)
yuyvtoyuv422_neon: 9779.0 ( 6.03x)
A53:
uyvytoyuv420_neon: 12720.1 ( 9.10x)
uyvytoyuv422_neon: 14282.9 ( 6.71x)
yuyvtoyuv420_neon: 12637.4 ( 9.15x)
yuyvtoyuv422_neon: 14127.6 ( 6.77x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 11:39:42 +02:00
Krzysztof Pyrkosz
64107e22f5
swscale/aarch64/rgb24toyv12: skip early right shift by 2
...
It's a minor improvement that shaves off 5-8% from the execution time.
Instead of shifting by 2 right away and by 7 soon after, shift by 9 one
time.
Times before and after:
A78:
rgb24toyv12_16_200_neon: 5366.8 ( 3.62x)
rgb24toyv12_128_60_neon: 13574.0 ( 3.34x)
rgb24toyv12_512_16_neon: 14463.8 ( 3.33x)
rgb24toyv12_1920_4_neon: 13508.2 ( 3.34x)
rgb24toyv12_1920_4_negstride_neon: 13525.0 ( 3.34x)
rgb24toyv12_16_200_neon: 5293.8 ( 3.66x)
rgb24toyv12_128_60_neon: 12955.0 ( 3.50x)
rgb24toyv12_512_16_neon: 13784.0 ( 3.50x)
rgb24toyv12_1920_4_neon: 12900.8 ( 3.49x)
rgb24toyv12_1920_4_negstride_neon: 12902.8 ( 3.49x)
A72:
rgb24toyv12_16_200_neon: 9695.8 ( 2.50x)
rgb24toyv12_128_60_neon: 20286.6 ( 2.70x)
rgb24toyv12_512_16_neon: 22276.6 ( 2.57x)
rgb24toyv12_1920_4_neon: 19154.1 ( 2.77x)
rgb24toyv12_1920_4_negstride_neon: 19055.1 ( 2.78x)
rgb24toyv12_16_200_neon: 9214.8 ( 2.65x)
rgb24toyv12_128_60_neon: 20731.5 ( 2.65x)
rgb24toyv12_512_16_neon: 21145.0 ( 2.70x)
rgb24toyv12_1920_4_neon: 17586.5 ( 2.99x)
rgb24toyv12_1920_4_negstride_neon: 17571.0 ( 2.98x)
A53:
rgb24toyv12_16_200_neon: 12880.4 ( 3.76x)
rgb24toyv12_128_60_neon: 27776.3 ( 3.94x)
rgb24toyv12_512_16_neon: 29411.3 ( 3.94x)
rgb24toyv12_1920_4_neon: 27253.1 ( 3.98x)
rgb24toyv12_1920_4_negstride_neon: 27474.3 ( 3.95x)
rgb24toyv12_16_200_neon: 12196.3 ( 3.95x)
rgb24toyv12_128_60_neon: 26943.1 ( 4.07x)
rgb24toyv12_512_16_neon: 28642.3 ( 4.07x)
rgb24toyv12_1920_4_neon: 26676.6 ( 4.08x)
rgb24toyv12_1920_4_negstride_neon: 26713.8 ( 4.07x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 10:49:41 +02:00
James Almer
268d0b6527
swscale/graph: copy scaler_params to the legacy subpass context
...
Fixes ticket #11448 .
Signed-off-by: James Almer <jamrial@gmail.com>
2025-02-07 13:17:37 -03:00
Krzysztof Pyrkosz
c85a748979
swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
...
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.
The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.
There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.
Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf
- A78
shuffle_bytes_0321_c: 75.5 ( 1.00x)
shuffle_bytes_0321_neon: 26.5 ( 2.85x)
shuffle_bytes_1203_c: 136.2 ( 1.00x)
shuffle_bytes_1203_neon: 27.2 ( 5.00x)
shuffle_bytes_1230_c: 135.5 ( 1.00x)
shuffle_bytes_1230_neon: 28.0 ( 4.84x)
shuffle_bytes_2013_c: 138.8 ( 1.00x)
shuffle_bytes_2013_neon: 22.0 ( 6.31x)
shuffle_bytes_2103_c: 76.5 ( 1.00x)
shuffle_bytes_2103_neon: 20.5 ( 3.73x)
shuffle_bytes_2130_c: 137.5 ( 1.00x)
shuffle_bytes_2130_neon: 28.0 ( 4.91x)
shuffle_bytes_3012_c: 138.2 ( 1.00x)
shuffle_bytes_3012_neon: 21.5 ( 6.43x)
shuffle_bytes_3102_c: 138.2 ( 1.00x)
shuffle_bytes_3102_neon: 27.2 ( 5.07x)
shuffle_bytes_3210_c: 138.0 ( 1.00x)
shuffle_bytes_3210_neon: 22.0 ( 6.27x)
shuf3210 using rev32
shuffle_bytes_3210_c: 139.0 ( 1.00x)
shuffle_bytes_3210_neon: 28.5 ( 4.88x)
- A72
shuffle_bytes_0321_c: 120.0 ( 1.00x)
shuffle_bytes_0321_neon: 36.0 ( 3.33x)
shuffle_bytes_1203_c: 188.2 ( 1.00x)
shuffle_bytes_1203_neon: 37.8 ( 4.99x)
shuffle_bytes_1230_c: 195.0 ( 1.00x)
shuffle_bytes_1230_neon: 36.0 ( 5.42x)
shuffle_bytes_2013_c: 195.8 ( 1.00x)
shuffle_bytes_2013_neon: 43.5 ( 4.50x)
shuffle_bytes_2103_c: 117.2 ( 1.00x)
shuffle_bytes_2103_neon: 53.5 ( 2.19x)
shuffle_bytes_2130_c: 203.2 ( 1.00x)
shuffle_bytes_2130_neon: 37.8 ( 5.38x)
shuffle_bytes_3012_c: 183.8 ( 1.00x)
shuffle_bytes_3012_neon: 46.8 ( 3.93x)
shuffle_bytes_3102_c: 180.8 ( 1.00x)
shuffle_bytes_3102_neon: 37.8 ( 4.79x)
shuffle_bytes_3210_c: 195.8 ( 1.00x)
shuffle_bytes_3210_neon: 37.8 ( 5.19x)
shuf3210 using rev32
shuffle_bytes_3210_c: 194.8 ( 1.00x)
shuffle_bytes_3210_neon: 30.8 ( 6.33x)
- x13s:
shuffle_bytes_0321_c: 49.4 ( 1.00x)
shuffle_bytes_0321_neon: 18.1 ( 2.72x)
shuffle_bytes_1203_c: 98.4 ( 1.00x)
shuffle_bytes_1203_neon: 18.4 ( 5.35x)
shuffle_bytes_1230_c: 97.4 ( 1.00x)
shuffle_bytes_1230_neon: 19.1 ( 5.09x)
shuffle_bytes_2013_c: 101.4 ( 1.00x)
shuffle_bytes_2013_neon: 16.9 ( 6.01x)
shuffle_bytes_2103_c: 53.9 ( 1.00x)
shuffle_bytes_2103_neon: 13.9 ( 3.88x)
shuffle_bytes_2130_c: 100.9 ( 1.00x)
shuffle_bytes_2130_neon: 19.1 ( 5.27x)
shuffle_bytes_3012_c: 97.4 ( 1.00x)
shuffle_bytes_3012_neon: 17.1 ( 5.69x)
shuffle_bytes_3102_c: 100.9 ( 1.00x)
shuffle_bytes_3102_neon: 19.1 ( 5.27x)
shuffle_bytes_3210_c: 100.6 ( 1.00x)
shuffle_bytes_3210_neon: 16.9 ( 5.96x)
shuf3210 using rev32
shuffle_bytes_3210_c: 100.6 ( 1.00x)
shuffle_bytes_3210_neon: 18.6 ( 5.40x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
Krzysztof Pyrkosz
e25a19fc7c
swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon
...
The benchmarks (before vs after) were gathered using
./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1
A78 before:
yuv2yuv1_0_512_accurate_c: 2039.5 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 385.5 ( 5.29x)
yuv2yuv1_0_512_approximate_c: 2110.5 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 385.5 ( 5.47x)
yuv2yuv1_3_512_accurate_c: 2061.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 381.2 ( 5.41x)
yuv2yuv1_3_512_approximate_c: 2099.2 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 381.2 ( 5.51x)
yuv2yuv1_8_512_accurate_c: 2054.2 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 385.5 ( 5.33x)
yuv2yuv1_8_512_approximate_c: 2112.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 385.5 ( 5.48x)
yuv2yuv1_11_512_accurate_c: 2036.0 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 381.2 ( 5.34x)
yuv2yuv1_11_512_approximate_c: 2115.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 381.2 ( 5.55x)
yuv2yuv1_16_512_accurate_c: 2066.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 385.5 ( 5.36x)
yuv2yuv1_16_512_approximate_c: 2100.8 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 385.5 ( 5.45x)
yuv2yuv1_19_512_accurate_c: 2059.8 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 381.2 ( 5.40x)
yuv2yuv1_19_512_approximate_c: 2102.8 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 381.2 ( 5.52x)
After:
yuv2yuv1_0_512_accurate_c: 2206.0 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 139.2 (15.84x)
yuv2yuv1_0_512_approximate_c: 2050.0 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 139.2 (14.72x)
yuv2yuv1_3_512_accurate_c: 2205.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 138.0 (15.98x)
yuv2yuv1_3_512_approximate_c: 2052.5 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 138.0 (14.87x)
yuv2yuv1_8_512_accurate_c: 2171.0 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 139.2 (15.59x)
yuv2yuv1_8_512_approximate_c: 2064.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 139.2 (14.82x)
yuv2yuv1_11_512_accurate_c: 2164.8 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 138.0 (15.69x)
yuv2yuv1_11_512_approximate_c: 2048.8 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 138.0 (14.85x)
yuv2yuv1_16_512_accurate_c: 2154.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 139.2 (15.47x)
yuv2yuv1_16_512_approximate_c: 2047.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 139.2 (14.70x)
yuv2yuv1_19_512_accurate_c: 2144.5 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 138.0 (15.54x)
yuv2yuv1_19_512_approximate_c: 2046.0 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 138.0 (14.83x)
A72 before:
yuv2yuv1_0_512_accurate_c: 3779.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 527.8 ( 7.16x)
yuv2yuv1_0_512_approximate_c: 4128.2 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 528.2 ( 7.81x)
yuv2yuv1_3_512_accurate_c: 3836.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 527.0 ( 7.28x)
yuv2yuv1_3_512_approximate_c: 3991.0 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 526.8 ( 7.58x)
yuv2yuv1_8_512_accurate_c: 3732.8 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 525.5 ( 7.10x)
yuv2yuv1_8_512_approximate_c: 4060.0 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 527.0 ( 7.70x)
yuv2yuv1_11_512_accurate_c: 3836.2 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 530.0 ( 7.24x)
yuv2yuv1_11_512_approximate_c: 4014.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 530.0 ( 7.57x)
yuv2yuv1_16_512_accurate_c: 3726.2 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 525.5 ( 7.09x)
yuv2yuv1_16_512_approximate_c: 4114.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 526.2 ( 7.82x)
yuv2yuv1_19_512_accurate_c: 3812.2 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 530.0 ( 7.19x)
yuv2yuv1_19_512_approximate_c: 4012.2 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 530.0 ( 7.57x)
After:
yuv2yuv1_0_512_accurate_c: 3716.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 215.1 (17.28x)
yuv2yuv1_0_512_approximate_c: 3877.8 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 222.8 (17.40x)
yuv2yuv1_3_512_accurate_c: 3717.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 217.8 (17.06x)
yuv2yuv1_3_512_approximate_c: 3801.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 220.3 (17.25x)
yuv2yuv1_8_512_accurate_c: 3716.6 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 213.8 (17.38x)
yuv2yuv1_8_512_approximate_c: 3831.8 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 218.1 (17.57x)
yuv2yuv1_11_512_accurate_c: 3717.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 219.1 (16.97x)
yuv2yuv1_11_512_approximate_c: 3801.6 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 216.1 (17.59x)
yuv2yuv1_16_512_accurate_c: 3716.6 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 213.6 (17.40x)
yuv2yuv1_16_512_approximate_c: 3831.6 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 215.1 (17.82x)
yuv2yuv1_19_512_accurate_c: 3717.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 223.8 (16.61x)
yuv2yuv1_19_512_approximate_c: 3801.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 219.1 (17.35x)
x13s before:
yuv2yuv1_0_512_accurate_c: 1435.1 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 221.1 ( 6.49x)
yuv2yuv1_0_512_approximate_c: 1405.4 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 219.1 ( 6.41x)
yuv2yuv1_3_512_accurate_c: 1418.6 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 215.9 ( 6.57x)
yuv2yuv1_3_512_approximate_c: 1405.9 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 224.1 ( 6.27x)
yuv2yuv1_8_512_accurate_c: 1433.9 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 218.6 ( 6.56x)
yuv2yuv1_8_512_approximate_c: 1412.9 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 218.9 ( 6.46x)
yuv2yuv1_11_512_accurate_c: 1449.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 217.6 ( 6.66x)
yuv2yuv1_11_512_approximate_c: 1410.9 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 221.1 ( 6.38x)
yuv2yuv1_16_512_accurate_c: 1402.1 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 214.6 ( 6.53x)
yuv2yuv1_16_512_approximate_c: 1422.4 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 222.9 ( 6.38x)
yuv2yuv1_19_512_accurate_c: 1421.6 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 217.4 ( 6.54x)
yuv2yuv1_19_512_approximate_c: 1421.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 221.4 ( 6.42x)
After:
yuv2yuv1_0_512_accurate_c: 1413.6 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 80.6 (17.53x)
yuv2yuv1_0_512_approximate_c: 1455.6 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 80.6 (18.05x)
yuv2yuv1_3_512_accurate_c: 1429.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 77.4 (18.47x)
yuv2yuv1_3_512_approximate_c: 1462.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 80.6 (18.14x)
yuv2yuv1_8_512_accurate_c: 1425.4 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 77.9 (18.30x)
yuv2yuv1_8_512_approximate_c: 1436.6 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 80.9 (17.76x)
yuv2yuv1_11_512_accurate_c: 1429.4 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 76.1 (18.78x)
yuv2yuv1_11_512_approximate_c: 1447.1 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 78.4 (18.46x)
yuv2yuv1_16_512_accurate_c: 1439.9 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 77.6 (18.55x)
yuv2yuv1_16_512_approximate_c: 1422.1 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 78.1 (18.20x)
yuv2yuv1_19_512_accurate_c: 1447.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 78.1 (18.52x)
yuv2yuv1_19_512_approximate_c: 1474.4 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 78.1 (18.87x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:05:06 +02:00
Shreesh Adiga
59f9dbaa31
swscale/x86/rgb2rgb: add AVX512ICL versions of shuffle_bytes
...
On a AMD 7950x Zen 4
shuffle_bytes_0321_c: 56.5 ( 1.00x)
shuffle_bytes_0321_ssse3: 15.2 ( 3.70x)
shuffle_bytes_0321_avx2: 10.2 ( 5.51x)
shuffle_bytes_0321_avx512icl: 9.2 ( 6.11x)
shuffle_bytes_1230_c: 84.5 ( 1.00x)
shuffle_bytes_1230_ssse3: 14.2 ( 5.93x)
shuffle_bytes_1230_avx2: 15.2 ( 5.54x)
shuffle_bytes_1230_avx512icl: 11.2 ( 7.51x)
shuffle_bytes_2103_c: 48.5 ( 1.00x)
shuffle_bytes_2103_ssse3: 21.2 ( 2.28x)
shuffle_bytes_2103_avx2: 13.8 ( 3.53x)
shuffle_bytes_2103_avx512icl: 9.2 ( 5.24x)
shuffle_bytes_3012_c: 84.5 ( 1.00x)
shuffle_bytes_3012_ssse3: 14.2 ( 5.93x)
shuffle_bytes_3012_avx2: 16.2 ( 5.20x)
shuffle_bytes_3012_avx512icl: 10.2 ( 8.24x)
shuffle_bytes_3210_c: 89.2 ( 1.00x)
shuffle_bytes_3210_ssse3: 24.2 ( 3.68x)
shuffle_bytes_3210_avx2: 16.2 ( 5.49x)
shuffle_bytes_3210_avx512icl: 9.2 ( 9.65x)
Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
2025-02-03 10:16:44 -03:00
Andreas Rheinhardt
4afe61ea6c
swscale/x86/swscale: Make M24 variables static
...
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-02-02 17:03:13 +01:00
Andreas Rheinhardt
3797e9239e
swscale/x86/swscale: Move some constants to rgb2rgb.c
...
ff_w1111 and ff_bgr2(Y|UV)Offset are only used there
(and only on x86-32 since caaec2ea95 ).
Also make them static.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-02-02 17:00:07 +01:00
James Almer
e20ee9f9ae
swscale/swscale: don't reject scaling when color parameters are not supported but conversion is not required
...
Values in csp, prim, trc, etc, are irrelevant if there's no conversion needed.
Reviewed-by: Niklas Haas <ffmpeg@haasn.xyz>
Signed-off-by: James Almer <jamrial@gmail.com>
2025-01-22 12:15:18 -03:00
James Almer
abdc20727c
swscale/swscale: combine the input/output checks in sws_frame_setup()
...
Cosmetic change in preparation for the next commit.
Signed-off-by: James Almer <jamrial@gmail.com>
2025-01-22 12:14:57 -03:00
Michael Niedermayer
665b0cf3bf
swscale: 16bit planar float input support
...
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-01-21 21:06:14 +01:00