Commit graph

25 commits

Author SHA1 Message Date
Krzysztof Pyrkosz
c85a748979 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.

The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.

There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.

Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf

- A78
shuffle_bytes_0321_c:                                   75.5 ( 1.00x)
shuffle_bytes_0321_neon:                                26.5 ( 2.85x)
shuffle_bytes_1203_c:                                  136.2 ( 1.00x)
shuffle_bytes_1203_neon:                                27.2 ( 5.00x)
shuffle_bytes_1230_c:                                  135.5 ( 1.00x)
shuffle_bytes_1230_neon:                                28.0 ( 4.84x)
shuffle_bytes_2013_c:                                  138.8 ( 1.00x)
shuffle_bytes_2013_neon:                                22.0 ( 6.31x)
shuffle_bytes_2103_c:                                   76.5 ( 1.00x)
shuffle_bytes_2103_neon:                                20.5 ( 3.73x)
shuffle_bytes_2130_c:                                  137.5 ( 1.00x)
shuffle_bytes_2130_neon:                                28.0 ( 4.91x)
shuffle_bytes_3012_c:                                  138.2 ( 1.00x)
shuffle_bytes_3012_neon:                                21.5 ( 6.43x)
shuffle_bytes_3102_c:                                  138.2 ( 1.00x)
shuffle_bytes_3102_neon:                                27.2 ( 5.07x)
shuffle_bytes_3210_c:                                  138.0 ( 1.00x)
shuffle_bytes_3210_neon:                                22.0 ( 6.27x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  139.0 ( 1.00x)
shuffle_bytes_3210_neon:                                28.5 ( 4.88x)

- A72
shuffle_bytes_0321_c:                                  120.0 ( 1.00x)
shuffle_bytes_0321_neon:                                36.0 ( 3.33x)
shuffle_bytes_1203_c:                                  188.2 ( 1.00x)
shuffle_bytes_1203_neon:                                37.8 ( 4.99x)
shuffle_bytes_1230_c:                                  195.0 ( 1.00x)
shuffle_bytes_1230_neon:                                36.0 ( 5.42x)
shuffle_bytes_2013_c:                                  195.8 ( 1.00x)
shuffle_bytes_2013_neon:                                43.5 ( 4.50x)
shuffle_bytes_2103_c:                                  117.2 ( 1.00x)
shuffle_bytes_2103_neon:                                53.5 ( 2.19x)
shuffle_bytes_2130_c:                                  203.2 ( 1.00x)
shuffle_bytes_2130_neon:                                37.8 ( 5.38x)
shuffle_bytes_3012_c:                                  183.8 ( 1.00x)
shuffle_bytes_3012_neon:                                46.8 ( 3.93x)
shuffle_bytes_3102_c:                                  180.8 ( 1.00x)
shuffle_bytes_3102_neon:                                37.8 ( 4.79x)
shuffle_bytes_3210_c:                                  195.8 ( 1.00x)
shuffle_bytes_3210_neon:                                37.8 ( 5.19x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  194.8 ( 1.00x)
shuffle_bytes_3210_neon:                                30.8 ( 6.33x)

- x13s:
shuffle_bytes_0321_c:                                   49.4 ( 1.00x)
shuffle_bytes_0321_neon:                                18.1 ( 2.72x)
shuffle_bytes_1203_c:                                   98.4 ( 1.00x)
shuffle_bytes_1203_neon:                                18.4 ( 5.35x)
shuffle_bytes_1230_c:                                   97.4 ( 1.00x)
shuffle_bytes_1230_neon:                                19.1 ( 5.09x)
shuffle_bytes_2013_c:                                  101.4 ( 1.00x)
shuffle_bytes_2013_neon:                                16.9 ( 6.01x)
shuffle_bytes_2103_c:                                   53.9 ( 1.00x)
shuffle_bytes_2103_neon:                                13.9 ( 3.88x)
shuffle_bytes_2130_c:                                  100.9 ( 1.00x)
shuffle_bytes_2130_neon:                                19.1 ( 5.27x)
shuffle_bytes_3012_c:                                   97.4 ( 1.00x)
shuffle_bytes_3012_neon:                                17.1 ( 5.69x)
shuffle_bytes_3102_c:                                  100.9 ( 1.00x)
shuffle_bytes_3102_neon:                                19.1 ( 5.27x)
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                16.9 ( 5.96x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                18.6 ( 5.40x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
James Almer
7a16bfa7c9 tests/checkasm/sw_rgb: increase plane array buffers
Fixes stack-buffer-overflow errors running under asan.

Reviewed-by: Marvin Scholz <epirat07@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2025-01-28 15:26:00 -03:00
Andreas Rheinhardt
5a72266d49 tests/checkasm/sw_rgb: Fix leaks
Also use loop-scope for variables where appropriate.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-01-12 15:41:40 +01:00
James Almer
658a645e18 tests/checkasm/sw_rgb: remove bogus value truncation in check_yuv2packed1()
Fixes out of array accesses.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-12-31 11:53:18 -03:00
Niklas Haas
a9ae2cc14d checkasm/sw_rgb: add alpToYV12 check
Mirroring lumToYV12 and chrToYV12.

Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
2024-12-23 11:20:59 +01:00
Niklas Haas
c601bb8df5 checkasm/sw_rgb: add tests for yuv2packed{1,2,X}
Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
2024-12-23 11:20:58 +01:00
Niklas Haas
6a91a165fd swscale: eliminate redundant SwsInternal accesses
This is a purely cosmetic commit aimed at replacing accesses to
SwsInternal.opts by direct access to SwsContext wherever convenient.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-11-25 10:59:52 +01:00
Niklas Haas
2d077f9acd swscale/internal: group user-facing options together
This is a preliminary step to separating these into a new struct. This
commit contains no functional changes, it is a pure search-and-replace.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-11-21 12:49:56 +01:00
Niklas Haas
67adb30322 swscale: rename SwsContext to SwsInternal
And preserve the public SwsContext as separate name. The motivation here
is that I want to turn SwsContext into a public struct, while keeping the
internal implementation hidden. Additionally, I also want to be able to
use multiple internal implementations, e.g. for GPU devices.

This commit does not include any functional changes. For the most part, it is
a simple rename. The only complications arise from the public facing API
functions, which preserve their current type (and hence require an additional
unwrapping step internally), and the checkasm test framework, which directly
accesses SwsInternal.

For consistency, the affected functions that need to maintain a distionction
have generally been changed to refer to the SwsContext as *sws, and the
SwsInternal as *c.

In an upcoming commit, I will provide a backing definition for the public
SwsContext, and update `sws_internal()` to dereference the internal struct
instead of merely casting it.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-24 22:50:00 +02:00
James Almer
e1d1ba4cbc tests/checkasm/sw_rgb: don't write random data past the end of the buffer
Should fix fate-checkasm-sw_rgb under gcc-ubsan.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2024-10-17 13:08:39 +02:00
Martin Storsjö
157ce21939 checkasm/sw_rgb: Revert test additions from e18b46d95f
The unaligned width test cases fail on i386; we have an assembly
function of rgb24toyv12 which is enabled only within
"#if ARCH_X86_32 && HAVE_7REGS", which seems to fail these new
test cases for unaligned widths.

As that assembly function has existed for a long time in that form,
the issue probably isn't very recent, thus skip testing these cases
for now.

Once the assembly function has been fixed, these test cases can
be readded.

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-09-26 13:16:56 +03:00
Zhao Zhili
e18b46d95f swscale/aarch64: Fix rgb24toyv12 only works with aligned width
Since c0666d8b, rgb24toyv12 is broken for width non-aligned to 16.
Add a simple wrapper to handle the non-aligned part.

Co-authored-by: johzzy <hellojinqiang@gmail.com>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-09-24 10:24:14 +08:00
Ramiro Polla
e0cc06184c checkasm/sw_rgb: add rgb24toyv12 tests 2024-09-06 23:06:35 +02:00
Ramiro Polla
c08bb33e41 checkasm/sw_rgb: add deinterleaveBytes 2024-09-06 23:05:06 +02:00
James Almer
287d139b77 checkasm/sw_rgb: fix alignment of buffers for rgb_to_yuv tests
src is apparently not guaranteed to be >8 byte aligned, but align to 16
nonetheless as the x86 asm will do unaligned loads anyway.
dst is guaranteed to be 32 byte aligned for the Y plane, but 16 byte for UV.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-09 14:12:51 -03:00
James Almer
6743c2fc6a checkasm/sw_rgb: test rgb32/rgb32_1 to yuv
Test all four pixel formats, but only bench the two native endian ones for a
given target.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-09 12:29:49 -03:00
Andreas Rheinhardt
fca796ac3b tests/checkasm/sw_rgb: Be more strict about clobbering MMX state
The MMXEXT versions of the rgb2rgb functions tested here
always emit emms on their own. Therefore one can use
a stricter test to ensure that it stays that way.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-06-09 12:03:47 +02:00
Zhao Zhili
47ba87551c checkasm/sw_rgb: test rgb24/bgr24 to yuv
The line width 8 is supposed to test corner case, while the
performance doesn't matter. Width 1080 is also a case of
unaligned to 16.

Width 1920 meant for benchmark (together with --runs options).

Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-05 15:22:49 -03:00
Andreas Rheinhardt
4608f7cc6a Remove unnecessary mem.h inclusions
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-07-22 14:47:57 +02:00
Anton Khirnov
c8c2dfbc37 lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00
Jun Zhao
7f76f20fa0 checkasm: sw_rgb: Fix mixed declaration and code
Fix mixed declaration and code.

Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2020-06-01 23:28:07 +08:00
Martin Storsjö
eba1ebd9bf checkasm: sw_rgb: Add a test for interleaveBytes
Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:38:01 +03:00
Jun Zhao
b30575bc98 checkasm/sw_rgb: fix the function declaration warning
fix the warning: "function declaration isn’t a prototype", in C
int foo() and int foo(void) are different functions. int foo()
accepts an arbitrary number of arguments, while int foo(void) accepts 0
arguments.

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
2018-05-10 19:28:51 +08:00
Martin Vignali
07a566e7d6 swscale/swscale_unscaled : add X86_64 (SSE2 and AVX) for uyvyto422
and checkasm test
2018-04-22 19:15:32 +02:00
Martin Vignali
a9a7ed4f27 checkasm/swscale : add test for rgb shuffle_bytes func 2018-03-24 20:22:12 +01:00