Add NEON unscaled converters for {yuv420p, yuv422p, yuva420p, nv12, nv21}
to {rgb565le, bgr565le, rgb555le, bgr555le}.
The 16bpp packing uses v8/v9 as the output accumulator. Since AAPCS-64
requires d8-d15 to be callee-saved, declare_func now wraps a
stp d8, d9 / ldp d8, d9 around 16bpp paths only (gated by .ifc on the
output format). Pattern matches libswscale/aarch64/hscale.S.
yuva420p -> 16bpp drops alpha and routes through the yuv420p wrappers,
mirroring how yuva420p -> rgb24/bgr24 already work in tree.
Speedup vs C at width=1920 on Apple M1 (checkasm --bench):
| input | rgb565le | bgr565le | rgb555le | bgr555le |
|----------|----------|----------|----------|----------|
| yuv420p | 3.69x | 3.68x | 3.28x | 3.31x |
| yuv422p | 4.70x | 4.70x | 4.32x | 4.35x |
| yuva420p | 3.67x | 3.66x | 3.32x | 3.27x |
NEON cycles are ~48 for planar and ~50.5 for semi-planar across all
four outputs. yuv422p shows the biggest speedup because its C
reference is the most expensive. 555 ratios trail 565 because the C
reference is faster for 555 (one fewer mask bit); NEON cycles are the
same. nv12/nv21 are bench-only (see the preceding checkasm commit) and
run at the same ~50.5 cycles.
This only handles the little endian forms of the 16 bit RGB formats.
Verified with checkasm --test=sw_yuv2rgb (110/110) and the full
checkasm regression (7657/7657) on Apple M1.
Signed-off-by: DROOdotFOO <drew@axol.io>
The code made the fundamental assumption that over-read into the padding
bytes is okay to do; because the most that can happen is that those pixel
values end up corrupted, which doesn't affect any adjacent pixels.
However, this is not true for SWS_OP_FILTER_H, because this operation
fundamentally mixes together horizontal pixels. Normally, this was fine,
because the filter weights for those pixels are set to 0, and 0 * x = 0.
However, that is not true for floating point inputs, which can contain
Infinity; and 0 * Infinity = NaN, thus corrupting the entire pixel.
Solve it by specifically preventing over-read when it would be unsafe.
Signed-off-by: Niklas Haas <git@haasn.dev>
The issue is that while Vulkan already does the decomposition for us,
swscale assumes that the pixels will be in bitstream order, rather than
in their decomposed form.
This is valid for all packed formats for which these instructions are
issued (XV30 and X2RGB10).
This allows us to support the formats in Vulkan.
Sponsored-by: Sovereign Tech Fund
This ensures that the ops printing path goes through the same code as the
actual ops dispatch backend, including all sub-passes etc.
Signed-off-by: Niklas Haas <git@haasn.dev>
Allows the uops macro generation code to not actually compile any passes.
More generally, this could be used to e.g. test if an op list is supported by
a backend without actually creating the passes.
The `bool first` change is needed because the `input == prev` check no longer
works if we don't actually compiled any passes.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will be used eventually when I rewrite checkasm/sw_ops to re-use the
code in ops_dispatch.c instead of hand-rolling the execution layer.
Signed-off-by: Niklas Haas <git@haasn.dev>
This function actually lives in ops_dispatch.c, and doesn't really make
sense in ops.h anymore. We should also move some stuff out of ops_internal.h,
which doesn't depend on any external ops stuff, here.
This allows the backend/compilation-related stuff to co-exist more nicely.
Signed-off-by: Niklas Haas <git@haasn.dev>
Using the configured scaler from the SwsContext implicitly. This does affect
the output of libswscale/tests/sws_ops.c, which now prints about 4x as much
data (taking roughly 4x as long, but still within a second on my machine).
We can make this process a lot faster by forcing SWS_SCALE_POINT as the
scaler, which skips calculating any actual filter weights in favor of
generating a trivial 1-tap filter.
Signed-off-by: Niklas Haas <git@haasn.dev>
The only difference here is an extra ff_sws_add_filters() call, which is
a no-op because src w/h = dst w/h = 16.
Signed-off-by: Niklas Haas <git@haasn.dev>
This no longer accesses prev/next as a result of the `unused` removal, so
the signature can be simplified to just take the op directly.
Signed-off-by: Niklas Haas <git@haasn.dev>
We have other op types that skip checking the data even in non-flexible mode,
so there is a precedent for just leaving away `flexible` for such kernels.
Signed-off-by: Niklas Haas <git@haasn.dev>
Mirroring the precedent established by the other SwsOp-generating functions.
This allows us to re-use it for the uops macro generator.
Signed-off-by: Niklas Haas <git@haasn.dev>
The fix from 5fa2a65c11 introduced a regression for non-native-endian
formats (such as rgb565be on a little-endian system).
Reproducible with:
$ ./libswscale/tests/swscale -unscaled 1 -src rgb565be -dst rgb24
Also:
$ ./ffmpeg_g -i /opt/samples/jpegls/128.jls -vf "scale=size=512x512,format=rgb24,scale=flags=neighbor,format=rgb565be" -f rawvideo -vframes 1 -y rgb565be.raw
$ magick -size 512x512 -endian MSB RGB565:rgb565be.raw output.png
$ ./ffplay_g output.png
(note: don't use ffmpeg to convert from rgb565be.raw to output for the
test above since it will perform the same bug and cancel out the error)
When running with "-v 0", the test parameters were not being printed,
which made it hard to track down which conversion the error referred
to.
Now the test parameters are logged with av_log() when a loss error
happens.
The -p, -flags, and -unscaled options all affected the decision to
select a subsample of the tests to run. When specifying -p 0.1, about
57% of the tests would run instead of the expect 10%.
This commit fixes this by separating -p from -flags and -unscaled.
box() and triangle() have well-defined, trivially verifiable numerical
inverses.
We could actually pre-compute and hard-code the numerical inverse of all
non-parametric kernels, but I'm a bit reluctant to do this as I have plans to
adjust the value of SWS_MAX_REDUCE_CUTOFF based on the desired bit depth of the
output, which makes a hard-coding approach unfeasible.
(It would also be a brittle solution that may break whenever we extend the
scaler configuration API, as well as making it harder to add new filters)
Signed-off-by: Niklas Haas <git@haasn.dev>
Obviously noone should pass negative values, they make no sense, but better to
explicitly check
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Some of these files aligned instructions to 4/24 columns, while
we commonly indent arm/aarch64 assembly to 8/24 columns.
Some of these files also used a different alignment for the
operands.