In yuv420_gbrp_ssse3, the boundary safeguard check "h_size * 3 >
FFABS(dstStride[0])" was erroneously set based on probably packed RGB24
formats (where each pixel spans 3 bytes per row).
For GBRP (planar GBR), each plane contains only 1 component per pixel
per row, meaning dstStride[0] corresponds exactly to width.
Multiplying h_size by 3 mistakenly triggers the condition for normal
widths, decreasing h_size by 8. This leaves the rightmost 8 pixels
of every row completely uninitialized (black).
Fix this by checking "h_size > FFABS(dstStride[0])" instead.
How to Reproduce the error:
1. Generate buggy and fixed outputs as PNGs using the 600x600 pipeline:
buggy output without the fix
$ ffmpeg -f lavfi -i color=c=red:size=600x600:rate=1 \
-vf format=yuv420p,format=gbrp \
-frames:v 1 -y buggy_red_600.png
fixed output with the fix
$ ffmpeg -f lavfi -i color=c=red:size=600x600:rate=1 \
-vf format=yuv420p,format=gbrp \
-frames:v 1 -y fixed_red_600.png
2. Verify buggy_red_600.png in an image viewer:
A strict, 8-pixel wide vertical black stripe (columns 592 to 599) is
clearly visible running top-to-bottom down the rightmost edge of the image.
3. Verify fixed_red_600.png in an image viewer as well:
The output renders a perfect, uniform, fully solid red square across
the entire 600x600 canvas, indicating the boundary bug is successfully resolved.
This writes 4 bytes but in SSE4 mode only produces 2 bytes per vector. We
can avoid over-writing by using the appropriately sized register.
Reproducible by:
make libswscale/tests/swscale
libswscale/tests/swscale -dst monob -unscaled 1 -flags unstable -align_src 1 -align_dst 1
Signed-off-by: Niklas Haas <git@haasn.dev>
These loops were both assuming that `h` lines need to be copied; but this
varies. First of all, for plane subsampling; but more importantly, when
vertically scaling, the input line count may be substantially lower than the
actual line count.
This fixes an out-of-bounds read/write when vertically upscaling with a tail
buffer.
Verifiable via e.g.:
make libswscale/tests/swscale
valgrind -- libswscale/tests/swscale -s 63x63 -src yuv444p -dst rgb24 \
-flags unstable -align_src 1 -align_dst 1
(As well as the SSIM scores, which drop from ~e-5 to ~e-3 without this fix)
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
The issue is the legacy path does not support hardware frames, so falling
back means erroring with ENOTSUP, which would fail the tests.
Sponsored-by: Sovereign Tech Fund
The six .ifc cascades that gate 16bpp behavior in yuv2rgb_neon.S
(linesize padding in three load_args macros, d8/d9 save/restore,
main-loop pack dispatch) all branch on the same four output formats.
Aggregate the predicate into four GAS .set symbols emitted once per
declare_func via a new set_rgb16_predicates macro:
rgb16 - 1 for *565le and *555le outputs; 0 otherwise
r_first - 1 for rgb*le (R high); 0 for bgr*le (B high)
gshift - 2 for 565, 3 for 555 (passed as pack_rgb16's g_shr)
hshift - 11 for 565, 10 for 555 (passed as pack_rgb16's high_shl)
Call sites become a flat ".if rgb16" gate (five places) plus a 2-way
".if r_first" inside ".if rgb16" for the pack dispatch (one place).
.if/.endif count drops from 46 to 33; -88/+49 lines net.
Pure source-level refactor: the full object disassembly is byte-for-byte
identical to the pre-refactor build (MD5 2a6ac497cabc81849e0c80ec0fde0550
on Apple M1, clang). checkasm --test=sw_yuv2rgb 110/110, full checkasm
7657/7657.
Signed-off-by: DROOdotFOO <drew@axol.io>
The formats added in e93de9948d keep the values in the most significant
bits of the uint16_t, and packed30togbra10() and gbr16ptopacked30()
weren't taking into consideration the shift field from AVComponentDescriptor.
Reproducible with:
$ ./libswscale/tests/swscale -unscaled 1 -src gbrp10msbbe -dst x2rgb10le
$ ./libswscale/tests/swscale -unscaled 1 -src x2rgb10le -dst gbrp10msbbe
Add NEON unscaled converters for {yuv420p, yuv422p, yuva420p, nv12, nv21}
to {rgb565le, bgr565le, rgb555le, bgr555le}.
The 16bpp packing uses v8/v9 as the output accumulator. Since AAPCS-64
requires d8-d15 to be callee-saved, declare_func now wraps a
stp d8, d9 / ldp d8, d9 around 16bpp paths only (gated by .ifc on the
output format). Pattern matches libswscale/aarch64/hscale.S.
yuva420p -> 16bpp drops alpha and routes through the yuv420p wrappers,
mirroring how yuva420p -> rgb24/bgr24 already work in tree.
Speedup vs C at width=1920 on Apple M1 (checkasm --bench):
| input | rgb565le | bgr565le | rgb555le | bgr555le |
|----------|----------|----------|----------|----------|
| yuv420p | 3.69x | 3.68x | 3.28x | 3.31x |
| yuv422p | 4.70x | 4.70x | 4.32x | 4.35x |
| yuva420p | 3.67x | 3.66x | 3.32x | 3.27x |
NEON cycles are ~48 for planar and ~50.5 for semi-planar across all
four outputs. yuv422p shows the biggest speedup because its C
reference is the most expensive. 555 ratios trail 565 because the C
reference is faster for 555 (one fewer mask bit); NEON cycles are the
same. nv12/nv21 are bench-only (see the preceding checkasm commit) and
run at the same ~50.5 cycles.
This only handles the little endian forms of the 16 bit RGB formats.
Verified with checkasm --test=sw_yuv2rgb (110/110) and the full
checkasm regression (7657/7657) on Apple M1.
Signed-off-by: DROOdotFOO <drew@axol.io>
The code made the fundamental assumption that over-read into the padding
bytes is okay to do; because the most that can happen is that those pixel
values end up corrupted, which doesn't affect any adjacent pixels.
However, this is not true for SWS_OP_FILTER_H, because this operation
fundamentally mixes together horizontal pixels. Normally, this was fine,
because the filter weights for those pixels are set to 0, and 0 * x = 0.
However, that is not true for floating point inputs, which can contain
Infinity; and 0 * Infinity = NaN, thus corrupting the entire pixel.
Solve it by specifically preventing over-read when it would be unsafe.
Signed-off-by: Niklas Haas <git@haasn.dev>
The issue is that while Vulkan already does the decomposition for us,
swscale assumes that the pixels will be in bitstream order, rather than
in their decomposed form.
This is valid for all packed formats for which these instructions are
issued (XV30 and X2RGB10).
This allows us to support the formats in Vulkan.
Sponsored-by: Sovereign Tech Fund
This ensures that the ops printing path goes through the same code as the
actual ops dispatch backend, including all sub-passes etc.
Signed-off-by: Niklas Haas <git@haasn.dev>
Allows the uops macro generation code to not actually compile any passes.
More generally, this could be used to e.g. test if an op list is supported by
a backend without actually creating the passes.
The `bool first` change is needed because the `input == prev` check no longer
works if we don't actually compiled any passes.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will be used eventually when I rewrite checkasm/sw_ops to re-use the
code in ops_dispatch.c instead of hand-rolling the execution layer.
Signed-off-by: Niklas Haas <git@haasn.dev>
This function actually lives in ops_dispatch.c, and doesn't really make
sense in ops.h anymore. We should also move some stuff out of ops_internal.h,
which doesn't depend on any external ops stuff, here.
This allows the backend/compilation-related stuff to co-exist more nicely.
Signed-off-by: Niklas Haas <git@haasn.dev>
Using the configured scaler from the SwsContext implicitly. This does affect
the output of libswscale/tests/sws_ops.c, which now prints about 4x as much
data (taking roughly 4x as long, but still within a second on my machine).
We can make this process a lot faster by forcing SWS_SCALE_POINT as the
scaler, which skips calculating any actual filter weights in favor of
generating a trivial 1-tap filter.
Signed-off-by: Niklas Haas <git@haasn.dev>
The only difference here is an extra ff_sws_add_filters() call, which is
a no-op because src w/h = dst w/h = 16.
Signed-off-by: Niklas Haas <git@haasn.dev>
This no longer accesses prev/next as a result of the `unused` removal, so
the signature can be simplified to just take the op directly.
Signed-off-by: Niklas Haas <git@haasn.dev>
We have other op types that skip checking the data even in non-flexible mode,
so there is a precedent for just leaving away `flexible` for such kernels.
Signed-off-by: Niklas Haas <git@haasn.dev>
Mirroring the precedent established by the other SwsOp-generating functions.
This allows us to re-use it for the uops macro generator.
Signed-off-by: Niklas Haas <git@haasn.dev>
The fix from 5fa2a65c11 introduced a regression for non-native-endian
formats (such as rgb565be on a little-endian system).
Reproducible with:
$ ./libswscale/tests/swscale -unscaled 1 -src rgb565be -dst rgb24
Also:
$ ./ffmpeg_g -i /opt/samples/jpegls/128.jls -vf "scale=size=512x512,format=rgb24,scale=flags=neighbor,format=rgb565be" -f rawvideo -vframes 1 -y rgb565be.raw
$ magick -size 512x512 -endian MSB RGB565:rgb565be.raw output.png
$ ./ffplay_g output.png
(note: don't use ffmpeg to convert from rgb565be.raw to output for the
test above since it will perform the same bug and cancel out the error)
When running with "-v 0", the test parameters were not being printed,
which made it hard to track down which conversion the error referred
to.
Now the test parameters are logged with av_log() when a loss error
happens.
The -p, -flags, and -unscaled options all affected the decision to
select a subsample of the tests to run. When specifying -p 0.1, about
57% of the tests would run instead of the expect 10%.
This commit fixes this by separating -p from -flags and -unscaled.
box() and triangle() have well-defined, trivially verifiable numerical
inverses.
We could actually pre-compute and hard-code the numerical inverse of all
non-parametric kernels, but I'm a bit reluctant to do this as I have plans to
adjust the value of SWS_MAX_REDUCE_CUTOFF based on the desired bit depth of the
output, which makes a hard-coding approach unfeasible.
(It would also be a brittle solution that may break whenever we extend the
scaler configuration API, as well as making it harder to add new filters)
Signed-off-by: Niklas Haas <git@haasn.dev>