The issue is that .flush gets called asynchronously, and modifies the
video session state while its being used for decoding. This did not
result in issues since all known vendors do not keep important state
there, but its not compliant with the specs.
Its not necessary to flush the decoder at all when seeking,
so simply don't.
Fixes#20487
perm gets incremented in the loop in such a manner that
it already has the value it is set to here except for
the first loop iteration.
Reviewed-by: Marton Balint <cus@passwd.hu>
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The DPX Vulkan unpack shader computes a word offset as
uint off = (line_off + pix_off >> 5);
Due to GLSL operator precedence this is evaluated as
line_off + (pix_off >> 5) rather than (line_off + pix_off) >> 5.
Since line_off is in bits while off is a 32-bit word index,
scanlines beyond y=0 use an inflated offset and the shader reads
past the end of the DPX slice buffer.
Parenthesize the expression so that the sum is shifted as intended:
uint off = (line_off + pix_off) >> 5;
This corrects the unpacked data and removes the CRC mismatch
observed between the software and Vulkan DPX decoders for
mispacked 12-bit DPX samples. The GPU OOB read itself is only
observable indirectly via this corruption since it occurs inside
the shader.
Repro on x86_64 with Vulkan/llvmpipe (531ce713a0):
./configure --cc=clang --disable-optimizations --disable-stripping \
--enable-debug=3 --disable-doc --disable-ffplay \
--enable-vulkan --enable-libshaderc \
--enable-hwaccel=dpx_vulkan \
--extra-cflags='-fsanitize=address -fno-omit-frame-pointer' \
--extra-ldflags='-fsanitize=address' && make
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json
PoC: packed 12-bit DPX with the packing flag cleared so the unpack
shader runs (4x64 gbrp12le), e.g. poc12_packed0.dpx.
Software decode:
./ffmpeg -v error -i poc12_packed0.dpx -f framecrc -
-> 0, ..., 1536, 0x26cf81c2
Vulkan hwaccel decode:
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json \
./ffmpeg -v error -init_hw_device vulkan \
-hwaccel vulkan -hwaccel_output_format vulkan \
-i poc12_packed0.dpx \
-vf hwdownload,format=gbrp12le -f framecrc -
-> 0, ..., 1536, 0x71e10a51
The only difference between the two runs is the Vulkan unpack
shader, and the stable CRC mismatch indicates that it is reading
past the intended DPX slice region.
Regression since: 531ce713a0
Found-by: Pwno
Required for the following commit, where a parsing function expects the buffer
to include the country code bytes.
Signed-off-by: James Almer <jamrial@gmail.com>
ff_vk_video_common_init() calls ff_vk_video_common_uninit() on failure
which leaves dangling object handles. Those get freed again when the
destructor of FFVulkanDecodeShared calls ff_vk_video_common_uninit()
again.
Signed-off-by: Cameron Gutman <aicommander@gmail.com>
vc1_inv_trans_8x4_altivec() is supposed to process a block
of 8x4 words, yet it read and processed eight lines. This led
to ASAN failures (see [1]) that this commit intends to fix.
It should also lead to performance improvements, but I don't have
real hardware to bench it.
[1]: https://fate.ffmpeg.org/report.cgi?time=20251207214004&slot=ppc64-linux-gcc-14.3-asan
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
We already parse the EXIF side data to extract the orientation, so we
should add it to the output file as an EXIF box.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
Before this commit, we ignore the display matrix side data if any EXIF
side data is present, even if that side data contains no orientation
tag. This allows us to calculate the orientation from the display
matrix sidedata first, if present. Ideally the decoder will have
removed the orientation tag upon decoding and attached the data as
display matrix side data instead, so this makes our orientation code
respect this behavior.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
Four rows of four bytes fit into one xmm register; therefore
one can arrange the rows as follows (A,B,C: first, second, third etc.
row)
xmm0: ABABABAB BCBCBCBC
xmm1: CDCDCDCD DEDEDEDE
xmm2: EFEFEFEF FGFGFGFG
xmm3: GHGHGHGH HIHIHIHI
and use four pmaddubsw to calculate two rows in parallel. The history
fits into four registers, making this possible even on 32bit systems.
Old benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.5 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 16.4 ( 6.44x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.3 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 15.4 ( 6.44x)
New benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 11.8 ( 8.90x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.7 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 10.7 ( 9.30x)
Old benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 138.2 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.93x)
vp9_put_8tap_smooth_4v_8bpp_c: 123.6 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.41x)
New benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 139.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 20.1 ( 6.92x)
vp9_put_8tap_smooth_4v_8bpp_c: 124.5 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 19.9 ( 6.26x)
Loading the constants into registers did not turn out to be advantageous
here (not to mention Win64, where this would necessitate saving
and restoring ever more register); probably because there are only two
loop iterations.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes a heap-buffer-overflow in `libavcodec/dpx.c` triggered by a stale
`unpadded_10bit` flag in the `DPXDecContext`. This flag, set for 10-bit
unpadded frames, persisted across `decode_frame` calls. If a subsequent
frame was 16-bit, the stale flag caused incorrect buffer size
validation, allowing truncated buffers to pass checks designed for
smaller 10-bit packed data. This led to an out-of-bounds read in
`av_image_copy_plane` during 16-bit decoding.
The fix explicitly resets `dpx->unpadded_10bit = 0` at the start of
`decode_frame` to ensure correct validation for each frame.
Fixes: https://issues.oss-fuzz.com/issues/464471792
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: out of array read
Fixes: 464471792/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_DPX_DEC_fuzzer-5275522210004992
This does not improve performance with current hardware due to the poor
performance of segmented accesses. Performance should be slightly better
with expensive or near-future hardware that I don't have, however it is
still limited by two other factors:
- There are only 4 elements.
- The final stores are necessarily indexed and hit multiple cache lines,
thus as slow as scalar.
SpacemiT X60:
dct_unquantize_h263_inter_c: 417.8 ( 1.00x)
dct_unquantize_h263_inter_rvv_i32: 66.0 ( 6.33x)
dct_unquantize_h263_intra_c: 140.2 ( 1.00x)
dct_unquantize_h263_intra_rvv_i32: 67.7 ( 2.07x)
Note that the C benchmarks are not stable, depending heavily on the
number of coefficients picked by the RNG. The R-V V benchmarks are
however very stable and generally better than C's.
The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
It's a name that communicates its functionality in a better way.
Since the function was introduced very recently, we can safely rename it.
Signed-off-by: James Almer <jamrial@gmail.com>
Normally, this function tries to make sure all threads are saturated with
work to do before returning any frames; and will continue requesting packets
until that is the case.
However, this significantly slows down initial decoding latency when only
requesting a single frame (to e.g. configure the filter graph), and also
wastes a lot of unnecessary memory in the event that the user does not intend
to decode more frames until later.
By introducing a new `flags` paramater and a new flag
`AV_CODEC_RECEIVE_FRAME_FLAG_SYNCHRONOUS` to go along with it, we can allow
users to temporarily bypass this logic.
h->context_initialized is zero after flush, which triggers call to
ff_get_format unconditionally. ff_get_format can be heavy with
ff_hwaccel_uninit and hwaccel_init. For example, it takes 20 ms on
macOS with videotoolbox. ff_get_format should not be called if
nothing changed. ff_get_format is guarantee to be called at the
first time and when video information changed with
(must_reinit || needs_reinit).
Fix#20760.
The width 16 epel functions never use four taps in any direction*,
so don't build said functions. Saves 4352B of .text and 89B of
.text.unlikely here.
*: mx and my in vp8_mc_luma() are always even.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For the epel functions, there can be no overflow as long as the sum
contains only one of the two large central coefficients; for bilinear
functions, there can be no overflow whatsoever.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
By changing the permutations used in the epel8_h{4,6} case
we can simply reuse the coefficient tables from the vertical epel
filters.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>