We already parse the EXIF side data to extract the orientation, so we
should add it to the output file as an EXIF box.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
Before this commit, we ignore the display matrix side data if any EXIF
side data is present, even if that side data contains no orientation
tag. This allows us to calculate the orientation from the display
matrix sidedata first, if present. Ideally the decoder will have
removed the orientation tag upon decoding and attached the data as
display matrix side data instead, so this makes our orientation code
respect this behavior.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
Four rows of four bytes fit into one xmm register; therefore
one can arrange the rows as follows (A,B,C: first, second, third etc.
row)
xmm0: ABABABAB BCBCBCBC
xmm1: CDCDCDCD DEDEDEDE
xmm2: EFEFEFEF FGFGFGFG
xmm3: GHGHGHGH HIHIHIHI
and use four pmaddubsw to calculate two rows in parallel. The history
fits into four registers, making this possible even on 32bit systems.
Old benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.5 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 16.4 ( 6.44x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.3 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 15.4 ( 6.44x)
New benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 11.8 ( 8.90x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.7 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 10.7 ( 9.30x)
Old benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 138.2 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.93x)
vp9_put_8tap_smooth_4v_8bpp_c: 123.6 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.41x)
New benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 139.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 20.1 ( 6.92x)
vp9_put_8tap_smooth_4v_8bpp_c: 124.5 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 19.9 ( 6.26x)
Loading the constants into registers did not turn out to be advantageous
here (not to mention Win64, where this would necessitate saving
and restoring ever more register); probably because there are only two
loop iterations.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes a heap-buffer-overflow in `libavcodec/dpx.c` triggered by a stale
`unpadded_10bit` flag in the `DPXDecContext`. This flag, set for 10-bit
unpadded frames, persisted across `decode_frame` calls. If a subsequent
frame was 16-bit, the stale flag caused incorrect buffer size
validation, allowing truncated buffers to pass checks designed for
smaller 10-bit packed data. This led to an out-of-bounds read in
`av_image_copy_plane` during 16-bit decoding.
The fix explicitly resets `dpx->unpadded_10bit = 0` at the start of
`decode_frame` to ensure correct validation for each frame.
Fixes: https://issues.oss-fuzz.com/issues/464471792
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: out of array read
Fixes: 464471792/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_DPX_DEC_fuzzer-5275522210004992
This does not improve performance with current hardware due to the poor
performance of segmented accesses. Performance should be slightly better
with expensive or near-future hardware that I don't have, however it is
still limited by two other factors:
- There are only 4 elements.
- The final stores are necessarily indexed and hit multiple cache lines,
thus as slow as scalar.
SpacemiT X60:
dct_unquantize_h263_inter_c: 417.8 ( 1.00x)
dct_unquantize_h263_inter_rvv_i32: 66.0 ( 6.33x)
dct_unquantize_h263_intra_c: 140.2 ( 1.00x)
dct_unquantize_h263_intra_rvv_i32: 67.7 ( 2.07x)
Note that the C benchmarks are not stable, depending heavily on the
number of coefficients picked by the RNG. The R-V V benchmarks are
however very stable and generally better than C's.
The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
It's a name that communicates its functionality in a better way.
Since the function was introduced very recently, we can safely rename it.
Signed-off-by: James Almer <jamrial@gmail.com>
Normally, this function tries to make sure all threads are saturated with
work to do before returning any frames; and will continue requesting packets
until that is the case.
However, this significantly slows down initial decoding latency when only
requesting a single frame (to e.g. configure the filter graph), and also
wastes a lot of unnecessary memory in the event that the user does not intend
to decode more frames until later.
By introducing a new `flags` paramater and a new flag
`AV_CODEC_RECEIVE_FRAME_FLAG_SYNCHRONOUS` to go along with it, we can allow
users to temporarily bypass this logic.
h->context_initialized is zero after flush, which triggers call to
ff_get_format unconditionally. ff_get_format can be heavy with
ff_hwaccel_uninit and hwaccel_init. For example, it takes 20 ms on
macOS with videotoolbox. ff_get_format should not be called if
nothing changed. ff_get_format is guarantee to be called at the
first time and when video information changed with
(must_reinit || needs_reinit).
Fix#20760.
The width 16 epel functions never use four taps in any direction*,
so don't build said functions. Saves 4352B of .text and 89B of
.text.unlikely here.
*: mx and my in vp8_mc_luma() are always even.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For the epel functions, there can be no overflow as long as the sum
contains only one of the two large central coefficients; for bilinear
functions, there can be no overflow whatsoever.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
By changing the permutations used in the epel8_h{4,6} case
we can simply reuse the coefficient tables from the vertical epel
filters.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Doubling the register width allows to use only one pshufb and pmaddubsw.
Old benchmarks:
vp8_put_epel4_h4_c: 82.8 ( 1.00x)
vp8_put_epel4_h4_ssse3: 13.9 ( 5.96x)
New benchmarks:
vp8_put_epel4_h4_c: 82.7 ( 1.00x)
vp8_put_epel4_h4_ssse3: 11.7 ( 7.08x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).
Old benchmarks:
vp8_put_epel4_v4_c: 96.8 ( 1.00x)
vp8_put_epel4_v4_ssse3: 28.2 ( 3.43x)
New benchmarks:
vp8_put_epel4_v4_c: 95.1 ( 1.00x)
vp8_put_epel4_v4_ssse3: 22.8 ( 4.17x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).
Old benchmarks:
vp8_put_epel4_v6_c: 132.8 ( 1.00x)
vp8_put_epel4_v6_ssse3: 34.3 ( 3.87x)
New benchmarks:
vp8_put_epel4_v6_c: 131.5 ( 1.00x)
vp8_put_epel4_v6_ssse3: 27.1 ( 4.86x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is a register available. No change in benchmarks here.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Use GPRs on x64 and xmm registers else (using GPRs reduces codesize).
This avoids clobbering the floating point state and therefore no longer
breaks the ABI.
No change in benchmarks here.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX(EXT) functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
A heap-use-after-free vulnerability was identified in
`libavcodec/aac/aacdec.c`. When `che_configure` frees a
`ChannelElement` (`ac->che[type][id]`), it failed to clear all
references to it in `ac->tag_che_map`. `ac->tag_che_map` caches
pointers to `ChannelElement`s and can contain cross-type mappings (e.g.,
a `TYPE_SCE` tag mapping to a `TYPE_LFE` element).
In a USAC stream reconfiguration scenario, an LFE element was freed, but
a stale pointer remained in `ac->tag_che_map`. Subsequent calls to
`ff_aac_get_che` returned this dangling pointer, leading to a crash in
`decode_usac_core_coder`.
This commit fixes the issue by iterating over the entire
`ac->tag_che_map` in `che_configure` and clearing any entries that point
to the `ChannelElement` about to be freed, ensuring no dangling pointers
remain.
Fixes: https://issues.oss-fuzz.com/issues/440220467