The return value of read_diff_float_data() was previously ignored,
allowing decode to continue silently with partially transformed samples
on malformed floating ALS input. Check and propagate the error.
All failure paths in read_diff_float_data() already return
AVERROR_INVALIDDATA, so the caller fix is sufficient without
any normalization inside the function.
Signed-off-by: Priyanshu Thapliyal <priyanshuthapliyal2005@gmail.com>
Halfs the amount of pmaddwd and improves performance a lot:
sbc_analyze_4_c: 55.7 ( 1.00x)
sbc_analyze_4_mmx: 7.0 ( 7.94x)
sbc_analyze_4_sse2: 4.3 (12.93x)
sbc_analyze_8_c: 131.1 ( 1.00x)
sbc_analyze_8_mmx: 22.4 ( 5.84x)
sbc_analyze_8_sse2: 10.7 (12.25x)
It also saves 224B of .text and allows to remove the emms_c()
from sbcenc.c (notice that ff_sbc_calc_scalefactors_mmx()
issues emms on its own, so it already abides by the ABI).
Hint: A pshufd could be avoided per function if the constants
were reordered.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Check in init whether the parameters are valid.
This can be triggered with
ffmpeg -i tests/data/asynth-44100-2.wav -c sbc -sbc_delay 0.001 \
-b:a 100k -f null -
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
sbc_calc_scalefactors uses an int32_t [16/*max blocks*/][2/*max
channels*/][8/*max subbands*/] array. The MMX version of this code
treats the two inner arrays as one [2*8] array to process
and it processes subbands*channels of them. But when subbands
is < 8 and channels is two, the entries to process are not
contiguous: One has to process 0..subbands-1 and 8..7+subbands,
yet the code processed 0..2*subbands-1.
This commit fixes this by processing entries 0..7+subbands
if there are two channels.
Before this commit, the following command line triggered an
av_assert2() in put_bits():
ffmpeg_g -i tests/data/asynth-44100-2.wav -c sbc -b:a 200k \
-sbc_delay 0.003 -f null -
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
One buffer is encoder-only, the other decoder-only.
Also move crc_ctx before the buffers (into padding).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Choose the first non-hwaccel format rather than the last one. This
matches the logic in ffmpeg CLI and selects YUVA rather than YUV for
HEVC with alpha.
It only tests MMX (me_cmp does not have pure MMX functions any more)
and MMXEXT and is therefore x86-only. Furthermore, checkasm is superior
in every regard.
Removing it also fixes a build failure (there is no dependency of this
tool on me_cmp).
Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The expression (exif_len & ~SIZE_MAX) is always 0 for size_t,
making the overflow guard permanently dead code.
Reported-by: Guanni Qu <qguanni@gmail.com>
Signed-off-by: Priyanshu Thapliyal <priyanshuthapliyal2005@gmail.com>
Replace abs() with FFABSU() to avoid undefined behavior when
raw_samples[c][i] == INT_MIN. Per libavutil/common.h, FFABS()
has the same INT_MIN UB as abs(); FFABSU() is the correct
helper as it casts to unsigned before negation.
Reported-by: Guanni Qu <qguanni@gmail.com>
Signed-off-by: Priyanshu Thapliyal <priyanshuthapliyal2005@gmail.com>
This reduces sizeof(VVCLocalContext) from 4580576B to
3408032B here.
Reviewed-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
And move the big buffers to the end. This reduces codesize
as offset+displacement addressing modes are either unavailable
or require more bytes of displacement is too large. E.g. this
saves 5952B on x64 here and 3008B on AArch64. This change should
also improve data locality.
Reviewed-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
6 is an undefined value for payload_size_type. For those, 7 is used to signal
a custom_byte_size synxtax element.
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes: signed integer overflow
Fixes: out of array access
Fixes: dvdsub_int_overflow_mixed_ps.mpg
Found-by: Quang Luong of Calif.io in collaboration with OpenAI Codex
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
It doesn't hurt to keep track of filtered_size:
The end result will be ignored if extradata is not removed
from the bitstream.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Changes compared to the current version include:
1. We no longer use a dummy PutByteContext on the first pass
for checking whether there is extradata in the NALU. Instead
the first pass no longer writes anything to any PutByteContext
at all; the size information is passed via additional int*
parameters. (This no longer discards const when initializing
the dummy PutByteContext, fixing a compiler warning.)
2. We actually error out on invalid data in the first pass,
ensuring that the second pass never fails.
3. The first pass is used to get the exact sizes of both
the extradata and the filtered data. This obviates the need
for reallocating the buffers lateron. (It also means
that the extradata side data will have been allocated with
av_malloc (ensuring proper alignment) instead of av_realloc().)
4. The second pass now writes both extradata and (if written)
the filtered data instead of parsing the NALUs twice.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Whenever the link register is stored on the stack, sign it
before storing it and validate at a symmetrical point (with the
stack at the same level as when it was signed).
These macros only have an effect if built with PAC enabled (e.g.
through -mbranch-protection=standard), otherwise they don't
generate any extra instructions.
None of these cases were present when PAC support was added
in 248986a0db in 2022.
Without these changes, PAC still had an effect in the compiler
generated code and in the existing cases where we these macros were
used - but make it apply to the remaining cases of link register
on the stack.
The sme_entry/sme_exit macros already take care of backing up/restoring
these registers. Additionally, as long as no function calls are
made within the function, x30 doesn't need to be backed up at all.
Possible now that this function is no longer MMX.
Old benchmarks:
gmc_edge_emulation_c: 782.3 ( 1.00x)
gmc_edge_emulation_ssse3: 220.3 ( 3.55x)
New benchmarks:
gmc_edge_emulation_c: 770.9 ( 1.00x)
gmc_edge_emulation_ssse3: 111.0 ( 6.94x)
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It beats MMX by a lot, because it has to process eight words.
Also notice that the MMX code expects registers to be preserved
between separate inline assembly blocks which is not guaranteed;
the new code meanwhile does not presume this.
Benchmarks:
gmc_c: 817.8 ( 1.00x)
gmc_mmx: 210.7 ( 3.88x)
gmc_ssse3: 80.7 (10.14x)
The MMX version has been removed.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
edge_emu_mc allows to use different src and dst strides,
so one can replace the outsized edge emu buffer with
one that is much smaller and nevertheless big enough
for all our needs; it also avoids having to check
whether the buffer is actually big enough.
This also improves performance (if the compiler uses
stack probing). Old benchmarks:
gmc_c: 814.5 ( 1.00x)
gmc_mmx: 243.7 ( 3.34x)
New benchmarks:
gmc_c: 813.8 ( 1.00x)
gmc_mmx: 213.5 ( 3.81x)
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
MPEG-4 GMC uses the following motion prediction scheme:
For output pixel (x,y), the reference pixel at fractional
coordinates (ox+dxx*x+dxy*y,oy+dyx*x+dyy*y) is used as prediction;
the latter is calculated via bilinear interpolation. The coefficients
here are fixed-point values with 16+shift fractional bits
where shift is sprite_warping_accuracy+1. For the weights,
only the shift most significant fractional bits are used.
shift can be at most four*.
The x86 MMX gmc implementation performs these calculations
using 16-bit words. To do so, it restricts itself to the case
in which the four least significant bits of dxx,dxy,dyx,dyy
are zero and shifts these bits away. Yet in case shift is
less than four, the 16 bits retained also contain at least
one bit that actually belongs to the fpel component
(which is already taken into account by using the correct
pixels for interpolation).
(This has been uncovered by a to-be-added checkasm test.
I don't know whether there are actual files in the wild
using sprite_warping_accuracy 0-2.)
*: It is always four when encoding with xvid and GMC.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
When encoding VP9 with a YUV pixel format (e.g. yuv420p) and
AVCOL_SPC_RGB colorspace metadata, libvpxenc unconditionally set
VPX_CS_SRGB. This produced a spec-violating bitstream: Profile 0
(4:2:0) with sRGB colorspace, which is only valid for Profile 1/3
(4:4:4). The resulting file is undecodable.
Fix this by setting ctx->vpx_cs to VPX_CS_SRGB in set_pix_fmt()
for 4:4:4 YUV formats when AVCOL_SPC_RGB is set, matching the
existing GBRP path. This covers the legitimate case of RGB data in
YUV444 containers (e.g. H.264 High 4:4:4 with identity matrix).
With this change, any AVCOL_SPC_RGB that reaches the switch in
set_colorspace() is guaranteed to be a subsampled format where
sRGB is invalid. Return an error so the user can fix their
pipeline rather than silently producing incorrect output.
To reproduce:
ffmpeg -f lavfi -i testsrc=s=64x64:d=1:r=1 \
-c:v libvpx-vp9 -pix_fmt yuv420p -colorspace rgb bad.webm
ffprobe bad.webm
# -> "vp9 (Profile 0), none(pc, gbr/...), 64x64"
ffmpeg -i bad.webm -f null -
# -> 0 frames decoded, error
See also:
https://issues.webmproject.org/487307225
Signed-off-by: Guangyu Sun <gsun@roblox.com>
Signed-off-by: James Zern <jzern@google.com>
For cases when returning early without updating any pixels, we
previously returned to return address in the caller's scope,
bypassing one function entirely. While this may seem like a neat
optimization, it makes the return stack predictor mispredict
the returns - which potentially can cost more performance than
it gains.
Secondly, if the armv9.3 feature GCS (Guarded Control Stack) is
enabled, then returns _must_ match the expected value; this feature
is being enabled across linux distributions, and by fixing the
hevc assembly, we can enable the security feature on ffmpeg as well.
Cap ulNumDecodeSurfaces to 32 and ulNumOutputSurfaces to 64 to prevent
cuvidCreateDecoder from failing with CUDA_ERROR_INVALID_VALUE when
initial_pool_size exceeds the hardware limits.
Also cap the decoder index pool (dpb_size) to 32 so that indices
handed out via av_refstruct_pool_get stay within the valid range
for cuvidDecodePicture's CurrPicIdx.
When unsafe_output is enabled, stop holding idx_ref in the unmap
callback. Since cuvidMapVideoFrame copies decoded data into an
independent output mapping slot, the decode surface index can safely
be reused as soon as the DPB releases it, without waiting for the
downstream consumer to release the mapped frame. This decouples the
decode surface index lifetime (max 32) from the output mapping slot
lifetime (max 64), eliminating the "No decoder surfaces left" error
that occurred when downstream components like nvenc held too many
frames.
Signed-off-by: Diego de Souza <ddesouza@nvidia.com>
Fixes ticket #22420.
When the first decoded frame is type 1, xan_decode_frame_type1() reads y_buffer as prior-frame state before any data has been written to it.
Since y_buffer is allocated with av_malloc(), this may propagate uninitialized heap data into the decoded luma output.
Allocate y_buffer with av_mallocz() instead.