For pre-AVX2, vpbroadcastw is emulated via a load, followed
by two shuffles. Yet given that one always wants to splat
multiple pairs of coefficients which are adjacent in memory,
one can do better than that: Load all of them at once, perform
a punpcklwd with itself and use one pshufd per register.
In case one has to sign-extend the coefficients, too,
one can replace the punpcklwd with one pmovsxbw (instead of one
per register) and use pshufd directly afterwards.
This saved 4816B of .text here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
8 tap motion compensation functions with both vertical and horizontal
components are under severe register pressure, so that the filter
coefficients have to be put on the stack. Before this commit,
this meant that coefficients for use with pmaddubsw and pmaddwd
were always created. Yet this is completely unnecessary, as
every such register is only used for exactly one purpose and
it is known at compile time which one it is (only 8bit horizontal
filters are used with pmaddubsw), so only prepare that one.
This also allows to half the amount of stack used.
This saves 2432B of .text here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It has already been checked before that we are only dealing
with high bitdepth here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Since ba793127c4,
the x86 mpeg4videodsp code uses ff_emulated_edge_mc_sse2()
instead of ff_emulated_edge_mc_8. This leads to linker errors
when x86asm is disabled. Fix this by also falling back to ff_gmc_c()
in case edge emulation is needed with external SSE2 being unavailable.
An alternative is to go back to ff_emulated_edge_mc_8(), but this
would readd the uglyness to videodsp for a niche case.
Reported-by: James Almer <jamrial@gmail.com>
Reviewed-by: Hendrik Leppkes <h.leppkes@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Leave the existing one for non decoder-specific, post processing usage.
With this, scenarios like nvdec decoding can work algonside lcevc enhancement application.
Signed-off-by: James Almer <jamrial@gmail.com>
The return value of read_diff_float_data() was previously ignored,
allowing decode to continue silently with partially transformed samples
on malformed floating ALS input. Check and propagate the error.
All failure paths in read_diff_float_data() already return
AVERROR_INVALIDDATA, so the caller fix is sufficient without
any normalization inside the function.
Signed-off-by: Priyanshu Thapliyal <priyanshuthapliyal2005@gmail.com>
Halfs the amount of pmaddwd and improves performance a lot:
sbc_analyze_4_c: 55.7 ( 1.00x)
sbc_analyze_4_mmx: 7.0 ( 7.94x)
sbc_analyze_4_sse2: 4.3 (12.93x)
sbc_analyze_8_c: 131.1 ( 1.00x)
sbc_analyze_8_mmx: 22.4 ( 5.84x)
sbc_analyze_8_sse2: 10.7 (12.25x)
It also saves 224B of .text and allows to remove the emms_c()
from sbcenc.c (notice that ff_sbc_calc_scalefactors_mmx()
issues emms on its own, so it already abides by the ABI).
Hint: A pshufd could be avoided per function if the constants
were reordered.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Check in init whether the parameters are valid.
This can be triggered with
ffmpeg -i tests/data/asynth-44100-2.wav -c sbc -sbc_delay 0.001 \
-b:a 100k -f null -
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
sbc_calc_scalefactors uses an int32_t [16/*max blocks*/][2/*max
channels*/][8/*max subbands*/] array. The MMX version of this code
treats the two inner arrays as one [2*8] array to process
and it processes subbands*channels of them. But when subbands
is < 8 and channels is two, the entries to process are not
contiguous: One has to process 0..subbands-1 and 8..7+subbands,
yet the code processed 0..2*subbands-1.
This commit fixes this by processing entries 0..7+subbands
if there are two channels.
Before this commit, the following command line triggered an
av_assert2() in put_bits():
ffmpeg_g -i tests/data/asynth-44100-2.wav -c sbc -b:a 200k \
-sbc_delay 0.003 -f null -
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
One buffer is encoder-only, the other decoder-only.
Also move crc_ctx before the buffers (into padding).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Choose the first non-hwaccel format rather than the last one. This
matches the logic in ffmpeg CLI and selects YUVA rather than YUV for
HEVC with alpha.
It only tests MMX (me_cmp does not have pure MMX functions any more)
and MMXEXT and is therefore x86-only. Furthermore, checkasm is superior
in every regard.
Removing it also fixes a build failure (there is no dependency of this
tool on me_cmp).
Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The expression (exif_len & ~SIZE_MAX) is always 0 for size_t,
making the overflow guard permanently dead code.
Reported-by: Guanni Qu <qguanni@gmail.com>
Signed-off-by: Priyanshu Thapliyal <priyanshuthapliyal2005@gmail.com>
Replace abs() with FFABSU() to avoid undefined behavior when
raw_samples[c][i] == INT_MIN. Per libavutil/common.h, FFABS()
has the same INT_MIN UB as abs(); FFABSU() is the correct
helper as it casts to unsigned before negation.
Reported-by: Guanni Qu <qguanni@gmail.com>
Signed-off-by: Priyanshu Thapliyal <priyanshuthapliyal2005@gmail.com>
This reduces sizeof(VVCLocalContext) from 4580576B to
3408032B here.
Reviewed-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
And move the big buffers to the end. This reduces codesize
as offset+displacement addressing modes are either unavailable
or require more bytes of displacement is too large. E.g. this
saves 5952B on x64 here and 3008B on AArch64. This change should
also improve data locality.
Reviewed-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
6 is an undefined value for payload_size_type. For those, 7 is used to signal
a custom_byte_size synxtax element.
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes: signed integer overflow
Fixes: out of array access
Fixes: dvdsub_int_overflow_mixed_ps.mpg
Found-by: Quang Luong of Calif.io in collaboration with OpenAI Codex
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
It doesn't hurt to keep track of filtered_size:
The end result will be ignored if extradata is not removed
from the bitstream.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Changes compared to the current version include:
1. We no longer use a dummy PutByteContext on the first pass
for checking whether there is extradata in the NALU. Instead
the first pass no longer writes anything to any PutByteContext
at all; the size information is passed via additional int*
parameters. (This no longer discards const when initializing
the dummy PutByteContext, fixing a compiler warning.)
2. We actually error out on invalid data in the first pass,
ensuring that the second pass never fails.
3. The first pass is used to get the exact sizes of both
the extradata and the filtered data. This obviates the need
for reallocating the buffers lateron. (It also means
that the extradata side data will have been allocated with
av_malloc (ensuring proper alignment) instead of av_realloc().)
4. The second pass now writes both extradata and (if written)
the filtered data instead of parsing the NALUs twice.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Whenever the link register is stored on the stack, sign it
before storing it and validate at a symmetrical point (with the
stack at the same level as when it was signed).
These macros only have an effect if built with PAC enabled (e.g.
through -mbranch-protection=standard), otherwise they don't
generate any extra instructions.
None of these cases were present when PAC support was added
in 248986a0db in 2022.
Without these changes, PAC still had an effect in the compiler
generated code and in the existing cases where we these macros were
used - but make it apply to the remaining cases of link register
on the stack.