The CELT psychoacoustic path was effectively broken: analysis could use
the wrong queued audio and stale scratch samples, and raw band scores were
folded into the frame bit budget, where they could overflow instead of
only driving alloc_boost.
On top of that, c3aea7628c changed avctx->frame_size from fixed
120-sample steps to a configuration-derived value, while the CELT input
and psy paths still treated queue entries as 120-sample steps. That could
misalign psy analysis, read before a short overlap frame, stall silent
flushes, poison rate control with zero-bit silent frames, and overrun the
range coder on EOF or short tails.
This commit fixes these cases by using avctx->frame_size for psy step
accounting, aligning bufqueue analysis with actual audio, padding short
overlaps, and avoiding invalid bit-budget updates for silent or EOF
packets. This lets CELT produce valid packets again.
This is a departure from the conventional idea of decoders always outputting
data as fast as possible. Instead, this allows decoders to be throttled in the
same way filter graphs can be.
This comes into play when e.g. a demuxer is feeding into two decoders, but
only one of the two decoders is actually currently needed (e.g. due to
A/V misalignment). In that case, what typically happens is that the unneeded
decoder alse decodes all frames, and then piles them up on the "buffersrc"
filter's downstream link (growing indefinitely).
Another issue this solves manifests when e.g. a single demuxer is feeding many
decoders that all try to feed frames to the same filter graph. In this case,
all decoders run as fast as posssible, leading to lock contention on the
filter graph input queue; resulting in (again) many frames piling up on the
buffersrc (or downstream filters) for the unneeded inputs that are not actually
the bottleneck, while the input that's actually undersatisfied can end up
starved for CPU time, possibly for long enough to exhaust memory limits. The
normal rate limiting fails to apply in this scenario because all decoders share
a single demuxer, and are hence rate-limited only by the demuxer speed; whereas
the demuxer is not choked because from the PoV of the scheduler, the filter
graph is simply not getting enough frames.
In a more general sense, there's a philosophical argument to be made here.
Since a decoder is typically also a decompressor, it produces more data than
it consumes. So, it a sense, it's acting like a type of producer also - in
the same way that a filter graph can produce more input that outputs.
Solve all of these issues by allowing decoders to be output-choked, which
gives the scheduler control over when decoders are allowed to output frames.
This does mean we have to add some sort of internal packet queue, because the
decoder thread may need to continue *accepting* upstream packets from the
demuxer (or else we risk stalling the demuxer), but defer the actual decoding
by placing them inside an internal "overflow" queue.
This effectively simulates a sort of "filter graph"-type semantics but
for the decoder queue.
This overflow logic is fairly self-contained inside `sch_dec_receive`, though
it is quite nontrivial. I have added as much documentation as is hopefully
needed to understand the logic.
Importantly, we cannot simply unlimit the decoder input thread queue because
the demuxer relies on backpressure from the decoder to rate limit itself. (Note
that demuxers may only be active if there is at least one downstream decoder
that is alse active, so we always have at least one decoder providing
backpressure)
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
When a filter is choked, but upstream threads are trying to write to its input,
this can result in the filter's input queue getting stuck. Normally, the
unchoke_downstream() logic would prevent this from happening, since the
filter would itself get unchoked as a result of upstream decoders receiving
pressure from the demuxer.
However, upcoming changes to this logic will require weakening this upstream
unchoking logic, so preventing the deadlock in a more elegant way helps with
making the code more robust.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
Exactly what it says on the tin. There is some ambiguity as to whether this
should also prevent reading from *choked*, as opposed to empty queue, but
I think it makes sense to consider them equivalent, as I struggle to think
of a use case where it would be beneficial to allow draining a queue that
was explicitly choked by the upstream (to e.g. prevent further reads).
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
It was never reliable to detect if a filtergraph have sources, because a filter
can act as a source only after some time, for example the loop filter.
So it is better to remove the source detection entirely and always give the
scheduler an oppurtunity to stop processing.
Fixes ticket #11604.
Signed-off-by: Marton Balint <cus@passwd.hu>
Signed-off-by: Niklas Haas <git@haasn.dev>
schedule_update_locked() is supposed to be a no-op when `sch->terminate`
is 1. However, there is a TOCTOU error here, where a different thread may
currently be executing schedule_update_locked(), having successfully passed
the sch->terminate check but without actually updating the choke status.
This does not matter for the current code, but will matter with the following
commit, where it creates the theoretical possibility of a race where sch_stop()
is trying to choke the demuxers (and unchoke the decoders) while
schedule_update_locked() is simultaneously trying to choke the decoders,
leading to a deadlock if the last decoder is left choked and unable to
propagate EOF downstream.
The cleanest solution is to just take the scheduler lock while updating the
choke status here. This ensures that any other schedule_update_locked() calls
will have completed.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of awkwardly looping over the type, just split this up into
multiple loops. The loss in complexity seems worth the loss in conciseness
to me, and more importantly, this allows us to easily add more waiter types.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
Fixes: ada-4-poc.ty
change is based on the suggested fix
Found-by: Claude and Ada Logics. This issue was found by Anthropic from using agents to study security of open source projects
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Before this change, the decoder was forcing downmixing everything to a max of
six channels.
Layouts 6.1(back), 7.1(wide), 7.1 and 5.1.2 (Channel Configurations 11, 7, 12,
and 14 respectively, as well as the equivalent PCE version) should be supported
now.
Signed-off-by: James Almer <jamrial@gmail.com>
- Prevent integer overflow when summing header lengths; add bounds check.
- Re-initialize priv->vp with the new stream's extradata once all chained
stream headers are collected.
Signed-off-by: Dale Curtis <dalecurtis@chromium.org>
Add NEON unscaled converters for {yuv420p, yuv422p, yuva420p, nv12, nv21}
to {rgb565le, bgr565le, rgb555le, bgr555le}.
The 16bpp packing uses v8/v9 as the output accumulator. Since AAPCS-64
requires d8-d15 to be callee-saved, declare_func now wraps a
stp d8, d9 / ldp d8, d9 around 16bpp paths only (gated by .ifc on the
output format). Pattern matches libswscale/aarch64/hscale.S.
yuva420p -> 16bpp drops alpha and routes through the yuv420p wrappers,
mirroring how yuva420p -> rgb24/bgr24 already work in tree.
Speedup vs C at width=1920 on Apple M1 (checkasm --bench):
| input | rgb565le | bgr565le | rgb555le | bgr555le |
|----------|----------|----------|----------|----------|
| yuv420p | 3.69x | 3.68x | 3.28x | 3.31x |
| yuv422p | 4.70x | 4.70x | 4.32x | 4.35x |
| yuva420p | 3.67x | 3.66x | 3.32x | 3.27x |
NEON cycles are ~48 for planar and ~50.5 for semi-planar across all
four outputs. yuv422p shows the biggest speedup because its C
reference is the most expensive. 555 ratios trail 565 because the C
reference is faster for 555 (one fewer mask bit); NEON cycles are the
same. nv12/nv21 are bench-only (see the preceding checkasm commit) and
run at the same ~50.5 cycles.
This only handles the little endian forms of the 16 bit RGB formats.
Verified with checkasm --test=sw_yuv2rgb (110/110) and the full
checkasm regression (7657/7657) on Apple M1.
Signed-off-by: DROOdotFOO <drew@axol.io>
The previous chroma stride formula (width >> log2_chroma_w) is correct
for planar yuv but wrong for semi-planar nv12/nv21, where the UV plane
is interleaved at width bytes per row (width/2 UV pairs of 2 bytes
each). Use av_image_get_linesize() so the test feeds a valid stride to
libswscale regardless of input format; for the existing planar suites
the value is unchanged.
With the stride fixed, add nv12 and nv21 to check_yuv2rgb() so the
upcoming NEON 16bpp paths get bench coverage. ff_get_unscaled_swscale
does not wire a C yuv2rgb fast path for these inputs, so the suites
report bench-only (no correctness reference); they still run clobber
detection and cycle counts.
Signed-off-by: DROOdotFOO <drew@axol.io>
libcelt, which it depends on, was not updated in a very long time and is
considered deprecated, as Opus exists which has a CELT mode. Therefore
remove standalone CELT decoding support.
It was already broken since b8604a9761,
11 years ago, and no one noticed and complained.
celt_header() reads a uint32 `extra_headers` field from the CELT identification
header and stores `1 + extra_headers` into the signed int extra_headers_left.
With extra_headers = 0x7FFFFFFE this becomes INT_MAX and the OGG parser
consumes every subsequent page as a CELT "extra header" without ever reaching
audio data, hanging on any streaming input. A value of 0xFFFFFFFE wraps the
signed addition negative, with the same family of consequences.
Reject any extra_headers count above a small fixed cap (16, well above any
real CELT-over-Ogg stream).
Verified with the audit PoC (a crafted file plus an infinite-page FIFO):
without the patch, ffmpeg consumes pages forever; with the patch it logs
"Too many CELT extra headers (...)" and exits in ~70 ms with
AVERROR_INVALIDDATA.
Reported by Franciszek Kalinowski (isec.pl / striga.ai) and Bartosz Smigielski.
When extradata_size == 0, ff_rtp_send_aac() does `size -= 7` to skip the
ADTS header without checking size >= 7. A short packet makes size negative,
and the value is later passed to memcpy() as size_t, reading past the buffer
end. Bail out instead.
The vulnerable branch is not reached when using the built-in AAC encoder
(which always emits extradata), but an application that feeds raw
ADTS-stripped AAC packets through the libavformat RTP muxer can hit it. The
fix is a one-line lower-bound check and compiles/runs cleanly; see audit
PoC for the static analysis and reachable-by-API write-up.
Reported by Franciszek Kalinowski (isec.pl / striga.ai) and Bartosz Smigielski.
The code made the fundamental assumption that over-read into the padding
bytes is okay to do; because the most that can happen is that those pixel
values end up corrupted, which doesn't affect any adjacent pixels.
However, this is not true for SWS_OP_FILTER_H, because this operation
fundamentally mixes together horizontal pixels. Normally, this was fine,
because the filter weights for those pixels are set to 0, and 0 * x = 0.
However, that is not true for floating point inputs, which can contain
Infinity; and 0 * Infinity = NaN, thus corrupting the entire pixel.
Solve it by specifically preventing over-read when it would be unsafe.
Signed-off-by: Niklas Haas <git@haasn.dev>
And stop writing 7.1 as 7.1(wide) (channel conf 7). Lets not create any more
non-spec compliant files that the native decoder needs to work around with now
that we can use PCE configuration for it, getting rid of the ambiguity.
Signed-off-by: James Almer <jamrial@gmail.com>
Many of the entries were downright wrong, like mistagging LFE elements as
SCE, as well as trying to match the native channel ordering in the PCE
by placing CPE elements before SCE ones in some cases (like with FRONT
elements), which is not spec compliant and results in unparseable streams.
Remove the three layouts that define top channels. It's not clear how they
should be signaled in PCE.
Signed-off-by: James Almer <jamrial@gmail.com>