This commit deduplicates the wrappers around the fpel functions
for copying whole blocks (i.e. height equaling width). It does
this in a manner which avoids having push/pop function arguments
when the calling convention forces one to pass them on the stack
(as in 32bit systems).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is not based on the MMXEXT one, because the latter is quite
suboptimal: Motion vector types mc01 and mc03 (vertical motion vectors
with remainder of one quarter or three quarter) use different neighboring
lines for interpolation: mc01 uses two lines above and two lines below,
mc03 one line above and three lines below. The MMXEXT code uses
a common macro for all of them and therefore reads six lines
before it processes them (even reading lines which are not used
at all), leading to severe register pressure.
Another difference to the old code is that the positive and negative
parts of the sum to calculate are accumulated separately and
the subtraction is performed with unsigned saturation, so
that one can avoid biasing the sum.
The fact that the mc01 and mc03 filter coefficients are mirrors
of each other has been exploited to reduce mc01 to mc03.
But of course the most important different difference between
this code and the MMXEXT one is that XMM registers allow to
process eight words at a time, ideal for 8x8 subblocks,
whereas the MMXEXT code processes them in 4x8 or 4x16 blocks.
Benchmarks:
avg_cavs_qpel_pixels_tab[0][4]_c: 917.0 ( 1.00x)
avg_cavs_qpel_pixels_tab[0][4]_mmxext: 222.0 ( 4.13x)
avg_cavs_qpel_pixels_tab[0][4]_sse2: 89.0 (10.31x)
avg_cavs_qpel_pixels_tab[0][12]_c: 885.7 ( 1.00x)
avg_cavs_qpel_pixels_tab[0][12]_mmxext: 223.2 ( 3.97x)
avg_cavs_qpel_pixels_tab[0][12]_sse2: 88.5 (10.01x)
avg_cavs_qpel_pixels_tab[1][4]_c: 222.4 ( 1.00x)
avg_cavs_qpel_pixels_tab[1][4]_mmxext: 57.2 ( 3.89x)
avg_cavs_qpel_pixels_tab[1][4]_sse2: 23.3 ( 9.55x)
avg_cavs_qpel_pixels_tab[1][12]_c: 216.0 ( 1.00x)
avg_cavs_qpel_pixels_tab[1][12]_mmxext: 57.4 ( 3.76x)
avg_cavs_qpel_pixels_tab[1][12]_sse2: 22.6 ( 9.56x)
put_cavs_qpel_pixels_tab[0][4]_c: 750.9 ( 1.00x)
put_cavs_qpel_pixels_tab[0][4]_mmxext: 210.4 ( 3.57x)
put_cavs_qpel_pixels_tab[0][4]_sse2: 84.2 ( 8.92x)
put_cavs_qpel_pixels_tab[0][12]_c: 731.6 ( 1.00x)
put_cavs_qpel_pixels_tab[0][12]_mmxext: 210.7 ( 3.47x)
put_cavs_qpel_pixels_tab[0][12]_sse2: 84.1 ( 8.70x)
put_cavs_qpel_pixels_tab[1][4]_c: 191.7 ( 1.00x)
put_cavs_qpel_pixels_tab[1][4]_mmxext: 53.8 ( 3.56x)
put_cavs_qpel_pixels_tab[1][4]_sse2: 24.5 ( 7.83x)
put_cavs_qpel_pixels_tab[1][12]_c: 179.1 ( 1.00x)
put_cavs_qpel_pixels_tab[1][12]_mmxext: 53.9 ( 3.32x)
put_cavs_qpel_pixels_tab[1][12]_sse2: 24.0 ( 7.47x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Basically a direct port of the MMXEXT one. The main difference
is of course that one can process eight pixels (unpacked to words)
at a time, leading to speedups.
avg_cavs_qpel_pixels_tab[0][2]_c: 700.1 ( 1.00x)
avg_cavs_qpel_pixels_tab[0][2]_mmxext: 158.1 ( 4.43x)
avg_cavs_qpel_pixels_tab[0][2]_sse2: 86.0 ( 8.14x)
avg_cavs_qpel_pixels_tab[1][2]_c: 171.9 ( 1.00x)
avg_cavs_qpel_pixels_tab[1][2]_mmxext: 39.4 ( 4.36x)
avg_cavs_qpel_pixels_tab[1][2]_sse2: 21.7 ( 7.92x)
put_cavs_qpel_pixels_tab[0][2]_c: 525.7 ( 1.00x)
put_cavs_qpel_pixels_tab[0][2]_mmxext: 148.5 ( 3.54x)
put_cavs_qpel_pixels_tab[0][2]_sse2: 75.2 ( 6.99x)
put_cavs_qpel_pixels_tab[1][2]_c: 129.5 ( 1.00x)
put_cavs_qpel_pixels_tab[1][2]_mmxext: 36.7 ( 3.53x)
put_cavs_qpel_pixels_tab[1][2]_sse2: 19.0 ( 6.81x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The prediction involves terms of the form
(-1 * s0 - 2 * s1 + 96 * s2 + 42 * s3 - 7 * s4 + 64) >> 7,
where the s values are in the range of 0..255.
The sum can have values in the range -2550..35190, which
does not fit into a signed 16bit integer. The code uses
an arithmetic right shift, which does not yield the correct
result for values >= 2^15; such values should be clipped
to 255, yet are clipped to 0 instead.
Fix this by shifting the values by 4096, so that the range
is positive, use a logical right shift and subtract 32.
bunny.mp4 from the FATE suite can be used to reproduce the problem.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The implementation hardcodes access to 3 channels, so we need to check that
Fixes: out of array access
Fixes: BIGSLEEP-445394503-crash.exr
Found-by: Google Big Sleep
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Without rounding them up there are too few dc coeffs for the blocks.
We do not know if this way of handling odd dimensions is correct, as we have
no such DWA sample.
thus we ask the user for a sample if she encounters such a file
Fixes: out of array access
Fixes: BIGSLEEP-445392027-crash.exr
Found-by: Google Big Sleep
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: out of array read
Fixes: dwa_uncompress.py.crash.exr
The code will read from the ac data even if ac_size is 0, thus that case
is not implemented and we ask for a sample and error out cleanly
Found-by: Google Big Sleep
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
1. Remove the OP parameter from the QPEL_H264* macros. These are
a remnant of inline assembly and were forgotten in
610e00b359.
2. Pass the instruction set extension for the shift5 function
explicitly in the macro instead of using magic #defines.
3. Likewise, avoid magic #defines for (8|16)_v_lowpass_ssse3.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Every caller calls it three times in a loop, with slightly
modified arguments. So it makes sense to move the loop
into the callee.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Blocksize 2 is Snow-only, so move all the code pertaining
to it to snow.c. Also make the put array in H264QpelContext
smaller -- it only needs three sets of 16 function pointers.
This continues 6eb8bc4217
and b0c91c2fba.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
None of the other registers need to be preserved at this time,
so six XMM registers are always enough. Forgotten in
fa9ea5113b.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids having to sign-extend the strides in the assembly
(it also is more correct given that the qpel_mc_func
already uses ptrdiff_t).
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The horizontal 10bit MC SSE2 functions are currently duplicated:
They exist both in ordinary form as well as with a "sse2_cache64"
suffix. A comment in ff_h264qpel_init_x86() indicates that this
is due to older processors not liking accesses that cross cache
lines, yet these functions are identical to the non-cache64
functions (apart from the unavoidable changes in the rip-offset).
The only difference between these functions and the ordinary ones
are that the cache64 ones are created via a special form of the
INIT_XMM macro: "INIT_XMM sse2, cache64". This affects the name
and apparently defines cpuflags_cache64, yet nothing checks for
this, so both versions are identical. So remove the cache64 ones
and treat the remaining ones like ordinary SSE2 functions.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
ff_{avg,put}_h264_qpel8or16_hv2_lowpass_ssse3()
currently is almost the disjoint union of the codepaths
for sizes 8 and 16. This size is a compile-time constant
at every callsite. So split the function and avoid
the runtime branch.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
tmpstride is unused. This also allows to remove said parameter
from lots of functions in h264_qpel.c.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
They are constant since the size 16 version is no longer emulated
via the size 8 version.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is only used by h264_qpel.c and only with height four
(which is unrolled) and uses a loop in order to handle
multiples of four as height. Remove the loop and the height
parameter and move the function to h264_qpel_8bit.asm.
This leads to a bit of code duplication, but this is simpler
than all the %if checks necessary to achieve the same outcome
in fpel.asm.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The repetition count is always one since
2cf9e733c6.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
avg_h264_qpel only supports 16x16,8x8 and 4x4 blocksizes,
so it is currently unnecessarily large.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The 2x2 put functions are only used by Snow and Snow uses
only the eight bit versions. The rest is dead code. Disabling
it saved 41277B here.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It only needs it for some x86 fpel functions; instead
add a direct dependency for that.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Affected standalone builds of the VC-1 parser.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This can be easily achieved by moving code only used by the MPEG-4
decoder behind #if CONFIG_MPEG4_DECODER.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
ff_avg_pixels{4,8,16}_l2_mmxext() are always called with height
equal to their blocksize. And ff_{put,avg}_pixels4_l2_mmxext()
are furthermore always called with both strides being equal.
So remove these redundant function parameters.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is more correct given that qpel_mc_func already uses ptrdiff_t;
it also allows to avoid movsxdifnidn.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The ff_avg_pixels{4,8,16}_l2_mmxext() functions are only ever
used in the last step (the one that actually writes to the dst buffer)
where the number of lines to process is always equal to the
dimensions of the block, whereas ff_put_pixels{8,16}_mmxext()
are also used in intermediate calculations where the number of
lines can be 9 or 17.
The code in qpel.asm uses common macros for both and processes
more than one line per loop iteration; it therefore checks
for whether the number of lines is odd and treats this line separately;
yet this special handling is only needed for the put functions,
not the avg functions. It has therefore been %if'ed away for these.
The check is also not needed for ff_put_pixels4_l2_mmxext() which
is only used by H.264 which always processes four lines. Because
ff_{avg,put}_pixels4_l2_mmxext() processes four lines in a single loop
iteration, not only the odd-height handling, but the whole loop
could be removed.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
liblc3 supports arbitrary strides, so one can simply use a stride
of zero to make it read the same zero value again and again.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The code optimizes throughput by letting the encoder work on frame N
until frame N+1 is ready for submission, but this hurts low-delay uses
by delaying output by one frame. Don't delay output beyond what is
necessary when AV_CODEC_FLAG_LOW_DELAY is used.
Signed-off-by: Cameron Gutman <aicommander@gmail.com>