(It would even be possible to avoid a clobbering m10 in
MASKED_COPY and the mask register (%3) in MASKED_COPY2
when VEX encoding is in use.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The result of using pmaddwd with the coefficients 1,-1,...,1,-1
is just the negative of using pmaddwd with the coefficients
-1,1,...,-1,1, so avoid one pmaddwd.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoid incrementing lowq and highq inside the loop by using
complex addressing modes, avoiding to undo said modification
at the end of the horizontal loop.
For inputq, modify istrideq outside of the loop so that
it is only modified once at the end of the horizontal loop.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Sign extend the integer arguments directly from the stack
instead of loading qwords, followed by sign-extending the
lower half.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Up until now, cfhdencdsp used constants consisting
of -1, 1, ...,-1,1 words and 1, -1,...,1,-1 words
for use as constants in pmaddwd. But one can use
the same constants if one shuffles the words in
a dword the opposite order. Similarly for some other
constants. This also allowed to avoid a register in
chfdenc_vert_filter.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
These names are a remnant of dsputil when all the DSP functions
from all codecs were part of DSPcontext.
Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Now that the SSE2/AVX functions are no longer restricted
to those systems having an aligned stack, the MMXEXT functions
are always overridden (except for ancient systems without
SSE2), so remove them.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Allocate it via cglobal as usual. This makes the SSE2/AVX functions
available when HAVE_ALIGNED_STACK is false; it also avoids
modifying rsp unnecessarily in the deblock_h_luma_intra_10 functions
on Win64.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
They are a remnant of the MMX functions (which processed
only eight pixels at a time, so that it was called twice
via a wrapper; the actual MMX function had "v8" in its name
instead of simply v) which have been removed in commit
4618f36a24.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Forgotten in 4618f36a24.
Also remove a PASS8ROWS wrapper that seems to have been always
unused.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For AVX2, movdqu is as fast as movdqa when used on aligned addresses,
so don't instantiate aligned/unaligned versions.
(The check was btw overtly strict: The AVX2 code only uses 16 byte
stores, so it would be enough for dst to be 16-byte aligned.)
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
These functions are only used on Conroe (they are overwritten
by SSSE3 functions using xmm registers if the SSSE3SLOW is not set)
which is very old (introduced in 2006), so remove them.
Btw: The checkasm test (which uses declare_func and not
declare_func_emms since cd8a33bcce)
would fail on a Conroe, yet no one ever reported any such failure.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
sub_median_pred_mmxext() calculates a predictor from the left, top
and topleft pixel values. The topleft values need to be initialized
differently for the first loop initialization than for the others
in order to avoid reading ptr[-1]. So it has been initialized before
the loop and then read again at the end of the loop, so that the last
value read was never used. Yet this can lead to reads beyond the end
of the buffer, e.g. with
ffmpeg -cpuflags mmx+mmxext -f lavfi -i "color=size=64x4,format=yuv420p" \
-vf vflip -c:v ffvhuff -pred median -frames 1 -f null -
Fix this by not reading the value at the end of the loop.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
sub_median_pred_mmxext() calculates a predictor from the left, top
and topleft pixel values. The left value is simply read via
ptr[-1], although this is not guaranteed to be inside the buffer
in case of negative strides. This happens e.g. with
ffmpeg -i fate-suite/mpeg2/dvd_single_frame.vob -vf vflip \
-c:v magicyuv -pred median -f null -
Fix this by reading the first value like the topleft value.
Also change the documentation of sub_median_pred to reflect this
change (and the one from 791b5954bc).
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Four rows of four bytes fit into one xmm register; therefore
one can arrange the rows as follows (A,B,C: first, second, third etc.
row)
xmm0: ABABABAB BCBCBCBC
xmm1: CDCDCDCD DEDEDEDE
xmm2: EFEFEFEF FGFGFGFG
xmm3: GHGHGHGH HIHIHIHI
and use four pmaddubsw to calculate two rows in parallel. The history
fits into four registers, making this possible even on 32bit systems.
Old benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.5 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 16.4 ( 6.44x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.3 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 15.4 ( 6.44x)
New benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 11.8 ( 8.90x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.7 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 10.7 ( 9.30x)
Old benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 138.2 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.93x)
vp9_put_8tap_smooth_4v_8bpp_c: 123.6 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.41x)
New benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 139.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 20.1 ( 6.92x)
vp9_put_8tap_smooth_4v_8bpp_c: 124.5 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 19.9 ( 6.26x)
Loading the constants into registers did not turn out to be advantageous
here (not to mention Win64, where this would necessitate saving
and restoring ever more register); probably because there are only two
loop iterations.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For the epel functions, there can be no overflow as long as the sum
contains only one of the two large central coefficients; for bilinear
functions, there can be no overflow whatsoever.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
By changing the permutations used in the epel8_h{4,6} case
we can simply reuse the coefficient tables from the vertical epel
filters.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Doubling the register width allows to use only one pshufb and pmaddubsw.
Old benchmarks:
vp8_put_epel4_h4_c: 82.8 ( 1.00x)
vp8_put_epel4_h4_ssse3: 13.9 ( 5.96x)
New benchmarks:
vp8_put_epel4_h4_c: 82.7 ( 1.00x)
vp8_put_epel4_h4_ssse3: 11.7 ( 7.08x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).
Old benchmarks:
vp8_put_epel4_v4_c: 96.8 ( 1.00x)
vp8_put_epel4_v4_ssse3: 28.2 ( 3.43x)
New benchmarks:
vp8_put_epel4_v4_c: 95.1 ( 1.00x)
vp8_put_epel4_v4_ssse3: 22.8 ( 4.17x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>