Four rows of four bytes fit into one xmm register; therefore
one can arrange the rows as follows (A,B,C: first, second, third etc.
row)
xmm0: ABABABAB BCBCBCBC
xmm1: CDCDCDCD DEDEDEDE
xmm2: EFEFEFEF FGFGFGFG
xmm3: GHGHGHGH HIHIHIHI
and use four pmaddubsw to calculate two rows in parallel. The history
fits into four registers, making this possible even on 32bit systems.
Old benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.5 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 16.4 ( 6.44x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.3 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 15.4 ( 6.44x)
New benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c: 105.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 11.8 ( 8.90x)
vp9_put_8tap_smooth_4v_8bpp_c: 99.7 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 10.7 ( 9.30x)
Old benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 138.2 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.93x)
vp9_put_8tap_smooth_4v_8bpp_c: 123.6 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.41x)
New benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c: 139.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3: 20.1 ( 6.92x)
vp9_put_8tap_smooth_4v_8bpp_c: 124.5 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3: 19.9 ( 6.26x)
Loading the constants into registers did not turn out to be advantageous
here (not to mention Win64, where this would necessitate saving
and restoring ever more register); probably because there are only two
loop iterations.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For the epel functions, there can be no overflow as long as the sum
contains only one of the two large central coefficients; for bilinear
functions, there can be no overflow whatsoever.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
By changing the permutations used in the epel8_h{4,6} case
we can simply reuse the coefficient tables from the vertical epel
filters.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Doubling the register width allows to use only one pshufb and pmaddubsw.
Old benchmarks:
vp8_put_epel4_h4_c: 82.8 ( 1.00x)
vp8_put_epel4_h4_ssse3: 13.9 ( 5.96x)
New benchmarks:
vp8_put_epel4_h4_c: 82.7 ( 1.00x)
vp8_put_epel4_h4_ssse3: 11.7 ( 7.08x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).
Old benchmarks:
vp8_put_epel4_v4_c: 96.8 ( 1.00x)
vp8_put_epel4_v4_ssse3: 28.2 ( 3.43x)
New benchmarks:
vp8_put_epel4_v4_c: 95.1 ( 1.00x)
vp8_put_epel4_v4_ssse3: 22.8 ( 4.17x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).
Old benchmarks:
vp8_put_epel4_v6_c: 132.8 ( 1.00x)
vp8_put_epel4_v6_ssse3: 34.3 ( 3.87x)
New benchmarks:
vp8_put_epel4_v6_c: 131.5 ( 1.00x)
vp8_put_epel4_v6_ssse3: 27.1 ( 4.86x)
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is a register available. No change in benchmarks here.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Use GPRs on x64 and xmm registers else (using GPRs reduces codesize).
This avoids clobbering the floating point state and therefore no longer
breaks the ABI.
No change in benchmarks here.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX(EXT) functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Currently several inline ASM blocks used a value as
an input and rax as clobber register. The input value
was just moved into the register which then served as loop
counter. This is wasteful, as one can just use the value's
register directly as loop counter.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Unquantizing involves calculating
(block[j] * qscale * quant_matrix[j]) / 16
where / rounds towards zero. Arithmetic right shifts
naturally round towards -inf, so the earlier code
calculated the absolute value first, then used a right-shift
and then negated the result if necessary.
This commit uses a different procedure: It biases the product
for negative values of block[j] by 0xf. The combination of
this and the arithmetic right shift is the same as rounding
towards zero.
Furthermore, a write-only store to mm7 has been removed.
Benchmarks:
dct_unquantize_mpeg2_intra_c: 214.3 ( 1.00x)
dct_unquantize_mpeg2_intra_mmx (old): 43.0 ( 4.98x)
dct_unquantize_mpeg2_intra_mmx (new): 28.4 ( 7.56x)
(The bitexact flag and the test for correctness have beem removed
from checkasm for the benchmarks.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The H.263 unquantize functions modified an input parameter.
(And they did so since this code was added in
7f3f5ec87b. I am surprised
that this didn't cause issues, particularly with the intra function.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
To do so, simply add these init files to X86ASM-OBJS instead of OBJS
in the Makefile. The former is already used for the actual assembly
files, but using them for the C init files just works, because the build
system uses file extensions to derive whether it is a C or a NASM file.
This avoids compiling unused function stubs and also reduces our
reliance on DCE: We don't add %if checks to the asm files except
for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4
functions will be available. It also allows to remove HAVE_X86ASM checks
in these init files.
Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
ff_h264_luma_dc_dequant_idct_sse2() does not pass checkasm for certain
seeds, because the input to packssdw no longer fits into an int16_t,
leading to saturation, where the C code just truncates. I don't know
whether the spec contains provisions that ensure that valid input
must not exceed 16 bit or whether the such inputs (even if invalid)
can be triggered by the actual code and not only the test.
This commit adapts the behavior of the function to the C reference code
to fix the test. packssdw is avoided, instead the lower words are
directly transfered to GPRs to be written out. This has unfortunately
led to a slight performance regression here (14.5 vs 15.1 cycles).
Fixes issue #20835.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
pw_1 is currently loaded in both codepaths. Generate it earlier instead.
Gives tiny speedups (15 vs 14.5 cycles) and reduces codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is ABI compliant and gives a tiny speedup here (and is 16B smaller).
Old benchmarks:
h264_luma_dc_dequant_idct_8_c: 33.2 ( 1.00x)
h264_luma_dc_dequant_idct_8_sse2: 16.0 ( 2.07x)
New benchmarks:
h264_luma_dc_dequant_idct_8_c: 33.0 ( 1.00x)
h264_luma_dc_dequant_idct_8_sse2: 15.0 ( 2.20x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only exported (i.e. cglobal) functions need it; stride is already
sign-extended when it reaches any of the internal functions used here,
so don't sign-extend again.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Store the intermediate values as words, clipped to the 0..255 range
instead.
Old benchmarks:
filter_diag4_c: 353.4 ( 1.00x)
filter_diag4_sse2: 57.5 ( 6.15x)
New benchmarks:
filter_diag4_c: 350.6 ( 1.00x)
filter_diag4_sse2: 55.1 ( 6.36x)
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only the two middle coefficients are so huge that overflow can happen.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For most systems (particularly all x64), the stack is already
guaranteed to be sufficiently aligned. So just use x86inc's
stack feature which does the right thing.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
They don't have anything in common since
160ebe0a8d.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The internal functions for calculating the hadamard difference
of two 8x8 blocks have no epilogue on UNIX64, so one can avoid
the call altogether by placing the 8x8 function so that it directly
falls into the internal function.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The SSE2 and SSSE3 functions are now available everywhere,
making the MMXEXT functions irrelevant.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>