x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from ff_sbr_qmf_deint_bfly_sse are truely ancient 32bit x86s
it is removed.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* qatar/master:
Consistently use "cpu_flags" as variable/parameter name for CPU flags
Conflicts:
libavcodec/x86/dsputil_init.c
libavcodec/x86/h264dsp_init.c
libavcodec/x86/hpeldsp_init.c
libavcodec/x86/motion_est.c
libavcodec/x86/mpegvideo.c
libavcodec/x86/proresdsp_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Sandybridge: 47 cycles
Having a loop counter is a 7 cycle gain.
Unrolling is another 7 cycle gain.
Working in reverse scan is another 6 cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
233 to 105 cycles on Arrandale and Win64.
Replacing the multiplication by s_m[m] by a pand and a pxor with
appropriate vectors is slower. Unrolling is a 15 cycles win.
A SSE version was 4 cycles slower.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
From 312 to 89/68 (sse/sse2) cycles on Arrandale and Win64.
Sandybridge: 68/47 cycles.
Having a loop counter is a 7 cycle gain.
Unrolling is another 7 cycle gain.
Working in reverse scan is another 6 cycles.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Timing on Arrandale:
C SSE
Win32: 57 44
Win64: 47 38
Unrolling and not storing mask both save some cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Start and end index are multiple of 2, therefore guaranteeing aligned access.
Also, this allows to generate 4 floats per loop, keeping the alignment all
along.
Timing:
- 32 bits: 326c -> 172c
- 64 bits: 323c -> 156c
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.
Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.
Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C /32bits: 82c (unrolled)/102c
C /64bits: 69c (unrolled)/82c
SSE/32bits: 42c
SSE/64bits: 31c
Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>