Commit graph

15 commits

Author SHA1 Message Date
Lynne
bbe95f7353
x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
James Almer
48615f0a78 x86/aacpsdsp: add ps_hybrid_analysis_fma3
This replace the sse3 version, which was not really faster than the sse one.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-22 13:27:43 -03:00
James Almer
2bcf86d53d x86/aacpsdsp: precompute constant factors
Inspired by the optimization done to the C version by Rémi Denis-Courmont.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-22 13:27:43 -03:00
Clément Bœsch
b12a36170b lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
James Almer
8bb59e6742 x86/aacpsdsp: add ff_ps_hybrid_analysis_ileave_sse
About 2x faster than the c version.
2017-06-18 22:34:22 -03:00
James Almer
e229df9478 x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
About 2x faster than the c version.
2017-06-18 22:33:27 -03:00
James Almer
623d217ed1 avcodec/aacps: move checks for valid length outside the stereo_interpolate dsp function
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-15 23:49:40 -03:00
James Almer
497a4b554c x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
2017-06-07 13:53:51 -03:00
James Almer
933dd62288 x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
~2% faster.
2017-06-04 23:29:56 -03:00
James Almer
be3809a521 x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
Move the unpacking outside of the loop. 5% to 10% faster.

Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-03 12:39:43 -03:00
James Almer
b5a0971ff0 x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
About 2x faster than the c version.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-02 11:06:24 -03:00
James Almer
ede4ec1f8f x86/aacpsdsp: optimize add_squares loop
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-14 12:41:23 -03:00
James Almer
82dbfccaf0 x86/aacdec: use HADDPS macro
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-08 14:18:18 -03:00
Henrik Gramner
f0b7882ceb x86inc: Drop SECTION_TEXT macro
The .text section is already 16-byte aligned by default on all supported
platforms so `SECTION_TEXT` isn't any different from `SECTION .text`.
2015-08-04 20:13:09 +02:00
James Almer
9dcaae70f2 x86/aacpsdsp: add SSE and SSE3 optimized functions
Between 1.5 and 2.5 times faster

Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2015-07-30 19:01:15 -03:00