Lynne
bbe95f7353
x86: replace explicit REP_RETs with RETs
...
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)
x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.
The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.
In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
James Almer
48615f0a78
x86/aacpsdsp: add ps_hybrid_analysis_fma3
...
This replace the sse3 version, which was not really faster than the sse one.
Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-22 13:27:43 -03:00
James Almer
2bcf86d53d
x86/aacpsdsp: precompute constant factors
...
Inspired by the optimization done to the C version by Rémi Denis-Courmont.
Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-22 13:27:43 -03:00
Clément Bœsch
b12a36170b
lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis
2017-06-28 12:22:39 +02:00
James Almer
8bb59e6742
x86/aacpsdsp: add ff_ps_hybrid_analysis_ileave_sse
...
About 2x faster than the c version.
2017-06-18 22:34:22 -03:00
James Almer
e229df9478
x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
...
About 2x faster than the c version.
2017-06-18 22:33:27 -03:00
James Almer
623d217ed1
avcodec/aacps: move checks for valid length outside the stereo_interpolate dsp function
...
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-15 23:49:40 -03:00
James Almer
497a4b554c
x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
...
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
2017-06-07 13:53:51 -03:00
James Almer
933dd62288
x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
...
~2% faster.
2017-06-04 23:29:56 -03:00
James Almer
be3809a521
x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
...
Move the unpacking outside of the loop. 5% to 10% faster.
Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-03 12:39:43 -03:00
James Almer
b5a0971ff0
x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
...
About 2x faster than the c version.
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-02 11:06:24 -03:00
James Almer
ede4ec1f8f
x86/aacpsdsp: optimize add_squares loop
...
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-14 12:41:23 -03:00
James Almer
82dbfccaf0
x86/aacdec: use HADDPS macro
...
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-08 14:18:18 -03:00
Henrik Gramner
f0b7882ceb
x86inc: Drop SECTION_TEXT macro
...
The .text section is already 16-byte aligned by default on all supported
platforms so `SECTION_TEXT` isn't any different from `SECTION .text`.
2015-08-04 20:13:09 +02:00
James Almer
9dcaae70f2
x86/aacpsdsp: add SSE and SSE3 optimized functions
...
Between 1.5 and 2.5 times faster
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2015-07-30 19:01:15 -03:00