To do so, simply add these init files to X86ASM-OBJS instead of OBJS
in the Makefile. The former is already used for the actual assembly
files, but using them for the C init files just works, because the build
system uses file extensions to derive whether it is a C or a NASM file.
This avoids compiling unused function stubs and also reduces our
reliance on DCE: We don't add %if checks to the asm files except
for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4
functions will be available. It also allows to remove HAVE_X86ASM checks
in these init files.
Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The approach of this ASM routine is to process two channels at a time using
AVX instructions. Obviously, there is no point in doing this if there is only
a single channel; in which case the scalar loop would be better.
Fixes a performance regression when filtering mono audio on certain CPUs,
notably e.g. the Intel N100.
It gains a lot because it has to operate on eight words;
it also saves 608B of .text here.
Old benchmarks:
column_fidct_c: 3365.7 ( 1.00x)
column_fidct_mmx: 1784.6 ( 1.89x)
New benchmarks:
column_fidct_c: 3361.5 ( 1.00x)
column_fidct_sse2: 801.1 ( 4.20x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This avoids some shift instructions and also gives us more headroom
in the registers. In fact, I have proven to myself that everything
that is supposed to fit into 16bits now actually does so.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It currently is not, because the shortcut mode uses different rounding
than the C code (as well as the non-shortcut code).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This fixes an ABI violation, as mul_thrmat did not issue emms.
It seems that this ABI violation could reach the user, namely
if ff_get_video_buffer() fails. Notice that ff_get_video_buffer()
itself could fail because of this, namely if the allocator uses
floating point registers.
On x64 (where GCC already used SSE2 in the C version)
mul_thrmat_c: 4.4 ( 1.00x)
mul_thrmat_mmx: 8.6 ( 0.52x)
mul_thrmat_sse2: 4.4 ( 1.00x)
On 32bit (where SSE2 is not known to be available):
mul_thrmat_c: 56.0 ( 1.00x)
mul_thrmat_sse2: 6.0 ( 9.40x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is in preparation for adding checkasm tests; without it,
checkasm would pull all of libavfilter in.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This avoids having to fix up ABI violations via emms_c and
also leads to a 73% speedup for the line noise average version
here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The diff and var functions benefit from psadbw, comb from wider
registers which allows to avoid reloading values, reducing the number
of loads from 48 to 10. Performance increased by 117% (the loop
in compute_metric() has been timed); codesize decreased by 144B.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This allows to remove an emms_c from the filter. It also gives
25% speedup here (when timing the calls to store_slice using
START/STOP_TIMER).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 version
of filter_line.
This commit therefore removes the overridden MMXEXT version
(which didn't abide by the ABI) which allows us to remove
an emms_c() from vf_gradfun.c, so that users with SSSE3 no longer
pay a price for the mere existence of an MMXEXT version.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
To avoid pulling in the entire libavfilter when using the DSP functions
from checkasm.
The rest of the struct is not needed outside vf_idet.c and was moved there.
It is more in line with our naming conventions.
Reviewed-by: Martin Storsjö <martin@martin.st>
Reviewed-by: Niklas Haas <ffmpeg@haasn.dev>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This wrapping logic still considered any nonzero return from the ASM function
to be the overall result, but this is not true since the addition of
FF_ALPHA_TRANSPARENT.
Fix it by only early returning if FF_ALPHA_STRAIGHT is detected.
Fixes: 9b8b78a815
See-Also: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20301#issuecomment-4802
It can be useful to know if the alpha plane consists of fully opaque
pixels or not, in which case it can e.g. safely be stripped.
This only requires a very minor modification to the AVX2 routines, adding
an extra AND on the read alpha value with the reference alpha value, and a
single extra cheap test per line.
detect_alpha_8_full_c: 2849.1 ( 1.00x)
detect_alpha_8_full_avx2: 260.3 (10.95x)
detect_alpha_8_full_avx512icl: 130.2 (21.87x)
detect_alpha_8_limited_c: 8349.2 ( 1.00x)
detect_alpha_8_limited_avx2: 756.6 (11.04x)
detect_alpha_8_limited_avx512icl: 364.2 (22.93x)
detect_alpha_16_full_c: 1652.8 ( 1.00x)
detect_alpha_16_full_avx2: 236.5 ( 6.99x)
detect_alpha_16_full_avx512icl: 134.6 (12.28x)
detect_alpha_16_limited_c: 5263.1 ( 1.00x)
detect_alpha_16_limited_avx2: 797.4 ( 6.60x)
detect_alpha_16_limited_avx512icl: 400.3 (13.15x)
I also tried replacing some of the instructions by more elaborate ones
using masks, but I found no performance gain significant enough to be worth
maintaining two code paths, so this implementation merely replaces the AVX2
implementation by drop-in AVX512 equivalents.
bwdif8_c: 6362.2 ( 1.00x)
bwdif8_sse2: 1004.9 ( 6.33x)
bwdif8_ssse3: 946.0 ( 6.73x)
bwdif8_avx2: 477.9 (13.31x)
bwdif8_avx512: 273.3 (23.28x)
bwdif10_c: 6341.5 ( 1.00x)
bwdif10_sse2: 872.4 ( 7.27x)
bwdif10_ssse3: 803.4 ( 7.89x)
bwdif10_avx2: 416.7 (15.22x)
bwdif10_avx512: 224.3 (28.27x)
Realtime test at 3840x2160 yuv420p:
avx2: frame=20000 fps=3370 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed=67.4x elapsed=0:00:05.93
avx512: frame=20000 fps=5077 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed= 102x elapsed=0:00:03.93
The use of this function is gated behind avx512icl so that it doesn't
downclock on Skylake.
For detect_range, the usage of vpbroadcast{b,w} requires the AVX512BW extension, and for
detect_alpha we don't want ZMM instructions downclocking old CPUs.
Signed-off-by: James Almer <jamrial@gmail.com>
Requested by a user. Even with autovectorization enabled, the compiler
performs a quite poor job of optimizing this function, due to not being
able to take advantage of the pmaxub + pcmpeqb trick for counting the number
of pixels less than or equal-to a threshold.
blackdetect8_c: 4625.0 ( 1.00x)
blackdetect8_avx2: 155.1 (29.83x)
blackdetect16_c: 2529.4 ( 1.00x)
blackdetect16_avx2: 163.6 (15.46x)
Since psadbw only exists for 8-bits, we have to emulate it for 16-bit
inputs. The simplest sequence is to use a normal subtraction, which is safe
as long as the inputs do not exceed 32767 - so limit this implementation
to 15-bit inputs and below.
For 16-bit inputs, we could in theory instead use a pminw / pmaxw to ensure
the resulting difference does not overflow, but this is slower, and also
breaks the subsequent use of pmaddwd, so I opted to skip 16-bit SIMD for
now.
scene_sad10_c: 114175.6 ( 1.00x)
scene_sad10_avx2: 9617.7 (11.87x)
scene_sad10_avx512: 5208.8 (21.92x)
scene_sad12_c: 114537.8 ( 1.00x)
scene_sad12_avx2: 9614.0 (11.91x)
scene_sad12_avx512: 5186.3 (22.08x)
scene_sad14_c: 114113.9 ( 1.00x)
scene_sad14_avx2: 9612.9 (11.87x)
scene_sad14_avx512: 5186.0 (22.00x)
scene_sad15_c: 114108.9 ( 1.00x)
scene_sad15_avx2: 9612.3 (11.87x)
scene_sad15_avx512: 5186.4 (22.00x)
scene_sad16_c: 114136.0 ( 1.00x)
Trivial to add, but a lot faster (on my machine).
scene_sad8_c: 114476.4 ( 1.00x)
scene_sad8_sse2: 8644.3 (13.24x)
scene_sad8_avx2: 4520.1 (25.33x)
scene_sad8_avx512: 3153.0 (36.31x)
Processes two channels in parallel, using 128-bit XMM registers.
In theory, we could go up to YMM registers to process 4 channels, but this is
not a gain except for relatively high channel counts (e.g. 7.1), and also
complicates the sample load/store operations considerably.
I decided to only add an AVX variant, since the C code is not substantially
slower enough to justify a separate function just for ancient CPUs.
The MMX requantize functions have the MMX permutation
(i.e. FF_IDCT_PERM_SIMPLE) hardcoded and therefore
check for the used permutation (namely via a CRC).
Yet this is very ugly and could even lead to misdetection;
furthermore, since d7246ea9f2
the permutation used here is de-facto and since
bfb28b5ce8 definitely
impossible on x64, making this code dead on x64.
So remove it.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>