ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-04-18 16:40:23 +00:00

Author	SHA1	Message	Date
Jun Zhao	91ae6d10ab	lavfi/nlmeans: add aarch64 neon for compute_weights_line Implement NEON optimization for compute_weights_line. Also update the function signature to use ptrdiff_t for stack arguments (max_meaningful_diff, startx, endx). This is done to unify the stack layout between Apple platforms (which pack 32-bit stack arguments tightly) and the generic AAPCS64 ABI (which requires 8-byte stack slots for 32-bit arguments). Using ptrdiff_t ensures 8-byte slots are used on all AArch64 platforms, avoiding ABI mismatches with the assembly implementation. The x86 AVX2 prototype is updated to match the new signature. Performance benchmark (AArch64) in MacOS M4: ./tests/checkasm/checkasm --test=vf_nlmeans --bench compute_weights_line_c: 151.1 ( 1.00x) compute_weights_line_neon: 62.6 ( 2.42x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-09 16:10:10 +00:00
Ruikai Peng	cc43670268	avfilter/x86/vf_noise: Use unaligned access Regression since: `3ba570de8b` (port from MMX to SSE2). The SSE2 inline asm in libavfilter/x86/vf_noise.c (line_noise_sse2 and line_noise_avg_sse2) uses aligned loads/stores (movdqa, movntdq) but never checks pointer alignment. When the filter reuses an input frame (common path when av_frame_is_writable() is true), it may receive misaligned data from upstream filters that adjust frame->data[i] in place, notably vf_crop: - vf_crop adjusts plane pointers by arbitrary byte offsets (frame->data[plane] += ...), so an x offset of 1 on 8-bit formats produces a 1‑byte misalignment. - The noise filter then calls the SSE2 path directly on those pointers without realigning or falling back. Repro on x86_64/SSE2 (current HEAD at that commit): ./ffmpeg -v error -f lavfi -i testsrc=s=320x240:rate=1 \ -vf "format=yuv420p,crop=w=319:x=1:h=240:exact=1,noise=alls=50" \ -frames:v 1 -f null - This crashes with SIGSEGV at the aligned load in line_noise_sse2 (movdqa (%r9,%rax),%xmm0; effective address misaligned by 1 byte). Impact: denial of service via crafted filtergraphs (e.g., crop + noise). Applies to planar 8-bit formats where upstream filters can shift data pointers without reallocating. Found-by: Pwno OSS Team	2025-12-12 19:25:21 +00:00
Andreas Rheinhardt	7356981bec	avfilter/x86/Makefile: Only compile ASM init files when X86ASM is enabled To do so, simply add these init files to X86ASM-OBJS instead of OBJS in the Makefile. The former is already used for the actual assembly files, but using them for the C init files just works, because the build system uses file extensions to derive whether it is a C or a NASM file. This avoids compiling unused function stubs and also reduces our reliance on DCE: We don't add %if checks to the asm files except for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4 functions will be available. It also allows to remove HAVE_X86ASM checks in these init files. Reviewed-by: Kacper Michajłow <kasper93@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 22:20:13 +01:00
Piotr Pawlowski	372dab2a4d	All: Removed reliance on compiler performing dead code elimination, changed various macro constant checks from if() to #if	2025-11-28 19:52:51 +01:00
Niklas Haas	f3346ca6f7	avfilter/x86/f_ebur128: only use filter_channels_avx for >= 2 channels The approach of this ASM routine is to process two channels at a time using AVX instructions. Obviously, there is no point in doing this if there is only a single channel; in which case the scalar loop would be better. Fixes a performance regression when filtering mono audio on certain CPUs, notably e.g. the Intel N100.	2025-11-25 22:13:57 +00:00
Andreas Rheinhardt	c0648b2004	avfilter/x86/vf_spp: Fix comment Forgotten in `dcb28ed860`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:12 +01:00
Andreas Rheinhardt	06b0dae51b	avfilter/vf_fsppdsp: Constify Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:12 +01:00
Andreas Rheinhardt	3cd452cbf1	avfilter/x86/vf_fspp: Avoid stack on x64 Possible due to the amount of registers. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:12 +01:00
Andreas Rheinhardt	ddd74276f8	avfilter/x86/vf_fspp: Port ff_column_fidct_mmx() to SSE2 It gains a lot because it has to operate on eight words; it also saves 608B of .text here. Old benchmarks: column_fidct_c: 3365.7 ( 1.00x) column_fidct_mmx: 1784.6 ( 1.89x) New benchmarks: column_fidct_c: 3361.5 ( 1.00x) column_fidct_sse2: 801.1 ( 4.20x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:11 +01:00
Andreas Rheinhardt	63493bf0e0	avfilter/x86/vf_fspp: Put shifts into constants This avoids some shift instructions and also gives us more headroom in the registers. In fact, I have proven to myself that everything that is supposed to fit into 16bits now actually does so. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:11 +01:00
Andreas Rheinhardt	66af18d06a	avfilter/x86/vf_fspp: Make ff_column_fidct_mmx() bitexact It currently is not, because the shortcut mode uses different rounding than the C code (as well as the non-shortcut code). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:11 +01:00
Andreas Rheinhardt	ff85a20b7d	avfilter/x86/vf_fspp: Port store_slice to SSE2 Old benchmarks: store_slice_c: 2798.3 ( 1.00x) store_slice_mmx: 950.2 ( 2.94x) store_slice2_c: 3811.7 ( 1.00x) store_slice2_mmx: 682.3 ( 5.59x) New benchmarks: store_slice_c: 2797.2 ( 1.00x) store_slice_sse2: 543.5 ( 5.15x) store_slice2_c: 3817.0 ( 1.00x) store_slice2_sse2: 408.2 ( 9.35x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	52ba2ac7bd	avfilter/x86/vf_fspp: Port mul_thrmat to SSE2 This fixes an ABI violation, as mul_thrmat did not issue emms. It seems that this ABI violation could reach the user, namely if ff_get_video_buffer() fails. Notice that ff_get_video_buffer() itself could fail because of this, namely if the allocator uses floating point registers. On x64 (where GCC already used SSE2 in the C version) mul_thrmat_c: 4.4 ( 1.00x) mul_thrmat_mmx: 8.6 ( 0.52x) mul_thrmat_sse2: 4.4 ( 1.00x) On 32bit (where SSE2 is not known to be available): mul_thrmat_c: 56.0 ( 1.00x) mul_thrmat_sse2: 6.0 ( 9.40x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	9f4d5d818d	avfilter/x86/vf_fspp: Don't duplicate dither table Reuse the one from vf_fsppdsp.c; also don't overalign said table too much. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	9b34088c4d	avfilter/vf_fspp: Add DSPCtx, move DSP functions to file of their own This is in preparation for adding checkasm tests; without it, checkasm would pull all of libavfilter in. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	3ba570de8b	avfilter/x86/vf_noise: Port line_noise funcs to SSE2 This avoids having to fix up ABI violations via emms_c and also leads to a 73% speedup for the line noise average version here. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-16 19:09:45 +02:00
Andreas Rheinhardt	adfec0f52e	avfilter/x86/vf_noise: Make line_noise_avg_mmx() match C function Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-16 18:41:19 +02:00
Andreas Rheinhardt	74a3c1ddb6	avfilter/x86/vf_pullup: Port pullup functions to SSE2, SSSE3 The diff and var functions benefit from psadbw, comb from wider registers which allows to avoid reloading values, reducing the number of loads from 48 to 10. Performance increased by 117% (the loop in compute_metric() has been timed); codesize decreased by 144B. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-15 19:43:37 +02:00
Andreas Rheinhardt	dcb28ed860	avfilter/x86/vf_spp: Port store_slice to SSE2 This allows to remove an emms_c from the filter. It also gives 25% speedup here (when timing the calls to store_slice using START/STOP_TIMER). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-15 19:43:37 +02:00
Andreas Rheinhardt	4fc05c28f4	avfilter/x86/vf_gradfun: Remove MMXEXT func overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 version of filter_line. This commit therefore removes the overridden MMXEXT version (which didn't abide by the ABI) which allows us to remove an emms_c() from vf_gradfun.c, so that users with SSSE3 no longer pay a price for the mere existence of an MMXEXT version. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:35 +02:00
Niklas Haas	843920d5d6	avfilter/x86/vf_idetdsp: add AVX2 and AVX512 implementations The only thing that changes slightly is the horizontal sum at the end.	2025-09-21 11:02:41 +00:00
Niklas Haas	4c067d0778	avfilter/x86/vf_idetdsp: generalize 8-bit macro This is mostly compatible with AVX as well, so turn it into a macro.	2025-09-21 11:02:41 +00:00
Niklas Haas	326abf359f	avfilter/vf_idetdsp: use consistent uint8_t pointer type Even for 16-bit DSP functions. Instead, cast the pointer inside the function.	2025-09-21 11:02:41 +00:00
Niklas Haas	60dbcc5321	avfilter/vf_idetdsp: pass actual bit depth More informative and IMO cleaner; some implementations may want to differentiate by exact bit depth or support 32 bit down the line.	2025-09-21 11:02:41 +00:00
Niklas Haas	5830743363	avfilter/vf_idet: separate DSP parts To avoid pulling in the entire libavfilter when using the DSP functions from checkasm. The rest of the struct is not needed outside vf_idet.c and was moved there.	2025-09-21 11:02:41 +00:00
Andreas Rheinhardt	a35c91dc14	avfilter/vf_colordetect: Rename header to vf_colordetectdsp.h It is more in line with our naming conventions. Reviewed-by: Martin Storsjö <martin@martin.st> Reviewed-by: Niklas Haas <ffmpeg@haasn.dev> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-16 18:22:24 +02:00
Niklas Haas	ba8aa0e7b3	avfilter/x86/vf_overlay: simplify function signature No reason to pass all the variables again, if we're already passing the context.	2025-09-02 17:06:25 +02:00
Niklas Haas	6d6bbdaab0	avfilter/vf_overlay: rename variables for clarity `is_straight`, `alpha_mode` etc. are more consistently named to refer to either the main image, or the overlay.	2025-09-02 17:06:25 +02:00
Niklas Haas	6f3eddbedd	avfilter/vf_overlay: configure alpha mode on the link And use the link-tagged value instead of the hard-coded parameter.	2025-09-02 17:06:25 +02:00
Niklas Haas	f07c12d806	avfilter/x86/vf_colordetect: fix alpha detect tail handling This wrapping logic still considered any nonzero return from the ASM function to be the overall result, but this is not true since the addition of FF_ALPHA_TRANSPARENT. Fix it by only early returning if FF_ALPHA_STRAIGHT is detected. Fixes: `9b8b78a815` See-Also: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20301#issuecomment-4802	2025-09-01 15:33:43 +00:00
Niklas Haas	9b8b78a815	avfilter/vf_colordetect: detect fully opaque alpha planes It can be useful to know if the alpha plane consists of fully opaque pixels or not, in which case it can e.g. safely be stripped. This only requires a very minor modification to the AVX2 routines, adding an extra AND on the read alpha value with the reference alpha value, and a single extra cheap test per line. detect_alpha_8_full_c: 2849.1 ( 1.00x) detect_alpha_8_full_avx2: 260.3 (10.95x) detect_alpha_8_full_avx512icl: 130.2 (21.87x) detect_alpha_8_limited_c: 8349.2 ( 1.00x) detect_alpha_8_limited_avx2: 756.6 (11.04x) detect_alpha_8_limited_avx512icl: 364.2 (22.93x) detect_alpha_16_full_c: 1652.8 ( 1.00x) detect_alpha_16_full_avx2: 236.5 ( 6.99x) detect_alpha_16_full_avx512icl: 134.6 (12.28x) detect_alpha_16_limited_c: 5263.1 ( 1.00x) detect_alpha_16_limited_avx2: 797.4 ( 6.60x) detect_alpha_16_limited_avx512icl: 400.3 (13.15x)	2025-08-18 18:50:00 +00:00
Niklas Haas	c96ccd78fc	avfilter/vf_colordetect: rename p, q, k variables for clarity Purely cosmetic. Motivated in part because I want to depend on the assumption that P represents the maximum alpha channel value.	2025-08-18 18:50:00 +00:00
James Almer	3f58c9df14	avfilter/x86/vf_bwdif: use the correct preprocessor check Signed-off-by: James Almer <jamrial@gmail.com>	2025-08-03 19:26:18 -03:00
Niklas Haas	7f00e24d70	vf_bwdif: add AVX512 implementation I also tried replacing some of the instructions by more elaborate ones using masks, but I found no performance gain significant enough to be worth maintaining two code paths, so this implementation merely replaces the AVX2 implementation by drop-in AVX512 equivalents. bwdif8_c: 6362.2 ( 1.00x) bwdif8_sse2: 1004.9 ( 6.33x) bwdif8_ssse3: 946.0 ( 6.73x) bwdif8_avx2: 477.9 (13.31x) bwdif8_avx512: 273.3 (23.28x) bwdif10_c: 6341.5 ( 1.00x) bwdif10_sse2: 872.4 ( 7.27x) bwdif10_ssse3: 803.4 ( 7.89x) bwdif10_avx2: 416.7 (15.22x) bwdif10_avx512: 224.3 (28.27x) Realtime test at 3840x2160 yuv420p: avx2: frame=20000 fps=3370 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed=67.4x elapsed=0:00:05.93 avx512: frame=20000 fps=5077 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed= 102x elapsed=0:00:03.93 The use of this function is gated behind avx512icl so that it doesn't downclock on Skylake.	2025-08-03 22:13:51 +00:00
Timo Rothenpieler	262d41c804	all: fix typos found by codespell	2025-08-03 13:48:47 +02:00
James Almer	a01dc3aa27	avfilter/x86/vf_colordetect: add missing preprocessor checks Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 18:03:22 -03:00
James Almer	c62813a057	avfilter/x86/vf_colordetect: make the AVX512 functions run only on ICL targets or newer For detect_range, the usage of vpbroadcast{b,w} requires the AVX512BW extension, and for detect_alpha we don't want ZMM instructions downclocking old CPUs. Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 17:25:28 -03:00
James Almer	70fc4e5909	avfilter/x86/vf_colordetect_init: don't enable ASM functions on targets where it's known they will be slower Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 16:58:51 -03:00
James Almer	fdca209f1f	avfilter/x86/vf_colordetect: don't use rax to return a 32bit integer Fixes compilation on x86_32 targets Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 16:58:36 -03:00
James Almer	14f4478354	avfilter/x86/vf_colordetect: fix use of AVX512 instruction in AVX2 function on non Unix64 targets Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 16:52:46 -03:00
Niklas Haas	8b647b3f8a	avfilter/vf_colordetect: add x86 SIMD implementation alphadetect8_full_c: 5658.2 ( 1.00x) alphadetect8_full_avx2: 215.1 (26.31x) alphadetect8_full_avx512: 133.5 (42.40x) alphadetect8_limited_c: 7391.5 ( 1.00x) alphadetect8_limited_avx2: 649.3 (11.38x) alphadetect8_limited_avx512: 330.5 (22.36x) alphadetect16_full_c: 3027.4 ( 1.00x) alphadetect16_full_avx2: 209.4 (14.46x) alphadetect16_full_avx512: 141.4 (21.41x) alphadetect16_limited_c: 3880.9 ( 1.00x) alphadetect16_limited_avx2: 734.9 ( 5.28x) alphadetect16_limited_avx512: 349.2 (11.11x) rangedetect8_c: 5854.2 ( 1.00x) rangedetect8_avx2: 138.9 (42.15x) rangedetect8_avx512: 106.2 (55.12x) rangedetect16_c: 4122.0 ( 1.00x) rangedetect16_avx2: 138.6 (29.74x) rangedetect16_avx512: 104.1 (39.60x)	2025-07-21 18:10:25 +02:00
James Almer	85f2911891	avfilter/x86/vf_blackdetect: add missing preprocessor check Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-18 15:17:02 -03:00
James Almer	ee4ff3f706	avfilter/x86/vf_blackdetect_init: don't enable the ASM functions on targets where it's known they will be slower Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-18 13:05:44 -03:00
James Almer	f263192f0e	avfilter/x86/vf_blackdetect: don't use rax to return a 32bit integer Fixes compilation on x86_32. Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-18 13:05:44 -03:00
Niklas Haas	75cd42c48a	avfilter/vf_blackdetect: add AVX2 SIMD version Requested by a user. Even with autovectorization enabled, the compiler performs a quite poor job of optimizing this function, due to not being able to take advantage of the pmaxub + pcmpeqb trick for counting the number of pixels less than or equal-to a threshold. blackdetect8_c: 4625.0 ( 1.00x) blackdetect8_avx2: 155.1 (29.83x) blackdetect16_c: 2529.4 ( 1.00x) blackdetect16_avx2: 163.6 (15.46x)	2025-07-18 10:47:31 +02:00
Niklas Haas	e44a1aaeec	avfilter/x86/scene_sad: add high bit depth AVX2/AVX512 version Since psadbw only exists for 8-bits, we have to emulate it for 16-bit inputs. The simplest sequence is to use a normal subtraction, which is safe as long as the inputs do not exceed 32767 - so limit this implementation to 15-bit inputs and below. For 16-bit inputs, we could in theory instead use a pminw / pmaxw to ensure the resulting difference does not overflow, but this is slower, and also breaks the subsequent use of pmaddwd, so I opted to skip 16-bit SIMD for now. scene_sad10_c: 114175.6 ( 1.00x) scene_sad10_avx2: 9617.7 (11.87x) scene_sad10_avx512: 5208.8 (21.92x) scene_sad12_c: 114537.8 ( 1.00x) scene_sad12_avx2: 9614.0 (11.91x) scene_sad12_avx512: 5186.3 (22.08x) scene_sad14_c: 114113.9 ( 1.00x) scene_sad14_avx2: 9612.9 (11.87x) scene_sad14_avx512: 5186.0 (22.00x) scene_sad15_c: 114108.9 ( 1.00x) scene_sad15_avx2: 9612.3 (11.87x) scene_sad15_avx512: 5186.4 (22.00x) scene_sad16_c: 114136.0 ( 1.00x)	2025-07-17 12:26:06 +02:00
Niklas Haas	91f2d146d4	avfilter/x86/scene_sad: add AVX512 implementation Trivial to add, but a lot faster (on my machine). scene_sad8_c: 114476.4 ( 1.00x) scene_sad8_sse2: 8644.3 (13.24x) scene_sad8_avx2: 4520.1 (25.33x) scene_sad8_avx512: 3153.0 (36.31x)	2025-07-17 12:26:06 +02:00
Niklas Haas	dc61b74c1d	avfilter/scene_sad: pass true depth to ff_scene_sad_get_fn() I need to be able to distinguish between 10/12/14 and 16 bit depths, for overflow reasons.	2025-07-17 12:26:05 +02:00
James Almer	dbe94e1110	avfilter/x86/f_ebur128: replace AVX2 instruction with AVX equivalent Using vpbroadcastq in an AVX function will result in SIGILL errors on pre Haswell/Zen processors. Signed-off-by: James Almer <jamrial@gmail.com>	2025-06-22 09:31:44 -03:00
Niklas Haas	daef348574	avfilter/x86/f_ebur128: implement AVX peak calculation Stereo only, for simplicity. Slightly faster than the C code.	2025-06-21 17:28:39 +02:00

1 2 3 4 5 ...

389 commits