ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-06-04 22:50:24 +00:00

Author	SHA1	Message	Date
Andreas Rheinhardt	7971953d29	avfilter/x86/vf_pp7: Port ff_pp7_dctB_mmx to SSE2 Unfortunately a bit slower than the MMX version due to the impossibility to use memory operands in paddw. The situation would reverse if ff_dctB_mmx() would have to issue emms. dctB_c: 3.7 ( 1.00x) dctB_mmx: 3.3 ( 1.13x) dctB_sse2: 3.6 ( 1.03x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-05-15 20:29:29 +02:00
Andreas Rheinhardt	fc9e63474f	avfilter/vf_pp7dsp: Add restrict Makes GCC optimize the scalar codepath away. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-05-15 20:29:29 +02:00
Andreas Rheinhardt	617a9afeb4	avfilter/vf_pp7: Add proper PP7DSPContext This is in preparation for checkasm tests for dctB. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-05-15 20:29:29 +02:00
Andreas Rheinhardt	0a1faa7202	avfilter/vf_pp7: Constify Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-05-15 20:29:29 +02:00
Andreas Rheinhardt	0c4c9c66bd	avfilter/x86/vf_atadenoise: Don't load args unnecessarily These args will be read directly from the stack into xmm register, so loading them into GPRs is unnecessary. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-04-06 11:28:49 +02:00
Andreas Rheinhardt	9fdd7e23e3	avfilter/x86/vf_atadenoise: Avoid load Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-31 16:49:51 +02:00
Kacper Michajłow	2b1d8ba3ec	avfilter/x86/vf_atadenoise: move %if ARCH_X86_64 after x86util include This is consistent pattern with other files. Also is needed for next commit to always include x86util.asm Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2026-03-29 22:22:29 +02:00
Kacper Michajłow	2b8ca0f3c5	avfilter/x86/avf_showcqt: add missing section declaration Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2026-03-29 22:22:29 +02:00
Andreas Rheinhardt	eb5ac9fee7	avfilter/x86/vf_idetdsp: Avoid (v)movdqa Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-29 01:05:23 +01:00
Jun Zhao	91ae6d10ab	lavfi/nlmeans: add aarch64 neon for compute_weights_line Implement NEON optimization for compute_weights_line. Also update the function signature to use ptrdiff_t for stack arguments (max_meaningful_diff, startx, endx). This is done to unify the stack layout between Apple platforms (which pack 32-bit stack arguments tightly) and the generic AAPCS64 ABI (which requires 8-byte stack slots for 32-bit arguments). Using ptrdiff_t ensures 8-byte slots are used on all AArch64 platforms, avoiding ABI mismatches with the assembly implementation. The x86 AVX2 prototype is updated to match the new signature. Performance benchmark (AArch64) in MacOS M4: ./tests/checkasm/checkasm --test=vf_nlmeans --bench compute_weights_line_c: 151.1 ( 1.00x) compute_weights_line_neon: 62.6 ( 2.42x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-01-09 16:10:10 +00:00
Ruikai Peng	cc43670268	avfilter/x86/vf_noise: Use unaligned access Regression since: `3ba570de8b` (port from MMX to SSE2). The SSE2 inline asm in libavfilter/x86/vf_noise.c (line_noise_sse2 and line_noise_avg_sse2) uses aligned loads/stores (movdqa, movntdq) but never checks pointer alignment. When the filter reuses an input frame (common path when av_frame_is_writable() is true), it may receive misaligned data from upstream filters that adjust frame->data[i] in place, notably vf_crop: - vf_crop adjusts plane pointers by arbitrary byte offsets (frame->data[plane] += ...), so an x offset of 1 on 8-bit formats produces a 1‑byte misalignment. - The noise filter then calls the SSE2 path directly on those pointers without realigning or falling back. Repro on x86_64/SSE2 (current HEAD at that commit): ./ffmpeg -v error -f lavfi -i testsrc=s=320x240:rate=1 \ -vf "format=yuv420p,crop=w=319:x=1:h=240:exact=1,noise=alls=50" \ -frames:v 1 -f null - This crashes with SIGSEGV at the aligned load in line_noise_sse2 (movdqa (%r9,%rax),%xmm0; effective address misaligned by 1 byte). Impact: denial of service via crafted filtergraphs (e.g., crop + noise). Applies to planar 8-bit formats where upstream filters can shift data pointers without reallocating. Found-by: Pwno OSS Team	2025-12-12 19:25:21 +00:00
Andreas Rheinhardt	7356981bec	avfilter/x86/Makefile: Only compile ASM init files when X86ASM is enabled To do so, simply add these init files to X86ASM-OBJS instead of OBJS in the Makefile. The former is already used for the actual assembly files, but using them for the C init files just works, because the build system uses file extensions to derive whether it is a C or a NASM file. This avoids compiling unused function stubs and also reduces our reliance on DCE: We don't add %if checks to the asm files except for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4 functions will be available. It also allows to remove HAVE_X86ASM checks in these init files. Reviewed-by: Kacper Michajłow <kasper93@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 22:20:13 +01:00
Piotr Pawlowski	372dab2a4d	All: Removed reliance on compiler performing dead code elimination, changed various macro constant checks from if() to #if	2025-11-28 19:52:51 +01:00
Niklas Haas	f3346ca6f7	avfilter/x86/f_ebur128: only use filter_channels_avx for >= 2 channels The approach of this ASM routine is to process two channels at a time using AVX instructions. Obviously, there is no point in doing this if there is only a single channel; in which case the scalar loop would be better. Fixes a performance regression when filtering mono audio on certain CPUs, notably e.g. the Intel N100.	2025-11-25 22:13:57 +00:00
Andreas Rheinhardt	c0648b2004	avfilter/x86/vf_spp: Fix comment Forgotten in `dcb28ed860`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:12 +01:00
Andreas Rheinhardt	06b0dae51b	avfilter/vf_fsppdsp: Constify Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:12 +01:00
Andreas Rheinhardt	3cd452cbf1	avfilter/x86/vf_fspp: Avoid stack on x64 Possible due to the amount of registers. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:12 +01:00
Andreas Rheinhardt	ddd74276f8	avfilter/x86/vf_fspp: Port ff_column_fidct_mmx() to SSE2 It gains a lot because it has to operate on eight words; it also saves 608B of .text here. Old benchmarks: column_fidct_c: 3365.7 ( 1.00x) column_fidct_mmx: 1784.6 ( 1.89x) New benchmarks: column_fidct_c: 3361.5 ( 1.00x) column_fidct_sse2: 801.1 ( 4.20x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:11 +01:00
Andreas Rheinhardt	63493bf0e0	avfilter/x86/vf_fspp: Put shifts into constants This avoids some shift instructions and also gives us more headroom in the registers. In fact, I have proven to myself that everything that is supposed to fit into 16bits now actually does so. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:11 +01:00
Andreas Rheinhardt	66af18d06a	avfilter/x86/vf_fspp: Make ff_column_fidct_mmx() bitexact It currently is not, because the shortcut mode uses different rounding than the C code (as well as the non-shortcut code). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 12:18:11 +01:00
Andreas Rheinhardt	ff85a20b7d	avfilter/x86/vf_fspp: Port store_slice to SSE2 Old benchmarks: store_slice_c: 2798.3 ( 1.00x) store_slice_mmx: 950.2 ( 2.94x) store_slice2_c: 3811.7 ( 1.00x) store_slice2_mmx: 682.3 ( 5.59x) New benchmarks: store_slice_c: 2797.2 ( 1.00x) store_slice_sse2: 543.5 ( 5.15x) store_slice2_c: 3817.0 ( 1.00x) store_slice2_sse2: 408.2 ( 9.35x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	52ba2ac7bd	avfilter/x86/vf_fspp: Port mul_thrmat to SSE2 This fixes an ABI violation, as mul_thrmat did not issue emms. It seems that this ABI violation could reach the user, namely if ff_get_video_buffer() fails. Notice that ff_get_video_buffer() itself could fail because of this, namely if the allocator uses floating point registers. On x64 (where GCC already used SSE2 in the C version) mul_thrmat_c: 4.4 ( 1.00x) mul_thrmat_mmx: 8.6 ( 0.52x) mul_thrmat_sse2: 4.4 ( 1.00x) On 32bit (where SSE2 is not known to be available): mul_thrmat_c: 56.0 ( 1.00x) mul_thrmat_sse2: 6.0 ( 9.40x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	9f4d5d818d	avfilter/x86/vf_fspp: Don't duplicate dither table Reuse the one from vf_fsppdsp.c; also don't overalign said table too much. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	9b34088c4d	avfilter/vf_fspp: Add DSPCtx, move DSP functions to file of their own This is in preparation for adding checkasm tests; without it, checkasm would pull all of libavfilter in. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-17 11:28:04 +01:00
Andreas Rheinhardt	3ba570de8b	avfilter/x86/vf_noise: Port line_noise funcs to SSE2 This avoids having to fix up ABI violations via emms_c and also leads to a 73% speedup for the line noise average version here. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-16 19:09:45 +02:00
Andreas Rheinhardt	adfec0f52e	avfilter/x86/vf_noise: Make line_noise_avg_mmx() match C function Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-16 18:41:19 +02:00
Andreas Rheinhardt	74a3c1ddb6	avfilter/x86/vf_pullup: Port pullup functions to SSE2, SSSE3 The diff and var functions benefit from psadbw, comb from wider registers which allows to avoid reloading values, reducing the number of loads from 48 to 10. Performance increased by 117% (the loop in compute_metric() has been timed); codesize decreased by 144B. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-15 19:43:37 +02:00
Andreas Rheinhardt	dcb28ed860	avfilter/x86/vf_spp: Port store_slice to SSE2 This allows to remove an emms_c from the filter. It also gives 25% speedup here (when timing the calls to store_slice using START/STOP_TIMER). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-15 19:43:37 +02:00
Andreas Rheinhardt	4fc05c28f4	avfilter/x86/vf_gradfun: Remove MMXEXT func overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 version of filter_line. This commit therefore removes the overridden MMXEXT version (which didn't abide by the ABI) which allows us to remove an emms_c() from vf_gradfun.c, so that users with SSSE3 no longer pay a price for the mere existence of an MMXEXT version. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:35 +02:00
Niklas Haas	843920d5d6	avfilter/x86/vf_idetdsp: add AVX2 and AVX512 implementations The only thing that changes slightly is the horizontal sum at the end.	2025-09-21 11:02:41 +00:00
Niklas Haas	4c067d0778	avfilter/x86/vf_idetdsp: generalize 8-bit macro This is mostly compatible with AVX as well, so turn it into a macro.	2025-09-21 11:02:41 +00:00
Niklas Haas	326abf359f	avfilter/vf_idetdsp: use consistent uint8_t pointer type Even for 16-bit DSP functions. Instead, cast the pointer inside the function.	2025-09-21 11:02:41 +00:00
Niklas Haas	60dbcc5321	avfilter/vf_idetdsp: pass actual bit depth More informative and IMO cleaner; some implementations may want to differentiate by exact bit depth or support 32 bit down the line.	2025-09-21 11:02:41 +00:00
Niklas Haas	5830743363	avfilter/vf_idet: separate DSP parts To avoid pulling in the entire libavfilter when using the DSP functions from checkasm. The rest of the struct is not needed outside vf_idet.c and was moved there.	2025-09-21 11:02:41 +00:00
Andreas Rheinhardt	a35c91dc14	avfilter/vf_colordetect: Rename header to vf_colordetectdsp.h It is more in line with our naming conventions. Reviewed-by: Martin Storsjö <martin@martin.st> Reviewed-by: Niklas Haas <ffmpeg@haasn.dev> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-16 18:22:24 +02:00
Niklas Haas	ba8aa0e7b3	avfilter/x86/vf_overlay: simplify function signature No reason to pass all the variables again, if we're already passing the context.	2025-09-02 17:06:25 +02:00
Niklas Haas	6d6bbdaab0	avfilter/vf_overlay: rename variables for clarity `is_straight`, `alpha_mode` etc. are more consistently named to refer to either the main image, or the overlay.	2025-09-02 17:06:25 +02:00
Niklas Haas	6f3eddbedd	avfilter/vf_overlay: configure alpha mode on the link And use the link-tagged value instead of the hard-coded parameter.	2025-09-02 17:06:25 +02:00
Niklas Haas	f07c12d806	avfilter/x86/vf_colordetect: fix alpha detect tail handling This wrapping logic still considered any nonzero return from the ASM function to be the overall result, but this is not true since the addition of FF_ALPHA_TRANSPARENT. Fix it by only early returning if FF_ALPHA_STRAIGHT is detected. Fixes: `9b8b78a815` See-Also: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20301#issuecomment-4802	2025-09-01 15:33:43 +00:00
Niklas Haas	9b8b78a815	avfilter/vf_colordetect: detect fully opaque alpha planes It can be useful to know if the alpha plane consists of fully opaque pixels or not, in which case it can e.g. safely be stripped. This only requires a very minor modification to the AVX2 routines, adding an extra AND on the read alpha value with the reference alpha value, and a single extra cheap test per line. detect_alpha_8_full_c: 2849.1 ( 1.00x) detect_alpha_8_full_avx2: 260.3 (10.95x) detect_alpha_8_full_avx512icl: 130.2 (21.87x) detect_alpha_8_limited_c: 8349.2 ( 1.00x) detect_alpha_8_limited_avx2: 756.6 (11.04x) detect_alpha_8_limited_avx512icl: 364.2 (22.93x) detect_alpha_16_full_c: 1652.8 ( 1.00x) detect_alpha_16_full_avx2: 236.5 ( 6.99x) detect_alpha_16_full_avx512icl: 134.6 (12.28x) detect_alpha_16_limited_c: 5263.1 ( 1.00x) detect_alpha_16_limited_avx2: 797.4 ( 6.60x) detect_alpha_16_limited_avx512icl: 400.3 (13.15x)	2025-08-18 18:50:00 +00:00
Niklas Haas	c96ccd78fc	avfilter/vf_colordetect: rename p, q, k variables for clarity Purely cosmetic. Motivated in part because I want to depend on the assumption that P represents the maximum alpha channel value.	2025-08-18 18:50:00 +00:00
James Almer	3f58c9df14	avfilter/x86/vf_bwdif: use the correct preprocessor check Signed-off-by: James Almer <jamrial@gmail.com>	2025-08-03 19:26:18 -03:00
Niklas Haas	7f00e24d70	vf_bwdif: add AVX512 implementation I also tried replacing some of the instructions by more elaborate ones using masks, but I found no performance gain significant enough to be worth maintaining two code paths, so this implementation merely replaces the AVX2 implementation by drop-in AVX512 equivalents. bwdif8_c: 6362.2 ( 1.00x) bwdif8_sse2: 1004.9 ( 6.33x) bwdif8_ssse3: 946.0 ( 6.73x) bwdif8_avx2: 477.9 (13.31x) bwdif8_avx512: 273.3 (23.28x) bwdif10_c: 6341.5 ( 1.00x) bwdif10_sse2: 872.4 ( 7.27x) bwdif10_ssse3: 803.4 ( 7.89x) bwdif10_avx2: 416.7 (15.22x) bwdif10_avx512: 224.3 (28.27x) Realtime test at 3840x2160 yuv420p: avx2: frame=20000 fps=3370 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed=67.4x elapsed=0:00:05.93 avx512: frame=20000 fps=5077 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed= 102x elapsed=0:00:03.93 The use of this function is gated behind avx512icl so that it doesn't downclock on Skylake.	2025-08-03 22:13:51 +00:00
Timo Rothenpieler	262d41c804	all: fix typos found by codespell	2025-08-03 13:48:47 +02:00
James Almer	a01dc3aa27	avfilter/x86/vf_colordetect: add missing preprocessor checks Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 18:03:22 -03:00
James Almer	c62813a057	avfilter/x86/vf_colordetect: make the AVX512 functions run only on ICL targets or newer For detect_range, the usage of vpbroadcast{b,w} requires the AVX512BW extension, and for detect_alpha we don't want ZMM instructions downclocking old CPUs. Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 17:25:28 -03:00
James Almer	70fc4e5909	avfilter/x86/vf_colordetect_init: don't enable ASM functions on targets where it's known they will be slower Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 16:58:51 -03:00
James Almer	fdca209f1f	avfilter/x86/vf_colordetect: don't use rax to return a 32bit integer Fixes compilation on x86_32 targets Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 16:58:36 -03:00
James Almer	14f4478354	avfilter/x86/vf_colordetect: fix use of AVX512 instruction in AVX2 function on non Unix64 targets Signed-off-by: James Almer <jamrial@gmail.com>	2025-07-21 16:52:46 -03:00
Niklas Haas	8b647b3f8a	avfilter/vf_colordetect: add x86 SIMD implementation alphadetect8_full_c: 5658.2 ( 1.00x) alphadetect8_full_avx2: 215.1 (26.31x) alphadetect8_full_avx512: 133.5 (42.40x) alphadetect8_limited_c: 7391.5 ( 1.00x) alphadetect8_limited_avx2: 649.3 (11.38x) alphadetect8_limited_avx512: 330.5 (22.36x) alphadetect16_full_c: 3027.4 ( 1.00x) alphadetect16_full_avx2: 209.4 (14.46x) alphadetect16_full_avx512: 141.4 (21.41x) alphadetect16_limited_c: 3880.9 ( 1.00x) alphadetect16_limited_avx2: 734.9 ( 5.28x) alphadetect16_limited_avx512: 349.2 (11.11x) rangedetect8_c: 5854.2 ( 1.00x) rangedetect8_avx2: 138.9 (42.15x) rangedetect8_avx512: 106.2 (55.12x) rangedetect16_c: 4122.0 ( 1.00x) rangedetect16_avx2: 138.6 (29.74x) rangedetect16_avx512: 104.1 (39.60x)	2025-07-21 18:10:25 +02:00

1 2 3 4 5 ...

398 commits