ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-04-18 16:40:23 +00:00

Author	SHA1	Message	Date
Andreas Rheinhardt	3a7c09eb39	avcodec/x86/mpegvideoencdsp_init: Port draw_edges to SSSE3 Benchmarks: draw_edges_8_1724_4_c: 2672.2 ( 1.00x) draw_edges_8_1724_4_mmx: 3191.5 ( 0.84x) draw_edges_8_1724_4_ssse3: 2179.6 ( 1.23x) draw_edges_8_1724_8_c: 2852.3 ( 1.00x) draw_edges_8_1724_8_mmx: 3683.0 ( 0.77x) draw_edges_8_1724_8_ssse3: 2225.7 ( 1.28x) draw_edges_8_1724_16_c: 4169.4 ( 1.00x) draw_edges_8_1724_16_mmx: 4665.9 ( 0.89x) draw_edges_8_1724_16_ssse3: 2765.8 ( 1.51x) draw_edges_128_407_4_c: 1126.6 ( 1.00x) draw_edges_128_407_4_mmx: 943.9 ( 1.19x) draw_edges_128_407_4_ssse3: 925.7 ( 1.22x) draw_edges_128_407_8_c: 1208.8 ( 1.00x) draw_edges_128_407_8_mmx: 1119.1 ( 1.08x) draw_edges_128_407_8_ssse3: 997.8 ( 1.21x) draw_edges_128_407_16_c: 1352.4 ( 1.00x) draw_edges_128_407_16_mmx: 1368.7 ( 0.99x) draw_edges_128_407_16_ssse3: 1148.3 ( 1.18x) draw_edges_1080_31_4_c: 228.5 ( 1.00x) draw_edges_1080_31_4_mmx: 240.8 ( 0.95x) draw_edges_1080_31_4_ssse3: 226.7 ( 1.01x) draw_edges_1080_31_8_c: 411.1 ( 1.00x) draw_edges_1080_31_8_mmx: 432.9 ( 0.95x) draw_edges_1080_31_8_ssse3: 403.2 ( 1.02x) draw_edges_1080_31_16_c: 1121.2 ( 1.00x) draw_edges_1080_31_16_mmx: 1124.9 ( 1.00x) draw_edges_1080_31_16_ssse3: 1125.4 ( 1.00x) draw_edges_1920_4_4_c: 310.8 ( 1.00x) draw_edges_1920_4_4_mmx: 311.6 ( 1.00x) draw_edges_1920_4_4_ssse3: 311.6 ( 1.00x) draw_edges_1920_4_4_negstride_c: 307.0 ( 1.00x) draw_edges_1920_4_4_negstride_mmx: 306.7 ( 1.00x) draw_edges_1920_4_4_negstride_ssse3: 306.7 ( 1.00x) draw_edges_1920_4_8_c: 724.2 ( 1.00x) draw_edges_1920_4_8_mmx: 724.9 ( 1.00x) draw_edges_1920_4_8_ssse3: 717.3 ( 1.01x) draw_edges_1920_4_8_negstride_c: 719.2 ( 1.00x) draw_edges_1920_4_8_negstride_mmx: 717.1 ( 1.00x) draw_edges_1920_4_8_negstride_ssse3: 710.9 ( 1.01x) draw_edges_1920_4_16_c: 1752.9 ( 1.00x) draw_edges_1920_4_16_mmx: 1754.6 ( 1.00x) draw_edges_1920_4_16_ssse3: 1751.1 ( 1.00x) draw_edges_1920_4_16_negstride_c: 1783.2 ( 1.00x) draw_edges_1920_4_16_negstride_mmx: 1778.2 ( 1.00x) draw_edges_1920_4_16_negstride_ssse3: 1768.3 ( 1.01x) Reviewed-by: Michael Niedermayer <michael@niedermayer.cc> Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-11 11:15:06 +01:00
Andreas Rheinhardt	436b74b725	avcodec/x86/hevc/dequant: Add SSSE3 dequant ASM function hevc_dequant_4x4_8_c (GCC): 20.2 ( 1.00x) hevc_dequant_4x4_8_c (Clang): 21.7 ( 1.00x) hevc_dequant_4x4_8_ssse3: 5.8 ( 3.51x) hevc_dequant_8x8_8_c (GCC): 32.9 ( 1.00x) hevc_dequant_8x8_8_c (Clang): 78.7 ( 1.00x) hevc_dequant_8x8_8_ssse3: 6.8 ( 4.83x) hevc_dequant_16x16_8_c (GCC): 105.1 ( 1.00x) hevc_dequant_16x16_8_c (Clang): 151.1 ( 1.00x) hevc_dequant_16x16_8_ssse3: 19.3 ( 5.45x) hevc_dequant_32x32_8_c (GCC): 415.7 ( 1.00x) hevc_dequant_32x32_8_c (Clang): 602.3 ( 1.00x) hevc_dequant_32x32_8_ssse3: 78.2 ( 5.32x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 12:25:33 +01:00
Andreas Rheinhardt	2729c52988	avcodec/x86/hevc/deblock: Reduce usage of GPRs Don't use two GPRs to store two words from xmm registers; shuffle these words so that they are fit into one GPR. This reduces the amount of GPRs used and leads to tiny speedups here. Also avoid rex prefixes whenever possible (for lines that needed to be modified anyway). Old benchmarks: hevc_h_loop_filter_luma8_skip_c: 23.8 ( 1.00x) hevc_h_loop_filter_luma8_skip_sse2: 8.5 ( 2.80x) hevc_h_loop_filter_luma8_skip_ssse3: 7.2 ( 3.29x) hevc_h_loop_filter_luma8_skip_avx: 6.4 ( 3.71x) hevc_h_loop_filter_luma8_strong_c: 150.4 ( 1.00x) hevc_h_loop_filter_luma8_strong_sse2: 34.4 ( 4.37x) hevc_h_loop_filter_luma8_strong_ssse3: 34.5 ( 4.36x) hevc_h_loop_filter_luma8_strong_avx: 32.3 ( 4.65x) hevc_h_loop_filter_luma8_weak_c: 103.2 ( 1.00x) hevc_h_loop_filter_luma8_weak_sse2: 34.5 ( 2.99x) hevc_h_loop_filter_luma8_weak_ssse3: 7.3 (14.22x) hevc_h_loop_filter_luma8_weak_avx: 32.4 ( 3.18x) hevc_h_loop_filter_luma10_skip_c: 23.5 ( 1.00x) hevc_h_loop_filter_luma10_skip_sse2: 6.6 ( 3.58x) hevc_h_loop_filter_luma10_skip_ssse3: 6.1 ( 3.86x) hevc_h_loop_filter_luma10_skip_avx: 5.4 ( 4.34x) hevc_h_loop_filter_luma10_strong_c: 161.8 ( 1.00x) hevc_h_loop_filter_luma10_strong_sse2: 32.2 ( 5.03x) hevc_h_loop_filter_luma10_strong_ssse3: 30.4 ( 5.33x) hevc_h_loop_filter_luma10_strong_avx: 30.3 ( 5.33x) hevc_h_loop_filter_luma10_weak_c: 23.5 ( 1.00x) hevc_h_loop_filter_luma10_weak_sse2: 6.6 ( 3.58x) hevc_h_loop_filter_luma10_weak_ssse3: 6.1 ( 3.85x) hevc_h_loop_filter_luma10_weak_avx: 5.4 ( 4.35x) hevc_h_loop_filter_luma12_skip_c: 18.8 ( 1.00x) hevc_h_loop_filter_luma12_skip_sse2: 6.6 ( 2.87x) hevc_h_loop_filter_luma12_skip_ssse3: 6.1 ( 3.08x) hevc_h_loop_filter_luma12_skip_avx: 6.2 ( 3.06x) hevc_h_loop_filter_luma12_strong_c: 159.0 ( 1.00x) hevc_h_loop_filter_luma12_strong_sse2: 36.3 ( 4.38x) hevc_h_loop_filter_luma12_strong_ssse3: 36.1 ( 4.40x) hevc_h_loop_filter_luma12_strong_avx: 33.5 ( 4.75x) hevc_h_loop_filter_luma12_weak_c: 40.1 ( 1.00x) hevc_h_loop_filter_luma12_weak_sse2: 35.5 ( 1.13x) hevc_h_loop_filter_luma12_weak_ssse3: 36.1 ( 1.11x) hevc_h_loop_filter_luma12_weak_avx: 6.2 ( 6.52x) hevc_v_loop_filter_luma8_skip_c: 25.5 ( 1.00x) hevc_v_loop_filter_luma8_skip_sse2: 10.6 ( 2.40x) hevc_v_loop_filter_luma8_skip_ssse3: 11.4 ( 2.24x) hevc_v_loop_filter_luma8_skip_avx: 8.3 ( 3.07x) hevc_v_loop_filter_luma8_strong_c: 146.8 ( 1.00x) hevc_v_loop_filter_luma8_strong_sse2: 43.9 ( 3.35x) hevc_v_loop_filter_luma8_strong_ssse3: 43.7 ( 3.36x) hevc_v_loop_filter_luma8_strong_avx: 42.3 ( 3.47x) hevc_v_loop_filter_luma8_weak_c: 25.5 ( 1.00x) hevc_v_loop_filter_luma8_weak_sse2: 10.6 ( 2.40x) hevc_v_loop_filter_luma8_weak_ssse3: 44.0 ( 0.58x) hevc_v_loop_filter_luma8_weak_avx: 8.3 ( 3.09x) hevc_v_loop_filter_luma10_skip_c: 20.0 ( 1.00x) hevc_v_loop_filter_luma10_skip_sse2: 11.3 ( 1.77x) hevc_v_loop_filter_luma10_skip_ssse3: 11.0 ( 1.82x) hevc_v_loop_filter_luma10_skip_avx: 9.3 ( 2.15x) hevc_v_loop_filter_luma10_strong_c: 193.5 ( 1.00x) hevc_v_loop_filter_luma10_strong_sse2: 46.1 ( 4.19x) hevc_v_loop_filter_luma10_strong_ssse3: 44.2 ( 4.38x) hevc_v_loop_filter_luma10_strong_avx: 44.4 ( 4.35x) hevc_v_loop_filter_luma10_weak_c: 90.3 ( 1.00x) hevc_v_loop_filter_luma10_weak_sse2: 46.3 ( 1.95x) hevc_v_loop_filter_luma10_weak_ssse3: 10.8 ( 8.37x) hevc_v_loop_filter_luma10_weak_avx: 44.4 ( 2.03x) hevc_v_loop_filter_luma12_skip_c: 16.8 ( 1.00x) hevc_v_loop_filter_luma12_skip_sse2: 11.8 ( 1.42x) hevc_v_loop_filter_luma12_skip_ssse3: 11.7 ( 1.43x) hevc_v_loop_filter_luma12_skip_avx: 8.7 ( 1.93x) hevc_v_loop_filter_luma12_strong_c: 159.3 ( 1.00x) hevc_v_loop_filter_luma12_strong_sse2: 45.3 ( 3.52x) hevc_v_loop_filter_luma12_strong_ssse3: 60.3 ( 2.64x) hevc_v_loop_filter_luma12_strong_avx: 44.1 ( 3.61x) hevc_v_loop_filter_luma12_weak_c: 63.6 ( 1.00x) hevc_v_loop_filter_luma12_weak_sse2: 45.3 ( 1.40x) hevc_v_loop_filter_luma12_weak_ssse3: 11.7 ( 5.41x) hevc_v_loop_filter_luma12_weak_avx: 43.9 ( 1.45x) New benchmarks: hevc_h_loop_filter_luma8_skip_c: 24.2 ( 1.00x) hevc_h_loop_filter_luma8_skip_sse2: 8.6 ( 2.82x) hevc_h_loop_filter_luma8_skip_ssse3: 7.0 ( 3.46x) hevc_h_loop_filter_luma8_skip_avx: 6.8 ( 3.54x) hevc_h_loop_filter_luma8_strong_c: 150.4 ( 1.00x) hevc_h_loop_filter_luma8_strong_sse2: 33.3 ( 4.52x) hevc_h_loop_filter_luma8_strong_ssse3: 32.7 ( 4.61x) hevc_h_loop_filter_luma8_strong_avx: 32.7 ( 4.60x) hevc_h_loop_filter_luma8_weak_c: 104.0 ( 1.00x) hevc_h_loop_filter_luma8_weak_sse2: 33.2 ( 3.13x) hevc_h_loop_filter_luma8_weak_ssse3: 7.0 (14.91x) hevc_h_loop_filter_luma8_weak_avx: 31.3 ( 3.32x) hevc_h_loop_filter_luma10_skip_c: 19.2 ( 1.00x) hevc_h_loop_filter_luma10_skip_sse2: 6.2 ( 3.08x) hevc_h_loop_filter_luma10_skip_ssse3: 6.2 ( 3.08x) hevc_h_loop_filter_luma10_skip_avx: 5.0 ( 3.85x) hevc_h_loop_filter_luma10_strong_c: 159.8 ( 1.00x) hevc_h_loop_filter_luma10_strong_sse2: 30.0 ( 5.32x) hevc_h_loop_filter_luma10_strong_ssse3: 29.2 ( 5.48x) hevc_h_loop_filter_luma10_strong_avx: 28.6 ( 5.58x) hevc_h_loop_filter_luma10_weak_c: 19.2 ( 1.00x) hevc_h_loop_filter_luma10_weak_sse2: 6.2 ( 3.09x) hevc_h_loop_filter_luma10_weak_ssse3: 6.2 ( 3.09x) hevc_h_loop_filter_luma10_weak_avx: 5.0 ( 3.88x) hevc_h_loop_filter_luma12_skip_c: 18.7 ( 1.00x) hevc_h_loop_filter_luma12_skip_sse2: 6.2 ( 3.00x) hevc_h_loop_filter_luma12_skip_ssse3: 5.7 ( 3.27x) hevc_h_loop_filter_luma12_skip_avx: 5.2 ( 3.61x) hevc_h_loop_filter_luma12_strong_c: 160.2 ( 1.00x) hevc_h_loop_filter_luma12_strong_sse2: 34.2 ( 4.68x) hevc_h_loop_filter_luma12_strong_ssse3: 29.3 ( 5.48x) hevc_h_loop_filter_luma12_strong_avx: 31.4 ( 5.10x) hevc_h_loop_filter_luma12_weak_c: 40.2 ( 1.00x) hevc_h_loop_filter_luma12_weak_sse2: 35.2 ( 1.14x) hevc_h_loop_filter_luma12_weak_ssse3: 29.3 ( 1.37x) hevc_h_loop_filter_luma12_weak_avx: 5.0 ( 8.09x) hevc_v_loop_filter_luma8_skip_c: 25.6 ( 1.00x) hevc_v_loop_filter_luma8_skip_sse2: 10.2 ( 2.52x) hevc_v_loop_filter_luma8_skip_ssse3: 10.5 ( 2.45x) hevc_v_loop_filter_luma8_skip_avx: 8.2 ( 3.11x) hevc_v_loop_filter_luma8_strong_c: 147.1 ( 1.00x) hevc_v_loop_filter_luma8_strong_sse2: 42.6 ( 3.45x) hevc_v_loop_filter_luma8_strong_ssse3: 42.4 ( 3.47x) hevc_v_loop_filter_luma8_strong_avx: 40.1 ( 3.67x) hevc_v_loop_filter_luma8_weak_c: 25.6 ( 1.00x) hevc_v_loop_filter_luma8_weak_sse2: 10.6 ( 2.42x) hevc_v_loop_filter_luma8_weak_ssse3: 42.7 ( 0.60x) hevc_v_loop_filter_luma8_weak_avx: 8.2 ( 3.11x) hevc_v_loop_filter_luma10_skip_c: 16.7 ( 1.00x) hevc_v_loop_filter_luma10_skip_sse2: 11.0 ( 1.52x) hevc_v_loop_filter_luma10_skip_ssse3: 10.5 ( 1.59x) hevc_v_loop_filter_luma10_skip_avx: 9.6 ( 1.74x) hevc_v_loop_filter_luma10_strong_c: 190.0 ( 1.00x) hevc_v_loop_filter_luma10_strong_sse2: 44.8 ( 4.24x) hevc_v_loop_filter_luma10_strong_ssse3: 42.3 ( 4.49x) hevc_v_loop_filter_luma10_strong_avx: 42.5 ( 4.47x) hevc_v_loop_filter_luma10_weak_c: 88.3 ( 1.00x) hevc_v_loop_filter_luma10_weak_sse2: 45.7 ( 1.93x) hevc_v_loop_filter_luma10_weak_ssse3: 10.5 ( 8.40x) hevc_v_loop_filter_luma10_weak_avx: 42.4 ( 2.09x) hevc_v_loop_filter_luma12_skip_c: 16.7 ( 1.00x) hevc_v_loop_filter_luma12_skip_sse2: 11.7 ( 1.42x) hevc_v_loop_filter_luma12_skip_ssse3: 10.5 ( 1.59x) hevc_v_loop_filter_luma12_skip_avx: 8.8 ( 1.90x) hevc_v_loop_filter_luma12_strong_c: 159.4 ( 1.00x) hevc_v_loop_filter_luma12_strong_sse2: 45.2 ( 3.53x) hevc_v_loop_filter_luma12_strong_ssse3: 59.3 ( 2.69x) hevc_v_loop_filter_luma12_strong_avx: 41.7 ( 3.82x) hevc_v_loop_filter_luma12_weak_c: 63.3 ( 1.00x) hevc_v_loop_filter_luma12_weak_sse2: 44.9 ( 1.41x) hevc_v_loop_filter_luma12_weak_ssse3: 10.5 ( 6.02x) hevc_v_loop_filter_luma12_weak_avx: 41.7 ( 1.52x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 11:54:57 +01:00
Andreas Rheinhardt	0843252229	avcodec/x86/hevc/deblock: avoid unused GPR r12 is unused, so use it instead of r13 to reduce the amount of push/pops. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 11:54:57 +01:00
Andreas Rheinhardt	0aad8b860a	avcodec/x86/hevc/deblock: Avoid vmovdqa (It would even be possible to avoid a clobbering m10 in MASKED_COPY and the mask register (%3) in MASKED_COPY2 when VEX encoding is in use.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 11:54:57 +01:00
Andreas Rheinhardt	c940128fff	avcodec/x86/vp9lpf: Avoid vmovdqa Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 11:54:57 +01:00
Andreas Rheinhardt	c898ddb8fe	avcodec/x86/cfhddsp: Reduce number of xmm registers used Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 01:33:40 +01:00
Andreas Rheinhardt	848c3ca772	avcodec/x86/cfhddsp: Avoid pmaddwd The result of using pmaddwd with the coefficients 1,-1,...,1,-1 is just the negative of using pmaddwd with the coefficients -1,1,...,-1,1, so avoid one pmaddwd. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 01:33:37 +01:00
Andreas Rheinhardt	6224445753	avcodec/x86/cfhdencdsp: Avoid += x, -= x Avoid incrementing lowq and highq inside the loop by using complex addressing modes, avoiding to undo said modification at the end of the horizontal loop. For inputq, modify istrideq outside of the loop so that it is only modified once at the end of the horizontal loop. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 01:33:34 +01:00
Andreas Rheinhardt	7dd6487800	avcodec/x86/cfhdencdsp: Don't load twice Sign extend the integer arguments directly from the stack instead of loading qwords, followed by sign-extending the lower half. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 01:33:30 +01:00
Andreas Rheinhardt	91c7710412	avcodec/x86/cfhdencdsp: Avoid unnecessary constants Up until now, cfhdencdsp used constants consisting of -1, 1, ...,-1,1 words and 1, -1,...,1,-1 words for use as constants in pmaddwd. But one can use the same constants if one shuffles the words in a dword the opposite order. Similarly for some other constants. This also allowed to avoid a register in chfdenc_vert_filter. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 01:33:23 +01:00
Andreas Rheinhardt	cd3d8116fb	avcodec/x86/cfhdencdsp: Avoid load of -1 It can be easily generated at runtime. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-29 01:32:57 +01:00
Andreas Rheinhardt	bf4d5037b4	avcodec/h264dsp: Remove redundant h264 from H264DSPCtx member names These names are a remnant of dsputil when all the DSP functions from all codecs were part of DSPcontext. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Sean McGovern <gseanmcg@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:25 +01:00
Andreas Rheinhardt	489aaf4e1c	avcodec/x86/h264_deblock: Don't sign-extend stride Unnecessary (and wrong) since `d5d699ab6e`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	db66e057eb	avcodec/x86/h264_deblock: Avoid reload Old benchmarks: h264_h_loop_filter_luma_8bpp_c: 60.0 ( 1.00x) h264_h_loop_filter_luma_8bpp_sse2: 65.4 ( 0.92x) h264_h_loop_filter_luma_8bpp_avx: 65.3 ( 0.92x) New benchmarks: h264_h_loop_filter_luma_8bpp_c: 60.4 ( 1.00x) h264_h_loop_filter_luma_8bpp_sse2: 62.0 ( 0.97x) h264_h_loop_filter_luma_8bpp_avx: 61.7 ( 0.98x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	8428a412bc	avcodec/x86/h264_deblock: Avoid MMX in deblock_h_luma_8 Old benchmarks: h264_h_loop_filter_luma_8bpp_c: 59.9 ( 1.00x) h264_h_loop_filter_luma_8bpp_sse2: 67.9 ( 0.88x) h264_h_loop_filter_luma_8bpp_avx: 67.4 ( 0.89x) New benchmarks: h264_h_loop_filter_luma_8bpp_c: 60.0 ( 1.00x) h264_h_loop_filter_luma_8bpp_sse2: 65.4 ( 0.92x) h264_h_loop_filter_luma_8bpp_avx: 65.3 ( 0.92x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	9882973935	avcodec/x86/h264_deblock: Avoid reloading constant No change in benchmarks. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	eaaf45fd79	avcodec/x86/h264_deblock_10bit: Simplify r0+4*r1 Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	aab0946eae	avcodec/x86/h264_deblock_10bit: Remove mmxext functions Now that the SSE2/AVX functions are no longer restricted to those systems having an aligned stack, the MMXEXT functions are always overridden (except for ancient systems without SSE2), so remove them. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	dbdf514c17	avcodec/x86/h264_deblock_10bit: Remove custom stack allocation code Allocate it via cglobal as usual. This makes the SSE2/AVX functions available when HAVE_ALIGNED_STACK is false; it also avoids modifying rsp unnecessarily in the deblock_h_luma_intra_10 functions on Win64. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	b1140d3c98	avcodec/x86/h264_deblock: Remove obsolete macro parameters They are a remnant of the MMX functions (which processed only eight pixels at a time, so that it was called twice via a wrapper; the actual MMX function had "v8" in its name instead of simply v) which have been removed in commit `4618f36a24`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	899475326b	avcodec/x86/h264_deblock: Simplify splatting Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	a22149ab3d	avcodec/x86/h264_deblock: Remove always-false branches These functions are always called with alpha and beta > 0. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	982244818b	avcodec/x86/h264_deblock: Remove unused macros Forgotten in `4618f36a24`. Also remove a PASS8ROWS wrapper that seems to have been always unused. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-25 22:53:21 +01:00
Andreas Rheinhardt	685011003f	avcodec/x86/pngdsp: Remove MMXEXT function overridden by SSSE3 Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-10 22:47:22 +01:00
Andreas Rheinhardt	31daa7cd87	avcodec/pngdsp: Use proper prefix ff_add_png->ff_png_add Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-10 22:47:22 +01:00
Andreas Rheinhardt	5f15c067fe	avcodec/pngdsp: Constify Also constify ff_png_filter_row(). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-01-10 22:47:22 +01:00
Andreas Rheinhardt	6177af5acc	avcodec/x86/lossless_videodsp: Avoid unnecessary reg push,pop Happens on Win64. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-19 20:56:09 +01:00
Andreas Rheinhardt	9314d5cae8	avcodec/x86/lossless_videodsp: Avoid aligned/unaligned versions For AVX2, movdqu is as fast as movdqa when used on aligned addresses, so don't instantiate aligned/unaligned versions. (The check was btw overtly strict: The AVX2 code only uses 16 byte stores, so it would be enough for dst to be 16-byte aligned.) Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-19 20:55:53 +01:00
Andreas Rheinhardt	6368d2baae	avcodec/x86/lossless_videodsp: Don't store in eight byte chunks Use movu (movdqu) instead of movq+movhps. Old benchmarks: add_left_pred_int16_c: 2265.5 ( 1.00x) add_left_pred_int16_ssse3: 595.4 ( 3.81x) add_left_pred_rnd_acc_c: 1255.0 ( 1.00x) add_left_pred_rnd_acc_ssse3: 326.2 ( 3.85x) add_left_pred_rnd_acc_avx2: 279.0 ( 4.50x) add_left_pred_zero_c: 1249.5 ( 1.00x) add_left_pred_zero_ssse3: 326.1 ( 3.83x) add_left_pred_zero_avx2: 277.0 ( 4.51x) New benchmarks: add_left_pred_int16_c: 2266.9 ( 1.00x) add_left_pred_int16_ssse3: 509.9 ( 4.45x) add_left_pred_rnd_acc_c: 1251.4 ( 1.00x) add_left_pred_rnd_acc_ssse3: 282.6 ( 4.43x) add_left_pred_rnd_acc_avx2: 208.9 ( 5.99x) add_left_pred_zero_c: 1253.7 ( 1.00x) add_left_pred_zero_ssse3: 280.0 ( 4.48x) add_left_pred_zero_avx2: 206.8 ( 6.06x) The checkasm test has been modified to use an unaligned destination for this test. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-19 20:55:37 +01:00
Andreas Rheinhardt	a6b8939e1e	avcodec/x86/lossless_videodsp: Remove SSSE3 functions using MMX regs These functions are only used on Conroe (they are overwritten by SSSE3 functions using xmm registers if the SSSE3SLOW is not set) which is very old (introduced in 2006), so remove them. Btw: The checkasm test (which uses declare_func and not declare_func_emms since `cd8a33bcce`) would fail on a Conroe, yet no one ever reported any such failure. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-19 20:54:44 +01:00
Andreas Rheinhardt	f96829b5bf	avcodec/x86/lossless_videoencdsp_init: Remove pointless av_unused Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-14 10:16:46 +01:00
Andreas Rheinhardt	abe6ba17fa	avcodec/x86/lossless_videoencdsp: Port sub_median_pred to NASM Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-14 10:16:43 +01:00
Andreas Rheinhardt	9ba33cc198	avcodec/x86/lossless_videoencdsp_init: Avoid special-casing first pixel Old benchmarks: sub_median_pred_c: 404.1 ( 1.00x) sub_median_pred_sse2: 20.5 (19.67x) New benchmarks: sub_median_pred_c: 408.5 ( 1.00x) sub_median_pred_sse2: 19.2 (21.27x) Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-14 10:16:40 +01:00
Andreas Rheinhardt	3a3e7080f1	avcodec/x86/lossless_videoencdsp_init: Port sub_median_pred to SSE2 Old benchmarks: sub_median_pred_c: 405.7 ( 1.00x) sub_median_pred_mmxext: 35.1 (11.57x) New benchmarks: sub_median_pred_c: 404.1 ( 1.00x) sub_median_pred_sse2: 20.5 (19.67x) Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-14 10:16:35 +01:00
Andreas Rheinhardt	3144652588	avcodec/x86/lossless_videoencdsp_init: Don't read too often sub_median_pred_mmxext() calculates a predictor from the left, top and topleft pixel values. The topleft values need to be initialized differently for the first loop initialization than for the others in order to avoid reading ptr[-1]. So it has been initialized before the loop and then read again at the end of the loop, so that the last value read was never used. Yet this can lead to reads beyond the end of the buffer, e.g. with ffmpeg -cpuflags mmx+mmxext -f lavfi -i "color=size=64x4,format=yuv420p" \ -vf vflip -c:v ffvhuff -pred median -frames 1 -f null - Fix this by not reading the value at the end of the loop. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-14 10:16:29 +01:00
Andreas Rheinhardt	2b9aea7756	avcodec/x86/lossless_videoencdsp_init: Don't read from before the buffer sub_median_pred_mmxext() calculates a predictor from the left, top and topleft pixel values. The left value is simply read via ptr[-1], although this is not guaranteed to be inside the buffer in case of negative strides. This happens e.g. with ffmpeg -i fate-suite/mpeg2/dvd_single_frame.vob -vf vflip \ -c:v magicyuv -pred median -f null - Fix this by reading the first value like the topleft value. Also change the documentation of sub_median_pred to reflect this change (and the one from `791b5954bc`). Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-14 10:16:25 +01:00
Andreas Rheinhardt	dc843cdd9a	avcodec/x86/vp9mc: Reindent after the previous commit Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:35:07 +01:00
Andreas Rheinhardt	65e71b0837	avcodec/x86/vp9mc: Deduplicate coefficient tables Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:35:01 +01:00
Andreas Rheinhardt	38e2174ce4	avcodec/x86/vp9mc: Avoid MMX regs in width 4 hor 8tap funcs Using wider registers (and pshufb) allows to halve the number of pmaddubsw used. It is also ABI compliant (no more missing emms). Old benchmarks: vp9_avg_8tap_smooth_4h_8bpp_c: 97.6 ( 1.00x) vp9_avg_8tap_smooth_4h_8bpp_ssse3: 15.0 ( 6.52x) vp9_avg_8tap_smooth_4hv_8bpp_c: 342.9 ( 1.00x) vp9_avg_8tap_smooth_4hv_8bpp_ssse3: 54.0 ( 6.35x) vp9_put_8tap_smooth_4h_8bpp_c: 94.9 ( 1.00x) vp9_put_8tap_smooth_4h_8bpp_ssse3: 14.2 ( 6.67x) vp9_put_8tap_smooth_4hv_8bpp_c: 325.9 ( 1.00x) vp9_put_8tap_smooth_4hv_8bpp_ssse3: 52.5 ( 6.20x) New benchmarks: vp9_avg_8tap_smooth_4h_8bpp_c: 97.6 ( 1.00x) vp9_avg_8tap_smooth_4h_8bpp_ssse3: 10.8 ( 9.08x) vp9_avg_8tap_smooth_4hv_8bpp_c: 342.4 ( 1.00x) vp9_avg_8tap_smooth_4hv_8bpp_ssse3: 38.8 ( 8.82x) vp9_put_8tap_smooth_4h_8bpp_c: 94.7 ( 1.00x) vp9_put_8tap_smooth_4h_8bpp_ssse3: 9.7 ( 9.75x) vp9_put_8tap_smooth_4hv_8bpp_c: 321.7 ( 1.00x) vp9_put_8tap_smooth_4hv_8bpp_ssse3: 37.0 ( 8.69x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:34:35 +01:00
Andreas Rheinhardt	dd5dc254ff	avcodec/x86/vp9mc: Avoid reloads, MMX regs in width 4 vert 8tap func Four rows of four bytes fit into one xmm register; therefore one can arrange the rows as follows (A,B,C: first, second, third etc. row) xmm0: ABABABAB BCBCBCBC xmm1: CDCDCDCD DEDEDEDE xmm2: EFEFEFEF FGFGFGFG xmm3: GHGHGHGH HIHIHIHI and use four pmaddubsw to calculate two rows in parallel. The history fits into four registers, making this possible even on 32bit systems. Old benchmarks (Unix 64): vp9_avg_8tap_smooth_4v_8bpp_c: 105.5 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 16.4 ( 6.44x) vp9_put_8tap_smooth_4v_8bpp_c: 99.3 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 15.4 ( 6.44x) New benchmarks (Unix 64): vp9_avg_8tap_smooth_4v_8bpp_c: 105.0 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 11.8 ( 8.90x) vp9_put_8tap_smooth_4v_8bpp_c: 99.7 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 10.7 ( 9.30x) Old benchmarks (x86-32): vp9_avg_8tap_smooth_4v_8bpp_c: 138.2 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.93x) vp9_put_8tap_smooth_4v_8bpp_c: 123.6 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.41x) New benchmarks (x86-32): vp9_avg_8tap_smooth_4v_8bpp_c: 139.0 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 20.1 ( 6.92x) vp9_put_8tap_smooth_4v_8bpp_c: 124.5 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 19.9 ( 6.26x) Loading the constants into registers did not turn out to be advantageous here (not to mention Win64, where this would necessitate saving and restoring ever more register); probably because there are only two loop iterations. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:31:59 +01:00
Andreas Rheinhardt	36204fbc3c	avcodec/vp9itxfm{,_16bpp}: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMXEXT functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:27:51 +01:00
Andreas Rheinhardt	ea37f49aed	avcodec/vp9intrapred: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMXEXT functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:27:44 +01:00
Andreas Rheinhardt	6e418af810	avcodec/vp9mc: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMXEXT functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:27:05 +01:00
Kacper Michajłow	5b5d51cbc1	avcodec/x86/h264_idct: fix version check for NASM 3 and newer Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2025-12-08 17:43:29 +00:00
Andreas Rheinhardt	050c80a526	avcodec/x86/vp8dsp: Don't use saturated addition when unnecessary For the epel functions, there can be no overflow as long as the sum contains only one of the two large central coefficients; for bilinear functions, there can be no overflow whatsoever. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	575e9e9c08	avcodec/x86/vp8dsp: Reduce number of coefficient tables By changing the permutations used in the epel8_h{4,6} case we can simply reuse the coefficient tables from the vertical epel filters. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	99fb257f58	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h6_ssse3 Doubling the register width allowed to avoid a pshufb and a pmaddubsw. Old benchmarks: vp8_put_epel4_h6_c: 115.9 ( 1.00x) vp8_put_epel4_h6_ssse3: 20.2 ( 5.74x) vp8_put_epel4_h6v4_c: 276.3 ( 1.00x) vp8_put_epel4_h6v4_ssse3: 58.6 ( 4.71x) vp8_put_epel4_h6v6_c: 363.6 ( 1.00x) vp8_put_epel4_h6v6_ssse3: 62.5 ( 5.82x) New benchmarks: vp8_put_epel4_h6_c: 116.4 ( 1.00x) vp8_put_epel4_h6_ssse3: 16.0 ( 7.29x) vp8_put_epel4_h6v4_c: 280.9 ( 1.00x) vp8_put_epel4_h6v4_ssse3: 44.3 ( 6.33x) vp8_put_epel4_h6v6_c: 365.6 ( 1.00x) vp8_put_epel4_h6v6_ssse3: 53.1 ( 6.89x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	3135bc0d3a	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h4_ssse3 Doubling the register width allows to use only one pshufb and pmaddubsw. Old benchmarks: vp8_put_epel4_h4_c: 82.8 ( 1.00x) vp8_put_epel4_h4_ssse3: 13.9 ( 5.96x) New benchmarks: vp8_put_epel4_h4_c: 82.7 ( 1.00x) vp8_put_epel4_h4_ssse3: 11.7 ( 7.08x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	714cbf1c70	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_v4_ssse3 Switching to xmm registers allows to process two rows in parallel, leading to speedups. It is also ABI compliant (no more missing emms). Old benchmarks: vp8_put_epel4_v4_c: 96.8 ( 1.00x) vp8_put_epel4_v4_ssse3: 28.2 ( 3.43x) New benchmarks: vp8_put_epel4_v4_c: 95.1 ( 1.00x) vp8_put_epel4_v4_ssse3: 22.8 ( 4.17x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00

1 2 3 4 5 ...

2920 commits