ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-02-15 15:30:23 +00:00

Author	SHA1	Message	Date
Andreas Rheinhardt	dc843cdd9a	avcodec/x86/vp9mc: Reindent after the previous commit Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:35:07 +01:00
Andreas Rheinhardt	65e71b0837	avcodec/x86/vp9mc: Deduplicate coefficient tables Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:35:01 +01:00
Andreas Rheinhardt	38e2174ce4	avcodec/x86/vp9mc: Avoid MMX regs in width 4 hor 8tap funcs Using wider registers (and pshufb) allows to halve the number of pmaddubsw used. It is also ABI compliant (no more missing emms). Old benchmarks: vp9_avg_8tap_smooth_4h_8bpp_c: 97.6 ( 1.00x) vp9_avg_8tap_smooth_4h_8bpp_ssse3: 15.0 ( 6.52x) vp9_avg_8tap_smooth_4hv_8bpp_c: 342.9 ( 1.00x) vp9_avg_8tap_smooth_4hv_8bpp_ssse3: 54.0 ( 6.35x) vp9_put_8tap_smooth_4h_8bpp_c: 94.9 ( 1.00x) vp9_put_8tap_smooth_4h_8bpp_ssse3: 14.2 ( 6.67x) vp9_put_8tap_smooth_4hv_8bpp_c: 325.9 ( 1.00x) vp9_put_8tap_smooth_4hv_8bpp_ssse3: 52.5 ( 6.20x) New benchmarks: vp9_avg_8tap_smooth_4h_8bpp_c: 97.6 ( 1.00x) vp9_avg_8tap_smooth_4h_8bpp_ssse3: 10.8 ( 9.08x) vp9_avg_8tap_smooth_4hv_8bpp_c: 342.4 ( 1.00x) vp9_avg_8tap_smooth_4hv_8bpp_ssse3: 38.8 ( 8.82x) vp9_put_8tap_smooth_4h_8bpp_c: 94.7 ( 1.00x) vp9_put_8tap_smooth_4h_8bpp_ssse3: 9.7 ( 9.75x) vp9_put_8tap_smooth_4hv_8bpp_c: 321.7 ( 1.00x) vp9_put_8tap_smooth_4hv_8bpp_ssse3: 37.0 ( 8.69x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:34:35 +01:00
Andreas Rheinhardt	dd5dc254ff	avcodec/x86/vp9mc: Avoid reloads, MMX regs in width 4 vert 8tap func Four rows of four bytes fit into one xmm register; therefore one can arrange the rows as follows (A,B,C: first, second, third etc. row) xmm0: ABABABAB BCBCBCBC xmm1: CDCDCDCD DEDEDEDE xmm2: EFEFEFEF FGFGFGFG xmm3: GHGHGHGH HIHIHIHI and use four pmaddubsw to calculate two rows in parallel. The history fits into four registers, making this possible even on 32bit systems. Old benchmarks (Unix 64): vp9_avg_8tap_smooth_4v_8bpp_c: 105.5 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 16.4 ( 6.44x) vp9_put_8tap_smooth_4v_8bpp_c: 99.3 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 15.4 ( 6.44x) New benchmarks (Unix 64): vp9_avg_8tap_smooth_4v_8bpp_c: 105.0 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 11.8 ( 8.90x) vp9_put_8tap_smooth_4v_8bpp_c: 99.7 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 10.7 ( 9.30x) Old benchmarks (x86-32): vp9_avg_8tap_smooth_4v_8bpp_c: 138.2 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.93x) vp9_put_8tap_smooth_4v_8bpp_c: 123.6 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 28.0 ( 4.41x) New benchmarks (x86-32): vp9_avg_8tap_smooth_4v_8bpp_c: 139.0 ( 1.00x) vp9_avg_8tap_smooth_4v_8bpp_ssse3: 20.1 ( 6.92x) vp9_put_8tap_smooth_4v_8bpp_c: 124.5 ( 1.00x) vp9_put_8tap_smooth_4v_8bpp_ssse3: 19.9 ( 6.26x) Loading the constants into registers did not turn out to be advantageous here (not to mention Win64, where this would necessitate saving and restoring ever more register); probably because there are only two loop iterations. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:31:59 +01:00
Andreas Rheinhardt	36204fbc3c	avcodec/vp9itxfm{,_16bpp}: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMXEXT functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:27:51 +01:00
Andreas Rheinhardt	ea37f49aed	avcodec/vp9intrapred: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMXEXT functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:27:44 +01:00
Andreas Rheinhardt	6e418af810	avcodec/vp9mc: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMXEXT functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-08 19:27:05 +01:00
Kacper Michajłow	5b5d51cbc1	avcodec/x86/h264_idct: fix version check for NASM 3 and newer Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2025-12-08 17:43:29 +00:00
Andreas Rheinhardt	050c80a526	avcodec/x86/vp8dsp: Don't use saturated addition when unnecessary For the epel functions, there can be no overflow as long as the sum contains only one of the two large central coefficients; for bilinear functions, there can be no overflow whatsoever. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	575e9e9c08	avcodec/x86/vp8dsp: Reduce number of coefficient tables By changing the permutations used in the epel8_h{4,6} case we can simply reuse the coefficient tables from the vertical epel filters. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	99fb257f58	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h6_ssse3 Doubling the register width allowed to avoid a pshufb and a pmaddubsw. Old benchmarks: vp8_put_epel4_h6_c: 115.9 ( 1.00x) vp8_put_epel4_h6_ssse3: 20.2 ( 5.74x) vp8_put_epel4_h6v4_c: 276.3 ( 1.00x) vp8_put_epel4_h6v4_ssse3: 58.6 ( 4.71x) vp8_put_epel4_h6v6_c: 363.6 ( 1.00x) vp8_put_epel4_h6v6_ssse3: 62.5 ( 5.82x) New benchmarks: vp8_put_epel4_h6_c: 116.4 ( 1.00x) vp8_put_epel4_h6_ssse3: 16.0 ( 7.29x) vp8_put_epel4_h6v4_c: 280.9 ( 1.00x) vp8_put_epel4_h6v4_ssse3: 44.3 ( 6.33x) vp8_put_epel4_h6v6_c: 365.6 ( 1.00x) vp8_put_epel4_h6v6_ssse3: 53.1 ( 6.89x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	3135bc0d3a	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h4_ssse3 Doubling the register width allows to use only one pshufb and pmaddubsw. Old benchmarks: vp8_put_epel4_h4_c: 82.8 ( 1.00x) vp8_put_epel4_h4_ssse3: 13.9 ( 5.96x) New benchmarks: vp8_put_epel4_h4_c: 82.7 ( 1.00x) vp8_put_epel4_h4_ssse3: 11.7 ( 7.08x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	714cbf1c70	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_v4_ssse3 Switching to xmm registers allows to process two rows in parallel, leading to speedups. It is also ABI compliant (no more missing emms). Old benchmarks: vp8_put_epel4_v4_c: 96.8 ( 1.00x) vp8_put_epel4_v4_ssse3: 28.2 ( 3.43x) New benchmarks: vp8_put_epel4_v4_c: 95.1 ( 1.00x) vp8_put_epel4_v4_ssse3: 22.8 ( 4.17x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	f017806829	avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_v6_ssse3 Switching to xmm registers allows to process two rows in parallel, leading to speedups. It is also ABI compliant (no more missing emms). Old benchmarks: vp8_put_epel4_v6_c: 132.8 ( 1.00x) vp8_put_epel4_v6_ssse3: 34.3 ( 3.87x) New benchmarks: vp8_put_epel4_v6_c: 131.5 ( 1.00x) vp8_put_epel4_v6_ssse3: 27.1 ( 4.86x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	7411998757	avcodec/x86/vp8dsp: Avoid unpacking multiple times Always pair row i with row i+2 for the vertical four-tap filter and row i+3 for the vertical six-tap filter (instead of pairing the first with the sixth, the second with the third and the fourth and the fifth). This allows to unpack each row only once instead of (at most) three times. Old benchmarks: vp8_put_epel4_v4_c: 98.4 ( 1.00x) vp8_put_epel4_v4_ssse3: 28.6 ( 3.44x) vp8_put_epel4_v6_c: 131.6 ( 1.00x) vp8_put_epel4_v6_ssse3: 38.5 ( 3.42x) vp8_put_epel8_v4_c: 362.5 ( 1.00x) vp8_put_epel8_v4_sse2: 63.8 ( 5.68x) vp8_put_epel8_v4_ssse3: 44.4 ( 8.16x) vp8_put_epel8_v6_c: 538.3 ( 1.00x) vp8_put_epel8_v6_sse2: 86.5 ( 6.22x) vp8_put_epel8_v6_ssse3: 57.0 ( 9.44x) vp8_put_epel16_v6_c: 1044.6 ( 1.00x) vp8_put_epel16_v6_sse2: 158.0 ( 6.61x) vp8_put_epel16_v6_ssse3: 106.7 ( 9.79x) New benchmarks: vp8_put_epel4_v4_c: 100.0 ( 1.00x) vp8_put_epel4_v4_ssse3: 28.4 ( 3.52x) vp8_put_epel4_v6_c: 131.7 ( 1.00x) vp8_put_epel4_v6_ssse3: 34.3 ( 3.84x) vp8_put_epel8_v4_c: 364.4 ( 1.00x) vp8_put_epel8_v4_sse2: 63.7 ( 5.72x) vp8_put_epel8_v4_ssse3: 43.3 ( 8.42x) vp8_put_epel8_v6_c: 550.2 ( 1.00x) vp8_put_epel8_v6_sse2: 86.4 ( 6.37x) vp8_put_epel8_v6_ssse3: 52.9 (10.40x) vp8_put_epel16_v6_c: 1052.5 ( 1.00x) vp8_put_epel16_v6_sse2: 158.3 ( 6.65x) vp8_put_epel16_v6_ssse3: 98.9 (10.64x) Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	24cdd4100d	avcodec/x86/vp8dsp_init: Remove unused macro Forgotten in `6a551f1405`. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	76900089fb	avcodec/x86/vp8dsp: Avoid reload Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	86aa1b81ec	avcodec/x86/vp8dsp: Increment src pointer earlier Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	e59ed3470d	avcodec/x86/vp8dsp: Directly use negated stride There is a register available. No change in benchmarks here. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:37 +01:00
Andreas Rheinhardt	8fb6b0c733	avcodec/x86/vp8dsp: Don't use MMX registers in put_vp8_pixels8 Use GPRs on x64 and xmm registers else (using GPRs reduces codesize). This avoids clobbering the floating point state and therefore no longer breaks the ABI. No change in benchmarks here. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:36 +01:00
Andreas Rheinhardt	ed5e0f9c68	avcodec/x86/vp8dsp: Remove MMXEXT functions overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-04 15:17:36 +01:00
Andreas Rheinhardt	c22c2c5e03	avcodec/mpegvideo: Port dct_unquantize_mpeg2_intra_mmx to SSE2 Benefits from wider registers. Benchmarks: dct_unquantize_mpeg2_intra_c: 228.2 ( 1.00x) dct_unquantize_mpeg2_intra_mmx: 28.2 ( 8.10x) dct_unquantize_mpeg2_intra_sse2: 18.4 (12.37x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	6e2153111d	avcodec/x86/mpegvideo: Port dct_unquantize_mpeg2_inter_mmx to SSSE3 Benefits from wider registers, pabsw and psignw. Benchmarks: dct_unquantize_mpeg2_inter_c: 131.2 ( 1.00x) dct_unquantize_mpeg2_inter_mmx: 50.2 ( 2.62x) dct_unquantize_mpeg2_inter_ssse3: 20.5 ( 6.38x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	60084b1369	avcodec/x86/mpegvideo: Port MPEG-1 unquantize functions to SSSE3 Benefits from wider registers and pabsw, psignw. Benchmarks: dct_unquantize_mpeg1_inter_c: 343.0 ( 1.00x) dct_unquantize_mpeg1_inter_mmx: 50.6 ( 6.78x) dct_unquantize_mpeg1_inter_ssse3: 17.2 (19.94x) dct_unquantize_mpeg1_intra_c: 352.1 ( 1.00x) dct_unquantize_mpeg1_intra_mmx: 48.8 ( 7.22x) dct_unquantize_mpeg1_intra_ssse3: 19.5 (18.03x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	1cb987d25b	avcodec/x86/mpegvideo: Port dct_unquantize_h263_{intra,inter}_mmx to SSSE3 It benefits from wider registers and psignw. Benchmarks: dct_unquantize_h263_inter_c: 88.3 ( 1.00x) dct_unquantize_h263_inter_mmx: 24.7 ( 3.58x) dct_unquantize_h263_inter_ssse3: 9.3 ( 9.47x) dct_unquantize_h263_intra_c: 93.7 ( 1.00x) dct_unquantize_h263_intra_mmx: 30.6 ( 3.06x) dct_unquantize_h263_intra_ssse3: 16.5 ( 5.69x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	a9a23925df	avcodec/x86/mpegvideo: Don't duplicate register Currently several inline ASM blocks used a value as an input and rax as clobber register. The input value was just moved into the register which then served as loop counter. This is wasteful, as one can just use the value's register directly as loop counter. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	1fa8ffc1db	avcodec/x86/mpegvideo: Improve unquantizing MPEG-2 intra blocks Unquantizing involves calculating (block[j] * qscale * quant_matrix[j]) / 16 where / rounds towards zero. Arithmetic right shifts naturally round towards -inf, so the earlier code calculated the absolute value first, then used a right-shift and then negated the result if necessary. This commit uses a different procedure: It biases the product for negative values of block[j] by 0xf. The combination of this and the arithmetic right shift is the same as rounding towards zero. Furthermore, a write-only store to mm7 has been removed. Benchmarks: dct_unquantize_mpeg2_intra_c: 214.3 ( 1.00x) dct_unquantize_mpeg2_intra_mmx (old): 43.0 ( 4.98x) dct_unquantize_mpeg2_intra_mmx (new): 28.4 ( 7.56x) (The bitexact flag and the test for correctness have beem removed from checkasm for the benchmarks.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	6d56807a06	avcodec/x86/mpegvideo: Use correct inline assembly constraints The H.263 unquantize functions modified an input parameter. (And they did so since this code was added in `7f3f5ec87b`. I am surprised that this didn't cause issues, particularly with the intra function.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:23:43 +01:00
Andreas Rheinhardt	358c569b05	avcodec/mpegvideo_unquantize: Constify MPVContext pointee Also use MPVContext instead of MpegEncContext. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-12-03 10:20:41 +01:00
Andreas Rheinhardt	eccf130fdb	{lib{avcodec,swscale}/x86/,}Makefile: Kill MMX-OBJS Reviewed-by: Kacper Michajłow <kasper93@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 22:20:13 +01:00
Andreas Rheinhardt	ba94177242	avcodec/x86/Makefile: Only compile ASM init files when X86ASM is enabled To do so, simply add these init files to X86ASM-OBJS instead of OBJS in the Makefile. The former is already used for the actual assembly files, but using them for the C init files just works, because the build system uses file extensions to derive whether it is a C or a NASM file. This avoids compiling unused function stubs and also reduces our reliance on DCE: We don't add %if checks to the asm files except for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4 functions will be available. It also allows to remove HAVE_X86ASM checks in these init files. Reviewed-by: Kacper Michajłow <kasper93@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 22:20:13 +01:00
Andreas Rheinhardt	65b4feb782	avcodec/x86/Makefile: Remove redundant WebP decoder->vp8dsp dependencies Redundant since `35b02732b9`. Reviewed-by: Kacper Michajłow <kasper93@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 22:20:13 +01:00
Andreas Rheinhardt	89f984e3d1	avcodec/x86/h264_idct: Fix ff_h264_luma_dc_dequant_idct_sse2 checkasm failures ff_h264_luma_dc_dequant_idct_sse2() does not pass checkasm for certain seeds, because the input to packssdw no longer fits into an int16_t, leading to saturation, where the C code just truncates. I don't know whether the spec contains provisions that ensure that valid input must not exceed 16 bit or whether the such inputs (even if invalid) can be triggered by the actual code and not only the test. This commit adapts the behavior of the function to the C reference code to fix the test. packssdw is avoided, instead the lower words are directly transfered to GPRs to be written out. This has unfortunately led to a slight performance regression here (14.5 vs 15.1 cycles). Fixes issue #20835. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	e6ae2802a3	avcodec/x86/h264_idct: Deduplicate generating constant pw_1 is currently loaded in both codepaths. Generate it earlier instead. Gives tiny speedups (15 vs 14.5 cycles) and reduces codesize. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	ada0a81577	avcodec/x86/h264_idct: Don't use MMX registers in ff_h264_luma_dc_dequant_idct_sse2 It is ABI compliant and gives a tiny speedup here (and is 16B smaller). Old benchmarks: h264_luma_dc_dequant_idct_8_c: 33.2 ( 1.00x) h264_luma_dc_dequant_idct_8_sse2: 16.0 ( 2.07x) New benchmarks: h264_luma_dc_dequant_idct_8_c: 33.0 ( 1.00x) h264_luma_dc_dequant_idct_8_sse2: 15.0 ( 2.20x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	012c25bac4	avcodec/x86/h264_idct: Zero with full-width stores Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	b9cbbd9074	avcodec/x86/h264_idct: Use tail call where advantageous It is possible on UNIX64. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	01ff05e4bc	avcodec/x86/h264_idct: Avoid call where possible Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	b51cbd4116	avcodec/x86/h264_idct: Remove redundant movsxdifnidn Only exported (i.e. cglobal) functions need it; stride is already sign-extended when it reaches any of the internal functions used here, so don't sign-extend again. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Andreas Rheinhardt	18019f177e	avcodec/x86/h264idct: Remove dead MMX macros Forgotten in `4618f36a24`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-30 00:15:43 +01:00
Piotr Pawlowski	372dab2a4d	All: Removed reliance on compiler performing dead code elimination, changed various macro constant checks from if() to #if	2025-11-28 19:52:51 +01:00
Andreas Rheinhardt	7018ce14df	avcodec/x86/vp6dsp: Avoid packing+unpacking Store the intermediate values as words, clipped to the 0..255 range instead. Old benchmarks: filter_diag4_c: 353.4 ( 1.00x) filter_diag4_sse2: 57.5 ( 6.15x) New benchmarks: filter_diag4_c: 350.6 ( 1.00x) filter_diag4_sse2: 55.1 ( 6.36x) Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:49 +01:00
Andreas Rheinhardt	300cd2c2f2	avcodec/x86/vp6dsp: Avoid saturated addition Only the two middle coefficients are so huge that overflow can happen. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:46 +01:00
Andreas Rheinhardt	dcc101167c	avcodec/x86/vp6dsp: Simplify splatting Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:43 +01:00
Andreas Rheinhardt	111fabf5b4	avcodec/x86/vp6dsp: Don't align the stack manually For most systems (particularly all x64), the stack is already guaranteed to be sufficiently aligned. So just use x86inc's stack feature which does the right thing. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:40 +01:00
Andreas Rheinhardt	363a34a7cb	avcodec/x86/vp6dsp: Fix outdated comment Forgotten in `6cb3ee80b3`. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:37 +01:00
Andreas Rheinhardt	962858169a	avcodec/vp6dsp: Constify source in vp6_filter_diag4 Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:32 +01:00
Andreas Rheinhardt	f397fe86c3	avcodec/vp56dsp: Separate VP5DSP and VP6DSP They don't have anything in common since `160ebe0a8d`. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-27 12:10:29 +01:00
Andreas Rheinhardt	81362b319e	avcodec/x86/me_cmp: Avoid call on UNIX64 The internal functions for calculating the hadamard difference of two 8x8 blocks have no epilogue on UNIX64, so one can avoid the call altogether by placing the 8x8 function so that it directly falls into the internal function. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-26 00:01:09 +00:00
Andreas Rheinhardt	23720df371	avcodec/me_cmp: Remove MMXEXT hadamard diff functions The SSE2 and SSSE3 functions are now available everywhere, making the MMXEXT functions irrelevant. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-11-26 00:01:09 +00:00

1 2 3 4 5 ...

2883 commits