ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-04-21 01:40:23 +00:00

Author	SHA1	Message	Date
Andreas Rheinhardt	0ddece40c5	avcodec/x86/vvc/alf: Simplify vb_pos comparisons The value of vb_pos at vb_bottom, vb_above is known at compile-time, so one can avoid the modifications to vb_pos and just compare against immediates. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	1960320112	avcodec/x86/vvc/alf: Avoid pointless wrappers for alf_filter They are completely unnecessary for the 8bit case (which only handles 8bit) and overtly complicated for the 10 and 12bit cases: All one needs to do is set up the (1<<bpp)-1 vector register and jmp from (say) the 12bpp function stub inside the 10bpp function. The way it is done here even allows to share the prologue between the two functions. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	467f8d8415	avcodec/x86/vvc/alf: Improve offsetting pointers It can be combined with an earlier lea for the loop processing 16 pixels at a time; it is unnecessary for the tail, because the new values will be overwritten immediately afterwards anyway. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	cb5f6c055b	avcodec/x86/vvc/alf: Don't modify rsp unnecessarily The vvc_alf_filter functions don't use x86inc's stack managment feature at all; they merely push and pop some regs themselves. So don't tell x86inc to provide stack (which in this case entails aligning the stack). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	38062ebd18	avcodec/x86/vvc/alf: Remove pointless counter, stride Each luma alf block has 212 auxiliary coefficients associated with it that the alf_filter functions consume; the C version simply increments the pointers. The x64 dsp function meanwhile does things differenty: The vvc_alf_filter functions have three levels of loops. The middle layer uses two counters, one of which is just the horizontal offset xd in the current line. It is only used for addressing these auxiliary coefficients and yet one needs to perform work translate from it to the coefficient offset, namely a 3 via lea and a *2 scale. Furthermore, the base pointers of the coefficients are incremented in the outer loop; the stride used for this is calculated in the C wrapper functions. Furthermore, due to GPR pressure xd is reused as loop counter for the innermost loop; the xd from the middle loop is pushed to the stack. Apart from the translation from horizontal offset to coefficient offset all of the above has been done for chroma, too, although the coefficient pointers don't get modified for them at all. This commit changes this to just increment the pointers after reading the relevant coefficients. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	d2e7fe5b19	avcodec/x86/vvc/alf: Improve deriving ac Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	5da3cab645	avcodec/x86/vvc/alf: Avoid broadcast Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	c9da0193ff	avcodec/x86/vvc/alf: Don't use 64bit where unnecessary Reduces codesize (avoids REX prefixes). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	a489a623fb	avcodec/x86/vvc/alf: Use memory sources directly Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	df7885d6c3	avcodec/x86/vvc/alf: Improve writing classify parameters The permutation that was applied before the write macro is actually only beneficial when one has 16 entries to write, so move it into the macro to write 16 entries and optimize the other macro. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	1bc91eb552	avcodec/x86/vvc/alf: Avoid checking twice Also avoids a vpermq in case width is eight. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:42 +01:00
Andreas Rheinhardt	e4a9d54e48	avcodec/x86/vvc/alf: Avoid nonvolatile registers Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	a2d9cd6dcb	avcodec/x86/vvc/alf: Don't calculate twice Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	01a897020e	avcodec/x86/vvc/alf: Use xmm registers where sufficient One always has eight samples when processing the luma remainder, so xmm registers are sufficient for everything. In fact, this actually simplifies loading the luma parameters. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	9cb5280c0e	avcodec/x86/vvc/alf: Improve storing 8bpp When width is known to be 8 (i.e. for luma that is not width 16), the upper lane is unused, so use an xmm-sized packuswb and avoid the vpermq altogether. For chroma not known to be 16 (i.e. 4,8 or 12) defer extracting from the high lane until it is known to be needed. Also do so via vextracti128 instead of vpermq (also do this for bpp>8). Also use vextracti128 and an xmm-sized packuswb in case of width 16 instead of an ymm-sized packuswb followed by vextracti128. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	56a4c15c23	avcodec/x86/vvc/alf: Avoid checking twice Also avoid doing unnecessary work in the width==8 case. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	43cc8f05df	avcodec/x86/vvc/alf: Don't clip for 8bpp packuswb does it already. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	a8b3b9c26f	avcodec/x86/vvc/alf: Remove unused array Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	94f9ad8061	avcodec/x86/vvc/alf: Use immediate for shift when possible Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	2159e40ab3	avcodec/x86/vvc/of: Avoid jump At the end of the height==8 codepath, a jump to RET at the end of the height==16 codepath is performed. Yet the epilogue is so cheap on Unix64 that this jump is not worthwhile. For Win64 meanwhile, one can still avoid jumps, because for width 16 >8bpp and width 8 8bpp content a jump is performed to the end of the height==8 position, immediately followed by a jump to RET. These two jumps can be combined into one. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	2a93d09968	avcodec/x86/vvc/of: Ignore upper lane for width 8 Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	9fe9fd95b6	avcodec/x86/vvc/of: Only clip for >8bpp packuswb does it already for 8bpp. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	83694749ad	avcodec/x86/vvc/of,dsp_init: Avoid unnecessary wrappers Write them in assembly instead; this exchanges a call+ret with a jmp and also avoids the stack for (1<<bpp)-1. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	d6ed5d6e3d	avcodec/x86/vvc/of: Deduplicate writing, save jump Both the 8bpp width 16 and >8bpp width 8 cases write 16 contiguous bytes; deduplicate writing them. In fact, by putting this block of code at the end of the SAVE macro, one can even save a jmp for the width 16 8bpp case (without adversely affecting the other cases). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	e7e19fcb1b	avcodec/x86/vvc/of: Avoid unnecessary jumps For 8bpp width 8 content, an unnecessary jump was performed for every write: First to the end of the SAVE_8BPC macro, then to the end of the SAVE macro. This commit changes this. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	dee361a5bf	avcodec/x86/vvc/of: Avoid initialization, addition for last block When processing the last block, we no longer need to preserve some registers for the next block, allowing simplifications. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	c6205355b4	avcodec/x86/vvc/of: Avoid initialization, addition for first block Output directly to the desired destination registers instead of zeroing them, followed by adding the desired values. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	f177672df2	avcodec/x86/vvc/of: Avoid unnecessary additions BDOF_PROF_GRAD just adds some values to m12,m13, so one can avoid two pxor, paddw by deferring saving these registers prematurely. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-06 20:02:41 +01:00
Andreas Rheinhardt	561f37c023	avcodec/x86/huffyuvencdsp_init: Remove pointless av_unused Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-01 12:04:14 +01:00
Andreas Rheinhardt	d345e902d2	avcodec/x86/huffyuvencdsp: Remove MMX sub_hfyu_median_pred_int16 Superseded by SSE2 and AVX2. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-01 12:04:14 +01:00
Andreas Rheinhardt	154bcd1054	avcodec/x86/huffyuvencdsp: Add AVX2 sub_hfyu_median_pred_int16 This version can also process 16bpp. Benchmarks: sub_hfyu_median_pred_int16_9bpp_c: 12667.7 ( 1.00x) sub_hfyu_median_pred_int16_9bpp_mmxext: 1966.5 ( 6.44x) sub_hfyu_median_pred_int16_9bpp_sse2: 997.6 (12.70x) sub_hfyu_median_pred_int16_9bpp_avx2: 474.8 (26.68x) sub_hfyu_median_pred_int16_9bpp_aligned_c: 12604.6 ( 1.00x) sub_hfyu_median_pred_int16_9bpp_aligned_mmxext: 1964.6 ( 6.42x) sub_hfyu_median_pred_int16_9bpp_aligned_sse2: 981.9 (12.84x) sub_hfyu_median_pred_int16_9bpp_aligned_avx2: 462.6 (27.25x) sub_hfyu_median_pred_int16_16bpp_c: 12592.5 ( 1.00x) sub_hfyu_median_pred_int16_16bpp_avx2: 465.6 (27.04x) sub_hfyu_median_pred_int16_16bpp_aligned_c: 12587.5 ( 1.00x) sub_hfyu_median_pred_int16_16bpp_aligned_avx2: 462.5 (27.22x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-01 12:04:14 +01:00
Andreas Rheinhardt	e575c2d496	avcodec/x86/huffyuvencdsp: Add SSE2 sub_hfyu_median_pred_int16 Contrary to the MMXEXT version this version does not overread at all (the MMXEXT version processes the input of 2w bytes in eight byte chunks and overreads by a further six bytes, because it loads the next left and left top values at the end of the loop, i.e. it reads FFALIGN(2w,8)+6 bytes instead of 2*w). Benchmarks: sub_hfyu_median_pred_int16_9bpp_c: 12673.6 ( 1.00x) sub_hfyu_median_pred_int16_9bpp_mmxext: 1947.7 ( 6.51x) sub_hfyu_median_pred_int16_9bpp_sse2: 993.9 (12.75x) sub_hfyu_median_pred_int16_9bpp_aligned_c: 12596.1 ( 1.00x) sub_hfyu_median_pred_int16_9bpp_aligned_mmxext: 1956.1 ( 6.44x) sub_hfyu_median_pred_int16_9bpp_aligned_sse2: 989.4 (12.73x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-01 12:03:55 +01:00
Andreas Rheinhardt	6834762d7b	avcodec/huffyuvencdsp: Add width parameter to init This allows to only use certain functions using wide registers if there is enough work to do and if one can even read a whole register wide without overreading. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-01 11:58:16 +01:00
Andreas Rheinhardt	2268ba89f0	avcodec/huffyuvencdsp: Pass bpp, not AVPixelFormat for init Avoids having to get a pixel format descriptor. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-03-01 11:56:57 +01:00
Andreas Rheinhardt	aa483bc422	avcodec/x86/bswapdsp: Avoid aligned vs unaligned codepaths for AVX2 For modern cpus (like those supporting AVX2) loads and stores using the unaligned versions of instructions are as fast as aligned ones if the address is aligned, so remove the aligned AVX2 version (and the alignment check) and just use the unaligned one. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-27 18:25:43 +01:00
Andreas Rheinhardt	55afe49dd0	avcodec/x86/bswapdsp: combine shifting, avoid check for AVX2 This avoids a check and a shift if >=8 elements are processed; it adds a check if < 8 elements are processed (which should be rare). No change in benchmarks here. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-27 18:25:31 +01:00
Andreas Rheinhardt	3e6fa5153e	avcodec/x86/bswapdsp: Avoid register copies No change in benchmarks here. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-27 18:25:01 +01:00
Andreas Rheinhardt	dc65dcec22	avcodec/vvc/inter: Combine offsets early For bi-predicted weighted averages, only the sum of the two offsets is ever used, so add the two early. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-25 12:08:33 +01:00
Andreas Rheinhardt	6c1c1720cf	avcodec/x86/vvc/dsp_init: Mark dsp init function as av_cold Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 01:05:12 +01:00
Andreas Rheinhardt	af3f8f5bd2	avcodec/x86/vvc/of: Break dependency chain Don't extract and update one word of one and the same register at a time; use separate src and dst registers, so that pextrw and bsr can be done in parallel. Also use movd instead of pinsrw for the first word. Old benchmarks: apply_bdof_8_8x16_c: 3275.2 ( 1.00x) apply_bdof_8_8x16_avx2: 487.6 ( 6.72x) apply_bdof_8_16x8_c: 3243.1 ( 1.00x) apply_bdof_8_16x8_avx2: 284.4 (11.40x) apply_bdof_8_16x16_c: 6501.8 ( 1.00x) apply_bdof_8_16x16_avx2: 570.0 (11.41x) apply_bdof_10_8x16_c: 3286.5 ( 1.00x) apply_bdof_10_8x16_avx2: 461.7 ( 7.12x) apply_bdof_10_16x8_c: 3274.5 ( 1.00x) apply_bdof_10_16x8_avx2: 271.4 (12.06x) apply_bdof_10_16x16_c: 6590.0 ( 1.00x) apply_bdof_10_16x16_avx2: 543.9 (12.12x) apply_bdof_12_8x16_c: 3307.6 ( 1.00x) apply_bdof_12_8x16_avx2: 462.2 ( 7.16x) apply_bdof_12_16x8_c: 3287.4 ( 1.00x) apply_bdof_12_16x8_avx2: 271.8 (12.10x) apply_bdof_12_16x16_c: 6465.7 ( 1.00x) apply_bdof_12_16x16_avx2: 543.8 (11.89x) New benchmarks: apply_bdof_8_8x16_c: 3255.7 ( 1.00x) apply_bdof_8_8x16_avx2: 349.3 ( 9.32x) apply_bdof_8_16x8_c: 3262.5 ( 1.00x) apply_bdof_8_16x8_avx2: 214.8 (15.19x) apply_bdof_8_16x16_c: 6471.6 ( 1.00x) apply_bdof_8_16x16_avx2: 429.8 (15.06x) apply_bdof_10_8x16_c: 3227.7 ( 1.00x) apply_bdof_10_8x16_avx2: 321.6 (10.04x) apply_bdof_10_16x8_c: 3250.2 ( 1.00x) apply_bdof_10_16x8_avx2: 201.2 (16.16x) apply_bdof_10_16x16_c: 6476.5 ( 1.00x) apply_bdof_10_16x16_avx2: 400.9 (16.16x) apply_bdof_12_8x16_c: 3230.7 ( 1.00x) apply_bdof_12_8x16_avx2: 321.8 (10.04x) apply_bdof_12_16x8_c: 3210.5 ( 1.00x) apply_bdof_12_16x8_avx2: 200.9 (15.98x) apply_bdof_12_16x16_c: 6474.5 ( 1.00x) apply_bdof_12_16x16_avx2: 400.2 (16.18x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 01:05:12 +01:00
Andreas Rheinhardt	19dc7b79a4	avcodec/x86/vvc/of: Unify shuffling One can use the same shuffles for the width 8 and width 16 case if one also changes the permutation in vpermd (that always follows pshufb for width 16). This also allows to load it before checking width. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 01:03:22 +01:00
Andreas Rheinhardt	8e82416434	avcodec/x86/vvc/of: Avoid unused register Avoids a push+pop. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 01:02:20 +01:00
Andreas Rheinhardt	81fb70c833	avcodec/x86/vvc/mc,dsp_init: Avoid pointless wrappers for w_avg They only add overhead (in form of another function call, sign-extending some parameters to 64bit (although the upper bits are not used at all) and rederiving the actual number of bits (from the maximum value (1<<bpp)-1)). Old benchmarks: w_avg_8_2x2_c: 16.4 ( 1.00x) w_avg_8_2x2_avx2: 12.9 ( 1.27x) w_avg_8_4x4_c: 48.0 ( 1.00x) w_avg_8_4x4_avx2: 14.9 ( 3.23x) w_avg_8_8x8_c: 168.2 ( 1.00x) w_avg_8_8x8_avx2: 22.4 ( 7.49x) w_avg_8_16x16_c: 396.5 ( 1.00x) w_avg_8_16x16_avx2: 47.9 ( 8.28x) w_avg_8_32x32_c: 1466.3 ( 1.00x) w_avg_8_32x32_avx2: 172.8 ( 8.48x) w_avg_8_64x64_c: 5629.3 ( 1.00x) w_avg_8_64x64_avx2: 678.7 ( 8.29x) w_avg_8_128x128_c: 22122.4 ( 1.00x) w_avg_8_128x128_avx2: 2743.5 ( 8.06x) w_avg_10_2x2_c: 18.7 ( 1.00x) w_avg_10_2x2_avx2: 13.1 ( 1.43x) w_avg_10_4x4_c: 50.3 ( 1.00x) w_avg_10_4x4_avx2: 15.9 ( 3.17x) w_avg_10_8x8_c: 109.3 ( 1.00x) w_avg_10_8x8_avx2: 20.6 ( 5.30x) w_avg_10_16x16_c: 395.5 ( 1.00x) w_avg_10_16x16_avx2: 44.8 ( 8.83x) w_avg_10_32x32_c: 1534.2 ( 1.00x) w_avg_10_32x32_avx2: 141.4 (10.85x) w_avg_10_64x64_c: 6003.6 ( 1.00x) w_avg_10_64x64_avx2: 557.4 (10.77x) w_avg_10_128x128_c: 23722.7 ( 1.00x) w_avg_10_128x128_avx2: 2205.0 (10.76x) w_avg_12_2x2_c: 18.6 ( 1.00x) w_avg_12_2x2_avx2: 13.1 ( 1.42x) w_avg_12_4x4_c: 52.2 ( 1.00x) w_avg_12_4x4_avx2: 16.1 ( 3.24x) w_avg_12_8x8_c: 109.2 ( 1.00x) w_avg_12_8x8_avx2: 20.6 ( 5.29x) w_avg_12_16x16_c: 396.1 ( 1.00x) w_avg_12_16x16_avx2: 45.0 ( 8.81x) w_avg_12_32x32_c: 1532.6 ( 1.00x) w_avg_12_32x32_avx2: 142.1 (10.79x) w_avg_12_64x64_c: 6002.2 ( 1.00x) w_avg_12_64x64_avx2: 557.3 (10.77x) w_avg_12_128x128_c: 23748.7 ( 1.00x) w_avg_12_128x128_avx2: 2206.4 (10.76x) New benchmarks: w_avg_8_2x2_c: 16.0 ( 1.00x) w_avg_8_2x2_avx2: 9.3 ( 1.71x) w_avg_8_4x4_c: 48.4 ( 1.00x) w_avg_8_4x4_avx2: 12.4 ( 3.91x) w_avg_8_8x8_c: 168.7 ( 1.00x) w_avg_8_8x8_avx2: 21.1 ( 8.00x) w_avg_8_16x16_c: 394.5 ( 1.00x) w_avg_8_16x16_avx2: 46.2 ( 8.54x) w_avg_8_32x32_c: 1456.3 ( 1.00x) w_avg_8_32x32_avx2: 171.8 ( 8.48x) w_avg_8_64x64_c: 5636.2 ( 1.00x) w_avg_8_64x64_avx2: 676.9 ( 8.33x) w_avg_8_128x128_c: 22129.1 ( 1.00x) w_avg_8_128x128_avx2: 2734.3 ( 8.09x) w_avg_10_2x2_c: 18.7 ( 1.00x) w_avg_10_2x2_avx2: 10.3 ( 1.82x) w_avg_10_4x4_c: 50.8 ( 1.00x) w_avg_10_4x4_avx2: 13.4 ( 3.79x) w_avg_10_8x8_c: 109.7 ( 1.00x) w_avg_10_8x8_avx2: 20.4 ( 5.38x) w_avg_10_16x16_c: 395.2 ( 1.00x) w_avg_10_16x16_avx2: 41.7 ( 9.48x) w_avg_10_32x32_c: 1535.6 ( 1.00x) w_avg_10_32x32_avx2: 137.9 (11.13x) w_avg_10_64x64_c: 6002.1 ( 1.00x) w_avg_10_64x64_avx2: 548.5 (10.94x) w_avg_10_128x128_c: 23742.7 ( 1.00x) w_avg_10_128x128_avx2: 2179.8 (10.89x) w_avg_12_2x2_c: 18.9 ( 1.00x) w_avg_12_2x2_avx2: 10.3 ( 1.84x) w_avg_12_4x4_c: 52.4 ( 1.00x) w_avg_12_4x4_avx2: 13.4 ( 3.91x) w_avg_12_8x8_c: 109.2 ( 1.00x) w_avg_12_8x8_avx2: 20.3 ( 5.39x) w_avg_12_16x16_c: 396.3 ( 1.00x) w_avg_12_16x16_avx2: 41.7 ( 9.51x) w_avg_12_32x32_c: 1532.6 ( 1.00x) w_avg_12_32x32_avx2: 138.6 (11.06x) w_avg_12_64x64_c: 5996.7 ( 1.00x) w_avg_12_64x64_avx2: 549.6 (10.91x) w_avg_12_128x128_c: 23738.0 ( 1.00x) w_avg_12_128x128_avx2: 2177.2 (10.90x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 01:01:27 +01:00
Andreas Rheinhardt	ea78402e9c	avcodec/x86/vvc/mc,dsp_init: Avoid pointless wrappers for avg Up until now, there were two averaging assembly functions, one for eight bit content and one for <=16 bit content; there are also three C-wrappers around these functions, for 8, 10 and 12 bpp. These wrappers simply forward the maximum permissible value (i.e. (1<<bpp)-1) and promote some integer values to ptrdiff_t. Yet these wrappers are absolutely useless: The assembly functions rederive the bpp from the maximum and only the integer part of the promoted ptrdiff_t values is ever used. Of course, these wrappers also entail an additional call (not a tail call, because the additional maximum parameter is passed on the stack). Remove the wrappers and add per-bpp assembly functions instead. Given that the only difference between 10 and 12 bits are some constants in registers, the main part of these functions can be shared (given that this code uses a jumptable, it can even be done without adding any additional jump). Old benchmarks: avg_8_2x2_c: 11.4 ( 1.00x) avg_8_2x2_avx2: 7.9 ( 1.44x) avg_8_4x4_c: 30.7 ( 1.00x) avg_8_4x4_avx2: 10.4 ( 2.95x) avg_8_8x8_c: 134.5 ( 1.00x) avg_8_8x8_avx2: 16.6 ( 8.12x) avg_8_16x16_c: 255.6 ( 1.00x) avg_8_16x16_avx2: 28.2 ( 9.07x) avg_8_32x32_c: 897.7 ( 1.00x) avg_8_32x32_avx2: 83.9 (10.70x) avg_8_64x64_c: 3320.0 ( 1.00x) avg_8_64x64_avx2: 321.1 (10.34x) avg_8_128x128_c: 12981.8 ( 1.00x) avg_8_128x128_avx2: 1480.1 ( 8.77x) avg_10_2x2_c: 12.0 ( 1.00x) avg_10_2x2_avx2: 8.4 ( 1.43x) avg_10_4x4_c: 34.9 ( 1.00x) avg_10_4x4_avx2: 9.8 ( 3.56x) avg_10_8x8_c: 76.8 ( 1.00x) avg_10_8x8_avx2: 15.1 ( 5.08x) avg_10_16x16_c: 256.6 ( 1.00x) avg_10_16x16_avx2: 25.1 (10.20x) avg_10_32x32_c: 932.9 ( 1.00x) avg_10_32x32_avx2: 73.4 (12.72x) avg_10_64x64_c: 3517.9 ( 1.00x) avg_10_64x64_avx2: 414.8 ( 8.48x) avg_10_128x128_c: 13695.3 ( 1.00x) avg_10_128x128_avx2: 1648.1 ( 8.31x) avg_12_2x2_c: 13.1 ( 1.00x) avg_12_2x2_avx2: 8.6 ( 1.53x) avg_12_4x4_c: 35.4 ( 1.00x) avg_12_4x4_avx2: 10.1 ( 3.49x) avg_12_8x8_c: 76.6 ( 1.00x) avg_12_8x8_avx2: 16.7 ( 4.60x) avg_12_16x16_c: 256.6 ( 1.00x) avg_12_16x16_avx2: 25.5 (10.07x) avg_12_32x32_c: 933.2 ( 1.00x) avg_12_32x32_avx2: 75.7 (12.34x) avg_12_64x64_c: 3519.1 ( 1.00x) avg_12_64x64_avx2: 416.8 ( 8.44x) avg_12_128x128_c: 13695.1 ( 1.00x) avg_12_128x128_avx2: 1651.6 ( 8.29x) New benchmarks: avg_8_2x2_c: 11.5 ( 1.00x) avg_8_2x2_avx2: 6.0 ( 1.91x) avg_8_4x4_c: 29.7 ( 1.00x) avg_8_4x4_avx2: 8.0 ( 3.72x) avg_8_8x8_c: 131.4 ( 1.00x) avg_8_8x8_avx2: 12.2 (10.74x) avg_8_16x16_c: 254.3 ( 1.00x) avg_8_16x16_avx2: 24.8 (10.25x) avg_8_32x32_c: 897.7 ( 1.00x) avg_8_32x32_avx2: 77.8 (11.54x) avg_8_64x64_c: 3321.3 ( 1.00x) avg_8_64x64_avx2: 318.7 (10.42x) avg_8_128x128_c: 12988.4 ( 1.00x) avg_8_128x128_avx2: 1430.1 ( 9.08x) avg_10_2x2_c: 12.1 ( 1.00x) avg_10_2x2_avx2: 5.7 ( 2.13x) avg_10_4x4_c: 35.0 ( 1.00x) avg_10_4x4_avx2: 9.0 ( 3.88x) avg_10_8x8_c: 77.2 ( 1.00x) avg_10_8x8_avx2: 12.4 ( 6.24x) avg_10_16x16_c: 256.2 ( 1.00x) avg_10_16x16_avx2: 24.3 (10.56x) avg_10_32x32_c: 932.9 ( 1.00x) avg_10_32x32_avx2: 71.9 (12.97x) avg_10_64x64_c: 3516.8 ( 1.00x) avg_10_64x64_avx2: 414.7 ( 8.48x) avg_10_128x128_c: 13693.7 ( 1.00x) avg_10_128x128_avx2: 1609.3 ( 8.51x) avg_12_2x2_c: 14.1 ( 1.00x) avg_12_2x2_avx2: 5.7 ( 2.48x) avg_12_4x4_c: 35.8 ( 1.00x) avg_12_4x4_avx2: 9.0 ( 3.96x) avg_12_8x8_c: 76.9 ( 1.00x) avg_12_8x8_avx2: 12.4 ( 6.22x) avg_12_16x16_c: 256.5 ( 1.00x) avg_12_16x16_avx2: 24.4 (10.50x) avg_12_32x32_c: 934.1 ( 1.00x) avg_12_32x32_avx2: 72.0 (12.97x) avg_12_64x64_c: 3518.2 ( 1.00x) avg_12_64x64_avx2: 414.8 ( 8.48x) avg_12_128x128_c: 13689.5 ( 1.00x) avg_12_128x128_avx2: 1611.1 ( 8.50x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:58:33 +01:00
Andreas Rheinhardt	5a60b3f1a6	avcodec/x86/vvc/mc: Remove always-false branches The C versions of the average and weighted average functions contains "FFMAX(3, 15 - BIT_DEPTH)" and the code here followed this; yet it is only instantiated for bit depths 8, 10 and 12, for which the above is just 15-BIT_DEPTH. So the comparisons are unnecessary. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:57:56 +01:00
Andreas Rheinhardt	59f8ff4c18	avcodec/x86/vvc/mc: Remove unused constants Also avoid overaligning .rodata. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:57:56 +01:00
Andreas Rheinhardt	eabf52e787	avcodec/x86/vvc/mc: Avoid unused work The high quadword of these registers is zero for width 2. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:57:56 +01:00
Andreas Rheinhardt	9317fb2b2e	avcodec/x86/vvc/mc: Avoid ymm registers where possible Widths 2 and 4 fit into xmm registers. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:57:56 +01:00
Andreas Rheinhardt	caa0ae0cfb	avcodec/x86/vvc/mc: Avoid pextr[dq], v{insert,extract}i128 Use mov[dq], movdqu instead if the least significant parts are set (i.e. if the immediate value is 0x0). Old benchmarks: avg_8_2x2_c: 11.3 ( 1.00x) avg_8_2x2_avx2: 7.5 ( 1.50x) avg_8_4x4_c: 31.2 ( 1.00x) avg_8_4x4_avx2: 10.7 ( 2.91x) avg_8_8x8_c: 133.5 ( 1.00x) avg_8_8x8_avx2: 21.2 ( 6.30x) avg_8_16x16_c: 254.7 ( 1.00x) avg_8_16x16_avx2: 30.1 ( 8.46x) avg_8_32x32_c: 896.9 ( 1.00x) avg_8_32x32_avx2: 103.9 ( 8.63x) avg_8_64x64_c: 3320.7 ( 1.00x) avg_8_64x64_avx2: 539.4 ( 6.16x) avg_8_128x128_c: 12991.5 ( 1.00x) avg_8_128x128_avx2: 1661.3 ( 7.82x) avg_10_2x2_c: 21.3 ( 1.00x) avg_10_2x2_avx2: 8.3 ( 2.55x) avg_10_4x4_c: 34.9 ( 1.00x) avg_10_4x4_avx2: 10.6 ( 3.28x) avg_10_8x8_c: 76.3 ( 1.00x) avg_10_8x8_avx2: 20.2 ( 3.77x) avg_10_16x16_c: 255.9 ( 1.00x) avg_10_16x16_avx2: 24.1 (10.60x) avg_10_32x32_c: 932.4 ( 1.00x) avg_10_32x32_avx2: 73.3 (12.72x) avg_10_64x64_c: 3516.4 ( 1.00x) avg_10_64x64_avx2: 601.7 ( 5.84x) avg_10_128x128_c: 13690.6 ( 1.00x) avg_10_128x128_avx2: 1613.2 ( 8.49x) avg_12_2x2_c: 14.0 ( 1.00x) avg_12_2x2_avx2: 8.3 ( 1.67x) avg_12_4x4_c: 35.3 ( 1.00x) avg_12_4x4_avx2: 10.9 ( 3.26x) avg_12_8x8_c: 76.5 ( 1.00x) avg_12_8x8_avx2: 20.3 ( 3.77x) avg_12_16x16_c: 256.7 ( 1.00x) avg_12_16x16_avx2: 24.1 (10.63x) avg_12_32x32_c: 932.5 ( 1.00x) avg_12_32x32_avx2: 73.3 (12.72x) avg_12_64x64_c: 3520.5 ( 1.00x) avg_12_64x64_avx2: 602.6 ( 5.84x) avg_12_128x128_c: 13689.6 ( 1.00x) avg_12_128x128_avx2: 1613.1 ( 8.49x) w_avg_8_2x2_c: 16.7 ( 1.00x) w_avg_8_2x2_avx2: 13.4 ( 1.25x) w_avg_8_4x4_c: 44.5 ( 1.00x) w_avg_8_4x4_avx2: 15.9 ( 2.81x) w_avg_8_8x8_c: 166.1 ( 1.00x) w_avg_8_8x8_avx2: 45.7 ( 3.63x) w_avg_8_16x16_c: 392.9 ( 1.00x) w_avg_8_16x16_avx2: 57.8 ( 6.80x) w_avg_8_32x32_c: 1455.5 ( 1.00x) w_avg_8_32x32_avx2: 215.0 ( 6.77x) w_avg_8_64x64_c: 5621.8 ( 1.00x) w_avg_8_64x64_avx2: 875.2 ( 6.42x) w_avg_8_128x128_c: 22131.3 ( 1.00x) w_avg_8_128x128_avx2: 3390.1 ( 6.53x) w_avg_10_2x2_c: 18.0 ( 1.00x) w_avg_10_2x2_avx2: 14.0 ( 1.28x) w_avg_10_4x4_c: 53.9 ( 1.00x) w_avg_10_4x4_avx2: 15.9 ( 3.40x) w_avg_10_8x8_c: 109.5 ( 1.00x) w_avg_10_8x8_avx2: 40.4 ( 2.71x) w_avg_10_16x16_c: 395.7 ( 1.00x) w_avg_10_16x16_avx2: 44.7 ( 8.86x) w_avg_10_32x32_c: 1532.7 ( 1.00x) w_avg_10_32x32_avx2: 142.4 (10.77x) w_avg_10_64x64_c: 6007.7 ( 1.00x) w_avg_10_64x64_avx2: 745.5 ( 8.06x) w_avg_10_128x128_c: 23719.7 ( 1.00x) w_avg_10_128x128_avx2: 2217.7 (10.70x) w_avg_12_2x2_c: 18.9 ( 1.00x) w_avg_12_2x2_avx2: 13.6 ( 1.38x) w_avg_12_4x4_c: 47.5 ( 1.00x) w_avg_12_4x4_avx2: 15.9 ( 2.99x) w_avg_12_8x8_c: 109.3 ( 1.00x) w_avg_12_8x8_avx2: 40.9 ( 2.67x) w_avg_12_16x16_c: 395.6 ( 1.00x) w_avg_12_16x16_avx2: 44.8 ( 8.84x) w_avg_12_32x32_c: 1531.0 ( 1.00x) w_avg_12_32x32_avx2: 141.8 (10.80x) w_avg_12_64x64_c: 6016.7 ( 1.00x) w_avg_12_64x64_avx2: 732.8 ( 8.21x) w_avg_12_128x128_c: 23762.2 ( 1.00x) w_avg_12_128x128_avx2: 2223.4 (10.69x) New benchmarks: avg_8_2x2_c: 11.3 ( 1.00x) avg_8_2x2_avx2: 7.6 ( 1.49x) avg_8_4x4_c: 31.2 ( 1.00x) avg_8_4x4_avx2: 10.8 ( 2.89x) avg_8_8x8_c: 131.6 ( 1.00x) avg_8_8x8_avx2: 15.6 ( 8.42x) avg_8_16x16_c: 255.3 ( 1.00x) avg_8_16x16_avx2: 27.9 ( 9.16x) avg_8_32x32_c: 897.9 ( 1.00x) avg_8_32x32_avx2: 81.2 (11.06x) avg_8_64x64_c: 3320.0 ( 1.00x) avg_8_64x64_avx2: 335.1 ( 9.91x) avg_8_128x128_c: 12999.1 ( 1.00x) avg_8_128x128_avx2: 1456.3 ( 8.93x) avg_10_2x2_c: 12.0 ( 1.00x) avg_10_2x2_avx2: 8.6 ( 1.40x) avg_10_4x4_c: 34.9 ( 1.00x) avg_10_4x4_avx2: 9.7 ( 3.61x) avg_10_8x8_c: 76.7 ( 1.00x) avg_10_8x8_avx2: 16.3 ( 4.69x) avg_10_16x16_c: 256.3 ( 1.00x) avg_10_16x16_avx2: 25.2 (10.18x) avg_10_32x32_c: 932.8 ( 1.00x) avg_10_32x32_avx2: 73.3 (12.72x) avg_10_64x64_c: 3518.8 ( 1.00x) avg_10_64x64_avx2: 416.8 ( 8.44x) avg_10_128x128_c: 13691.6 ( 1.00x) avg_10_128x128_avx2: 1612.9 ( 8.49x) avg_12_2x2_c: 14.1 ( 1.00x) avg_12_2x2_avx2: 8.7 ( 1.62x) avg_12_4x4_c: 35.7 ( 1.00x) avg_12_4x4_avx2: 9.7 ( 3.68x) avg_12_8x8_c: 77.0 ( 1.00x) avg_12_8x8_avx2: 16.9 ( 4.57x) avg_12_16x16_c: 256.2 ( 1.00x) avg_12_16x16_avx2: 25.7 ( 9.96x) avg_12_32x32_c: 933.5 ( 1.00x) avg_12_32x32_avx2: 74.0 (12.62x) avg_12_64x64_c: 3516.4 ( 1.00x) avg_12_64x64_avx2: 408.7 ( 8.60x) avg_12_128x128_c: 13691.6 ( 1.00x) avg_12_128x128_avx2: 1613.8 ( 8.48x) w_avg_8_2x2_c: 16.7 ( 1.00x) w_avg_8_2x2_avx2: 14.0 ( 1.19x) w_avg_8_4x4_c: 48.2 ( 1.00x) w_avg_8_4x4_avx2: 16.1 ( 3.00x) w_avg_8_8x8_c: 168.0 ( 1.00x) w_avg_8_8x8_avx2: 22.5 ( 7.47x) w_avg_8_16x16_c: 392.5 ( 1.00x) w_avg_8_16x16_avx2: 47.9 ( 8.19x) w_avg_8_32x32_c: 1453.7 ( 1.00x) w_avg_8_32x32_avx2: 176.1 ( 8.26x) w_avg_8_64x64_c: 5631.4 ( 1.00x) w_avg_8_64x64_avx2: 690.8 ( 8.15x) w_avg_8_128x128_c: 22139.5 ( 1.00x) w_avg_8_128x128_avx2: 2742.4 ( 8.07x) w_avg_10_2x2_c: 18.1 ( 1.00x) w_avg_10_2x2_avx2: 13.8 ( 1.31x) w_avg_10_4x4_c: 47.0 ( 1.00x) w_avg_10_4x4_avx2: 16.4 ( 2.87x) w_avg_10_8x8_c: 110.0 ( 1.00x) w_avg_10_8x8_avx2: 21.6 ( 5.09x) w_avg_10_16x16_c: 395.2 ( 1.00x) w_avg_10_16x16_avx2: 45.4 ( 8.71x) w_avg_10_32x32_c: 1533.8 ( 1.00x) w_avg_10_32x32_avx2: 142.6 (10.76x) w_avg_10_64x64_c: 6004.4 ( 1.00x) w_avg_10_64x64_avx2: 672.8 ( 8.92x) w_avg_10_128x128_c: 23748.5 ( 1.00x) w_avg_10_128x128_avx2: 2198.0 (10.80x) w_avg_12_2x2_c: 17.2 ( 1.00x) w_avg_12_2x2_avx2: 13.9 ( 1.24x) w_avg_12_4x4_c: 51.4 ( 1.00x) w_avg_12_4x4_avx2: 16.5 ( 3.11x) w_avg_12_8x8_c: 109.1 ( 1.00x) w_avg_12_8x8_avx2: 22.0 ( 4.96x) w_avg_12_16x16_c: 395.9 ( 1.00x) w_avg_12_16x16_avx2: 44.9 ( 8.81x) w_avg_12_32x32_c: 1533.5 ( 1.00x) w_avg_12_32x32_avx2: 142.3 (10.78x) w_avg_12_64x64_c: 6002.0 ( 1.00x) w_avg_12_64x64_avx2: 557.5 (10.77x) w_avg_12_128x128_c: 23749.5 ( 1.00x) w_avg_12_128x128_avx2: 2202.0 (10.79x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:57:56 +01:00
Andreas Rheinhardt	7bf9c1e3f6	avcodec/x86/vvc/mc: Avoid redundant clipping for 8bit It is already done by packuswb. Old benchmarks: avg_8_2x2_c: 11.1 ( 1.00x) avg_8_2x2_avx2: 8.6 ( 1.28x) avg_8_4x4_c: 30.0 ( 1.00x) avg_8_4x4_avx2: 10.8 ( 2.78x) avg_8_8x8_c: 132.0 ( 1.00x) avg_8_8x8_avx2: 25.7 ( 5.14x) avg_8_16x16_c: 254.6 ( 1.00x) avg_8_16x16_avx2: 33.2 ( 7.67x) avg_8_32x32_c: 897.5 ( 1.00x) avg_8_32x32_avx2: 115.6 ( 7.76x) avg_8_64x64_c: 3316.9 ( 1.00x) avg_8_64x64_avx2: 626.5 ( 5.29x) avg_8_128x128_c: 12973.6 ( 1.00x) avg_8_128x128_avx2: 1914.0 ( 6.78x) w_avg_8_2x2_c: 16.7 ( 1.00x) w_avg_8_2x2_avx2: 14.4 ( 1.16x) w_avg_8_4x4_c: 48.2 ( 1.00x) w_avg_8_4x4_avx2: 16.5 ( 2.92x) w_avg_8_8x8_c: 168.1 ( 1.00x) w_avg_8_8x8_avx2: 49.7 ( 3.38x) w_avg_8_16x16_c: 392.4 ( 1.00x) w_avg_8_16x16_avx2: 61.1 ( 6.43x) w_avg_8_32x32_c: 1455.3 ( 1.00x) w_avg_8_32x32_avx2: 224.6 ( 6.48x) w_avg_8_64x64_c: 5632.1 ( 1.00x) w_avg_8_64x64_avx2: 896.9 ( 6.28x) w_avg_8_128x128_c: 22136.3 ( 1.00x) w_avg_8_128x128_avx2: 3626.7 ( 6.10x) New benchmarks: avg_8_2x2_c: 12.3 ( 1.00x) avg_8_2x2_avx2: 8.1 ( 1.52x) avg_8_4x4_c: 30.3 ( 1.00x) avg_8_4x4_avx2: 11.3 ( 2.67x) avg_8_8x8_c: 131.8 ( 1.00x) avg_8_8x8_avx2: 21.3 ( 6.20x) avg_8_16x16_c: 255.0 ( 1.00x) avg_8_16x16_avx2: 30.6 ( 8.33x) avg_8_32x32_c: 898.5 ( 1.00x) avg_8_32x32_avx2: 104.9 ( 8.57x) avg_8_64x64_c: 3317.7 ( 1.00x) avg_8_64x64_avx2: 540.9 ( 6.13x) avg_8_128x128_c: 12986.5 ( 1.00x) avg_8_128x128_avx2: 1663.4 ( 7.81x) w_avg_8_2x2_c: 16.8 ( 1.00x) w_avg_8_2x2_avx2: 13.9 ( 1.21x) w_avg_8_4x4_c: 48.2 ( 1.00x) w_avg_8_4x4_avx2: 16.2 ( 2.98x) w_avg_8_8x8_c: 168.6 ( 1.00x) w_avg_8_8x8_avx2: 46.3 ( 3.64x) w_avg_8_16x16_c: 392.4 ( 1.00x) w_avg_8_16x16_avx2: 57.7 ( 6.80x) w_avg_8_32x32_c: 1454.6 ( 1.00x) w_avg_8_32x32_avx2: 214.6 ( 6.78x) w_avg_8_64x64_c: 5638.4 ( 1.00x) w_avg_8_64x64_avx2: 875.6 ( 6.44x) w_avg_8_128x128_c: 22133.5 ( 1.00x) w_avg_8_128x128_avx2: 3334.3 ( 6.64x) Also saves 550B of .text here. The improvements will likely be even better on Win64, because it avoids using two nonvolatile registers in the weighted average case. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-22 00:57:56 +01:00

1 2 3 4 5 ...

2971 commits