Commit graph

2971 commits

Author SHA1 Message Date
Andreas Rheinhardt
0ddece40c5 avcodec/x86/vvc/alf: Simplify vb_pos comparisons
The value of vb_pos at vb_bottom, vb_above is known
at compile-time, so one can avoid the modifications
to vb_pos and just compare against immediates.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
1960320112 avcodec/x86/vvc/alf: Avoid pointless wrappers for alf_filter
They are completely unnecessary for the 8bit case (which only
handles 8bit) and overtly complicated for the 10 and 12bit cases:
All one needs to do is set up the (1<<bpp)-1 vector register
and jmp from (say) the 12bpp function stub inside the 10bpp
function. The way it is done here even allows to share the
prologue between the two functions.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
467f8d8415 avcodec/x86/vvc/alf: Improve offsetting pointers
It can be combined with an earlier lea for the loop
processing 16 pixels at a time; it is unnecessary
for the tail, because the new values will be overwritten
immediately afterwards anyway.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
cb5f6c055b avcodec/x86/vvc/alf: Don't modify rsp unnecessarily
The vvc_alf_filter functions don't use x86inc's stack managment
feature at all; they merely push and pop some regs themselves.
So don't tell x86inc to provide stack (which in this case
entails aligning the stack).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
38062ebd18 avcodec/x86/vvc/alf: Remove pointless counter, stride
Each luma alf block has 2*12 auxiliary coefficients associated
with it that the alf_filter functions consume; the C version
simply increments the pointers.

The x64 dsp function meanwhile does things differenty:
The vvc_alf_filter functions have three levels of loops.
The middle layer uses two counters, one of which is
just the horizontal offset xd in the current line. It is only
used for addressing these auxiliary coefficients and
yet one needs to perform work translate from it to
the coefficient offset, namely a *3 via lea and a *2 scale.
Furthermore, the base pointers of the coefficients are incremented
in the outer loop; the stride used for this is calculated
in the C wrapper functions. Furthermore, due to GPR pressure xd
is reused as loop counter for the innermost loop; the
xd from the middle loop is pushed to the stack.

Apart from the translation from horizontal offset to coefficient
offset all of the above has been done for chroma, too, although
the coefficient pointers don't get modified for them at all.

This commit changes this to just increment the pointers
after reading the relevant coefficients.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
d2e7fe5b19 avcodec/x86/vvc/alf: Improve deriving ac
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
5da3cab645 avcodec/x86/vvc/alf: Avoid broadcast
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
c9da0193ff avcodec/x86/vvc/alf: Don't use 64bit where unnecessary
Reduces codesize (avoids REX prefixes).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
a489a623fb avcodec/x86/vvc/alf: Use memory sources directly
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
df7885d6c3 avcodec/x86/vvc/alf: Improve writing classify parameters
The permutation that was applied before the write macro
is actually only beneficial when one has 16 entries to write,
so move it into the macro to write 16 entries and optimize
the other macro.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
1bc91eb552 avcodec/x86/vvc/alf: Avoid checking twice
Also avoids a vpermq in case width is eight.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:42 +01:00
Andreas Rheinhardt
e4a9d54e48 avcodec/x86/vvc/alf: Avoid nonvolatile registers
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
a2d9cd6dcb avcodec/x86/vvc/alf: Don't calculate twice
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
01a897020e avcodec/x86/vvc/alf: Use xmm registers where sufficient
One always has eight samples when processing the luma remainder,
so xmm registers are sufficient for everything. In fact, this
actually simplifies loading the luma parameters.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
9cb5280c0e avcodec/x86/vvc/alf: Improve storing 8bpp
When width is known to be 8 (i.e. for luma that is not width 16),
the upper lane is unused, so use an xmm-sized packuswb and avoid
the vpermq altogether. For chroma not known to be 16 (i.e. 4,8 or
12) defer extracting from the high lane until it is known to be needed.
Also do so via vextracti128 instead of vpermq (also do this for
bpp>8).
Also use vextracti128 and an xmm-sized packuswb in case of width 16
instead of an ymm-sized packuswb followed by vextracti128.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
56a4c15c23 avcodec/x86/vvc/alf: Avoid checking twice
Also avoid doing unnecessary work in the width==8 case.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
43cc8f05df avcodec/x86/vvc/alf: Don't clip for 8bpp
packuswb does it already.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
a8b3b9c26f avcodec/x86/vvc/alf: Remove unused array
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
94f9ad8061 avcodec/x86/vvc/alf: Use immediate for shift when possible
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
2159e40ab3 avcodec/x86/vvc/of: Avoid jump
At the end of the height==8 codepath, a jump to RET at the end
of the height==16 codepath is performed. Yet the epilogue
is so cheap on Unix64 that this jump is not worthwhile.
For Win64 meanwhile, one can still avoid jumps, because
for width 16 >8bpp and width 8 8bpp content a jump is performed
to the end of the height==8 position, immediately followed
by a jump to RET. These two jumps can be combined into one.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
2a93d09968 avcodec/x86/vvc/of: Ignore upper lane for width 8
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
9fe9fd95b6 avcodec/x86/vvc/of: Only clip for >8bpp
packuswb does it already for 8bpp.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
83694749ad avcodec/x86/vvc/of,dsp_init: Avoid unnecessary wrappers
Write them in assembly instead; this exchanges a call+ret
with a jmp and also avoids the stack for (1<<bpp)-1.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
d6ed5d6e3d avcodec/x86/vvc/of: Deduplicate writing, save jump
Both the 8bpp width 16 and >8bpp width 8 cases write
16 contiguous bytes; deduplicate writing them. In fact,
by putting this block of code at the end of the SAVE macro,
one can even save a jmp for the width 16 8bpp case
(without adversely affecting the other cases).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
e7e19fcb1b avcodec/x86/vvc/of: Avoid unnecessary jumps
For 8bpp width 8 content, an unnecessary jump was performed
for every write: First to the end of the SAVE_8BPC macro,
then to the end of the SAVE macro. This commit changes this.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
dee361a5bf avcodec/x86/vvc/of: Avoid initialization, addition for last block
When processing the last block, we no longer need to preserve
some registers for the next block, allowing simplifications.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
c6205355b4 avcodec/x86/vvc/of: Avoid initialization, addition for first block
Output directly to the desired destination registers instead
of zeroing them, followed by adding the desired values.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
f177672df2 avcodec/x86/vvc/of: Avoid unnecessary additions
BDOF_PROF_GRAD just adds some values to m12,m13,
so one can avoid two pxor, paddw by deferring
saving these registers prematurely.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-06 20:02:41 +01:00
Andreas Rheinhardt
561f37c023 avcodec/x86/huffyuvencdsp_init: Remove pointless av_unused
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-01 12:04:14 +01:00
Andreas Rheinhardt
d345e902d2 avcodec/x86/huffyuvencdsp: Remove MMX sub_hfyu_median_pred_int16
Superseded by SSE2 and AVX2.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-01 12:04:14 +01:00
Andreas Rheinhardt
154bcd1054 avcodec/x86/huffyuvencdsp: Add AVX2 sub_hfyu_median_pred_int16
This version can also process 16bpp.

Benchmarks:
sub_hfyu_median_pred_int16_9bpp_c:                   12667.7 ( 1.00x)
sub_hfyu_median_pred_int16_9bpp_mmxext:               1966.5 ( 6.44x)
sub_hfyu_median_pred_int16_9bpp_sse2:                  997.6 (12.70x)
sub_hfyu_median_pred_int16_9bpp_avx2:                  474.8 (26.68x)
sub_hfyu_median_pred_int16_9bpp_aligned_c:           12604.6 ( 1.00x)
sub_hfyu_median_pred_int16_9bpp_aligned_mmxext:       1964.6 ( 6.42x)
sub_hfyu_median_pred_int16_9bpp_aligned_sse2:          981.9 (12.84x)
sub_hfyu_median_pred_int16_9bpp_aligned_avx2:          462.6 (27.25x)
sub_hfyu_median_pred_int16_16bpp_c:                  12592.5 ( 1.00x)
sub_hfyu_median_pred_int16_16bpp_avx2:                 465.6 (27.04x)
sub_hfyu_median_pred_int16_16bpp_aligned_c:          12587.5 ( 1.00x)
sub_hfyu_median_pred_int16_16bpp_aligned_avx2:         462.5 (27.22x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-01 12:04:14 +01:00
Andreas Rheinhardt
e575c2d496 avcodec/x86/huffyuvencdsp: Add SSE2 sub_hfyu_median_pred_int16
Contrary to the MMXEXT version this version does not overread at all
(the MMXEXT version processes the input of 2*w bytes in eight byte
chunks and overreads by a further six bytes, because it loads
the next left and left top values at the end of the loop,
i.e. it reads FFALIGN(2*w,8)+6 bytes instead of 2*w).

Benchmarks:
sub_hfyu_median_pred_int16_9bpp_c:                   12673.6 ( 1.00x)
sub_hfyu_median_pred_int16_9bpp_mmxext:               1947.7 ( 6.51x)
sub_hfyu_median_pred_int16_9bpp_sse2:                  993.9 (12.75x)
sub_hfyu_median_pred_int16_9bpp_aligned_c:           12596.1 ( 1.00x)
sub_hfyu_median_pred_int16_9bpp_aligned_mmxext:       1956.1 ( 6.44x)
sub_hfyu_median_pred_int16_9bpp_aligned_sse2:          989.4 (12.73x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-01 12:03:55 +01:00
Andreas Rheinhardt
6834762d7b avcodec/huffyuvencdsp: Add width parameter to init
This allows to only use certain functions using wide registers
if there is enough work to do and if one can even read a whole
register wide without overreading.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-01 11:58:16 +01:00
Andreas Rheinhardt
2268ba89f0 avcodec/huffyuvencdsp: Pass bpp, not AVPixelFormat for init
Avoids having to get a pixel format descriptor.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-01 11:56:57 +01:00
Andreas Rheinhardt
aa483bc422 avcodec/x86/bswapdsp: Avoid aligned vs unaligned codepaths for AVX2
For modern cpus (like those supporting AVX2) loads and stores
using the unaligned versions of instructions are as fast
as aligned ones if the address is aligned, so remove
the aligned AVX2 version (and the alignment check) and just
use the unaligned one.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-27 18:25:43 +01:00
Andreas Rheinhardt
55afe49dd0 avcodec/x86/bswapdsp: combine shifting, avoid check for AVX2
This avoids a check and a shift if >=8 elements are processed;
it adds a check if < 8 elements are processed (which should
be rare).
No change in benchmarks here.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-27 18:25:31 +01:00
Andreas Rheinhardt
3e6fa5153e avcodec/x86/bswapdsp: Avoid register copies
No change in benchmarks here.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-27 18:25:01 +01:00
Andreas Rheinhardt
dc65dcec22 avcodec/vvc/inter: Combine offsets early
For bi-predicted weighted averages, only the sum
of the two offsets is ever used, so add the two early.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-25 12:08:33 +01:00
Andreas Rheinhardt
6c1c1720cf avcodec/x86/vvc/dsp_init: Mark dsp init function as av_cold
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 01:05:12 +01:00
Andreas Rheinhardt
af3f8f5bd2 avcodec/x86/vvc/of: Break dependency chain
Don't extract and update one word of one and the same register
at a time; use separate src and dst registers, so that pextrw
and bsr can be done in parallel. Also use movd instead of pinsrw
for the first word.

Old benchmarks:
apply_bdof_8_8x16_c:                                  3275.2 ( 1.00x)
apply_bdof_8_8x16_avx2:                                487.6 ( 6.72x)
apply_bdof_8_16x8_c:                                  3243.1 ( 1.00x)
apply_bdof_8_16x8_avx2:                                284.4 (11.40x)
apply_bdof_8_16x16_c:                                 6501.8 ( 1.00x)
apply_bdof_8_16x16_avx2:                               570.0 (11.41x)
apply_bdof_10_8x16_c:                                 3286.5 ( 1.00x)
apply_bdof_10_8x16_avx2:                               461.7 ( 7.12x)
apply_bdof_10_16x8_c:                                 3274.5 ( 1.00x)
apply_bdof_10_16x8_avx2:                               271.4 (12.06x)
apply_bdof_10_16x16_c:                                6590.0 ( 1.00x)
apply_bdof_10_16x16_avx2:                              543.9 (12.12x)
apply_bdof_12_8x16_c:                                 3307.6 ( 1.00x)
apply_bdof_12_8x16_avx2:                               462.2 ( 7.16x)
apply_bdof_12_16x8_c:                                 3287.4 ( 1.00x)
apply_bdof_12_16x8_avx2:                               271.8 (12.10x)
apply_bdof_12_16x16_c:                                6465.7 ( 1.00x)
apply_bdof_12_16x16_avx2:                              543.8 (11.89x)

New benchmarks:
apply_bdof_8_8x16_c:                                  3255.7 ( 1.00x)
apply_bdof_8_8x16_avx2:                                349.3 ( 9.32x)
apply_bdof_8_16x8_c:                                  3262.5 ( 1.00x)
apply_bdof_8_16x8_avx2:                                214.8 (15.19x)
apply_bdof_8_16x16_c:                                 6471.6 ( 1.00x)
apply_bdof_8_16x16_avx2:                               429.8 (15.06x)
apply_bdof_10_8x16_c:                                 3227.7 ( 1.00x)
apply_bdof_10_8x16_avx2:                               321.6 (10.04x)
apply_bdof_10_16x8_c:                                 3250.2 ( 1.00x)
apply_bdof_10_16x8_avx2:                               201.2 (16.16x)
apply_bdof_10_16x16_c:                                6476.5 ( 1.00x)
apply_bdof_10_16x16_avx2:                              400.9 (16.16x)
apply_bdof_12_8x16_c:                                 3230.7 ( 1.00x)
apply_bdof_12_8x16_avx2:                               321.8 (10.04x)
apply_bdof_12_16x8_c:                                 3210.5 ( 1.00x)
apply_bdof_12_16x8_avx2:                               200.9 (15.98x)
apply_bdof_12_16x16_c:                                6474.5 ( 1.00x)
apply_bdof_12_16x16_avx2:                              400.2 (16.18x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 01:05:12 +01:00
Andreas Rheinhardt
19dc7b79a4 avcodec/x86/vvc/of: Unify shuffling
One can use the same shuffles for the width 8 and width 16
case if one also changes the permutation in vpermd (that always
follows pshufb for width 16).

This also allows to load it before checking width.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 01:03:22 +01:00
Andreas Rheinhardt
8e82416434 avcodec/x86/vvc/of: Avoid unused register
Avoids a push+pop.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 01:02:20 +01:00
Andreas Rheinhardt
81fb70c833 avcodec/x86/vvc/mc,dsp_init: Avoid pointless wrappers for w_avg
They only add overhead (in form of another function call,
sign-extending some parameters to 64bit (although the upper
bits are not used at all) and rederiving the actual number
of bits (from the maximum value (1<<bpp)-1)).

Old benchmarks:
w_avg_8_2x2_c:                                          16.4 ( 1.00x)
w_avg_8_2x2_avx2:                                       12.9 ( 1.27x)
w_avg_8_4x4_c:                                          48.0 ( 1.00x)
w_avg_8_4x4_avx2:                                       14.9 ( 3.23x)
w_avg_8_8x8_c:                                         168.2 ( 1.00x)
w_avg_8_8x8_avx2:                                       22.4 ( 7.49x)
w_avg_8_16x16_c:                                       396.5 ( 1.00x)
w_avg_8_16x16_avx2:                                     47.9 ( 8.28x)
w_avg_8_32x32_c:                                      1466.3 ( 1.00x)
w_avg_8_32x32_avx2:                                    172.8 ( 8.48x)
w_avg_8_64x64_c:                                      5629.3 ( 1.00x)
w_avg_8_64x64_avx2:                                    678.7 ( 8.29x)
w_avg_8_128x128_c:                                   22122.4 ( 1.00x)
w_avg_8_128x128_avx2:                                 2743.5 ( 8.06x)
w_avg_10_2x2_c:                                         18.7 ( 1.00x)
w_avg_10_2x2_avx2:                                      13.1 ( 1.43x)
w_avg_10_4x4_c:                                         50.3 ( 1.00x)
w_avg_10_4x4_avx2:                                      15.9 ( 3.17x)
w_avg_10_8x8_c:                                        109.3 ( 1.00x)
w_avg_10_8x8_avx2:                                      20.6 ( 5.30x)
w_avg_10_16x16_c:                                      395.5 ( 1.00x)
w_avg_10_16x16_avx2:                                    44.8 ( 8.83x)
w_avg_10_32x32_c:                                     1534.2 ( 1.00x)
w_avg_10_32x32_avx2:                                   141.4 (10.85x)
w_avg_10_64x64_c:                                     6003.6 ( 1.00x)
w_avg_10_64x64_avx2:                                   557.4 (10.77x)
w_avg_10_128x128_c:                                  23722.7 ( 1.00x)
w_avg_10_128x128_avx2:                                2205.0 (10.76x)
w_avg_12_2x2_c:                                         18.6 ( 1.00x)
w_avg_12_2x2_avx2:                                      13.1 ( 1.42x)
w_avg_12_4x4_c:                                         52.2 ( 1.00x)
w_avg_12_4x4_avx2:                                      16.1 ( 3.24x)
w_avg_12_8x8_c:                                        109.2 ( 1.00x)
w_avg_12_8x8_avx2:                                      20.6 ( 5.29x)
w_avg_12_16x16_c:                                      396.1 ( 1.00x)
w_avg_12_16x16_avx2:                                    45.0 ( 8.81x)
w_avg_12_32x32_c:                                     1532.6 ( 1.00x)
w_avg_12_32x32_avx2:                                   142.1 (10.79x)
w_avg_12_64x64_c:                                     6002.2 ( 1.00x)
w_avg_12_64x64_avx2:                                   557.3 (10.77x)
w_avg_12_128x128_c:                                  23748.7 ( 1.00x)
w_avg_12_128x128_avx2:                                2206.4 (10.76x)

New benchmarks:
w_avg_8_2x2_c:                                          16.0 ( 1.00x)
w_avg_8_2x2_avx2:                                        9.3 ( 1.71x)
w_avg_8_4x4_c:                                          48.4 ( 1.00x)
w_avg_8_4x4_avx2:                                       12.4 ( 3.91x)
w_avg_8_8x8_c:                                         168.7 ( 1.00x)
w_avg_8_8x8_avx2:                                       21.1 ( 8.00x)
w_avg_8_16x16_c:                                       394.5 ( 1.00x)
w_avg_8_16x16_avx2:                                     46.2 ( 8.54x)
w_avg_8_32x32_c:                                      1456.3 ( 1.00x)
w_avg_8_32x32_avx2:                                    171.8 ( 8.48x)
w_avg_8_64x64_c:                                      5636.2 ( 1.00x)
w_avg_8_64x64_avx2:                                    676.9 ( 8.33x)
w_avg_8_128x128_c:                                   22129.1 ( 1.00x)
w_avg_8_128x128_avx2:                                 2734.3 ( 8.09x)
w_avg_10_2x2_c:                                         18.7 ( 1.00x)
w_avg_10_2x2_avx2:                                      10.3 ( 1.82x)
w_avg_10_4x4_c:                                         50.8 ( 1.00x)
w_avg_10_4x4_avx2:                                      13.4 ( 3.79x)
w_avg_10_8x8_c:                                        109.7 ( 1.00x)
w_avg_10_8x8_avx2:                                      20.4 ( 5.38x)
w_avg_10_16x16_c:                                      395.2 ( 1.00x)
w_avg_10_16x16_avx2:                                    41.7 ( 9.48x)
w_avg_10_32x32_c:                                     1535.6 ( 1.00x)
w_avg_10_32x32_avx2:                                   137.9 (11.13x)
w_avg_10_64x64_c:                                     6002.1 ( 1.00x)
w_avg_10_64x64_avx2:                                   548.5 (10.94x)
w_avg_10_128x128_c:                                  23742.7 ( 1.00x)
w_avg_10_128x128_avx2:                                2179.8 (10.89x)
w_avg_12_2x2_c:                                         18.9 ( 1.00x)
w_avg_12_2x2_avx2:                                      10.3 ( 1.84x)
w_avg_12_4x4_c:                                         52.4 ( 1.00x)
w_avg_12_4x4_avx2:                                      13.4 ( 3.91x)
w_avg_12_8x8_c:                                        109.2 ( 1.00x)
w_avg_12_8x8_avx2:                                      20.3 ( 5.39x)
w_avg_12_16x16_c:                                      396.3 ( 1.00x)
w_avg_12_16x16_avx2:                                    41.7 ( 9.51x)
w_avg_12_32x32_c:                                     1532.6 ( 1.00x)
w_avg_12_32x32_avx2:                                   138.6 (11.06x)
w_avg_12_64x64_c:                                     5996.7 ( 1.00x)
w_avg_12_64x64_avx2:                                   549.6 (10.91x)
w_avg_12_128x128_c:                                  23738.0 ( 1.00x)
w_avg_12_128x128_avx2:                                2177.2 (10.90x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 01:01:27 +01:00
Andreas Rheinhardt
ea78402e9c avcodec/x86/vvc/mc,dsp_init: Avoid pointless wrappers for avg
Up until now, there were two averaging assembly functions,
one for eight bit content and one for <=16 bit content;
there are also three C-wrappers around these functions,
for 8, 10 and 12 bpp. These wrappers simply forward the
maximum permissible value (i.e. (1<<bpp)-1) and promote
some integer values to ptrdiff_t.

Yet these wrappers are absolutely useless: The assembly functions
rederive the bpp from the maximum and only the integer part
of the promoted ptrdiff_t values is ever used. Of course,
these wrappers also entail an additional call (not a tail call,
because the additional maximum parameter is passed on the stack).

Remove the wrappers and add per-bpp assembly functions instead.
Given that the only difference between 10 and 12 bits are some
constants in registers, the main part of these functions can be
shared (given that this code uses a jumptable, it can even
be done without adding any additional jump).

Old benchmarks:
avg_8_2x2_c:                                            11.4 ( 1.00x)
avg_8_2x2_avx2:                                          7.9 ( 1.44x)
avg_8_4x4_c:                                            30.7 ( 1.00x)
avg_8_4x4_avx2:                                         10.4 ( 2.95x)
avg_8_8x8_c:                                           134.5 ( 1.00x)
avg_8_8x8_avx2:                                         16.6 ( 8.12x)
avg_8_16x16_c:                                         255.6 ( 1.00x)
avg_8_16x16_avx2:                                       28.2 ( 9.07x)
avg_8_32x32_c:                                         897.7 ( 1.00x)
avg_8_32x32_avx2:                                       83.9 (10.70x)
avg_8_64x64_c:                                        3320.0 ( 1.00x)
avg_8_64x64_avx2:                                      321.1 (10.34x)
avg_8_128x128_c:                                     12981.8 ( 1.00x)
avg_8_128x128_avx2:                                   1480.1 ( 8.77x)
avg_10_2x2_c:                                           12.0 ( 1.00x)
avg_10_2x2_avx2:                                         8.4 ( 1.43x)
avg_10_4x4_c:                                           34.9 ( 1.00x)
avg_10_4x4_avx2:                                         9.8 ( 3.56x)
avg_10_8x8_c:                                           76.8 ( 1.00x)
avg_10_8x8_avx2:                                        15.1 ( 5.08x)
avg_10_16x16_c:                                        256.6 ( 1.00x)
avg_10_16x16_avx2:                                      25.1 (10.20x)
avg_10_32x32_c:                                        932.9 ( 1.00x)
avg_10_32x32_avx2:                                      73.4 (12.72x)
avg_10_64x64_c:                                       3517.9 ( 1.00x)
avg_10_64x64_avx2:                                     414.8 ( 8.48x)
avg_10_128x128_c:                                    13695.3 ( 1.00x)
avg_10_128x128_avx2:                                  1648.1 ( 8.31x)
avg_12_2x2_c:                                           13.1 ( 1.00x)
avg_12_2x2_avx2:                                         8.6 ( 1.53x)
avg_12_4x4_c:                                           35.4 ( 1.00x)
avg_12_4x4_avx2:                                        10.1 ( 3.49x)
avg_12_8x8_c:                                           76.6 ( 1.00x)
avg_12_8x8_avx2:                                        16.7 ( 4.60x)
avg_12_16x16_c:                                        256.6 ( 1.00x)
avg_12_16x16_avx2:                                      25.5 (10.07x)
avg_12_32x32_c:                                        933.2 ( 1.00x)
avg_12_32x32_avx2:                                      75.7 (12.34x)
avg_12_64x64_c:                                       3519.1 ( 1.00x)
avg_12_64x64_avx2:                                     416.8 ( 8.44x)
avg_12_128x128_c:                                    13695.1 ( 1.00x)
avg_12_128x128_avx2:                                  1651.6 ( 8.29x)

New benchmarks:
avg_8_2x2_c:                                            11.5 ( 1.00x)
avg_8_2x2_avx2:                                          6.0 ( 1.91x)
avg_8_4x4_c:                                            29.7 ( 1.00x)
avg_8_4x4_avx2:                                          8.0 ( 3.72x)
avg_8_8x8_c:                                           131.4 ( 1.00x)
avg_8_8x8_avx2:                                         12.2 (10.74x)
avg_8_16x16_c:                                         254.3 ( 1.00x)
avg_8_16x16_avx2:                                       24.8 (10.25x)
avg_8_32x32_c:                                         897.7 ( 1.00x)
avg_8_32x32_avx2:                                       77.8 (11.54x)
avg_8_64x64_c:                                        3321.3 ( 1.00x)
avg_8_64x64_avx2:                                      318.7 (10.42x)
avg_8_128x128_c:                                     12988.4 ( 1.00x)
avg_8_128x128_avx2:                                   1430.1 ( 9.08x)
avg_10_2x2_c:                                           12.1 ( 1.00x)
avg_10_2x2_avx2:                                         5.7 ( 2.13x)
avg_10_4x4_c:                                           35.0 ( 1.00x)
avg_10_4x4_avx2:                                         9.0 ( 3.88x)
avg_10_8x8_c:                                           77.2 ( 1.00x)
avg_10_8x8_avx2:                                        12.4 ( 6.24x)
avg_10_16x16_c:                                        256.2 ( 1.00x)
avg_10_16x16_avx2:                                      24.3 (10.56x)
avg_10_32x32_c:                                        932.9 ( 1.00x)
avg_10_32x32_avx2:                                      71.9 (12.97x)
avg_10_64x64_c:                                       3516.8 ( 1.00x)
avg_10_64x64_avx2:                                     414.7 ( 8.48x)
avg_10_128x128_c:                                    13693.7 ( 1.00x)
avg_10_128x128_avx2:                                  1609.3 ( 8.51x)
avg_12_2x2_c:                                           14.1 ( 1.00x)
avg_12_2x2_avx2:                                         5.7 ( 2.48x)
avg_12_4x4_c:                                           35.8 ( 1.00x)
avg_12_4x4_avx2:                                         9.0 ( 3.96x)
avg_12_8x8_c:                                           76.9 ( 1.00x)
avg_12_8x8_avx2:                                        12.4 ( 6.22x)
avg_12_16x16_c:                                        256.5 ( 1.00x)
avg_12_16x16_avx2:                                      24.4 (10.50x)
avg_12_32x32_c:                                        934.1 ( 1.00x)
avg_12_32x32_avx2:                                      72.0 (12.97x)
avg_12_64x64_c:                                       3518.2 ( 1.00x)
avg_12_64x64_avx2:                                     414.8 ( 8.48x)
avg_12_128x128_c:                                    13689.5 ( 1.00x)
avg_12_128x128_avx2:                                  1611.1 ( 8.50x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:58:33 +01:00
Andreas Rheinhardt
5a60b3f1a6 avcodec/x86/vvc/mc: Remove always-false branches
The C versions of the average and weighted average functions
contains "FFMAX(3, 15 - BIT_DEPTH)" and the code here followed
this; yet it is only instantiated for bit depths 8, 10 and 12,
for which the above is just 15-BIT_DEPTH. So the comparisons
are unnecessary.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:57:56 +01:00
Andreas Rheinhardt
59f8ff4c18 avcodec/x86/vvc/mc: Remove unused constants
Also avoid overaligning .rodata.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:57:56 +01:00
Andreas Rheinhardt
eabf52e787 avcodec/x86/vvc/mc: Avoid unused work
The high quadword of these registers is zero for width 2.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:57:56 +01:00
Andreas Rheinhardt
9317fb2b2e avcodec/x86/vvc/mc: Avoid ymm registers where possible
Widths 2 and 4 fit into xmm registers.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:57:56 +01:00
Andreas Rheinhardt
caa0ae0cfb avcodec/x86/vvc/mc: Avoid pextr[dq], v{insert,extract}i128
Use mov[dq], movdqu instead if the least significant parts
are set (i.e. if the immediate value is 0x0).

Old benchmarks:
avg_8_2x2_c:                                            11.3 ( 1.00x)
avg_8_2x2_avx2:                                          7.5 ( 1.50x)
avg_8_4x4_c:                                            31.2 ( 1.00x)
avg_8_4x4_avx2:                                         10.7 ( 2.91x)
avg_8_8x8_c:                                           133.5 ( 1.00x)
avg_8_8x8_avx2:                                         21.2 ( 6.30x)
avg_8_16x16_c:                                         254.7 ( 1.00x)
avg_8_16x16_avx2:                                       30.1 ( 8.46x)
avg_8_32x32_c:                                         896.9 ( 1.00x)
avg_8_32x32_avx2:                                      103.9 ( 8.63x)
avg_8_64x64_c:                                        3320.7 ( 1.00x)
avg_8_64x64_avx2:                                      539.4 ( 6.16x)
avg_8_128x128_c:                                     12991.5 ( 1.00x)
avg_8_128x128_avx2:                                   1661.3 ( 7.82x)
avg_10_2x2_c:                                           21.3 ( 1.00x)
avg_10_2x2_avx2:                                         8.3 ( 2.55x)
avg_10_4x4_c:                                           34.9 ( 1.00x)
avg_10_4x4_avx2:                                        10.6 ( 3.28x)
avg_10_8x8_c:                                           76.3 ( 1.00x)
avg_10_8x8_avx2:                                        20.2 ( 3.77x)
avg_10_16x16_c:                                        255.9 ( 1.00x)
avg_10_16x16_avx2:                                      24.1 (10.60x)
avg_10_32x32_c:                                        932.4 ( 1.00x)
avg_10_32x32_avx2:                                      73.3 (12.72x)
avg_10_64x64_c:                                       3516.4 ( 1.00x)
avg_10_64x64_avx2:                                     601.7 ( 5.84x)
avg_10_128x128_c:                                    13690.6 ( 1.00x)
avg_10_128x128_avx2:                                  1613.2 ( 8.49x)
avg_12_2x2_c:                                           14.0 ( 1.00x)
avg_12_2x2_avx2:                                         8.3 ( 1.67x)
avg_12_4x4_c:                                           35.3 ( 1.00x)
avg_12_4x4_avx2:                                        10.9 ( 3.26x)
avg_12_8x8_c:                                           76.5 ( 1.00x)
avg_12_8x8_avx2:                                        20.3 ( 3.77x)
avg_12_16x16_c:                                        256.7 ( 1.00x)
avg_12_16x16_avx2:                                      24.1 (10.63x)
avg_12_32x32_c:                                        932.5 ( 1.00x)
avg_12_32x32_avx2:                                      73.3 (12.72x)
avg_12_64x64_c:                                       3520.5 ( 1.00x)
avg_12_64x64_avx2:                                     602.6 ( 5.84x)
avg_12_128x128_c:                                    13689.6 ( 1.00x)
avg_12_128x128_avx2:                                  1613.1 ( 8.49x)
w_avg_8_2x2_c:                                          16.7 ( 1.00x)
w_avg_8_2x2_avx2:                                       13.4 ( 1.25x)
w_avg_8_4x4_c:                                          44.5 ( 1.00x)
w_avg_8_4x4_avx2:                                       15.9 ( 2.81x)
w_avg_8_8x8_c:                                         166.1 ( 1.00x)
w_avg_8_8x8_avx2:                                       45.7 ( 3.63x)
w_avg_8_16x16_c:                                       392.9 ( 1.00x)
w_avg_8_16x16_avx2:                                     57.8 ( 6.80x)
w_avg_8_32x32_c:                                      1455.5 ( 1.00x)
w_avg_8_32x32_avx2:                                    215.0 ( 6.77x)
w_avg_8_64x64_c:                                      5621.8 ( 1.00x)
w_avg_8_64x64_avx2:                                    875.2 ( 6.42x)
w_avg_8_128x128_c:                                   22131.3 ( 1.00x)
w_avg_8_128x128_avx2:                                 3390.1 ( 6.53x)
w_avg_10_2x2_c:                                         18.0 ( 1.00x)
w_avg_10_2x2_avx2:                                      14.0 ( 1.28x)
w_avg_10_4x4_c:                                         53.9 ( 1.00x)
w_avg_10_4x4_avx2:                                      15.9 ( 3.40x)
w_avg_10_8x8_c:                                        109.5 ( 1.00x)
w_avg_10_8x8_avx2:                                      40.4 ( 2.71x)
w_avg_10_16x16_c:                                      395.7 ( 1.00x)
w_avg_10_16x16_avx2:                                    44.7 ( 8.86x)
w_avg_10_32x32_c:                                     1532.7 ( 1.00x)
w_avg_10_32x32_avx2:                                   142.4 (10.77x)
w_avg_10_64x64_c:                                     6007.7 ( 1.00x)
w_avg_10_64x64_avx2:                                   745.5 ( 8.06x)
w_avg_10_128x128_c:                                  23719.7 ( 1.00x)
w_avg_10_128x128_avx2:                                2217.7 (10.70x)
w_avg_12_2x2_c:                                         18.9 ( 1.00x)
w_avg_12_2x2_avx2:                                      13.6 ( 1.38x)
w_avg_12_4x4_c:                                         47.5 ( 1.00x)
w_avg_12_4x4_avx2:                                      15.9 ( 2.99x)
w_avg_12_8x8_c:                                        109.3 ( 1.00x)
w_avg_12_8x8_avx2:                                      40.9 ( 2.67x)
w_avg_12_16x16_c:                                      395.6 ( 1.00x)
w_avg_12_16x16_avx2:                                    44.8 ( 8.84x)
w_avg_12_32x32_c:                                     1531.0 ( 1.00x)
w_avg_12_32x32_avx2:                                   141.8 (10.80x)
w_avg_12_64x64_c:                                     6016.7 ( 1.00x)
w_avg_12_64x64_avx2:                                   732.8 ( 8.21x)
w_avg_12_128x128_c:                                  23762.2 ( 1.00x)
w_avg_12_128x128_avx2:                                2223.4 (10.69x)

New benchmarks:
avg_8_2x2_c:                                            11.3 ( 1.00x)
avg_8_2x2_avx2:                                          7.6 ( 1.49x)
avg_8_4x4_c:                                            31.2 ( 1.00x)
avg_8_4x4_avx2:                                         10.8 ( 2.89x)
avg_8_8x8_c:                                           131.6 ( 1.00x)
avg_8_8x8_avx2:                                         15.6 ( 8.42x)
avg_8_16x16_c:                                         255.3 ( 1.00x)
avg_8_16x16_avx2:                                       27.9 ( 9.16x)
avg_8_32x32_c:                                         897.9 ( 1.00x)
avg_8_32x32_avx2:                                       81.2 (11.06x)
avg_8_64x64_c:                                        3320.0 ( 1.00x)
avg_8_64x64_avx2:                                      335.1 ( 9.91x)
avg_8_128x128_c:                                     12999.1 ( 1.00x)
avg_8_128x128_avx2:                                   1456.3 ( 8.93x)
avg_10_2x2_c:                                           12.0 ( 1.00x)
avg_10_2x2_avx2:                                         8.6 ( 1.40x)
avg_10_4x4_c:                                           34.9 ( 1.00x)
avg_10_4x4_avx2:                                         9.7 ( 3.61x)
avg_10_8x8_c:                                           76.7 ( 1.00x)
avg_10_8x8_avx2:                                        16.3 ( 4.69x)
avg_10_16x16_c:                                        256.3 ( 1.00x)
avg_10_16x16_avx2:                                      25.2 (10.18x)
avg_10_32x32_c:                                        932.8 ( 1.00x)
avg_10_32x32_avx2:                                      73.3 (12.72x)
avg_10_64x64_c:                                       3518.8 ( 1.00x)
avg_10_64x64_avx2:                                     416.8 ( 8.44x)
avg_10_128x128_c:                                    13691.6 ( 1.00x)
avg_10_128x128_avx2:                                  1612.9 ( 8.49x)
avg_12_2x2_c:                                           14.1 ( 1.00x)
avg_12_2x2_avx2:                                         8.7 ( 1.62x)
avg_12_4x4_c:                                           35.7 ( 1.00x)
avg_12_4x4_avx2:                                         9.7 ( 3.68x)
avg_12_8x8_c:                                           77.0 ( 1.00x)
avg_12_8x8_avx2:                                        16.9 ( 4.57x)
avg_12_16x16_c:                                        256.2 ( 1.00x)
avg_12_16x16_avx2:                                      25.7 ( 9.96x)
avg_12_32x32_c:                                        933.5 ( 1.00x)
avg_12_32x32_avx2:                                      74.0 (12.62x)
avg_12_64x64_c:                                       3516.4 ( 1.00x)
avg_12_64x64_avx2:                                     408.7 ( 8.60x)
avg_12_128x128_c:                                    13691.6 ( 1.00x)
avg_12_128x128_avx2:                                  1613.8 ( 8.48x)
w_avg_8_2x2_c:                                          16.7 ( 1.00x)
w_avg_8_2x2_avx2:                                       14.0 ( 1.19x)
w_avg_8_4x4_c:                                          48.2 ( 1.00x)
w_avg_8_4x4_avx2:                                       16.1 ( 3.00x)
w_avg_8_8x8_c:                                         168.0 ( 1.00x)
w_avg_8_8x8_avx2:                                       22.5 ( 7.47x)
w_avg_8_16x16_c:                                       392.5 ( 1.00x)
w_avg_8_16x16_avx2:                                     47.9 ( 8.19x)
w_avg_8_32x32_c:                                      1453.7 ( 1.00x)
w_avg_8_32x32_avx2:                                    176.1 ( 8.26x)
w_avg_8_64x64_c:                                      5631.4 ( 1.00x)
w_avg_8_64x64_avx2:                                    690.8 ( 8.15x)
w_avg_8_128x128_c:                                   22139.5 ( 1.00x)
w_avg_8_128x128_avx2:                                 2742.4 ( 8.07x)
w_avg_10_2x2_c:                                         18.1 ( 1.00x)
w_avg_10_2x2_avx2:                                      13.8 ( 1.31x)
w_avg_10_4x4_c:                                         47.0 ( 1.00x)
w_avg_10_4x4_avx2:                                      16.4 ( 2.87x)
w_avg_10_8x8_c:                                        110.0 ( 1.00x)
w_avg_10_8x8_avx2:                                      21.6 ( 5.09x)
w_avg_10_16x16_c:                                      395.2 ( 1.00x)
w_avg_10_16x16_avx2:                                    45.4 ( 8.71x)
w_avg_10_32x32_c:                                     1533.8 ( 1.00x)
w_avg_10_32x32_avx2:                                   142.6 (10.76x)
w_avg_10_64x64_c:                                     6004.4 ( 1.00x)
w_avg_10_64x64_avx2:                                   672.8 ( 8.92x)
w_avg_10_128x128_c:                                  23748.5 ( 1.00x)
w_avg_10_128x128_avx2:                                2198.0 (10.80x)
w_avg_12_2x2_c:                                         17.2 ( 1.00x)
w_avg_12_2x2_avx2:                                      13.9 ( 1.24x)
w_avg_12_4x4_c:                                         51.4 ( 1.00x)
w_avg_12_4x4_avx2:                                      16.5 ( 3.11x)
w_avg_12_8x8_c:                                        109.1 ( 1.00x)
w_avg_12_8x8_avx2:                                      22.0 ( 4.96x)
w_avg_12_16x16_c:                                      395.9 ( 1.00x)
w_avg_12_16x16_avx2:                                    44.9 ( 8.81x)
w_avg_12_32x32_c:                                     1533.5 ( 1.00x)
w_avg_12_32x32_avx2:                                   142.3 (10.78x)
w_avg_12_64x64_c:                                     6002.0 ( 1.00x)
w_avg_12_64x64_avx2:                                   557.5 (10.77x)
w_avg_12_128x128_c:                                  23749.5 ( 1.00x)
w_avg_12_128x128_avx2:                                2202.0 (10.79x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:57:56 +01:00
Andreas Rheinhardt
7bf9c1e3f6 avcodec/x86/vvc/mc: Avoid redundant clipping for 8bit
It is already done by packuswb.

Old benchmarks:
avg_8_2x2_c:                                            11.1 ( 1.00x)
avg_8_2x2_avx2:                                          8.6 ( 1.28x)
avg_8_4x4_c:                                            30.0 ( 1.00x)
avg_8_4x4_avx2:                                         10.8 ( 2.78x)
avg_8_8x8_c:                                           132.0 ( 1.00x)
avg_8_8x8_avx2:                                         25.7 ( 5.14x)
avg_8_16x16_c:                                         254.6 ( 1.00x)
avg_8_16x16_avx2:                                       33.2 ( 7.67x)
avg_8_32x32_c:                                         897.5 ( 1.00x)
avg_8_32x32_avx2:                                      115.6 ( 7.76x)
avg_8_64x64_c:                                        3316.9 ( 1.00x)
avg_8_64x64_avx2:                                      626.5 ( 5.29x)
avg_8_128x128_c:                                     12973.6 ( 1.00x)
avg_8_128x128_avx2:                                   1914.0 ( 6.78x)
w_avg_8_2x2_c:                                          16.7 ( 1.00x)
w_avg_8_2x2_avx2:                                       14.4 ( 1.16x)
w_avg_8_4x4_c:                                          48.2 ( 1.00x)
w_avg_8_4x4_avx2:                                       16.5 ( 2.92x)
w_avg_8_8x8_c:                                         168.1 ( 1.00x)
w_avg_8_8x8_avx2:                                       49.7 ( 3.38x)
w_avg_8_16x16_c:                                       392.4 ( 1.00x)
w_avg_8_16x16_avx2:                                     61.1 ( 6.43x)
w_avg_8_32x32_c:                                      1455.3 ( 1.00x)
w_avg_8_32x32_avx2:                                    224.6 ( 6.48x)
w_avg_8_64x64_c:                                      5632.1 ( 1.00x)
w_avg_8_64x64_avx2:                                    896.9 ( 6.28x)
w_avg_8_128x128_c:                                   22136.3 ( 1.00x)
w_avg_8_128x128_avx2:                                 3626.7 ( 6.10x)

New benchmarks:
avg_8_2x2_c:                                            12.3 ( 1.00x)
avg_8_2x2_avx2:                                          8.1 ( 1.52x)
avg_8_4x4_c:                                            30.3 ( 1.00x)
avg_8_4x4_avx2:                                         11.3 ( 2.67x)
avg_8_8x8_c:                                           131.8 ( 1.00x)
avg_8_8x8_avx2:                                         21.3 ( 6.20x)
avg_8_16x16_c:                                         255.0 ( 1.00x)
avg_8_16x16_avx2:                                       30.6 ( 8.33x)
avg_8_32x32_c:                                         898.5 ( 1.00x)
avg_8_32x32_avx2:                                      104.9 ( 8.57x)
avg_8_64x64_c:                                        3317.7 ( 1.00x)
avg_8_64x64_avx2:                                      540.9 ( 6.13x)
avg_8_128x128_c:                                     12986.5 ( 1.00x)
avg_8_128x128_avx2:                                   1663.4 ( 7.81x)
w_avg_8_2x2_c:                                          16.8 ( 1.00x)
w_avg_8_2x2_avx2:                                       13.9 ( 1.21x)
w_avg_8_4x4_c:                                          48.2 ( 1.00x)
w_avg_8_4x4_avx2:                                       16.2 ( 2.98x)
w_avg_8_8x8_c:                                         168.6 ( 1.00x)
w_avg_8_8x8_avx2:                                       46.3 ( 3.64x)
w_avg_8_16x16_c:                                       392.4 ( 1.00x)
w_avg_8_16x16_avx2:                                     57.7 ( 6.80x)
w_avg_8_32x32_c:                                      1454.6 ( 1.00x)
w_avg_8_32x32_avx2:                                    214.6 ( 6.78x)
w_avg_8_64x64_c:                                      5638.4 ( 1.00x)
w_avg_8_64x64_avx2:                                    875.6 ( 6.44x)
w_avg_8_128x128_c:                                   22133.5 ( 1.00x)
w_avg_8_128x128_avx2:                                 3334.3 ( 6.64x)

Also saves 550B of .text here. The improvements will likely
be even better on Win64, because it avoids using two nonvolatile
registers in the weighted average case.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-22 00:57:56 +01:00