Commit graph

2921 commits

Author SHA1 Message Date
Andreas Rheinhardt
fe0d8cb3e4 avcodec/x86/dirac_dwt: Remove MMX in comment
Forgotten in 5e332fe35c.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-13 09:21:33 +01:00
Andreas Rheinhardt
3a7c09eb39 avcodec/x86/mpegvideoencdsp_init: Port draw_edges to SSSE3
Benchmarks:
draw_edges_8_1724_4_c:                                2672.2 ( 1.00x)
draw_edges_8_1724_4_mmx:                              3191.5 ( 0.84x)
draw_edges_8_1724_4_ssse3:                            2179.6 ( 1.23x)
draw_edges_8_1724_8_c:                                2852.3 ( 1.00x)
draw_edges_8_1724_8_mmx:                              3683.0 ( 0.77x)
draw_edges_8_1724_8_ssse3:                            2225.7 ( 1.28x)
draw_edges_8_1724_16_c:                               4169.4 ( 1.00x)
draw_edges_8_1724_16_mmx:                             4665.9 ( 0.89x)
draw_edges_8_1724_16_ssse3:                           2765.8 ( 1.51x)
draw_edges_128_407_4_c:                               1126.6 ( 1.00x)
draw_edges_128_407_4_mmx:                              943.9 ( 1.19x)
draw_edges_128_407_4_ssse3:                            925.7 ( 1.22x)
draw_edges_128_407_8_c:                               1208.8 ( 1.00x)
draw_edges_128_407_8_mmx:                             1119.1 ( 1.08x)
draw_edges_128_407_8_ssse3:                            997.8 ( 1.21x)
draw_edges_128_407_16_c:                              1352.4 ( 1.00x)
draw_edges_128_407_16_mmx:                            1368.7 ( 0.99x)
draw_edges_128_407_16_ssse3:                          1148.3 ( 1.18x)
draw_edges_1080_31_4_c:                                228.5 ( 1.00x)
draw_edges_1080_31_4_mmx:                              240.8 ( 0.95x)
draw_edges_1080_31_4_ssse3:                            226.7 ( 1.01x)
draw_edges_1080_31_8_c:                                411.1 ( 1.00x)
draw_edges_1080_31_8_mmx:                              432.9 ( 0.95x)
draw_edges_1080_31_8_ssse3:                            403.2 ( 1.02x)
draw_edges_1080_31_16_c:                              1121.2 ( 1.00x)
draw_edges_1080_31_16_mmx:                            1124.9 ( 1.00x)
draw_edges_1080_31_16_ssse3:                          1125.4 ( 1.00x)
draw_edges_1920_4_4_c:                                 310.8 ( 1.00x)
draw_edges_1920_4_4_mmx:                               311.6 ( 1.00x)
draw_edges_1920_4_4_ssse3:                             311.6 ( 1.00x)
draw_edges_1920_4_4_negstride_c:                       307.0 ( 1.00x)
draw_edges_1920_4_4_negstride_mmx:                     306.7 ( 1.00x)
draw_edges_1920_4_4_negstride_ssse3:                   306.7 ( 1.00x)
draw_edges_1920_4_8_c:                                 724.2 ( 1.00x)
draw_edges_1920_4_8_mmx:                               724.9 ( 1.00x)
draw_edges_1920_4_8_ssse3:                             717.3 ( 1.01x)
draw_edges_1920_4_8_negstride_c:                       719.2 ( 1.00x)
draw_edges_1920_4_8_negstride_mmx:                     717.1 ( 1.00x)
draw_edges_1920_4_8_negstride_ssse3:                   710.9 ( 1.01x)
draw_edges_1920_4_16_c:                               1752.9 ( 1.00x)
draw_edges_1920_4_16_mmx:                             1754.6 ( 1.00x)
draw_edges_1920_4_16_ssse3:                           1751.1 ( 1.00x)
draw_edges_1920_4_16_negstride_c:                     1783.2 ( 1.00x)
draw_edges_1920_4_16_negstride_mmx:                   1778.2 ( 1.00x)
draw_edges_1920_4_16_negstride_ssse3:                 1768.3 ( 1.01x)

Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-02-11 11:15:06 +01:00
Andreas Rheinhardt
436b74b725 avcodec/x86/hevc/dequant: Add SSSE3 dequant ASM function
hevc_dequant_4x4_8_c (GCC):                             20.2 ( 1.00x)
hevc_dequant_4x4_8_c (Clang):                           21.7 ( 1.00x)
hevc_dequant_4x4_8_ssse3:                                5.8 ( 3.51x)
hevc_dequant_8x8_8_c (GCC):                             32.9 ( 1.00x)
hevc_dequant_8x8_8_c (Clang):                           78.7 ( 1.00x)
hevc_dequant_8x8_8_ssse3:                                6.8 ( 4.83x)
hevc_dequant_16x16_8_c (GCC):                          105.1 ( 1.00x)
hevc_dequant_16x16_8_c (Clang):                        151.1 ( 1.00x)
hevc_dequant_16x16_8_ssse3:                             19.3 ( 5.45x)
hevc_dequant_32x32_8_c (GCC):                          415.7 ( 1.00x)
hevc_dequant_32x32_8_c (Clang):                        602.3 ( 1.00x)
hevc_dequant_32x32_8_ssse3:                             78.2 ( 5.32x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 12:25:33 +01:00
Andreas Rheinhardt
2729c52988 avcodec/x86/hevc/deblock: Reduce usage of GPRs
Don't use two GPRs to store two words from xmm registers;
shuffle these words so that they are fit into one GPR.
This reduces the amount of GPRs used and leads to tiny speedups
here. Also avoid rex prefixes whenever possible (for lines
that needed to be modified anyway).

Old benchmarks:
hevc_h_loop_filter_luma8_skip_c:                        23.8 ( 1.00x)
hevc_h_loop_filter_luma8_skip_sse2:                      8.5 ( 2.80x)
hevc_h_loop_filter_luma8_skip_ssse3:                     7.2 ( 3.29x)
hevc_h_loop_filter_luma8_skip_avx:                       6.4 ( 3.71x)
hevc_h_loop_filter_luma8_strong_c:                     150.4 ( 1.00x)
hevc_h_loop_filter_luma8_strong_sse2:                   34.4 ( 4.37x)
hevc_h_loop_filter_luma8_strong_ssse3:                  34.5 ( 4.36x)
hevc_h_loop_filter_luma8_strong_avx:                    32.3 ( 4.65x)
hevc_h_loop_filter_luma8_weak_c:                       103.2 ( 1.00x)
hevc_h_loop_filter_luma8_weak_sse2:                     34.5 ( 2.99x)
hevc_h_loop_filter_luma8_weak_ssse3:                     7.3 (14.22x)
hevc_h_loop_filter_luma8_weak_avx:                      32.4 ( 3.18x)
hevc_h_loop_filter_luma10_skip_c:                       23.5 ( 1.00x)
hevc_h_loop_filter_luma10_skip_sse2:                     6.6 ( 3.58x)
hevc_h_loop_filter_luma10_skip_ssse3:                    6.1 ( 3.86x)
hevc_h_loop_filter_luma10_skip_avx:                      5.4 ( 4.34x)
hevc_h_loop_filter_luma10_strong_c:                    161.8 ( 1.00x)
hevc_h_loop_filter_luma10_strong_sse2:                  32.2 ( 5.03x)
hevc_h_loop_filter_luma10_strong_ssse3:                 30.4 ( 5.33x)
hevc_h_loop_filter_luma10_strong_avx:                   30.3 ( 5.33x)
hevc_h_loop_filter_luma10_weak_c:                       23.5 ( 1.00x)
hevc_h_loop_filter_luma10_weak_sse2:                     6.6 ( 3.58x)
hevc_h_loop_filter_luma10_weak_ssse3:                    6.1 ( 3.85x)
hevc_h_loop_filter_luma10_weak_avx:                      5.4 ( 4.35x)
hevc_h_loop_filter_luma12_skip_c:                       18.8 ( 1.00x)
hevc_h_loop_filter_luma12_skip_sse2:                     6.6 ( 2.87x)
hevc_h_loop_filter_luma12_skip_ssse3:                    6.1 ( 3.08x)
hevc_h_loop_filter_luma12_skip_avx:                      6.2 ( 3.06x)
hevc_h_loop_filter_luma12_strong_c:                    159.0 ( 1.00x)
hevc_h_loop_filter_luma12_strong_sse2:                  36.3 ( 4.38x)
hevc_h_loop_filter_luma12_strong_ssse3:                 36.1 ( 4.40x)
hevc_h_loop_filter_luma12_strong_avx:                   33.5 ( 4.75x)
hevc_h_loop_filter_luma12_weak_c:                       40.1 ( 1.00x)
hevc_h_loop_filter_luma12_weak_sse2:                    35.5 ( 1.13x)
hevc_h_loop_filter_luma12_weak_ssse3:                   36.1 ( 1.11x)
hevc_h_loop_filter_luma12_weak_avx:                      6.2 ( 6.52x)
hevc_v_loop_filter_luma8_skip_c:                        25.5 ( 1.00x)
hevc_v_loop_filter_luma8_skip_sse2:                     10.6 ( 2.40x)
hevc_v_loop_filter_luma8_skip_ssse3:                    11.4 ( 2.24x)
hevc_v_loop_filter_luma8_skip_avx:                       8.3 ( 3.07x)
hevc_v_loop_filter_luma8_strong_c:                     146.8 ( 1.00x)
hevc_v_loop_filter_luma8_strong_sse2:                   43.9 ( 3.35x)
hevc_v_loop_filter_luma8_strong_ssse3:                  43.7 ( 3.36x)
hevc_v_loop_filter_luma8_strong_avx:                    42.3 ( 3.47x)
hevc_v_loop_filter_luma8_weak_c:                        25.5 ( 1.00x)
hevc_v_loop_filter_luma8_weak_sse2:                     10.6 ( 2.40x)
hevc_v_loop_filter_luma8_weak_ssse3:                    44.0 ( 0.58x)
hevc_v_loop_filter_luma8_weak_avx:                       8.3 ( 3.09x)
hevc_v_loop_filter_luma10_skip_c:                       20.0 ( 1.00x)
hevc_v_loop_filter_luma10_skip_sse2:                    11.3 ( 1.77x)
hevc_v_loop_filter_luma10_skip_ssse3:                   11.0 ( 1.82x)
hevc_v_loop_filter_luma10_skip_avx:                      9.3 ( 2.15x)
hevc_v_loop_filter_luma10_strong_c:                    193.5 ( 1.00x)
hevc_v_loop_filter_luma10_strong_sse2:                  46.1 ( 4.19x)
hevc_v_loop_filter_luma10_strong_ssse3:                 44.2 ( 4.38x)
hevc_v_loop_filter_luma10_strong_avx:                   44.4 ( 4.35x)
hevc_v_loop_filter_luma10_weak_c:                       90.3 ( 1.00x)
hevc_v_loop_filter_luma10_weak_sse2:                    46.3 ( 1.95x)
hevc_v_loop_filter_luma10_weak_ssse3:                   10.8 ( 8.37x)
hevc_v_loop_filter_luma10_weak_avx:                     44.4 ( 2.03x)
hevc_v_loop_filter_luma12_skip_c:                       16.8 ( 1.00x)
hevc_v_loop_filter_luma12_skip_sse2:                    11.8 ( 1.42x)
hevc_v_loop_filter_luma12_skip_ssse3:                   11.7 ( 1.43x)
hevc_v_loop_filter_luma12_skip_avx:                      8.7 ( 1.93x)
hevc_v_loop_filter_luma12_strong_c:                    159.3 ( 1.00x)
hevc_v_loop_filter_luma12_strong_sse2:                  45.3 ( 3.52x)
hevc_v_loop_filter_luma12_strong_ssse3:                 60.3 ( 2.64x)
hevc_v_loop_filter_luma12_strong_avx:                   44.1 ( 3.61x)
hevc_v_loop_filter_luma12_weak_c:                       63.6 ( 1.00x)
hevc_v_loop_filter_luma12_weak_sse2:                    45.3 ( 1.40x)
hevc_v_loop_filter_luma12_weak_ssse3:                   11.7 ( 5.41x)
hevc_v_loop_filter_luma12_weak_avx:                     43.9 ( 1.45x)

New benchmarks:
hevc_h_loop_filter_luma8_skip_c:                        24.2 ( 1.00x)
hevc_h_loop_filter_luma8_skip_sse2:                      8.6 ( 2.82x)
hevc_h_loop_filter_luma8_skip_ssse3:                     7.0 ( 3.46x)
hevc_h_loop_filter_luma8_skip_avx:                       6.8 ( 3.54x)
hevc_h_loop_filter_luma8_strong_c:                     150.4 ( 1.00x)
hevc_h_loop_filter_luma8_strong_sse2:                   33.3 ( 4.52x)
hevc_h_loop_filter_luma8_strong_ssse3:                  32.7 ( 4.61x)
hevc_h_loop_filter_luma8_strong_avx:                    32.7 ( 4.60x)
hevc_h_loop_filter_luma8_weak_c:                       104.0 ( 1.00x)
hevc_h_loop_filter_luma8_weak_sse2:                     33.2 ( 3.13x)
hevc_h_loop_filter_luma8_weak_ssse3:                     7.0 (14.91x)
hevc_h_loop_filter_luma8_weak_avx:                      31.3 ( 3.32x)
hevc_h_loop_filter_luma10_skip_c:                       19.2 ( 1.00x)
hevc_h_loop_filter_luma10_skip_sse2:                     6.2 ( 3.08x)
hevc_h_loop_filter_luma10_skip_ssse3:                    6.2 ( 3.08x)
hevc_h_loop_filter_luma10_skip_avx:                      5.0 ( 3.85x)
hevc_h_loop_filter_luma10_strong_c:                    159.8 ( 1.00x)
hevc_h_loop_filter_luma10_strong_sse2:                  30.0 ( 5.32x)
hevc_h_loop_filter_luma10_strong_ssse3:                 29.2 ( 5.48x)
hevc_h_loop_filter_luma10_strong_avx:                   28.6 ( 5.58x)
hevc_h_loop_filter_luma10_weak_c:                       19.2 ( 1.00x)
hevc_h_loop_filter_luma10_weak_sse2:                     6.2 ( 3.09x)
hevc_h_loop_filter_luma10_weak_ssse3:                    6.2 ( 3.09x)
hevc_h_loop_filter_luma10_weak_avx:                      5.0 ( 3.88x)
hevc_h_loop_filter_luma12_skip_c:                       18.7 ( 1.00x)
hevc_h_loop_filter_luma12_skip_sse2:                     6.2 ( 3.00x)
hevc_h_loop_filter_luma12_skip_ssse3:                    5.7 ( 3.27x)
hevc_h_loop_filter_luma12_skip_avx:                      5.2 ( 3.61x)
hevc_h_loop_filter_luma12_strong_c:                    160.2 ( 1.00x)
hevc_h_loop_filter_luma12_strong_sse2:                  34.2 ( 4.68x)
hevc_h_loop_filter_luma12_strong_ssse3:                 29.3 ( 5.48x)
hevc_h_loop_filter_luma12_strong_avx:                   31.4 ( 5.10x)
hevc_h_loop_filter_luma12_weak_c:                       40.2 ( 1.00x)
hevc_h_loop_filter_luma12_weak_sse2:                    35.2 ( 1.14x)
hevc_h_loop_filter_luma12_weak_ssse3:                   29.3 ( 1.37x)
hevc_h_loop_filter_luma12_weak_avx:                      5.0 ( 8.09x)
hevc_v_loop_filter_luma8_skip_c:                        25.6 ( 1.00x)
hevc_v_loop_filter_luma8_skip_sse2:                     10.2 ( 2.52x)
hevc_v_loop_filter_luma8_skip_ssse3:                    10.5 ( 2.45x)
hevc_v_loop_filter_luma8_skip_avx:                       8.2 ( 3.11x)
hevc_v_loop_filter_luma8_strong_c:                     147.1 ( 1.00x)
hevc_v_loop_filter_luma8_strong_sse2:                   42.6 ( 3.45x)
hevc_v_loop_filter_luma8_strong_ssse3:                  42.4 ( 3.47x)
hevc_v_loop_filter_luma8_strong_avx:                    40.1 ( 3.67x)
hevc_v_loop_filter_luma8_weak_c:                        25.6 ( 1.00x)
hevc_v_loop_filter_luma8_weak_sse2:                     10.6 ( 2.42x)
hevc_v_loop_filter_luma8_weak_ssse3:                    42.7 ( 0.60x)
hevc_v_loop_filter_luma8_weak_avx:                       8.2 ( 3.11x)
hevc_v_loop_filter_luma10_skip_c:                       16.7 ( 1.00x)
hevc_v_loop_filter_luma10_skip_sse2:                    11.0 ( 1.52x)
hevc_v_loop_filter_luma10_skip_ssse3:                   10.5 ( 1.59x)
hevc_v_loop_filter_luma10_skip_avx:                      9.6 ( 1.74x)
hevc_v_loop_filter_luma10_strong_c:                    190.0 ( 1.00x)
hevc_v_loop_filter_luma10_strong_sse2:                  44.8 ( 4.24x)
hevc_v_loop_filter_luma10_strong_ssse3:                 42.3 ( 4.49x)
hevc_v_loop_filter_luma10_strong_avx:                   42.5 ( 4.47x)
hevc_v_loop_filter_luma10_weak_c:                       88.3 ( 1.00x)
hevc_v_loop_filter_luma10_weak_sse2:                    45.7 ( 1.93x)
hevc_v_loop_filter_luma10_weak_ssse3:                   10.5 ( 8.40x)
hevc_v_loop_filter_luma10_weak_avx:                     42.4 ( 2.09x)
hevc_v_loop_filter_luma12_skip_c:                       16.7 ( 1.00x)
hevc_v_loop_filter_luma12_skip_sse2:                    11.7 ( 1.42x)
hevc_v_loop_filter_luma12_skip_ssse3:                   10.5 ( 1.59x)
hevc_v_loop_filter_luma12_skip_avx:                      8.8 ( 1.90x)
hevc_v_loop_filter_luma12_strong_c:                    159.4 ( 1.00x)
hevc_v_loop_filter_luma12_strong_sse2:                  45.2 ( 3.53x)
hevc_v_loop_filter_luma12_strong_ssse3:                 59.3 ( 2.69x)
hevc_v_loop_filter_luma12_strong_avx:                   41.7 ( 3.82x)
hevc_v_loop_filter_luma12_weak_c:                       63.3 ( 1.00x)
hevc_v_loop_filter_luma12_weak_sse2:                    44.9 ( 1.41x)
hevc_v_loop_filter_luma12_weak_ssse3:                   10.5 ( 6.02x)
hevc_v_loop_filter_luma12_weak_avx:                     41.7 ( 1.52x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 11:54:57 +01:00
Andreas Rheinhardt
0843252229 avcodec/x86/hevc/deblock: avoid unused GPR
r12 is unused, so use it instead of r13 to reduce
the amount of push/pops.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 11:54:57 +01:00
Andreas Rheinhardt
0aad8b860a avcodec/x86/hevc/deblock: Avoid vmovdqa
(It would even be possible to avoid a clobbering m10 in
MASKED_COPY and the mask register (%3) in MASKED_COPY2
when VEX encoding is in use.)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 11:54:57 +01:00
Andreas Rheinhardt
c940128fff avcodec/x86/vp9lpf: Avoid vmovdqa
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 11:54:57 +01:00
Andreas Rheinhardt
c898ddb8fe avcodec/x86/cfhddsp: Reduce number of xmm registers used
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 01:33:40 +01:00
Andreas Rheinhardt
848c3ca772 avcodec/x86/cfhddsp: Avoid pmaddwd
The result of using pmaddwd with the coefficients 1,-1,...,1,-1
is just the negative of using pmaddwd with the coefficients
-1,1,...,-1,1, so avoid one pmaddwd.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 01:33:37 +01:00
Andreas Rheinhardt
6224445753 avcodec/x86/cfhdencdsp: Avoid += x, -= x
Avoid incrementing lowq and highq inside the loop by using
complex addressing modes, avoiding to undo said modification
at the end of the horizontal loop.
For inputq, modify istrideq outside of the loop so that
it is only modified once at the end of the horizontal loop.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 01:33:34 +01:00
Andreas Rheinhardt
7dd6487800 avcodec/x86/cfhdencdsp: Don't load twice
Sign extend the integer arguments directly from the stack
instead of loading qwords, followed by sign-extending the
lower half.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 01:33:30 +01:00
Andreas Rheinhardt
91c7710412 avcodec/x86/cfhdencdsp: Avoid unnecessary constants
Up until now, cfhdencdsp used constants consisting
of -1, 1, ...,-1,1 words and 1, -1,...,1,-1 words
for use as constants in pmaddwd. But one can use
the same constants if one shuffles the words in
a dword the opposite order. Similarly for some other
constants. This also allowed to avoid a register in
chfdenc_vert_filter.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 01:33:23 +01:00
Andreas Rheinhardt
cd3d8116fb avcodec/x86/cfhdencdsp: Avoid load of -1
It can be easily generated at runtime.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 01:32:57 +01:00
Andreas Rheinhardt
bf4d5037b4 avcodec/h264dsp: Remove redundant h264 from H264DSPCtx member names
These names are a remnant of dsputil when all the DSP functions
from all codecs were part of DSPcontext.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:25 +01:00
Andreas Rheinhardt
489aaf4e1c avcodec/x86/h264_deblock: Don't sign-extend stride
Unnecessary (and wrong) since d5d699ab6e.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
db66e057eb avcodec/x86/h264_deblock: Avoid reload
Old benchmarks:
h264_h_loop_filter_luma_8bpp_c:                         60.0 ( 1.00x)
h264_h_loop_filter_luma_8bpp_sse2:                      65.4 ( 0.92x)
h264_h_loop_filter_luma_8bpp_avx:                       65.3 ( 0.92x)

New benchmarks:
h264_h_loop_filter_luma_8bpp_c:                         60.4 ( 1.00x)
h264_h_loop_filter_luma_8bpp_sse2:                      62.0 ( 0.97x)
h264_h_loop_filter_luma_8bpp_avx:                       61.7 ( 0.98x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
8428a412bc avcodec/x86/h264_deblock: Avoid MMX in deblock_h_luma_8
Old benchmarks:
h264_h_loop_filter_luma_8bpp_c:                         59.9 ( 1.00x)
h264_h_loop_filter_luma_8bpp_sse2:                      67.9 ( 0.88x)
h264_h_loop_filter_luma_8bpp_avx:                       67.4 ( 0.89x)

New benchmarks:
h264_h_loop_filter_luma_8bpp_c:                         60.0 ( 1.00x)
h264_h_loop_filter_luma_8bpp_sse2:                      65.4 ( 0.92x)
h264_h_loop_filter_luma_8bpp_avx:                       65.3 ( 0.92x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
9882973935 avcodec/x86/h264_deblock: Avoid reloading constant
No change in benchmarks.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
eaaf45fd79 avcodec/x86/h264_deblock_10bit: Simplify r0+4*r1
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
aab0946eae avcodec/x86/h264_deblock_10bit: Remove mmxext functions
Now that the SSE2/AVX functions are no longer restricted
to those systems having an aligned stack, the MMXEXT functions
are always overridden (except for ancient systems without
SSE2), so remove them.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
dbdf514c17 avcodec/x86/h264_deblock_10bit: Remove custom stack allocation code
Allocate it via cglobal as usual. This makes the SSE2/AVX functions
available when HAVE_ALIGNED_STACK is false; it also avoids
modifying rsp unnecessarily in the deblock_h_luma_intra_10 functions
on Win64.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
b1140d3c98 avcodec/x86/h264_deblock: Remove obsolete macro parameters
They are a remnant of the MMX functions (which processed
only eight pixels at a time, so that it was called twice
via a wrapper; the actual MMX function had "v8" in its name
instead of simply v) which have been removed in commit
4618f36a24.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
899475326b avcodec/x86/h264_deblock: Simplify splatting
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
a22149ab3d avcodec/x86/h264_deblock: Remove always-false branches
These functions are always called with alpha and beta > 0.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
982244818b avcodec/x86/h264_deblock: Remove unused macros
Forgotten in 4618f36a24.
Also remove a PASS8ROWS wrapper that seems to have been always
unused.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-25 22:53:21 +01:00
Andreas Rheinhardt
685011003f avcodec/x86/pngdsp: Remove MMXEXT function overridden by SSSE3
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-10 22:47:22 +01:00
Andreas Rheinhardt
31daa7cd87 avcodec/pngdsp: Use proper prefix ff_add_png->ff_png_add
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-10 22:47:22 +01:00
Andreas Rheinhardt
5f15c067fe avcodec/pngdsp: Constify
Also constify ff_png_filter_row().

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-10 22:47:22 +01:00
Andreas Rheinhardt
6177af5acc avcodec/x86/lossless_videodsp: Avoid unnecessary reg push,pop
Happens on Win64.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-19 20:56:09 +01:00
Andreas Rheinhardt
9314d5cae8 avcodec/x86/lossless_videodsp: Avoid aligned/unaligned versions
For AVX2, movdqu is as fast as movdqa when used on aligned addresses,
so don't instantiate aligned/unaligned versions.

(The check was btw overtly strict: The AVX2 code only uses 16 byte
stores, so it would be enough for dst to be 16-byte aligned.)

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-19 20:55:53 +01:00
Andreas Rheinhardt
6368d2baae avcodec/x86/lossless_videodsp: Don't store in eight byte chunks
Use movu (movdqu) instead of movq+movhps.

Old benchmarks:
add_left_pred_int16_c:                                2265.5 ( 1.00x)
add_left_pred_int16_ssse3:                             595.4 ( 3.81x)
add_left_pred_rnd_acc_c:                              1255.0 ( 1.00x)
add_left_pred_rnd_acc_ssse3:                           326.2 ( 3.85x)
add_left_pred_rnd_acc_avx2:                            279.0 ( 4.50x)
add_left_pred_zero_c:                                 1249.5 ( 1.00x)
add_left_pred_zero_ssse3:                              326.1 ( 3.83x)
add_left_pred_zero_avx2:                               277.0 ( 4.51x)

New benchmarks:
add_left_pred_int16_c:                                2266.9 ( 1.00x)
add_left_pred_int16_ssse3:                             509.9 ( 4.45x)
add_left_pred_rnd_acc_c:                              1251.4 ( 1.00x)
add_left_pred_rnd_acc_ssse3:                           282.6 ( 4.43x)
add_left_pred_rnd_acc_avx2:                            208.9 ( 5.99x)
add_left_pred_zero_c:                                 1253.7 ( 1.00x)
add_left_pred_zero_ssse3:                              280.0 ( 4.48x)
add_left_pred_zero_avx2:                               206.8 ( 6.06x)

The checkasm test has been modified to use an unaligned destination
for this test.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-19 20:55:37 +01:00
Andreas Rheinhardt
a6b8939e1e avcodec/x86/lossless_videodsp: Remove SSSE3 functions using MMX regs
These functions are only used on Conroe (they are overwritten
by SSSE3 functions using xmm registers if the SSSE3SLOW is not set)
which is very old (introduced in 2006), so remove them.

Btw: The checkasm test (which uses declare_func and not
declare_func_emms since cd8a33bcce)
would fail on a Conroe, yet no one ever reported any such failure.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-19 20:54:44 +01:00
Andreas Rheinhardt
f96829b5bf avcodec/x86/lossless_videoencdsp_init: Remove pointless av_unused
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:46 +01:00
Andreas Rheinhardt
abe6ba17fa avcodec/x86/lossless_videoencdsp: Port sub_median_pred to NASM
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:43 +01:00
Andreas Rheinhardt
9ba33cc198 avcodec/x86/lossless_videoencdsp_init: Avoid special-casing first pixel
Old benchmarks:
sub_median_pred_c:                                     404.1 ( 1.00x)
sub_median_pred_sse2:                                   20.5 (19.67x)

New benchmarks:
sub_median_pred_c:                                     408.5 ( 1.00x)
sub_median_pred_sse2:                                   19.2 (21.27x)

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:40 +01:00
Andreas Rheinhardt
3a3e7080f1 avcodec/x86/lossless_videoencdsp_init: Port sub_median_pred to SSE2
Old benchmarks:
sub_median_pred_c:                                     405.7 ( 1.00x)
sub_median_pred_mmxext:                                 35.1 (11.57x)

New benchmarks:
sub_median_pred_c:                                     404.1 ( 1.00x)
sub_median_pred_sse2:                                   20.5 (19.67x)

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:35 +01:00
Andreas Rheinhardt
3144652588 avcodec/x86/lossless_videoencdsp_init: Don't read too often
sub_median_pred_mmxext() calculates a predictor from the left, top
and topleft pixel values. The topleft values need to be initialized
differently for the first loop initialization than for the others
in order to avoid reading ptr[-1]. So it has been initialized before
the loop and then read again at the end of the loop, so that the last
value read was never used. Yet this can lead to reads beyond the end
of the buffer, e.g. with
ffmpeg -cpuflags mmx+mmxext -f lavfi -i "color=size=64x4,format=yuv420p" \
-vf vflip -c:v ffvhuff -pred median -frames 1 -f null -

Fix this by not reading the value at the end of the loop.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:29 +01:00
Andreas Rheinhardt
2b9aea7756 avcodec/x86/lossless_videoencdsp_init: Don't read from before the buffer
sub_median_pred_mmxext() calculates a predictor from the left, top
and topleft pixel values. The left value is simply read via
ptr[-1], although this is not guaranteed to be inside the buffer
in case of negative strides. This happens e.g. with

ffmpeg -i fate-suite/mpeg2/dvd_single_frame.vob -vf vflip \
       -c:v magicyuv -pred median -f null -

Fix this by reading the first value like the topleft value.
Also change the documentation of sub_median_pred to reflect this
change (and the one from 791b5954bc).

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:25 +01:00
Andreas Rheinhardt
dc843cdd9a avcodec/x86/vp9mc: Reindent after the previous commit
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:35:07 +01:00
Andreas Rheinhardt
65e71b0837 avcodec/x86/vp9mc: Deduplicate coefficient tables
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:35:01 +01:00
Andreas Rheinhardt
38e2174ce4 avcodec/x86/vp9mc: Avoid MMX regs in width 4 hor 8tap funcs
Using wider registers (and pshufb) allows to halve the number of
pmaddubsw used. It is also ABI compliant (no more missing emms).

Old benchmarks:
vp9_avg_8tap_smooth_4h_8bpp_c:                          97.6 ( 1.00x)
vp9_avg_8tap_smooth_4h_8bpp_ssse3:                      15.0 ( 6.52x)
vp9_avg_8tap_smooth_4hv_8bpp_c:                        342.9 ( 1.00x)
vp9_avg_8tap_smooth_4hv_8bpp_ssse3:                     54.0 ( 6.35x)
vp9_put_8tap_smooth_4h_8bpp_c:                          94.9 ( 1.00x)
vp9_put_8tap_smooth_4h_8bpp_ssse3:                      14.2 ( 6.67x)
vp9_put_8tap_smooth_4hv_8bpp_c:                        325.9 ( 1.00x)
vp9_put_8tap_smooth_4hv_8bpp_ssse3:                     52.5 ( 6.20x)

New benchmarks:
vp9_avg_8tap_smooth_4h_8bpp_c:                          97.6 ( 1.00x)
vp9_avg_8tap_smooth_4h_8bpp_ssse3:                      10.8 ( 9.08x)
vp9_avg_8tap_smooth_4hv_8bpp_c:                        342.4 ( 1.00x)
vp9_avg_8tap_smooth_4hv_8bpp_ssse3:                     38.8 ( 8.82x)
vp9_put_8tap_smooth_4h_8bpp_c:                          94.7 ( 1.00x)
vp9_put_8tap_smooth_4h_8bpp_ssse3:                       9.7 ( 9.75x)
vp9_put_8tap_smooth_4hv_8bpp_c:                        321.7 ( 1.00x)
vp9_put_8tap_smooth_4hv_8bpp_ssse3:                     37.0 ( 8.69x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:34:35 +01:00
Andreas Rheinhardt
dd5dc254ff avcodec/x86/vp9mc: Avoid reloads, MMX regs in width 4 vert 8tap func
Four rows of four bytes fit into one xmm register; therefore
one can arrange the rows as follows (A,B,C: first, second, third etc.
row)

xmm0: ABABABAB BCBCBCBC
xmm1: CDCDCDCD DEDEDEDE
xmm2: EFEFEFEF FGFGFGFG
xmm3: GHGHGHGH HIHIHIHI

and use four pmaddubsw to calculate two rows in parallel. The history
fits into four registers, making this possible even on 32bit systems.

Old benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c:                         105.5 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3:                      16.4 ( 6.44x)
vp9_put_8tap_smooth_4v_8bpp_c:                          99.3 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3:                      15.4 ( 6.44x)

New benchmarks (Unix 64):
vp9_avg_8tap_smooth_4v_8bpp_c:                         105.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3:                      11.8 ( 8.90x)
vp9_put_8tap_smooth_4v_8bpp_c:                          99.7 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3:                      10.7 ( 9.30x)

Old benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c:                         138.2 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3:                      28.0 ( 4.93x)
vp9_put_8tap_smooth_4v_8bpp_c:                         123.6 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3:                      28.0 ( 4.41x)

New benchmarks (x86-32):
vp9_avg_8tap_smooth_4v_8bpp_c:                         139.0 ( 1.00x)
vp9_avg_8tap_smooth_4v_8bpp_ssse3:                      20.1 ( 6.92x)
vp9_put_8tap_smooth_4v_8bpp_c:                         124.5 ( 1.00x)
vp9_put_8tap_smooth_4v_8bpp_ssse3:                      19.9 ( 6.26x)

Loading the constants into registers did not turn out to be advantageous
here (not to mention Win64, where this would necessitate saving
and restoring ever more register); probably because there are only two
loop iterations.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:31:59 +01:00
Andreas Rheinhardt
36204fbc3c avcodec/vp9itxfm{,_16bpp}: Remove MMXEXT functions overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:27:51 +01:00
Andreas Rheinhardt
ea37f49aed avcodec/vp9intrapred: Remove MMXEXT functions overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:27:44 +01:00
Andreas Rheinhardt
6e418af810 avcodec/vp9mc: Remove MMXEXT functions overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMXEXT functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:27:05 +01:00
Kacper Michajłow
5b5d51cbc1 avcodec/x86/h264_idct: fix version check for NASM 3 and newer
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-12-08 17:43:29 +00:00
Andreas Rheinhardt
050c80a526 avcodec/x86/vp8dsp: Don't use saturated addition when unnecessary
For the epel functions, there can be no overflow as long as the sum
contains only one of the two large central coefficients; for bilinear
functions, there can be no overflow whatsoever.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
575e9e9c08 avcodec/x86/vp8dsp: Reduce number of coefficient tables
By changing the permutations used in the epel8_h{4,6} case
we can simply reuse the coefficient tables from the vertical epel
filters.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
99fb257f58 avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h6_ssse3
Doubling the register width allowed to avoid a pshufb and a pmaddubsw.

Old benchmarks:
vp8_put_epel4_h6_c:                                    115.9 ( 1.00x)
vp8_put_epel4_h6_ssse3:                                 20.2 ( 5.74x)
vp8_put_epel4_h6v4_c:                                  276.3 ( 1.00x)
vp8_put_epel4_h6v4_ssse3:                               58.6 ( 4.71x)
vp8_put_epel4_h6v6_c:                                  363.6 ( 1.00x)
vp8_put_epel4_h6v6_ssse3:                               62.5 ( 5.82x)

New benchmarks:
vp8_put_epel4_h6_c:                                    116.4 ( 1.00x)
vp8_put_epel4_h6_ssse3:                                 16.0 ( 7.29x)
vp8_put_epel4_h6v4_c:                                  280.9 ( 1.00x)
vp8_put_epel4_h6v4_ssse3:                               44.3 ( 6.33x)
vp8_put_epel4_h6v6_c:                                  365.6 ( 1.00x)
vp8_put_epel4_h6v6_ssse3:                               53.1 ( 6.89x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
3135bc0d3a avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h4_ssse3
Doubling the register width allows to use only one pshufb and pmaddubsw.

Old benchmarks:
vp8_put_epel4_h4_c:                                     82.8 ( 1.00x)
vp8_put_epel4_h4_ssse3:                                 13.9 ( 5.96x)

New benchmarks:
vp8_put_epel4_h4_c:                                     82.7 ( 1.00x)
vp8_put_epel4_h4_ssse3:                                 11.7 ( 7.08x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00