Commit graph

679 commits

Author SHA1 Message Date
Andreas Rheinhardt
bb523a2d3f tests/checkasm/llviddsp: Reindent after the previous commit
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-19 20:55:33 +01:00
Andreas Rheinhardt
b2dea09de1 tests/checkasm/llviddsp: Avoid unnecessary initializations, allocs
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-19 20:55:20 +01:00
Martin Storsjö
cf608f6b65 checkasm: Use av_strlcatf for appending SME info after SVE
If we had SVE enabled and formatted info about its vector lengths,
it would be overwritten by the SME info.
2025-12-19 18:42:10 +00:00
Martin Storsjö
477e2dbc39 checkasm: Add a comment about adding new tests to checkasm.mak 2025-12-16 09:47:05 +00:00
Rémi Denis-Courmont
468eca2854 checkasm: test all plane configurations with sub_left_predict
The original code didn't really make sense, never iterating the loop
and never testing non-first plane configurations.
2025-12-15 18:39:53 +02:00
Niklas Haas
960cf3015e swscale/ops: add explicit row offset to SwsDitherOp
To improve decorrelation between components, we offset the dither matrix
slightly for each component. This is currently done by adding a hard-coded
offset of {0, 3, 2, 5} to each of the four components, respectively.

However, this represents a serious challenge when re-ordering SwsDitherOp
past a swizzle, or when splitting an SwsOpList into multiple sub-operations
(e.g. for decoupling luma from subsampled chroma when they are independent).

To fix this on a fundamental level, we have to keep track of the offset per
channel as part of the SwsDitherOp metadata, and respect those values at
runtime.

This commit merely adds the metadata; the update to the underlying backends
will come in a follow-up commit. The FATE change is merely due to the
added offsets in the op list print-out.
2025-12-15 14:31:58 +00:00
Andreas Rheinhardt
4715b9bac2 tests/checkasm/llviddspenc: Rename to llvidencdsp
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:48 +01:00
Andreas Rheinhardt
3a3e7080f1 avcodec/x86/lossless_videoencdsp_init: Port sub_median_pred to SSE2
Old benchmarks:
sub_median_pred_c:                                     405.7 ( 1.00x)
sub_median_pred_mmxext:                                 35.1 (11.57x)

New benchmarks:
sub_median_pred_c:                                     404.1 ( 1.00x)
sub_median_pred_sse2:                                   20.5 (19.67x)

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:35 +01:00
Andreas Rheinhardt
56b8769a1c tests/checkasm/llviddspenc: Add test for sub_median_pred
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-14 10:16:32 +01:00
Georgii Zagoruiko
cdb14bc74d configure: add detection of assembler support for SME
All changes are made during development/testing of SVE/SME for ffmpeg (vvc). Tested on Apple M4
2025-12-09 21:38:38 +00:00
Andreas Rheinhardt
4baa5e638b tests/checkasm/checkasm: Don't test 3dnow
The last 3dnow functions have been removed in commit
5ef613bcb0, so don't test
it in checkasm.

(This will affect only one test, namely scalarproduct_and_madd_int16
from lossless_audiodsp: It does not use an SSSE3 function when
the 3dnow flag is set. So for old AMDs (which advertise support for
3dnow), said SSSE3 function is never tested. Now it will.)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-09 03:03:07 +00:00
Andreas Rheinhardt
7bc35b8426 tests/checkasm/vp9dsp: Allow to run only a subset of tests
Make it possible to run only a subset of the VP9 tests
in addition to all of them (via the vp9dsp test). This
reduces noise and speeds up testing.
FATE continues to use vp9dsp.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-08 19:27:56 +01:00
Kacper Michajłow
6a14a93af5 checkasm/sw_xyz2rgb: fix function type
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-12-05 21:55:03 +00:00
Arpad Panyik
a13871ae19 checkasm: Add xyz12Torgb48le test
Add checkasm coverage for the XYZ12LE to RGB48LE path via the
ctx->xyz12Torgb48 hook. Integrate the test into the build and runner,
exercise a variety of widths/heights, compare against the C reference,
and benchmark when width is multiple of 4.

This improves test coverage for the new function pointer in preparation
for architecture-specific implementations in subsequent commits.

Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
2025-12-05 10:28:18 +00:00
Andreas Rheinhardt
4b6e40a298 avcodec/vp8dsp: Don't compile unused functions
The width 16 epel functions never use four taps in any direction*,
so don't build said functions. Saves 4352B of .text and 89B of
.text.unlikely here.

*: mx and my in vp8_mc_luma() are always even.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
e3e3265034 tests/checkasm/mpegvideo_unquantize: Add missing const
Fixes this test under UBSan:
runtime error: call to function dct_unquantize_mpeg1_intra_c through pointer to incorrect function type 'void (*)(struct MpegEncContext *, short *, int, int)'
I don't know how I could forget this.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 14:17:58 +01:00
Andreas Rheinhardt
c22c2c5e03 avcodec/mpegvideo: Port dct_unquantize_mpeg2_intra_mmx to SSE2
Benefits from wider registers.

Benchmarks:
dct_unquantize_mpeg2_intra_c:                          228.2 ( 1.00x)
dct_unquantize_mpeg2_intra_mmx:                         28.2 ( 8.10x)
dct_unquantize_mpeg2_intra_sse2:                        18.4 (12.37x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
581050a175 tests/checkasm: Add mpegvideo unquantize test
This adds a test for the mpegvideo unquantize functions.

It has been written in order to be able to easily bench
these functions. It should be noted that the random input
fed to the tested functions is not necessarily representative
of the stuff actually occuring in the wild. So benchmarks should
be taken with a grain of salt; but comparisons between two functions
that do not depend on branch predictions are valid (the usecase
for this is to port the x86 mmx functions to use xmm registers).

During testing I have found a bug in the arm/aarch64 neon optimizations
when using the LIBMPEG2 permutation (used by FF_IDCT_INT): The code
seems to be based on the presumption that the remainder of the number
of coefficients to process is always <= 4 mod 16. The test therefore
sometimes fails for these arches.

Hint: I am not certain that 16 bits are enough for the intermediate
values of all the computations involved; e.g. both FLV and MPEG-4
escape values can go beyond that after the corresponding
multiplications. The input in this test is nevertheless designed
to fit into 16 bits.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:39 +01:00
Kacper Michajłow
17456c553e tests/checkasm: fix check for 32-bit Windows build
With --disable-asm, ARCH_X86_32 is set to 0, but we still build the
checkasm binary. Update the check so it is config.h agnostic.

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-11-30 22:07:39 +00:00
Andreas Rheinhardt
ada0a81577 avcodec/x86/h264_idct: Don't use MMX registers in ff_h264_luma_dc_dequant_idct_sse2
It is ABI compliant and gives a tiny speedup here (and is 16B smaller).

Old benchmarks:
h264_luma_dc_dequant_idct_8_c:                          33.2 ( 1.00x)
h264_luma_dc_dequant_idct_8_sse2:                       16.0 ( 2.07x)

New benchmarks:
h264_luma_dc_dequant_idct_8_c:                          33.0 ( 1.00x)
h264_luma_dc_dequant_idct_8_sse2:                       15.0 ( 2.20x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-30 00:15:43 +01:00
Andreas Rheinhardt
aabaab10d2 tests/checkasm: Test VP6DSP
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-27 12:10:34 +01:00
James Almer
191f7e4869 tests/checkasm/sw_ops: fix signed integer related UB when shifting values
Fixes:
src/tests/checkasm/sw_ops.c:441:34: runtime error: shift exponent 32 is too large for 32-bit type 'int'
src/tests/checkasm/sw_ops.c:591:37: runtime error: shift exponent 32 is too large for 32-bit type 'int'

Signed-off-by: James Almer <jamrial@gmail.com>
2025-11-21 18:40:58 +00:00
Andreas Rheinhardt
0d3a88e55f tests/checkasm/mpegvideoencdsp: Test denoise_dct
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-18 20:41:12 +01:00
Andreas Rheinhardt
06b0dae51b avfilter/vf_fsppdsp: Constify
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-17 12:18:12 +01:00
Andreas Rheinhardt
ddd74276f8 avfilter/x86/vf_fspp: Port ff_column_fidct_mmx() to SSE2
It gains a lot because it has to operate on eight words;
it also saves 608B of .text here.

Old benchmarks:
column_fidct_c:                                       3365.7 ( 1.00x)
column_fidct_mmx:                                     1784.6 ( 1.89x)

New benchmarks:
column_fidct_c:                                       3361.5 ( 1.00x)
column_fidct_sse2:                                     801.1 ( 4.20x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-17 12:18:11 +01:00
Andreas Rheinhardt
68b11cde82 tests/checkasm/vf_fspp: Add test for column_fidct
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-17 12:18:11 +01:00
Andreas Rheinhardt
570f8fc6c9 tests/checkasm/vf_fspp: Test store_slice
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-17 11:28:04 +01:00
Andreas Rheinhardt
52ba2ac7bd avfilter/x86/vf_fspp: Port mul_thrmat to SSE2
This fixes an ABI violation, as mul_thrmat did not issue emms.
It seems that this ABI violation could reach the user, namely
if ff_get_video_buffer() fails. Notice that ff_get_video_buffer()
itself could fail because of this, namely if the allocator uses
floating point registers.

On x64 (where GCC already used SSE2 in the C version)
mul_thrmat_c:                                            4.4 ( 1.00x)
mul_thrmat_mmx:                                          8.6 ( 0.52x)
mul_thrmat_sse2:                                         4.4 ( 1.00x)

On 32bit (where SSE2 is not known to be available):
mul_thrmat_c:                                           56.0 ( 1.00x)
mul_thrmat_sse2:                                         6.0 ( 9.40x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-17 11:28:04 +01:00
Andreas Rheinhardt
70eb8a76a9 tests/checkasm: Add vf_fspp mul_thrmat test
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-17 11:28:04 +01:00
Andreas Rheinhardt
c6efe1abda avcodec/h264chroma: Move mc1 function to mpegvideo_dec.c
It is only used by mpegvideo decoders (for lowres). It is also only used
for bitdepth == 8, so don't build the bitdepth == 16 function at all any
more.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-01 13:31:57 +01:00
Andreas Rheinhardt
0f105b96a3 avcodec/x86/hevc/idct: Port ff_hevc_idct_4x4_dc_{8,10,12}_mmxext to SSE2
Practically no change in benchmarks (and in codesize).

hevc_idct_4x4_dc_8_c:                                    7.8 ( 1.00x)
hevc_idct_4x4_dc_8_mmxext:                               6.9 ( 1.14x)
hevc_idct_4x4_dc_8_sse2:                                 6.8 ( 1.15x)
hevc_idct_4x4_dc_10_c:                                   7.9 ( 1.00x)
hevc_idct_4x4_dc_10_mmxext:                              6.9 ( 1.16x)
hevc_idct_4x4_dc_10_sse2:                                6.8 ( 1.16x)
hevc_idct_4x4_dc_12_c:                                   7.8 ( 1.00x)
hevc_idct_4x4_dc_12_mmxext:                              7.0 ( 1.13x)
hevc_idct_4x4_dc_12_sse2:                                6.8 ( 1.15x)

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-30 08:56:45 +01:00
Andreas Rheinhardt
f4a87d8ca4 avcodec/x86/mpegvideoencdsp_init: Use xmm registers in SSSE3 functions
Improves performance and no longer breaks the ABI (by forgetting
to call emms).

Old benchmarks:
add_8x8basis_c:                                         43.6 ( 1.00x)
add_8x8basis_ssse3:                                     12.3 ( 3.55x)

New benchmarks:
add_8x8basis_c:                                         43.0 ( 1.00x)
add_8x8basis_ssse3:                                      6.3 ( 6.79x)

Notice that the output of try_8x8basis_ssse3 changes a bit:
Before this commit, it computes certain values and adds the values
for i,i+1,i+4 and i+5 before right shifting them; now it adds
the values for i,i+1,i+8,i+9. The second pair in these lists
could be avoided (by shifting xmm0 and xmm1 before adding both together
instead of only shifting xmm0 after adding them), but the former
i,i+1 is inherent in using pmaddwd. This is the reason that this
function is not bitexact.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-15 08:55:13 +02:00
Andreas Rheinhardt
ce499ebf96 tests/checkasm/mpegvideoencdsp: Add test for add_8x8basis
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-15 08:55:13 +02:00
Andreas Rheinhardt
31f0749cd4 avcodec/vp3: Optimize alignment check away when possible
Check only on arches that need said check.

(Btw: I do not see how h_loop_filter benefits from alignment
at all and why h_loop_filter_unaligned exists.)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-13 18:59:49 +02:00
Andreas Rheinhardt
5823ab347a avcodec/vp3dsp: Remove unused flags parameter from ff_vp3dsp_init()
No longer necessary now that the x86 loop filter functions are
bitexact.

Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-13 18:59:24 +02:00
Andreas Rheinhardt
e3ca57ae8f avcodec/x86/vp3dsp: Port loop filters to SSE2
The old code operated on bytes and did lots of tricks
due to their limited range; it did not completely succeed,
which is why the old versions were not used when bitexact
output was requested.

In contrast, the new version is much simpler: It operates
on signed 16 bit words whose range is more than sufficient.
This means that these functions don't need a check for bitexactness
(and can be used in FATE).

Old benchmarks (for this, the AV_CODEC_FLAG_BITEXACT check has been
removed from checkasm):
h_loop_filter_c:                                        29.8 ( 1.00x)
h_loop_filter_mmxext:                                   32.2 ( 0.93x)
h_loop_filter_unaligned_c:                              29.9 ( 1.00x)
h_loop_filter_unaligned_mmxext:                         31.4 ( 0.95x)
v_loop_filter_c:                                        39.3 ( 1.00x)
v_loop_filter_mmxext:                                   14.2 ( 2.78x)
v_loop_filter_unaligned_c:                              38.9 ( 1.00x)
v_loop_filter_unaligned_mmxext:                         14.3 ( 2.72x)

New benchmarks:
h_loop_filter_c:                                        29.2 ( 1.00x)
h_loop_filter_sse2:                                     28.6 ( 1.02x)
h_loop_filter_unaligned_c:                              29.0 ( 1.00x)
h_loop_filter_unaligned_sse2:                           26.9 ( 1.08x)
v_loop_filter_c:                                        38.3 ( 1.00x)
v_loop_filter_sse2:                                     11.0 ( 3.47x)
v_loop_filter_unaligned_c:                              35.5 ( 1.00x)
v_loop_filter_unaligned_sse2:                           11.2 ( 3.18x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-13 18:58:50 +02:00
Andreas Rheinhardt
5d9a392bce tests/checkasm: Add VP3 loop filter test
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-13 18:58:50 +02:00
Andreas Rheinhardt
54598238e4 tests/checkasm: Add CAVS qpel test
This test already uncovered a bug in the vertical qpel motion
compensation code.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:08 +02:00
Andreas Rheinhardt
3e2d9b73c1 avcodec/h264qpel: Move Snow-only code to snow.c
Blocksize 2 is Snow-only, so move all the code pertaining
to it to snow.c. Also make the put array in H264QpelContext
smaller -- it only needs three sets of 16 function pointers.
This continues 6eb8bc4217
and b0c91c2fba.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
James Almer
95850f339e tests/checkasm: add a test for dcadsp
Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-05 10:09:04 -03:00
Andreas Rheinhardt
6eb8bc4217 avcodec/h264qpel: Don't build unused 2x2 size funcs for bitdepths > 8
The 2x2 put functions are only used by Snow and Snow uses
only the eight bit versions. The rest is dead code. Disabling
it saved 41277B here.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
8820e2205c tests/checkasm/hpeldsp: Use instruction-set independent height
Otherwise the benchmark numbers are incomparable.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
9a0581fca0 tests/checkasm: Add qpeldsp checkasm
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
ab7d1c64c9 avcodec/x86/h263_loopfilter: Port loop filter to SSE2
Old benchmarks:
h263dsp.h_loop_filter_c:                                41.2 ( 1.00x)
h263dsp.h_loop_filter_mmx:                              39.5 ( 1.04x)
h263dsp.v_loop_filter_c:                                43.5 ( 1.00x)
h263dsp.v_loop_filter_mmx:                              16.9 ( 2.57x)

New benchmarks:
h263dsp.h_loop_filter_c:                                41.6 ( 1.00x)
h263dsp.h_loop_filter_sse2:                             28.2 ( 1.48x)
h263dsp.v_loop_filter_c:                                42.4 ( 1.00x)
h263dsp.v_loop_filter_sse2:                             15.1 ( 2.81x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-03 17:05:46 +00:00
Andreas Rheinhardt
a8a16c15c8 tests/checkasm/llviddsp: Use the same width for each cpuflag
Otherwise the benchmark numbers would be incomparable nonsense.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-03 17:05:46 +00:00
Kacper Michajłow
d6cb0d2c2b ALL: move av_unused to conform with standard requirement
This is required placement by standard [[maybe_unused]] attribute, works
the same for __attribute__((unused)).

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-09-26 16:15:46 +00:00
Andreas Rheinhardt
4e2ef29cba tests/checkasm: Add hpeldsp checkasm
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:02 +02:00
Niklas Haas
00e05bcd68 tests/checkasm: add vf_idet checkasm 2025-09-21 11:02:41 +00:00
Andreas Rheinhardt
a35c91dc14 avfilter/vf_colordetect: Rename header to vf_colordetectdsp.h
It is more in line with our naming conventions.

Reviewed-by: Martin Storsjö <martin@martin.st>
Reviewed-by: Niklas Haas <ffmpeg@haasn.dev>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-16 18:22:24 +02:00
Timo Rothenpieler
0362cb3806 build: link with CXX when -lstdc++ on linker commandline 2025-09-14 11:45:11 +00:00