Commit graph

52830 commits

Author SHA1 Message Date
Andreas Rheinhardt
697da64c8e avcodec/x86/h264_qpel: Port pixel8_l2_shift5 from MMXEXT to SSE2
This abides by the ABI (no missing emms) and yields a tiny
performance improvement here.

Old benchmarks:
avg_h264_qpel_8_mc12_8_c:                              419.9 ( 1.00x)
avg_h264_qpel_8_mc12_8_sse2:                            78.9 ( 5.32x)
avg_h264_qpel_8_mc12_8_ssse3:                           71.7 ( 5.86x)
avg_h264_qpel_8_mc32_8_c:                              429.1 ( 1.00x)
avg_h264_qpel_8_mc32_8_sse2:                            76.9 ( 5.58x)
avg_h264_qpel_8_mc32_8_ssse3:                           73.4 ( 5.84x)
put_h264_qpel_8_mc12_8_c:                              424.0 ( 1.00x)
put_h264_qpel_8_mc12_8_sse2:                            78.6 ( 5.40x)
put_h264_qpel_8_mc12_8_ssse3:                           70.6 ( 6.00x)
put_h264_qpel_8_mc32_8_c:                              425.7 ( 1.00x)
put_h264_qpel_8_mc32_8_sse2:                            75.2 ( 5.66x)
put_h264_qpel_8_mc32_8_ssse3:                           70.4 ( 6.05x)

New benchmarks:
avg_h264_qpel_8_mc12_8_c:                              425.7 ( 1.00x)
avg_h264_qpel_8_mc12_8_sse2:                            77.5 ( 5.49x)
avg_h264_qpel_8_mc12_8_ssse3:                           69.8 ( 6.10x)
avg_h264_qpel_8_mc32_8_c:                              423.7 ( 1.00x)
avg_h264_qpel_8_mc32_8_sse2:                            74.6 ( 5.68x)
avg_h264_qpel_8_mc32_8_ssse3:                           71.9 ( 5.89x)
put_h264_qpel_8_mc12_8_c:                              422.2 ( 1.00x)
put_h264_qpel_8_mc12_8_sse2:                            75.8 ( 5.57x)
put_h264_qpel_8_mc12_8_ssse3:                           67.9 ( 6.22x)
put_h264_qpel_8_mc32_8_c:                              421.8 ( 1.00x)
put_h264_qpel_8_mc32_8_sse2:                            72.6 ( 5.81x)
put_h264_qpel_8_mc32_8_ssse3:                           67.7 ( 6.23x)

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
4ac9162beb avcodec/x86/h264_qpel: Don't use ff_ prefix for static functions
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
cd077e88d1 avcodec/x86/h264_qpel: Add ff_{avg,put}_h264_qpel16_h_lowpass_l2_sse2()
These functions are currently emulated via four calls to the versions
for 8x8 blocks. In fact, the size savings from the simplified calls
in h264_qpel.c (GCC 1344B, Clang 1280B) more than outweigh the size
of the added functions (512B) here.

It is also beneficial performance-wise. Old benchmarks:
avg_h264_qpel_16_mc11_8_c:                            1414.1 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          206.2 ( 6.86x)
avg_h264_qpel_16_mc11_8_ssse3:                         177.7 ( 7.96x)
avg_h264_qpel_16_mc13_8_c:                            1417.0 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          207.4 ( 6.83x)
avg_h264_qpel_16_mc13_8_ssse3:                         178.2 ( 7.95x)
avg_h264_qpel_16_mc21_8_c:                            1632.8 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          349.3 ( 4.67x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.3 ( 5.60x)
avg_h264_qpel_16_mc23_8_c:                            1640.2 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          351.3 ( 4.67x)
avg_h264_qpel_16_mc23_8_ssse3:                         290.8 ( 5.64x)
avg_h264_qpel_16_mc31_8_c:                            1411.7 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          203.4 ( 6.94x)
avg_h264_qpel_16_mc31_8_ssse3:                         178.9 ( 7.89x)
avg_h264_qpel_16_mc33_8_c:                            1409.7 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          204.6 ( 6.89x)
avg_h264_qpel_16_mc33_8_ssse3:                         178.1 ( 7.92x)
put_h264_qpel_16_mc11_8_c:                            1391.0 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          197.4 ( 7.05x)
put_h264_qpel_16_mc11_8_ssse3:                         176.1 ( 7.90x)
put_h264_qpel_16_mc13_8_c:                            1395.9 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          196.7 ( 7.10x)
put_h264_qpel_16_mc13_8_ssse3:                         177.7 ( 7.85x)
put_h264_qpel_16_mc21_8_c:                            1609.5 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          341.1 ( 4.72x)
put_h264_qpel_16_mc21_8_ssse3:                         289.2 ( 5.57x)
put_h264_qpel_16_mc23_8_c:                            1604.0 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          340.9 ( 4.71x)
put_h264_qpel_16_mc23_8_ssse3:                         289.6 ( 5.54x)
put_h264_qpel_16_mc31_8_c:                            1390.2 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          194.6 ( 7.14x)
put_h264_qpel_16_mc31_8_ssse3:                         176.4 ( 7.88x)
put_h264_qpel_16_mc33_8_c:                            1400.4 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          198.5 ( 7.06x)
put_h264_qpel_16_mc33_8_ssse3:                         176.2 ( 7.95x)

New benchmarks:
avg_h264_qpel_16_mc11_8_c:                            1413.3 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          171.8 ( 8.23x)
avg_h264_qpel_16_mc11_8_ssse3:                         173.0 ( 8.17x)
avg_h264_qpel_16_mc13_8_c:                            1423.2 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          172.0 ( 8.27x)
avg_h264_qpel_16_mc13_8_ssse3:                         173.4 ( 8.21x)
avg_h264_qpel_16_mc21_8_c:                            1641.3 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          322.1 ( 5.10x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.3 ( 5.63x)
avg_h264_qpel_16_mc23_8_c:                            1629.1 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          323.0 ( 5.04x)
avg_h264_qpel_16_mc23_8_ssse3:                         293.3 ( 5.55x)
avg_h264_qpel_16_mc31_8_c:                            1409.2 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          172.0 ( 8.19x)
avg_h264_qpel_16_mc31_8_ssse3:                         173.7 ( 8.11x)
avg_h264_qpel_16_mc33_8_c:                            1402.5 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          172.5 ( 8.13x)
avg_h264_qpel_16_mc33_8_ssse3:                         173.6 ( 8.08x)
put_h264_qpel_16_mc11_8_c:                            1393.7 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          170.4 ( 8.18x)
put_h264_qpel_16_mc11_8_ssse3:                         178.2 ( 7.82x)
put_h264_qpel_16_mc13_8_c:                            1398.0 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          170.2 ( 8.21x)
put_h264_qpel_16_mc13_8_ssse3:                         178.6 ( 7.83x)
put_h264_qpel_16_mc21_8_c:                            1619.6 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          320.6 ( 5.05x)
put_h264_qpel_16_mc21_8_ssse3:                         297.2 ( 5.45x)
put_h264_qpel_16_mc23_8_c:                            1617.4 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          320.0 ( 5.05x)
put_h264_qpel_16_mc23_8_ssse3:                         297.4 ( 5.44x)
put_h264_qpel_16_mc31_8_c:                            1389.7 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          169.9 ( 8.18x)
put_h264_qpel_16_mc31_8_ssse3:                         178.1 ( 7.80x)
put_h264_qpel_16_mc33_8_c:                            1394.0 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          170.9 ( 8.16x)
put_h264_qpel_16_mc33_8_ssse3:                         176.9 ( 7.88x)

Notice that the SSSE3 versions of mc21 and mc23 benefit from
an optimized version of hv2_lowpass.

Also notice that there is no SSE2 version of the purely horizontal
motion compensation. This means that src2 is currently always aligned
when calling the SSE2 functions (and that srcStride is always equal
to the block width). Yet this has not been exploited (yet).

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
4880fa4dca avcodec/x86/h264_qpel_8bit: Remove dead macro
Forgotten in 4011a76494.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
35aaf697e9 avcodec/x86/h264_qpel_8bit: Replace qpel8_h_lowpass_l2 MMXEXT by SSE2
Using xmm registers here is very natural, as it allows to
operate on eight words at a time. It also saves 48B here
and does not clobber the MMX state.

Old benchmarks (only tests affected by the modified function are shown):
avg_h264_qpel_8_mc11_8_c:                              352.2 ( 1.00x)
avg_h264_qpel_8_mc11_8_sse2:                            70.4 ( 5.00x)
avg_h264_qpel_8_mc11_8_ssse3:                           53.9 ( 6.53x)
avg_h264_qpel_8_mc13_8_c:                              353.3 ( 1.00x)
avg_h264_qpel_8_mc13_8_sse2:                            72.8 ( 4.86x)
avg_h264_qpel_8_mc13_8_ssse3:                           53.8 ( 6.57x)
avg_h264_qpel_8_mc21_8_c:                              404.0 ( 1.00x)
avg_h264_qpel_8_mc21_8_sse2:                           116.1 ( 3.48x)
avg_h264_qpel_8_mc21_8_ssse3:                           94.3 ( 4.28x)
avg_h264_qpel_8_mc23_8_c:                              398.9 ( 1.00x)
avg_h264_qpel_8_mc23_8_sse2:                           118.6 ( 3.36x)
avg_h264_qpel_8_mc23_8_ssse3:                           94.8 ( 4.21x)
avg_h264_qpel_8_mc31_8_c:                              352.7 ( 1.00x)
avg_h264_qpel_8_mc31_8_sse2:                            71.4 ( 4.94x)
avg_h264_qpel_8_mc31_8_ssse3:                           53.8 ( 6.56x)
avg_h264_qpel_8_mc33_8_c:                              354.0 ( 1.00x)
avg_h264_qpel_8_mc33_8_sse2:                            70.6 ( 5.01x)
avg_h264_qpel_8_mc33_8_ssse3:                           53.7 ( 6.59x)
avg_h264_qpel_16_mc11_8_c:                            1417.0 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          276.9 ( 5.12x)
avg_h264_qpel_16_mc11_8_ssse3:                         178.8 ( 7.92x)
avg_h264_qpel_16_mc13_8_c:                            1427.3 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          277.4 ( 5.14x)
avg_h264_qpel_16_mc13_8_ssse3:                         179.7 ( 7.94x)
avg_h264_qpel_16_mc21_8_c:                            1634.1 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          421.3 ( 3.88x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.2 ( 5.61x)
avg_h264_qpel_16_mc23_8_c:                            1627.0 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          420.8 ( 3.87x)
avg_h264_qpel_16_mc23_8_ssse3:                         291.0 ( 5.59x)
avg_h264_qpel_16_mc31_8_c:                            1418.4 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          278.5 ( 5.09x)
avg_h264_qpel_16_mc31_8_ssse3:                         178.6 ( 7.94x)
avg_h264_qpel_16_mc33_8_c:                            1407.3 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          277.6 ( 5.07x)
avg_h264_qpel_16_mc33_8_ssse3:                         179.9 ( 7.82x)
put_h264_qpel_8_mc11_8_c:                              348.1 ( 1.00x)
put_h264_qpel_8_mc11_8_sse2:                            69.1 ( 5.04x)
put_h264_qpel_8_mc11_8_ssse3:                           53.8 ( 6.47x)
put_h264_qpel_8_mc13_8_c:                              349.3 ( 1.00x)
put_h264_qpel_8_mc13_8_sse2:                            69.7 ( 5.01x)
put_h264_qpel_8_mc13_8_ssse3:                           53.7 ( 6.51x)
put_h264_qpel_8_mc21_8_c:                              398.5 ( 1.00x)
put_h264_qpel_8_mc21_8_sse2:                           115.0 ( 3.46x)
put_h264_qpel_8_mc21_8_ssse3:                           95.3 ( 4.18x)
put_h264_qpel_8_mc23_8_c:                              399.9 ( 1.00x)
put_h264_qpel_8_mc23_8_sse2:                           120.8 ( 3.31x)
put_h264_qpel_8_mc23_8_ssse3:                           95.4 ( 4.19x)
put_h264_qpel_8_mc31_8_c:                              350.4 ( 1.00x)
put_h264_qpel_8_mc31_8_sse2:                            69.6 ( 5.03x)
put_h264_qpel_8_mc31_8_ssse3:                           54.2 ( 6.47x)
put_h264_qpel_8_mc33_8_c:                              353.1 ( 1.00x)
put_h264_qpel_8_mc33_8_sse2:                            71.0 ( 4.97x)
put_h264_qpel_8_mc33_8_ssse3:                           54.2 ( 6.51x)
put_h264_qpel_16_mc11_8_c:                            1384.2 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          272.9 ( 5.07x)
put_h264_qpel_16_mc11_8_ssse3:                         178.3 ( 7.76x)
put_h264_qpel_16_mc13_8_c:                            1393.6 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          271.1 ( 5.14x)
put_h264_qpel_16_mc13_8_ssse3:                         178.3 ( 7.82x)
put_h264_qpel_16_mc21_8_c:                            1612.6 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          416.5 ( 3.87x)
put_h264_qpel_16_mc21_8_ssse3:                         289.1 ( 5.58x)
put_h264_qpel_16_mc23_8_c:                            1621.3 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          416.9 ( 3.89x)
put_h264_qpel_16_mc23_8_ssse3:                         289.4 ( 5.60x)
put_h264_qpel_16_mc31_8_c:                            1408.4 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          273.5 ( 5.15x)
put_h264_qpel_16_mc31_8_ssse3:                         176.9 ( 7.96x)
put_h264_qpel_16_mc33_8_c:                            1396.4 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          276.3 ( 5.05x)
put_h264_qpel_16_mc33_8_ssse3:                         176.4 ( 7.92x)

New benchmarks:
avg_h264_qpel_8_mc11_8_c:                              352.1 ( 1.00x)
avg_h264_qpel_8_mc11_8_sse2:                            52.5 ( 6.71x)
avg_h264_qpel_8_mc11_8_ssse3:                           53.9 ( 6.54x)
avg_h264_qpel_8_mc13_8_c:                              350.8 ( 1.00x)
avg_h264_qpel_8_mc13_8_sse2:                            54.7 ( 6.42x)
avg_h264_qpel_8_mc13_8_ssse3:                           54.3 ( 6.46x)
avg_h264_qpel_8_mc21_8_c:                              400.1 ( 1.00x)
avg_h264_qpel_8_mc21_8_sse2:                            98.6 ( 4.06x)
avg_h264_qpel_8_mc21_8_ssse3:                           95.5 ( 4.19x)
avg_h264_qpel_8_mc23_8_c:                              400.4 ( 1.00x)
avg_h264_qpel_8_mc23_8_sse2:                           101.4 ( 3.95x)
avg_h264_qpel_8_mc23_8_ssse3:                           95.9 ( 4.18x)
avg_h264_qpel_8_mc31_8_c:                              352.4 ( 1.00x)
avg_h264_qpel_8_mc31_8_sse2:                            52.9 ( 6.67x)
avg_h264_qpel_8_mc31_8_ssse3:                           54.4 ( 6.48x)
avg_h264_qpel_8_mc33_8_c:                              354.5 ( 1.00x)
avg_h264_qpel_8_mc33_8_sse2:                            52.9 ( 6.70x)
avg_h264_qpel_8_mc33_8_ssse3:                           54.4 ( 6.52x)
avg_h264_qpel_16_mc11_8_c:                            1420.4 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          204.8 ( 6.93x)
avg_h264_qpel_16_mc11_8_ssse3:                         177.9 ( 7.98x)
avg_h264_qpel_16_mc13_8_c:                            1409.8 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          206.4 ( 6.83x)
avg_h264_qpel_16_mc13_8_ssse3:                         178.0 ( 7.92x)
avg_h264_qpel_16_mc21_8_c:                            1634.1 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          349.6 ( 4.67x)
avg_h264_qpel_16_mc21_8_ssse3:                         290.0 ( 5.63x)
avg_h264_qpel_16_mc23_8_c:                            1624.1 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          350.0 ( 4.64x)
avg_h264_qpel_16_mc23_8_ssse3:                         291.9 ( 5.56x)
avg_h264_qpel_16_mc31_8_c:                            1407.2 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          205.8 ( 6.84x)
avg_h264_qpel_16_mc31_8_ssse3:                         178.2 ( 7.90x)
avg_h264_qpel_16_mc33_8_c:                            1400.5 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          206.3 ( 6.79x)
avg_h264_qpel_16_mc33_8_ssse3:                         179.4 ( 7.81x)
put_h264_qpel_8_mc11_8_c:                              349.7 ( 1.00x)
put_h264_qpel_8_mc11_8_sse2:                            50.2 ( 6.96x)
put_h264_qpel_8_mc11_8_ssse3:                           51.3 ( 6.82x)
put_h264_qpel_8_mc13_8_c:                              349.8 ( 1.00x)
put_h264_qpel_8_mc13_8_sse2:                            50.7 ( 6.90x)
put_h264_qpel_8_mc13_8_ssse3:                           51.7 ( 6.76x)
put_h264_qpel_8_mc21_8_c:                              398.0 ( 1.00x)
put_h264_qpel_8_mc21_8_sse2:                            96.5 ( 4.13x)
put_h264_qpel_8_mc21_8_ssse3:                           92.3 ( 4.31x)
put_h264_qpel_8_mc23_8_c:                              401.4 ( 1.00x)
put_h264_qpel_8_mc23_8_sse2:                           102.3 ( 3.92x)
put_h264_qpel_8_mc23_8_ssse3:                           92.8 ( 4.32x)
put_h264_qpel_8_mc31_8_c:                              349.4 ( 1.00x)
put_h264_qpel_8_mc31_8_sse2:                            50.8 ( 6.88x)
put_h264_qpel_8_mc31_8_ssse3:                           51.8 ( 6.75x)
put_h264_qpel_8_mc33_8_c:                              351.1 ( 1.00x)
put_h264_qpel_8_mc33_8_sse2:                            52.2 ( 6.73x)
put_h264_qpel_8_mc33_8_ssse3:                           51.7 ( 6.79x)
put_h264_qpel_16_mc11_8_c:                            1391.1 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          196.6 ( 7.07x)
put_h264_qpel_16_mc11_8_ssse3:                         178.2 ( 7.81x)
put_h264_qpel_16_mc13_8_c:                            1385.2 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          195.6 ( 7.08x)
put_h264_qpel_16_mc13_8_ssse3:                         176.6 ( 7.84x)
put_h264_qpel_16_mc21_8_c:                            1607.5 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          341.0 ( 4.71x)
put_h264_qpel_16_mc21_8_ssse3:                         289.1 ( 5.56x)
put_h264_qpel_16_mc23_8_c:                            1616.7 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          340.8 ( 4.74x)
put_h264_qpel_16_mc23_8_ssse3:                         288.6 ( 5.60x)
put_h264_qpel_16_mc31_8_c:                            1397.6 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          197.3 ( 7.08x)
put_h264_qpel_16_mc31_8_ssse3:                         175.4 ( 7.97x)
put_h264_qpel_16_mc33_8_c:                            1394.3 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          197.7 ( 7.05x)
put_h264_qpel_16_mc33_8_ssse3:                         175.2 ( 7.96x)

As can be seen, the SSE2 version is often neck-to-neck with the SSSE3
version (which also benefits from a better hv2_lowpass SSSE3
implementation for mc21 and mc23) for eight byte block sizes.
Unsurprisingly, SSSE3 beats SSE2 for 16x16 blocks: For SSE2,
these blocks are processed by calling the 8x8 function four times
whereas SSSE3 has a dedicated function (on x64).
This implementation should also be extendable to an AVX version
for 16x16 blocks.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
fa9ea5113b avcodec/x86/h264_qpel_8bit: Optimize branch away
ff_{avg,put}_h264_qpel8or16_hv2_lowpass_ssse3()
currently is almost the disjoint union of the codepaths
for sizes 8 and 16. This size is a compile-time constant
at every callsite. So split the function and avoid
the runtime branch.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
400203c00c avcodec/x86/h264_qpel: Remove unused parameter from hv2_lowpass funcs
tmpstride is unused. This also allows to remove said parameter
from lots of functions in h264_qpel.c.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
b84c818c83 avcodec/x86/h264_qpel: Remove constant parameters from shift5 funcs
They are constant since the size 16 version is no longer emulated
via the size 8 version.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
810bd3e62a avcodec/x86/h264_qpel: Add ff_{avg,put}_pixels16_l2_shift5_sse2
Up until now this function was emulated via two calls
to ff_{avg,pull}_pixels8_l2_shift5_mmxext(). Adding a dedicated
function proved beneficial both size wise and performance wise:
The new functions take 192B, yet the simplified calls save
256B with GCC and 320B with Clang here.

This change will also allow further optimizations.

Old benchmarks:
avg_h264_qpel_16_mc12_8_c:                            1735.8 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          300.8 ( 5.77x)
avg_h264_qpel_16_mc12_8_ssse3:                         233.3 ( 7.44x)
avg_h264_qpel_16_mc32_8_c:                            1777.9 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          275.6 ( 6.45x)
avg_h264_qpel_16_mc32_8_ssse3:                         235.7 ( 7.54x)
put_h264_qpel_16_mc12_8_c:                            1808.2 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          267.2 ( 6.77x)
put_h264_qpel_16_mc12_8_ssse3:                         231.9 ( 7.80x)
put_h264_qpel_16_mc32_8_c:                            1766.9 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          272.9 ( 6.47x)
put_h264_qpel_16_mc32_8_ssse3:                         229.5 ( 7.70x)

New benchmarks:
avg_h264_qpel_16_mc12_8_c:                            1742.3 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          240.3 ( 7.25x)
avg_h264_qpel_16_mc12_8_ssse3:                         214.8 ( 8.11x)
avg_h264_qpel_16_mc32_8_c:                            1748.0 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          238.0 ( 7.35x)
avg_h264_qpel_16_mc32_8_ssse3:                         209.2 ( 8.35x)
put_h264_qpel_16_mc12_8_c:                            2014.4 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          243.7 ( 8.27x)
put_h264_qpel_16_mc12_8_ssse3:                         211.5 ( 9.52x)
put_h264_qpel_16_mc32_8_c:                            1800.0 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          238.8 ( 7.54x)
put_h264_qpel_16_mc32_8_ssse3:                         206.7 ( 8.71x)

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
279b6f3cf5 avcodec/fpel: Avoid loop in ff_avg_pixels4_mmxext()
It is only used by h264_qpel.c and only with height four
(which is unrolled) and uses a loop in order to handle
multiples of four as height. Remove the loop and the height
parameter and move the function to h264_qpel_8bit.asm.
This leads to a bit of code duplication, but this is simpler
than all the %if checks necessary to achieve the same outcome
in fpel.asm.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
e340f31b89 avcodec/x86/fpel: Remove redundant repetition
The repetition count is always one since
2cf9e733c6.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
b0c91c2fba avcodec/h264qpel: Make avg_h264_qpel_pixels_tab smaller
avg_h264_qpel only supports 16x16,8x8 and 4x4 blocksizes,
so it is currently unnecessarily large.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
6eb8bc4217 avcodec/h264qpel: Don't build unused 2x2 size funcs for bitdepths > 8
The 2x2 put functions are only used by Snow and Snow uses
only the eight bit versions. The rest is dead code. Disabling
it saved 41277B here.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
92ae9d1ffc configure: Remove vc1dsp->qpeldsp dependency
It only needs it for some x86 fpel functions; instead
add a direct dependency for that.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
16d5e074dc avcodec/mips/Makefile: Fix VC1DSP build rules
Affected standalone builds of the VC-1 parser.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
0035d99c61 configure: Avoid mpeg4video_parser->{h263,qpel}dsp dependency
This can be easily achieved by moving code only used by the MPEG-4
decoder behind #if CONFIG_MPEG4_DECODER.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
c4c616db53 avcodec/x86/qpel: Move ff_{put,avg}_pixels4_l2_mmxext to h264_qpel
Only used there.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
1e11fdff52 avcodec/x86/qpel{,dsp_init}: Remove constant function parameters
ff_avg_pixels{4,8,16}_l2_mmxext() are always called with height
equal to their blocksize. And ff_{put,avg}_pixels4_l2_mmxext()
are furthermore always called with both strides being equal.
So remove these redundant function parameters.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
52a77128fd avcodec/x86/qpel{dsp,dsp_init}: Use ptrdiff_t for stride
This is more correct given that qpel_mc_func already uses ptrdiff_t;
it also allows to avoid movsxdifnidn.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
cacf854fe7 avcodec/x86/qpel: Remove always-false branches
The ff_avg_pixels{4,8,16}_l2_mmxext() functions are only ever
used in the last step (the one that actually writes to the dst buffer)
where the number of lines to process is always equal to the
dimensions of the block, whereas ff_put_pixels{8,16}_mmxext()
are also used in intermediate calculations where the number of
lines can be 9 or 17.

The code in qpel.asm uses common macros for both and processes
more than one line per loop iteration; it therefore checks
for whether the number of lines is odd and treats this line separately;
yet this special handling is only needed for the put functions,
not the avg functions. It has therefore been %if'ed away for these.

The check is also not needed for ff_put_pixels4_l2_mmxext() which
is only used by H.264 which always processes four lines. Because
ff_{avg,put}_pixels4_l2_mmxext() processes four lines in a single loop
iteration, not only the odd-height handling, but the whole loop
could be removed.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:32 +02:00
Andreas Rheinhardt
15a9c8dea3 avcodec/liblc3enc: Avoid allocating buffer to send a zero frame
liblc3 supports arbitrary strides, so one can simply use a stride
of zero to make it read the same zero value again and again.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 06:07:37 +02:00
Andreas Rheinhardt
ab7d1c64c9 avcodec/x86/h263_loopfilter: Port loop filter to SSE2
Old benchmarks:
h263dsp.h_loop_filter_c:                                41.2 ( 1.00x)
h263dsp.h_loop_filter_mmx:                              39.5 ( 1.04x)
h263dsp.v_loop_filter_c:                                43.5 ( 1.00x)
h263dsp.v_loop_filter_mmx:                              16.9 ( 2.57x)

New benchmarks:
h263dsp.h_loop_filter_c:                                41.6 ( 1.00x)
h263dsp.h_loop_filter_sse2:                             28.2 ( 1.48x)
h263dsp.v_loop_filter_c:                                42.4 ( 1.00x)
h263dsp.v_loop_filter_sse2:                             15.1 ( 2.81x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-03 17:05:46 +00:00
Cameron Gutman
df4587789f avcodec/amfenc: avoid unnecessary output delay in low delay mode
The code optimizes throughput by letting the encoder work on frame N
until frame N+1 is ready for submission, but this hurts low-delay uses
by delaying output by one frame. Don't delay output beyond what is
necessary when AV_CODEC_FLAG_LOW_DELAY is used.

Signed-off-by: Cameron Gutman <aicommander@gmail.com>
2025-10-03 11:05:03 +00:00
Michael Niedermayer
61b6877637 avcodec/mjpegdec: Explain buf_size/width/height check
Suggested-by: Ramiro

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-10-02 12:52:43 +00:00
James Almer
5511641365 avcodec/atrac9dec: use av_zero_extend()
Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-01 01:26:19 +00:00
James Almer
7ce3a14496 avcodec/apv_entropy: use av_zero_extend()
Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-01 01:26:19 +00:00
James Almer
776ee07990 avcodec/aom_film_grain: use av_zero_extend()
Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-01 01:26:19 +00:00
Koushik Dutta via ffmpeg-devel
fd136a4d82 ffv1enc_vulkan: fix empty struct build error on msvc
Signed-off-by: Koushik Dutta <koushd@gmail.com>
2025-09-30 19:36:56 +09:00
James Almer
d975dbd7b7 avcodec/libdav1d: bump minimum supported version to 1.0.0
This allows us to remove old deprecated options.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-09-28 23:53:27 -03:00
Andreas Rheinhardt
635cb4543f avcodec/bsf/ahx_to_mp2: Don't output uninitialized data
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-27 17:57:15 +02:00
Andreas Rheinhardt
0f1f345c37 avcodec/x86/qpeldsp_init: Fix compilation without external assembly
Broken in 2cf9e733c6.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 18:30:53 +02:00
Kacper Michajłow
d6cb0d2c2b ALL: move av_unused to conform with standard requirement
This is required placement by standard [[maybe_unused]] attribute, works
the same for __attribute__((unused)).

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-09-26 16:15:46 +00:00
Andreas Rheinhardt
a54d6b1d91 avcodec/x86/rnd_template: Merge into hpeldsp_init.c
It is now only included exactly once.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:58 +02:00
Andreas Rheinhardt
43fe9554cc avcodec/x86/hpeldsp_init: Avoid complicating macro
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:55 +02:00
Andreas Rheinhardt
00e046df13 avcodec/x86/hpeldsp_init: Remove MMX(EXT) funcs overridden by SSE2
This affects the {avg,put}_no_rnd_pixels16_{x,y}2 MMX and
(put-only) MMXEXT versions. Removing these functions saved
1184B here.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:53 +02:00
Andreas Rheinhardt
30c4007c65 avcodec/x86/hpeldsp: Add SSE2 avg_no_rnd size 16 versions
These currently only exist as MMX versions.
The added functions occupy 320B here. So far, they are only for
the x2 and y2 (i.e. right and down, not down-right) directions.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:49 +02:00
Andreas Rheinhardt
1e677e6964 avcodec/x86/hpeldsp: Add SSE2 put_no_rnd size 16 versions
These currently only exist as MMX and (not bitexact) MMXEXT versions.
The added functions occupy 288B here. So far, they are only for
the x2 and y2 (i.e. right and down, not down-right) directions.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:46 +02:00
Andreas Rheinhardt
262791b8d8 avcodec/hpeldsp: Make put_no_rnd_pixels_tab smaller
Only the blocksizes 16 and 8 are implemented, yet the motion estimation
code touches the blocksize 4 entries. But really nothing touches
the blocksize 2 entries, so that we can reduce the put_no_rnd_pixels_tab
array size to [3][4].

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:43 +02:00
Andreas Rheinhardt
c7161befb4 avcodec/x86/h264_qpel: Remove MMX(EXT) funcs overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX(EXT) functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:39 +02:00
Andreas Rheinhardt
5ef613bcb0 avcodec/x86/mpegvideoencdsp_init: Remove MMX, 3DNOw funcs overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX and 3DNOW functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Also merge the mpegvideoenc_qns_template.c file into the main file.

The 3DNOW functions removed in this commit were the last in the
codebase.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:31 +02:00
Andreas Rheinhardt
6a47ea5f9f avcodec/x86/vvc/sao_10bit: Remove unused functions
Saves 65280B here.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:26 +02:00
Andreas Rheinhardt
918d37d9d1 avcodec/x86/rv40dsp_init: Remove MMX(EXT) funcs overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX(EXT) functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:23 +02:00
Andreas Rheinhardt
e86f137514 avcodec/x86/hpeldsp_init: Remove MMX(EXT) funcs overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX(EXT) functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:19 +02:00
Andreas Rheinhardt
2cf9e733c6 avcodec/x86/qpeldsp_init: Use SSE2 versions where possible
The mc00 versions (i.e. the qdsp functions with no subpixel
interpolation) are just wrappers around their fpel versions.
There are SSE2 versions of these, yet the qpel code only
uses the MMX(EXT) versions. This commit changes this and
also removes the MMX(EXT) versions.

This also allowed to remove ff_avg_pixels16_mmxext,
ff_put_pixels16_mmx.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:16 +02:00
Andreas Rheinhardt
1f9ef6a8dc avcodec/x86/h264_qpel: Remove MMX(EXT) functions overridden by SSE2FAST
CPUs which support SSE2, but not in a fast way (so that
they get the additional AV_CPU_FLAG_SSE2SLOW) are ancient
nowadays (2007 and older), so ignore the distinction between
the two and remove MMX and MMXEXT functions that are now
overridden by SSE2 functions.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:12 +02:00
Andreas Rheinhardt
8a7858dacf avcodec/x86/hpeldsp_init: Remove MMX(EXT) functions overridden by SSE2FAST
CPUs which support SSE2, but not in a fast way (so that
they get the additional AV_CPU_FLAG_SSE2SLOW) are ancient
nowadays (2007 and older), so ignore the distinction between
the two and remove MMX and MMXEXT functions that are now
overridden by SSE2 functions.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:08 +02:00
Andreas Rheinhardt
4d691da5ed avcodec/x86/hpeldsp_init: Remove MMX functions overridden by MMXEXT
Forgotten in a51279bbde because
I only looked for MMX(EXT) functions overridden by SSE2.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:04 +02:00
Andreas Rheinhardt
fcb9e0b5f0 avcodec/hpel{dsp,_template}: Use ptrdiff_t for strides
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:20:56 +02:00
Andreas Rheinhardt
89f2016ece avcodec/hpel_template: Fix unintentional usage of unsigned offsets
The value of sizeof() is of type size_t which means that
an expression like
src1[i * src_stride1 + 4 * (int)sizeof(pixel)]
will use a very large offset if src_stride1 is sufficiently negative.
It works in practice (because it is correct modulo SIZE_MAX),
but UBSan treats it as error:
libavcodec/hpel_template.c:104:1: runtime error: addition of unsigned offset to 0x7ffdfa0391d8 overflowed to 0x7ffdfa0391cc
Fix this by casting sizeof(pixel) to int.

(This has been uncovered by a checkasm test for the hpeldsp
which will be added in a later commit.)

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:20:52 +02:00
Andreas Rheinhardt
b316a1bdd1 avcodec/hpeldsp: Fix documentation
This commit fixes two issues in the documentation:
a) The documentation for {put,avg}_pixels_tab only mentions
widths 16 and 8, although it explicitly mentions that there
are four horizontal blocksizes. This part of the patch
basically reverts e5771f4f37.
b) The restrictions on height don't match the reality. While
most users abide by it, some do not:
i) vp56.c copies a 16x12 block.
ii) indeo3 can copy an arbitrary multiple of four lines
for block widths 4, 8 and 16.
iii) SVQ3 can use block sizes luma block sizes 16x16, 8x16,
16x8, 8x8, 4x8, 8x4 and 4x4 and the corresponding
8x8, 4x8, 8x4, 4x4, 2x4, 4x2 and 2x2 chroma block sizes.

This implies that for widths 2 and 4 height can be two
and is guaranteed to be at least even. For all other widths,
height can be a multiple of four.

Furthermore, a comment for the SVQ3 blocksizes has been added.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:20:30 +02:00