ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-02-11 20:49:59 +00:00

Author	SHA1	Message	Date
Andreas Rheinhardt	697da64c8e	avcodec/x86/h264_qpel: Port pixel8_l2_shift5 from MMXEXT to SSE2 This abides by the ABI (no missing emms) and yields a tiny performance improvement here. Old benchmarks: avg_h264_qpel_8_mc12_8_c: 419.9 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 78.9 ( 5.32x) avg_h264_qpel_8_mc12_8_ssse3: 71.7 ( 5.86x) avg_h264_qpel_8_mc32_8_c: 429.1 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 76.9 ( 5.58x) avg_h264_qpel_8_mc32_8_ssse3: 73.4 ( 5.84x) put_h264_qpel_8_mc12_8_c: 424.0 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 78.6 ( 5.40x) put_h264_qpel_8_mc12_8_ssse3: 70.6 ( 6.00x) put_h264_qpel_8_mc32_8_c: 425.7 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 75.2 ( 5.66x) put_h264_qpel_8_mc32_8_ssse3: 70.4 ( 6.05x) New benchmarks: avg_h264_qpel_8_mc12_8_c: 425.7 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 77.5 ( 5.49x) avg_h264_qpel_8_mc12_8_ssse3: 69.8 ( 6.10x) avg_h264_qpel_8_mc32_8_c: 423.7 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 74.6 ( 5.68x) avg_h264_qpel_8_mc32_8_ssse3: 71.9 ( 5.89x) put_h264_qpel_8_mc12_8_c: 422.2 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 75.8 ( 5.57x) put_h264_qpel_8_mc12_8_ssse3: 67.9 ( 6.22x) put_h264_qpel_8_mc32_8_c: 421.8 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 72.6 ( 5.81x) put_h264_qpel_8_mc32_8_ssse3: 67.7 ( 6.23x) Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	4ac9162beb	avcodec/x86/h264_qpel: Don't use ff_ prefix for static functions Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	cd077e88d1	avcodec/x86/h264_qpel: Add ff_{avg,put}_h264_qpel16_h_lowpass_l2_sse2() These functions are currently emulated via four calls to the versions for 8x8 blocks. In fact, the size savings from the simplified calls in h264_qpel.c (GCC 1344B, Clang 1280B) more than outweigh the size of the added functions (512B) here. It is also beneficial performance-wise. Old benchmarks: avg_h264_qpel_16_mc11_8_c: 1414.1 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 206.2 ( 6.86x) avg_h264_qpel_16_mc11_8_ssse3: 177.7 ( 7.96x) avg_h264_qpel_16_mc13_8_c: 1417.0 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 207.4 ( 6.83x) avg_h264_qpel_16_mc13_8_ssse3: 178.2 ( 7.95x) avg_h264_qpel_16_mc21_8_c: 1632.8 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 349.3 ( 4.67x) avg_h264_qpel_16_mc21_8_ssse3: 291.3 ( 5.60x) avg_h264_qpel_16_mc23_8_c: 1640.2 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 351.3 ( 4.67x) avg_h264_qpel_16_mc23_8_ssse3: 290.8 ( 5.64x) avg_h264_qpel_16_mc31_8_c: 1411.7 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 203.4 ( 6.94x) avg_h264_qpel_16_mc31_8_ssse3: 178.9 ( 7.89x) avg_h264_qpel_16_mc33_8_c: 1409.7 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 204.6 ( 6.89x) avg_h264_qpel_16_mc33_8_ssse3: 178.1 ( 7.92x) put_h264_qpel_16_mc11_8_c: 1391.0 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 197.4 ( 7.05x) put_h264_qpel_16_mc11_8_ssse3: 176.1 ( 7.90x) put_h264_qpel_16_mc13_8_c: 1395.9 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 196.7 ( 7.10x) put_h264_qpel_16_mc13_8_ssse3: 177.7 ( 7.85x) put_h264_qpel_16_mc21_8_c: 1609.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 341.1 ( 4.72x) put_h264_qpel_16_mc21_8_ssse3: 289.2 ( 5.57x) put_h264_qpel_16_mc23_8_c: 1604.0 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 340.9 ( 4.71x) put_h264_qpel_16_mc23_8_ssse3: 289.6 ( 5.54x) put_h264_qpel_16_mc31_8_c: 1390.2 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 194.6 ( 7.14x) put_h264_qpel_16_mc31_8_ssse3: 176.4 ( 7.88x) put_h264_qpel_16_mc33_8_c: 1400.4 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 198.5 ( 7.06x) put_h264_qpel_16_mc33_8_ssse3: 176.2 ( 7.95x) New benchmarks: avg_h264_qpel_16_mc11_8_c: 1413.3 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 171.8 ( 8.23x) avg_h264_qpel_16_mc11_8_ssse3: 173.0 ( 8.17x) avg_h264_qpel_16_mc13_8_c: 1423.2 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 172.0 ( 8.27x) avg_h264_qpel_16_mc13_8_ssse3: 173.4 ( 8.21x) avg_h264_qpel_16_mc21_8_c: 1641.3 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 322.1 ( 5.10x) avg_h264_qpel_16_mc21_8_ssse3: 291.3 ( 5.63x) avg_h264_qpel_16_mc23_8_c: 1629.1 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 323.0 ( 5.04x) avg_h264_qpel_16_mc23_8_ssse3: 293.3 ( 5.55x) avg_h264_qpel_16_mc31_8_c: 1409.2 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 172.0 ( 8.19x) avg_h264_qpel_16_mc31_8_ssse3: 173.7 ( 8.11x) avg_h264_qpel_16_mc33_8_c: 1402.5 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 172.5 ( 8.13x) avg_h264_qpel_16_mc33_8_ssse3: 173.6 ( 8.08x) put_h264_qpel_16_mc11_8_c: 1393.7 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 170.4 ( 8.18x) put_h264_qpel_16_mc11_8_ssse3: 178.2 ( 7.82x) put_h264_qpel_16_mc13_8_c: 1398.0 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 170.2 ( 8.21x) put_h264_qpel_16_mc13_8_ssse3: 178.6 ( 7.83x) put_h264_qpel_16_mc21_8_c: 1619.6 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 320.6 ( 5.05x) put_h264_qpel_16_mc21_8_ssse3: 297.2 ( 5.45x) put_h264_qpel_16_mc23_8_c: 1617.4 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 320.0 ( 5.05x) put_h264_qpel_16_mc23_8_ssse3: 297.4 ( 5.44x) put_h264_qpel_16_mc31_8_c: 1389.7 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 169.9 ( 8.18x) put_h264_qpel_16_mc31_8_ssse3: 178.1 ( 7.80x) put_h264_qpel_16_mc33_8_c: 1394.0 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 170.9 ( 8.16x) put_h264_qpel_16_mc33_8_ssse3: 176.9 ( 7.88x) Notice that the SSSE3 versions of mc21 and mc23 benefit from an optimized version of hv2_lowpass. Also notice that there is no SSE2 version of the purely horizontal motion compensation. This means that src2 is currently always aligned when calling the SSE2 functions (and that srcStride is always equal to the block width). Yet this has not been exploited (yet). Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	4880fa4dca	avcodec/x86/h264_qpel_8bit: Remove dead macro Forgotten in `4011a76494`. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	35aaf697e9	avcodec/x86/h264_qpel_8bit: Replace qpel8_h_lowpass_l2 MMXEXT by SSE2 Using xmm registers here is very natural, as it allows to operate on eight words at a time. It also saves 48B here and does not clobber the MMX state. Old benchmarks (only tests affected by the modified function are shown): avg_h264_qpel_8_mc11_8_c: 352.2 ( 1.00x) avg_h264_qpel_8_mc11_8_sse2: 70.4 ( 5.00x) avg_h264_qpel_8_mc11_8_ssse3: 53.9 ( 6.53x) avg_h264_qpel_8_mc13_8_c: 353.3 ( 1.00x) avg_h264_qpel_8_mc13_8_sse2: 72.8 ( 4.86x) avg_h264_qpel_8_mc13_8_ssse3: 53.8 ( 6.57x) avg_h264_qpel_8_mc21_8_c: 404.0 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 116.1 ( 3.48x) avg_h264_qpel_8_mc21_8_ssse3: 94.3 ( 4.28x) avg_h264_qpel_8_mc23_8_c: 398.9 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 118.6 ( 3.36x) avg_h264_qpel_8_mc23_8_ssse3: 94.8 ( 4.21x) avg_h264_qpel_8_mc31_8_c: 352.7 ( 1.00x) avg_h264_qpel_8_mc31_8_sse2: 71.4 ( 4.94x) avg_h264_qpel_8_mc31_8_ssse3: 53.8 ( 6.56x) avg_h264_qpel_8_mc33_8_c: 354.0 ( 1.00x) avg_h264_qpel_8_mc33_8_sse2: 70.6 ( 5.01x) avg_h264_qpel_8_mc33_8_ssse3: 53.7 ( 6.59x) avg_h264_qpel_16_mc11_8_c: 1417.0 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 276.9 ( 5.12x) avg_h264_qpel_16_mc11_8_ssse3: 178.8 ( 7.92x) avg_h264_qpel_16_mc13_8_c: 1427.3 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 277.4 ( 5.14x) avg_h264_qpel_16_mc13_8_ssse3: 179.7 ( 7.94x) avg_h264_qpel_16_mc21_8_c: 1634.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 421.3 ( 3.88x) avg_h264_qpel_16_mc21_8_ssse3: 291.2 ( 5.61x) avg_h264_qpel_16_mc23_8_c: 1627.0 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 420.8 ( 3.87x) avg_h264_qpel_16_mc23_8_ssse3: 291.0 ( 5.59x) avg_h264_qpel_16_mc31_8_c: 1418.4 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 278.5 ( 5.09x) avg_h264_qpel_16_mc31_8_ssse3: 178.6 ( 7.94x) avg_h264_qpel_16_mc33_8_c: 1407.3 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 277.6 ( 5.07x) avg_h264_qpel_16_mc33_8_ssse3: 179.9 ( 7.82x) put_h264_qpel_8_mc11_8_c: 348.1 ( 1.00x) put_h264_qpel_8_mc11_8_sse2: 69.1 ( 5.04x) put_h264_qpel_8_mc11_8_ssse3: 53.8 ( 6.47x) put_h264_qpel_8_mc13_8_c: 349.3 ( 1.00x) put_h264_qpel_8_mc13_8_sse2: 69.7 ( 5.01x) put_h264_qpel_8_mc13_8_ssse3: 53.7 ( 6.51x) put_h264_qpel_8_mc21_8_c: 398.5 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 115.0 ( 3.46x) put_h264_qpel_8_mc21_8_ssse3: 95.3 ( 4.18x) put_h264_qpel_8_mc23_8_c: 399.9 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 120.8 ( 3.31x) put_h264_qpel_8_mc23_8_ssse3: 95.4 ( 4.19x) put_h264_qpel_8_mc31_8_c: 350.4 ( 1.00x) put_h264_qpel_8_mc31_8_sse2: 69.6 ( 5.03x) put_h264_qpel_8_mc31_8_ssse3: 54.2 ( 6.47x) put_h264_qpel_8_mc33_8_c: 353.1 ( 1.00x) put_h264_qpel_8_mc33_8_sse2: 71.0 ( 4.97x) put_h264_qpel_8_mc33_8_ssse3: 54.2 ( 6.51x) put_h264_qpel_16_mc11_8_c: 1384.2 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 272.9 ( 5.07x) put_h264_qpel_16_mc11_8_ssse3: 178.3 ( 7.76x) put_h264_qpel_16_mc13_8_c: 1393.6 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 271.1 ( 5.14x) put_h264_qpel_16_mc13_8_ssse3: 178.3 ( 7.82x) put_h264_qpel_16_mc21_8_c: 1612.6 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 416.5 ( 3.87x) put_h264_qpel_16_mc21_8_ssse3: 289.1 ( 5.58x) put_h264_qpel_16_mc23_8_c: 1621.3 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 416.9 ( 3.89x) put_h264_qpel_16_mc23_8_ssse3: 289.4 ( 5.60x) put_h264_qpel_16_mc31_8_c: 1408.4 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 273.5 ( 5.15x) put_h264_qpel_16_mc31_8_ssse3: 176.9 ( 7.96x) put_h264_qpel_16_mc33_8_c: 1396.4 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 276.3 ( 5.05x) put_h264_qpel_16_mc33_8_ssse3: 176.4 ( 7.92x) New benchmarks: avg_h264_qpel_8_mc11_8_c: 352.1 ( 1.00x) avg_h264_qpel_8_mc11_8_sse2: 52.5 ( 6.71x) avg_h264_qpel_8_mc11_8_ssse3: 53.9 ( 6.54x) avg_h264_qpel_8_mc13_8_c: 350.8 ( 1.00x) avg_h264_qpel_8_mc13_8_sse2: 54.7 ( 6.42x) avg_h264_qpel_8_mc13_8_ssse3: 54.3 ( 6.46x) avg_h264_qpel_8_mc21_8_c: 400.1 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 98.6 ( 4.06x) avg_h264_qpel_8_mc21_8_ssse3: 95.5 ( 4.19x) avg_h264_qpel_8_mc23_8_c: 400.4 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 101.4 ( 3.95x) avg_h264_qpel_8_mc23_8_ssse3: 95.9 ( 4.18x) avg_h264_qpel_8_mc31_8_c: 352.4 ( 1.00x) avg_h264_qpel_8_mc31_8_sse2: 52.9 ( 6.67x) avg_h264_qpel_8_mc31_8_ssse3: 54.4 ( 6.48x) avg_h264_qpel_8_mc33_8_c: 354.5 ( 1.00x) avg_h264_qpel_8_mc33_8_sse2: 52.9 ( 6.70x) avg_h264_qpel_8_mc33_8_ssse3: 54.4 ( 6.52x) avg_h264_qpel_16_mc11_8_c: 1420.4 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 204.8 ( 6.93x) avg_h264_qpel_16_mc11_8_ssse3: 177.9 ( 7.98x) avg_h264_qpel_16_mc13_8_c: 1409.8 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 206.4 ( 6.83x) avg_h264_qpel_16_mc13_8_ssse3: 178.0 ( 7.92x) avg_h264_qpel_16_mc21_8_c: 1634.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 349.6 ( 4.67x) avg_h264_qpel_16_mc21_8_ssse3: 290.0 ( 5.63x) avg_h264_qpel_16_mc23_8_c: 1624.1 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 350.0 ( 4.64x) avg_h264_qpel_16_mc23_8_ssse3: 291.9 ( 5.56x) avg_h264_qpel_16_mc31_8_c: 1407.2 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 205.8 ( 6.84x) avg_h264_qpel_16_mc31_8_ssse3: 178.2 ( 7.90x) avg_h264_qpel_16_mc33_8_c: 1400.5 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 206.3 ( 6.79x) avg_h264_qpel_16_mc33_8_ssse3: 179.4 ( 7.81x) put_h264_qpel_8_mc11_8_c: 349.7 ( 1.00x) put_h264_qpel_8_mc11_8_sse2: 50.2 ( 6.96x) put_h264_qpel_8_mc11_8_ssse3: 51.3 ( 6.82x) put_h264_qpel_8_mc13_8_c: 349.8 ( 1.00x) put_h264_qpel_8_mc13_8_sse2: 50.7 ( 6.90x) put_h264_qpel_8_mc13_8_ssse3: 51.7 ( 6.76x) put_h264_qpel_8_mc21_8_c: 398.0 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 96.5 ( 4.13x) put_h264_qpel_8_mc21_8_ssse3: 92.3 ( 4.31x) put_h264_qpel_8_mc23_8_c: 401.4 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 102.3 ( 3.92x) put_h264_qpel_8_mc23_8_ssse3: 92.8 ( 4.32x) put_h264_qpel_8_mc31_8_c: 349.4 ( 1.00x) put_h264_qpel_8_mc31_8_sse2: 50.8 ( 6.88x) put_h264_qpel_8_mc31_8_ssse3: 51.8 ( 6.75x) put_h264_qpel_8_mc33_8_c: 351.1 ( 1.00x) put_h264_qpel_8_mc33_8_sse2: 52.2 ( 6.73x) put_h264_qpel_8_mc33_8_ssse3: 51.7 ( 6.79x) put_h264_qpel_16_mc11_8_c: 1391.1 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 196.6 ( 7.07x) put_h264_qpel_16_mc11_8_ssse3: 178.2 ( 7.81x) put_h264_qpel_16_mc13_8_c: 1385.2 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 195.6 ( 7.08x) put_h264_qpel_16_mc13_8_ssse3: 176.6 ( 7.84x) put_h264_qpel_16_mc21_8_c: 1607.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 341.0 ( 4.71x) put_h264_qpel_16_mc21_8_ssse3: 289.1 ( 5.56x) put_h264_qpel_16_mc23_8_c: 1616.7 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 340.8 ( 4.74x) put_h264_qpel_16_mc23_8_ssse3: 288.6 ( 5.60x) put_h264_qpel_16_mc31_8_c: 1397.6 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 197.3 ( 7.08x) put_h264_qpel_16_mc31_8_ssse3: 175.4 ( 7.97x) put_h264_qpel_16_mc33_8_c: 1394.3 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 197.7 ( 7.05x) put_h264_qpel_16_mc33_8_ssse3: 175.2 ( 7.96x) As can be seen, the SSE2 version is often neck-to-neck with the SSSE3 version (which also benefits from a better hv2_lowpass SSSE3 implementation for mc21 and mc23) for eight byte block sizes. Unsurprisingly, SSSE3 beats SSE2 for 16x16 blocks: For SSE2, these blocks are processed by calling the 8x8 function four times whereas SSSE3 has a dedicated function (on x64). This implementation should also be extendable to an AVX version for 16x16 blocks. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	fa9ea5113b	avcodec/x86/h264_qpel_8bit: Optimize branch away ff_{avg,put}_h264_qpel8or16_hv2_lowpass_ssse3() currently is almost the disjoint union of the codepaths for sizes 8 and 16. This size is a compile-time constant at every callsite. So split the function and avoid the runtime branch. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	400203c00c	avcodec/x86/h264_qpel: Remove unused parameter from hv2_lowpass funcs tmpstride is unused. This also allows to remove said parameter from lots of functions in h264_qpel.c. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	b84c818c83	avcodec/x86/h264_qpel: Remove constant parameters from shift5 funcs They are constant since the size 16 version is no longer emulated via the size 8 version. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	810bd3e62a	avcodec/x86/h264_qpel: Add ff_{avg,put}_pixels16_l2_shift5_sse2 Up until now this function was emulated via two calls to ff_{avg,pull}_pixels8_l2_shift5_mmxext(). Adding a dedicated function proved beneficial both size wise and performance wise: The new functions take 192B, yet the simplified calls save 256B with GCC and 320B with Clang here. This change will also allow further optimizations. Old benchmarks: avg_h264_qpel_16_mc12_8_c: 1735.8 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 300.8 ( 5.77x) avg_h264_qpel_16_mc12_8_ssse3: 233.3 ( 7.44x) avg_h264_qpel_16_mc32_8_c: 1777.9 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 275.6 ( 6.45x) avg_h264_qpel_16_mc32_8_ssse3: 235.7 ( 7.54x) put_h264_qpel_16_mc12_8_c: 1808.2 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 267.2 ( 6.77x) put_h264_qpel_16_mc12_8_ssse3: 231.9 ( 7.80x) put_h264_qpel_16_mc32_8_c: 1766.9 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 272.9 ( 6.47x) put_h264_qpel_16_mc32_8_ssse3: 229.5 ( 7.70x) New benchmarks: avg_h264_qpel_16_mc12_8_c: 1742.3 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 240.3 ( 7.25x) avg_h264_qpel_16_mc12_8_ssse3: 214.8 ( 8.11x) avg_h264_qpel_16_mc32_8_c: 1748.0 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 238.0 ( 7.35x) avg_h264_qpel_16_mc32_8_ssse3: 209.2 ( 8.35x) put_h264_qpel_16_mc12_8_c: 2014.4 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 243.7 ( 8.27x) put_h264_qpel_16_mc12_8_ssse3: 211.5 ( 9.52x) put_h264_qpel_16_mc32_8_c: 1800.0 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 238.8 ( 7.54x) put_h264_qpel_16_mc32_8_ssse3: 206.7 ( 8.71x) Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	279b6f3cf5	avcodec/fpel: Avoid loop in ff_avg_pixels4_mmxext() It is only used by h264_qpel.c and only with height four (which is unrolled) and uses a loop in order to handle multiples of four as height. Remove the loop and the height parameter and move the function to h264_qpel_8bit.asm. This leads to a bit of code duplication, but this is simpler than all the %if checks necessary to achieve the same outcome in fpel.asm. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	e340f31b89	avcodec/x86/fpel: Remove redundant repetition The repetition count is always one since `2cf9e733c6`. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	b0c91c2fba	avcodec/h264qpel: Make avg_h264_qpel_pixels_tab smaller avg_h264_qpel only supports 16x16,8x8 and 4x4 blocksizes, so it is currently unnecessarily large. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	6eb8bc4217	avcodec/h264qpel: Don't build unused 2x2 size funcs for bitdepths > 8 The 2x2 put functions are only used by Snow and Snow uses only the eight bit versions. The rest is dead code. Disabling it saved 41277B here. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	92ae9d1ffc	configure: Remove vc1dsp->qpeldsp dependency It only needs it for some x86 fpel functions; instead add a direct dependency for that. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	16d5e074dc	avcodec/mips/Makefile: Fix VC1DSP build rules Affected standalone builds of the VC-1 parser. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	0035d99c61	configure: Avoid mpeg4video_parser->{h263,qpel}dsp dependency This can be easily achieved by moving code only used by the MPEG-4 decoder behind #if CONFIG_MPEG4_DECODER. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	c4c616db53	avcodec/x86/qpel: Move ff_{put,avg}_pixels4_l2_mmxext to h264_qpel Only used there. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	1e11fdff52	avcodec/x86/qpel{,dsp_init}: Remove constant function parameters ff_avg_pixels{4,8,16}_l2_mmxext() are always called with height equal to their blocksize. And ff_{put,avg}_pixels4_l2_mmxext() are furthermore always called with both strides being equal. So remove these redundant function parameters. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	52a77128fd	avcodec/x86/qpel{dsp,dsp_init}: Use ptrdiff_t for stride This is more correct given that qpel_mc_func already uses ptrdiff_t; it also allows to avoid movsxdifnidn. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	cacf854fe7	avcodec/x86/qpel: Remove always-false branches The ff_avg_pixels{4,8,16}_l2_mmxext() functions are only ever used in the last step (the one that actually writes to the dst buffer) where the number of lines to process is always equal to the dimensions of the block, whereas ff_put_pixels{8,16}_mmxext() are also used in intermediate calculations where the number of lines can be 9 or 17. The code in qpel.asm uses common macros for both and processes more than one line per loop iteration; it therefore checks for whether the number of lines is odd and treats this line separately; yet this special handling is only needed for the put functions, not the avg functions. It has therefore been %if'ed away for these. The check is also not needed for ff_put_pixels4_l2_mmxext() which is only used by H.264 which always processes four lines. Because ff_{avg,put}_pixels4_l2_mmxext() processes four lines in a single loop iteration, not only the odd-height handling, but the whole loop could be removed. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	15a9c8dea3	avcodec/liblc3enc: Avoid allocating buffer to send a zero frame liblc3 supports arbitrary strides, so one can simply use a stride of zero to make it read the same zero value again and again. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 06:07:37 +02:00
Andreas Rheinhardt	ab7d1c64c9	avcodec/x86/h263_loopfilter: Port loop filter to SSE2 Old benchmarks: h263dsp.h_loop_filter_c: 41.2 ( 1.00x) h263dsp.h_loop_filter_mmx: 39.5 ( 1.04x) h263dsp.v_loop_filter_c: 43.5 ( 1.00x) h263dsp.v_loop_filter_mmx: 16.9 ( 2.57x) New benchmarks: h263dsp.h_loop_filter_c: 41.6 ( 1.00x) h263dsp.h_loop_filter_sse2: 28.2 ( 1.48x) h263dsp.v_loop_filter_c: 42.4 ( 1.00x) h263dsp.v_loop_filter_sse2: 15.1 ( 2.81x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-03 17:05:46 +00:00
Cameron Gutman	df4587789f	avcodec/amfenc: avoid unnecessary output delay in low delay mode The code optimizes throughput by letting the encoder work on frame N until frame N+1 is ready for submission, but this hurts low-delay uses by delaying output by one frame. Don't delay output beyond what is necessary when AV_CODEC_FLAG_LOW_DELAY is used. Signed-off-by: Cameron Gutman <aicommander@gmail.com>	2025-10-03 11:05:03 +00:00
Michael Niedermayer	61b6877637	avcodec/mjpegdec: Explain buf_size/width/height check Suggested-by: Ramiro Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-02 12:52:43 +00:00
James Almer	5511641365	avcodec/atrac9dec: use av_zero_extend() Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-01 01:26:19 +00:00
James Almer	7ce3a14496	avcodec/apv_entropy: use av_zero_extend() Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-01 01:26:19 +00:00
James Almer	776ee07990	avcodec/aom_film_grain: use av_zero_extend() Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-01 01:26:19 +00:00
Koushik Dutta via ffmpeg-devel	fd136a4d82	ffv1enc_vulkan: fix empty struct build error on msvc Signed-off-by: Koushik Dutta <koushd@gmail.com>	2025-09-30 19:36:56 +09:00
James Almer	d975dbd7b7	avcodec/libdav1d: bump minimum supported version to 1.0.0 This allows us to remove old deprecated options. Signed-off-by: James Almer <jamrial@gmail.com>	2025-09-28 23:53:27 -03:00
Andreas Rheinhardt	635cb4543f	avcodec/bsf/ahx_to_mp2: Don't output uninitialized data Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-27 17:57:15 +02:00
Andreas Rheinhardt	0f1f345c37	avcodec/x86/qpeldsp_init: Fix compilation without external assembly Broken in `2cf9e733c6`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 18:30:53 +02:00
Kacper Michajłow	d6cb0d2c2b	ALL: move av_unused to conform with standard requirement This is required placement by standard [[maybe_unused]] attribute, works the same for __attribute__((unused)). Signed-off-by: Kacper Michajłow <kasper93@gmail.com>	2025-09-26 16:15:46 +00:00
Andreas Rheinhardt	a54d6b1d91	avcodec/x86/rnd_template: Merge into hpeldsp_init.c It is now only included exactly once. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:58 +02:00
Andreas Rheinhardt	43fe9554cc	avcodec/x86/hpeldsp_init: Avoid complicating macro Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:55 +02:00
Andreas Rheinhardt	00e046df13	avcodec/x86/hpeldsp_init: Remove MMX(EXT) funcs overridden by SSE2 This affects the {avg,put}_no_rnd_pixels16_{x,y}2 MMX and (put-only) MMXEXT versions. Removing these functions saved 1184B here. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:53 +02:00
Andreas Rheinhardt	30c4007c65	avcodec/x86/hpeldsp: Add SSE2 avg_no_rnd size 16 versions These currently only exist as MMX versions. The added functions occupy 320B here. So far, they are only for the x2 and y2 (i.e. right and down, not down-right) directions. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:49 +02:00
Andreas Rheinhardt	1e677e6964	avcodec/x86/hpeldsp: Add SSE2 put_no_rnd size 16 versions These currently only exist as MMX and (not bitexact) MMXEXT versions. The added functions occupy 288B here. So far, they are only for the x2 and y2 (i.e. right and down, not down-right) directions. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:46 +02:00
Andreas Rheinhardt	262791b8d8	avcodec/hpeldsp: Make put_no_rnd_pixels_tab smaller Only the blocksizes 16 and 8 are implemented, yet the motion estimation code touches the blocksize 4 entries. But really nothing touches the blocksize 2 entries, so that we can reduce the put_no_rnd_pixels_tab array size to [3][4]. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:43 +02:00
Andreas Rheinhardt	c7161befb4	avcodec/x86/h264_qpel: Remove MMX(EXT) funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:39 +02:00
Andreas Rheinhardt	5ef613bcb0	avcodec/x86/mpegvideoencdsp_init: Remove MMX, 3DNOw funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX and 3DNOW functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Also merge the mpegvideoenc_qns_template.c file into the main file. The 3DNOW functions removed in this commit were the last in the codebase. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:31 +02:00
Andreas Rheinhardt	6a47ea5f9f	avcodec/x86/vvc/sao_10bit: Remove unused functions Saves 65280B here. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:26 +02:00
Andreas Rheinhardt	918d37d9d1	avcodec/x86/rv40dsp_init: Remove MMX(EXT) funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:23 +02:00
Andreas Rheinhardt	e86f137514	avcodec/x86/hpeldsp_init: Remove MMX(EXT) funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:19 +02:00
Andreas Rheinhardt	2cf9e733c6	avcodec/x86/qpeldsp_init: Use SSE2 versions where possible The mc00 versions (i.e. the qdsp functions with no subpixel interpolation) are just wrappers around their fpel versions. There are SSE2 versions of these, yet the qpel code only uses the MMX(EXT) versions. This commit changes this and also removes the MMX(EXT) versions. This also allowed to remove ff_avg_pixels16_mmxext, ff_put_pixels16_mmx. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:16 +02:00
Andreas Rheinhardt	1f9ef6a8dc	avcodec/x86/h264_qpel: Remove MMX(EXT) functions overridden by SSE2FAST CPUs which support SSE2, but not in a fast way (so that they get the additional AV_CPU_FLAG_SSE2SLOW) are ancient nowadays (2007 and older), so ignore the distinction between the two and remove MMX and MMXEXT functions that are now overridden by SSE2 functions. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:12 +02:00
Andreas Rheinhardt	8a7858dacf	avcodec/x86/hpeldsp_init: Remove MMX(EXT) functions overridden by SSE2FAST CPUs which support SSE2, but not in a fast way (so that they get the additional AV_CPU_FLAG_SSE2SLOW) are ancient nowadays (2007 and older), so ignore the distinction between the two and remove MMX and MMXEXT functions that are now overridden by SSE2 functions. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:08 +02:00
Andreas Rheinhardt	4d691da5ed	avcodec/x86/hpeldsp_init: Remove MMX functions overridden by MMXEXT Forgotten in `a51279bbde` because I only looked for MMX(EXT) functions overridden by SSE2. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:21:04 +02:00
Andreas Rheinhardt	fcb9e0b5f0	avcodec/hpel{dsp,_template}: Use ptrdiff_t for strides Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:20:56 +02:00
Andreas Rheinhardt	89f2016ece	avcodec/hpel_template: Fix unintentional usage of unsigned offsets The value of sizeof() is of type size_t which means that an expression like src1[i * src_stride1 + 4 * (int)sizeof(pixel)] will use a very large offset if src_stride1 is sufficiently negative. It works in practice (because it is correct modulo SIZE_MAX), but UBSan treats it as error: libavcodec/hpel_template.c:104:1: runtime error: addition of unsigned offset to 0x7ffdfa0391d8 overflowed to 0x7ffdfa0391cc Fix this by casting sizeof(pixel) to int. (This has been uncovered by a checkasm test for the hpeldsp which will be added in a later commit.) Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:20:52 +02:00
Andreas Rheinhardt	b316a1bdd1	avcodec/hpeldsp: Fix documentation This commit fixes two issues in the documentation: a) The documentation for {put,avg}_pixels_tab only mentions widths 16 and 8, although it explicitly mentions that there are four horizontal blocksizes. This part of the patch basically reverts `e5771f4f37`. b) The restrictions on height don't match the reality. While most users abide by it, some do not: i) vp56.c copies a 16x12 block. ii) indeo3 can copy an arbitrary multiple of four lines for block widths 4, 8 and 16. iii) SVQ3 can use block sizes luma block sizes 16x16, 8x16, 16x8, 8x8, 4x8, 8x4 and 4x4 and the corresponding 8x8, 4x8, 8x4, 4x4, 2x4, 4x2 and 2x2 chroma block sizes. This implies that for widths 2 and 4 height can be two and is guaranteed to be at least even. For all other widths, height can be a multiple of four. Furthermore, a comment for the SVQ3 blocksizes has been added. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-09-26 06:20:30 +02:00

1 2 3 4 5 ...

52830 commits