ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-02-19 05:50:18 +00:00

Author	SHA1	Message	Date
James Almer	4377affc28	avcodec/hevc/refs: don't unconditionally discard non-IRAP frames if no IRAP frame was seen before Should fix issue #20661 Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-09 02:52:46 +00:00
Andreas Rheinhardt	378d5bb08a	avcodec/x86/fpel: Add blocksize x blocksize avg/put functions This commit deduplicates the wrappers around the fpel functions for copying whole blocks (i.e. height equaling width). It does this in a manner which avoids having push/pop function arguments when the calling convention forces one to pass them on the stack (as in 32bit systems). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:53 +02:00
Andreas Rheinhardt	ad498f9759	avcodec/x86/cavsdsp: Remove MMXEXT Qpeldsp Superseded by SSE2. Saves about 11630B here. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	650098955e	avcodec/x86/cavs_qpel: Add SSE2 vertical motion compensation This is not based on the MMXEXT one, because the latter is quite suboptimal: Motion vector types mc01 and mc03 (vertical motion vectors with remainder of one quarter or three quarter) use different neighboring lines for interpolation: mc01 uses two lines above and two lines below, mc03 one line above and three lines below. The MMXEXT code uses a common macro for all of them and therefore reads six lines before it processes them (even reading lines which are not used at all), leading to severe register pressure. Another difference to the old code is that the positive and negative parts of the sum to calculate are accumulated separately and the subtraction is performed with unsigned saturation, so that one can avoid biasing the sum. The fact that the mc01 and mc03 filter coefficients are mirrors of each other has been exploited to reduce mc01 to mc03. But of course the most important different difference between this code and the MMXEXT one is that XMM registers allow to process eight words at a time, ideal for 8x8 subblocks, whereas the MMXEXT code processes them in 4x8 or 4x16 blocks. Benchmarks: avg_cavs_qpel_pixels_tab[0][4]_c: 917.0 ( 1.00x) avg_cavs_qpel_pixels_tab[0][4]_mmxext: 222.0 ( 4.13x) avg_cavs_qpel_pixels_tab[0][4]_sse2: 89.0 (10.31x) avg_cavs_qpel_pixels_tab[0][12]_c: 885.7 ( 1.00x) avg_cavs_qpel_pixels_tab[0][12]_mmxext: 223.2 ( 3.97x) avg_cavs_qpel_pixels_tab[0][12]_sse2: 88.5 (10.01x) avg_cavs_qpel_pixels_tab[1][4]_c: 222.4 ( 1.00x) avg_cavs_qpel_pixels_tab[1][4]_mmxext: 57.2 ( 3.89x) avg_cavs_qpel_pixels_tab[1][4]_sse2: 23.3 ( 9.55x) avg_cavs_qpel_pixels_tab[1][12]_c: 216.0 ( 1.00x) avg_cavs_qpel_pixels_tab[1][12]_mmxext: 57.4 ( 3.76x) avg_cavs_qpel_pixels_tab[1][12]_sse2: 22.6 ( 9.56x) put_cavs_qpel_pixels_tab[0][4]_c: 750.9 ( 1.00x) put_cavs_qpel_pixels_tab[0][4]_mmxext: 210.4 ( 3.57x) put_cavs_qpel_pixels_tab[0][4]_sse2: 84.2 ( 8.92x) put_cavs_qpel_pixels_tab[0][12]_c: 731.6 ( 1.00x) put_cavs_qpel_pixels_tab[0][12]_mmxext: 210.7 ( 3.47x) put_cavs_qpel_pixels_tab[0][12]_sse2: 84.1 ( 8.70x) put_cavs_qpel_pixels_tab[1][4]_c: 191.7 ( 1.00x) put_cavs_qpel_pixels_tab[1][4]_mmxext: 53.8 ( 3.56x) put_cavs_qpel_pixels_tab[1][4]_sse2: 24.5 ( 7.83x) put_cavs_qpel_pixels_tab[1][12]_c: 179.1 ( 1.00x) put_cavs_qpel_pixels_tab[1][12]_mmxext: 53.9 ( 3.32x) put_cavs_qpel_pixels_tab[1][12]_sse2: 24.0 ( 7.47x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	74a88c0c11	avcodec/x86/cavsdsp: Add SSE2 mc20 horizontal motion compensation Basically a direct port of the MMXEXT one. The main difference is of course that one can process eight pixels (unpacked to words) at a time, leading to speedups. avg_cavs_qpel_pixels_tab[0][2]_c: 700.1 ( 1.00x) avg_cavs_qpel_pixels_tab[0][2]_mmxext: 158.1 ( 4.43x) avg_cavs_qpel_pixels_tab[0][2]_sse2: 86.0 ( 8.14x) avg_cavs_qpel_pixels_tab[1][2]_c: 171.9 ( 1.00x) avg_cavs_qpel_pixels_tab[1][2]_mmxext: 39.4 ( 4.36x) avg_cavs_qpel_pixels_tab[1][2]_sse2: 21.7 ( 7.92x) put_cavs_qpel_pixels_tab[0][2]_c: 525.7 ( 1.00x) put_cavs_qpel_pixels_tab[0][2]_mmxext: 148.5 ( 3.54x) put_cavs_qpel_pixels_tab[0][2]_sse2: 75.2 ( 6.99x) put_cavs_qpel_pixels_tab[1][2]_c: 129.5 ( 1.00x) put_cavs_qpel_pixels_tab[1][2]_mmxext: 36.7 ( 3.53x) put_cavs_qpel_pixels_tab[1][2]_sse2: 19.0 ( 6.81x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	cc2e2f12ca	avcodec/x86/cavsdsp: Fix vertical qpel motion compensation The prediction involves terms of the form (-1 * s0 - 2 * s1 + 96 * s2 + 42 * s3 - 7 * s4 + 64) >> 7, where the s values are in the range of 0..255. The sum can have values in the range -2550..35190, which does not fit into a signed 16bit integer. The code uses an arithmetic right shift, which does not yield the correct result for values >= 2^15; such values should be clipped to 255, yet are clipped to 0 instead. Fix this by shifting the values by 4096, so that the range is positive, use a logical right shift and subtract 32. bunny.mp4 from the FATE suite can be used to reproduce the problem. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	ec2fe94b3f	avcodec/cavs: Remove unused parameter Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Michael Niedermayer	7896cc67c1	avcodec/exr: Check that DWA has 3 channels The implementation hardcodes access to 3 channels, so we need to check that Fixes: out of array access Fixes: BIGSLEEP-445394503-crash.exr Found-by: Google Big Sleep Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-08 00:27:49 +00:00
Michael Niedermayer	c911e00011	avcodec/exr: Round dc_w/h up Without rounding them up there are too few dc coeffs for the blocks. We do not know if this way of handling odd dimensions is correct, as we have no such DWA sample. thus we ask the user for a sample if she encounters such a file Fixes: out of array access Fixes: BIGSLEEP-445392027-crash.exr Found-by: Google Big Sleep Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-08 00:27:49 +00:00
Michael Niedermayer	8e078826da	avcodec/exr: check ac_size Fixes: out of array read Fixes: dwa_uncompress.py.crash.exr The code will read from the ac data even if ac_size is 0, thus that case is not implemented and we ask for a sample and error out cleanly Found-by: Google Big Sleep Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-08 00:27:49 +00:00
Baptiste Coudurier	ef60d5ac32	general: fix warning 'av_malloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Wcalloc-transposed-args] Fixes trac ticket #11620	2025-10-07 14:51:46 -07:00
Andreas Rheinhardt	00225e9ebc	avcodec/x86/h264_qpel: Simplify macros 1. Remove the OP parameter from the QPEL_H264* macros. These are a remnant of inline assembly and were forgotten in `610e00b359`. 2. Pass the instruction set extension for the shift5 function explicitly in the macro instead of using magic #defines. 3. Likewise, avoid magic #defines for (8\|16)_v_lowpass_ssse3. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	3049694e9f	avcodec/x86/h264_qpel: Split hv2_lowpass_sse2 into size 8,16 funcs This is beneficial size-wise: 384B of new asm functions are more than outweighted by 416B savings from simpler calls here (for size 16, the size 8 function had been called twice). It also makes the code more readable, as it allowed to remove several wrappers in h264_qpel.c. It is also beneficial performance-wise. Old benchmarks: avg_h264_qpel_16_mc12_8_c: 1757.7 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 197.7 ( 8.89x) avg_h264_qpel_16_mc12_8_ssse3: 204.6 ( 8.59x) avg_h264_qpel_16_mc21_8_c: 1631.6 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 276.4 ( 5.90x) avg_h264_qpel_16_mc21_8_ssse3: 290.7 ( 5.61x) avg_h264_qpel_16_mc22_8_c: 1122.7 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 179.5 ( 6.25x) avg_h264_qpel_16_mc22_8_ssse3: 181.8 ( 6.17x) avg_h264_qpel_16_mc23_8_c: 1626.7 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 276.8 ( 5.88x) avg_h264_qpel_16_mc23_8_ssse3: 290.9 ( 5.59x) avg_h264_qpel_16_mc32_8_c: 1754.1 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 193.8 ( 9.05x) avg_h264_qpel_16_mc32_8_ssse3: 203.6 ( 8.62x) put_h264_qpel_16_mc12_8_c: 1733.6 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 189.6 ( 9.14x) put_h264_qpel_16_mc12_8_ssse3: 199.6 ( 8.69x) put_h264_qpel_16_mc21_8_c: 1616.0 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 284.3 ( 5.69x) put_h264_qpel_16_mc21_8_ssse3: 296.5 ( 5.45x) put_h264_qpel_16_mc22_8_c: 963.7 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 169.9 ( 5.67x) put_h264_qpel_16_mc22_8_ssse3: 186.1 ( 5.18x) put_h264_qpel_16_mc23_8_c: 1607.2 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 275.0 ( 5.84x) put_h264_qpel_16_mc23_8_ssse3: 297.8 ( 5.40x) put_h264_qpel_16_mc32_8_c: 1734.7 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 189.4 ( 9.16x) put_h264_qpel_16_mc32_8_ssse3: 199.4 ( 8.70x) New benchmarks: avg_h264_qpel_16_mc12_8_c: 1743.7 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 189.7 ( 9.19x) avg_h264_qpel_16_mc12_8_ssse3: 204.4 ( 8.53x) avg_h264_qpel_16_mc21_8_c: 1637.7 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 267.7 ( 6.12x) avg_h264_qpel_16_mc21_8_ssse3: 291.5 ( 5.62x) avg_h264_qpel_16_mc22_8_c: 1150.3 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 164.6 ( 6.99x) avg_h264_qpel_16_mc22_8_ssse3: 182.1 ( 6.32x) avg_h264_qpel_16_mc23_8_c: 1635.3 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 268.5 ( 6.09x) avg_h264_qpel_16_mc23_8_ssse3: 298.5 ( 5.48x) avg_h264_qpel_16_mc32_8_c: 1740.6 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 182.6 ( 9.53x) avg_h264_qpel_16_mc32_8_ssse3: 201.9 ( 8.62x) put_h264_qpel_16_mc12_8_c: 1727.4 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 188.1 ( 9.18x) put_h264_qpel_16_mc12_8_ssse3: 199.6 ( 8.65x) put_h264_qpel_16_mc21_8_c: 1623.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 265.9 ( 6.11x) put_h264_qpel_16_mc21_8_ssse3: 299.4 ( 5.42x) put_h264_qpel_16_mc22_8_c: 954.0 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 161.8 ( 5.89x) put_h264_qpel_16_mc22_8_ssse3: 180.4 ( 5.29x) put_h264_qpel_16_mc23_8_c: 1611.2 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 265.8 ( 6.06x) put_h264_qpel_16_mc23_8_ssse3: 300.3 ( 5.37x) put_h264_qpel_16_mc32_8_c: 1734.5 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 180.0 ( 9.63x) put_h264_qpel_16_mc32_8_ssse3: 199.7 ( 8.69x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	3ed590c7b9	avcodec/x86/h264_qpel: Port qpel8or16_hv2_lowpass_op_mmxext to SSE2 This means that only blocksize 4 still uses mmx(ext). Old benchmarks: avg_h264_qpel_8_mc12_8_c: 428.4 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 74.3 ( 5.77x) avg_h264_qpel_8_mc12_8_ssse3: 69.3 ( 6.18x) avg_h264_qpel_8_mc21_8_c: 401.4 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 97.8 ( 4.10x) avg_h264_qpel_8_mc21_8_ssse3: 93.7 ( 4.28x) avg_h264_qpel_8_mc22_8_c: 281.8 ( 1.00x) avg_h264_qpel_8_mc22_8_sse2: 66.7 ( 4.23x) avg_h264_qpel_8_mc22_8_ssse3: 62.6 ( 4.50x) avg_h264_qpel_8_mc23_8_c: 397.2 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 97.9 ( 4.06x) avg_h264_qpel_8_mc23_8_ssse3: 93.7 ( 4.24x) avg_h264_qpel_8_mc32_8_c: 432.4 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 73.9 ( 5.85x) avg_h264_qpel_8_mc32_8_ssse3: 69.5 ( 6.22x) avg_h264_qpel_16_mc12_8_c: 1756.4 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 240.0 ( 7.32x) avg_h264_qpel_16_mc12_8_ssse3: 204.5 ( 8.59x) avg_h264_qpel_16_mc21_8_c: 1635.3 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 321.2 ( 5.09x) avg_h264_qpel_16_mc21_8_ssse3: 288.5 ( 5.67x) avg_h264_qpel_16_mc22_8_c: 1130.8 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 219.4 ( 5.15x) avg_h264_qpel_16_mc22_8_ssse3: 182.2 ( 6.21x) avg_h264_qpel_16_mc23_8_c: 1622.5 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 321.3 ( 5.05x) avg_h264_qpel_16_mc23_8_ssse3: 289.5 ( 5.60x) avg_h264_qpel_16_mc32_8_c: 1762.5 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 236.1 ( 7.46x) avg_h264_qpel_16_mc32_8_ssse3: 205.2 ( 8.59x) put_h264_qpel_8_mc12_8_c: 427.2 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 72.1 ( 5.93x) put_h264_qpel_8_mc12_8_ssse3: 67.0 ( 6.38x) put_h264_qpel_8_mc21_8_c: 402.9 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 95.9 ( 4.20x) put_h264_qpel_8_mc21_8_ssse3: 91.9 ( 4.38x) put_h264_qpel_8_mc22_8_c: 235.0 ( 1.00x) put_h264_qpel_8_mc22_8_sse2: 64.6 ( 3.64x) put_h264_qpel_8_mc22_8_ssse3: 60.0 ( 3.92x) put_h264_qpel_8_mc23_8_c: 403.6 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 95.9 ( 4.21x) put_h264_qpel_8_mc23_8_ssse3: 91.7 ( 4.40x) put_h264_qpel_8_mc32_8_c: 430.7 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 72.1 ( 5.97x) put_h264_qpel_8_mc32_8_ssse3: 67.0 ( 6.43x) put_h264_qpel_16_mc12_8_c: 1724.2 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 230.7 ( 7.47x) put_h264_qpel_16_mc12_8_ssse3: 199.8 ( 8.63x) put_h264_qpel_16_mc21_8_c: 1613.3 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 327.5 ( 4.93x) put_h264_qpel_16_mc21_8_ssse3: 297.2 ( 5.43x) put_h264_qpel_16_mc22_8_c: 959.2 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 211.9 ( 4.53x) put_h264_qpel_16_mc22_8_ssse3: 186.1 ( 5.15x) put_h264_qpel_16_mc23_8_c: 1619.0 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 319.7 ( 5.06x) put_h264_qpel_16_mc23_8_ssse3: 299.2 ( 5.41x) put_h264_qpel_16_mc32_8_c: 1741.7 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 230.9 ( 7.54x) put_h264_qpel_16_mc32_8_ssse3: 199.4 ( 8.74x) New benchmarks: avg_h264_qpel_8_mc12_8_c: 427.2 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 63.9 ( 6.69x) avg_h264_qpel_8_mc12_8_ssse3: 69.2 ( 6.18x) avg_h264_qpel_8_mc21_8_c: 399.2 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 87.7 ( 4.55x) avg_h264_qpel_8_mc21_8_ssse3: 93.9 ( 4.25x) avg_h264_qpel_8_mc22_8_c: 285.7 ( 1.00x) avg_h264_qpel_8_mc22_8_sse2: 56.4 ( 5.07x) avg_h264_qpel_8_mc22_8_ssse3: 62.6 ( 4.56x) avg_h264_qpel_8_mc23_8_c: 398.6 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 87.6 ( 4.55x) avg_h264_qpel_8_mc23_8_ssse3: 93.8 ( 4.25x) avg_h264_qpel_8_mc32_8_c: 425.8 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 63.8 ( 6.67x) avg_h264_qpel_8_mc32_8_ssse3: 69.0 ( 6.17x) avg_h264_qpel_16_mc12_8_c: 1748.2 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 198.5 ( 8.81x) avg_h264_qpel_16_mc12_8_ssse3: 203.2 ( 8.60x) avg_h264_qpel_16_mc21_8_c: 1638.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 277.4 ( 5.91x) avg_h264_qpel_16_mc21_8_ssse3: 291.1 ( 5.63x) avg_h264_qpel_16_mc22_8_c: 1140.7 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 180.3 ( 6.33x) avg_h264_qpel_16_mc22_8_ssse3: 181.9 ( 6.27x) avg_h264_qpel_16_mc23_8_c: 1629.9 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 278.0 ( 5.86x) avg_h264_qpel_16_mc23_8_ssse3: 291.0 ( 5.60x) avg_h264_qpel_16_mc32_8_c: 1752.1 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 193.7 ( 9.05x) avg_h264_qpel_16_mc32_8_ssse3: 203.4 ( 8.61x) put_h264_qpel_8_mc12_8_c: 421.8 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 61.7 ( 6.83x) put_h264_qpel_8_mc12_8_ssse3: 67.2 ( 6.28x) put_h264_qpel_8_mc21_8_c: 396.8 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 85.4 ( 4.65x) put_h264_qpel_8_mc21_8_ssse3: 91.6 ( 4.33x) put_h264_qpel_8_mc22_8_c: 234.1 ( 1.00x) put_h264_qpel_8_mc22_8_sse2: 54.4 ( 4.30x) put_h264_qpel_8_mc22_8_ssse3: 60.2 ( 3.89x) put_h264_qpel_8_mc23_8_c: 399.2 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 85.5 ( 4.67x) put_h264_qpel_8_mc23_8_ssse3: 91.8 ( 4.35x) put_h264_qpel_8_mc32_8_c: 422.2 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 61.8 ( 6.83x) put_h264_qpel_8_mc32_8_ssse3: 67.0 ( 6.30x) put_h264_qpel_16_mc12_8_c: 1720.3 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 189.9 ( 9.06x) put_h264_qpel_16_mc12_8_ssse3: 199.9 ( 8.61x) put_h264_qpel_16_mc21_8_c: 1624.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 285.4 ( 5.69x) put_h264_qpel_16_mc21_8_ssse3: 296.4 ( 5.48x) put_h264_qpel_16_mc22_8_c: 963.9 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 170.1 ( 5.67x) put_h264_qpel_16_mc22_8_ssse3: 186.4 ( 5.17x) put_h264_qpel_16_mc23_8_c: 1613.5 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 274.6 ( 5.88x) put_h264_qpel_16_mc23_8_ssse3: 300.4 ( 5.37x) put_h264_qpel_16_mc32_8_c: 1735.9 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 189.6 ( 9.15x) put_h264_qpel_16_mc32_8_ssse3: 199.5 ( 8.70x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	617c042093	avcodec/x86/h264_qpel_8bit: Avoid doing unnecessary work Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	29f439077a	avcodec/h264_qpel: Move loop into qpel4_hv_lowpass_v_mmxext() Every caller calls it three times in a loop, with slightly modified arguments. So it makes sense to move the loop into the callee. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	4539f7e4d4	avcodec/x86/h264_qpel_8bit: Don't duplicate qpel4_hv_lowpass_v_mmxext Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	3e2d9b73c1	avcodec/h264qpel: Move Snow-only code to snow.c Blocksize 2 is Snow-only, so move all the code pertaining to it to snow.c. Also make the put array in H264QpelContext smaller -- it only needs three sets of 16 function pointers. This continues `6eb8bc4217` and `b0c91c2fba`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	15a4289b79	avcodec/x86/h264_qpel_8bit: Improve register allocation None of the other registers need to be preserved at this time, so six XMM registers are always enough. Forgotten in `fa9ea5113b`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	dcfef80bd9	avcodec/pngenc: Mark unreachable default switch cases as such Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 17:36:25 +02:00
James Almer	6231fa7fb7	avcodec/av1dec: don't emit a warning when parsing isobmff style extradata No OBUs may be present and it's a valid scenario, so only warn when parsing raw extradata. Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-05 22:23:51 -03:00
James Almer	78a16e42bd	avcodec/av1dec: don't overwrite container level color information if none is coded in the bitstream Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-05 13:22:23 -03:00
James Almer	009e4a1c20	avcodec/libdav1d: also consider user defined color information when selectiog pix_fmt Fixes issue #20624. Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-05 13:22:23 -03:00
James Almer	99034b581f	avcodec/dcadsp: constify lfe_samples parameter Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-04 14:18:30 -03:00
Andreas Rheinhardt	8fad52bd57	avcodec/x86/h264_qpel: Use ptrdiff_t for strides Avoids having to sign-extend the strides in the assembly (it also is more correct given that the qpel_mc_func already uses ptrdiff_t). Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	495c3d03ae	avcodec/x86/h264_qpel_10bit: Remove SSE2 "cache64" duplicates The horizontal 10bit MC SSE2 functions are currently duplicated: They exist both in ordinary form as well as with a "sse2_cache64" suffix. A comment in ff_h264qpel_init_x86() indicates that this is due to older processors not liking accesses that cross cache lines, yet these functions are identical to the non-cache64 functions (apart from the unavoidable changes in the rip-offset). The only difference between these functions and the ordinary ones are that the cache64 ones are created via a special form of the INIT_XMM macro: "INIT_XMM sse2, cache64". This affects the name and apparently defines cpuflags_cache64, yet nothing checks for this, so both versions are identical. So remove the cache64 ones and treat the remaining ones like ordinary SSE2 functions. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	697da64c8e	avcodec/x86/h264_qpel: Port pixel8_l2_shift5 from MMXEXT to SSE2 This abides by the ABI (no missing emms) and yields a tiny performance improvement here. Old benchmarks: avg_h264_qpel_8_mc12_8_c: 419.9 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 78.9 ( 5.32x) avg_h264_qpel_8_mc12_8_ssse3: 71.7 ( 5.86x) avg_h264_qpel_8_mc32_8_c: 429.1 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 76.9 ( 5.58x) avg_h264_qpel_8_mc32_8_ssse3: 73.4 ( 5.84x) put_h264_qpel_8_mc12_8_c: 424.0 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 78.6 ( 5.40x) put_h264_qpel_8_mc12_8_ssse3: 70.6 ( 6.00x) put_h264_qpel_8_mc32_8_c: 425.7 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 75.2 ( 5.66x) put_h264_qpel_8_mc32_8_ssse3: 70.4 ( 6.05x) New benchmarks: avg_h264_qpel_8_mc12_8_c: 425.7 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 77.5 ( 5.49x) avg_h264_qpel_8_mc12_8_ssse3: 69.8 ( 6.10x) avg_h264_qpel_8_mc32_8_c: 423.7 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 74.6 ( 5.68x) avg_h264_qpel_8_mc32_8_ssse3: 71.9 ( 5.89x) put_h264_qpel_8_mc12_8_c: 422.2 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 75.8 ( 5.57x) put_h264_qpel_8_mc12_8_ssse3: 67.9 ( 6.22x) put_h264_qpel_8_mc32_8_c: 421.8 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 72.6 ( 5.81x) put_h264_qpel_8_mc32_8_ssse3: 67.7 ( 6.23x) Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	4ac9162beb	avcodec/x86/h264_qpel: Don't use ff_ prefix for static functions Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	cd077e88d1	avcodec/x86/h264_qpel: Add ff_{avg,put}_h264_qpel16_h_lowpass_l2_sse2() These functions are currently emulated via four calls to the versions for 8x8 blocks. In fact, the size savings from the simplified calls in h264_qpel.c (GCC 1344B, Clang 1280B) more than outweigh the size of the added functions (512B) here. It is also beneficial performance-wise. Old benchmarks: avg_h264_qpel_16_mc11_8_c: 1414.1 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 206.2 ( 6.86x) avg_h264_qpel_16_mc11_8_ssse3: 177.7 ( 7.96x) avg_h264_qpel_16_mc13_8_c: 1417.0 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 207.4 ( 6.83x) avg_h264_qpel_16_mc13_8_ssse3: 178.2 ( 7.95x) avg_h264_qpel_16_mc21_8_c: 1632.8 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 349.3 ( 4.67x) avg_h264_qpel_16_mc21_8_ssse3: 291.3 ( 5.60x) avg_h264_qpel_16_mc23_8_c: 1640.2 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 351.3 ( 4.67x) avg_h264_qpel_16_mc23_8_ssse3: 290.8 ( 5.64x) avg_h264_qpel_16_mc31_8_c: 1411.7 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 203.4 ( 6.94x) avg_h264_qpel_16_mc31_8_ssse3: 178.9 ( 7.89x) avg_h264_qpel_16_mc33_8_c: 1409.7 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 204.6 ( 6.89x) avg_h264_qpel_16_mc33_8_ssse3: 178.1 ( 7.92x) put_h264_qpel_16_mc11_8_c: 1391.0 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 197.4 ( 7.05x) put_h264_qpel_16_mc11_8_ssse3: 176.1 ( 7.90x) put_h264_qpel_16_mc13_8_c: 1395.9 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 196.7 ( 7.10x) put_h264_qpel_16_mc13_8_ssse3: 177.7 ( 7.85x) put_h264_qpel_16_mc21_8_c: 1609.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 341.1 ( 4.72x) put_h264_qpel_16_mc21_8_ssse3: 289.2 ( 5.57x) put_h264_qpel_16_mc23_8_c: 1604.0 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 340.9 ( 4.71x) put_h264_qpel_16_mc23_8_ssse3: 289.6 ( 5.54x) put_h264_qpel_16_mc31_8_c: 1390.2 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 194.6 ( 7.14x) put_h264_qpel_16_mc31_8_ssse3: 176.4 ( 7.88x) put_h264_qpel_16_mc33_8_c: 1400.4 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 198.5 ( 7.06x) put_h264_qpel_16_mc33_8_ssse3: 176.2 ( 7.95x) New benchmarks: avg_h264_qpel_16_mc11_8_c: 1413.3 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 171.8 ( 8.23x) avg_h264_qpel_16_mc11_8_ssse3: 173.0 ( 8.17x) avg_h264_qpel_16_mc13_8_c: 1423.2 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 172.0 ( 8.27x) avg_h264_qpel_16_mc13_8_ssse3: 173.4 ( 8.21x) avg_h264_qpel_16_mc21_8_c: 1641.3 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 322.1 ( 5.10x) avg_h264_qpel_16_mc21_8_ssse3: 291.3 ( 5.63x) avg_h264_qpel_16_mc23_8_c: 1629.1 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 323.0 ( 5.04x) avg_h264_qpel_16_mc23_8_ssse3: 293.3 ( 5.55x) avg_h264_qpel_16_mc31_8_c: 1409.2 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 172.0 ( 8.19x) avg_h264_qpel_16_mc31_8_ssse3: 173.7 ( 8.11x) avg_h264_qpel_16_mc33_8_c: 1402.5 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 172.5 ( 8.13x) avg_h264_qpel_16_mc33_8_ssse3: 173.6 ( 8.08x) put_h264_qpel_16_mc11_8_c: 1393.7 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 170.4 ( 8.18x) put_h264_qpel_16_mc11_8_ssse3: 178.2 ( 7.82x) put_h264_qpel_16_mc13_8_c: 1398.0 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 170.2 ( 8.21x) put_h264_qpel_16_mc13_8_ssse3: 178.6 ( 7.83x) put_h264_qpel_16_mc21_8_c: 1619.6 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 320.6 ( 5.05x) put_h264_qpel_16_mc21_8_ssse3: 297.2 ( 5.45x) put_h264_qpel_16_mc23_8_c: 1617.4 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 320.0 ( 5.05x) put_h264_qpel_16_mc23_8_ssse3: 297.4 ( 5.44x) put_h264_qpel_16_mc31_8_c: 1389.7 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 169.9 ( 8.18x) put_h264_qpel_16_mc31_8_ssse3: 178.1 ( 7.80x) put_h264_qpel_16_mc33_8_c: 1394.0 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 170.9 ( 8.16x) put_h264_qpel_16_mc33_8_ssse3: 176.9 ( 7.88x) Notice that the SSSE3 versions of mc21 and mc23 benefit from an optimized version of hv2_lowpass. Also notice that there is no SSE2 version of the purely horizontal motion compensation. This means that src2 is currently always aligned when calling the SSE2 functions (and that srcStride is always equal to the block width). Yet this has not been exploited (yet). Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	4880fa4dca	avcodec/x86/h264_qpel_8bit: Remove dead macro Forgotten in `4011a76494`. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	35aaf697e9	avcodec/x86/h264_qpel_8bit: Replace qpel8_h_lowpass_l2 MMXEXT by SSE2 Using xmm registers here is very natural, as it allows to operate on eight words at a time. It also saves 48B here and does not clobber the MMX state. Old benchmarks (only tests affected by the modified function are shown): avg_h264_qpel_8_mc11_8_c: 352.2 ( 1.00x) avg_h264_qpel_8_mc11_8_sse2: 70.4 ( 5.00x) avg_h264_qpel_8_mc11_8_ssse3: 53.9 ( 6.53x) avg_h264_qpel_8_mc13_8_c: 353.3 ( 1.00x) avg_h264_qpel_8_mc13_8_sse2: 72.8 ( 4.86x) avg_h264_qpel_8_mc13_8_ssse3: 53.8 ( 6.57x) avg_h264_qpel_8_mc21_8_c: 404.0 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 116.1 ( 3.48x) avg_h264_qpel_8_mc21_8_ssse3: 94.3 ( 4.28x) avg_h264_qpel_8_mc23_8_c: 398.9 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 118.6 ( 3.36x) avg_h264_qpel_8_mc23_8_ssse3: 94.8 ( 4.21x) avg_h264_qpel_8_mc31_8_c: 352.7 ( 1.00x) avg_h264_qpel_8_mc31_8_sse2: 71.4 ( 4.94x) avg_h264_qpel_8_mc31_8_ssse3: 53.8 ( 6.56x) avg_h264_qpel_8_mc33_8_c: 354.0 ( 1.00x) avg_h264_qpel_8_mc33_8_sse2: 70.6 ( 5.01x) avg_h264_qpel_8_mc33_8_ssse3: 53.7 ( 6.59x) avg_h264_qpel_16_mc11_8_c: 1417.0 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 276.9 ( 5.12x) avg_h264_qpel_16_mc11_8_ssse3: 178.8 ( 7.92x) avg_h264_qpel_16_mc13_8_c: 1427.3 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 277.4 ( 5.14x) avg_h264_qpel_16_mc13_8_ssse3: 179.7 ( 7.94x) avg_h264_qpel_16_mc21_8_c: 1634.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 421.3 ( 3.88x) avg_h264_qpel_16_mc21_8_ssse3: 291.2 ( 5.61x) avg_h264_qpel_16_mc23_8_c: 1627.0 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 420.8 ( 3.87x) avg_h264_qpel_16_mc23_8_ssse3: 291.0 ( 5.59x) avg_h264_qpel_16_mc31_8_c: 1418.4 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 278.5 ( 5.09x) avg_h264_qpel_16_mc31_8_ssse3: 178.6 ( 7.94x) avg_h264_qpel_16_mc33_8_c: 1407.3 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 277.6 ( 5.07x) avg_h264_qpel_16_mc33_8_ssse3: 179.9 ( 7.82x) put_h264_qpel_8_mc11_8_c: 348.1 ( 1.00x) put_h264_qpel_8_mc11_8_sse2: 69.1 ( 5.04x) put_h264_qpel_8_mc11_8_ssse3: 53.8 ( 6.47x) put_h264_qpel_8_mc13_8_c: 349.3 ( 1.00x) put_h264_qpel_8_mc13_8_sse2: 69.7 ( 5.01x) put_h264_qpel_8_mc13_8_ssse3: 53.7 ( 6.51x) put_h264_qpel_8_mc21_8_c: 398.5 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 115.0 ( 3.46x) put_h264_qpel_8_mc21_8_ssse3: 95.3 ( 4.18x) put_h264_qpel_8_mc23_8_c: 399.9 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 120.8 ( 3.31x) put_h264_qpel_8_mc23_8_ssse3: 95.4 ( 4.19x) put_h264_qpel_8_mc31_8_c: 350.4 ( 1.00x) put_h264_qpel_8_mc31_8_sse2: 69.6 ( 5.03x) put_h264_qpel_8_mc31_8_ssse3: 54.2 ( 6.47x) put_h264_qpel_8_mc33_8_c: 353.1 ( 1.00x) put_h264_qpel_8_mc33_8_sse2: 71.0 ( 4.97x) put_h264_qpel_8_mc33_8_ssse3: 54.2 ( 6.51x) put_h264_qpel_16_mc11_8_c: 1384.2 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 272.9 ( 5.07x) put_h264_qpel_16_mc11_8_ssse3: 178.3 ( 7.76x) put_h264_qpel_16_mc13_8_c: 1393.6 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 271.1 ( 5.14x) put_h264_qpel_16_mc13_8_ssse3: 178.3 ( 7.82x) put_h264_qpel_16_mc21_8_c: 1612.6 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 416.5 ( 3.87x) put_h264_qpel_16_mc21_8_ssse3: 289.1 ( 5.58x) put_h264_qpel_16_mc23_8_c: 1621.3 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 416.9 ( 3.89x) put_h264_qpel_16_mc23_8_ssse3: 289.4 ( 5.60x) put_h264_qpel_16_mc31_8_c: 1408.4 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 273.5 ( 5.15x) put_h264_qpel_16_mc31_8_ssse3: 176.9 ( 7.96x) put_h264_qpel_16_mc33_8_c: 1396.4 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 276.3 ( 5.05x) put_h264_qpel_16_mc33_8_ssse3: 176.4 ( 7.92x) New benchmarks: avg_h264_qpel_8_mc11_8_c: 352.1 ( 1.00x) avg_h264_qpel_8_mc11_8_sse2: 52.5 ( 6.71x) avg_h264_qpel_8_mc11_8_ssse3: 53.9 ( 6.54x) avg_h264_qpel_8_mc13_8_c: 350.8 ( 1.00x) avg_h264_qpel_8_mc13_8_sse2: 54.7 ( 6.42x) avg_h264_qpel_8_mc13_8_ssse3: 54.3 ( 6.46x) avg_h264_qpel_8_mc21_8_c: 400.1 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 98.6 ( 4.06x) avg_h264_qpel_8_mc21_8_ssse3: 95.5 ( 4.19x) avg_h264_qpel_8_mc23_8_c: 400.4 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 101.4 ( 3.95x) avg_h264_qpel_8_mc23_8_ssse3: 95.9 ( 4.18x) avg_h264_qpel_8_mc31_8_c: 352.4 ( 1.00x) avg_h264_qpel_8_mc31_8_sse2: 52.9 ( 6.67x) avg_h264_qpel_8_mc31_8_ssse3: 54.4 ( 6.48x) avg_h264_qpel_8_mc33_8_c: 354.5 ( 1.00x) avg_h264_qpel_8_mc33_8_sse2: 52.9 ( 6.70x) avg_h264_qpel_8_mc33_8_ssse3: 54.4 ( 6.52x) avg_h264_qpel_16_mc11_8_c: 1420.4 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 204.8 ( 6.93x) avg_h264_qpel_16_mc11_8_ssse3: 177.9 ( 7.98x) avg_h264_qpel_16_mc13_8_c: 1409.8 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 206.4 ( 6.83x) avg_h264_qpel_16_mc13_8_ssse3: 178.0 ( 7.92x) avg_h264_qpel_16_mc21_8_c: 1634.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 349.6 ( 4.67x) avg_h264_qpel_16_mc21_8_ssse3: 290.0 ( 5.63x) avg_h264_qpel_16_mc23_8_c: 1624.1 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 350.0 ( 4.64x) avg_h264_qpel_16_mc23_8_ssse3: 291.9 ( 5.56x) avg_h264_qpel_16_mc31_8_c: 1407.2 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 205.8 ( 6.84x) avg_h264_qpel_16_mc31_8_ssse3: 178.2 ( 7.90x) avg_h264_qpel_16_mc33_8_c: 1400.5 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 206.3 ( 6.79x) avg_h264_qpel_16_mc33_8_ssse3: 179.4 ( 7.81x) put_h264_qpel_8_mc11_8_c: 349.7 ( 1.00x) put_h264_qpel_8_mc11_8_sse2: 50.2 ( 6.96x) put_h264_qpel_8_mc11_8_ssse3: 51.3 ( 6.82x) put_h264_qpel_8_mc13_8_c: 349.8 ( 1.00x) put_h264_qpel_8_mc13_8_sse2: 50.7 ( 6.90x) put_h264_qpel_8_mc13_8_ssse3: 51.7 ( 6.76x) put_h264_qpel_8_mc21_8_c: 398.0 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 96.5 ( 4.13x) put_h264_qpel_8_mc21_8_ssse3: 92.3 ( 4.31x) put_h264_qpel_8_mc23_8_c: 401.4 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 102.3 ( 3.92x) put_h264_qpel_8_mc23_8_ssse3: 92.8 ( 4.32x) put_h264_qpel_8_mc31_8_c: 349.4 ( 1.00x) put_h264_qpel_8_mc31_8_sse2: 50.8 ( 6.88x) put_h264_qpel_8_mc31_8_ssse3: 51.8 ( 6.75x) put_h264_qpel_8_mc33_8_c: 351.1 ( 1.00x) put_h264_qpel_8_mc33_8_sse2: 52.2 ( 6.73x) put_h264_qpel_8_mc33_8_ssse3: 51.7 ( 6.79x) put_h264_qpel_16_mc11_8_c: 1391.1 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 196.6 ( 7.07x) put_h264_qpel_16_mc11_8_ssse3: 178.2 ( 7.81x) put_h264_qpel_16_mc13_8_c: 1385.2 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 195.6 ( 7.08x) put_h264_qpel_16_mc13_8_ssse3: 176.6 ( 7.84x) put_h264_qpel_16_mc21_8_c: 1607.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 341.0 ( 4.71x) put_h264_qpel_16_mc21_8_ssse3: 289.1 ( 5.56x) put_h264_qpel_16_mc23_8_c: 1616.7 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 340.8 ( 4.74x) put_h264_qpel_16_mc23_8_ssse3: 288.6 ( 5.60x) put_h264_qpel_16_mc31_8_c: 1397.6 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 197.3 ( 7.08x) put_h264_qpel_16_mc31_8_ssse3: 175.4 ( 7.97x) put_h264_qpel_16_mc33_8_c: 1394.3 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 197.7 ( 7.05x) put_h264_qpel_16_mc33_8_ssse3: 175.2 ( 7.96x) As can be seen, the SSE2 version is often neck-to-neck with the SSSE3 version (which also benefits from a better hv2_lowpass SSSE3 implementation for mc21 and mc23) for eight byte block sizes. Unsurprisingly, SSSE3 beats SSE2 for 16x16 blocks: For SSE2, these blocks are processed by calling the 8x8 function four times whereas SSSE3 has a dedicated function (on x64). This implementation should also be extendable to an AVX version for 16x16 blocks. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	fa9ea5113b	avcodec/x86/h264_qpel_8bit: Optimize branch away ff_{avg,put}_h264_qpel8or16_hv2_lowpass_ssse3() currently is almost the disjoint union of the codepaths for sizes 8 and 16. This size is a compile-time constant at every callsite. So split the function and avoid the runtime branch. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	400203c00c	avcodec/x86/h264_qpel: Remove unused parameter from hv2_lowpass funcs tmpstride is unused. This also allows to remove said parameter from lots of functions in h264_qpel.c. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	b84c818c83	avcodec/x86/h264_qpel: Remove constant parameters from shift5 funcs They are constant since the size 16 version is no longer emulated via the size 8 version. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	810bd3e62a	avcodec/x86/h264_qpel: Add ff_{avg,put}_pixels16_l2_shift5_sse2 Up until now this function was emulated via two calls to ff_{avg,pull}_pixels8_l2_shift5_mmxext(). Adding a dedicated function proved beneficial both size wise and performance wise: The new functions take 192B, yet the simplified calls save 256B with GCC and 320B with Clang here. This change will also allow further optimizations. Old benchmarks: avg_h264_qpel_16_mc12_8_c: 1735.8 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 300.8 ( 5.77x) avg_h264_qpel_16_mc12_8_ssse3: 233.3 ( 7.44x) avg_h264_qpel_16_mc32_8_c: 1777.9 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 275.6 ( 6.45x) avg_h264_qpel_16_mc32_8_ssse3: 235.7 ( 7.54x) put_h264_qpel_16_mc12_8_c: 1808.2 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 267.2 ( 6.77x) put_h264_qpel_16_mc12_8_ssse3: 231.9 ( 7.80x) put_h264_qpel_16_mc32_8_c: 1766.9 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 272.9 ( 6.47x) put_h264_qpel_16_mc32_8_ssse3: 229.5 ( 7.70x) New benchmarks: avg_h264_qpel_16_mc12_8_c: 1742.3 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 240.3 ( 7.25x) avg_h264_qpel_16_mc12_8_ssse3: 214.8 ( 8.11x) avg_h264_qpel_16_mc32_8_c: 1748.0 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 238.0 ( 7.35x) avg_h264_qpel_16_mc32_8_ssse3: 209.2 ( 8.35x) put_h264_qpel_16_mc12_8_c: 2014.4 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 243.7 ( 8.27x) put_h264_qpel_16_mc12_8_ssse3: 211.5 ( 9.52x) put_h264_qpel_16_mc32_8_c: 1800.0 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 238.8 ( 7.54x) put_h264_qpel_16_mc32_8_ssse3: 206.7 ( 8.71x) Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	279b6f3cf5	avcodec/fpel: Avoid loop in ff_avg_pixels4_mmxext() It is only used by h264_qpel.c and only with height four (which is unrolled) and uses a loop in order to handle multiples of four as height. Remove the loop and the height parameter and move the function to h264_qpel_8bit.asm. This leads to a bit of code duplication, but this is simpler than all the %if checks necessary to achieve the same outcome in fpel.asm. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	e340f31b89	avcodec/x86/fpel: Remove redundant repetition The repetition count is always one since `2cf9e733c6`. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	b0c91c2fba	avcodec/h264qpel: Make avg_h264_qpel_pixels_tab smaller avg_h264_qpel only supports 16x16,8x8 and 4x4 blocksizes, so it is currently unnecessarily large. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	6eb8bc4217	avcodec/h264qpel: Don't build unused 2x2 size funcs for bitdepths > 8 The 2x2 put functions are only used by Snow and Snow uses only the eight bit versions. The rest is dead code. Disabling it saved 41277B here. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	92ae9d1ffc	configure: Remove vc1dsp->qpeldsp dependency It only needs it for some x86 fpel functions; instead add a direct dependency for that. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	16d5e074dc	avcodec/mips/Makefile: Fix VC1DSP build rules Affected standalone builds of the VC-1 parser. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	0035d99c61	configure: Avoid mpeg4video_parser->{h263,qpel}dsp dependency This can be easily achieved by moving code only used by the MPEG-4 decoder behind #if CONFIG_MPEG4_DECODER. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	c4c616db53	avcodec/x86/qpel: Move ff_{put,avg}_pixels4_l2_mmxext to h264_qpel Only used there. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	1e11fdff52	avcodec/x86/qpel{,dsp_init}: Remove constant function parameters ff_avg_pixels{4,8,16}_l2_mmxext() are always called with height equal to their blocksize. And ff_{put,avg}_pixels4_l2_mmxext() are furthermore always called with both strides being equal. So remove these redundant function parameters. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	52a77128fd	avcodec/x86/qpel{dsp,dsp_init}: Use ptrdiff_t for stride This is more correct given that qpel_mc_func already uses ptrdiff_t; it also allows to avoid movsxdifnidn. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	cacf854fe7	avcodec/x86/qpel: Remove always-false branches The ff_avg_pixels{4,8,16}_l2_mmxext() functions are only ever used in the last step (the one that actually writes to the dst buffer) where the number of lines to process is always equal to the dimensions of the block, whereas ff_put_pixels{8,16}_mmxext() are also used in intermediate calculations where the number of lines can be 9 or 17. The code in qpel.asm uses common macros for both and processes more than one line per loop iteration; it therefore checks for whether the number of lines is odd and treats this line separately; yet this special handling is only needed for the put functions, not the avg functions. It has therefore been %if'ed away for these. The check is also not needed for ff_put_pixels4_l2_mmxext() which is only used by H.264 which always processes four lines. Because ff_{avg,put}_pixels4_l2_mmxext() processes four lines in a single loop iteration, not only the odd-height handling, but the whole loop could be removed. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:32 +02:00
Andreas Rheinhardt	15a9c8dea3	avcodec/liblc3enc: Avoid allocating buffer to send a zero frame liblc3 supports arbitrary strides, so one can simply use a stride of zero to make it read the same zero value again and again. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 06:07:37 +02:00
Andreas Rheinhardt	ab7d1c64c9	avcodec/x86/h263_loopfilter: Port loop filter to SSE2 Old benchmarks: h263dsp.h_loop_filter_c: 41.2 ( 1.00x) h263dsp.h_loop_filter_mmx: 39.5 ( 1.04x) h263dsp.v_loop_filter_c: 43.5 ( 1.00x) h263dsp.v_loop_filter_mmx: 16.9 ( 2.57x) New benchmarks: h263dsp.h_loop_filter_c: 41.6 ( 1.00x) h263dsp.h_loop_filter_sse2: 28.2 ( 1.48x) h263dsp.v_loop_filter_c: 42.4 ( 1.00x) h263dsp.v_loop_filter_sse2: 15.1 ( 2.81x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-03 17:05:46 +00:00
Cameron Gutman	df4587789f	avcodec/amfenc: avoid unnecessary output delay in low delay mode The code optimizes throughput by letting the encoder work on frame N until frame N+1 is ready for submission, but this hurts low-delay uses by delaying output by one frame. Don't delay output beyond what is necessary when AV_CODEC_FLAG_LOW_DELAY is used. Signed-off-by: Cameron Gutman <aicommander@gmail.com>	2025-10-03 11:05:03 +00:00
Michael Niedermayer	61b6877637	avcodec/mjpegdec: Explain buf_size/width/height check Suggested-by: Ramiro Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-02 12:52:43 +00:00

1 2 3 4 5 ...

52856 commits