ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-06-27 19:31:27 +00:00

History

Martin Storsjö 870bfe16a1 aarch64: h264pred: Optimize the inner loop of existing 8 bit functions Move the loop counter decrement further from the branch instruction, this hides the latency of the decrement. In loops that first load, then store (the horizontal prediction cases), do the decrement after the load (where the next instruction would stall a bit anyway, waiting for the result of the load). In loops that store twice using the same destination register, also do the decrement between the two stores (as the second store would need to wait for the updated destination register from the first instruction). In loops that store twice to two different destination registers, do the decrement before both stores, to do it as soon before the branch as possible. This gives minor (1-2 cycle) speedups in most cases (modulo measurement noise), but the horizontal prediction functions get a rather notable speedup on the Cortex A53. Before: Cortex A53 A72 A73 pred8x8_dc_8_neon: 60.7 46.2 39.2 pred8x8_dc_128_8_neon: 30.7 18.0 14.0 pred8x8_horizontal_8_neon: 42.2 29.2 18.5 pred8x8_left_dc_8_neon: 52.7 36.2 32.2 pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7 pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7 pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2 pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5 pred8x8_plane_8_neon: 112.2 86.2 88.2 pred8x8_top_dc_8_neon: 40.7 23.0 21.2 pred8x8_vertical_8_neon: 27.2 15.5 14.0 pred16x16_dc_8_neon: 91.0 73.2 70.5 pred16x16_dc_128_8_neon: 43.0 34.7 30.7 pred16x16_horizontal_8_neon: 86.0 49.7 44.7 pred16x16_left_dc_8_neon: 87.0 67.2 67.5 pred16x16_plane_8_neon: 236.0 175.7 173.0 pred16x16_top_dc_8_neon: 53.2 39.0 41.7 pred16x16_vertical_8_neon: 41.7 29.7 31.0 After: pred8x8_dc_8_neon: 59.0 46.7 42.5 pred8x8_dc_128_8_neon: 28.2 18.0 14.0 pred8x8_horizontal_8_neon: 34.2 29.2 18.5 pred8x8_left_dc_8_neon: 51.0 38.2 32.7 pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2 pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5 pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2 pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0 pred8x8_plane_8_neon: 111.5 86.5 89.5 pred8x8_top_dc_8_neon: 39.0 23.2 21.0 pred8x8_vertical_8_neon: 27.2 16.0 14.0 pred16x16_dc_8_neon: 85.0 70.2 70.5 pred16x16_dc_128_8_neon: 42.0 30.0 30.7 pred16x16_horizontal_8_neon: 66.5 49.5 42.5 pred16x16_left_dc_8_neon: 81.0 66.5 67.5 pred16x16_plane_8_neon: 235.0 175.7 173.0 pred16x16_top_dc_8_neon: 52.0 39.0 41.7 pred16x16_vertical_8_neon: 40.2 33.2 31.0 Despite this, a number of these functions still are slower than what e.g. GCC 7 generates - this shows the relative speedup of the neon codepaths over the compiler generated ones: Cortex A53 A72 A73 pred8x8_dc_8_neon: 0.86 0.65 1.04 pred8x8_dc_128_8_neon: 0.59 0.44 0.62 pred8x8_horizontal_8_neon: 1.51 0.58 1.30 pred8x8_left_dc_8_neon: 0.72 0.56 0.89 pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37 pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68 pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32 pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60 pred8x8_plane_8_neon: 3.36 3.58 3.76 pred8x8_top_dc_8_neon: 0.97 0.99 1.43 pred8x8_vertical_8_neon: 0.86 0.78 1.18 pred16x16_dc_8_neon: 1.20 1.06 1.49 pred16x16_dc_128_8_neon: 0.83 0.95 0.99 pred16x16_horizontal_8_neon: 1.78 0.96 1.59 pred16x16_left_dc_8_neon: 1.06 0.96 1.32 pred16x16_plane_8_neon: 5.78 6.49 7.19 pred16x16_top_dc_8_neon: 1.48 1.53 1.94 pred16x16_vertical_8_neon: 1.39 1.34 1.98 In particular, on Cortex A72, many of these functions are slower than the compiler generated code, while they're more beneficial on e.g. the Cortex A73. Signed-off-by: Martin Storsjö <martin@martin.st>		2021-04-14 15:23:44 +03:00
..
aacpsdsp_init_aarch64.c	lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis	2017-06-28 12:22:39 +02:00
aacpsdsp_neon.S	lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis	2017-06-28 12:22:39 +02:00
asm-offsets.h	aarch64/asm-offsets: remove old CELT offsets	2019-05-14 23:41:24 +01:00
cabac.h
fft_init_aarch64.c	Merge commit '`97aec6e75e`'	2016-04-12 15:43:09 +01:00
fft_neon.S
fmtconvert_init.c	Merge commit '`a0fc780a20`'	2016-01-02 11:21:16 +01:00
fmtconvert_neon.S	Merge commit '`a0fc780a20`'	2016-01-02 11:21:16 +01:00
h264chroma_init_aarch64.c	Merge commit '`e4a94d8b36`'	2017-03-21 15:20:45 -03:00
h264cmc_neon.S	Merge commit '`e4a94d8b36`'	2017-03-21 15:20:45 -03:00
h264dsp_init_aarch64.c	Merge commit '`186bd30aa3`'	2019-03-14 16:29:41 -03:00
h264dsp_neon.S	Merge commit '`186bd30aa3`'	2019-03-14 16:29:41 -03:00
h264idct_neon.S	libavcodec: Remove dynamic relocs from aarch64/h264idct_neon.S	2019-01-03 20:12:07 +01:00
h264pred_init.c
h264pred_neon.S	aarch64: h264pred: Optimize the inner loop of existing 8 bit functions	2021-04-14 15:23:44 +03:00
h264qpel_init_aarch64.c
h264qpel_neon.S
hevcdsp_idct_neon.S	lavc/aarch64: add HEVC idct_dc NEON	2021-02-18 14:12:01 +01:00
hevcdsp_init_aarch64.c	lavc/aarch64: add HEVC sao_band NEON	2021-02-18 14:12:01 +01:00
hevcdsp_sao_neon.S	lavc/aarch64: add HEVC sao_band NEON	2021-02-18 14:12:01 +01:00
hpeldsp_init_aarch64.c
hpeldsp_neon.S
idct.h	Merge commit '`2ec9fa5ec6`'	2017-03-21 14:29:52 -03:00
idctdsp_init_aarch64.c	lavc/aarch64: Fix compilation with --disable-neon	2020-03-11 14:16:48 +01:00
Makefile	lavc/aarch64: add HEVC sao_band NEON	2021-02-18 14:12:01 +01:00
mdct_neon.S
mpegaudiodsp_init.c	Merge commit '`72a19f4013`'	2017-03-31 14:43:37 -03:00
mpegaudiodsp_neon.S	Merge commit '`732510636e`'	2017-11-11 17:47:10 -03:00
neon.S	Merge commit '`cdb1665f70`'	2016-04-24 12:51:42 +01:00
neontest.c	avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions	2021-02-26 19:26:31 -03:00
opusdsp_init.c	aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis	2019-04-10 01:08:54 +02:00
opusdsp_neon.S	aarch64/opusdsp: do not clobber register v8	2019-08-15 13:29:22 +01:00
pixblockdsp_init_aarch64.c	libavcodec: aarch64: Add a NEON implementation of pixblockdsp	2020-05-15 23:37:55 +03:00
pixblockdsp_neon.S	libavcodec: aarch64: Add a NEON implementation of pixblockdsp	2020-05-15 23:37:55 +03:00
rv40dsp_init_aarch64.c	Merge commit '`e4a94d8b36`'	2017-03-21 15:20:45 -03:00
sbrdsp_init_aarch64.c	lavc/aarch64: add sbrdsp neon implementation	2017-07-03 14:29:22 +02:00
sbrdsp_neon.S	lavc/aarch64/sbrdsp_neon: fix build on old binutils	2018-01-26 02:42:01 -06:00
simple_idct_neon.S	lavc/aarch64/simple_idct: fix build with Xcode 7.2	2017-06-14 23:20:58 +02:00
synth_filter_init.c	avcodec/synth_filter: split off remaining code from dcadec files	2016-01-25 14:57:38 -03:00
synth_filter_neon.S	Merge commit '`2425d7329f`'	2017-04-26 16:28:57 +02:00
vc1dsp_init_aarch64.c	Merge commit '`e4a94d8b36`'	2017-03-21 15:20:45 -03:00
videodsp.S
videodsp_init.c
vorbisdsp_init.c
vorbisdsp_neon.S
vp8dsp.h	Merge commit '`e39a9212ab`'	2019-03-14 16:18:42 -03:00
vp8dsp_init_aarch64.c	Merge commit '`e39a9212ab`'	2019-03-14 16:18:42 -03:00
vp8dsp_neon.S	Merge commit '`7e42d5f0ab`'	2019-03-14 16:22:29 -03:00
vp9dsp_init.h	vp9: re-split the decoder/format/dsp interface header files.	2017-03-28 18:04:26 -04:00
vp9dsp_init_10bpp_aarch64.c	aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC	2017-01-24 22:36:05 +02:00
vp9dsp_init_12bpp_aarch64.c	aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC	2017-01-24 22:36:05 +02:00
vp9dsp_init_16bpp_aarch64_template.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
vp9dsp_init_aarch64.c	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h	2021-01-01 14:11:01 +01:00
vp9itxfm_16bpp_neon.S	aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older	2017-06-21 09:08:14 +03:00
vp9itxfm_neon.S	aarch64: vp9: Fix assembling with Xcode 6.2 and older	2017-06-21 09:08:13 +03:00
vp9lpf_16bpp_neon.S	aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter	2017-01-24 22:36:11 +02:00
vp9lpf_neon.S	aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1	2017-03-11 13:14:50 +02:00
vp9mc_16bpp_neon.S	lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.	2020-03-11 14:16:40 +01:00
vp9mc_aarch64.S	lavc/aarch64: Fix suffix of new file vp9mc_aarch64.	2020-03-11 14:29:22 +01:00
vp9mc_neon.S	lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.	2020-03-11 14:16:40 +01:00