ffmpeg/libavcodec/aarch64
Martin Storsjö 870bfe16a1 aarch64: h264pred: Optimize the inner loop of existing 8 bit functions
Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.

In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).

In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).

In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.

This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.

Before:                     Cortex A53     A72     A73
pred8x8_dc_8_neon:                60.7    46.2    39.2
pred8x8_dc_128_8_neon:            30.7    18.0    14.0
pred8x8_horizontal_8_neon:        42.2    29.2    18.5
pred8x8_left_dc_8_neon:           52.7    36.2    32.2
pred8x8_mad_cow_dc_0l0_8_neon:    48.2    27.7    25.7
pred8x8_mad_cow_dc_0lt_8_neon:    52.5    33.2    34.7
pred8x8_mad_cow_dc_l0t_8_neon:    52.5    31.7    33.2
pred8x8_mad_cow_dc_l00_8_neon:    43.2    27.0    25.5
pred8x8_plane_8_neon:            112.2    86.2    88.2
pred8x8_top_dc_8_neon:            40.7    23.0    21.2
pred8x8_vertical_8_neon:          27.2    15.5    14.0
pred16x16_dc_8_neon:              91.0    73.2    70.5
pred16x16_dc_128_8_neon:          43.0    34.7    30.7
pred16x16_horizontal_8_neon:      86.0    49.7    44.7
pred16x16_left_dc_8_neon:         87.0    67.2    67.5
pred16x16_plane_8_neon:          236.0   175.7   173.0
pred16x16_top_dc_8_neon:          53.2    39.0    41.7
pred16x16_vertical_8_neon:        41.7    29.7    31.0

After:
pred8x8_dc_8_neon:                59.0    46.7    42.5
pred8x8_dc_128_8_neon:            28.2    18.0    14.0
pred8x8_horizontal_8_neon:        34.2    29.2    18.5
pred8x8_left_dc_8_neon:           51.0    38.2    32.7
pred8x8_mad_cow_dc_0l0_8_neon:    46.7    28.2    26.2
pred8x8_mad_cow_dc_0lt_8_neon:    55.2    33.7    37.5
pred8x8_mad_cow_dc_l0t_8_neon:    51.2    31.7    37.2
pred8x8_mad_cow_dc_l00_8_neon:    41.7    27.5    26.0
pred8x8_plane_8_neon:            111.5    86.5    89.5
pred8x8_top_dc_8_neon:            39.0    23.2    21.0
pred8x8_vertical_8_neon:          27.2    16.0    14.0
pred16x16_dc_8_neon:              85.0    70.2    70.5
pred16x16_dc_128_8_neon:          42.0    30.0    30.7
pred16x16_horizontal_8_neon:      66.5    49.5    42.5
pred16x16_left_dc_8_neon:         81.0    66.5    67.5
pred16x16_plane_8_neon:          235.0   175.7   173.0
pred16x16_top_dc_8_neon:          52.0    39.0    41.7
pred16x16_vertical_8_neon:        40.2    33.2    31.0

Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:

                           Cortex A53    A72    A73
pred8x8_dc_8_neon:               0.86   0.65   1.04
pred8x8_dc_128_8_neon:           0.59   0.44   0.62
pred8x8_horizontal_8_neon:       1.51   0.58   1.30
pred8x8_left_dc_8_neon:          0.72   0.56   0.89
pred8x8_mad_cow_dc_0l0_8_neon:   0.93   0.93   1.37
pred8x8_mad_cow_dc_0lt_8_neon:   1.37   1.41   1.68
pred8x8_mad_cow_dc_l0t_8_neon:   1.21   1.17   1.32
pred8x8_mad_cow_dc_l00_8_neon:   1.24   1.19   1.60
pred8x8_plane_8_neon:            3.36   3.58   3.76
pred8x8_top_dc_8_neon:           0.97   0.99   1.43
pred8x8_vertical_8_neon:         0.86   0.78   1.18
pred16x16_dc_8_neon:             1.20   1.06   1.49
pred16x16_dc_128_8_neon:         0.83   0.95   0.99
pred16x16_horizontal_8_neon:     1.78   0.96   1.59
pred16x16_left_dc_8_neon:        1.06   0.96   1.32
pred16x16_plane_8_neon:          5.78   6.49   7.19
pred16x16_top_dc_8_neon:         1.48   1.53   1.94
pred16x16_vertical_8_neon:       1.39   1.34   1.98

In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-04-14 15:23:44 +03:00
..
aacpsdsp_init_aarch64.c lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
aacpsdsp_neon.S lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
asm-offsets.h aarch64/asm-offsets: remove old CELT offsets 2019-05-14 23:41:24 +01:00
cabac.h
fft_init_aarch64.c Merge commit '97aec6e75e' 2016-04-12 15:43:09 +01:00
fft_neon.S
fmtconvert_init.c Merge commit 'a0fc780a20' 2016-01-02 11:21:16 +01:00
fmtconvert_neon.S Merge commit 'a0fc780a20' 2016-01-02 11:21:16 +01:00
h264chroma_init_aarch64.c Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
h264cmc_neon.S Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
h264dsp_init_aarch64.c Merge commit '186bd30aa3' 2019-03-14 16:29:41 -03:00
h264dsp_neon.S Merge commit '186bd30aa3' 2019-03-14 16:29:41 -03:00
h264idct_neon.S libavcodec: Remove dynamic relocs from aarch64/h264idct_neon.S 2019-01-03 20:12:07 +01:00
h264pred_init.c
h264pred_neon.S aarch64: h264pred: Optimize the inner loop of existing 8 bit functions 2021-04-14 15:23:44 +03:00
h264qpel_init_aarch64.c
h264qpel_neon.S
hevcdsp_idct_neon.S lavc/aarch64: add HEVC idct_dc NEON 2021-02-18 14:12:01 +01:00
hevcdsp_init_aarch64.c lavc/aarch64: add HEVC sao_band NEON 2021-02-18 14:12:01 +01:00
hevcdsp_sao_neon.S lavc/aarch64: add HEVC sao_band NEON 2021-02-18 14:12:01 +01:00
hpeldsp_init_aarch64.c
hpeldsp_neon.S
idct.h Merge commit '2ec9fa5ec6' 2017-03-21 14:29:52 -03:00
idctdsp_init_aarch64.c lavc/aarch64: Fix compilation with --disable-neon 2020-03-11 14:16:48 +01:00
Makefile lavc/aarch64: add HEVC sao_band NEON 2021-02-18 14:12:01 +01:00
mdct_neon.S
mpegaudiodsp_init.c Merge commit '72a19f4013' 2017-03-31 14:43:37 -03:00
mpegaudiodsp_neon.S Merge commit '732510636e' 2017-11-11 17:47:10 -03:00
neon.S Merge commit 'cdb1665f70' 2016-04-24 12:51:42 +01:00
neontest.c avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions 2021-02-26 19:26:31 -03:00
opusdsp_init.c aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis 2019-04-10 01:08:54 +02:00
opusdsp_neon.S aarch64/opusdsp: do not clobber register v8 2019-08-15 13:29:22 +01:00
pixblockdsp_init_aarch64.c libavcodec: aarch64: Add a NEON implementation of pixblockdsp 2020-05-15 23:37:55 +03:00
pixblockdsp_neon.S libavcodec: aarch64: Add a NEON implementation of pixblockdsp 2020-05-15 23:37:55 +03:00
rv40dsp_init_aarch64.c Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
sbrdsp_init_aarch64.c lavc/aarch64: add sbrdsp neon implementation 2017-07-03 14:29:22 +02:00
sbrdsp_neon.S lavc/aarch64/sbrdsp_neon: fix build on old binutils 2018-01-26 02:42:01 -06:00
simple_idct_neon.S lavc/aarch64/simple_idct: fix build with Xcode 7.2 2017-06-14 23:20:58 +02:00
synth_filter_init.c avcodec/synth_filter: split off remaining code from dcadec files 2016-01-25 14:57:38 -03:00
synth_filter_neon.S Merge commit '2425d7329f' 2017-04-26 16:28:57 +02:00
vc1dsp_init_aarch64.c Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
videodsp.S
videodsp_init.c
vorbisdsp_init.c
vorbisdsp_neon.S
vp8dsp.h Merge commit 'e39a9212ab' 2019-03-14 16:18:42 -03:00
vp8dsp_init_aarch64.c Merge commit 'e39a9212ab' 2019-03-14 16:18:42 -03:00
vp8dsp_neon.S Merge commit '7e42d5f0ab' 2019-03-14 16:22:29 -03:00
vp9dsp_init.h vp9: re-split the decoder/format/dsp interface header files. 2017-03-28 18:04:26 -04:00
vp9dsp_init_10bpp_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init_12bpp_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init_16bpp_aarch64_template.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
vp9dsp_init_aarch64.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
vp9itxfm_16bpp_neon.S aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older 2017-06-21 09:08:14 +03:00
vp9itxfm_neon.S aarch64: vp9: Fix assembling with Xcode 6.2 and older 2017-06-21 09:08:13 +03:00
vp9lpf_16bpp_neon.S aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter 2017-01-24 22:36:11 +02:00
vp9lpf_neon.S aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 2017-03-11 13:14:50 +02:00
vp9mc_16bpp_neon.S lavc/aarch64: Move non-neon vp9 copy functions out of neon source file. 2020-03-11 14:16:40 +01:00
vp9mc_aarch64.S lavc/aarch64: Fix suffix of new file vp9mc_aarch64. 2020-03-11 14:29:22 +01:00
vp9mc_neon.S lavc/aarch64: Move non-neon vp9 copy functions out of neon source file. 2020-03-11 14:16:40 +01:00