ffmpeg/libavcodec/aarch64
Martin Storsjö fd3bd5c492 aarch64: h264qpel: Do vertical filtering without transposing
This gives rather big speedups on these functions:

Before:
put_h264_qpel_8_mc01_8_neon:     241.0   131.5   138.7
put_h264_qpel_8_mc02_8_neon:     214.7   121.2   127.5
put_h264_qpel_8_mc03_8_neon:     242.5   131.2   135.7
put_h264_qpel_8_mc11_8_neon:     421.2   218.7   251.0
put_h264_qpel_8_mc12_8_neon:     878.0   509.5   537.5
put_h264_qpel_8_mc13_8_neon:     423.7   217.0   252.0
put_h264_qpel_8_mc21_8_neon:     858.2   479.5   514.0
put_h264_qpel_8_mc22_8_neon:     649.7   385.2   403.0
put_h264_qpel_8_mc23_8_neon:     860.2   476.5   517.7
put_h264_qpel_8_mc31_8_neon:     437.2   219.5   252.5
put_h264_qpel_8_mc32_8_neon:     892.5   510.5   546.0
put_h264_qpel_8_mc33_8_neon:     438.2   218.5   257.0
put_h264_qpel_16_mc01_8_neon:    944.2   509.7   546.7
put_h264_qpel_16_mc02_8_neon:    878.7   469.5   509.7
put_h264_qpel_16_mc03_8_neon:    945.7   510.7   557.0
put_h264_qpel_16_mc11_8_neon:   1663.2   858.5   979.5
put_h264_qpel_16_mc12_8_neon:   3510.2  2027.7  2112.7
put_h264_qpel_16_mc13_8_neon:   1664.7   857.5   980.5
put_h264_qpel_16_mc21_8_neon:   3366.2  1928.5  2030.5
put_h264_qpel_16_mc22_8_neon:   2584.7  1514.7  1590.2
put_h264_qpel_16_mc23_8_neon:   3367.7  1927.7  2035.0
put_h264_qpel_16_mc31_8_neon:   1716.7   849.7   997.0
put_h264_qpel_16_mc32_8_neon:   3564.0  2044.2  3835.2
put_h264_qpel_16_mc33_8_neon:   1717.7   863.0   989.5

After:
put_h264_qpel_8_mc01_8_neon:     136.0    73.7    76.0
put_h264_qpel_8_mc02_8_neon:     108.7    65.0    64.0
put_h264_qpel_8_mc03_8_neon:     137.5    72.7    73.0
put_h264_qpel_8_mc11_8_neon:     316.2   159.0   188.5
put_h264_qpel_8_mc12_8_neon:     653.0   375.5   384.7
put_h264_qpel_8_mc13_8_neon:     318.7   165.5   189.5
put_h264_qpel_8_mc21_8_neon:     739.2   385.7   432.5
put_h264_qpel_8_mc22_8_neon:     530.7   295.5   309.5
put_h264_qpel_8_mc23_8_neon:     741.2   393.7   421.0
put_h264_qpel_8_mc31_8_neon:     332.2   162.5   190.0
put_h264_qpel_8_mc32_8_neon:     667.5   378.2   390.5
put_h264_qpel_8_mc33_8_neon:     332.7   166.5   195.5
put_h264_qpel_16_mc01_8_neon:    524.2   285.2   294.0
put_h264_qpel_16_mc02_8_neon:    454.7   252.2   250.2
put_h264_qpel_16_mc03_8_neon:    525.7   286.0   283.0
put_h264_qpel_16_mc11_8_neon:   1243.2   630.7   726.7
put_h264_qpel_16_mc12_8_neon:   2610.2  1479.7  1481.2
put_h264_qpel_16_mc13_8_neon:   1250.5   631.7   727.7
put_h264_qpel_16_mc21_8_neon:   2890.2  1571.2  1679.7
put_h264_qpel_16_mc22_8_neon:   2108.7  1177.5  1223.5
put_h264_qpel_16_mc23_8_neon:   2891.7  1578.7  1667.7
put_h264_qpel_16_mc31_8_neon:   1296.7   630.5   752.5
put_h264_qpel_16_mc32_8_neon:   2664.0  1483.2  1503.5
put_h264_qpel_16_mc33_8_neon:   1297.7   632.5   747.2

I.e. overall a 20%-60% reduction in runtime of these
functions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:58 +03:00
..
aacpsdsp_init_aarch64.c Include attributes.h directly 2021-04-19 14:34:10 +02:00
aacpsdsp_neon.S lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
asm-offsets.h aarch64/asm-offsets: remove old CELT offsets 2019-05-14 23:41:24 +01:00
cabac.h
fft_init_aarch64.c
fft_neon.S
fmtconvert_init.c
fmtconvert_neon.S
h264chroma_init_aarch64.c Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
h264cmc_neon.S Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
h264dsp_init_aarch64.c lavc/aarch64: h264, add chroma loop filters for 10bit 2021-08-21 00:06:26 +03:00
h264dsp_neon.S lavc/aarch64: h264, add chroma loop filters for 10bit 2021-08-21 00:06:26 +03:00
h264idct_neon.S libavcodec: Remove dynamic relocs from aarch64/h264idct_neon.S 2019-01-03 20:12:07 +01:00
h264pred_init.c lavc/aarch64: add pred functions for 10-bit 2021-08-21 00:06:26 +03:00
h264pred_neon.S lavc/aarch64: add pred functions for 10-bit 2021-08-21 00:06:26 +03:00
h264qpel_init_aarch64.c
h264qpel_neon.S aarch64: h264qpel: Do vertical filtering without transposing 2021-10-18 14:27:58 +03:00
hevcdsp_idct_neon.S aarch64: hevc_idct: Fix overflows in idct_dc 2021-05-22 00:08:03 +03:00
hevcdsp_init_aarch64.c lavc/aarch64: add HEVC sao_band NEON 2021-02-18 14:12:01 +01:00
hevcdsp_sao_neon.S lavc/aarch64: add HEVC sao_band NEON 2021-02-18 14:12:01 +01:00
hpeldsp_init_aarch64.c
hpeldsp_neon.S
idct.h Merge commit '2ec9fa5ec6' 2017-03-21 14:29:52 -03:00
idctdsp_init_aarch64.c lavc/aarch64: Fix compilation with --disable-neon 2020-03-11 14:16:48 +01:00
Makefile lavc/aarch64: add HEVC sao_band NEON 2021-02-18 14:12:01 +01:00
mdct_neon.S
mpegaudiodsp_init.c Merge commit '72a19f4013' 2017-03-31 14:43:37 -03:00
mpegaudiodsp_neon.S Merge commit '732510636e' 2017-11-11 17:47:10 -03:00
neon.S lavc/aarch64: move transpose_4x8H to neon.S 2021-08-21 00:06:26 +03:00
neontest.c avcodec: Remove deprecated old encode/decode APIs 2021-04-27 10:43:12 -03:00
opusdsp_init.c Include attributes.h directly 2021-04-19 14:34:10 +02:00
opusdsp_neon.S aarch64/opusdsp: do not clobber register v8 2019-08-15 13:29:22 +01:00
pixblockdsp_init_aarch64.c libavcodec: aarch64: Add a NEON implementation of pixblockdsp 2020-05-15 23:37:55 +03:00
pixblockdsp_neon.S libavcodec: aarch64: Add a NEON implementation of pixblockdsp 2020-05-15 23:37:55 +03:00
rv40dsp_init_aarch64.c Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
sbrdsp_init_aarch64.c lavc/aarch64: add sbrdsp neon implementation 2017-07-03 14:29:22 +02:00
sbrdsp_neon.S lavc/aarch64/sbrdsp_neon: fix build on old binutils 2018-01-26 02:42:01 -06:00
simple_idct_neon.S lavc/aarch64/simple_idct: fix build with Xcode 7.2 2017-06-14 23:20:58 +02:00
synth_filter_init.c
synth_filter_neon.S Merge commit '2425d7329f' 2017-04-26 16:28:57 +02:00
vc1dsp_init_aarch64.c Merge commit 'e4a94d8b36' 2017-03-21 15:20:45 -03:00
videodsp.S lavc/aarch64: fix relocation out of range error 2021-09-25 21:55:29 +03:00
videodsp_init.c
vorbisdsp_init.c
vorbisdsp_neon.S
vp8dsp.h Merge commit 'e39a9212ab' 2019-03-14 16:18:42 -03:00
vp8dsp_init_aarch64.c Merge commit 'e39a9212ab' 2019-03-14 16:18:42 -03:00
vp8dsp_neon.S Merge commit '7e42d5f0ab' 2019-03-14 16:22:29 -03:00
vp9dsp_init.h vp9: re-split the decoder/format/dsp interface header files. 2017-03-28 18:04:26 -04:00
vp9dsp_init_10bpp_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init_12bpp_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init_16bpp_aarch64_template.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
vp9dsp_init_aarch64.c lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h 2021-01-01 14:11:01 +01:00
vp9itxfm_16bpp_neon.S aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older 2017-06-21 09:08:14 +03:00
vp9itxfm_neon.S aarch64: vp9: Fix assembling with Xcode 6.2 and older 2017-06-21 09:08:13 +03:00
vp9lpf_16bpp_neon.S lavc/aarch64: move transpose_4x8H to neon.S 2021-08-21 00:06:26 +03:00
vp9lpf_neon.S aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 2017-03-11 13:14:50 +02:00
vp9mc_16bpp_neon.S lavc/aarch64: Move non-neon vp9 copy functions out of neon source file. 2020-03-11 14:16:40 +01:00
vp9mc_aarch64.S lavc/aarch64: Fix suffix of new file vp9mc_aarch64. 2020-03-11 14:29:22 +01:00
vp9mc_neon.S lavc/aarch64: Move non-neon vp9 copy functions out of neon source file. 2020-03-11 14:16:40 +01:00