ffmpeg/libavcodec/arm
Martin Storsjö 5eb5aec475 arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-09 12:32:00 +02:00
..
aac.h arm: use HAVE*_INLINE/EXTERNAL macros for conditional compilation 2012-12-07 16:54:03 +00:00
aacpsdsp_init_arm.c aacps: NEON optimisations 2012-05-05 22:04:21 +01:00
aacpsdsp_neon.S ARM: Move asm.S from libavcodec to libavutil 2012-06-08 13:14:38 -04:00
ac3dsp_arm.S ARM: Move asm.S from libavcodec to libavutil 2012-06-08 13:14:38 -04:00
ac3dsp_armv6.S ARM: swap source operands in some add instructions 2012-09-20 17:07:18 +01:00
ac3dsp_init_arm.c dsputil: Move apply_window_int16 to ac3dsp 2013-12-08 17:57:15 +01:00
ac3dsp_neon.S dsputil: Move apply_window_int16 to ac3dsp 2013-12-08 17:57:15 +01:00
apedsp_init_arm.c dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
apedsp_neon.S dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
asm-offsets.h mpegvideo: move the MpegEncContext fields used from arm asm to the beginning 2014-04-29 14:49:42 +02:00
audiodsp_arm.h dsputil: Split audio operations off into a separate context 2014-06-22 06:20:15 -07:00
audiodsp_init_arm.c dsputil: Split audio operations off into a separate context 2014-06-22 06:20:15 -07:00
audiodsp_init_neon.c audiodsp: reorder arguments for vector_clipf 2016-09-22 09:47:52 +02:00
audiodsp_neon.S audiodsp: reorder arguments for vector_clipf 2016-09-22 09:47:52 +02:00
blockdsp_arm.h blockdsp: drop the high_bit_depth parameter 2016-09-22 09:47:52 +02:00
blockdsp_init_arm.c blockdsp: drop the high_bit_depth parameter 2016-09-22 09:47:52 +02:00
blockdsp_init_neon.c blockdsp: drop the high_bit_depth parameter 2016-09-22 09:47:52 +02:00
blockdsp_neon.S dsputil: Split clear_block*/fill_block* off into a separate context 2014-06-18 14:07:23 -07:00
cabac.h arm: get_cabac inline asm 2014-03-09 00:45:34 +01:00
dca.h dcadec: simplify decoding of VQ high frequencies 2014-02-28 13:03:22 +01:00
dcadsp_init_arm.c dca: remove unused decode_hf function and quant_d tables 2015-12-24 13:58:18 +01:00
dcadsp_neon.S dca: remove unused decode_hf function and quant_d tables 2015-12-24 13:58:18 +01:00
dcadsp_vfp.S dcadec: remove scaling in lfe_interpolation_fir 2014-02-28 13:00:47 +01:00
fft_fixed_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
fft_fixed_neon.S arm: Use .data.rel.ro for const data with relocations 2014-12-09 11:43:25 +02:00
fft_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
fft_neon.S arm: Use .data.rel.ro for const data with relocations 2014-12-09 11:43:25 +02:00
fft_vfp.S arm: Use .data.rel.ro for const data with relocations 2014-12-09 11:43:25 +02:00
flacdsp_arm.S flacdsp: arm optimised lpc filter 2012-09-15 23:54:21 +01:00
flacdsp_init_arm.c flacdsp: arm optimised lpc filter 2012-09-15 23:54:21 +01:00
fmtconvert_init_arm.c arm: add ff_int32_to_float_fmul_array8_neon 2015-12-14 16:45:02 +01:00
fmtconvert_neon.S arm: add ff_int32_to_float_fmul_array8_neon 2015-12-14 16:45:02 +01:00
fmtconvert_vfp.S arm: fmtconvert: Split armv6 fmtconvert code off from vfp code 2013-08-29 11:24:14 +02:00
g722dsp_init_arm.c g722: Add ARM NEON implementation for g722_apply_qmf() 2015-02-15 22:47:21 +02:00
g722dsp_neon.S g722: Add ARM NEON implementation for g722_apply_qmf() 2015-02-15 22:47:21 +02:00
h264chroma_init_arm.c h264chroma: Change type of stride parameters to ptrdiff_t 2016-09-29 14:48:04 +02:00
h264cmc_neon.S h264chroma: Change type of stride parameters to ptrdiff_t 2016-09-29 14:48:04 +02:00
h264dsp_init_arm.c h264: Move start code search functions into separate source files. 2014-08-04 22:22:54 +02:00
h264dsp_neon.S dsputil: Separate h264 qpel 2013-01-24 10:44:43 +01:00
h264idct_neon.S arm: Add X() around all references to extern symbols 2014-02-07 15:13:58 +02:00
h264pred_init_arm.c h264: arm: use intra pred8x8 functions only for chroma_format_idc <= 1 2015-07-18 00:28:49 +02:00
h264pred_neon.S ARM: Move asm.S from libavcodec to libavutil 2012-06-08 13:14:38 -04:00
h264qpel_init_arm.c qpeldsp: Mark source pointer in qpel_mc_func function pointer const 2014-07-25 02:52:54 -07:00
h264qpel_neon.S dsputil: Separate h264 qpel 2013-01-24 10:44:43 +01:00
hpeldsp_arm.h arm: Use full filenames as multiple inclusion guards 2014-01-14 00:04:52 +01:00
hpeldsp_arm.S hpeldsp: arm: Update comments left behind in 25841dfe80 2016-09-29 14:48:03 +02:00
hpeldsp_armv6.S arm: hpeldsp: fix put_pixels8_y2_{,no_rnd_}armv6 2014-03-08 18:31:57 +01:00
hpeldsp_init_arm.c dsputil: Refactor duplicated CALL_2X_PIXELS / PIXELS16 macros 2014-03-22 06:17:29 -07:00
hpeldsp_init_armv6.c arm: hpeldsp: Move half-pel assembly from dsputil to hpeldsp 2013-04-19 23:19:08 +03:00
hpeldsp_init_neon.c arm: hpeldsp: Move half-pel assembly from dsputil to hpeldsp 2013-04-19 23:19:08 +03:00
hpeldsp_neon.S arm: hpeldsp: Move half-pel assembly from dsputil to hpeldsp 2013-04-19 23:19:08 +03:00
idct.h idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_arm.h dsputil: Split off IDCT bits into their own context 2014-06-30 07:58:46 -07:00
idctdsp_arm.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_armv6.S dsputil: Split off IDCT bits into their own context 2014-06-30 07:58:46 -07:00
idctdsp_init_arm.c idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_init_armv5te.c idct: Move arm-specific declarations to a header in the arm directory 2014-07-20 13:02:17 -07:00
idctdsp_init_armv6.c idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_init_neon.c idct: Move arm-specific declarations to a header in the arm directory 2014-07-20 13:02:17 -07:00
idctdsp_neon.S dsputil: Split off IDCT bits into their own context 2014-06-30 07:58:46 -07:00
int_neon.S dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
jrevdct_arm.S Drop DCTELEM typedef 2013-01-22 18:32:56 -08:00
Makefile arm: vp9: Add NEON loop filters 2016-11-11 14:16:42 +02:00
mathops.h arm: use HAVE*_INLINE/EXTERNAL macros for conditional compilation 2012-12-07 16:54:03 +00:00
mdct_fixed_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
mdct_fixed_neon.S ARM: set Tag_ABI_align_preserved in all asm files 2012-10-02 19:47:56 +01:00
mdct_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
mdct_neon.S arm: Add X() around all references to extern symbols 2014-02-07 15:13:58 +02:00
mdct_vfp.S armv6: Accelerate ff_imdct_half for general case (mdct_bits != 6) 2014-07-18 01:34:08 +03:00
me_cmp_armv6.S dsputil: Split motion estimation compare bits off into their own context 2014-07-17 09:07:10 -07:00
me_cmp_init_arm.c motion_est: convert stride to ptrdiff_t 2014-11-24 01:30:10 +00:00
mlpdsp_armv5te.S arm: mlpdsp: handle pic offset calculation in a macro 2014-12-09 22:00:08 +01:00
mlpdsp_armv6.S cosmetics: Fix spelling mistakes 2016-05-04 18:16:21 +02:00
mlpdsp_init_arm.c truehd: add hand-scheduled ARM asm version of ff_mlp_pack_output. 2014-03-26 19:54:32 +02:00
mpegaudiodsp_fixed_armv6.S ARM: Move asm.S from libavcodec to libavutil 2012-06-08 13:14:38 -04:00
mpegaudiodsp_init_arm.c Add av_cold attributes to arch-specific init functions 2013-02-05 17:01:05 +01:00
mpegvideo_arm.c mpegvideo: cosmetics: Lowercase ugly uppercase MPV_ function name prefixes 2014-08-15 01:26:33 -07:00
mpegvideo_arm.h mpegvideo: cosmetics: Lowercase ugly uppercase MPV_ function name prefixes 2014-08-15 01:26:33 -07:00
mpegvideo_armv5te.c cosmetics: Fix spelling mistakes 2016-05-04 18:16:21 +02:00
mpegvideo_armv5te_s.S ARM: use standard syntax for all LDRD/STRD instructions 2012-08-01 10:32:24 +01:00
mpegvideo_neon.S arm: Add X() around all references to extern symbols 2014-02-07 15:13:58 +02:00
mpegvideoencdsp_armv6.S dsputil: Move pix_sum, pix_norm1, shrink function pointers to mpegvideoenc 2014-07-06 14:26:53 -07:00
mpegvideoencdsp_init_arm.c dsputil: Move pix_sum, pix_norm1, shrink function pointers to mpegvideoenc 2014-07-06 14:26:53 -07:00
neon.S ARM: make some NEON macros reusable 2011-12-02 19:59:18 +00:00
neontest.c lavc: add clobber tests for the new encoding/decoding API 2016-09-28 10:01:52 +02:00
pixblockdsp_armv6.S dsputil: Split off pixel block routines into their own context 2014-07-09 08:05:26 -07:00
pixblockdsp_init_arm.c pixblockdsp: Change type of stride parameters to ptrdiff_t 2016-09-14 14:12:36 +02:00
rdft_init_arm.c rdft: arm: Split RDFT initialization into a separate file 2016-02-26 14:34:58 +01:00
rdft_neon.S ARM: set Tag_ABI_align_preserved in all asm files 2012-10-02 19:47:56 +01:00
rv34dsp_init_arm.c rv34: Drop now unnecessary dsputil dependencies 2013-02-06 11:30:54 +01:00
rv34dsp_neon.S Drop DCTELEM typedef 2013-01-22 18:32:56 -08:00
rv40dsp_init_arm.c qpeldsp: Mark source pointer in qpel_mc_func function pointer const 2014-07-25 02:52:54 -07:00
rv40dsp_neon.S ARM: Move asm.S from libavcodec to libavutil 2012-06-08 13:14:38 -04:00
sbrdsp_init_arm.c ARM: allow runtime masking of CPU features 2012-04-22 12:30:45 +01:00
sbrdsp_neon.S ARM: generate position independent code to access data symbols 2012-07-01 11:25:06 +01:00
simple_idct_arm.S cosmetics: Fix spelling mistakes 2016-05-04 18:16:21 +02:00
simple_idct_armv5te.S simple_idct: arm: Drop disabled code variant 2016-08-17 12:21:54 +02:00
simple_idct_armv6.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
simple_idct_neon.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
startcode.h h264: Move start code search functions into separate source files. 2014-08-04 22:22:54 +02:00
startcode_armv6.S h264: Move start code search functions into separate source files. 2014-08-04 22:22:54 +02:00
synth_filter_neon.S ARM: set Tag_ABI_align_preserved in all asm files 2012-10-02 19:47:56 +01:00
synth_filter_vfp.S arm: cosmetics: Consistently use lowercase for shift operators 2014-07-18 11:17:40 +03:00
vc1dsp.h vc1: arm: Add NEON assembly 2013-12-20 14:53:39 +02:00
vc1dsp_init_arm.c vc-1: Add platform-specific start code search routine to VC1DSPContext. 2014-08-04 22:22:54 +02:00
vc1dsp_init_neon.c h264chroma: Change type of stride parameters to ptrdiff_t 2016-09-29 14:48:04 +02:00
vc1dsp_neon.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
videodsp_arm.h lavc: add missing files for arm 2012-12-20 14:07:23 +01:00
videodsp_armv5te.S arm: use a local label instead of the function symbol in ff_prefetch_arm 2015-07-20 23:10:29 +02:00
videodsp_init_arm.c Add av_cold attributes to arch-specific init functions 2013-02-05 17:01:05 +01:00
videodsp_init_armv5te.c Add av_cold attributes to arch-specific init functions 2013-02-05 17:01:05 +01:00
vorbisdsp_init_arm.c Add av_cold attributes to arch-specific init functions 2013-02-05 17:01:05 +01:00
vorbisdsp_neon.S Move vorbis_inverse_coupling from dsputil to vorbisdspcontext. 2013-01-19 22:21:10 -08:00
vp3dsp_init_arm.c vp3: Change type of stride parameters to ptrdiff_t 2016-08-26 11:36:26 +02:00
vp3dsp_neon.S arm: Add a missing # as prefix for an immediate constant 2014-01-07 19:30:13 +02:00
vp6dsp_init_arm.c vp56: Separate VP5 and VP6 dsp initialization 2016-08-26 11:50:22 +02:00
vp6dsp_neon.S vp56: Mark VP6-only optimizations as such. 2013-08-23 14:42:19 +02:00
vp8.h arm: asm decode_block_coeffs_internal is vp8 specific 2014-04-04 10:39:29 +02:00
vp8_armv6.S ARM: swap source operands in some add instructions 2012-09-20 17:07:18 +01:00
vp8dsp.h On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_armv6.S vp8: Update some assembly comments left unchanged in bd66f073fe 2016-08-26 11:36:53 +02:00
vp8dsp_init_arm.c On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_init_armv6.c On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_init_neon.c On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_neon.S arm: Fix a typo in a comment 2016-07-06 22:58:51 +03:00
vp9dsp_init_arm.c arm: vp9: Add NEON loop filters 2016-11-11 14:16:42 +02:00
vp9itxfm_neon.S arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible 2017-02-09 12:32:00 +02:00
vp9lpf_neon.S arm: vp9: Add NEON loop filters 2016-11-11 14:16:42 +02:00
vp9mc_neon.S arm: vp9mc: Fix vertical alignment of operands 2017-01-03 14:15:45 +02:00
vp56_arith.h arm: use HAVE*_INLINE/EXTERNAL macros for conditional compilation 2012-12-07 16:54:03 +00:00