Commit graph

2355 commits

Author SHA1 Message Date
Michael Niedermayer
516c213f08 avcodec/x86/vp9dsp_init_16bpp: Fix linking to missing ff_vp9_ipred_dr_32x32_16_avx2() on 32bit
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-06-28 00:31:33 +02:00
Ilia Valiakhmetov
35a5d9715d avcodec/vp9: add 64-bit ipred_dr_32x32_16 avx2 implementation
vp9_diag_downright_32x32_12bpp_c: 429.7
vp9_diag_downright_32x32_12bpp_sse2: 158.9
vp9_diag_downright_32x32_12bpp_ssse3: 144.6
vp9_diag_downright_32x32_12bpp_avx: 141.0
vp9_diag_downright_32x32_12bpp_avx2: 73.8

Almost 50% faster than avx implementation

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2017-06-27 16:10:50 -04:00
Paul B Mahol
4ed7c2bbc3 avcodec/utvideodec: add SIMD for restore_rgb_planes
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-06-27 09:54:10 +02:00
Matthieu Bouron
db5bf64b21 lavc/x86: clear r2 higher bits in ff_sbr_sum_square
Suggested-by: James Almer <jamrial@gmail.com>
2017-06-26 09:55:23 +02:00
James Almer
349446e36f x86/mdct15: use three operand form for some instructions
Fixes compilation with old yasm
2017-06-24 01:44:49 -03:00
Rostislav Pehlivanov
e1120b1c54 mdct15: add assembly optimizations for the 15-point FFT
c:    1802 decicycles in fft15,16774635 runs,   2581 skips
avx:   865 decicycles in fft15,16776378 runs,    838 skips

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-06-23 23:45:37 +01:00
Diego Biurrun
fd502f4f5f build: Generalize yasm/nasm-related variable names
None of them are specific to the YASM assembler.

(Cherry-picked from libav commit 39e208f4d4)

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-21 17:00:29 -03:00
James Darnley
8221c71703 avcodec/x86: allow future 8-bit simple idct to use slightly different coefficients 2017-06-20 16:12:25 +02:00
James Darnley
d2597fb0c1 avcodec/x86: modify simple_idct10 macros to add an action paramter 2017-06-20 13:35:01 +02:00
James Darnley
8781330d80 avcodec/x86: cleanup simple_idct10
Use named arguments for the functions so we can remove a define.  The
stride/linesize argument is now ptrdiff_t type so we no longer need to
sign extend the register.
2017-06-20 13:34:38 +02:00
James Darnley
e3db94302c avcodec/x86/mpegenc: support transpose permuation type 2017-06-20 12:12:13 +02:00
James Darnley
fa30a0a548 avcodec/x86/mpegenc: check IDCT permutation type is a valid value 2017-06-20 12:12:13 +02:00
Michael Niedermayer
ae6f6d4e34 avcodec/x86/mpegvideo: Use intra scantable in dct_unquantize_h263_intra_mmx()
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-06-20 00:07:51 +02:00
James Almer
8bb59e6742 x86/aacpsdsp: add ff_ps_hybrid_analysis_ileave_sse
About 2x faster than the c version.
2017-06-18 22:34:22 -03:00
James Almer
e229df9478 x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
About 2x faster than the c version.
2017-06-18 22:33:27 -03:00
James Almer
623d217ed1 avcodec/aacps: move checks for valid length outside the stereo_interpolate dsp function
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-15 23:49:40 -03:00
James Almer
b3446862bf x86/vorbisdsp: optimize ff_vorbis_inverse_coupling_sse
About 7% faster.
2017-06-15 23:20:05 -03:00
Ronald S. Bultje
d35ff98e27 vp9: fix overwrite in ff_vp9_ipred_dr_16x16_16_avx2.
Fixes trac issue 6459.
2017-06-14 11:37:38 -04:00
Ilia Valiakhmetov
81fc617c12 avcodec/vp9: ipred_dr_16x16_16 avx2 implementation
Signed-off-by: Ilia Valiakhmetov <zakne0ne@gmail.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2017-06-12 12:40:58 -04:00
James Almer
497a4b554c x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
2017-06-07 13:53:51 -03:00
Ilia Valiakhmetov
73d9a9a6af libavcodec/vp9: ipred_dl_32x32_16 avx2 implementation
vp9_diag_downleft_32x32_8bpp_c: 580.2
vp9_diag_downleft_32x32_8bpp_sse2: 75.6
vp9_diag_downleft_32x32_8bpp_ssse3: 73.7
vp9_diag_downleft_32x32_8bpp_avx: 72.7
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0
vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0

~30% faster than avx implementation

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2017-06-06 08:05:03 -04:00
James Almer
933dd62288 x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
~2% faster.
2017-06-04 23:29:56 -03:00
James Almer
be3809a521 x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
Move the unpacking outside of the loop. 5% to 10% faster.

Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-03 12:39:43 -03:00
James Almer
b5a0971ff0 x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
About 2x faster than the c version.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-02 11:06:24 -03:00
James Darnley
0dea0114fb avcodec/x86/idctdsp_init: reindent 2017-05-30 13:20:44 +02:00
James Darnley
8e89f6fd37 avcodec/x86: move simple_idct to external assembly 2017-05-30 13:20:42 +02:00
Clément Bœsch
584366a436 lavc/mpegvideoenc: reformat inv_zigzag_direct16 so the zigzag pattern is visible 2017-05-19 11:17:58 +02:00
Clément Bœsch
19bb2cade5 Merge commit 'b4a911c189'
* commit 'b4a911c189':
  mpegvideoenc: make a table const

Merged-by: Clément Bœsch <u@pkh.me>
2017-05-19 11:15:16 +02:00
James Darnley
7aa90b4e94 avcodec/h264: add sse2 versions of previous idct functions
Kaby Lake Pentium:
 - ff_h264_idct_add_8_sse2:    ~1.18x faster than mmxext
 - ff_h264_idct_dc_add_8_sse2: ~1.07x faster than mmxext
2017-05-15 15:00:20 +02:00
James Darnley
27460dfebc avcodec/h264: add avx 8-bit h264_idct_dc_add
Haswell:
 - 1.02x faster (405±0.7 vs. 397±0.8 decicycles) compared with mmxext

Skylake-U:
 - 1.06x faster (498±1.8 vs. 470±1.3 decicycles) compared with mmxext
2017-05-15 15:00:19 +02:00
James Darnley
f61d454ca1 avcodec/h264: add avx 8-bit h264_idct_add
Haswell:
 - 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext

Skylake-U:
 - 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext
2017-05-15 15:00:17 +02:00
James Darnley
b5325c6711 avcodec/h264: use some 3 operand forms 2017-05-15 15:00:16 +02:00
James Darnley
060ba9e5e3 avcodec/h264: change RETs into REP_RETs where appropriate 2017-05-15 15:00:15 +02:00
Michael Niedermayer
fa8fd0808f avcodec/x86/vc1dsp_init: Fix build failure with --disable-optimizations and clang
compilers doing DCE at -O0 do not necessarily understand "complex" boolean expressions
Build succeeds with this change, this was the only failure

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-04-27 04:25:31 +02:00
Clément Bœsch
5be1440c74 Merge commit '0a35f128f3'
* commit '0a35f128f3':
  cabac: x86: Give optimizations header a more meaningful name

Merged-by: Clément Bœsch <u@pkh.me>
2017-04-08 14:30:13 +02:00
Ronald S. Bultje
83ae7e6350 x86/idctdsp_init: reindent. 2017-04-06 10:03:28 -04:00
Ronald S. Bultje
e0c205677f x86/simple_idct: add explicit sse2 simple_idct_put/add versions.
These use the mmx IDCT, but sse2 put/add_pixels_clamped implementations.
This way we don't need to use the ff_put/add_pixels_clamped function
pointers.
2017-04-06 10:03:28 -04:00
Ronald S. Bultje
2f0591cfa3 cavs: add a sse2 idct implementation.
This makes using the function pointer ff_add_pixels_clamped() unnecessary,
since we always know what the best implementation is at compile-time.
2017-04-06 10:03:28 -04:00
Ronald S. Bultje
c9d98c5649 cavs: convert idct from inline asm to yasm. 2017-04-06 10:03:27 -04:00
Ronald S. Bultje
b51d7d89f8 x86/xvididct: remove use of ff_put/add_pixels_clamped function pointer.
Since there's separate SSE2 implementations of xvid_idct_put/add, this
patch has no practical impact on performance.
2017-04-06 10:03:27 -04:00
James Almer
6171f178e7 x86/hevc_add_res: merge last remaining changes from 3d65359832
See https://lists.libav.org/pipermail/libav-devel/2016-October/079829.html
2017-03-31 20:49:45 -03:00
Clément Bœsch
1ea0df14c3 Merge commit '0361e4dcb4'
* commit '0361e4dcb4':
  h264_qpel: x86: Move function with only one instance out of template macro

Note: warning is present with clang.

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-03-31 09:44:04 +02:00
Ronald S. Bultje
f8c019944d vp9: re-split the decoder/format/dsp interface header files.
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
2017-03-28 18:04:26 -04:00
Clément Bœsch
1c9f4b5078 lavc/vp9: split into vp9{block,data,mvs}
This is following Libav layout to ease merges.
2017-03-27 21:38:21 +02:00
Michael Niedermayer
73fb40dc87 avcodec/x86/idctdsp: Remove duplicate include
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-03-26 19:17:30 +02:00
James Almer
ac42f08099 x86/hevc_add_res: merge missing changes from 3d65359832
Unrolling the loops triplicates the size of the assembled output
while not generating any gain in performance.
2017-03-24 11:24:18 -03:00
Clément Bœsch
3d65359832 Merge commit '6d5636ad9a'
* commit '6d5636ad9a':
  hevc: x86: Add add_residual() SIMD optimizations

See a6af4bf64d

This merge is only cosmetics (renames, space shuffling, etc).

The functionnal changes in the ASM are *not* merged:
- unrolling with %rep is kept
- ADD_RES_MMX_4_8 is left untouched: this needs investigation

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-24 12:33:25 +01:00
Clément Bœsch
40ac226014 lavc/x86/hevc: rename hevc_res_add to hevc_add_res
This will simplify incoming merge.
2017-03-24 11:45:23 +01:00
James Almer
bac44a5020 Merge commit 'b89804da9b'
* commit 'b89804da9b':
  x86: videodsp: Add parentheses to expression to work around warning

Merged-by: James Almer <jamrial@gmail.com>
2017-03-23 18:35:49 -03:00
James Almer
29db87af52 Merge commit '6be7944ee2'
* commit '6be7944ee2':
  x86: Add missing colons after assembly labels

Merged-by: James Almer <jamrial@gmail.com>
2017-03-23 18:05:27 -03:00