Commit graph

21463 commits

Author SHA1 Message Date
Anton Khirnov
522d850e68 h264_cavlc: check the value of run_before
Section 9.2.3.2 of the spec implies that run_before must not be larger
than zeros_left.

Fixes invalid reads with corrupted files.

CC: libav-stable@libav.org
Bug-Id: 1000
Found-By: Kamil Frankowicz
2017-03-12 20:42:13 +01:00
Anton Khirnov
83b2b34d06 h2645_parse: use the bytestream2 API for packet splitting
The code does some nontrivial jumping around in the buffer, so it is
safer to use a checked API rather than do everything manually.

Fixes a bug in nalff parsing, where the length field is currently not
counted in the buffer size check, resulting in possible overreads with
invalid files.

CC: libav-stable@libav.org
Bug-Id: 1002
Found-By: Kamil Frankowicz
2017-03-12 20:42:12 +01:00
Anton Khirnov
b76f6a76c6 h264dec: initialize field_started to 0 on each decode call
It might be incorrectly set to 1 if the previous call exited with an
error.

Bug-Id: 1019
CC: libav-stable@libav.org
2017-03-12 20:42:12 +01:00
Martin Storsjö
3a0d5e206d arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 22:07:30 +02:00
Martin Storsjö
98ee855ae0 arm: vp9itxfm: Template the quarter/half idct32 function
This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 22:07:12 +02:00
Kieran Kunhya
5f794aa165 Add Cineform HD Decoder
Decodes YUV 4:2:2 10-bit and RGB 12-bit files.
Older files with more subbands, skips, Bayer, alpha not supported.

Further fixes and refactorings by Anton Khirnov <anton@khirnov.net>,
Diego Biurrun <diego@biurrun.de>, Vittorio Giovara <vittorio.giovara@gmail.com>

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2017-03-09 18:37:29 +01:00
Konda Raju
f6790b5e10 add initial QP value options
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2017-03-09 17:24:00 +01:00
wm4
8a60bba0ae avcodec: clarify some decoding/encoding API details
Make it clear that there is no timing-dependent behavior. In particular,
there is no state in which both input and output are denied, and where
you have to wait for a while yourself to make progress (apparently some
hardware decoders like to do this).

Avoid wording that makes references to time. It shouldn't be mistaken
for some kind of asynchronous API (like POSIX read() can return EAGAIN
if there is no new input yet). It's a state machine, so try to use
appropriate terms.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2017-03-09 17:07:24 +01:00
Vittorio Giovara
b44bd7ee7f pixlet: Fix architecture-dependent code and values
The constants used in the decoder used floating point precision,
and this caused different values to be generated on different
architectures. Additionally on big endian machines, the fate test
would output bytes in native order, which is different from the one
hardcoded in the test.

So, eradicate floating point numbers and use fixed point (32.32)
arithmetics everywhere, replacing constants with precomputed integer
values, and force the pixel format output to be the same in the fate
test.

Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
2017-03-06 18:15:02 -05:00
Diego Biurrun
6eef263aca x86: Merge align directives into SECTION_RODATA declarations where possible 2017-03-05 14:26:06 +01:00
Ganapathy Kasi
3303f86467 nvenc: Remove qmin and qmax constraints for nvenc vbr
qmin and qmax are not necessary for nvenc vbr.

Also fix for using 2 pass vbr mode for slow preset through ctx->flag NVENC_TWO_PASSES.

Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2017-03-04 08:23:28 +01:00
Paul B Mahol
aba5b94859 Add Apple Pixlet decoder
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
2017-03-01 11:52:29 -05:00
Diego Biurrun
39e208f4d4 build: Generalize yasm/nasm-related variable names
None of them are specific to the YASM assembler.
2017-03-01 10:18:15 +01:00
Diego Biurrun
fde7ee8710 x86: hevc: Add missing colons after assembly labels
This fixes several warnings of the sort
warning: label alone on a line without a colon might be in error
2017-03-01 09:23:42 +01:00
Michael Niedermayer
d7b2bb5391 h264_sei: Check actual presence of picture timing SEI message
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
2017-02-28 10:32:50 -05:00
Ben Chang
d8f36a6aa3 nvenc: Fix the preset mapping list
The map is a sparse array and does not need a empty element to terminate
it.

The empty element is stored after the last one inserted in the list,
overwriting whichever element was next with zeros.

Bug-Id: 1029

Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2017-02-28 11:54:02 +01:00
Anton Khirnov
984736dd9e lavc: make sure not to return EAGAIN from codecs
This error is treated specially by the API.

CC: libav-stable@libav.org
2017-02-25 09:57:44 +01:00
Anton Khirnov
b2788fe934 svq3: fix the slice size check
Currently it incorrectly compares bits with bytes.

Also, move the check right before where it's relevant, so that the
correct number of remaining bits is used.

CC: libav-stable@libav.org
2017-02-25 09:57:43 +01:00
John Stebbins
248dc5c164 h264dec: fix dropped initial SEI recovery point 2017-02-24 08:24:13 -07:00
Martin Storsjö
b8f66c0838 aarch64: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:04:34 +02:00
Martin Storsjö
08074c092d arm: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:04:33 +02:00
Martin Storsjö
09eb88a12e aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:04:32 +02:00
Martin Storsjö
de06bdfe6c arm: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:04:31 +02:00
Martin Storsjö
65aa002d54 aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:03:44 +02:00
Martin Storsjö
402546a172 arm: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:03:43 +02:00
Martin Storsjö
575e31e931 arm: vp9lpf: Implement the mix2_44 function with one single filter pass
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.

The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.

Before:                          Cortex A7      A8     A9     A53
vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
After:
vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:03:09 +02:00
Martin Storsjö
3bf9c48320 aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1
This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:03:00 +02:00
Martin Storsjö
c582cb8537 arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-24 00:02:36 +02:00
Diego Biurrun
ed6a891c36 Place attribute_deprecated in the right position for struct declarations
libavcodec/vaapi.h:58:1: warning: attribute 'deprecated' is ignored, place it after "struct" to apply attribute to type declaration [-Wignored-attributes]
2017-02-23 12:23:20 +01:00
Diego Biurrun
00b160af11 nvenc: Fix nvec vs. nvenc typo 2017-02-20 09:50:03 +01:00
Mark Thompson
7cb9296db8 webp: Fix alpha decoding
This was broken by 4e528206bc - the webp
decoder was assuming that it could set the output pixfmt of the vp8
decoder directly, but after that change it no longer could because
ff_get_format() was used instead.  This adds an internal get_format()
callback to webp use of the vp8 decoder to override the pixfmt
appropriately.
2017-02-18 19:53:20 +00:00
Mark Thompson
17aeee5832 vaapi_encode: Discard output buffer if picture submission fails
Previously this was leaking, though it actually hit an assert making
sure that the buffer had already been cleared when freeing the picture.
2017-02-16 20:58:42 +00:00
Martin Storsjö
030de53e9c libopenh264dec: Let the framework use the h264_mp4toannexb bitstream filter
This avoids a lot of boilerplate code within the decoder wrapper itself.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-15 23:05:58 +02:00
Mark Thompson
5dd9a4b88b vaapi: Implement device-only setup
In this case, the user only supplies a device and the frame context
is allocated internally by lavc.
2017-02-13 21:44:43 +00:00
Mark Thompson
44f2eda39f lavc: Add device context field to AVCodecContext
For use by codec implementations which can allocate frames internally.
2017-02-13 20:14:27 +00:00
Martin Storsjö
07b5136c48 aarch64: vp9lpf: Fix broken indentation/vertical alignment
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-12 21:57:23 +02:00
Martin Storsjö
b0806088d3 aarch64: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 22:54:18 +02:00
Martin Storsjö
e18c39005a arm: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 22:54:01 +02:00
Luca Barbato
9c2d36fcaf dv: Convert to the new bitstream reader 2017-02-11 20:29:44 +01:00
Luca Barbato
ba30b74686 aac: Validate the sbr sample rate before using the value
Avoid a floating point exception.

Bug-Id: 1027
CC: libav-stable@libav.org
2017-02-11 20:23:11 +01:00
Anton Khirnov
f44ec22e09 lavc: use av_cpu_max_align() instead of hardcoding alignment requirements 2017-02-11 11:37:45 +01:00
Martin Storsjö
435cd7bc99 arm: vp9lpf: Use orrs instead of orr+cmp
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:44:04 +02:00
Martin Storsjö
e1f9de86f4 arm/aarch64: vp9lpf: Calculate !hev directly
Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
After:
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:43:59 +02:00
Martin Storsjö
3fcf788fbb aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google.

Before:                           Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:31:58 +02:00
Martin Storsjö
a76bf8cf12 arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google.

Before:                            Cortex A7      A8      A9     A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:31:52 +02:00
Martin Storsjö
388e0d2515 aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
No measured speedup on a Cortex A53, but other cores might benefit.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:08:50 +02:00
Martin Storsjö
fea92a4b57 arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
Before:                    Cortex A7      A8     A9     A53
vp9_put_8tap_smooth_4h_neon:   378.1   273.2  340.7   229.5
After:
vp9_put_8tap_smooth_4h_neon:   352.1   222.2  290.5   229.5

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:08:37 +02:00
Martin Storsjö
5e0c2158fb aarch64: vp9mc: Simplify the extmla macro parameters
Fold the field lengths into the macro.

This makes the macro invocations much more readable, when the
lines are shorter.

This also makes it easier to use only half the registers within
the macro.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-11 00:08:29 +02:00
Martin Storsjö
bc25897630 utvideodec: Add a missing include
This was missing from 77c23704c7, fixing building.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-02-10 09:31:49 +02:00
Timo Rothenpieler
a52976c0fe nvenc: make gpu indices independent of supported capabilities
Do not allocate a CUDA context for every available gpu.

Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2017-02-09 23:29:32 +01:00