This fixes compiling with MSVC for aarch64 after
510999f6b0.
While MSVC does do dead code elimintation for function references
within e.g. "if (0)", it doesn't do that for functions referenced
within a static function, even if that static function itself ends
up not used.
A reproduction example:
void missing(void);
void (*func_ptr)(void);
static void wrapper(void) {
missing();
}
void init(int cpu_flags) {
if (0) {
func_ptr = wrapper;
}
}
If "wrapper" is entirely unreferenced, then MSVC doesn't produce
any reference to the symbol "missing". Also, if we do
"func_ptr = missing;" then the reference to missing also is
eliminated. But for the case of referencing the function in a
static function, even if the reference to the static function can
be eliminated, then MSVC does keep the reference to the symbol.
Rewrite ff_hevc_put_hevc_qpel_h16_8_neon and h32 to use byte-domain
widening multiply (umull/umlal/umlsl via calc_qpelb/calc_qpelb2 macros)
instead of the previous int16-domain approach (uxtl + mul/mla).
The byte-domain approach eliminates the uxtl expansion step and halves
the ext stride (1 byte vs 2 bytes per tap), reducing per-row instruction
count from ~32 to ~23. The functions are also inlined, removing bl/ret
call overhead.
This benefits all HV-path callers (hv/uni_hv/bi_hv/uni_w_hv/bi_w_hv)
at widths 16/32/48/64.
checkasm benchmarks on Apple M4 (5-run average):
H-pass standalone (NEON):
h16: 34.0 -> 24.4 cycles (1.39x speedup)
h32: 132.0 -> 95.0 cycles (1.39x speedup)
h64: 521.8 -> 373.9 cycles (1.40x speedup)
HV compound paths geometric mean speedup (NEON, width >= 16):
qpel_hv: 1.144x (4 functions)
qpel_bi_hv: 1.158x (4 functions)
qpel_uni_hv: 1.188x (4 functions)
qpel_uni_w_hv: 1.158x (3 functions)
Overall: 1.162x (15 functions)
VVC qpel h16/h32 are separated into self-contained functions retaining
the int16-domain approach, as VVC filters have arbitrary coefficients
incompatible with the hardcoded sign pattern in calc_qpelb.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
The VVC qpel h16 and h32 functions had a redundant 'mov mx, x30'
instruction. The first one was placed before vvc_load_filter had
finished using mx (the filter pointer argument), making it a dead
store immediately overwritten by the second 'mov mx, x30'.
Remove the first instance and reorder so that 'sub src, src, #3'
comes before 'mov mx, x30', ensuring the filter pointer in mx is
fully consumed by vvc_load_filter before being overwritten with the
link register.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
For bi-predicted weighted averages, only the sum
of the two offsets is ever used, so add the two early.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The ldrsw instruction requires immediate offset with # prefix.
This fixes the syntax error introduced in commit 26752368f0
(aarch64/h26x: Add put_hevc_pel_bi_w_pixels) where the
load_bi_w_pixels_param macro was added.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
This file is excempt from the indent checker script, as there
are a few other bits in it that the script wants to reformat
into slightly worse form, or which might not warrant being
reformatted.
But these instructions should indeed be indented this way.
These names are a remnant of dsputil when all the DSP functions
from all codecs were part of DSPcontext.
Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Implement NEON optimization for HEVC dequant at 8-bit depth.
The NEON implementation uses srshr (Signed Rounding Shift Right) which
does both the add with offset and right shift in a single instruction.
Optimization details:
- 4x4 (16 coeffs): Single load-process-store sequence
- 8x8 (64 coeffs): Fully unrolled, no loop overhead
- 16x16 (256 coeffs): Pipelined load/compute/store to hide memory latency
- 32x32 (1024 coeffs): Pipelined with all available NEON registers
Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_8_c: 11.3 ( 1.00x)
hevc_dequant_4x4_8_neon: 6.3 ( 1.78x)
hevc_dequant_8x8_8_c: 33.9 ( 1.00x)
hevc_dequant_8x8_8_neon: 6.6 ( 5.11x)
hevc_dequant_16x16_8_c: 153.8 ( 1.00x)
hevc_dequant_16x16_8_neon: 9.0 (17.02x)
hevc_dequant_32x32_8_c: 78.1 ( 1.00x)
hevc_dequant_32x32_8_neon: 31.9 ( 2.45x)
Note on Performance Anomaly:
The observation that hevc_dequant_32x32_8_c is faster than 16x16 (78.1 vs 153.8)
is due to Clang auto-vectorizing only for sizes >= 32x32.
Compiler: Apple clang version 17.0.0 (clang-1700.6.3.2)
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
The mismatch between neon and C functions can be reproduced
using the following bitstream and command line on aarch64 devices.
wget https://streams.videolan.org/ffmpeg/incoming/replay_intra_pred_16x16.h264
./ffmpeg -cpuflags 0 -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_ref
./ffmpeg -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_neon
Signed-off-by: Bin Peng <pengbin@visionular.com>
Before this commit, the input in get_pixels and get_pixels_unaligned
has been treated inconsistenly:
- The generic code treated 9, 10, 12 and 14 bits as 16bit input
(these bits correspond to what FFmpeg's dsputils supported),
everything with <= 8 bits as 8 bit and everything else as 8 bit
when used via AVDCT (which exposes these functions and purports
to support up to 14 bits).
- AARCH64, ARM, PPC and RISC-V, x86 ignore this AVDCT special case.
- RISC-V also ignored the restriction to 9, 10, 12 and 14 for its
16bit check and treated everything > 8 bits as 16bit.
- The mmi MIPS code treats everything as 8 bit when used via
AVDCT (this is certainly broken); otherwise it checks for <= 8 bits.
The msa MIPS code behaves like the generic code.
This commit changes this to treat 9..16 bits as 16 bit input,
everything else as 8 bit (the former because it makes sense,
the latter to preserve the behaviour for external users*).
*: The only internal user of AVDCT (the spp filter) always
uses 8, 9 or 10 bits.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Many of the fields of MpegEncContext (which is also used by decoders)
are actually only used by encoders. Therefore this commit adds
a new encoder-only structure and moves all of the encoder-only
fields to it except for those which require more explicit
synchronisation between the main slice context and the other
slice contexts. This synchronisation is currently mainly provided
by ff_update_thread_context() which simply copies most of
the main slice context over the other slice contexts. Fields
which are moved to the new MPVEncContext no longer participate
in this (which is desired, because it is horrible and for the
fields b) below wasteful) which means that some fields can only
be moved when explicit synchronisation code is added in later commits.
More explicitly, this commit moves the following fields:
a) Fields not copied by ff_update_duplicate_context():
dct_error_sum and dct_count; the former does not need synchronisation,
the latter is synchronised in merge_context_after_encode().
b) Fields which do not change after initialisation (these fields
could also be put into MPVMainEncContext at the cost of
an indirection to access them): lambda_table, adaptive_quant,
{luma,chroma}_elim_threshold, new_pic, fdsp, mpvencdsp, pdsp,
{p,b_forw,b_back,b_bidir_forw,b_bidir_back,b_direct,b_field}_mv_table,
[pb]_field_select_table, mb_{type,var,mean}, mc_mb_var, {min,max}_qcoeff,
{inter,intra}_quant_bias, ac_esc_length, the *_vlc_length fields,
the q_{intra,inter,chroma_intra}_matrix{,16}, dct_offset, mb_info,
mjpeg_ctx, rtp_mode, rtp_payload_size, encode_mb, all function
pointers, mpv_flags, quantizer_noise_shaping,
frame_reconstruction_bitfield, error_rate and intra_penalty.
c) Fields which are already (re)set explicitly: The PutBitContexts
pb, tex_pb, pb2; dquant, skipdct, encoding_error, the statistics
fields {mv,i_tex,p_tex,misc,last}_bits and i_count; last_mv_dir,
esc_pos (reset when writing the header).
d) Fields which are only used by encoders not supporting slice
threading for which synchronisation doesn't matter: esc3_level_length
and the remaining mb_info fields.
e) coded_score: This field is only really used when FF_MPV_FLAG_CBP_RD
is set (which implies trellis) and even then it is only used for
non-intra blocks. For these blocks dct_quantize_trellis_c() either
sets coded_score[n] or returns a last_non_zero value of -1
in which case coded_score will be reset in encode_mb_internal().
Therefore no old values are ever used.
The MotionEstContext has not been moved yet.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Reduce binary size at the same time. The performance compared to clang -O3
is the same.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
these three at the end.
Before and after:
A78
ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x)
ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x)
A72
ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x)
ac3_sum_square_bufferfly_int32_neon: 527.3 ( 1.92x)
Signed-off-by: Martin Storsjö <martin@martin.st>
This change removes one extra floating point operation and simplifies
load operations at the beginning of the loop by using dedicated register
for each of the 5 pointers and interleaving it with calculations. The
first case seems to be a bit slower, but the performance increase is
substantial in the other two.
A78 before:
postfilter_15_neon: 1684.8 ( 4.23x)
postfilter_512_neon: 1395.5 ( 5.10x)
postfilter_1022_neon: 1357.0 ( 5.25x)
After:
postfilter_15_neon: 1742.2 ( 4.09x)
postfilter_512_neon: 1169.8 ( 6.09x)
postfilter_1022_neon: 1160.0 ( 6.12x)
A72 before:
postfilter_15_neon: 3144.8 ( 2.39x)
postfilter_512_neon: 3141.2 ( 2.39x)
postfilter_1022_neon: 3230.0 ( 2.33x)
After:
postfilter_15_neon: 2847.8 ( 2.64x)
postfilter_512_neon: 2877.8 ( 2.61x)
postfilter_1022_neon: 2837.2 ( 2.65x)
x13s before:
postfilter_15_neon: 1615.4 ( 2.61x)
postfilter_512_neon: 963.1 ( 4.39x)
postfilter_1022_neon: 963.6 ( 4.39x)
After:
postfilter_15_neon: 1749.6 ( 2.41x)
postfilter_512_neon: 707.1 ( 5.97x)
postfilter_1022_neon: 706.1 ( 5.99x)
Signed-off-by: Martin Storsjö <martin@martin.st>
This reduces the amount the horizontal filters read beyond the filter
width to a consistent 1 pixel. The data is not used so this is usually
not noticeable. It becomes a problem when the application allocates
frame buffers only for the aligned picture size and the end of it is at
a page boundary. This happens for picture sizes which are a multiple of
the page size like 1280x640. The frame buffer allocation is based on
its most likely done via mmap + MAP_ANONYMOUS so start and end of the
buffer are page aligned and the previous and next page are not
necessarily mapped.
Under these conditions like seen by Firefox a read beyond the end of the
buffer results in a segfault.
After the over-read is reduced to a single pixel it's reasonable to use
VP9's emulated edge motion compensation for this.
Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185
Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.
This fixes these build errors:
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
error A2518: operand 2: Memory offset must be aligned
ldr s5, [x1, #1]
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
error A2518: operand 2: Memory offset must be aligned
ldr d7, [x1, #2]
Signed-off-by: Martin Storsjö <martin@martin.st>
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 367840
Signed-off-by: Peng Bin <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 479612
The mismatch between neon and C functions can also be reproduced using the following bitstream and command line.
wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264
./ffmpeg -cpuflags 0 -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_ref
./ffmpeg -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_neon
Signed-off-by: Bin Peng <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>