Anton Khirnov
c8c2dfbc37
lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
...
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00
Martin Storsjö
7168adedbc
libavcodec: aarch64: Add a NEON implementation of pixblockdsp
...
Cortex A53 A72 A73
get_pixels_c: 140.7 87.7 72.5
get_pixels_neon: 46.0 20.0 19.5
get_pixels_unaligned_c: 140.7 87.7 73.0
get_pixels_unaligned_neon: 49.2 20.2 26.2
diff_pixels_c: 209.7 133.7 138.7
diff_pixels_neon: 54.2 31.7 23.5
diff_pixels_unaligned_c: 209.7 134.2 139.0
diff_pixels_unaligned_neon: 68.0 27.7 41.7
Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:37:55 +03:00
Carl Eugen Hoyos
34d7c8d942
lavc/aarch64: Remove unneeded file vp9mc_aarch64.c
2020-03-11 14:36:07 +01:00
Carl Eugen Hoyos
951bd25572
lavc/aarch64: Fix suffix of new file vp9mc_aarch64.
2020-03-11 14:29:22 +01:00
Carl Eugen Hoyos
213c796561
lavc/aarch64: Fix compilation with --disable-neon
...
Fixes ticket #8565 .
2020-03-11 14:16:48 +01:00
Carl Eugen Hoyos
9a21754904
lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.
...
Fixes part of ticket #8565 .
2020-03-11 14:16:40 +01:00
Lynne
aac382e9e5
aarch64/opusdsp: do not clobber register v8
...
A part of v8-v15 needs to be preserved across calls.
2019-08-15 13:29:22 +01:00
Lynne
f62ee527cb
aarch64/asm-offsets: remove old CELT offsets
...
They're not used and they're incorrect.
2019-05-14 23:41:24 +01:00
Lynne
4d2f62150d
aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis
...
153372 UNITS in postfilter_c, 65536 runs, 0 skips
73164 UNITS in postfilter_neon, 65536 runs, 0 skips -> 2.1x speedup
80591 UNITS in deemphasis_c, 131072 runs, 0 skips
43969 UNITS in deemphasis_neon, 131072 runs, 0 skips -> 1.83x speedup
Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)
Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;
for (int i = 0; i < len; i += 4) {
y[0] = x[0] + c1*state;
y[1] = x[1] + c2*state + c1*x[0];
y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
state = y[3];
y += 4;
x += 4;
}
Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
2019-04-10 01:08:54 +02:00
James Almer
92219ef4ac
Merge commit ' 186bd30aa3'
...
* commit '186bd30aa3 ':
h264/arm64: implement missing 4:2:2 chroma loop filter neon functions
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:29:41 -03:00
James Almer
5c363d3e59
Merge commit ' 7e42d5f0ab'
...
* commit '7e42d5f0ab ':
aarch64: vp8: Optimize vp8_idct_add_neon for aarch64
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:22:29 -03:00
James Almer
409e684e79
Merge commit ' 49f9c4272c'
...
* commit '49f9c4272c ':
aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:21:46 -03:00
James Almer
fbd607dd56
Merge commit ' 37394ef01b'
...
* commit '37394ef01b ':
aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:20:05 -03:00
James Almer
34a0a9746b
Merge commit ' e39a9212ab'
...
* commit 'e39a9212ab ':
aarch64: vp8: Port bilin functions from arm version
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:18:42 -03:00
James Almer
2ac399d7fa
Merge commit ' 58d1549227'
...
* commit '58d1549227 ':
aarch64: vp8: Port epel4 functions from arm version
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:17:33 -03:00
James Almer
c6892f59eb
Merge commit ' cc7ba00c35'
...
* commit 'cc7ba00c35 ':
aarch64: vp8: Port missing epel8 functions from arm version
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:16:43 -03:00
James Almer
79025da3f2
Merge commit ' 52c9b0a6c0'
...
* commit '52c9b0a6c0 ':
aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:14:40 -03:00
James Almer
39278ff0de
Merge commit ' c513fcd7d2'
...
* commit 'c513fcd7d2 ':
aarch64: vp8: Fix a typo in a comment
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:13:32 -03:00
James Almer
4f9a8d3fe2
Merge commit ' f1011ea28a'
...
* commit 'f1011ea28a ':
aarch64: vp8: Reorder the function pointer inits to match the arm original
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:09:11 -03:00
James Almer
398000abcf
Merge commit ' 85bfaa4949'
...
* commit '85bfaa4949 ':
aarch64: vp8: Use the proper aarch64 form for conditional branches
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:06:43 -03:00
James Almer
a2ae381b5a
Merge commit ' 0801853e64'
...
* commit '0801853e64 ':
libavcodec: vp8 neon optimizations for aarch64
See 833fed5253
Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:05:52 -03:00
Janne Grunau
186bd30aa3
h264/arm64: implement missing 4:2:2 chroma loop filter neon functions
2019-02-27 21:57:05 +01:00
Carl Eugen Hoyos
7e4d3dbe18
lavc/aarch64/h264dsp_init: Only use neon horizontal intra loopfilter for 4:2:0.
2019-02-20 23:56:21 +01:00
James Almer
aa844dc46f
aarch64/h264dsp: change loop filter stride argument to ptrdiff_t
...
This was missed in d5d699ab6e
Signed-off-by: James Almer <jamrial@gmail.com>
2019-02-20 19:38:46 -03:00
James Almer
e4e04dce1f
Merge commit ' 28a8b5413b'
...
* commit '28a8b5413b ':
h264/aarch64: add intra loop filter neon asm
Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:42:01 -03:00
James Almer
4dc1f06f0c
Merge commit ' 846c3d6aca'
...
* commit '846c3d6aca ':
h264/aarch64: optimize neon loop filter
Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:41:03 -03:00
James Almer
5ca7eb36b7
Merge commit ' bb515e3a73'
...
* commit 'bb515e3a73 ':
h264/aarch64: sign extend int stride in loop filter asm
Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 14:50:37 -03:00
Martin Storsjö
c8bc9d1380
aarch64: vp8: Move the vp8dsp makefile entries to the right places
...
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.
These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit b4b27dce95 )
2019-02-19 23:43:17 +02:00
Martin Storsjö
fecf75a5c4
aarch64: vp8: Remove superfluous includes
...
This fixes building with MSVC, which lacks unistd.h.
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit ad32f7b126 )
2019-02-19 23:42:16 +02:00
Martin Storsjö
7ddfa5e908
aarch64: vp8: Fix assembling with armasm64
...
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 2eeac79936 )
2019-02-19 23:42:03 +02:00
Martin Storsjö
c950beb68d
aarch64: vp8: Fix assembling with clang
...
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).
The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 26d7af4c38 )
2019-02-19 23:41:47 +02:00
Martin Storsjö
7e42d5f0ab
aarch64: vp8: Optimize vp8_idct_add_neon for aarch64
...
The previous version was a pretty exact translation of the arm
version. This version does do some unnecessary arithemetic (it does
more operations on vectors that are only half filled; it does 4
uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
of packing data together (which could be done for free in the arm
version).
This gives a decent speedup on Cortex A53, a minor speedup on
A72 and a very minor slowdown on Cortex A73.
Before: Cortex A53 A72 A73
vp8_idct_add_neon: 79.7 67.5 65.0
After:
vp8_idct_add_neon: 67.7 64.8 66.7
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:28 +02:00
Martin Storsjö
49f9c4272c
aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon
...
The original arm version didn't do saturation here. This probably
doesn't make any difference for performance, but reduces the
differences.
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:24 +02:00
Martin Storsjö
37394ef01b
aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2
...
This makes it similar to put_epel16_v6, and gives a large speedup
on Cortex A53, a minor speedup on A72 and a very minor slowdown on
A73.
Before: Cortex A53 A72 A73
vp8_put_epel16_h6v6_neon: 2211.4 1586.5 1431.7
After:
vp8_put_epel16_h6v6_neon: 1736.9 1522.0 1448.1
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:21 +02:00
Martin Storsjö
e39a9212ab
aarch64: vp8: Port bilin functions from arm version
...
Cortex A53 A72 A73
vp8_put_bilin4_h_c: 303.8 102.2 161.8
vp8_put_bilin4_h_neon: 100.0 40.9 41.2
vp8_put_bilin4_hv_c: 322.8 201.0 305.9
vp8_put_bilin4_hv_neon: 156.8 72.6 77.0
vp8_put_bilin4_v_c: 304.7 101.7 166.5
vp8_put_bilin4_v_neon: 82.7 41.2 33.0
vp8_put_bilin8_h_c: 1192.7 352.5 623.8
vp8_put_bilin8_h_neon: 213.5 70.2 87.8
vp8_put_bilin8_hv_c: 1098.6 769.2 1041.9
vp8_put_bilin8_hv_neon: 324.0 123.5 146.0
vp8_put_bilin8_v_c: 1193.9 350.4 617.7
vp8_put_bilin8_v_neon: 183.9 60.7 64.7
vp8_put_bilin16_h_c: 2353.1 671.2 1223.3
vp8_put_bilin16_h_neon: 261.9 140.7 145.0
vp8_put_bilin16_hv_c: 2453.2 1470.9 2355.2
vp8_put_bilin16_hv_neon: 383.9 196.0 217.0
vp8_put_bilin16_v_c: 2349.3 669.8 1251.2
vp8_put_bilin16_v_neon: 202.9 110.7 96.2
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:14 +02:00
Martin Storsjö
58d1549227
aarch64: vp8: Port epel4 functions from arm version
...
Cortex A53 A72 A73
vp8_put_epel4_h4_c: 631.4 291.7 367.8
vp8_put_epel4_h4_neon: 241.0 131.0 155.7
vp8_put_epel4_h4v4_c: 967.5 529.3 667.7
vp8_put_epel4_h4v4_neon: 429.3 241.8 279.7
vp8_put_epel4_h4v6_c: 1374.7 657.5 864.5
vp8_put_epel4_h4v6_neon: 515.5 295.5 334.7
vp8_put_epel4_h6_c: 851.0 421.0 486.0
vp8_put_epel4_h6_neon: 321.5 195.0 217.7
vp8_put_epel4_h6v4_c: 1111.3 621.1 781.2
vp8_put_epel4_h6v4_neon: 539.2 328.0 365.3
vp8_put_epel4_h6v6_c: 1561.3 763.3 999.7
vp8_put_epel4_h6v6_neon: 645.5 401.0 434.7
vp8_put_epel4_v4_c: 663.8 298.3 357.0
vp8_put_epel4_v4_neon: 116.0 81.5 72.5
vp8_put_epel4_v6_c: 870.5 437.0 507.4
vp8_put_epel4_v6_neon: 147.7 108.8 92.0
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:11 +02:00
Martin Storsjö
cc7ba00c35
aarch64: vp8: Port missing epel8 functions from arm version
...
Cortex A53 A72 A73
vp8_put_epel8_h4_c: 2594.8 1159.6 1374.8
vp8_put_epel8_h4_neon: 506.4 244.2 314.0
vp8_put_epel8_h6_c: 3445.8 1677.1 1811.3
vp8_put_epel8_h6_neon: 634.4 371.7 433.0
vp8_put_epel8_v4_c: 2614.0 1174.8 1378.0
vp8_put_epel8_v4_neon: 321.0 221.7 235.8
vp8_put_epel8_v6_c: 3635.5 1703.0 2079.2
vp8_put_epel8_v6_neon: 416.9 317.0 295.5
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:08 +02:00
Martin Storsjö
52c9b0a6c0
aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version
...
Cortex A53 A72 A73
vp8_luma_dc_wht_c: 115.7 75.7 90.7
vp8_luma_dc_wht_neon: 60.7 41.2 45.7
vp8_idct_dc_add4uv_c: 376.1 262.9 282.5
vp8_idct_dc_add4uv_neon: 52.0 29.0 37.0
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:04 +02:00
Martin Storsjö
c513fcd7d2
aarch64: vp8: Fix a typo in a comment
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:00 +02:00
Martin Storsjö
f1011ea28a
aarch64: vp8: Reorder the function pointer inits to match the arm original
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:56 +02:00
Martin Storsjö
b4b27dce95
aarch64: vp8: Move the vp8dsp makefile entries to the right places
...
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.
These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:53 +02:00
Martin Storsjö
ad32f7b126
aarch64: vp8: Remove superfluous includes
...
This fixes building with MSVC, which lacks unistd.h.
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:50 +02:00
Martin Storsjö
85bfaa4949
aarch64: vp8: Use the proper aarch64 form for conditional branches
...
The previous form also does seem to assemble on current tools,
but I think it might fail on some older aarch64 tools.
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:47 +02:00
Martin Storsjö
2eeac79936
aarch64: vp8: Fix assembling with armasm64
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:44 +02:00
Martin Storsjö
26d7af4c38
aarch64: vp8: Fix assembling with clang
...
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).
The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:41 +02:00
Magnus Röös
0801853e64
libavcodec: vp8 neon optimizations for aarch64
...
Partial port of the ARM Neon for aarch64.
Benchmarks from fate:
benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
- vp8dsp.idct [OK]
- vp8dsp.mc [OK]
- vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
vp8_idct_add_c: 201.6
vp8_idct_add_neon: 83.1
vp8_idct_dc_add_c: 107.6
vp8_idct_dc_add_neon: 33.8
vp8_idct_dc_add4y_c: 426.4
vp8_idct_dc_add4y_neon: 59.4
vp8_loop_filter8uv_h_c: 688.1
vp8_loop_filter8uv_h_neon: 216.3
vp8_loop_filter8uv_inner_h_c: 649.3
vp8_loop_filter8uv_inner_h_neon: 195.3
vp8_loop_filter8uv_inner_v_c: 544.8
vp8_loop_filter8uv_inner_v_neon: 131.3
vp8_loop_filter8uv_v_c: 706.1
vp8_loop_filter8uv_v_neon: 141.1
vp8_loop_filter16y_h_c: 668.8
vp8_loop_filter16y_h_neon: 242.8
vp8_loop_filter16y_inner_h_c: 647.3
vp8_loop_filter16y_inner_h_neon: 224.6
vp8_loop_filter16y_inner_v_c: 647.8
vp8_loop_filter16y_inner_v_neon: 128.8
vp8_loop_filter16y_v_c: 721.8
vp8_loop_filter16y_v_neon: 154.3
vp8_loop_filter_simple_h_c: 387.8
vp8_loop_filter_simple_h_neon: 187.6
vp8_loop_filter_simple_v_c: 384.1
vp8_loop_filter_simple_v_neon: 78.6
vp8_put_epel8_h4v4_c: 3971.1
vp8_put_epel8_h4v4_neon: 855.1
vp8_put_epel8_h4v6_c: 5060.1
vp8_put_epel8_h4v6_neon: 989.6
vp8_put_epel8_h6v4_c: 4320.8
vp8_put_epel8_h6v4_neon: 1007.3
vp8_put_epel8_h6v6_c: 5449.3
vp8_put_epel8_h6v6_neon: 1158.1
vp8_put_epel16_h6_c: 6683.8
vp8_put_epel16_h6_neon: 831.8
vp8_put_epel16_h6v6_c: 11110.8
vp8_put_epel16_h6v6_neon: 2214.8
vp8_put_epel16_v6_c: 7024.8
vp8_put_epel16_v6_neon: 799.6
vp8_put_pixels8_c: 112.8
vp8_put_pixels8_neon: 78.1
vp8_put_pixels16_c: 131.3
vp8_put_pixels16_neon: 129.8
This contains a fix to include guards by Carl Eugen Hoyos.
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:33 +02:00
Carl Eugen Hoyos
ed20fbcd48
lavc/aarch64/vp8dsp: Fix the include guard.
...
Fixes fate-source.
2019-01-31 22:35:44 +01:00
Magnus Röös
833fed5253
libavcodec: vp8 neon optimizations for aarch64
...
Partial port of the ARM Neon for aarch64.
Benchmarks from fate:
benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
- vp8dsp.idct [OK]
- vp8dsp.mc [OK]
- vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
vp8_idct_add_c: 201.6
vp8_idct_add_neon: 83.1
vp8_idct_dc_add_c: 107.6
vp8_idct_dc_add_neon: 33.8
vp8_idct_dc_add4y_c: 426.4
vp8_idct_dc_add4y_neon: 59.4
vp8_loop_filter8uv_h_c: 688.1
vp8_loop_filter8uv_h_neon: 216.3
vp8_loop_filter8uv_inner_h_c: 649.3
vp8_loop_filter8uv_inner_h_neon: 195.3
vp8_loop_filter8uv_inner_v_c: 544.8
vp8_loop_filter8uv_inner_v_neon: 131.3
vp8_loop_filter8uv_v_c: 706.1
vp8_loop_filter8uv_v_neon: 141.1
vp8_loop_filter16y_h_c: 668.8
vp8_loop_filter16y_h_neon: 242.8
vp8_loop_filter16y_inner_h_c: 647.3
vp8_loop_filter16y_inner_h_neon: 224.6
vp8_loop_filter16y_inner_v_c: 647.8
vp8_loop_filter16y_inner_v_neon: 128.8
vp8_loop_filter16y_v_c: 721.8
vp8_loop_filter16y_v_neon: 154.3
vp8_loop_filter_simple_h_c: 387.8
vp8_loop_filter_simple_h_neon: 187.6
vp8_loop_filter_simple_v_c: 384.1
vp8_loop_filter_simple_v_neon: 78.6
vp8_put_epel8_h4v4_c: 3971.1
vp8_put_epel8_h4v4_neon: 855.1
vp8_put_epel8_h4v6_c: 5060.1
vp8_put_epel8_h4v6_neon: 989.6
vp8_put_epel8_h6v4_c: 4320.8
vp8_put_epel8_h6v4_neon: 1007.3
vp8_put_epel8_h6v6_c: 5449.3
vp8_put_epel8_h6v6_neon: 1158.1
vp8_put_epel16_h6_c: 6683.8
vp8_put_epel16_h6_neon: 831.8
vp8_put_epel16_h6v6_c: 11110.8
vp8_put_epel16_h6v6_neon: 2214.8
vp8_put_epel16_v6_c: 7024.8
vp8_put_epel16_v6_neon: 799.6
vp8_put_pixels8_c: 112.8
vp8_put_pixels8_neon: 78.1
vp8_put_pixels16_c: 131.3
vp8_put_pixels16_neon: 129.8
Signed-off-by: Magnus Röös <mla2.roos@gmail.com>
2019-01-31 20:17:51 +01:00
Janne Grunau
28a8b5413b
h264/aarch64: add intra loop filter neon asm
...
Add my neon asm from x264 relicensed under the LGPL 2.1 or later. Ported
(x264 uses nv12 chroma) and optimized.
Cycle count for checkasm --bench on a Snapdragon 820e:
h264_h_loop_filter_luma_intra_8bpp_c: 60.0
h264_h_loop_filter_luma_intra_8bpp_neon: 54.2
h264_v_loop_filter_luma_intra_8bpp_c: 148.3
h264_v_loop_filter_luma_intra_8bpp_neon: 73.8
h264_h_loop_filter_chroma_intra_8bpp_c: 27.8
h264_h_loop_filter_chroma_intra_8bpp_neon: 21.4
h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 15.8
h264_h_loop_filter_chroma_mbaff_intra_8bpp_neon: 15.7
h264_v_loop_filter_chroma_intra_8bpp_c: 45.8
h264_v_loop_filter_chroma_intra_8bpp_neon: 17.3
2019-01-26 12:05:10 +01:00
Janne Grunau
846c3d6aca
h264/aarch64: optimize neon loop filter
...
Exit as soon as possible if no filtering will be done.
Improves the checkasm --bench cycle count on a Snapdragon 820e:
h264_h_loop_filter_luma_8bpp_c: 72.4 -> 72.5
h264_h_loop_filter_luma_8bpp_neon: 97.1 -> 56.3
h264_v_loop_filter_luma_8bpp_c: 174.0 -> 173.5
h264_v_loop_filter_luma_8bpp_neon: 62.9 -> 60.9
h264_h_loop_filter_chroma_8bpp_c: 30.2 -> 30.3
h264_h_loop_filter_chroma_8bpp_neon: 51.6 -> 25.7
h264_v_loop_filter_chroma_8bpp_c: 57.3 -> 57.3
h264_v_loop_filter_chroma_8bpp_neon: 28.0 -> 24.0
2019-01-26 12:05:10 +01:00