mirror of
https://git.ffmpeg.org/ffmpeg.git
synced 2026-04-21 01:40:23 +00:00
Add NEON alpha drop/insert using ldp+tbl+stp instead of ld4/st3 and ld3/st4 structure operations. Both use a 2-register sliding-window tbl with post-indexed addressing. Instruction scheduling targets narrow in-order cores (A55) while remaining neutral on wide OoO. Scalar tails use coalesced loads/stores (ldr+strh+lsr+strb for alpha drop, ldrh+ldrb+orr+str for alpha insert) to reduce per-pixel instruction count. Independent instructions placed between loads and dependent operations to fill load-use latency on in-order cores. checkasm --bench on Apple M3 Max (decicycles, 1920px): rgb32tobgr24_c: 114.4 ( 1.00x) rgb32tobgr24_neon: 64.3 ( 1.78x) rgb24tobgr32_c: 128.9 ( 1.00x) rgb24tobgr32_neon: 80.9 ( 1.59x) C baseline is clang auto-vectorized; speedup is over compiler NEON. Signed-off-by: David Christle <dev@christle.is> |
||
|---|---|---|
| .. | ||
| asm-offsets.h | ||
| hscale.S | ||
| input.S | ||
| Makefile | ||
| output.S | ||
| range_convert_neon.S | ||
| rgb2rgb.c | ||
| rgb2rgb_neon.S | ||
| swscale.c | ||
| swscale_unscaled.c | ||
| swscale_unscaled_neon.S | ||
| xyz2rgb_neon.S | ||
| yuv2rgb_neon.S | ||