mirror of
https://git.ffmpeg.org/ffmpeg.git
synced 2026-06-04 22:50:24 +00:00
3-tap [1,2,1]>>2: shared implementation body across size-specialized entry points (8x8/16x16/32x32) to reduce code size. Fold the 3-tap kernel into uhadd + urhadd: uhadd gives floor((prev+next)/2), then urhadd rounds with curr to produce (prev + 2*curr + next + 2) >> 2 on 16 bytes in-place (no widen/narrow needed). Overlap-last technique for tail avoids partial stores. Caller pads input arrays by 16 bytes to guarantee safe over-read. Strong smoothing (32x32): preloaded weight tables, interleaved umull/umlal pairs (two 16-byte blocks at a time) to hide rshrn-to-store latency, with paired st1 for 32-byte writes. checkasm --bench --runs=15 (Apple M4, average of 3 trials): ref_filter_3tap_8x8_8_neon: 4.1x ref_filter_3tap_16x16_8_neon: 3.3x ref_filter_3tap_32x32_8_neon: 2.5x ref_filter_strong_8_neon: 1.9x Signed-off-by: Jun Zhao <barryjzhao@tencent.com> |
||
|---|---|---|
| .. | ||
| cabac.c | ||
| data.c | ||
| data.h | ||
| dsp.c | ||
| dsp.h | ||
| dsp_template.c | ||
| filter.c | ||
| hevc.h | ||
| hevcdec.c | ||
| hevcdec.h | ||
| Makefile | ||
| mvs.c | ||
| parse.c | ||
| parse.h | ||
| parser.c | ||
| pred.c | ||
| pred.h | ||
| pred_template.c | ||
| ps.c | ||
| ps.h | ||
| ps_enc.c | ||
| refs.c | ||
| sei.c | ||
| sei.h | ||