The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
This commit optimizes the Vulkan decoder by splitting up decoding
from iDCT, and merging the few tables needed directly into the shader.
The speedup on Intel is 10x.
Add a shader-based Apple ProRes decoder.
It supports all codec features for profiles up to
the 4444 XQ profile, ie.:
- 4:2:2 and 4:4:4 chroma subsampling
- 10- and 12-bit component depth
- Interlacing
- Alpha
The implementation consists in two shaders: the
VLD kernel does entropy decoding for color/alpha,
and the IDCT kernel performs the inverse transform
on color components.
Benchmarks for a 4k yuv422p10 sample:
- AMD Radeon 6700XT: 178 fps
- Intel i7 Tiger Lake: 37 fps
- NVidia Orin Nano: 70 fps
This commit adds a ProRes RAW hardware implementation written in Vulkan.
Both version 0 and version 1 streams are supported.
The implementation is highly parallelized, with 512 invocations dispatched
per every tile, with generally 4k tiles on a 5.8k stream.
Thanks to unlord for the 8-point iDCT.
Benchmark for a generic 5.8k RAW HQ file:
6900XT: 63fps
7900XTX: 84fps
6000 Ada: 120fps
Intel: 9fps
This reduces the intermediate VRAM used for RGB decoding by a
factor of 100x for 6k video.
This also speeds the decoder up by 16% for 4k RGB24 and 31% for 6k video.
This is equivalent to what the software decoder does, but with less pointers.
This patch adds a fully-featured level 3 and 4 decoder for FFv1,
supporting Golomb and all Range coding variants, all pixel formats,
and all features, except for the newly added floating-point formats.
On a 6000 Ada, for 3840x2160 bgr0 content at 50Mbps (standard desktop
recording), it is able to do 400fps.
An Alder Lake with 24 threads can barely do 100fps.
This commit implements a standard, compliant, version 3 and version 4
FFv1 encoder, entirely in Vulkan. The encoder is written in standard
GLSL and requires a Vulkan 1.3 supporting GPU with the BDA extension.
The encoder can use any amount of slices, but nominally, should use
32x32 slices (1024 in total) to maximize parallelism.
All features are supported, as well as all pixel formats.
This includes:
- Rice
- Range coding with a custom quantization table
- PCM encoding
CRC calculation is also massively parallelized on the GPU.
Encoding of unaligned dimensions on subsampled data requires
version 4, or requires oversizing the image to 64-pixel alignment
and cropping out the padding via container flags.
Performance-wise, this makes 1080p real-time screen capture possible
at 60fps on even modest GPUs.