The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
This stores a small buffer in shared memory per decode thread (16 bytes),
which helps reduce the number of memory accesses.
The bitstream buffer is first aligned to a 4 byte boundary, so that the
buffer can be filled with a single memory request.
This allows increased internal precision.
In addition, we can introduce an offset to the DC coefficient
during the second IDCT step, to remove a per-element addition
in the output codepath.
Finally, by processing columns first we can remove the barrier
after loading coefficients.
Signed-off-by: averne <averne381@gmail.com>
It allows us to easily synchronize the software and hardware
decoders, by removing the abstraction the Vulkan layer added by changing
the values written.
The Vulkan spec requires that all accesses to push data are uniform for
all invocations (e.g. can't be based on gl_WorkGroupID or gl_LocalInvocationID).
This commit optimizes the Vulkan decoder by splitting up decoding
from iDCT, and merging the few tables needed directly into the shader.
The speedup on Intel is 10x.
The qScale syntax element has a maximum value of 512, which would overflow the 16-bit store from the VLD shader in extreme cases.
This fixes that edge case by forwarding the element in a storage buffer, and applying the inverse quantization fully in the IDCT shader.
Add a shader-based Apple ProRes decoder.
It supports all codec features for profiles up to
the 4444 XQ profile, ie.:
- 4:2:2 and 4:4:4 chroma subsampling
- 10- and 12-bit component depth
- Interlacing
- Alpha
The implementation consists in two shaders: the
VLD kernel does entropy decoding for color/alpha,
and the IDCT kernel performs the inverse transform
on color components.
Benchmarks for a 4k yuv422p10 sample:
- AMD Radeon 6700XT: 178 fps
- Intel i7 Tiger Lake: 37 fps
- NVidia Orin Nano: 70 fps
This commit adds a ProRes RAW hardware implementation written in Vulkan.
Both version 0 and version 1 streams are supported.
The implementation is highly parallelized, with 512 invocations dispatched
per every tile, with generally 4k tiles on a 5.8k stream.
Thanks to unlord for the 8-point iDCT.
Benchmark for a generic 5.8k RAW HQ file:
6900XT: 63fps
7900XTX: 84fps
6000 Ada: 120fps
Intel: 9fps
The issue is that there is an explicit lack of synchronization as only the very
first invocation writes symbols and updates the state, which other invocations
then store.
This commit adds support for compiler hints.
While on AMD these are not used/needed, Nvidia benefits from them, and gives
a sizeable 10% speedup on 4k.
This commit also makes it possible for the encoder to choose a different
quantization table on a per-slice basis, as well as adding this capability
to the decoder.
Also, this commit fully fixes decoding of context=1 encoded files.
This reduces the intermediate VRAM used for RGB decoding by a
factor of 100x for 6k video.
This also speeds the decoder up by 16% for 4k RGB24 and 31% for 6k video.
This is equivalent to what the software decoder does, but with less pointers.