The issue is that there is an explicit lack of synchronization as only the very
first invocation writes symbols and updates the state, which other invocations
then store.
This commit adds support for compiler hints.
While on AMD these are not used/needed, Nvidia benefits from them, and gives
a sizeable 10% speedup on 4k.
This commit also makes it possible for the encoder to choose a different
quantization table on a per-slice basis, as well as adding this capability
to the decoder.
Also, this commit fully fixes decoding of context=1 encoded files.
This reduces the intermediate VRAM used for RGB decoding by a
factor of 100x for 6k video.
This also speeds the decoder up by 16% for 4k RGB24 and 31% for 6k video.
This is equivalent to what the software decoder does, but with less pointers.
This patch adds a fully-featured level 3 and 4 decoder for FFv1,
supporting Golomb and all Range coding variants, all pixel formats,
and all features, except for the newly added floating-point formats.
On a 6000 Ada, for 3840x2160 bgr0 content at 50Mbps (standard desktop
recording), it is able to do 400fps.
An Alder Lake with 24 threads can barely do 100fps.
According to the GL_EXT_buffer_reference spec alignment
"must be a power of two and be greater than or equal to the largest scalar/component type in the block."
This means by using u32vec2 we can drop the requirement alignment from 8 bytes to 4 bytes
and save a pack64 call in reverse8 (though I assume in most ISAs that compiles to nothing)
Allows the vc2 vulkan encoder to function without setting PB_UNALIGNED
If caller wrote a divisible by eight number of bits it would write an extra byte.
Also increment by to_write instead of BUF_BYTES which overly pads the bitstream.
Unlike the software FFv1 encoder, none of our buffers are allocated by
FFmpeg, which supports at most 4GiB large allocations.
For really large sizes, the maximum size of the buffer can exceed 4GiB,
which the software encoder optimistically tries to allocate as 4GiB
in the hopes that the encoder will compress to under that amount.
We can just let Vulkan allocate us a larger buffer, and switch to
64-bit offsets.
This commit implements a standard, compliant, version 3 and version 4
FFv1 encoder, entirely in Vulkan. The encoder is written in standard
GLSL and requires a Vulkan 1.3 supporting GPU with the BDA extension.
The encoder can use any amount of slices, but nominally, should use
32x32 slices (1024 in total) to maximize parallelism.
All features are supported, as well as all pixel formats.
This includes:
- Rice
- Range coding with a custom quantization table
- PCM encoding
CRC calculation is also massively parallelized on the GPU.
Encoding of unaligned dimensions on subsampled data requires
version 4, or requires oversizing the image to 64-pixel alignment
and cropping out the padding via container flags.
Performance-wise, this makes 1080p real-time screen capture possible
at 60fps on even modest GPUs.