Its rarely respected by implementations, its fairly new (1 year old),
and it has a scuffed define (neither glslc nor glslang enable the
"GL_EXT_nontemporal_keyword" define if its enabled, unlike all other extensions).
The shader needs ~3 loads per DCT coeff.
This data was not observed to get efficiently stored
in the upper cached levels, loading it explicitely in
shared memory fixes that.
Also reduce code size by moving the bitstream
initialization outside of the switch/case.
The DPX Vulkan unpack shader computes a word offset as
uint off = (line_off + pix_off >> 5);
Due to GLSL operator precedence this is evaluated as
line_off + (pix_off >> 5) rather than (line_off + pix_off) >> 5.
Since line_off is in bits while off is a 32-bit word index,
scanlines beyond y=0 use an inflated offset and the shader reads
past the end of the DPX slice buffer.
Parenthesize the expression so that the sum is shifted as intended:
uint off = (line_off + pix_off) >> 5;
This corrects the unpacked data and removes the CRC mismatch
observed between the software and Vulkan DPX decoders for
mispacked 12-bit DPX samples. The GPU OOB read itself is only
observable indirectly via this corruption since it occurs inside
the shader.
Repro on x86_64 with Vulkan/llvmpipe (531ce713a0):
./configure --cc=clang --disable-optimizations --disable-stripping \
--enable-debug=3 --disable-doc --disable-ffplay \
--enable-vulkan --enable-libshaderc \
--enable-hwaccel=dpx_vulkan \
--extra-cflags='-fsanitize=address -fno-omit-frame-pointer' \
--extra-ldflags='-fsanitize=address' && make
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json
PoC: packed 12-bit DPX with the packing flag cleared so the unpack
shader runs (4x64 gbrp12le), e.g. poc12_packed0.dpx.
Software decode:
./ffmpeg -v error -i poc12_packed0.dpx -f framecrc -
-> 0, ..., 1536, 0x26cf81c2
Vulkan hwaccel decode:
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json \
./ffmpeg -v error -init_hw_device vulkan \
-hwaccel vulkan -hwaccel_output_format vulkan \
-i poc12_packed0.dpx \
-vf hwdownload,format=gbrp12le -f framecrc -
-> 0, ..., 1536, 0x71e10a51
The only difference between the two runs is the Vulkan unpack
shader, and the stable CRC mismatch indicates that it is reading
past the intended DPX slice region.
Regression since: 531ce713a0
Found-by: Pwno
The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
This stores a small buffer in shared memory per decode thread (16 bytes),
which helps reduce the number of memory accesses.
The bitstream buffer is first aligned to a 4 byte boundary, so that the
buffer can be filled with a single memory request.
This allows increased internal precision.
In addition, we can introduce an offset to the DC coefficient
during the second IDCT step, to remove a per-element addition
in the output codepath.
Finally, by processing columns first we can remove the barrier
after loading coefficients.
Signed-off-by: averne <averne381@gmail.com>
It allows us to easily synchronize the software and hardware
decoders, by removing the abstraction the Vulkan layer added by changing
the values written.
The Vulkan spec requires that all accesses to push data are uniform for
all invocations (e.g. can't be based on gl_WorkGroupID or gl_LocalInvocationID).
This commit optimizes the Vulkan decoder by splitting up decoding
from iDCT, and merging the few tables needed directly into the shader.
The speedup on Intel is 10x.
The qScale syntax element has a maximum value of 512, which would overflow the 16-bit store from the VLD shader in extreme cases.
This fixes that edge case by forwarding the element in a storage buffer, and applying the inverse quantization fully in the IDCT shader.
Add a shader-based Apple ProRes decoder.
It supports all codec features for profiles up to
the 4444 XQ profile, ie.:
- 4:2:2 and 4:4:4 chroma subsampling
- 10- and 12-bit component depth
- Interlacing
- Alpha
The implementation consists in two shaders: the
VLD kernel does entropy decoding for color/alpha,
and the IDCT kernel performs the inverse transform
on color components.
Benchmarks for a 4k yuv422p10 sample:
- AMD Radeon 6700XT: 178 fps
- Intel i7 Tiger Lake: 37 fps
- NVidia Orin Nano: 70 fps
This commit adds a ProRes RAW hardware implementation written in Vulkan.
Both version 0 and version 1 streams are supported.
The implementation is highly parallelized, with 512 invocations dispatched
per every tile, with generally 4k tiles on a 5.8k stream.
Thanks to unlord for the 8-point iDCT.
Benchmark for a generic 5.8k RAW HQ file:
6900XT: 63fps
7900XTX: 84fps
6000 Ada: 120fps
Intel: 9fps
The issue is that there is an explicit lack of synchronization as only the very
first invocation writes symbols and updates the state, which other invocations
then store.