Commit graph

84 commits

Author SHA1 Message Date
Lynne
e51c549f6e
vulkan/dpx: drop using the nontemporal extension
Its rarely respected by implementations, its fairly new (1 year old),
and it has a scuffed define (neither glslc nor glslang enable the
"GL_EXT_nontemporal_keyword" define if its enabled, unlike all other extensions).
2026-01-14 16:13:22 +01:00
Lynne
f2a55af9a4
vulkan_dpx: switch to compile-time SPIR-V generation 2026-01-12 17:28:43 +01:00
Lynne
0f4667fc11
vulkan_prores_raw: clean up and optimize 2026-01-12 17:28:42 +01:00
Lynne
23ab1b1a66
vulkan/dct: embed DCT scaling values during SPIR-V generation
Instead of relying on rounded off values, use specialization constants
to bake the DCT values into the shader when its compiled.
2026-01-12 17:28:42 +01:00
Lynne
e27b510da8
vulkan_prores: generate SPIR-V at compile-time 2026-01-12 17:28:42 +01:00
Lynne
026e94e339
vulkan_prores_raw: use compile-time SPIR-V generation 2026-01-12 17:28:42 +01:00
Lynne
f2affdfafb
configure/make: support compile-time SPIR-V generation 2026-01-12 17:28:40 +01:00
Lynne
58bd5ad630
vulkan/prores_raw_idct: use the same prores_idct method for loading coeffs
This saves 2 barriers.
Also implement workbank avoidance.
2025-12-31 15:00:47 +01:00
Lynne
8db6947700
vulkan_prores_raw: reduce zigzag table size
No need for full 32-bits.
2025-12-22 19:46:27 +01:00
Lynne
cfcf52a08c
vulkan: deduplicate shorthand casting defines to common.comp 2025-12-22 19:46:27 +01:00
Lynne
6eced88188
vulkan: merge ProRes and ProRes RAW iDCTs
This cleans up the code a bit, and reduces binary size.
2025-12-22 19:46:26 +01:00
averne
b9078c0939 vulkan/prores: copy constant tables to shared memory
The shader needs ~3 loads per DCT coeff.
This data was not observed to get efficiently stored
in the upper cached levels, loading it explicitely in
shared memory fixes that.

Also reduce code size by moving the bitstream
initialization outside of the switch/case.
2025-12-15 12:29:00 +00:00
averne
a2475d16ed lavc/vulkan/common: allow configurable bitstream caching in shared memory 2025-12-15 12:29:00 +00:00
Lynne
9e8e34d475
vulkan_ffv1: remove unused RCT shader files
The 2 files were made redundant when the RCT was merged into encode/decode.
2025-12-13 22:12:26 +01:00
Lynne
5bb9cd23b7
vulkan_dpx: fix GRAY16BE and big-endian marked 8-bit samples 2025-12-13 21:35:56 +01:00
Lynne
c3291993eb
vulkan_ffv1: use proper rounded divisions for plane width and height
Fixes #20314
2025-12-13 19:12:24 +01:00
Ruikai Peng
c48b8ebbbb avcodec/vulkan: fix DPX unpack offset
The DPX Vulkan unpack shader computes a word offset as

    uint off = (line_off + pix_off >> 5);

Due to GLSL operator precedence this is evaluated as
line_off + (pix_off >> 5) rather than (line_off + pix_off) >> 5.
Since line_off is in bits while off is a 32-bit word index,
scanlines beyond y=0 use an inflated offset and the shader reads
past the end of the DPX slice buffer.

Parenthesize the expression so that the sum is shifted as intended:

    uint off = (line_off + pix_off) >> 5;

This corrects the unpacked data and removes the CRC mismatch
observed between the software and Vulkan DPX decoders for
mispacked 12-bit DPX samples. The GPU OOB read itself is only
observable indirectly via this corruption since it occurs inside
the shader.

Repro on x86_64 with Vulkan/llvmpipe (531ce713a0):

    ./configure --cc=clang --disable-optimizations --disable-stripping \
        --enable-debug=3 --disable-doc --disable-ffplay \
        --enable-vulkan --enable-libshaderc \
        --enable-hwaccel=dpx_vulkan \
        --extra-cflags='-fsanitize=address -fno-omit-frame-pointer' \
        --extra-ldflags='-fsanitize=address' && make

    VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json

PoC: packed 12-bit DPX with the packing flag cleared so the unpack
shader runs (4x64 gbrp12le), e.g. poc12_packed0.dpx.

Software decode:

    ./ffmpeg -v error -i poc12_packed0.dpx -f framecrc -
    -> 0, ..., 1536, 0x26cf81c2

Vulkan hwaccel decode:

    VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json \
    ./ffmpeg -v error -init_hw_device vulkan \
        -hwaccel vulkan -hwaccel_output_format vulkan \
        -i poc12_packed0.dpx \
        -vf hwdownload,format=gbrp12le -f framecrc -
    -> 0, ..., 1536, 0x71e10a51

The only difference between the two runs is the Vulkan unpack
shader, and the stable CRC mismatch indicates that it is reading
past the intended DPX slice region.

Regression since: 531ce713a0
Found-by: Pwno
2025-12-12 20:13:16 +00:00
averne
c384b1e803 vulkan/prores: use vkCmdClearColorImage
The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
2025-12-07 18:17:36 +00:00
Lynne
f80addbb07
ffv1enc_vulkan: fix encoding with large contexts
When RGB_LINECACHE == 2, then top2 is not the current line.
2025-12-04 16:53:58 +01:00
Lynne
9b14ea0aa1
vulkan_dpx: fix alignment issue
12-bit images apparently require mod-32 alignment for each line.
Go figure.
2025-12-04 15:08:46 +01:00
averne
fd2fd3828c libavcodec/vulkan: remove unnessary member in GetBitContext
The number of remaining bits can be calculated using existing state.
This simplifies calculations and frees up one register.
2025-11-30 19:21:08 +01:00
averne
ef7354d471 libavcodec/vulkan: introduce cached bitstream reader
This stores a small buffer in shared memory per decode thread (16 bytes),
which helps reduce the number of memory accesses.
The bitstream buffer is first aligned to a 4 byte boundary, so that the
buffer can be filled with a single memory request.
2025-11-30 19:21:04 +01:00
averne
1c5bb1b12d vulkan/prores: normalize coefficients during IDCT
This allows increased internal precision.
In addition, we can introduce an offset to the DC coefficient
during the second IDCT step, to remove a per-element addition
in the output codepath.
Finally, by processing columns first we can remove the barrier
after loading coefficients.

Signed-off-by: averne <averne381@gmail.com>
2025-11-29 17:56:28 +01:00
averne
1982add485 vulkan/prores: fix dequantization for 4:2:2 subsampling
Bug introduced in d00f41f due to an oversight.
2025-11-29 17:27:21 +01:00
Lynne
531ce713a0
dpxdec: add a Vulkan hwaccel 2025-11-26 15:16:43 +01:00
Lynne
7af5b5cec3
vulkan_prores_raw: use the native image representation
It allows us to easily synchronize the software and hardware
decoders, by removing the abstraction the Vulkan layer added by changing
the values written.
2025-11-26 15:16:42 +01:00
Lynne
a811a6885a
vulkan_prores_raw: read the header length rather than assuming its 8
In all known samples, it is equal to 8.
2025-11-26 15:16:42 +01:00
Lynne
0db891366d
vulkan_prores_raw: fix dynamically non-uniform accesses to pushconsts
The Vulkan spec requires that all accesses to push data are uniform for
all invocations (e.g. can't be based on gl_WorkGroupID or gl_LocalInvocationID).
2025-11-26 15:16:41 +01:00
Lynne
bb30a0d0d8
vulkan_prores_raw: split up decoding and DCT
This commit optimizes the Vulkan decoder by splitting up decoding
from iDCT, and merging the few tables needed directly into the shader.

The speedup on Intel is 10x.
2025-11-26 15:16:41 +01:00
Lynne
615b26f1b1
vulkan_ffv1: fix swapped colors for x2bgr10 2025-11-26 15:16:40 +01:00
Lynne
d36d88dcbb
vulkan/common: add reverse2 endian reversal macro 2025-11-26 15:16:39 +01:00
averne
1d84ab331c vulkan/prores: Adopt the same IDCT routine as the prores-raw hwaccel
The added rounding at the final output conforms
to the SMPTE document and reduces the deviation
against the software decoder.
2025-11-25 17:54:56 +00:00
Lynne
9860017495
vulkan/ffv1: use u32vec2 for slice offsets
Simplifies calculations slightly.
2025-11-12 00:37:24 +01:00
averne
d00f41f213 vulkan/prores: forward quantization parameter to the IDCT shader
The qScale syntax element has a maximum value of 512, which would overflow the 16-bit store from the VLD shader in extreme cases.
This fixes that edge case by forwarding the element in a storage buffer, and applying the inverse quantization fully in the IDCT shader.
2025-11-08 22:31:21 +00:00
Lynne
6720f71247 Revert "vulkan/prores: output LSB-padded data"
This reverts commit 909d71322a.

The issue was elsewhere, not in our code.
2025-11-06 21:46:43 +01:00
averne
909d71322a vulkan/prores: output LSB-padded data
For consistency with existing Vulkan-based hwaccels
2025-10-28 06:12:14 +00:00
Lynne
51843adfe5
vulkan/rangecoder: ifdef out encode and decode chunks
There's little code sharing between them.
2025-10-28 07:11:26 +01:00
averne
98412edfed lavc: add a ProRes Vulkan hwaccel
Add a shader-based Apple ProRes decoder.
It supports all codec features for profiles up to
the 4444 XQ profile, ie.:
- 4:2:2 and 4:4:4 chroma subsampling
- 10- and 12-bit component depth
- Interlacing
- Alpha

The implementation consists in two shaders: the
VLD kernel does entropy decoding for color/alpha,
and the IDCT kernel performs the inverse transform
on color components.

Benchmarks for a 4k yuv422p10 sample:
- AMD Radeon 6700XT:   178 fps
- Intel i7 Tiger Lake: 37 fps
- NVidia Orin Nano:    70 fps
2025-10-25 19:54:13 +00:00
Lynne
75aeffb1c6
lavc: add a ProRes RAW Vulkan hwaccel
This commit adds a ProRes RAW hardware implementation written in Vulkan.
Both version 0 and version 1 streams are supported.
The implementation is highly parallelized, with 512 invocations dispatched
per every tile, with generally 4k tiles on a 5.8k stream.

Thanks to unlord for the 8-point iDCT.

Benchmark for a generic 5.8k RAW HQ file:
6900XT: 63fps
7900XTX: 84fps
6000 Ada: 120fps
Intel: 9fps
2025-08-08 18:29:41 +09:00
Lynne
2c3315b04c
lavc/vulkan/common: sign-ify lengths
This makes left_bits return useful data rather than overflowing, and
also saves some 64-bit integer operations, which is still always a plus sadly.
2025-08-05 23:51:21 +09:00
Timo Rothenpieler
262d41c804 all: fix typos found by codespell 2025-08-03 13:48:47 +02:00
Lynne
3cbe3418b2
vulkan_ffv1: fix golomb coding for non-RGB streams
The run_index is reset on each plane, unlike with RGB, where
its reset once per slice.
2025-05-27 06:40:33 +09:00
Lynne
c395ad7c2c
vulkan_ffv1: small cleanup for golomb
Split up computation of the offset in the same way that
the range coder version does it.
2025-05-27 06:40:29 +09:00
Lynne
977d1a24bc
vulkan/ffv1: fix sync issue in cached bitstream reader/writer
The issue is that there is an explicit lack of synchronization as only the very
first invocation writes symbols and updates the state, which other invocations
then store.
2025-05-23 05:23:44 +09:00
Lynne
7b45d9c5fd
vulkan_ffv1: pipe through slice decoding status 2025-05-20 19:53:02 +09:00
Lynne
cb8f4b675d
vulkan/ffv1: unify encode and decode get/put primitives
This simply makes a get_rac/put_rac_internal variant that can be
reused.
2025-05-20 19:53:02 +09:00
Lynne
7576410af7
ffv1enc_vulkan: implement RCT search for level >= 4 2025-05-20 19:53:01 +09:00
Lynne
0156680f09
ffv1enc_vulkan: implement the cached EC writer from the decoder
This gives a 35% speedup on AMD and 50% on Nvidia.
2025-05-20 19:53:01 +09:00
Lynne
a24ea37228
vulkan_ffv1: fix PCM + cached symbol reader
writeout_rgb requires that all subgroups are active.
2025-05-20 19:53:01 +09:00
Lynne
8a2d921627
ffv1_common: minor RGB optimization 2025-05-20 19:53:01 +09:00