ff_vk_find_struct returns const void *, so storing it in const void *drm_create_pnext
fixes the initialization warning but then dpb_hwfc->create_pnext = drm_create_pnext
assigns const void * to void *, triggering the same warning at that line. The right
fix is a (void *) cast at the call site, same as done for buf_pnext.
Also restrict the GetPhysicalDeviceImageFormatProperties2 verbose log in
try_export_flags to the DRM modifier path only: when has_mods is false the log
always printed mod[0]=0x0, which is misleading since no DRM modifier is involved.
Signed-off-by: Tymur Boiko <tboiko@nvidia.com>
When mapping Vulkan Video frames to DMA-BUF, synchronize using an exportable
binary semaphore and sync_fd where supported. Submit a lightweight exec that
waits on each plane's timeline semaphore at the current value, signals a
SYNC_FD-exportable binary semaphore, then export with vkGetSemaphoreFdKHR.
Store that binary semaphore in AVVkFrameInternal and reuse it across maps
instead of creating and destroying each time: for
VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_SYNC_FD_BIT, copy transference means a
successful vkGetSemaphoreFdKHR unsignals the semaphore like a wait, so it can
be signaled again on the next map submit. If export is unavailable, fall back
to vkWaitSemaphores.
Moved drm_sync_sem destroy to vulkan_free_internal
Export dma-buf fds with GetMemoryFdKHR for each populated f->mem[i], iterating
up to the sw_format plane count instead of stopping at the image count, so
multi-memory bindings are not skipped. Describe DRM layers using
max(sw planes, image count) and query subresource layout with the correct
aspect and image index when one VkImage backs multiple planes. Reference the
source hw_frames_ctx on the mapped frame and close dma-buf fds on failure paths.
For DMA-BUF-capable pools, honor VK_EXTERNAL_MEMORY_FEATURE_DEDICATED_ONLY_BIT
from format export queries when binding memory. With DRM modifiers and a
video profile in create_pnext, preserve caller usage and image flags instead of
overwriting them from generic supported_usage probing; use the modifier list
create info when probing export flags for modifier tiling.
Include VK_IMAGE_USAGE_VIDEO_DECODE_DPB_BIT_KHR from the output frames
context's usage together with DST (fixes
VUID-VkVideoBeginCodingInfoKHR-slotIndex-07245) instead of adding DPB usage
only when !is_current.
In ff_vk_decode_add_slice, pass VkVideoProfileListInfoKHR (from the output
frames context's create_pnext) as the pNext argument to
ff_vk_get_pooled_buffer instead of the full create_pnext chain. In
ff_vk_frame_params, set tiling to OPTIMAL only when it is not already
DRM_FORMAT_MODIFIER_EXT. In ff_vk_decode_init, when the output pool's
create_pnext includes VkImageDrmFormatModifierListCreateInfoEXT, initialize the
DPB pool with that modifier-list pNext and DRM_FORMAT_MODIFIER_EXT tiling;
otherwise use VkVideoProfileListInfoKHR and OPTIMAL as before. When
VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR is unset, the output
and DPB pools cannot use different layouts or tiling, so the DPB pool must
match the output pool.
Also fix av_hwframe_map ioctl sync_fd export, multi-planar semaphore handling,
and related failure-path cleanup.
Signed-off-by: Tymur Boiko <tboiko@nvidia.com>
Both FF_VK_EXT_VIDEO_ENCODE_QUEUE and FF_VK_EXT_VIDEO_MAINTENANCE_1 are
required, not only one of them.
Found by VVL.
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
With the limiting of video usages on the image views, we no longer need
a yuv sampler, since no multi-plane image will be created with the
SAMPLED usage bit.
The VK spec forbids using clear commands on YUV images,
so we need to allocate separate per-plane images.
This removes the need for a separate reset shader.
Add a shader-based Apple ProRes decoder.
It supports all codec features for profiles up to
the 4444 XQ profile, ie.:
- 4:2:2 and 4:4:4 chroma subsampling
- 10- and 12-bit component depth
- Interlacing
- Alpha
The implementation consists in two shaders: the
VLD kernel does entropy decoding for color/alpha,
and the IDCT kernel performs the inverse transform
on color components.
Benchmarks for a 4k yuv422p10 sample:
- AMD Radeon 6700XT: 178 fps
- Intel i7 Tiger Lake: 37 fps
- NVidia Orin Nano: 70 fps
This commit adds a ProRes RAW hardware implementation written in Vulkan.
Both version 0 and version 1 streams are supported.
The implementation is highly parallelized, with 512 invocations dispatched
per every tile, with generally 4k tiles on a 5.8k stream.
Thanks to unlord for the 8-point iDCT.
Benchmark for a generic 5.8k RAW HQ file:
6900XT: 63fps
7900XTX: 84fps
6000 Ada: 120fps
Intel: 9fps
In filtering, and SDR encoding, we use storage images.
This fixes using Vulkan filters on Intel.
Tested not to break anything on the three major vendors.
This patch adds a fully-featured level 3 and 4 decoder for FFv1,
supporting Golomb and all Range coding variants, all pixel formats,
and all features, except for the newly added floating-point formats.
On a 6000 Ada, for 3840x2160 bgr0 content at 50Mbps (standard desktop
recording), it is able to do 400fps.
An Alder Lake with 24 threads can barely do 100fps.
We queried the decoder whether it was able to decode sucessfully, but
since we operated asynchronously, we weren't able to do anything with
this information but let the user know decoding failed for the previous
frame(s).
Since we parse the slice headers ourselves and we're reasonably sure we
can decode before actually starting to decode, this was rarely triggered
on corrupt data, and hardware's understanding of whether there was an error
or not is vague.
There's also a semantic problem with our use of the queries - if there's
a seek, we flush, but what happens to the queries is vague according to
the spec. Most hardware dealt fine, since queries are nothing more than
GPU memory with integers stored. But with Intel, they seem to be more of
a register to which a driver must keep track of, leading to issues if there's
been a reset (seek) and we query the previous submission before the seek.
Just get rid of them. The query code is still used in encoding.
This fixes seeking with HEVC and AV1 on Intel.
We recently introduced a public field which was a superset
of the queue context we used to have.
Switch to using it entirely.
This also allows us to get rid of the NIH function which was
valid only for video queues.
Originally, the decoder had a single execution pool, with one
execution context per thread. Execution pools were always intended
to be thread-safe, as long as there were enough execution contexts
in the pool to satisfy all threads.
Due to synchronization issues, the threading part was removed at some
point, and, for decoding, each thread had its own execution pool.
Having a single execution pool per context is hacky, not to mention
wasteful.
Most importantly, we *cannot* associate single shaders across multiple
execution pools for a single application. This means that we cannot
use shaders to either apply film grain, or use this framework for
software-defined decoders.
The recent commits added threading capabilities back to the execution
pool, and the number of contexts in each pool was increased. This was
done with the assumption that the execution pool was singular, which
it was not. This led to increased parallelism and number of frames
in flight, which is taxing on memory.
This commit finally restores proper threading behaviour.
The validation layer has isses that are reported and addressed in the
earlier commit.
The old query code never worked properly, and did some hideous
heuristics to read the status bit, and work that into a return
code.
This is all best left to callers to do, which simplifies
our code a lot.
This also fixes minor validation errors regarding calling queries
which are not in their active state.
Vulkan encoding was designed in a very... consolidated way.
You had to know the exact codec and profile that the image was going to
eventually be encoded as at... image creation time. Unfortunately, as good
as our code is, glimpsing into the exact future isn't what its capable of.
video_maintenance1 removed that requirement, which only then made encoding
images practically possible.
layered_dpb only makes sense when dedicated_dpb is set to 1.
For some mysterious reason, some Nvidia drivers stopped indicating
SEPARATE_REFRENCES, but kept the COINCIDE flag, which broke
the code.
The first release of the CTS for AV1 decoding had incorrect
offsets for the OrderHints values.
The CTS will be fixed, and eventually, the drivers will be
updated to the proper spec-conforming behaviour, but we still
need to add a workaround as this will take months.
Only NVIDIA use these values at all, so limit the workaround
to only NVIDIA. Also, other vendors don't tend to provide accurate
CTS information.