Not only does this take into account extreme edge cases where the plane
padding can significantly exceed the actual width/stride, but it also
correctly takes into account the filter offsets when scaling; which the
previous code completely ignored.
Simpler, robuster, and more correct. Now valgrind passes for 100% of format
conversions for me, with and without scaling.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a mostly straightforward internal mechanical change that I wanted
to isolate from the following commit to make bisection easier in the case of
regressions.
While the number of tail blocks could theoretically be different for input
vs output memcpy, the extra complexity of handling that mismatch (and
adjusting all of the tail offsets, strides etc.) seems not worth it.
I tested this commit by manually setting `p->tail_blocks` to higher values
and seeing if that still passed the self-check under valgrind.
Signed-off-by: Niklas Haas <git@haasn.dev>
The x86 kernel e.g. assumes that at least one block is processed; so avoid
calling this with an empty width. This is currently only possible if e.g.
operating on an unpadded, very small image whose total linesize is less than
a single block.
Signed-off-by: Niklas Haas <git@haasn.dev>
This code had two issues:
1. It was over-allocating bytes for the input offset map case, and
2. It was hard-coding the assumption that there is only a single tail block
We can fix both of these issues by rewriting the way the tail size is derived.
In the non-offset case, and assuming only 1 tail block:
aligned_w - safe_width
= num_blocks * block_size - (num_blocks - 1) * block_size
= block_size
Additionally, the FFMAX(tail_size_in/out) is unnecessary, because:
tail_size = pass->width - safe_width <= aligned_w - safe_width
In the input offset case, we instead realize that the input kernel already
never over-reads the input due to the filter size adjustment/clamping, so
the only thing we need to ensure is that we allocate extra bytes for the
input over-read.
Signed-off-by: Niklas Haas <git@haasn.dev>
The over_read/write fields are not documented as depending on the subsampling
factor. Actually, they are not documented as depending on the plane at all.
If and when we do actually add support for horizontal subsampling to this
code, it will most likely be by turning all of these key variables into
arrays, which will be an upgrade we get basically for free.
Signed-off-by: Niklas Haas <git@haasn.dev>
This makes it far less likely to accidentally add or remove a +7 bias when
repeating this often-used expression.
Signed-off-by: Niklas Haas <git@haasn.dev>
This could trigger if e.g. a backend tries to operate on monow formats with
a block size that is not a multiple of 1. In this case, `block_size_in`
would previously be miscomputed (to e.g. 0), which is obviously wrong.
Signed-off-by: Niklas Haas <git@haasn.dev>
As well as weird edge cases like trying to filter `monow` and pixels landing
in the middle of a byte. Realistically, this will never happen - we'd instead
pre-process it into something byte-aligned, and then dispatch a byte-aligned
filter on it.
However, I need to add a check for overflow in any case, so we might as well
add the alignment check at the same time. It's basically free.
Signed-off-by: Niklas Haas <git@haasn.dev>
Prevents valgrind from complaining about operating on uninitialized bytes.
This should be cheap as it's only done once during setup().
Signed-off-by: Niklas Haas <git@haasn.dev>
As a consequence of the fact that the frame pool API doesn't let us directly
access the linesize, we have to "un-translate" the over_read/write back to
the nearest multiple of the pixel size.
Signed-off-by: Niklas Haas <git@haasn.dev>
Just define these directly as integer arrays; there's really no point in
having them re-use SwsSwizzleOp; the only place this was ever even remotely
relevant was in the no-op check, which any decent compiler should already
be capable of optimizing into a single 32-bit comparison.
Signed-off-by: Niklas Haas <git@haasn.dev>
First, we try compiling the filter pass as-is; in case any backends decide to
handle the filter as a single pass. (e.g. Vulkan, which will want to compile
such using internal temporary buffers and barriers)
If that fails, retry with a chained list of split passes.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is cheap to precompute and can be used as-is for gather-style horizontal
filter implementations.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Rather than dispatching the compiled function for each line of the tail
individually, with a memcpy to a shared buffer in between, this instead copies
the entire tail region into a temporary intermediate buffer, processes it with
a single dispatch call, and then copies the entire result back to the
destination.
The main benefit of this is that it enables scaling, subsampling or other
quirky layouts to continue working, which may require accessing lines adjacent
to the main input.
It also arguably makes the code a bit simpler and easier to follow, but YMMV.
One minor consequence of the change in logic is that we also no longer handle
the last row of an unpadded input buffer separately - instead, if *any* row
needs to be padded, *all* rows in the current slice will be padded. This is
a bit less efficient but much more predictable, and as discussed, basically
required for scaling/filtering anyways.
While we could implement some sort of hybrid regime where we only use the new
logic when scaling is needed, I really don't think this would gain us anything
concrete enough to be worth the effort, especially since the performance is
basically roughly the same across the board:
16 threads:
yuv444p 1920x1080 -> ayuv 1920x1080: speedup=1.000x slower (input memcpy)
rgb24 1920x1080 -> argb 1920x1080: speedup=1.012x faster (output memcpy)
1 thread:
yuv444p 1920x1080 -> ayuv 1920x1080: speedup=1.062x faster (input memcpy)
rgb24 1920x1080 -> argb 1920x1080: speedup=0.959x slower (output memcpy)
Overall speedup is +/- 1% across the board, well within margin of error.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is more useful for tight loops inside CPU backends, which can implement
this by having a shared path for incrementing to the next line (as normal),
and then a separate path for adding an extra position-dependent, stride
multiplied line offset after each completed line.
As a free upside, this encoding does not require any separate/special handling
for the exec tail.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
And use it to look up the correct source plane line for each destination
line. Needed for vertical scaling, in which case multiple output lines can
reference the same input line.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Will make more sense in light of the fact that this may not correspond
to the op list actually sent to the backends, due to subpass splitting.
Signed-off-by: Niklas Haas <git@haasn.dev>
If the block size is somehow less than 8, this may round down, leading to
one byte too few being copied (e.g. for monow/rgb4).
This was never an issue for current backends because they all have block sizes
of 8 or larger, but a future platform may have different requirements.
Signed-off-by: Niklas Haas <git@haasn.dev>
The `memcpy_in` condition is reversed for negative strides, which require a
memcpy() on the *first* line, not the last line. Additionally, the check
just completely didn't work for negative linesizes, due to comparing against
a negative stride.
Signed-off-by: Niklas Haas <git@haasn.dev>
Added in commit 00907e1244 to hack around a problem that was caused by
the Vulkan backend's incorrect use of the ops dispatch code, which was fixed
properly in commit 143cb56501.
This logic never made sense to begin with, it was only meant to disable the
memcpy logic for Vulkan specifically.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Or else this might false-positive when we retry compilation after subpass
splitting.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
More useful than just allowing it to "modify" the ops; in practice this means
the contents will be undefined anyways - might as well have this function
take care of freeing it afterwards as well.
Will make things simpler with regards to subpass splitting.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Useful for a handful of reasons, including Vulkan (which depends on external
device resources), but also a change I want to make to the tail handling.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
These were abstraction-violating in the first place. Good riddance.
This partially reverts commit c911295f09.
Signed-off-by: Niklas Haas <git@haasn.dev>
Allows compiled functions to opt out of the ops_dispatch execution harness
altogether and just get dispatched directly as the pass run() function.
Useful in particular for Vulkan.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Now that this function returns a status code and takes care of cleanup on
failure, many call-sites can just return the function directly.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is arguably more convenient for most downstream users, as will be
more prominently seen in the next commit.
Also allows this code to re-use a pass_free() helper with the graph uninit.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is just slightly common enough a pattern that it IMO makes sense to do
so. This will also make more sense after the following commits.
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of once at the start of add_convert_pass(). This makes much
more sense in light of the fact that we want to start e.g. splitting
passes apart.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is already called by compile_backend(), and nothing else in this file
depends on accurate values.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This allows distinguishing between different types of failure, e.g.
AVERROR(EINVAL) on invalid pass dimensions.
Signed-off-by: Niklas Haas <git@haasn.dev>
Makes various pieces of code that expect to get a SWS_OP_READ more robust,
and also allows us to generalize to introduce more input op types in the
future (in particular, I am looking ahead towards filter ops).
Signed-off-by: Niklas Haas <git@haasn.dev>
This code is self-contained and logically distinct from the ops-related
helpers in ops.c, so it belongs in its own file.
Purely cosmetic; no functional change.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>