Post by Tomasz Figa[CC Marek]
Post by Daniel VetterPost by Christoph HellwigPost by Daniel VetterJust spend a bit of time reading through the implementations already
merged. Is the struct device *dev parameter actually needed anywhere?
dma-api definitely needs it, because we need that to pick the right iommu.
But for cache management from what I've seen the target device doesn't
matter, all the target specific stuff will be handled by the iommu.
It looks like only mips every uses the dev argument, and even there
the function it passes dev to ignores it. So it could probably be removed.
Whether the cache maintenance operation needs to actually do anything
or not is a function of `dev`. We can have some devices that are
coherent with CPU caches, and some that are not, on the same system.
So the thing is that the gpu driver knows this too. It fairly often can
even decide at runtime (through surface state bits or gpu pte bits)
whether to use coherent or non-coherent transactions. dma-api assuming
that a given device is always coherent or non-coherent is one of the
fundamental mismatches we have.
If you otoh need dev because there's some strange bridge caches you need
to flush (I've never seen that, but who knows), that would be a diffeernt
thing. All the bridge flushing I've seen is attached to the iommu though,
so would be really a surprise if the cache management needs that too.
Post by Tomasz FigaPost by Daniel VetterPost by Christoph HellwigPost by Daniel VetterDropping the dev parameter would make this a perfect fit for coherency
management of buffers used by multiple devices. Right now there's all
kinds of nasty tricks for that use cases needed to avoid redundant
flushes.
Note that one thing I'd like to avoid is exposing these funtions directly
to drivers, as that will get us into all kinds of abuses.
What kind of abuse do you expect? It could very well be that gpu folks
call that "standard use case" ... At least on x86 with the i915 driver
we pretty much rely on architectural guarantees for how cache flushes
work very much. Down to userspace doing the cache flushing for
mappings the kernel has set up.
i915 is a very specific case of a fully contained,
architecture-specific hardware subsystem, where you can just hardcode
all integration details inside the driver, because nobody else would
care.
Nah, it's not fully contained, it's just with your arm eyewear on you
can't see how badly it leaks all over the place. The problem is that GPUs
(in many cases at least) want to decide whether and when to do cache
maintenance. We need a combo of iommu and cache maintenance apis that
allows this, across multiple devices, because the dma-api doesn't cut it.
Aside: The cache maintenance api _must_ do the flush when asked to, even
if it believes no cache maintenance is necessary. Even if the default mode
is coherent for a platform/dev combo, the gpu driver might want to do a
non-coherent transaction.
Post by Tomasz FigaIn ARM world, you can have the same IP blocks licensed by multiple SoC
vendors with different integration details and that often includes the
option of coherency.
My experience is that for soc gpus, you need to retune up/download code
for every soc. Or at least every family of socs.
This is painful. I guess thus far the arm soc gpu drivers we have merged
aren't the ones that needed widely different tuning on different soc
families/ranges. What's worse, it's userspace who will decide whether to
use the coherent or non-coherent paths in many cases (that's at least how
it worked since decades on the big gpus, small gpus just lag a few years
usually). The kerenl only provides the tools for the userspace
opengl/vulkan driver to do all the things.
Trying to hide coherent vs. non-coherent like it's done for everyone else
is probably not going to work for gpus. In fact, hasn't ever worked for
gpus thus far, and unlikely to change I think.
-Daniel
Post by Tomasz FigaPost by Daniel VetterPost by Christoph HellwigSo I'd much prefer if we could have iommu APIs wrapping these that are
specific to actual uses cases that we understand well.
As for the buffer sharing: at least for the DMA API side I want to
move the current buffer sharing users away from dma_alloc_coherent
(and coherent dma_alloc_attrs users) and the remapping done in there
required for non-coherent architectures. Instead I'd like to allocate
plain old pages, and then just dma map them for each device separately,
with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
or last user to unmap. On the iommu side it could probably work
similar.
I think this is what's often done. Except then there's also the issue
of how to get at the cma allocator if your device needs something
contiguous. There's a lot of that still going on in graphics/media.
I suppose one could just expose CMA with the default pool directly.
It's just an allocator, so not sure why it would need any
device-specific information.
There is also the use case of using CMA with device-specific pools of
memory reusable by the system when not used by the device and those
would have to somehow get the pool to allocate from, but I wonder if
struct device is the right way to pass such information. I'd see the
pool given explicitly like cma_alloc(struct cma_pool *, size, flags)
and perhaps a wrapper cma_alloc_default(size, flags) that is just a
simple macro calling cma_alloc(&cma_pool_default, size, flags).
Best regards,
Tomasz
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch