[v3,1/1] drm: msm: Replace dma_map_sg with dma_sync_sg*

Submitted by Vivek Gautam on Nov. 29, 2018, 2:03 p.m.

Details

Message ID 20181129140315.28476-1-vivek.gautam@codeaurora.org
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in DRI devel

Not browsing as part of any series.

Commit Message

Vivek Gautam Nov. 29, 2018, 2:03 p.m.
dma_map_sg() expects a DMA domain. However, the drm devices
have been traditionally using unmanaged iommu domain which
is non-dma type. Using dma mapping APIs with that domain is bad.

Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}()
to do the cache maintenance.

Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
Suggested-by: Tomasz Figa <tfiga@chromium.org>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Jordan Crouse <jcrouse@codeaurora.org>
Cc: Sean Paul <seanpaul@chromium.org>
---

Changes since v2:
 - Addressed Tomasz's comment to keep DMA_BIDIRECTIONAL dma direction
   flag intact.
 - Updated comment for sg's dma-address assignment as per Tomasz'
   suggestion.

 drivers/gpu/drm/msm/msm_gem.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 00c795ced02c..7048e9fe00c6 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -81,6 +81,8 @@  static struct page **get_pages(struct drm_gem_object *obj)
 		struct drm_device *dev = obj->dev;
 		struct page **p;
 		int npages = obj->size >> PAGE_SHIFT;
+		struct scatterlist *s;
+		int i;
 
 		if (use_pages(obj))
 			p = drm_gem_get_pages(obj);
@@ -104,12 +106,23 @@  static struct page **get_pages(struct drm_gem_object *obj)
 			return ptr;
 		}
 
-		/* For non-cached buffers, ensure the new pages are clean
+		/*
+		 * Some implementations of the DMA mapping ops expect
+		 * physical addresses of the pages to be stored as DMA
+		 * addresses of the sglist entries. To work around it,
+		 * set them here explicitly.
+		 */
+		for_each_sg(msm_obj->sgt->sgl, s, msm_obj->sgt->nents, i)
+			sg_dma_address(s) = sg_phys(s);
+
+		/*
+		 * For non-cached buffers, ensure the new pages are clean
 		 * because display controller, GPU, etc. are not coherent:
 		 */
-		if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED))
-			dma_map_sg(dev->dev, msm_obj->sgt->sgl,
-					msm_obj->sgt->nents, DMA_BIDIRECTIONAL);
+		if (msm_obj->flags & (MSM_BO_WC | MSM_BO_UNCACHED))
+			dma_sync_sg_for_device(dev->dev, msm_obj->sgt->sgl,
+					       msm_obj->sgt->nents,
+					       DMA_BIDIRECTIONAL);
 	}
 
 	return msm_obj->pages;
@@ -133,14 +146,16 @@  static void put_pages(struct drm_gem_object *obj)
 
 	if (msm_obj->pages) {
 		if (msm_obj->sgt) {
-			/* For non-cached buffers, ensure the new
+			/*
+			 * For non-cached buffers, ensure the new
 			 * pages are clean because display controller,
 			 * GPU, etc. are not coherent:
 			 */
-			if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED))
-				dma_unmap_sg(obj->dev->dev, msm_obj->sgt->sgl,
-					     msm_obj->sgt->nents,
-					     DMA_BIDIRECTIONAL);
+			if (msm_obj->flags & (MSM_BO_WC | MSM_BO_UNCACHED))
+				dma_sync_sg_for_cpu(obj->dev->dev,
+						    msm_obj->sgt->sgl,
+						    msm_obj->sgt->nents,
+						    DMA_BIDIRECTIONAL);
 
 			sg_free_table(msm_obj->sgt);
 			kfree(msm_obj->sgt);

Comments

On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote:
> [CC Marek]
> 
> On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> > > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote:
> > > > Just spend a bit of time reading through the implementations already
> > > > merged. Is the struct device *dev parameter actually needed anywhere?
> > > > dma-api definitely needs it, because we need that to pick the right iommu.
> > > > But for cache management from what I've seen the target device doesn't
> > > > matter, all the target specific stuff will be handled by the iommu.
> > >
> > > It looks like only mips every uses the dev argument, and even there
> > > the function it passes dev to ignores it.  So it could probably be removed.
> > >
> 
> Whether the cache maintenance operation needs to actually do anything
> or not is a function of `dev`. We can have some devices that are
> coherent with CPU caches, and some that are not, on the same system.

So the thing is that the gpu driver knows this too. It fairly often can
even decide at runtime (through surface state bits or gpu pte bits)
whether to use coherent or non-coherent transactions. dma-api assuming
that a given device is always coherent or non-coherent is one of the
fundamental mismatches we have.

If you otoh need dev because there's some strange bridge caches you need
to flush (I've never seen that, but who knows), that would be a diffeernt
thing. All the bridge flushing I've seen is attached to the iommu though,
so would be really a surprise if the cache management needs that too.
 
> > > >
> > > > Dropping the dev parameter would make this a perfect fit for coherency
> > > > management of buffers used by multiple devices. Right now there's all
> > > > kinds of nasty tricks for that use cases needed to avoid redundant
> > > > flushes.
> > >
> > > Note that one thing I'd like to avoid is exposing these funtions directly
> > > to drivers, as that will get us into all kinds of abuses.
> >
> > What kind of abuse do you expect? It could very well be that gpu folks
> > call that "standard use case" ... At least on x86 with the i915 driver
> > we pretty much rely on architectural guarantees for how cache flushes
> > work very much. Down to userspace doing the cache flushing for
> > mappings the kernel has set up.
> 
> i915 is a very specific case of a fully contained,
> architecture-specific hardware subsystem, where you can just hardcode
> all integration details inside the driver, because nobody else would
> care.

Nah, it's not fully contained, it's just with your arm eyewear on you
can't see how badly it leaks all over the place. The problem is that GPUs
(in many cases at least) want to decide whether and when to do cache
maintenance. We need a combo of iommu and cache maintenance apis that
allows this, across multiple devices, because the dma-api doesn't cut it.

Aside: The cache maintenance api _must_ do the flush when asked to, even
if it believes no cache maintenance is necessary. Even if the default mode
is coherent for a platform/dev combo, the gpu driver might want to do a
non-coherent transaction.

> In ARM world, you can have the same IP blocks licensed by multiple SoC
> vendors with different integration details and that often includes the
> option of coherency.

My experience is that for soc gpus, you need to retune up/download code
for every soc. Or at least every family of socs.

This is painful. I guess thus far the arm soc gpu drivers we have merged
aren't the ones that needed widely different tuning on different soc
families/ranges. What's worse, it's userspace who will decide whether to
use the coherent or non-coherent paths in many cases (that's at least how
it worked since decades on the big gpus, small gpus just lag a few years
usually). The kerenl only provides the tools for the userspace
opengl/vulkan driver to do all the things.

Trying to hide coherent vs. non-coherent like it's done for everyone else
is probably not going to work for gpus. In fact, hasn't ever worked for
gpus thus far, and unlikely to change I think.
-Daniel

> > > So I'd much prefer if we could have iommu APIs wrapping these that are
> > > specific to actual uses cases that we understand well.
> > >
> > > As for the buffer sharing: at least for the DMA API side I want to
> > > move the current buffer sharing users away from dma_alloc_coherent
> > > (and coherent dma_alloc_attrs users) and the remapping done in there
> > > required for non-coherent architectures.  Instead I'd like to allocate
> > > plain old pages, and then just dma map them for each device separately,
> > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > > or last user to unmap.  On the iommu side it could probably work
> > > similar.
> >
> > I think this is what's often done. Except then there's also the issue
> > of how to get at the cma allocator if your device needs something
> > contiguous. There's a lot of that still going on in graphics/media.
> 
> I suppose one could just expose CMA with the default pool directly.
> It's just an allocator, so not sure why it would need any
> device-specific information.
> 
> There is also the use case of using CMA with device-specific pools of
> memory reusable by the system when not used by the device and those
> would have to somehow get the pool to allocate from, but I wonder if
> struct device is the right way to pass such information. I'd see the
> pool given explicitly like cma_alloc(struct cma_pool *, size, flags)
> and perhaps a wrapper cma_alloc_default(size, flags) that is just a
> simple macro calling cma_alloc(&cma_pool_default, size, flags).
> 
> Best regards,
> Tomasz
On Thu, Nov 29, 2018 at 08:15:23PM -0500, Rob Clark wrote:
> As far as hiding cache ops behind iommu layer, I guess I'd been
> thinking more of just letting the drivers that want to bypass dma
> layer take things into their own hands.. tbh I think/assume/hope
> drm/msm is more the exception than the rule as far as needing to
> bypass the dma layer.  Or at least I guess the # of drivers having
> problems w/ the dma layer is less than the # of iommu drivers.

So the whole bypass thing has already been a contentious discussion
in the past.  One thing that the API aready enforces is that we pass
a DMA direction, which I want to keep.  The other is that we need
a struct device (or something similiar) that is used to check if the
device is cache coherent or not.  In your MSM case you might know that,
but in general it really is a platform issue that I don't want every
driver to know about.

> (Not sure if that changes my thoughts on this patch, it isn't like
> what this patch replaces isn't also a problematic hack around the
> inability to bypass the dma layer.  In the short term I just want
> *something* that works, I'm totally happy to refactor later when there
> are better options.)

The current patch is simply broken.  You can't just construct your
own S/G list and pass it to the DMA API, and we enforce that very
strictly with dma debug enabled.

So your only options are: a) actually use the DMA API for creating
the mapping, by e.g. using dma_direct_ops ontop of an actual IOMMU
managed in the background, or b) figure out a way to do cache
maintainance for raw IOMMU API drivers.
On Thu, Nov 29, 2018 at 01:57:38PM -0500, Rob Clark wrote:
> On Thu, Nov 29, 2018 at 12:24 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > [CC Marek]
> >
> > On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > Note that one thing I'd like to avoid is exposing these funtions directly
> > > > to drivers, as that will get us into all kinds of abuses.
> > >
> > > What kind of abuse do you expect? It could very well be that gpu folks
> > > call that "standard use case" ... At least on x86 with the i915 driver
> > > we pretty much rely on architectural guarantees for how cache flushes
> > > work very much. Down to userspace doing the cache flushing for
> > > mappings the kernel has set up.
> >
> > i915 is a very specific case of a fully contained,
> > architecture-specific hardware subsystem, where you can just hardcode
> > all integration details inside the driver, because nobody else would
> > care.
> >
> > In ARM world, you can have the same IP blocks licensed by multiple SoC
> > vendors with different integration details and that often includes the
> > option of coherency.
> 
> fwiw, I believe all the GPU IP blocks that are used across multiple
> SoCs have their own GPU MMU (potentially in addition to an iommu?).
> So the dma-api is a much better fit for them..  drm/msm is a lot
> closer to drm/i915 scenario, so I don't so much care if the solution
> to our unique problem isn't something that would work for other
> drivers ;-)

Right now maybe, but I fully except the entire coherent vs. non-coherent
transactions hilarity that we have on the bigger intel socs since a few
years already to trickle down into smaller (arm based socs) eventually. I
think Apple is already there since a few generations.

So maybe we won't have to fight the iommu side of the dma-api anymore
on these, but we'll still have to fight the cache maintenance side of
dma-api. You can tell the dma-api to not flush, but then you don't have
any other way to actually flush that's acceptable for arch/arm (on x86 we
just run clflush in userspace and call it a day).
-Daniel
On Fri, Nov 30, 2018 at 10:35:27AM +0100, Daniel Vetter wrote:
> > Whether the cache maintenance operation needs to actually do anything
> > or not is a function of `dev`. We can have some devices that are
> > coherent with CPU caches, and some that are not, on the same system.
> 
> So the thing is that the gpu driver knows this too. It fairly often can
> even decide at runtime (through surface state bits or gpu pte bits)
> whether to use coherent or non-coherent transactions. dma-api assuming
> that a given device is always coherent or non-coherent is one of the
> fundamental mismatches we have.
> 
> If you otoh need dev because there's some strange bridge caches you need
> to flush (I've never seen that, but who knows), that would be a diffeernt
> thing. All the bridge flushing I've seen is attached to the iommu though,
> so would be really a surprise if the cache management needs that too.

Strange bridge caches aren't the problem.  Outside of magic components
like SOCs integrated GPUs the issue is that a platform can wire up a
PCIe/AXI/etc bus either so that it is cache coherent, or not cache
coherent (does not snooping).  Drivers need to use the full DMA API
include dma_sync_* to cater for the non-coherent case, which will turn
into no-ops if DMA is coherent.

Now PCIe now has unsnooped transactions, which can be non-coherent even
if the bus would otherwise be coherent.  We have so far very much
ignored those in Linux (at least Linux in general, I know you guys have
some driver-local hacks), but if that use case, or similar ones for GPUs
on SOCs become common we need to find a solution.

One of the major issues here is that architectures that always are
DMA coherent might not even have the cache flushing instructions, or even
if they do we have not wired them up in the DMA code as we didn't need
them.

So what we'd need to support this properly is to do something like:

 - add new arch hook that allows an architecture to say it supports
   optional non-coherent at the arch level, and for a given device
 - wire up arch_sync_dma_for_{device,cpu} for those architectures that
   define it if they don't currently have it (e.g. x86)
 - add a new DMA_ATTR_* flag to opt into cache flushing even if the
   device declares it is otherwise coherent

And I'd really like to see that work driven by an actual user.
On Thu, Nov 29, 2018 at 07:33:34PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 06:09:05PM +0100, Daniel Vetter wrote:
> > What kind of abuse do you expect? It could very well be that gpu folks
> > call that "standard use case" ... At least on x86 with the i915 driver
> > we pretty much rely on architectural guarantees for how cache flushes
> > work very much. Down to userspace doing the cache flushing for
> > mappings the kernel has set up.
> 
> Mostly the usual bypasses of the DMA API because people know better
> (and with that I don't mean low-level IOMMU API users, but "creative"
> direct mappings).

Ah yes. I think that even gpu folks get behind :-) Best if someone
bothered to explicitly cast to dma_addr_t to shut up the tools, but not
fix the bug ...

> > > As for the buffer sharing: at least for the DMA API side I want to
> > > move the current buffer sharing users away from dma_alloc_coherent
> > > (and coherent dma_alloc_attrs users) and the remapping done in there
> > > required for non-coherent architectures.  Instead I'd like to allocate
> > > plain old pages, and then just dma map them for each device separately,
> > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > > or last user to unmap.  On the iommu side it could probably work
> > > similar.
> > 
> > I think this is what's often done. Except then there's also the issue
> > of how to get at the cma allocator if your device needs something
> > contiguous. There's a lot of that still going on in graphics/media.
> 
> Being able to dip into CMA and mayb iommu coalescing if we want to
> get fancy is indeed the only reason for this API.  If we just wanted
> to map pages we could already do that now with just a little bit
> of boilerplate code (and quite a few drivers do - just adding this
> new API will remove tons of code).

Sounds like the future is very bright indeed \o/
-Daniel
On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 29/11/2018 19:57, Tomasz Figa wrote:
> > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > >>
> > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > >>>>
> > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > >>>>> Maybe the thing we need to do is just implement a blacklist of
> > >>>>> compatible strings for devices which should skip the automatic
> > >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> > >>>>> need to control the domain/context-bank that is allocated by the dma
> > >>>>> api).
> > >>>>
> > >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> > >>>> which a few drm drivers do, but I don't think this is the problem.
> > >>>
> > >>> I think even with detach, we wouldn't end up with the context-bank
> > >>> that the gpu firmware was hard-coded to expect, and so it would
> > >>> overwrite the incorrect page table address register.  (I could be
> > >>> mis-remembering that, Jordan spent more time looking at that.  But it
> > >>> was something along those lines.)
> > >>
> > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> > >> to use that context bank for pagetable switching.
> > >>
> > >> I believe the Tegra guys also had a similar problem with a hard coded context
> > >> bank.
> >
> > AIUI, they don't need a specific hardware context, they just need to
> > know which one they're actually using, which the domain abstraction hides.
> >
> > > Wait, if we detach the GPU/display struct device from the default
> > > domain and attach it to a newly allocated domain, wouldn't the newly
> > > allocated domain use the context bank we need? Note that we're already
> >
> > The arm-smmu driver doesn't, but there's no fundamental reason it
> > couldn't. That should just need code to refcount domain users and
> > release hardware contexts for domains with no devices currently attached.
> >
> > Robin.
> >
> > > doing that, except that we're doing it behind the back of the DMA
> > > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > > ops for the device and doing any mapping operations on the default
> > > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > > solve the problem?
>
> Thanks Robin.
>
> Still, my point is that the MSM DRM driver attaches the GPU struct
> device to a new domain it allocates using iommu_domain_alloc() and it
> seems to work fine, so I believe it's not the problem we're looking
> into with this patch.

Could we just make the MSM DRM call arch_teardown_dma_ops() and then
arch_setup_dma_ops() with the `iommu` argument set to NULL and be done
with it?

Best regards,
Tomasz
On Fri, Nov 30, 2018 at 9:05 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > > On 29/11/2018 19:57, Tomasz Figa wrote:
> > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > > >>
> > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > > >>>>
> > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > > >>>>> Maybe the thing we need to do is just implement a blacklist of
> > > >>>>> compatible strings for devices which should skip the automatic
> > > >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> > > >>>>> need to control the domain/context-bank that is allocated by the dma
> > > >>>>> api).
> > > >>>>
> > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> > > >>>> which a few drm drivers do, but I don't think this is the problem.
> > > >>>
> > > >>> I think even with detach, we wouldn't end up with the context-bank
> > > >>> that the gpu firmware was hard-coded to expect, and so it would
> > > >>> overwrite the incorrect page table address register.  (I could be
> > > >>> mis-remembering that, Jordan spent more time looking at that.  But it
> > > >>> was something along those lines.)
> > > >>
> > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> > > >> to use that context bank for pagetable switching.
> > > >>
> > > >> I believe the Tegra guys also had a similar problem with a hard coded context
> > > >> bank.
> > >
> > > AIUI, they don't need a specific hardware context, they just need to
> > > know which one they're actually using, which the domain abstraction hides.
> > >
> > > > Wait, if we detach the GPU/display struct device from the default
> > > > domain and attach it to a newly allocated domain, wouldn't the newly
> > > > allocated domain use the context bank we need? Note that we're already
> > >
> > > The arm-smmu driver doesn't, but there's no fundamental reason it
> > > couldn't. That should just need code to refcount domain users and
> > > release hardware contexts for domains with no devices currently attached.
> > >
> > > Robin.
> > >
> > > > doing that, except that we're doing it behind the back of the DMA
> > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > > > ops for the device and doing any mapping operations on the default
> > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > > > solve the problem?
> >
> > Thanks Robin.
> >
> > Still, my point is that the MSM DRM driver attaches the GPU struct
> > device to a new domain it allocates using iommu_domain_alloc() and it
> > seems to work fine, so I believe it's not the problem we're looking
> > into with this patch.
>
> Could we just make the MSM DRM call arch_teardown_dma_ops() and then
> arch_setup_dma_ops() with the `iommu` argument set to NULL and be done
> with it?

I don't think those are exported to modules?

I have actually a simpler patch, that adds a small blacklist to check
in of_dma_configure() before calling arch_setup_dma_ops(), which can
replace this patch.  It also solves the problem of dma api allocating
the context bank that he gpu wants to use for context-switching, and
should be a simple thing to backport to stable branches.

I was just spending some time trying to figure out what changed
recently to start causing dma_map_sg() to opps on boot for us, so I
could write a more detailed commit msg.

BR,
-R
On Sat, Dec 1, 2018 at 3:47 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 30, 2018 at 9:05 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
> > > >
> > > > On 29/11/2018 19:57, Tomasz Figa wrote:
> > > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > > > >>
> > > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > >>>>
> > > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > > > >>>>> Maybe the thing we need to do is just implement a blacklist of
> > > > >>>>> compatible strings for devices which should skip the automatic
> > > > >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> > > > >>>>> need to control the domain/context-bank that is allocated by the dma
> > > > >>>>> api).
> > > > >>>>
> > > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> > > > >>>> which a few drm drivers do, but I don't think this is the problem.
> > > > >>>
> > > > >>> I think even with detach, we wouldn't end up with the context-bank
> > > > >>> that the gpu firmware was hard-coded to expect, and so it would
> > > > >>> overwrite the incorrect page table address register.  (I could be
> > > > >>> mis-remembering that, Jordan spent more time looking at that.  But it
> > > > >>> was something along those lines.)
> > > > >>
> > > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> > > > >> to use that context bank for pagetable switching.
> > > > >>
> > > > >> I believe the Tegra guys also had a similar problem with a hard coded context
> > > > >> bank.
> > > >
> > > > AIUI, they don't need a specific hardware context, they just need to
> > > > know which one they're actually using, which the domain abstraction hides.
> > > >
> > > > > Wait, if we detach the GPU/display struct device from the default
> > > > > domain and attach it to a newly allocated domain, wouldn't the newly
> > > > > allocated domain use the context bank we need? Note that we're already
> > > >
> > > > The arm-smmu driver doesn't, but there's no fundamental reason it
> > > > couldn't. That should just need code to refcount domain users and
> > > > release hardware contexts for domains with no devices currently attached.
> > > >
> > > > Robin.
> > > >
> > > > > doing that, except that we're doing it behind the back of the DMA
> > > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > > > > ops for the device and doing any mapping operations on the default
> > > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > > > > solve the problem?
> > >
> > > Thanks Robin.
> > >
> > > Still, my point is that the MSM DRM driver attaches the GPU struct
> > > device to a new domain it allocates using iommu_domain_alloc() and it
> > > seems to work fine, so I believe it's not the problem we're looking
> > > into with this patch.
> >
> > Could we just make the MSM DRM call arch_teardown_dma_ops() and then
> > arch_setup_dma_ops() with the `iommu` argument set to NULL and be done
> > with it?
>
> I don't think those are exported to modules?
>

Indeed, if we compile MSM DRM as modules, it wouldn't work...

> I have actually a simpler patch, that adds a small blacklist to check
> in of_dma_configure() before calling arch_setup_dma_ops(), which can
> replace this patch.  It also solves the problem of dma api allocating
> the context bank that he gpu wants to use for context-switching, and
> should be a simple thing to backport to stable branches.
>
> I was just spending some time trying to figure out what changed
> recently to start causing dma_map_sg() to opps on boot for us, so I
> could write a more detailed commit msg.

Yeah, that sounds much better, thanks. Reviewed that patch.

Best regards,
Tomasz
On Fri, Nov 30, 2018 at 10:46:04AM +0100, Daniel Vetter wrote:
> > Being able to dip into CMA and maybe iommu coalescing if we want to
> > get fancy is indeed the only reason for this API.  If we just wanted
> > to map pages we could already do that now with just a little bit
> > of boilerplate code (and quite a few drivers do - just adding this
> > new API will remove tons of code).
> 
> Sounds like the future is very bright indeed \o/

So, I spent some time with this and instead of a new API I think
it makes sure we have DMA_ATTR_NON_CONSISTENT consistently available
and with well defined semantics, that is virt_to_page on the return
value works, it is contiguos and we can use dma_sync_single_for_cpu
and dma_sync_single_for_device for ownership tranfers.

Can you graphics folks check if this:

    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-noncoherent-allocator

is something to work with?  Especially to get rid of the horrible
dma_get_sgtable hacks?
On Thu, Dec 6, 2018 at 8:38 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Nov 30, 2018 at 10:46:04AM +0100, Daniel Vetter wrote:
> > > Being able to dip into CMA and maybe iommu coalescing if we want to
> > > get fancy is indeed the only reason for this API.  If we just wanted
> > > to map pages we could already do that now with just a little bit
> > > of boilerplate code (and quite a few drivers do - just adding this
> > > new API will remove tons of code).
> >
> > Sounds like the future is very bright indeed \o/
>
> So, I spent some time with this and instead of a new API I think
> it makes sure we have DMA_ATTR_NON_CONSISTENT consistently available
> and with well defined semantics, that is virt_to_page on the return
> value works, it is contiguos and we can use dma_sync_single_for_cpu
> and dma_sync_single_for_device for ownership tranfers.
>
> Can you graphics folks check if this:
>
>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-noncoherent-allocator
>
> is something to work with?  Especially to get rid of the horrible
> dma_get_sgtable hacks?

I'm not sure I really have strong opinion about this.  For some of the
display-only SoC drm drivers which use the CMA helpers, it might be
useful.

For the drm drivers for real GPUs, we aren't really using the dma api
to allocate in the first place.  (And the direction is towards more
support for userptr allocations)

BR,
-R