[Intel-gfx] intel: Make unsynchronized GTT mappings work on systems with snooping.

Submitted by Kenneth Graunke on March 11, 2017, 1:14 a.m.

Details

Message ID 20170311011432.28962-1-kenneth@whitecape.org
State New
Headers show
Series "intel: Make unsynchronized GTT mappings work on systems with snooping." ( rev: 1 ) in libdrm - Intel

Not browsing as part of any series.

Commit Message

Kenneth Graunke March 11, 2017, 1:14 a.m.
On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
had the surprising behavior of doing a synchronized GTT mapping.
This is obviously not what the user of the API wanted.

Eric left a comment indicating a valid concern: if the CPU and GPU
caches are incoherent, we don't keep track of where the user last
mapped the buffer, and what caches might contain relevant data.

Modern Atom systems still don't have LLC, but they do offer snooping,
which effectively makes the caches coherent.  The kernel appears to
set up the PTE/PPAT to enable snooping for everything where the cache
level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
are marked as uncached.

Any buffers used by scanout should be flagged as non-reusable with
drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
assume that any reusable buffer should be snooped.

This patch enables unsynchronized mappings for reusable buffers
on all Gen6+ hardware (which have either LLC or snooping).

On Broxton, this improves the performance of Unigine Valley 1.0
on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
(same settings) by about 53%.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: mesa-dev@lists.freedesktop.org
---
 intel/intel_bufmgr_gem.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

It looks like Mesa and Beignet are the only callers of this function
(SNA and Anvil don't use libdrm, UXA and vaapi don't use this function.)

This passed our full barrage of Piglit/dEQP/GLCTS/GLESCTS testing.
gnome-shell still works, as does Unigine, and GLBenchmark.

I haven't tested any OpenCL workloads.

Patch hide | download patch | download mbox

diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
index e260f2dc..f53f1fcc 100644
--- a/intel/intel_bufmgr_gem.c
+++ b/intel/intel_bufmgr_gem.c
@@ -1630,9 +1630,7 @@  int
 drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
 {
 	drm_intel_bufmgr_gem *bufmgr_gem = (drm_intel_bufmgr_gem *) bo->bufmgr;
-#ifdef HAVE_VALGRIND
 	drm_intel_bo_gem *bo_gem = (drm_intel_bo_gem *) bo;
-#endif
 	int ret;
 
 	/* If the CPU cache isn't coherent with the GTT, then use a
@@ -1641,8 +1639,12 @@  drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
 	 * terms of drm_intel_bo_map vs drm_intel_gem_bo_map_gtt, so
 	 * we would potentially corrupt the buffer even when the user
 	 * does reasonable things.
+	 *
+	 * The caches are coherent on LLC platforms or snooping is enabled
+	 * for the BO.  The kernel enables snooping for non-scanout (reusable)
+	 * buffers on modern non-LLC systems.
 	 */
-	if (!bufmgr_gem->has_llc)
+	if (bufmgr_gem->gen < 6 || !bo_gem->reusable)
 		return drm_intel_gem_bo_map_gtt(bo);
 
 	pthread_mutex_lock(&bufmgr_gem->lock);

Comments

On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote:
> On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> had the surprising behavior of doing a synchronized GTT mapping.
> This is obviously not what the user of the API wanted.
> 
> Eric left a comment indicating a valid concern: if the CPU and GPU
> caches are incoherent, we don't keep track of where the user last
> mapped the buffer, and what caches might contain relevant data.

Note this is an issue in libdrm_intel not tracking the cache domain
transitions. Even just a switch between cpu and coherent would solve the
majority of that - the caveat being shared bo where the tracking is
incomplete.
 
> Modern Atom systems still don't have LLC, but they do offer snooping,
> which effectively makes the caches coherent.  The kernel appears to
> set up the PTE/PPAT to enable snooping for everything where the cache
> level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> are marked as uncached.

Byt, bsw beg to differ. I don't have a bxt to know the results of the
igt/kernel tests.
 
> Any buffers used by scanout should be flagged as non-reusable with
> drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> assume that any reusable buffer should be snooped.

Not really, there is no reason why scanout buffers can't be reused.
 
> This patch enables unsynchronized mappings for reusable buffers
> on all Gen6+ hardware (which have either LLC or snooping).
> 
> On Broxton, this improves the performance of Unigine Valley 1.0
> on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> (same settings) by about 53%.

Does anyone have figures for gtt performance on bxt - does it cover over
the same performance penalty from earler atoms? Basically why bother to
enable this over wc mapping (no stalls for a contended, limited
resource) + detiling. (Just note that for detiling Y to WC you need to
use a temporary cacheable page, or rearrange the code to make sure the
reads/writes are in 64 byte chunks.) 

> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: mesa-dev@lists.freedesktop.org
> ---
>  intel/intel_bufmgr_gem.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> It looks like Mesa and Beignet are the only callers of this function
> (SNA and Anvil don't use libdrm, UXA and vaapi don't use this function.)
> 
> This passed our full barrage of Piglit/dEQP/GLCTS/GLESCTS testing.
> gnome-shell still works, as does Unigine, and GLBenchmark.
> 
> I haven't tested any OpenCL workloads.
> 
> diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
> index e260f2dc..f53f1fcc 100644
> --- a/intel/intel_bufmgr_gem.c
> +++ b/intel/intel_bufmgr_gem.c
> @@ -1630,9 +1630,7 @@ int
>  drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
>  {
>  	drm_intel_bufmgr_gem *bufmgr_gem = (drm_intel_bufmgr_gem *) bo->bufmgr;
> -#ifdef HAVE_VALGRIND
>  	drm_intel_bo_gem *bo_gem = (drm_intel_bo_gem *) bo;
> -#endif
>  	int ret;
>  
>  	/* If the CPU cache isn't coherent with the GTT, then use a
> @@ -1641,8 +1639,12 @@ drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
>  	 * terms of drm_intel_bo_map vs drm_intel_gem_bo_map_gtt, so
>  	 * we would potentially corrupt the buffer even when the user
>  	 * does reasonable things.
> +	 *
> +	 * The caches are coherent on LLC platforms or snooping is enabled
> +	 * for the BO.  The kernel enables snooping for non-scanout (reusable)
> +	 * buffers on modern non-LLC systems.

gen >= 9; the snoop was reinvented

Not enabled by default on !llc. By default object and PTE are set to
uncached, and mocs default is to follow PTE.

I would just do the unsync map and leave it to the caller to ensure it
is used correctly. Or just use the raw interfaces that leave the domain
tracking to the caller.

>  	 */
> -	if (!bufmgr_gem->has_llc)
> +	if (bufmgr_gem->gen < 6 || !bo_gem->reusable)
>  		return drm_intel_gem_bo_map_gtt(bo);
>  
>  	pthread_mutex_lock(&bufmgr_gem->lock);
> -- 
> 2.12.0
>
On Sun, Mar 12, 2017 at 01:21:12PM +0000, Chris Wilson wrote:
> On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote:
> > On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> > had the surprising behavior of doing a synchronized GTT mapping.
> > This is obviously not what the user of the API wanted.
> > 
> > Eric left a comment indicating a valid concern: if the CPU and GPU
> > caches are incoherent, we don't keep track of where the user last
> > mapped the buffer, and what caches might contain relevant data.
> 
> Note this is an issue in libdrm_intel not tracking the cache domain
> transitions. Even just a switch between cpu and coherent would solve the
> majority of that - the caveat being shared bo where the tracking is
> incomplete.
>  
> > Modern Atom systems still don't have LLC, but they do offer snooping,
> > which effectively makes the caches coherent.  The kernel appears to
> > set up the PTE/PPAT to enable snooping for everything where the cache
> > level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> > are marked as uncached.
> 
> Byt, bsw beg to differ. I don't have a bxt to know the results of the
> igt/kernel tests.

Just give me a list of the tests to run (and, if any, what patches
to apply and the debugging level you want enabled) and I'll provide
the necessary results.

> > Any buffers used by scanout should be flagged as non-reusable with
> > drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> > assume that any reusable buffer should be snooped.
> 
> Not really, there is no reason why scanout buffers can't be reused.
>  
> > This patch enables unsynchronized mappings for reusable buffers
> > on all Gen6+ hardware (which have either LLC or snooping).
> > 
> > On Broxton, this improves the performance of Unigine Valley 1.0
> > on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> > (same settings) by about 53%.
> 
> Does anyone have figures for gtt performance on bxt - does it cover over
> the same performance penalty from earler atoms? Basically why bother to
> enable this over wc mapping (no stalls for a contended, limited
> resource) + detiling. (Just note that for detiling Y to WC you need to
> use a temporary cacheable page, or rearrange the code to make sure the
> reads/writes are in 64 byte chunks.) 

Again, I can run any tests you'd like to get numbers from,
just give me a list.


Kind regards, David
On Sunday, March 12, 2017 6:21:12 AM PDT Chris Wilson wrote:
> On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote:
> > On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> > had the surprising behavior of doing a synchronized GTT mapping.
> > This is obviously not what the user of the API wanted.
> > 
> > Eric left a comment indicating a valid concern: if the CPU and GPU
> > caches are incoherent, we don't keep track of where the user last
> > mapped the buffer, and what caches might contain relevant data.
> 
> Note this is an issue in libdrm_intel not tracking the cache domain
> transitions. Even just a switch between cpu and coherent would solve the
> majority of that - the caveat being shared bo where the tracking is
> incomplete.

Hmm.  Given the API we have today, I suppose we could make libdrm track
that - it just doesn't currently...

>  
> > Modern Atom systems still don't have LLC, but they do offer snooping,
> > which effectively makes the caches coherent.  The kernel appears to
> > set up the PTE/PPAT to enable snooping for everything where the cache
> > level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> > are marked as uncached.
> 
> Byt, bsw beg to differ. I don't have a bxt to know the results of the
> igt/kernel tests.

They do?  I was reading i915_gem_gtt.c, and see in byt_pte_encode:

        if (level != I915_CACHE_NONE)
                pte |= BYT_PTE_SNOOPED_BY_CPU_CACHES;

and also in chv_setup_private_ppat, a whole lot of CHV_PPAT_SNOOP.
It sure looked to me that we were setting up snooping for all non-UC
buffers.

i915_gem_object_is_coherent() also appears to indicate that LLC or
cache_level != I915_CACHE_NONE guarantees coherency.

Am I missing something?

> > Any buffers used by scanout should be flagged as non-reusable with
> > drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> > assume that any reusable buffer should be snooped.
> 
> Not really, there is no reason why scanout buffers can't be reused.

Yes, we could make them reusable.  However, as far as I can tell, libdrm
doesn't consider the cacheability settings and just adds all reusable
BOs back into the cache.

As I understand it, the kernel marks a BO uncached when it's used for
display, and that BO remains uncached forever.  So that would mean that
libdrm's buffer cache would have random uncached buffers in it, which
would be really undesirable.  So I believe libdrm (and its users) mark
all scanout buffers non-reusable.

It would probably be better to check I915_GEM_GET_CACHING !=
I915_CACHING_NONE, but I was hoping to avoid the extra ioctl.

> > This patch enables unsynchronized mappings for reusable buffers
> > on all Gen6+ hardware (which have either LLC or snooping).
> > 
> > On Broxton, this improves the performance of Unigine Valley 1.0
> > on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> > (same settings) by about 53%.
> 
> Does anyone have figures for gtt performance on bxt - does it cover over
> the same performance penalty from earler atoms? Basically why bother to
> enable this over wc mapping (no stalls for a contended, limited
> resource) + detiling. (Just note that for detiling Y to WC you need to
> use a temporary cacheable page, or rearrange the code to make sure the
> reads/writes are in 64 byte chunks.) 

I'd like to use WC mappings eventually, but to me, fixing the
lack-of-async seems kind of orthogonal to changing GTT vs. WC.
Even if we want to use WC, why not fix this too?

> > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: mesa-dev@lists.freedesktop.org
> > ---
> >  intel/intel_bufmgr_gem.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > It looks like Mesa and Beignet are the only callers of this function
> > (SNA and Anvil don't use libdrm, UXA and vaapi don't use this function.)
> > 
> > This passed our full barrage of Piglit/dEQP/GLCTS/GLESCTS testing.
> > gnome-shell still works, as does Unigine, and GLBenchmark.
> > 
> > I haven't tested any OpenCL workloads.
> > 
> > diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
> > index e260f2dc..f53f1fcc 100644
> > --- a/intel/intel_bufmgr_gem.c
> > +++ b/intel/intel_bufmgr_gem.c
> > @@ -1630,9 +1630,7 @@ int
> >  drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
> >  {
> >  	drm_intel_bufmgr_gem *bufmgr_gem = (drm_intel_bufmgr_gem *) bo->bufmgr;
> > -#ifdef HAVE_VALGRIND
> >  	drm_intel_bo_gem *bo_gem = (drm_intel_bo_gem *) bo;
> > -#endif
> >  	int ret;
> >  
> >  	/* If the CPU cache isn't coherent with the GTT, then use a
> > @@ -1641,8 +1639,12 @@ drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
> >  	 * terms of drm_intel_bo_map vs drm_intel_gem_bo_map_gtt, so
> >  	 * we would potentially corrupt the buffer even when the user
> >  	 * does reasonable things.
> > +	 *
> > +	 * The caches are coherent on LLC platforms or snooping is enabled
> > +	 * for the BO.  The kernel enables snooping for non-scanout (reusable)
> > +	 * buffers on modern non-LLC systems.
> 
> gen >= 9; the snoop was reinvented

Then I'd really like to know what those BYT and CHV PTE/PPAT bits
mean...

> Not enabled by default on !llc. By default object and PTE are set to
> uncached, and mocs default is to follow PTE.
> 
> I would just do the unsync map and leave it to the caller to ensure it
> is used correctly. Or just use the raw interfaces that leave the domain
> tracking to the caller.

So, drop this altogether and let the caller sort it out?  I suspect that
would be fine, too.  It seems like we want to deprecate (or at least
stop using) these APIs in the long run, in favor of the raw interfaces,
so in the meantime, as long as any changes do the right thing for
current users, it's probably fine.

> >  	 */
> > -	if (!bufmgr_gem->has_llc)
> > +	if (bufmgr_gem->gen < 6 || !bo_gem->reusable)
> >  		return drm_intel_gem_bo_map_gtt(bo);
> >  
> >  	pthread_mutex_lock(&bufmgr_gem->lock);
> 
>
On Sun, Mar 12, 2017 at 03:07:14PM -0700, Kenneth Graunke wrote:
> On Sunday, March 12, 2017 6:21:12 AM PDT Chris Wilson wrote:
> > On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote:
> > > On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> > > had the surprising behavior of doing a synchronized GTT mapping.
> > > This is obviously not what the user of the API wanted.
> > > 
> > > Eric left a comment indicating a valid concern: if the CPU and GPU
> > > caches are incoherent, we don't keep track of where the user last
> > > mapped the buffer, and what caches might contain relevant data.
> > 
> > Note this is an issue in libdrm_intel not tracking the cache domain
> > transitions. Even just a switch between cpu and coherent would solve the
> > majority of that - the caveat being shared bo where the tracking is
> > incomplete.
> 
> Hmm.  Given the API we have today, I suppose we could make libdrm track
> that - it just doesn't currently...
> 
> >  
> > > Modern Atom systems still don't have LLC, but they do offer snooping,
> > > which effectively makes the caches coherent.  The kernel appears to
> > > set up the PTE/PPAT to enable snooping for everything where the cache
> > > level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> > > are marked as uncached.
> > 
> > Byt, bsw beg to differ. I don't have a bxt to know the results of the
> > igt/kernel tests.
> 
> They do?  I was reading i915_gem_gtt.c, and see in byt_pte_encode:
> 
>         if (level != I915_CACHE_NONE)
>                 pte |= BYT_PTE_SNOOPED_BY_CPU_CACHES;
> 
> and also in chv_setup_private_ppat, a whole lot of CHV_PPAT_SNOOP.
> It sure looked to me that we were setting up snooping for all non-UC
> buffers.
> 
> i915_gem_object_is_coherent() also appears to indicate that LLC or
> cache_level != I915_CACHE_NONE guarantees coherency.
> 
> Am I missing something?

On non-llc the default is uncached for all buffers, because it's faster.
Only LLC uses cached by default (to make use of this nice cache), and set
scanout to uncached or write-thru. On byt/bsw you can enable snooping
because that could be faster for gpu->cpu transfers (since it avoids the
clflush while still doing cached reads on the cpu side).

But as Chris said, on all the non-llc machines we've seen the gtt wc is
slows than cpu wc anyway, as long as you can deal with the manual tiling
(and mesa has code for common cases already).

> > > Any buffers used by scanout should be flagged as non-reusable with
> > > drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> > > assume that any reusable buffer should be snooped.
> > 
> > Not really, there is no reason why scanout buffers can't be reused.
> 
> Yes, we could make them reusable.  However, as far as I can tell, libdrm
> doesn't consider the cacheability settings and just adds all reusable
> BOs back into the cache.
> 
> As I understand it, the kernel marks a BO uncached when it's used for
> display, and that BO remains uncached forever.  So that would mean that
> libdrm's buffer cache would have random uncached buffers in it, which
> would be really undesirable.  So I believe libdrm (and its users) mark
> all scanout buffers non-reusable.
> 
> It would probably be better to check I915_GEM_GET_CACHING !=
> I915_CACHING_NONE, but I was hoping to avoid the extra ioctl.

As mentioned, on non-lcc everything is uncached by default. It's how the
platform works.

> > > This patch enables unsynchronized mappings for reusable buffers
> > > on all Gen6+ hardware (which have either LLC or snooping).
> > > 
> > > On Broxton, this improves the performance of Unigine Valley 1.0
> > > on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> > > (same settings) by about 53%.
> > 
> > Does anyone have figures for gtt performance on bxt - does it cover over
> > the same performance penalty from earler atoms? Basically why bother to
> > enable this over wc mapping (no stalls for a contended, limited
> > resource) + detiling. (Just note that for detiling Y to WC you need to
> > use a temporary cacheable page, or rearrange the code to make sure the
> > reads/writes are in 64 byte chunks.) 
> 
> I'd like to use WC mappings eventually, but to me, fixing the
> lack-of-async seems kind of orthogonal to changing GTT vs. WC.
> Even if we want to use WC, why not fix this too?

Because it's harder :-) Well not harder, in both cases you need to make
sure you have some basic domain tracking in place, and once you have that,
you might as well just opt for the wc mapping and call it a day (since
that is the minor change).
-Daniel

> > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: mesa-dev@lists.freedesktop.org
> > > ---
> > >  intel/intel_bufmgr_gem.c | 8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > It looks like Mesa and Beignet are the only callers of this function
> > > (SNA and Anvil don't use libdrm, UXA and vaapi don't use this function.)
> > > 
> > > This passed our full barrage of Piglit/dEQP/GLCTS/GLESCTS testing.
> > > gnome-shell still works, as does Unigine, and GLBenchmark.
> > > 
> > > I haven't tested any OpenCL workloads.
> > > 
> > > diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
> > > index e260f2dc..f53f1fcc 100644
> > > --- a/intel/intel_bufmgr_gem.c
> > > +++ b/intel/intel_bufmgr_gem.c
> > > @@ -1630,9 +1630,7 @@ int
> > >  drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
> > >  {
> > >  	drm_intel_bufmgr_gem *bufmgr_gem = (drm_intel_bufmgr_gem *) bo->bufmgr;
> > > -#ifdef HAVE_VALGRIND
> > >  	drm_intel_bo_gem *bo_gem = (drm_intel_bo_gem *) bo;
> > > -#endif
> > >  	int ret;
> > >  
> > >  	/* If the CPU cache isn't coherent with the GTT, then use a
> > > @@ -1641,8 +1639,12 @@ drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
> > >  	 * terms of drm_intel_bo_map vs drm_intel_gem_bo_map_gtt, so
> > >  	 * we would potentially corrupt the buffer even when the user
> > >  	 * does reasonable things.
> > > +	 *
> > > +	 * The caches are coherent on LLC platforms or snooping is enabled
> > > +	 * for the BO.  The kernel enables snooping for non-scanout (reusable)
> > > +	 * buffers on modern non-LLC systems.
> > 
> > gen >= 9; the snoop was reinvented
> 
> Then I'd really like to know what those BYT and CHV PTE/PPAT bits
> mean...
> 
> > Not enabled by default on !llc. By default object and PTE are set to
> > uncached, and mocs default is to follow PTE.
> > 
> > I would just do the unsync map and leave it to the caller to ensure it
> > is used correctly. Or just use the raw interfaces that leave the domain
> > tracking to the caller.
> 
> So, drop this altogether and let the caller sort it out?  I suspect that
> would be fine, too.  It seems like we want to deprecate (or at least
> stop using) these APIs in the long run, in favor of the raw interfaces,
> so in the meantime, as long as any changes do the right thing for
> current users, it's probably fine.
> 
> > >  	 */
> > > -	if (!bufmgr_gem->has_llc)
> > > +	if (bufmgr_gem->gen < 6 || !bo_gem->reusable)
> > >  		return drm_intel_gem_bo_map_gtt(bo);
> > >  
> > >  	pthread_mutex_lock(&bufmgr_gem->lock);
> > 
> > 
> 



> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Sun, Mar 12, 2017 at 06:19:17PM +0100, David Weinehall wrote:
> On Sun, Mar 12, 2017 at 01:21:12PM +0000, Chris Wilson wrote:
> > On Fri, Mar 10, 2017 at 05:14:32PM -0800, Kenneth Graunke wrote:
> > > On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> > > had the surprising behavior of doing a synchronized GTT mapping.
> > > This is obviously not what the user of the API wanted.
> > > 
> > > Eric left a comment indicating a valid concern: if the CPU and GPU
> > > caches are incoherent, we don't keep track of where the user last
> > > mapped the buffer, and what caches might contain relevant data.
> > 
> > Note this is an issue in libdrm_intel not tracking the cache domain
> > transitions. Even just a switch between cpu and coherent would solve the
> > majority of that - the caveat being shared bo where the tracking is
> > incomplete.
> >  
> > > Modern Atom systems still don't have LLC, but they do offer snooping,
> > > which effectively makes the caches coherent.  The kernel appears to
> > > set up the PTE/PPAT to enable snooping for everything where the cache
> > > level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> > > are marked as uncached.
> > 
> > Byt, bsw beg to differ. I don't have a bxt to know the results of the
> > igt/kernel tests.
> 
> Just give me a list of the tests to run (and, if any, what patches
> to apply and the debugging level you want enabled) and I'll provide
> the necessary results.

The most important result is igt/gem_mmap_gtt/coherency. That tests if a
write through the GTT is immediately visible in the backing storage. (It
should fail...)

To test the proposed used here that GTT + snooping is ok, first requires
disabling the test forbidding GTT + snooping in i915_gem_fault. Then
similar tests to gem_exec_flush or directly from kselftests/coherency
can be used to spot if we need any flushes.

> > > Any buffers used by scanout should be flagged as non-reusable with
> > > drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> > > assume that any reusable buffer should be snooped.
> > 
> > Not really, there is no reason why scanout buffers can't be reused.
> >  
> > > This patch enables unsynchronized mappings for reusable buffers
> > > on all Gen6+ hardware (which have either LLC or snooping).
> > > 
> > > On Broxton, this improves the performance of Unigine Valley 1.0
> > > on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> > > (same settings) by about 53%.
> > 
> > Does anyone have figures for gtt performance on bxt - does it cover over
> > the same performance penalty from earler atoms? Basically why bother to
> > enable this over wc mapping (no stalls for a contended, limited
> > resource) + detiling. (Just note that for detiling Y to WC you need to
> > use a temporary cacheable page, or rearrange the code to make sure the
> > reads/writes are in 64 byte chunks.) 
> 
> Again, I can run any tests you'd like to get numbers from,
> just give me a list.

gem_gtt_speed $obj_size will tell us the relative performance of
untiled/tiled GTT access vs WC/WB.
-Chris
Hi,

On 11.03.2017 03:14, Kenneth Graunke wrote:
> On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
> had the surprising behavior of doing a synchronized GTT mapping.
> This is obviously not what the user of the API wanted.
>
> Eric left a comment indicating a valid concern: if the CPU and GPU
> caches are incoherent, we don't keep track of where the user last
> mapped the buffer, and what caches might contain relevant data.
>
> Modern Atom systems still don't have LLC, but they do offer snooping,
> which effectively makes the caches coherent.  The kernel appears to
> set up the PTE/PPAT to enable snooping for everything where the cache
> level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
> are marked as uncached.
>
> Any buffers used by scanout should be flagged as non-reusable with
> drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
> assume that any reusable buffer should be snooped.
>
> This patch enables unsynchronized mappings for reusable buffers
> on all Gen6+ hardware (which have either LLC or snooping).
>
> On Broxton, this improves the performance of Unigine Valley 1.0
> on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
> (same settings) by about 53%.

I tested it with our normal set of benchmarks.

Using FullHD resolution and "high" quality settings, on Broxton, Valley 
improved by ~11% and Heaven (with tessellation enabled) by 2-3%.

CarChase seemed to improve also by several percents, but everything else 
was within normal variation.

I'll check BYT & BSW too.


	- Eero


> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: mesa-dev@lists.freedesktop.org
> ---
>  intel/intel_bufmgr_gem.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> It looks like Mesa and Beignet are the only callers of this function
> (SNA and Anvil don't use libdrm, UXA and vaapi don't use this function.)
>
> This passed our full barrage of Piglit/dEQP/GLCTS/GLESCTS testing.
> gnome-shell still works, as does Unigine, and GLBenchmark.
>
> I haven't tested any OpenCL workloads.
>
> diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
> index e260f2dc..f53f1fcc 100644
> --- a/intel/intel_bufmgr_gem.c
> +++ b/intel/intel_bufmgr_gem.c
> @@ -1630,9 +1630,7 @@ int
>  drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
>  {
>  	drm_intel_bufmgr_gem *bufmgr_gem = (drm_intel_bufmgr_gem *) bo->bufmgr;
> -#ifdef HAVE_VALGRIND
>  	drm_intel_bo_gem *bo_gem = (drm_intel_bo_gem *) bo;
> -#endif
>  	int ret;
>
>  	/* If the CPU cache isn't coherent with the GTT, then use a
> @@ -1641,8 +1639,12 @@ drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
>  	 * terms of drm_intel_bo_map vs drm_intel_gem_bo_map_gtt, so
>  	 * we would potentially corrupt the buffer even when the user
>  	 * does reasonable things.
> +	 *
> +	 * The caches are coherent on LLC platforms or snooping is enabled
> +	 * for the BO.  The kernel enables snooping for non-scanout (reusable)
> +	 * buffers on modern non-LLC systems.
>  	 */
> -	if (!bufmgr_gem->has_llc)
> +	if (bufmgr_gem->gen < 6 || !bo_gem->reusable)
>  		return drm_intel_gem_bo_map_gtt(bo);
>
>  	pthread_mutex_lock(&bufmgr_gem->lock);
>
Hi,

On 14.03.2017 12:48, Eero Tamminen wrote:
> On 11.03.2017 03:14, Kenneth Graunke wrote:
>> On systems without LLC, drm_intel_gem_bo_map_unsynchronized() has
>> had the surprising behavior of doing a synchronized GTT mapping.
>> This is obviously not what the user of the API wanted.
>>
>> Eric left a comment indicating a valid concern: if the CPU and GPU
>> caches are incoherent, we don't keep track of where the user last
>> mapped the buffer, and what caches might contain relevant data.
>>
>> Modern Atom systems still don't have LLC, but they do offer snooping,
>> which effectively makes the caches coherent.  The kernel appears to
>> set up the PTE/PPAT to enable snooping for everything where the cache
>> level is not I915_CACHE_NONE.  As far as I know, only scanout buffers
>> are marked as uncached.
>>
>> Any buffers used by scanout should be flagged as non-reusable with
>> drm_intel_bo_disable_reuse(), prime export, or flink.  So, we can
>> assume that any reusable buffer should be snooped.
>>
>> This patch enables unsynchronized mappings for reusable buffers
>> on all Gen6+ hardware (which have either LLC or snooping).
>>
>> On Broxton, this improves the performance of Unigine Valley 1.0
>> on Low settings at 1280x720 by about 45%, and Unigine Heaven 4.0
>> (same settings) by about 53%.
>
> I tested it with our normal set of benchmarks.
>
> Using FullHD resolution and "high" quality settings, on Broxton, Valley
> improved by ~11% and Heaven (with tessellation enabled) by 2-3%.

BSW: Valley +10%, Heaven +4%.  Blend showed regression, but it could 
still be within variance.

BYT: Valley +5%, Heaven +11%.  Rest of changes were within normal variance.


	- Eero

> CarChase seemed to improve also by several percents, but everything else
> was within normal variation.
>
> I'll check BYT & BSW too.
>
>
>     - Eero
>
>
>> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: mesa-dev@lists.freedesktop.org
>> ---
>>  intel/intel_bufmgr_gem.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> It looks like Mesa and Beignet are the only callers of this function
>> (SNA and Anvil don't use libdrm, UXA and vaapi don't use this function.)
>>
>> This passed our full barrage of Piglit/dEQP/GLCTS/GLESCTS testing.
>> gnome-shell still works, as does Unigine, and GLBenchmark.
>>
>> I haven't tested any OpenCL workloads.
>>
>> diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
>> index e260f2dc..f53f1fcc 100644
>> --- a/intel/intel_bufmgr_gem.c
>> +++ b/intel/intel_bufmgr_gem.c
>> @@ -1630,9 +1630,7 @@ int
>>  drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
>>  {
>>      drm_intel_bufmgr_gem *bufmgr_gem = (drm_intel_bufmgr_gem *)
>> bo->bufmgr;
>> -#ifdef HAVE_VALGRIND
>>      drm_intel_bo_gem *bo_gem = (drm_intel_bo_gem *) bo;
>> -#endif
>>      int ret;
>>
>>      /* If the CPU cache isn't coherent with the GTT, then use a
>> @@ -1641,8 +1639,12 @@
>> drm_intel_gem_bo_map_unsynchronized(drm_intel_bo *bo)
>>       * terms of drm_intel_bo_map vs drm_intel_gem_bo_map_gtt, so
>>       * we would potentially corrupt the buffer even when the user
>>       * does reasonable things.
>> +     *
>> +     * The caches are coherent on LLC platforms or snooping is enabled
>> +     * for the BO.  The kernel enables snooping for non-scanout
>> (reusable)
>> +     * buffers on modern non-LLC systems.
>>       */
>> -    if (!bufmgr_gem->has_llc)
>> +    if (bufmgr_gem->gen < 6 || !bo_gem->reusable)
>>          return drm_intel_gem_bo_map_gtt(bo);
>>
>>      pthread_mutex_lock(&bufmgr_gem->lock);
>>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev