drm/i915/ringbuffer: Delay after EMIT_INVALIDATE for gen4/gen5

Submitted by Chris Wilson on Nov. 5, 2018, 9:43 a.m.

Details

Message ID 20181105094305.5767-1-chris@chris-wilson.co.uk
State Accepted
Series "drm/i915/ringbuffer: Delay after EMIT_INVALIDATE for gen4/gen5"
Commit fb5bbae9b1333d44023713946fdd28db0cd85751
Headers show

Commit Message

Chris Wilson Nov. 5, 2018, 9:43 a.m.
Exercising the gpu reloc path strenuously revealed an issue where the
updated relocations (from MI_STORE_DWORD_IMM) were not being observed
upon execution. After some experiments with adding pipecontrols (a lot
of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe
controls or even the current on), it was discovered that we merely
needed to delay the EMIT_INVALIDATE by several flushes. It is important
to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that
needs the delay as opposed to what one might first expect -- that the
delay is required for the TLB invalidation to take effect (one presumes
to purge any CS buffers) as opposed to a delay after flushing to ensure
the writes have landed before triggering invalidation.

Testcase: igt/gem_tiled_fence_blits
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index b8a7a014d46d..87eebc13c0d8 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -91,6 +91,7 @@  static int
 gen4_render_ring_flush(struct i915_request *rq, u32 mode)
 {
 	u32 cmd, *cs;
+	int i;
 
 	/*
 	 * read/write caches:
@@ -127,12 +128,45 @@  gen4_render_ring_flush(struct i915_request *rq, u32 mode)
 			cmd |= MI_INVALIDATE_ISP;
 	}
 
-	cs = intel_ring_begin(rq, 2);
+	i = 2;
+	if (mode & EMIT_INVALIDATE)
+		i += 20;
+
+	cs = intel_ring_begin(rq, i);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
 	*cs++ = cmd;
-	*cs++ = MI_NOOP;
+
+	/*
+	 * A random delay to let the CS invalidate take effect? Without this
+	 * delay, the GPU relocation path fails as the CS does not see
+	 * the updated contents. Just as important, if we apply the flushes
+	 * to the EMIT_FLUSH branch (i.e. immediately after the relocation
+	 * write and before the invalidate on the next batch), the relocations
+	 * still fail. This implies that is a delay following invalidation
+	 * that is required to reset the caches as opposed to a delay to
+	 * ensure the memory is written.
+	 */
+	if (mode & EMIT_INVALIDATE) {
+		*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
+		*cs++ = i915_ggtt_offset(rq->engine->scratch) |
+			PIPE_CONTROL_GLOBAL_GTT;
+		*cs++ = 0;
+		*cs++ = 0;
+
+		for (i = 0; i < 12; i++)
+			*cs++ = MI_FLUSH;
+
+		*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
+		*cs++ = i915_ggtt_offset(rq->engine->scratch) |
+			PIPE_CONTROL_GLOBAL_GTT;
+		*cs++ = 0;
+		*cs++ = 0;
+	}
+
+	*cs++ = cmd;
+
 	intel_ring_advance(rq, cs);
 
 	return 0;

Comments

Ville Syrjälä Nov. 7, 2018, 3:04 p.m.
On Mon, Nov 05, 2018 at 09:43:05AM +0000, Chris Wilson wrote:
> Exercising the gpu reloc path strenuously revealed an issue where the
> updated relocations (from MI_STORE_DWORD_IMM) were not being observed
> upon execution. After some experiments with adding pipecontrols (a lot
> of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe
> controls or even the current on), it was discovered that we merely
> needed to delay the EMIT_INVALIDATE by several flushes. It is important
> to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that
> needs the delay as opposed to what one might first expect -- that the
> delay is required for the TLB invalidation to take effect (one presumes
> to purge any CS buffers) as opposed to a delay after flushing to ensure
> the writes have landed before triggering invalidation.
> 
> Testcase: igt/gem_tiled_fence_blits
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: stable@vger.kernel.org
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index b8a7a014d46d..87eebc13c0d8 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -91,6 +91,7 @@ static int
>  gen4_render_ring_flush(struct i915_request *rq, u32 mode)
>  {
>  	u32 cmd, *cs;
> +	int i;
>  
>  	/*
>  	 * read/write caches:
> @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode)
>  			cmd |= MI_INVALIDATE_ISP;
>  	}
>  
> -	cs = intel_ring_begin(rq, 2);
> +	i = 2;
> +	if (mode & EMIT_INVALIDATE)
> +		i += 20;
> +
> +	cs = intel_ring_begin(rq, i);
>  	if (IS_ERR(cs))
>  		return PTR_ERR(cs);
>  
>  	*cs++ = cmd;
> -	*cs++ = MI_NOOP;
> +
> +	/*
> +	 * A random delay to let the CS invalidate take effect? Without this
> +	 * delay, the GPU relocation path fails as the CS does not see
> +	 * the updated contents. Just as important, if we apply the flushes
> +	 * to the EMIT_FLUSH branch (i.e. immediately after the relocation
> +	 * write and before the invalidate on the next batch), the relocations
> +	 * still fail. This implies that is a delay following invalidation
> +	 * that is required to reset the caches as opposed to a delay to
> +	 * ensure the memory is written.
> +	 */
> +	if (mode & EMIT_INVALIDATE) {
> +		*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
> +		*cs++ = i915_ggtt_offset(rq->engine->scratch) |
> +			PIPE_CONTROL_GLOBAL_GTT;
> +		*cs++ = 0;
> +		*cs++ = 0;
> +
> +		for (i = 0; i < 12; i++)
> +			*cs++ = MI_FLUSH;
> +
> +		*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
> +		*cs++ = i915_ggtt_offset(rq->engine->scratch) |
> +			PIPE_CONTROL_GLOBAL_GTT;
> +		*cs++ = 0;
> +		*cs++ = 0;
> +	}

This smells a lot like the snb a/b w/a, except there the spec says to
use 8 STORE_DWORDS. I suppose the choice of a specific command isn't
critical, and it's just a matter of stuffing the pipeline with something
that's takes long enough to let the TLB invalidate finish?

Anyways, patch itself seems as reasonable as one might expect for an
issue like this.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

> +
> +	*cs++ = cmd;
> +
>  	intel_ring_advance(rq, cs);
>  
>  	return 0;
> -- 
> 2.19.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Nov. 7, 2018, 3:12 p.m.
Quoting Ville Syrjälä (2018-11-07 15:04:24)
> On Mon, Nov 05, 2018 at 09:43:05AM +0000, Chris Wilson wrote:
> > Exercising the gpu reloc path strenuously revealed an issue where the
> > updated relocations (from MI_STORE_DWORD_IMM) were not being observed
> > upon execution. After some experiments with adding pipecontrols (a lot
> > of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe
> > controls or even the current on), it was discovered that we merely
> > needed to delay the EMIT_INVALIDATE by several flushes. It is important
> > to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that
> > needs the delay as opposed to what one might first expect -- that the
> > delay is required for the TLB invalidation to take effect (one presumes
> > to purge any CS buffers) as opposed to a delay after flushing to ensure
> > the writes have landed before triggering invalidation.
> > 
> > Testcase: igt/gem_tiled_fence_blits
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: stable@vger.kernel.org
> > ---
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++--
> >  1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index b8a7a014d46d..87eebc13c0d8 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -91,6 +91,7 @@ static int
> >  gen4_render_ring_flush(struct i915_request *rq, u32 mode)
> >  {
> >       u32 cmd, *cs;
> > +     int i;
> >  
> >       /*
> >        * read/write caches:
> > @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode)
> >                       cmd |= MI_INVALIDATE_ISP;
> >       }
> >  
> > -     cs = intel_ring_begin(rq, 2);
> > +     i = 2;
> > +     if (mode & EMIT_INVALIDATE)
> > +             i += 20;
> > +
> > +     cs = intel_ring_begin(rq, i);
> >       if (IS_ERR(cs))
> >               return PTR_ERR(cs);
> >  
> >       *cs++ = cmd;
> > -     *cs++ = MI_NOOP;
> > +
> > +     /*
> > +      * A random delay to let the CS invalidate take effect? Without this
> > +      * delay, the GPU relocation path fails as the CS does not see
> > +      * the updated contents. Just as important, if we apply the flushes
> > +      * to the EMIT_FLUSH branch (i.e. immediately after the relocation
> > +      * write and before the invalidate on the next batch), the relocations
> > +      * still fail. This implies that is a delay following invalidation
> > +      * that is required to reset the caches as opposed to a delay to
> > +      * ensure the memory is written.
> > +      */
> > +     if (mode & EMIT_INVALIDATE) {
> > +             *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
> > +             *cs++ = i915_ggtt_offset(rq->engine->scratch) |
> > +                     PIPE_CONTROL_GLOBAL_GTT;
> > +             *cs++ = 0;
> > +             *cs++ = 0;
> > +
> > +             for (i = 0; i < 12; i++)
> > +                     *cs++ = MI_FLUSH;
> > +
> > +             *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
> > +             *cs++ = i915_ggtt_offset(rq->engine->scratch) |
> > +                     PIPE_CONTROL_GLOBAL_GTT;
> > +             *cs++ = 0;
> > +             *cs++ = 0;
> > +     }
> 
> This smells a lot like the snb a/b w/a, except there the spec says to
> use 8 STORE_DWORDS.

Yeah, the similarity wasn't lost, except that w/a is to cover the
coherency aspect of the writes not being flushed. This feels a bit
fishier in that the experiments indicate it's an issue on the
invalidation path as opposed to flushing the writes.

And the other w/a to use umpteen pipecontrols to get around the lack of
PIPE_CONTROL_FLUSH.

> I suppose the choice of a specific command isn't
> critical, and it's just a matter of stuffing the pipeline with something
> that's takes long enough to let the TLB invalidate finish?

Except the MI_FLUSH are more effective in fewer number than
PIPE_CONTROLs. Probably because each one translates to a few pipe
controls or something, quite heavy.

> Anyways, patch itself seems as reasonable as one might expect for an
> issue like this.

As nasty as one would expect.

For the record, we are not entirely out of danger. gem_exec_whisper
continues to indicate a problem, but one step at a time. (I haven't yet
found quite what's upsetting it yet, except if we do each batch
synchronously and verify each one, it's happy.)
-Chris