drm/i915/execlists: Clear STOP_RING bit on reset

Submitted by Chris Wilson on Sept. 10, 2019, 7:57 a.m.

Details

Message ID 20190910075738.26919-1-chris@chris-wilson.co.uk
State New
Headers show
Series "drm/i915/execlists: Clear STOP_RING bit on reset" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Chris Wilson Sept. 10, 2019, 7:57 a.m.
During reset, we try to ensure no forward progress of the CS prior to
the reset by setting the STOP_RING bit in RING_MI_MODE. Since gen9, this
register is context saved and do we end up in the odd situation where we
save the STOP_RING bit and so try to stop the engine again immediately
upon resume. This is quite unexpected and causes us to complain about an
early CS completion event!

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111514
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c     | 10 ++++++++++
 drivers/gpu/drm/i915/gt/intel_lrc_reg.h |  2 ++
 2 files changed, 12 insertions(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index f073aea6a1fe..761f5cbd90d3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2368,6 +2368,15 @@  static struct i915_request *active_request(struct i915_request *rq)
 	return active;
 }
 
+static void __execlists_reset_reg_state(const struct intel_context *ce,
+					const struct intel_engine_cs *engine)
+{
+	u32 *regs = ce->lrc_reg_state;
+
+	if (INTEL_GEN(engine->i915) >= 9)
+		regs[GEN9_CTX_RING_MI_MODE + 1] |= _MASKED_BIT_DISABLE(STOP_RING);
+}
+
 static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -2455,6 +2464,7 @@  static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
 	GEM_TRACE("%s replay {head:%04x, tail:%04x\n",
 		  engine->name, ce->ring->head, ce->ring->tail);
 	intel_ring_update_space(ce->ring);
+	__execlists_reset_reg_state(ce, engine);
 	__execlists_update_reg_state(ce, engine);
 	mutex_release(&ce->pin_mutex.dep_map, 0, _THIS_IP_);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc_reg.h b/drivers/gpu/drm/i915/gt/intel_lrc_reg.h
index 68caf8541866..7e773e74a3fe 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc_reg.h
+++ b/drivers/gpu/drm/i915/gt/intel_lrc_reg.h
@@ -39,6 +39,8 @@ 
 #define CTX_R_PWR_CLK_STATE		0x42
 #define CTX_END				0x44
 
+#define GEN9_CTX_RING_MI_MODE		0x54
+
 /* GEN12+ Reg State Context */
 #define GEN12_CTX_BB_PER_CTX_PTR		0x12
 #define GEN12_CTX_LRI_HEADER_3			0x41

Comments

Chris Wilson <chris@chris-wilson.co.uk> writes:

> During reset, we try to ensure no forward progress of the CS prior to
> the reset by setting the STOP_RING bit in RING_MI_MODE. Since gen9, this
> register is context saved and do we end up in the odd situation where we
> save the STOP_RING bit and so try to stop the engine again immediately
> upon resume. This is quite unexpected and causes us to complain about an
> early CS completion event!

The completion event is a product of resume with a stop set?

If my memory serves me well, we have fought this before.

But I have still missing pieces. Why would we not want to
set this for all contexts primed for execution? In gen8 too.

I mean, queuing context with a ring stopped just doesn't
sound right on any gen.

-Mika

>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111514
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_lrc.c     | 10 ++++++++++
>  drivers/gpu/drm/i915/gt/intel_lrc_reg.h |  2 ++
>  2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index f073aea6a1fe..761f5cbd90d3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2368,6 +2368,15 @@ static struct i915_request *active_request(struct i915_request *rq)
>  	return active;
>  }
>  
> +static void __execlists_reset_reg_state(const struct intel_context *ce,
> +					const struct intel_engine_cs *engine)
> +{
> +	u32 *regs = ce->lrc_reg_state;
> +
> +	if (INTEL_GEN(engine->i915) >= 9)
> +		regs[GEN9_CTX_RING_MI_MODE + 1] |= _MASKED_BIT_DISABLE(STOP_RING);
> +}
> +
>  static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
>  {
>  	struct intel_engine_execlists * const execlists = &engine->execlists;
> @@ -2455,6 +2464,7 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
>  	GEM_TRACE("%s replay {head:%04x, tail:%04x\n",
>  		  engine->name, ce->ring->head, ce->ring->tail);
>  	intel_ring_update_space(ce->ring);
> +	__execlists_reset_reg_state(ce, engine);
>  	__execlists_update_reg_state(ce, engine);
>  	mutex_release(&ce->pin_mutex.dep_map, 0, _THIS_IP_);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc_reg.h b/drivers/gpu/drm/i915/gt/intel_lrc_reg.h
> index 68caf8541866..7e773e74a3fe 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc_reg.h
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc_reg.h
> @@ -39,6 +39,8 @@
>  #define CTX_R_PWR_CLK_STATE		0x42
>  #define CTX_END				0x44
>  
> +#define GEN9_CTX_RING_MI_MODE		0x54
> +
>  /* GEN12+ Reg State Context */
>  #define GEN12_CTX_BB_PER_CTX_PTR		0x12
>  #define GEN12_CTX_LRI_HEADER_3			0x41
> -- 
> 2.23.0
Quoting Mika Kuoppala (2019-09-10 10:31:05)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > During reset, we try to ensure no forward progress of the CS prior to
> > the reset by setting the STOP_RING bit in RING_MI_MODE. Since gen9, this
> > register is context saved and do we end up in the odd situation where we
> > save the STOP_RING bit and so try to stop the engine again immediately
> > upon resume. This is quite unexpected and causes us to complain about an
> > early CS completion event!
> 
> The completion event is a product of resume with a stop set?

A completion event is the product of STOP_RING. Whether it is the
completion event that we keep failing on...
 
> If my memory serves me well, we have fought this before.

Exactly. We've reduced the frequency of when we apply the STOP_RING, but
have not eliminated it.
 
> But I have still missing pieces. Why would we not want to
> set this for all contexts primed for execution? In gen8 too.

It's not in the gen8 context, afaict. I searched the context image for an
LRI with the RING_MI_MODE register:
https://patchwork.freedesktop.org/patch/329919/?series=66468&rev=1
 
> I mean, queuing context with a ring stopped just doesn't
> sound right on any gen.

We clear the STOP_RING in the register on resume just in case, and that
is being flagged on Icelake (which I put down to the reset not being very
thorough!). The remaining question was whether we were restoring it from
the context image.
-Chris
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2019-09-10 10:31:05)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > During reset, we try to ensure no forward progress of the CS prior to
>> > the reset by setting the STOP_RING bit in RING_MI_MODE. Since gen9, this
>> > register is context saved and do we end up in the odd situation where we
>> > save the STOP_RING bit and so try to stop the engine again immediately
>> > upon resume. This is quite unexpected and causes us to complain about an
>> > early CS completion event!
>> 
>> The completion event is a product of resume with a stop set?
>
> A completion event is the product of STOP_RING. Whether it is the
> completion event that we keep failing on...
>  
>> If my memory serves me well, we have fought this before.
>
> Exactly. We've reduced the frequency of when we apply the STOP_RING, but
> have not eliminated it.
>  
>> But I have still missing pieces. Why would we not want to
>> set this for all contexts primed for execution? In gen8 too.
>
> It's not in the gen8 context, afaict. I searched the context image for an
> LRI with the RING_MI_MODE register:
> https://patchwork.freedesktop.org/patch/329919/?series=66468&rev=1
>  
>> I mean, queuing context with a ring stopped just doesn't
>> sound right on any gen.
>
> We clear the STOP_RING in the register on resume just in case, and that
> is being flagged on Icelake (which I put down to the reset not being very
> thorough!). The remaining question was whether we were restoring it from
> the context image.

Hmm yeah, I was confused of the sequence of setup. With that cleared
and with the context state being worked on, perhaps we can add
sanity checkers to the queuing path.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Quoting Mika Kuoppala (2019-09-10 10:54:43)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2019-09-10 10:31:05)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> 
> >> > During reset, we try to ensure no forward progress of the CS prior to
> >> > the reset by setting the STOP_RING bit in RING_MI_MODE. Since gen9, this
> >> > register is context saved and do we end up in the odd situation where we
> >> > save the STOP_RING bit and so try to stop the engine again immediately
> >> > upon resume. This is quite unexpected and causes us to complain about an
> >> > early CS completion event!
> >> 
> >> The completion event is a product of resume with a stop set?
> >
> > A completion event is the product of STOP_RING. Whether it is the
> > completion event that we keep failing on...
> >  
> >> If my memory serves me well, we have fought this before.
> >
> > Exactly. We've reduced the frequency of when we apply the STOP_RING, but
> > have not eliminated it.
> >  
> >> But I have still missing pieces. Why would we not want to
> >> set this for all contexts primed for execution? In gen8 too.
> >
> > It's not in the gen8 context, afaict. I searched the context image for an
> > LRI with the RING_MI_MODE register:
> > https://patchwork.freedesktop.org/patch/329919/?series=66468&rev=1
> >  
> >> I mean, queuing context with a ring stopped just doesn't
> >> sound right on any gen.
> >
> > We clear the STOP_RING in the register on resume just in case, and that
> > is being flagged on Icelake (which I put down to the reset not being very
> > thorough!). The remaining question was whether we were restoring it from
> > the context image.
> 
> Hmm yeah, I was confused of the sequence of setup. With that cleared
> and with the context state being worked on, perhaps we can add
> sanity checkers to the queuing path.

Yeah, I think there's definitely some fun we can have here. At the very
least a check that CTX_RING_START == ring->start would be a good
sanitycheck.
 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

As always, the only way to be sure if this changes the mtbf is to let is
soak. One day I may be able to run my own extended testing on icl!
-Chris