[2/3] drm/i915/guc: Add support for reset engine using GuC commands

Submitted by Michel Thierry on Oct. 30, 2017, 6:56 p.m.

Details

Message ID 20171030185616.32836-3-michel.thierry@intel.com
State New
Headers show
Series "GuC based reset engine" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Michel Thierry Oct. 30, 2017, 6:56 p.m.
This patch adds per engine reset and recovery (TDR) support when GuC is
used to submit workloads to GPU.

In the case of i915 directly submission to ELSP, driver manages hang
detection, recovery and resubmission. With GuC submission these tasks
are shared between driver and GuC. i915 is still responsible for detecting
a hang, and when it does it only requests GuC to reset that Engine. GuC
internally manages acquiring forcewake and idling the engine before actually
resetting it.

Once the reset is successful, i915 takes over again and handles resubmission.
The scheduler in i915 knows which requests are pending so after resetting
a engine, pending workloads/requests are resubmitted again.

v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
non-guc funtion names.

v3: Removed debug message about engine restarting from which request,
since the new baseline do it regardless of submission mode. (Chris)

v4: Rebase.

v5: Do not pass unnecessary reporting flags to the fw (Jeff);
tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.c       |  9 +++++++--
 drivers/gpu/drm/i915/i915_drv.h       |  1 +
 drivers/gpu/drm/i915/intel_guc.c      | 24 ++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h |  1 +
 drivers/gpu/drm/i915/intel_uncore.c   |  5 -----
 5 files changed, 33 insertions(+), 7 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index af745749509c..02fb35744f66 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1984,10 +1984,15 @@  int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
 		goto out;
 	}
 
-	ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
+	if (!engine->i915->guc.execbuf_client)
+		ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
+	else
+		ret = intel_guc_reset_engine(engine);
+
 	if (ret) {
 		/* If we fail here, we expect to fallback to a global reset */
-		DRM_DEBUG_DRIVER("Failed to reset %s, ret=%d\n",
+		DRM_DEBUG_DRIVER("%sFailed to reset %s, ret=%d\n",
+				 (engine->i915->guc.execbuf_client ? "GUC " : ""),
 				 engine->name, ret);
 		goto out;
 	}
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 0ed06ca07568..b77fd9720ec2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3326,6 +3326,7 @@  extern int i915_reset_engine(struct intel_engine_cs *engine,
 
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_guc(struct drm_i915_private *dev_priv);
+extern int intel_guc_reset_engine(struct intel_engine_cs *engine);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/intel_guc.c b/drivers/gpu/drm/i915/intel_guc.c
index f74d50fdaeb0..f7b9b6b91499 100644
--- a/drivers/gpu/drm/i915/intel_guc.c
+++ b/drivers/gpu/drm/i915/intel_guc.c
@@ -24,6 +24,7 @@ 
 
 #include "intel_guc.h"
 #include "i915_drv.h"
+#include "i915_guc_submission.h"
 
 static void gen8_guc_raise_irq(struct intel_guc *guc)
 {
@@ -283,6 +284,29 @@  int intel_guc_suspend(struct drm_i915_private *dev_priv)
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
 
+/**
+ * intel_guc_reset_engine() - ask GuC to reset an engine
+ * @engine:	engine to be reset
+ */
+int intel_guc_reset_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	u32 data[7];
+
+	GEM_BUG_ON(!guc->execbuf_client);
+
+	data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
+	data[1] = engine->guc_id;
+	data[2] = 0;
+	data[3] = 0;
+	data[4] = 0;
+	data[5] = guc->execbuf_client->stage_id;
+	data[6] = guc_ggtt_offset(guc->shared_data);
+
+	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+}
+
 /**
  * intel_guc_resume() - notify GuC resuming from suspend state
  * @dev_priv:	i915 device private
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index e24dbec2ead8..6a10aa6f04d3 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -574,6 +574,7 @@  struct guc_shared_ctx_data {
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
 	INTEL_GUC_ACTION_REQUEST_PREEMPTION = 0x2,
+	INTEL_GUC_ACTION_REQUEST_ENGINE_RESET = 0x3,
 	INTEL_GUC_ACTION_SAMPLE_FORCEWAKE = 0x6,
 	INTEL_GUC_ACTION_ALLOCATE_DOORBELL = 0x10,
 	INTEL_GUC_ACTION_DEALLOCATE_DOORBELL = 0x20,
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index d629b2e40013..0564d87ad416 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1792,14 +1792,9 @@  bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
-/*
- * When GuC submission is enabled, GuC manages ELSP and can initiate the
- * engine reset too. For now, fall back to full GPU reset if it is enabled.
- */
 bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 {
 	return (dev_priv->info.has_reset_engine &&
-		!dev_priv->guc.execbuf_client &&
 		i915_modparams.reset >= 2);
 }
 

Comments

Quoting Michel Thierry (2017-10-30 18:56:15)
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index af745749509c..02fb35744f66 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1984,10 +1984,15 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
>                 goto out;
>         }
>  
> -       ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
> +       if (!engine->i915->guc.execbuf_client)
> +               ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
> +       else
> +               ret = intel_guc_reset_engine(engine);

While redundant, the interface cries out for
	intel_guc_reset_engine(guc, engine);
even though, in this case we would be using
	intel_guc_reset_engine(&engine->i915->guc, engine);

I would also be tempted to change intel_gpu_reset to
	intel_gpu_reset_engine(engine->i915, engine);

intel_gt_reset_engine?

Just trying to make our language consistent along each path.
-Chris
On 30/10/17 13:58, Chris Wilson wrote:
> Quoting Michel Thierry (2017-10-30 18:56:15)
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index af745749509c..02fb35744f66 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1984,10 +1984,15 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
>>                  goto out;
>>          }
>>   
>> -       ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
>> +       if (!engine->i915->guc.execbuf_client)
>> +               ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
>> +       else
>> +               ret = intel_guc_reset_engine(engine);
> 
> While redundant, the interface cries out for
> 	intel_guc_reset_engine(guc, engine);
> even though, in this case we would be using
> 	intel_guc_reset_engine(&engine->i915->guc, engine);
> 

I'll change the name and pass the guc as parameter.

> I would also be tempted to change intel_gpu_reset to
> 	intel_gpu_reset_engine(engine->i915, engine);
> 
> intel_gt_reset_engine?
> 
And then it becomes a wrapper of intel_gpu_reset? (intel_gpu_reset is 
used for full reset path and in older platforms too).

> Just trying to make our language consistent along each path.
> 

It's true that it would look better as you suggest:

if (!guc_client)
	intel_gt_reset_engine(i915, engine);
else
	intel_guc_reset_engine(guc, engine);

Thanks,

-Michel
Quoting Michel Thierry (2017-10-30 18:56:15)
> This patch adds per engine reset and recovery (TDR) support when GuC is
> used to submit workloads to GPU.
> 
> In the case of i915 directly submission to ELSP, driver manages hang
> detection, recovery and resubmission. With GuC submission these tasks
> are shared between driver and GuC. i915 is still responsible for detecting
> a hang, and when it does it only requests GuC to reset that Engine. GuC
> internally manages acquiring forcewake and idling the engine before actually
> resetting it.
> 
> Once the reset is successful, i915 takes over again and handles resubmission.
> The scheduler in i915 knows which requests are pending so after resetting
> a engine, pending workloads/requests are resubmitted again.
> 
> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
> non-guc funtion names.
> 
> v3: Removed debug message about engine restarting from which request,
> since the new baseline do it regardless of submission mode. (Chris)
> 
> v4: Rebase.
> 
> v5: Do not pass unnecessary reporting flags to the fw (Jeff);
> tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.

In your experience, how did our test coverage fare?

Could you use live_hangcheck effectively? (The drv_selftest would need
some hand holding to pass along guc options. But for livetesting we
should probably get to the point of being able to load/unload the guc
interface so that we cover both execlists and guc.) Did you find 
"gem_exec_whisper --r hang*", did you try gem_concurrent_all?
 
> +/**
> + * intel_guc_reset_engine() - ask GuC to reset an engine
> + * @engine:    engine to be reset
> + */
> +int intel_guc_reset_engine(struct intel_engine_cs *engine)
> +{
> +       struct drm_i915_private *dev_priv = engine->i915;
> +       struct intel_guc *guc = &dev_priv->guc;
> +       u32 data[7];
> +
> +       GEM_BUG_ON(!guc->execbuf_client);
> +
> +       data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
> +       data[1] = engine->guc_id;
> +       data[2] = 0;
> +       data[3] = 0;
> +       data[4] = 0;
> +       data[5] = guc->execbuf_client->stage_id;
> +       data[6] = guc_ggtt_offset(guc->shared_data);
> +
> +       return intel_guc_send(guc, data, ARRAY_SIZE(data));

Is this a synchronous action? We expect that following the completion of
the reset routine, we are ready to reinit the hw. The same rule needs to
apply the guc, I think.
-Chris
On 30/10/17 14:09, Chris Wilson wrote:
> Quoting Michel Thierry (2017-10-30 18:56:15)
>> This patch adds per engine reset and recovery (TDR) support when GuC is
>> used to submit workloads to GPU.
>>
>> In the case of i915 directly submission to ELSP, driver manages hang
>> detection, recovery and resubmission. With GuC submission these tasks
>> are shared between driver and GuC. i915 is still responsible for detecting
>> a hang, and when it does it only requests GuC to reset that Engine. GuC
>> internally manages acquiring forcewake and idling the engine before actually
>> resetting it.
>>
>> Once the reset is successful, i915 takes over again and handles resubmission.
>> The scheduler in i915 knows which requests are pending so after resetting
>> a engine, pending workloads/requests are resubmitted again.
>>
>> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
>> non-guc funtion names.
>>
>> v3: Removed debug message about engine restarting from which request,
>> since the new baseline do it regardless of submission mode. (Chris)
>>
>> v4: Rebase.
>>
>> v5: Do not pass unnecessary reporting flags to the fw (Jeff);
>> tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.
> 
> In your experience, how did our test coverage fare?
> 
> Could you use live_hangcheck effectively? (The drv_selftest would need
> some hand holding to pass along guc options. But for livetesting we
> should probably get to the point of being able to load/unload the guc
> interface so that we cover both execlists and guc.) Did you find
> "gem_exec_whisper --r hang*", did you try gem_concurrent_all?
>   

live_hangcheck runs ok with guc (as long as i915_params.h has guc 
submission enabled). Do you see a benefit on adding an option in 
drv_selftest to override the submission mode? I can add it to my list.

You got me in gem_concurrent_all, I forgot to schedule it a few weeks ago.

>> +/**
>> + * intel_guc_reset_engine() - ask GuC to reset an engine
>> + * @engine:    engine to be reset
>> + */
>> +int intel_guc_reset_engine(struct intel_engine_cs *engine)
>> +{
>> +       struct drm_i915_private *dev_priv = engine->i915;
>> +       struct intel_guc *guc = &dev_priv->guc;
>> +       u32 data[7];
>> +
>> +       GEM_BUG_ON(!guc->execbuf_client);
>> +
>> +       data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
>> +       data[1] = engine->guc_id;
>> +       data[2] = 0;
>> +       data[3] = 0;
>> +       data[4] = 0;
>> +       data[5] = guc->execbuf_client->stage_id;
>> +       data[6] = guc_ggtt_offset(guc->shared_data);
>> +
>> +       return intel_guc_send(guc, data, ARRAY_SIZE(data));
> 
> Is this a synchronous action? We expect that following the completion of
> the reset routine, we are ready to reinit the hw. The same rule needs to
> apply the guc, I think.

Right now the action is synchronous, the fw won't reply to the action 
until all the steps are completed. It also is fast enough, I haven't 
seen it time out (which would be promoted to full reset and reload the 
fw). But, do you have a crystal ball?

> -Chris
>
Quoting Michel Thierry (2017-10-31 04:38:30)
> On 30/10/17 14:09, Chris Wilson wrote:
> > Quoting Michel Thierry (2017-10-30 18:56:15)
> >> This patch adds per engine reset and recovery (TDR) support when GuC is
> >> used to submit workloads to GPU.
> >>
> >> In the case of i915 directly submission to ELSP, driver manages hang
> >> detection, recovery and resubmission. With GuC submission these tasks
> >> are shared between driver and GuC. i915 is still responsible for detecting
> >> a hang, and when it does it only requests GuC to reset that Engine. GuC
> >> internally manages acquiring forcewake and idling the engine before actually
> >> resetting it.
> >>
> >> Once the reset is successful, i915 takes over again and handles resubmission.
> >> The scheduler in i915 knows which requests are pending so after resetting
> >> a engine, pending workloads/requests are resubmitted again.
> >>
> >> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
> >> non-guc funtion names.
> >>
> >> v3: Removed debug message about engine restarting from which request,
> >> since the new baseline do it regardless of submission mode. (Chris)
> >>
> >> v4: Rebase.
> >>
> >> v5: Do not pass unnecessary reporting flags to the fw (Jeff);
> >> tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.
> > 
> > In your experience, how did our test coverage fare?
> > 
> > Could you use live_hangcheck effectively? (The drv_selftest would need
> > some hand holding to pass along guc options. But for livetesting we
> > should probably get to the point of being able to load/unload the guc
> > interface so that we cover both execlists and guc.) Did you find
> > "gem_exec_whisper --r hang*", did you try gem_concurrent_all?
> >   
> 
> live_hangcheck runs ok with guc (as long as i915_params.h has guc 
> submission enabled). Do you see a benefit on adding an option in 
> drv_selftest to override the submission mode? I can add it to my list.

I'd rather take the path of loading/unloading guc in the kernel to
ensure we have coverage of both on all platforms where that makes sense.
That way the kernel remains in control of the coverage it deems
necessary.

The alternative would be a for_each_param_value() loop in drv_selftests
which doesn't inspired me with forward/backward testing confidence.

For the time being, knowing that you did check live_hangcheck is enough
for me to cross it off the checklist.

The larger question behind "how did our test coverage fare" is did it
find any bugs during development? If not, we need more tests ;)
 
> You got me in gem_concurrent_all, I forgot to schedule it a few weeks ago.
> 
> >> +/**
> >> + * intel_guc_reset_engine() - ask GuC to reset an engine
> >> + * @engine:    engine to be reset
> >> + */
> >> +int intel_guc_reset_engine(struct intel_engine_cs *engine)
> >> +{
> >> +       struct drm_i915_private *dev_priv = engine->i915;
> >> +       struct intel_guc *guc = &dev_priv->guc;
> >> +       u32 data[7];
> >> +
> >> +       GEM_BUG_ON(!guc->execbuf_client);
> >> +
> >> +       data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
> >> +       data[1] = engine->guc_id;
> >> +       data[2] = 0;
> >> +       data[3] = 0;
> >> +       data[4] = 0;
> >> +       data[5] = guc->execbuf_client->stage_id;
> >> +       data[6] = guc_ggtt_offset(guc->shared_data);
> >> +
> >> +       return intel_guc_send(guc, data, ARRAY_SIZE(data));
> > 
> > Is this a synchronous action? We expect that following the completion of
> > the reset routine, we are ready to reinit the hw. The same rule needs to
> > apply the guc, I think.
> 
> Right now the action is synchronous, the fw won't reply to the action 
> until all the steps are completed. It also is fast enough, I haven't 
> seen it time out (which would be promoted to full reset and reload the 
> fw). But, do you have a crystal ball?

Just REQUEST is a red flag that there may be an asynchronous REPLY/ACK.

Aside: we need to start planning/adding fault_injection tests.
intel_guc_send() seems a good choice for an if(should_fail()).
-Chris