drm/i915: Cancel reset-engine if we couldn't find an active request

Submitted by Michel Thierry on May 17, 2017, 8:41 p.m.

Details

Message ID 20170517204134.25388-1-michel.thierry@intel.com
State New
Headers show
Series "Gen8+ engine-reset" ( rev: 8 ) in Intel GFX

Browsing this patch as part of:
"Gen8+ engine-reset" rev 8 in Intel GFX
<< prev patch [20/20] next patch >>

Commit Message

Michel Thierry May 17, 2017, 8:41 p.m.
Before reseting an engine, check if there is an active request, and if
the _hung_ request has completed. In these two cases, the seqno has moved
after hang declaration and we can skip the reset.

Also store the active request so that we only search for it once, this
applies for reset-engine and full reset.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

v3: Capture active requests during reset_prepare and store it the
engine hangcheck obj (Chris).

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 18 ++++++++++----
 drivers/gpu/drm/i915/i915_drv.h         |  3 ++-
 drivers/gpu/drm/i915/i915_gem.c         | 42 +++++++++++++++++++++++----------
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 4 files changed, 46 insertions(+), 18 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d62793805794..771857258292 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1900,15 +1900,19 @@  int i915_reset_engine(struct intel_engine_cs *engine)
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
+	engine->hangcheck.active_request = i915_gem_reset_prepare_engine(engine);
+	if (!engine->hangcheck.active_request) {
+		DRM_DEBUG_DRIVER("seqno moved after hang declaration, pardoned\n");
+		goto canceled;
+	} else if (IS_ERR(engine->hangcheck.active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(engine->hangcheck.active_request);
 		goto out;
 	}
 
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
 	i915_gem_reset_engine(engine);
@@ -1942,6 +1946,10 @@  int i915_reset_engine(struct intel_engine_cs *engine)
 
 out:
 	return ret;
+
+canceled:
+	i915_gem_reset_finish_engine(engine);
+	return 0;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a5b9c666b3bf..6cbfeaa02246 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3370,7 +3370,8 @@  static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5dc073a5ddc..5ec454dafb9f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,14 @@  static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * Also store the active request so that we only search for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2827,21 +2829,35 @@  int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
-		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+
+		if (request) {
+			if (request->fence.error == -EIO)
+				return ERR_PTR(-EIO); /* Previous reset failed! */
+
+			if (i915_gem_request_completed(request))
+				return NULL; /* request completed, skip it */
+		}
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	for_each_engine(engine, dev_priv, id)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine(engine, dev_priv, id) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
+		}
+
+		engine->hangcheck.active_request = request;
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2930,9 +2946,11 @@  static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 
 void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
+	struct drm_i915_gem_request *request = engine->hangcheck.active_request;
+
+	if (!request)
+		request = i915_gem_find_active_request(engine);
 
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index ec16fb6fde62..f850c4b12337 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -121,6 +121,7 @@  struct intel_engine_hangcheck {
 	unsigned long action_timestamp;
 	int deadlock;
 	struct intel_instdone instdone;
+	struct drm_i915_gem_request *active_request;
 	bool stalled;
 };
 

Comments

On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
> @@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  
>  	if (engine_stalled(engine)) {
>  		request = i915_gem_find_active_request(engine);
> -		if (request && request->fence.error == -EIO)
> -			err = -EIO; /* Previous reset failed! */
> +
> +		if (request) {
> +			if (request->fence.error == -EIO)
> +				return ERR_PTR(-EIO); /* Previous reset failed! */
> +
> +			if (i915_gem_request_completed(request))
> +				return NULL; /* request completed, skip it */

This check is pointless here. We are just a few cycles since it was
known to be true. Both paths should be doing it just before the actual
reset for symmetry.
-Chris
On 17/05/17 13:52, Chris Wilson wrote:
> On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
>> @@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>
>>  	if (engine_stalled(engine)) {
>>  		request = i915_gem_find_active_request(engine);
>> -		if (request && request->fence.error == -EIO)
>> -			err = -EIO; /* Previous reset failed! */
>> +
>> +		if (request) {
>> +			if (request->fence.error == -EIO)
>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>> +
>> +			if (i915_gem_request_completed(request))
>> +				return NULL; /* request completed, skip it */
>
> This check is pointless here. We are just a few cycles since it was
> known to be true. Both paths should be doing it just before the actual
> reset for symmetry.

As you said, in gem_reset_request, 'guilty' should check for 
i915_gem_request_completed instead of engine_stalled... but at that 
point it's too late to cancel the reset (intel_gpu_reset has already 
been called).
On Wed, May 17, 2017 at 06:11:06PM -0700, Michel Thierry wrote:
> On 17/05/17 13:52, Chris Wilson wrote:
> >On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
> >>@@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >>
> >> 	if (engine_stalled(engine)) {
> >> 		request = i915_gem_find_active_request(engine);
> >>-		if (request && request->fence.error == -EIO)
> >>-			err = -EIO; /* Previous reset failed! */
> >>+
> >>+		if (request) {
> >>+			if (request->fence.error == -EIO)
> >>+				return ERR_PTR(-EIO); /* Previous reset failed! */
> >>+
> >>+			if (i915_gem_request_completed(request))
> >>+				return NULL; /* request completed, skip it */
> >
> >This check is pointless here. We are just a few cycles since it was
> >known to be true. Both paths should be doing it just before the actual
> >reset for symmetry.
> 
> As you said, in gem_reset_request, 'guilty' should check for
> i915_gem_request_completed instead of engine_stalled... but at that
> point it's too late to cancel the reset (intel_gpu_reset has already
> been called).

Ok. At that point we are just deciding between skipping the request or
replaying it. The motivation behind carrying forward the active_request
was to avoid the repeated searches + engine_stalled() checks (since any
future check can then just confirm the active_request is still
incomplete).
-Chris
On 5/18/2017 12:56 AM, Chris Wilson wrote:
> On Wed, May 17, 2017 at 06:11:06PM -0700, Michel Thierry wrote:
>> On 17/05/17 13:52, Chris Wilson wrote:
>>> On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
>>>> @@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>>>
>>>> 	if (engine_stalled(engine)) {
>>>> 		request = i915_gem_find_active_request(engine);
>>>> -		if (request && request->fence.error == -EIO)
>>>> -			err = -EIO; /* Previous reset failed! */
>>>> +
>>>> +		if (request) {
>>>> +			if (request->fence.error == -EIO)
>>>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>>>> +
>>>> +			if (i915_gem_request_completed(request))
>>>> +				return NULL; /* request completed, skip it */
>>>
>>> This check is pointless here. We are just a few cycles since it was
>>> known to be true. Both paths should be doing it just before the actual
>>> reset for symmetry.
>>
>> As you said, in gem_reset_request, 'guilty' should check for
>> i915_gem_request_completed instead of engine_stalled... but at that
>> point it's too late to cancel the reset (intel_gpu_reset has already
>> been called).
>
> Ok. At that point we are just deciding between skipping the request or
> replaying it. The motivation behind carrying forward the active_request
> was to avoid the repeated searches + engine_stalled() checks (since any
> future check can then just confirm the active_request is still
> incomplete).

Agreed, we'll still avoid the repeated searches + engine_stalled.
Let me send that.