[2/2] drm/i915/gt: Only unwedge if we can reset first

Submitted by Chris Wilson on Sept. 9, 2019, 10:55 p.m.

Details

Message ID 20190909225536.7037-2-chris@chris-wilson.co.uk
State New
Headers show
Series "Series without cover letter" ( rev: 2 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Chris Wilson Sept. 9, 2019, 10:55 p.m.
Unwedging the GPU requires a successful GPU reset before we restore the
default submission, or else we may see residual context switch events
that we were not expecting.

Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index fe57296b790c..5242496a893a 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -809,6 +809,7 @@  static bool __intel_gt_unset_wedged(struct intel_gt *gt)
 	struct intel_gt_timelines *timelines = &gt->timelines;
 	struct intel_timeline *tl;
 	unsigned long flags;
+	bool ok;
 
 	if (!test_bit(I915_WEDGED, &gt->reset.flags))
 		return true;
@@ -854,7 +855,11 @@  static bool __intel_gt_unset_wedged(struct intel_gt *gt)
 	}
 	spin_unlock_irqrestore(&timelines->lock, flags);
 
-	intel_gt_sanitize(gt, false);
+	ok = false;
+	if (!reset_clobbers_display(gt->i915))
+		ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;
+	if (!ok)
+		return false;
 
 	/*
 	 * Undo nop_submit_request. We prevent all new i915 requests from

Comments

On 9/9/19 3:55 PM, Chris Wilson wrote:
> Unwedging the GPU requires a successful GPU reset before we restore the
> default submission, or else we may see residual context switch events
> that we were not expecting.
> 
> Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index fe57296b790c..5242496a893a 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -809,6 +809,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
>   	struct intel_gt_timelines *timelines = &gt->timelines;
>   	struct intel_timeline *tl;
>   	unsigned long flags;
> +	bool ok;
>   
>   	if (!test_bit(I915_WEDGED, &gt->reset.flags))
>   		return true;
> @@ -854,7 +855,11 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
>   	}
>   	spin_unlock_irqrestore(&timelines->lock, flags);
>   
> -	intel_gt_sanitize(gt, false);
> +	ok = false;
> +	if (!reset_clobbers_display(gt->i915))
> +		ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;

Of the thing we had in the gt_sanitize, we're ok skipping the 
uc_sanitize() because we take care of that during wedge (from 
intel_uc_reset_prepare), but what about the loop of 
__intel_engine_reset()? Is that safe to skip here?

Apart from that, the patch LGTM. Worth noting that with this change a 
successful reset is required to unwedge even after a suspend/resume 
cycle (in gem_sanitize), which is a good thing IMO.

Daniele

> +	if (!ok)
> +		return false;
>   
>   	/*
>   	 * Undo nop_submit_request. We prevent all new i915 requests from
>
Quoting Daniele Ceraolo Spurio (2019-09-10 01:59:38)
> 
> 
> On 9/9/19 3:55 PM, Chris Wilson wrote:
> > Unwedging the GPU requires a successful GPU reset before we restore the
> > default submission, or else we may see residual context switch events
> > that we were not expecting.
> > 
> > Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_reset.c | 7 ++++++-
> >   1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > index fe57296b790c..5242496a893a 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > @@ -809,6 +809,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
> >       struct intel_gt_timelines *timelines = &gt->timelines;
> >       struct intel_timeline *tl;
> >       unsigned long flags;
> > +     bool ok;
> >   
> >       if (!test_bit(I915_WEDGED, &gt->reset.flags))
> >               return true;
> > @@ -854,7 +855,11 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
> >       }
> >       spin_unlock_irqrestore(&timelines->lock, flags);
> >   
> > -     intel_gt_sanitize(gt, false);
> > +     ok = false;
> > +     if (!reset_clobbers_display(gt->i915))
> > +             ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;
> 
> Of the thing we had in the gt_sanitize, we're ok skipping the 
> uc_sanitize() because we take care of that during wedge (from 
> intel_uc_reset_prepare), but what about the loop of 
> __intel_engine_reset()? Is that safe to skip here?

I think yes, because we always follow the unwedge with a GT restart. That
is either via the full reset or the sanitize+restart on resume. Both call
paths will also set the wedged bit if they fail. gem_eio/suspend should
be testing the recovery upon resume path, and even gem_eio/*-stress
should give responsible coverage of the normal recovery via full reset.
 
> Apart from that, the patch LGTM. Worth noting that with this change a 
> successful reset is required to unwedge even after a suspend/resume 
> cycle (in gem_sanitize), which is a good thing IMO.

Hence why relaxing the gpu_clobbers_display is important to retain the
ability to clear wedged across suspend on older devices.
-Chris
Hi Chris,

On Tuesday, September 10, 2019 12:55:36 AM CEST Chris Wilson wrote:
> Unwedging the GPU requires a successful GPU reset before we restore the
> default submission, or else we may see residual context switch events
> that we were not expecting.
> 
> Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/
gt/intel_reset.c
> index fe57296b790c..5242496a893a 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -809,6 +809,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
>  	struct intel_gt_timelines *timelines = &gt->timelines;
>  	struct intel_timeline *tl;
>  	unsigned long flags;
> +	bool ok;
>  
>  	if (!test_bit(I915_WEDGED, &gt->reset.flags))
>  		return true;
> @@ -854,7 +855,11 @@ static bool __intel_gt_unset_wedged(struct intel_gt 
*gt)
>  	}
>  	spin_unlock_irqrestore(&timelines->lock, flags);
>  
> -	intel_gt_sanitize(gt, false);
> +	ok = false;
> +	if (!reset_clobbers_display(gt->i915))
> +		ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;
> +	if (!ok)
> +		return false;

Before your change, that code was executed inside intel_gt_sanitize(gt, false) 
which unfortunately didn't return any result.  The same outcome could be 
achieved by redefining intel_gt_sanitize() to return that result and saying:

	if (!intel_gt_sanitize(gt, false)
		return false;

Is there any specific reason for intel_gt_sanitize() returning void?

Thanks,
Janusz

>  
>  	/*
>  	 * Undo nop_submit_request. We prevent all new i915 requests from
>
On 9/9/19 11:06 PM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-09-10 01:59:38)
>>
>>
>> On 9/9/19 3:55 PM, Chris Wilson wrote:
>>> Unwedging the GPU requires a successful GPU reset before we restore the
>>> default submission, or else we may see residual context switch events
>>> that we were not expecting.
>>>
>>> Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_reset.c | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> index fe57296b790c..5242496a893a 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> @@ -809,6 +809,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
>>>        struct intel_gt_timelines *timelines = &gt->timelines;
>>>        struct intel_timeline *tl;
>>>        unsigned long flags;
>>> +     bool ok;
>>>    
>>>        if (!test_bit(I915_WEDGED, &gt->reset.flags))
>>>                return true;
>>> @@ -854,7 +855,11 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
>>>        }
>>>        spin_unlock_irqrestore(&timelines->lock, flags);
>>>    
>>> -     intel_gt_sanitize(gt, false);
>>> +     ok = false;
>>> +     if (!reset_clobbers_display(gt->i915))
>>> +             ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;
>>
>> Of the thing we had in the gt_sanitize, we're ok skipping the
>> uc_sanitize() because we take care of that during wedge (from
>> intel_uc_reset_prepare), but what about the loop of
>> __intel_engine_reset()? Is that safe to skip here?
> 
> I think yes, because we always follow the unwedge with a GT restart. That
> is either via the full reset or the sanitize+restart on resume. Both call
> paths will also set the wedged bit if they fail. gem_eio/suspend should
> be testing the recovery upon resume path, and even gem_eio/*-stress
> should give responsible coverage of the normal recovery via full reset.
>   
>> Apart from that, the patch LGTM. Worth noting that with this change a
>> successful reset is required to unwedge even after a suspend/resume
>> cycle (in gem_sanitize), which is a good thing IMO.
> 
> Hence why relaxing the gpu_clobbers_display is important to retain the
> ability to clear wedged across suspend on older devices.
> -Chris
> 

Sold!

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele
Quoting Janusz Krzysztofik (2019-09-10 08:39:51)
> Hi Chris,
> 
> On Tuesday, September 10, 2019 12:55:36 AM CEST Chris Wilson wrote:
> > @@ -854,7 +855,11 @@ static bool __intel_gt_unset_wedged(struct intel_gt 
> *gt)
> >       }
> >       spin_unlock_irqrestore(&timelines->lock, flags);
> >  
> > -     intel_gt_sanitize(gt, false);
> > +     ok = false;
> > +     if (!reset_clobbers_display(gt->i915))
> > +             ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;
> > +     if (!ok)
> > +             return false;
> 
> Before your change, that code was executed inside intel_gt_sanitize(gt, false) 
> which unfortunately didn't return any result.  The same outcome could be 
> achieved by redefining intel_gt_sanitize() to return that result and saying:
> 
>         if (!intel_gt_sanitize(gt, false)
>                 return false;
> 
> Is there any specific reason for intel_gt_sanitize() returning void?

The intent is that sanitize scrubs the leftover BIOS state, failure is
not an option.  The biggest change with respect to intel_gt_sanitize() is
the game we play with reset_clobbers_display -- we need the reset,
whereas in sanitize, the reset is good to have (but realistically we do
not expect there to be any contexts to scrub and so can take the risk).
-Chris