[1/2] drm/i915: Pull sync_scru for device reset outside of wedge_mutex

Submitted by Chris Wilson on Feb. 11, 2019, 1:50 p.m.

Details

Message ID 20190211135040.1234-1-chris@chris-wilson.co.uk
State New
Series "Series without cover letter"
Headers show

Commit Message

Chris Wilson Feb. 11, 2019, 1:50 p.m.
We need to flush our srcu protecting resources about to be clobbered
by the reset, inside of our timer failsafe but outside of the
error->wedge_mutex, so that the failsafe can run in case the
synchronize_srcu() takes too long (hits a shrinker deadlock?).

Fixes: 72eb16df010a ("drm/i915: Serialise resets with wedging")
References: https://bugs.freedesktop.org/show_bug.cgi?id=109605
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reset.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
index 9494b015185a..c2b7570730c2 100644
--- a/drivers/gpu/drm/i915/i915_reset.c
+++ b/drivers/gpu/drm/i915/i915_reset.c
@@ -941,9 +941,6 @@  static int do_reset(struct drm_i915_private *i915, unsigned int stalled_mask)
 {
 	int err, i;
 
-	/* Flush everyone currently using a resource about to be clobbered */
-	synchronize_srcu(&i915->gpu_error.reset_backoff_srcu);
-
 	err = intel_gpu_reset(i915, ALL_ENGINES);
 	for (i = 0; err && i < RESET_MAX_RETRIES; i++) {
 		msleep(10 * (i + 1));
@@ -1140,6 +1137,9 @@  static void i915_reset_device(struct drm_i915_private *i915,
 	i915_wedge_on_timeout(&w, i915, 5 * HZ) {
 		intel_prepare_reset(i915);
 
+		/* Flush everyone using a resource about to be clobbered */
+		synchronize_srcu(&error->reset_backoff_srcu);
+
 		mutex_lock(&error->wedge_mutex);
 		i915_reset(i915, engine_mask, reason);
 		mutex_unlock(&error->wedge_mutex);

Comments

Mika Kuoppala Feb. 11, 2019, 3:09 p.m.
Chris Wilson <chris@chris-wilson.co.uk> writes:

> We need to flush our srcu protecting resources about to be clobbered
> by the reset, inside of our timer failsafe but outside of the
> error->wedge_mutex, so that the failsafe can run in case the
> synchronize_srcu() takes too long (hits a shrinker deadlock?).
>
> Fixes: 72eb16df010a ("drm/i915: Serialise resets with wedging")
> References: https://bugs.freedesktop.org/show_bug.cgi?id=109605
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reset.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 9494b015185a..c2b7570730c2 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -941,9 +941,6 @@ static int do_reset(struct drm_i915_private *i915, unsigned int stalled_mask)
>  {
>  	int err, i;
>  
> -	/* Flush everyone currently using a resource about to be clobbered */
> -	synchronize_srcu(&i915->gpu_error.reset_backoff_srcu);
> -
>  	err = intel_gpu_reset(i915, ALL_ENGINES);
>  	for (i = 0; err && i < RESET_MAX_RETRIES; i++) {
>  		msleep(10 * (i + 1));
> @@ -1140,6 +1137,9 @@ static void i915_reset_device(struct drm_i915_private *i915,
>  	i915_wedge_on_timeout(&w, i915, 5 * HZ) {
>  		intel_prepare_reset(i915);
>  
> +		/* Flush everyone using a resource about to be clobbered */
> +		synchronize_srcu(&error->reset_backoff_srcu);
> +

Do we easily see which one it will be? This one or
the block below to timeout on wedge?

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

>  		mutex_lock(&error->wedge_mutex);
>  		i915_reset(i915, engine_mask, reason);
>  		mutex_unlock(&error->wedge_mutex);
> -- 
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Feb. 11, 2019, 3:14 p.m.
Quoting Mika Kuoppala (2019-02-11 15:09:48)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > We need to flush our srcu protecting resources about to be clobbered
> > by the reset, inside of our timer failsafe but outside of the
> > error->wedge_mutex, so that the failsafe can run in case the
> > synchronize_srcu() takes too long (hits a shrinker deadlock?).
> >
> > Fixes: 72eb16df010a ("drm/i915: Serialise resets with wedging")
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=109605
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_reset.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> > index 9494b015185a..c2b7570730c2 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.c
> > +++ b/drivers/gpu/drm/i915/i915_reset.c
> > @@ -941,9 +941,6 @@ static int do_reset(struct drm_i915_private *i915, unsigned int stalled_mask)
> >  {
> >       int err, i;
> >  
> > -     /* Flush everyone currently using a resource about to be clobbered */
> > -     synchronize_srcu(&i915->gpu_error.reset_backoff_srcu);
> > -
> >       err = intel_gpu_reset(i915, ALL_ENGINES);
> >       for (i = 0; err && i < RESET_MAX_RETRIES; i++) {
> >               msleep(10 * (i + 1));
> > @@ -1140,6 +1137,9 @@ static void i915_reset_device(struct drm_i915_private *i915,
> >       i915_wedge_on_timeout(&w, i915, 5 * HZ) {
> >               intel_prepare_reset(i915);
> >  
> > +             /* Flush everyone using a resource about to be clobbered */
> > +             synchronize_srcu(&error->reset_backoff_srcu);
> > +
> 
> Do we easily see which one it will be? This one or
> the block below to timeout on wedge?

It would be easy to reconstruct if we have all the stack traces so we
can switch which process is stuck where, but we do not. Failing that my
hunch is that it's sync_srcu taking too long, and by design we know it
can deadlock around an unfortunate shrinker interaction :( But I'm not
entirely convinced we're hitting that.
-Chris