drm/i915: Cancel pending execlist tasklet upon wedging

Submitted by Chris Wilson on June 21, 2017, 12:48 p.m.

Details

Message ID 20170621124804.4529-1-chris@chris-wilson.co.uk
State Accepted
Commit 4ee056f41807858b2eae263e74ae8b81800c0337
Headers show
Series "drm/i915: Cancel pending execlist tasklet upon wedging" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Chris Wilson June 21, 2017, 12:48 p.m.
Highly unlikely, but if the stop_machine() did suspend the tasklet, we
want to make sure that when it wakes it finds there is nothing to do.
Otherwise, it will loudly complain that the ELSP port tracking no longer
matches the hardware, and we will be mightly confused.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
 1 file changed, 7 insertions(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1353491c1010..ae3ce1314bd1 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3092,6 +3092,13 @@  static void engine_set_wedged(struct intel_engine_cs *engine)
 		engine->execlist_first = NULL;
 
 		spin_unlock_irqrestore(&engine->timeline->lock, flags);
+
+		/* The port is checked prior to scheduling a tasklet, but
+		 * just in case we have suspended the tasklet to do the
+		 * wedging make sure that when it wakes, it decides there
+		 * is no work to do by clearing the irq_posted bit.
+		 */
+		clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
 	}
 }
 

Comments

On 21/06/2017 13:48, Chris Wilson wrote:
> Highly unlikely, but if the stop_machine() did suspend the tasklet, we
> want to make sure that when it wakes it finds there is nothing to do.
> Otherwise, it will loudly complain that the ELSP port tracking no longer
> matches the hardware, and we will be mightly confused.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1353491c1010..ae3ce1314bd1 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3092,6 +3092,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>   		engine->execlist_first = NULL;
>   
>   		spin_unlock_irqrestore(&engine->timeline->lock, flags);
> +
> +		/* The port is checked prior to scheduling a tasklet, but
> +		 * just in case we have suspended the tasklet to do the
> +		 * wedging make sure that when it wakes, it decides there
> +		 * is no work to do by clearing the irq_posted bit.
> +		 */
> +		clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>   	}
>   }
>   
> 

Or tasklet_kill? Actually I am not sure now how the wedging transition 
works for request which have pasted the submit callback, but are not in 
the hardware yet (waiting in the priotree). Can't find at the moment 
that we have anything dealing with those.

Regards,

Tvrtko
Quoting Tvrtko Ursulin (2017-06-21 14:33:59)
> 
> On 21/06/2017 13:48, Chris Wilson wrote:
> > Highly unlikely, but if the stop_machine() did suspend the tasklet, we
> > want to make sure that when it wakes it finds there is nothing to do.
> > Otherwise, it will loudly complain that the ELSP port tracking no longer
> > matches the hardware, and we will be mightly confused.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
> >   1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 1353491c1010..ae3ce1314bd1 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -3092,6 +3092,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
> >               engine->execlist_first = NULL;
> >   
> >               spin_unlock_irqrestore(&engine->timeline->lock, flags);
> > +
> > +             /* The port is checked prior to scheduling a tasklet, but
> > +              * just in case we have suspended the tasklet to do the
> > +              * wedging make sure that when it wakes, it decides there
> > +              * is no work to do by clearing the irq_posted bit.
> > +              */
> > +             clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> >       }
> >   }
> >   
> > 
> 
> Or tasklet_kill? Actually I am not sure now how the wedging transition 
> works for request which have pasted the submit callback, but are not in 
> the hardware yet (waiting in the priotree). Can't find at the moment 
> that we have anything dealing with those.

We kill the rbtree and the tree itself doesn't hold a reference to the
request (just a reference is held for the ports, as prior to submission
there is no way we complete the request and so cancel the in-flight
reference count). That's the

	engine->execlist_queue = RB_ROOT;
	engine->execlist_first = NULL;

above.
-Chris
Quoting Tvrtko Ursulin (2017-06-21 14:33:59)
> 
> On 21/06/2017 13:48, Chris Wilson wrote:
> > Highly unlikely, but if the stop_machine() did suspend the tasklet, we
> > want to make sure that when it wakes it finds there is nothing to do.
> > Otherwise, it will loudly complain that the ELSP port tracking no longer
> > matches the hardware, and we will be mightly confused.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
> >   1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 1353491c1010..ae3ce1314bd1 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -3092,6 +3092,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
> >               engine->execlist_first = NULL;
> >   
> >               spin_unlock_irqrestore(&engine->timeline->lock, flags);
> > +
> > +             /* The port is checked prior to scheduling a tasklet, but
> > +              * just in case we have suspended the tasklet to do the
> > +              * wedging make sure that when it wakes, it decides there
> > +              * is no work to do by clearing the irq_posted bit.
> > +              */
> > +             clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> >       }
> >   }
> >   
> > 
> 
> Or tasklet_kill?

Hmm, technically not permitted:

void tasklet_kill(struct tasklet_struct *t)
{
        if (in_interrupt())
                pr_notice("Attempt to kill tasklet from interrupt\n");

        while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) {
                do {
                        yield();
                } while (test_bit(TASKLET_STATE_SCHED, &t->state));
        }
        tasklet_unlock_wait(t);
        clear_bit(TASKLET_STATE_SCHED, &t->state);
}

Even if we don't trigger the in_interrupt() simply due to having
interrupts disabled, then the potential yield() is illegal from inside
stop_machine().
-Chris
On 21/06/2017 14:53, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-06-21 14:33:59)
>>
>> On 21/06/2017 13:48, Chris Wilson wrote:
>>> Highly unlikely, but if the stop_machine() did suspend the tasklet, we
>>> want to make sure that when it wakes it finds there is nothing to do.
>>> Otherwise, it will loudly complain that the ELSP port tracking no longer
>>> matches the hardware, and we will be mightly confused.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 1353491c1010..ae3ce1314bd1 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -3092,6 +3092,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>>>                engine->execlist_first = NULL;
>>>    
>>>                spin_unlock_irqrestore(&engine->timeline->lock, flags);
>>> +
>>> +             /* The port is checked prior to scheduling a tasklet, but
>>> +              * just in case we have suspended the tasklet to do the
>>> +              * wedging make sure that when it wakes, it decides there
>>> +              * is no work to do by clearing the irq_posted bit.
>>> +              */
>>> +             clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>>>        }
>>>    }
>>>    
>>>
>>
>> Or tasklet_kill? Actually I am not sure now how the wedging transition
>> works for request which have pasted the submit callback, but are not in
>> the hardware yet (waiting in the priotree). Can't find at the moment
>> that we have anything dealing with those.
> 
> We kill the rbtree and the tree itself doesn't hold a reference to the
> request (just a reference is held for the ports, as prior to submission
> there is no way we complete the request and so cancel the in-flight
> reference count). That's the
> 
> 	engine->execlist_queue = RB_ROOT;
> 	engine->execlist_first = NULL;
> 
> above.

Yes of course, don't know where I was looking.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
Quoting Tvrtko Ursulin (2017-06-21 15:06:34)
> 
> On 21/06/2017 14:53, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-06-21 14:33:59)
> >>
> >> On 21/06/2017 13:48, Chris Wilson wrote:
> >>> Highly unlikely, but if the stop_machine() did suspend the tasklet, we
> >>> want to make sure that when it wakes it finds there is nothing to do.
> >>> Otherwise, it will loudly complain that the ELSP port tracking no longer
> >>> matches the hardware, and we will be mightly confused.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
> >>>    1 file changed, 7 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>> index 1353491c1010..ae3ce1314bd1 100644
> >>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>> @@ -3092,6 +3092,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
> >>>                engine->execlist_first = NULL;
> >>>    
> >>>                spin_unlock_irqrestore(&engine->timeline->lock, flags);
> >>> +
> >>> +             /* The port is checked prior to scheduling a tasklet, but
> >>> +              * just in case we have suspended the tasklet to do the
> >>> +              * wedging make sure that when it wakes, it decides there
> >>> +              * is no work to do by clearing the irq_posted bit.
> >>> +              */
> >>> +             clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> >>>        }
> >>>    }
> >>>    
> >>>
> >>
> >> Or tasklet_kill? Actually I am not sure now how the wedging transition
> >> works for request which have pasted the submit callback, but are not in
> >> the hardware yet (waiting in the priotree). Can't find at the moment
> >> that we have anything dealing with those.
> > 
> > We kill the rbtree and the tree itself doesn't hold a reference to the
> > request (just a reference is held for the ports, as prior to submission
> > there is no way we complete the request and so cancel the in-flight
> > reference count). That's the
> > 
> >       engine->execlist_queue = RB_ROOT;
> >       engine->execlist_first = NULL;
> > 
> > above.
> 
> Yes of course, don't know where I was looking.
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Discussion is always worthwhile, we've caught many little bugs today
alone just by asking questions. :)

Pushed this little patch as it makes reasoning about
i915_gem_set_wedged() a smidgen easier.
-Chris