[2/4] drm/i915: Let hangcheck score decay faster than loop increment

Submitted by Mika Kuoppala on Nov. 30, 2015, 4:53 p.m.

Details

Message ID 1448902389-12477-2-git-send-email-mika.kuoppala@intel.com
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Mika Kuoppala Nov. 30, 2015, 4:53 p.m.
We decay hangcheck score, instead of setting it to zero,
when seqno moves forward. This was added as mechanism to
detect batches full of invalid waits. But with multiple runs of
very intensive batches (shader tests), the score accumulates
even without wait/kick pairs only by engine being active inside
shader loops multiple times in succession.

Prevent this mechanism to falsely trigger on loops by
decaying faster than we accumulate during active looping.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 6ed6571..3507269 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -3009,6 +3009,7 @@  static void i915_hangcheck_elapsed(struct work_struct *work)
 #define BUSY 1
 #define KICK 5
 #define HUNG 20
+#define BUSY_DECAY (2*BUSY)
 
 	if (!i915.enable_hangcheck)
 		return;
@@ -3084,8 +3085,8 @@  static void i915_hangcheck_elapsed(struct work_struct *work)
 			/* Gradually reduce the count so that we catch DoS
 			 * attempts across multiple batches.
 			 */
-			if (ring->hangcheck.score > 0)
-				ring->hangcheck.score--;
+			if (ring->hangcheck.score > BUSY_DECAY)
+				ring->hangcheck.score -= BUSY_DECAY;
 
 			ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;
 		}

Comments

On Mon, Nov 30, 2015 at 06:53:07PM +0200, Mika Kuoppala wrote:
> We decay hangcheck score, instead of setting it to zero,
> when seqno moves forward. This was added as mechanism to
> detect batches full of invalid waits.

And slow DoS. So no.
-Chris
On Mon, Nov 30, 2015 at 05:18:43PM +0000, Chris Wilson wrote:
> On Mon, Nov 30, 2015 at 06:53:07PM +0200, Mika Kuoppala wrote:
> > We decay hangcheck score, instead of setting it to zero,
> > when seqno moves forward. This was added as mechanism to
> > detect batches full of invalid waits.
> 
> And slow DoS. So no.

Yeah we need to fix the deqp CI issues with tunables. Running a shader for
10s and monopolizing the gpu might be ok in a stress tests, but it's
definitely not ok by default on a normal system.
-Daniel
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Mon, Nov 30, 2015 at 06:53:07PM +0200, Mika Kuoppala wrote:
>> We decay hangcheck score, instead of setting it to zero,
>> when seqno moves forward. This was added as mechanism to
>> detect batches full of invalid waits.
>
> And slow DoS. So no.

Point was to decay faster than 'loop detection' (-2) but slower
than kicking (+5) to keep the score decaying even if there is
one 'loop detection' per batch.

The criticism about pipe control (patch 1/4) is warranted as it just
happens to help now when the synchronization happens on ring, and thus
only helps with batches that follow this pattern wrt pipe control.

I will post a patch that tries to achieve more generic way
by inspecting subunit states if head seems to be stuck, so
that substitutes 1/4. And also makes 4/4 even more moot.

There is not much to salvage in this series so please ignore.

Thanks,
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre