[1/4] drm/i915: Teach hangcheck about long operations on rings

Submitted by Mika Kuoppala on Nov. 30, 2015, 4:53 p.m.

Details

Message ID 1448902389-12477-1-git-send-email-mika.kuoppala@intel.com
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Mika Kuoppala Nov. 30, 2015, 4:53 p.m.
Some operations that happen in ringbuffer, like flushing,
can take significant amounts of time. After some intense
shader tests, the PIPE_CONTROL with flush can apparently last
longer time than what is our hangcheck tick (1500ms). If
this happens twice in a row, even with subsequent batches,
the hangcheck score decaying mechanism can't cope and
hang is declared.

Strip out actual head checking to a separate function and if
actual head has not moved, check if it is lingering inside the
ringbuffer as opposed to batch. If so, treat it as if it would be
inside loop to only slightly increment the hangcheck score.

References: https://bugs.freedesktop.org/show_bug.cgi?id=93029
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index e88d692..6ed6571 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2914,11 +2914,11 @@  static void semaphore_clear_deadlocks(struct drm_i915_private *dev_priv)
 }
 
 static enum intel_ring_hangcheck_action
-ring_stuck(struct intel_engine_cs *ring, u64 acthd)
+head_stuck(struct intel_engine_cs *ring, u64 acthd)
 {
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	u32 tmp;
+	u32 head;
 
 	if (acthd != ring->hangcheck.acthd) {
 		if (acthd > ring->hangcheck.max_acthd) {
@@ -2929,6 +2929,30 @@  ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 		return HANGCHECK_ACTIVE_LOOP;
 	}
 
+	head = I915_READ_HEAD(ring) & HEAD_ADDR;
+
+	/* Some operations, like pipe flush, can take a long time.
+	 * Detect if we are inside ringbuffer and treat these as if
+	 * the ring would be busy.
+	 */
+	if (lower_32_bits(acthd) == head)
+		return HANGCHECK_ACTIVE_LOOP;
+
+	return HANGCHECK_HUNG;
+}
+
+static enum intel_ring_hangcheck_action
+ring_stuck(struct intel_engine_cs *ring, u64 acthd)
+{
+	struct drm_device *dev = ring->dev;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	enum intel_ring_hangcheck_action ha;
+	u32 tmp;
+
+	ha = head_stuck(ring, acthd);
+	if (ha != HANGCHECK_HUNG)
+		return ha;
+
 	if (IS_GEN2(dev))
 		return HANGCHECK_HUNG;
 

Comments

On Mon, Nov 30, 2015 at 06:53:06PM +0200, Mika Kuoppala wrote:
> Some operations that happen in ringbuffer, like flushing,
> can take significant amounts of time. After some intense
> shader tests, the PIPE_CONTROL with flush can apparently last
> longer time than what is our hangcheck tick (1500ms). If
> this happens twice in a row, even with subsequent batches,
> the hangcheck score decaying mechanism can't cope and
> hang is declared.
> 
> Strip out actual head checking to a separate function and if
> actual head has not moved, check if it is lingering inside the
> ringbuffer as opposed to batch. If so, treat it as if it would be
> inside loop to only slightly increment the hangcheck score.

The PIPE_CONTROL in the ring after the batch, is equivalent to the batch
performing its own PIPE_CONTROL as the last instruction. It does not
make sense to distinguish the two.
-Chris
On 30/11/15 17:11, Chris Wilson wrote:
> On Mon, Nov 30, 2015 at 06:53:06PM +0200, Mika Kuoppala wrote:
>> Some operations that happen in ringbuffer, like flushing,
>> can take significant amounts of time. After some intense
>> shader tests, the PIPE_CONTROL with flush can apparently last
>> longer time than what is our hangcheck tick (1500ms). If
>> this happens twice in a row, even with subsequent batches,
>> the hangcheck score decaying mechanism can't cope and
>> hang is declared.
>>
>> Strip out actual head checking to a separate function and if
>> actual head has not moved, check if it is lingering inside the
>> ringbuffer as opposed to batch. If so, treat it as if it would be
>> inside loop to only slightly increment the hangcheck score.
>
> The PIPE_CONTROL in the ring after the batch, is equivalent to the batch
> performing its own PIPE_CONTROL as the last instruction. It does not
> make sense to distinguish the two.
> -Chris

It's equivalent in terms of outcome, but not when checking what's 
happening. The driver controls insertion of PIPE_CONTROLs in the ring, 
but not in batches. If execution is at the ring level, we know it's 
running instructions that the driver put there, and we know that it 
*will* then progress to the next batch (assuming the hardware's not 
stuck). OTOH if execution is inside a batch then we don't know what 
sequence of instructions it's running, and we can't guarantee that the 
batch will ever terminate. So, a reduced penalty if executing 
driver-supplied code makes sense.

.Dave.
On Mon, Nov 30, 2015 at 06:04:54PM +0000, Dave Gordon wrote:
> On 30/11/15 17:11, Chris Wilson wrote:
> >On Mon, Nov 30, 2015 at 06:53:06PM +0200, Mika Kuoppala wrote:
> >>Some operations that happen in ringbuffer, like flushing,
> >>can take significant amounts of time. After some intense
> >>shader tests, the PIPE_CONTROL with flush can apparently last
> >>longer time than what is our hangcheck tick (1500ms). If
> >>this happens twice in a row, even with subsequent batches,
> >>the hangcheck score decaying mechanism can't cope and
> >>hang is declared.
> >>
> >>Strip out actual head checking to a separate function and if
> >>actual head has not moved, check if it is lingering inside the
> >>ringbuffer as opposed to batch. If so, treat it as if it would be
> >>inside loop to only slightly increment the hangcheck score.
> >
> >The PIPE_CONTROL in the ring after the batch, is equivalent to the batch
> >performing its own PIPE_CONTROL as the last instruction. It does not
> >make sense to distinguish the two.
> >-Chris
> 
> It's equivalent in terms of outcome, but not when checking what's
> happening. The driver controls insertion of PIPE_CONTROLs in the
> ring, but not in batches. If execution is at the ring level, we know
> it's running instructions that the driver put there, and we know
> that it *will* then progress to the next batch (assuming the
> hardware's not stuck). OTOH if execution is inside a batch then we
> don't know what sequence of instructions it's running, and we can't
> guarantee that the batch will ever terminate. So, a reduced penalty
> if executing driver-supplied code makes sense.

Not exactly. If it is executing an infinite loop in the shader, it will
hang indefinitely at whatever pipecontrol comes next. The pipecontrol
following the batch is a forced operation in the user context to ensure
correctness between batches.
-Chris