drm/i915: Re-enable per-engine reset for Broxton

Submitted by Michel Thierry on Aug. 18, 2017, 5:23 p.m.

Details

Message ID 20170818172342.7282-1-michel.thierry@intel.com
State Accepted
Commit 41e61020e821487489526e50b8e2e223342b7b93
Headers show
Series "drm/i915: Re-enable per-engine reset for Broxton" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Michel Thierry Aug. 18, 2017, 5:23 p.m.
The corruption in CSB mmio reads we were seeing has been tracked down to
incorrectly touching forcewake of all domains, following an engine reset.
It is still a mistery why we only catched this in Broxton, since it
could happen in any platform.

With that fix already merged, commit 4055dc75d6b5 ("drm/i915: Stop
touching forcewake following a gen6+ engine reset"), lets try to enable
per-engine resets in Broxton one more time.

This reverts commit f188258bde0f ("drm/i915: Disable per-engine reset for
Broxton").

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_pci.c | 1 -
 1 file changed, 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 6f87f1fe9cef..aa48a40d72eb 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -400,7 +400,6 @@  static const struct intel_device_info intel_broxton_info = {
 	GEN9_LP_FEATURES,
 	.platform = INTEL_BROXTON,
 	.ddb_size = 512,
-	.has_reset_engine = false,
 };
 
 static const struct intel_device_info intel_geminilake_info = {

Comments

Quoting Michel Thierry (2017-08-18 18:23:42)
> The corruption in CSB mmio reads we were seeing has been tracked down to
> incorrectly touching forcewake of all domains, following an engine reset.
> It is still a mistery why we only catched this in Broxton, since it
> could happen in any platform.
> 
> With that fix already merged, commit 4055dc75d6b5 ("drm/i915: Stop
> touching forcewake following a gen6+ engine reset"), lets try to enable
> per-engine resets in Broxton one more time.
> 
> This reverts commit f188258bde0f ("drm/i915: Disable per-engine reset for
> Broxton").
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>

My bxt has survived about 72 hours of hang testing, which is far more
than it was able to previously.

Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Quoting Chris Wilson (2017-08-21 15:55:34)
> Quoting Michel Thierry (2017-08-18 18:23:42)
> > The corruption in CSB mmio reads we were seeing has been tracked down to
> > incorrectly touching forcewake of all domains, following an engine reset.
> > It is still a mistery why we only catched this in Broxton, since it
> > could happen in any platform.
> > 
> > With that fix already merged, commit 4055dc75d6b5 ("drm/i915: Stop
> > touching forcewake following a gen6+ engine reset"), lets try to enable
> > per-engine resets in Broxton one more time.
> > 
> > This reverts commit f188258bde0f ("drm/i915: Disable per-engine reset for
> > Broxton").
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> 
> My bxt has survived about 72 hours of hang testing, which is far more
> than it was able to previously.
> 
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
> Tested-by: Chris Wilson <chris@chris-wilson.co.uk>

Uh oh, seemingly just hit it again...
-Chris
On 05/09/17 06:57, Chris Wilson wrote:
> Quoting Chris Wilson (2017-08-21 15:55:34)
>> Quoting Michel Thierry (2017-08-18 18:23:42)
>>> The corruption in CSB mmio reads we were seeing has been tracked down to
>>> incorrectly touching forcewake of all domains, following an engine reset.
>>> It is still a mistery why we only catched this in Broxton, since it
>>> could happen in any platform.
>>>
>>> With that fix already merged, commit 4055dc75d6b5 ("drm/i915: Stop
>>> touching forcewake following a gen6+ engine reset"), lets try to enable
>>> per-engine resets in Broxton one more time.
>>>
>>> This reverts commit f188258bde0f ("drm/i915: Disable per-engine reset for
>>> Broxton").
>>>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>>
>> My bxt has survived about 72 hours of hang testing, which is far more
>> than it was able to previously.
>>
>> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Uh oh, seemingly just hit it again...

Was it because the CSBs were 0's?

A couple of times I saw a spurious CSB event (0x12 - preempted & 
complete), after an already 'complete' event. That was also hitting the 
assert because the ctx-id would be 'wrong'. I think we could ignore the 
0x12 event and it will continue.
Quoting Michel Thierry (2017-09-06 16:25:06)
> On 05/09/17 06:57, Chris Wilson wrote:
> > Quoting Chris Wilson (2017-08-21 15:55:34)
> >> Quoting Michel Thierry (2017-08-18 18:23:42)
> >>> The corruption in CSB mmio reads we were seeing has been tracked down to
> >>> incorrectly touching forcewake of all domains, following an engine reset.
> >>> It is still a mistery why we only catched this in Broxton, since it
> >>> could happen in any platform.
> >>>
> >>> With that fix already merged, commit 4055dc75d6b5 ("drm/i915: Stop
> >>> touching forcewake following a gen6+ engine reset"), lets try to enable
> >>> per-engine resets in Broxton one more time.
> >>>
> >>> This reverts commit f188258bde0f ("drm/i915: Disable per-engine reset for
> >>> Broxton").
> >>>
> >>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> >>
> >> My bxt has survived about 72 hours of hang testing, which is far more
> >> than it was able to previously.
> >>
> >> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > Uh oh, seemingly just hit it again...
> 
> Was it because the CSBs were 0's?
> 
> A couple of times I saw a spurious CSB event (0x12 - preempted & 
> complete), after an already 'complete' event. That was also hitting the 
> assert because the ctx-id would be 'wrong'. I think we could ignore the 
> 0x12 event and it will continue.

Hmm, that 0x12 event has never triggered the invalid ctx id yet for me
(but that's probably just a matter of workload), it always hits the
too-many-switches.  Sadly we can't just continue on after that as the hw
is completely out-of-sync with our submissions, and the only way to
recover appears to be a gpu reset.

Anyway, haven't yet dug back into the bang, just reaffirmed that
disabling per-engine resets gives me a
ickle@broxton:~$ uptime
 16:55:31 up 1 day,  2:01,  2 users,  load average: 3.66, 3.38, 3.3
so far of drv_selftest --r live_hanghceck
-Chris