[CI-ping,11/15] drm/i915: Prevent machine death on Ivybridge context switching

Submitted by Chris Wilson on April 12, 2016, 8:03 p.m.

Details

Message ID 1460491389-8602-11-git-send-email-chris@chris-wilson.co.uk
State New
Headers show
Series "Series without cover letter" ( rev: 3 2 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Chris Wilson April 12, 2016, 8:03 p.m.
Two concurrent writes into the same register cacheline has the chance of
killing the machine on Ivybridge and other gen7. This includes LRI
emitted from the command parser.  The MI_SET_CONTEXT itself serves as
serialising barrier and prevents the pair of register writes in the first
packet from triggering the fault.  However, if a second switch-context
immediately occurs then we may have two adjacent blocks of LRI to the
same registers which may then trigger the hang. To counteract this we
need to insert a delay after the second register write using SRM.

This is easiest to reproduce with something like
igt/gem_ctx_switch/interruptible that triggers back-to-back context
switches (with no operations in between them in the command stream,
which requires the execbuf operation to be interrupted after the
MI_SET_CONTEXT) but can be observed sporadically elsewhere when running
interruptible igt. No reports from the wild though, so it must be of low
enough frequency that no one has correlated the random machine freezes
with i915.ko

The issue was introduced with
commit 2c550183476dfa25641309ae9a28d30feed14379 [v3.19]
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 16 10:02:27 2014 +0000

    drm/i915: Disable PSMI sleep messages on all rings around context switches

Testcase: igt/gem_ctx_switch/render-interruptible #ivb
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_gem_context.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index fe580cb9501a..e5ad7b21e356 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -539,7 +539,7 @@  mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
 
 	len = 4;
 	if (INTEL_INFO(engine->dev)->gen >= 7)
-		len += 2 + (num_rings ? 4*num_rings + 2 : 0);
+		len += 2 + (num_rings ? 4*num_rings + 6 : 0);
 
 	ret = intel_ring_begin(req, len);
 	if (ret)
@@ -579,6 +579,7 @@  mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
 	if (INTEL_INFO(engine->dev)->gen >= 7) {
 		if (num_rings) {
 			struct intel_engine_cs *signaller;
+			i915_reg_t last_reg = {}; /* keep gcc quiet */
 
 			intel_ring_emit(engine,
 					MI_LOAD_REGISTER_IMM(num_rings));
@@ -586,11 +587,19 @@  mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
 				if (signaller == engine)
 					continue;
 
-				intel_ring_emit_reg(engine,
-						    RING_PSMI_CTL(signaller->mmio_base));
+				last_reg = RING_PSMI_CTL(signaller->mmio_base);
+				intel_ring_emit_reg(engine, last_reg);
 				intel_ring_emit(engine,
 						_MASKED_BIT_DISABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
 			}
+
+			/* Insert a delay before the next switch! */
+			intel_ring_emit(engine,
+					MI_STORE_REGISTER_MEM |
+					MI_SRM_LRM_GLOBAL_GTT);
+			intel_ring_emit_reg(engine, last_reg);
+			intel_ring_emit(engine, engine->scratch.gtt_offset);
+			intel_ring_emit(engine, MI_NOOP);
 		}
 		intel_ring_emit(engine, MI_ARB_ON_OFF | MI_ARB_ENABLE);
 	}

Comments

On Tue, Apr 12, 2016 at 09:03:05PM +0100, Chris Wilson wrote:
> Two concurrent writes into the same register cacheline has the chance of
> killing the machine on Ivybridge and other gen7. This includes LRI
> emitted from the command parser.  The MI_SET_CONTEXT itself serves as
> serialising barrier and prevents the pair of register writes in the first
> packet from triggering the fault.  However, if a second switch-context
> immediately occurs then we may have two adjacent blocks of LRI to the
> same registers which may then trigger the hang. To counteract this we
> need to insert a delay after the second register write using SRM.
> 
> This is easiest to reproduce with something like
> igt/gem_ctx_switch/interruptible that triggers back-to-back context
> switches (with no operations in between them in the command stream,
> which requires the execbuf operation to be interrupted after the
> MI_SET_CONTEXT) but can be observed sporadically elsewhere when running
> interruptible igt. No reports from the wild though, so it must be of low
> enough frequency that no one has correlated the random machine freezes
> with i915.ko
> 
> The issue was introduced with
> commit 2c550183476dfa25641309ae9a28d30feed14379 [v3.19]
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Dec 16 10:02:27 2014 +0000
> 
>     drm/i915: Disable PSMI sleep messages on all rings around context switches
> 
> Testcase: igt/gem_ctx_switch/render-interruptible #ivb
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index fe580cb9501a..e5ad7b21e356 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -539,7 +539,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>  
>  	len = 4;
>  	if (INTEL_INFO(engine->dev)->gen >= 7)
> -		len += 2 + (num_rings ? 4*num_rings + 2 : 0);
> +		len += 2 + (num_rings ? 4*num_rings + 6 : 0);
>  
>  	ret = intel_ring_begin(req, len);
>  	if (ret)
> @@ -579,6 +579,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>  	if (INTEL_INFO(engine->dev)->gen >= 7) {
>  		if (num_rings) {
>  			struct intel_engine_cs *signaller;
> +			i915_reg_t last_reg = {}; /* keep gcc quiet */
>  
>  			intel_ring_emit(engine,
>  					MI_LOAD_REGISTER_IMM(num_rings));
> @@ -586,11 +587,19 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>  				if (signaller == engine)
>  					continue;
>  
> -				intel_ring_emit_reg(engine,
> -						    RING_PSMI_CTL(signaller->mmio_base));
> +				last_reg = RING_PSMI_CTL(signaller->mmio_base);
> +				intel_ring_emit_reg(engine, last_reg);
>  				intel_ring_emit(engine,
>  						_MASKED_BIT_DISABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
>  			}
> +
> +			/* Insert a delay before the next switch! */
> +			intel_ring_emit(engine,
> +					MI_STORE_REGISTER_MEM |
> +					MI_SRM_LRM_GLOBAL_GTT);
> +			intel_ring_emit_reg(engine, last_reg);
> +			intel_ring_emit(engine, engine->scratch.gtt_offset);
> +			intel_ring_emit(engine, MI_NOOP);
>  		}
>  		intel_ring_emit(engine, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>  	}
> -- 
> 2.8.0.rc3
>
On Wed, 13 Apr 2016, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Tue, Apr 12, 2016 at 09:03:05PM +0100, Chris Wilson wrote:
>> Two concurrent writes into the same register cacheline has the chance of
>> killing the machine on Ivybridge and other gen7. This includes LRI
>> emitted from the command parser.  The MI_SET_CONTEXT itself serves as
>> serialising barrier and prevents the pair of register writes in the first
>> packet from triggering the fault.  However, if a second switch-context
>> immediately occurs then we may have two adjacent blocks of LRI to the
>> same registers which may then trigger the hang. To counteract this we
>> need to insert a delay after the second register write using SRM.
>> 
>> This is easiest to reproduce with something like
>> igt/gem_ctx_switch/interruptible that triggers back-to-back context
>> switches (with no operations in between them in the command stream,
>> which requires the execbuf operation to be interrupted after the
>> MI_SET_CONTEXT) but can be observed sporadically elsewhere when running
>> interruptible igt. No reports from the wild though, so it must be of low
>> enough frequency that no one has correlated the random machine freezes
>> with i915.ko
>> 
>> The issue was introduced with
>> commit 2c550183476dfa25641309ae9a28d30feed14379 [v3.19]
>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>> Date:   Tue Dec 16 10:02:27 2014 +0000
>> 
>>     drm/i915: Disable PSMI sleep messages on all rings around context switches
>> 
>> Testcase: igt/gem_ctx_switch/render-interruptible #ivb
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: stable@vger.kernel.org
>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

FYI, this (*) does not cherry-pick cleanly to drm-intel-fixes.

BR,
Jani.

(*) Well, not exactly *this* but rather
https://patchwork.freedesktop.org/patch/80952/ which was not posted on
the list so I can't reply to it.


>
>> ---
>>  drivers/gpu/drm/i915/i915_gem_context.c | 15 ++++++++++++---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index fe580cb9501a..e5ad7b21e356 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -539,7 +539,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>>  
>>  	len = 4;
>>  	if (INTEL_INFO(engine->dev)->gen >= 7)
>> -		len += 2 + (num_rings ? 4*num_rings + 2 : 0);
>> +		len += 2 + (num_rings ? 4*num_rings + 6 : 0);
>>  
>>  	ret = intel_ring_begin(req, len);
>>  	if (ret)
>> @@ -579,6 +579,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>>  	if (INTEL_INFO(engine->dev)->gen >= 7) {
>>  		if (num_rings) {
>>  			struct intel_engine_cs *signaller;
>> +			i915_reg_t last_reg = {}; /* keep gcc quiet */
>>  
>>  			intel_ring_emit(engine,
>>  					MI_LOAD_REGISTER_IMM(num_rings));
>> @@ -586,11 +587,19 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>>  				if (signaller == engine)
>>  					continue;
>>  
>> -				intel_ring_emit_reg(engine,
>> -						    RING_PSMI_CTL(signaller->mmio_base));
>> +				last_reg = RING_PSMI_CTL(signaller->mmio_base);
>> +				intel_ring_emit_reg(engine, last_reg);
>>  				intel_ring_emit(engine,
>>  						_MASKED_BIT_DISABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
>>  			}
>> +
>> +			/* Insert a delay before the next switch! */
>> +			intel_ring_emit(engine,
>> +					MI_STORE_REGISTER_MEM |
>> +					MI_SRM_LRM_GLOBAL_GTT);
>> +			intel_ring_emit_reg(engine, last_reg);
>> +			intel_ring_emit(engine, engine->scratch.gtt_offset);
>> +			intel_ring_emit(engine, MI_NOOP);
>>  		}
>>  		intel_ring_emit(engine, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>>  	}
>> -- 
>> 2.8.0.rc3
>>
On Mon, 18 Apr 2016, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Wed, 13 Apr 2016, Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Tue, Apr 12, 2016 at 09:03:05PM +0100, Chris Wilson wrote:
>>> Two concurrent writes into the same register cacheline has the chance of
>>> killing the machine on Ivybridge and other gen7. This includes LRI
>>> emitted from the command parser.  The MI_SET_CONTEXT itself serves as
>>> serialising barrier and prevents the pair of register writes in the first
>>> packet from triggering the fault.  However, if a second switch-context
>>> immediately occurs then we may have two adjacent blocks of LRI to the
>>> same registers which may then trigger the hang. To counteract this we
>>> need to insert a delay after the second register write using SRM.
>>> 
>>> This is easiest to reproduce with something like
>>> igt/gem_ctx_switch/interruptible that triggers back-to-back context
>>> switches (with no operations in between them in the command stream,
>>> which requires the execbuf operation to be interrupted after the
>>> MI_SET_CONTEXT) but can be observed sporadically elsewhere when running
>>> interruptible igt. No reports from the wild though, so it must be of low
>>> enough frequency that no one has correlated the random machine freezes
>>> with i915.ko
>>> 
>>> The issue was introduced with
>>> commit 2c550183476dfa25641309ae9a28d30feed14379 [v3.19]
>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>> Date:   Tue Dec 16 10:02:27 2014 +0000
>>> 
>>>     drm/i915: Disable PSMI sleep messages on all rings around context switches
>>> 
>>> Testcase: igt/gem_ctx_switch/render-interruptible #ivb
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>> Cc: stable@vger.kernel.org
>>
>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> FYI, this (*) does not cherry-pick cleanly to drm-intel-fixes.

By which I mean, regardless of cc: stable I'm not doing anything to
backport the commit to fixes. Same with the other patch in this
thread. If someone wants the fixes to 4.6 or stable, they need to do the
backporting.

BR,
Jani.


>
> BR,
> Jani.
>
> (*) Well, not exactly *this* but rather
> https://patchwork.freedesktop.org/patch/80952/ which was not posted on
> the list so I can't reply to it.
>
>
>>
>>> ---
>>>  drivers/gpu/drm/i915/i915_gem_context.c | 15 ++++++++++++---
>>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>>> index fe580cb9501a..e5ad7b21e356 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>> @@ -539,7 +539,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>>>  
>>>  	len = 4;
>>>  	if (INTEL_INFO(engine->dev)->gen >= 7)
>>> -		len += 2 + (num_rings ? 4*num_rings + 2 : 0);
>>> +		len += 2 + (num_rings ? 4*num_rings + 6 : 0);
>>>  
>>>  	ret = intel_ring_begin(req, len);
>>>  	if (ret)
>>> @@ -579,6 +579,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>>>  	if (INTEL_INFO(engine->dev)->gen >= 7) {
>>>  		if (num_rings) {
>>>  			struct intel_engine_cs *signaller;
>>> +			i915_reg_t last_reg = {}; /* keep gcc quiet */
>>>  
>>>  			intel_ring_emit(engine,
>>>  					MI_LOAD_REGISTER_IMM(num_rings));
>>> @@ -586,11 +587,19 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>>>  				if (signaller == engine)
>>>  					continue;
>>>  
>>> -				intel_ring_emit_reg(engine,
>>> -						    RING_PSMI_CTL(signaller->mmio_base));
>>> +				last_reg = RING_PSMI_CTL(signaller->mmio_base);
>>> +				intel_ring_emit_reg(engine, last_reg);
>>>  				intel_ring_emit(engine,
>>>  						_MASKED_BIT_DISABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
>>>  			}
>>> +
>>> +			/* Insert a delay before the next switch! */
>>> +			intel_ring_emit(engine,
>>> +					MI_STORE_REGISTER_MEM |
>>> +					MI_SRM_LRM_GLOBAL_GTT);
>>> +			intel_ring_emit_reg(engine, last_reg);
>>> +			intel_ring_emit(engine, engine->scratch.gtt_offset);
>>> +			intel_ring_emit(engine, MI_NOOP);
>>>  		}
>>>  		intel_ring_emit(engine, MI_ARB_ON_OFF | MI_ARB_ENABLE);
>>>  	}
>>> -- 
>>> 2.8.0.rc3
>>>