drm/i915/tgl: Suspend pre-parser across GTT invalidations

Submitted by Chris Wilson on Sept. 14, 2019, 8:25 a.m.

Details

Message ID 20190914082550.11547-1-chris@chris-wilson.co.uk
State New
Headers show
Series "drm/i915/tgl: Suspend pre-parser across GTT invalidations" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Chris Wilson Sept. 14, 2019, 8:25 a.m.
Before we execute a batch, we must first issue any and all TLB
invalidations so that batch picks up the new page table entries.
Tigerlake's preparser is weakening our post-sync CS_STALL inside the
invalidate pipe-control and allowing the loading of the batch buffer
before we have setup its page table (and so it loads the wrong page and
executes indefinitely).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
Suggestions welcome as this does not seem intended behaviour, so I
suspect there is a strong pipecontrol flag we are missing.
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index a3f0e4999744..a9e690c303cc 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2796,11 +2796,14 @@  static int gen11_emit_flush_render(struct i915_request *request,
 		flags |= PIPE_CONTROL_QW_WRITE;
 		flags |= PIPE_CONTROL_GLOBAL_GTT_IVB;
 
-		cs = intel_ring_begin(request, 6);
+		cs = intel_ring_begin(request, 8);
 		if (IS_ERR(cs))
 			return PTR_ERR(cs);
 
+		*cs++ = MI_ARB_CHECK | 1 << 8 | 1;
 		cs = gen8_emit_pipe_control(cs, flags, scratch_addr);
+		*cs++ = MI_ARB_CHECK | 1 << 8;
+
 		intel_ring_advance(request, cs);
 	}
 

Comments

On 9/14/19 1:25 AM, Chris Wilson wrote:
> Before we execute a batch, we must first issue any and all TLB
> invalidations so that batch picks up the new page table entries.
> Tigerlake's preparser is weakening our post-sync CS_STALL inside the
> invalidate pipe-control and allowing the loading of the batch buffer
> before we have setup its page table (and so it loads the wrong page and
> executes indefinitely).
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
> Suggestions welcome as this does not seem intended behaviour, so I
> suspect there is a strong pipecontrol flag we are missing.

When I discussed the pre-parser with the HW team the feedback I got was 
that the only way to make sure we don't race the memory update is to 
either leave enough CL of space or turn the parser off like you did 
below. That discussion was about actual physical memory access though 
and not TLB.
Anyway, turning off the parser around the pipe control, if it fixes the 
problem, shouldn't be too bad since the main performance advantage of 
the parser is expected inside the user batch. The alternative would 
probably be to stop invalidating the TLBs from within the ring and 
switch to MMIO invalidations when lite-restoring a new request in the 
ring (the CS will implicitly stop the parser and invalidate everything 
on a full ctx switch).

Daniele

> ---
>   drivers/gpu/drm/i915/gt/intel_lrc.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index a3f0e4999744..a9e690c303cc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2796,11 +2796,14 @@ static int gen11_emit_flush_render(struct i915_request *request,
>   		flags |= PIPE_CONTROL_QW_WRITE;
>   		flags |= PIPE_CONTROL_GLOBAL_GTT_IVB;
>   
> -		cs = intel_ring_begin(request, 6);
> +		cs = intel_ring_begin(request, 8);
>   		if (IS_ERR(cs))
>   			return PTR_ERR(cs);
>   
> +		*cs++ = MI_ARB_CHECK | 1 << 8 | 1;
>   		cs = gen8_emit_pipe_control(cs, flags, scratch_addr);
> +		*cs++ = MI_ARB_CHECK | 1 << 8;
> +
>   		intel_ring_advance(request, cs);
>   	}
>   
>
Quoting Daniele Ceraolo Spurio (2019-09-16 21:37:26)
> 
> 
> On 9/14/19 1:25 AM, Chris Wilson wrote:
> > Before we execute a batch, we must first issue any and all TLB
> > invalidations so that batch picks up the new page table entries.
> > Tigerlake's preparser is weakening our post-sync CS_STALL inside the
> > invalidate pipe-control and allowing the loading of the batch buffer
> > before we have setup its page table (and so it loads the wrong page and
> > executes indefinitely).
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> > Suggestions welcome as this does not seem intended behaviour, so I
> > suspect there is a strong pipecontrol flag we are missing.
> 
> When I discussed the pre-parser with the HW team the feedback I got was 
> that the only way to make sure we don't race the memory update is to 
> either leave enough CL of space or turn the parser off like you did 
> below. That discussion was about actual physical memory access though 
> and not TLB.
> Anyway, turning off the parser around the pipe control, if it fixes the 
> problem, shouldn't be too bad since the main performance advantage of 
> the parser is expected inside the user batch. The alternative would 
> probably be to stop invalidating the TLBs from within the ring and 
> switch to MMIO invalidations when lite-restoring a new request in the 
> ring (the CS will implicitly stop the parser and invalidate everything 
> on a full ctx switch).

I also only observe the effect on rcs0 atm. Does that tie in with the
preparser theory? i.e. that either the MI_FLUSH_DW is a strong barrier
or the preparser is rcs0 only. (Strictly speaking I haven't yet run the
igt_cs_tlb test on tgl/bcs0 so I am basing the above off the igt test
results that pass on bcs0 but consistently failed on rcs0.)
-Chris
On 9/16/19 1:54 PM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-09-16 21:37:26)
>>
>>
>> On 9/14/19 1:25 AM, Chris Wilson wrote:
>>> Before we execute a batch, we must first issue any and all TLB
>>> invalidations so that batch picks up the new page table entries.
>>> Tigerlake's preparser is weakening our post-sync CS_STALL inside the
>>> invalidate pipe-control and allowing the loading of the batch buffer
>>> before we have setup its page table (and so it loads the wrong page and
>>> executes indefinitely).
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> ---
>>> Suggestions welcome as this does not seem intended behaviour, so I
>>> suspect there is a strong pipecontrol flag we are missing.
>>
>> When I discussed the pre-parser with the HW team the feedback I got was
>> that the only way to make sure we don't race the memory update is to
>> either leave enough CL of space or turn the parser off like you did
>> below. That discussion was about actual physical memory access though
>> and not TLB.
>> Anyway, turning off the parser around the pipe control, if it fixes the
>> problem, shouldn't be too bad since the main performance advantage of
>> the parser is expected inside the user batch. The alternative would
>> probably be to stop invalidating the TLBs from within the ring and
>> switch to MMIO invalidations when lite-restoring a new request in the
>> ring (the CS will implicitly stop the parser and invalidate everything
>> on a full ctx switch).
> 
> I also only observe the effect on rcs0 atm. Does that tie in with the
> preparser theory? i.e. that either the MI_FLUSH_DW is a strong barrier
> or the preparser is rcs0 only. (Strictly speaking I haven't yet run the
> igt_cs_tlb test on tgl/bcs0 so I am basing the above off the igt test
> results that pass on bcs0 but consistently failed on rcs0.)
> -Chris
> 

AFAIK the pre-parser is part of the core CS logic, which should be the 
same on all engines (although the new behavior was mainly required for 
RCS workloads). The specs also don't mention a different behavior on 
MI_FLUSH. It might just be a timing thing, since I'd expect the MI_FLUSH 
to be faster to execute compared to the pipe control and thus less 
instructions will be accumulated in the pre-fetch FIFO in the meantime, 
also considering that we have 2 calls close to each other (one from the 
breadcrumb fini of the previous request).

Daniele