GART write flush error on SI w/ amdgpu

Submitted by Marek Olšák on May 26, 2017, 3:47 p.m.

Details

Message ID CAAxE2A7Q=qdxtm6HXF3Py9dYpARvA6+Z4ENkWTCEA572SdNEqw@mail.gmail.com
State New
Headers show
Series "GART write flush error on SI w/ amdgpu" ( rev: 2 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Marek Olšák May 26, 2017, 3:47 p.m.
On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaehnle@gmail.com> wrote:
> Hi all,
>
> I'm seeing some very strange errors on Verde with CPU readback from GART,
> and am pretty much out of ideas. Some help would be very much appreciated.
>
> The error manifests with the
> GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu,
but
> *not* on radeon. Here's what the test does:
>
> 1. Upload a texture.
> 2. Read the texture back via a shader that uses shader buffer writes to
> write data to a buffer that is allocated in GART.
> 3. The CPU then reads from the buffer -- and sometimes gets stale data.
>
> This sequence is repeated for many sub-tests. There are some sub-tests
where
> the CPU reads stale data from the buffer, i.e. the shader writes simply
> don't make it to the CPU. The tests vary superficially, e.g. the first
> failing test is (almost?) always one where data is written in 16-bit words
> (but there are succeeding sub-tests with 16-bit writes as well).
>
> The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);)
> between the fence wait and the return of glMapBuffer does not fix the
> problem. The data must be stuck in a cache somewhere.
>
> Since the test runs okay with the radeon module, I tried some changes
based
> on comparing the IB submit between radeon and amdgpu, and based on
comparing
> register settings via scans obtained from umr. Some of the things I've
> tried:
>
> - Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and
amdgpu/gfx9
> set this)
> - Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid
> (radeon does this)
> - Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of
PFP
> (which seems more logical, and is done by gfx7+), or remove the
> corresponding WRITE_DATA entirely
>
> None of these changes helped.
>
> What *does* help is adding an artificial wait. Specifically, I'm adding a
> sequence of
>
> - WRITE_DATA
> - CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior)
> - WAIT_REG_MEM
>
> as can be seen in the attached patch. This works around the problem, but
it
> makes no sense:
>
> Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence
works
> around the problem. However(!) it does not actually cause the UMD to wait
> any longer than before. Without this change, the UMD immediately sees a
> signaled user fence (and never uses an ioctl to wait), and with this
change,
> it *still* sees a signaled user fence.
>
> Also, note that the way I've hacked the change, the wait sequence is only
> added for the user fence emit (and I'm using a modified UMD to ensure that
> there is enough memory to be used by the added wait sequence).
>
> Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around
the
> problem.
>
> So for whatever reason, the added wait sequence *before* the SURFACE_SYNC
> encourages some part of the GPU to flush the data from wherever it's
stuck,
> and that's just really bizarre. There must be something really simple I'm
> missing, and any pointers would be appreciated.

Have you tried this?


One more cache flush there shouldn't hurt.

Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a
try.

Marek

Patch hide | download patch | download mbox

diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c
b/src/gallium/drivers/radeonsi/si_hw_context.c
index 92c09cb..e6ac0ba 100644
--- a/src/gallium/drivers/radeonsi/si_hw_context.c
+++ b/src/gallium/drivers/radeonsi/si_hw_context.c
@@ -133,7 +133,8 @@  void si_context_gfx_flush(void *context, unsigned flags,
                        SI_CONTEXT_PS_PARTIAL_FLUSH;

        /* DRM 3.1.0 doesn't flush TC for VI correctly. */
-       if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
+       if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
||
+           (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3))
                ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 |
                                SI_CONTEXT_INV_VMEM_L1;

Comments

BTW, I noticed the flush sequence in the kernel is wrong. The correct
flush sequence should be:

1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory,
but no fence/interrupt.
2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC.
3) SURFACE_SYNC (TC, K$, I$)
4) Write CP_COHER_CNTL2.
5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the interrupt.

WAIT_REG_MEM wouldn't be needed if we were able to merge
CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP
event.

The main issue with the current flush sequence in radeon and amdgpu is
that it doesn't wait for idle before writing CP_COHER_CNTL2 and
SURFACE_SYNC. So far we've been able to avoid the bug by waiting for
idle in userspace IBs.

Marek


On Fri, May 26, 2017 at 5:47 PM, Marek Olšák <maraeo@gmail.com> wrote:
> On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaehnle@gmail.com> wrote:
>> Hi all,
>>
>> I'm seeing some very strange errors on Verde with CPU readback from GART,
>> and am pretty much out of ideas. Some help would be very much appreciated.
>>
>> The error manifests with the
>> GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu,
>> but
>> *not* on radeon. Here's what the test does:
>>
>> 1. Upload a texture.
>> 2. Read the texture back via a shader that uses shader buffer writes to
>> write data to a buffer that is allocated in GART.
>> 3. The CPU then reads from the buffer -- and sometimes gets stale data.
>>
>> This sequence is repeated for many sub-tests. There are some sub-tests
>> where
>> the CPU reads stale data from the buffer, i.e. the shader writes simply
>> don't make it to the CPU. The tests vary superficially, e.g. the first
>> failing test is (almost?) always one where data is written in 16-bit words
>> (but there are succeeding sub-tests with 16-bit writes as well).
>>
>> The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);)
>> between the fence wait and the return of glMapBuffer does not fix the
>> problem. The data must be stuck in a cache somewhere.
>>
>> Since the test runs okay with the radeon module, I tried some changes
>> based
>> on comparing the IB submit between radeon and amdgpu, and based on
>> comparing
>> register settings via scans obtained from umr. Some of the things I've
>> tried:
>>
>> - Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and
>> amdgpu/gfx9
>> set this)
>> - Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid
>> (radeon does this)
>> - Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of
>> PFP
>> (which seems more logical, and is done by gfx7+), or remove the
>> corresponding WRITE_DATA entirely
>>
>> None of these changes helped.
>>
>> What *does* help is adding an artificial wait. Specifically, I'm adding a
>> sequence of
>>
>> - WRITE_DATA
>> - CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior)
>> - WAIT_REG_MEM
>>
>> as can be seen in the attached patch. This works around the problem, but
>> it
>> makes no sense:
>>
>> Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence
>> works
>> around the problem. However(!) it does not actually cause the UMD to wait
>> any longer than before. Without this change, the UMD immediately sees a
>> signaled user fence (and never uses an ioctl to wait), and with this
>> change,
>> it *still* sees a signaled user fence.
>>
>> Also, note that the way I've hacked the change, the wait sequence is only
>> added for the user fence emit (and I'm using a modified UMD to ensure that
>> there is enough memory to be used by the added wait sequence).
>>
>> Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around
>> the
>> problem.
>>
>> So for whatever reason, the added wait sequence *before* the SURFACE_SYNC
>> encourages some part of the GPU to flush the data from wherever it's
>> stuck,
>> and that's just really bizarre. There must be something really simple I'm
>> missing, and any pointers would be appreciated.
>
> Have you tried this?
>
> diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c
> b/src/gallium/drivers/radeonsi/si_hw_context.c
> index 92c09cb..e6ac0ba 100644
> --- a/src/gallium/drivers/radeonsi/si_hw_context.c
> +++ b/src/gallium/drivers/radeonsi/si_hw_context.c
> @@ -133,7 +133,8 @@ void si_context_gfx_flush(void *context, unsigned flags,
>                         SI_CONTEXT_PS_PARTIAL_FLUSH;
>
>         /* DRM 3.1.0 doesn't flush TC for VI correctly. */
> -       if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
> +       if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
> ||
> +           (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3))
>                 ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 |
>                                 SI_CONTEXT_INV_VMEM_L1;
>
> One more cache flush there shouldn't hurt.
>
> Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a
> try.
>
> Marek
Am 20.06.2017 um 12:34 schrieb Marek Olšák:
> BTW, I noticed the flush sequence in the kernel is wrong. The correct
> flush sequence should be:
>
> 1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory,
> but no fence/interrupt.
> 2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC.
> 3) SURFACE_SYNC (TC, K$, I$)
> 4) Write CP_COHER_CNTL2.
> 5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the interrupt.
>
> WAIT_REG_MEM wouldn't be needed if we were able to merge
> CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP
> event.
>
> The main issue with the current flush sequence in radeon and amdgpu is
> that it doesn't wait for idle before writing CP_COHER_CNTL2 and
> SURFACE_SYNC. So far we've been able to avoid the bug by waiting for
> idle in userspace IBs.

Well not waiting for idle between IBs is an explicit requirement, 
because it is rather bad for performance to do so.

David Zhou, Monk and I worked quite a lot on this to avoid both possible 
hazard and performance drop.

Christian.

>
> Marek
>
>
> On Fri, May 26, 2017 at 5:47 PM, Marek Olšák <maraeo@gmail.com> wrote:
>> On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaehnle@gmail.com> wrote:
>>> Hi all,
>>>
>>> I'm seeing some very strange errors on Verde with CPU readback from GART,
>>> and am pretty much out of ideas. Some help would be very much appreciated.
>>>
>>> The error manifests with the
>>> GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu,
>>> but
>>> *not* on radeon. Here's what the test does:
>>>
>>> 1. Upload a texture.
>>> 2. Read the texture back via a shader that uses shader buffer writes to
>>> write data to a buffer that is allocated in GART.
>>> 3. The CPU then reads from the buffer -- and sometimes gets stale data.
>>>
>>> This sequence is repeated for many sub-tests. There are some sub-tests
>>> where
>>> the CPU reads stale data from the buffer, i.e. the shader writes simply
>>> don't make it to the CPU. The tests vary superficially, e.g. the first
>>> failing test is (almost?) always one where data is written in 16-bit words
>>> (but there are succeeding sub-tests with 16-bit writes as well).
>>>
>>> The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);)
>>> between the fence wait and the return of glMapBuffer does not fix the
>>> problem. The data must be stuck in a cache somewhere.
>>>
>>> Since the test runs okay with the radeon module, I tried some changes
>>> based
>>> on comparing the IB submit between radeon and amdgpu, and based on
>>> comparing
>>> register settings via scans obtained from umr. Some of the things I've
>>> tried:
>>>
>>> - Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and
>>> amdgpu/gfx9
>>> set this)
>>> - Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid
>>> (radeon does this)
>>> - Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of
>>> PFP
>>> (which seems more logical, and is done by gfx7+), or remove the
>>> corresponding WRITE_DATA entirely
>>>
>>> None of these changes helped.
>>>
>>> What *does* help is adding an artificial wait. Specifically, I'm adding a
>>> sequence of
>>>
>>> - WRITE_DATA
>>> - CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior)
>>> - WAIT_REG_MEM
>>>
>>> as can be seen in the attached patch. This works around the problem, but
>>> it
>>> makes no sense:
>>>
>>> Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence
>>> works
>>> around the problem. However(!) it does not actually cause the UMD to wait
>>> any longer than before. Without this change, the UMD immediately sees a
>>> signaled user fence (and never uses an ioctl to wait), and with this
>>> change,
>>> it *still* sees a signaled user fence.
>>>
>>> Also, note that the way I've hacked the change, the wait sequence is only
>>> added for the user fence emit (and I'm using a modified UMD to ensure that
>>> there is enough memory to be used by the added wait sequence).
>>>
>>> Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around
>>> the
>>> problem.
>>>
>>> So for whatever reason, the added wait sequence *before* the SURFACE_SYNC
>>> encourages some part of the GPU to flush the data from wherever it's
>>> stuck,
>>> and that's just really bizarre. There must be something really simple I'm
>>> missing, and any pointers would be appreciated.
>> Have you tried this?
>>
>> diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c
>> b/src/gallium/drivers/radeonsi/si_hw_context.c
>> index 92c09cb..e6ac0ba 100644
>> --- a/src/gallium/drivers/radeonsi/si_hw_context.c
>> +++ b/src/gallium/drivers/radeonsi/si_hw_context.c
>> @@ -133,7 +133,8 @@ void si_context_gfx_flush(void *context, unsigned flags,
>>                          SI_CONTEXT_PS_PARTIAL_FLUSH;
>>
>>          /* DRM 3.1.0 doesn't flush TC for VI correctly. */
>> -       if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
>> +       if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
>> ||
>> +           (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3))
>>                  ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 |
>>                                  SI_CONTEXT_INV_VMEM_L1;
>>
>> One more cache flush there shouldn't hurt.
>>
>> Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a
>> try.
>>
>> Marek
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On 20.06.2017 12:34, Marek Olšák wrote:
> BTW, I noticed the flush sequence in the kernel is wrong. The correct
> flush sequence should be:
> 
> 1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory,
> but no fence/interrupt.
> 2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC.
> 3) SURFACE_SYNC (TC, K$, I$)
> 4) Write CP_COHER_CNTL2.
> 5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the interrupt.
> 
> WAIT_REG_MEM wouldn't be needed if we were able to merge
> CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP
> event.
> 
> The main issue with the current flush sequence in radeon and amdgpu is
> that it doesn't wait for idle before writing CP_COHER_CNTL2 and
> SURFACE_SYNC. So far we've been able to avoid the bug by waiting for
> idle in userspace IBs.

This is gfx9-only though, right?

Cheers,
Nicolai


> 
> Marek
> 
> 
> On Fri, May 26, 2017 at 5:47 PM, Marek Olšák <maraeo@gmail.com> wrote:
>> On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaehnle@gmail.com> wrote:
>>> Hi all,
>>>
>>> I'm seeing some very strange errors on Verde with CPU readback from GART,
>>> and am pretty much out of ideas. Some help would be very much appreciated.
>>>
>>> The error manifests with the
>>> GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu,
>>> but
>>> *not* on radeon. Here's what the test does:
>>>
>>> 1. Upload a texture.
>>> 2. Read the texture back via a shader that uses shader buffer writes to
>>> write data to a buffer that is allocated in GART.
>>> 3. The CPU then reads from the buffer -- and sometimes gets stale data.
>>>
>>> This sequence is repeated for many sub-tests. There are some sub-tests
>>> where
>>> the CPU reads stale data from the buffer, i.e. the shader writes simply
>>> don't make it to the CPU. The tests vary superficially, e.g. the first
>>> failing test is (almost?) always one where data is written in 16-bit words
>>> (but there are succeeding sub-tests with 16-bit writes as well).
>>>
>>> The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);)
>>> between the fence wait and the return of glMapBuffer does not fix the
>>> problem. The data must be stuck in a cache somewhere.
>>>
>>> Since the test runs okay with the radeon module, I tried some changes
>>> based
>>> on comparing the IB submit between radeon and amdgpu, and based on
>>> comparing
>>> register settings via scans obtained from umr. Some of the things I've
>>> tried:
>>>
>>> - Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and
>>> amdgpu/gfx9
>>> set this)
>>> - Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid
>>> (radeon does this)
>>> - Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of
>>> PFP
>>> (which seems more logical, and is done by gfx7+), or remove the
>>> corresponding WRITE_DATA entirely
>>>
>>> None of these changes helped.
>>>
>>> What *does* help is adding an artificial wait. Specifically, I'm adding a
>>> sequence of
>>>
>>> - WRITE_DATA
>>> - CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior)
>>> - WAIT_REG_MEM
>>>
>>> as can be seen in the attached patch. This works around the problem, but
>>> it
>>> makes no sense:
>>>
>>> Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence
>>> works
>>> around the problem. However(!) it does not actually cause the UMD to wait
>>> any longer than before. Without this change, the UMD immediately sees a
>>> signaled user fence (and never uses an ioctl to wait), and with this
>>> change,
>>> it *still* sees a signaled user fence.
>>>
>>> Also, note that the way I've hacked the change, the wait sequence is only
>>> added for the user fence emit (and I'm using a modified UMD to ensure that
>>> there is enough memory to be used by the added wait sequence).
>>>
>>> Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around
>>> the
>>> problem.
>>>
>>> So for whatever reason, the added wait sequence *before* the SURFACE_SYNC
>>> encourages some part of the GPU to flush the data from wherever it's
>>> stuck,
>>> and that's just really bizarre. There must be something really simple I'm
>>> missing, and any pointers would be appreciated.
>>
>> Have you tried this?
>>
>> diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c
>> b/src/gallium/drivers/radeonsi/si_hw_context.c
>> index 92c09cb..e6ac0ba 100644
>> --- a/src/gallium/drivers/radeonsi/si_hw_context.c
>> +++ b/src/gallium/drivers/radeonsi/si_hw_context.c
>> @@ -133,7 +133,8 @@ void si_context_gfx_flush(void *context, unsigned flags,
>>                          SI_CONTEXT_PS_PARTIAL_FLUSH;
>>
>>          /* DRM 3.1.0 doesn't flush TC for VI correctly. */
>> -       if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
>> +       if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
>> ||
>> +           (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3))
>>                  ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 |
>>                                  SI_CONTEXT_INV_VMEM_L1;
>>
>> One more cache flush there shouldn't hurt.
>>
>> Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a
>> try.
>>
>> Marek
On Tue, Jun 20, 2017 at 1:49 PM, Nicolai Hähnle <nhaehnle@gmail.com> wrote:
> On 20.06.2017 12:34, Marek Olšák wrote:
>>
>> BTW, I noticed the flush sequence in the kernel is wrong. The correct
>> flush sequence should be:
>>
>> 1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory,
>> but no fence/interrupt.
>> 2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC.
>> 3) SURFACE_SYNC (TC, K$, I$)
>> 4) Write CP_COHER_CNTL2.
>> 5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the
>> interrupt.
>>
>> WAIT_REG_MEM wouldn't be needed if we were able to merge
>> CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP
>> event.
>>
>> The main issue with the current flush sequence in radeon and amdgpu is
>> that it doesn't wait for idle before writing CP_COHER_CNTL2 and
>> SURFACE_SYNC. So far we've been able to avoid the bug by waiting for
>> idle in userspace IBs.
>
>
> This is gfx9-only though, right?

No, I'm only talking about SI.

Marek
On Tue, Jun 20, 2017 at 1:46 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Am 20.06.2017 um 12:34 schrieb Marek Olšák:
>>
>> BTW, I noticed the flush sequence in the kernel is wrong. The correct
>> flush sequence should be:
>>
>> 1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory,
>> but no fence/interrupt.
>> 2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC.
>> 3) SURFACE_SYNC (TC, K$, I$)
>> 4) Write CP_COHER_CNTL2.
>> 5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the
>> interrupt.
>>
>> WAIT_REG_MEM wouldn't be needed if we were able to merge
>> CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP
>> event.
>>
>> The main issue with the current flush sequence in radeon and amdgpu is
>> that it doesn't wait for idle before writing CP_COHER_CNTL2 and
>> SURFACE_SYNC. So far we've been able to avoid the bug by waiting for
>> idle in userspace IBs.
>
>
> Well not waiting for idle between IBs is an explicit requirement, because it
> is rather bad for performance to do so.
>
> David Zhou, Monk and I worked quite a lot on this to avoid both possible
> hazard and performance drop.

I guess the requirement was ignored for SI. If you don't do the TC
flush as part the EOP event, you have to wait for idle before
SURFACE_SYNC, because SURFACE_SYNC doesn't wait for idle. It's kinda
useless to flush TC when shaders are still in flight.

Marek