Regression on gfx8 with ring init

Submitted by Koenig, Christian on Sept. 18, 2018, 3 p.m.

Details

Message ID edd44be9-2ef3-3c39-3342-5d3b4bbfa40a@amd.com
State New
Headers show
Series "Regression on gfx8 with ring init" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Koenig, Christian Sept. 18, 2018, 3 p.m.
Tom,

can you try if the following makes it working again?

  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
  {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {
         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
         .test_ring = gfx_v8_0_ring_test_ring,
-       .test_ib = gfx_v8_0_ring_test_ib,
+       .test_ib = gfx_v8_0_kiq_ring_test_ib,
         .insert_nop = amdgpu_ring_insert_nop,
         .pad_ib = amdgpu_ring_generic_pad_ib,
         .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:
> CRTC and GFX interrupts seem to be working perfectly fine.
>
> The problem here looks like only EOP interrupts from the Compute queue 
> are not correctly handled.
>
> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>
> Christian.
>
> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>
>> FWIW, a number of consumer Raven boards have bad IVRS tables (windows 
>> doesn't use interrupt remapping so they are sometimes wrong and 
>> probably not validated.  There are a number of workaround to manually 
>> override the IVRS tables to make interrupts work. I think specifying 
>> pci=noacpi is also a possible workaround.
>>
>>
>> Alex
>>
>> ------------------------------------------------------------------------
>> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of 
>> Christian König <christian.koenig@amd.com>
>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>> *Subject:* Re: Regression on gfx8 with ring init
>> Well looks like interrupt processing is working perfectly fine.
>>
>> But looking at the error message once more I see that this actually
>> affects ring number 9 and not the GFX ring.
>>
>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>> number?
>>
>> That must be some of the compute rings.
>>
>> Thanks,
>> Christian.
>>
>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>> >> Mhm, there is no more failed IB-test in there isn't it?
>> >
>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from
>> > the tip of drm-next
>> >
>> > Tom
>> >
>> >>
>> >> Christian.
>> >>
>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>> >>>
>> >>> Here's the log.
>> >>>
>> >>> Tom
>> >>>
>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>> >>>> Odd I couldn't even boot my system with the dGPU as primary after
>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>> >>>> panic'ed before loading the network stack.
>> >>>>
>> >>>> Bizarre.
>> >>>>
>> >>>> I'll keep trying.
>> >>>>
>> >>>> Tom
>> >>>>
>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>> >>>>>>> Great, not sure if that is a good or a bad news.
>> >>>>>>>
>> >>>>>>> Anyway going to revert the change for now. Does anybody
>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>> >>>>>>> correctly on Raven?
>> >>>>>>
>> >>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>> >>>>>>
>> >>>>>> Anything I could test with my devel raven?
>> >>>>>
>> >>>>> The problem seems to be that on some boards IH handling doesn't
>> >>>>> work as it should.
>> >>>>>
>> >>>>> Can you try to disable the onboard graphics and try again?
>> >>>>>
>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Christian.
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Tom
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Christian.
>> >>>>>>>
>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>> >>>>>>>> This commit:
>> >>>>>>>>
>> >>>>>>>> [root@raven linux]# git bisect good
>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>> >>>>>>>> Author: Christian König <christian.koenig@amd.com>
>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>> >>>>>>>>
>> >>>>>>>>     drm/amdgpu: remove fence fallback
>> >>>>>>>>
>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>> >>>>>>>>
>> >>>>>>>>     So when interrupts doesn't work any more we are pretty much
>> >>>>>>>> busted no
>> >>>>>>>>     matter what.
>> >>>>>>>>
>> >>>>>>>>     Signed-off-by: Christian König <christian.koenig@amd.com>
>> >>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>> >>>>>>>>
>> >>>>>>>> Results in this:
>> >>>>>>>>
>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>> >>>>>>>> 0000:07:00.0 on minor 1
>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 12600
>> >>>>>>>> bytes left
>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>> >>>>>>>> amdgpu: IB test timed out.
>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>> >>>>>>>> failed (-110).
>> >>>>>>>> [   28.506708] fuse init (API version 7.27)
>> >>>>>>>>
>> >>>>>>>> On init with my polaris/raven1 system.
>> >>>>>>>>
>> >>>>>>>> Cheers,
>> >>>>>>>> Tom
>> >>>>>>>> _______________________________________________
>> >>>>>>>> amd-gfx mailing list
>> >>>>>>>> amd-gfx@lists.freedesktop.org
>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@  static int gfx_v8_0_ring_test_ib(struct amdgpu_ring 
*ring, long timeout)
         return r;
  }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
timeout)
+{
+       return 0;
+}


Comments

What's the status with this error and the suggested patch to fix it ? It 
impacts GPU reset on Polaris11.

Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?


P.S Suspend resume also stopped working on latest branch - will bisect 
it later today or tomorrow.


Andrey


On 09/18/2018 11:00 AM, Christian König wrote:
> Tom,
>
> can you try if the following makes it working again?
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> index b6160de70d12..d65f5ba92fc5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
> amdgpu_ring *ring, long timeout)
>         return r;
>  }
>
> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
> timeout)
> +{
> +       return 0;
> +}
>
>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>  {
> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
> gfx_v8_0_ring_funcs_kiq = {
>         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>         .test_ring = gfx_v8_0_ring_test_ring,
> -       .test_ib = gfx_v8_0_ring_test_ib,
> +       .test_ib = gfx_v8_0_kiq_ring_test_ib,
>         .insert_nop = amdgpu_ring_insert_nop,
>         .pad_ib = amdgpu_ring_generic_pad_ib,
>         .emit_rreg = gfx_v8_0_ring_emit_rreg,
>
>
> Thanks,
> Christian.
>
> Am 18.09.2018 um 16:41 schrieb Christian König:
>> CRTC and GFX interrupts seem to be working perfectly fine.
>>
>> The problem here looks like only EOP interrupts from the Compute 
>> queue are not correctly handled.
>>
>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>
>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>> (windows doesn't use interrupt remapping so they are sometimes wrong 
>>> and probably not validated.  There are a number of workaround to 
>>> manually override the IVRS tables to make interrupts work.  I think 
>>> specifying pci=noacpi is also a possible workaround.
>>>
>>>
>>> Alex
>>>
>>> ------------------------------------------------------------------------
>>> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of 
>>> Christian König <christian.koenig@amd.com>
>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>> *Subject:* Re: Regression on gfx8 with ring init
>>> Well looks like interrupt processing is working perfectly fine.
>>>
>>> But looking at the error message once more I see that this actually
>>> affects ring number 9 and not the GFX ring.
>>>
>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>>> number?
>>>
>>> That must be some of the compute rings.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>> >
>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a log 
>>> from
>>> > the tip of drm-next
>>> >
>>> > Tom
>>> >
>>> >>
>>> >> Christian.
>>> >>
>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>> >>>
>>> >>> Here's the log.
>>> >>>
>>> >>> Tom
>>> >>>
>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>> >>>> Odd I couldn't even boot my system with the dGPU as primary after
>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>> >>>> panic'ed before loading the network stack.
>>> >>>>
>>> >>>> Bizarre.
>>> >>>>
>>> >>>> I'll keep trying.
>>> >>>>
>>> >>>> Tom
>>> >>>>
>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>> >>>>>>>
>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>> >>>>>>> correctly on Raven?
>>> >>>>>>
>>> >>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>> >>>>>>
>>> >>>>>> Anything I could test with my devel raven?
>>> >>>>>
>>> >>>>> The problem seems to be that on some boards IH handling doesn't
>>> >>>>> work as it should.
>>> >>>>>
>>> >>>>> Can you try to disable the onboard graphics and try again?
>>> >>>>>
>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Christian.
>>> >>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Tom
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> Christian.
>>> >>>>>>>
>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>> >>>>>>>> This commit:
>>> >>>>>>>>
>>> >>>>>>>> [root@raven linux]# git bisect good
>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad 
>>> commit
>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>> >>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>> >>>>>>>>
>>> >>>>>>>>     drm/amdgpu: remove fence fallback
>>> >>>>>>>>
>>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>>> >>>>>>>>
>>> >>>>>>>>     So when interrupts doesn't work any more we are pretty 
>>> much
>>> >>>>>>>> busted no
>>> >>>>>>>>     matter what.
>>> >>>>>>>>
>>> >>>>>>>>     Signed-off-by: Christian König <christian.koenig@amd.com>
>>> >>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>>> >>>>>>>>
>>> >>>>>>>> Results in this:
>>> >>>>>>>>
>>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>> >>>>>>>> 0000:07:00.0 on minor 1
>>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 
>>> 12600
>>> >>>>>>>> bytes left
>>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>> >>>>>>>> amdgpu: IB test timed out.
>>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>> >>>>>>>> failed (-110).
>>> >>>>>>>> [   28.506708] fuse init (API version 7.27)
>>> >>>>>>>>
>>> >>>>>>>> On init with my polaris/raven1 system.
>>> >>>>>>>>
>>> >>>>>>>> Cheers,
>>> >>>>>>>> Tom
>>> >>>>>>>> _______________________________________________
>>> >>>>>>>> amd-gfx mailing list
>>> >>>>>>>> amd-gfx@lists.freedesktop.org
>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:
>
> What's the status with this error and the suggested patch to fix it ? 
> It impacts GPU reset on Polaris11.
>
> Do we want to investigate why the original patch breaks it or just 
> disable with the proposed patch ?
>
>
> P.S Suspend resume also stopped working on latest branch - will bisect 
> it later today or tomorrow.
>
>
> Andrey
>
>
> On 09/18/2018 11:00 AM, Christian König wrote:
>> Tom,
>>
>> can you try if the following makes it working again?
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> index b6160de70d12..d65f5ba92fc5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
>> amdgpu_ring *ring, long timeout)
>>         return r;
>>  }
>>
>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
>> timeout)
>> +{
>> +       return 0;
>> +}
>>
>>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>>  {
>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
>> gfx_v8_0_ring_funcs_kiq = {
>>         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>>         .test_ring = gfx_v8_0_ring_test_ring,
>> -       .test_ib = gfx_v8_0_ring_test_ib,
>> +       .test_ib = gfx_v8_0_kiq_ring_test_ib,
>>         .insert_nop = amdgpu_ring_insert_nop,
>>         .pad_ib = amdgpu_ring_generic_pad_ib,
>>         .emit_rreg = gfx_v8_0_ring_emit_rreg,
>>
>>
>> Thanks,
>> Christian.
>>
>> Am 18.09.2018 um 16:41 schrieb Christian König:
>>> CRTC and GFX interrupts seem to be working perfectly fine.
>>>
>>> The problem here looks like only EOP interrupts from the Compute 
>>> queue are not correctly handled.
>>>
>>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>>
>>> Christian.
>>>
>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>>
>>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>>> (windows doesn't use interrupt remapping so they are sometimes 
>>>> wrong and probably not validated.  There are a number of workaround 
>>>> to manually override the IVRS tables to make interrupts work.  I 
>>>> think specifying pci=noacpi is also a possible workaround.
>>>>
>>>>
>>>> Alex
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf 
>>>> of Christian König <christian.koenig@amd.com>
>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>>> *Subject:* Re: Regression on gfx8 with ring init
>>>> Well looks like interrupt processing is working perfectly fine.
>>>>
>>>> But looking at the error message once more I see that this actually
>>>> affects ring number 9 and not the GFX ring.
>>>>
>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>>>> number?
>>>>
>>>> That must be some of the compute rings.
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>>> >
>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a log 
>>>> from
>>>> > the tip of drm-next
>>>> >
>>>> > Tom
>>>> >
>>>> >>
>>>> >> Christian.
>>>> >>
>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>> >>>
>>>> >>> Here's the log.
>>>> >>>
>>>> >>> Tom
>>>> >>>
>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> >>>> panic'ed before loading the network stack.
>>>> >>>>
>>>> >>>> Bizarre.
>>>> >>>>
>>>> >>>> I'll keep trying.
>>>> >>>>
>>>> >>>> Tom
>>>> >>>>
>>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>>> >>>>>>>
>>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>> >>>>>>> correctly on Raven?
>>>> >>>>>>
>>>> >>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>>> >>>>>>
>>>> >>>>>> Anything I could test with my devel raven?
>>>> >>>>>
>>>> >>>>> The problem seems to be that on some boards IH handling doesn't
>>>> >>>>> work as it should.
>>>> >>>>>
>>>> >>>>> Can you try to disable the onboard graphics and try again?
>>>> >>>>>
>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>> >>>>>
>>>> >>>>> Thanks,
>>>> >>>>> Christian.
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Tom
>>>> >>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Christian.
>>>> >>>>>>>
>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>> >>>>>>>> This commit:
>>>> >>>>>>>>
>>>> >>>>>>>> [root@raven linux]# git bisect good
>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad 
>>>> commit
>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>> >>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>> >>>>>>>>
>>>> >>>>>>>>     drm/amdgpu: remove fence fallback
>>>> >>>>>>>>
>>>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>>>> >>>>>>>>
>>>> >>>>>>>>     So when interrupts doesn't work any more we are pretty 
>>>> much
>>>> >>>>>>>> busted no
>>>> >>>>>>>>     matter what.
>>>> >>>>>>>>
>>>> >>>>>>>>     Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> >>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>>>> >>>>>>>>
>>>> >>>>>>>> Results in this:
>>>> >>>>>>>>
>>>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>> >>>>>>>> 0000:07:00.0 on minor 1
>>>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 
>>>> 12600
>>>> >>>>>>>> bytes left
>>>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>> >>>>>>>> amdgpu: IB test timed out.
>>>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>> >>>>>>>> failed (-110).
>>>> >>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>> >>>>>>>>
>>>> >>>>>>>> On init with my polaris/raven1 system.
>>>> >>>>>>>>
>>>> >>>>>>>> Cheers,
>>>> >>>>>>>> Tom
>>>> >>>>>>>> _______________________________________________
>>>> >>>>>>>> amd-gfx mailing list
>>>> >>>>>>>> amd-gfx@lists.freedesktop.org
>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
I unfortunately don't have a Polaris to test this myself.

But please give me time till Monday so that I can at least try one more 
things to fix it.

Christian.

Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:
>
> Ping...
>
>
> Andrey
>
>
> On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:
>>
>> What's the status with this error and the suggested patch to fix it ? 
>> It impacts GPU reset on Polaris11.
>>
>> Do we want to investigate why the original patch breaks it or just 
>> disable with the proposed patch ?
>>
>>
>> P.S Suspend resume also stopped working on latest branch - will 
>> bisect it later today or tomorrow.
>>
>>
>> Andrey
>>
>>
>> On 09/18/2018 11:00 AM, Christian König wrote:
>>> Tom,
>>>
>>> can you try if the following makes it working again?
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>> index b6160de70d12..d65f5ba92fc5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
>>> amdgpu_ring *ring, long timeout)
>>>         return r;
>>>  }
>>>
>>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
>>> timeout)
>>> +{
>>> +       return 0;
>>> +}
>>>
>>>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>>>  {
>>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
>>> gfx_v8_0_ring_funcs_kiq = {
>>>         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>>>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>>>         .test_ring = gfx_v8_0_ring_test_ring,
>>> -       .test_ib = gfx_v8_0_ring_test_ib,
>>> +       .test_ib = gfx_v8_0_kiq_ring_test_ib,
>>>         .insert_nop = amdgpu_ring_insert_nop,
>>>         .pad_ib = amdgpu_ring_generic_pad_ib,
>>>         .emit_rreg = gfx_v8_0_ring_emit_rreg,
>>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 18.09.2018 um 16:41 schrieb Christian König:
>>>> CRTC and GFX interrupts seem to be working perfectly fine.
>>>>
>>>> The problem here looks like only EOP interrupts from the Compute 
>>>> queue are not correctly handled.
>>>>
>>>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>>>
>>>> Christian.
>>>>
>>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>>>
>>>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>>>> (windows doesn't use interrupt remapping so they are sometimes 
>>>>> wrong and probably not validated.  There are a number of 
>>>>> workaround to manually override the IVRS tables to make interrupts 
>>>>> work.  I think specifying pci=noacpi is also a possible workaround.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf 
>>>>> of Christian König <christian.koenig@amd.com>
>>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>>>> *Subject:* Re: Regression on gfx8 with ring init
>>>>> Well looks like interrupt processing is working perfectly fine.
>>>>>
>>>>> But looking at the error message once more I see that this actually
>>>>> affects ring number 9 and not the GFX ring.
>>>>>
>>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>>>>> number?
>>>>>
>>>>> That must be some of the compute rings.
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>>>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>>>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>>>> >
>>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a 
>>>>> log from
>>>>> > the tip of drm-next
>>>>> >
>>>>> > Tom
>>>>> >
>>>>> >>
>>>>> >> Christian.
>>>>> >>
>>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>>> >>>
>>>>> >>> Here's the log.
>>>>> >>>
>>>>> >>> Tom
>>>>> >>>
>>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary 
>>>>> after
>>>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver 
>>>>> (loads
>>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
>>>>> because it
>>>>> >>>> panic'ed before loading the network stack.
>>>>> >>>>
>>>>> >>>> Bizarre.
>>>>> >>>>
>>>>> >>>> I'll keep trying.
>>>>> >>>>
>>>>> >>>> Tom
>>>>> >>>>
>>>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>>>> >>>>>>>
>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>> >>>>>>> correctly on Raven?
>>>>> >>>>>>
>>>>> >>>>>> What does "doesn't work correctly?"  My workstation is a 
>>>>> Raven1
>>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>> >>>>>>
>>>>> >>>>>> Anything I could test with my devel raven?
>>>>> >>>>>
>>>>> >>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> >>>>> work as it should.
>>>>> >>>>>
>>>>> >>>>> Can you try to disable the onboard graphics and try again?
>>>>> >>>>>
>>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>> >>>>>
>>>>> >>>>> Thanks,
>>>>> >>>>> Christian.
>>>>> >>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> Tom
>>>>> >>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Christian.
>>>>> >>>>>>>
>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>> >>>>>>>> This commit:
>>>>> >>>>>>>>
>>>>> >>>>>>>> [root@raven linux]# git bisect good
>>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad 
>>>>> commit
>>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>> >>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>> >>>>>>>>
>>>>> >>>>>>>>     drm/amdgpu: remove fence fallback
>>>>> >>>>>>>>
>>>>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>> >>>>>>>>
>>>>> >>>>>>>>     So when interrupts doesn't work any more we are 
>>>>> pretty much
>>>>> >>>>>>>> busted no
>>>>> >>>>>>>>     matter what.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> >>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>>>>> >>>>>>>>
>>>>> >>>>>>>> Results in this:
>>>>> >>>>>>>>
>>>>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>> >>>>>>>> 0000:07:00.0 on minor 1
>>>>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 
>>>>> 12600
>>>>> >>>>>>>> bytes left
>>>>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>> >>>>>>>> amdgpu: IB test timed out.
>>>>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>> >>>>>>>> failed (-110).
>>>>> >>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>>> >>>>>>>>
>>>>> >>>>>>>> On init with my polaris/raven1 system.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Cheers,
>>>>> >>>>>>>> Tom
>>>>> >>>>>>>> _______________________________________________
>>>>> >>>>>>>> amd-gfx mailing list
>>>>> >>>>>>>> amd-gfx@lists.freedesktop.org
>>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>> >>>>>>>
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
No worries, I will just revert locally until then to clear the extra 
errors during my investigation of current GPU reset status and issues.


Andrey


On 09/21/2018 01:53 PM, Christian König wrote:
> I unfortunately don't have a Polaris to test this myself.
>
> But please give me time till Monday so that I can at least try one 
> more things to fix it.
>
> Christian.
>
> Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:
>>
>> Ping...
>>
>>
>> Andrey
>>
>>
>> On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:
>>>
>>> What's the status with this error and the suggested patch to fix it 
>>> ? It impacts GPU reset on Polaris11.
>>>
>>> Do we want to investigate why the original patch breaks it or just 
>>> disable with the proposed patch ?
>>>
>>>
>>> P.S Suspend resume also stopped working on latest branch - will 
>>> bisect it later today or tomorrow.
>>>
>>>
>>> Andrey
>>>
>>>
>>> On 09/18/2018 11:00 AM, Christian König wrote:
>>>> Tom,
>>>>
>>>> can you try if the following makes it working again?
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>> index b6160de70d12..d65f5ba92fc5 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
>>>> amdgpu_ring *ring, long timeout)
>>>>         return r;
>>>>  }
>>>>
>>>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, 
>>>> long timeout)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>>
>>>>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>>>>  {
>>>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
>>>> gfx_v8_0_ring_funcs_kiq = {
>>>>         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>>>>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>>>>         .test_ring = gfx_v8_0_ring_test_ring,
>>>> -       .test_ib = gfx_v8_0_ring_test_ib,
>>>> +       .test_ib = gfx_v8_0_kiq_ring_test_ib,
>>>>         .insert_nop = amdgpu_ring_insert_nop,
>>>>         .pad_ib = amdgpu_ring_generic_pad_ib,
>>>>         .emit_rreg = gfx_v8_0_ring_emit_rreg,
>>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 18.09.2018 um 16:41 schrieb Christian König:
>>>>> CRTC and GFX interrupts seem to be working perfectly fine.
>>>>>
>>>>> The problem here looks like only EOP interrupts from the Compute 
>>>>> queue are not correctly handled.
>>>>>
>>>>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>>>>
>>>>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>>>>> (windows doesn't use interrupt remapping so they are sometimes 
>>>>>> wrong and probably not validated.  There are a number of 
>>>>>> workaround to manually override the IVRS tables to make 
>>>>>> interrupts work.  I think specifying pci=noacpi is also a 
>>>>>> possible workaround.
>>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf 
>>>>>> of Christian König <christian.koenig@amd.com>
>>>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>>>>> *Subject:* Re: Regression on gfx8 with ring init
>>>>>> Well looks like interrupt processing is working perfectly fine.
>>>>>>
>>>>>> But looking at the error message once more I see that this actually
>>>>>> affects ring number 9 and not the GFX ring.
>>>>>>
>>>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of 
>>>>>> the
>>>>>> number?
>>>>>>
>>>>>> That must be some of the compute rings.
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>>>>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>>>>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>>>>> >
>>>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a 
>>>>>> log from
>>>>>> > the tip of drm-next
>>>>>> >
>>>>>> > Tom
>>>>>> >
>>>>>> >>
>>>>>> >> Christian.
>>>>>> >>
>>>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>>>> >>>
>>>>>> >>> Here's the log.
>>>>>> >>>
>>>>>> >>> Tom
>>>>>> >>>
>>>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary 
>>>>>> after
>>>>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver 
>>>>>> (loads
>>>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
>>>>>> because it
>>>>>> >>>> panic'ed before loading the network stack.
>>>>>> >>>>
>>>>>> >>>> Bizarre.
>>>>>> >>>>
>>>>>> >>>> I'll keep trying.
>>>>>> >>>>
>>>>>> >>>> Tom
>>>>>> >>>>
>>>>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't 
>>>>>> work
>>>>>> >>>>>>> correctly on Raven?
>>>>>> >>>>>>
>>>>>> >>>>>> What does "doesn't work correctly?"  My workstation is a 
>>>>>> Raven1
>>>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>> >>>>>>
>>>>>> >>>>>> Anything I could test with my devel raven?
>>>>>> >>>>>
>>>>>> >>>>> The problem seems to be that on some boards IH handling 
>>>>>> doesn't
>>>>>> >>>>> work as it should.
>>>>>> >>>>>
>>>>>> >>>>> Can you try to disable the onboard graphics and try again?
>>>>>> >>>>>
>>>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>> >>>>>
>>>>>> >>>>> Thanks,
>>>>>> >>>>> Christian.
>>>>>> >>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>> Tom
>>>>>> >>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Christian.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>> >>>>>>>> This commit:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> [root@raven linux]# git bisect good
>>>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first 
>>>>>> bad commit
>>>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>> >>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>     drm/amdgpu: remove fence fallback
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>     So when interrupts doesn't work any more we are 
>>>>>> pretty much
>>>>>> >>>>>>>> busted no
>>>>>> >>>>>>>>     matter what.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> >>>>>>>> Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Results in this:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>>> >>>>>>>> 0000:07:00.0 on minor 1
>>>>>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack 
>>>>>> depth: 12600
>>>>>> >>>>>>>> bytes left
>>>>>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>>> >>>>>>>> amdgpu: IB test timed out.
>>>>>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>>> >>>>>>>> failed (-110).
>>>>>> >>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On init with my polaris/raven1 system.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Cheers,
>>>>>> >>>>>>>> Tom
>>>>>> >>>>>>>> _______________________________________________
>>>>>> >>>>>>>> amd-gfx mailing list
>>>>>> >>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>> >>>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
BTW, this also seems to be what breaks suspend/resume.


Andrey


On 09/21/2018 01:56 PM, Andrey Grodzovsky wrote:
>
> No worries, I will just revert locally until then to clear the extra 
> errors during my investigation of current GPU reset status and issues.
>
>
> Andrey
>
>
> On 09/21/2018 01:53 PM, Christian König wrote:
>> I unfortunately don't have a Polaris to test this myself.
>>
>> But please give me time till Monday so that I can at least try one 
>> more things to fix it.
>>
>> Christian.
>>
>> Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:
>>>
>>> Ping...
>>>
>>>
>>> Andrey
>>>
>>>
>>> On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:
>>>>
>>>> What's the status with this error and the suggested patch to fix it 
>>>> ? It impacts GPU reset on Polaris11.
>>>>
>>>> Do we want to investigate why the original patch breaks it or just 
>>>> disable with the proposed patch ?
>>>>
>>>>
>>>> P.S Suspend resume also stopped working on latest branch - will 
>>>> bisect it later today or tomorrow.
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 09/18/2018 11:00 AM, Christian König wrote:
>>>>> Tom,
>>>>>
>>>>> can you try if the following makes it working again?
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>>> index b6160de70d12..d65f5ba92fc5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
>>>>> amdgpu_ring *ring, long timeout)
>>>>>         return r;
>>>>>  }
>>>>>
>>>>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, 
>>>>> long timeout)
>>>>> +{
>>>>> +       return 0;
>>>>> +}
>>>>>
>>>>>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>>>>>  {
>>>>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
>>>>> gfx_v8_0_ring_funcs_kiq = {
>>>>>         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>>>>>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>>>>>         .test_ring = gfx_v8_0_ring_test_ring,
>>>>> -       .test_ib = gfx_v8_0_ring_test_ib,
>>>>> +       .test_ib = gfx_v8_0_kiq_ring_test_ib,
>>>>>         .insert_nop = amdgpu_ring_insert_nop,
>>>>>         .pad_ib = amdgpu_ring_generic_pad_ib,
>>>>>         .emit_rreg = gfx_v8_0_ring_emit_rreg,
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 18.09.2018 um 16:41 schrieb Christian König:
>>>>>> CRTC and GFX interrupts seem to be working perfectly fine.
>>>>>>
>>>>>> The problem here looks like only EOP interrupts from the Compute 
>>>>>> queue are not correctly handled.
>>>>>>
>>>>>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>>>>>
>>>>>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>>>>>> (windows doesn't use interrupt remapping so they are sometimes 
>>>>>>> wrong and probably not validated.  There are a number of 
>>>>>>> workaround to manually override the IVRS tables to make 
>>>>>>> interrupts work.  I think specifying pci=noacpi is also a 
>>>>>>> possible workaround.
>>>>>>>
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on 
>>>>>>> behalf of Christian König <christian.koenig@amd.com>
>>>>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>>>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>>>>>> *Subject:* Re: Regression on gfx8 with ring init
>>>>>>> Well looks like interrupt processing is working perfectly fine.
>>>>>>>
>>>>>>> But looking at the error message once more I see that this actually
>>>>>>> affects ring number 9 and not the GFX ring.
>>>>>>>
>>>>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead 
>>>>>>> of the
>>>>>>> number?
>>>>>>>
>>>>>>> That must be some of the compute rings.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>>>>>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>>>>>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>>>>>> >
>>>>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a 
>>>>>>> log from
>>>>>>> > the tip of drm-next
>>>>>>> >
>>>>>>> > Tom
>>>>>>> >
>>>>>>> >>
>>>>>>> >> Christian.
>>>>>>> >>
>>>>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>>>>> >>>
>>>>>>> >>> Here's the log.
>>>>>>> >>>
>>>>>>> >>> Tom
>>>>>>> >>>
>>>>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary 
>>>>>>> after
>>>>>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver 
>>>>>>> (loads
>>>>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
>>>>>>> because it
>>>>>>> >>>> panic'ed before loading the network stack.
>>>>>>> >>>>
>>>>>>> >>>> Bizarre.
>>>>>>> >>>>
>>>>>>> >>>> I'll keep trying.
>>>>>>> >>>>
>>>>>>> >>>> Tom
>>>>>>> >>>>
>>>>>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't 
>>>>>>> work
>>>>>>> >>>>>>> correctly on Raven?
>>>>>>> >>>>>>
>>>>>>> >>>>>> What does "doesn't work correctly?"  My workstation is a 
>>>>>>> Raven1
>>>>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has 
>>>>>>> been
>>>>>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>> >>>>>>
>>>>>>> >>>>>> Anything I could test with my devel raven?
>>>>>>> >>>>>
>>>>>>> >>>>> The problem seems to be that on some boards IH handling 
>>>>>>> doesn't
>>>>>>> >>>>> work as it should.
>>>>>>> >>>>>
>>>>>>> >>>>> Can you try to disable the onboard graphics and try again?
>>>>>>> >>>>>
>>>>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>>> >>>>>
>>>>>>> >>>>> Thanks,
>>>>>>> >>>>> Christian.
>>>>>>> >>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> Tom
>>>>>>> >>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Christian.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>> >>>>>>>> This commit:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> [root@raven linux]# git bisect good
>>>>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first 
>>>>>>> bad commit
>>>>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>> >>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> drm/amdgpu: remove fence fallback
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>>     So when interrupts doesn't work any more we are 
>>>>>>> pretty much
>>>>>>> >>>>>>>> busted no
>>>>>>> >>>>>>>>     matter what.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>> >>>>>>>> Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Results in this:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>>>> >>>>>>>> 0000:07:00.0 on minor 1
>>>>>>> >>>>>>>> [ 24.335674] modprobe (3895) used greatest stack depth: 
>>>>>>> 12600
>>>>>>> >>>>>>>> bytes left
>>>>>>> >>>>>>>> [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>>>> >>>>>>>> amdgpu: IB test timed out.
>>>>>>> >>>>>>>> [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>>> >>>>>>>> [ 26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>>>> >>>>>>>> failed (-110).
>>>>>>> >>>>>>>> [ 28.506708] fuse init (API version 7.27)
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> On init with my polaris/raven1 system.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Cheers,
>>>>>>> >>>>>>>> Tom
>>>>>>> >>>>>>>> _______________________________________________
>>>>>>> >>>>>>>> amd-gfx mailing list
>>>>>>> >>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>> >>>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>
>>>>>>> >>>>
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>