"ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

Submitted by Grodzovsky, Andrey on Feb. 12, 2019, 5:46 p.m.

Details

Message ID 0349a887-b5b2-827a-830a-91c93dc9628d@amd.com
State New
Headers show
Series ""ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Grodzovsky, Andrey Feb. 12, 2019, 5:46 p.m.
I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved 
and then stack waiting for fences to signal in 
reservation_object_wait_timeout_rcu (which won't signal because there 
was a VM_FAULT). Then when we try to shutdown display block during reset 
recovery from drm_atomic_helper_suspend we also try to reserve the BO,  
probably from dm_plane_helper_cleanup_fb ending in deadlock.

To confirm i am attaching some printks around the BO reservation - 
please apply and rerun.

Also, probably a good idea to open FDO ticket on this instead of using 
amd-gfx.

Andrey


On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:
> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey

> <Andrey.Grodzovsky@amd.com> wrote:

>> It should recover you - so this looks like a bug. I noticed in one of

>> the call traces this - drm_atomic_helper_suspend which points to system

>> going into sleep mode, is it what happened, did it hang when system

>> tried to sleep ?

>>

> It's weird because the computer was not enter in sleep mode. I am sure.

> Steps for reproduce:

> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse stop respond

> 3. Dump gfx, waves and all other dumps including dmesg

>

> And of course the power button (button which enter in sleep mode) was

> not pressed.

>

> So the new dumps has any new useful info? Or they are pointless?

> --

> Best Regards,

> Mike Gavrilov.

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d59bafc..e15cd3c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2353,6 +2353,8 @@  static int get_fb_info(const struct amdgpu_framebuffer *amdgpu_fb,
                       uint64_t *tiling_flags)
 {
        struct amdgpu_bo *rbo = gem_to_amdgpu_bo(amdgpu_fb->base.obj[0]);
+
+       DRM_ERROR("Before %p\n",rbo);
        int r = amdgpu_bo_reserve(rbo, false);
 
        if (unlikely(r)) {
@@ -2362,6 +2364,8 @@  static int get_fb_info(const struct amdgpu_framebuffer *amdgpu_fb,
                return r;
        }
 
+       DRM_ERROR("After %p\n",rbo);
+
        if (tiling_flags)
                amdgpu_bo_get_tiling_flags(rbo, tiling_flags);
 
@@ -3715,9 +3719,11 @@  static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
        obj = new_state->fb->obj[0];
        rbo = gem_to_amdgpu_bo(obj);
        adev = amdgpu_ttm_adev(rbo->tbo.bdev);
+       DRM_ERROR("Before %p\n",rbo);
        r = amdgpu_bo_reserve(rbo, false);
        if (unlikely(r != 0))
                return r;
+       DRM_ERROR("After %p\n",rbo);
 
        if (plane->type != DRM_PLANE_TYPE_CURSOR)
                domain = amdgpu_display_supported_domains(adev);
@@ -3790,11 +3796,13 @@  static void dm_plane_helper_cleanup_fb(struct drm_plane *plane,
                return;
 
        rbo = gem_to_amdgpu_bo(old_state->fb->obj[0]);
+       DRM_ERROR("Before %p\n",__LINE__);
        r = amdgpu_bo_reserve(rbo, false);
        if (unlikely(r)) {
                DRM_ERROR("failed to reserve rbo before unpin\n");
                return;
        }
+       DRM_ERROR("After %d\n",__LINE__);
 
        amdgpu_bo_unpin(rbo);
        amdgpu_bo_unreserve(rbo);
@@ -4801,15 +4809,17 @@  static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
                         * blocking commit to as per framework helpers
                         */
                        abo = gem_to_amdgpu_bo(fb->obj[0]);
+                       DRM_ERROR("Before %p\n",abo);
                        r = amdgpu_bo_reserve(abo, true);
                        if (unlikely(r != 0)) {
                                DRM_ERROR("failed to reserve buffer before flip\n");
                                WARN_ON(1);
                        }
-
+                       DRM_ERROR("After %p\n",abo);
                        /* Wait for all fences on this FB */
                        WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
-                                                                                   MAX_SCHEDULE_TIMEOUT) < 0);
+                                       msecs_to_jiffies(5000)) < 0);
+                       DRM_ERROR("After  reservation_object_wait_timeout_rcu %p\n",abo);
 
                        amdgpu_bo_get_tiling_flags(abo, &tiling_flags);


Comments

The MAX_SCHEDULE_TIMEOUT is probably not a good idea on the wait in DM.

I wonder if we could just do shorter wait and skip the FB 
update/programming if it fails after some reasonable amount of time.

This would still allow recovery to happen at least even if the display 
isn't showing the right buffer.

Nicholas Kazlauskas

On 2/12/19 12:46 PM, Grodzovsky, Andrey wrote:
> I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved

> and then stack waiting for fences to signal in

> reservation_object_wait_timeout_rcu (which won't signal because there

> was a VM_FAULT). Then when we try to shutdown display block during reset

> recovery from drm_atomic_helper_suspend we also try to reserve the BO,

> probably from dm_plane_helper_cleanup_fb ending in deadlock.

> 

> To confirm i am attaching some printks around the BO reservation -

> please apply and rerun.

> 

> Also, probably a good idea to open FDO ticket on this instead of using

> amd-gfx.

> 

> Andrey

> 

> 

> On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:

>> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey

>> <Andrey.Grodzovsky@amd.com> wrote:

>>> It should recover you - so this looks like a bug. I noticed in one of

>>> the call traces this - drm_atomic_helper_suspend which points to system

>>> going into sleep mode, is it what happened, did it hang when system

>>> tried to sleep ?

>>>

>> It's weird because the computer was not enter in sleep mode. I am sure.

>> Steps for reproduce:

>> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse stop respond

>> 3. Dump gfx, waves and all other dumps including dmesg

>>

>> And of course the power button (button which enter in sleep mode) was

>> not pressed.

>>

>> So the new dumps has any new useful info? Or they are pointless?

>> --

>> Best Regards,

>> Mike Gavrilov.

>>

>> _______________________________________________

>> amd-gfx mailing list

>> amd-gfx@lists.freedesktop.org

>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Sure, that probably would be the solution, one missing detail here 
(besides confirming with the debug prints that this is the scenario we 
are hitting) is WHY we even stuck in 
reservation_object_wait_timeout_rcu, in amdgpu_device_pre_asic_reset 
(during GPU reset) we are first forcing all outstanding HW fences 
completion through amdgpu_fence_driver_force_completion BEFORE 
proceeding to ip blocks suspend in amdgpu_device_ip_suspend. One 
possible explanation would be that the fence attached to the BO is a 
scheduler fence (SW fence) and not the backing HW fence, I will be able 
to verify this with some fence traces after confirming that the deadlock 
indeed is the one I described.

Andrey

On 2/12/19 1:29 PM, Kazlauskas, Nicholas wrote:
> The MAX_SCHEDULE_TIMEOUT is probably not a good idea on the wait in DM.

>

> I wonder if we could just do shorter wait and skip the FB

> update/programming if it fails after some reasonable amount of time.

>

> This would still allow recovery to happen at least even if the display

> isn't showing the right buffer.

>

> Nicholas Kazlauskas

>

> On 2/12/19 12:46 PM, Grodzovsky, Andrey wrote:

>> I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved

>> and then stack waiting for fences to signal in

>> reservation_object_wait_timeout_rcu (which won't signal because there

>> was a VM_FAULT). Then when we try to shutdown display block during reset

>> recovery from drm_atomic_helper_suspend we also try to reserve the BO,

>> probably from dm_plane_helper_cleanup_fb ending in deadlock.

>>

>> To confirm i am attaching some printks around the BO reservation -

>> please apply and rerun.

>>

>> Also, probably a good idea to open FDO ticket on this instead of using

>> amd-gfx.

>>

>> Andrey

>>

>>

>> On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:

>>> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey

>>> <Andrey.Grodzovsky@amd.com> wrote:

>>>> It should recover you - so this looks like a bug. I noticed in one of

>>>> the call traces this - drm_atomic_helper_suspend which points to system

>>>> going into sleep mode, is it what happened, did it hang when system

>>>> tried to sleep ?

>>>>

>>> It's weird because the computer was not enter in sleep mode. I am sure.

>>> Steps for reproduce:

>>> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse stop respond

>>> 3. Dump gfx, waves and all other dumps including dmesg

>>>

>>> And of course the power button (button which enter in sleep mode) was

>>> not pressed.

>>>

>>> So the new dumps has any new useful info? Or they are pointless?

>>> --

>>> Best Regards,

>>> Mike Gavrilov.

>>>

>>> _______________________________________________

>>> amd-gfx mailing list

>>> amd-gfx@lists.freedesktop.org

>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Tue, 12 Feb 2019 at 22:46, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
wrote:

> I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved
> and then stack waiting for fences to signal in
> reservation_object_wait_timeout_rcu (which won't signal because there
> was a VM_FAULT). Then when we try to shutdown display block during reset
> recovery from drm_atomic_helper_suspend we also try to reserve the BO,
> probably from dm_plane_helper_cleanup_fb ending in deadlock.
>
> To confirm i am attaching some printks around the BO reservation -
> please apply and rerun.
>
> Also, probably a good idea to open FDO ticket on this instead of using
> amd-gfx.
>
> Andrey



Hi Andrey,
Looks likes at least one line missed in Linux tree or patch incorrect.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c#n4606

[image: Screenshot from 2019-02-12 23-52-28.png]
--
Best Regards,
Mike Gavrilov.
On 2/12/19 2:00 PM, Mikhail Gavrilov wrote:


On Tue, 12 Feb 2019 at 22:46, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>> wrote:
I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved
and then stack waiting for fences to signal in
reservation_object_wait_timeout_rcu (which won't signal because there
was a VM_FAULT). Then when we try to shutdown display block during reset
recovery from drm_atomic_helper_suspend we also try to reserve the BO,
probably from dm_plane_helper_cleanup_fb ending in deadlock.

To confirm i am attaching some printks around the BO reservation -
please apply and rerun.

Also, probably a good idea to open FDO ticket on this instead of using
amd-gfx.

Andrey


Hi Andrey,
Looks likes at least one line missed in Linux tree or patch incorrect.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c#n4606


Sorry, for your kernel this particular set of prints should go in amdgpu_dm_do_flip<https://elixir.bootlin.com/linux/v5.0-rc6/ident/amdgpu_dm_do_flip>

Andrey


[Screenshot from 2019-02-12 23-52-28.png]
--
Best Regards,
Mike Gavrilov.