drm/amdgpu: Simplify amdgpu_lockup_timeout usage.

Submitted by Koenig, Christian on Dec. 19, 2017, 5:47 p.m.

Details

Message ID f5351f8f-ad38-5517-ae87-e5d8efab1839@amd.com
State New
Headers show
Series "drm/amdgpu: Simplify amdgpu_lockup_timeout usage." ( rev: 5 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Koenig, Christian Dec. 19, 2017, 5:47 p.m.
Yeah, that is a known issue which came to front again because Andrey's 
patch is slightly buggy.

Please test and review the attached (only compile tested) fix for 
Andrey's patch.

Still working on finding the root cause, but so far didn't had time for 
that.

Regards,
Christian.

Am 19.12.2017 um 18:26 schrieb Michel Dänzer:
> On 2017-12-13 08:44 PM, Andrey Grodzovsky wrote:
>> With introduction of amdgpu_gpu_recovery we don't need any more
>> to rely on amdgpu_lockup_timeout == 0 for disabling GPU reset.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Since this change landed, I'm once again unable to finish a piglit run
> on my development machine, see the attached dmesg output (happens pretty
> quickly, after ~5% of piglit tests have run). I realized that with
> lockup_timeout != 0, the
>
> 	WARN_ON_ONCE(bo->tbo.mem.mem_type == TTM_PL_SYSTEM);
>
> at the top of amdgpu_bo_gpu_offset has been triggering since the 4.15
> development cycle. See the bisection result below. Note that I'm not
> 100% sure this is the correct guilty commit, since it's probably been
> the most painful bisection I've ever done so far (14 skips, had to
> revert 4 commits causing other regressions). But I'm quite sure this
> regression happened in the
> 84d43463a2d09c28c9222fbb7d1082c078e2523a..3f3333f8a0e90ac26f84ed7b0aa344efce695c08
> range.
>
>
> 3f3333f8a0e90ac26f84ed7b0aa344efce695c08 is the first bad commit
> commit 3f3333f8a0e90ac26f84ed7b0aa344efce695c08
> Author: Christian König <christian.koenig@amd.com>
> Date:   Thu Aug 3 14:02:13 2017 +0200
>
>      drm/amdgpu: track evicted page tables v2
>
>      Instead of validating all page tables when one was evicted,
>      track which one needs a validation.
>
>      v2: simplify amdgpu_vm_ready as well
>
>      Signed-off-by: Christian König <christian.koenig@amd.com>
>      Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (v1)
>      Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
>
>

Patch hide | download patch | download mbox

From 769eb5d08910e16058902fada499d21655ccdba4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Tue, 19 Dec 2017 18:44:43 +0100
Subject: [PATCH] drm/amdgpu: fix test for shadow page tables
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

They don't work 100% correctly at the moment.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c87b7f902c9..9fac284315cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2833,7 +2833,11 @@  bool amdgpu_need_backup(struct amdgpu_device *adev)
 	if (adev->flags & AMD_IS_APU)
 		return false;
 
-	return amdgpu_gpu_recovery;
+	if (amdgpu_gpu_recovery == 0 ||
+	    (amdgpu_gpu_recovery == -1  && !amdgpu_sriov_vf(adev)))
+		return false;
+
+	return true;
 }
 
 static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
-- 
2.11.0


Comments

On 2017-12-19 06:47 PM, Christian König wrote:
> Yeah, that is a known issue which came to front again because Andrey's
> patch is slightly buggy.
> 
> Please test and review the attached (only compile tested) fix for
> Andrey's patch.
> 
> Still working on finding the root cause, but so far didn't had time for
> that.

Unfortunately, I haven't been able to narrow it down to any particular
piglit tests. What I can say though is that it's triggered much more
quickly when running piglit with --process-isolation false.
Am 21.12.2017 um 18:09 schrieb Michel Dänzer:
> On 2017-12-19 06:47 PM, Christian König wrote:
>> Yeah, that is a known issue which came to front again because Andrey's
>> patch is slightly buggy.
>>
>> Please test and review the attached (only compile tested) fix for
>> Andrey's patch.
>>
>> Still working on finding the root cause, but so far didn't had time for
>> that.
> Unfortunately, I haven't been able to narrow it down to any particular
> piglit tests. What I can say though is that it's triggered much more
> quickly when running piglit with --process-isolation false.

I think I'm able to reproduce it now as well.

I thought a bit about what could be the difference between my test 
machines and the system you guys where able to reproduce it on, because 
of the lack of VM faults on my side no matter what I do.

My conclusion was that it is somehow related to the installed amount of 
system memory and reducing TTMs swap limit indeed resulted in a VM fault 
rather quickly.

I was only able to reproduce this once now, but that was so quick that I 
might have found the root cause.

Christian.