[v3,1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.

Submitted by Andrey Grodzovsky on Aug. 30, 2019, 4:39 p.m.

Details

Message ID 1567183153-11014-1-git-send-email-andrey.grodzovsky@amd.com
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Andrey Grodzovsky Aug. 30, 2019, 4:39 p.m.
Issue 1:
In  XGMI case amdgpu_device_lock_adev for other devices in hive
was called to late, after access to their repsective schedulers.
So relocate the lock to the begining of accessing the other devs.

Issue 2:
Using amdgpu_device_ip_need_full_reset to switch the device list from
all devices in hive to the single 'master' device who owns this reset
call is wrong because when stopping schedulers we iterate all the devices
in hive but when restarting we will only reactivate the 'master' device.
Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS
needed we then have to stop schedulers for all devices in hive and not
only the 'master' but with amdgpu_device_ip_need_full_reset  we
already missed the opprotunity do to so. So just remove this logic and
always stop and start all schedulers for all devices in hive.

Also minor cleanup and print fix.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a5daccc..19f6624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3814,15 +3814,16 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		device_list_handle = &device_list;
 	}
 
-	/*
-	 * Mark these ASICs to be reseted as untracked first
-	 * And add them back after reset completed
-	 */
-	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)
-		amdgpu_unregister_gpu_instance(tmp_adev);
-
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
+		if (tmp_adev != adev)
+			amdgpu_device_lock_adev(tmp_adev, false);
+		/*
+		 * Mark these ASICs to be reseted as untracked first
+		 * And add them back after reset completed
+		 */
+		amdgpu_unregister_gpu_instance(tmp_adev);
+
 		/* disable ras on ALL IPs */
 		if (amdgpu_device_ip_need_full_reset(tmp_adev))
 			amdgpu_ras_suspend(tmp_adev);
@@ -3848,9 +3849,6 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	    dma_fence_is_signaled(job->base.s_fence->parent))
 		job_signaled = true;
 
-	if (!amdgpu_device_ip_need_full_reset(adev))
-		device_list_handle = &device_list;
-
 	if (job_signaled) {
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
 		goto skip_hw_reset;
@@ -3869,10 +3867,9 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 retry:	/* Rest of adevs pre asic reset from XGMI hive. */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 
-		if (tmp_adev == adev)
+		if(tmp_adev == adev)
 			continue;
 
-		amdgpu_device_lock_adev(tmp_adev, false);
 		r = amdgpu_device_pre_asic_reset(tmp_adev,
 						 NULL,
 						 &need_full_reset);
@@ -3921,10 +3918,10 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 		if (r) {
 			/* bad news, how to tell it to userspace ? */
-			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&adev->gpu_reset_counter));
+			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&tmp_adev->gpu_reset_counter));
 			amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
 		} else {
-			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&adev->gpu_reset_counter));
+			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
 		}
 
 		amdgpu_device_unlock_adev(tmp_adev);

Comments

On 2019-08-30 12:39 p.m., Andrey Grodzovsky wrote:
> Issue 1:

> In  XGMI case amdgpu_device_lock_adev for other devices in hive

> was called to late, after access to their repsective schedulers.

> So relocate the lock to the begining of accessing the other devs.

>

> Issue 2:

> Using amdgpu_device_ip_need_full_reset to switch the device list from

> all devices in hive to the single 'master' device who owns this reset

> call is wrong because when stopping schedulers we iterate all the devices

> in hive but when restarting we will only reactivate the 'master' device.

> Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS

> needed we then have to stop schedulers for all devices in hive and not

> only the 'master' but with amdgpu_device_ip_need_full_reset  we

> already missed the opprotunity do to so. So just remove this logic and

> always stop and start all schedulers for all devices in hive.

>

> Also minor cleanup and print fix.

>

> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>


Minor nit-pick inline. With that fixed this patch is Acked-by: Felix 
Kuehling <Felix.Kuehling@amd.com>


> ---

>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++--------------

>   1 file changed, 11 insertions(+), 14 deletions(-)

>

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> index a5daccc..19f6624 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> @@ -3814,15 +3814,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

>   		device_list_handle = &device_list;

>   	}

>   

> -	/*

> -	 * Mark these ASICs to be reseted as untracked first

> -	 * And add them back after reset completed

> -	 */

> -	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)

> -		amdgpu_unregister_gpu_instance(tmp_adev);

> -

>   	/* block all schedulers and reset given job's ring */

>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {

> +		if (tmp_adev != adev)

> +			amdgpu_device_lock_adev(tmp_adev, false);

> +		/*

> +		 * Mark these ASICs to be reseted as untracked first

> +		 * And add them back after reset completed

> +		 */

> +		amdgpu_unregister_gpu_instance(tmp_adev);

> +

>   		/* disable ras on ALL IPs */

>   		if (amdgpu_device_ip_need_full_reset(tmp_adev))

>   			amdgpu_ras_suspend(tmp_adev);

> @@ -3848,9 +3849,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

>   	    dma_fence_is_signaled(job->base.s_fence->parent))

>   		job_signaled = true;

>   

> -	if (!amdgpu_device_ip_need_full_reset(adev))

> -		device_list_handle = &device_list;

> -

>   	if (job_signaled) {

>   		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");

>   		goto skip_hw_reset;

> @@ -3869,10 +3867,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

>   retry:	/* Rest of adevs pre asic reset from XGMI hive. */

>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {

>   

> -		if (tmp_adev == adev)

> +		if(tmp_adev == adev)


The space before ( was correct coding style. This will trigger a 
checkpatch error or warning.


>   			continue;

>   

> -		amdgpu_device_lock_adev(tmp_adev, false);

>   		r = amdgpu_device_pre_asic_reset(tmp_adev,

>   						 NULL,

>   						 &need_full_reset);

> @@ -3921,10 +3918,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

>   

>   		if (r) {

>   			/* bad news, how to tell it to userspace ? */

> -			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&adev->gpu_reset_counter));

> +			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&tmp_adev->gpu_reset_counter));

>   			amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);

>   		} else {

> -			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&adev->gpu_reset_counter));

> +			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));

>   		}

>   

>   		amdgpu_device_unlock_adev(tmp_adev);