amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

Submitted by Marek Olšák on Dec. 11, 2017, 9:29 p.m.

Details

Message ID 1513027772-32408-1-git-send-email-maraeo@gmail.com
State New
Headers show
Series "amdgpu: disable GPU reset if amdgpu.lockup_timeout=0" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Marek Olšák Dec. 11, 2017, 9:29 p.m.
From: Marek Olšák <marek.olsak@amd.com>

Signed-off-by: Marek Olšák <marek.olsak@amd.com>
---

Is this really correct? I have no easy way to test it.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
 1 file changed, 4 insertions(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@  static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
  *
  * Attempt to reset the GPU if it has hung (all asics).
  * Returns 0 for success or an error on failure.
  */
 int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 {
 	struct drm_atomic_state *state = NULL;
 	uint64_t reset_flags = 0;
 	int i, r, resched;
 
+	/* amdgpu.lockup_timeout=0 disables GPU reset. */
+	if (amdgpu_lockup_timeout == 0)
+		return 0;
+
 	if (!amdgpu_check_soft_reset(adev)) {
 		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
 		return 0;
 	}
 
 	dev_info(adev->dev, "GPU reset begin!\n");
 
 	mutex_lock(&adev->lock_reset);
 	atomic_inc(&adev->gpu_reset_counter);
 	adev->in_gpu_reset = 1;

Comments

Am 11.12.2017 um 22:29 schrieb Marek Olšák:
> From: Marek Olšák <marek.olsak@amd.com>
>
> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
> ---
>
> Is this really correct? I have no easy way to test it.

It's a step in the right direction, but I would rather vote for 
something else:

Instead of disabling the timeout by default we only disable the GPU 
reset/recovery.

The idea is to add a new parameter amdgpu_gpu_recovery which makes 
amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at 
all (on bare metal systems).

Then we finally set the amdgpu_lockup_timeout to a non zero value by 
default.

Andrey could you take care of this when you have time?

Thanks,
Christian.

>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8d03baa..56c41cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
>    *
>    * Attempt to reset the GPU if it has hung (all asics).
>    * Returns 0 for success or an error on failure.
>    */
>   int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   {
>   	struct drm_atomic_state *state = NULL;
>   	uint64_t reset_flags = 0;
>   	int i, r, resched;
>   
> +	/* amdgpu.lockup_timeout=0 disables GPU reset. */
> +	if (amdgpu_lockup_timeout == 0)
> +		return 0;
> +
>   	if (!amdgpu_check_soft_reset(adev)) {
>   		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
>   		return 0;
>   	}
>   
>   	dev_info(adev->dev, "GPU reset begin!\n");
>   
>   	mutex_lock(&adev->lock_reset);
>   	atomic_inc(&adev->gpu_reset_counter);
>   	adev->in_gpu_reset = 1;
On 12/12/2017 04:01 AM, Christian König wrote:
> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>> From: Marek Olšák <marek.olsak@amd.com>
>>
>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>> ---
>>
>> Is this really correct? I have no easy way to test it.
>
> It's a step in the right direction, but I would rather vote for 
> something else:
>
> Instead of disabling the timeout by default we only disable the GPU 
> reset/recovery.
>
> The idea is to add a new parameter amdgpu_gpu_recovery which makes 
> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU 
> at all (on bare metal systems).
>
> Then we finally set the amdgpu_lockup_timeout to a non zero value by 
> default.
>
> Andrey could you take care of this when you have time?
>
> Thanks,
> Christian.

Sure.

Thanks,
Andrey

>
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 8d03baa..56c41cf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct 
>> amdgpu_device *adev, uint64_t *reset_flags,
>>    *
>>    * Attempt to reset the GPU if it has hung (all asics).
>>    * Returns 0 for success or an error on failure.
>>    */
>>   int amdgpu_gpu_recover(struct amdgpu_device *adev, struct 
>> amdgpu_job *job)
>>   {
>>       struct drm_atomic_state *state = NULL;
>>       uint64_t reset_flags = 0;
>>       int i, r, resched;
>>   +    /* amdgpu.lockup_timeout=0 disables GPU reset. */
>> +    if (amdgpu_lockup_timeout == 0)
>> +        return 0;
>> +
>>       if (!amdgpu_check_soft_reset(adev)) {
>>           DRM_INFO("No hardware hang detected. Did some blocks 
>> stall?\n");
>>           return 0;
>>       }
>>         dev_info(adev->dev, "GPU reset begin!\n");
>>         mutex_lock(&adev->lock_reset);
>>       atomic_inc(&adev->gpu_reset_counter);
>>       adev->in_gpu_reset = 1;
>
On Tue, Dec 12, 2017 at 10:01 AM, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>
>> From: Marek Olšák <marek.olsak@amd.com>
>>
>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>> ---
>>
>> Is this really correct? I have no easy way to test it.
>
>
> It's a step in the right direction, but I would rather vote for something
> else:
>
> Instead of disabling the timeout by default we only disable the GPU
> reset/recovery.
>
> The idea is to add a new parameter amdgpu_gpu_recovery which makes
> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all
> (on bare metal systems).
>
> Then we finally set the amdgpu_lockup_timeout to a non zero value by
> default.
>
> Andrey could you take care of this when you have time?

I don't understand this.

Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
disabled GPU reset? Why do we have to add another option for the same
thing?

Marek
Am 12.12.2017 um 15:57 schrieb Marek Olšák:
> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>> From: Marek Olšák <marek.olsak@amd.com>
>>>
>>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>>> ---
>>>
>>> Is this really correct? I have no easy way to test it.
>>
>> It's a step in the right direction, but I would rather vote for something
>> else:
>>
>> Instead of disabling the timeout by default we only disable the GPU
>> reset/recovery.
>>
>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all
>> (on bare metal systems).
>>
>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>> default.
>>
>> Andrey could you take care of this when you have time?
> I don't understand this.
>
> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
> disabled GPU reset? Why do we have to add another option for the same
> thing?

lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

You could still manually trigger a reset and also invalid commands, 
invalid register writes and requests from the SRIOV hypervisor could 
trigger this.

And as Monk explained GPU resets are mandatory for SRIOV, you can't 
disable them at all in this case.

Additional to that we probably want the error message that something 
timed out, but not touching the hardware in any way.

Regards,
Christian.

>
> Marek
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Tue, Dec 12, 2017 at 5:36 PM, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 12.12.2017 um 15:57 schrieb Marek Olšák:
>>
>> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>
>>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>>>
>>>> From: Marek Olšák <marek.olsak@amd.com>
>>>>
>>>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>>>> ---
>>>>
>>>> Is this really correct? I have no easy way to test it.
>>>
>>>
>>> It's a step in the right direction, but I would rather vote for something
>>> else:
>>>
>>> Instead of disabling the timeout by default we only disable the GPU
>>> reset/recovery.
>>>
>>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at
>>> all
>>> (on bare metal systems).
>>>
>>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>>> default.
>>>
>>> Andrey could you take care of this when you have time?
>>
>> I don't understand this.
>>
>> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
>> disabled GPU reset? Why do we have to add another option for the same
>> thing?
>
>
> lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

It disabled the automatic reset before we had those interrupt callbacks.

>
> You could still manually trigger a reset and also invalid commands, invalid
> register writes and requests from the SRIOV hypervisor could trigger this.

That's OK. Manual resets should always be allowed.

>
> And as Monk explained GPU resets are mandatory for SRIOV, you can't disable
> them at all in this case.

What is preventing Monk from setting amdgpu.lockup_timeout > 0, which
should be the default state anyway?

Let's just say lockup_timeout=0 has undefined behavior with SRIOV.

>
> Additional to that we probably want the error message that something timed
> out, but not touching the hardware in any way.

Yes that is a fair point.

Marek