drm/amdgpu/sriov: give 8s for recover vram under RUNTIME

Submitted by Deng, Emily on Aug. 8, 2018, 2:13 a.m.

Details

Message ID 1533694400-21433-1-git-send-email-Emily.Deng@amd.com
State New
Headers show
Series "drm/amdgpu/sriov: give 8s for recover vram under RUNTIME" ( rev: 2 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Deng, Emily Aug. 8, 2018, 2:13 a.m.
Modify the commit message

Extend the timeout for recovering vram bos from shadows on sr-iov
to cover the worst case scenario for timeslices and VFs

Under runtime, the wait fence time could be quite long when
other VFs are in exclusive mode. For example, for 4 VF, every
VF's exclusive timeout time is set to 3s, then the worst case is
9s. If the VF number is more than 4,then the worst case time will
be longer.
The 8s is the test data, with setting to 8s, it will pass the TDR
test for 1000 times.

SWDEV-161490

Change-Id: Ifc32d56ca7fde01b1f4fe2b0db6959b51909008a
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1d933db..ef82ad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3124,7 +3124,7 @@  static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
 	long tmo;
 
 	if (amdgpu_sriov_runtime(adev))
-		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
+		tmo = msecs_to_jiffies(8000);
 	else
 		tmo = msecs_to_jiffies(100);
 

Comments

Dear Deng,


On 08/08/18 04:13, Emily Deng wrote:
> Modify the commit message

I guess the line above is a leftover from some template, and can be
removed?

> Extend the timeout for recovering vram bos from shadows on sr-iov
> to cover the worst case scenario for timeslices and VFs
> 
> Under runtime, the wait fence time could be quite long when
> other VFs are in exclusive mode. For example, for 4 VF, every
> VF's exclusive timeout time is set to 3s, then the worst case is
> 9s. If the VF number is more than 4,then the worst case time will
> be longer.

Nit: Missing space after the comma.

How did you get to nine seconds? Isn’t it four times three, which is
twelve?

> The 8s is the test data, with setting to 8s, it will pass the TDR
> test for 1000 times.
> 
> SWDEV-161490
> 
> Change-Id: Ifc32d56ca7fde01b1f4fe2b0db6959b51909008a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1d933db..ef82ad1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3124,7 +3124,7 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>  	long tmo;
>  
>  	if (amdgpu_sriov_runtime(adev))
> -		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> +		tmo = msecs_to_jiffies(8000);

Actually, I do not understand the change at all. Isn’t that a module
parameter?

```
$ git grep amdgpu_lockup_timeout
drivers/gpu/drm/amd/amdgpu/amdgpu.h:extern int amdgpu_lockup_timeout;
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:     if (amdgpu_lockup_timeout == 0) {
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             amdgpu_lockup_timeout = 10000;
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:int amdgpu_lockup_timeout = 10000;
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:                      timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:  if (amdgpu_lockup_timeout == 0)
```

So, if it’s not set, the time-out is set to 10(!) seconds anyway,
isn’t it? What am I missing?

>  	else
>  		tmo = msecs_to_jiffies(100);
>  
> 

In my opinion, such time-outs need to have a big FIXME added to
it, and VF’s exclusive time-outs should be drastically decreased.


Kind regards,

Paul
Hi Paul,
     Answers as below:

-----Original Message-----
From: Paul Menzel <pmenzel+amd-gfx@molgen.mpg.de> 

Sent: Wednesday, August 8, 2018 10:29 PM
To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amdgpu/sriov: give 8s for recover vram under RUNTIME

Dear Deng,


On 08/08/18 04:13, Emily Deng wrote:
> Modify the commit message


I guess the line above is a leftover from some template, and can be
removed?
[Emily] It is removed
> Extend the timeout for recovering vram bos from shadows on sr-iov

> to cover the worst case scenario for timeslices and VFs

> 

> Under runtime, the wait fence time could be quite long when

> other VFs are in exclusive mode. For example, for 4 VF, every

> VF's exclusive timeout time is set to 3s, then the worst case is

> 9s. If the VF number is more than 4,then the worst case time will

> be longer.


Nit: Missing space after the comma.

How did you get to nine seconds? Isn’t it four times three, which is
twelve?
[Emily] It is 3 times three, not 4, as the VF only need to wait the other three VF's exclusive timeout.

> The 8s is the test data, with setting to 8s, it will pass the TDR

> test for 1000 times.

> 

> SWDEV-161490

> 

> Change-Id: Ifc32d56ca7fde01b1f4fe2b0db6959b51909008a

> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

> Signed-off-by: Emily Deng <Emily.Deng@amd.com>

> ---

>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

> 

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> index 1d933db..ef82ad1 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> @@ -3124,7 +3124,7 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)

>  	long tmo;

>  

>  	if (amdgpu_sriov_runtime(adev))

> -		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);

> +		tmo = msecs_to_jiffies(8000);


Actually, I do not understand the change at all. Isn’t that a module
parameter?

```
$ git grep amdgpu_lockup_timeout
drivers/gpu/drm/amd/amdgpu/amdgpu.h:extern int amdgpu_lockup_timeout;
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:     if (amdgpu_lockup_timeout == 0) {
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             amdgpu_lockup_timeout = 10000;
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:int amdgpu_lockup_timeout = 10000;
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:                      timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:  if (amdgpu_lockup_timeout == 0)
```

So, if it’s not set, the time-out is set to 10(!) seconds anyway,
isn’t it? What am I missing?

>  	else

>  		tmo = msecs_to_jiffies(100);

>  

> 


In my opinion, such time-outs need to have a big FIXME added to
it, and VF’s exclusive time-outs should be drastically decreased.


Kind regards,

Paul
Dear Emily,


Thank you for your reply.

Am 09.08.2018 um 04:11 schrieb Deng, Emily:
> Hi Paul,
>       Answers as below:

It’d be great if you had an mail user agent (email program) that
supports quoting. Maybe you could configure Outlook(?) that way? That
would tremendously useful for the mailing list archives too.

> -----Original Message-----
> From: Paul Menzel <pmenzel+amd-gfx@molgen.mpg.de>
> Sent: Wednesday, August 8, 2018 10:29 PM
> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu@amd.com>
> Subject: Re: [PATCH] drm/amdgpu/sriov: give 8s for recover vram under RUNTIME

> On 08/08/18 04:13, Emily Deng wrote:

[…]

>> Extend the timeout for recovering vram bos from shadows on sr-iov
>> to cover the worst case scenario for timeslices and VFs
>>
>> Under runtime, the wait fence time could be quite long when
>> other VFs are in exclusive mode. For example, for 4 VF, every
>> VF's exclusive timeout time is set to 3s, then the worst case is
>> 9s. If the VF number is more than 4,then the worst case time will
>> be longer.
> 
> Nit: Missing space after the comma.
> 
> How did you get to nine seconds? Isn’t it four times three, which is
> twelve?
> [Emily] It is 3 times three, not 4, as the VF only need to wait the
> other three VF's exclusive timeout.

Thank you, I didn’t know that.

(It looks like you missed my inline comments for the change itself
below.)

>> The 8s is the test data, with setting to 8s, it will pass the TDR
>> test for 1000 times.
>>
>> SWDEV-161490
>>
>> Change-Id: Ifc32d56ca7fde01b1f4fe2b0db6959b51909008a
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 1d933db..ef82ad1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3124,7 +3124,7 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>   	long tmo;
>>   
>>   	if (amdgpu_sriov_runtime(adev))
>> -		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
>> +		tmo = msecs_to_jiffies(8000);
> 
> Actually, I do not understand the change at all. Isn’t that a module
> parameter?
> 
> ```
> $ git grep amdgpu_lockup_timeout
> drivers/gpu/drm/amd/amdgpu/amdgpu.h:extern int amdgpu_lockup_timeout;
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:     if (amdgpu_lockup_timeout == 0) {
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             amdgpu_lockup_timeout = 10000;
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:int amdgpu_lockup_timeout = 10000;
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:                      timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:  if (amdgpu_lockup_timeout == 0)
> ```
> 
> So, if it’s not set, the time-out is set to 10(!) seconds anyway,
> isn’t it? What am I missing?
> 
>>   	else
>>   		tmo = msecs_to_jiffies(100);
>>   
>>
> 
> In my opinion, such time-outs need to have a big FIXME added to
> it, and VF’s exclusive time-outs should be drastically decreased.


Kind regards,

Paul
Hi Menzel,
     Thanks for your advice, will have a try.

Best wishes
Emily Deng

-----Original Message-----
From: Paul Menzel <pmenzel+amd-gfx@molgen.mpg.de> 

Sent: Thursday, August 9, 2018 2:47 PM
To: Deng, Emily <Emily.Deng@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amdgpu/sriov: give 8s for recover vram under RUNTIME

Dear Emily,


Thank you for your reply.

Am 09.08.2018 um 04:11 schrieb Deng, Emily:
> Hi Paul,

>       Answers as below:


It’d be great if you had an mail user agent (email program) that supports quoting. Maybe you could configure Outlook(?) that way? That would tremendously useful for the mailing list archives too.

> -----Original Message-----

> From: Paul Menzel <pmenzel+amd-gfx@molgen.mpg.de>

> Sent: Wednesday, August 8, 2018 10:29 PM

> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> Cc: Liu, Monk <Monk.Liu@amd.com>

> Subject: Re: [PATCH] drm/amdgpu/sriov: give 8s for recover vram under 

> RUNTIME


> On 08/08/18 04:13, Emily Deng wrote:


[…]

>> Extend the timeout for recovering vram bos from shadows on sr-iov to 

>> cover the worst case scenario for timeslices and VFs

>>

>> Under runtime, the wait fence time could be quite long when other VFs 

>> are in exclusive mode. For example, for 4 VF, every VF's exclusive 

>> timeout time is set to 3s, then the worst case is 9s. If the VF 

>> number is more than 4,then the worst case time will be longer.

> 

> Nit: Missing space after the comma.

> 

> How did you get to nine seconds? Isn’t it four times three, which is 

> twelve?

> [Emily] It is 3 times three, not 4, as the VF only need to wait the 

> other three VF's exclusive timeout.


Thank you, I didn’t know that.

(It looks like you missed my inline comments for the change itself
below.)

>> The 8s is the test data, with setting to 8s, it will pass the TDR 

>> test for 1000 times.

>>

>> SWDEV-161490

>>

>> Change-Id: Ifc32d56ca7fde01b1f4fe2b0db6959b51909008a

>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>

>> ---

>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-

>>   1 file changed, 1 insertion(+), 1 deletion(-)

>>

>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 

>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

>> index 1d933db..ef82ad1 100644

>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

>> @@ -3124,7 +3124,7 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)

>>   	long tmo;

>>   

>>   	if (amdgpu_sriov_runtime(adev))

>> -		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);

>> +		tmo = msecs_to_jiffies(8000);

> 

> Actually, I do not understand the change at all. Isn’t that a module 

> parameter?

> 

> ```

> $ git grep amdgpu_lockup_timeout

> drivers/gpu/drm/amd/amdgpu/amdgpu.h:extern int amdgpu_lockup_timeout;

> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:     if (amdgpu_lockup_timeout == 0) {

> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             amdgpu_lockup_timeout = 10000;

> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             tmo = msecs_to_jiffies(amdgpu_lockup_timeout);

> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:int amdgpu_lockup_timeout = 

> 10000; drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);

> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:                      timeout = msecs_to_jiffies(amdgpu_lockup_timeout);

> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:  if (amdgpu_lockup_timeout == 

> 0) ```

> 

> So, if it’s not set, the time-out is set to 10(!) seconds anyway, 

> isn’t it? What am I missing?

> 

>>   	else

>>   		tmo = msecs_to_jiffies(100);

>>   

>>

> 

> In my opinion, such time-outs need to have a big FIXME added to it, 

> and VF’s exclusive time-outs should be drastically decreased.



Kind regards,

Paul