drm/amdgpu/sriov: reject kms open if TDR not finished or failed

Submitted by Deng, Emily on April 25, 2018, 6:39 a.m.

Details

Message ID 1524638384-31977-1-git-send-email-Emily.Deng@amd.com
State New
Headers show
Series "drm/amdgpu/sriov: reject kms open if TDR not finished or failed" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Deng, Emily April 25, 2018, 6:39 a.m.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
 1 file changed, 3 insertions(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 6d55cae..adeca71 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -847,6 +847,9 @@  int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
 	struct amdgpu_fpriv *fpriv;
 	int r, pasid;
 
+	if (adev->in_gpu_reset)
+		return -ECANCELED;
+
 	file_priv->driver_priv = NULL;
 
 	r = pm_runtime_get_sync(dev->dev);

Comments

On 2018-04-25 08:39 AM, Emily Deng wrote:
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 6d55cae..adeca71 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -847,6 +847,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>  	struct amdgpu_fpriv *fpriv;
>  	int r, pasid;
>  
> +	if (adev->in_gpu_reset)
> +		return -ECANCELED;

This seems like a bad idea, as it would cause basically any userspace
which wants to use the GPU to fail to start during a GPU reset.
Hi Michel,
     > This seems like a bad idea, as it would cause basically any userspace which

> wants to use the GPU to fail to start during a GPU reset.

[Emily] Yes, this is what the change want to do, when driver is doing gpu recover or hardware is doing reset,
it doesn't want to be interrupted, and during the reset any driver open kms is meaningless. This behavior could
let the gpu recover and hardware reset back more quickly. After finished the reset, we will set adev->in_gpu_reset=0
again, then could allow the application open the kms.

Best Wishes,
Emily Deng

> -----Original Message-----

> From: Michel Dänzer [mailto:michel@daenzer.net]

> Sent: Wednesday, April 25, 2018 3:48 PM

> To: Deng, Emily <Emily.Deng@amd.com>

> Cc: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>

> Subject: Re: [PATCH] drm/amdgpu/sriov: reject kms open if TDR not finished

> or failed

> 

> On 2018-04-25 08:39 AM, Emily Deng wrote:

> > Signed-off-by: Monk Liu <Monk.Liu@amd.com>

> > ---

> >  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++

> >  1 file changed, 3 insertions(+)

> >

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > index 6d55cae..adeca71 100644

> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > @@ -847,6 +847,9 @@ int amdgpu_driver_open_kms(struct drm_device

> *dev, struct drm_file *file_priv)

> >  	struct amdgpu_fpriv *fpriv;

> >  	int r, pasid;

> >

> > +	if (adev->in_gpu_reset)

> > +		return -ECANCELED;

> 

> This seems like a bad idea, as it would cause basically any userspace which

> wants to use the GPU to fail to start during a GPU reset.

> 

> 

> --

> Earthling Michel Dänzer               |               http://www.amd.com

> Libre software enthusiast             |             Mesa and X developer
On 2018-04-26 04:03 AM, Deng, Emily wrote:
> Hi Michel,
>> This seems like a bad idea, as it would cause basically any
>> userspace which wants to use the GPU to fail to start during a GPU
>> reset.
> [Emily] Yes, this is what the change want to do, when driver is doing
> gpu recover or hardware is doing reset, it doesn't want to be
> interrupted, and during the reset any driver open kms is meaningless.

Applications randomly failing to start up during a GPU reset would be
surprising and confusing for the user. The driver needs to handle this
transparently.
This seems like a bad idea, as it would cause basically any userspace
which wants to use the GPU to fail to start during a GPU reset.
[Emily] Yes, this is what the change want to do, when driver is doing
gpu recover or hardware is doing reset, it doesn't want to be
 interrupted, and during the reset any driver open kms is meaningless.
 
Applications randomly failing to start up during a GPU reset would be
 surprising and confusing for the user. The driver needs to handle this
 transparently.
[Emily] Yes, you are right, how about to sleep for some time in here to wait the GPU reset 
successfully. After sleep, if it still in gpu reset, then return error to APP?

Best Wishes,
Emily Deng

> -----Original Message-----

> From: Michel Dänzer [mailto:michel@daenzer.net]

> Sent: Thursday, April 26, 2018 3:33 PM

> To: Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>

> Cc: amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/amdgpu/sriov: reject kms open if TDR not finished

> or failed

> 

> On 2018-04-26 04:03 AM, Deng, Emily wrote:

> > Hi Michel,

> >> This seems like a bad idea, as it would cause basically any userspace

> >> which wants to use the GPU to fail to start during a GPU reset.

> > [Emily] Yes, this is what the change want to do, when driver is doing

> > gpu recover or hardware is doing reset, it doesn't want to be

> > interrupted, and during the reset any driver open kms is meaningless.

> 

> Applications randomly failing to start up during a GPU reset would be

> surprising and confusing for the user. The driver needs to handle this

> transparently.

> 

> 

> --

> Earthling Michel Dänzer               |               http://www.amd.com

> Libre software enthusiast             |             Mesa and X developer
On 2018-04-26 09:57 AM, Deng, Emily wrote:
>>>> This seems like a bad idea, as it would cause basically any
>>>> userspace which wants to use the GPU to fail to start during a
>>>> GPU reset.
>>> [Emily] Yes, this is what the change want to do, when driver is
>>> doing gpu recover or hardware is doing reset, it doesn't want to
>>> be interrupted, and during the reset any driver open kms is
>>> meaningless.
>> 
>> Applications randomly failing to start up during a GPU reset would
>> be surprising and confusing for the user. The driver needs to
>> handle this transparently.
> [Emily] Yes, you are right, how about to sleep for some time in here
> to wait the GPU reset successfully. After sleep, if it still in gpu
> reset, then return error to APP?

I think the only case where returning an error here *might* be
appropriate is when the driver has tried resetting the GPU, but it has
definitely and irrecoverably failed. But even in that case, it might be
better to let the open succeed, and let userspace figure out what
happened by other means.
Sorry, I don't get your point, when the GPU is doing a reset, even let the dkms open, the userspace will
still fail, so during the reset, any driver open kms is meaningless.

Best Wishes,
Emily Deng

> -----Original Message-----

> From: Michel Dänzer [mailto:michel@daenzer.net]

> Sent: Thursday, April 26, 2018 5:01 PM

> To: Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>

> Cc: amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/amdgpu/sriov: reject kms open if TDR not finished

> or failed

> 

> On 2018-04-26 09:57 AM, Deng, Emily wrote:

> >>>> This seems like a bad idea, as it would cause basically any

> >>>> userspace which wants to use the GPU to fail to start during a GPU

> >>>> reset.

> >>> [Emily] Yes, this is what the change want to do, when driver is

> >>> doing gpu recover or hardware is doing reset, it doesn't want to be

> >>> interrupted, and during the reset any driver open kms is

> >>> meaningless.

> >>

> >> Applications randomly failing to start up during a GPU reset would be

> >> surprising and confusing for the user. The driver needs to handle

> >> this transparently.

> > [Emily] Yes, you are right, how about to sleep for some time in here

> > to wait the GPU reset successfully. After sleep, if it still in gpu

> > reset, then return error to APP?

> 

> I think the only case where returning an error here *might* be appropriate is

> when the driver has tried resetting the GPU, but it has definitely and

> irrecoverably failed. But even in that case, it might be better to let the open

> succeed, and let userspace figure out what happened by other means.

> 

> 

> --

> Earthling Michel Dänzer               |               http://www.amd.com

> Libre software enthusiast             |             Mesa and X developer
On 2018-04-27 07:29 AM, Deng, Emily wrote:
> Sorry, I don't get your point, when the GPU is doing a reset, even
> let the dkms open, the userspace will still fail, so during the
> reset, any driver open kms is meaningless.

That needs to be fixed, then. If the GPU reset succeeds, userspace
shouldn't have to fail.
No, I mean during GPU reset, the app will fail, it is reasonable. The flag in_gpu_reset indicates the GPU is in reset state, after reset, we will set the flag back.

Best Wishes,
Emily Deng

> -----Original Message-----

> From: Michel Dänzer [mailto:michel@daenzer.net]

> Sent: Friday, April 27, 2018 3:52 PM

> To: Deng, Emily <Emily.Deng@amd.com>; Liu, Monk <Monk.Liu@amd.com>

> Cc: amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/amdgpu/sriov: reject kms open if TDR not finished

> or failed

> 

> On 2018-04-27 07:29 AM, Deng, Emily wrote:

> > Sorry, I don't get your point, when the GPU is doing a reset, even let

> > the dkms open, the userspace will still fail, so during the reset, any

> > driver open kms is meaningless.

> 

> That needs to be fixed, then. If the GPU reset succeeds, userspace shouldn't

> have to fail.

> 

> 

> --

> Earthling Michel Dänzer               |               http://www.amd.com

> Libre software enthusiast             |             Mesa and X developer
On 2018-04-27 10:15 AM, Deng, Emily wrote:
> No, I mean during GPU reset, the app will fail, it is reasonable.

I understand, but I don't agree that it's reasonable. :)
Am 27.04.2018 um 10:22 schrieb Michel Dänzer:
> On 2018-04-27 10:15 AM, Deng, Emily wrote:
>> No, I mean during GPU reset, the app will fail, it is reasonable.
> I understand, but I don't agree that it's reasonable. :)

I agree with Michel here that this is a really bad idea.

Even while the GPU is in reset an application should still be able to 
load userspace and open the driver.

It migth need to reset it's context even if it never used them, but that 
is a completely different story.

Christian.