drm/amdgpu: Fix the dead lock issue.

Submitted by Deng, Emily on Sept. 10, 2018, 4:07 a.m.

Details

Message ID 1536552453-18595-1-git-send-email-Emily.Deng@amd.com
State New
Headers show
Series "drm/amdgpu: Fix the dead lock issue." ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Deng, Emily Sept. 10, 2018, 4:07 a.m.
It will ramdomly have the dead lock issue when test TDR:
1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock
2. amdgpu_bo_create locked the bo's resv lock
3. amdgpu_bo_create_shadow is waiting for the shadow_list_lock
4. amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
lock.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index de990bd..c75447d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -557,12 +557,8 @@  static int amdgpu_bo_create_shadow(struct amdgpu_device *adev,
 	bp.resv = bo->tbo.resv;
 
 	r = amdgpu_bo_do_create(adev, &bp, &bo->shadow);
-	if (!r) {
+	if (!r)
 		bo->shadow->parent = amdgpu_bo_ref(bo);
-		mutex_lock(&adev->shadow_list_lock);
-		list_add_tail(&bo->shadow_list, &adev->shadow_list);
-		mutex_unlock(&adev->shadow_list_lock);
-	}
 
 	return r;
 }
@@ -603,6 +599,12 @@  int amdgpu_bo_create(struct amdgpu_device *adev,
 		if (!bp->resv)
 			reservation_object_unlock((*bo_ptr)->tbo.resv);
 
+		if (!r) {
+			mutex_lock(&adev->shadow_list_lock);
+			list_add_tail(&(*bo_ptr)->shadow_list, &adev->shadow_list);
+			mutex_unlock(&adev->shadow_list_lock);
+		}
+
 		if (r)
 			amdgpu_bo_unref(bo_ptr);
 	}

Comments

Am 10.09.2018 um 06:07 schrieb Emily Deng:
> It will ramdomly have the dead lock issue when test TDR:
> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock
> 2. amdgpu_bo_create locked the bo's resv lock
> 3. amdgpu_bo_create_shadow is waiting for the shadow_list_lock
> 4. amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
> lock.
>
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>

Good catch, problem is the fix doesn't work like this because the lock 
won't be dropped at this point in most cases.

Instead you need to fix amdgpu_device_recover_vram_from_shadow to make a 
local copy of the list.

You can do this by grabbing a reference to each BO and moving it to a 
local list.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index de990bd..c75447d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -557,12 +557,8 @@ static int amdgpu_bo_create_shadow(struct amdgpu_device *adev,
>   	bp.resv = bo->tbo.resv;
>   
>   	r = amdgpu_bo_do_create(adev, &bp, &bo->shadow);
> -	if (!r) {
> +	if (!r)
>   		bo->shadow->parent = amdgpu_bo_ref(bo);
> -		mutex_lock(&adev->shadow_list_lock);
> -		list_add_tail(&bo->shadow_list, &adev->shadow_list);
> -		mutex_unlock(&adev->shadow_list_lock);
> -	}
>   
>   	return r;
>   }
> @@ -603,6 +599,12 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
>   		if (!bp->resv)
>   			reservation_object_unlock((*bo_ptr)->tbo.resv);
>   
> +		if (!r) {
> +			mutex_lock(&adev->shadow_list_lock);
> +			list_add_tail(&(*bo_ptr)->shadow_list, &adev->shadow_list);
> +			mutex_unlock(&adev->shadow_list_lock);
> +		}
> +
>   		if (r)
>   			amdgpu_bo_unref(bo_ptr);
>   	}
>-----Original Message-----

>From: Christian König <ckoenig.leichtzumerken@gmail.com>

>Sent: Monday, September 10, 2018 3:06 PM

>To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

>Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.

>

>Am 10.09.2018 um 06:07 schrieb Emily Deng:

>> It will ramdomly have the dead lock issue when test TDR:

>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.

>> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow

>> is waiting for the shadow_list_lock 4.

>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv

>> lock.

>>

>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>

>

>Good catch, problem is the fix doesn't work like this because the lock won't be

>dropped at this point in most cases.

Sorry, could you explain more, why the lock won't be dropped after calling reservation_object_unlock?
>

>Instead you need to fix amdgpu_device_recover_vram_from_shadow to make

>a local copy of the list.

>

>You can do this by grabbing a reference to each BO and moving it to a local

>list.

>

>Regards,

>Christian.

>

>> ---

>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 12 +++++++-----

>>   1 file changed, 7 insertions(+), 5 deletions(-)

>>

>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>> index de990bd..c75447d 100644

>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>> @@ -557,12 +557,8 @@ static int amdgpu_bo_create_shadow(struct

>amdgpu_device *adev,

>>   	bp.resv = bo->tbo.resv;

>>

>>   	r = amdgpu_bo_do_create(adev, &bp, &bo->shadow);

>> -	if (!r) {

>> +	if (!r)

>>   		bo->shadow->parent = amdgpu_bo_ref(bo);

>> -		mutex_lock(&adev->shadow_list_lock);

>> -		list_add_tail(&bo->shadow_list, &adev->shadow_list);

>> -		mutex_unlock(&adev->shadow_list_lock);

>> -	}

>>

>>   	return r;

>>   }

>> @@ -603,6 +599,12 @@ int amdgpu_bo_create(struct amdgpu_device

>*adev,

>>   		if (!bp->resv)

>>   			reservation_object_unlock((*bo_ptr)->tbo.resv);

>>

>> +		if (!r) {

>> +			mutex_lock(&adev->shadow_list_lock);

>> +			list_add_tail(&(*bo_ptr)->shadow_list, &adev-

>>shadow_list);

>> +			mutex_unlock(&adev->shadow_list_lock);

>> +		}

>> +

>>   		if (r)

>>   			amdgpu_bo_unref(bo_ptr);

>>   	}
Am 10.09.2018 um 09:19 schrieb Deng, Emily:
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Monday, September 10, 2018 3:06 PM
>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.
>>
>> Am 10.09.2018 um 06:07 schrieb Emily Deng:
>>> It will ramdomly have the dead lock issue when test TDR:
>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.
>>> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow
>>> is waiting for the shadow_list_lock 4.
>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
>>> lock.
>>>
>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>> Good catch, problem is the fix doesn't work like this because the lock won't be
>> dropped at this point in most cases.
> Sorry, could you explain more, why the lock won't be dropped after calling reservation_object_unlock?

reservation_object_unlock won't be called for most BOs.

E.g. the shadow is used for page directories and all page directories 
use the root reservation object and so have bp->resv set here.

Christian.

>> Instead you need to fix amdgpu_device_recover_vram_from_shadow to make
>> a local copy of the list.
>>
>> You can do this by grabbing a reference to each BO and moving it to a local
>> list.
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 12 +++++++-----
>>>    1 file changed, 7 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index de990bd..c75447d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -557,12 +557,8 @@ static int amdgpu_bo_create_shadow(struct
>> amdgpu_device *adev,
>>>    	bp.resv = bo->tbo.resv;
>>>
>>>    	r = amdgpu_bo_do_create(adev, &bp, &bo->shadow);
>>> -	if (!r) {
>>> +	if (!r)
>>>    		bo->shadow->parent = amdgpu_bo_ref(bo);
>>> -		mutex_lock(&adev->shadow_list_lock);
>>> -		list_add_tail(&bo->shadow_list, &adev->shadow_list);
>>> -		mutex_unlock(&adev->shadow_list_lock);
>>> -	}
>>>
>>>    	return r;
>>>    }
>>> @@ -603,6 +599,12 @@ int amdgpu_bo_create(struct amdgpu_device
>> *adev,
>>>    		if (!bp->resv)
>>>    			reservation_object_unlock((*bo_ptr)->tbo.resv);
>>>
>>> +		if (!r) {
>>> +			mutex_lock(&adev->shadow_list_lock);
>>> +			list_add_tail(&(*bo_ptr)->shadow_list, &adev-
>>> shadow_list);
>>> +			mutex_unlock(&adev->shadow_list_lock);
>>> +		}
>>> +
>>>    		if (r)
>>>    			amdgpu_bo_unref(bo_ptr);
>>>    	}
>-----Original Message-----

>From: Koenig, Christian

>Sent: Monday, September 10, 2018 3:23 PM

>To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

>Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.

>

>Am 10.09.2018 um 09:19 schrieb Deng, Emily:

>>> -----Original Message-----

>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>

>>> Sent: Monday, September 10, 2018 3:06 PM

>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

>>> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.

>>>

>>> Am 10.09.2018 um 06:07 schrieb Emily Deng:

>>>> It will ramdomly have the dead lock issue when test TDR:

>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.

>>>> amdgpu_bo_create locked the bo's resv lock 3.

>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4.

>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv

>>>> lock.

>>>>

>>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>

>>> Good catch, problem is the fix doesn't work like this because the

>>> lock won't be dropped at this point in most cases.

>> Sorry, could you explain more, why the lock won't be dropped after calling

>reservation_object_unlock?

>

>reservation_object_unlock won't be called for most BOs.

>

>E.g. the shadow is used for page directories and all page directories use the

>root reservation object and so have bp->resv set here.

Thanks, will modify as your suggestion.

Best wishes
Emily Deng
>Christian.

>

>>> Instead you need to fix amdgpu_device_recover_vram_from_shadow to

>>> make a local copy of the list.

>>>

>>> You can do this by grabbing a reference to each BO and moving it to a

>>> local list.

>>>

>>> Regards,

>>> Christian.

>>>

>>>> ---

>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 12 +++++++-----

>>>>    1 file changed, 7 insertions(+), 5 deletions(-)

>>>>

>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>>>> index de990bd..c75447d 100644

>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

>>>> @@ -557,12 +557,8 @@ static int amdgpu_bo_create_shadow(struct

>>> amdgpu_device *adev,

>>>>    	bp.resv = bo->tbo.resv;

>>>>

>>>>    	r = amdgpu_bo_do_create(adev, &bp, &bo->shadow);

>>>> -	if (!r) {

>>>> +	if (!r)

>>>>    		bo->shadow->parent = amdgpu_bo_ref(bo);

>>>> -		mutex_lock(&adev->shadow_list_lock);

>>>> -		list_add_tail(&bo->shadow_list, &adev->shadow_list);

>>>> -		mutex_unlock(&adev->shadow_list_lock);

>>>> -	}

>>>>

>>>>    	return r;

>>>>    }

>>>> @@ -603,6 +599,12 @@ int amdgpu_bo_create(struct amdgpu_device

>>> *adev,

>>>>    		if (!bp->resv)

>>>>    			reservation_object_unlock((*bo_ptr)->tbo.resv);

>>>>

>>>> +		if (!r) {

>>>> +			mutex_lock(&adev->shadow_list_lock);

>>>> +			list_add_tail(&(*bo_ptr)->shadow_list, &adev-

>>>> shadow_list);

>>>> +			mutex_unlock(&adev->shadow_list_lock);

>>>> +		}

>>>> +

>>>>    		if (r)

>>>>    			amdgpu_bo_unref(bo_ptr);

>>>>    	}