[v5,5/5] drm/amdgpu: move PD/PT bos on LRU again

Submitted by Huang, Ray on Aug. 22, 2018, 7:52 a.m.

Details

Message ID 1534924375-5837-6-git-send-email-ray.huang@amd.com
State New
Headers show
Series "drm/ttm,amdgpu: Introduce LRU bulk move functionality" ( rev: 5 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Huang, Ray Aug. 22, 2018, 7:52 a.m.
The new bulk moving functionality is ready, the overhead of moving PD/PT bos to
LRU is fixed. So move them on LRU again.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Tested-by: Mike Lothian <mike@fireburn.co.uk>
Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Acked-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index db1f28a..d195a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1107,7 +1107,7 @@  int amdgpu_vm_update_directories(struct amdgpu_device *adev,
 					   struct amdgpu_vm_bo_base,
 					   vm_status);
 		bo_base->moved = false;
-		list_del_init(&bo_base->vm_status);
+		list_move(&bo_base->vm_status, &vm->idle);
 
 		bo = bo_base->bo->parent;
 		if (!bo)

Comments

Hi Ray,


On 2018-08-22 9:52 a.m., Huang Rui wrote:
> The new bulk moving functionality is ready, the overhead of moving PD/PT bos to
> LRU is fixed. So move them on LRU again.
> 
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> Tested-by: Mike Lothian <mike@fireburn.co.uk>
> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
> Acked-by: Chunming Zhou <david1.zhou@amd.com>
> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index db1f28a..d195a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>  					   struct amdgpu_vm_bo_base,
>  					   vm_status);
>  		bo_base->moved = false;
> -		list_del_init(&bo_base->vm_status);
> +		list_move(&bo_base->vm_status, &vm->idle);
>  
>  		bo = bo_base->bo->parent;
>  		if (!bo)
> 

Since this change, I'm getting various badness when running piglit using
radeonsi on Bonaire, see the attached dmesg excerpt.

Reverting just this change on top of current amd-staging-drm-next avoids
the problem.

Looks like some list manipulation isn't sufficiently protected against
concurrent execution?
On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
> 
> Hi Ray,
> 
> 
> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>> The new bulk moving functionality is ready, the overhead of moving PD/PT bos to
>> LRU is fixed. So move them on LRU again.
>>
>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>> Tested-by: Mike Lothian <mike@fireburn.co.uk>
>> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
>> Acked-by: Chunming Zhou <david1.zhou@amd.com>
>> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index db1f28a..d195a3d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>  					   struct amdgpu_vm_bo_base,
>>  					   vm_status);
>>  		bo_base->moved = false;
>> -		list_del_init(&bo_base->vm_status);
>> +		list_move(&bo_base->vm_status, &vm->idle);
>>  
>>  		bo = bo_base->bo->parent;
>>  		if (!bo)
>>
> 
> Since this change, I'm getting various badness when running piglit using
> radeonsi on Bonaire, see the attached dmesg excerpt.
> 
> Reverting just this change on top of current amd-staging-drm-next avoids
> the problem.
> 
> Looks like some list manipulation isn't sufficiently protected against
> concurrent execution?

KASAN pointed me to one issue:
https://patchwork.freedesktop.org/patch/246212/

However, this doesn't fully fix the problem.
On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
>> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>>> The new bulk moving functionality is ready, the overhead of moving PD/PT bos to
>>> LRU is fixed. So move them on LRU again.
>>>
>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>> Tested-by: Mike Lothian <mike@fireburn.co.uk>
>>> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
>>> Acked-by: Chunming Zhou <david1.zhou@amd.com>
>>> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index db1f28a..d195a3d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>  					   struct amdgpu_vm_bo_base,
>>>  					   vm_status);
>>>  		bo_base->moved = false;
>>> -		list_del_init(&bo_base->vm_status);
>>> +		list_move(&bo_base->vm_status, &vm->idle);
>>>  
>>>  		bo = bo_base->bo->parent;
>>>  		if (!bo)
>>>
>>
>> Since this change, I'm getting various badness when running piglit using
>> radeonsi on Bonaire, see the attached dmesg excerpt.
>>
>> Reverting just this change on top of current amd-staging-drm-next avoids
>> the problem.
>>
>> Looks like some list manipulation isn't sufficiently protected against
>> concurrent execution?
> 
> KASAN pointed me to one issue:
> https://patchwork.freedesktop.org/patch/246212/
> 
> However, this doesn't fully fix the problem.

Ray, any ideas yet for solving this? If not, let's revert this change
for now.
Am 29.08.2018 um 09:52 schrieb Michel Dänzer:
> On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
>>> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>>>> The new bulk moving functionality is ready, the overhead of moving PD/PT bos to
>>>> LRU is fixed. So move them on LRU again.
>>>>
>>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>>> Tested-by: Mike Lothian <mike@fireburn.co.uk>
>>>> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
>>>> Acked-by: Chunming Zhou <david1.zhou@amd.com>
>>>> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index db1f28a..d195a3d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>>   					   struct amdgpu_vm_bo_base,
>>>>   					   vm_status);
>>>>   		bo_base->moved = false;
>>>> -		list_del_init(&bo_base->vm_status);
>>>> +		list_move(&bo_base->vm_status, &vm->idle);
>>>>   
>>>>   		bo = bo_base->bo->parent;
>>>>   		if (!bo)
>>>>
>>> Since this change, I'm getting various badness when running piglit using
>>> radeonsi on Bonaire, see the attached dmesg excerpt.
>>>
>>> Reverting just this change on top of current amd-staging-drm-next avoids
>>> the problem.
>>>
>>> Looks like some list manipulation isn't sufficiently protected against
>>> concurrent execution?
>> KASAN pointed me to one issue:
>> https://patchwork.freedesktop.org/patch/246212/
>>
>> However, this doesn't fully fix the problem.
> Ray, any ideas yet for solving this? If not, let's revert this change
> for now.

I've gone over this multiple times now as well, but can't find anything 
obvious wrong either.

If we don't have any more ideas I would say revert it for now and try to 
debug it further.

BTW: Any idea how to force the issue?

Christian.
On 2018-08-29 10:57 a.m., Christian König wrote:
> Am 29.08.2018 um 09:52 schrieb Michel Dänzer:
>> On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
>>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
>>>> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>>>>> The new bulk moving functionality is ready, the overhead of moving
>>>>> PD/PT bos to
>>>>> LRU is fixed. So move them on LRU again.
>>>>>
>>>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>>>> Tested-by: Mike Lothian <mike@fireburn.co.uk>
>>>>> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
>>>>> Acked-by: Chunming Zhou <david1.zhou@amd.com>
>>>>> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index db1f28a..d195a3d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct
>>>>> amdgpu_device *adev,
>>>>>                          struct amdgpu_vm_bo_base,
>>>>>                          vm_status);
>>>>>           bo_base->moved = false;
>>>>> -        list_del_init(&bo_base->vm_status);
>>>>> +        list_move(&bo_base->vm_status, &vm->idle);
>>>>>             bo = bo_base->bo->parent;
>>>>>           if (!bo)
>>>>>
>>>> Since this change, I'm getting various badness when running piglit
>>>> using
>>>> radeonsi on Bonaire, see the attached dmesg excerpt.
>>>>
>>>> Reverting just this change on top of current amd-staging-drm-next
>>>> avoids
>>>> the problem.
>>>>
>>>> Looks like some list manipulation isn't sufficiently protected against
>>>> concurrent execution?
>>> KASAN pointed me to one issue:
>>> https://patchwork.freedesktop.org/patch/246212/
>>>
>>> However, this doesn't fully fix the problem.
>> Ray, any ideas yet for solving this? If not, let's revert this change
>> for now.
> 
> I've gone over this multiple times now as well, but can't find anything
> obvious wrong either.

Thanks for looking into it.


> If we don't have any more ideas I would say revert it for now and try to
> debug it further.

Yep.


> BTW: Any idea how to force the issue?

Not specifically. It happens reliably and pretty quickly for me when
running the piglit gpu profile.
On 2018-08-29 10:57 a.m., Christian König wrote:
> Am 29.08.2018 um 09:52 schrieb Michel Dänzer:
>> On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
>>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
>>>> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>>>>> The new bulk moving functionality is ready, the overhead of moving
>>>>> PD/PT bos to
>>>>> LRU is fixed. So move them on LRU again.
>>>>>
>>>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>>>> Tested-by: Mike Lothian <mike@fireburn.co.uk>
>>>>> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
>>>>> Acked-by: Chunming Zhou <david1.zhou@amd.com>
>>>>> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index db1f28a..d195a3d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct
>>>>> amdgpu_device *adev,
>>>>>                          struct amdgpu_vm_bo_base,
>>>>>                          vm_status);
>>>>>           bo_base->moved = false;
>>>>> -        list_del_init(&bo_base->vm_status);
>>>>> +        list_move(&bo_base->vm_status, &vm->idle);
>>>>>             bo = bo_base->bo->parent;
>>>>>           if (!bo)
>>>>>
>>>> Since this change, I'm getting various badness when running piglit
>>>> using
>>>> radeonsi on Bonaire, see the attached dmesg excerpt.
>>>>
>>>> Reverting just this change on top of current amd-staging-drm-next
>>>> avoids
>>>> the problem.
>>>>
>>>> Looks like some list manipulation isn't sufficiently protected against
>>>> concurrent execution?
>>> KASAN pointed me to one issue:
>>> https://patchwork.freedesktop.org/patch/246212/
>>>
>>> However, this doesn't fully fix the problem.
>> Ray, any ideas yet for solving this? If not, let's revert this change
>> for now.
> 
> I've gone over this multiple times now as well, but can't find anything
> obvious wrong either.

After looking at the code, one question: Why does vm->moved need a
spinlock, but not vm->idle? What is protecting against concurrent access
to the latter?
Am 29.08.2018 um 16:51 schrieb Michel Dänzer:
> On 2018-08-29 10:57 a.m., Christian König wrote:
>> Am 29.08.2018 um 09:52 schrieb Michel Dänzer:
>>> On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
>>>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
>>>>> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>>>>>> The new bulk moving functionality is ready, the overhead of moving
>>>>>> PD/PT bos to
>>>>>> LRU is fixed. So move them on LRU again.
>>>>>>
>>>>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>>>>> Tested-by: Mike Lothian <mike@fireburn.co.uk>
>>>>>> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
>>>>>> Acked-by: Chunming Zhou <david1.zhou@amd.com>
>>>>>> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> index db1f28a..d195a3d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct
>>>>>> amdgpu_device *adev,
>>>>>>                           struct amdgpu_vm_bo_base,
>>>>>>                           vm_status);
>>>>>>            bo_base->moved = false;
>>>>>> -        list_del_init(&bo_base->vm_status);
>>>>>> +        list_move(&bo_base->vm_status, &vm->idle);
>>>>>>              bo = bo_base->bo->parent;
>>>>>>            if (!bo)
>>>>>>
>>>>> Since this change, I'm getting various badness when running piglit
>>>>> using
>>>>> radeonsi on Bonaire, see the attached dmesg excerpt.
>>>>>
>>>>> Reverting just this change on top of current amd-staging-drm-next
>>>>> avoids
>>>>> the problem.
>>>>>
>>>>> Looks like some list manipulation isn't sufficiently protected against
>>>>> concurrent execution?
>>>> KASAN pointed me to one issue:
>>>> https://patchwork.freedesktop.org/patch/246212/
>>>>
>>>> However, this doesn't fully fix the problem.
>>> Ray, any ideas yet for solving this? If not, let's revert this change
>>> for now.
>> I've gone over this multiple times now as well, but can't find anything
>> obvious wrong either.
> After looking at the code, one question: Why does vm->moved need a
> spinlock, but not vm->idle? What is protecting against concurrent access
> to the latter?

The moved state is occupied by both normal and per VM BOs, e.g. BOs with 
different reservation objects.

All other states are only used by per VM BOs or PDs/PTs, so we only put 
the BOs on those when the reservation object of the root BO is locked.

We could probably split the moved state into two separate ones to 
further avoid that lock.

Christian.