drm/amdgpu: Fix KFD oversubscription by tracking queues correctly

Submitted by Cornwall, Jay on July 12, 2017, 6:26 p.m.

Details

Message ID 1499884012-7669-1-git-send-email-Jay.Cornwall@amd.com
State New
Headers show
Series "drm/amdgpu: Fix KFD oversubscription by tracking queues correctly" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Cornwall, Jay July 12, 2017, 6:26 p.m.
The number of compute queues available to the KFD was erroneously
calculated as 64. Only the first MEC can execute compute queues and
it has 32 queue slots.

This caused the oversubscription limit to be calculated incorrectly,
leading to a missing chained runlist command at the end of an
oversubscribed runlist.

Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 7060daf..aa4006a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -140,7 +140,7 @@  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 		/* According to linux/bitmap.h we shouldn't use bitmap_clear if
 		 * nbits is not compile time constant
 		 */
-		last_valid_bit = adev->gfx.mec.num_mec
+		last_valid_bit = 1 /* only first MEC can have compute queues */
 				* adev->gfx.mec.num_pipe_per_mec
 				* adev->gfx.mec.num_queue_per_pipe;
 		for (i = last_valid_bit; i < KGD_MAX_QUEUES; ++i)

Comments

On 2017-07-12 02:26 PM, Jay Cornwall wrote:
> The number of compute queues available to the KFD was erroneously
> calculated as 64. Only the first MEC can execute compute queues and
> it has 32 queue slots.
> 
> This caused the oversubscription limit to be calculated incorrectly,
> leading to a missing chained runlist command at the end of an
> oversubscribed runlist.
> 
> Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
> Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 7060daf..aa4006a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>   		/* According to linux/bitmap.h we shouldn't use bitmap_clear if
>   		 * nbits is not compile time constant
>   		 */
> -		last_valid_bit = adev->gfx.mec.num_mec
> +		last_valid_bit = 1 /* only first MEC can have compute queues */

Hey Jay,

Minor nitpick. We already have some similar resource patching in 
kgd2kfd_device_init(), and I think it would be good to keep all of these 
together.

Otherwise, looks good to me.

Regards,
Andres

>   				* adev->gfx.mec.num_pipe_per_mec
>   				* adev->gfx.mec.num_queue_per_pipe;
>   		for (i = last_valid_bit; i < KGD_MAX_QUEUES; ++i)
>
On 2017-07-13 02:36 PM, Andres Rodriguez wrote:
> On 2017-07-12 02:26 PM, Jay Cornwall wrote:
>> The number of compute queues available to the KFD was erroneously
>> calculated as 64. Only the first MEC can execute compute queues and
>> it has 32 queue slots.
>>
>> This caused the oversubscription limit to be calculated incorrectly,
>> leading to a missing chained runlist command at the end of an
>> oversubscribed runlist.
>>
>> Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
>> Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 7060daf..aa4006a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct 
>> amdgpu_device *adev)
>>           /* According to linux/bitmap.h we shouldn't use bitmap_clear if
>>            * nbits is not compile time constant
>>            */
>> -        last_valid_bit = adev->gfx.mec.num_mec
>> +        last_valid_bit = 1 /* only first MEC can have compute queues */
> 
> Hey Jay,
> 
> Minor nitpick. We already have some similar resource patching in 
> kgd2kfd_device_init(), and I think it would be good to keep all of these 
> together.
> 
> Otherwise, looks good to me.

Just re-read my reply and wanted to clarify. I don't really have a 
strong opining on which side does the resource availability patched. 
Whether it happens here or on the KFD side it is fine.

I just don't think it is good to keep it in different two places.

Regards,
Andres

> 
> Regards,
> Andres
> 
>>                   * adev->gfx.mec.num_pipe_per_mec
>>                   * adev->gfx.mec.num_queue_per_pipe;
>>           for (i = last_valid_bit; i < KGD_MAX_QUEUES; ++i)
>>
On Thu, Jul 13, 2017, at 13:36, Andres Rodriguez wrote:
> On 2017-07-12 02:26 PM, Jay Cornwall wrote:
> > The number of compute queues available to the KFD was erroneously
> > calculated as 64. Only the first MEC can execute compute queues and
> > it has 32 queue slots.
> > 
> > This caused the oversubscription limit to be calculated incorrectly,
> > leading to a missing chained runlist command at the end of an
> > oversubscribed runlist.
> > 
> > Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
> > Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > index 7060daf..aa4006a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
> >   		/* According to linux/bitmap.h we shouldn't use bitmap_clear if
> >   		 * nbits is not compile time constant
> >   		 */
> > -		last_valid_bit = adev->gfx.mec.num_mec
> > +		last_valid_bit = 1 /* only first MEC can have compute queues */
> 
> Hey Jay,
> 
> Minor nitpick. We already have some similar resource patching in 
> kgd2kfd_device_init(), and I think it would be good to keep all of these 
> together.

OK. I see shared_resources.num_mec is set to 1 in kgd2kfd_device_init.
That's not very clear (the number of MECs doesn't change) and num_mec
doesn't appear to be used anywhere except in dead code in kfd_device.c.
That code also runs after the queue bitmap setup.

How about I remove that field entirely?
On 17-07-13 03:15 PM, Jay Cornwall wrote:
> On Thu, Jul 13, 2017, at 13:36, Andres Rodriguez wrote:
>> On 2017-07-12 02:26 PM, Jay Cornwall wrote:
>>> The number of compute queues available to the KFD was erroneously
>>> calculated as 64. Only the first MEC can execute compute queues and
>>> it has 32 queue slots.
>>>
>>> This caused the oversubscription limit to be calculated incorrectly,
>>> leading to a missing chained runlist command at the end of an
>>> oversubscribed runlist.
>>>
>>> Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
>>> Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 7060daf..aa4006a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>>>   		/* According to linux/bitmap.h we shouldn't use bitmap_clear if
>>>   		 * nbits is not compile time constant
>>>   		 */
>>> -		last_valid_bit = adev->gfx.mec.num_mec
>>> +		last_valid_bit = 1 /* only first MEC can have compute queues */
>> Hey Jay,
>>
>> Minor nitpick. We already have some similar resource patching in 
>> kgd2kfd_device_init(), and I think it would be good to keep all of these 
>> together.
> OK. I see shared_resources.num_mec is set to 1 in kgd2kfd_device_init.
> That's not very clear (the number of MECs doesn't change) and num_mec
> doesn't appear to be used anywhere except in dead code in kfd_device.c.
> That code also runs after the queue bitmap setup.
>
> How about I remove that field entirely?
Yeah, that's fine with me.
On 2017-07-13 03:35 PM, Felix Kuehling wrote:
> On 17-07-13 03:15 PM, Jay Cornwall wrote:
>> On Thu, Jul 13, 2017, at 13:36, Andres Rodriguez wrote:
>>> On 2017-07-12 02:26 PM, Jay Cornwall wrote:
>>>> The number of compute queues available to the KFD was erroneously
>>>> calculated as 64. Only the first MEC can execute compute queues and
>>>> it has 32 queue slots.
>>>>
>>>> This caused the oversubscription limit to be calculated incorrectly,
>>>> leading to a missing chained runlist command at the end of an
>>>> oversubscribed runlist.
>>>>
>>>> Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
>>>> Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> index 7060daf..aa4006a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>>>>    		/* According to linux/bitmap.h we shouldn't use bitmap_clear if
>>>>    		 * nbits is not compile time constant
>>>>    		 */
>>>> -		last_valid_bit = adev->gfx.mec.num_mec
>>>> +		last_valid_bit = 1 /* only first MEC can have compute queues */
>>> Hey Jay,
>>>
>>> Minor nitpick. We already have some similar resource patching in
>>> kgd2kfd_device_init(), and I think it would be good to keep all of these
>>> together.
>> OK. I see shared_resources.num_mec is set to 1 in kgd2kfd_device_init.
>> That's not very clear (the number of MECs doesn't change) and num_mec
>> doesn't appear to be used anywhere except in dead code in kfd_device.c.
>> That code also runs after the queue bitmap setup.
>>
>> How about I remove that field entirely?
> Yeah, that's fine with me.
> 

Good with me as well.
There is a function get_mec_num use the field , but seems  no one  call it  , maybe remove it  as well. 

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Andres Rodriguez

Sent: Thursday, July 13, 2017 3:54 PM
To: Kuehling, Felix; Jay Cornwall; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Fix KFD oversubscription by tracking queues correctly



On 2017-07-13 03:35 PM, Felix Kuehling wrote:
> On 17-07-13 03:15 PM, Jay Cornwall wrote:

>> On Thu, Jul 13, 2017, at 13:36, Andres Rodriguez wrote:

>>> On 2017-07-12 02:26 PM, Jay Cornwall wrote:

>>>> The number of compute queues available to the KFD was erroneously 

>>>> calculated as 64. Only the first MEC can execute compute queues and 

>>>> it has 32 queue slots.

>>>>

>>>> This caused the oversubscription limit to be calculated 

>>>> incorrectly, leading to a missing chained runlist command at the 

>>>> end of an oversubscribed runlist.

>>>>

>>>> Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461

>>>> Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>

>>>> ---

>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-

>>>>    1 file changed, 1 insertion(+), 1 deletion(-)

>>>>

>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 

>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

>>>> index 7060daf..aa4006a 100644

>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

>>>> @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)

>>>>    		/* According to linux/bitmap.h we shouldn't use bitmap_clear if

>>>>    		 * nbits is not compile time constant

>>>>    		 */

>>>> -		last_valid_bit = adev->gfx.mec.num_mec

>>>> +		last_valid_bit = 1 /* only first MEC can have compute queues */

>>> Hey Jay,

>>>

>>> Minor nitpick. We already have some similar resource patching in 

>>> kgd2kfd_device_init(), and I think it would be good to keep all of 

>>> these together.

>> OK. I see shared_resources.num_mec is set to 1 in kgd2kfd_device_init.

>> That's not very clear (the number of MECs doesn't change) and num_mec 

>> doesn't appear to be used anywhere except in dead code in kfd_device.c.

>> That code also runs after the queue bitmap setup.

>>

>> How about I remove that field entirely?

> Yeah, that's fine with me.

> 


Good with me as well.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx