drm/amdgpu: add rcu_barrier after entity fini

Submitted by Deng, Emily on May 17, 2018, 10:03 a.m.

Details

Message ID 1526551432-12599-1-git-send-email-Emily.Deng@amd.com
State New
Headers show
Series "drm/amdgpu: add rcu_barrier after entity fini" ( rev: 3 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Deng, Emily May 17, 2018, 10:03 a.m.
To free the fence from the amdgpu_fence_slab, need twice call_rcu, to avoid
the amdgpu_fence_slab_fini call kmem_cache_destroy(amdgpu_fence_slab) before
kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after drm_sched_entity_fini.

The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
1.drm_sched_entity_fini ->
drm_sched_entity_cleanup ->
dma_fence_put(entity->last_scheduled) ->
drm_sched_fence_release_finished ->
drm_sched_fence_release_scheduled ->
call_rcu(&fence->finished.rcu, drm_sched_fence_free)

2.drm_sched_fence_free ->
dma_fence_put(fence->parent) ->
amdgpu_fence_release ->
call_rcu(&f->rcu, amdgpu_fence_free) ->
kmem_cache_free(amdgpu_fence_slab, fence);

v2:put the barrier before the kmem_cache_destroy

Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +
 1 file changed, 1 insertion(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 39ec6b8..42be65b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -69,6 +69,7 @@  int amdgpu_fence_slab_init(void)
 void amdgpu_fence_slab_fini(void)
 {
 	rcu_barrier();
+	rcu_barrier();
 	kmem_cache_destroy(amdgpu_fence_slab);
 }
 /*

Comments

Am 17.05.2018 um 12:03 schrieb Emily Deng:
> To free the fence from the amdgpu_fence_slab, need twice call_rcu, to avoid
> the amdgpu_fence_slab_fini call kmem_cache_destroy(amdgpu_fence_slab) before
> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after drm_sched_entity_fini.
>
> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
> 1.drm_sched_entity_fini ->
> drm_sched_entity_cleanup ->
> dma_fence_put(entity->last_scheduled) ->
> drm_sched_fence_release_finished ->
> drm_sched_fence_release_scheduled ->
> call_rcu(&fence->finished.rcu, drm_sched_fence_free)
>
> 2.drm_sched_fence_free ->
> dma_fence_put(fence->parent) ->
> amdgpu_fence_release ->
> call_rcu(&f->rcu, amdgpu_fence_free) ->
> kmem_cache_free(amdgpu_fence_slab, fence);
>
> v2:put the barrier before the kmem_cache_destroy
>
> Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 39ec6b8..42be65b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)
>   void amdgpu_fence_slab_fini(void)
>   {
>   	rcu_barrier();
> +	rcu_barrier();

Well, you should have noted that there is already an rcu_barrier here 
and adding another one shouldn't have any additional effect. So your 
explanation and the proposed solution doesn't make to much sense.

I think the problem you run into is rather that the fence is reference 
counted and might live longer than the module who created it.

Complicated issue, one possible solution would be to release 
fence->parent earlier in the scheduler fence but that doesn't sound like 
a general purpose solution.

Christian.

>   	kmem_cache_destroy(amdgpu_fence_slab);
>   }
>   /*
Hi Christian,
     Yes, it has already one rcu_barrier, but it has called twice call_rcu, so the one rcu_barrier just could barrier one call_rcu some time.
    After I added another rcu_barrier, the kernel issue will disappear.

Best Wishes,
Emily Deng

> -----Original Message-----

> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> Sent: Thursday, May 17, 2018 7:08 PM

> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> 

> Am 17.05.2018 um 12:03 schrieb Emily Deng:

> > To free the fence from the amdgpu_fence_slab, need twice call_rcu, to

> > avoid the amdgpu_fence_slab_fini call

> > kmem_cache_destroy(amdgpu_fence_slab) before

> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after

> drm_sched_entity_fini.

> >

> > The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:

> > 1.drm_sched_entity_fini ->

> > drm_sched_entity_cleanup ->

> > dma_fence_put(entity->last_scheduled) ->

> > drm_sched_fence_release_finished ->

> drm_sched_fence_release_scheduled

> > -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)

> >

> > 2.drm_sched_fence_free ->

> > dma_fence_put(fence->parent) ->

> > amdgpu_fence_release ->

> > call_rcu(&f->rcu, amdgpu_fence_free) ->

> > kmem_cache_free(amdgpu_fence_slab, fence);

> >

> > v2:put the barrier before the kmem_cache_destroy

> >

> > Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df

> > Signed-off-by: Emily Deng <Emily.Deng@amd.com>

> > ---

> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +

> >   1 file changed, 1 insertion(+)

> >

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > index 39ec6b8..42be65b 100644

> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)

> >   void amdgpu_fence_slab_fini(void)

> >   {

> >   	rcu_barrier();

> > +	rcu_barrier();

> 

> Well, you should have noted that there is already an rcu_barrier here and

> adding another one shouldn't have any additional effect. So your explanation

> and the proposed solution doesn't make to much sense.

> 

> I think the problem you run into is rather that the fence is reference counted

> and might live longer than the module who created it.

> 

> Complicated issue, one possible solution would be to release

> fence->parent earlier in the scheduler fence but that doesn't sound like

> a general purpose solution.

> 

> Christian.

> 

> >   	kmem_cache_destroy(amdgpu_fence_slab);

> >   }

> >   /*
Ping......

Best Wishes,
Emily Deng




> -----Original Message-----

> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf

> Of Deng, Emily

> Sent: Friday, May 18, 2018 11:20 AM

> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-

> gfx@lists.freedesktop.org

> Subject: RE: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> 

> Hi Christian,

>      Yes, it has already one rcu_barrier, but it has called twice call_rcu, so the

> one rcu_barrier just could barrier one call_rcu some time.

>     After I added another rcu_barrier, the kernel issue will disappear.

> 

> Best Wishes,

> Emily Deng

> 

> > -----Original Message-----

> > From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> > Sent: Thursday, May 17, 2018 7:08 PM

> > To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> > Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> >

> > Am 17.05.2018 um 12:03 schrieb Emily Deng:

> > > To free the fence from the amdgpu_fence_slab, need twice call_rcu,

> > > to avoid the amdgpu_fence_slab_fini call

> > > kmem_cache_destroy(amdgpu_fence_slab) before

> > kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after

> > drm_sched_entity_fini.

> > >

> > > The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:

> > > 1.drm_sched_entity_fini ->

> > > drm_sched_entity_cleanup ->

> > > dma_fence_put(entity->last_scheduled) ->

> > > drm_sched_fence_release_finished ->

> > drm_sched_fence_release_scheduled

> > > -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)

> > >

> > > 2.drm_sched_fence_free ->

> > > dma_fence_put(fence->parent) ->

> > > amdgpu_fence_release ->

> > > call_rcu(&f->rcu, amdgpu_fence_free) ->

> > > kmem_cache_free(amdgpu_fence_slab, fence);

> > >

> > > v2:put the barrier before the kmem_cache_destroy

> > >

> > > Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df

> > > Signed-off-by: Emily Deng <Emily.Deng@amd.com>

> > > ---

> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +

> > >   1 file changed, 1 insertion(+)

> > >

> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > > index 39ec6b8..42be65b 100644

> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> > > @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)

> > >   void amdgpu_fence_slab_fini(void)

> > >   {

> > >   	rcu_barrier();

> > > +	rcu_barrier();

> >

> > Well, you should have noted that there is already an rcu_barrier here

> > and adding another one shouldn't have any additional effect. So your

> > explanation and the proposed solution doesn't make to much sense.

> >

> > I think the problem you run into is rather that the fence is reference

> > counted and might live longer than the module who created it.

> >

> > Complicated issue, one possible solution would be to release

> > fence->parent earlier in the scheduler fence but that doesn't sound

> > fence->like

> > a general purpose solution.

> >

> > Christian.

> >

> > >   	kmem_cache_destroy(amdgpu_fence_slab);

> > >   }

> > >   /*

> 

> _______________________________________________

> amd-gfx mailing list

> amd-gfx@lists.freedesktop.org

> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Ok, I'm lost where do we use call_rcu() twice? Cause that sounds 
incorrect in the first place.

Christian.

Am 18.05.2018 um 11:41 schrieb Deng, Emily:
> Ping......
>
> Best Wishes,
> Emily Deng
>
>
>
>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>> Of Deng, Emily
>> Sent: Friday, May 18, 2018 11:20 AM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-
>> gfx@lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
>>
>> Hi Christian,
>>       Yes, it has already one rcu_barrier, but it has called twice call_rcu, so the
>> one rcu_barrier just could barrier one call_rcu some time.
>>      After I added another rcu_barrier, the kernel issue will disappear.
>>
>> Best Wishes,
>> Emily Deng
>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: Thursday, May 17, 2018 7:08 PM
>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
>>>
>>> Am 17.05.2018 um 12:03 schrieb Emily Deng:
>>>> To free the fence from the amdgpu_fence_slab, need twice call_rcu,
>>>> to avoid the amdgpu_fence_slab_fini call
>>>> kmem_cache_destroy(amdgpu_fence_slab) before
>>> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after
>>> drm_sched_entity_fini.
>>>> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
>>>> 1.drm_sched_entity_fini ->
>>>> drm_sched_entity_cleanup ->
>>>> dma_fence_put(entity->last_scheduled) ->
>>>> drm_sched_fence_release_finished ->
>>> drm_sched_fence_release_scheduled
>>>> -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)
>>>>
>>>> 2.drm_sched_fence_free ->
>>>> dma_fence_put(fence->parent) ->
>>>> amdgpu_fence_release ->
>>>> call_rcu(&f->rcu, amdgpu_fence_free) ->
>>>> kmem_cache_free(amdgpu_fence_slab, fence);
>>>>
>>>> v2:put the barrier before the kmem_cache_destroy
>>>>
>>>> Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df
>>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 39ec6b8..42be65b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)
>>>>    void amdgpu_fence_slab_fini(void)
>>>>    {
>>>>    	rcu_barrier();
>>>> +	rcu_barrier();
>>> Well, you should have noted that there is already an rcu_barrier here
>>> and adding another one shouldn't have any additional effect. So your
>>> explanation and the proposed solution doesn't make to much sense.
>>>
>>> I think the problem you run into is rather that the fence is reference
>>> counted and might live longer than the module who created it.
>>>
>>> Complicated issue, one possible solution would be to release
>>> fence->parent earlier in the scheduler fence but that doesn't sound
>>> fence->like
>>> a general purpose solution.
>>>
>>> Christian.
>>>
>>>>    	kmem_cache_destroy(amdgpu_fence_slab);
>>>>    }
>>>>    /*
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Hi Christian,
        When we free an IB fence, we first call one call_rcu in drm_sched_fence_release_scheduled as the call trace one, then after the call trace one,
 we call the call_rcu second in the  amdgpu_fence_release in call trace two, as below.

 The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
    1.drm_sched_entity_fini ->
    drm_sched_entity_cleanup ->
    dma_fence_put(entity->last_scheduled) ->
    drm_sched_fence_release_finished ->
    drm_sched_fence_release_scheduled ->
    call_rcu(&fence->finished.rcu, drm_sched_fence_free)

    2.drm_sched_fence_free ->
    dma_fence_put(fence->parent) ->
    amdgpu_fence_release ->
    call_rcu(&f->rcu, amdgpu_fence_free) ->
    kmem_cache_free(amdgpu_fence_slab, fence);

> -----Original Message-----

> From: Koenig, Christian

> Sent: Friday, May 18, 2018 5:46 PM

> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

>

> Ok, I'm lost where do we use call_rcu() twice? Cause that sounds incorrect in

> the first place.

>

> Christian.

>

> Am 18.05.2018 um 11:41 schrieb Deng, Emily:

> > Ping......

> >

> > Best Wishes,

> > Emily Deng

> >

> >

> >

> >

> >> -----Original Message-----

> >> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On

> >> Behalf Of Deng, Emily

> >> Sent: Friday, May 18, 2018 11:20 AM

> >> To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-

> >> gfx@lists.freedesktop.org<mailto:gfx@lists.freedesktop.org>

> >> Subject: RE: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> >>

> >> Hi Christian,

> >>       Yes, it has already one rcu_barrier, but it has called twice

> >> call_rcu, so the one rcu_barrier just could barrier one call_rcu some time.

> >>      After I added another rcu_barrier, the kernel issue will disappear.

> >>

> >> Best Wishes,

> >> Emily Deng

> >>

> >>> -----Original Message-----

> >>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> >>> Sent: Thursday, May 17, 2018 7:08 PM

> >>> To: Deng, Emily <Emily.Deng@amd.com<mailto:Emily.Deng@amd.com>>; amd-<mailto:amd-gfx@lists.freedesktop.org>

> gfx@lists.freedesktop.org

> >>> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> >>>

> >>> Am 17.05.2018 um 12:03 schrieb Emily Deng:

> >>>> To free the fence from the amdgpu_fence_slab, need twice call_rcu,

> >>>> to avoid the amdgpu_fence_slab_fini call

> >>>> kmem_cache_destroy(amdgpu_fence_slab) before

> >>> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after

> >>> drm_sched_entity_fini.

> >>>> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as

> below:

> >>>> 1.drm_sched_entity_fini ->

> >>>> drm_sched_entity_cleanup ->

> >>>> dma_fence_put(entity->last_scheduled) ->

> >>>> drm_sched_fence_release_finished ->

> >>> drm_sched_fence_release_scheduled

> >>>> -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)

> >>>>

> >>>> 2.drm_sched_fence_free ->

> >>>> dma_fence_put(fence->parent) ->

> >>>> amdgpu_fence_release ->

> >>>> call_rcu(&f->rcu, amdgpu_fence_free) ->

> >>>> kmem_cache_free(amdgpu_fence_slab, fence);

> >>>>

> >>>> v2:put the barrier before the kmem_cache_destroy

> >>>>

> >>>> Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df

> >>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com<mailto:Emily.Deng@amd.com>>

> >>>> ---

> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +

> >>>>    1 file changed, 1 insertion(+)

> >>>>

> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> index 39ec6b8..42be65b 100644

> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)

> >>>>    void amdgpu_fence_slab_fini(void)

> >>>>    {

> >>>>          rcu_barrier();

> >>>> +        rcu_barrier();

> >>> Well, you should have noted that there is already an rcu_barrier

> >>> here and adding another one shouldn't have any additional effect. So

> >>> your explanation and the proposed solution doesn't make to much

> sense.

> >>>

> >>> I think the problem you run into is rather that the fence is

> >>> reference counted and might live longer than the module who created it.

> >>>

> >>> Complicated issue, one possible solution would be to release

> >>> fence->parent earlier in the scheduler fence but that doesn't sound

> >>> fence->like

> >>> a general purpose solution.

> >>>

> >>> Christian.

> >>>

> >>>>          kmem_cache_destroy(amdgpu_fence_slab);

> >>>>    }

> >>>>    /*

> >> _______________________________________________

> >> amd-gfx mailing list

> >> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Ah! So we have one RCU grace period caused by the scheduler fence and 
another one from the hardware fence.

Thanks for explaining that once more, that was really not obvious from 
reading the code.

But this means we could we also fix it by moving the 
"dma_fence_put(fence->parent);" from drm_sched_fence_free() into 
drm_sched_fence_release_scheduled() directly before the call_rcu(), 
can't we?

I think that would be cleaner, cause otherwise every driver using the 
GPU scheduler would need that workaround.

And drm_sched_fence_release_scheduled() is only called when the 
reference count becomes zero, so when anybody accesses the structure 
after that then that would be a bug as well.

Thanks,
Christian.

Am 18.05.2018 um 11:56 schrieb Deng, Emily:
> Hi Christian,
>         When we free an IB fence, we first call one call_rcu in 
> drm_sched_fence_release_scheduled as the call trace one, then after 
> the call trace one,
> we call the call_rcu second in the  amdgpu_fence_release in call trace 
> two, as below.
> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
>     1.drm_sched_entity_fini ->
>     drm_sched_entity_cleanup ->
>     dma_fence_put(entity->last_scheduled) ->
>     drm_sched_fence_release_finished ->
>     drm_sched_fence_release_scheduled ->
>     call_rcu(&fence->finished.rcu, drm_sched_fence_free)
>     2.drm_sched_fence_free ->
>     dma_fence_put(fence->parent) ->
>     amdgpu_fence_release ->
>     call_rcu(&f->rcu, amdgpu_fence_free) ->
>     kmem_cache_free(amdgpu_fence_slab, fence);
> > -----Original Message-----
> > From: Koenig, Christian
> > Sent: Friday, May 18, 2018 5:46 PM
> > To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
> > 
> > Ok, I'm lost where do we use call_rcu() twice? Cause that sounds incorrect in
> > the first place.
> > 
> > Christian.
> > 
> > Am 18.05.2018 um 11:41 schrieb Deng, Emily:
> > > Ping......
> > >
> > > Best Wishes,
> > > Emily Deng
> > >
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
> > >> Behalf Of Deng, Emily
> > >> Sent: Friday, May 18, 2018 11:20 AM
> > >> To: Koenig, Christian <Christian.Koenig@amd.com <mailto:Christian.Koenig@amd.com>>; amd-
> > >> gfx@lists.freedesktop.org <mailto:gfx@lists.freedesktop.org>
> > >> Subject: RE: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
> > >>
> > >> Hi Christian,
> > >>       Yes, it has already one rcu_barrier, but it has called twice
> > >> call_rcu, so the one rcu_barrier just could barrier one call_rcu some time.
> > >>      After I added another rcu_barrier, the kernel issue will disappear.
> > >>
> > >> Best Wishes,
> > >> Emily Deng
> > >>
> > >>> -----Original Message-----
> > >>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> > >>> Sent: Thursday, May 17, 2018 7:08 PM
> > >>> To: Deng, Emily <Emily.Deng@amd.com <mailto:Emily.Deng@amd.com>>; amd- 
> <mailto:amd-gfx@lists.freedesktop.org>
> > gfx@lists.freedesktop.org
> > >>> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
> > >>>
> > >>> Am 17.05.2018 um 12:03 schrieb Emily Deng:
> > >>>> To free the fence from the amdgpu_fence_slab, need twice call_rcu,
> > >>>> to avoid the amdgpu_fence_slab_fini call
> > >>>> kmem_cache_destroy(amdgpu_fence_slab) before
> > >>> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after
> > >>> drm_sched_entity_fini.
> > >>>> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as
> > below:
> > >>>> 1.drm_sched_entity_fini ->
> > >>>> drm_sched_entity_cleanup ->
> > >>>> dma_fence_put(entity->last_scheduled) ->
> > >>>> drm_sched_fence_release_finished ->
> > >>> drm_sched_fence_release_scheduled
> > >>>> -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)
> > >>>>
> > >>>> 2.drm_sched_fence_free ->
> > >>>> dma_fence_put(fence->parent) ->
> > >>>> amdgpu_fence_release ->
> > >>>> call_rcu(&f->rcu, amdgpu_fence_free) ->
> > >>>> kmem_cache_free(amdgpu_fence_slab, fence);
> > >>>>
> > >>>> v2:put the barrier before the kmem_cache_destroy
> > >>>>
> > >>>> Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df
> > >>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com <mailto:Emily.Deng@amd.com>>
> > >>>> ---
> > >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +
> > >>>>    1 file changed, 1 insertion(+)
> > >>>>
> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > >>>> index 39ec6b8..42be65b 100644
> > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > >>>> @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)
> > >>>>    void amdgpu_fence_slab_fini(void)
> > >>>>    {
> > >>>>           rcu_barrier();
> > >>>> +        rcu_barrier();
> > >>> Well, you should have noted that there is already an rcu_barrier
> > >>> here and adding another one shouldn't have any additional effect. So
> > >>> your explanation and the proposed solution doesn't make to much
> > sense.
> > >>>
> > >>> I think the problem you run into is rather that the fence is
> > >>> reference counted and might live longer than the module who created it.
> > >>>
> > >>> Complicated issue, one possible solution would be to release
> > >>> fence->parent earlier in the scheduler fence but that doesn't sound
> > >>> fence->like
> > >>> a general purpose solution.
> > >>>
> > >>> Christian.
> > >>>
> > >>>>          kmem_cache_destroy(amdgpu_fence_slab);
> > >>>>    }
> > >>>>    /*
> > >> _______________________________________________
> > >> amd-gfx mailing list
> > >> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> > >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Hi Christian,
     Thanks for your advice, will send a modify patch later.

Best Wishes,
Emily Deng
From: Koenig, Christian

Sent: Friday, May 18, 2018 6:36 PM
To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

Ah! So we have one RCU grace period caused by the scheduler fence and another one from the hardware fence.

Thanks for explaining that once more, that was really not obvious from reading the code.

But this means we could we also fix it by moving the "dma_fence_put(fence->parent);" from drm_sched_fence_free() into drm_sched_fence_release_scheduled() directly before the call_rcu(), can't we?

I think that would be cleaner, cause otherwise every driver using the GPU scheduler would need that workaround.

And drm_sched_fence_release_scheduled() is only called when the reference count becomes zero, so when anybody accesses the structure after that then that would be a bug as well.

Thanks,
Christian.

Am 18.05.2018 um 11:56 schrieb Deng, Emily:
Hi Christian,
        When we free an IB fence, we first call one call_rcu in drm_sched_fence_release_scheduled as the call trace one, then after the call trace one,
we call the call_rcu second in the  amdgpu_fence_release in call trace two, as below.

The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
    1.drm_sched_entity_fini ->
    drm_sched_entity_cleanup ->
    dma_fence_put(entity->last_scheduled) ->
    drm_sched_fence_release_finished ->
    drm_sched_fence_release_scheduled ->
    call_rcu(&fence->finished.rcu, drm_sched_fence_free)

    2.drm_sched_fence_free ->
    dma_fence_put(fence->parent) ->
    amdgpu_fence_release ->
    call_rcu(&f->rcu, amdgpu_fence_free) ->
    kmem_cache_free(amdgpu_fence_slab, fence);

> -----Original Message-----

> From: Koenig, Christian

> Sent: Friday, May 18, 2018 5:46 PM

> To: Deng, Emily <Emily.Deng@amd.com><mailto:Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

>

> Ok, I'm lost where do we use call_rcu() twice? Cause that sounds incorrect in

> the first place.

>

> Christian.

>

> Am 18.05.2018 um 11:41 schrieb Deng, Emily:

> > Ping......

> >

> > Best Wishes,

> > Emily Deng

> >

> >

> >

> >

> >> -----Original Message-----

> >> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On

> >> Behalf Of Deng, Emily

> >> Sent: Friday, May 18, 2018 11:20 AM

> >> To: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-

> >> gfx@lists.freedesktop.org<mailto:gfx@lists.freedesktop.org>

> >> Subject: RE: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> >>

> >> Hi Christian,

> >>       Yes, it has already one rcu_barrier, but it has called twice

> >> call_rcu, so the one rcu_barrier just could barrier one call_rcu some time.

> >>      After I added another rcu_barrier, the kernel issue will disappear.

> >>

> >> Best Wishes,

> >> Emily Deng

> >>

> >>> -----Original Message-----

> >>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> >>> Sent: Thursday, May 17, 2018 7:08 PM

> >>> To: Deng, Emily <Emily.Deng@amd.com<mailto:Emily.Deng@amd.com>>; amd-<mailto:amd-gfx@lists.freedesktop.org>

> gfx@lists.freedesktop.org<mailto:gfx@lists.freedesktop.org>

> >>> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini

> >>>

> >>> Am 17.05.2018 um 12:03 schrieb Emily Deng:

> >>>> To free the fence from the amdgpu_fence_slab, need twice call_rcu,

> >>>> to avoid the amdgpu_fence_slab_fini call

> >>>> kmem_cache_destroy(amdgpu_fence_slab) before

> >>> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after

> >>> drm_sched_entity_fini.

> >>>> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as

> below:

> >>>> 1.drm_sched_entity_fini ->

> >>>> drm_sched_entity_cleanup ->

> >>>> dma_fence_put(entity->last_scheduled) ->

> >>>> drm_sched_fence_release_finished ->

> >>> drm_sched_fence_release_scheduled

> >>>> -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)

> >>>>

> >>>> 2.drm_sched_fence_free ->

> >>>> dma_fence_put(fence->parent) ->

> >>>> amdgpu_fence_release ->

> >>>> call_rcu(&f->rcu, amdgpu_fence_free) ->

> >>>> kmem_cache_free(amdgpu_fence_slab, fence);

> >>>>

> >>>> v2:put the barrier before the kmem_cache_destroy

> >>>>

> >>>> Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df

> >>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com<mailto:Emily.Deng@amd.com>>

> >>>> ---

> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +

> >>>>    1 file changed, 1 insertion(+)

> >>>>

> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> index 39ec6b8..42be65b 100644

> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

> >>>> @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)

> >>>>    void amdgpu_fence_slab_fini(void)

> >>>>    {

> >>>>           rcu_barrier();

> >>>> +        rcu_barrier();

> >>> Well, you should have noted that there is already an rcu_barrier

> >>> here and adding another one shouldn't have any additional effect. So

> >>> your explanation and the proposed solution doesn't make to much

> sense.

> >>>

> >>> I think the problem you run into is rather that the fence is

> >>> reference counted and might live longer than the module who created it.

> >>>

> >>> Complicated issue, one possible solution would be to release

> >>> fence->parent earlier in the scheduler fence but that doesn't sound

> >>> fence->like

> >>> a general purpose solution.

> >>>

> >>> Christian.

> >>>

> >>>>           kmem_cache_destroy(amdgpu_fence_slab);

> >>>>    }

> >>>>    /*

> >> _______________________________________________

> >> amd-gfx mailing list

> >> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx