drm/gpu-sched: fix force APP kill hang

Submitted by Deng, Emily on March 28, 2018, 8:07 a.m.

Details

Message ID 1522224449-8825-1-git-send-email-Emily.Deng@amd.com
State New
Headers show
Series "drm/gpu-sched: fix force APP kill hang" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Deng, Emily March 28, 2018, 8:07 a.m.
issue:
there are VMC page fault occured if force APP kill during
3dmark test, the cause is in entity_fini we manually signal
all those jobs in entity's queue which confuse the sync/dep
mechanism:

1)page fault occured in sdma's clear job which operate on
shadow buffer, and shadow buffer's Gart table is cleaned by
ttm_bo_release since the fence in its reservation was fake signaled
by entity_fini() under the case of SIGKILL received.

2)page fault occured in gfx' job because during the lifetime
of gfx job we manually fake signal all jobs from its entity
in entity_fini(), thus the unmapping/clear PTE job depend on those
result fence is satisfied and sdma start clearing the PTE and lead
to GFX page fault.

fix:
1)should at least wait all jobs already scheduled complete in entity_fini()
if SIGKILL is the case.

2)if a fence signaled and try to clear some entity's dependency, should
set this entity guilty to prevent its job really run since the dependency
is fake signaled.

related issue ticket:
http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 2bd69c4..9b306d3 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -198,6 +198,28 @@  static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
 	return true;
 }
 
+static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
+				struct drm_sched_entity *entity)
+{
+	struct drm_sched_job *last;
+	signed long r;
+
+	spin_lock(&sched->job_list_lock);
+	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
+		if (last->s_fence->scheduled.context == entity->fence_context) {
+			dma_fence_get(&last->s_fence->finished);
+			break;
+		}
+	spin_unlock(&sched->job_list_lock);
+
+	if (&last->node != &sched->ring_mirror_list) {
+		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
+		if (r == 0)
+			DRM_WARN("wait on the fly job timeout\n");
+		dma_fence_put(&last->s_fence->finished);
+	}
+}
+
 /**
  * Destroy a context entity
  *
@@ -238,6 +260,12 @@  void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
 			entity->dependency = NULL;
 		}
 
+		/* Wait till all jobs from this entity really finished otherwise below
+		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
+		 * vm fault
+		 */
+		drm_sched_entity_wait_otf_signal(sched, entity);
+
 		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
 			struct drm_sched_fence *s_fence = job->s_fence;
 			drm_sched_fence_scheduled(s_fence);
@@ -255,6 +283,14 @@  static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
 {
 	struct drm_sched_entity *entity =
 		container_of(cb, struct drm_sched_entity, cb);
+
+	/* set the entity guity since its dependency is
+	 * not really cleared but fake signaled (by SIGKILL
+	 * or GPU recover)
+	 */
+	if (f->error && entity->guilty)
+		atomic_set(entity->guilty, 1);
+
 	entity->dependency = NULL;
 	dma_fence_put(f);
 	drm_sched_wakeup(entity->sched);

Comments

Am 28.03.2018 um 10:07 schrieb Emily Deng:
> issue:
> there are VMC page fault occured if force APP kill during
> 3dmark test, the cause is in entity_fini we manually signal
> all those jobs in entity's queue which confuse the sync/dep
> mechanism:
>
> 1)page fault occured in sdma's clear job which operate on
> shadow buffer, and shadow buffer's Gart table is cleaned by
> ttm_bo_release since the fence in its reservation was fake signaled
> by entity_fini() under the case of SIGKILL received.
>
> 2)page fault occured in gfx' job because during the lifetime
> of gfx job we manually fake signal all jobs from its entity
> in entity_fini(), thus the unmapping/clear PTE job depend on those
> result fence is satisfied and sdma start clearing the PTE and lead
> to GFX page fault.

Nice catch, but the fixes won't work like this.

> fix:
> 1)should at least wait all jobs already scheduled complete in entity_fini()
> if SIGKILL is the case.

Well that is not a good idea because when we kill a process we actually 
want to tear down the task as fast as possible and not wait for 
anything. That is actually the reason why we have this handling.

> 2)if a fence signaled and try to clear some entity's dependency, should
> set this entity guilty to prevent its job really run since the dependency
> is fake signaled.

Well, that is a clear NAK. It would mean that you mark things like the X 
server or Wayland queue is marked guilty because some client is killed.

And since unmapping/clear jobs don't have a guilty pointer it should 
actually not have any effect on the issue.

Regards,
Christian.


>
> related issue ticket:
> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>   1 file changed, 36 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 2bd69c4..9b306d3 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>   	return true;
>   }
>   
> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *last;
> +	signed long r;
> +
> +	spin_lock(&sched->job_list_lock);
> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
> +		if (last->s_fence->scheduled.context == entity->fence_context) {
> +			dma_fence_get(&last->s_fence->finished);
> +			break;
> +		}
> +	spin_unlock(&sched->job_list_lock);
> +
> +	if (&last->node != &sched->ring_mirror_list) {
> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
> +		if (r == 0)
> +			DRM_WARN("wait on the fly job timeout\n");
> +		dma_fence_put(&last->s_fence->finished);
> +	}
> +}
> +
>   /**
>    * Destroy a context entity
>    *
> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>   			entity->dependency = NULL;
>   		}
>   
> +		/* Wait till all jobs from this entity really finished otherwise below
> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
> +		 * vm fault
> +		 */
> +		drm_sched_entity_wait_otf_signal(sched, entity);
> +
>   		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>   			struct drm_sched_fence *s_fence = job->s_fence;
>   			drm_sched_fence_scheduled(s_fence);
> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>   {
>   	struct drm_sched_entity *entity =
>   		container_of(cb, struct drm_sched_entity, cb);
> +
> +	/* set the entity guity since its dependency is
> +	 * not really cleared but fake signaled (by SIGKILL
> +	 * or GPU recover)
> +	 */
> +	if (f->error && entity->guilty)
> +		atomic_set(entity->guilty, 1);
> +
>   	entity->dependency = NULL;
>   	dma_fence_put(f);
>   	drm_sched_wakeup(entity->sched);
Hi Christian,
     Thanks for your review, could you please give some advices on how to resolve the issue? How about adding the fence status ESRCH
checking when check the fence signal?  If so,  need to identify the detail behavior if the fence status is ESRCH.

Best Wishes,
Emily Deng

> -----Original Message-----

> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> Sent: Wednesday, March 28, 2018 7:57 PM

> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> Cc: Liu, Monk <Monk.Liu@amd.com>

> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

> 

> Am 28.03.2018 um 10:07 schrieb Emily Deng:

> > issue:

> > there are VMC page fault occured if force APP kill during 3dmark test,

> > the cause is in entity_fini we manually signal all those jobs in

> > entity's queue which confuse the sync/dep

> > mechanism:

> >

> > 1)page fault occured in sdma's clear job which operate on shadow

> > buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release

> > since the fence in its reservation was fake signaled by entity_fini()

> > under the case of SIGKILL received.

> >

> > 2)page fault occured in gfx' job because during the lifetime of gfx

> > job we manually fake signal all jobs from its entity in entity_fini(),

> > thus the unmapping/clear PTE job depend on those result fence is

> > satisfied and sdma start clearing the PTE and lead to GFX page fault.

> 

> Nice catch, but the fixes won't work like this.

> 

> > fix:

> > 1)should at least wait all jobs already scheduled complete in

> > entity_fini() if SIGKILL is the case.

> 

> Well that is not a good idea because when we kill a process we actually want

> to tear down the task as fast as possible and not wait for anything. That is

> actually the reason why we have this handling.

> 

> > 2)if a fence signaled and try to clear some entity's dependency,

> > should set this entity guilty to prevent its job really run since the

> > dependency is fake signaled.

> 

> Well, that is a clear NAK. It would mean that you mark things like the X server

> or Wayland queue is marked guilty because some client is killed.

> 

> And since unmapping/clear jobs don't have a guilty pointer it should actually

> not have any effect on the issue.

> 

> Regards,

> Christian.

> 

> 

> >

> > related issue ticket:

> > http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

> >

> > Signed-off-by: Monk Liu <Monk.Liu@amd.com>

> > ---

> >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 36

> +++++++++++++++++++++++++++++++

> >   1 file changed, 36 insertions(+)

> >

> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

> > b/drivers/gpu/drm/scheduler/gpu_scheduler.c

> > index 2bd69c4..9b306d3 100644

> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

> > @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct

> drm_sched_entity *entity)

> >   	return true;

> >   }

> >

> > +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler

> *sched,

> > +				struct drm_sched_entity *entity)

> > +{

> > +	struct drm_sched_job *last;

> > +	signed long r;

> > +

> > +	spin_lock(&sched->job_list_lock);

> > +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

> > +		if (last->s_fence->scheduled.context == entity-

> >fence_context) {

> > +			dma_fence_get(&last->s_fence->finished);

> > +			break;

> > +		}

> > +	spin_unlock(&sched->job_list_lock);

> > +

> > +	if (&last->node != &sched->ring_mirror_list) {

> > +		r = dma_fence_wait_timeout(&last->s_fence->finished, false,

> msecs_to_jiffies(500));

> > +		if (r == 0)

> > +			DRM_WARN("wait on the fly job timeout\n");

> > +		dma_fence_put(&last->s_fence->finished);

> > +	}

> > +}

> > +

> >   /**

> >    * Destroy a context entity

> >    *

> > @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct

> drm_gpu_scheduler *sched,

> >   			entity->dependency = NULL;

> >   		}

> >

> > +		/* Wait till all jobs from this entity really finished otherwise

> below

> > +		 * fake signaling would kickstart sdma's clear PTE jobs and

> lead to

> > +		 * vm fault

> > +		 */

> > +		drm_sched_entity_wait_otf_signal(sched, entity);

> > +

> >   		while ((job = to_drm_sched_job(spsc_queue_pop(&entity-

> >job_queue)))) {

> >   			struct drm_sched_fence *s_fence = job->s_fence;

> >   			drm_sched_fence_scheduled(s_fence);

> > @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct

> dma_fence *f, struct dma_fence_cb *cb

> >   {

> >   	struct drm_sched_entity *entity =

> >   		container_of(cb, struct drm_sched_entity, cb);

> > +

> > +	/* set the entity guity since its dependency is

> > +	 * not really cleared but fake signaled (by SIGKILL

> > +	 * or GPU recover)

> > +	 */

> > +	if (f->error && entity->guilty)

> > +		atomic_set(entity->guilty, 1);

> > +

> >   	entity->dependency = NULL;

> >   	dma_fence_put(f);

> >   	drm_sched_wakeup(entity->sched);
On 2018年03月29日 10:45, Deng, Emily wrote:
> Hi Christian,
>       Thanks for your review, could you please give some advices on how to resolve the issue? How about adding the fence status ESRCH
> checking when check the fence signal?  If so,  need to identify the detail behavior if the fence status is ESRCH.
That's a possible way, but seems this isn't a bit effort to deal with 
every fence_wait/cb everywhere, you can avoid new job submiting, but it 
cannot solve resource used in GPU executing.

Waiting shceduled job signal will kill the fake fence signal design, but 
not sure fake signal is good to way ongoing, fake signal could result in 
some resources depending on fence are freed ahead of GPU executing.

A worse way is to give gpu hw reset immediately when there is a killing 
app event, but the method is too heavy.

Regards,
David Zhou
>
> Best Wishes,
> Emily Deng
>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: Wednesday, March 28, 2018 7:57 PM
>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>> issue:
>>> there are VMC page fault occured if force APP kill during 3dmark test,
>>> the cause is in entity_fini we manually signal all those jobs in
>>> entity's queue which confuse the sync/dep
>>> mechanism:
>>>
>>> 1)page fault occured in sdma's clear job which operate on shadow
>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
>>> since the fence in its reservation was fake signaled by entity_fini()
>>> under the case of SIGKILL received.
>>>
>>> 2)page fault occured in gfx' job because during the lifetime of gfx
>>> job we manually fake signal all jobs from its entity in entity_fini(),
>>> thus the unmapping/clear PTE job depend on those result fence is
>>> satisfied and sdma start clearing the PTE and lead to GFX page fault.
>> Nice catch, but the fixes won't work like this.
>>
>>> fix:
>>> 1)should at least wait all jobs already scheduled complete in
>>> entity_fini() if SIGKILL is the case.
>> Well that is not a good idea because when we kill a process we actually want
>> to tear down the task as fast as possible and not wait for anything. That is
>> actually the reason why we have this handling.
>>
>>> 2)if a fence signaled and try to clear some entity's dependency,
>>> should set this entity guilty to prevent its job really run since the
>>> dependency is fake signaled.
>> Well, that is a clear NAK. It would mean that you mark things like the X server
>> or Wayland queue is marked guilty because some client is killed.
>>
>> And since unmapping/clear jobs don't have a guilty pointer it should actually
>> not have any effect on the issue.
>>
>> Regards,
>> Christian.
>>
>>
>>> related issue ticket:
>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 36
>> +++++++++++++++++++++++++++++++
>>>    1 file changed, 36 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> index 2bd69c4..9b306d3 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct
>> drm_sched_entity *entity)
>>>    	return true;
>>>    }
>>>
>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler
>> *sched,
>>> +				struct drm_sched_entity *entity)
>>> +{
>>> +	struct drm_sched_job *last;
>>> +	signed long r;
>>> +
>>> +	spin_lock(&sched->job_list_lock);
>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>> +		if (last->s_fence->scheduled.context == entity-
>>> fence_context) {
>>> +			dma_fence_get(&last->s_fence->finished);
>>> +			break;
>>> +		}
>>> +	spin_unlock(&sched->job_list_lock);
>>> +
>>> +	if (&last->node != &sched->ring_mirror_list) {
>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false,
>> msecs_to_jiffies(500));
>>> +		if (r == 0)
>>> +			DRM_WARN("wait on the fly job timeout\n");
>>> +		dma_fence_put(&last->s_fence->finished);
>>> +	}
>>> +}
>>> +
>>>    /**
>>>     * Destroy a context entity
>>>     *
>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct
>> drm_gpu_scheduler *sched,
>>>    			entity->dependency = NULL;
>>>    		}
>>>
>>> +		/* Wait till all jobs from this entity really finished otherwise
>> below
>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and
>> lead to
>>> +		 * vm fault
>>> +		 */
>>> +		drm_sched_entity_wait_otf_signal(sched, entity);
>>> +
>>>    		while ((job = to_drm_sched_job(spsc_queue_pop(&entity-
>>> job_queue)))) {
>>>    			struct drm_sched_fence *s_fence = job->s_fence;
>>>    			drm_sched_fence_scheduled(s_fence);
>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct
>> dma_fence *f, struct dma_fence_cb *cb
>>>    {
>>>    	struct drm_sched_entity *entity =
>>>    		container_of(cb, struct drm_sched_entity, cb);
>>> +
>>> +	/* set the entity guity since its dependency is
>>> +	 * not really cleared but fake signaled (by SIGKILL
>>> +	 * or GPU recover)
>>> +	 */
>>> +	if (f->error && entity->guilty)
>>> +		atomic_set(entity->guilty, 1);
>>> +
>>>    	entity->dependency = NULL;
>>>    	dma_fence_put(f);
>>>    	drm_sched_wakeup(entity->sched);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 2)if a fence signaled and try to clear some entity's dependency, 

> should set this entity guilty to prevent its job really run since the 

> dependency is fake signaled.


Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.


[ML], yeah that's a good point, how about this way: if a fence is fake signaled and try to clear other entity's dependency we only allow entity marked as guilty 
If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?
This way for a certain process a faked signaled GFX fence won't be able to wake another VCE/SDMA job 

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 

Sent: 2018年3月28日 19:57
To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Am 28.03.2018 um 10:07 schrieb Emily Deng:
> issue:

> there are VMC page fault occured if force APP kill during 3dmark test, 

> the cause is in entity_fini we manually signal all those jobs in 

> entity's queue which confuse the sync/dep

> mechanism:

>

> 1)page fault occured in sdma's clear job which operate on shadow 

> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release 

> since the fence in its reservation was fake signaled by entity_fini() 

> under the case of SIGKILL received.

>

> 2)page fault occured in gfx' job because during the lifetime of gfx 

> job we manually fake signal all jobs from its entity in entity_fini(), 

> thus the unmapping/clear PTE job depend on those result fence is 

> satisfied and sdma start clearing the PTE and lead to GFX page fault.


Nice catch, but the fixes won't work like this.

> fix:

> 1)should at least wait all jobs already scheduled complete in 

> entity_fini() if SIGKILL is the case.


Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.

> 2)if a fence signaled and try to clear some entity's dependency, 

> should set this entity guilty to prevent its job really run since the 

> dependency is fake signaled.


Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

Regards,
Christian.


>

> related issue ticket:

> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

>

> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

> ---

>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++

>   1 file changed, 36 insertions(+)

>

> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 

> b/drivers/gpu/drm/scheduler/gpu_scheduler.c

> index 2bd69c4..9b306d3 100644

> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)

>   	return true;

>   }

>   

> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,

> +				struct drm_sched_entity *entity)

> +{

> +	struct drm_sched_job *last;

> +	signed long r;

> +

> +	spin_lock(&sched->job_list_lock);

> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

> +		if (last->s_fence->scheduled.context == entity->fence_context) {

> +			dma_fence_get(&last->s_fence->finished);

> +			break;

> +		}

> +	spin_unlock(&sched->job_list_lock);

> +

> +	if (&last->node != &sched->ring_mirror_list) {

> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));

> +		if (r == 0)

> +			DRM_WARN("wait on the fly job timeout\n");

> +		dma_fence_put(&last->s_fence->finished);

> +	}

> +}

> +

>   /**

>    * Destroy a context entity

>    *

> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,

>   			entity->dependency = NULL;

>   		}

>   

> +		/* Wait till all jobs from this entity really finished otherwise below

> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to

> +		 * vm fault

> +		 */

> +		drm_sched_entity_wait_otf_signal(sched, entity);

> +

>   		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {

>   			struct drm_sched_fence *s_fence = job->s_fence;

>   			drm_sched_fence_scheduled(s_fence);

> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb

>   {

>   	struct drm_sched_entity *entity =

>   		container_of(cb, struct drm_sched_entity, cb);

> +

> +	/* set the entity guity since its dependency is

> +	 * not really cleared but fake signaled (by SIGKILL

> +	 * or GPU recover)

> +	 */

> +	if (f->error && entity->guilty)

> +		atomic_set(entity->guilty, 1);

> +

>   	entity->dependency = NULL;

>   	dma_fence_put(f);

>   	drm_sched_wakeup(entity->sched);
I think the core of the problem is that we don't abort all entities of 
the process at the same time.

How about splitting drm_sched_entity_fini() into two functions?

The first one is does the waiting, removes the entity from the runqueue 
and returns an error when the process was killed.

During destruction this one is called first for all contexts as well as 
the entity for VM updates.

The second one then goes over the entity and signals all jobs with an 
error code.

This way no VM updates should be executed any longer after the process 
is killed (something which doesn't makes sense anyway and just costs us 
time).

Regards,
Christian.

Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>> 2)if a fence signaled and try to clear some entity's dependency,
>> should set this entity guilty to prevent its job really run since the
>> dependency is fake signaled.
> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>
> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>
>
> [ML], yeah that's a good point, how about this way: if a fence is fake signaled and try to clear other entity's dependency we only allow entity marked as guilty
> If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?
> This way for a certain process a faked signaled GFX fence won't be able to wake another VCE/SDMA job
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月28日 19:57
> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu@amd.com>
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>> issue:
>> there are VMC page fault occured if force APP kill during 3dmark test,
>> the cause is in entity_fini we manually signal all those jobs in
>> entity's queue which confuse the sync/dep
>> mechanism:
>>
>> 1)page fault occured in sdma's clear job which operate on shadow
>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
>> since the fence in its reservation was fake signaled by entity_fini()
>> under the case of SIGKILL received.
>>
>> 2)page fault occured in gfx' job because during the lifetime of gfx
>> job we manually fake signal all jobs from its entity in entity_fini(),
>> thus the unmapping/clear PTE job depend on those result fence is
>> satisfied and sdma start clearing the PTE and lead to GFX page fault.
> Nice catch, but the fixes won't work like this.
>
>> fix:
>> 1)should at least wait all jobs already scheduled complete in
>> entity_fini() if SIGKILL is the case.
> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.
>
>> 2)if a fence signaled and try to clear some entity's dependency,
>> should set this entity guilty to prevent its job really run since the
>> dependency is fake signaled.
> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>
> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>
> Regards,
> Christian.
>
>
>> related issue ticket:
>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>>    1 file changed, 36 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> index 2bd69c4..9b306d3 100644
>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>>    	return true;
>>    }
>>    
>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
>> +				struct drm_sched_entity *entity)
>> +{
>> +	struct drm_sched_job *last;
>> +	signed long r;
>> +
>> +	spin_lock(&sched->job_list_lock);
>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>> +		if (last->s_fence->scheduled.context == entity->fence_context) {
>> +			dma_fence_get(&last->s_fence->finished);
>> +			break;
>> +		}
>> +	spin_unlock(&sched->job_list_lock);
>> +
>> +	if (&last->node != &sched->ring_mirror_list) {
>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
>> +		if (r == 0)
>> +			DRM_WARN("wait on the fly job timeout\n");
>> +		dma_fence_put(&last->s_fence->finished);
>> +	}
>> +}
>> +
>>    /**
>>     * Destroy a context entity
>>     *
>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>>    			entity->dependency = NULL;
>>    		}
>>    
>> +		/* Wait till all jobs from this entity really finished otherwise below
>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
>> +		 * vm fault
>> +		 */
>> +		drm_sched_entity_wait_otf_signal(sched, entity);
>> +
>>    		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>    			struct drm_sched_fence *s_fence = job->s_fence;
>>    			drm_sched_fence_scheduled(s_fence);
>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>>    {
>>    	struct drm_sched_entity *entity =
>>    		container_of(cb, struct drm_sched_entity, cb);
>> +
>> +	/* set the entity guity since its dependency is
>> +	 * not really cleared but fake signaled (by SIGKILL
>> +	 * or GPU recover)
>> +	 */
>> +	if (f->error && entity->guilty)
>> +		atomic_set(entity->guilty, 1);
>> +
>>    	entity->dependency = NULL;
>>    	dma_fence_put(f);
>>    	drm_sched_wakeup(entity->sched);
Hi Christian,
     Thanks for your good idea, will try this.

Best Wishes,
Emily Deng




> -----Original Message-----

> From: Koenig, Christian

> Sent: Thursday, March 29, 2018 5:05 PM

> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>;

> amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

> 

> I think the core of the problem is that we don't abort all entities of the

> process at the same time.

> 

> How about splitting drm_sched_entity_fini() into two functions?

> 

> The first one is does the waiting, removes the entity from the runqueue and

> returns an error when the process was killed.

> 

> During destruction this one is called first for all contexts as well as the entity

> for VM updates.

> 

> The second one then goes over the entity and signals all jobs with an error

> code.

> 

> This way no VM updates should be executed any longer after the process is

> killed (something which doesn't makes sense anyway and just costs us time).

> 

> Regards,

> Christian.

> 

> Am 29.03.2018 um 06:14 schrieb Liu, Monk:

> >> 2)if a fence signaled and try to clear some entity's dependency,

> >> should set this entity guilty to prevent its job really run since the

> >> dependency is fake signaled.

> > Well, that is a clear NAK. It would mean that you mark things like the X

> server or Wayland queue is marked guilty because some client is killed.

> >

> > And since unmapping/clear jobs don't have a guilty pointer it should

> actually not have any effect on the issue.

> >

> >

> > [ML], yeah that's a good point, how about this way: if a fence is fake

> > signaled and try to clear other entity's dependency we only allow entity

> marked as guilty If this entity share the same ctx (or even process?) of the

> entity of the job of that fake signaled fence ?

> > This way for a certain process a faked signaled GFX fence won't be

> > able to wake another VCE/SDMA job

> >

> > /Monk

> >

> > -----Original Message-----

> > From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> > Sent: 2018年3月28日 19:57

> > To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> > Cc: Liu, Monk <Monk.Liu@amd.com>

> > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

> >

> > Am 28.03.2018 um 10:07 schrieb Emily Deng:

> >> issue:

> >> there are VMC page fault occured if force APP kill during 3dmark

> >> test, the cause is in entity_fini we manually signal all those jobs

> >> in entity's queue which confuse the sync/dep

> >> mechanism:

> >>

> >> 1)page fault occured in sdma's clear job which operate on shadow

> >> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release

> >> since the fence in its reservation was fake signaled by entity_fini()

> >> under the case of SIGKILL received.

> >>

> >> 2)page fault occured in gfx' job because during the lifetime of gfx

> >> job we manually fake signal all jobs from its entity in

> >> entity_fini(), thus the unmapping/clear PTE job depend on those

> >> result fence is satisfied and sdma start clearing the PTE and lead to GFX

> page fault.

> > Nice catch, but the fixes won't work like this.

> >

> >> fix:

> >> 1)should at least wait all jobs already scheduled complete in

> >> entity_fini() if SIGKILL is the case.

> > Well that is not a good idea because when we kill a process we actually

> want to tear down the task as fast as possible and not wait for anything.

> That is actually the reason why we have this handling.

> >

> >> 2)if a fence signaled and try to clear some entity's dependency,

> >> should set this entity guilty to prevent its job really run since the

> >> dependency is fake signaled.

> > Well, that is a clear NAK. It would mean that you mark things like the X

> server or Wayland queue is marked guilty because some client is killed.

> >

> > And since unmapping/clear jobs don't have a guilty pointer it should

> actually not have any effect on the issue.

> >

> > Regards,

> > Christian.

> >

> >

> >> related issue ticket:

> >> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

> >>

> >> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

> >> ---

> >>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 36

> +++++++++++++++++++++++++++++++

> >>    1 file changed, 36 insertions(+)

> >>

> >> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

> >> b/drivers/gpu/drm/scheduler/gpu_scheduler.c

> >> index 2bd69c4..9b306d3 100644

> >> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

> >> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

> >> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct

> drm_sched_entity *entity)

> >>    	return true;

> >>    }

> >>

> >> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler

> *sched,

> >> +				struct drm_sched_entity *entity) {

> >> +	struct drm_sched_job *last;

> >> +	signed long r;

> >> +

> >> +	spin_lock(&sched->job_list_lock);

> >> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

> >> +		if (last->s_fence->scheduled.context == entity-

> >fence_context) {

> >> +			dma_fence_get(&last->s_fence->finished);

> >> +			break;

> >> +		}

> >> +	spin_unlock(&sched->job_list_lock);

> >> +

> >> +	if (&last->node != &sched->ring_mirror_list) {

> >> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false,

> msecs_to_jiffies(500));

> >> +		if (r == 0)

> >> +			DRM_WARN("wait on the fly job timeout\n");

> >> +		dma_fence_put(&last->s_fence->finished);

> >> +	}

> >> +}

> >> +

> >>    /**

> >>     * Destroy a context entity

> >>     *

> >> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct

> drm_gpu_scheduler *sched,

> >>    			entity->dependency = NULL;

> >>    		}

> >>

> >> +		/* Wait till all jobs from this entity really finished otherwise

> below

> >> +		 * fake signaling would kickstart sdma's clear PTE jobs and

> lead to

> >> +		 * vm fault

> >> +		 */

> >> +		drm_sched_entity_wait_otf_signal(sched, entity);

> >> +

> >>    		while ((job = to_drm_sched_job(spsc_queue_pop(&entity-

> >job_queue)))) {

> >>    			struct drm_sched_fence *s_fence = job->s_fence;

> >>    			drm_sched_fence_scheduled(s_fence);

> >> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct

> dma_fence *f, struct dma_fence_cb *cb

> >>    {

> >>    	struct drm_sched_entity *entity =

> >>    		container_of(cb, struct drm_sched_entity, cb);

> >> +

> >> +	/* set the entity guity since its dependency is

> >> +	 * not really cleared but fake signaled (by SIGKILL

> >> +	 * or GPU recover)

> >> +	 */

> >> +	if (f->error && entity->guilty)

> >> +		atomic_set(entity->guilty, 1);

> >> +

> >>    	entity->dependency = NULL;

> >>    	dma_fence_put(f);

> >>    	drm_sched_wakeup(entity->sched);
First let's consider the shadow buffer case:

After you signal all jobs with an error code, e.g. you signals an unmapping/clear-pte job on sdma ring (it is running on sdma), the reservation are then all cleared, this way
during amdgpu_bo_undef() on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT

/Monk
-----Original Message-----
From: Koenig, Christian 

Sent: 2018年3月29日 17:05
To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

I think the core of the problem is that we don't abort all entities of the process at the same time.

How about splitting drm_sched_entity_fini() into two functions?

The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.

During destruction this one is called first for all contexts as well as the entity for VM updates.

The second one then goes over the entity and signals all jobs with an error code.

This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).

Regards,
Christian.

Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>> 2)if a fence signaled and try to clear some entity's dependency, 

>> should set this entity guilty to prevent its job really run since the 

>> dependency is fake signaled.

> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>

> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>

>

> [ML], yeah that's a good point, how about this way: if a fence is fake 

> signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?

> This way for a certain process a faked signaled GFX fence won't be 

> able to wake another VCE/SDMA job

>

> /Monk

>

> -----Original Message-----

> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> Sent: 2018年3月28日 19:57

> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

> Cc: Liu, Monk <Monk.Liu@amd.com>

> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>

> Am 28.03.2018 um 10:07 schrieb Emily Deng:

>> issue:

>> there are VMC page fault occured if force APP kill during 3dmark 

>> test, the cause is in entity_fini we manually signal all those jobs 

>> in entity's queue which confuse the sync/dep

>> mechanism:

>>

>> 1)page fault occured in sdma's clear job which operate on shadow 

>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release 

>> since the fence in its reservation was fake signaled by entity_fini() 

>> under the case of SIGKILL received.

>>

>> 2)page fault occured in gfx' job because during the lifetime of gfx 

>> job we manually fake signal all jobs from its entity in 

>> entity_fini(), thus the unmapping/clear PTE job depend on those 

>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.

> Nice catch, but the fixes won't work like this.

>

>> fix:

>> 1)should at least wait all jobs already scheduled complete in

>> entity_fini() if SIGKILL is the case.

> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.

>

>> 2)if a fence signaled and try to clear some entity's dependency, 

>> should set this entity guilty to prevent its job really run since the 

>> dependency is fake signaled.

> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>

> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>

> Regards,

> Christian.

>

>

>> related issue ticket:

>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

>>

>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

>> ---

>>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++

>>    1 file changed, 36 insertions(+)

>>

>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>> index 2bd69c4..9b306d3 100644

>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)

>>    	return true;

>>    }

>>    

>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,

>> +				struct drm_sched_entity *entity) {

>> +	struct drm_sched_job *last;

>> +	signed long r;

>> +

>> +	spin_lock(&sched->job_list_lock);

>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

>> +		if (last->s_fence->scheduled.context == entity->fence_context) {

>> +			dma_fence_get(&last->s_fence->finished);

>> +			break;

>> +		}

>> +	spin_unlock(&sched->job_list_lock);

>> +

>> +	if (&last->node != &sched->ring_mirror_list) {

>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));

>> +		if (r == 0)

>> +			DRM_WARN("wait on the fly job timeout\n");

>> +		dma_fence_put(&last->s_fence->finished);

>> +	}

>> +}

>> +

>>    /**

>>     * Destroy a context entity

>>     *

>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,

>>    			entity->dependency = NULL;

>>    		}

>>    

>> +		/* Wait till all jobs from this entity really finished otherwise below

>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to

>> +		 * vm fault

>> +		 */

>> +		drm_sched_entity_wait_otf_signal(sched, entity);

>> +

>>    		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {

>>    			struct drm_sched_fence *s_fence = job->s_fence;

>>    			drm_sched_fence_scheduled(s_fence);

>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb

>>    {

>>    	struct drm_sched_entity *entity =

>>    		container_of(cb, struct drm_sched_entity, cb);

>> +

>> +	/* set the entity guity since its dependency is

>> +	 * not really cleared but fake signaled (by SIGKILL

>> +	 * or GPU recover)

>> +	 */

>> +	if (f->error && entity->guilty)

>> +		atomic_set(entity->guilty, 1);

>> +

>>    	entity->dependency = NULL;

>>    	dma_fence_put(f);

>>    	drm_sched_wakeup(entity->sched);
Hi Monk,

well that isn't a problem.

The idea is that we first stop the ALL entities and then mark all fences 
as signaled with error. This way we won't even start running the 
unmapping/clear-pte job in the first place.

I mean as I wrote when the process is killed we should cancel ALL still 
pending jobs of that process including pending submissions and page 
table updates.

Regards,
Christian.

Am 29.03.2018 um 12:11 schrieb Liu, Monk:
> First let's consider the shadow buffer case:
>
> After you signal all jobs with an error code, e.g. you signals an unmapping/clear-pte job on sdma ring (it is running on sdma), the reservation are then all cleared, this way
> during amdgpu_bo_undef() on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>
> /Monk
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月29日 17:05
> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> I think the core of the problem is that we don't abort all entities of the process at the same time.
>
> How about splitting drm_sched_entity_fini() into two functions?
>
> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.
>
> During destruction this one is called first for all contexts as well as the entity for VM updates.
>
> The second one then goes over the entity and signals all jobs with an error code.
>
> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).
>
> Regards,
> Christian.
>
> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>> 2)if a fence signaled and try to clear some entity's dependency,
>>> should set this entity guilty to prevent its job really run since the
>>> dependency is fake signaled.
>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>
>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>
>>
>> [ML], yeah that's a good point, how about this way: if a fence is fake
>> signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?
>> This way for a certain process a faked signaled GFX fence won't be
>> able to wake another VCE/SDMA job
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月28日 19:57
>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>> issue:
>>> there are VMC page fault occured if force APP kill during 3dmark
>>> test, the cause is in entity_fini we manually signal all those jobs
>>> in entity's queue which confuse the sync/dep
>>> mechanism:
>>>
>>> 1)page fault occured in sdma's clear job which operate on shadow
>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
>>> since the fence in its reservation was fake signaled by entity_fini()
>>> under the case of SIGKILL received.
>>>
>>> 2)page fault occured in gfx' job because during the lifetime of gfx
>>> job we manually fake signal all jobs from its entity in
>>> entity_fini(), thus the unmapping/clear PTE job depend on those
>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.
>> Nice catch, but the fixes won't work like this.
>>
>>> fix:
>>> 1)should at least wait all jobs already scheduled complete in
>>> entity_fini() if SIGKILL is the case.
>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.
>>
>>> 2)if a fence signaled and try to clear some entity's dependency,
>>> should set this entity guilty to prevent its job really run since the
>>> dependency is fake signaled.
>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>
>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>
>> Regards,
>> Christian.
>>
>>
>>> related issue ticket:
>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>>>     1 file changed, 36 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> index 2bd69c4..9b306d3 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>>>     	return true;
>>>     }
>>>     
>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
>>> +				struct drm_sched_entity *entity) {
>>> +	struct drm_sched_job *last;
>>> +	signed long r;
>>> +
>>> +	spin_lock(&sched->job_list_lock);
>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {
>>> +			dma_fence_get(&last->s_fence->finished);
>>> +			break;
>>> +		}
>>> +	spin_unlock(&sched->job_list_lock);
>>> +
>>> +	if (&last->node != &sched->ring_mirror_list) {
>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
>>> +		if (r == 0)
>>> +			DRM_WARN("wait on the fly job timeout\n");
>>> +		dma_fence_put(&last->s_fence->finished);
>>> +	}
>>> +}
>>> +
>>>     /**
>>>      * Destroy a context entity
>>>      *
>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>>>     			entity->dependency = NULL;
>>>     		}
>>>     
>>> +		/* Wait till all jobs from this entity really finished otherwise below
>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
>>> +		 * vm fault
>>> +		 */
>>> +		drm_sched_entity_wait_otf_signal(sched, entity);
>>> +
>>>     		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>     			struct drm_sched_fence *s_fence = job->s_fence;
>>>     			drm_sched_fence_scheduled(s_fence);
>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>>>     {
>>>     	struct drm_sched_entity *entity =
>>>     		container_of(cb, struct drm_sched_entity, cb);
>>> +
>>> +	/* set the entity guity since its dependency is
>>> +	 * not really cleared but fake signaled (by SIGKILL
>>> +	 * or GPU recover)
>>> +	 */
>>> +	if (f->error && entity->guilty)
>>> +		atomic_set(entity->guilty, 1);
>>> +
>>>     	entity->dependency = NULL;
>>>     	dma_fence_put(f);
>>>     	drm_sched_wakeup(entity->sched);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Hi Christian 

> This way we won't even start running the unmapping/clear-pte job in the first place.


What if there is already an unmapping/clear-pte job running before you kill app ? like app is naturally release some resource
And by coincidence you meanwhile kill the app ?

/Monk





-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 

Sent: 2018年3月29日 18:16
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Hi Monk,

well that isn't a problem.

The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place.

I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates.

Regards,
Christian.

Am 29.03.2018 um 12:11 schrieb Liu, Monk:
> First let's consider the shadow buffer case:

>

> After you signal all jobs with an error code, e.g. you signals an 

> unmapping/clear-pte job on sdma ring (it is running on sdma), the 

> reservation are then all cleared, this way during amdgpu_bo_undef() on 

> the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT

>

> /Monk

> -----Original Message-----

> From: Koenig, Christian

> Sent: 2018年3月29日 17:05

> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 

> amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>

> I think the core of the problem is that we don't abort all entities of the process at the same time.

>

> How about splitting drm_sched_entity_fini() into two functions?

>

> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.

>

> During destruction this one is called first for all contexts as well as the entity for VM updates.

>

> The second one then goes over the entity and signals all jobs with an error code.

>

> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).

>

> Regards,

> Christian.

>

> Am 29.03.2018 um 06:14 schrieb Liu, Monk:

>>> 2)if a fence signaled and try to clear some entity's dependency, 

>>> should set this entity guilty to prevent its job really run since 

>>> the dependency is fake signaled.

>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>>

>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>>

>>

>> [ML], yeah that's a good point, how about this way: if a fence is 

>> fake signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?

>> This way for a certain process a faked signaled GFX fence won't be 

>> able to wake another VCE/SDMA job

>>

>> /Monk

>>

>> -----Original Message-----

>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

>> Sent: 2018年3月28日 19:57

>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

>> Cc: Liu, Monk <Monk.Liu@amd.com>

>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>>

>> Am 28.03.2018 um 10:07 schrieb Emily Deng:

>>> issue:

>>> there are VMC page fault occured if force APP kill during 3dmark 

>>> test, the cause is in entity_fini we manually signal all those jobs 

>>> in entity's queue which confuse the sync/dep

>>> mechanism:

>>>

>>> 1)page fault occured in sdma's clear job which operate on shadow 

>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release 

>>> since the fence in its reservation was fake signaled by 

>>> entity_fini() under the case of SIGKILL received.

>>>

>>> 2)page fault occured in gfx' job because during the lifetime of gfx 

>>> job we manually fake signal all jobs from its entity in 

>>> entity_fini(), thus the unmapping/clear PTE job depend on those 

>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.

>> Nice catch, but the fixes won't work like this.

>>

>>> fix:

>>> 1)should at least wait all jobs already scheduled complete in

>>> entity_fini() if SIGKILL is the case.

>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.

>>

>>> 2)if a fence signaled and try to clear some entity's dependency, 

>>> should set this entity guilty to prevent its job really run since 

>>> the dependency is fake signaled.

>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>>

>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>>

>> Regards,

>> Christian.

>>

>>

>>> related issue ticket:

>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

>>>

>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

>>> ---

>>>     drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++

>>>     1 file changed, 36 insertions(+)

>>>

>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>> index 2bd69c4..9b306d3 100644

>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)

>>>     	return true;

>>>     }

>>>     

>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,

>>> +				struct drm_sched_entity *entity) {

>>> +	struct drm_sched_job *last;

>>> +	signed long r;

>>> +

>>> +	spin_lock(&sched->job_list_lock);

>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {

>>> +			dma_fence_get(&last->s_fence->finished);

>>> +			break;

>>> +		}

>>> +	spin_unlock(&sched->job_list_lock);

>>> +

>>> +	if (&last->node != &sched->ring_mirror_list) {

>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));

>>> +		if (r == 0)

>>> +			DRM_WARN("wait on the fly job timeout\n");

>>> +		dma_fence_put(&last->s_fence->finished);

>>> +	}

>>> +}

>>> +

>>>     /**

>>>      * Destroy a context entity

>>>      *

>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,

>>>     			entity->dependency = NULL;

>>>     		}

>>>     

>>> +		/* Wait till all jobs from this entity really finished otherwise below

>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to

>>> +		 * vm fault

>>> +		 */

>>> +		drm_sched_entity_wait_otf_signal(sched, entity);

>>> +

>>>     		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {

>>>     			struct drm_sched_fence *s_fence = job->s_fence;

>>>     			drm_sched_fence_scheduled(s_fence);

>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb

>>>     {

>>>     	struct drm_sched_entity *entity =

>>>     		container_of(cb, struct drm_sched_entity, cb);

>>> +

>>> +	/* set the entity guity since its dependency is

>>> +	 * not really cleared but fake signaled (by SIGKILL

>>> +	 * or GPU recover)

>>> +	 */

>>> +	if (f->error && entity->guilty)

>>> +		atomic_set(entity->guilty, 1);

>>> +

>>>     	entity->dependency = NULL;

>>>     	dma_fence_put(f);

>>>     	drm_sched_wakeup(entity->sched);

> _______________________________________________

> amd-gfx mailing list

> amd-gfx@lists.freedesktop.org

> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Am 29.03.2018 um 13:14 schrieb Liu, Monk:
> Hi Christian
>
>> This way we won't even start running the unmapping/clear-pte job in the first place.
> What if there is already an unmapping/clear-pte job running before you kill app ? like app is naturally release some resource
> And by coincidence you meanwhile kill the app ?

At least nothing problematic. Since the job is already running we won't 
do anything with its fence.

Christian.

>
> /Monk
>
>
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月29日 18:16
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> Hi Monk,
>
> well that isn't a problem.
>
> The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place.
>
> I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates.
>
> Regards,
> Christian.
>
> Am 29.03.2018 um 12:11 schrieb Liu, Monk:
>> First let's consider the shadow buffer case:
>>
>> After you signal all jobs with an error code, e.g. you signals an
>> unmapping/clear-pte job on sdma ring (it is running on sdma), the
>> reservation are then all cleared, this way during amdgpu_bo_undef() on
>> the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>>
>> /Monk
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年3月29日 17:05
>> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> I think the core of the problem is that we don't abort all entities of the process at the same time.
>>
>> How about splitting drm_sched_entity_fini() into two functions?
>>
>> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.
>>
>> During destruction this one is called first for all contexts as well as the entity for VM updates.
>>
>> The second one then goes over the entity and signals all jobs with an error code.
>>
>> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).
>>
>> Regards,
>> Christian.
>>
>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>>> 2)if a fence signaled and try to clear some entity's dependency,
>>>> should set this entity guilty to prevent its job really run since
>>>> the dependency is fake signaled.
>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>>
>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>>
>>>
>>> [ML], yeah that's a good point, how about this way: if a fence is
>>> fake signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?
>>> This way for a certain process a faked signaled GFX fence won't be
>>> able to wake another VCE/SDMA job
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年3月28日 19:57
>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Liu, Monk <Monk.Liu@amd.com>
>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>
>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>>> issue:
>>>> there are VMC page fault occured if force APP kill during 3dmark
>>>> test, the cause is in entity_fini we manually signal all those jobs
>>>> in entity's queue which confuse the sync/dep
>>>> mechanism:
>>>>
>>>> 1)page fault occured in sdma's clear job which operate on shadow
>>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
>>>> since the fence in its reservation was fake signaled by
>>>> entity_fini() under the case of SIGKILL received.
>>>>
>>>> 2)page fault occured in gfx' job because during the lifetime of gfx
>>>> job we manually fake signal all jobs from its entity in
>>>> entity_fini(), thus the unmapping/clear PTE job depend on those
>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.
>>> Nice catch, but the fixes won't work like this.
>>>
>>>> fix:
>>>> 1)should at least wait all jobs already scheduled complete in
>>>> entity_fini() if SIGKILL is the case.
>>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.
>>>
>>>> 2)if a fence signaled and try to clear some entity's dependency,
>>>> should set this entity guilty to prevent its job really run since
>>>> the dependency is fake signaled.
>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>>
>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>> related issue ticket:
>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>>
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>>>>      1 file changed, 36 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> index 2bd69c4..9b306d3 100644
>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>>>>      	return true;
>>>>      }
>>>>      
>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
>>>> +				struct drm_sched_entity *entity) {
>>>> +	struct drm_sched_job *last;
>>>> +	signed long r;
>>>> +
>>>> +	spin_lock(&sched->job_list_lock);
>>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {
>>>> +			dma_fence_get(&last->s_fence->finished);
>>>> +			break;
>>>> +		}
>>>> +	spin_unlock(&sched->job_list_lock);
>>>> +
>>>> +	if (&last->node != &sched->ring_mirror_list) {
>>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
>>>> +		if (r == 0)
>>>> +			DRM_WARN("wait on the fly job timeout\n");
>>>> +		dma_fence_put(&last->s_fence->finished);
>>>> +	}
>>>> +}
>>>> +
>>>>      /**
>>>>       * Destroy a context entity
>>>>       *
>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>>>>      			entity->dependency = NULL;
>>>>      		}
>>>>      
>>>> +		/* Wait till all jobs from this entity really finished otherwise below
>>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
>>>> +		 * vm fault
>>>> +		 */
>>>> +		drm_sched_entity_wait_otf_signal(sched, entity);
>>>> +
>>>>      		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>>      			struct drm_sched_fence *s_fence = job->s_fence;
>>>>      			drm_sched_fence_scheduled(s_fence);
>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>>>>      {
>>>>      	struct drm_sched_entity *entity =
>>>>      		container_of(cb, struct drm_sched_entity, cb);
>>>> +
>>>> +	/* set the entity guity since its dependency is
>>>> +	 * not really cleared but fake signaled (by SIGKILL
>>>> +	 * or GPU recover)
>>>> +	 */
>>>> +	if (f->error && entity->guilty)
>>>> +		atomic_set(entity->guilty, 1);
>>>> +
>>>>      	entity->dependency = NULL;
>>>>      	dma_fence_put(f);
>>>>      	drm_sched_wakeup(entity->sched);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
This sdma job is running, and you will fake signal all fences eventually ,
Which lead to this process's PD reservation object free and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow buffer's ttm 
And lead to sdma job hit VMC page fault on vmid0

/Monk





-----Original Message-----
From: Koenig, Christian 

Sent: 2018年3月29日 19:24
To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Am 29.03.2018 um 13:14 schrieb Liu, Monk:
> Hi Christian

>

>> This way we won't even start running the unmapping/clear-pte job in the first place.

> What if there is already an unmapping/clear-pte job running before you 

> kill app ? like app is naturally release some resource And by coincidence you meanwhile kill the app ?


At least nothing problematic. Since the job is already running we won't do anything with its fence.

Christian.

>

> /Monk

>

>

>

>

>

> -----Original Message-----

> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

> Sent: 2018年3月29日 18:16

> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 

> <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 

> amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>

> Hi Monk,

>

> well that isn't a problem.

>

> The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place.

>

> I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates.

>

> Regards,

> Christian.

>

> Am 29.03.2018 um 12:11 schrieb Liu, Monk:

>> First let's consider the shadow buffer case:

>>

>> After you signal all jobs with an error code, e.g. you signals an 

>> unmapping/clear-pte job on sdma ring (it is running on sdma), the 

>> reservation are then all cleared, this way during amdgpu_bo_undef() 

>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT

>>

>> /Monk

>> -----Original Message-----

>> From: Koenig, Christian

>> Sent: 2018年3月29日 17:05

>> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 

>> amd-gfx@lists.freedesktop.org

>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>>

>> I think the core of the problem is that we don't abort all entities of the process at the same time.

>>

>> How about splitting drm_sched_entity_fini() into two functions?

>>

>> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.

>>

>> During destruction this one is called first for all contexts as well as the entity for VM updates.

>>

>> The second one then goes over the entity and signals all jobs with an error code.

>>

>> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).

>>

>> Regards,

>> Christian.

>>

>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:

>>>> 2)if a fence signaled and try to clear some entity's dependency, 

>>>> should set this entity guilty to prevent its job really run since 

>>>> the dependency is fake signaled.

>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>>>

>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>>>

>>>

>>> [ML], yeah that's a good point, how about this way: if a fence is 

>>> fake signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?

>>> This way for a certain process a faked signaled GFX fence won't be 

>>> able to wake another VCE/SDMA job

>>>

>>> /Monk

>>>

>>> -----Original Message-----

>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

>>> Sent: 2018年3月28日 19:57

>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

>>> Cc: Liu, Monk <Monk.Liu@amd.com>

>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>>>

>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:

>>>> issue:

>>>> there are VMC page fault occured if force APP kill during 3dmark 

>>>> test, the cause is in entity_fini we manually signal all those jobs 

>>>> in entity's queue which confuse the sync/dep

>>>> mechanism:

>>>>

>>>> 1)page fault occured in sdma's clear job which operate on shadow 

>>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release 

>>>> since the fence in its reservation was fake signaled by

>>>> entity_fini() under the case of SIGKILL received.

>>>>

>>>> 2)page fault occured in gfx' job because during the lifetime of gfx 

>>>> job we manually fake signal all jobs from its entity in 

>>>> entity_fini(), thus the unmapping/clear PTE job depend on those 

>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.

>>> Nice catch, but the fixes won't work like this.

>>>

>>>> fix:

>>>> 1)should at least wait all jobs already scheduled complete in

>>>> entity_fini() if SIGKILL is the case.

>>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.

>>>

>>>> 2)if a fence signaled and try to clear some entity's dependency, 

>>>> should set this entity guilty to prevent its job really run since 

>>>> the dependency is fake signaled.

>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>>>

>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>>>

>>> Regards,

>>> Christian.

>>>

>>>

>>>> related issue ticket:

>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

>>>>

>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

>>>> ---

>>>>      drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++

>>>>      1 file changed, 36 insertions(+)

>>>>

>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>> index 2bd69c4..9b306d3 100644

>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)

>>>>      	return true;

>>>>      }

>>>>      

>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,

>>>> +				struct drm_sched_entity *entity) {

>>>> +	struct drm_sched_job *last;

>>>> +	signed long r;

>>>> +

>>>> +	spin_lock(&sched->job_list_lock);

>>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

>>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {

>>>> +			dma_fence_get(&last->s_fence->finished);

>>>> +			break;

>>>> +		}

>>>> +	spin_unlock(&sched->job_list_lock);

>>>> +

>>>> +	if (&last->node != &sched->ring_mirror_list) {

>>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));

>>>> +		if (r == 0)

>>>> +			DRM_WARN("wait on the fly job timeout\n");

>>>> +		dma_fence_put(&last->s_fence->finished);

>>>> +	}

>>>> +}

>>>> +

>>>>      /**

>>>>       * Destroy a context entity

>>>>       *

>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,

>>>>      			entity->dependency = NULL;

>>>>      		}

>>>>      

>>>> +		/* Wait till all jobs from this entity really finished otherwise below

>>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to

>>>> +		 * vm fault

>>>> +		 */

>>>> +		drm_sched_entity_wait_otf_signal(sched, entity);

>>>> +

>>>>      		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {

>>>>      			struct drm_sched_fence *s_fence = job->s_fence;

>>>>      			drm_sched_fence_scheduled(s_fence);

>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb

>>>>      {

>>>>      	struct drm_sched_entity *entity =

>>>>      		container_of(cb, struct drm_sched_entity, cb);

>>>> +

>>>> +	/* set the entity guity since its dependency is

>>>> +	 * not really cleared but fake signaled (by SIGKILL

>>>> +	 * or GPU recover)

>>>> +	 */

>>>> +	if (f->error && entity->guilty)

>>>> +		atomic_set(entity->guilty, 1);

>>>> +

>>>>      	entity->dependency = NULL;

>>>>      	dma_fence_put(f);

>>>>      	drm_sched_wakeup(entity->sched);

>> _______________________________________________

>> amd-gfx mailing list

>> amd-gfx@lists.freedesktop.org

>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> you will fake signal all fences eventually ,
That is not correct. We fake signal all *pending* fences, e.g. fences 
which are not pushed to the hardware ring.

Fences/jobs which are already running on the hardware ring are not 
touched in any way.

Regards,
Christian.

Am 29.03.2018 um 13:31 schrieb Liu, Monk:
> This sdma job is running, and you will fake signal all fences eventually ,
> Which lead to this process's PD reservation object free and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow buffer's ttm
> And lead to sdma job hit VMC page fault on vmid0
>
> /Monk
>
>
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月29日 19:24
> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> Am 29.03.2018 um 13:14 schrieb Liu, Monk:
>> Hi Christian
>>
>>> This way we won't even start running the unmapping/clear-pte job in the first place.
>> What if there is already an unmapping/clear-pte job running before you
>> kill app ? like app is naturally release some resource And by coincidence you meanwhile kill the app ?
> At least nothing problematic. Since the job is already running we won't do anything with its fence.
>
> Christian.
>
>> /Monk
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月29日 18:16
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> Hi Monk,
>>
>> well that isn't a problem.
>>
>> The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place.
>>
>> I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates.
>>
>> Regards,
>> Christian.
>>
>> Am 29.03.2018 um 12:11 schrieb Liu, Monk:
>>> First let's consider the shadow buffer case:
>>>
>>> After you signal all jobs with an error code, e.g. you signals an
>>> unmapping/clear-pte job on sdma ring (it is running on sdma), the
>>> reservation are then all cleared, this way during amdgpu_bo_undef()
>>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>>>
>>> /Monk
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2018年3月29日 17:05
>>> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>;
>>> amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>
>>> I think the core of the problem is that we don't abort all entities of the process at the same time.
>>>
>>> How about splitting drm_sched_entity_fini() into two functions?
>>>
>>> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.
>>>
>>> During destruction this one is called first for all contexts as well as the entity for VM updates.
>>>
>>> The second one then goes over the entity and signals all jobs with an error code.
>>>
>>> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>>>> 2)if a fence signaled and try to clear some entity's dependency,
>>>>> should set this entity guilty to prevent its job really run since
>>>>> the dependency is fake signaled.
>>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>>>
>>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>>>
>>>>
>>>> [ML], yeah that's a good point, how about this way: if a fence is
>>>> fake signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?
>>>> This way for a certain process a faked signaled GFX fence won't be
>>>> able to wake another VCE/SDMA job
>>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2018年3月28日 19:57
>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Liu, Monk <Monk.Liu@amd.com>
>>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>>
>>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>>>> issue:
>>>>> there are VMC page fault occured if force APP kill during 3dmark
>>>>> test, the cause is in entity_fini we manually signal all those jobs
>>>>> in entity's queue which confuse the sync/dep
>>>>> mechanism:
>>>>>
>>>>> 1)page fault occured in sdma's clear job which operate on shadow
>>>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
>>>>> since the fence in its reservation was fake signaled by
>>>>> entity_fini() under the case of SIGKILL received.
>>>>>
>>>>> 2)page fault occured in gfx' job because during the lifetime of gfx
>>>>> job we manually fake signal all jobs from its entity in
>>>>> entity_fini(), thus the unmapping/clear PTE job depend on those
>>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.
>>>> Nice catch, but the fixes won't work like this.
>>>>
>>>>> fix:
>>>>> 1)should at least wait all jobs already scheduled complete in
>>>>> entity_fini() if SIGKILL is the case.
>>>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.
>>>>
>>>>> 2)if a fence signaled and try to clear some entity's dependency,
>>>>> should set this entity guilty to prevent its job really run since
>>>>> the dependency is fake signaled.
>>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>>>
>>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>> related issue ticket:
>>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>>>
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>>>>>       1 file changed, 36 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> index 2bd69c4..9b306d3 100644
>>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>>>>>       	return true;
>>>>>       }
>>>>>       
>>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
>>>>> +				struct drm_sched_entity *entity) {
>>>>> +	struct drm_sched_job *last;
>>>>> +	signed long r;
>>>>> +
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {
>>>>> +			dma_fence_get(&last->s_fence->finished);
>>>>> +			break;
>>>>> +		}
>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>> +
>>>>> +	if (&last->node != &sched->ring_mirror_list) {
>>>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
>>>>> +		if (r == 0)
>>>>> +			DRM_WARN("wait on the fly job timeout\n");
>>>>> +		dma_fence_put(&last->s_fence->finished);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>>       /**
>>>>>        * Destroy a context entity
>>>>>        *
>>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>>>>>       			entity->dependency = NULL;
>>>>>       		}
>>>>>       
>>>>> +		/* Wait till all jobs from this entity really finished otherwise below
>>>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
>>>>> +		 * vm fault
>>>>> +		 */
>>>>> +		drm_sched_entity_wait_otf_signal(sched, entity);
>>>>> +
>>>>>       		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>>>       			struct drm_sched_fence *s_fence = job->s_fence;
>>>>>       			drm_sched_fence_scheduled(s_fence);
>>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>>>>>       {
>>>>>       	struct drm_sched_entity *entity =
>>>>>       		container_of(cb, struct drm_sched_entity, cb);
>>>>> +
>>>>> +	/* set the entity guity since its dependency is
>>>>> +	 * not really cleared but fake signaled (by SIGKILL
>>>>> +	 * or GPU recover)
>>>>> +	 */
>>>>> +	if (f->error && entity->guilty)
>>>>> +		atomic_set(entity->guilty, 1);
>>>>> +
>>>>>       	entity->dependency = NULL;
>>>>>       	dma_fence_put(f);
>>>>>       	drm_sched_wakeup(entity->sched);
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
But the fence hook to PD's reservation only remember the last one, and the last one will be fake signaled ... 

Did I wrong on some concept ??

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 

Sent: 2018年3月29日 19:47
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

> you will fake signal all fences eventually ,

That is not correct. We fake signal all *pending* fences, e.g. fences which are not pushed to the hardware ring.

Fences/jobs which are already running on the hardware ring are not touched in any way.

Regards,
Christian.

Am 29.03.2018 um 13:31 schrieb Liu, Monk:
> This sdma job is running, and you will fake signal all fences 

> eventually , Which lead to this process's PD reservation object free 

> and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow 

> buffer's ttm And lead to sdma job hit VMC page fault on vmid0

>

> /Monk

>

>

>

>

>

> -----Original Message-----

> From: Koenig, Christian

> Sent: 2018年3月29日 19:24

> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 

> amd-gfx@lists.freedesktop.org

> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>

> Am 29.03.2018 um 13:14 schrieb Liu, Monk:

>> Hi Christian

>>

>>> This way we won't even start running the unmapping/clear-pte job in the first place.

>> What if there is already an unmapping/clear-pte job running before 

>> you kill app ? like app is naturally release some resource And by coincidence you meanwhile kill the app ?

> At least nothing problematic. Since the job is already running we won't do anything with its fence.

>

> Christian.

>

>> /Monk

>>

>>

>>

>>

>>

>> -----Original Message-----

>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

>> Sent: 2018年3月29日 18:16

>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 

>> <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 

>> amd-gfx@lists.freedesktop.org

>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>>

>> Hi Monk,

>>

>> well that isn't a problem.

>>

>> The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place.

>>

>> I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates.

>>

>> Regards,

>> Christian.

>>

>> Am 29.03.2018 um 12:11 schrieb Liu, Monk:

>>> First let's consider the shadow buffer case:

>>>

>>> After you signal all jobs with an error code, e.g. you signals an 

>>> unmapping/clear-pte job on sdma ring (it is running on sdma), the 

>>> reservation are then all cleared, this way during amdgpu_bo_undef() 

>>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT

>>>

>>> /Monk

>>> -----Original Message-----

>>> From: Koenig, Christian

>>> Sent: 2018年3月29日 17:05

>>> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>; 

>>> amd-gfx@lists.freedesktop.org

>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>>>

>>> I think the core of the problem is that we don't abort all entities of the process at the same time.

>>>

>>> How about splitting drm_sched_entity_fini() into two functions?

>>>

>>> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.

>>>

>>> During destruction this one is called first for all contexts as well as the entity for VM updates.

>>>

>>> The second one then goes over the entity and signals all jobs with an error code.

>>>

>>> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).

>>>

>>> Regards,

>>> Christian.

>>>

>>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:

>>>>> 2)if a fence signaled and try to clear some entity's dependency, 

>>>>> should set this entity guilty to prevent its job really run since 

>>>>> the dependency is fake signaled.

>>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>>>>

>>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>>>>

>>>>

>>>> [ML], yeah that's a good point, how about this way: if a fence is 

>>>> fake signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?

>>>> This way for a certain process a faked signaled GFX fence won't be 

>>>> able to wake another VCE/SDMA job

>>>>

>>>> /Monk

>>>>

>>>> -----Original Message-----

>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]

>>>> Sent: 2018年3月28日 19:57

>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org

>>>> Cc: Liu, Monk <Monk.Liu@amd.com>

>>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

>>>>

>>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:

>>>>> issue:

>>>>> there are VMC page fault occured if force APP kill during 3dmark 

>>>>> test, the cause is in entity_fini we manually signal all those 

>>>>> jobs in entity's queue which confuse the sync/dep

>>>>> mechanism:

>>>>>

>>>>> 1)page fault occured in sdma's clear job which operate on shadow 

>>>>> buffer, and shadow buffer's Gart table is cleaned by 

>>>>> ttm_bo_release since the fence in its reservation was fake 

>>>>> signaled by

>>>>> entity_fini() under the case of SIGKILL received.

>>>>>

>>>>> 2)page fault occured in gfx' job because during the lifetime of 

>>>>> gfx job we manually fake signal all jobs from its entity in 

>>>>> entity_fini(), thus the unmapping/clear PTE job depend on those 

>>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.

>>>> Nice catch, but the fixes won't work like this.

>>>>

>>>>> fix:

>>>>> 1)should at least wait all jobs already scheduled complete in

>>>>> entity_fini() if SIGKILL is the case.

>>>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.

>>>>

>>>>> 2)if a fence signaled and try to clear some entity's dependency, 

>>>>> should set this entity guilty to prevent its job really run since 

>>>>> the dependency is fake signaled.

>>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.

>>>>

>>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.

>>>>

>>>> Regards,

>>>> Christian.

>>>>

>>>>

>>>>> related issue ticket:

>>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

>>>>>

>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

>>>>> ---

>>>>>       drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++

>>>>>       1 file changed, 36 insertions(+)

>>>>>

>>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>>> index 2bd69c4..9b306d3 100644

>>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

>>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)

>>>>>       	return true;

>>>>>       }

>>>>>       

>>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,

>>>>> +				struct drm_sched_entity *entity) {

>>>>> +	struct drm_sched_job *last;

>>>>> +	signed long r;

>>>>> +

>>>>> +	spin_lock(&sched->job_list_lock);

>>>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)

>>>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {

>>>>> +			dma_fence_get(&last->s_fence->finished);

>>>>> +			break;

>>>>> +		}

>>>>> +	spin_unlock(&sched->job_list_lock);

>>>>> +

>>>>> +	if (&last->node != &sched->ring_mirror_list) {

>>>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));

>>>>> +		if (r == 0)

>>>>> +			DRM_WARN("wait on the fly job timeout\n");

>>>>> +		dma_fence_put(&last->s_fence->finished);

>>>>> +	}

>>>>> +}

>>>>> +

>>>>>       /**

>>>>>        * Destroy a context entity

>>>>>        *

>>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,

>>>>>       			entity->dependency = NULL;

>>>>>       		}

>>>>>       

>>>>> +		/* Wait till all jobs from this entity really finished otherwise below

>>>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to

>>>>> +		 * vm fault

>>>>> +		 */

>>>>> +		drm_sched_entity_wait_otf_signal(sched, entity);

>>>>> +

>>>>>       		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {

>>>>>       			struct drm_sched_fence *s_fence = job->s_fence;

>>>>>       			drm_sched_fence_scheduled(s_fence);

>>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb

>>>>>       {

>>>>>       	struct drm_sched_entity *entity =

>>>>>       		container_of(cb, struct drm_sched_entity, cb);

>>>>> +

>>>>> +	/* set the entity guity since its dependency is

>>>>> +	 * not really cleared but fake signaled (by SIGKILL

>>>>> +	 * or GPU recover)

>>>>> +	 */

>>>>> +	if (f->error && entity->guilty)

>>>>> +		atomic_set(entity->guilty, 1);

>>>>> +

>>>>>       	entity->dependency = NULL;

>>>>>       	dma_fence_put(f);

>>>>>       	drm_sched_wakeup(entity->sched);

>>> _______________________________________________

>>> amd-gfx mailing list

>>> amd-gfx@lists.freedesktop.org

>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

> _______________________________________________

> amd-gfx mailing list

> amd-gfx@lists.freedesktop.org

> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Ah, crap, you're right! Yeah in this case the approach won't work.

How about this instead:

The first part is similar to how Emily wanted to fix it, e.g. we either 
figure out the last scheduled fence of the entity or even better 
remember it during scheduling.

But then instead of waiting for this fence we install it as completion 
signal for the remaining jobs. This way the remaining fences won't 
signal before the ones already scheduled.

Regards,
Christian.

Am 29.03.2018 um 13:52 schrieb Liu, Monk:
> But the fence hook to PD's reservation only remember the last one, and the last one will be fake signaled ...
>
> Did I wrong on some concept ??
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月29日 19:47
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
>> you will fake signal all fences eventually ,
> That is not correct. We fake signal all *pending* fences, e.g. fences which are not pushed to the hardware ring.
>
> Fences/jobs which are already running on the hardware ring are not touched in any way.
>
> Regards,
> Christian.
>
> Am 29.03.2018 um 13:31 schrieb Liu, Monk:
>> This sdma job is running, and you will fake signal all fences
>> eventually , Which lead to this process's PD reservation object free
>> and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow
>> buffer's ttm And lead to sdma job hit VMC page fault on vmid0
>>
>> /Monk
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年3月29日 19:24
>> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> Am 29.03.2018 um 13:14 schrieb Liu, Monk:
>>> Hi Christian
>>>
>>>> This way we won't even start running the unmapping/clear-pte job in the first place.
>>> What if there is already an unmapping/clear-pte job running before
>>> you kill app ? like app is naturally release some resource And by coincidence you meanwhile kill the app ?
>> At least nothing problematic. Since the job is already running we won't do anything with its fence.
>>
>> Christian.
>>
>>> /Monk
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年3月29日 18:16
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>;
>>> amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>
>>> Hi Monk,
>>>
>>> well that isn't a problem.
>>>
>>> The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place.
>>>
>>> I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.03.2018 um 12:11 schrieb Liu, Monk:
>>>> First let's consider the shadow buffer case:
>>>>
>>>> After you signal all jobs with an error code, e.g. you signals an
>>>> unmapping/clear-pte job on sdma ring (it is running on sdma), the
>>>> reservation are then all cleared, this way during amdgpu_bo_undef()
>>>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>>>>
>>>> /Monk
>>>> -----Original Message-----
>>>> From: Koenig, Christian
>>>> Sent: 2018年3月29日 17:05
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>;
>>>> amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>>
>>>> I think the core of the problem is that we don't abort all entities of the process at the same time.
>>>>
>>>> How about splitting drm_sched_entity_fini() into two functions?
>>>>
>>>> The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed.
>>>>
>>>> During destruction this one is called first for all contexts as well as the entity for VM updates.
>>>>
>>>> The second one then goes over the entity and signals all jobs with an error code.
>>>>
>>>> This way no VM updates should be executed any longer after the process is killed (something which doesn't makes sense anyway and just costs us time).
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>>>>> 2)if a fence signaled and try to clear some entity's dependency,
>>>>>> should set this entity guilty to prevent its job really run since
>>>>>> the dependency is fake signaled.
>>>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>>>>
>>>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>>>>
>>>>>
>>>>> [ML], yeah that's a good point, how about this way: if a fence is
>>>>> fake signaled and try to clear other entity's dependency we only allow entity marked as guilty If this entity share the same ctx (or even process?) of the entity of the job of that fake signaled fence ?
>>>>> This way for a certain process a faked signaled GFX fence won't be
>>>>> able to wake another VCE/SDMA job
>>>>>
>>>>> /Monk
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>>> Sent: 2018年3月28日 19:57
>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Cc: Liu, Monk <Monk.Liu@amd.com>
>>>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>>>
>>>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>>>>> issue:
>>>>>> there are VMC page fault occured if force APP kill during 3dmark
>>>>>> test, the cause is in entity_fini we manually signal all those
>>>>>> jobs in entity's queue which confuse the sync/dep
>>>>>> mechanism:
>>>>>>
>>>>>> 1)page fault occured in sdma's clear job which operate on shadow
>>>>>> buffer, and shadow buffer's Gart table is cleaned by
>>>>>> ttm_bo_release since the fence in its reservation was fake
>>>>>> signaled by
>>>>>> entity_fini() under the case of SIGKILL received.
>>>>>>
>>>>>> 2)page fault occured in gfx' job because during the lifetime of
>>>>>> gfx job we manually fake signal all jobs from its entity in
>>>>>> entity_fini(), thus the unmapping/clear PTE job depend on those
>>>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault.
>>>>> Nice catch, but the fixes won't work like this.
>>>>>
>>>>>> fix:
>>>>>> 1)should at least wait all jobs already scheduled complete in
>>>>>> entity_fini() if SIGKILL is the case.
>>>>> Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling.
>>>>>
>>>>>> 2)if a fence signaled and try to clear some entity's dependency,
>>>>>> should set this entity guilty to prevent its job really run since
>>>>>> the dependency is fake signaled.
>>>>> Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed.
>>>>>
>>>>> And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>> related issue ticket:
>>>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>>>>
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> ---
>>>>>>        drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>>>>>>        1 file changed, 36 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> index 2bd69c4..9b306d3 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>>>>>>        	return true;
>>>>>>        }
>>>>>>        
>>>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
>>>>>> +				struct drm_sched_entity *entity) {
>>>>>> +	struct drm_sched_job *last;
>>>>>> +	signed long r;
>>>>>> +
>>>>>> +	spin_lock(&sched->job_list_lock);
>>>>>> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>>>>> +		if (last->s_fence->scheduled.context == entity->fence_context) {
>>>>>> +			dma_fence_get(&last->s_fence->finished);
>>>>>> +			break;
>>>>>> +		}
>>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>>> +
>>>>>> +	if (&last->node != &sched->ring_mirror_list) {
>>>>>> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
>>>>>> +		if (r == 0)
>>>>>> +			DRM_WARN("wait on the fly job timeout\n");
>>>>>> +		dma_fence_put(&last->s_fence->finished);
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>>        /**
>>>>>>         * Destroy a context entity
>>>>>>         *
>>>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>>>>>>        			entity->dependency = NULL;
>>>>>>        		}
>>>>>>        
>>>>>> +		/* Wait till all jobs from this entity really finished otherwise below
>>>>>> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
>>>>>> +		 * vm fault
>>>>>> +		 */
>>>>>> +		drm_sched_entity_wait_otf_signal(sched, entity);
>>>>>> +
>>>>>>        		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>>>>        			struct drm_sched_fence *s_fence = job->s_fence;
>>>>>>        			drm_sched_fence_scheduled(s_fence);
>>>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>>>>>>        {
>>>>>>        	struct drm_sched_entity *entity =
>>>>>>        		container_of(cb, struct drm_sched_entity, cb);
>>>>>> +
>>>>>> +	/* set the entity guity since its dependency is
>>>>>> +	 * not really cleared but fake signaled (by SIGKILL
>>>>>> +	 * or GPU recover)
>>>>>> +	 */
>>>>>> +	if (f->error && entity->guilty)
>>>>>> +		atomic_set(entity->guilty, 1);
>>>>>> +
>>>>>>        	entity->dependency = NULL;
>>>>>>        	dma_fence_put(f);
>>>>>>        	drm_sched_wakeup(entity->sched);
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx