regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Submitted by Koenig, Christian on Oct. 13, 2017, 9 a.m.

Details

Message ID 53f57508-6669-82ed-1b95-011f30610e05@amd.com
State New
Headers show
Series "regression with d6c650c0a8f6f671e49553725e1db541376d95f2" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Koenig, Christian Oct. 13, 2017, 9 a.m.
> There is chance that free_job() called before that 
> “trace_amd_sched_process_job”, correct?
>
Correct, but that is harmless.

Take a look what trace_amd_sched_process_job actually does, it just 
prints the pointer of the fence structure (but the pointer might be 
stale at this point).

Nevertheless you are right that this is really ugly.

How about the attached patch to fix the issue?

Regards,
Christian.

Am 13.10.2017 um 10:51 schrieb Liu, Monk:
>
> The free_job() is called in sched_job_finish() which is queued on a 
> WORK and scheduled from that “amd_sched_fence_finished()”
>
> So the finishing timing of free_job() is asynchronized with 
> sched_process_job()
>
> There is chance that free_job() called before that 
> “trace_amd_sched_process_job”, correct?
>
> And if so the s_fence referred by it maybe a wild pointer
>
> BR Monk
>
> *From:*Liu, Monk
> *Sent:* 2017年10月13日16:49
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Nicolai Hähnle 
> <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> No that’s not true
>
> The free_job() is called in sched_job_finish() which is queued on a 
> WORK and scheduled from that “amd_sched_fence_finished()”
>
> So the finishing timing of free_job() is asynchronized with 
> sched_process_job()
>
> How can you sure free_job() must before “trace_amd_sched_process_job”?
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:43
> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Nicolai 
> Hähnle <nhaehnle@gmail.com <mailto:nhaehnle@gmail.com>>; 
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> The free_job() callback is called only way after the job has finished.
>
> That is one change actually made by you to the code :)
>
> Christian.
>
> Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>
>     I doubt it would always work fine…
>
>     First, we have FENCE_TRACE reference s_fence->finished after
>     “fence_signal(&fence->finished)”
>
>     Second, we have trace_amd_sched_proess_job(s_fence) after
>     “amd_sched_fence_finished()”,
>
>     If you put the finished before free_job() and by coincidence the
>     job_finish() get very soon executed you’ll have odds to hit wild
>     pointer on above two cases
>
>     BR Monk
>
>     *From:*Koenig, Christian
>     *Sent:* 2017年10月13日16:17
>     *To:* Liu, Monk <Monk.Liu@amd.com> <mailto:Monk.Liu@amd.com>;
>     Nicolai Hähnle <nhaehnle@gmail.com> <mailto:nhaehnle@gmail.com>;
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     *Subject:* Re: regression with
>     d6c650c0a8f6f671e49553725e1db541376d95f2
>
>     Yeah, that change is actually incorrect and should be reverted.
>
>     What we really need to do is remove dropping sched_job->s_fence
>     from amd_sched_process_job() into amd_sched_job_finish() directly
>     before the call to free_job().
>
>     Regards,
>     Christian.
>
>     Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>
>         commit d6c650c0a8f6f671e49553725e1db541376d95f2
>         Author: Nicolai Hähnle <nicolai.haehnle@amd.com>
>         <mailto:nicolai.haehnle@amd.com>
>         @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>
>                         fence = sched->ops->run_job(sched_job);
>                         amd_sched_fence_scheduled(s_fence);
>         +
>         +               /* amd_sched_process_job drops the job's
>         reference of the fence. */
>         +               sched_job->s_fence = NULL;
>         +
>                         if (fence) {
>                                 s_fence->parent = dma_fence_get(fence);
>                                 r = dma_fence_add_callback(fence,
>         &s_fence->cb,
>
>         Hi Nicolai
>
>         with this patch, you will break
>         "amdgpu_sched_hw_job_reset()"routine:
>
>         void
>
>         amd_sched_hw_job_reset(structamd_gpu_scheduler
>
>         *sched)
>
>         {
>
>         structamd_sched_job
>
>         *s_job;
>
>         spin_lock(&sched->job_list_lock);
>
>         list_for_each_entry_reverse(s_job,
>
>         &sched->ring_mirror_list, node) {
>
>         if(s_job->s_fence->parent
>
>         &&
>
>         fence_remove_callback(s_job->s_fence->parent,
>
>         &s_job->s_fence->cb))
>
>         {
>
>         fence_put(s_job->s_fence->parent);
>
>                     s_job->s_fence->parent
>
>         =
>
>         NULL;
>
>         atomic_dec(&sched->hw_rq_count);
>
>                 }
>
>             }
>
>         spin_unlock(&sched->job_list_lock);
>
>         }
>
>         see that without sched_job->s_fence, you cannot remove the
>         callback from its hw fence,
>
>         any idea??
>
>         BR Monk
>

Patch hide | download patch | download mbox

From 43723de3f71d49ead63619920a5ed63f3efdfee9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Fri, 13 Oct 2017 10:58:15 +0200
Subject: [PATCH] drm/amd/sched: fix job tear down order
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Move the trace before we signal the scheduler fence and drop the
scheduler fence reference directly before we free the job.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 59f1325d975c..04ac783fb954 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -406,6 +406,7 @@  static void amd_sched_job_finish(struct work_struct *work)
 			schedule_delayed_work(&next->work_tdr, sched->timeout);
 	}
 	spin_unlock(&sched->job_list_lock);
+	dma_fence_put(&s_job->s_fence->finished);
 	sched->ops->free_job(s_job);
 }
 
@@ -587,10 +588,9 @@  static void amd_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb)
 	struct amd_gpu_scheduler *sched = s_fence->sched;
 
 	atomic_dec(&sched->hw_rq_count);
+	trace_amd_sched_process_job(s_fence);
 	amd_sched_fence_finished(s_fence);
 
-	trace_amd_sched_process_job(s_fence);
-	dma_fence_put(&s_fence->finished);
 	wake_up_interruptible(&sched->wake_up_worker);
 }
 
@@ -638,9 +638,6 @@  static int amd_sched_main(void *param)
 		fence = sched->ops->run_job(sched_job);
 		amd_sched_fence_scheduled(s_fence);
 
-		/* amd_sched_process_job drops the job's reference of the fence. */
-		sched_job->s_fence = NULL;
-
 		if (fence) {
 			s_fence->parent = dma_fence_get(fence);
 			r = dma_fence_add_callback(fence, &s_fence->cb,
-- 
2.11.0


Comments

your patch looks good,  do you think we should also do this:


 void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
 {
-       int ret = fence_signal(&fence->scheduled);
+       int ret;
+
+       fence_get(&fence->scheduled;)
+       ret = fence_signal(&fence->scheduled);

        if (!ret)
                FENCE_TRACE(&fence->scheduled, "signaled from irq context\n");
        else
                FENCE_TRACE(&fence->scheduled, "was already signaled\n");
+       fence_put(&fence->scheduled);
 }



________________________________
From: Koenig, Christian

Sent: Friday, October 13, 2017 5:00:27 PM
To: Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
Correct, but that is harmless.

Take a look what trace_amd_sched_process_job actually does, it just prints the pointer of the fence structure (but the pointer might be stale at this point).

Nevertheless you are right that this is really ugly.

How about the attached patch to fix the issue?

Regards,
Christian.

Am 13.10.2017 um 10:51 schrieb Liu, Monk:
The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
And if so the s_fence referred by it maybe a wild pointer


BR Monk


From: Liu, Monk

Sent: 2017年10月13日 16:49
To: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

No that’s not true

The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

How can you sure free_job() must before “trace_amd_sched_process_job” ?

From: Koenig, Christian

Sent: 2017年10月13日 16:43
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Nicolai Hähnle <nhaehnle@gmail.com<mailto:nhaehnle@gmail.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

The free_job() callback is called only way after the job has finished.

That is one change actually made by you to the code :)

Christian.

Am 13.10.2017 um 10:39 schrieb Liu, Monk:
I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian

Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk