drm/amdgpu: fix the null pointer to get timeline by scheduler fence

Submitted by Huang, Ray on Aug. 8, 2018, 7:05 a.m.

Details

Message ID 1533711936-22450-1-git-send-email-ray.huang@amd.com
State New
Headers show
Series "drm/amdgpu: fix the null pointer to get timeline by scheduler fence" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Huang, Ray Aug. 8, 2018, 7:05 a.m.
We won't initialize fence scheduler in drm_sched_fence_create() anymore, so it
will refer null fence scheduler if open trace event to get the timeline name.
Actually, it is the scheduler name from the entity, so add a macro to replace
legacy getting timeline name by job.

[  212.844281] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067 PMD 0
[  212.859419] Oops: 0000 [#1] SMP PTI
[  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G           OE     4.18.0-rc1-custom #1
[  212.872194] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30 [gpu_sched]
[  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 00 00 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 00 0f
[  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
[  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX: ffffa3ed81f27da8
[  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI: ffff8f24ebfb5400
[  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09: ffffffffae4deafc
[  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12: ffff8f24e891f898
[  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15: ffff8f24e3de97a8
[  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000) knlGS:0000000000000000
[  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4: 00000000003606e0
[  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  212.985720] Call Trace:
[  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170 [amdgpu]
[  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
[  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
[  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
[  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
[  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
[  213.020794]  ? recalc_sigpending+0x17/0x50
[  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
[  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
[  213.034366]  ? do_signal+0x36/0x700
[  213.037928]  ? signal_wake_up_state+0x15/0x30
[  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]

Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index e12871d..be01e1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1247,7 +1247,7 @@  static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 
 	amdgpu_job_free_resources(job);
 
-	trace_amdgpu_cs_ioctl(job);
+	trace_amdgpu_cs_ioctl(job, entity);
 	amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
 	priority = job->base.s_priority;
 	drm_sched_entity_push_job(&job->base, entity);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 8c2dab2..25cdcb7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -36,6 +36,8 @@ 
 
 #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
 	 job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)
+#define AMDGPU_GET_SCHED_NAME(entity) \
+	 (entity->rq->sched->name)
 
 TRACE_EVENT(amdgpu_mm_rreg,
 	    TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
@@ -161,11 +163,11 @@  TRACE_EVENT(amdgpu_cs,
 );
 
 TRACE_EVENT(amdgpu_cs_ioctl,
-	    TP_PROTO(struct amdgpu_job *job),
-	    TP_ARGS(job),
+	    TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity *entity),
+	    TP_ARGS(job, entity),
 	    TP_STRUCT__entry(
 			     __field(uint64_t, sched_job_id)
-			     __string(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
+			     __string(timeline, AMDGPU_GET_SCHED_NAME(entity))
 			     __field(unsigned int, context)
 			     __field(unsigned int, seqno)
 			     __field(struct dma_fence *, fence)
@@ -175,7 +177,7 @@  TRACE_EVENT(amdgpu_cs_ioctl,
 
 	    TP_fast_assign(
 			   __entry->sched_job_id = job->base.id;
-			   __assign_str(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
+			   __assign_str(timeline, AMDGPU_GET_SCHED_NAME(entity))
 			   __entry->context = job->base.s_fence->finished.context;
 			   __entry->seqno = job->base.s_fence->finished.seqno;
 			   __entry->ring_name = to_amdgpu_ring(job->base.sched)->name;

Comments

Yeah that is a known issue, but this solution is not correct either.

See the scheduler where the job is execute on is simply not determined 
yet when we want to trace it.

So using the scheduler name from the entity is wrong as well.

We should probably move the reschedule from drm_sched_entity_push_job() 
to drm_sched_job_init() to fix that.

I will prepare a patch for that today,
Christian.

Am 08.08.2018 um 09:05 schrieb Huang Rui:
> We won't initialize fence scheduler in drm_sched_fence_create() anymore, so it
> will refer null fence scheduler if open trace event to get the timeline name.
> Actually, it is the scheduler name from the entity, so add a macro to replace
> legacy getting timeline name by job.
>
> [  212.844281] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067 PMD 0
> [  212.859419] Oops: 0000 [#1] SMP PTI
> [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G           OE     4.18.0-rc1-custom #1
> [  212.872194] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
> [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30 [gpu_sched]
> [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 00 00 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 00 0f
> [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
> [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX: ffffa3ed81f27da8
> [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI: ffff8f24ebfb5400
> [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09: ffffffffae4deafc
> [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12: ffff8f24e891f898
> [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15: ffff8f24e3de97a8
> [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000) knlGS:0000000000000000
> [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4: 00000000003606e0
> [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  212.985720] Call Trace:
> [  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170 [amdgpu]
> [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
> [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
> [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
> [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
> [  213.020794]  ? recalc_sigpending+0x17/0x50
> [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
> [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> [  213.034366]  ? do_signal+0x36/0x700
> [  213.037928]  ? signal_wake_up_state+0x15/0x30
> [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
>
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
>   2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index e12871d..be01e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   
>   	amdgpu_job_free_resources(job);
>   
> -	trace_amdgpu_cs_ioctl(job);
> +	trace_amdgpu_cs_ioctl(job, entity);
>   	amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
>   	priority = job->base.s_priority;
>   	drm_sched_entity_push_job(&job->base, entity);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> index 8c2dab2..25cdcb7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> @@ -36,6 +36,8 @@
>   
>   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>   	 job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)
> +#define AMDGPU_GET_SCHED_NAME(entity) \
> +	 (entity->rq->sched->name)
>   
>   TRACE_EVENT(amdgpu_mm_rreg,
>   	    TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
> @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
>   );
>   
>   TRACE_EVENT(amdgpu_cs_ioctl,
> -	    TP_PROTO(struct amdgpu_job *job),
> -	    TP_ARGS(job),
> +	    TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity *entity),
> +	    TP_ARGS(job, entity),
>   	    TP_STRUCT__entry(
>   			     __field(uint64_t, sched_job_id)
> -			     __string(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
> +			     __string(timeline, AMDGPU_GET_SCHED_NAME(entity))
>   			     __field(unsigned int, context)
>   			     __field(unsigned int, seqno)
>   			     __field(struct dma_fence *, fence)
> @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
>   
>   	    TP_fast_assign(
>   			   __entry->sched_job_id = job->base.id;
> -			   __assign_str(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
> +			   __assign_str(timeline, AMDGPU_GET_SCHED_NAME(entity))
>   			   __entry->context = job->base.s_fence->finished.context;
>   			   __entry->seqno = job->base.s_fence->finished.seqno;
>   			   __entry->ring_name = to_amdgpu_ring(job->base.sched)->name;
On Wed, Aug 08, 2018 at 03:10:07PM +0800, Koenig, Christian wrote:
> Yeah that is a known issue, but this solution is not correct either.
> 
> See the scheduler where the job is execute on is simply not determined 
> yet when we want to trace it.
> 
> So using the scheduler name from the entity is wrong as well.
> 
> We should probably move the reschedule from drm_sched_entity_push_job() 
> to drm_sched_job_init() to fix that.

Could you please explain why move reschedule along can fix the issue.
Seemingly, only s_fence's sched is written to entity rq's sched, it can
avoid the issue.

sched_job->s_fence->sched = entity->rq->sched

Thanks,
Ray

> 
> I will prepare a patch for that today,
> Christian.
> 
> Am 08.08.2018 um 09:05 schrieb Huang Rui:
> > We won't initialize fence scheduler in drm_sched_fence_create() anymore, so it
> > will refer null fence scheduler if open trace event to get the timeline name.
> > Actually, it is the scheduler name from the entity, so add a macro to replace
> > legacy getting timeline name by job.
> >
> > [  212.844281] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> > [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067 PMD 0
> > [  212.859419] Oops: 0000 [#1] SMP PTI
> > [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G           OE     4.18.0-rc1-custom #1
> > [  212.872194] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
> > [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30 [gpu_sched]
> > [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 00 00 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 00 0f
> > [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
> > [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX: ffffa3ed81f27da8
> > [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI: ffff8f24ebfb5400
> > [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09: ffffffffae4deafc
> > [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12: ffff8f24e891f898
> > [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15: ffff8f24e3de97a8
> > [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000) knlGS:0000000000000000
> > [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4: 00000000003606e0
> > [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  212.985720] Call Trace:
> > [  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170 [amdgpu]
> > [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
> > [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
> > [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
> > [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> > [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
> > [  213.020794]  ? recalc_sigpending+0x17/0x50
> > [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
> > [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> > [  213.034366]  ? do_signal+0x36/0x700
> > [  213.037928]  ? signal_wake_up_state+0x15/0x30
> > [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
> >
> > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
> >   2 files changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index e12871d..be01e1b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> >   
> >   	amdgpu_job_free_resources(job);
> >   
> > -	trace_amdgpu_cs_ioctl(job);
> > +	trace_amdgpu_cs_ioctl(job, entity);
> >   	amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
> >   	priority = job->base.s_priority;
> >   	drm_sched_entity_push_job(&job->base, entity);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > index 8c2dab2..25cdcb7 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > @@ -36,6 +36,8 @@
> >   
> >   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
> >   	 job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)
> > +#define AMDGPU_GET_SCHED_NAME(entity) \
> > +	 (entity->rq->sched->name)
> >   
> >   TRACE_EVENT(amdgpu_mm_rreg,
> >   	    TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
> > @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
> >   );
> >   
> >   TRACE_EVENT(amdgpu_cs_ioctl,
> > -	    TP_PROTO(struct amdgpu_job *job),
> > -	    TP_ARGS(job),
> > +	    TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity *entity),
> > +	    TP_ARGS(job, entity),
> >   	    TP_STRUCT__entry(
> >   			     __field(uint64_t, sched_job_id)
> > -			     __string(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
> > +			     __string(timeline, AMDGPU_GET_SCHED_NAME(entity))
> >   			     __field(unsigned int, context)
> >   			     __field(unsigned int, seqno)
> >   			     __field(struct dma_fence *, fence)
> > @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
> >   
> >   	    TP_fast_assign(
> >   			   __entry->sched_job_id = job->base.id;
> > -			   __assign_str(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
> > +			   __assign_str(timeline, AMDGPU_GET_SCHED_NAME(entity))
> >   			   __entry->context = job->base.s_fence->finished.context;
> >   			   __entry->seqno = job->base.s_fence->finished.seqno;
> >   			   __entry->ring_name = to_amdgpu_ring(job->base.sched)->name;
>
On Wed, Aug 8, 2018 at 4:58 PM Huang Rui <ray.huang@amd.com> wrote:

> On Wed, Aug 08, 2018 at 03:10:07PM +0800, Koenig, Christian wrote:
> > Yeah that is a known issue, but this solution is not correct either.
> >
> > See the scheduler where the job is execute on is simply not determined
> > yet when we want to trace it.
> >
> > So using the scheduler name from the entity is wrong as well.
> >
> > We should probably move the reschedule from drm_sched_entity_push_job()
> > to drm_sched_job_init() to fix that.
>
> Could you please explain why move reschedule along can fix the issue.
> Seemingly, only s_fence's sched is written to entity rq's sched, it can
> avoid the issue.
>
> sched_job->s_fence->sched = entity->rq->sched
>
> Because entity->rq->sched may not necessarily be the scheduler on which
this job will get scheduled. And assigning a wrong scheduler could lead to
wrong dependency optimizations. Hence it was assigned NULL initially we
don't know scheduler it will be scheduled on to avoid any wrong
optimizations.

Cheers,
Nayan

> Thanks,
> Ray
>
> >
> > I will prepare a patch for that today,
> > Christian.
> >
> > Am 08.08.2018 um 09:05 schrieb Huang Rui:
> > > We won't initialize fence scheduler in drm_sched_fence_create()
> anymore, so it
> > > will refer null fence scheduler if open trace event to get the
> timeline name.
> > > Actually, it is the scheduler name from the entity, so add a macro to
> replace
> > > legacy getting timeline name by job.
> > >
> > > [  212.844281] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000018
> > > [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067
> PMD 0
> > > [  212.859419] Oops: 0000 [#1] SMP PTI
> > > [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G
>  OE     4.18.0-rc1-custom #1
> > > [  212.872194] Hardware name: Gigabyte Technology Co., Ltd.
> Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
> > > [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30
> [gpu_sched]
> > > [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13
> 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 00 00
> 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 00 0f
> > > [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
> > > [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX:
> ffffa3ed81f27da8
> > > [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI:
> ffff8f24ebfb5400
> > > [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09:
> ffffffffae4deafc
> > > [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12:
> ffff8f24e891f898
> > > [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15:
> ffff8f24e3de97a8
> > > [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000)
> knlGS:0000000000000000
> > > [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4:
> 00000000003606e0
> > > [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > > [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> > > [  212.985720] Call Trace:
> > > [  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170
> [amdgpu]
> > > [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
> > > [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
> > > [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
> > > [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> > > [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
> > > [  213.020794]  ? recalc_sigpending+0x17/0x50
> > > [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
> > > [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> > > [  213.034366]  ? do_signal+0x36/0x700
> > > [  213.037928]  ? signal_wake_up_state+0x15/0x30
> > > [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
> > >
> > > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
> > >   2 files changed, 7 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > index e12871d..be01e1b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct
> amdgpu_cs_parser *p,
> > >
> > >     amdgpu_job_free_resources(job);
> > >
> > > -   trace_amdgpu_cs_ioctl(job);
> > > +   trace_amdgpu_cs_ioctl(job, entity);
> > >     amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
> > >     priority = job->base.s_priority;
> > >     drm_sched_entity_push_job(&job->base, entity);
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > > index 8c2dab2..25cdcb7 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > > @@ -36,6 +36,8 @@
> > >
> > >   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
> > >
> job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)
> > > +#define AMDGPU_GET_SCHED_NAME(entity) \
> > > +    (entity->rq->sched->name)
> > >
> > >   TRACE_EVENT(amdgpu_mm_rreg,
> > >         TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
> > > @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
> > >   );
> > >
> > >   TRACE_EVENT(amdgpu_cs_ioctl,
> > > -       TP_PROTO(struct amdgpu_job *job),
> > > -       TP_ARGS(job),
> > > +       TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity
> *entity),
> > > +       TP_ARGS(job, entity),
> > >         TP_STRUCT__entry(
> > >                          __field(uint64_t, sched_job_id)
> > > -                        __string(timeline,
> AMDGPU_JOB_GET_TIMELINE_NAME(job))
> > > +                        __string(timeline,
> AMDGPU_GET_SCHED_NAME(entity))
> > >                          __field(unsigned int, context)
> > >                          __field(unsigned int, seqno)
> > >                          __field(struct dma_fence *, fence)
> > > @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
> > >
> > >         TP_fast_assign(
> > >                        __entry->sched_job_id = job->base.id;
> > > -                      __assign_str(timeline,
> AMDGPU_JOB_GET_TIMELINE_NAME(job))
> > > +                      __assign_str(timeline,
> AMDGPU_GET_SCHED_NAME(entity))
> > >                        __entry->context =
> job->base.s_fence->finished.context;
> > >                        __entry->seqno =
> job->base.s_fence->finished.seqno;
> > >                        __entry->ring_name =
> to_amdgpu_ring(job->base.sched)->name;
> >
>
On Wed, Aug 08, 2018 at 05:30:17PM +0530, Nayan Deshmukh wrote:
> On Wed, Aug 8, 2018 at 4:58 PM Huang Rui <ray.huang@amd.com> wrote:
> 
> > On Wed, Aug 08, 2018 at 03:10:07PM +0800, Koenig, Christian wrote:
> > > Yeah that is a known issue, but this solution is not correct either.
> > >
> > > See the scheduler where the job is execute on is simply not determined
> > > yet when we want to trace it.
> > >
> > > So using the scheduler name from the entity is wrong as well.
> > >
> > > We should probably move the reschedule from drm_sched_entity_push_job()
> > > to drm_sched_job_init() to fix that.
> >
> > Could you please explain why move reschedule along can fix the issue.
> > Seemingly, only s_fence's sched is written to entity rq's sched, it can
> > avoid the issue.
> >
> > sched_job->s_fence->sched = entity->rq->sched
> >
> > Because entity->rq->sched may not necessarily be the scheduler on which
> this job will get scheduled. And assigning a wrong scheduler could lead to
> wrong dependency optimizations. Hence it was assigned NULL initially we
> don't know scheduler it will be scheduled on to avoid any wrong
> optimizations.
> 

Nayan, thank you. I got it. That's the reason that we assigned
sfence->sched as NULL in drm_sched_job_init().

Thanks,
Ray

> Cheers,
> Nayan
> 
> > Thanks,
> > Ray
> >
> > >
> > > I will prepare a patch for that today,
> > > Christian.
> > >
> > > Am 08.08.2018 um 09:05 schrieb Huang Rui:
> > > > We won't initialize fence scheduler in drm_sched_fence_create()
> > anymore, so it
> > > > will refer null fence scheduler if open trace event to get the
> > timeline name.
> > > > Actually, it is the scheduler name from the entity, so add a macro to
> > replace
> > > > legacy getting timeline name by job.
> > > >
> > > > [  212.844281] BUG: unable to handle kernel NULL pointer dereference
> > at 0000000000000018
> > > > [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067
> > PMD 0
> > > > [  212.859419] Oops: 0000 [#1] SMP PTI
> > > > [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G
> >  OE     4.18.0-rc1-custom #1
> > > > [  212.872194] Hardware name: Gigabyte Technology Co., Ltd.
> > Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
> > > > [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30
> > [gpu_sched]
> > > > [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13
> > 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 00 00
> > 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 00 0f
> > > > [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
> > > > [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX:
> > ffffa3ed81f27da8
> > > > [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI:
> > ffff8f24ebfb5400
> > > > [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09:
> > ffffffffae4deafc
> > > > [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12:
> > ffff8f24e891f898
> > > > [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15:
> > ffff8f24e3de97a8
> > > > [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000)
> > knlGS:0000000000000000
> > > > [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4:
> > 00000000003606e0
> > > > [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > > > [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > > > [  212.985720] Call Trace:
> > > > [  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170
> > [amdgpu]
> > > > [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
> > > > [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
> > > > [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
> > > > [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> > > > [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
> > > > [  213.020794]  ? recalc_sigpending+0x17/0x50
> > > > [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
> > > > [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> > > > [  213.034366]  ? do_signal+0x36/0x700
> > > > [  213.037928]  ? signal_wake_up_state+0x15/0x30
> > > > [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
> > > >
> > > > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
> > > >   2 files changed, 7 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > index e12871d..be01e1b 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct
> > amdgpu_cs_parser *p,
> > > >
> > > >     amdgpu_job_free_resources(job);
> > > >
> > > > -   trace_amdgpu_cs_ioctl(job);
> > > > +   trace_amdgpu_cs_ioctl(job, entity);
> > > >     amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
> > > >     priority = job->base.s_priority;
> > > >     drm_sched_entity_push_job(&job->base, entity);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > > > index 8c2dab2..25cdcb7 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> > > > @@ -36,6 +36,8 @@
> > > >
> > > >   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
> > > >
> > job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)
> > > > +#define AMDGPU_GET_SCHED_NAME(entity) \
> > > > +    (entity->rq->sched->name)
> > > >
> > > >   TRACE_EVENT(amdgpu_mm_rreg,
> > > >         TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
> > > > @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
> > > >   );
> > > >
> > > >   TRACE_EVENT(amdgpu_cs_ioctl,
> > > > -       TP_PROTO(struct amdgpu_job *job),
> > > > -       TP_ARGS(job),
> > > > +       TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity
> > *entity),
> > > > +       TP_ARGS(job, entity),
> > > >         TP_STRUCT__entry(
> > > >                          __field(uint64_t, sched_job_id)
> > > > -                        __string(timeline,
> > AMDGPU_JOB_GET_TIMELINE_NAME(job))
> > > > +                        __string(timeline,
> > AMDGPU_GET_SCHED_NAME(entity))
> > > >                          __field(unsigned int, context)
> > > >                          __field(unsigned int, seqno)
> > > >                          __field(struct dma_fence *, fence)
> > > > @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
> > > >
> > > >         TP_fast_assign(
> > > >                        __entry->sched_job_id = job->base.id;
> > > > -                      __assign_str(timeline,
> > AMDGPU_JOB_GET_TIMELINE_NAME(job))
> > > > +                      __assign_str(timeline,
> > AMDGPU_GET_SCHED_NAME(entity))
> > > >                        __entry->context =
> > job->base.s_fence->finished.context;
> > > >                        __entry->seqno =
> > job->base.s_fence->finished.seqno;
> > > >                        __entry->ring_name =
> > to_amdgpu_ring(job->base.sched)->name;
> > >
> >

> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Any updates on this issue?

Regards,
Andres

On 2018-08-08 03:10 AM, Christian König wrote:
> Yeah that is a known issue, but this solution is not correct either.
> 
> See the scheduler where the job is execute on is simply not determined 
> yet when we want to trace it.
> 
> So using the scheduler name from the entity is wrong as well.
> 
> We should probably move the reschedule from drm_sched_entity_push_job() 
> to drm_sched_job_init() to fix that.
> 
> I will prepare a patch for that today,
> Christian.
> 
> Am 08.08.2018 um 09:05 schrieb Huang Rui:
>> We won't initialize fence scheduler in drm_sched_fence_create() 
>> anymore, so it
>> will refer null fence scheduler if open trace event to get the 
>> timeline name.
>> Actually, it is the scheduler name from the entity, so add a macro to 
>> replace
>> legacy getting timeline name by job.
>>
>> [  212.844281] BUG: unable to handle kernel NULL pointer dereference 
>> at 0000000000000018
>> [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067 
>> PMD 0
>> [  212.859419] Oops: 0000 [#1] SMP PTI
>> [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G           
>> OE     4.18.0-rc1-custom #1
>> [  212.872194] Hardware name: Gigabyte Technology Co., Ltd. 
>> Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
>> [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30 
>> [gpu_sched]
>> [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13 
>> 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 
>> 00 00 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 
>> 00 0f
>> [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
>> [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX: 
>> ffffa3ed81f27da8
>> [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI: 
>> ffff8f24ebfb5400
>> [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09: 
>> ffffffffae4deafc
>> [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12: 
>> ffff8f24e891f898
>> [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15: 
>> ffff8f24e3de97a8
>> [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000) 
>> knlGS:0000000000000000
>> [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4: 
>> 00000000003606e0
>> [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [  212.985720] Call Trace:
>> [  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170 [amdgpu]
>> [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
>> [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
>> [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
>> [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
>> [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
>> [  213.020794]  ? recalc_sigpending+0x17/0x50
>> [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
>> [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
>> [  213.034366]  ? do_signal+0x36/0x700
>> [  213.037928]  ? signal_wake_up_state+0x15/0x30
>> [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
>>
>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
>>   2 files changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index e12871d..be01e1b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct 
>> amdgpu_cs_parser *p,
>>       amdgpu_job_free_resources(job);
>> -    trace_amdgpu_cs_ioctl(job);
>> +    trace_amdgpu_cs_ioctl(job, entity);
>>       amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
>>       priority = job->base.s_priority;
>>       drm_sched_entity_push_job(&job->base, entity);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> index 8c2dab2..25cdcb7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> @@ -36,6 +36,8 @@
>>   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>>        
>> job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished) 
>>
>> +#define AMDGPU_GET_SCHED_NAME(entity) \
>> +     (entity->rq->sched->name)
>>   TRACE_EVENT(amdgpu_mm_rreg,
>>           TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
>> @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
>>   );
>>   TRACE_EVENT(amdgpu_cs_ioctl,
>> -        TP_PROTO(struct amdgpu_job *job),
>> -        TP_ARGS(job),
>> +        TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity 
>> *entity),
>> +        TP_ARGS(job, entity),
>>           TP_STRUCT__entry(
>>                    __field(uint64_t, sched_job_id)
>> -                 __string(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
>> +                 __string(timeline, AMDGPU_GET_SCHED_NAME(entity))
>>                    __field(unsigned int, context)
>>                    __field(unsigned int, seqno)
>>                    __field(struct dma_fence *, fence)
>> @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
>>           TP_fast_assign(
>>                  __entry->sched_job_id = job->base.id;
>> -               __assign_str(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
>> +               __assign_str(timeline, AMDGPU_GET_SCHED_NAME(entity))
>>                  __entry->context = job->base.s_fence->finished.context;
>>                  __entry->seqno = job->base.s_fence->finished.seqno;
>>                  __entry->ring_name = 
>> to_amdgpu_ring(job->base.sched)->name;
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
That is fixed by "drm/scheduler: bind job earlier to scheduler".

Christian.

Am 13.08.2018 um 16:33 schrieb Andres Rodriguez:
> Any updates on this issue?
>
> Regards,
> Andres
>
> On 2018-08-08 03:10 AM, Christian König wrote:
>> Yeah that is a known issue, but this solution is not correct either.
>>
>> See the scheduler where the job is execute on is simply not 
>> determined yet when we want to trace it.
>>
>> So using the scheduler name from the entity is wrong as well.
>>
>> We should probably move the reschedule from 
>> drm_sched_entity_push_job() to drm_sched_job_init() to fix that.
>>
>> I will prepare a patch for that today,
>> Christian.
>>
>> Am 08.08.2018 um 09:05 schrieb Huang Rui:
>>> We won't initialize fence scheduler in drm_sched_fence_create() 
>>> anymore, so it
>>> will refer null fence scheduler if open trace event to get the 
>>> timeline name.
>>> Actually, it is the scheduler name from the entity, so add a macro 
>>> to replace
>>> legacy getting timeline name by job.
>>>
>>> [  212.844281] BUG: unable to handle kernel NULL pointer dereference 
>>> at 0000000000000018
>>> [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 
>>> 4235fc067 PMD 0
>>> [  212.859419] Oops: 0000 [#1] SMP PTI
>>> [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: 
>>> G           OE     4.18.0-rc1-custom #1
>>> [  212.872194] Hardware name: Gigabyte Technology Co., Ltd. 
>>> Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
>>> [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30 
>>> [gpu_sched]
>>> [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13 
>>> 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 
>>> 00 00 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 
>>> 00 00 0f
>>> [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
>>> [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX: 
>>> ffffa3ed81f27da8
>>> [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI: 
>>> ffff8f24ebfb5400
>>> [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09: 
>>> ffffffffae4deafc
>>> [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12: 
>>> ffff8f24e891f898
>>> [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15: 
>>> ffff8f24e3de97a8
>>> [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000) 
>>> knlGS:0000000000000000
>>> [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4: 
>>> 00000000003606e0
>>> [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [  212.985720] Call Trace:
>>> [  212.988236] trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170 
>>> [amdgpu]
>>> [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
>>> [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
>>> [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
>>> [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
>>> [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
>>> [  213.020794]  ? recalc_sigpending+0x17/0x50
>>> [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
>>> [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
>>> [  213.034366]  ? do_signal+0x36/0x700
>>> [  213.037928]  ? signal_wake_up_state+0x15/0x30
>>> [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
>>>
>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
>>>   2 files changed, 7 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index e12871d..be01e1b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct 
>>> amdgpu_cs_parser *p,
>>>       amdgpu_job_free_resources(job);
>>> -    trace_amdgpu_cs_ioctl(job);
>>> +    trace_amdgpu_cs_ioctl(job, entity);
>>>       amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
>>>       priority = job->base.s_priority;
>>>       drm_sched_entity_push_job(&job->base, entity);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>> index 8c2dab2..25cdcb7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>> @@ -36,6 +36,8 @@
>>>   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>>> job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished) 
>>>
>>> +#define AMDGPU_GET_SCHED_NAME(entity) \
>>> +     (entity->rq->sched->name)
>>>   TRACE_EVENT(amdgpu_mm_rreg,
>>>           TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
>>> @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
>>>   );
>>>   TRACE_EVENT(amdgpu_cs_ioctl,
>>> -        TP_PROTO(struct amdgpu_job *job),
>>> -        TP_ARGS(job),
>>> +        TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity 
>>> *entity),
>>> +        TP_ARGS(job, entity),
>>>           TP_STRUCT__entry(
>>>                    __field(uint64_t, sched_job_id)
>>> -                 __string(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
>>> +                 __string(timeline, AMDGPU_GET_SCHED_NAME(entity))
>>>                    __field(unsigned int, context)
>>>                    __field(unsigned int, seqno)
>>>                    __field(struct dma_fence *, fence)
>>> @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
>>>           TP_fast_assign(
>>>                  __entry->sched_job_id = job->base.id;
>>> -               __assign_str(timeline, 
>>> AMDGPU_JOB_GET_TIMELINE_NAME(job))
>>> +               __assign_str(timeline, AMDGPU_GET_SCHED_NAME(entity))
>>>                  __entry->context = 
>>> job->base.s_fence->finished.context;
>>>                  __entry->seqno = job->base.s_fence->finished.seqno;
>>>                  __entry->ring_name = 
>>> to_amdgpu_ring(job->base.sched)->name;
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx