drm/amdgpu: initialize the context reset_counter in amdgpu_ctx_init

Submitted by Nicolai Hähnle on Oct. 4, 2016, 7:45 a.m.

Details

Message ID 1475567106-5818-1-git-send-email-nhaehnle@gmail.com
State New
Headers show
Series "drm/amdgpu: initialize the context reset_counter in amdgpu_ctx_init" ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Nicolai Hähnle Oct. 4, 2016, 7:45 a.m.
From: Nicolai Hähnle <nicolai.haehnle@amd.com>

Ensure that we really only report a GPU reset if one has happened since the
creation of the context.

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +++
 1 file changed, 3 insertions(+)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index e203e55..a5e2fcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -36,20 +36,23 @@  static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
 	spin_lock_init(&ctx->ring_lock);
 	ctx->fences = kcalloc(amdgpu_sched_jobs * AMDGPU_MAX_RINGS,
 			      sizeof(struct fence*), GFP_KERNEL);
 	if (!ctx->fences)
 		return -ENOMEM;
 
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		ctx->rings[i].sequence = 1;
 		ctx->rings[i].fences = &ctx->fences[amdgpu_sched_jobs * i];
 	}
+
+	ctx->reset_counter = atomic_read(&adev->gpu_reset_counter);
+
 	/* create context entity for each ring */
 	for (i = 0; i < adev->num_rings; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 		struct amd_sched_rq *rq;
 
 		rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
 		r = amd_sched_entity_init(&ring->sched, &ctx->rings[i].entity,
 					  rq, amdgpu_sched_jobs);
 		if (r)
 			break;

Comments

Am 04.10.2016 um 09:45 schrieb Nicolai Hähnle:
> From: Nicolai Hähnle <nicolai.haehnle@amd.com>
>
> Ensure that we really only report a GPU reset if one has happened since the
> creation of the context.
>
> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index e203e55..a5e2fcb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -36,20 +36,23 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   	spin_lock_init(&ctx->ring_lock);
>   	ctx->fences = kcalloc(amdgpu_sched_jobs * AMDGPU_MAX_RINGS,
>   			      sizeof(struct fence*), GFP_KERNEL);
>   	if (!ctx->fences)
>   		return -ENOMEM;
>   
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		ctx->rings[i].sequence = 1;
>   		ctx->rings[i].fences = &ctx->fences[amdgpu_sched_jobs * i];
>   	}
> +
> +	ctx->reset_counter = atomic_read(&adev->gpu_reset_counter);
> +
>   	/* create context entity for each ring */
>   	for (i = 0; i < adev->num_rings; i++) {
>   		struct amdgpu_ring *ring = adev->rings[i];
>   		struct amd_sched_rq *rq;
>   
>   		rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
>   		r = amd_sched_entity_init(&ring->sched, &ctx->rings[i].entity,
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			break;
Do we need to bump the DRM version for this bug fix?

Marek

On Oct 4, 2016 10:20 AM, "Christian König" <deathsimple@vodafone.de> wrote:

> Am 04.10.2016 um 09:45 schrieb Nicolai Hähnle:
>
>> From: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>
>> Ensure that we really only report a GPU reset if one has happened since
>> the
>> creation of the context.
>>
>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>.
>
> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index e203e55..a5e2fcb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -36,20 +36,23 @@ static int amdgpu_ctx_init(struct amdgpu_device
>> *adev, struct amdgpu_ctx *ctx)
>>         spin_lock_init(&ctx->ring_lock);
>>         ctx->fences = kcalloc(amdgpu_sched_jobs * AMDGPU_MAX_RINGS,
>>                               sizeof(struct fence*), GFP_KERNEL);
>>         if (!ctx->fences)
>>                 return -ENOMEM;
>>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>                 ctx->rings[i].sequence = 1;
>>                 ctx->rings[i].fences = &ctx->fences[amdgpu_sched_jobs *
>> i];
>>         }
>> +
>> +       ctx->reset_counter = atomic_read(&adev->gpu_reset_counter);
>> +
>>         /* create context entity for each ring */
>>         for (i = 0; i < adev->num_rings; i++) {
>>                 struct amdgpu_ring *ring = adev->rings[i];
>>                 struct amd_sched_rq *rq;
>>                 rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
>>                 r = amd_sched_entity_init(&ring->sched,
>> &ctx->rings[i].entity,
>>                                           rq, amdgpu_sched_jobs);
>>                 if (r)
>>                         break;
>>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
On Thu, Oct 6, 2016 at 6:28 AM, Marek Olšák <maraeo@gmail.com> wrote:
> Do we need to bump the DRM version for this bug fix?
>

Alternatively, we could just cc stable.  Being that the whole reset
handling is mesa is till kind of up in the air, I'm not sure how
critical it is.

Alex

> Marek
>
>
> On Oct 4, 2016 10:20 AM, "Christian König" <deathsimple@vodafone.de> wrote:
>>
>> Am 04.10.2016 um 09:45 schrieb Nicolai Hähnle:
>>>
>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>>
>>> Ensure that we really only report a GPU reset if one has happened since
>>> the
>>> creation of the context.
>>>
>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>
>>
>> Reviewed-by: Christian König <christian.koenig@amd.com>.
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index e203e55..a5e2fcb 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -36,20 +36,23 @@ static int amdgpu_ctx_init(struct amdgpu_device
>>> *adev, struct amdgpu_ctx *ctx)
>>>         spin_lock_init(&ctx->ring_lock);
>>>         ctx->fences = kcalloc(amdgpu_sched_jobs * AMDGPU_MAX_RINGS,
>>>                               sizeof(struct fence*), GFP_KERNEL);
>>>         if (!ctx->fences)
>>>                 return -ENOMEM;
>>>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>                 ctx->rings[i].sequence = 1;
>>>                 ctx->rings[i].fences = &ctx->fences[amdgpu_sched_jobs *
>>> i];
>>>         }
>>> +
>>> +       ctx->reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>>         /* create context entity for each ring */
>>>         for (i = 0; i < adev->num_rings; i++) {
>>>                 struct amdgpu_ring *ring = adev->rings[i];
>>>                 struct amd_sched_rq *rq;
>>>                 rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
>>>                 r = amd_sched_entity_init(&ring->sched,
>>> &ctx->rings[i].entity,
>>>                                           rq, amdgpu_sched_jobs);
>>>                 if (r)
>>>                         break;
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
On Thu, Oct 6, 2016 at 4:24 PM, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Thu, Oct 6, 2016 at 6:28 AM, Marek Olšák <maraeo@gmail.com> wrote:
>> Do we need to bump the DRM version for this bug fix?
>>
>
> Alternatively, we could just cc stable.  Being that the whole reset
> handling is mesa is till kind of up in the air, I'm not sure how
> critical it is.

The Mesa part is actually complete. The problem is that if an
application requests the GL_LOSE_CONTEXT_ON_RESET strategy, an
incorrect reset counter returned by the kernel will automatically kill
the OpenGL context (by setting the no-op dispatch), so it's a pretty
serious issue and the kernel should return correct values regardless
of the state of GPU reset there. If the kernel returns an incorrect
reset counter, we need a version bump for the fix.

Marek
On 06.10.2016 16:43, Marek Olšák wrote:
> On Thu, Oct 6, 2016 at 4:24 PM, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Thu, Oct 6, 2016 at 6:28 AM, Marek Olšák <maraeo@gmail.com> wrote:
>>> Do we need to bump the DRM version for this bug fix?
>>>
>>
>> Alternatively, we could just cc stable.  Being that the whole reset
>> handling is mesa is till kind of up in the air, I'm not sure how
>> critical it is.
>
> The Mesa part is actually complete. The problem is that if an
> application requests the GL_LOSE_CONTEXT_ON_RESET strategy, an
> incorrect reset counter returned by the kernel will automatically kill
> the OpenGL context (by setting the no-op dispatch), so it's a pretty
> serious issue and the kernel should return correct values regardless
> of the state of GPU reset there. If the kernel returns an incorrect
> reset counter, we need a version bump for the fix.

Yeah, that seems to be the safest solution.

Nicolai

>
> Marek
>
On Thu, Oct 6, 2016 at 10:43 AM, Marek Olšák <maraeo@gmail.com> wrote:
> On Thu, Oct 6, 2016 at 4:24 PM, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Thu, Oct 6, 2016 at 6:28 AM, Marek Olšák <maraeo@gmail.com> wrote:
>>> Do we need to bump the DRM version for this bug fix?
>>>
>>
>> Alternatively, we could just cc stable.  Being that the whole reset
>> handling is mesa is till kind of up in the air, I'm not sure how
>> critical it is.
>
> The Mesa part is actually complete. The problem is that if an
> application requests the GL_LOSE_CONTEXT_ON_RESET strategy, an
> incorrect reset counter returned by the kernel will automatically kill
> the OpenGL context (by setting the no-op dispatch), so it's a pretty
> serious issue and the kernel should return correct values regardless
> of the state of GPU reset there. If the kernel returns an incorrect
> reset counter, we need a version bump for the fix.

GPU reset is currently disabled by default so the counter should never
change at this point.

Alex