[4/4] i965: Drop the batch and limp along if execbuf fails.

Submitted by Kenneth Graunke on Sept. 24, 2017, 6:02 a.m.

Details

Message ID 20170924060206.18753-4-kenneth@whitecape.org
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Kenneth Graunke Sept. 24, 2017, 6:02 a.m.
The execbuf2 ioctl can fail for several reasons:

- a catastrophic bug in Mesa (we're programming garbage commands)
- repeated GPU hangs, where the kernel has stepped in and banned our
  process (or at least fd) from talking to the GPU anymore
- some sort of transient failures (low memory, GPU resetting a lot?)

We've not been too concerned with handling this case, because we thought
that the first two were the only ways this could happen.  In those cases
(which shouldn't happen anyway) it's probably better to exit and avoid
sabotaging the GPU repeatedly, which potentially could tank the system.

But it seems like we can hit this occasionally in other circumstances.
It appears to happen in certain low memory situations.  It might also
happen if someone else is tanking the GPU a bunch of times.  When the
failures are temporary, it's rude to outright kill the application
(especially if it's the X server or Wayland compositor).

With this patch, we raise a GL_OUT_OF_MEMORY error and move on, ignoring
the failure.  For normal flushing, we'll make a new batch and proceed as
normal - hoping that things will work out better in the future, and that
we miraculously avoid things like mapping failures which could cause us
to crash.  For fencing-triggered flushes, we drop the batch reference so
we don't block on it forever.

I'm not entirely sure how this will work out in practice, but the
existing code is dire, so we may as well give it a try and hope it
works out better for our users.
---
 src/mesa/drivers/dri/i965/intel_batchbuffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
index d564510d06a..bc8a2283f9c 100644
--- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
+++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
@@ -782,6 +782,7 @@  submit_batch(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
    const struct gen_device_info *devinfo = &brw->screen->devinfo;
    __DRIscreen *dri_screen = brw->screen->driScrnPriv;
    struct intel_batchbuffer *batch = &brw->batch;
+   struct gl_context *ctx = &brw->ctx;
    int ret = 0;
 
    if (batch->batch_cpu_map) {
@@ -865,9 +866,8 @@  submit_batch(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
       brw_check_for_reset(brw);
 
    if (ret != 0) {
-      fprintf(stderr, "i965: Failed to submit batchbuffer: %s\n",
-              strerror(-ret));
-      exit(1);
+      _mesa_error(ctx, GL_OUT_OF_MEMORY,
+                  "i965: Failed to submit batchbuffer: %s\n", strerror(-ret));
    }
 
    return ret;

Comments

I've got this a few times recently and it's really annoying.  I don't know 
if this will fix anything or not but it may be worth a go.  I fear, 
however, that ignoring an execbuf failure will lead to permanently 
corrupted rendering or even additional hangs due to a chunk of the stream 
being missing.  That seems undesirable.  I would feel more comfortable 
about if you flagged BRW_NEW_CONTEXT in this cases to force a full state 
re-emit.


On September 24, 2017 1:02:18 AM Kenneth Graunke <kenneth@whitecape.org> wrote:

> The execbuf2 ioctl can fail for several reasons:
>
> - a catastrophic bug in Mesa (we're programming garbage commands)
> - repeated GPU hangs, where the kernel has stepped in and banned our
>   process (or at least fd) from talking to the GPU anymore
> - some sort of transient failures (low memory, GPU resetting a lot?)
>
> We've not been too concerned with handling this case, because we thought
> that the first two were the only ways this could happen.  In those cases
> (which shouldn't happen anyway) it's probably better to exit and avoid
> sabotaging the GPU repeatedly, which potentially could tank the system.
>
> But it seems like we can hit this occasionally in other circumstances.
> It appears to happen in certain low memory situations.  It might also
> happen if someone else is tanking the GPU a bunch of times.  When the
> failures are temporary, it's rude to outright kill the application
> (especially if it's the X server or Wayland compositor).
>
> With this patch, we raise a GL_OUT_OF_MEMORY error and move on, ignoring
> the failure.  For normal flushing, we'll make a new batch and proceed as
> normal - hoping that things will work out better in the future, and that
> we miraculously avoid things like mapping failures which could cause us
> to crash.  For fencing-triggered flushes, we drop the batch reference so
> we don't block on it forever.
>
> I'm not entirely sure how this will work out in practice, but the
> existing code is dire, so we may as well give it a try and hope it
> works out better for our users.
> ---
>  src/mesa/drivers/dri/i965/intel_batchbuffer.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c 
> b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> index d564510d06a..bc8a2283f9c 100644
> --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> @@ -782,6 +782,7 @@ submit_batch(struct brw_context *brw, int in_fence_fd, 
> int *out_fence_fd)
>     const struct gen_device_info *devinfo = &brw->screen->devinfo;
>     __DRIscreen *dri_screen = brw->screen->driScrnPriv;
>     struct intel_batchbuffer *batch = &brw->batch;
> +   struct gl_context *ctx = &brw->ctx;
>     int ret = 0;
>
>     if (batch->batch_cpu_map) {
> @@ -865,9 +866,8 @@ submit_batch(struct brw_context *brw, int in_fence_fd, 
> int *out_fence_fd)
>        brw_check_for_reset(brw);
>
>     if (ret != 0) {
> -      fprintf(stderr, "i965: Failed to submit batchbuffer: %s\n",
> -              strerror(-ret));
> -      exit(1);
> +      _mesa_error(ctx, GL_OUT_OF_MEMORY,
> +                  "i965: Failed to submit batchbuffer: %s\n", strerror(-ret));
>     }
>
>     return ret;
> --
> 2.14.1
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Quoting Jason Ekstrand (2017-09-24 22:53:04)
> I've got this a few times recently and it's really annoying.  I don't know 
> if this will fix anything or not but it may be worth a go.  I fear, 
> however, that ignoring an execbuf failure will lead to permanently 
> corrupted rendering or even additional hangs due to a chunk of the stream 
> being missing.  That seems undesirable.  I would feel more comfortable 
> about if you flagged BRW_NEW_CONTEXT in this cases to force a full state 
> re-emit.

The other step to consider is dependency chains. If you replace the
batch with an empty one (just MI_BB_END) and execute it with the rest of
the execobjects, then all the fences (both implicit and explicit) will
be valid. Just the contents garbage, and gpu state can be fixed up by
NEW_CONTEXT as Jason suggested.
-Chris
On Monday, September 25, 2017 8:05:29 AM PDT Chris Wilson wrote:
> Quoting Jason Ekstrand (2017-09-24 22:53:04)
> > I've got this a few times recently and it's really annoying.  I don't know 
> > if this will fix anything or not but it may be worth a go.  I fear, 
> > however, that ignoring an execbuf failure will lead to permanently 
> > corrupted rendering or even additional hangs due to a chunk of the stream 
> > being missing.  That seems undesirable.  I would feel more comfortable 
> > about if you flagged BRW_NEW_CONTEXT in this cases to force a full state 
> > re-emit.
> 
> The other step to consider is dependency chains. If you replace the
> batch with an empty one (just MI_BB_END) and execute it with the rest of
> the execobjects, then all the fences (both implicit and explicit) will
> be valid. Just the contents garbage, and gpu state can be fixed up by
> NEW_CONTEXT as Jason suggested.
> -Chris

That seems like a good idea.  What do you think we should do if that
fails?  I guess it would still try to pin all the execobject BOs, so
couldn't it still hit a low memory problem and fail?

Jason's email also reminded me that there's a bunch of CPU-side "current
state of the GPU" fields that we'd need to reset too (which I also need
to fix when we exceed the aperture).  They're kind of scattered all over
the place now, so I'll try and round them up and fix that...

--Ken
Quoting Kenneth Graunke (2017-09-25 17:46:11)
> On Monday, September 25, 2017 8:05:29 AM PDT Chris Wilson wrote:
> > Quoting Jason Ekstrand (2017-09-24 22:53:04)
> > > I've got this a few times recently and it's really annoying.  I don't know 
> > > if this will fix anything or not but it may be worth a go.  I fear, 
> > > however, that ignoring an execbuf failure will lead to permanently 
> > > corrupted rendering or even additional hangs due to a chunk of the stream 
> > > being missing.  That seems undesirable.  I would feel more comfortable 
> > > about if you flagged BRW_NEW_CONTEXT in this cases to force a full state 
> > > re-emit.
> > 
> > The other step to consider is dependency chains. If you replace the
> > batch with an empty one (just MI_BB_END) and execute it with the rest of
> > the execobjects, then all the fences (both implicit and explicit) will
> > be valid. Just the contents garbage, and gpu state can be fixed up by
> > NEW_CONTEXT as Jason suggested.
> > -Chris
> 
> That seems like a good idea.  What do you think we should do if that
> fails?  I guess it would still try to pin all the execobject BOs, so
> couldn't it still hit a low memory problem and fail?

At that point, I suggest crawling into a corner and hiding under a
blanket. If the nop batch fails, then you know the system is under too
much pressure to continue -- tell the client to switch to using a GTT
mmap to write their last wishes. I don't think pulling the plug is the
answer, report the error and let the client sort it out (or they ignore
it and either get killed by the kernel, or it just works after a while).
-Chris
On September 25, 2017 11:46:18 AM Kenneth Graunke <kenneth@whitecape.org> 
wrote:

> On Monday, September 25, 2017 8:05:29 AM PDT Chris Wilson wrote:
>> Quoting Jason Ekstrand (2017-09-24 22:53:04)
>> > I've got this a few times recently and it's really annoying.  I don't know
>> > if this will fix anything or not but it may be worth a go.  I fear,
>> > however, that ignoring an execbuf failure will lead to permanently
>> > corrupted rendering or even additional hangs due to a chunk of the stream
>> > being missing.  That seems undesirable.  I would feel more comfortable
>> > about if you flagged BRW_NEW_CONTEXT in this cases to force a full state
>> > re-emit.
>>
>> The other step to consider is dependency chains. If you replace the
>> batch with an empty one (just MI_BB_END) and execute it with the rest of
>> the execobjects, then all the fences (both implicit and explicit) will
>> be valid. Just the contents garbage, and gpu state can be fixed up by
>> NEW_CONTEXT as Jason suggested.
>> -Chris
>
> That seems like a good idea.  What do you think we should do if that
> fails?  I guess it would still try to pin all the execobject BOs, so
> couldn't it still hit a low memory problem and fail?
>
> Jason's email also reminded me that there's a bunch of CPU-side "current
> state of the GPU" fields that we'd need to reset too (which I also need
> to fix when we exceed the aperture).  They're kind of scattered all over
> the place now, so I'll try and round them up and fix that...

The most dangerous one I can think of is URB size.  That one can lead to 
done nasty issues.

--Jason