[2/2] tests/gem_eio: Resilience against "hanging too fast"

Submitted by Daniel Vetter on Nov. 26, 2015, 11:34 a.m.

Details

Message ID 1448537675-481-2-git-send-email-daniel.vetter@ffwll.ch
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Daniel Vetter Nov. 26, 2015, 11:34 a.m.
Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
hang we need to be careful to not run into the "hanging too fast
check":

- don't restore the ban period, but instead keep it at 0.
- make sure we idle the gpu fully before hanging it again (wait
  subtest missted that).

With this gem_eio works now reliable even when I don't run the
subtests individually.

Of course it's a bit fishy that the default ctx gets blamed for
essentially doing nothing, but until that's figured out in upstream
it's better to make the test work for now.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/tests/gem_eio.c b/tests/gem_eio.c
index 8345d1a7a429..6c07d9175b95 100644
--- a/tests/gem_eio.c
+++ b/tests/gem_eio.c
@@ -84,13 +84,17 @@  static void trigger_reset(int fd)
 
 static void wedge_gpu(int fd)
 {
+	igt_hang_ring_t hang;
+
 	/* First idle the GPU then disable GPU resets before injecting a hang */
 	gem_quiescent_gpu(fd);
 
 	igt_require(i915_reset_control(false));
 
 	igt_debug("Wedging GPU by injecting hang\n");
-	igt_post_hang_ring(fd, igt_hang_ring(fd, I915_EXEC_DEFAULT));
+	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
+	hang.ban = 0;
+	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));
 }
@@ -161,10 +165,14 @@  static void test_wait(int fd)
 {
 	igt_hang_ring_t hang;
 
+	/* First idle the GPU then disable GPU resets before injecting a hang */
+	gem_quiescent_gpu(fd);
+
 	igt_require(i915_reset_control(false));
 
 	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
 	igt_assert_eq(__gem_wait(fd, hang.handle, -1), 0);
+	hang.ban = 0;
 	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));

Comments

On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> hang we need to be careful to not run into the "hanging too fast
> check":
> 
> - don't restore the ban period, but instead keep it at 0.
> - make sure we idle the gpu fully before hanging it again (wait
>   subtest missted that).
> 
> With this gem_eio works now reliable even when I don't run the
> subtests individually.
> 
> Of course it's a bit fishy that the default ctx gets blamed for
> essentially doing nothing, but until that's figured out in upstream
> it's better to make the test work for now.

This used to be reliable. And just disabling all banning in the kernel
forever more is silly.

During igt_post_hang_ring:
1. we wait upon the hanging batch
 - this returns when hangcheck fires
2. reset the ban period to normal
 - this takes mutex_lock_interruptible and so must wait for the reset
   handler to run before it can make the change,
 - ergo the hanging batch never triggers a ban for itself.
 - (a subsequent nonsimulated GPU hang may trigger the ban though)

Nak.
-Chris
On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > hang we need to be careful to not run into the "hanging too fast
> > check":
> > 
> > - don't restore the ban period, but instead keep it at 0.
> > - make sure we idle the gpu fully before hanging it again (wait
> >   subtest missted that).
> > 
> > With this gem_eio works now reliable even when I don't run the
> > subtests individually.
> > 
> > Of course it's a bit fishy that the default ctx gets blamed for
> > essentially doing nothing, but until that's figured out in upstream
> > it's better to make the test work for now.
> 
> This used to be reliable. And just disabling all banning in the kernel
> forever more is silly.
> 
> During igt_post_hang_ring:
> 1. we wait upon the hanging batch
>  - this returns when hangcheck fires
> 2. reset the ban period to normal
>  - this takes mutex_lock_interruptible and so must wait for the reset
>    handler to run before it can make the change,
>  - ergo the hanging batch never triggers a ban for itself.
>  - (a subsequent nonsimulated GPU hang may trigger the ban though)

This isn't where it dies. It dies when we do the echo 1 > i915_wedged. I
suspect quiescent_gpu or whatever is getting in the way, but I really only
wanted to get things to run first. And since i915_wedged is a developer
feature, and it does work perfectly if you don't intend to reuse contexts
I didn't see any point in first trying to fix it up.

So I still maintain that this is a good enough approach, at least if
there's no obvious fix in-flight already.
-Daniel
On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > hang we need to be careful to not run into the "hanging too fast
> > > check":
> > > 
> > > - don't restore the ban period, but instead keep it at 0.
> > > - make sure we idle the gpu fully before hanging it again (wait
> > >   subtest missted that).
> > > 
> > > With this gem_eio works now reliable even when I don't run the
> > > subtests individually.
> > > 
> > > Of course it's a bit fishy that the default ctx gets blamed for
> > > essentially doing nothing, but until that's figured out in upstream
> > > it's better to make the test work for now.
> > 
> > This used to be reliable. And just disabling all banning in the kernel
> > forever more is silly.
> > 
> > During igt_post_hang_ring:
> > 1. we wait upon the hanging batch
> >  - this returns when hangcheck fires
> > 2. reset the ban period to normal
> >  - this takes mutex_lock_interruptible and so must wait for the reset
> >    handler to run before it can make the change,
> >  - ergo the hanging batch never triggers a ban for itself.
> >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> 
> This isn't where it dies. It dies when we do the echo 1 > i915_wedged.

That is not where it dies.

> I suspect quiescent_gpu or whatever is getting in the way, but I really only
> wanted to get things to run first. And since i915_wedged is a developer
> feature, and it does work perfectly if you don't intend to reuse contexts
> I didn't see any point in first trying to fix it up.
> 
> So I still maintain that this is a good enough approach, at least if
> there's no obvious fix in-flight already.

No way. This is a kernel regression since 4.0, having just tested with
v4.0 on snb/ivb/hsw.
-Chris
On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > hang we need to be careful to not run into the "hanging too fast
> > > > check":
> > > > 
> > > > - don't restore the ban period, but instead keep it at 0.
> > > > - make sure we idle the gpu fully before hanging it again (wait
> > > >   subtest missted that).
> > > > 
> > > > With this gem_eio works now reliable even when I don't run the
> > > > subtests individually.
> > > > 
> > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > essentially doing nothing, but until that's figured out in upstream
> > > > it's better to make the test work for now.
> > > 
> > > This used to be reliable. And just disabling all banning in the kernel
> > > forever more is silly.
> > > 
> > > During igt_post_hang_ring:
> > > 1. we wait upon the hanging batch
> > >  - this returns when hangcheck fires
> > > 2. reset the ban period to normal
> > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > >    handler to run before it can make the change,
> > >  - ergo the hanging batch never triggers a ban for itself.
> > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > 
> > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> 
> That is not where it dies.

Well at least it happens after we start the hang recover from i915_wedged.

> > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > wanted to get things to run first. And since i915_wedged is a developer
> > feature, and it does work perfectly if you don't intend to reuse contexts
> > I didn't see any point in first trying to fix it up.
> > 
> > So I still maintain that this is a good enough approach, at least if
> > there's no obvious fix in-flight already.
> 
> No way. This is a kernel regression since 4.0, having just tested with
> v4.0 on snb/ivb/hsw.

Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
anyway when we are terminally wedged, and that seems to have been the case
ever since we started with reset_counter this has been broken forever. I
guess I missed something.
-Daniel
On Thu, Nov 26, 2015 at 04:51:13PM +0100, Daniel Vetter wrote:
> On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > > hang we need to be careful to not run into the "hanging too fast
> > > > > check":
> > > > > 
> > > > > - don't restore the ban period, but instead keep it at 0.
> > > > > - make sure we idle the gpu fully before hanging it again (wait
> > > > >   subtest missted that).
> > > > > 
> > > > > With this gem_eio works now reliable even when I don't run the
> > > > > subtests individually.
> > > > > 
> > > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > > essentially doing nothing, but until that's figured out in upstream
> > > > > it's better to make the test work for now.
> > > > 
> > > > This used to be reliable. And just disabling all banning in the kernel
> > > > forever more is silly.
> > > > 
> > > > During igt_post_hang_ring:
> > > > 1. we wait upon the hanging batch
> > > >  - this returns when hangcheck fires
> > > > 2. reset the ban period to normal
> > > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > > >    handler to run before it can make the change,
> > > >  - ergo the hanging batch never triggers a ban for itself.
> > > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > > 
> > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> > 
> > That is not where it dies.
> 
> Well at least it happens after we start the hang recover from i915_wedged.
> 
> > > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > > wanted to get things to run first. And since i915_wedged is a developer
> > > feature, and it does work perfectly if you don't intend to reuse contexts
> > > I didn't see any point in first trying to fix it up.
> > > 
> > > So I still maintain that this is a good enough approach, at least if
> > > there's no obvious fix in-flight already.
> > 
> > No way. This is a kernel regression since 4.0, having just tested with
> > v4.0 on snb/ivb/hsw.
> 
> Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
> anyway when we are terminally wedged, and that seems to have been the case
> ever since we started with reset_counter this has been broken forever. I
> guess I missed something.

The bug I see is SNB specific, and introduced between v4.0 and v4.1.
-Chris
On Thu, Nov 26, 2015 at 09:10:57PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 04:51:13PM +0100, Daniel Vetter wrote:
> > On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> > > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > > > hang we need to be careful to not run into the "hanging too fast
> > > > > > check":
> > > > > > 
> > > > > > - don't restore the ban period, but instead keep it at 0.
> > > > > > - make sure we idle the gpu fully before hanging it again (wait
> > > > > >   subtest missted that).
> > > > > > 
> > > > > > With this gem_eio works now reliable even when I don't run the
> > > > > > subtests individually.
> > > > > > 
> > > > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > > > essentially doing nothing, but until that's figured out in upstream
> > > > > > it's better to make the test work for now.
> > > > > 
> > > > > This used to be reliable. And just disabling all banning in the kernel
> > > > > forever more is silly.
> > > > > 
> > > > > During igt_post_hang_ring:
> > > > > 1. we wait upon the hanging batch
> > > > >  - this returns when hangcheck fires
> > > > > 2. reset the ban period to normal
> > > > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > > > >    handler to run before it can make the change,
> > > > >  - ergo the hanging batch never triggers a ban for itself.
> > > > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > > > 
> > > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> > > 
> > > That is not where it dies.
> > 
> > Well at least it happens after we start the hang recover from i915_wedged.
> > 
> > > > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > > > wanted to get things to run first. And since i915_wedged is a developer
> > > > feature, and it does work perfectly if you don't intend to reuse contexts
> > > > I didn't see any point in first trying to fix it up.
> > > > 
> > > > So I still maintain that this is a good enough approach, at least if
> > > > there's no obvious fix in-flight already.
> > > 
> > > No way. This is a kernel regression since 4.0, having just tested with
> > > v4.0 on snb/ivb/hsw.
> > 
> > Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
> > anyway when we are terminally wedged, and that seems to have been the case
> > ever since we started with reset_counter this has been broken forever. I
> > guess I missed something.
> 
> The bug I see is SNB specific, and introduced between v4.0 and v4.1.

The irony, I hacked on gem_eio on snb too ;-)
-Daniel