tests/gem_eio: Resilience against "hanging too fast"

Submitted by Daniel Vetter on Nov. 25, 2015, 4:19 p.m.

Details

Message ID 1448468364-20610-1-git-send-email-daniel.vetter@ffwll.ch
State New
Headers show
Series "tests/gem_eio: Resilience against "hanging too fast"" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Daniel Vetter Nov. 25, 2015, 4:19 p.m.
Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
hang we need to be careful to not run into the "hanging too fast
check":

- don't restore the ban period, but instead keep it at 0.
- make sure we idle the gpu fully before hanging it again (wait
  subtest missted that).

With this gem_eio works now reliable even when I don't run the
subtests individually.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/tests/gem_eio.c b/tests/gem_eio.c
index ad67332eae59..f6e41db63a7d 100644
--- a/tests/gem_eio.c
+++ b/tests/gem_eio.c
@@ -84,13 +84,17 @@  static void trigger_reset(int fd)
 
 static void wedge_gpu(int fd)
 {
+	igt_hang_ring_t hang;
+
 	/* First idle the GPU then disable GPU resets before injecting a hang */
 	gem_quiescent_gpu(fd);
 
 	igt_require(i915_reset_control(false));
 
 	igt_debug("Wedging GPU by injecting hang\n");
-	igt_post_hang_ring(fd, igt_hang_ring(fd, I915_EXEC_DEFAULT));
+	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
+	hang.ban = 0;
+	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));
 }
@@ -161,10 +165,14 @@  static void test_wait(int fd)
 {
 	igt_hang_ring_t hang;
 
+	/* First idle the GPU then disable GPU resets before injecting a hang */
+	gem_quiescent_gpu(fd);
+
 	igt_require(i915_reset_control(false));
 
 	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
 	igt_assert_eq(__gem_wait(fd, hang.handle, -1), -EIO);
+	hang.ban = 0;
 	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));

Comments

On Wed, Nov 25, 2015 at 05:19:24PM +0100, Daniel Vetter wrote:
> Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> hang we need to be careful to not run into the "hanging too fast
> check":

That's not how it works. It restores the GPU by triggering a manual
reset.
-Chris
On Wed, Nov 25, 2015 at 04:28:02PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 05:19:24PM +0100, Daniel Vetter wrote:
> > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > hang we need to be careful to not run into the "hanging too fast
> > check":
> 
> That's not how it works. It restores the GPU by triggering a manual
> reset.

Maybe in your tree, but here it goes through the full hang recovery. And
the gem_eio default context _does_ get banned without this. Maybe it's a
bug in the ban logic, but it didn't look like that. After all
debugfs/i915_wedged is a hack for developers - by that time userspace has
already either died or fallen back to sw rendering anyway. So I figured no
point in fixing this in the kernel with special cases.
-Daniel
On Thu, Nov 26, 2015 at 09:34:01AM +0100, Daniel Vetter wrote:
> On Wed, Nov 25, 2015 at 04:28:02PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 05:19:24PM +0100, Daniel Vetter wrote:
> > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > hang we need to be careful to not run into the "hanging too fast
> > > check":
> > 
> > That's not how it works. It restores the GPU by triggering a manual
> > reset.
> 
> Maybe in your tree, but here it goes through the full hang recovery.

Right.

> And
> the gem_eio default context _does_ get banned without this.

Wrong.

> Maybe it's a
> bug in the ban logic, but it didn't look like that.

Indeed, that is not where the *kernel* bug is.
-Chris