tests/gem_eio: Disable reset for wait subtests

Submitted by Daniel Vetter on Nov. 25, 2015, 3:58 p.m.

Details

Message ID 1448467099-1876-1-git-send-email-daniel.vetter@ffwll.ch
State New
Headers show
Series "tests/gem_eio: Disable reset for wait subtests" ( rev: 1 ) in Intel GFX

Not browsing as part of any series.

Commit Message

Daniel Vetter Nov. 25, 2015, 3:58 p.m.
This testcase tries to validate -EIO behaviour by disabling gpu reset
support in the kernel. Except that the wait subtest forgot to do that,
and therefore gets a return value of 0 instead of the expected -EIO.

With this fix gem_eio passes on a kernel with my fixes completely.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 4 ++++
 1 file changed, 4 insertions(+)

Patch hide | download patch | download mbox

diff --git a/tests/gem_eio.c b/tests/gem_eio.c
index a24c8f1c53b5..ad67332eae59 100644
--- a/tests/gem_eio.c
+++ b/tests/gem_eio.c
@@ -161,10 +161,14 @@  static void test_wait(int fd)
 {
 	igt_hang_ring_t hang;
 
+	igt_require(i915_reset_control(false));
+
 	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
 	igt_assert_eq(__gem_wait(fd, hang.handle, -1), -EIO);
 	igt_post_hang_ring(fd, hang);
 
+	igt_assert(i915_reset_control(true));
+
 	trigger_reset(fd);
 }
 

Comments

On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> This testcase tries to validate -EIO behaviour by disabling gpu reset
> support in the kernel. Except that the wait subtest forgot to do that,
> and therefore gets a return value of 0 instead of the expected -EIO.
> 

Wrong. It was intentionally not using reset=false.
-Chris
On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > support in the kernel. Except that the wait subtest forgot to do that,
> > and therefore gets a return value of 0 instead of the expected -EIO.
> > 
> 
> Wrong. It was intentionally not using reset=false.

To be more precise, the reason here is that we are not wedging the GPU
but the expectation is that a wait upon a request that hangs reports the
hang. Since the wait on GPU activity is explicit in the ioctl, the
presumption is that the user actually cares about that activity and so
should be given greater information about how it completes (timeout, GPU
hung, or success).
-Chris
On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > > support in the kernel. Except that the wait subtest forgot to do that,
> > > and therefore gets a return value of 0 instead of the expected -EIO.
> > > 
> > 
> > Wrong. It was intentionally not using reset=false.
> 
> To be more precise, the reason here is that we are not wedging the GPU
> but the expectation is that a wait upon a request that hangs reports the
> hang. Since the wait on GPU activity is explicit in the ioctl, the
> presumption is that the user actually cares about that activity and so
> should be given greater information about how it completes (timeout, GPU
> hung, or success).

The only place we reprt hangs is in the reset_stats_ioctl. And
fundamentally wait_ioctl can't do (right now) what you want, since if the
reset recovery happens before userspace calls wait_ioctl then it will
happily return 0 for success.

So if you want this then we need:
- a reason for userspace to want this
- reorg all our reset handling and move (or well copy) the reset stats to
  every object

I don't see the point in that.
-Daniel
On Thu, Nov 26, 2015 at 09:36:14AM +0100, Daniel Vetter wrote:
> On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > > > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > > > support in the kernel. Except that the wait subtest forgot to do that,
> > > > and therefore gets a return value of 0 instead of the expected -EIO.
> > > > 
> > > 
> > > Wrong. It was intentionally not using reset=false.
> > 
> > To be more precise, the reason here is that we are not wedging the GPU
> > but the expectation is that a wait upon a request that hangs reports the
> > hang. Since the wait on GPU activity is explicit in the ioctl, the
> > presumption is that the user actually cares about that activity and so
> > should be given greater information about how it completes (timeout, GPU
> > hung, or success).
> 
> The only place we reprt hangs is in the reset_stats_ioctl. And
> fundamentally wait_ioctl can't do (right now) what you want, since if the
> reset recovery happens before userspace calls wait_ioctl then it will
> happily return 0 for success.
> 
> So if you want this then we need:
> - a reason for userspace to want this
> - reorg all our reset handling and move (or well copy) the reset stats to
>   every object

I buy that if it resets before the wait_ioctl then it is invisible,
which makes the information incomplete. I still think that as a caller
of wait_ioctl, knowing that my wait was broken by a reset is valuable
information. The drivers use some form of waiting to throttle
themselves, therefore it is a convenient way of having the information
that a reset has occurred and take action. This usecase also wouldn't
miss a reset happening.

You may disagree about wait_ioctl, but a wait_fence must definitely
report EIO. (And this is as close to wait_fence as we currently have in
the code and in use today.)
-Chris
On Thu, Nov 26, 2015 at 10:03:01AM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 09:36:14AM +0100, Daniel Vetter wrote:
> > On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote:
> > > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> > > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > > > > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > > > > support in the kernel. Except that the wait subtest forgot to do that,
> > > > > and therefore gets a return value of 0 instead of the expected -EIO.
> > > > > 
> > > > 
> > > > Wrong. It was intentionally not using reset=false.
> > > 
> > > To be more precise, the reason here is that we are not wedging the GPU
> > > but the expectation is that a wait upon a request that hangs reports the
> > > hang. Since the wait on GPU activity is explicit in the ioctl, the
> > > presumption is that the user actually cares about that activity and so
> > > should be given greater information about how it completes (timeout, GPU
> > > hung, or success).
> > 
> > The only place we reprt hangs is in the reset_stats_ioctl. And
> > fundamentally wait_ioctl can't do (right now) what you want, since if the
> > reset recovery happens before userspace calls wait_ioctl then it will
> > happily return 0 for success.
> > 
> > So if you want this then we need:
> > - a reason for userspace to want this
> > - reorg all our reset handling and move (or well copy) the reset stats to
> >   every object
> 
> I buy that if it resets before the wait_ioctl then it is invisible,
> which makes the information incomplete. I still think that as a caller
> of wait_ioctl, knowing that my wait was broken by a reset is valuable
> information. The drivers use some form of waiting to throttle
> themselves, therefore it is a convenient way of having the information
> that a reset has occurred and take action. This usecase also wouldn't
> miss a reset happening.

Iirc we even discussed this when doing arb robustness, but the trouble is
that arb robusteness really wants its reset stats per context. Until that
changes (with fences or not) I think we should just curb the EIO from
wait_ioctl like in all the other places.

> You may disagree about wait_ioctl, but a wait_fence must definitely
> report EIO. (And this is as close to wait_fence as we currently have in
> the code and in use today.)

Hm, the plan for fence waiting is to use the android syncpt stuff. I
didn't check whether that bothers with reporting failures, but should be
simple to add. And definitely makes sense.
-Daniel