[Intel-gfx] tests/initial_state: Add a test to capture the state of the GPU

Submitted by Lofstedt, Marta on May 12, 2017, 11:07 a.m.

Details

Message ID 20170512110735.30752-1-marta.lofstedt@intel.com
State New
Headers show
Series "tests/initial_state: Add a test to capture the state of the GPU" ( rev: 1 ) in IGT (deprecated)

Not browsing as part of any series.

Commit Message

Lofstedt, Marta May 12, 2017, 11:07 a.m.
When running testlist for CI iot would be good to know if
the state of the GPU is as expected. This adds a subtest to check
if the GPU is wedged.

Signed-off-by: Marta Lofstedt <marta.lofstedt@intel.com>
---
 tests/Makefile.sources                |  1 +
 tests/initial_state.c                 | 52 +++++++++++++++++++++++++++++++++++
 tests/intel-ci/extended.testlist      |  1 +
 tests/intel-ci/fast-feedback.testlist |  1 +
 4 files changed, 55 insertions(+)
 create mode 100644 tests/initial_state.c

Patch hide | download patch | download mbox

diff --git a/tests/Makefile.sources b/tests/Makefile.sources
index 9553e4d9..32431a05 100644
--- a/tests/Makefile.sources
+++ b/tests/Makefile.sources
@@ -152,6 +152,7 @@  TESTS_progs_M = \
 	vgem_basic \
 	vgem_slow \
 	meta_test \
+	initial_state \
 	$(NULL)
 
 if HAVE_CHAMELIUM
diff --git a/tests/initial_state.c b/tests/initial_state.c
new file mode 100644
index 00000000..1c5d9a74
--- /dev/null
+++ b/tests/initial_state.c
@@ -0,0 +1,52 @@ 
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include "igt.h"
+
+/*
+ * The purpose of this test is to capture the
+ * state of the GPU before running a testlist.
+ *
+ * Note, this test currently only checks if
+ * i915 is wedged. But, it could be extended to report
+ * on other bad initial states.
+ */
+
+static void is_wedged(void)
+{
+	char buf[128];
+	int fd = drm_open_driver(DRIVER_INTEL);
+
+	igt_debugfs_read(fd, "i915_wedged", buf);
+	igt_assert(!strstr(buf, "1"));
+}
+
+igt_main
+{
+
+	igt_subtest("is_wedged")
+		is_wedged();
+}
diff --git a/tests/intel-ci/extended.testlist b/tests/intel-ci/extended.testlist
index a16c9c84..2ec08b55 100644
--- a/tests/intel-ci/extended.testlist
+++ b/tests/intel-ci/extended.testlist
@@ -1,3 +1,4 @@ 
+igt@initial_state@is_wedged
 igt@core_auth@many-magics
 igt@core_getclient
 igt@core_get_client_auth@master-drop
diff --git a/tests/intel-ci/fast-feedback.testlist b/tests/intel-ci/fast-feedback.testlist
index 5ffa2cef..a6a0a810 100644
--- a/tests/intel-ci/fast-feedback.testlist
+++ b/tests/intel-ci/fast-feedback.testlist
@@ -1,3 +1,4 @@ 
+igt@initial_state@is_wedged
 igt@core_auth@basic-auth
 igt@core_prop_blob@basic
 igt@drv_getparams_basic@basic-eu-total

Comments

The commit name "Add a test to capture the state of the GPU" should 
likely be renamed to "Add a test that checks the current state of the 
driver".

On 12/05/17 14:07, Marta Lofstedt wrote:
> When running testlist for CI iot would be good to know if

iot -> it

> the state of the GPU is as expected. This adds a subtest to check
> if the GPU is wedged.

This is not mandatory, but how about adding:

"When coupled with piglit's abort-on feature, this makes the piglit 
results contain the reason for the abort when the driver's initial state 
is unacceptable, improving debuggability"

With the commit title and the typo fixed, the patch is:
Acked-by: Martin Peres <martin.peres@linux.intel.com>

> 
> Signed-off-by: Marta Lofstedt <marta.lofstedt@intel.com>
> ---
>   tests/Makefile.sources                |  1 +
>   tests/initial_state.c                 | 52 +++++++++++++++++++++++++++++++++++
>   tests/intel-ci/extended.testlist      |  1 +
>   tests/intel-ci/fast-feedback.testlist |  1 +
>   4 files changed, 55 insertions(+)
>   create mode 100644 tests/initial_state.c
> 
> diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> index 9553e4d9..32431a05 100644
> --- a/tests/Makefile.sources
> +++ b/tests/Makefile.sources
> @@ -152,6 +152,7 @@ TESTS_progs_M = \
>   	vgem_basic \
>   	vgem_slow \
>   	meta_test \
> +	initial_state \
>   	$(NULL)
>   
>   if HAVE_CHAMELIUM
> diff --git a/tests/initial_state.c b/tests/initial_state.c
> new file mode 100644
> index 00000000..1c5d9a74
> --- /dev/null
> +++ b/tests/initial_state.c
> @@ -0,0 +1,52 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include <stdio.h>
> +#include <string.h>
> +#include "igt.h"
> +
> +/*
> + * The purpose of this test is to capture the
> + * state of the GPU before running a testlist.
> + *
> + * Note, this test currently only checks if
> + * i915 is wedged. But, it could be extended to report
> + * on other bad initial states.
> + */
> +
> +static void is_wedged(void)
> +{
> +	char buf[128];
> +	int fd = drm_open_driver(DRIVER_INTEL);
> +
> +	igt_debugfs_read(fd, "i915_wedged", buf);
> +	igt_assert(!strstr(buf, "1"));
> +}
> +
> +igt_main
> +{
> +
> +	igt_subtest("is_wedged")
> +		is_wedged();
> +}
> diff --git a/tests/intel-ci/extended.testlist b/tests/intel-ci/extended.testlist
> index a16c9c84..2ec08b55 100644
> --- a/tests/intel-ci/extended.testlist
> +++ b/tests/intel-ci/extended.testlist
> @@ -1,3 +1,4 @@
> +igt@initial_state@is_wedged
>   igt@core_auth@many-magics
>   igt@core_getclient
>   igt@core_get_client_auth@master-drop
> diff --git a/tests/intel-ci/fast-feedback.testlist b/tests/intel-ci/fast-feedback.testlist
> index 5ffa2cef..a6a0a810 100644
> --- a/tests/intel-ci/fast-feedback.testlist
> +++ b/tests/intel-ci/fast-feedback.testlist
> @@ -1,3 +1,4 @@
> +igt@initial_state@is_wedged
>   igt@core_auth@basic-auth
>   igt@core_prop_blob@basic
>   igt@drv_getparams_basic@basic-eu-total
>
On Fri, May 12, 2017 at 04:08:48PM +0300, Martin Peres wrote:
> The commit name "Add a test to capture the state of the GPU" should
> likely be renamed to "Add a test that checks the current state of
> the driver".
> 
> On 12/05/17 14:07, Marta Lofstedt wrote:
> >When running testlist for CI iot would be good to know if
> 
> iot -> it
> 
> >the state of the GPU is as expected. This adds a subtest to check
> >if the GPU is wedged.
> 
> This is not mandatory, but how about adding:
> 
> "When coupled with piglit's abort-on feature, this makes the piglit
> results contain the reason for the abort when the driver's initial
> state is unacceptable, improving debuggability"
> 
> With the commit title and the typo fixed, the patch is:
> Acked-by: Martin Peres <martin.peres@linux.intel.com>
> 
> >
> >Signed-off-by: Marta Lofstedt <marta.lofstedt@intel.com>
> >---
> >  tests/Makefile.sources                |  1 +
> >  tests/initial_state.c                 | 52 +++++++++++++++++++++++++++++++++++
> >  tests/intel-ci/extended.testlist      |  1 +
> >  tests/intel-ci/fast-feedback.testlist |  1 +
> >  4 files changed, 55 insertions(+)
> >  create mode 100644 tests/initial_state.c
> >
> >diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> >index 9553e4d9..32431a05 100644
> >--- a/tests/Makefile.sources
> >+++ b/tests/Makefile.sources
> >@@ -152,6 +152,7 @@ TESTS_progs_M = \
> >  	vgem_basic \
> >  	vgem_slow \
> >  	meta_test \
> >+	initial_state \
> >  	$(NULL)
> >  if HAVE_CHAMELIUM
> >diff --git a/tests/initial_state.c b/tests/initial_state.c
> >new file mode 100644
> >index 00000000..1c5d9a74
> >--- /dev/null
> >+++ b/tests/initial_state.c
> >@@ -0,0 +1,52 @@
> >+/*
> >+ * Copyright © 2017 Intel Corporation
> >+ *
> >+ * Permission is hereby granted, free of charge, to any person obtaining a
> >+ * copy of this software and associated documentation files (the "Software"),
> >+ * to deal in the Software without restriction, including without limitation
> >+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> >+ * and/or sell copies of the Software, and to permit persons to whom the
> >+ * Software is furnished to do so, subject to the following conditions:
> >+ *
> >+ * The above copyright notice and this permission notice (including the next
> >+ * paragraph) shall be included in all copies or substantial portions of the
> >+ * Software.
> >+ *
> >+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> >+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> >+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> >+ * IN THE SOFTWARE.
> >+ *
> >+ */
> >+
> >+#include <stdio.h>
> >+#include <string.h>
> >+#include "igt.h"
> >+
> >+/*
> >+ * The purpose of this test is to capture the
> >+ * state of the GPU before running a testlist.
> >+ *
> >+ * Note, this test currently only checks if
> >+ * i915 is wedged. But, it could be extended to report
> >+ * on other bad initial states.
> >+ */
> >+
> >+static void is_wedged(void)
> >+{
> >+	char buf[128];
> >+	int fd = drm_open_driver(DRIVER_INTEL);
> >+
> >+	igt_debugfs_read(fd, "i915_wedged", buf);

The ABI for testing whether the device is wedged before use is
gem_throttle(). This is encapsulated into igt_require_gem() and any test
that depends upon execution on the GPU should be checking for GEM first.

This test should just be igt_gem_require(drm_open_driver(DRIVER_INTEL));
-Chris
On Fri, May 12, 2017 at 02:17:29PM +0100, Chris Wilson wrote:
> The ABI for testing whether the device is wedged before use is
> gem_throttle(). This is encapsulated into igt_require_gem() and any test
> that depends upon execution on the GPU should be checking for GEM first.
> 
> This test should just be igt_gem_require(drm_open_driver(DRIVER_INTEL));

Well, how do you expect people to spot your nifty tools and refactorings
if they don't show up on the m-l, don't get acks or reviews, and don't
come with api-docs? At least neither gmail nor google did find that patch
submitted anywhere.

Marta, can you pls review Chris' patch:

commit 9518cb59abe35143f15abac2b7ffdb99820ef53c                                                                                         
Author: Chris Wilson <chris@chris-wilson.co.uk>                                                                                         
Date:   Wed Feb 22 15:24:54 2017 +0000                                                                                                  

    igt: Start marking up GEM tests that require an alive GPU to function

Best would be to check that all gem tests really use this, and then also
add abi docs for this new helper.

Also, since we have igt_require_gem() already, what's the value of this
additional testcase? If we want to reboot machines if the gpu died, then
we need to have such a test at a higher level, in piglit. What we could do
is add a special exit code to igt that indicates a runtime problem (and
tells piglit to reboot), essentially PLEASE_REBOOT instead of SKIP (which
is what igt_require_gem() results in).
-Daniel
On Mon, May 15, 2017 at 04:00:55PM +0200, Daniel Vetter wrote:
> On Fri, May 12, 2017 at 02:17:29PM +0100, Chris Wilson wrote:
> > The ABI for testing whether the device is wedged before use is
> > gem_throttle(). This is encapsulated into igt_require_gem() and any test
> > that depends upon execution on the GPU should be checking for GEM first.
> > 
> > This test should just be igt_gem_require(drm_open_driver(DRIVER_INTEL));
> 
> Well, how do you expect people to spot your nifty tools and refactorings

The ABI was established in

commit 537e73f3f935b917f2f5f9b51499cb29d65e3889
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 24 17:37:41 2010 +0100

    Disable dri2 after forcing fallbacks
I hereby pull-out this patch.
The idea of it was to know if we were already wedged at the beginning of testing, that would give us information on how to interpret silly results; such that test starting to get skipped and/or we got dmesg-warns/incomplete on tests that usually should be skipped.
Also, we are planning to soon deploy a piglit.conf solution where testing will be terminated on wedged, so I agree that my test isn't really needed.

/Marta

> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Monday, May 15, 2017 5:01 PM
> To: Chris Wilson <chris@chris-wilson.co.uk>; Martin Peres
> <martin.peres@linux.intel.com>; Lofstedt, Marta
> <marta.lofstedt@intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> the state of the GPU
> 
> On Fri, May 12, 2017 at 02:17:29PM +0100, Chris Wilson wrote:
> > The ABI for testing whether the device is wedged before use is
> > gem_throttle(). This is encapsulated into igt_require_gem() and any
> > test that depends upon execution on the GPU should be checking for GEM
> first.
> >
> > This test should just be
> > igt_gem_require(drm_open_driver(DRIVER_INTEL));
> 
> Well, how do you expect people to spot your nifty tools and refactorings if
> they don't show up on the m-l, don't get acks or reviews, and don't come
> with api-docs? At least neither gmail nor google did find that patch submitted
> anywhere.
> 
> Marta, can you pls review Chris' patch:
> 
> commit 9518cb59abe35143f15abac2b7ffdb99820ef53c
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Feb 22 15:24:54 2017 +0000
> 
>     igt: Start marking up GEM tests that require an alive GPU to function
> 
> Best would be to check that all gem tests really use this, and then also add abi
> docs for this new helper.
> 
> Also, since we have igt_require_gem() already, what's the value of this
> additional testcase? If we want to reboot machines if the gpu died, then we
> need to have such a test at a higher level, in piglit. What we could do is add a
> special exit code to igt that indicates a runtime problem (and tells piglit to
> reboot), essentially PLEASE_REBOOT instead of SKIP (which is what
> igt_require_gem() results in).
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> I hereby pull-out this patch.
> The idea of it was to know if we were already wedged at the beginning of testing, that would give us information on how to interpret silly results; such that test starting to get skipped and/or we got dmesg-warns/incomplete on tests that usually should be skipped.
> Also, we are planning to soon deploy a piglit.conf solution where testing will be terminated on wedged, so I agree that my test isn't really needed.

Not everything is broken by wedged; internally we just use that as an
indicator that GEM is hosed. KMS should still work, we must still be
able to drive the displays to show the error and keep the servers alive
until the data is saved (and hopefully gracefully degrade that we don't
have to interrupt their immediate session).

Whilst that may be out of scope for BAT, it should not be completely
beyond us for robustness testing.
-Chris
> -----Original Message-----
> From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> Sent: Tuesday, May 16, 2017 11:21 AM
> To: Lofstedt, Marta <marta.lofstedt@intel.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> the state of the GPU
> 
> On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > I hereby pull-out this patch.
> > The idea of it was to know if we were already wedged at the beginning of
> testing, that would give us information on how to interpret silly results; such
> that test starting to get skipped and/or we got dmesg-warns/incomplete on
> tests that usually should be skipped.
> > Also, we are planning to soon deploy a piglit.conf solution where testing
> will be terminated on wedged, so I agree that my test isn't really needed.
> 
> Not everything is broken by wedged; internally we just use that as an
> indicator that GEM is hosed. KMS should still work, we must still be able to
> drive the displays to show the error and keep the servers alive until the data
> is saved (and hopefully gracefully degrade that we don't have to interrupt
> their immediate session).

It doesn't matter if it is broken or not, if we are terminally wedged the rest of the result may be silly. Look for example at CI_DRM_2612, the fi-elk-e7500 is wedged at igt@gem_busy@basic-hang-default, then all test are skipped until gem_exec_reloc@basic-cpu-gtt-noreloc where the machine hangs, but it is a gem test so it should have been skipped, right. My conclusion from seeing this pattern multiple times is that after terminally wedged, silly things can happen, i.e. we can't trust the results, and since we don't want silly bugs, the CI testing should be stopped.

> 
> Whilst that may be out of scope for BAT, it should not be completely beyond
> us for robustness testing.
> -Chris
> 
> --
> Chris Wilson, Intel Open Source Technology Centre
On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> 
> 
> > -----Original Message-----
> > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > Sent: Tuesday, May 16, 2017 11:21 AM
> > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> > the state of the GPU
> > 
> > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > I hereby pull-out this patch.
> > > The idea of it was to know if we were already wedged at the beginning of
> > testing, that would give us information on how to interpret silly results; such
> > that test starting to get skipped and/or we got dmesg-warns/incomplete on
> > tests that usually should be skipped.
> > > Also, we are planning to soon deploy a piglit.conf solution where testing
> > will be terminated on wedged, so I agree that my test isn't really needed.
> > 
> > Not everything is broken by wedged; internally we just use that as an
> > indicator that GEM is hosed. KMS should still work, we must still be able to
> > drive the displays to show the error and keep the servers alive until the data
> > is saved (and hopefully gracefully degrade that we don't have to interrupt
> > their immediate session).
> 
> It doesn't matter if it is broken or not, if we are terminally wedged the rest of the result may be silly. Look for example at CI_DRM_2612, the fi-elk-e7500 is wedged at igt@gem_busy@basic-hang-default, then all test are skipped until gem_exec_reloc@basic-cpu-gtt-noreloc where the machine hangs, but it is a gem test so it should have been skipped, right. My conclusion from seeing this pattern multiple times is that after terminally wedged, silly things can happen, i.e. we can't trust the results, and since we don't want silly bugs, the CI testing should be stopped.

The machine didn't hang, it was remotely killed because the run timed out.
-Chris
On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> 
> 
> > -----Original Message-----
> > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > Sent: Tuesday, May 16, 2017 11:21 AM
> > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> > the state of the GPU
> > 
> > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > I hereby pull-out this patch.
> > > The idea of it was to know if we were already wedged at the beginning of
> > testing, that would give us information on how to interpret silly results; such
> > that test starting to get skipped and/or we got dmesg-warns/incomplete on
> > tests that usually should be skipped.
> > > Also, we are planning to soon deploy a piglit.conf solution where testing
> > will be terminated on wedged, so I agree that my test isn't really needed.
> > 
> > Not everything is broken by wedged; internally we just use that as an
> > indicator that GEM is hosed. KMS should still work, we must still be able to
> > drive the displays to show the error and keep the servers alive until the data
> > is saved (and hopefully gracefully degrade that we don't have to interrupt
> > their immediate session).
> 
> It doesn't matter if it is broken or not, if we are terminally wedged the rest of the result may be silly. Look for example at CI_DRM_2612, the fi-elk-e7500 is wedged at igt@gem_busy@basic-hang-default, then all test are skipped until gem_exec_reloc@basic-cpu-gtt-noreloc where the machine hangs, but it is a gem test so it should have been skipped, right. My conclusion from seeing this pattern multiple times is that after terminally wedged, silly things can happen, i.e. we can't trust the results, and since we don't want silly bugs, the CI testing should be stopped.

Oh, and note it was running so slowly because of a KMS bug that is quite
common.
-Chris
> -----Original Message-----
> From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> Sent: Tuesday, May 16, 2017 12:04 PM
> To: Lofstedt, Marta <marta.lofstedt@intel.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> the state of the GPU
> 
> On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> >
> >
> > > -----Original Message-----
> > > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > > Sent: Tuesday, May 16, 2017 11:21 AM
> > > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a
> > > test to capture the state of the GPU
> > >
> > > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > > I hereby pull-out this patch.
> > > > The idea of it was to know if we were already wedged at the
> > > > beginning of
> > > testing, that would give us information on how to interpret silly
> > > results; such that test starting to get skipped and/or we got
> > > dmesg-warns/incomplete on tests that usually should be skipped.
> > > > Also, we are planning to soon deploy a piglit.conf solution where
> > > > testing
> > > will be terminated on wedged, so I agree that my test isn't really needed.
> > >
> > > Not everything is broken by wedged; internally we just use that as
> > > an indicator that GEM is hosed. KMS should still work, we must still
> > > be able to drive the displays to show the error and keep the servers
> > > alive until the data is saved (and hopefully gracefully degrade that
> > > we don't have to interrupt their immediate session).
> >
> > It doesn't matter if it is broken or not, if we are terminally wedged the rest
> of the result may be silly. Look for example at CI_DRM_2612, the fi-elk-e7500
> is wedged at igt@gem_busy@basic-hang-default, then all test are skipped
> until gem_exec_reloc@basic-cpu-gtt-noreloc where the machine hangs, but
> it is a gem test so it should have been skipped, right. My conclusion from
> seeing this pattern multiple times is that after terminally wedged, silly things
> can happen, i.e. we can't trust the results, and since we don't want silly bugs,
> the CI testing should be stopped.
> 
> The machine didn't hang, it was remotely killed because the run timed out.
How do you know that?
> -Chris
> 
> --
> Chris Wilson, Intel Open Source Technology Centre
On Tue, May 16, 2017 at 09:43:52AM +0000, Lofstedt, Marta wrote:
> 
> 
> > -----Original Message-----
> > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > Sent: Tuesday, May 16, 2017 12:04 PM
> > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> > the state of the GPU
> > 
> > On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > > > Sent: Tuesday, May 16, 2017 11:21 AM
> > > > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > > > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a
> > > > test to capture the state of the GPU
> > > >
> > > > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > > > I hereby pull-out this patch.
> > > > > The idea of it was to know if we were already wedged at the
> > > > > beginning of
> > > > testing, that would give us information on how to interpret silly
> > > > results; such that test starting to get skipped and/or we got
> > > > dmesg-warns/incomplete on tests that usually should be skipped.
> > > > > Also, we are planning to soon deploy a piglit.conf solution where
> > > > > testing
> > > > will be terminated on wedged, so I agree that my test isn't really needed.
> > > >
> > > > Not everything is broken by wedged; internally we just use that as
> > > > an indicator that GEM is hosed. KMS should still work, we must still
> > > > be able to drive the displays to show the error and keep the servers
> > > > alive until the data is saved (and hopefully gracefully degrade that
> > > > we don't have to interrupt their immediate session).
> > >
> > > It doesn't matter if it is broken or not, if we are terminally wedged the rest
> > of the result may be silly. Look for example at CI_DRM_2612, the fi-elk-e7500
> > is wedged at igt@gem_busy@basic-hang-default, then all test are skipped
> > until gem_exec_reloc@basic-cpu-gtt-noreloc where the machine hangs, but
> > it is a gem test so it should have been skipped, right. My conclusion from
> > seeing this pattern multiple times is that after terminally wedged, silly things
> > can happen, i.e. we can't trust the results, and since we don't want silly bugs,
> > the CI testing should be stopped.
> > 
> > The machine didn't hang, it was remotely killed because the run timed out.
> How do you know that?

The dmesg is a stream of flip timeouts until we run out of total BAT
runtime (12 minutes + some startup slack).
-Chris
> -----Original Message-----
> From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> Sent: Tuesday, May 16, 2017 12:48 PM
> To: Lofstedt, Marta <marta.lofstedt@intel.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> the state of the GPU
> 
> On Tue, May 16, 2017 at 09:43:52AM +0000, Lofstedt, Marta wrote:
> >
> >
> > > -----Original Message-----
> > > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > > Sent: Tuesday, May 16, 2017 12:04 PM
> > > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a
> > > test to capture the state of the GPU
> > >
> > > On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > > > > Sent: Tuesday, May 16, 2017 11:21 AM
> > > > > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > > > > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > > > > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add
> > > > > a test to capture the state of the GPU
> > > > >
> > > > > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > > > > I hereby pull-out this patch.
> > > > > > The idea of it was to know if we were already wedged at the
> > > > > > beginning of
> > > > > testing, that would give us information on how to interpret
> > > > > silly results; such that test starting to get skipped and/or we
> > > > > got dmesg-warns/incomplete on tests that usually should be skipped.
> > > > > > Also, we are planning to soon deploy a piglit.conf solution
> > > > > > where testing
> > > > > will be terminated on wedged, so I agree that my test isn't really
> needed.
> > > > >
> > > > > Not everything is broken by wedged; internally we just use that
> > > > > as an indicator that GEM is hosed. KMS should still work, we
> > > > > must still be able to drive the displays to show the error and
> > > > > keep the servers alive until the data is saved (and hopefully
> > > > > gracefully degrade that we don't have to interrupt their immediate
> session).
> > > >
> > > > It doesn't matter if it is broken or not, if we are terminally
> > > > wedged the rest
> > > of the result may be silly. Look for example at CI_DRM_2612, the
> > > fi-elk-e7500 is wedged at igt@gem_busy@basic-hang-default, then all
> > > test are skipped until gem_exec_reloc@basic-cpu-gtt-noreloc where
> > > the machine hangs, but it is a gem test so it should have been
> > > skipped, right. My conclusion from seeing this pattern multiple
> > > times is that after terminally wedged, silly things can happen, i.e.
> > > we can't trust the results, and since we don't want silly bugs, the CI
> testing should be stopped.
> > >
> > > The machine didn't hang, it was remotely killed because the run timed
> out.
> > How do you know that?
> 
> The dmesg is a stream of flip timeouts until we run out of total BAT runtime
> (12 minutes + some startup slack).
> -Chris

Then look at CI_DRM_2602, wedged at igt@gem_busy@basic-hang-default, after a lot of skipping, we get incomplete result for another test, this time gem_exec_reloc@basic-gtt-cpu-noreloc

So, gem_exec_reloc@basic-cpu-gtt-noreloc and gem_exec_reloc@basic-gtt-cpu-noreloc are falsely getting blamed and my conclusion is that this is due to the permanent wedging started at gem_busy@basic-hang-default. So, to avoid bug reports for gem_exec_reloc@basic-cpu-gtt-noreloc and gem_exec_reloc@basic-gtt-cpu- noreloc the suggestion is to stop testing after we are terminally wedged. 

> 
> --
> Chris Wilson, Intel Open Source Technology Centre
On Tue, May 16, 2017 at 10:07:41AM +0000, Lofstedt, Marta wrote:
> 
> 
> > -----Original Message-----
> > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > Sent: Tuesday, May 16, 2017 12:48 PM
> > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> > the state of the GPU
> > 
> > On Tue, May 16, 2017 at 09:43:52AM +0000, Lofstedt, Marta wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > > > Sent: Tuesday, May 16, 2017 12:04 PM
> > > > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > > > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a
> > > > test to capture the state of the GPU
> > > >
> > > > On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> > > > > > Sent: Tuesday, May 16, 2017 11:21 AM
> > > > > > To: Lofstedt, Marta <marta.lofstedt@intel.com>
> > > > > > Cc: Daniel Vetter <daniel@ffwll.ch>; Martin Peres
> > > > > > <martin.peres@linux.intel.com>; intel-gfx@lists.freedesktop.org
> > > > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add
> > > > > > a test to capture the state of the GPU
> > > > > >
> > > > > > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > > > > > I hereby pull-out this patch.
> > > > > > > The idea of it was to know if we were already wedged at the
> > > > > > > beginning of
> > > > > > testing, that would give us information on how to interpret
> > > > > > silly results; such that test starting to get skipped and/or we
> > > > > > got dmesg-warns/incomplete on tests that usually should be skipped.
> > > > > > > Also, we are planning to soon deploy a piglit.conf solution
> > > > > > > where testing
> > > > > > will be terminated on wedged, so I agree that my test isn't really
> > needed.
> > > > > >
> > > > > > Not everything is broken by wedged; internally we just use that
> > > > > > as an indicator that GEM is hosed. KMS should still work, we
> > > > > > must still be able to drive the displays to show the error and
> > > > > > keep the servers alive until the data is saved (and hopefully
> > > > > > gracefully degrade that we don't have to interrupt their immediate
> > session).
> > > > >
> > > > > It doesn't matter if it is broken or not, if we are terminally
> > > > > wedged the rest
> > > > of the result may be silly. Look for example at CI_DRM_2612, the
> > > > fi-elk-e7500 is wedged at igt@gem_busy@basic-hang-default, then all
> > > > test are skipped until gem_exec_reloc@basic-cpu-gtt-noreloc where
> > > > the machine hangs, but it is a gem test so it should have been
> > > > skipped, right. My conclusion from seeing this pattern multiple
> > > > times is that after terminally wedged, silly things can happen, i.e.
> > > > we can't trust the results, and since we don't want silly bugs, the CI
> > testing should be stopped.
> > > >
> > > > The machine didn't hang, it was remotely killed because the run timed
> > out.
> > > How do you know that?
> > 
> > The dmesg is a stream of flip timeouts until we run out of total BAT runtime
> > (12 minutes + some startup slack).
> > -Chris
> 
> Then look at CI_DRM_2602, wedged at igt@gem_busy@basic-hang-default, after a lot of skipping, we get incomplete result for another test, this time gem_exec_reloc@basic-gtt-cpu-noreloc
> 
> So, gem_exec_reloc@basic-cpu-gtt-noreloc and gem_exec_reloc@basic-gtt-cpu-noreloc are falsely getting blamed and my conclusion is that this is due to the permanent wedging started at gem_busy@basic-hang-default. So, to avoid bug reports for gem_exec_reloc@basic-cpu-gtt-noreloc and gem_exec_reloc@basic-gtt-cpu- noreloc the suggestion is to stop testing after we are terminally wedged. 

The conclusion is still wrong though, it doesn't derive from the wedged
state itself but from another bug that can be triggered by a recovered
gpu hang. The issue is that post-processing isn't yet smart enough to
determine the real cause and picks on the victim. We are always going to
have that problem in some form, the question is what can be done to make
the process smart to avoid false positives or rather just say this is a
general bug and be careful not to be precise on the blame.
-Chris