[i-g-t] runner: Make taint aborts more verbose

Submitted by Arkadiusz Hiler on Nov. 26, 2018, 8:52 a.m.

Details

Message ID 20181126085222.12001-1-arkadiusz.hiler@intel.com
State New
Series "runner: Make taint aborts more verbose"
Headers show

Commit Message

Arkadiusz Hiler Nov. 26, 2018, 8:52 a.m.
Provide the reader with the taint names and a short explanation, as well
as point in the direction of the dmesg for more details.

Cc: Petri Latvala <petri.latvala@intel.com>
Cc: Martin Peres <martin.peres@linux.intel.com>
Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
---
 runner/executor.c | 42 ++++++++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 10 deletions(-)

Patch hide | download patch | download mbox

diff --git a/runner/executor.c b/runner/executor.c
index 2038c3fd..954cd9a7 100644
--- a/runner/executor.c
+++ b/runner/executor.c
@@ -139,24 +139,46 @@  static char *handle_lockdep(void)
 
 static char *handle_taint(void)
 {
-	const unsigned long bad_taints =
-		0x20  | /* TAINT_PAGE */
-		0x80  | /* TAINT_DIE */
-		0x200; /* TAINT_OOPS */
-	unsigned long taints = 0;
+	/* see Linux's include/linux/kernel.h */
+	static const struct {
+		unsigned long bit;
+		const char *explanation;
+	} taints[] = {
+	  {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
+	  {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
+	  {(1 << 9), "TAINT_WARN: WARN_ON has happened."},
+	  {0, 0}};
+
+	unsigned long system_taints = 0;
 	char *reason = NULL;
 	FILE *f;
 
+	unsigned long bad_taints = 0;
+	for (typeof(*taints) *taint = taints; taint->bit; taint++)
+		bad_taints |= taint->bit;
+
 	f = fopen("/proc/sys/kernel/tainted", "r");
 	if (f) {
-		fscanf(f, "%lu", &taints);
+		fscanf(f, "%lu", &system_taints);
 		fclose(f);
 	}
 
-	if (taints & bad_taints)
-		asprintf(&reason,
-			 "Kernel tainted (%#lx -- %lx)",
-			 taints, taints & bad_taints);
+	if (system_taints & bad_taints) {
+		asprintf(&reason, "Kernel badly tainted (%#lx) "
+			 "(check dmesg for details if possible):\n",
+			 system_taints);
+
+		for (typeof(*taints) *taint = taints; taint->bit; taint++) {
+			if (taint->bit & system_taints) {
+				char *old_reason = reason;
+				asprintf(&reason, "%s\t(%#lx) %s\n",
+						old_reason,
+						taint->bit,
+						taint->explanation);
+				free(old_reason);
+			}
+		}
+	}
 
 	return reason;
 }

Comments

Chris Wilson Nov. 26, 2018, 9:12 a.m.
Quoting Arkadiusz Hiler (2018-11-26 08:52:22)
> Provide the reader with the taint names and a short explanation, as well
> as point in the direction of the dmesg for more details.
> 
> Cc: Petri Latvala <petri.latvala@intel.com>
> Cc: Martin Peres <martin.peres@linux.intel.com>
> Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> ---
>  runner/executor.c | 42 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 32 insertions(+), 10 deletions(-)
> 
> diff --git a/runner/executor.c b/runner/executor.c
> index 2038c3fd..954cd9a7 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -139,24 +139,46 @@ static char *handle_lockdep(void)
>  
>  static char *handle_taint(void)
>  {
> -       const unsigned long bad_taints =
> -               0x20  | /* TAINT_PAGE */
> -               0x80  | /* TAINT_DIE */
> -               0x200; /* TAINT_OOPS */
> -       unsigned long taints = 0;
> +       /* see Linux's include/linux/kernel.h */
> +       static const struct {
> +               unsigned long bit;
> +               const char *explanation;
> +       } taints[] = {
> +         {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
> +         {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
> +         {(1 << 9), "TAINT_WARN: WARN_ON has happened."},

Pure noise, these don't add any useful information, so why? The only
information is the explanation in dmesg. Checking tainted is just
shorthand for us to determine a significant event.

If you want to add the kmsg output filtered by WARN+, that would be more
useful than hinting to the user they need to do it themselves. But that
should be redundant as that is already part of the runner -- so is not
the problem the actual integration, that this appears as a separate test
and not flagging the igt itself, with exceptions to handle tainted
before a test?
-Chris
Arkadiusz Hiler Nov. 26, 2018, 10:36 a.m.
On Mon, Nov 26, 2018 at 09:12:46AM +0000, Chris Wilson wrote:
> Quoting Arkadiusz Hiler (2018-11-26 08:52:22)
> > Provide the reader with the taint names and a short explanation, as well
> > as point in the direction of the dmesg for more details.
> > 
> > Cc: Petri Latvala <petri.latvala@intel.com>
> > Cc: Martin Peres <martin.peres@linux.intel.com>
> > Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> > ---
> >  runner/executor.c | 42 ++++++++++++++++++++++++++++++++----------
> >  1 file changed, 32 insertions(+), 10 deletions(-)
> > 
> > diff --git a/runner/executor.c b/runner/executor.c
> > index 2038c3fd..954cd9a7 100644
> > --- a/runner/executor.c
> > +++ b/runner/executor.c
> > @@ -139,24 +139,46 @@ static char *handle_lockdep(void)
> >  
> >  static char *handle_taint(void)
> >  {
> > -       const unsigned long bad_taints =
> > -               0x20  | /* TAINT_PAGE */
> > -               0x80  | /* TAINT_DIE */
> > -               0x200; /* TAINT_OOPS */
> > -       unsigned long taints = 0;
> > +       /* see Linux's include/linux/kernel.h */
> > +       static const struct {
> > +               unsigned long bit;
> > +               const char *explanation;
> > +       } taints[] = {
> > +         {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
> > +         {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
> > +         {(1 << 9), "TAINT_WARN: WARN_ON has happened."},
> 
> Pure noise, these don't add any useful information, so why? The only
> information is the explanation in dmesg. Checking tainted is just
> shorthand for us to determine a significant event.

People not knowing what taint means (or that it is a "bit field") were
getting confused by those aborts. I just got tired of explaining that we
don't check for "unsafe module option taint" which is usually the first
result they get while greping dmesg for "taint".

> If you want to add the kmsg output filtered by WARN+, that would be more
> useful than hinting to the user they need to do it themselves. But that
> should be redundant as that is already part of the runner -- so is not
> the problem the actual integration, that this appears as a separate test
> and not flagging the igt itself, with exceptions to handle tainted
> before a test?
> -Chris

Providing more logs, as a part of the aborted results, indeed sounds
good, but it is harder to implement, as we would have to keep track of
dmesg between taint checks too. Petri, what do you think?

I believe that this is, for now, a good middle ground. We give people
more context, explain what we hit so they know that they have to look
for a warn, oops or a slab spill in the logs.
Petri Latvala Nov. 26, 2018, 12:22 p.m.
On Mon, Nov 26, 2018 at 12:36:44PM +0200, Arkadiusz Hiler wrote:
> On Mon, Nov 26, 2018 at 09:12:46AM +0000, Chris Wilson wrote:
> > Quoting Arkadiusz Hiler (2018-11-26 08:52:22)
> > > Provide the reader with the taint names and a short explanation, as well
> > > as point in the direction of the dmesg for more details.
> > > 
> > > Cc: Petri Latvala <petri.latvala@intel.com>
> > > Cc: Martin Peres <martin.peres@linux.intel.com>
> > > Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> > > ---
> > >  runner/executor.c | 42 ++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 32 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/runner/executor.c b/runner/executor.c
> > > index 2038c3fd..954cd9a7 100644
> > > --- a/runner/executor.c
> > > +++ b/runner/executor.c
> > > @@ -139,24 +139,46 @@ static char *handle_lockdep(void)
> > >  
> > >  static char *handle_taint(void)
> > >  {
> > > -       const unsigned long bad_taints =
> > > -               0x20  | /* TAINT_PAGE */
> > > -               0x80  | /* TAINT_DIE */
> > > -               0x200; /* TAINT_OOPS */
> > > -       unsigned long taints = 0;
> > > +       /* see Linux's include/linux/kernel.h */
> > > +       static const struct {
> > > +               unsigned long bit;
> > > +               const char *explanation;
> > > +       } taints[] = {
> > > +         {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
> > > +         {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
> > > +         {(1 << 9), "TAINT_WARN: WARN_ON has happened."},
> > 
> > Pure noise, these don't add any useful information, so why? The only
> > information is the explanation in dmesg. Checking tainted is just
> > shorthand for us to determine a significant event.
> 
> People not knowing what taint means (or that it is a "bit field") were
> getting confused by those aborts. I just got tired of explaining that we
> don't check for "unsafe module option taint" which is usually the first
> result they get while greping dmesg for "taint".


Yeah this has been causing confusion in all kinds of circles.

In general it's always better to explain numbers in outputs meant for
humans. Surely if people ask about stuff, adding explanations to it is
useful information by definition?


> 
> > If you want to add the kmsg output filtered by WARN+, that would be more
> > useful than hinting to the user they need to do it themselves. But that
> > should be redundant as that is already part of the runner -- so is not
> > the problem the actual integration, that this appears as a separate test
> > and not flagging the igt itself, with exceptions to handle tainted
> > before a test?
> > -Chris
> 
> Providing more logs, as a part of the aborted results, indeed sounds
> good, but it is harder to implement, as we would have to keep track of
> dmesg between taint checks too. Petri, what do you think?


It's possible [citation needed] to inject the previous test
execution's dmesg into the aborted test result, but only in the form
of igt_runner CI is using today. Resuming, especially when using
multiple-mode, will make it harder. Actually, even today CI is not
fully able to visualize shard results when multiple shards abort.

What might be easier is injecting the aborting blame into the
previously executed test but even that is not a one-liner.

Either way, aborting and its reporting is not done and abandoned yet. WIP.