[i-g-t,v2] runner: Make taint abort messages more verbose

Submitted by Arkadiusz Hiler on Dec. 3, 2018, 3 p.m.

Details

Message ID 20181203150012.26768-1-arkadiusz.hiler@intel.com
State New
Series "runner: Make taint abort messages more verbose"
Headers show

Commit Message

Arkadiusz Hiler Dec. 3, 2018, 3 p.m.
Since not everyone is familiar with kernel taints, and it is easy to get
confused and attribute an abort to an innocent TAINT_USER caused by an
unsafe module option, which is usually the first thing people find
greping dmesg for "taint", we should provide more guidance.

This patch extends the abort log by printing the taint names, as found
in the kernel, along with a short explanation, so people know what to
look for in the dmesg.

v2: rebase, reword

Cc: Petri Latvala <petri.latvala@intel.com>
Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
---
 runner/executor.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

Patch hide | download patch | download mbox

diff --git a/runner/executor.c b/runner/executor.c
index 54c530b7..c0639a66 100644
--- a/runner/executor.c
+++ b/runner/executor.c
@@ -137,13 +137,23 @@  static char *handle_lockdep(void)
 	return NULL;
 }
 
+/* see Linux's include/linux/kernel.h */
+static const struct {
+	unsigned long bit;
+	const char *explanation;
+} abort_taints[] = {
+  {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
+  {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
+  {(1 << 9), "TAINT_WARN: WARN_ON has happened."},
+  {0, 0}};
+
 static unsigned long tainted(unsigned long *taints)
 {
-	const unsigned long bad_taints =
-		0x20  | /* TAINT_PAGE */
-		0x80  | /* TAINT_DIE */
-		0x200; /* TAINT_OOPS */
 	FILE *f;
+	unsigned long bad_taints = 0;
+
+	for (typeof(*abort_taints) *taint = abort_taints; taint->bit; taint++)
+		bad_taints |= taint->bit;
 
 	*taints = 0;
 
@@ -158,14 +168,26 @@  static unsigned long tainted(unsigned long *taints)
 
 static char *handle_taint(void)
 {
-	unsigned long taints, bad_taints;
+	unsigned long taints;
 	char *reason;
 
-	bad_taints = tainted(&taints);
-	if (!bad_taints)
+	if (!tainted(&taints))
 		return NULL;
 
-	asprintf(&reason, "Kernel tainted (%#lx -- %lx)", taints, bad_taints);
+	asprintf(&reason, "Kernel badly tainted (%#lx) (check dmesg for details):\n",
+		 taints);
+
+	for (typeof(*abort_taints) *taint = abort_taints; taint->bit; taint++) {
+		if (taint->bit & taints) {
+			char *old_reason = reason;
+			asprintf(&reason, "%s\t(%#lx) %s\n",
+					old_reason,
+					taint->bit,
+					taint->explanation);
+			free(old_reason);
+		}
+	}
+
 	return reason;
 }
 

Comments

Chris Wilson Dec. 3, 2018, 3:09 p.m.
Quoting Arkadiusz Hiler (2018-12-03 15:00:12)
> Since not everyone is familiar with kernel taints, and it is easy to get
> confused and attribute an abort to an innocent TAINT_USER caused by an
> unsafe module option, which is usually the first thing people find
> greping dmesg for "taint", we should provide more guidance.
> 
> This patch extends the abort log by printing the taint names, as found
> in the kernel, along with a short explanation, so people know what to
> look for in the dmesg.
> 
> v2: rebase, reword
> 
> Cc: Petri Latvala <petri.latvala@intel.com>
> Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> ---
>  runner/executor.c | 38 ++++++++++++++++++++++++++++++--------
>  1 file changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/runner/executor.c b/runner/executor.c
> index 54c530b7..c0639a66 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -137,13 +137,23 @@ static char *handle_lockdep(void)
>         return NULL;
>  }
>  
> +/* see Linux's include/linux/kernel.h */
> +static const struct {
> +       unsigned long bit;
> +       const char *explanation;
> +} abort_taints[] = {
> +  {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
> +  {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
> +  {(1 << 9), "TAINT_WARN: WARN_ON has happened."},
> +  {0, 0}};
> +
>  static unsigned long tainted(unsigned long *taints)
>  {
> -       const unsigned long bad_taints =
> -               0x20  | /* TAINT_PAGE */
> -               0x80  | /* TAINT_DIE */
> -               0x200; /* TAINT_OOPS */
>         FILE *f;
> +       unsigned long bad_taints = 0;
> +
> +       for (typeof(*abort_taints) *taint = abort_taints; taint->bit; taint++)
> +               bad_taints |= taint->bit;
>  
>         *taints = 0;
>  
> @@ -158,14 +168,26 @@ static unsigned long tainted(unsigned long *taints)
>  
>  static char *handle_taint(void)
>  {
> -       unsigned long taints, bad_taints;
> +       unsigned long taints;
>         char *reason;
>  
> -       bad_taints = tainted(&taints);
> -       if (!bad_taints)
> +       if (!tainted(&taints))
>                 return NULL;
>  
> -       asprintf(&reason, "Kernel tainted (%#lx -- %lx)", taints, bad_taints);
> +       asprintf(&reason, "Kernel badly tainted (%#lx) (check dmesg for details):\n",
> +                taints);
> +
> +       for (typeof(*abort_taints) *taint = abort_taints; taint->bit; taint++) {
> +               if (taint->bit & taints) {
> +                       char *old_reason = reason;
> +                       asprintf(&reason, "%s\t(%#lx) %s\n",
> +                                       old_reason,
> +                                       taint->bit,
> +                                       taint->explanation);
> +                       free(old_reason);

Still this gives no more information, than having to manually check
dmesg for the explanation.

	"TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
	"TAINT_DIE: Kernel has died - BUG/OOPS."},
	"TAINT_WARN: WARN_ON has happened."},

are quite frankly misleading, and not informative.

And since the runner is handling the dmesg, do the work to present the
relevant information.
-Chris
Arkadiusz Hiler Dec. 3, 2018, 3:45 p.m.
On Mon, Dec 03, 2018 at 03:09:29PM +0000, Chris Wilson wrote:
> Quoting Arkadiusz Hiler (2018-12-03 15:00:12)
> > Since not everyone is familiar with kernel taints, and it is easy to get
> > confused and attribute an abort to an innocent TAINT_USER caused by an
> > unsafe module option, which is usually the first thing people find
> > greping dmesg for "taint", we should provide more guidance.
> > 
> > This patch extends the abort log by printing the taint names, as found
> > in the kernel, along with a short explanation, so people know what to
> > look for in the dmesg.
> > 
> > v2: rebase, reword
> > 
> > Cc: Petri Latvala <petri.latvala@intel.com>
> > Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> > ---
> >  runner/executor.c | 38 ++++++++++++++++++++++++++++++--------
> >  1 file changed, 30 insertions(+), 8 deletions(-)
> > 
> > diff --git a/runner/executor.c b/runner/executor.c
> > index 54c530b7..c0639a66 100644
> > --- a/runner/executor.c
> > +++ b/runner/executor.c
> > @@ -137,13 +137,23 @@ static char *handle_lockdep(void)
> >         return NULL;
> >  }
> >  
> > +/* see Linux's include/linux/kernel.h */
> > +static const struct {
> > +       unsigned long bit;
> > +       const char *explanation;
> > +} abort_taints[] = {
> > +  {(1 << 5), "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
> > +  {(1 << 7), "TAINT_DIE: Kernel has died - BUG/OOPS."},
> > +  {(1 << 9), "TAINT_WARN: WARN_ON has happened."},
> > +  {0, 0}};
> > +
> >  static unsigned long tainted(unsigned long *taints)
> >  {
> > -       const unsigned long bad_taints =
> > -               0x20  | /* TAINT_PAGE */
> > -               0x80  | /* TAINT_DIE */
> > -               0x200; /* TAINT_OOPS */
> >         FILE *f;
> > +       unsigned long bad_taints = 0;
> > +
> > +       for (typeof(*abort_taints) *taint = abort_taints; taint->bit; taint++)
> > +               bad_taints |= taint->bit;
> >  
> >         *taints = 0;
> >  
> > @@ -158,14 +168,26 @@ static unsigned long tainted(unsigned long *taints)
> >  
> >  static char *handle_taint(void)
> >  {
> > -       unsigned long taints, bad_taints;
> > +       unsigned long taints;
> >         char *reason;
> >  
> > -       bad_taints = tainted(&taints);
> > -       if (!bad_taints)
> > +       if (!tainted(&taints))
> >                 return NULL;
> >  
> > -       asprintf(&reason, "Kernel tainted (%#lx -- %lx)", taints, bad_taints);
> > +       asprintf(&reason, "Kernel badly tainted (%#lx) (check dmesg for details):\n",
> > +                taints);
> > +
> > +       for (typeof(*abort_taints) *taint = abort_taints; taint->bit; taint++) {
> > +               if (taint->bit & taints) {
> > +                       char *old_reason = reason;
> > +                       asprintf(&reason, "%s\t(%#lx) %s\n",
> > +                                       old_reason,
> > +                                       taint->bit,
> > +                                       taint->explanation);
> > +                       free(old_reason);
> 
> Still this gives no more information, than having to manually check
> dmesg for the explanation.

I still stand by what has been said last week:
https://lists.freedesktop.org/archives/igt-dev/2018-November/007128.html

I have got two more questions on why are we aborting on unsafe module
options since then.

> 	"TAINT_BAD_PAGE: Bad page reference or an unexpected page flags."},
> 	"TAINT_DIE: Kernel has died - BUG/OOPS."},
> 	"TAINT_WARN: WARN_ON has happened."},
> 
> are quite frankly misleading, and not informative.

Any suggestions on how to make them less misleading?

> And since the runner is handling the dmesg, do the work to present the
> relevant information.
> -Chris

WIP by Petri, but that is going to take a while.