[i-g-t,3/9] lib: Don't leak children in igt_waitchildren_timeout

Submitted by Daniel Vetter on Feb. 11, 2019, 6:02 p.m.

Details

Message ID 20190211180208.29406-3-daniel.vetter@ffwll.ch
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in IGT

Not browsing as part of any series.

Commit Message

Daniel Vetter Feb. 11, 2019, 6:02 p.m.
Instead of cleaning up the mess in igt_exit make sure we don't even
let it out of the container. See also

commit 754876378d6c9b2775e8c07b4d16f9878c55949f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 26 22:11:10 2016 +0000

    igt/gem_sync: Enforce a timeout of 20s

which added this helper.

To make sure that everyone follows the rules, add an assert.

We're keeping the cleanup code as a failsafe, and because it speeds
up the testcase I'm following up with.

v2: Chris pointed out that my original patch did nothing. Which I
didn't catch because my testcase was also broken. Unfortunately this
means we need to open code part of the waiting.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 lib/igt_core.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/lib/igt_core.c b/lib/igt_core.c
index cbbe79f88070..3053697da58c 100644
--- a/lib/igt_core.c
+++ b/lib/igt_core.c
@@ -1525,6 +1525,7 @@  void igt_exit(void)
 
 	for (int c = 0; c < num_test_children; c++)
 		kill(test_children[c], SIGKILL);
+	assert(!num_test_children);
 
 	if (!test_with_subtests) {
 		struct timespec now;
@@ -1832,20 +1833,48 @@  void igt_waitchildren(void)
 		igt_fail(err);
 }
 
+static bool igt_killchidren_timed_out;
+
+static void igt_alarm_killchildren(int signal)
+{
+	igt_info("Timed out waiting for children\n");
+
+	igt_killchidren_timed_out = true;
+
+	for (int c = 0; c < num_test_children; c++)
+		kill(test_children[c], SIGKILL);
+}
+
 /**
  * igt_waitchildren_timeout:
  * @seconds: timeout in seconds to wait
  * @reason: debug string explaining what timedout
  *
- * Wait for all children forked with igt_fork, for a maximum of @seconds.
- *
- * Wraps igt_waitchildren() and igt_set_timeout()
+ * Wait for all children forked with igt_fork, for a maximum of @seconds. If the
+ * timeout expires, kills all children and cleans them up.
  */
 void igt_waitchildren_timeout(int seconds, const char *reason)
 {
-	igt_set_timeout(seconds, reason);
-	igt_waitchildren();
+	struct sigaction sa;
+	int ret;
+
+	sa.sa_handler = igt_alarm_killchildren;
+	sigemptyset(&sa.sa_mask);
+	sa.sa_flags = 0;
+
+	igt_killchidren_timed_out = false;
+
+	sigaction(SIGALRM, &sa, NULL);
+
+	alarm(seconds);
+
+	ret = __igt_waitchildren();
+	if (!igt_killchidren_timed_out && ret)
+		igt_fail(ret);
 	igt_reset_timeout();
+	__igt_waitchildren();
+	if (igt_killchidren_timed_out)
+		igt_fail(IGT_EXIT_FAILURE);
 }
 
 /* exit handler code */

Comments

Quoting Daniel Vetter (2019-02-11 18:02:02)
> Instead of cleaning up the mess in igt_exit make sure we don't even
> let it out of the container. See also
> 
> commit 754876378d6c9b2775e8c07b4d16f9878c55949f
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Feb 26 22:11:10 2016 +0000
> 
>     igt/gem_sync: Enforce a timeout of 20s
> 
> which added this helper.
> 
> To make sure that everyone follows the rules, add an assert.
> 
> We're keeping the cleanup code as a failsafe, and because it speeds
> up the testcase I'm following up with.
> 
> v2: Chris pointed out that my original patch did nothing. Which I
> didn't catch because my testcase was also broken. Unfortunately this
> means we need to open code part of the waiting.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  lib/igt_core.c | 39 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/igt_core.c b/lib/igt_core.c
> index cbbe79f88070..3053697da58c 100644
> --- a/lib/igt_core.c
> +++ b/lib/igt_core.c
> @@ -1525,6 +1525,7 @@ void igt_exit(void)
>  
>         for (int c = 0; c < num_test_children; c++)
>                 kill(test_children[c], SIGKILL);
> +       assert(!num_test_children);
>  
>         if (!test_with_subtests) {
>                 struct timespec now;
> @@ -1832,20 +1833,48 @@ void igt_waitchildren(void)
>                 igt_fail(err);
>  }
>  
> +static bool igt_killchidren_timed_out;
> +
> +static void igt_alarm_killchildren(int signal)
> +{
> +       igt_info("Timed out waiting for children\n");

We used to print the caller supplied reason. Although that appears to
have never been used, so might as well drop it from the API.

> +
> +       igt_killchidren_timed_out = true;
> +
> +       for (int c = 0; c < num_test_children; c++)
> +               kill(test_children[c], SIGKILL);

Ok, kill() is signal-safe.

> +}
> +
>  /**
>   * igt_waitchildren_timeout:
>   * @seconds: timeout in seconds to wait
>   * @reason: debug string explaining what timedout
>   *
> - * Wait for all children forked with igt_fork, for a maximum of @seconds.
> - *
> - * Wraps igt_waitchildren() and igt_set_timeout()
> + * Wait for all children forked with igt_fork, for a maximum of @seconds. If the
> + * timeout expires, kills all children and cleans them up.
>   */
>  void igt_waitchildren_timeout(int seconds, const char *reason)
>  {
> -       igt_set_timeout(seconds, reason);
> -       igt_waitchildren();
> +       struct sigaction sa;
> +       int ret;
> +
> +       sa.sa_handler = igt_alarm_killchildren;
> +       sigemptyset(&sa.sa_mask);
> +       sa.sa_flags = 0;
> +
> +       igt_killchidren_timed_out = false;
> +
> +       sigaction(SIGALRM, &sa, NULL);
> +
> +       alarm(seconds);
> +
> +       ret = __igt_waitchildren();
> +       if (!igt_killchidren_timed_out && ret)
> +               igt_fail(ret);
>         igt_reset_timeout();
> +       __igt_waitchildren();
> +       if (igt_killchidren_timed_out)
> +               igt_fail(IGT_EXIT_FAILURE);

Double __igt_waitchildren()? My reading of __igt_waitchildren() is that
it will continue on after the alarm() wakes wait() up with SIGINT and
repeat the wait() until all children die. And now those children will
die, rather than the parent as before.
-Chris
On Mon, Feb 11, 2019 at 09:03:12PM +0000, Chris Wilson wrote:
> Quoting Daniel Vetter (2019-02-11 18:02:02)
> > Instead of cleaning up the mess in igt_exit make sure we don't even
> > let it out of the container. See also
> > 
> > commit 754876378d6c9b2775e8c07b4d16f9878c55949f
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Fri Feb 26 22:11:10 2016 +0000
> > 
> >     igt/gem_sync: Enforce a timeout of 20s
> > 
> > which added this helper.
> > 
> > To make sure that everyone follows the rules, add an assert.
> > 
> > We're keeping the cleanup code as a failsafe, and because it speeds
> > up the testcase I'm following up with.
> > 
> > v2: Chris pointed out that my original patch did nothing. Which I
> > didn't catch because my testcase was also broken. Unfortunately this
> > means we need to open code part of the waiting.
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  lib/igt_core.c | 39 ++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 34 insertions(+), 5 deletions(-)
> > 
> > diff --git a/lib/igt_core.c b/lib/igt_core.c
> > index cbbe79f88070..3053697da58c 100644
> > --- a/lib/igt_core.c
> > +++ b/lib/igt_core.c
> > @@ -1525,6 +1525,7 @@ void igt_exit(void)
> >  
> >         for (int c = 0; c < num_test_children; c++)
> >                 kill(test_children[c], SIGKILL);
> > +       assert(!num_test_children);
> >  
> >         if (!test_with_subtests) {
> >                 struct timespec now;
> > @@ -1832,20 +1833,48 @@ void igt_waitchildren(void)
> >                 igt_fail(err);
> >  }
> >  
> > +static bool igt_killchidren_timed_out;
> > +
> > +static void igt_alarm_killchildren(int signal)
> > +{
> > +       igt_info("Timed out waiting for children\n");
> 
> We used to print the caller supplied reason. Although that appears to
> have never been used, so might as well drop it from the API.
> 
> > +
> > +       igt_killchidren_timed_out = true;
> > +
> > +       for (int c = 0; c < num_test_children; c++)
> > +               kill(test_children[c], SIGKILL);
> 
> Ok, kill() is signal-safe.
> 
> > +}
> > +
> >  /**
> >   * igt_waitchildren_timeout:
> >   * @seconds: timeout in seconds to wait
> >   * @reason: debug string explaining what timedout
> >   *
> > - * Wait for all children forked with igt_fork, for a maximum of @seconds.
> > - *
> > - * Wraps igt_waitchildren() and igt_set_timeout()
> > + * Wait for all children forked with igt_fork, for a maximum of @seconds. If the
> > + * timeout expires, kills all children and cleans them up.
> >   */
> >  void igt_waitchildren_timeout(int seconds, const char *reason)
> >  {
> > -       igt_set_timeout(seconds, reason);
> > -       igt_waitchildren();
> > +       struct sigaction sa;
> > +       int ret;
> > +
> > +       sa.sa_handler = igt_alarm_killchildren;
> > +       sigemptyset(&sa.sa_mask);
> > +       sa.sa_flags = 0;
> > +
> > +       igt_killchidren_timed_out = false;
> > +
> > +       sigaction(SIGALRM, &sa, NULL);
> > +
> > +       alarm(seconds);
> > +
> > +       ret = __igt_waitchildren();
> > +       if (!igt_killchidren_timed_out && ret)
> > +               igt_fail(ret);
> >         igt_reset_timeout();
> > +       __igt_waitchildren();
> > +       if (igt_killchidren_timed_out)
> > +               igt_fail(IGT_EXIT_FAILURE);
> 
> Double __igt_waitchildren()? My reading of __igt_waitchildren() is that
> it will continue on after the alarm() wakes wait() up with SIGINT and
> repeat the wait() until all children die. And now those children will
> die, rather than the parent as before.

igt_waitchildren_timeout before this patch wouldn't work if it did that.
The pid == wait(); if (pid == -1) continue; bails out to the next child in
case of interrupts. Hence the wait; kill; wait sequence here. I discovered
this through testcases :-)
-Daniel
Quoting Daniel Vetter (2019-02-11 22:38:25)
> On Mon, Feb 11, 2019 at 09:03:12PM +0000, Chris Wilson wrote:
> > Quoting Daniel Vetter (2019-02-11 18:02:02)
> > > Instead of cleaning up the mess in igt_exit make sure we don't even
> > > let it out of the container. See also
> > > 
> > > commit 754876378d6c9b2775e8c07b4d16f9878c55949f
> > > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > > Date:   Fri Feb 26 22:11:10 2016 +0000
> > > 
> > >     igt/gem_sync: Enforce a timeout of 20s
> > > 
> > > which added this helper.
> > > 
> > > To make sure that everyone follows the rules, add an assert.
> > > 
> > > We're keeping the cleanup code as a failsafe, and because it speeds
> > > up the testcase I'm following up with.
> > > 
> > > v2: Chris pointed out that my original patch did nothing. Which I
> > > didn't catch because my testcase was also broken. Unfortunately this
> > > means we need to open code part of the waiting.
> > > 
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  lib/igt_core.c | 39 ++++++++++++++++++++++++++++++++++-----
> > >  1 file changed, 34 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/lib/igt_core.c b/lib/igt_core.c
> > > index cbbe79f88070..3053697da58c 100644
> > > --- a/lib/igt_core.c
> > > +++ b/lib/igt_core.c
> > > @@ -1525,6 +1525,7 @@ void igt_exit(void)
> > >  
> > >         for (int c = 0; c < num_test_children; c++)
> > >                 kill(test_children[c], SIGKILL);
> > > +       assert(!num_test_children);
> > >  
> > >         if (!test_with_subtests) {
> > >                 struct timespec now;
> > > @@ -1832,20 +1833,48 @@ void igt_waitchildren(void)
> > >                 igt_fail(err);
> > >  }
> > >  
> > > +static bool igt_killchidren_timed_out;
> > > +
> > > +static void igt_alarm_killchildren(int signal)
> > > +{
> > > +       igt_info("Timed out waiting for children\n");
> > 
> > We used to print the caller supplied reason. Although that appears to
> > have never been used, so might as well drop it from the API.
> > 
> > > +
> > > +       igt_killchidren_timed_out = true;
> > > +
> > > +       for (int c = 0; c < num_test_children; c++)
> > > +               kill(test_children[c], SIGKILL);
> > 
> > Ok, kill() is signal-safe.
> > 
> > > +}
> > > +
> > >  /**
> > >   * igt_waitchildren_timeout:
> > >   * @seconds: timeout in seconds to wait
> > >   * @reason: debug string explaining what timedout
> > >   *
> > > - * Wait for all children forked with igt_fork, for a maximum of @seconds.
> > > - *
> > > - * Wraps igt_waitchildren() and igt_set_timeout()
> > > + * Wait for all children forked with igt_fork, for a maximum of @seconds. If the
> > > + * timeout expires, kills all children and cleans them up.
> > >   */
> > >  void igt_waitchildren_timeout(int seconds, const char *reason)
> > >  {
> > > -       igt_set_timeout(seconds, reason);
> > > -       igt_waitchildren();
> > > +       struct sigaction sa;
> > > +       int ret;
> > > +
> > > +       sa.sa_handler = igt_alarm_killchildren;
> > > +       sigemptyset(&sa.sa_mask);
> > > +       sa.sa_flags = 0;
> > > +
> > > +       igt_killchidren_timed_out = false;
> > > +
> > > +       sigaction(SIGALRM, &sa, NULL);
> > > +
> > > +       alarm(seconds);
> > > +
> > > +       ret = __igt_waitchildren();
> > > +       if (!igt_killchidren_timed_out && ret)
> > > +               igt_fail(ret);
> > >         igt_reset_timeout();
> > > +       __igt_waitchildren();
> > > +       if (igt_killchidren_timed_out)
> > > +               igt_fail(IGT_EXIT_FAILURE);
> > 
> > Double __igt_waitchildren()? My reading of __igt_waitchildren() is that
> > it will continue on after the alarm() wakes wait() up with SIGINT and
> > repeat the wait() until all children die. And now those children will
> > die, rather than the parent as before.
> 
> igt_waitchildren_timeout before this patch wouldn't work if it did that.

Before this patch, the alarm handler did igt_fail -> exit(1) in the
parent.

> The pid == wait(); if (pid == -1) continue; bails out to the next child in
> case of interrupts. Hence the wait; kill; wait sequence here. I discovered
> this through testcases :-)

Now alarm -> wait() returning -1 + errno=EINTR, right? Then the
sighandler does killall -9, so on the next wait it sees that all the
children are dead.

__igt_waitchildren() even sets num_test_children = 0 on return, so the
second __igt_waitchildren() must do nothing. Or I am very confused.
-Chris
On Mon, Feb 11, 2019 at 11:01:16PM +0000, Chris Wilson wrote:
> Quoting Daniel Vetter (2019-02-11 22:38:25)
> > On Mon, Feb 11, 2019 at 09:03:12PM +0000, Chris Wilson wrote:
> > > Quoting Daniel Vetter (2019-02-11 18:02:02)
> > > > Instead of cleaning up the mess in igt_exit make sure we don't even
> > > > let it out of the container. See also
> > > > 
> > > > commit 754876378d6c9b2775e8c07b4d16f9878c55949f
> > > > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Date:   Fri Feb 26 22:11:10 2016 +0000
> > > > 
> > > >     igt/gem_sync: Enforce a timeout of 20s
> > > > 
> > > > which added this helper.
> > > > 
> > > > To make sure that everyone follows the rules, add an assert.
> > > > 
> > > > We're keeping the cleanup code as a failsafe, and because it speeds
> > > > up the testcase I'm following up with.
> > > > 
> > > > v2: Chris pointed out that my original patch did nothing. Which I
> > > > didn't catch because my testcase was also broken. Unfortunately this
> > > > means we need to open code part of the waiting.
> > > > 
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  lib/igt_core.c | 39 ++++++++++++++++++++++++++++++++++-----
> > > >  1 file changed, 34 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/lib/igt_core.c b/lib/igt_core.c
> > > > index cbbe79f88070..3053697da58c 100644
> > > > --- a/lib/igt_core.c
> > > > +++ b/lib/igt_core.c
> > > > @@ -1525,6 +1525,7 @@ void igt_exit(void)
> > > >  
> > > >         for (int c = 0; c < num_test_children; c++)
> > > >                 kill(test_children[c], SIGKILL);
> > > > +       assert(!num_test_children);
> > > >  
> > > >         if (!test_with_subtests) {
> > > >                 struct timespec now;
> > > > @@ -1832,20 +1833,48 @@ void igt_waitchildren(void)
> > > >                 igt_fail(err);
> > > >  }
> > > >  
> > > > +static bool igt_killchidren_timed_out;
> > > > +
> > > > +static void igt_alarm_killchildren(int signal)
> > > > +{
> > > > +       igt_info("Timed out waiting for children\n");
> > > 
> > > We used to print the caller supplied reason. Although that appears to
> > > have never been used, so might as well drop it from the API.
> > > 
> > > > +
> > > > +       igt_killchidren_timed_out = true;
> > > > +
> > > > +       for (int c = 0; c < num_test_children; c++)
> > > > +               kill(test_children[c], SIGKILL);
> > > 
> > > Ok, kill() is signal-safe.
> > > 
> > > > +}
> > > > +
> > > >  /**
> > > >   * igt_waitchildren_timeout:
> > > >   * @seconds: timeout in seconds to wait
> > > >   * @reason: debug string explaining what timedout
> > > >   *
> > > > - * Wait for all children forked with igt_fork, for a maximum of @seconds.
> > > > - *
> > > > - * Wraps igt_waitchildren() and igt_set_timeout()
> > > > + * Wait for all children forked with igt_fork, for a maximum of @seconds. If the
> > > > + * timeout expires, kills all children and cleans them up.
> > > >   */
> > > >  void igt_waitchildren_timeout(int seconds, const char *reason)
> > > >  {
> > > > -       igt_set_timeout(seconds, reason);
> > > > -       igt_waitchildren();
> > > > +       struct sigaction sa;
> > > > +       int ret;
> > > > +
> > > > +       sa.sa_handler = igt_alarm_killchildren;
> > > > +       sigemptyset(&sa.sa_mask);
> > > > +       sa.sa_flags = 0;
> > > > +
> > > > +       igt_killchidren_timed_out = false;
> > > > +
> > > > +       sigaction(SIGALRM, &sa, NULL);
> > > > +
> > > > +       alarm(seconds);
> > > > +
> > > > +       ret = __igt_waitchildren();
> > > > +       if (!igt_killchidren_timed_out && ret)
> > > > +               igt_fail(ret);
> > > >         igt_reset_timeout();
> > > > +       __igt_waitchildren();
> > > > +       if (igt_killchidren_timed_out)
> > > > +               igt_fail(IGT_EXIT_FAILURE);
> > > 
> > > Double __igt_waitchildren()? My reading of __igt_waitchildren() is that
> > > it will continue on after the alarm() wakes wait() up with SIGINT and
> > > repeat the wait() until all children die. And now those children will
> > > die, rather than the parent as before.
> > 
> > igt_waitchildren_timeout before this patch wouldn't work if it did that.
> 
> Before this patch, the alarm handler did igt_fail -> exit(1) in the
> parent.
> 
> > The pid == wait(); if (pid == -1) continue; bails out to the next child in
> > case of interrupts. Hence the wait; kill; wait sequence here. I discovered
> > this through testcases :-)
> 
> Now alarm -> wait() returning -1 + errno=EINTR, right? Then the
> sighandler does killall -9, so on the next wait it sees that all the
> children are dead.
> 
> __igt_waitchildren() even sets num_test_children = 0 on return, so the
> second __igt_waitchildren() must do nothing. Or I am very confused.

Hm right I think I got myself totally confused with some older version of
this (which was broken), in combination with also broken testcases.

Looking at the code again it does indeed loop until it has them all, plus
there is even a loop to kill all the children if one failed. I'll see what
happens when I drop the 2nd waitchildren, and double-check my tests do
tests what I want to test.
-Daniel