Handle EAGAIN errno from poll(2) or select(2)

Submitted by Jeremy Sequoia on Aug. 19, 2015, 11:09 p.m.

Details

Message ID 1440025773-24254-1-git-send-email-jeremyhu@apple.com
State New
Headers show

Not browsing as part of any series.

Commit Message

Jeremy Sequoia Aug. 19, 2015, 11:09 p.m.
No known fallout from this, but I spotted the possible issue when auditing
this code to track down a related issue.  While not noted in SUS, some
implementations (like darwin) may return EAGAIN for (possibly) transient
kernel issues that would suggest trying again.

Signed-off-by: Jeremy Huddleston Sequoia <jeremyhu@apple.com>
---
 src/xcb_conn.c | 2 +-
 src/xcb_in.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/xcb_conn.c b/src/xcb_conn.c
index 7d09637..7e49384 100644
--- a/src/xcb_conn.c
+++ b/src/xcb_conn.c
@@ -487,7 +487,7 @@  int _xcb_conn_wait(xcb_connection_t *c, pthread_cond_t *cond, struct iovec **vec
 #else
         ret = select(c->fd + 1, &rfds, &wfds, 0, 0);
 #endif
-    } while (ret == -1 && errno == EINTR);
+    } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
     if(ret < 0)
     {
         _xcb_conn_shutdown(c, XCB_CONN_ERROR);
diff --git a/src/xcb_in.c b/src/xcb_in.c
index bab4bc7..e806388 100644
--- a/src/xcb_in.c
+++ b/src/xcb_in.c
@@ -386,7 +386,7 @@  static int read_block(const int fd, void *buf, const ssize_t len)
             pfd.revents = 0;
             do {
                 ret = poll(&pfd, 1, -1);
-            } while (ret == -1 && errno == EINTR);
+            } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
 #else
             fd_set fds;
             FD_ZERO(&fds);
@@ -396,7 +396,7 @@  static int read_block(const int fd, void *buf, const ssize_t len)
             errno = 0;
             do {
                 ret = select(fd + 1, &fds, 0, 0, 0);
-            } while (ret == -1 && errno == EINTR);
+            } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
 #endif /* USE_POLL */
         }
         if(ret <= 0)

Comments

> From: Jeremy Huddleston Sequoia <jeremyhu@apple.com>
> Date: Wed, 19 Aug 2015 16:09:33 -0700
> 
> No known fallout from this, but I spotted the possible issue when auditing
> this code to track down a related issue.  While not noted in SUS, some
> implementations (like darwin) may return EAGAIN for (possibly) transient
> kernel issues that would suggest trying again.

Well, EAGAIN suggests "try again *later*".  Presumably the kernel
would return EAGAIN immediately and therefore this change may very
well introduce a spinning loop.  That would not be good, and I'd say
returning an error would be preferable over having the application
spin.

Having the kernel return EAGAIN for a blocking poll or select would be
a serious bug IMHB.  It should just wait until resources are
available.  Does Darwin really have such a bug, or are you just trying
to strike pre-emptively?

> Signed-off-by: Jeremy Huddleston Sequoia <jeremyhu@apple.com>
> ---
>  src/xcb_conn.c | 2 +-
>  src/xcb_in.c   | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/src/xcb_conn.c b/src/xcb_conn.c
> index 7d09637..7e49384 100644
> --- a/src/xcb_conn.c
> +++ b/src/xcb_conn.c
> @@ -487,7 +487,7 @@ int _xcb_conn_wait(xcb_connection_t *c, pthread_cond_t *cond, struct iovec **vec
>  #else
>          ret = select(c->fd + 1, &rfds, &wfds, 0, 0);
>  #endif
> -    } while (ret == -1 && errno == EINTR);
> +    } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
>      if(ret < 0)
>      {
>          _xcb_conn_shutdown(c, XCB_CONN_ERROR);
> diff --git a/src/xcb_in.c b/src/xcb_in.c
> index bab4bc7..e806388 100644
> --- a/src/xcb_in.c
> +++ b/src/xcb_in.c
> @@ -386,7 +386,7 @@ static int read_block(const int fd, void *buf, const ssize_t len)
>              pfd.revents = 0;
>              do {
>                  ret = poll(&pfd, 1, -1);
> -            } while (ret == -1 && errno == EINTR);
> +            } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
>  #else
>              fd_set fds;
>              FD_ZERO(&fds);
> @@ -396,7 +396,7 @@ static int read_block(const int fd, void *buf, const ssize_t len)
>              errno = 0;
>              do {
>                  ret = select(fd + 1, &fds, 0, 0, 0);
> -            } while (ret == -1 && errno == EINTR);
> +            } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
>  #endif /* USE_POLL */
>          }
>          if(ret <= 0)
> -- 
> 2.5.0
> 
> _______________________________________________
> Xcb mailing list
> Xcb@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/xcb
> 
>
Yeah, I thought about sleeping before retrying in the EAGAIN case to avoid a possible busy loop.  I can do that if you prefer.

As I indicated in the commit message, there is know known fallout from the lack of EAGAIN handling.  There is no behavioral problem.  Indeed the only time someone should ever get back EAGAIN from poll or select on darwin is under resource pressure, and its likely the user would have bigger concerns than this at that point.

I just happened to notice this while tracing code to figure out why someone on stackoverflow was seeing recv() of the DISPLAY socket erring out with EAGAIN and then hanging.

Sent from my iPhone...

On Aug 19, 2015, at 23:59, Mark Kettenis <mark.kettenis@xs4all.nl> wrote:

>> From: Jeremy Huddleston Sequoia <jeremyhu@apple.com>
>> Date: Wed, 19 Aug 2015 16:09:33 -0700
>> 
>> No known fallout from this, but I spotted the possible issue when auditing
>> this code to track down a related issue.  While not noted in SUS, some
>> implementations (like darwin) may return EAGAIN for (possibly) transient
>> kernel issues that would suggest trying again.
> 
> Well, EAGAIN suggests "try again *later*".  Presumably the kernel
> would return EAGAIN immediately and therefore this change may very
> well introduce a spinning loop.  That would not be good, and I'd say
> returning an error would be preferable over having the application
> spin.
> 
> Having the kernel return EAGAIN for a blocking poll or select would be
> a serious bug IMHB.  It should just wait until resources are
> available.  Does Darwin really have such a bug, or are you just trying
> to strike pre-emptively?
> 
>> Signed-off-by: Jeremy Huddleston Sequoia <jeremyhu@apple.com>
>> ---
>> src/xcb_conn.c | 2 +-
>> src/xcb_in.c   | 4 ++--
>> 2 files changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/src/xcb_conn.c b/src/xcb_conn.c
>> index 7d09637..7e49384 100644
>> --- a/src/xcb_conn.c
>> +++ b/src/xcb_conn.c
>> @@ -487,7 +487,7 @@ int _xcb_conn_wait(xcb_connection_t *c, pthread_cond_t *cond, struct iovec **vec
>> #else
>>         ret = select(c->fd + 1, &rfds, &wfds, 0, 0);
>> #endif
>> -    } while (ret == -1 && errno == EINTR);
>> +    } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
>>     if(ret < 0)
>>     {
>>         _xcb_conn_shutdown(c, XCB_CONN_ERROR);
>> diff --git a/src/xcb_in.c b/src/xcb_in.c
>> index bab4bc7..e806388 100644
>> --- a/src/xcb_in.c
>> +++ b/src/xcb_in.c
>> @@ -386,7 +386,7 @@ static int read_block(const int fd, void *buf, const ssize_t len)
>>             pfd.revents = 0;
>>             do {
>>                 ret = poll(&pfd, 1, -1);
>> -            } while (ret == -1 && errno == EINTR);
>> +            } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
>> #else
>>             fd_set fds;
>>             FD_ZERO(&fds);
>> @@ -396,7 +396,7 @@ static int read_block(const int fd, void *buf, const ssize_t len)
>>             errno = 0;
>>             do {
>>                 ret = select(fd + 1, &fds, 0, 0, 0);
>> -            } while (ret == -1 && errno == EINTR);
>> +            } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
>> #endif /* USE_POLL */
>>         }
>>         if(ret <= 0)
>> -- 
>> 2.5.0
>> 
>> _______________________________________________
>> Xcb mailing list
>> Xcb@lists.freedesktop.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_xcb&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=sKrzD564Mp5jpuqjBANoZGDJo7bM8NKbzP2DfayAcf0&s=gwaPnxNJ45yt4eV_0PFKxBLp063hfi28x3Qx3JpFdiU&e= 
>> 
>>
On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
> Yeah, I thought about sleeping before retrying in the EAGAIN case to
> avoid a possible busy loop.  I can do that if you prefer.
> 
> As I indicated in the commit message, there is know known fallout from
> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
> the only time someone should ever get back EAGAIN from poll or select
> on darwin is under resource pressure, and its likely the user would
> have bigger concerns than this at that point.
> 
> I just happened to notice this while tracing code to figure out why
> someone on stackoverflow was seeing recv() of the DISPLAY socket
> erring out with EAGAIN and then hanging.

If Darwin/OSX returns EAGAIN to a blocking call under *any*
circumstances, including "resource pressure", that's a serious bug.
Don't work around it in XCB or any other library, *especially* because
no other platform should behave the same way.  EAGAIN means "The socket
is marked nonblocking and the receive operation would block, or a
receive timeout had been set and the timeout expired before data was
received."  A blocking call with no timeout should never return EAGAIN;
it should either block or return some fatal error.

Libraries should *definitely* not have to include "wait a bit and try
again" logic; that's the kernel's job.

If you want a way to work around that on Darwin, you could create some
wrapper library around functions like recv that hides the incorrect
behavior.  But first, I'd suggest reporting a bug against Darwin for
violating the spec.

- Josh Triplett
> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
> 
> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
>> avoid a possible busy loop.  I can do that if you prefer.
>> 
>> As I indicated in the commit message, there is know known fallout from
>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
>> the only time someone should ever get back EAGAIN from poll or select
>> on darwin is under resource pressure, and its likely the user would
>> have bigger concerns than this at that point.
>> 
>> I just happened to notice this while tracing code to figure out why
>> someone on stackoverflow was seeing recv() of the DISPLAY socket
>> erring out with EAGAIN and then hanging.
> 
> If Darwin/OSX returns EAGAIN to a blocking call under *any*
> circumstances, including "resource pressure", that's a serious bug.
> Don't work around it in XCB or any other library, *especially* because
> no other platform should behave the same way.  EAGAIN means "The socket
> is marked nonblocking and the receive operation would block, or a
> receive timeout had been set and the timeout expired before data was
> received."  

No, that is not what EAGAIN means.  From SUSv4 at http://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html

"""
The poll() function shall fail if:

[EAGAIN]
The allocation of internal data structures failed but a subsequent request may succeed.
...
"""

True, select(2) does not specify EAGAIN as a possible returned error in SUSv4 (http://pubs.opengroup.org/onlinepubs/9699919799/functions/select.html), but darwin extends the standard to indicate that select(2) can return EAGAIN for basically the same reasons as poll(2) can.

> A blocking call with no timeout should never return EAGAIN;
> it should either block or return some fatal error.

Not according to UNIX.

> Libraries should *definitely* not have to include "wait a bit and try
> again" logic; that's the kernel's job.

Hence why I decided to just try again immediately.

> If you want a way to work around that on Darwin

This is a UNIX issue (for poll(2)).  Yes, select(2) returning an EAGAIN error is a darwin *extension*, but I don't imagine that should cause any additional headaches given that we already have to handle that exact case for the poll(2) codepath.

> , you could create some
> wrapper library around functions like recv that hides the incorrect
> behavior.  But first, I'd suggest reporting a bug against Darwin for
> violating the spec.

That's not necessary, as I've indicated above.

--Jeremy
On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
> 
> > On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
> > 
> > On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
> >> Yeah, I thought about sleeping before retrying in the EAGAIN case to
> >> avoid a possible busy loop.  I can do that if you prefer.
> >> 
> >> As I indicated in the commit message, there is know known fallout from
> >> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
> >> the only time someone should ever get back EAGAIN from poll or select
> >> on darwin is under resource pressure, and its likely the user would
> >> have bigger concerns than this at that point.
> >> 
> >> I just happened to notice this while tracing code to figure out why
> >> someone on stackoverflow was seeing recv() of the DISPLAY socket
> >> erring out with EAGAIN and then hanging.
> > 
> > If Darwin/OSX returns EAGAIN to a blocking call under *any*
> > circumstances, including "resource pressure", that's a serious bug.
> > Don't work around it in XCB or any other library, *especially* because
> > no other platform should behave the same way.  EAGAIN means "The socket
> > is marked nonblocking and the receive operation would block, or a
> > receive timeout had been set and the timeout expired before data was
> > received."  
> 
> No, that is not what EAGAIN means.  From SUSv4 at http://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html
> 
> """
> The poll() function shall fail if:
> 
> [EAGAIN]
> The allocation of internal data structures failed but a subsequent request may succeed.
> ...
> """

Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
had in mind and that the Linux manpage documents; from
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03

> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.

But sure enough, for EAGAIN it says "Resource temporarily unavailable.
This is a temporary condition and later calls to the same routine may
complete normally."  So if an implementation ignores the spec language
saying "A conforming implementation may assign the same values for
[EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
mean the kernel is making its internal problems the application's
problems and requiring the application to try again.  Sigh.

> > A blocking call with no timeout should never return EAGAIN;
> > it should either block or return some fatal error.
> 
> Not according to UNIX.

s/EAGAIN/EWOULDBLOCK/ and the statement holds.

> > Libraries should *definitely* not have to include "wait a bit and try
> > again" logic; that's the kernel's job.

I stand by this statement, but evidently the spec allows this particular
bit of ridiculosity.  Personally, I'd argue that if the kernel has a
resource allocation failure, it should be returning -ENOMEM.

Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
will only run on platforms where EAGAIN *doesn't* have the same meaning
as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
that define those two identically, the extra logic will constant-fold
away.

(I also wonder whether every other application and library includes this
logic on Darwin, or if other applications and libraries end up just
exiting with an error in this case.)

- Josh Triplett
> On Aug 22, 2015, at 10:30, Josh Triplett <josh@joshtriplett.org> wrote:
> 
> On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
>> 
>>> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
>>> 
>>> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
>>>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
>>>> avoid a possible busy loop.  I can do that if you prefer.
>>>> 
>>>> As I indicated in the commit message, there is know known fallout from
>>>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
>>>> the only time someone should ever get back EAGAIN from poll or select
>>>> on darwin is under resource pressure, and its likely the user would
>>>> have bigger concerns than this at that point.
>>>> 
>>>> I just happened to notice this while tracing code to figure out why
>>>> someone on stackoverflow was seeing recv() of the DISPLAY socket
>>>> erring out with EAGAIN and then hanging.
>>> 
>>> If Darwin/OSX returns EAGAIN to a blocking call under *any*
>>> circumstances, including "resource pressure", that's a serious bug.
>>> Don't work around it in XCB or any other library, *especially* because
>>> no other platform should behave the same way.  EAGAIN means "The socket
>>> is marked nonblocking and the receive operation would block, or a
>>> receive timeout had been set and the timeout expired before data was
>>> received."  
>> 
>> No, that is not what EAGAIN means.  From SUSv4 at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=8rN43F7_wUVFVOedp3SA7SqafUll4tbQU32iZKnmHM0&e= 
>> 
>> """
>> The poll() function shall fail if:
>> 
>> [EAGAIN]
>> The allocation of internal data structures failed but a subsequent request may succeed.
>> ...
>> """
> 
> Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
> EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
> had in mind and that the Linux manpage documents; from
> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_V2-5Fchap02.html-23tag-5F15-5F03&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=T2bl08Kgddw2duANE9MM75ZPc0SHqKhrvCy9gKYMFPE&e= 
> 
>> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.
> 
> But sure enough, for EAGAIN it says "Resource temporarily unavailable.
> This is a temporary condition and later calls to the same routine may
> complete normally."  So if an implementation ignores the spec language
> saying "A conforming implementation may assign the same values for
> [EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
> mean the kernel is making its internal problems the application's
> problems and requiring the application to try again.  Sigh.
> 
>>> A blocking call with no timeout should never return EAGAIN;
>>> it should either block or return some fatal error.
>> 
>> Not according to UNIX.
> 
> s/EAGAIN/EWOULDBLOCK/ and the statement holds.

Yep!

>>> Libraries should *definitely* not have to include "wait a bit and try
>>> again" logic; that's the kernel's job.
> 
> I stand by this statement, but evidently the spec allows this particular
> bit of ridiculosity.  Personally, I'd argue that if the kernel has a
> resource allocation failure, it should be returning -ENOMEM.

I agree, but sadly nobody consulted either you or I when writing the SUS.

> Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
> checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
> will only run on platforms where EAGAIN *doesn't* have the same meaning
> as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
> that define those two identically, the extra logic will constant-fold
> away.

They won't constant fold because we're not checking for EWOULDBLOCK because it doesn't really make sense in this case.  I don't think any implementation of poll(2) or select(2) would return EWOULDBLOCK because it doesn't really make sense to have non-blocking implementations of those syscalls.  The whole point of those syscalls is to block until data is available.

> (I also wonder whether every other application and library includes this
> logic on Darwin, or if other applications and libraries end up just
> exiting with an error in this case.)

I doubt many OS X applications are doing this check.  Error handling is so bad in a lot of code that we're in good shape if we catch the most common errors.  Running out of memory on desktops is mostly unheard of these days, and most desktop systems are likely destined to panic anyways in such a case.  Modern design for embedded systems, however, changes this.  As engineers, we need to now consider systems that have limited memory resources and no swap whereby running out of memory might be a common occurrence.  The kernel can kill runaway or idle processes to reclaim memory and then do the operation again successfully.  I agree with your position above that the kernel should deal with it internally rather than returning EAGAIN, but as code that is designed to work anywhere (or at least anywhere UNIXish), we should try to handle that case.
On Sat, Aug 22, 2015 at 10:52:17AM -0700, Jeremy Huddleston Sequoia wrote:
> 
> > On Aug 22, 2015, at 10:30, Josh Triplett <josh@joshtriplett.org> wrote:
> > 
> > On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
> >> 
> >>> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
> >>> 
> >>> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
> >>>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
> >>>> avoid a possible busy loop.  I can do that if you prefer.
> >>>> 
> >>>> As I indicated in the commit message, there is know known fallout from
> >>>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
> >>>> the only time someone should ever get back EAGAIN from poll or select
> >>>> on darwin is under resource pressure, and its likely the user would
> >>>> have bigger concerns than this at that point.
> >>>> 
> >>>> I just happened to notice this while tracing code to figure out why
> >>>> someone on stackoverflow was seeing recv() of the DISPLAY socket
> >>>> erring out with EAGAIN and then hanging.
> >>> 
> >>> If Darwin/OSX returns EAGAIN to a blocking call under *any*
> >>> circumstances, including "resource pressure", that's a serious bug.
> >>> Don't work around it in XCB or any other library, *especially* because
> >>> no other platform should behave the same way.  EAGAIN means "The socket
> >>> is marked nonblocking and the receive operation would block, or a
> >>> receive timeout had been set and the timeout expired before data was
> >>> received."  
> >> 
> >> No, that is not what EAGAIN means.  From SUSv4 at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=8rN43F7_wUVFVOedp3SA7SqafUll4tbQU32iZKnmHM0&e= 
> >> 
> >> """
> >> The poll() function shall fail if:
> >> 
> >> [EAGAIN]
> >> The allocation of internal data structures failed but a subsequent request may succeed.
> >> ...
> >> """
> > 
> > Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
> > EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
> > had in mind and that the Linux manpage documents; from
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_V2-5Fchap02.html-23tag-5F15-5F03&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=T2bl08Kgddw2duANE9MM75ZPc0SHqKhrvCy9gKYMFPE&e= 
> > 
> >> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.
> > 
> > But sure enough, for EAGAIN it says "Resource temporarily unavailable.
> > This is a temporary condition and later calls to the same routine may
> > complete normally."  So if an implementation ignores the spec language
> > saying "A conforming implementation may assign the same values for
> > [EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
> > mean the kernel is making its internal problems the application's
> > problems and requiring the application to try again.  Sigh.
> > 
> >>> A blocking call with no timeout should never return EAGAIN;
> >>> it should either block or return some fatal error.
> >> 
> >> Not according to UNIX.
> > 
> > s/EAGAIN/EWOULDBLOCK/ and the statement holds.
> 
> Yep!
> 
> >>> Libraries should *definitely* not have to include "wait a bit and try
> >>> again" logic; that's the kernel's job.
> > 
> > I stand by this statement, but evidently the spec allows this particular
> > bit of ridiculosity.  Personally, I'd argue that if the kernel has a
> > resource allocation failure, it should be returning -ENOMEM.
> 
> I agree, but sadly nobody consulted either you or I when writing the SUS.
> 
> > Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
> > checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
> > will only run on platforms where EAGAIN *doesn't* have the same meaning
> > as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
> > that define those two identically, the extra logic will constant-fold
> > away.
> 
> They won't constant fold because we're not checking for EWOULDBLOCK
> because it doesn't really make sense in this case.  I don't think any
> implementation of poll(2) or select(2) would return EWOULDBLOCK
> because it doesn't really make sense to have non-blocking
> implementations of those syscalls.  The whole point of those syscalls
> is to block until data is available.

That's not what I mean.  "EAGAIN != EWOULDBLOCK" constant-folds into 0
on a system where those two are equal.  So, something like "if (EAGAIN
!= EWOULDBLOCK && errno == EAGAIN) { loop and try again }" will fold
away to nothing except on a system that has EAGAIN as a separate error
from EWOULDBLOCK, which conveniently matches those systems where
retrying on EAGAIN makes sense.

- Josh Triplett
Sent from my iPhone...

> On Aug 22, 2015, at 18:43, Josh Triplett <josh@joshtriplett.org> wrote:
> 
>> On Sat, Aug 22, 2015 at 10:52:17AM -0700, Jeremy Huddleston Sequoia wrote:
>> 
>>> On Aug 22, 2015, at 10:30, Josh Triplett <josh@joshtriplett.org> wrote:
>>> 
>>> On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
>>>> 
>>>>> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
>>>>> 
>>>>> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
>>>>>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
>>>>>> avoid a possible busy loop.  I can do that if you prefer.
>>>>>> 
>>>>>> As I indicated in the commit message, there is know known fallout from
>>>>>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
>>>>>> the only time someone should ever get back EAGAIN from poll or select
>>>>>> on darwin is under resource pressure, and its likely the user would
>>>>>> have bigger concerns than this at that point.
>>>>>> 
>>>>>> I just happened to notice this while tracing code to figure out why
>>>>>> someone on stackoverflow was seeing recv() of the DISPLAY socket
>>>>>> erring out with EAGAIN and then hanging.
>>>>> 
>>>>> If Darwin/OSX returns EAGAIN to a blocking call under *any*
>>>>> circumstances, including "resource pressure", that's a serious bug.
>>>>> Don't work around it in XCB or any other library, *especially* because
>>>>> no other platform should behave the same way.  EAGAIN means "The socket
>>>>> is marked nonblocking and the receive operation would block, or a
>>>>> receive timeout had been set and the timeout expired before data was
>>>>> received."  
>>>> 
>>>> No, that is not what EAGAIN means.  From SUSv4 at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=8rN43F7_wUVFVOedp3SA7SqafUll4tbQU32iZKnmHM0&e= 
>>>> 
>>>> """
>>>> The poll() function shall fail if:
>>>> 
>>>> [EAGAIN]
>>>> The allocation of internal data structures failed but a subsequent request may succeed.
>>>> ...
>>>> """
>>> 
>>> Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
>>> EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
>>> had in mind and that the Linux manpage documents; from
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_V2-5Fchap02.html-23tag-5F15-5F03&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=T2bl08Kgddw2duANE9MM75ZPc0SHqKhrvCy9gKYMFPE&e= 
>>> 
>>>> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.
>>> 
>>> But sure enough, for EAGAIN it says "Resource temporarily unavailable.
>>> This is a temporary condition and later calls to the same routine may
>>> complete normally."  So if an implementation ignores the spec language
>>> saying "A conforming implementation may assign the same values for
>>> [EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
>>> mean the kernel is making its internal problems the application's
>>> problems and requiring the application to try again.  Sigh.
>>> 
>>>>> A blocking call with no timeout should never return EAGAIN;
>>>>> it should either block or return some fatal error.
>>>> 
>>>> Not according to UNIX.
>>> 
>>> s/EAGAIN/EWOULDBLOCK/ and the statement holds.
>> 
>> Yep!
>> 
>>>>> Libraries should *definitely* not have to include "wait a bit and try
>>>>> again" logic; that's the kernel's job.
>>> 
>>> I stand by this statement, but evidently the spec allows this particular
>>> bit of ridiculosity.  Personally, I'd argue that if the kernel has a
>>> resource allocation failure, it should be returning -ENOMEM.
>> 
>> I agree, but sadly nobody consulted either you or I when writing the SUS.
>> 
>>> Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
>>> checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
>>> will only run on platforms where EAGAIN *doesn't* have the same meaning
>>> as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
>>> that define those two identically, the extra logic will constant-fold
>>> away.
>> 
>> They won't constant fold because we're not checking for EWOULDBLOCK
>> because it doesn't really make sense in this case.  I don't think any
>> implementation of poll(2) or select(2) would return EWOULDBLOCK
>> because it doesn't really make sense to have non-blocking
>> implementations of those syscalls.  The whole point of those syscalls
>> is to block until data is available.
> 
> That's not what I mean.  "EAGAIN != EWOULDBLOCK" constant-folds into 0
> on a system where those two are equal.  So, something like "if (EAGAIN
> != EWOULDBLOCK && errno == EAGAIN) { loop and try again }" will fold
> away to nothing except on a system that has EAGAIN as a separate error
> from EWOULDBLOCK, which conveniently matches those systems where
> retrying on EAGAIN makes sense.

I'm not sure how you are concluding that this has anything to do with whether or not EAGAIN and EWOULDBLOCK are the same value, but that is not the case.

POSIX allows compliant implementations to define those two errnos to the same value and it also defines the conditions in which poll(2) can return EAGAIN.  There's nothing about the first which has any bearing on the second.

For example, darwin defines the two errnos to the same value, and I think most Linux and BSDs do the same, but we still have to deal with the possibility of poll EAGAINing.

This is also on the slow path, so I'm not sure it is worth making use of platform specific knowledge instead of coding to the standard.  If you prefer, I can keep the EAGAIN bits out of the select(2) path and keep them only in the poll(2) path.
On Sat, Aug 22, 2015 at 07:11:41PM -0700, Jeremy Sequoia wrote:
> > On Aug 22, 2015, at 18:43, Josh Triplett <josh@joshtriplett.org> wrote:
> >> On Sat, Aug 22, 2015 at 10:52:17AM -0700, Jeremy Huddleston Sequoia wrote:
> >>> On Aug 22, 2015, at 10:30, Josh Triplett <josh@joshtriplett.org> wrote:
> >>> On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
> >>>>> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
> >>>>> 
> >>>>> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
> >>>>>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
> >>>>>> avoid a possible busy loop.  I can do that if you prefer.
> >>>>>> 
> >>>>>> As I indicated in the commit message, there is know known fallout from
> >>>>>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
> >>>>>> the only time someone should ever get back EAGAIN from poll or select
> >>>>>> on darwin is under resource pressure, and its likely the user would
> >>>>>> have bigger concerns than this at that point.
> >>>>>> 
> >>>>>> I just happened to notice this while tracing code to figure out why
> >>>>>> someone on stackoverflow was seeing recv() of the DISPLAY socket
> >>>>>> erring out with EAGAIN and then hanging.
> >>>>> 
> >>>>> If Darwin/OSX returns EAGAIN to a blocking call under *any*
> >>>>> circumstances, including "resource pressure", that's a serious bug.
> >>>>> Don't work around it in XCB or any other library, *especially* because
> >>>>> no other platform should behave the same way.  EAGAIN means "The socket
> >>>>> is marked nonblocking and the receive operation would block, or a
> >>>>> receive timeout had been set and the timeout expired before data was
> >>>>> received."  
> >>>> 
> >>>> No, that is not what EAGAIN means.  From SUSv4 at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=8rN43F7_wUVFVOedp3SA7SqafUll4tbQU32iZKnmHM0&e= 
> >>>> 
> >>>> """
> >>>> The poll() function shall fail if:
> >>>> 
> >>>> [EAGAIN]
> >>>> The allocation of internal data structures failed but a subsequent request may succeed.
> >>>> ...
> >>>> """
> >>> 
> >>> Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
> >>> EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
> >>> had in mind and that the Linux manpage documents; from
> >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_V2-5Fchap02.html-23tag-5F15-5F03&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=T2bl08Kgddw2duANE9MM75ZPc0SHqKhrvCy9gKYMFPE&e= 
> >>> 
> >>>> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.
> >>> 
> >>> But sure enough, for EAGAIN it says "Resource temporarily unavailable.
> >>> This is a temporary condition and later calls to the same routine may
> >>> complete normally."  So if an implementation ignores the spec language
> >>> saying "A conforming implementation may assign the same values for
> >>> [EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
> >>> mean the kernel is making its internal problems the application's
> >>> problems and requiring the application to try again.  Sigh.
> >>> 
> >>>>> A blocking call with no timeout should never return EAGAIN;
> >>>>> it should either block or return some fatal error.
> >>>> 
> >>>> Not according to UNIX.
> >>> 
> >>> s/EAGAIN/EWOULDBLOCK/ and the statement holds.
> >> 
> >> Yep!
> >> 
> >>>>> Libraries should *definitely* not have to include "wait a bit and try
> >>>>> again" logic; that's the kernel's job.
> >>> 
> >>> I stand by this statement, but evidently the spec allows this particular
> >>> bit of ridiculosity.  Personally, I'd argue that if the kernel has a
> >>> resource allocation failure, it should be returning -ENOMEM.
> >> 
> >> I agree, but sadly nobody consulted either you or I when writing the SUS.
> >> 
> >>> Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
> >>> checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
> >>> will only run on platforms where EAGAIN *doesn't* have the same meaning
> >>> as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
> >>> that define those two identically, the extra logic will constant-fold
> >>> away.
> >> 
> >> They won't constant fold because we're not checking for EWOULDBLOCK
> >> because it doesn't really make sense in this case.  I don't think any
> >> implementation of poll(2) or select(2) would return EWOULDBLOCK
> >> because it doesn't really make sense to have non-blocking
> >> implementations of those syscalls.  The whole point of those syscalls
> >> is to block until data is available.
> > 
> > That's not what I mean.  "EAGAIN != EWOULDBLOCK" constant-folds into 0
> > on a system where those two are equal.  So, something like "if (EAGAIN
> > != EWOULDBLOCK && errno == EAGAIN) { loop and try again }" will fold
> > away to nothing except on a system that has EAGAIN as a separate error
> > from EWOULDBLOCK, which conveniently matches those systems where
> > retrying on EAGAIN makes sense.
> 
> I'm not sure how you are concluding that this has anything to do with
> whether or not EAGAIN and EWOULDBLOCK are the same value, but that is
> not the case.
> 
> POSIX allows compliant implementations to define those two errnos to
> the same value and it also defines the conditions in which poll(2) can
> return EAGAIN.  There's nothing about the first which has any bearing
> on the second.
> 
> For example, darwin defines the two errnos to the same value, and I
> think most Linux and BSDs do the same, but we still have to deal with
> the possibility of poll EAGAINing.

Sigh.  Apparently I was still underestimating how unusual an
implementation can be and still technically comply with the spec.  I had
assumed that if an implementation was going to use EAGAIN as a special
"try again later, my internal failure is now your problem" value, it
wouldn't simultaneously use the same errno value (under the name
EWOULDBLOCK) to mean "you asked me not to block so I didn't".  But if
Darwin equates the two errno values, then that check won't work.

Is this issue limited to poll() and select(), or can Darwin also return
EAGAIN from functions that can return EWOULDBLOCK if called on a
non-blocking file descriptor that isn't ready?

> This is also on the slow path, so I'm not sure it is worth making use
> of platform specific knowledge instead of coding to the standard.  If
> you prefer, I can keep the EAGAIN bits out of the select(2) path and
> keep them only in the poll(2) path.

Standards-compliant or not, it's *odd* behavior, and not particularly
sensible.  I was trying to find a way to have this not affect systems
other than those with the problem.  However, I'm now out of ideas for
how to do so, so go ahead and apply them.  To both select and poll, if
they both can spuriously return EAGAIN.

For the benefit of Linux developers who are used to an entirely
different meaning of EAGAIN, please do include a comment next to the
conditional, specifically explaining that it has nothing to do with
non-blocking descriptors in this case, that Darwin was observed to
return it from poll or select when it fails to allocate kernel-internal
resources, and that the spec allows it (citing
http://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html )
and says that a subsequent call may succeed, hence the retry.  That way,
nobody will come across that line of the source and get confused about
why poll is returning EAGAIN when
http://man7.org/linux/man-pages/man2/poll.2.html doesn't mention EAGAIN.

I've also submitted a request to the Linux man-pages project to add a
portability note about this.

- Josh Triplett
> On Aug 23, 2015, at 00:32, Josh Triplett <josh@joshtriplett.org> wrote:
> 
> On Sat, Aug 22, 2015 at 07:11:41PM -0700, Jeremy Sequoia wrote:
>>> On Aug 22, 2015, at 18:43, Josh Triplett <josh@joshtriplett.org> wrote:
>>>> On Sat, Aug 22, 2015 at 10:52:17AM -0700, Jeremy Huddleston Sequoia wrote:
>>>>> On Aug 22, 2015, at 10:30, Josh Triplett <josh@joshtriplett.org> wrote:
>>>>> On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
>>>>>>> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
>>>>>>> 
>>>>>>> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
>>>>>>>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
>>>>>>>> avoid a possible busy loop.  I can do that if you prefer.
>>>>>>>> 
>>>>>>>> As I indicated in the commit message, there is know known fallout from
>>>>>>>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
>>>>>>>> the only time someone should ever get back EAGAIN from poll or select
>>>>>>>> on darwin is under resource pressure, and its likely the user would
>>>>>>>> have bigger concerns than this at that point.
>>>>>>>> 
>>>>>>>> I just happened to notice this while tracing code to figure out why
>>>>>>>> someone on stackoverflow was seeing recv() of the DISPLAY socket
>>>>>>>> erring out with EAGAIN and then hanging.
>>>>>>> 
>>>>>>> If Darwin/OSX returns EAGAIN to a blocking call under *any*
>>>>>>> circumstances, including "resource pressure", that's a serious bug.
>>>>>>> Don't work around it in XCB or any other library, *especially* because
>>>>>>> no other platform should behave the same way.  EAGAIN means "The socket
>>>>>>> is marked nonblocking and the receive operation would block, or a
>>>>>>> receive timeout had been set and the timeout expired before data was
>>>>>>> received."  
>>>>>> 
>>>>>> No, that is not what EAGAIN means.  From SUSv4 at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=8rN43F7_wUVFVOedp3SA7SqafUll4tbQU32iZKnmHM0&e= 
>>>>>> 
>>>>>> """
>>>>>> The poll() function shall fail if:
>>>>>> 
>>>>>> [EAGAIN]
>>>>>> The allocation of internal data structures failed but a subsequent request may succeed.
>>>>>> ...
>>>>>> """
>>>>> 
>>>>> Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
>>>>> EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
>>>>> had in mind and that the Linux manpage documents; from
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_V2-5Fchap02.html-23tag-5F15-5F03&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=T2bl08Kgddw2duANE9MM75ZPc0SHqKhrvCy9gKYMFPE&e= 
>>>>> 
>>>>>> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.
>>>>> 
>>>>> But sure enough, for EAGAIN it says "Resource temporarily unavailable.
>>>>> This is a temporary condition and later calls to the same routine may
>>>>> complete normally."  So if an implementation ignores the spec language
>>>>> saying "A conforming implementation may assign the same values for
>>>>> [EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
>>>>> mean the kernel is making its internal problems the application's
>>>>> problems and requiring the application to try again.  Sigh.
>>>>> 
>>>>>>> A blocking call with no timeout should never return EAGAIN;
>>>>>>> it should either block or return some fatal error.
>>>>>> 
>>>>>> Not according to UNIX.
>>>>> 
>>>>> s/EAGAIN/EWOULDBLOCK/ and the statement holds.
>>>> 
>>>> Yep!
>>>> 
>>>>>>> Libraries should *definitely* not have to include "wait a bit and try
>>>>>>> again" logic; that's the kernel's job.
>>>>> 
>>>>> I stand by this statement, but evidently the spec allows this particular
>>>>> bit of ridiculosity.  Personally, I'd argue that if the kernel has a
>>>>> resource allocation failure, it should be returning -ENOMEM.
>>>> 
>>>> I agree, but sadly nobody consulted either you or I when writing the SUS.
>>>> 
>>>>> Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
>>>>> checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
>>>>> will only run on platforms where EAGAIN *doesn't* have the same meaning
>>>>> as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
>>>>> that define those two identically, the extra logic will constant-fold
>>>>> away.
>>>> 
>>>> They won't constant fold because we're not checking for EWOULDBLOCK
>>>> because it doesn't really make sense in this case.  I don't think any
>>>> implementation of poll(2) or select(2) would return EWOULDBLOCK
>>>> because it doesn't really make sense to have non-blocking
>>>> implementations of those syscalls.  The whole point of those syscalls
>>>> is to block until data is available.
>>> 
>>> That's not what I mean.  "EAGAIN != EWOULDBLOCK" constant-folds into 0
>>> on a system where those two are equal.  So, something like "if (EAGAIN
>>> != EWOULDBLOCK && errno == EAGAIN) { loop and try again }" will fold
>>> away to nothing except on a system that has EAGAIN as a separate error
>>> from EWOULDBLOCK, which conveniently matches those systems where
>>> retrying on EAGAIN makes sense.
>> 
>> I'm not sure how you are concluding that this has anything to do with
>> whether or not EAGAIN and EWOULDBLOCK are the same value, but that is
>> not the case.
>> 
>> POSIX allows compliant implementations to define those two errnos to
>> the same value and it also defines the conditions in which poll(2) can
>> return EAGAIN.  There's nothing about the first which has any bearing
>> on the second.
>> 
>> For example, darwin defines the two errnos to the same value, and I
>> think most Linux and BSDs do the same, but we still have to deal with
>> the possibility of poll EAGAINing.
> 
> Sigh.  Apparently I was still underestimating how unusual an
> implementation can be and still technically comply with the spec.  I had
> assumed that if an implementation was going to use EAGAIN as a special
> "try again later, my internal failure is now your problem" value, it
> wouldn't simultaneously use the same errno value (under the name
> EWOULDBLOCK) to mean "you asked me not to block so I didn't".  But if
> Darwin equates the two errno values, then that check won't work.
> 
> Is this issue limited to poll() and select(), or can Darwin also return
> EAGAIN from functions that can return EWOULDBLOCK if called on a
> non-blocking file descriptor that isn't ready?

The only Darwinism involved in this discussion is the extension to select() to allow it to behave like poll().  You can see this in http://opensource.apple.com/source/xnu/xnu-2422.1.72/bsd/kern/sys_generic.c:

/*
 * Select system call.
 *
 * Returns:	0			Success
 *		EINVAL			Invalid argument
 *		EAGAIN			Nonconformant error if allocation fails
 *	selprocess:???
 */

Prior to Leopard, xnu would panic if memory was exhausted in various syscalls.  A change was made to return an error to userland instead of hitting a panic.  In reviewing the commit that introduced this change to select() and some other syscalls.  It looks like select() was the only one that used EAGAIN, presumably to be consistent with poll().  Some other syscalls were updated to return ENOMEM or ENOBUFS depending on the case.


>> This is also on the slow path, so I'm not sure it is worth making use
>> of platform specific knowledge instead of coding to the standard.  If
>> you prefer, I can keep the EAGAIN bits out of the select(2) path and
>> keep them only in the poll(2) path.
> 
> Standards-compliant or not, it's *odd* behavior, and not particularly
> sensible.  I was trying to find a way to have this not affect systems
> other than those with the problem.  However, I'm now out of ideas for
> how to do so, so go ahead and apply them.  To both select and poll, if
> they both can spuriously return EAGAIN.
> 
> For the benefit of Linux developers who are used to an entirely
> different meaning of EAGAIN, please do include a comment next to the
> conditional, specifically explaining that it has nothing to do with
> non-blocking descriptors in this case, that Darwin was observed to
> return it from poll or select

As I mentioned, I haven't actually observed this happening.  This is just being proactive because I noticed that EAGAIN wasn't handled when auditing the codepath.  There is no actual known fallout.  This is all theoretical but not actually being observed anywhere.

> when it fails to allocate kernel-internal
> resources, and that the spec allows it (citing
> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=ucD7fNtc-SJ25X0TKPEqS7lUcztGgtcKS9We3DNb5Y0&s=TJ9FdfMPk7ZqOe6QvmzhyF1Rp-5BmX4xkw0PwzosN4c&e=  )
> and says that a subsequent call may succeed, hence the retry.  That way,
> nobody will come across that line of the source and get confused about
> why poll is returning EAGAIN when
> https://urldefense.proofpoint.com/v2/url?u=http-3A__man7.org_linux_man-2Dpages_man2_poll.2.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=ucD7fNtc-SJ25X0TKPEqS7lUcztGgtcKS9We3DNb5Y0&s=sRR49SbjJ4nc-LmZuwf4EK1m4HV59qVjC3k3FpCEL04&e=  doesn't mention EAGAIN.
> 
> I've also submitted a request to the Linux man-pages project to add a
> portability note about this.

Thanks.  Will do.


> - Josh Triplett
And actually, digging deeper into it, it looks like the implementation in xnu should guarantee that the MALLOC in question never fails (it will block until memory is available), so we should actually never even see the EAGAIN returned for select() case even though it's documented to be a possibility.

> On Aug 23, 2015, at 01:26, Jeremy Huddleston Sequoia <jeremyhu@apple.com> wrote:
> 
> 
>> On Aug 23, 2015, at 00:32, Josh Triplett <josh@joshtriplett.org> wrote:
>> 
>> On Sat, Aug 22, 2015 at 07:11:41PM -0700, Jeremy Sequoia wrote:
>>>> On Aug 22, 2015, at 18:43, Josh Triplett <josh@joshtriplett.org> wrote:
>>>>> On Sat, Aug 22, 2015 at 10:52:17AM -0700, Jeremy Huddleston Sequoia wrote:
>>>>>> On Aug 22, 2015, at 10:30, Josh Triplett <josh@joshtriplett.org> wrote:
>>>>>> On Sat, Aug 22, 2015 at 02:33:46AM -0700, Jeremy Huddleston Sequoia wrote:
>>>>>>>> On Aug 20, 2015, at 09:21, Josh Triplett <josh@joshtriplett.org> wrote:
>>>>>>>> 
>>>>>>>> On Thu, Aug 20, 2015 at 12:18:41AM -0700, Jeremy Sequoia wrote:
>>>>>>>>> Yeah, I thought about sleeping before retrying in the EAGAIN case to
>>>>>>>>> avoid a possible busy loop.  I can do that if you prefer.
>>>>>>>>> 
>>>>>>>>> As I indicated in the commit message, there is know known fallout from
>>>>>>>>> the lack of EAGAIN handling.  There is no behavioral problem.  Indeed
>>>>>>>>> the only time someone should ever get back EAGAIN from poll or select
>>>>>>>>> on darwin is under resource pressure, and its likely the user would
>>>>>>>>> have bigger concerns than this at that point.
>>>>>>>>> 
>>>>>>>>> I just happened to notice this while tracing code to figure out why
>>>>>>>>> someone on stackoverflow was seeing recv() of the DISPLAY socket
>>>>>>>>> erring out with EAGAIN and then hanging.
>>>>>>>> 
>>>>>>>> If Darwin/OSX returns EAGAIN to a blocking call under *any*
>>>>>>>> circumstances, including "resource pressure", that's a serious bug.
>>>>>>>> Don't work around it in XCB or any other library, *especially* because
>>>>>>>> no other platform should behave the same way.  EAGAIN means "The socket
>>>>>>>> is marked nonblocking and the receive operation would block, or a
>>>>>>>> receive timeout had been set and the timeout expired before data was
>>>>>>>> received."  
>>>>>>> 
>>>>>>> No, that is not what EAGAIN means.  From SUSv4 at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=8rN43F7_wUVFVOedp3SA7SqafUll4tbQU32iZKnmHM0&e= 
>>>>>>> 
>>>>>>> """
>>>>>>> The poll() function shall fail if:
>>>>>>> 
>>>>>>> [EAGAIN]
>>>>>>> The allocation of internal data structures failed but a subsequent request may succeed.
>>>>>>> ...
>>>>>>> """
>>>>>> 
>>>>>> Ah, I see; I'd forgotten that the spec actually allows EAGAIN and
>>>>>> EWOULDBLOCK to be different.  EWOULDBLOCK definitely has the semantics I
>>>>>> had in mind and that the Linux manpage documents; from
>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_V2-5Fchap02.html-23tag-5F15-5F03&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=b79atDQl6jtM7bQJnkNie1ThegJwAhDJkHqH6ZBsmeQ&s=T2bl08Kgddw2duANE9MM75ZPc0SHqKhrvCy9gKYMFPE&e= 
>>>>>> 
>>>>>>> Operation would block. An operation on a socket marked as non-blocking has encountered a situation such as no data available that otherwise would have caused the function to suspend execution.
>>>>>> 
>>>>>> But sure enough, for EAGAIN it says "Resource temporarily unavailable.
>>>>>> This is a temporary condition and later calls to the same routine may
>>>>>> complete normally."  So if an implementation ignores the spec language
>>>>>> saying "A conforming implementation may assign the same values for
>>>>>> [EWOULDBLOCK] and [EAGAIN]." and makes them separate, EAGAIN can indeed
>>>>>> mean the kernel is making its internal problems the application's
>>>>>> problems and requiring the application to try again.  Sigh.
>>>>>> 
>>>>>>>> A blocking call with no timeout should never return EAGAIN;
>>>>>>>> it should either block or return some fatal error.
>>>>>>> 
>>>>>>> Not according to UNIX.
>>>>>> 
>>>>>> s/EAGAIN/EWOULDBLOCK/ and the statement holds.
>>>>> 
>>>>> Yep!
>>>>> 
>>>>>>>> Libraries should *definitely* not have to include "wait a bit and try
>>>>>>>> again" logic; that's the kernel's job.
>>>>>> 
>>>>>> I stand by this statement, but evidently the spec allows this particular
>>>>>> bit of ridiculosity.  Personally, I'd argue that if the kernel has a
>>>>>> resource allocation failure, it should be returning -ENOMEM.
>>>>> 
>>>>> I agree, but sadly nobody consulted either you or I when writing the SUS.
>>>>> 
>>>>>> Could I talk you into adding a "EAGAIN != EWOULDBLOCK && " before
>>>>>> checking for EAGAIN?  That way, the "retry immediately on EAGAIN" logic
>>>>>> will only run on platforms where EAGAIN *doesn't* have the same meaning
>>>>>> as EWOULDBLOCK's "this is non-blocking and would block".  On platforms
>>>>>> that define those two identically, the extra logic will constant-fold
>>>>>> away.
>>>>> 
>>>>> They won't constant fold because we're not checking for EWOULDBLOCK
>>>>> because it doesn't really make sense in this case.  I don't think any
>>>>> implementation of poll(2) or select(2) would return EWOULDBLOCK
>>>>> because it doesn't really make sense to have non-blocking
>>>>> implementations of those syscalls.  The whole point of those syscalls
>>>>> is to block until data is available.
>>>> 
>>>> That's not what I mean.  "EAGAIN != EWOULDBLOCK" constant-folds into 0
>>>> on a system where those two are equal.  So, something like "if (EAGAIN
>>>> != EWOULDBLOCK && errno == EAGAIN) { loop and try again }" will fold
>>>> away to nothing except on a system that has EAGAIN as a separate error
>>>> from EWOULDBLOCK, which conveniently matches those systems where
>>>> retrying on EAGAIN makes sense.
>>> 
>>> I'm not sure how you are concluding that this has anything to do with
>>> whether or not EAGAIN and EWOULDBLOCK are the same value, but that is
>>> not the case.
>>> 
>>> POSIX allows compliant implementations to define those two errnos to
>>> the same value and it also defines the conditions in which poll(2) can
>>> return EAGAIN.  There's nothing about the first which has any bearing
>>> on the second.
>>> 
>>> For example, darwin defines the two errnos to the same value, and I
>>> think most Linux and BSDs do the same, but we still have to deal with
>>> the possibility of poll EAGAINing.
>> 
>> Sigh.  Apparently I was still underestimating how unusual an
>> implementation can be and still technically comply with the spec.  I had
>> assumed that if an implementation was going to use EAGAIN as a special
>> "try again later, my internal failure is now your problem" value, it
>> wouldn't simultaneously use the same errno value (under the name
>> EWOULDBLOCK) to mean "you asked me not to block so I didn't".  But if
>> Darwin equates the two errno values, then that check won't work.
>> 
>> Is this issue limited to poll() and select(), or can Darwin also return
>> EAGAIN from functions that can return EWOULDBLOCK if called on a
>> non-blocking file descriptor that isn't ready?
> 
> The only Darwinism involved in this discussion is the extension to select() to allow it to behave like poll().  You can see this in http://opensource.apple.com/source/xnu/xnu-2422.1.72/bsd/kern/sys_generic.c:
> 
> /*
> * Select system call.
> *
> * Returns:	0			Success
> *		EINVAL			Invalid argument
> *		EAGAIN			Nonconformant error if allocation fails
> *	selprocess:???
> */
> 
> Prior to Leopard, xnu would panic if memory was exhausted in various syscalls.  A change was made to return an error to userland instead of hitting a panic.  In reviewing the commit that introduced this change to select() and some other syscalls.  It looks like select() was the only one that used EAGAIN, presumably to be consistent with poll().  Some other syscalls were updated to return ENOMEM or ENOBUFS depending on the case.
> 
> 
>>> This is also on the slow path, so I'm not sure it is worth making use
>>> of platform specific knowledge instead of coding to the standard.  If
>>> you prefer, I can keep the EAGAIN bits out of the select(2) path and
>>> keep them only in the poll(2) path.
>> 
>> Standards-compliant or not, it's *odd* behavior, and not particularly
>> sensible.  I was trying to find a way to have this not affect systems
>> other than those with the problem.  However, I'm now out of ideas for
>> how to do so, so go ahead and apply them.  To both select and poll, if
>> they both can spuriously return EAGAIN.
>> 
>> For the benefit of Linux developers who are used to an entirely
>> different meaning of EAGAIN, please do include a comment next to the
>> conditional, specifically explaining that it has nothing to do with
>> non-blocking descriptors in this case, that Darwin was observed to
>> return it from poll or select
> 
> As I mentioned, I haven't actually observed this happening.  This is just being proactive because I noticed that EAGAIN wasn't handled when auditing the codepath.  There is no actual known fallout.  This is all theoretical but not actually being observed anywhere.
> 
>> when it fails to allocate kernel-internal
>> resources, and that the spec allows it (citing
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_9699919799_functions_poll.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=ucD7fNtc-SJ25X0TKPEqS7lUcztGgtcKS9We3DNb5Y0&s=TJ9FdfMPk7ZqOe6QvmzhyF1Rp-5BmX4xkw0PwzosN4c&e=  )
>> and says that a subsequent call may succeed, hence the retry.  That way,
>> nobody will come across that line of the source and get confused about
>> why poll is returning EAGAIN when
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__man7.org_linux_man-2Dpages_man2_poll.2.html&d=BQIBAg&c=eEvniauFctOgLOKGJOplqw&r=UaoPsU3Wgwl0YJPmjBVM0jyEVkD-hIP4wNFk_7YgTEE&m=ucD7fNtc-SJ25X0TKPEqS7lUcztGgtcKS9We3DNb5Y0&s=sRR49SbjJ4nc-LmZuwf4EK1m4HV59qVjC3k3FpCEL04&e=  doesn't mention EAGAIN.
>> 
>> I've also submitted a request to the Linux man-pages project to add a
>> portability note about this.
> 
> Thanks.  Will do.
> 
> 
>> - Josh Triplett
>
On Sun, Aug 23, 2015 at 01:51:45AM -0700, Jeremy Huddleston Sequoia wrote:
> And actually, digging deeper into it, it looks like the implementation
> in xnu should guarantee that the MALLOC in question never fails (it
> will block until memory is available), so we should actually never
> even see the EAGAIN returned for select() case even though it's
> documented to be a possibility.

Given that, is it still a good idea to make this change, or should we
wait until some real system exists where this is a problem?

- Josh Triplett
Resurrecting this thread as it looks like there actually is a real-world case of this returning EAGAIN ... maybe...  I'm not convinced that there isn't something else going on here, but I wanted to connect the dots:

https://bugs.freedesktop.org/show_bug.cgi?id=92652

> On Aug 23, 2015, at 16:02, Josh Triplett <josh@joshtriplett.org> wrote:
> 
> On Sun, Aug 23, 2015 at 01:51:45AM -0700, Jeremy Huddleston Sequoia wrote:
>> And actually, digging deeper into it, it looks like the implementation
>> in xnu should guarantee that the MALLOC in question never fails (it
>> will block until memory is available), so we should actually never
>> even see the EAGAIN returned for select() case even though it's
>> documented to be a possibility.
> 
> Given that, is it still a good idea to make this change, or should we
> wait until some real system exists where this is a problem?
> 
> - Josh Triplett
> _______________________________________________
> Xcb mailing list
> Xcb@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/xcb
Am 29.05.2016 um 08:58 schrieb Jeremy Huddleston Sequoia:
> Resurrecting this thread as it looks like there actually is a real-world case of this returning EAGAIN ... maybe...  I'm not convinced that there isn't something else going on here, but I wanted to connect the dots:
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=92652

That doesn't sound like resource exhaustion at all (and retrying a failed
xcb_wait_for_event() is pointless, because the connection will be in an error
state after the failure and the following attempts immediately return a failure).

Given the easy reproducability of this, I'd use the Mac equivalent of strace to
figure out what exactly is going on. select()/poll() reproducability failing
sounds more like the WM using KillClient to have the server close the
connection, but since the test program has WM_DELETE_WINDOW...

Cheers,
Uli
> On May 29, 2016, at 00:39, Uli Schlachter <psychon@znc.in> wrote:
> 
> Am 29.05.2016 um 08:58 schrieb Jeremy Huddleston Sequoia:
>> Resurrecting this thread as it looks like there actually is a real-world case of this returning EAGAIN ... maybe...  I'm not convinced that there isn't something else going on here, but I wanted to connect the dots:
>> 
>> https://bugs.freedesktop.org/show_bug.cgi?id=92652
> 
> That doesn't sound like resource exhaustion at all (and retrying a failed
> xcb_wait_for_event() is pointless, because the connection will be in an error
> state after the failure and the following attempts immediately return a failure).

Yeah, I just wanted to connect the two threads since I found myself having some deja vu reading through that code.  I'd certainly not resource exhaustion, but we are getting back EAGAIN which is surprising.  I want to figure out why that is being returned instead of something more appropriate.

> Given the easy reproducability of this, I'd use the Mac equivalent of strace to
> figure out what exactly is going on. select()/poll() reproducability failing
> sounds more like the WM using KillClient to have the server close the
> connection, but since the test program has WM_DELETE_WINDOW...

Yeah, I ended up going down that same path by watching the server side:

(lldb) bt
* thread #3: tid = 0x20aab8, function: CloseDownConnection , stop reason = breakpoint 5.1
  * frame #0: 0x0000000101f44780 X11.bin`CloseDownConnection
    frame #1: 0x0000000101e1f804 X11.bin`CloseDownClient + 484
    frame #2: 0x0000000101e3a84a X11.bin`ProcKillClient + 314
    frame #3: 0x0000000101e1ebf4 X11.bin`Dispatch + 1172

so that seems like what's going on.

> Cheers,
> Uli
> -- 
> "Do you know that books smell like nutmeg or some spice from a foreign land?"
>                                                  -- Faber in Fahrenheit 451
> _______________________________________________
> Xcb mailing list
> Xcb@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/xcb