From b6f41cc07d366ee50ace720d3426656aba7daa2b Mon Sep 17 00:00:00 2001 From: Tom Smeding Date: Sat, 27 Aug 2022 15:35:56 +0200 Subject: Add bugs/efault post --- bugs/efault.html | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 bugs/efault.html (limited to 'bugs/efault.html') diff --git a/bugs/efault.html b/bugs/efault.html new file mode 100644 index 0000000..06be3c1 --- /dev/null +++ b/bugs/efault.html @@ -0,0 +1,44 @@ +

The impossible EFAULT

+

I have written a program; suppose it's called worker. +(While the program is written in Haskell, I don't think that's particularly relevant to this post.)

+

When run, worker starts a bunch of copies of a script. +Under normal circumstances this script sets up a container using Linux cgroups and Linux user namespaces, but none of that is relevant because the strange behaviour in question occurs just fine without all of that -- in fact, we'll let it start the following script, say ./sleep.sh:

+
#!/bin/bash
+sleep 10
+
+

Clearly, there is no weird behaviour here, assuming that the system has bash under /bin, and mine does.

+

The copies of sleep.sh are started by passing ./sleep.sh to posix_spawnp(3). +(The Haskell process library does this for me.) +The thing is, occasionally (once every 5 to 10 invocations of ./worker, approximately), posix_spawnp returns EFAULT ("Bad Address"). +The manpage for posix_spawnp says that:

+
+

ERRORS

+

The posix_spawn() and posix_spawnp() functions fail only in the case where the underlying fork(2), vfork(2) or clone(2) call fails; in these cases, these functions return an error number, which will be one of the errors described for fork(2), vfork(2) or clone(2).

+

In addition, these functions fail if:

+

ENOSYS Function not supported on this system.

+
+

Okay, so I should look for EFAULT in fork(2), vfork(2) and clone(2) to figure out what goes wrong, right? +Wrong. +Or, in any case, none of those manpages mention EFAULT. +I've looked through the source code of posix_spawnp in glibc and it at least doesn't throw EFAULT directly; presumably, one of the subroutines it calls does. +glibc is large and I don't think looking through the entire call tree will be very productive, so I tried to diagnose the issue from the outside instead.

+

And this is where the weirdness starts. +Whenever my program encounters EFAULT from posix_spawnp, it prints Oops EFAULT; hence grepping for EFAULT gives output precisely if the error occurred in this run. +I get the following observations:

+ +

("errors occur" means that once every few executions I get output indicating that EFAULT occurred; in the negative case I've run it for >20x the number of invocations that are necessary to produce EFAULT in the other cases, without any EFAULT.)

+

The only situation in which posix_spawnp seems to always succeed, is when stdout of the process that worker's output is piped to, is block-buffered. +But this makes no sense: there shouldn't even be a reasonable way in which worker can even determine whether this is the case! +Surely it can distinguish between ./worker | cat and ./worker (using isatty(3) -- this is precisely what grep does when not passed --line-buffered), but in all of the above cases the output is piped to another process anyway.

+

This is already spooky, but it gets even spookier: if I replace the invocation of ./sleep.sh by an invocation of sleep (i.e. removing the indirection of the shell script), errors occur in none of the above setups. +Somehow, starting a script is different from starting a native process (and changing bash to dash in sleep.sh doesn't change anything). +posix_spawnp shouldn't care what it is starting! +That's the job of the loader, as far as I know. +So what gives?

+

I'll try to reduce my own program to a minimal reproducer, and if I find anything I'll post an update to this post. +In the meantime, spookiness.

-- cgit v1.2.3-70-g09d2