From b6f41cc07d366ee50ace720d3426656aba7daa2b Mon Sep 17 00:00:00 2001 From: Tom Smeding Date: Sat, 27 Aug 2022 15:35:56 +0200 Subject: Add bugs/efault post --- bugs/efault.md | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 bugs/efault.md (limited to 'bugs/efault.md') diff --git a/bugs/efault.md b/bugs/efault.md new file mode 100644 index 0000000..c9a539d --- /dev/null +++ b/bugs/efault.md @@ -0,0 +1,57 @@ +## The impossible EFAULT + +I have written a program; suppose it's called `worker`. +(While the program is written in Haskell, I don't think that's particularly relevant to this post.) + +When run, `worker` starts a bunch of copies of a script. +Under normal circumstances this script sets up a container using Linux cgroups and Linux user namespaces, but none of that is relevant because the strange behaviour in question occurs just fine without all of that -- in fact, we'll let it start the following script, say `./sleep.sh`: + +```bash +#!/bin/bash +sleep 10 +``` + +Clearly, there is no weird behaviour here, assuming that the system has `bash` under `/bin`, and mine does. + +The copies of `sleep.sh` are started by passing `./sleep.sh` to `posix_spawnp(3)`. +(The Haskell `process` library does this for me.) +The thing is, occasionally (once every 5 to 10 invocations of `./worker`, approximately), `posix_spawnp` returns `EFAULT` ("Bad Address"). +The manpage for `posix_spawnp` says that: + +> **ERRORS** +> +> The posix_spawn() and posix_spawnp() functions fail only in the case where the underlying fork(2), vfork(2) or clone(2) call fails; in these cases, these functions return an error number, which will be one of the errors described for fork(2), vfork(2) or clone(2). +> +> In addition, these functions fail if: +> +> **ENOSYS** Function not supported on this system. + +Okay, so I should look for `EFAULT` in `fork(2)`, `vfork(2)` and `clone(2)` to figure out what goes wrong, right? +Wrong. +Or, in any case, none of those manpages mention `EFAULT`. +I've looked through the source code of `posix_spawnp` in glibc and it at least doesn't throw `EFAULT` directly; presumably, one of the subroutines it calls does. +glibc is large and I don't think looking through the entire call tree will be very productive, so I tried to diagnose the issue from the outside instead. + +And this is where the weirdness starts. +Whenever my program encounters `EFAULT` from `posix_spawnp`, it prints `Oops EFAULT`; hence grepping for `EFAULT` gives output precisely if the error occurred in this run. +I get the following observations: + +- `./worker 2>&1 | grep EFAULT`: errors occur. +- `./worker 2>&1 | grep EFAULT | cat`: errors DO NOT occur. +- `./worker 2>&1 | grep --line-buffered EFAULT | cat`: errors occur. +- `./worker 2>&1 | grep --line-buffered EFAULT`: errors occur. + +("errors occur" means that once every few executions I get output indicating that `EFAULT` occurred; in the negative case I've run it for >20x the number of invocations that are necessary to produce `EFAULT` in the other cases, without any `EFAULT`.) + +The only situation in which `posix_spawnp` seems to always succeed, is when `stdout` of the process that `worker`'s output is piped to, is block-buffered. +But this makes no sense: there shouldn't even be a reasonable way in which `worker` _can_ even determine whether this is the case! +Surely it can distinguish between `./worker | cat` and `./worker` (using `isatty(3)` -- this is precisely what `grep` does when not passed `--line-buffered`), but in all of the above cases the output is piped to another process anyway. + +This is already spooky, but it gets even spookier: if I replace the invocation of `./sleep.sh` by an invocation of `sleep` (i.e. removing the indirection of the shell script), errors occur in none of the above setups. +Somehow, starting a script is different from starting a native process (and changing `bash` to `dash` in `sleep.sh` doesn't change anything). +`posix_spawnp` shouldn't care what it is starting! +That's the job of the loader, as far as I know. +So what gives? + +I'll try to reduce my own program to a minimal reproducer, and if I find anything I'll post an update to this post. +In the meantime, spookiness. -- cgit v1.2.3-54-g00ecf