summaryrefslogtreecommitdiff
path: root/bugs
diff options
context:
space:
mode:
authorTom Smeding <t.j.smeding@uu.nl>2022-08-27 15:35:56 +0200
committerTom Smeding <t.j.smeding@uu.nl>2022-08-27 15:35:56 +0200
commitb6f41cc07d366ee50ace720d3426656aba7daa2b (patch)
tree7c5c3b47c94e402fca37c0d2402ef6e73da8cde9 /bugs
parentcc3b4084bc328023b20cdf14ea6e9be1f86d940c (diff)
Add bugs/efault post
Diffstat (limited to 'bugs')
-rw-r--r--bugs/efault.html44
-rw-r--r--bugs/efault.md57
2 files changed, 101 insertions, 0 deletions
diff --git a/bugs/efault.html b/bugs/efault.html
new file mode 100644
index 0000000..06be3c1
--- /dev/null
+++ b/bugs/efault.html
@@ -0,0 +1,44 @@
+<h2>The impossible EFAULT</h2>
+<p>I have written a program; suppose it's called <code>worker</code>.
+(While the program is written in Haskell, I don't think that's particularly relevant to this post.)</p>
+<p>When run, <code>worker</code> starts a bunch of copies of a script.
+Under normal circumstances this script sets up a container using Linux cgroups and Linux user namespaces, but none of that is relevant because the strange behaviour in question occurs just fine without all of that -- in fact, we'll let it start the following script, say <code>./sleep.sh</code>:</p>
+<pre><code class="language-bash">#!/bin/bash
+sleep 10
+</code></pre>
+<p>Clearly, there is no weird behaviour here, assuming that the system has <code>bash</code> under <code>/bin</code>, and mine does.</p>
+<p>The copies of <code>sleep.sh</code> are started by passing <code>./sleep.sh</code> to <code>posix_spawnp(3)</code>.
+(The Haskell <code>process</code> library does this for me.)
+The thing is, occasionally (once every 5 to 10 invocations of <code>./worker</code>, approximately), <code>posix_spawnp</code> returns <code>EFAULT</code> (&quot;Bad Address&quot;).
+The manpage for <code>posix_spawnp</code> says that:</p>
+<blockquote>
+<p><strong>ERRORS</strong></p>
+<p>The posix_spawn() and posix_spawnp() functions fail only in the case where the underlying fork(2), vfork(2) or clone(2) call fails; in these cases, these functions return an error number, which will be one of the errors described for fork(2), vfork(2) or clone(2).</p>
+<p>In addition, these functions fail if:</p>
+<p><strong>ENOSYS</strong> Function not supported on this system.</p>
+</blockquote>
+<p>Okay, so I should look for <code>EFAULT</code> in <code>fork(2)</code>, <code>vfork(2)</code> and <code>clone(2)</code> to figure out what goes wrong, right?
+Wrong.
+Or, in any case, none of those manpages mention <code>EFAULT</code>.
+I've looked through the source code of <code>posix_spawnp</code> in glibc and it at least doesn't throw <code>EFAULT</code> directly; presumably, one of the subroutines it calls does.
+glibc is large and I don't think looking through the entire call tree will be very productive, so I tried to diagnose the issue from the outside instead.</p>
+<p>And this is where the weirdness starts.
+Whenever my program encounters <code>EFAULT</code> from <code>posix_spawnp</code>, it prints <code>Oops EFAULT</code>; hence grepping for <code>EFAULT</code> gives output precisely if the error occurred in this run.
+I get the following observations:</p>
+<ul>
+<li><code>./worker 2&gt;&amp;1 | grep EFAULT</code>: errors occur.</li>
+<li><code>./worker 2&gt;&amp;1 | grep EFAULT | cat</code>: errors DO NOT occur.</li>
+<li><code>./worker 2&gt;&amp;1 | grep --line-buffered EFAULT | cat</code>: errors occur.</li>
+<li><code>./worker 2&gt;&amp;1 | grep --line-buffered EFAULT</code>: errors occur.</li>
+</ul>
+<p>(&quot;errors occur&quot; means that once every few executions I get output indicating that <code>EFAULT</code> occurred; in the negative case I've run it for &gt;20x the number of invocations that are necessary to produce <code>EFAULT</code> in the other cases, without any <code>EFAULT</code>.)</p>
+<p>The only situation in which <code>posix_spawnp</code> seems to always succeed, is when <code>stdout</code> of the process that <code>worker</code>'s output is piped to, is block-buffered.
+But this makes no sense: there shouldn't even be a reasonable way in which <code>worker</code> <em>can</em> even determine whether this is the case!
+Surely it can distinguish between <code>./worker | cat</code> and <code>./worker</code> (using <code>isatty(3)</code> -- this is precisely what <code>grep</code> does when not passed <code>--line-buffered</code>), but in all of the above cases the output is piped to another process anyway.</p>
+<p>This is already spooky, but it gets even spookier: if I replace the invocation of <code>./sleep.sh</code> by an invocation of <code>sleep</code> (i.e. removing the indirection of the shell script), errors occur in none of the above setups.
+Somehow, starting a script is different from starting a native process (and changing <code>bash</code> to <code>dash</code> in <code>sleep.sh</code> doesn't change anything).
+<code>posix_spawnp</code> shouldn't care what it is starting!
+That's the job of the loader, as far as I know.
+So what gives?</p>
+<p>I'll try to reduce my own program to a minimal reproducer, and if I find anything I'll post an update to this post.
+In the meantime, spookiness.</p>
diff --git a/bugs/efault.md b/bugs/efault.md
new file mode 100644
index 0000000..c9a539d
--- /dev/null
+++ b/bugs/efault.md
@@ -0,0 +1,57 @@
+## The impossible EFAULT
+
+I have written a program; suppose it's called `worker`.
+(While the program is written in Haskell, I don't think that's particularly relevant to this post.)
+
+When run, `worker` starts a bunch of copies of a script.
+Under normal circumstances this script sets up a container using Linux cgroups and Linux user namespaces, but none of that is relevant because the strange behaviour in question occurs just fine without all of that -- in fact, we'll let it start the following script, say `./sleep.sh`:
+
+```bash
+#!/bin/bash
+sleep 10
+```
+
+Clearly, there is no weird behaviour here, assuming that the system has `bash` under `/bin`, and mine does.
+
+The copies of `sleep.sh` are started by passing `./sleep.sh` to `posix_spawnp(3)`.
+(The Haskell `process` library does this for me.)
+The thing is, occasionally (once every 5 to 10 invocations of `./worker`, approximately), `posix_spawnp` returns `EFAULT` ("Bad Address").
+The manpage for `posix_spawnp` says that:
+
+> **ERRORS**
+>
+> The posix_spawn() and posix_spawnp() functions fail only in the case where the underlying fork(2), vfork(2) or clone(2) call fails; in these cases, these functions return an error number, which will be one of the errors described for fork(2), vfork(2) or clone(2).
+>
+> In addition, these functions fail if:
+>
+> **ENOSYS** Function not supported on this system.
+
+Okay, so I should look for `EFAULT` in `fork(2)`, `vfork(2)` and `clone(2)` to figure out what goes wrong, right?
+Wrong.
+Or, in any case, none of those manpages mention `EFAULT`.
+I've looked through the source code of `posix_spawnp` in glibc and it at least doesn't throw `EFAULT` directly; presumably, one of the subroutines it calls does.
+glibc is large and I don't think looking through the entire call tree will be very productive, so I tried to diagnose the issue from the outside instead.
+
+And this is where the weirdness starts.
+Whenever my program encounters `EFAULT` from `posix_spawnp`, it prints `Oops EFAULT`; hence grepping for `EFAULT` gives output precisely if the error occurred in this run.
+I get the following observations:
+
+- `./worker 2>&1 | grep EFAULT`: errors occur.
+- `./worker 2>&1 | grep EFAULT | cat`: errors DO NOT occur.
+- `./worker 2>&1 | grep --line-buffered EFAULT | cat`: errors occur.
+- `./worker 2>&1 | grep --line-buffered EFAULT`: errors occur.
+
+("errors occur" means that once every few executions I get output indicating that `EFAULT` occurred; in the negative case I've run it for >20x the number of invocations that are necessary to produce `EFAULT` in the other cases, without any `EFAULT`.)
+
+The only situation in which `posix_spawnp` seems to always succeed, is when `stdout` of the process that `worker`'s output is piped to, is block-buffered.
+But this makes no sense: there shouldn't even be a reasonable way in which `worker` _can_ even determine whether this is the case!
+Surely it can distinguish between `./worker | cat` and `./worker` (using `isatty(3)` -- this is precisely what `grep` does when not passed `--line-buffered`), but in all of the above cases the output is piped to another process anyway.
+
+This is already spooky, but it gets even spookier: if I replace the invocation of `./sleep.sh` by an invocation of `sleep` (i.e. removing the indirection of the shell script), errors occur in none of the above setups.
+Somehow, starting a script is different from starting a native process (and changing `bash` to `dash` in `sleep.sh` doesn't change anything).
+`posix_spawnp` shouldn't care what it is starting!
+That's the job of the loader, as far as I know.
+So what gives?
+
+I'll try to reduce my own program to a minimal reproducer, and if I find anything I'll post an update to this post.
+In the meantime, spookiness.