summaryrefslogtreecommitdiff
path: root/bugs/efault.html
blob: 6aceda4e4decef1ecad11b492f3a6fe0b1765d93 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<h2>The impossible EFAULT</h2>
<p>I have written a program; suppose it's called <code>worker</code>.
(While the program is written in Haskell, I don't think that's particularly relevant to this post.)</p>
<p>(EDIT: Reproducer can be found <a href="https://git.tomsmeding.com/snap-efault/tree/">here</a>.)</p>
<p>(EDIT 2: Diagnosis by <code>int-e</code> on irc <a href="https://paste.tomsmeding.com/D22SvR2T">here</a>.)</p>
<p>When run, <code>worker</code> starts a bunch of copies of a script.
Under normal circumstances this script sets up a container using Linux cgroups and Linux user namespaces, but none of that is relevant because the strange behaviour in question occurs just fine without all of that -- in fact, we'll let it start the following script, say <code>./sleep.sh</code>:</p>
<pre><code class="language-bash">#!/bin/bash
sleep 10
</code></pre>
<p>Clearly, there is no weird behaviour here, assuming that the system has <code>bash</code> under <code>/bin</code>, and mine does.</p>
<p>The copies of <code>sleep.sh</code> are started by passing <code>./sleep.sh</code> to <code>posix_spawnp(3)</code>.
(The Haskell <code>process</code> library does this for me.)
The thing is, occasionally (once every 5 to 10 invocations of <code>./worker</code>, approximately), <code>posix_spawnp</code> returns <code>EFAULT</code> (&quot;Bad Address&quot;).
The manpage for <code>posix_spawnp</code> says that:</p>
<blockquote>
<p><strong>ERRORS</strong></p>
<p>The posix_spawn() and posix_spawnp() functions fail only in the case where the underlying fork(2), vfork(2) or clone(2) call fails; in these cases, these functions return an error number, which will be one of the errors described for fork(2), vfork(2) or clone(2).</p>
<p>In addition, these functions fail if:</p>
<p><strong>ENOSYS</strong> Function not supported on this system.</p>
</blockquote>
<p>Okay, so I should look for <code>EFAULT</code> in <code>fork(2)</code>, <code>vfork(2)</code> and <code>clone(2)</code> to figure out what goes wrong, right?
Wrong.
Or, in any case, none of those manpages mention <code>EFAULT</code>.
I've looked through the source code of <code>posix_spawnp</code> in glibc and it at least doesn't throw <code>EFAULT</code> directly; presumably, one of the subroutines it calls does.
glibc is large and I don't think looking through the entire call tree will be very productive, so I tried to diagnose the issue from the outside instead.</p>
<p>And this is where the weirdness starts.
Whenever my program encounters <code>EFAULT</code> from <code>posix_spawnp</code>, it prints <code>Oops EFAULT</code>; hence grepping for <code>EFAULT</code> gives output precisely if the error occurred in this run.
I get the following observations:</p>
<ul>
<li><code>./worker 2&gt;&amp;1 | grep EFAULT</code>: errors occur.</li>
<li><code>./worker 2&gt;&amp;1 | grep EFAULT | cat</code>: errors DO NOT occur.</li>
<li><code>./worker 2&gt;&amp;1 | grep --line-buffered EFAULT | cat</code>: errors occur.</li>
<li><code>./worker 2&gt;&amp;1 | grep --line-buffered EFAULT</code>: errors occur.</li>
</ul>
<p>(&quot;errors occur&quot; means that once every few executions I get output indicating that <code>EFAULT</code> occurred; in the negative case I've run it for &gt;20x the number of invocations that are necessary to produce <code>EFAULT</code> in the other cases, without any <code>EFAULT</code>.)</p>
<p>The only situation in which <code>posix_spawnp</code> seems to always succeed, is when <code>stdout</code> of the process that <code>worker</code>'s output is piped to, is block-buffered.
But this makes no sense: there shouldn't even be a reasonable way in which <code>worker</code> <em>can</em> even determine whether this is the case!
Surely it can distinguish between <code>./worker | cat</code> and <code>./worker</code> (using <code>isatty(3)</code> -- this is precisely what <code>grep</code> does when not passed <code>--line-buffered</code>), but in all of the above cases the output is piped to another process anyway.</p>
<p>This is already spooky, but it gets even spookier: if I replace the invocation of <code>./sleep.sh</code> by an invocation of <code>sleep</code> (i.e. removing the indirection of the shell script), errors occur in none of the above setups.
Somehow, starting a script is different from starting a native process (and changing <code>bash</code> to <code>dash</code> in <code>sleep.sh</code> doesn't change anything).
<code>posix_spawnp</code> shouldn't care what it is starting!
That's the job of the loader, as far as I know.
So what gives?</p>
<h3>The cause</h3>
<p><s>I'll try to reduce my own program to a minimal reproducer, and if I find anything I'll post an update to this post.
In the meantime, spookiness.</s></p>
<p><code>snap-server</code> <a href="https://github.com/snapframework/snap-server/blob/8d89c10014d8d295bfbf5419bbb8551de32d7f85/src/Snap/Http/Server.hs#L161">modifies the environment</a> to set the locale, and <code>setenv(3)</code> is not atomic.
In particular, it breaks <code>execve(2)</code> when they race, and this is what happens.
All possible solutions to this problem are hacks.</p>