s6-svscan does not terminate after SIGTERM under amd64 (Docker) qemulation

From: Saj Goonatilleke <saj_at_discourse.org>
Date: Fri, 4 Feb 2022 15:42:38 +1100

Hello,

I stumbled on what might be an odd bug.

Most of our production workloads are containerised. We use Docker CE
for builds and execution on Linux. Outside of my day job, I have grown
quite fond of s6, and so it felt natural to make use of it in these
containerised environments.

(Please skip the next two paragraphs if you are familiar with Docker.)

Many of our staff, myself included, happen to work on Apple machines.
Docker (the company) distribute 'Docker Desktop' for the Mac platform.
Essentially, this tool works by shipping a tiny Linux virtual machine
and a whole lot of magic that plumbs the VM into the host operating
system. The magic works surprisingly well.

Over the past couple of years, Apple have pivoted their computing
products away from amd64/x86-64 to their own arm64-based CPUs. This
upstream change would have posed some inconvenience to Docker's existing
Mac userbase, and so they added automatic amd64 emulation on arm64 host
machines -- courtesy of the QEMU user space emulator. The end result
provides user-friendly (easy!) execution of linux/amd64 executables on a
darwin/arm64 host. Again, the magic works surprisingly well.
Performance aside, I have not been stumped by any observable difference
in behaviour between these platforms until today... in s6.

Included below is the source for a simple container. PID 1 in the
container is s6-svscan. s6-svscan forks one s6-supervise process, which
in turn forks a simple sleep(1). The experiment uses s6 as bundled with
Debian bullseye; deb version 2.10.0.2-1.

https://packages.debian.org/bullseye/s6
https://salsa.debian.org/zhsj/s6/-/tree/debian/2.10.0.2-1

Interesting things happen when we signal s6-svscan with SIGTERM:

docker kill -s TERM <container-name>

The supervision tree and s6-svscan is terminated when executed without
any QEMU emulation. The container then ceases execution as one would
expect. This works fine with dockerd executing natively on an
linux/amd64 host machine, and also when Docker Desktop is executing
native arm64 images on an Apple M1 Pro.

However, if we repeat the experiment with an amd64 image on an arm64 Mac
(i.e.: with qemu) something completely different happens. None of the
three processes are terminated. s6-svscan continues running. The
container never terminates.

Note that svc/.s6-svscan/SIGTERM is not marked executable. This was
intentional. After reading src/supervision/s6-svscan.c, I thought this
might provide a hint as to what is happening.

Output from the happy case (without qemu) is as expected:

> % docker run --rm ... > *signalled*
> s6-svscan: warning: unable to spawn .s6-svscan/SIGTERM: Permission denied
> *terminates*

Output from the sad case (with qemu) is quite curious:

> % docker run --rm ...
> WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
> *signalled*

Notably, there is no stderr write from strerr_warnwu2sys() to say the
call to posix_spawn(3) failed.

ptrace is not available in qemu, though we do have qemu -strace, which
sort of emulates the output from Linux's strace(1). Things get even
more curious here.

For this particular test, I forked s6-svscan from tini. tini was PID 1.
  This made the initial signal delivery more obvious in the trace.
Problematic s6-svscan behaviour otherwise appeared to be the same as the
test without tini.

PID 1 is tini.
PID 11 is s6-svscan.
PID 25 appears to be the process forked for .s6-svscan/SIGTERM.

> 1 wait4(4294967295,274903107948,1,0,0,274905173824) = 0
> 1 rt_sigtimedwait(274903108064,274903107856,274877935648,8,0,274905173824) = -1 errno=11 (Resource temporarily unavailable)
> 1 wait4(4294967295,274903107948,1,0,0,274905173824) = 0
> 1 rt_sigtimedwait(274903108064,274903107856,274877935648,8,0,274905173824) = -1 errno=11 (Resource temporarily unavailable)
> 1 wait4(4294967295,274903107948,1,0,0,274905173824) = 0
> 1 rt_sigtimedwait(274903108064,274903107856,274877935648,8,0,274905173824) = 15
> 1 kill(11,SIGTERM) = 0
> 1 wait4(4294967295,274903107948,1,0,0,274905173824) = 0
> 11 ppoll(274903108336,2,274903106832,0,8,1) = 1
> 11 clock_gettime(CLOCK_MONOTONIC,0x0000004001808560) = 0 ({tv_sec = 356286,tv_nsec = 304382892})
> 11 read(6,0x1808550,128) = 128
> 11 mmap(NULL,36864,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS|0x20000,-1,0) = 0x0000004001a89000
> 11 rt_sigprocmask(SIG_BLOCK,0x0000004001a09f40,0x0000004001808100) = 0
> 11 clone(CLONE_VM|CLONE_VFORK|0x11,child_stack=0x0000004001a91ff0,parent_tidptr=0x0000004001a74740,tls=0x0000000000000000,child_tidptr=0x00000040018085d0) = 25
> 11 munmap(0x0000004001a89000,36864) = 0
> 11 rt_sigprocmask(SIG_SETMASK,0x0000004001808100,NULL) = 0
> 11 read(6,0x1808550,128) = -1 errno=11 (Resource temporarily unavailable)
> 11 clone(CLONE_VM|CLONE_VFORK|0x11,child_stack=0x0000004001a91ff0,parent_tidptr=0x0000004001a74740,tls=0x0000000000000000,child_tidptr=0x00000040018085d0) = 0
> 25 rt_sigprocmask(SIG_BLOCK,NULL,0x0000004001a91ea0) = 0
> 25 rt_sigaction(SIGHUP,NULL,0x0000004001a91dd0) = 0
> 25 rt_sigaction(SIGHUP,0x0000004001a91d30,NULL) = 0
> ...
> 25 rt_sigaction(61,NULL,0x0000004001a91dd0) = 0
> 25 rt_sigaction(61,0x0000004001a91d30,NULL) = 0
> 25 rt_sigprocmask(SIG_SETMASK,0x00000040018084b8,NULL) = 0
> 25 execve(".s6-svscan/SIGTERM",{".s6-svscan/SIGTERM",NULL}) = -1 errno=13 (Permission denied)
> 25 exit_group(127)
> 11 ppoll(274903108336,2,274903106832,0,8,1) = 1
> 11 clock_gettime(CLOCK_MONOTONIC,0x0000004001808560) = 0 ({tv_sec = 356286,tv_nsec = 308767933})
> 11 read(6,0x1808550,128) = 128
> 11 read(6,0x1808550,128) = -1 errno=11 (Resource temporarily unavailable)
> 11 wait4(4294967295,274903106972,1,0,8,1) = 25
> 11 wait4(4294967295,274903106972,1,0,8,1) = 0
> 1 rt_sigtimedwait(274903108064,274903107856,274877935648,8,0,274905173824) = -1 errno=11 (Resource temporarily unavailable)
> 1 wait4(4294967295,274903107948,1,0,0,274905173824) = 0

(Many intervening calls to rt_sigaction() redacted for brevity.)

The call to posix_spawn(3) appears evident there, but there is no
evidence of any follow-up to the failure. s6-svscan should call term()
and die soon afterwards but that never happens.

Repeating the test with no .s6-svscan/SIGTERM file produced a similar
result (albeit with errno=2 on exec).

At this stage, I am unsure where the problem lies. It may not be in s6.
  Aside from this quirk, s6 does seem to function just fine under qemu.
  Non-trivial supervision trees come up without a hitch. I first
noticed this problem while working on a container with supervised
HAProxy and Go processes -- all non-trivial programs. No problems were
observed in these other programs under emulation.

I suppose the next step would be to avail myself of some real debugging
tools -- or at least put myself in a position where I can build an s6
with some simple printf debug statements. In the meantime, I thought
I'd reach out in case I have missed something obvious.

Test input:

> % tar --format=mtree --options 'mtree:!all,mtree:flags,mode,size,type' -c .
> #mtree
> . mode=700 type=dir
> ./Dockerfile mode=644 type=file size=389
> ./svc mode=755 type=dir
> ./svc/.s6-svscan mode=755 type=dir
> ./svc/.s6-svscan/SIGTERM mode=644 type=file size=0
> ./svc/sleep mode=755 type=dir
> ./svc/sleep/run mode=755 type=file size=43

> % find -s . \! -type d -print0 | xargs -0 grep .
> ./Dockerfile:FROM --platform=linux/amd64 debian:bullseye-slim
> ./Dockerfile:RUN DEBIAN_FRONTEND=noninteractive apt-get update \
> ./Dockerfile: && DEBIAN_FRONTEND=noninteractive apt-get -y dist-upgrade \
> ./Dockerfile: && DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends \
> ./Dockerfile: procps \
> ./Dockerfile: psmisc \
> ./Dockerfile: s6 \
> ./Dockerfile: && DEBIAN_FRONTEND=noninteractive apt-get clean
> ./Dockerfile:COPY svc/ svc/
> ./Dockerfile:ENTRYPOINT ["/usr/bin/s6-svscan", "/svc"]
> ./svc/sleep/run:#!/bin/sh
> ./svc/sleep/run:# GNU sleep
> ./svc/sleep/run:exec sleep infinity

Cheers,

-- 
Saj Goonatilleke
Received on Fri Feb 04 2022 - 05:42:38 CET

This archive was generated by hypermail 2.4.0 : Fri Feb 04 2022 - 05:43:15 CET