s6-rc-init hangs during stage2 from Vasil Yordanov on 2016-06-06 (supervision)

From: Vasil Yordanov <vasil.yordanov_at_gmail.com>
Date: Mon, 6 Jun 2016 12:35:49 +0300

Hello Laurent,

I have an issue with s6-rc, using the latest master code of the s6, s6-rc,
s6-linux-init.

When I run a docker container with s6 installed on it, from time to time it
hangs on s6-rc-init.

This happens from four times, ones or even more rear, but unfortunately
happens :(

I have investigated a bit more the issue with strace, but it seems more
complex to understand the core reason of the problem.

Here is my setup.

Docker, with ENTRYPOINT "/etc/s6-linux-init/init"
root_at_25b2cb1a6998 / # cat /etc/s6-linux-init/init
#!/bin/execlineb -P

/bin/export PATH "/usr/bin:/usr/sbin:/bin:/sbin"
/bin/cd /
s6-setsid -qb --
umask 022
if { s6-echo -- "\n init created by s6-linux-init-maker\n seehttp://
skarnet.org/software/s6-linux-init/\n\n
<http://skarnet.org/software/s6-linux-init/%5Cn%5Cn>" }
#if { s6-mount -nwt tmpfs -o mode=0755 tmpfs "/run" }
if { s6-hiercopy "/etc/s6-linux-init"/run-image "/run" }
emptyenv -p
s6-envdir -I -- "/etc/s6-linux-init"/env
redirfd -r 0 /dev/null
redirfd -wnb 1 "/run"/service/s6-svscan-log/fifo
background
{
  s6-setsid --
  redirfd -w 1 "/run"/service/s6-svscan-log/fifo
  fdmove -c 1 2
  "/etc/rc.init"
}
unexport !
cd "/run"/service
fdmove -c 2 1
s6-svscan -st0

and the rc.init is
root_at_25b2cb1a6998 / # cat /etc/rc.init
#!/bin/sh
strace -o /tmp/str_s6-rc-init s6-rc-init -c /tmp/s6-rc-compiled-dir
/run/service
s6-rc change

Currently I have put strace to trace s6-rc-init.

Here is how my processes looks like when it hangs:
root 1 0 0 Jun04 ? 00:00:00 s6-svscan -st0
root 7 1 0 Jun04 ? 00:00:00 /bin/sh /etc/rc.init
root 8 1 0 Jun04 ? 00:00:00 s6-supervise s6-svscan-log
nobody 9 8 0 Jun04 ? 00:00:00 s6-log -bp -- t
/run/uncaught-logs
root 10 7 0 Jun04 ? 00:00:00 strace -o
/tmp/str_s6-rc-init s6-rc-init -c /tmp/s6-rc-compiled-dir /run/service
root 13 10 0 Jun04 ? 00:00:00 s6-rc-init -c
/tmp/s6-rc-compiled-dir /run/service
root 14 13 0 Jun04 ? 00:00:00 s6-ftrigrd
....

You can see there is one additional process s6-ftrigrd.

In the strace log you can see:

....
open("/run/s6-rc/scandir/.s6-svscan/control", O_WRONLY|O_NONBLOCK) = 6
fcntl(6, F_GETFL) = 0x8801 (flags
O_WRONLY|O_NONBLOCK|O_LARGEFILE)
fcntl(6, F_SETFL, O_WRONLY|O_LARGEFILE) = 0
write(6, "a", 1) = 1
close(6) = 0
ppoll([{fd=5, events=POLLIN|POLLHUP}], 1, {1152921504606846975, 950409000},
NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {1152921504606846975,
909368723})
recvmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"\0\0\0\4\0\0\0\f!s\0\0\0\4\0\0\0
!s\0\0\0\4\0\0\0\"!s", 2047}, {NULL, 0}], msg_controllen=0,
msg_flags=MSG_CMSG_C
LOEXEC}, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = 30
recvmsg(5, 0x7ffd3eaca250, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = -1
EAGAIN (Resource temporarily unavailable)
ppoll([{fd=5, events=POLLIN|POLLHUP}], 1, {1152921504606846975, 909072000},
NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {1152921504606846975,
907791669})
recvmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"\0\0\0\4\0\0\0\
0!s\0\0\0\4\0\0\0#!s\0\0\0\4\0\0\0!!s\0\0"..., 2018}, {"", 29}],
msg_controllen=0, msg_flags=MSG_C
MSG_CLOEXEC}, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = 350
ppoll([{fd=5, events=POLLIN|POLLHUP}], 1, {1152921504606846975, 907695000},
NULL, 8

it seems s6-rc-init is hanging and waiting for some event.

The normal strace without hangs is
....
open("/run/s6-rc/scandir/.s6-svscan/control", O_WRONLY|O_NONBLOCK) = 6
fcntl(6, F_GETFL) = 0x8801 (flags
O_WRONLY|O_NONBLOCK|O_LARGEFILE)
fcntl(6, F_SETFL, O_WRONLY|O_LARGEFILE) = 0
write(6, "a", 1) = 1
close(6) = 0
ppoll([{fd=5, events=POLLIN|POLLHUP}], 1, {1152921504606846975, 950277000},
NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {1152921504606846975,
911900084})
recvmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"\0\0\0\4\0\0\0\0!s", 2047},
{NULL, 0}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC},
MSG_DONTWAIT|MSG_WAITALL|M
SG_CMSG_CLOEXEC) = 10
recvmsg(5, 0x7fff43f02640, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = -1
EAGAIN (Resource temporarily unavailable)
ppoll([{fd=5, events=POLLIN|POLLHUP}], 1, {1152921504606846975, 911598000},
NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {1152921504606846975,
910204953})
recvmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"\0\0\0\4\0\0\0%!
s\0\0\0\4\0\0\0$!s\0\0\0\4\0\0\0\1!s\0\0"..., 2038}, {"", 9}],
msg_controllen=0, msg_flags=MSG_CM
SG_CLOEXEC}, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = 290
recvmsg(5, 0x7fff43f02640, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = -1
EAGAIN (Resource temporarily unavailable)
ppoll([{fd=5, events=POLLIN|POLLHUP}], 1, {1152921504606846975, 910117000},
NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {1152921504606846975,
909970613})
recvmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"\0\0\0\4\0\0\0#!
s\0\0\0\4\0\0\0\t!s\0\0\0\4\0\0\0\6!s\0\0"..., 1748}, {"", 299}],
msg_controllen=0, msg_flags=MSG
_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = 80
recvmsg(5, 0x7fff43f02640, MSG_DONTWAIT|MSG_WAITALL|MSG_CMSG_CLOEXEC) = -1
EAGAIN (Resource temporarily unavailable)
close(4) = 0
close(5) = 0
wait4(14, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 14
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14, si_status=0,
si_utime=0, si_stime=1} ---
exit_group(0) = ?
+++ exited with 0 +++

And the processes are:
root 1 0 0 06:11 ? 00:00:00 s6-svscan -st0
root 8 1 0 06:11 ? 00:00:00 s6-supervise s6-svscan-log
nobody 9 8 0 06:11 ? 00:00:00 s6-log -bp -- t
/run/uncaught-logs
...

I have observed that the hangs of s6-rc-init happens more often when I have
more services. Currently I have 38 services and 2 bundles that group these
services

Please help me to resolve this issue.

Regards,
Vasil

PS. do we actually need to run s6-rc-init on each boot as a first line in
the stage2 script?
Received on Mon Jun 06 2016 - 09:35:49 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC