My process became PID 1 and now signals behave strangely

Or let's write our own init process

When your process runs as PID 1 in a Docker container, signal handling behaves differently to what you might expect.

First lets sanity check what happens when a process is not PID 1 on a “normal” system.

A simple Python process that just sleeps

Aarons-iMac:bin aaronkalair$ cat mypy.pyimport subprocess

subprocess.call(["sleep", "100"])

And if we run it and send SIGTERM

Aarons-iMac:init-proc aaronkalair$ ps -ef | grep python501 14013 6588 0 2:08pm ttys004 0:00.02 python mypy.py

Aarons-iMac:bin aaronkalair$ kill 14013Terminated: 15

It gets terminated, nothing surprising here

And now let’s run it as PID 1 in a Docker container

Aarons-iMac:bin aaronkalair$ cat Dockerfilefrom ubuntu:16.04

RUN apt-get updateRUN apt-get install -y pythonCOPY mypy.py /srv/

CMD ["python", "/srv/mypy.py"]

Run this container, exec in and then send the same signal

Aarons-iMac:init-proc aaronkalair$ docker exec -it 0229aa205b48 bash

root@0229aa205b48:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 14:15 ? 00:00:00 python /srv/mypy.pyroot 7 1 0 14:15 ? 00:00:00 sleep 100

root@0229aa205b48:/# kill 1

root@0229aa205b48:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 14:15 ? 00:00:00 python /srv/mypy.pyroot 7 1 0 14:15 ? 00:00:00 sleep 100

And now nothing happens!

Lets try this with a Go process that does something similar

package main

import ("time")

func main() {time.Sleep(time.Duration(100000) * time.Millisecond)}

Pop this into a Docker container, run it, exec in and send it SIGTERM

Aarons-iMac:init-proc aaronkalair$ docker exec -it e6ccf11be060 bash

root@e6ccf11be060:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 14:28 ? 00:00:00 ./srv/sleep-spawner

root@e6ccf11be060:/# kill 1

root@e6ccf11be060:/# Aarons-iMac:init-proc aaronkalair$

And it’s killed, just like it behaves if it wasn’t running as PID 1

So what’s going on here then?

Well PID 1 is special in Linux, amongst other things it ignores any signals unless a handler for that signal is explicitly declared. From the Docker docs — https://docs.docker.com/engine/reference/run/#foreground

Note: A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. So, the process will not terminate on _SIGINT_ or _SIGTERM_ unless it is coded to do so.

We could just define handlers for those signals in every process we want to run in a Docker container but this is a lot of work and we may not have the source code to do so. Furthermore there are other responsibilities for PID 1 that we’ll explore later.

So instead we could run a different process as PID 1 and have it proxy signals to the actual process we want to run and perform the other duties of a standard init process

There are numerous solutions that do this for example

Yelps dumb-init — https://github.com/Yelp/dumb-init

Tini which is shipped with Docker— https://docs.docker.com/engine/reference/run/#specify-an-init-process

And many more which you can find by searching around.

But I’m going to write my own…

So let's start with the basics I need a program that takes the name of another process to execute and executes it

func main() {cmd := exec.Command(os.Args[1], os.Args[2:]...)err := cmd.Start()if err != nil {panic(err)}err = cmd.Wait()if err != nil {panic(err)}}

Some important things to note about how we do this because it will be important later.

After we Start() the new process we call Wait() this is important, this will block until the command exits and once it does cleans up any resources associated with it.

Failure to wait on a process you spawn leads to zombie processes that hang around once they’ve finished executing consuming some resource.

From the man page — http://man7.org/linux/man-pages/man2/waitpid.2.html#NOTES

A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

So let's try out our new signal proxy, if we run that in a container…

CMD ["./srv/init-proc", "/srv/sleep-spawner", "1"]

We can see that our proxy process is now PID 1 and has spawned off sleep-spawner

root@36c4892039db:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 17:45 ? 00:00:00 ./srv/init-proc /srv/sleep-spawner 1root 11 1 0 17:45 ? 00:00:00 /srv/sleep-spawner 1

Alright the next step is to register ourselves as being interested with all the possible signals

func main() {signalChannel := make(chan os.Signal, 2)signal.Notify(signalChannel)pid := -1

go sigHandler(&pid, signalChannel)

cmd := exec.Command(os.Args\[1\], os.Args\[2:\]...)  
err := cmd.Start()  
pid = cmd.Process.Pid

if err != nil {  
    panic(err)  
}  
err = cmd.Wait()  
if err != nil {  
    panic(err)  
}

}

With sigHandler defined as:

func sigHandler(pid *int, signalChannel chan os.Signal) {var sigToSend syscall.Signal = syscall.SIGHUPfor {sig := <-signalChannelswitch sig {// #1 - Sent went the controlling terminal is closed, typically used by daemonised processes to reload configcase syscall.SIGHUP:sigToSend = syscall.SIGHUP// #2 - Like pressing CTRL+Ccase syscall.SIGINT:sigToSend = syscall.SIGINT.....repeat for all signals}syscall.Kill(*pid, sigToSend)}}

It simply switches on all the signals Go supports — https://golang.org/pkg/syscall/#pkg-constants

And then uses the killsystem call to send the signal through to the process that’s being ran.

Now let's use it to run our Python program and see if it handles SIGTERM correctly.

Aarons-iMac:init-proc aaronkalair$ docker exec -it 579ef1d3ce77 bash

root@579ef1d3ce77:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 18:33 ? 00:00:00 ./srv/init-proc python /srv/mypy.pyroot 13 1 0 18:33 ? 00:00:00 python /srv/mypy.pyroot 14 13 0 18:33 ? 00:00:00 sleep 100

root@579ef1d3ce77:/# kill 1

root@579ef1d3ce77:/# Aarons-iMac:init-proc aaronkalair$

And it works!

Now let’s take care of another thing PID 1 is responsible for, cleaning up Zombie processes.

Imagine this scenario

A — spawns -> B — spawns-> C

Now if B dies or exits before C, C becomes an orphan process, who is C’s parent now?

Well the operating system is responsible for reparenting orphan processes to PID 1, so it now looks like

A — parent of -> C

Now when C exits A will receive the SIGCHILD signal and is responsible for calling wait on C to clean up this Zombie process.

So lets add this logic to the SIGCHILD case:

case syscall.SIGCHLD:var status syscall.WaitStatusvar rusage syscall.Rusagesyscall.Wait4(-1, &status, syscall.WNOHANG, &rusage) sigToSend = syscall.SIGCHLD

-1 Means wait for any child process to change state rather than a specific one as we don’t know the ID of the process that has exited when we get the signal

WNOHANG Means that if there are no child processes that have changed state don’t block waiting for one, return immediately

Performing wait on a terminated child cleans up its resources preventing it from remaining a zombie process

From the wait manpage — http://man7.org/linux/man-pages/man2/waitpid.2.html

In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a "zombie" state

Now there’s just one more case to handle imagine:

A — spawns -> B — spawns -> C

Now C exits but B doesn’t call wait on it

A — parent of-> B — parent of-> C (defunct zombie process)

wait Only works on child processes so no matter how many times our init process A called wait it wouldn’t clean up the resources C was using. (And note that SIGCHILD would only be sent to B so A wouldn’t even be aware of C exiting)

Now B exits A recieves SIGCHILD calls wait and B is cleaned up nicely.

C is now an orphan that gets reparented to A so we have

A — parent of -> C (defunct zombie process)

We can see the above in action with some modifications to our sleeping program to produce processes where parents exit before there children and don’t call wait

func main() {MAX_LEVEL := 4

level, err := strconv.Atoi(os.Args[1])if err != nil {panic(err)}

// We'll have a bunch of processes that immediately exit at the max levelif level == MAX_LEVEL {return}

// Need the top level to outlive the others, otherwise the container would exit and you wouldn't be able to inspect the process treesleepTime := 0if level == 1 {sleepTime = 20000000} else {// Generate proceses where children sleep for longer than there parents so parents exit first without waiting on the children showing what happens to orphan / zombie processessleepTime = level * 1000}

level += 1for i := 0; i < 2; i++ {// Spawn a command and intentionally dont wait on iterr := exec.Command("/srv/sleep-spawner", strconv.Itoa(level)).Start()if err != nil {panic(err)}}time.Sleep(time.Duration(sleepTime) * time.Millisecond)}

It’s available on Github here — https://github.com/AaronKalair/sleep-spawner

And if we run this we can see what the process tree looks like:

Aarons-iMac:init-proc aaronkalair$ docker exec -it 854a232d4b89 bashroot@854a232d4b89:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 22:13 ? 00:00:00 ./srv/init-proc /srv/sleep-spawner 1root 12 1 0 22:13 ? 00:00:00 /srv/sleep-spawner 1root 17 12 0 22:13 ? 00:00:00 [sleep-spawner] <defunct>root 22 12 0 22:13 ? 00:00:00 [sleep-spawner] <defunct>root 32 1 0 22:13 ? 00:00:00 [sleep-spawner] <defunct>

With our current implementation this will remain the situation forever, so we need to modify it slightly to handle cases like this:

case syscall.SIGCHLD:var status syscall.WaitStatusvar rusage syscall.Rusagefor {retValue, err := syscall.Wait4(-1, &status, syscall.WNOHANG, &rusage)if err != nil {panic(err)}if retValue <= 0 {break}}sigToSend = syscall.SIGCHLD

We take advantage of the return value of wait4 when used in combination with WNOHANG to call it in a loop every time we get a SIGCHILD signal.

Again from the man page (wait4's return value conforms to waitpid — http://man7.org/linux/man-pages/man2/waitpid.2.html )

on success, returns the process ID of the child whose state has changed; if WNOHANG was specified and one or more child(ren) specified by pid exist, but have not yet changed state, then 0 is returned. On error, -1 is returned.

So we can sit calling Wait4 until we get a return value less than or equal to 0 knowing that it’s cleaning up exited processes.

Now if we run this and exec inside the container and check with ps

Aarons-iMac:init-proc aaronkalair$ docker exec -it 30f13d4e53bd bashroot@30f13d4e53bd:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 22:05 ? 00:00:00 ./srv/init-proc /srv/sleep-spawner 1root 12 1 0 22:05 ? 00:00:00 /srv/sleep-spawner 1root 17 12 0 22:05 ? 00:00:00 [sleep-spawner] <defunct>root 18 12 0 22:05 ? 00:00:00 [sleep-spawner] <defunct>

We can see that the zombies parented to PID 1 have now been cleaned up!

And there we have it, we’ve made a basic init process that lets us send signals to processes running in Docker containers and have them behave the same way they would outside of a container, and the ability cleanup zombie processes!

See the full source code here — https://github.com/AaronKalair/init-proc

Follow me on Twitter @AaronKalair