The Container Anatomy: A Kernel Introduction

Welcome to this tutorial series, where we will evolve from the anatomy of a container inside the Linux Kernel, and will keep building pieces and evolving till the publication of a service into an Orchestration Platform. The general idea is to detail as much as possible (without being massive) how is things working under the hood.

In this very first article, we will start to understand what is a container, to create the proper mindset when working with it. This is important for troubleshooting and architecture principles, where you need to understand well how something works, and in this context, container. It's always a good idea to remember how the evolution of virtualization came into the business.

In virtualization (Generally speaking), we have an Operational System (Let's use a common word to describe this one, Bastion) that operates the hardware directly, and exposes the hardware to its virtual machines, that are basically processes running into Bastion. This image helps to illustrate this explanation:

This exposure enables these processes, virtual machines, to operate the hardware in different ways (bypassing instructions and so on), so virtual machines can do the work. The Virtual Machines perspective comes from the Bastion vision, each process by itself has an Operational System to run their applications, and this is what we call full virtualization.

When we talk about containers, the general idea is the same: Process that runs in Bastion, but with a big difference: The Bastion doesn't expose the underlying hardware and the process doesn't need another Operational System on top of it to runs it's applications.

This is done using a very ancient and rock-stable kernel feature called

namespaces

. The namespaces are an abstraction layer that runs inside kernel space and exposes the kernel subsystems by separating their runtime, the namespaces expose vital kernel functions to processes "pretending" they are running in their own kernel, but they are all sharing the same kernel in the underlying host.

During the write of this article, we have 6 kernel namespaces, each exposing they own kernel subsystem:

IPC ( Inter Process communication) - Introduced in kernel 2.6.19, Isolate the communication of certain System V IPC objects and since 2.6.30 POSIX queues messages;
Network - Introduced in Kernel 2.6.24 and finished in 2.6.29, Isolate the logical resources used to network communication, like network interfaces, routing tables, IP address and so;
Mount - Introduced in kernel 2.6.19, Isolate the mount point seen by the process.PID - ( Process identifier ) - Introduced in kernel 2.6.24, Isolates the Process identified space. It means inside the namespace, each process can have their own process number without conflicting with bastion PID Namespace. PID namespace can be migrated to different Bastion while maintaining the same PID's;
User - Introduced in kernel 2.6.23 and completed in 3.8, Isolate the users and groups ID space, in other words, it means the user and group ID inside the container can be different from the same user and group in Bastion
UTS - Introduced in kernel 2.6.19, Isolate the host global identifiers
```
nodename
```
and
```
domainname
```
, returned by
```
uname()
```
syscall. In the context of containers, it allows each container to have it's own hostname and domain

Basically, when we create a container, the container engine talks with kernel namespaces asking for a new "table" in each namespace where this container will run. To Bastion, it looks like a very simple process, to process, it looks like a brand new dedicated OS, but it's not. And this is the main difference from virtual machines, they are more light, fast and generally quick than a Virtual Machine, this is why we can spin containers in less than a second and the disk space is very reduced when compared to a virtual machine.

Doing some hands-on, is possible to understand this theory: If we run in any Linux system this command, you will be able to check-in each namespaces, a given process runs:

# ls -lai /proc/1/ns

ipc -> 'ipc:[4026531839]'
mnt -> 'mnt:[4026531840]'
net -> 'net:[4026531992]'
pid -> 'pid:[4026531836]'
pid_for_children -> 'pid:[4026531836]'
user -> 'user:[4026531837]'
uts -> 'uts:[4026531838]'

The namespaces used by Init System

Inside the virtual

/proc

filesystem, lives the runtime of Kernel. During the boot process, the very first process that boots is the init system used by Kernel. In Enterprise Linux 6 was the System V and for Enterprise Linux 7 and above, Systemd. Systemd is responsible for initiate the other boot time processes and setup the baseline to kernel space and user space interact.

When we run

ls

command with

-i

option, we are requesting the

inode

number of each file (Remember, everything inside Linux is a file !) that represents a different namespace. If this same

ls

in a different process returns different

inode

number, that means this given process is running in a different namespace (What happens with containers):

# docker container ps

565a6681bf8c [...Output Ommited...] ecstatic_nightingale

# docker container inspect --format '{{.State.Pid}}, {{.Name}}' ecstatic_nightingale
10494, /ecstatic_nightingale

# ls /proc | grep -i 10494
10494

# ls -lai /proc/10494/ns/

cgroup -> 'cgroup:[4026531835]'
ipc -> 'ipc:[4026532611]'
mnt -> 'mnt:[4026532609]'
net -> 'net:[4026532614]'
pid -> 'pid:[4026532612]'
pid_for_children -> 'pid:[4026532612]'
user -> 'user:[4026531837]'
uts -> 'uts:[4026532610]'

Namespaces used by a container process

Above, I've got the docker container PID using a filter based in container name, then I've checked for namespaces used by that container.

Conclusion: To Bastion, a container is a simple process attached to different namespaces. Inside the container, we can operate everything (within its limitation) as a very different host, but lightweight and secure. In the next article, we will start our very first container and will lookup for points we learned in this one.

About the author - Sudip is a Solution Architect with more than 15 years of working experience and likes sharing his knowledge by regularly writing for Hackernoon, DZone, Appfleet and many more. And while he is not doing that, he must be fishing or playing chess.

Previously posted at https://appfleet.com/.