Containerization In 2023: Strive for Maximum Modularity via the Cattle Model

Experienced professionals whose eyes glaze over with expertise may justifiably be dissatisfied, believing that containerization has been discussed many times before. They even would say, "Read the documentation." And in a sense, they would be right.

However, new specialists don't know about the times when Ubuntu disks were given away for free and have no experience in compiling operating systems from scratch. So, this statement becomes utopian.

Every new technology starts with enthusiasts who fully immerse themselves in its intricacies. For example, early amateur radio operators knew almost everything about radio communications.

They could build a receiver with their own hands and catch signals on a homemade antenna. Early GNU/Linux users knew the operating system's inner workings. Many even knew how to patch and recompile modem drivers to set up a network connection.

The current generation of users is generally not engaged with such basics, as they are starting to engage with the technology at a lower entry threshold.

These processes are happening not only among end users but also among engineers. On the one hand, deep specialization is normal. On the other hand, specialists risk turning into priests who know the magic rituals but cannot fix something at a low level when it breaks.

This phenomenon is also seen in the context of containerization. Recently, more DevOps engineers know how to use Docker and Podman. They can create a Dockerfile. However, they feel insecure when asked about namespaces in Linux or confused by a question about how containerization differs from deployment via RPM packages.

My name is Tanya, and I am the Senior DevOps Engineer at Plerdy. Today, I would like to discuss the containerization topic from a slightly different angle and shed light on the following aspects:

Basic principles on which modern containerization is based
The differences between containerization and virtualization
What Linux namespaces is

Get your cup of coffee ready, and let's get started on this fascinating journey into the world of containerization.

Abstraction Levels in Containerization

Let’s start with the basics.

Hardware-Level

The backbone of modern infrastructure is hardware. It is often said that "there is nothing but an interface to the hardware in someone else's data center." But hardware can range from Chinese components with their peculiarities to branded servers from HP or even homemade clusters assembled on Raspberry Pi with individual controllers.

Ideally, developers want their code to run everywhere with minimal changes and without worrying about what hardware is being used - SSD, HDD, or even a refurbished floppy disk just for fun. They also don't want to worry about accessing files, RAM areas, and other low-level details.

Kernel Level

The operating system is used to achieve this abstraction between hardware and software. It provides the necessary system calls that allow interaction with hardware resources. These system calls are often low-level, reliable, and simple. For example, "read 10 bytes, open file, read 10 bytes."

In theory, you can develop applications using almost only system calls. Let's say you need to create a Windows application without using high-level libraries. You can directly write the window image to the video card buffer, display it on the screen, and respond to keystrokes.

Then, through system calls, you can track mouse coordinates and locate screen clicks to properly handle events like "user pressed button_1".

This is possible but not usually used for graphical applications. In this case, reusing existing code is limited and requires highly skilled developers. I mean up to and including writing direct machine code for the processor.

Library Level

A more common and convenient approach is to use off-the-shelf libraries that wrap system calls in higher-level abstractions and provide convenient interfaces to the computer mouse, network resources, and sound devices. One example is PulseAudio, which facilitates access to audio files and devices.

User-Space Level

The next level of abstraction is user space. A program running at this level has no direct knowledge of the hardware, operating system, or other low-level details. It is usually just a set of high-level instructions that rely on dependencies in the form of libraries from the previous level.

The operating system creates a separate isolated process for our program. It allows the application to not worry about available resources and other technical details. The program runs inside this process under the notion that it is executed in an isolated environment where it owns all the resources.

In reality, the operating system manages resources and ensures that RAM is allocated, monitors CPU time allocation, and manages task execution priorities. It can "suspend" one application to perform calculations for another, all according to quotas and priorities. This greatly facilitates development and ensures reliable system operation.

How, ultimately, can a program without direct knowledge of the operating system and low-level details interact with all these components and infrastructure? This is accomplished by using dynamically pluggable libraries.

These libraries are a set of standard instructions that are set aside in a separate abstraction to simplify interaction and provide programming convenience.

Let's look at working in the console and understand how it functions.

First, we must put the file command on the ls binary and ensure it was dynamically linked to the libraries. It will not work without them because it relies on their presence in the OS. Next, using ldd, we must determine which libraries the program uses when working.

It is important that we connect not just any libraries but quite specific versions of them. In some minor releases, they may fix bugs, optimize something, and fix vulnerabilities. But major changes may add new features and break old ones.

In the case of popular libraries, developers try to support all old API calls for as long as possible. Still, they must be removed if they interfere with implementing new architectural solutions.

Usually, they warn about this in advance with long and persistent messages that this particular call is deprecated. As usual, often nobody does anything until the last moment if the business doesn't want to pay for it right now.

No One Likes Maintenance

Maintaining software is hard. Businesses are extremely reluctant to part with money to solve problems that haven't come yet. In addition, there are difficulties in the interaction between the developer and the end user who will run the application.

For example, a developer sits on ancient Ubuntu and still can't update. Or, in the olden days, he dragged a library into the application's base somewhere, which was buried under several layers of code and abstractions. So, he takes an artifact of his application or source code and gives it to DevOps.

The engineer rolls that application out into the environment, and everything crashes. It cusses that it can't feel legs, it can't feel arms, that library over there was not brought in, and this one is of an unknown version. The calls are wrong.

DevOps engineer returns to the developer and says, “It doesn't work. You need to figure out how to run it”. But he may get a response like, "I don't know anything, I have the same leg, and it doesn't hurt; everything works; I checked".

And that's where the dilemma comes in - what's the best way to reproduce the environment that the application needs to fully work? Let's assume I will take a bunch of libraries and somehow run them in a test environment. But how do I give the binary to the customer?

Should I write a huge instruction on how to set up the environment with the necessary libraries, drag everything in a huge archive, and install it with a set of scripts? Or find a simpler solution?

Distributions Maintenance

Supporting different operating system distributions is an important task with pros and cons, depending on your situation and company resources. Here are a few important aspects.

Pros of supporting multiple distributions:

Accessibility to a wider audience. Supporting different distributions makes your application accessible to more users, as different organizations and individuals may prefer different distributions.

End-user friendly. Users can install the application with minimal effort using package management tools such as yum or apt-get.

Compatibility. Supporting different distributions helps to ensure compatibility with different operating system versions and libraries.

Cons of supporting multiple distributions:

Complexity and cost. Supporting multiple distributions requires additional effort and resources to test and ensure compatibility. It can be expensive and time-consuming.

Development constraints. The need to support older versions of libraries and environments can limit the ability to introduce new features and optimize code.

Dependency on distribution maintainers. You may need to work with distribution maintainers to ensure your application is properly integrated and supported. This may require additional effort.

Some companies choose to support only a few popular distributions to simplify the development process and reduce costs. In contrast, others strive to provide broad availability and user-friendliness, even at additional cost.

Containerization or Virtualization

Most companies don't have the resources to support many distributions. You can limit yourself to a small number of distributions. But problems will start when the distribution becomes oldstable, and a small team will have to update all dependencies when upgrading to a new OS version.

In this case, the traditional solution is to pack everything into a docker image and send it to the customer. What exactly he will run it on is not so important. Thanks to the abstraction layer, our image will contain everything we need in a portable form. If the DevOps is good, the image will be minimalistic and contain only the essentials.

Suddenly, it turns out that the application needs to work with hardware tokens, hard disk controllers, and cash registers. We need full-fledged drivers and can't stuff our kernel into a container. And the OS won't let us do it.

In this case, only full virtualization will help, where the host-OS hypervisor will allow us to run any kernel with the drivers we need. This is often a better solution. But now, we will consider a situation when we don't need overhead in the form of many running kernels of a full-fledged OS to run small applications.

In the case of a container, we send system calls to the kernel on which all containers are running. Then, we drag the libraries together with the binary inside the docker image. This allows us to tell the process: "Don't go there for libraries, but here. They are in the container."

The process knows nothing about the file system or the computer as a whole. The process gets everything from the kernel. When we say, "Give us a list of files in the directory", we send a request to the kernel. The kernel responds to this system call with a list of directory files.

We send a request to the kernel, "Please give us a list of network interfaces." The kernel responds with a list of network interfaces. "Kernel, please open a TCP connection for us." The kernel returns a pointer to this connection.

That is, the process is basically helpless without the kernel, even though it has brought with it all the libraries necessary for the application.

Namespace

The Linux kernel natively supports namespaces. This mechanism does not require any additional packages to be installed. They are provided right at the kernel level. It allows us to respond to each application's system calls differently.

As this mechanism has evolved, new features have been added to virtualize certain functions.

Namespace	Kernel version	Year
Mount	2.4.19	2002
UTS	2.6.19	2006
PID	2.6.24	2008
Network	2.6.29	2009
IPC	2.6.30	2009
User	3.8	2013
Time	5.6	2020

With the advent of this technology, it is now possible, for example, to give different sets of available networks when querying the kernel from applications that run in different namespaces.

The table above shows all currently available namespaces. The time namespace is a relatively recent addition, which allows different processes to see different system time or time zones.

Let's look at some of them a little closer.

Network namespaces provide a means of isolating network resources. They allow processes running in different namespaces to have their own network devices, routing tables, and firewall rules. If an application runs in a separate net namespace, we provide complete isolation of the network from the rest of the system, allowing communication only through dedicated interfaces.

On the other hand, mount namespaces provide the application with its own representation of the file system. We can create a specific structure in a container and tell the process, "This is the root of the file system. Use this as the base." In this way, we give the application access only to specific files and directories, thus providing file system isolation within the container.

The PID namespace creates isolated namespaces for processes, the User namespace allows us to have our own unique copy of user and group identifiers, and the IPC namespace provides control over inter-process communication. There is also the UTS namespace, which allows us to set our own hostname for each isolated environment.

These types of namespaces provide complete isolation and control over resources and processes within the container.

Conclusion

The current major trend today is to strive for maximum modularity. This approach makes it possible to build customized environments from a set of libraries and ensure they work almost everywhere.

This greatly accelerates the introduction of new solutions to the market, provides portability and fault tolerance, and simplifies load balancing. It also will bring other benefits associated with the transition from the "Pets" model (unique, hand-customized services and servers) to the "Cattle" model (homogeneous entities raised and destroyed by the hundreds).

However, it is important to remember that Docker is a convenient shell over the standard kernel mechanisms. These mechanisms are used not only in the context of Docker but also in conjunction with many other technologies.