Container cornerstone - namespace and cgroup

The purpose of this article is to make readers have a concrete understanding of namespace and cgroup. Of course, due to my limited knowledge of Linux, I can't go deep.

"Container is an execution environment that shares the kernel with the host system but is isolated from other process resources in the operating system", which is the core of understanding container technology. A container is an environment in which the processes running are the same as other processes in the operating system and enjoy the same hardware resources. The only difference is that the processes in the environment cannot see the existence of other processes and the operations will not affect each other, that is, the so-called isolation; The operation of multiple containers is to run their own processes in their own isolated environment.

Container is just an abstract logical concept. Those with the above characteristics can be called containers. The implementation of these characteristics depends on the namespace and cgroup provided by the Linux operating system. Namespace provides resource isolation to ensure that resource operations before different namespaces do not affect each other; Cggroup provides resource limits to ensure that the resource usage in a group will not exceed the preset.

namespace

Resource isolation, a complete running environment, that is, a so-called container, what resources need to be isolated? There are roughly the following categories

  • Isolated file system: file operations do not affect each other
  • Isolated network: the container needs to have independent IP, port, routing rules, etc
  • Isolate hostname: the container needs to identify itself in the network
  • Isolate interprocess communication: message queue, shared memory, etc
  • Isolate user permissions: there should be complete user permissions in the container
  • Isolation PID: the PID in the container needs to be isolated from the PID of the host machine

For each category, Linux provides isolation support on namespaces, that is, there are six different types of namespaces, each corresponding to different resources.

The purpose of namespace is to realize "lightweight virtualization service" (i.e. container), which is supported at the kernel level. Processes in the same namespace can be perceived and visible to each other; Processes in different namespaces can't see each other at all, just like in an independent operating system.

To start a container, you only need to create the process of the container in a new namespace. For this, Linux provides support through API

  • clone(): create a separate namespace process
  • setns(): add the current process to an existing namespace
  • unshare(): resource isolation is performed on the original process, that is, the original process is still in the original namespace, but its created child processes are in the new namespace

To this end, we can write a simple code to experience and verify resource isolation, including the following c code

// namespace.c
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static char child_stack[STACK_SIZE];

char* const child_args[] = {
  "/bin/bash",
  NULL
};

// In the new process
int child_main(void* args) {
  printf("In child processes!\n");
  // Set new hostname
  sethostname("NewNamespace", 12);
  // Execute Bash and enter the bash console. Only when we enter exit will we exit the bash program and end the current function
  execv(child_args[0], child_args);
  return 1;
}

int main() {
  printf("Program start: \n");
  // Create a sub process, and the sub process performs comprehensive resource isolation
  int child_pid = clone(child_main, child_stack + STACK_SIZE, 
                        CLONE_NEWIPC | 
                        CLONE_NEWUTS | 
                        CLONE_NEWPID | 
                        CLONE_NEWNS | 
                        CLONE_NEWNET |
                        CLONE_NEWUSER |
                        SIGCHLD, NULL);
  printf("%d", CLONE_NEWUTS);
  // Wait for the child process to end
  waitpid(child_pid, NULL, 0);
  printf("Exited \n");
  return 0;
}

Compile and execute under linux (Note: root is required to execute successfully)

# Compile and execute
root@10-9-175-15:/home/ubuntu/docker-learn# gcc -Wall namespace.c -o namespace.o && ./namespace.o 
Program start: 
In child processes!
# Output current process number
nobody@NewNamespace:/home/ubuntu/docker-learn$ echo $$
1
# Exit bash and the process exits
nobody@NewNamespace:/home/ubuntu/docker-learn$ exit
exit
67108864 Exited

As you can see, in the new process

  • The user name has changed from root to nobody, and the user has been isolated
  • The hostname becomes the NewNamespace set by ourselves, indicating that the hostname is isolated
  • The process number is 1, indicating that the PID is isolated. PID=1 is very important in Linux. It is called init process. It has privileges and plays a special role

Other resource isolation also has corresponding verification methods, but it does not hinder understanding, so we won't go into it here. However, from this, we can imagine that container implementations such as docker, containerd and runc create processes in an isolated namespace based on calls like the above.

cgroup

namespace is responsible for resource isolation, but the resources in different namespaces cannot be consumed indefinitely. Otherwise, it is easy to run out of resources due to bug s or malicious attacks of programs in the container, threatening the processes of other containers. Therefore, resource constraints are required, which requires cgroup. cgroup can not only limit resources, but also record resource usage statistics (this function can be used to charge cloud services), but also suspend and restore tasks.

There are several concepts in cgroup:

  • Task: task that identifies a process

  • cgroup: control group, which refers to a task group divided according to a certain resource control standard, and can contain one or more subsystems

  • Subsystem: subsystem, i.e. resource scheduling controller, such as CPU subsystem and memory subsystem. In detail, docker uses the following

    • blkio: set input and output limits for block devices, such as disks
    • cpu: scheduling of cpu
    • cpuacct: automatically generate reports on CPU resource usage of tasks in cgroup
    • cpuset: independent cpu and memory can be allocated for tasks in cgroup
    • Devices: turn on or off the access of tasks in cgroup to devices
    • freezer: suspend or resume tasks in cgroup
    • Memory: set the memory usage limit of tasks in cgroup and generate their memory resource usage report
    • perf_event: enables tasks in cgroup to conduct unified performance testing
    • net_cls: it marks network packets with a hierarchy identifier to allow Linux traffic control programs to identify packets generated from specific cgroup s
  • hierarchy: hierarchical relationship, consisting of a series of cgroup s in a tree structure

We can see how many subsystems the current system has

root@10-9-175-15:/home/ubuntu/docker-learn# mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

You can see that each subsystem corresponds to a folder on the file system. Let's take a look at the cgroup of a running docker container:

When you view the docker container running locally, you can see a container with id ee4a4efd4a5b

root@10-9-175-15:/home/ubuntu/docker-learn# docker ps
CONTAINER ID   IMAGE          COMMAND                  CREATED         STATUS        PORTS                                      NAMES
ee4a4efd4a5b   halohub/halo   "/bin/sh -c 'java -X..."   10 months ago   Up 22 hours   0.0.0.0:8090->8090/tcp, :::8090->8090/tcp  halo

View the corresponding cpu limit configuration in / sys / FS / CGroup / cpu / docker / ee4a4efd4a5b2a9a6e8154bc4336bf2a7f1528205e9f53adb8443868add7eeb

root@10-9-175-15:/sys/fs/cgroup/cpu/docker/ee4a4efd4a5b2a9a6e8154bc4336bf2a7f1528205e9f53adb8443868adad7eeb# ls
cgroup.clone_children  cpuacct.usage         cpuacct.usage_percpu_sys   cpuacct.usage_user  cpu.shares      cpu.uclamp.min
cgroup.procs           cpuacct.usage_all     cpuacct.usage_percpu_user  cpu.cfs_period_us   cpu.stat        notify_on_release
cpuacct.stat           cpuacct.usage_percpu  cpuacct.usage_sys          cpu.cfs_quota_us    cpu.uclamp.max  tasks

root@10-9-175-15:/sys/fs/cgroup/cpu/docker# tree ee4a4efd4a5b2a9a6e8154bc4336bf2a7f1528205e9f53adb8443868adad7eeb
ee4a4efd4a5b2a9a6e8154bc4336bf2a7f1528205e9f53adb8443868adad7eeb
├── cgroup.clone_children
├── cgroup.procs
├── cpuacct.stat
├── cpuacct.usage
├── cpuacct.usage_all
├── cpuacct.usage_percpu
├── cpuacct.usage_percpu_sys
├── cpuacct.usage_percpu_user
├── cpuacct.usage_sys
├── cpuacct.usage_user
├── cpu.cfs_period_us
├── cpu.cfs_quota_us
├── cpu.shares
├── cpu.stat
├── cpu.uclamp.max
├── cpu.uclamp.min
├── notify_on_release
└── tasks

You can see that there are many files, and each file corresponds to a CPU configuration or monitoring value. Note that the tasks file is a task managed by cgroup. Check any one, such as cpu.cfs_quota_us, the CPU quota. The default is - 1, which means there is no limit

root@10-9-175-15:/sys/fs/cgroup/cpu/docker/ee4a4efd4a5b2a9a6e8154bc4336bf2a7f1528205e9f53adb8443868adad7eeb# cat cpu.cfs_quota_us 
-1

Of course, we can add our own processes to cgroup for restriction. The method is to create a folder in the corresponding subsystem file, and the system will automatically add the above configuration file under the folder. We can add tasks to tasks and add configuration to the specified file.

I won't describe it specifically. You can refer to it This article and Official manual.

ending

It can be seen that with namespace and cgroup, the process creation is no different from the usual process creation, and the resulting container is very lightweight. Container and process creation can be compared with Ctrip and method call. They all use ordinary methods to achieve the goal of light weight and fast.

Anyway. Whenever you read these contents, you will feel the lack of Linux knowledge. Therefore, it is necessary to learn Linux. The most important ones are process management, Linux network, file system, etc.

Keywords: Linux Operation & Maintenance Docker Container

Added by chrisg101 on Sun, 05 Dec 2021 15:43:23 +0200