Isolation and limitation of Linux container

Linux Process Introduction

If you want to write a small program for calculating addition, the program needs to input from one file, and the results after calculation are input into another file.

Because the computer only knows 0 and 1, no matter which language is used to write this code, it needs to be translated into binary files in some way in order to run in the computer operating system.

In order to make these codes work normally, we often have to provide it with data, such as the input file required by our addition program. These data, together with the binary files of the code itself, are placed on the disk, which is what we usually call a program, also known as the executable image of the code.

Then we can run the program on the computer.

First, the operating system finds that the input is saved in a file from the program, so the data will be loaded into memory. At the same time, the operating system reads the instruction to calculate the addition. At this time, it needs to instruct the CPU to complete the addition operation. When the CPU cooperates with memory for addition calculation, it will use registers to store values and memory stack to store executed commands and variables. At the same time, there are open files in the computer, and various I/O devices are constantly calling to modify their state.

In this way, once the program is executed, it becomes a collection of data in computer memory, values in registers, instructions in the stack, open files, and various device status information from binary files on disk. The sum of the computer execution environment after such a program runs is our protagonist: the process.

Therefore, for the process, its static performance is the program, which usually stays quietly on the disk; Once it runs, it becomes the sum of data and states in the computer, which is its dynamic performance,

Isolation of Linux containers

Docker container is essentially a process of Linux operating system, but docker implements the resource isolation technology between processes through namespace. In this way, many people will feel very abstract. Let's learn about it through actual combat!

First, we create a container:

# docker run -it busybox /bin/sh
/ # 

Execute the following PS command in the container:

/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    6 root      0:00 ps

It can be seen that the first / bin/sh we executed in Docker is the No. 1 process (PID=1) inside the container, and there are only two processes running in the container. This means that the / bin/sh we executed earlier and the ps we just executed have been isolated by Docker in a world different from the host.

How on earth did this happen?

Originally, whenever we run a / bin/sh program on the host, the operating system will assign it a process number, such as PID=100. This number is the unique identification of the process, just like the employee's work card. Therefore, PID=100, which can be roughly understood as that / bin/sh is the 100th employee in our company, and the 1st employee is naturally a person in charge of the overall situation like Bill Gates. Now, we will run the / bin/sh program in a container through Docker. At this time, Docker will give this employee No. 100 a "cover up" when he starts his job, so that he will never see the other 99 employees in front, let alone Bill Gates. In this way, he will mistakenly think that he is the No. 1 employee in the company. This mechanism actually tampers with the process space of isolated applications, so that these processes can only see the recalculated process number, such as PID=1. But in fact, they are still the original process 100 in the host's operating system.

This technology is the Namespace mechanism in Linux. The use of Namespace is also very interesting: it is actually only an optional parameter for Linux to create a new process. We know that the system call to create a thread in the Linux system is clone(), for example:

int pid = clone(main_function, stack_size, SIGCHLD, NULL); 

This system call will create a new process for us and return its process number pid.

When we create a new process with the clone() system call, we can specify clone in the parameter_ Newpid parameter, such as:

int pid = clone(main_function, stack_size, CLONE_NEWPID | SIGCHLD, NULL); 

At this time, the newly created process will "see" a new process space, in which its PID is 1. The reason to say "see" is that it is just a "trick". In the real process space of the host, the PID of the process is still a real value, such as 100.

Of course, we can also execute the clone() call above many times, which will create multiple PID namespaces, and the application process in each Namespace will consider itself the No. 1 process in the current container. They can neither see the real process space in the host machine nor the specific situation in other PID namespaces.

In addition to the PID Namespace we just used, the Linux operating system also provides namespaces such as Mount, UTS, IPC, network and User to "cover up" various process contexts. For example, Mount Namespace is used to let the isolated process see only the mount point information in the current Namespace; Network Namespace is used to let the isolated process see the network devices and configurations in the current Namespace.

This is the most basic implementation principle of Linux container.

Therefore, the mysterious and mysterious concept of Docker container actually specifies a set of Namespace parameters to be enabled by the container process when it is created. In this way, the container can only "see" the resources, files, devices, States, or configurations defined by the current Namespace. For the host and other unrelated programs, it is completely invisible.

Therefore, the container is actually a special process.

Restrictions on Linux containers

Why do you need to restrict containers?

Although the No. 1 process in the container can only see the situation in the container under the interference of deception, on the host, it still has a competitive relationship with all other processes as the No. 100 process, which means that although the No. 100 process is ostensibly isolated, However, the resources it can use (such as CPU and memory) can be occupied by other processes (or other machines) on the host at any time. Of course, the process 100 itself may eat up all its resources. Obviously, these situations are not reasonable behavior that a sandbox should identify.

What are Linux Cgroups?

Cgroups is a control module under Linux (or a group of) process resource restriction mechanism, fully known as control groups, can finely control cpu, memory and other resources. For example, many dockers under Linux implement resource control based on the resource restriction mechanism provided by cgroups; in addition, developers can also directly control process resources based on cgroups, such as deployed on 8-core machines A web service and a computing service can make the web service use only six cores, leaving the remaining two cores to the computing service. cgroups cpu limit in addition to limiting the number / cores used, you can also set the cpu occupation ratio (note that the occupation ratio is the utilization ratio when each cgroup is full. If one cgroup is idle and the other is busy, the busy cgroup may occupy the whole cpu core).

In Linux, the operating interface exposed by Cgroups to users is the file system, that is, it is organized in the / sys/fs/cgroup path of the operating system in the form of files and directories. In the Centos machine, we can use the mount command to show them:

/ # mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,pids)

At present, there are many subdirectories such as cpuset, cpu and memory under / sys/fs/cgroup, also known as subsystems. These are the types of resources that can be restricted by Cgroups. Under the resource class corresponding to the subsystem, you can see the specific methods that can be restricted.

For example, for the CPU subsystem, we can see the following configuration files:

/ # ls -l /sys/fs/cgroup/cpu/
total 0
-rw-r--r--    1 root     root             0 Aug 12 10:55 cgroup.clone_children
--w--w--w-    1 root     root             0 Aug 12 10:55 cgroup.event_control
-rw-r--r--    1 root     root             0 Aug 12 10:55 cgroup.procs
-rw-r--r--    1 root     root             0 Aug 12 10:55 cpu.cfs_period_us
-rw-r--r--    1 root     root             0 Aug 12 10:55 cpu.cfs_quota_us
-rw-r--r--    1 root     root             0 Aug 12 10:55 cpu.rt_period_us
-rw-r--r--    1 root     root             0 Aug 12 10:55 cpu.rt_runtime_us
-rw-r--r--    1 root     root             0 Aug 12 10:55 cpu.shares
-r--r--r--    1 root     root             0 Aug 12 10:55 cpu.stat
-r--r--r--    1 root     root             0 Aug 12 10:55 cpuacct.stat
-rw-r--r--    1 root     root             0 Aug 12 10:55 cpuacct.usage
-r--r--r--    1 root     root             0 Aug 12 10:55 cpuacct.usage_percpu
-rw-r--r--    1 root     root             0 Aug 12 10:55 notify_on_release
-rw-r--r--    1 root     root             0 Aug 12 10:55 tasks

Students familiar with Linux CPU management should notice cfs_period and CFS_ Keywords like quota. These two parameters need to be used in combination and can be used to limit the process length cfs_period can only be allocated to CFS in total_ CPU time of quota.

Next, let's use this configuration?

First, we need to create a directory under the corresponding subsystem:

# cd /sys/fs/cgroup/cpu
# mkdir container
# cd container/
# ll
total 0
-rw-r--r--. 1 root root 0 Aug 12 19:38 cgroup.clone_children
--w--w--w-. 1 root root 0 Aug 12 19:38 cgroup.event_control
-rw-r--r--. 1 root root 0 Aug 12 19:38 cgroup.procs
-r--r--r--. 1 root root 0 Aug 12 19:38 cpuacct.stat
-rw-r--r--. 1 root root 0 Aug 12 19:38 cpuacct.usage
-r--r--r--. 1 root root 0 Aug 12 19:38 cpuacct.usage_percpu
-rw-r--r--. 1 root root 0 Aug 12 19:38 cpu.cfs_period_us
-rw-r--r--. 1 root root 0 Aug 12 19:38 cpu.cfs_quota_us
-rw-r--r--. 1 root root 0 Aug 12 19:38 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Aug 12 19:38 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Aug 12 19:38 cpu.shares
-r--r--r--. 1 root root 0 Aug 12 19:38 cpu.stat
-rw-r--r--. 1 root root 0 Aug 12 19:38 notify_on_release
-rw-r--r--. 1 root root 0 Aug 12 19:38 tasks

This directory is called a control group. You will find that the operating system will automatically generate the resource limit file corresponding to the subsystem under the newly created container directory.

At this point, we execute an endless loop script to eat 100% of the CPU of the calculation

# while : ; do : ; done 
# top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 
 7996 root      20   0    1320    256    212 R  100  0.0   1:12.75 sh 

You can see from the top command that the CPU utilization has reached 100%

At this time, we can see that the CPU quota of the container control group does not have any limit (: - 1) by viewing the files in the container directory

# cat  /sys/fs/cgroup/cpu/container/cpu.cfs_quota_us 
-1

Next, we set the restrictions by modifying these files:

To CFS in container group_ Quota file write 20ms (20000 us)

 echo 20000 > /sys/fs/cgroup/cpu/container/cpu.cfs_quota_us 

Within 100ms, only 20MS CPU time can be used for the process limited by the control group, that is, this process can only use 20% CPU bandwidth

Next, we write the PID of the restricted process to the tasks file in the container group, and the above settings will take effect for the process

# echo 7996 > /sys/fs/cgroup/cpu/container/tasks 

Then view the following through top:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND               
 7996 root      20   0  119484   6140   1652 R  20.3  0.2   3:45.10 sh 

As you can see, the CPU utilization of the computer immediately dropped to 20%

<u> Isn't it amazing</ u>

In addition to the CPU subsystem, each subsystem of Cgroups has unique resource limitation capabilities: for example

  • blkio, which sets I/O limits for block devices. It is generally used for devices such as disks
  • cpuset, which allocates separate CPU cores and corresponding memory nodes for the process
  • Memory, which sets the memory usage limit for the process

The design of Linux Cgroups is relatively easy to use. It is a combination of a subsystem directory and a set of resource limiting files. For Docker and other Linux container projects, they only need to create a control group (i.e. create a directory) for each container under each subsystem, and then fill the process PID into the tasks file of the corresponding control group after starting the container process.

As for the values to be filled in the resource files under these control groups, they can be specified by the parameters when the user executes docker run, such as the following commands:

# docker run -it --cpu-period=10000 --cpu-quota=20000 ubuntu /bin/bash

After starting the container, we can confirm by checking the contents of the resource limit file in the docker control group in the CPU subsystem under the Cgroup file system:

#cat/sys/fs/cgroup/cpu/docker/0712c3d12935b9a3f69ac976b9d70309b78cb7db9a5a5c8a612742370b7453e4/cpu.cfs_period_us 
10000
#cat/sys/fs/cgroup/cpu/docker/0712c3d12935b9a3f69ac976b9d70309b78cb7db9a5a5c8a612742370b7453e4/cpu.cfs_quota_us 
20000

Click“ Read the original text "Get a better reading experience!

Keywords: Linux Operation & Maintenance

Added by neel_basu on Thu, 30 Dec 2021 00:04:45 +0200