On container virtualization network

After we install Docker, it will handle all things related to network virtualization for us, but behind this, Docker actually uses multiple features provided by Linux to complete all this. This is also the reason why Docker was criticized as "old wine in a new bottle" when it was first released, In fact, Docker just packages the capabilities of the operating system and provides a set of easy-to-use API s, without inventing anything new.

ps: of course, I don't agree with this view:)

In this article, a large number of ip commands will be used. The ip command comes from the iproute2 package. Generally, the system will install it by default. If not, please install it by yourself. In addition, the ip command requires root permission because it needs to modify the system network settings. Therefore, please do not try in the production environment or important systems to prevent unexpected errors.

1, Linux Namespace

Namespace is an isolation mechanism for system global resources provided by Linux; From the perspective of process, a process in the same namespace sees its own independent global resources. The changes of these resources are only visible in this namespace and have no impact on other namespaces. Docker uses the namespace mechanism to isolate the network and process space. Different containers belong to different namespaces, so the resources between containers are isolated from each other and do not affect each other.

The namespace can isolate the process PID, file system mount point, host name and other resources of the container. But today we only focus on the network namespace, or netns for short. It can logically provide independent network protocol stacks for different namespaces, including network devices, routing tables, arp tables, iptables, socket s, etc., so that different cyberspace is like running in an independent network.

First, we try to create a new netns:

ip netns add netns1

Next, let's check the routing table, Iptables and network devices of the netns:

[root@centos ~]$ ip netns exec netns1 route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

[root@centos ~]$ ip netns exec netns0 iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

[root@centos ~]$ ip netns exec netns1 ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

As you can see, because it is a newly created netns, all the above information is empty, and there is only one lo device in the down state

Through netns, we can isolate the protocol stacks between different containers to avoid mutual influence and pollution. Next, we will try how to communicate with each other in different netns.

2, Veth

Linux provides a method to simulate hardware network card with software: Veth (Virtual Ethernet devices), Veth is a virtual network device in Linux. Veth always appears in pairs, so it is generally called Veth pair. Its function is very simple: if v-a and V-B are a pair of Veth devices, the data sent by v-a will be received by v-b. vice versa. In fact, Veth is a "network cable" , you can naturally receive data from one end of the hair from the other.

Both ends of the veth are directly connected to the network protocol stack, so if you create a veth pair, there will be two more network cards on the host. In fact, this virtual device is not unfamiliar to us. The lo loop device (127.0.0.1) in our local network IO is also such a virtual device. The only difference is that veth always appears in pairs.

Under Linux, we can use ip command to create a pair of veths and ip link add to create a pair of veths veth0@veth1 ip link indicates that this is a link layer interface:

$ ip link add veth1 type veth peer name veth1-peer

Use ip link show to view. You can see, veth1@veth1-peer interconnection:

[root@centos ~]$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:09:10:27 brd ff:ff:ff:ff:ff:ff
5: veth1-peer@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:20:02:e7:bf:bd brd ff:ff:ff:ff:ff:ff
6: veth1@veth1-peer: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether aa:c4:57:ee:46:60 brd ff:ff:ff:ff:ff:ff

Add the veth1 header to the netns1 we just created:

$ ip link set veth1-peer netns netns1

Check again at this time and you will find that veth1 has disappeared because the device has been in netns0:

[root@centos ~]$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:97:55:37 brd ff:ff:ff:ff:ff:ff
4: veth0@if3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether de:2d:ab:06:fb:16 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    
[root@centos ~]$ ip netns exec netns1 ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: veth1-peer@if6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:20:02:e7:bf:bd brd ff:ff:ff:ff:ff:ff link-netnsid 0

Next, configure the IP for the veth pair and start the device

$ ip addr add 172.16.0.1/24 dev veth1
$ ip link set dev veth1 up

# Under different netns, you need to use ip netns exec $name to execute under the specified netns
$ ip netns exec netns1 ip addr add 172.16.0.2/24 dev veth1-peer
$ ip netns exec netns1 ip link set dev veth1-peer up

After the devices are started, we can view them through the familiar ifconfig:

[root@centos ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.1.48  netmask 255.255.255.0  broadcast 10.0.1.255
        inet6 fe80::5054:ff:fe09:1027  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:09:10:27  txqueuelen 1000  (Ethernet)
        RX packets 44996  bytes 61990718 (59.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8462  bytes 684565 (668.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 2  bytes 256 (256.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2  bytes 256 (256.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::a8c4:57ff:feee:4660  prefixlen 64  scopeid 0x20<link>
        ether aa:c4:57:ee:46:60  txqueuelen 1000  (Ethernet)
        RX packets 8  bytes 656 (656.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 656 (656.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        
        
[root@centos ~]$ ip netns exec netns1 ifconfig
veth1-peer: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.0.2  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::420:2ff:fee7:bfbd  prefixlen 64  scopeid 0x20<link>
        ether 06:20:02:e7:bf:bd  txqueuelen 1000  (Ethernet)
        RX packets 8  bytes 656 (656.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 656 (656.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

The network topology at this time is:

+------------------+              +------------------+
|       host       |              |      netns1      |
|                  |  veth pair   |                  |
|                +-+              +-+                |
| 172.16.0.1/24 | +--------------+  | 172.16.0.2/24  |
|    (veth1)     +-+              +-+ (veth1-peer)   |
|                  |              |                  |
|                  |              |                  |
|                  |              |                  |
+------------------+              +------------------+

Now we can try whether the veth s can communicate with each other:

[root@centos ~]$ ip netns exec netns1 ping 172.16.0.1 -I veth1-peer -c 1
PING 172.16.0.1 (172.16.0.1) from 172.16.0.2 veth1-peer: 56(84) bytes of data.
64 bytes from 172.16.0.1: icmp_seq=1 ttl=64 time=0.031 ms

--- 172.16.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.031/0.031/0.031/0.000 ms

[root@centos ~]$ ping 172.16.0.2 -I veth1 -c 2
PING 172.16.0.2 (172.16.0.2) from 172.16.0.1 veth1: 56(84) bytes of data.
64 bytes from 172.16.0.2: icmp_seq=1 ttl=64 time=0.015 ms

--- 172.16.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.015/0.015/0.015/0.000 ms

As can be seen above, it is OK to initiate communication from netns to the host or from the host to the devices in netns. So far, we have realized point-to-point communication.

3, Linux Bridge

Because veths always appear in pairs and bind to each other, in the real world, there may be dozens or hundreds of containers on the same server that need to communicate with each other. If we establish a pair of veths for each container that needs to communicate with each other, I'm afraid it's unrealistic, just like in the real world, we won't connect each server in the cluster with a network cable.

In the physical world, we use switches to solve this problem, and Linux provides a function that allows us to use software to simulate the capability of switches, that is, Linux Bridge. After the introduction of Linux Bridge, we no longer need point-to-point connection, but connect the other end of all veth s to the bridge, Bridge is responsible for forwarding packets between different "pairs". In this way, the containers can communicate with each other.

In the previous experiment, we created a netns0, and then created a pair of veth s with one end connected to netns0. This is a typical point-to-point communication. Next, we try to use Bridge to communicate across multiple netns.
First, we create two new netns: netns2 and netns 3, and assign and start network cards for them:

# Create two new netns
[root@centos ~]$ ip netns add netns2
[root@centos ~]$ ip netns add netns3

# Create two new veth s
[root@centos ~]$ ip link add veth2 type veth peer name veth2-peer
[root@centos ~]$ ip link add veth3 type veth peer name veth3-peer

# Put one end of veth into another netns
[root@centos ~]$ ip link set veth2-peer netns netns2
[root@centos ~]$ ip link set veth3-peer netns netns3

# Assign IP and start
[root@centos ~]$ ip netns exec netns2 ip addr add 172.16.0.102/24 dev veth2-peer
[root@centos ~]$ ip netns exec netns2 ip link set veth2-peer up
[root@centos ~]$ ip netns exec netns3 ip addr add 172.16.0.103/24 dev veth3-peer
[root@centos ~]$ ip netns exec netns3 ip link set veth3-peer up

OK, so we can create two virtual network environments on one Linux. At this time, if we directly request netns3 in netns2, it is impossible, because the namespace completely isolates different networks. Next, we need to use a Bridge to connect the two virtual networks:

First, you need to create a Bridge and "plug" the other ends of the two pairs of vets into the Bridge:

The brctl command comes from the bridge utils package and may need to be installed on some operating systems.

# Create a Bridge
[root@centos ~]$ brctl addbr br0

# Plug the two pairs of veth s into br0
[root@centos ~]$ ip link set dev veth2 master br0
[root@centos ~]$ ip link set dev veth3 master br0

# Also br0 assign an address:
[root@centos ~]$ ip addr add 172.16.0.100/24 dev br0

# Start all network cards
[root@centos ~]$ ip link set veth2 up
[root@centos ~]$ ip link set veth3 up
[root@centos ~]$ ip link set br0 up

# Check whether the operation is successful
[root@centos ~]$ brctl show
bridge name    bridge id        STP enabled    interfaces
br0        8000.c63e968442c6    no               veth2
                                                   veth3

The topology is:

+------------------+     +------------------+
|      netns2      |     |     netns3       |
| 172.16.0.102/24  |     | 172.16.0.103/24  |
+---(veth2-peer)---+     +---(veth3-peer)---+
          +                        +
          |                        |
          +                        +
+------(veth2)------------------(veth2)------+
|              linux-bridge                  |
|         (br0/172.16.0.100/24)              |
+--------------------------------------------+

After the above operations are completed, we can try whether two different netns can communicate with each other:

# Using veth2 peer in netns2 to 172.16 in nets3 0.103 initiate ping
[root@centos ~]$ ip netns exec netns2 ping 172.16.0.103 -I veth2-peer -c 1
PING 172.16.0.103 (172.16.0.103) from 172.16.0.102 veth2-peer: 56(84) bytes of data.
64 bytes from 172.16.0.103: icmp_seq=1 ttl=64 time=0.038 ms

--- 172.16.0.103 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.038/0.038/0.038/0.000 ms

# Use veth3 peer in netns3 to 172.16 in nets2 0.102 initiate ping
[root@centos ~]$ ip netns exec netns3 ping 172.16.0.102 -I veth3-peer -c 1
PING 172.16.0.102 (172.16.0.102) from 172.16.0.103 veth3-peer: 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=64 time=0.023 ms

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms

It can be seen that the communication is successful. So far, we have completed the communication under different netns. When new netns appear later, we only need to plug one end into br0 to communicate with other netns already on br0.

4, Communication with external network

We have solved the communication between containers. It seems that everything is fine, but there is one last problem to be solved: the communication between containers and the outside. We can simply try:

[root@centos ~]$  ip netns exec netns3 ping baidu.com -I veth3_p -c 1
ping: baidu.com: Name or service not known

It can be found that external communication is not possible in netns, but only between br0 bound networks. In the real world, it is necessary to initiate external connections or accept external connections, such as initiating public network requests in the container through the Docker listening port.

In order to realize this requirement, we need to introduce two new tools: routing table and iptables

The concept of routing is very simple: which network card should the data be sent to? (virtual network card devices also count). The rules for routing are written in the routing table. There can be multiple routing tables in Linux. The most important and commonly used are local and main. You can view the local routing table through route -n. the local routing table records the routing rules of the network card device IP in the network namespace, Other routing tables are generally written in main. You can see the routing table automatically created during our experiment through the following command:

[root@centos ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.1.1        0.0.0.0         UG    0      0        0 eth0
10.0.1.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.16.0.0      0.0.0.0         255.255.255.0   U     0      0        0 br0

Because the network stack of Linux runs in the kernel state and the commands we execute are executed in the user state, we can't interfere with the behavior of the network stack in theory. However, in order to meet various needs, Linux has opened some hooks from the kernel state so that the user state can intervene, and iptables is the tool we call these hooks. For details and usage of iptables, please refer to iptables(8) - Linux man page, which will not be repeated here.

We first try to solve the need for the container to initiate communication requests to external addresses. The external network here refers to the network other than the virtual network host, which does not necessarily need to really reach the public Internet. We continue to reuse the experimental environment created before, assuming that our host eth0 network card address is 10.0 1.48. The ip of another host on the LAN is 10.0 one point two six

First, we try to upgrade to 10.0.0 in netns2 1.26 initiate ping. Before starting, we first try to ensure that there is no communication problem between LANs:

[root@centos ~]$ ping 10.0.1.26 -c 1
PING 10.0.1.26 (10.0.1.26) 56(84) bytes of data.
64 bytes from 10.0.1.26: icmp_seq=1 ttl=64 time=0.193 ms

--- 10.0.1.26 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.193/0.193/0.193/0.000 ms

# Enter netns attempt
[root@centos ~]$ ip netns exec netns2 ping 10.0.1.26 -I veth2-peer -c 1
PING 10.0.1.26 (10.0.1.26) from 172.16.0.102 veth2-peer: 56(84) bytes of data.

--- 10.0.1.26 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

As we expected, we can communicate with external websites on the host computer, but we can't communicate directly with external networks in netns.
At this point, let's check the routing table of netns2 to see:

[root@centos ~]$ ip netns exec netns2 route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.0.0      0.0.0.0         255.255.255.0   U     0      0        0 veth2-peer

You can see that there is only one record in the routing table, that is, the Destination is 172.16 0.0, and our target ip is 10.0 1.26. Obviously, this cannot match this record, so it failed. If you haven't forgotten, the other end of our virtual network card is connected to br0. According to the characteristics of veth, we can only establish contact with br0, so we try to forward all unmatched requests to br0 in the routing table and establish a default route:

[root@centos ~]$ ip netns exec netns2 route add default gw 172.16.0.100 veth2-peer
[root@centos ~]$ ip netns exec netns2 route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.16.0.100    0.0.0.0         UG    0      0        0 veth2-peer
172.16.0.0      0.0.0.0         255.255.255.0   U     0      0        0 veth2-peer

Let's continue to analyze what happens when the traffic reaches br0. Check the routing table on the host:

[root@centos ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.1.1        0.0.0.0         UG    0      0        0 eth0
10.0.1.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.16.0.0      0.0.0.0         255.255.255.0   U     0      0        0 br0

After the traffic reaches br0 of the host, the default route is to continue to send the next hop address, i.e. 0.0 0.0, and through this route, we can normally reach 10.0 1.26, but there is a problem here, 10.0 1.26 is not in the same subnet as br0 (172.16.0.100) we sent the request, so we need to do a three-layer forwarding. Lianx itself supports this function, but it is not enabled by default. You can use sysctl.net ipv4. conf.all. Forwarding = 1 open.

Continue to analyze. At this time, the starting path of traffic should be veth2 peer - > BR0 - > eth0 - > 10.0 1.26, when 10.0 1.26 when responding, it does not recognize 172.16 0.0/24 is a network segment, so we also need to perform NAT(Network address translation). The principle of NAT is very simple: when an internal address needs to communicate with an external address, replace the source address and port of the outlet traffic with the address owned by the gateway that can communicate with the external address. When the external traffic returns, By checking the NAT table to match which address in the internal network the returned traffic should be routed to, this technology is currently widely used by operators to solve the problem of depletion of IPv4 addresses in the environmental public network.

In Linux, we can implement software NAT through iptables. Since we want to modify the export traffic, we should do source NAT, i.e. SNAT:

[root@centos ~]$ iptables -t nat -A POSTROUTING -s 172.16.0.0/24 ! -o br0 -j MASQUERADE
# -t nat: select the nat table and do not specify it. The default is the filter table.
# -A: Append adds a new rule, - D Delete, - L List.
# POSTROUTING: select POSTROUTING chain. POSTROUTING represents the packet exit phase.
# -s 172.16.0.0/24: matching condition. source address is network segment 172.16 IP in 0.102/24.
# -j MASQUERADE: after matching, execute the command and execute MASQUERADE, that is, the source IP address is disguised as the exit network card IP. For the nat table, the options are DNAT/MASQUERADE/REDIRECT/SNAT.

At this point, we will try ping ing again and find that the communication can be established directly:

[root@centos ~]$ ip netns exec netns2 ping 10.0.1.26 -I veth2-peer -c 1
PING 10.0.1.26 (10.0.1.26) from 172.16.0.102 veth2-peer: 56(84) bytes of data.
64 bytes from 10.0.1.26: icmp_seq=1 ttl=63 time=0.249 ms

--- 10.0.1.26 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.249/0.249/0.249/0.000 ms

After solving the problem of communication between internal and external networks, the last problem to be considered is how to make the external network interact directly with the container, 172.16 The 0.102 IP is unknown to the outside world. Only the host knows who it is, so obviously we need to do NAT again.

The purpose of this time is to modify the inlet flow, so a DNAT(Destination Network Address Translation) needs to be done
During SNAT, we do global NAT for a segment, regardless of port, but it is in progress
During DNAT, we need to declare which port in the container corresponds to the host, otherwise we cannot accurately route traffic.

Still use iptables for NAT:

[root@centos ~]$ iptables -t nat -A PREROUTING  ! -i br0 -p tcp -m tcp --dport 8088 -j DNAT --to-destination 172.16.0.102:80
# -t nat: select the nat table and do not specify it. The default is the filter table.
# -A: Append adds a new rule, - D Delete, - L List.
# POSTROUTING: select POSTROUTING chain. POSTROUTING represents the packet exit phase.
# -p protocol, tcp or udp
# --dport 8088, the target port, that is, the port monitored by the host
# -s 172.16.0.0/24: matching condition. source address is network segment 172.16 IP in 0.102/24.
# -j DNAT: after matching, execute the command and execute DNAT, that is, forward the source traffic to 172.16 Port 80 on 0.102.

After execution, we can test. First, listen to port 80 in netns2:

[root@centos ~]$ ip netns exec netns2 nc -lp 80

Then log in to the machine 10.0 where we tested the outflow traffic 1.26, telnet:

[root@10.0.1.26 ~]$ telnet 10.0.1.48 8088
Trying 10.0.1.48...
Connected to 10.0.1.48.
Escape character is '^]'.

It can be seen that the traffic is correctly directed to the listening port under the namespace of netns2.

So far, we have combed the virtualization network in the host. In the next issue, we will explore the cross host virtualization network and various network solutions under Kubernetes.

This series of articles are written by FinClip Produced by the engineer.

Keywords: Container

Added by kiwibrit on Thu, 23 Dec 2021 21:45:11 +0200

Programming VIP

On container virtualization network

1, Linux Namespace

2, Veth

3, Linux Bridge

4, Communication with external network

Popular Keywords