DPVS fullnat mode deployment

This article mainly introduces in centos7 9 deploy the FullNAT mode of DPVS on the system and install the toa module on the RealServer to obtain the real IP of the client.

It has been introduced in previous articles DPVS introduction and deployment as well as Application and principle analysis of DPDK in DPVS , students in need can fill in the relevant contents first. Since the deployment steps in the previous article only introduced the deployment of DPVS and did not involve the configuration of various load balancing modes, and the version of DPVS and the corresponding version of DPDK have been updated after more than half a year, a new deployment tutorial is written here in detail.

The DPVS version installed in this article is 1.8-10 and the dpdk version is 18.11.2. Different from the above, the installation steps and operations are also different.

1. Preparatory work

After the formal installation, we need to adjust the hardware parameters of the machine. DPVS official has certain requirements for hardware (mainly because of the dpdk used at the bottom), and dpdk official gives a copy Support List , although the platforms on the support list are widely supported, in fact, Intel's hardware platform seems to be the one with the best compatibility and performance.

1.1 hardware

1.1.1 hardware parameters

Machine model: PowerEdge R630
CPU: two Intel (R) Xeon (R) CPUs e5-2630 V4 @ 2.20GHz
Memory: 16G*8 DDR4-2400 MT/s (Configured in 2133 MT/s), 64g for each CPU64G, 128G in total
Network card 1: Intel Corporation 82599es 10 Gigabit SFI / SFP + network connection (Rev 01)
Network card 2: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
System: CentOS Linux release 7.9.2009 (Core)
Kernel: 3.10.0-1160.36.2 el7. x86_ sixty-four

1.1.2 BIOS settings

Before starting, first enter the BIOS, turn off hyper threading and enable NUMA policy. DPVS is a typical CPU busy application (the CPU utilization of the process is always 100%). In order to ensure performance, it is recommended to turn off the hyper threading setting of the CPU. In order to ensure the affinity of the CPU, it is best to open the page manually in the BIOS for the sake of the affinity of the CPU.

1.1.3 network card PCI ID

After using the PMD driver of dpvs to take over the network card, if the number of network cards is large and easy to be confused, it is best to record the corresponding network card name, MAC address and PCI ID in advance to avoid confusion in subsequent operations.

Use the lspci command to view the PCI ID of the corresponding network card. Then we can view the device file under the corresponding network card name directory in the directory / sys/class/net / to know the PCI ID of the network card. Finally, you can string the network card name, MAC address and PCI ID.

$ lspci | grep -i net
01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

$ file /sys/class/net/eth0/device
/sys/class/net/eth0/device: symbolic link to `../../../0000:01:00.0'

1.2 software

1.2.1 system software

# Tools for compiling and installing dpvs and tools for viewing CPU NUMA information
$ yum group install "Development Tools" 
$ yum install patch libnuma* numactl numactl-devel kernel-devel openssl* popt* libpcap-devel -y
# If you need ipvsadm to support ipv6, you need to install libnl3 devel
$ yum install libnl libnl-devel libnl3 libnl3-devel -y


# Note that the version of the kernel and the corresponding kernel components should correspond to the current version of the kernel
$ uname -r
3.10.0-1160.36.2.el7.x86_64
$ rpm -qa | grep kernel | grep "3.10.0-1160.36.2"
kernel-3.10.0-1160.36.2.el7.x86_64
kernel-devel-3.10.0-1160.36.2.el7.x86_64
kernel-tools-libs-3.10.0-1160.36.2.el7.x86_64
kernel-debug-devel-3.10.0-1160.36.2.el7.x86_64
kernel-tools-3.10.0-1160.36.2.el7.x86_64
kernel-headers-3.10.0-1160.36.2.el7.x86_64

1.2.2 dpvs and dpdk

# dpvs we directly use git to pull the latest version from github
$ git clone https://github.com/iqiyi/dpvs.git
# dpdk we download version 18.11.2 from the official website and put it in the dpvs directory for easy operation
$ cd dpvs/
$ wget https://fast.dpdk.org/rel/dpdk-18.11.2.tar.xz
$ tar -Jxvf dpdk-18.11.2.tar.xz

After completing the above steps, you can start the following installation.

2. Installation steps

2.1 DPDK installation

2.1.1 installing dpdk patch

Under the patch directory of the dpvs folder, there are patches of the corresponding supported dpdk version. If you don't know which patch you need, the official recommendation is to install all of them

$ ll dpvs/patch/dpdk-stable-18.11.2
total 44
-rw-r--r-- 1 root root  4185 Jul 22 12:47 0001-kni-use-netlink-event-for-multicast-driver-part.patch
-rw-r--r-- 1 root root  1771 Jul 22 12:47 0002-net-support-variable-IP-header-len-for-checksum-API.patch
-rw-r--r-- 1 root root  1130 Jul 22 12:47 0003-driver-kni-enable-flow_item-type-comparsion-in-flow_.patch
-rw-r--r-- 1 root root  1706 Jul 22 12:47 0004-rm-rte_experimental-attribute-of-rte_memseg_walk.patch
-rw-r--r-- 1 root root 16538 Jul 22 12:47 0005-enable-pdump-and-change-dpdk-pdump-tool-for-dpvs.patch
-rw-r--r-- 1 root root  2189 Jul 22 12:47 0006-enable-dpdk-eal-memory-debug.patch

The operation of installing patch is also very simple

# We first copy all the patch es to the root directory of dpdk
$ cp dpvs/patch/dpdk-stable-18.11.2/*patch dpvs/dpdk-stable-18.11.2/
$ cd dpvs/dpdk-stable-18.11.2/
# Then we install it in the order of the file names of the patch
$ patch -p 1 < 0001-kni-use-netlink-event-for-multicast-driver-part.patch
patching file kernel/linux/kni/kni_net.c
$ patch -p 1 < 0002-net-support-variable-IP-header-len-for-checksum-API.patch
patching file lib/librte_net/rte_ip.h
$ patch -p 1 < 0003-driver-kni-enable-flow_item-type-comparsion-in-flow_.patch
patching file drivers/net/mlx5/mlx5_flow.c
$ patch -p 1 < 0004-rm-rte_experimental-attribute-of-rte_memseg_walk.patch
patching file lib/librte_eal/common/eal_common_memory.c
Hunk #1 succeeded at 606 (offset 5 lines).
patching file lib/librte_eal/common/include/rte_memory.h
$ patch -p 1 < 0005-enable-pdump-and-change-dpdk-pdump-tool-for-dpvs.patch
patching file app/pdump/main.c
patching file config/common_base
patching file lib/librte_pdump/rte_pdump.c
patching file lib/librte_pdump/rte_pdump.h
$ patch -p 1 < 0006-enable-dpdk-eal-memory-debug.patch
patching file config/common_base
patching file lib/librte_eal/common/include/rte_malloc.h
patching file lib/librte_eal/common/rte_malloc.c

2.1.2 dpdk compilation and installation

$ cd dpvs/dpdk-stable-18.11.2
$ make config T=x86_64-native-linuxapp-gcc
$ make 

# The words "build complete [x86_64 native Linux app GCC] appear to indicate that make is successful

$ export RTE_SDK=$PWD
$ export RTE_TARGET=build

Dpdk17.0 will not appear in the process of compiling and installing here Ndo in version 11.2_ change_ MTU problem

2.1.3 configure hugepage

Different from other general programs, the dpdk used by dpvs does not ask for memory from the operating system, but directly uses large page memory, which greatly improves the efficiency of memory allocation. The configuration of hugepage is relatively simple. The official configuration process uses 2MB of large page memory. 28672 here refers to 28672 2MB of large page memory allocated, that is, 56GB of memory corresponding to a node. A total of 112GB of memory is allocated. The memory here can be adjusted according to the size of the machine. However, if it is less than 1GB, it may cause startup error.

A single CPU system can refer to dpdk Official documents

# for NUMA machine
$ echo 28672 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ echo 28672 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

$ mkdir /mnt/huge
$ mount -t hugetlbfs nodev /mnt/huge

# If you need to start up and mount automatically, you can
$ echo "nodev /mnt/huge hugetlbfs defaults 0 0" >> /etc/fstab

# After the configuration is completed, we can see that the memory utilization rate increases immediately
$ free -g	# Before configuration
              total        used        free      shared  buff/cache   available
Mem:            125           1         122           0           1         123
$ free -g	# After configuration
              total        used        free      shared  buff/cache   available
Mem:            125         113          10           0           1          11
# Using numactl to check the memory status, you can also see that 56G of CPU memory is allocated on both sides
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18
node 0 size: 64184 MB
node 0 free: 4687 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19
node 1 size: 64494 MB
node 1 free: 5759 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

2.1.4 configuring ulimit

By default, the ulimit of the system limits the number of open file descriptors. If it is too small, it will affect the normal operation of dpvs, so we increase it:

$ ulimit -n 655350
$ echo "ulimit -n 655350" >> /etc/rc.local
$ chmod a+x /etc/rc.local

2.2 mount the drive module

First, we need to let the system mount the dpdk driver (PMD driver) we have compiled, and then replace the default driver used by the network card with the PMD driver we have compiled here

$ modprobe uio
$ insmod /path/to/dpdk-stable-18.11.2/build/kmod/igb_uio.ko
$ insmod /path/to/dpdk-stable-18.11.2/build/kmod/rte_kni.ko carrier=on

It should be noted that the carrier parameter is from dpdk v18 The default value is off when the version 11 is added. We need to load RTE_ KNI. The KNI equipment can work normally only when the carrier=on parameter is brought when the KO module is.

In the dpdk-stable-18.11.2/usertools directory, there are some scripts to help us install and use dpdk. We can use them to reduce the complexity of configuration. Here we can use dpdk devbind Py script to change the driver of the network card

# First, we turn off the network card that we need to load the PMD driver
$ ifdown eth{2,3,4,5}

# Check the status of the network card. Pay special attention to the PCI ID corresponding to the network card. Below, only some useful output results are intercepted
$ ./usertools/dpdk-devbind.py --status
Network devices using kernel driver
===================================
0000:04:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=eth2 drv=ixgbe unused=igb_uio
0000:04:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=eth3 drv=ixgbe unused=igb_uio
0000:82:00.0 'Ethernet 10G 2P X520 Adapter 154d' if=eth4 drv=ixgbe unused=igb_uio
0000:82:00.1 'Ethernet 10G 2P X520 Adapter 154d' if=eth5 drv=ixgbe unused=igb_uio

From the above output results, we can see that the current network card uses ixgbe driver, and our goal is to make it use igb_uio driver. Note that if there are too many network cards in the system at this time, the three parameters of network card name MAC address PCI ID recorded earlier can be used.

# Load specific drivers for network cards that need to use dpvs
$ ./usertools/dpdk-devbind.py -b igb_uio 0000:04:00.0
$ ./usertools/dpdk-devbind.py -b igb_uio 0000:04:00.1
$ ./usertools/dpdk-devbind.py -b igb_uio 0000:82:00.0
$ ./usertools/dpdk-devbind.py -b igb_uio 0000:82:00.1

# Check whether the loading is successful again. Only some useful output results are intercepted below
$ ./usertools/dpdk-devbind.py --status
Network devices using DPDK-compatible driver
============================================
0000:04:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=igb_uio unused=ixgbe
0000:04:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=igb_uio unused=ixgbe
0000:82:00.0 'Ethernet 10G 2P X520 Adapter 154d' drv=igb_uio unused=ixgbe
0000:82:00.1 'Ethernet 10G 2P X520 Adapter 154d' drv=igb_uio unused=ixgbe

2.3 DPVS installation

$ cd /path/to/dpdk-stable-18.11.2/
$ export RTE_SDK=$PWD
$ cd /path/to/dpvs
$ make 
$ make install
# View the binary files in the bin directory
$ ls /path/to/dpvs/bin/
dpip  dpvs  ipvsadm  keepalived

# Pay attention to the prompt information in the make process, especially the kept part. If the following part appears, it means that IPVS supports IPv6
Keepalived configuration
------------------------
Keepalived version       : 2.0.19
Compiler                 : gcc
Preprocessor flags       : -D_GNU_SOURCE -I/usr/include/libnl3
Compiler flags           : -g -g -O2 -fPIE -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -O2
Linker flags             : -pie -Wl,-z,relro -Wl,-z,now
Extra Lib                : -lm -lcrypto -lssl -lnl-genl-3 -lnl-3
Use IPVS Framework       : Yes
IPVS use libnl           : Yes
IPVS syncd attributes    : No
IPVS 64 bit stats        : No



# In order to facilitate management, relevant operation commands can be soft linked to / sbin for global execution
$ ln -s /path/to/dpvs/bin/dpvs /sbin/dpvs
$ ln -s /path/to/dpvs/bin/dpip /sbin/dpip
$ ln -s /path/to/dpvs/bin/ipvsadm /sbin/ipvsadm
$ ln -s /path/to/dpvs/bin/keepalived /sbin/keepalived

# Check whether the commands related to dpvs can work normally. Note that other commands can be used normally only after the dpvs process is started
$ dpvs -v
dpvs version: 1.8-10, build on 2021.07.26.15:34:26

2.4 configuring DPVS conf

Under the dpvs/conf directory, there are examples of dpvs configuration files with various configuration modes, and at the same time, in dpvs All parameters are recorded in the conf.items file. It is recommended that students read them all and understand the basic syntax before configuring them. The default configuration file for dpvs startup is / etc / dpvs conf.

Here is a brief summary of some parts (! Is a comment symbol):

The format of the log can be manually adjusted to DEBUG, and the location of the log output can be modified to facilitate the location of the problem
```
global_defs {
    log_level   DEBUG
    log_file    /path/to/dpvs/logs/dpvs.log
}
```

If you need to define multiple network cards, you can refer to this configuration

netif_defs {
    <init> pktpool_size     1048575
    <init> pktpool_cache    256

    <init> device dpdk0 {
        rx {
            queue_number        16
            descriptor_number   1024
            rss                 all
        }
        tx {
            queue_number        16
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        kni_name                dpdk0.kni
    }

    <init> device dpdk1 {
        rx {
            queue_number        16
            descriptor_number   1024
            rss                 all
        }
        tx {
            queue_number        16
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        kni_name                dpdk1.kni
    }

    <init> device dpdk2 {
        rx {
            queue_number        16
            descriptor_number   1024
            rss                 all
        }
        tx {
            queue_number        16
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        kni_name                dpdk2.kni
    }

    <init> device dpdk3 {
        rx {
            queue_number        16
            descriptor_number   1024
            rss                 all
        }
        tx {
            queue_number        16
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        kni_name                dpdk3.kni
    }

}

The same transceiver queue of multiple network cards shares the same CPU

    <init> worker cpu1 {
        type    slave
        cpu_id  1
        port    dpdk0 {
            rx_queue_ids     0
            tx_queue_ids     0
        }
        port    dpdk1 {
            rx_queue_ids     0
            tx_queue_ids     0
        }
        port    dpdk2 {
            rx_queue_ids     0
            tx_queue_ids     0
        }
        port    dpdk3 {
            rx_queue_ids     0
            tx_queue_ids     0
        }
    }

If you need to specify a CPU separately to process ICMP packets, you can add ICMP to the parameters of the worker_ redirect_ core

    <init> worker cpu16 {
        type    slave
        cpu_id  16
        icmp_redirect_core
        port    dpdk0 {
            rx_queue_ids     15
            tx_queue_ids     15
        }
    }

After the DPVS process is started, the corresponding network card can be configured directly in the network configuration file of the Linux system, which is exactly the same as other network cards such as eth0.

After running successfully, you can see the corresponding dpdk network card by using the dpip command and the normal IP and ifconfig commands, and the IPv4 and IPv6 networks can be used normally. The following figure only intercepts part of the information, the IP and MAC information has been desensitized, and the IPv6 information has been removed.

$ dpip link show
1: dpdk0: socket 0 mtu 1500 rx-queue 16 tx-queue 16
    UP 10000 Mbps full-duplex auto-nego
    addr AA:BB:CC:23:33:33 OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM

$ ip a
67: dpdk0.kni: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether AA:BB:CC:23:33:33 brd ff:ff:ff:ff:ff:ff
    inet 1.1.1.1/24 brd 1.1.1.255 scope global dpdk0.kni
       valid_lft forever preferred_lft forever
       
$ ifconfig dpdk0.kni
dpdk0.kni: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 1.1.1.1  netmask 255.255.254.0  broadcast 1.1.1.255
        ether AA:BB:CC:23:33:33  txqueuelen 1000  (Ethernet)
        RX packets 1790  bytes 136602 (133.4 KiB)
        RX errors 0  dropped 52  overruns 0  frame 0
        TX packets 115  bytes 24290 (23.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

3. Configure FullNat

In order to verify that our DPVS can work normally, here we refer to the official Configuration document First, configure an FNAT with the simplest dual arm mode. Referring to the official architecture diagram and modifying the IP address information, we can get the following simple architecture diagram.

In this mode, it is not necessary to configure the kni network card virtualized by DPVS by using the tools such as ip and ifconfig provided by the system

[the external chain image transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ywrjxom5-1645782758324)( https://resource.tinychen.com/20210728123202.svg )]

Here, we use the dpdk2 network card as the wan port and the dpdk0 network card as the lan port

# First, we add VIP 10.0.96.204 to the dpdk2 network card (wan)
$ dpip addr add 10.0.96.204/32 dev dpdk2

# Next, we need to add two routes, which are divided into wan port network segment and RS machine network segment
$ dpip route add 10.0.96.0/24 dev dpdk2
$ dpip route add 192.168.229.0/24 dev dpdk0
# It is better to add a default route to the gateway to ensure that the return packet of ICMP packet can run through
$ dpip route add default via 10.0.96.254 dev dpdk2

# Establish forwarding rules using RR algorithm
# add service <VIP:vport> to forwarding, scheduling mode is RR.
# use ipvsadm --help for more info.
$ ipvsadm -A -t 10.0.96.204:80 -s rr

# Here, for the convenience of testing, we only add one RS
# add two RS for service, forwarding mode is FNAT (-b)
$ ipvsadm -a -t 10.0.96.204:80 -r 192.168.229.1 -b

# Add LocalIP to the network. FNAT mode is required here
# add at least one Local-IP (LIP) for FNAT on LAN interface
$ ipvsadm --add-laddr -z 192.168.229.204 -t 10.0.96.204:80 -F dpdk0


# Then let's see the effect
$ dpip route show
inet 192.168.229.204/32 via 0.0.0.0 src 0.0.0.0 dev dpdk0 mtu 1500 tos 0 scope host metric 0 proto auto
inet 10.0.96.204/32 via 0.0.0.0 src 0.0.0.0 dev dpdk2 mtu 1500 tos 0 scope host metric 0 proto auto
inet 10.0.96.0/24 via 0.0.0.0 src 0.0.0.0 dev dpdk2 mtu 1500 tos 0 scope link metric 0 proto auto
inet 192.168.229.0/24 via 0.0.0.0 src 0.0.0.0 dev dpdk0 mtu 1500 tos 0 scope link metric 0 proto auto
inet 0.0.0.0/0 via 10.0.96.254 src 0.0.0.0 dev dpdk2 mtu 1500 tos 0 scope global metric 0 proto auto

$ dpip addr show
inet 10.0.96.204/32 scope global dpdk2
     valid_lft forever preferred_lft forever
inet 192.168.229.204/32 scope global dpdk0
     valid_lft forever preferred_lft forever

$ ipvsadm  -ln
IP Virtual Server version 0.0.0 (size=0)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.96.204:80 rr
  -> 192.168.229.1:80              FullNat 1      0          0
$ ipvsadm  -G
VIP:VPORT            TOTAL    SNAT_IP              CONFLICTS  CONNS
10.0.96.204:80    1
                              192.168.229.204       0          0

Then we start an nginx on RS and set the return IP and port number to see the effect:

    server {
        listen 80 default;

        location / {
            default_type text/plain;
            return 200 "Your IP and port is $remote_addr:$remote_port\n";
        }

    }

Test VIP directly with ping and curl commands:

$ ping -c4 10.0.96.204
PING 10.0.96.204 (10.0.96.204) 56(84) bytes of data.
64 bytes from 10.0.96.204: icmp_seq=1 ttl=54 time=47.2 ms
64 bytes from 10.0.96.204: icmp_seq=2 ttl=54 time=48.10 ms
64 bytes from 10.0.96.204: icmp_seq=3 ttl=54 time=48.5 ms
64 bytes from 10.0.96.204: icmp_seq=4 ttl=54 time=48.5 ms

--- 10.0.96.204 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 8ms
rtt min/avg/max/mdev = 47.235/48.311/48.969/0.684 ms

$ curl 10.0.96.204
Your IP and port is 192.168.229.204:1033

It can be found that no matter what machine it is on, only the IP and port number of LIP will be returned. If you need to obtain the user's real IP, you need to install the TOA module

4. RS installation TOA module

At present, there are many versions of TOA modules provided by the open source community. Here, in order to ensure compatibility, we directly use the TOA and uoa modules officially provided by dpvs. According to their official description, their toa modules are separated from Alibaba TOA

TOA source code is included into DPVS project(in directory kmod/toa) since v1.7 to support IPv6 and NAT64. It is derived from the Alibaba TOA. For IPv6 applications which need client's real IP address, we suggest to use this TOA version.

Since both RS machines and DPVS machines here use the CentOS7 system, we can directly compile the toa module on the DPVS machine and then copy it to each RS machine for use

$ cd /path/to/dpvs/kmod/toa/
$ make

After successful compilation, a toa will be generated in the current directory Ko module file, which is the file we need. Directly use the insmod command to load the module and then check it

$ insmod toa.ko
$ lsmod  | grep toa
toa                   279641  0

Ensure that the module can be loaded at RC Add the following instructions to the local file

/usr/sbin/insmod /path/to/toa.ko
# for example: 
# /usr/sbin/insmod /home/dpvs/kmod/toa/toa.ko

In addition to toa module, there is also uoa module for UDP protocol, which is completely consistent with the compilation and installation process of TOA module above. It will not be repeated here.

After we load the curs module on the machine again:

$ curl 10.0.96.204
Your IP and port is 172.16.0.1:62844

So far, the FullNat mode of the whole DPVS has been deployed and can work normally. Since DPVS supports many configuration combinations, a special article on IPv6, nat64, keepalived, bonding and Master/Backup mode configuration will be written later.

Keywords: Load Balance lvs

Added by DeathStar on Sat, 26 Feb 2022 18:54:28 +0200

Programming VIP