Author: Zhang Hua published on: December 28, 2021
Copyright notice: you can reprint it at will. When reprinting, please be sure to indicate the original source and author information of the article and this copyright notice in the form of hyperlink
( http://blog.csdn.net/quqi99 )
problem
To understand this error:
openvswitch: ovs-system: deferred action limit reached, drop recirc action
Initially, the code path is:
ovs_dp_process_packet -> ovs_execute_actions -> process_deferred_actions -> do_execute_actions(OVS_ACTION_ATTR_RECIRC) -> (execute_recirc|sample|clone|execute_check_pkt_len) -> clone_execute -> add_deferred_actions -> action_fifo_put
However, due to the unclear code structure of ovs, the above code path is still not well understood
The reason for understanding this error is the following VM can't ping GW problem:
VM(10.10.30.20, hosted in ecs4), GW chassis(ecs12)
Capture and analyze the packets on the computing node and Ping the GW from the VM
openstack port list --server <vm> tcpdump -enli tap<first-11-chars> -p `hostname`_<vm-1-tap>.pcap tshark -r ecs4_xx.pcap ip.src==10.10.30.20 and icmp
Capture packets on GW chassis(the node with the highest priority)
sudo ovn-nbctl lrp-list neutron-<router-uuid> sudo ovn-nbctl lrp-get-gateway-chassis lrp-<ovn-port-uuid> tcpdump -enli bond1 "(icmp or arp)" -w `hostname`_bond1.pcap tshark -r ecs4_xx.pcap ip.src==10.10.30.20 and icmp
I did see intermittent pings
$ tshark -r ecs4_37552ee4-38.pcap ip.src==10.10.30.20 and icmp 254 74.664491 10.10.30.20 → 10.10.30.1 ICMP 98 Echo (ping) request id=0x17ab, seq=99/25344, ttl=64 267 75.679441 10.10.30.20 → 10.10.30.1 ICMP 98 Echo (ping) request id=0x17ab, seq=100/25600, ttl=64 268 75.679799 10.10.30.1 → 10.10.30.20 ICMP 98 Echo (ping) reply id=0x17ab, seq=100/25600, ttl=254 (request in 267)
sosreport on ecs4 sees the following three types of error s:
$ grep -r 'deferred action limit reached' var/log/kern.log |tail -n1 Nov 8 13:14:30 ecs4 kernel: [9964180.307470] openvswitch: ovs-system: deferred action limit reached, drop recirc action 2021-11-10T00:00:31.476Z|147680|poll_loop|INFO|wakeup due to [POLLIN] on fd 3 (10.10.5.180:42162<->10.10.5.166:6642) at lib/stream-ssl.c:832 (101% CPU usage) 2021-11-10T00:01:07.194Z|147681|timeval|WARN|Unreasonably long 1110ms poll interval (1095ms user, 12ms system) 2021-11-10T00:01:07.194Z|147682|timeval|WARN|faults: 17299 minor, 0 major 2021-11-10T00:01:07.194Z|147683|coverage|INFO|Dropped 5 log messages in last 74 seconds (most recently, 35 seconds ago) due to excessive rate $ var/log/ovn/ovn-controller.log.1.gz:2021-11-10T08:21:40.110Z|154925|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values (lrp-fbf33f64-0cce-497d-a261-2d3d88e20b80 and \"::\") for index on columns \"logical_port\" and \"ip\". First row, with UUID 4e63d47d-791b-4cc1-ab3c-3d3ac29b5439, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 6d6281a0-a16e-4fbc-b8b2-da59038f22d5, was inserted by this transaction.","error":"constraint violation"} $ sudo ovn-nbctl show| egrep "^router |lrp-fbf33f64-0cce-497d-a261-2d3d88e20b80"| grep "port lrp" -B1 router 4307456d-3f8b-412c-a784-812f3e73fbfc (neutron-dbf7c13b-751a-41da-b504-09576617213e) (aka ansible-int) port lrp-fbf33f64-0cce-497d-a261-2d3d88e20b80 $ sudo ovn-nbctl show 4307456d-3f8b-412c-a784-812f3e73fbfc
OVS code structure
This article is good: https://blog.csdn.net/nb_zsy/article/details/107893255
openflow controller sends flow rules to ovsdb server / OVS vswitchd, Then cache the data to the kernel datapath through netlink (you can view the cached flow rules through OVS appctl dpctl / dump flows type = OVS). Later, the kernel datapath forwards the data directly according to these cache flow rules. If you don't know how to forward, you have to query through netlink (this is called slow path). The design idea of OVS is to complete the efficient forwarding of network data through the combination of slow path and fast path. Similarly, if the network card supports hw offload, you can cache the flow rules in the network card hardware through TC to improve the performance.
Datapath
Starting from the openvswitch datapath kernel module, it is responsible for performing data processing, that is, matching the data packets received from the receiving port in the flow table and performing the matched actions. A datapath can correspond to multiple vports. A vport is similar to the port concept of a physical switch. A datapath is associated with a flow table. A flow table contains multiple entries. Each entry includes two contents: a match/key and an action
Advanced OVS in data processing_ dp_ process_ Packet, which performs matching search according to mask and key. If no flow is found, it needs to be sent to the user status for slow matching. If there is a match, there are these actions:
- OVS_ACTION_ATTR_OUTPUT: get the port number and call do_output() sends a message to the port
- OVS_ACTION_ATTR_USERSPACE: call output_userspace() sent to user status
- OVS_ACTION_ATTR_HASH: call execute_hash() gets the hash of skb and assigns it to ovs_flow_hash
- OVS_ACTION_ATTR_PUSH_VLAN: call push_vlan() add vlan header
- OVS_ACTION_ATTR_RECIRC: in action_ Add a deferred to the FIFOs global array_ action
- OVS_ACTION_ATTR_SET: call execute_set_action() sets relevant parameters
- OVS_ACTION_ATTR_SAMPLE: send the message to the user state in a probabilistic manner (related to sflow).
What is OVS_ACTION_ATTR_RECIRC
This article is good - OpenvSwitch code reading notes (1) -- Freezing Translation: https://xiaohutou.github.io/2018/06/01/ovs-code-note-1/
Reference
[1] https://blog.csdn.net/nb_zsy/article/details/107893255
[2] https://blog.csdn.net/quqi99/article/details/111831695
[3] https://xiaohutou.github.io/2018/06/01/ovs-code-note-1/