dpdk problem location: a port of a four port x710 network card cannot be initialized

Problem description

In a newly imported hardware, the service port uses the x710 network card with four optical ports. The dpdk program reports the following error during initialization:

EAL: PCI device 0000:16:00.0 on NUMA socket -1
EAL:   probe driver: 8086:1581 rte_i40e_pmd
EAL:   PCI memory mapped at 0x4200120000
EAL:   PCI memory mapped at 0x4200920000
PMD: eth_i40e_dev_init(): FW 6.2 API 1.5 NVM 06.08.00 eetrack 80004cf1
PMD: i40e_dcb_init_configure(): Failed to stop lldp
EAL: PCI device 0000:16:00.1 on NUMA socket -1
EAL:   probe driver: 8086:1581 rte_i40e_pmd
EAL:   PCI memory mapped at 0x4200928000
EAL:   PCI memory mapped at 0x4201128000
PMD: eth_i40e_dev_init(): FW 6.2 API 1.5 NVM 06.08.00 eetrack 80004cf1
PMD: i40e_dcb_init_configure(): Failed to stop lldp
EAL: PCI device 0000:16:00.2 on NUMA socket -1
EAL:   probe driver: 8086:1581 rte_i40e_pmd
EAL:   PCI memory mapped at 0x4201130000
EAL:   PCI memory mapped at 0x4201930000
PMD: eth_i40e_dev_init(): FW 6.2 API 1.5 NVM 06.08.00 eetrack 80004cf1
PMD: eth_i40e_dev_init(): Failed to do parameter init: -22
EAL: Error - exiting with code: 1

16: 00.0 and 16:00.1 were initialized normally, and a parameter error was reported during 16:00.2 initialization.

Problem location process

1. dpdk-16.04 l2fwd test

After starting relevant debugging information, the following error printing is added:

PMD: eth_i40e_dev_init(): FW 6.1 API 1.7 NVM 06.08.00 eetrack 80003cf1
PMD: i40e_configure_registers(): Read from 0x26ce00: 0x10000200
PMD: i40e_configure_registers(): Read from 0x26ce08: 0x11f0200
PMD: i40e_configure_registers(): Read from 0x269fbc: 0x3030303
PMD: i40e_pf_parameter_init(): No queue or VSI left for VMDq
PMD: i40e_pf_parameter_init(): Failed to allocate 2 VSIs, which exceeds the hardware maximum 0
PMD: eth_i40e_dev_init(): Failed to do parameter init: -22
PMD: i40e_free_dma_mem_d(): memzone i40e_dma_664339303382194

The above information indicates that in i40e_ pf_ parameter_ Failed to create 2 VSI structures in init function, which exceeds the hardware limit.

In order to quickly verify whether it is a driver problem, continue the test using dpdk-19.11.

2. dpdk-19.11 l2fwd test

Related error reports:

i40e_pf_parameter_init(): Failed to allocate 2 VSIs, which exceeds the hardware maximum 0
eth_i40e_dev_init(): Failed to do parameter init: -22
EAL: ethdev initialisation failedEAL: Requested device 0000:16:00.2 cannot be used
EAL: PCI device 0000:16:00.3 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:1581 net_i40e
i40e_pf_parameter_init(): Failed to allocate 2 VSIs, which exceeds the hardware maximum 0
eth_i40e_dev_init(): Failed to do parameter init: -22
EAL: ethdev initialisation failedEAL: Requested device 0000:16:00.3 cannot be used

The error content is consistent, and the driver problem is basically eliminated!

3. Relevant code analysis

i40e_ pf_ parameter_ The codes related to this judgment in init function are as follows:

if (qp_count > hw->func_caps.num_tx_qp) {
		PMD_DRV_LOG(ERR, "Failed to allocate %u queues, which exceeds "
			    "the hardware maximum %u", qp_count,
			    hw->func_caps.num_tx_qp);
		return -EINVAL;
	}
	if (vsi_count > hw->func_caps.num_vsis) {
		PMD_DRV_LOG(ERR, "Failed to allocate %u VSIs, which exceeds "
			    "the hardware maximum %u", vsi_count,
			    hw->func_caps.num_vsis);
		return -EINVAL;
	}

The data source of the judgment here is obtained by accessing the network card hardware. In addition, the data of this driver has been running normally on many x710 network cards, so the judgment probability is a hardware problem.

Questioning session

1. Is the software configuration of four optical ports consistent?

l2fwd configure all interfaces uniformly to confirm that the software configuration is consistent.

2. Is it a single point of failure?

After testing, it is confirmed that multiple network cards have the same problem.

3. What are the main differences in the newly adapted hardware environment?

The main difference is the physical network card itself rather than the driver. The version of the driver is the same. From a layered perspective, priority should be given to troubleshooting hardware problems.

Finally confirmed problems

After troubleshooting, it is confirmed that the firmware information burned by the manufacturer is abnormal, which leads to the problem. After burning the firmware again, the problem is solved.

summary

When troubleshooting problems, priority should be given to finding the baseline data. In this problem, the baseline data is the version driven by x710 and the factual basis for the normal operation of this version on x710 network cards in multiple environments.

With baseline data, use the idea of layering to find the level of important differences, and then give priority to troubleshooting the problems in this layer. In this problem, the changed items point to the network card itself. Therefore, it is necessary to give priority to troubleshooting the hardware problems of the network card and not fall into the driver details too early. This problem is finally located as a hardware problem, which proves that the driver is normal, and further explains the rationality of the layered method.

Keywords: dpdk

Added by bluedogatdingdong on Tue, 21 Dec 2021 10:52:12 +0200