Fault analysis of an intelligent network card (mellanox)

Background: This is reproduced in centos 7.6.1810. There are many smart network cards at present
The network card on the ECS is standard configuration. In OPPO, it is mainly used in vpc and other scenarios. The code of the intelligent network card will follow
The complexity has been increasing due to the enhancement of functions. The driver bug has always been the big head of the kernel bug. When encountering similar problems, the kernel developers are not familiar with the driver code, so it will be difficult to troubleshoot. The background knowledge involved itself is dma_pool,dma_page,net_device,mlx5_core_dev device, device uninstall, uaf problem, etc. in addition, this bug has not been solved visually in the latest linux baseline. This article lists it separately because the uaf problem is relatively unique.
Here's how we troubleshoot and solve this problem.

1, Fault phenomenon

The OPPO cloud kernel team received the connectivity alarm and found that the machine was reset:

UPTIME: 00:04:16-------------The running time is very short
LOAD AVERAGE: 0.25, 0.23, 0.11
TASKS: 2027
RELEASE: 3.10.0-1062.18.1.el7.x86_sixty-four
MEMORY: 127.6 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at           (null)"
PID: 23283
COMMAND: "spider-agent"
TASK: ffff9d1fbb090000  [THREAD_INFO: ffff9d1f9a0d8000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 23283  TASK: ffff9d1fbb090000  CPU: 0   COMMAND: "spider-agent"
 #0 [ffff9d1f9a0db650] machine_kexec at ffffffffb6665b34
 #1 [ffff9d1f9a0db6b0] __crash_kexec at ffffffffb6722592
 #2 [ffff9d1f9a0db780] crash_kexec at ffffffffb6722680
 #3 [ffff9d1f9a0db798] oops_end at ffffffffb6d85798
 #4 [ffff9d1f9a0db7c0] no_context at ffffffffb6675bb4
 #5 [ffff9d1f9a0db810] __bad_area_nosemaphore at ffffffffb6675e82
 #6 [ffff9d1f9a0db860] bad_area_nosemaphore at ffffffffb6675fa4
 #7 [ffff9d1f9a0db870] __do_page_fault at ffffffffb6d88750
 #8 [ffff9d1f9a0db8e0] do_page_fault at ffffffffb6d88975
 #9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
    [exception RIP: dma_pool_alloc+427]//caq: exception address
    RIP: ffffffffb680efab  RSP: ffff9d1f9a0db9c8  RFLAGS: 00010046
    RAX: 0000000000000246  RBX: ffff9d0fa45f4c80  RCX: 0000000000001000
    RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff9d0fa45f4c10
    RBP: ffff9d1f9a0dba20   R8: 000000000001f080   R9: ffff9d00ffc07c00
    R10: ffffffffc03e10c4  R11: ffffffffb67dd6fd  R12: 00000000000080d0
    R13: ffff9d0fa45f4c10  R14: ffff9d0fa45f4c00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff9d1f9a0dba28] mlx5_ alloc_ cmd_ MSG at ffffc03e10e3 [mlx5_core] / / modules involved
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
#12 [ffff9d1f9a0dbb18] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
#13 [ffff9d1f9a0dbb48] mlx5_core_access_reg at ffffffffc03ee354 [mlx5_core]
#14 [ffff9d1f9a0dbba0] mlx5_query_port_ptys at ffffffffc03ee411 [mlx5_core]
#15 [ffff9d1f9a0dbc10] mlx5e_get_link_ksettings at ffffffffc0413035 [mlx5_core]
#16 [ffff9d1f9a0dbce8] __ethtool_get_link_ksettings at ffffffffb6c56d06
#17 [ffff9d1f9a0dbd48] speed_show at ffffffffb6c705b8
#18 [ffff9d1f9a0dbdd8] dev_attr_show at ffffffffb6ab1643
#19 [ffff9d1f9a0dbdf8] sysfs_kf_seq_show at ffffffffb68d709f
#20 [ffff9d1f9a0dbe18] kernfs_seq_show at ffffffffb68d57d6
#21 [ffff9d1f9a0dbe28] seq_read at ffffffffb6872a30
#22 [ffff9d1f9a0dbe98] kernfs_fop_read at ffffffffb68d6125
#23 [ffff9d1f9a0dbed8] vfs_read at ffffffffb684a8ff
#24 [ffff9d1f9a0dbf08] sys_read at ffffffffb684b7bf
#25 [ffff9d1f9a0dbf50] system_call_fastpath at ffffffffb6d8dede
    RIP: 00000000004a5030  RSP: 000000c001099378  RFLAGS: 00000212
    RAX: 0000000000000000  RBX: 000000c000040000  RCX: ffffffffffffffff
    RDX: 000000000000000a  RSI: 000000c00109976e  RDI: 000000000000000d---read File fd number
    RBP: 000000c001099640   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000206  R12: 000000000000000c
    R13: 0000000000000032  R14: 0000000000f710c4  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

From the stack, it is a process reading a file that triggers a kernel state null pointer reference.

2, Fault phenomenon analysis

From the stack information:

1. At that time, the process opens the file with fd number 13, which can be seen from the rdi value.

2,speed_show and__ ethtool_ get_ link_ Kssettings indicates the rate at which the network card is being read
Let's see which file is open,

crash> files 23283
PID: 23283  TASK: ffff9d1fbb090000  CPU: 0   COMMAND: "spider-agent"
ROOT: /rootfs    CWD: /rootfs/home/service/app/spider
 FD       FILE            DENTRY           INODE       TYPE PATH
....
  9 ffff9d0f5709b200 ffff9d1facc80a80 ffff9d1069a194d0 REG  /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/net/p1p1/speed---This is still there
 10 ffff9d0f4a45a400 ffff9d0f9982e240 ffff9d0fb7b873a0 REG  /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/p3p1/speed---Note correspondence 0000:5e:00.0 corresponding p3p1
 11 ffff9d0f57098f00 ffff9d1facc80240 ffff9d1069a1b530 REG  /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.1/net/p1p2/speed---This is still there
 13 ffff9d0f4a458a00 ffff9d0f9982e0c0 ffff9d0fb7b875f0 REG  /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.1/net/p3p2/speed---Note correspondence 0000:5e:00.1 corresponding p3p2
....

Note the correspondence between the pci number and the network card name above, which will be used later.
Opening a file to read speed itself should be a very common process,
Let's start with exception RIP: dma_pool_alloc+427 further analyzes why NULL pointer dereference is triggered
Expand the specific stack as follows:

#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
    [exception RIP: dma_pool_alloc+427]
    RIP: ffffffffb680efab  RSP: ffff9d1f9a0db9c8  RFLAGS: 00010046
    RAX: 0000000000000246  RBX: ffff9d0fa45f4c80  RCX: 0000000000001000
    RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff9d0fa45f4c10
    RBP: ffff9d1f9a0dba20   R8: 000000000001f080   R9: ffff9d00ffc07c00
    R10: ffffffffc03e10c4  R11: ffffffffb67dd6fd  R12: 00000000000080d0
    R13: ffff9d0fa45f4c10  R14: ffff9d0fa45f4c00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    ffff9d1f9a0db918: 0000000000000000 ffff9d0fa45f4c00 
    ffff9d1f9a0db928: ffff9d0fa45f4c10 00000000000080d0 
    ffff9d1f9a0db938: ffff9d1f9a0dba20 ffff9d0fa45f4c80 
    ffff9d1f9a0db948: ffffffffb67dd6fd ffffffffc03e10c4 
    ffff9d1f9a0db958: ffff9d00ffc07c00 000000000001f080 
    ffff9d1f9a0db968: 0000000000000246 0000000000001000 
    ffff9d1f9a0db978: 0000000000000000 0000000000000246 
    ffff9d1f9a0db988: ffff9d0fa45f4c10 ffffffffffffffff 
    ffff9d1f9a0db998: ffffffffb680efab 0000000000000010 
    ffff9d1f9a0db9a8: 0000000000010046 ffff9d1f9a0db9c8 
    ffff9d1f9a0db9b8: 0000000000000018 ffffffffb680ee45 
    ffff9d1f9a0db9c8: ffff9d0faf9fec40 0000000000000000 
    ffff9d1f9a0db9d8: ffff9d0faf9fec48 ffffffffb682669c 
    ffff9d1f9a0db9e8: ffff9d00ffc07c00 00000000618746c1 
    ffff9d1f9a0db9f8: 0000000000000000 0000000000000000 
    ffff9d1f9a0dba08: ffff9d0faf9fec40 0000000000000000 
    ffff9d1f9a0dba18: ffff9d0fa3c800c0 ffff9d1f9a0dba70 
    ffff9d1f9a0dba28: ffffffffc03e10e3 
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]
    ffff9d1f9a0dba30: ffff9d0f4eebee00 0000000000000001 
    ffff9d1f9a0dba40: 000000d0000080d0 0000000000000050 
    ffff9d1f9a0dba50: ffff9d0fa3c800c0 0000000000000005 --r12 yes rdi ,ffff9d0fa3c800c0
    ffff9d1f9a0dba60: ffff9d0fa3c803e0 ffff9d1f9d87ccc0 
    ffff9d1f9a0dba70: ffff9d1f9a0dbb10 ffffffffc03e3c92 
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]

Take the corresponding mlx5 from the stack_ core_ Dev is ffff9d0fa3c800c0

crash> mlx5_core_dev.cmd ffff9d0fa3c800c0 -xo
struct mlx5_core_dev {
  [ffff9d0fa3c80138] struct mlx5_cmd cmd;
}
crash> mlx5_cmd.pool ffff9d0fa3c80138
  pool = 0xffff9d0fa45f4c00------This is dma_pool，Students who write driver code often encounter

The code line number of the problem is:

crash> dis -l dma_pool_alloc+427 -B 5
/usr/src/debug/kernel-3.10.0-1062.18.1.el7/linux-3.10.0-1062.18.1.el7.x86_64/mm/dmapool.c: 334
0xffffffffb680efab <dma_pool_alloc+427>:        mov    (%r15),%ecx
 And the corresponding r15，From the stack above, it is null. 
    305 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
    306                      dma_addr_t *handle)
    307 {
...
    315         spin_lock_irqsave(&pool->lock, flags);
    316         list_for_each_entry(page, &pool->page_list, page_list) {
    317                 if (page->offset < pool->allocation)---//caq: currently satisfied
    318                         goto ready;//caq: jump to ready
    319         }
    320 
    321         /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
    322         spin_unlock_irqrestore(&pool->lock, flags);
    323 
    324         page = pool_alloc_page(pool, mem_flags & (~__GFP_ZERO));
    325         if (!page)
    326                 return NULL;
    327 
    328         spin_lock_irqsave(&pool->lock, flags);
    329 
    330         list_add(&page->page_list, &pool->page_list);
    331  ready:
    332         page->in_use++;//caq: indicates that it is being referenced
    333         offset = page->offset;//Use it from the last place
    334         page->offset = *(int *)(page->vaddr + offset);//caq: the line number of the problem
...
    }

From the above code, page - > vaddr is NULL and offset is 0. Page has two sources,

The first is the page from the pool_ List,

The second is from pool_alloc_page is a temporary application. Of course, after the application, it will be linked to the page in the pool_ list,

Let's take a look at this page_list.

crash> dma_pool ffff9d0fa45f4c00 -x
struct dma_pool {
  page_list = {
    next = 0xffff9d0fa45f4c80, 
    prev = 0xffff9d0fa45f4c00
  }, 
  lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x1
          }
        }
      }
    }
  }, 
  size = 0x400, 
  dev = 0xffff9d1fbddec098, 
  allocation = 0x1000, 
  boundary = 0x1000, 
  name = "mlx5_cmd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  pools = {
    next = 0xdead000000000100, 
    prev = 0xdead000000000200
  }
}

crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
  offset = 0
  vaddr = 0x0
ffff9d0fa45f4d00
  offset = 0
  vaddr = 0x0

Slave DMA_ pool_ From the code logic of alloc function, pool - > page_ The list is indeed not empty and satisfied
If (page - > offset < pool - > allocation), so the first page should be ffff9d0fa45f4c80
That is, from the first case:

crash> dma_page ffff9d0fa45f4c80
struct dma_page {
  page_list = {
    next = 0xffff9d0fa45f4d00, 
    prev = 0xffff9d0fa45f4c80
  }, 
  vaddr = 0x0, //caq: this exception will cause crash if it is referenced
  dma = 0, 
  in_use = 1, //caq: this flag is in use and conforms to page - > in_ use++;
  offset = 0
}

This is the end of the problem analysis, because DMA_ vaddr will be initialized after the page in the pool is applied,
Usually in the pool_ alloc_ How can it be NULL when initializing in page?
Then check this address:

crash> kmem ffff9d0fa45f4c80-------This is dma_pool Medium page
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff9d00ffc07900 kmalloc-128//caq: note the length 128 8963 14976 234 8K
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffe299c0917d00  ffff9d0fa45f4000     0     64         29    35
  FREE / [ALLOCATED]
   ffff9d0fa45f4c80  

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe299c0917d00 10245f4000                0 ffff9d0fa45f4c00  1 2fffff00004080 slab,head

Because similar dma functions have been used before, I am impressed with dma_page is not so big. Let's look at the second dma_ The page is as follows:

crash> kmem ffff9d0fa45f4d00
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff9d00ffc07900 kmalloc-128              128       8963     14976    234     8k
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffe299c0917d00  ffff9d0fa45f4000     0     64         29    35
  FREE / [ALLOCATED]
   ffff9d0fa45f4d00  

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe299c0917d00 10245f4000                0 ffff9d0fa45f4c00  1 2fffff00004080 slab,head

crash> dma_page ffff9d0fa45f4d00
struct dma_page {
  page_list = {
    next = 0xffff9d0fa45f5000, 
    prev = 0xffff9d0fa45f4d00
  }, 
  vaddr = 0x0, -----------caq: Also null
  dma = 0, 
  in_use = 0, 
  offset = 0
}

crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
  offset = 0
  vaddr = 0x0
ffff9d0fa45f4d00
  offset = 0
  vaddr = 0x0
ffff9d0fa45f5000
  offset = 0
  vaddr = 0x0
.........

It seems that it is not only the first DMA_ There is a problem with the page. All DMAS in the pool_ All units are the same,
Check DMA directly_ Normal size of page:

crash> p sizeof(struct dma_page)
$3 = 40

According to the principle, the length is only 40 bytes. Even if you apply for slab, it should be expanded to 64 bytes. How can it be like the DMA above_ Is page 128 bytes? To solve this puzzle, find a normal node and compare it with other nodes:

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffff8f9e800be000  lo     127.0.0.1
ffff8f9e62640000  p1p1   
ffff8f9e626c0000  p1p2   
ffff8f9e627c0000  p3p1   -----//caq: take this as an example
ffff8f9e62100000  p3p2   

Then according to the code: net_device see mlx5e_priv: 

static int mlx5e_get_link_ksettings(struct net_device *netdev,
                    struct ethtool_link_ksettings *link_ksettings)
{
...
    struct mlx5e_priv *priv    = netdev_priv(netdev);
...
}

static inline void *netdev_priv(const struct net_device *dev)
{
	return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
}

crash> px sizeof(struct net_device)
$2 = 0x8c0

crash> mlx5e_priv.mdev ffff8f9e627c08c0---Based on offset calculation
  mdev = 0xffff8f9e67c400c0

crash> mlx5_core_dev.cmd 0xffff8f9e67c400c0 -xo
struct mlx5_core_dev {
  [ffff8f9e67c40138] struct mlx5_cmd cmd;
}

crash> mlx5_cmd.pool ffff8f9e67c40138
  pool = 0xffff8f9e7bf48f80

crash> dma_pool 0xffff8f9e7bf48f80
struct dma_pool {
  page_list = {
    next = 0xffff8f9e79c60880, //caq: one of the dma_page
    prev = 0xffff8fae6e4db800
  }, 
.......
  size = 1024, 
  dev = 0xffff8f9e800b3098, 
  allocation = 4096, 
  boundary = 4096, 
  name = "mlx5_cmd\000\217\364{\236\217\377\377\300\217\364{\236\217\377\377\200\234>\250\217\217\377\377", 
  pools = {
    next = 0xffff8f9e800b3290, 
    prev = 0xffff8f9e800b3290
  }
}
crash> dma_page 0xffff8f9e79c60880     //caq: check this dma_page
struct dma_page {
  page_list = {
    next = 0xffff8f9e79c60840, -------One of them dma_page
    prev = 0xffff8f9e7bf48f80
  }, 
  vaddr = 0xffff8f9e6fc9b000, //caq: normal vaddr cannot be NULL
  dma = 69521223680, 
  in_use = 0, 
  offset = 0
}

crash> kmem 0xffff8f9e79c60880
CACHE            NAME             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff8f8fbfc07b00 kmalloc-64--Normal length    64     six hundred and sixty-seven thousand nine hundred and twenty-one    745024  11641  4 k
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffde5140e71800  ffff8f9e79c60000     0     64         64     0
  FREE / [ALLOCATED]
  [ffff8f9e79c60880]

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffde5140e71800 1039c60000                0        0  1 2fffff00000080 slab

The above operation requirements are correct_ Device and mlx5 related driver code are familiar.
Compared with abnormal dma_page, normal dma_page is a 64 byte slab, so obviously,
Either this is a memory problem or a uaf(used after free) problem.
General questions: how can we quickly judge which type it is? Because these two problems involve memory disorder, they are generally difficult to check. At this time, we need to jump out. Let's take a look at other running processes and find a process as follows:

crash> bt 48263
PID: 48263  TASK: ffff9d0f4ee0a0e0  CPU: 56  COMMAND: "reboot"
 #0 [ffff9d0f95d7f958] __schedule at ffffffffb6d80d4a
 #1 [ffff9d0f95d7f9e8] schedule at ffffffffb6d811f9
 #2 [ffff9d0f95d7f9f8] schedule_timeout at ffffffffb6d7ec48
 #3 [ffff9d0f95d7faa8] wait_for_completion_timeout at ffffffffb6d81ae5
 #4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
 #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
 #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
 #7 [ffff9d0f95d7fc40] mlx5_mr_cache_cleanup at ffffffffc0c60aab [mlx5_ib]
 #8 [ffff9d0f95d7fca8] mlx5_ib_stage_pre_ib_reg_umr_cleanup at ffffffffc0c45d32 [mlx5_ib]
 #9 [ffff9d0f95d7fcc0] __mlx5_ib_remove at ffffffffc0c4f450 [mlx5_ib]
#10 [ffff9d0f95d7fce8] mlx5_ib_remove at ffffffffc0c4f4aa [mlx5_ib]
#11 [ffff9d0f95d7fd00] mlx5_detach_device at ffffffffc03fe231 [mlx5_core]
#12 [ffff9d0f95d7fd30] mlx5_unload_one at ffffffffc03dee90 [mlx5_core]
#13 [ffff9d0f95d7fd60] shutdown at ffffffffc03def80 [mlx5_core]
#14 [ffff9d0f95d7fd80] pci_device_shutdown at ffffffffb69d1cda
#15 [ffff9d0f95d7fda8] device_shutdown at ffffffffb6ab3beb
#16 [ffff9d0f95d7fdd8] kernel_restart_prepare at ffffffffb66b7916
#17 [ffff9d0f95d7fde8] kernel_restart at ffffffffb66b7932
#18 [ffff9d0f95d7fe00] SYSC_reboot at ffffffffb66b7ba9
#19 [ffff9d0f95d7ff40] sys_reboot at ffffffffb66b7c4e
#20 [ffff9d0f95d7ff50] system_call_fastpath at ffffffffb6d8dede
    RIP: 00007fc9be7a5226  RSP: 00007ffd9a19e448  RFLAGS: 00010246
    RAX: 00000000000000a9  RBX: 0000000000000004  RCX: 0000000000000000
    RDX: 0000000001234567  RSI: 0000000028121969  RDI: fffffffffee1dead
    RBP: 0000000000000002   R8: 00005575d529558c   R9: 0000000000000000
    R10: 00007fc9bea767b8  R11: 0000000000000206  R12: 0000000000000000
    R13: 00007ffd9a19e690  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

Why do you pay attention to this process? Over the years, uaf problems caused by uninstalling modules have been checked no less than 20 times, sometimes reboots, sometimes unload s, and sometimes releases resources in work. Therefore, intuitively, I think it has a lot to do with this uninstallation. Let's analyze the operation in the reboot process.

2141 void device_shutdown(void)
   2142 {
   2143         struct device *dev, *parent;
   2144 
   2145         spin_lock(&devices_kset->list_lock);
   2146         /*
   2147          * Walk the devices list backward, shutting down each in turn.
   2148          * Beware that device unplug events may also start pulling
   2149          * devices offline, even as the system is shutting down.
   2150          */
   2151         while (!list_empty(&devices_kset->list)) {
   2152                 dev = list_entry(devices_kset->list.prev, struct device,
   2153                                 kobj.entry);
........
   2178                 if (dev->device_rh && dev->device_rh->class_shutdown_pre) {
   2179                         if (initcall_debug)
   2180                                 dev_info(dev, "shutdown_pre\n");
   2181                         dev->device_rh->class_shutdown_pre(dev);
   2182                 }
   2183                 if (dev->bus && dev->bus->shutdown) {
   2184                         if (initcall_debug)
   2185                                 dev_info(dev, "shutdown\n");
   2186                         dev->bus->shutdown(dev);
   2187                 } else if (dev->driver && dev->driver->shutdown) {
   2188                         if (initcall_debug)
   2189                                 dev_info(dev, "shutdown\n");
   2190                         dev->driver->shutdown(dev);
   2191                 }
   }

From the above code, we can see the following two points:

1. Kobj of each device The entry member is concatenated in devices_ Kset - > list.

2. The shutdown process of each device is from device_shutdown is serial.

From the stack of reboot, the process of unloading an mlx device includes the following:

pci_device_shutdown–>shutdown–>mlx5_unload_one–>mlx5_detach_device
–>mlx5_cmd_cleanup–>dma_pool_destroy

mlx5_ detach_ The process branch of device is:

void dma_pool_destroy(struct dma_pool *pool)
{
.......
        while (!list_empty(&pool->page_list)) {//caq: DMA in pool_ Page delete one by one
                struct dma_page *page;
                page = list_entry(pool->page_list.next,
                                  struct dma_page, page_list);
                if (is_page_busy(page)) {
.......
                        list_del(&page->page_list);
                        kfree(page);
                } else
                        pool_free_page(pool, page);//Per dma_page to release
        }

        kfree(pool);//caq: release pool
.......        
}

static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
{
        dma_addr_t dma = page->dma;

#ifdef  DMAPOOL_DEBUG
        memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
#endif
        dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
        list_del(&page->page_list);//caq: after release, the page_list member poisoning
        kfree(page);
}

From the stack of reboot, view the corresponding information

 #4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
    ffff9d0f95d7fb10: ffffffffb735b580 ffff9d0f904caf18 
    ffff9d0f95d7fb20: ffff9d00ff801da8 ffff9d0f23121200 
    ffff9d0f95d7fb30: ffff9d0f23121740 ffff9d0fa7480138 
    ffff9d0f95d7fb40: 0000000000000000 0000001002020000 
    ffff9d0f95d7fb50: 0000000000000000 ffff9d0f95d7fbe8 
    ffff9d0f95d7fb60: ffff9d0f00000000 0000000000000000 
    ffff9d0f95d7fb70: 00000000756415e3 ffff9d0fa74800c0 ----mlx5_core_dev Equipment, corresponding to p3p1，
    ffff9d0f95d7fb80: ffff9d0f95d7fbf8 ffff9d0f95d7fbe8 
    ffff9d0f95d7fb90: 0000000000000246 ffff9d0f8f3a20b8 
    ffff9d0f95d7fba0: ffff9d0f95d7fbd0 ffffffffc03e442b 
 #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
    ffff9d0f95d7fbb0: 0000000000000000 ffff9d0fa74800c0 
    ffff9d0f95d7fbc0: ffff9d0f8f3a20b8 ffff9d0fa74bea00 
    ffff9d0f95d7fbd0: ffff9d0f95d7fc38 ffffffffc03f085d 
 #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]

Note that reboot is releasing mlx5_core_dev is ffff9d0fa74800c0, which corresponds to net_device is:
p3p1, while the 23283 process is accessing mlx5_core_dev is ffff9d0fa3c800c0, corresponding to p3p2.

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffff9d0fc003e000  lo     127.0.0.1
ffff9d1fad200000  p1p1   
ffff9d0fa0700000  p1p2   
ffff9d0fa00c0000  p3p1  Corresponding mlx5_core_dev yes ffff9d0fa74800c0
ffff9d0fa0200000  p3p2  Corresponding mlx5_core_dev yes ffff9d0fa3c800c0

Let's take a look at what's still left_ device in kset:

crash> p devices_kset
devices_kset = $4 = (struct kset *) 0xffff9d1fbf4e70c0
crash> p devices_kset.list
$5 = {
  next = 0xffffffffb72f2a38, 
  prev = 0xffff9d0fbe0ea130
}

crash> list -H -o 0x18 0xffffffffb72f2a38 -s device.kobj.name >device.list

We found p3p1 And p3p2 None device.list In,

[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.0 device.list //caq: not found. This is p3p1. The current reboot process is unloading.
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.1 device.list //caq: not found. This is p3p2, which has been uninstalled
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.0  device.list //caq: this mlx5 device has not been unload ed yet
  kobj.name = 0xffff9d1fbe82aa70 "0000:3b:00.0",
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.1 device.list //caq: this mlx5 device has not been unload ed yet
  kobj.name = 0xffff9d1fbe82aae0 "0000:3b:00.1",

Because p3p2 and p3p1 are not in device List, and according to PCI_ device_ In the serial uninstall process of shutdown, p3p1 is currently being unloaded, so it is certain that the 23283 process accesses the unloaded cmd_pool, according to the unloading process described above:
pci_device_shutdown–>shutdown–>mlx5_unload_one–>mlx5_cmd_cleanup–>dma_pool_destroy
At this time, the pool has been released, and DMA in the pool_ All pages are invalid.

Then try the bug corresponding to google, and see that RedHat has encountered a similar problem, which is very similar to the current phenomenon: https://access.redhat.com/solutions/5132931

However, in this link, red hat thinks that the problem of uaf has been solved, and the combined patch is:

commit 4cca96a8d9da0ed8217cfdf2aec0c3c8b88e8911
Author: Parav Pandit <parav@mellanox.com>
Date:   Thu Dec 12 13:30:21 2019 +0200

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 997cbfe..05b557d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6725,6 +6725,8 @@ void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
                      const struct mlx5_ib_profile *profile,
                      int stage)
 {
+       dev->ib_active = false;
+
        /* Number of stages to cleanup */
        while (stage) {
                stage--;

Knock on the blackboard three times:
This integration cannot solve the corresponding bug, such as the following Concurrency:
Let's use a simple diagram to show concurrent processing:

    CPU1                                                            CPU2
                                                                   dev_attr_show
    pci_device_shutdown                                            speed_show
      shutdown                          
        mlx5_unload_one
          mlx5_detach_device
            mlx5_detach_interface
              mlx5e_detach
               mlx5e_detach_netdev
                 mlx5e_nic_disable
                   rtnl_lock
                     mlx5e_close_locked 
                     clear_bit(MLX5E_STATE_OPENED, &priv->state);---Only this was cleaned bit
                   rtnl_unlock                   
                                                  rtnl_trylock---After locking successfully
                                                  netif_running Just judgment net_device.state Lowest order of
                                                    __ethtool_get_link_ksettings
                                                    mlx5e_get_link_ksettings
                                                      mlx5_query_port_ptys()
                                                      mlx5_core_access_reg()
                                                      mlx5_cmd_exec
                                                      cmd_exec
                                                      mlx5_alloc_cmd_msg
          mlx5_cmd_cleanup---clear dma_pool                                                       
                                                      dma_pool_alloc---visit cmd.pool,trigger crash

So if you want to really solve this problem, you need netif_ device_ Cleaning in detach__ LINK_ STATE_ Bit of start, or in speed_ Judge in show__ LINK_STATE_PRESENT bit? If you consider the scope of influence and do not want to move the public process, you should
In mlx5e_ get_ link_ Judge from ksettings__ LINK_STATE_PRESENT.
This is left to the students who like to deal with the community to improve it.

static void mlx5e_nic_disable(struct mlx5e_priv *priv)
{
.......
	rtnl_lock();
	if (netif_running(priv->netdev))
		mlx5e_close(priv->netdev);
	netif_device_detach(priv->netdev);
  //caq: add cleaning__ LINK_STATE_PRESENT bit 
	rtnl_unlock();
.......

3, Fault recurrence

1. The competition problem can create a competition scenario similar to cpu1 and cpu2 in the figure above.

4, Fault avoidance or resolution

Possible solutions are:

1. Don't follow red hat https://access.redhat.com/solutions/5132931 Upgrade like that.

2. Patch separately.

Introduction to the author

Anqing

At present, I am responsible for linux kernel, container, virtual machine and other virtualization in OPPO hybrid cloud

For more exciting content, please scan code to focus on [OPPO technology] official account.

Keywords: Back-end

Added by premiso on Fri, 14 Jan 2022 03:33:46 +0200

Programming VIP