Background: This is reproduced in centos 7.6.1810. There are many smart network cards at present
The network card on the ECS is standard configuration. In OPPO, it is mainly used in vpc and other scenarios. The code of the intelligent network card will follow
The complexity has been increasing due to the enhancement of functions. The driver bug has always been the big head of the kernel bug. When encountering similar problems, the kernel developers are not familiar with the driver code, so it will be difficult to troubleshoot. The background knowledge involved itself is dma_pool,dma_page,net_device,mlx5_core_dev device, device uninstall, uaf problem, etc. in addition, this bug has not been solved visually in the latest linux baseline. This article lists it separately because the uaf problem is relatively unique.
Here's how we troubleshoot and solve this problem.
1, Fault phenomenon
The OPPO cloud kernel team received the connectivity alarm and found that the machine was reset:
UPTIME: 00:04:16-------------The running time is very short LOAD AVERAGE: 0.25, 0.23, 0.11 TASKS: 2027 RELEASE: 3.10.0-1062.18.1.el7.x86_sixty-four MEMORY: 127.6 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)" PID: 23283 COMMAND: "spider-agent" TASK: ffff9d1fbb090000 [THREAD_INFO: ffff9d1f9a0d8000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent" #0 [ffff9d1f9a0db650] machine_kexec at ffffffffb6665b34 #1 [ffff9d1f9a0db6b0] __crash_kexec at ffffffffb6722592 #2 [ffff9d1f9a0db780] crash_kexec at ffffffffb6722680 #3 [ffff9d1f9a0db798] oops_end at ffffffffb6d85798 #4 [ffff9d1f9a0db7c0] no_context at ffffffffb6675bb4 #5 [ffff9d1f9a0db810] __bad_area_nosemaphore at ffffffffb6675e82 #6 [ffff9d1f9a0db860] bad_area_nosemaphore at ffffffffb6675fa4 #7 [ffff9d1f9a0db870] __do_page_fault at ffffffffb6d88750 #8 [ffff9d1f9a0db8e0] do_page_fault at ffffffffb6d88975 #9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778 [exception RIP: dma_pool_alloc+427]//caq: exception address RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10 RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00 R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0 R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff9d1f9a0dba28] mlx5_ alloc_ cmd_ MSG at ffffc03e10e3 [mlx5_core] / / modules involved #11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core] #12 [ffff9d1f9a0dbb18] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core] #13 [ffff9d1f9a0dbb48] mlx5_core_access_reg at ffffffffc03ee354 [mlx5_core] #14 [ffff9d1f9a0dbba0] mlx5_query_port_ptys at ffffffffc03ee411 [mlx5_core] #15 [ffff9d1f9a0dbc10] mlx5e_get_link_ksettings at ffffffffc0413035 [mlx5_core] #16 [ffff9d1f9a0dbce8] __ethtool_get_link_ksettings at ffffffffb6c56d06 #17 [ffff9d1f9a0dbd48] speed_show at ffffffffb6c705b8 #18 [ffff9d1f9a0dbdd8] dev_attr_show at ffffffffb6ab1643 #19 [ffff9d1f9a0dbdf8] sysfs_kf_seq_show at ffffffffb68d709f #20 [ffff9d1f9a0dbe18] kernfs_seq_show at ffffffffb68d57d6 #21 [ffff9d1f9a0dbe28] seq_read at ffffffffb6872a30 #22 [ffff9d1f9a0dbe98] kernfs_fop_read at ffffffffb68d6125 #23 [ffff9d1f9a0dbed8] vfs_read at ffffffffb684a8ff #24 [ffff9d1f9a0dbf08] sys_read at ffffffffb684b7bf #25 [ffff9d1f9a0dbf50] system_call_fastpath at ffffffffb6d8dede RIP: 00000000004a5030 RSP: 000000c001099378 RFLAGS: 00000212 RAX: 0000000000000000 RBX: 000000c000040000 RCX: ffffffffffffffff RDX: 000000000000000a RSI: 000000c00109976e RDI: 000000000000000d---read File fd number RBP: 000000c001099640 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000c R13: 0000000000000032 R14: 0000000000f710c4 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
From the stack, it is a process reading a file that triggers a kernel state null pointer reference.
2, Fault phenomenon analysis
From the stack information:
1. At that time, the process opens the file with fd number 13, which can be seen from the rdi value.
2,speed_show and__ ethtool_ get_ link_ Kssettings indicates the rate at which the network card is being read
Let's see which file is open,
crash> files 23283 PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent" ROOT: /rootfs CWD: /rootfs/home/service/app/spider FD FILE DENTRY INODE TYPE PATH .... 9 ffff9d0f5709b200 ffff9d1facc80a80 ffff9d1069a194d0 REG /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/net/p1p1/speed---This is still there 10 ffff9d0f4a45a400 ffff9d0f9982e240 ffff9d0fb7b873a0 REG /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/p3p1/speed---Note correspondence 0000:5e:00.0 corresponding p3p1 11 ffff9d0f57098f00 ffff9d1facc80240 ffff9d1069a1b530 REG /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.1/net/p1p2/speed---This is still there 13 ffff9d0f4a458a00 ffff9d0f9982e0c0 ffff9d0fb7b875f0 REG /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.1/net/p3p2/speed---Note correspondence 0000:5e:00.1 corresponding p3p2 ....
Note the correspondence between the pci number and the network card name above, which will be used later.
Opening a file to read speed itself should be a very common process,
Let's start with exception RIP: dma_pool_alloc+427 further analyzes why NULL pointer dereference is triggered
Expand the specific stack as follows:
#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778 [exception RIP: dma_pool_alloc+427] RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10 RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00 R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0 R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 ffff9d1f9a0db918: 0000000000000000 ffff9d0fa45f4c00 ffff9d1f9a0db928: ffff9d0fa45f4c10 00000000000080d0 ffff9d1f9a0db938: ffff9d1f9a0dba20 ffff9d0fa45f4c80 ffff9d1f9a0db948: ffffffffb67dd6fd ffffffffc03e10c4 ffff9d1f9a0db958: ffff9d00ffc07c00 000000000001f080 ffff9d1f9a0db968: 0000000000000246 0000000000001000 ffff9d1f9a0db978: 0000000000000000 0000000000000246 ffff9d1f9a0db988: ffff9d0fa45f4c10 ffffffffffffffff ffff9d1f9a0db998: ffffffffb680efab 0000000000000010 ffff9d1f9a0db9a8: 0000000000010046 ffff9d1f9a0db9c8 ffff9d1f9a0db9b8: 0000000000000018 ffffffffb680ee45 ffff9d1f9a0db9c8: ffff9d0faf9fec40 0000000000000000 ffff9d1f9a0db9d8: ffff9d0faf9fec48 ffffffffb682669c ffff9d1f9a0db9e8: ffff9d00ffc07c00 00000000618746c1 ffff9d1f9a0db9f8: 0000000000000000 0000000000000000 ffff9d1f9a0dba08: ffff9d0faf9fec40 0000000000000000 ffff9d1f9a0dba18: ffff9d0fa3c800c0 ffff9d1f9a0dba70 ffff9d1f9a0dba28: ffffffffc03e10e3 #10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core] ffff9d1f9a0dba30: ffff9d0f4eebee00 0000000000000001 ffff9d1f9a0dba40: 000000d0000080d0 0000000000000050 ffff9d1f9a0dba50: ffff9d0fa3c800c0 0000000000000005 --r12 yes rdi ,ffff9d0fa3c800c0 ffff9d1f9a0dba60: ffff9d0fa3c803e0 ffff9d1f9d87ccc0 ffff9d1f9a0dba70: ffff9d1f9a0dbb10 ffffffffc03e3c92 #11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
Take the corresponding mlx5 from the stack_ core_ Dev is ffff9d0fa3c800c0
crash> mlx5_core_dev.cmd ffff9d0fa3c800c0 -xo struct mlx5_core_dev { [ffff9d0fa3c80138] struct mlx5_cmd cmd; } crash> mlx5_cmd.pool ffff9d0fa3c80138 pool = 0xffff9d0fa45f4c00------This is dma_pool,Students who write driver code often encounter
The code line number of the problem is:
crash> dis -l dma_pool_alloc+427 -B 5 /usr/src/debug/kernel-3.10.0-1062.18.1.el7/linux-3.10.0-1062.18.1.el7.x86_64/mm/dmapool.c: 334 0xffffffffb680efab <dma_pool_alloc+427>: mov (%r15),%ecx And the corresponding r15,From the stack above, it is null. 305 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, 306 dma_addr_t *handle) 307 { ... 315 spin_lock_irqsave(&pool->lock, flags); 316 list_for_each_entry(page, &pool->page_list, page_list) { 317 if (page->offset < pool->allocation)---//caq: currently satisfied 318 goto ready;//caq: jump to ready 319 } 320 321 /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */ 322 spin_unlock_irqrestore(&pool->lock, flags); 323 324 page = pool_alloc_page(pool, mem_flags & (~__GFP_ZERO)); 325 if (!page) 326 return NULL; 327 328 spin_lock_irqsave(&pool->lock, flags); 329 330 list_add(&page->page_list, &pool->page_list); 331 ready: 332 page->in_use++;//caq: indicates that it is being referenced 333 offset = page->offset;//Use it from the last place 334 page->offset = *(int *)(page->vaddr + offset);//caq: the line number of the problem ... }
From the above code, page - > vaddr is NULL and offset is 0. Page has two sources,
The first is the page from the pool_ List,
The second is from pool_alloc_page is a temporary application. Of course, after the application, it will be linked to the page in the pool_ list,
Let's take a look at this page_list.
crash> dma_pool ffff9d0fa45f4c00 -x struct dma_pool { page_list = { next = 0xffff9d0fa45f4c80, prev = 0xffff9d0fa45f4c00 }, lock = { { rlock = { raw_lock = { val = { counter = 0x1 } } } } }, size = 0x400, dev = 0xffff9d1fbddec098, allocation = 0x1000, boundary = 0x1000, name = "mlx5_cmd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", pools = { next = 0xdead000000000100, prev = 0xdead000000000200 } } crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr ffff9d0fa45f4c80 offset = 0 vaddr = 0x0 ffff9d0fa45f4d00 offset = 0 vaddr = 0x0
Slave DMA_ pool_ From the code logic of alloc function, pool - > page_ The list is indeed not empty and satisfied
If (page - > offset < pool - > allocation), so the first page should be ffff9d0fa45f4c80
That is, from the first case:
crash> dma_page ffff9d0fa45f4c80 struct dma_page { page_list = { next = 0xffff9d0fa45f4d00, prev = 0xffff9d0fa45f4c80 }, vaddr = 0x0, //caq: this exception will cause crash if it is referenced dma = 0, in_use = 1, //caq: this flag is in use and conforms to page - > in_ use++; offset = 0 }
This is the end of the problem analysis, because DMA_ vaddr will be initialized after the page in the pool is applied,
Usually in the pool_ alloc_ How can it be NULL when initializing in page?
Then check this address:
crash> kmem ffff9d0fa45f4c80-------This is dma_pool Medium page CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff9d00ffc07900 kmalloc-128//caq: note the length 128 8963 14976 234 8K SLAB MEMORY NODE TOTAL ALLOCATED FREE ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35 FREE / [ALLOCATED] ffff9d0fa45f4c80 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head
Because similar dma functions have been used before, I am impressed with dma_page is not so big. Let's look at the second dma_ The page is as follows:
crash> kmem ffff9d0fa45f4d00 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff9d00ffc07900 kmalloc-128 128 8963 14976 234 8k SLAB MEMORY NODE TOTAL ALLOCATED FREE ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35 FREE / [ALLOCATED] ffff9d0fa45f4d00 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head crash> dma_page ffff9d0fa45f4d00 struct dma_page { page_list = { next = 0xffff9d0fa45f5000, prev = 0xffff9d0fa45f4d00 }, vaddr = 0x0, -----------caq: Also null dma = 0, in_use = 0, offset = 0 } crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr ffff9d0fa45f4c80 offset = 0 vaddr = 0x0 ffff9d0fa45f4d00 offset = 0 vaddr = 0x0 ffff9d0fa45f5000 offset = 0 vaddr = 0x0 .........
It seems that it is not only the first DMA_ There is a problem with the page. All DMAS in the pool_ All units are the same,
Check DMA directly_ Normal size of page:
crash> p sizeof(struct dma_page) $3 = 40
According to the principle, the length is only 40 bytes. Even if you apply for slab, it should be expanded to 64 bytes. How can it be like the DMA above_ Is page 128 bytes? To solve this puzzle, find a normal node and compare it with other nodes:
crash> net NET_DEVICE NAME IP ADDRESS(ES) ffff8f9e800be000 lo 127.0.0.1 ffff8f9e62640000 p1p1 ffff8f9e626c0000 p1p2 ffff8f9e627c0000 p3p1 -----//caq: take this as an example ffff8f9e62100000 p3p2 Then according to the code: net_device see mlx5e_priv: static int mlx5e_get_link_ksettings(struct net_device *netdev, struct ethtool_link_ksettings *link_ksettings) { ... struct mlx5e_priv *priv = netdev_priv(netdev); ... } static inline void *netdev_priv(const struct net_device *dev) { return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN); } crash> px sizeof(struct net_device) $2 = 0x8c0 crash> mlx5e_priv.mdev ffff8f9e627c08c0---Based on offset calculation mdev = 0xffff8f9e67c400c0 crash> mlx5_core_dev.cmd 0xffff8f9e67c400c0 -xo struct mlx5_core_dev { [ffff8f9e67c40138] struct mlx5_cmd cmd; } crash> mlx5_cmd.pool ffff8f9e67c40138 pool = 0xffff8f9e7bf48f80 crash> dma_pool 0xffff8f9e7bf48f80 struct dma_pool { page_list = { next = 0xffff8f9e79c60880, //caq: one of the dma_page prev = 0xffff8fae6e4db800 }, ....... size = 1024, dev = 0xffff8f9e800b3098, allocation = 4096, boundary = 4096, name = "mlx5_cmd\000\217\364{\236\217\377\377\300\217\364{\236\217\377\377\200\234>\250\217\217\377\377", pools = { next = 0xffff8f9e800b3290, prev = 0xffff8f9e800b3290 } } crash> dma_page 0xffff8f9e79c60880 //caq: check this dma_page struct dma_page { page_list = { next = 0xffff8f9e79c60840, -------One of them dma_page prev = 0xffff8f9e7bf48f80 }, vaddr = 0xffff8f9e6fc9b000, //caq: normal vaddr cannot be NULL dma = 69521223680, in_use = 0, offset = 0 } crash> kmem 0xffff8f9e79c60880 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff8f8fbfc07b00 kmalloc-64--Normal length 64 six hundred and sixty-seven thousand nine hundred and twenty-one 745024 11641 4 k SLAB MEMORY NODE TOTAL ALLOCATED FREE ffffde5140e71800 ffff8f9e79c60000 0 64 64 0 FREE / [ALLOCATED] [ffff8f9e79c60880] PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffde5140e71800 1039c60000 0 0 1 2fffff00000080 slab
The above operation requirements are correct_ Device and mlx5 related driver code are familiar.
Compared with abnormal dma_page, normal dma_page is a 64 byte slab, so obviously,
Either this is a memory problem or a uaf(used after free) problem.
General questions: how can we quickly judge which type it is? Because these two problems involve memory disorder, they are generally difficult to check. At this time, we need to jump out. Let's take a look at other running processes and find a process as follows:
crash> bt 48263 PID: 48263 TASK: ffff9d0f4ee0a0e0 CPU: 56 COMMAND: "reboot" #0 [ffff9d0f95d7f958] __schedule at ffffffffb6d80d4a #1 [ffff9d0f95d7f9e8] schedule at ffffffffb6d811f9 #2 [ffff9d0f95d7f9f8] schedule_timeout at ffffffffb6d7ec48 #3 [ffff9d0f95d7faa8] wait_for_completion_timeout at ffffffffb6d81ae5 #4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core] #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core] #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core] #7 [ffff9d0f95d7fc40] mlx5_mr_cache_cleanup at ffffffffc0c60aab [mlx5_ib] #8 [ffff9d0f95d7fca8] mlx5_ib_stage_pre_ib_reg_umr_cleanup at ffffffffc0c45d32 [mlx5_ib] #9 [ffff9d0f95d7fcc0] __mlx5_ib_remove at ffffffffc0c4f450 [mlx5_ib] #10 [ffff9d0f95d7fce8] mlx5_ib_remove at ffffffffc0c4f4aa [mlx5_ib] #11 [ffff9d0f95d7fd00] mlx5_detach_device at ffffffffc03fe231 [mlx5_core] #12 [ffff9d0f95d7fd30] mlx5_unload_one at ffffffffc03dee90 [mlx5_core] #13 [ffff9d0f95d7fd60] shutdown at ffffffffc03def80 [mlx5_core] #14 [ffff9d0f95d7fd80] pci_device_shutdown at ffffffffb69d1cda #15 [ffff9d0f95d7fda8] device_shutdown at ffffffffb6ab3beb #16 [ffff9d0f95d7fdd8] kernel_restart_prepare at ffffffffb66b7916 #17 [ffff9d0f95d7fde8] kernel_restart at ffffffffb66b7932 #18 [ffff9d0f95d7fe00] SYSC_reboot at ffffffffb66b7ba9 #19 [ffff9d0f95d7ff40] sys_reboot at ffffffffb66b7c4e #20 [ffff9d0f95d7ff50] system_call_fastpath at ffffffffb6d8dede RIP: 00007fc9be7a5226 RSP: 00007ffd9a19e448 RFLAGS: 00010246 RAX: 00000000000000a9 RBX: 0000000000000004 RCX: 0000000000000000 RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead RBP: 0000000000000002 R8: 00005575d529558c R9: 0000000000000000 R10: 00007fc9bea767b8 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffd9a19e690 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
Why do you pay attention to this process? Over the years, uaf problems caused by uninstalling modules have been checked no less than 20 times, sometimes reboots, sometimes unload s, and sometimes releases resources in work. Therefore, intuitively, I think it has a lot to do with this uninstallation. Let's analyze the operation in the reboot process.
2141 void device_shutdown(void) 2142 { 2143 struct device *dev, *parent; 2144 2145 spin_lock(&devices_kset->list_lock); 2146 /* 2147 * Walk the devices list backward, shutting down each in turn. 2148 * Beware that device unplug events may also start pulling 2149 * devices offline, even as the system is shutting down. 2150 */ 2151 while (!list_empty(&devices_kset->list)) { 2152 dev = list_entry(devices_kset->list.prev, struct device, 2153 kobj.entry); ........ 2178 if (dev->device_rh && dev->device_rh->class_shutdown_pre) { 2179 if (initcall_debug) 2180 dev_info(dev, "shutdown_pre\n"); 2181 dev->device_rh->class_shutdown_pre(dev); 2182 } 2183 if (dev->bus && dev->bus->shutdown) { 2184 if (initcall_debug) 2185 dev_info(dev, "shutdown\n"); 2186 dev->bus->shutdown(dev); 2187 } else if (dev->driver && dev->driver->shutdown) { 2188 if (initcall_debug) 2189 dev_info(dev, "shutdown\n"); 2190 dev->driver->shutdown(dev); 2191 } }
From the above code, we can see the following two points:
1. Kobj of each device The entry member is concatenated in devices_ Kset - > list.
2. The shutdown process of each device is from device_shutdown is serial.
From the stack of reboot, the process of unloading an mlx device includes the following:
pci_device_shutdown–>shutdown–>mlx5_unload_one–>mlx5_detach_device
–>mlx5_cmd_cleanup–>dma_pool_destroy
mlx5_ detach_ The process branch of device is:
void dma_pool_destroy(struct dma_pool *pool) { ....... while (!list_empty(&pool->page_list)) {//caq: DMA in pool_ Page delete one by one struct dma_page *page; page = list_entry(pool->page_list.next, struct dma_page, page_list); if (is_page_busy(page)) { ....... list_del(&page->page_list); kfree(page); } else pool_free_page(pool, page);//Per dma_page to release } kfree(pool);//caq: release pool ....... } static void pool_free_page(struct dma_pool *pool, struct dma_page *page) { dma_addr_t dma = page->dma; #ifdef DMAPOOL_DEBUG memset(page->vaddr, POOL_POISON_FREED, pool->allocation); #endif dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma); list_del(&page->page_list);//caq: after release, the page_list member poisoning kfree(page); }
From the stack of reboot, view the corresponding information
#4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core] ffff9d0f95d7fb10: ffffffffb735b580 ffff9d0f904caf18 ffff9d0f95d7fb20: ffff9d00ff801da8 ffff9d0f23121200 ffff9d0f95d7fb30: ffff9d0f23121740 ffff9d0fa7480138 ffff9d0f95d7fb40: 0000000000000000 0000001002020000 ffff9d0f95d7fb50: 0000000000000000 ffff9d0f95d7fbe8 ffff9d0f95d7fb60: ffff9d0f00000000 0000000000000000 ffff9d0f95d7fb70: 00000000756415e3 ffff9d0fa74800c0 ----mlx5_core_dev Equipment, corresponding to p3p1, ffff9d0f95d7fb80: ffff9d0f95d7fbf8 ffff9d0f95d7fbe8 ffff9d0f95d7fb90: 0000000000000246 ffff9d0f8f3a20b8 ffff9d0f95d7fba0: ffff9d0f95d7fbd0 ffffffffc03e442b #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core] ffff9d0f95d7fbb0: 0000000000000000 ffff9d0fa74800c0 ffff9d0f95d7fbc0: ffff9d0f8f3a20b8 ffff9d0fa74bea00 ffff9d0f95d7fbd0: ffff9d0f95d7fc38 ffffffffc03f085d #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
Note that reboot is releasing mlx5_core_dev is ffff9d0fa74800c0, which corresponds to net_device is:
p3p1, while the 23283 process is accessing mlx5_core_dev is ffff9d0fa3c800c0, corresponding to p3p2.
crash> net NET_DEVICE NAME IP ADDRESS(ES) ffff9d0fc003e000 lo 127.0.0.1 ffff9d1fad200000 p1p1 ffff9d0fa0700000 p1p2 ffff9d0fa00c0000 p3p1 Corresponding mlx5_core_dev yes ffff9d0fa74800c0 ffff9d0fa0200000 p3p2 Corresponding mlx5_core_dev yes ffff9d0fa3c800c0
Let's take a look at what's still left_ device in kset:
crash> p devices_kset devices_kset = $4 = (struct kset *) 0xffff9d1fbf4e70c0 crash> p devices_kset.list $5 = { next = 0xffffffffb72f2a38, prev = 0xffff9d0fbe0ea130 } crash> list -H -o 0x18 0xffffffffb72f2a38 -s device.kobj.name >device.list We found p3p1 And p3p2 None device.list In, [root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.0 device.list //caq: not found. This is p3p1. The current reboot process is unloading. [root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.1 device.list //caq: not found. This is p3p2, which has been uninstalled [root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.0 device.list //caq: this mlx5 device has not been unload ed yet kobj.name = 0xffff9d1fbe82aa70 "0000:3b:00.0", [root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.1 device.list //caq: this mlx5 device has not been unload ed yet kobj.name = 0xffff9d1fbe82aae0 "0000:3b:00.1",
Because p3p2 and p3p1 are not in device List, and according to PCI_ device_ In the serial uninstall process of shutdown, p3p1 is currently being unloaded, so it is certain that the 23283 process accesses the unloaded cmd_pool, according to the unloading process described above:
pci_device_shutdown–>shutdown–>mlx5_unload_one–>mlx5_cmd_cleanup–>dma_pool_destroy
At this time, the pool has been released, and DMA in the pool_ All pages are invalid.
Then try the bug corresponding to google, and see that RedHat has encountered a similar problem, which is very similar to the current phenomenon: https://access.redhat.com/solutions/5132931
However, in this link, red hat thinks that the problem of uaf has been solved, and the combined patch is:
commit 4cca96a8d9da0ed8217cfdf2aec0c3c8b88e8911 Author: Parav Pandit <parav@mellanox.com> Date: Thu Dec 12 13:30:21 2019 +0200 diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 997cbfe..05b557d 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6725,6 +6725,8 @@ void __mlx5_ib_remove(struct mlx5_ib_dev *dev, const struct mlx5_ib_profile *profile, int stage) { + dev->ib_active = false; + /* Number of stages to cleanup */ while (stage) { stage--;
Knock on the blackboard three times:
This integration cannot solve the corresponding bug, such as the following Concurrency:
Let's use a simple diagram to show concurrent processing:
CPU1 CPU2 dev_attr_show pci_device_shutdown speed_show shutdown mlx5_unload_one mlx5_detach_device mlx5_detach_interface mlx5e_detach mlx5e_detach_netdev mlx5e_nic_disable rtnl_lock mlx5e_close_locked clear_bit(MLX5E_STATE_OPENED, &priv->state);---Only this was cleaned bit rtnl_unlock rtnl_trylock---After locking successfully netif_running Just judgment net_device.state Lowest order of __ethtool_get_link_ksettings mlx5e_get_link_ksettings mlx5_query_port_ptys() mlx5_core_access_reg() mlx5_cmd_exec cmd_exec mlx5_alloc_cmd_msg mlx5_cmd_cleanup---clear dma_pool dma_pool_alloc---visit cmd.pool,trigger crash
So if you want to really solve this problem, you need netif_ device_ Cleaning in detach__ LINK_ STATE_ Bit of start, or in speed_ Judge in show__ LINK_STATE_PRESENT bit? If you consider the scope of influence and do not want to move the public process, you should
In mlx5e_ get_ link_ Judge from ksettings__ LINK_STATE_PRESENT.
This is left to the students who like to deal with the community to improve it.
static void mlx5e_nic_disable(struct mlx5e_priv *priv) { ....... rtnl_lock(); if (netif_running(priv->netdev)) mlx5e_close(priv->netdev); netif_device_detach(priv->netdev); //caq: add cleaning__ LINK_STATE_PRESENT bit rtnl_unlock(); .......
3, Fault recurrence
1. The competition problem can create a competition scenario similar to cpu1 and cpu2 in the figure above.
4, Fault avoidance or resolution
Possible solutions are:
1. Don't follow red hat https://access.redhat.com/solutions/5132931 Upgrade like that.
2. Patch separately.
Introduction to the author
Anqing
At present, I am responsible for linux kernel, container, virtual machine and other virtualization in OPPO hybrid cloud
For more exciting content, please scan code to focus on [OPPO technology] official account.