background
For an online server, the monitoring alarm shows that the server load is high, reaching more than 100. Log in to the machine and use the top command to view. The CPU and IO utilization are low. At the same time, it is found that there are many ps and pidof processes in the system, and their process status is D. using the kill command can not kill these processes.
problem analysis
It is preliminarily suspected that a large number of ps and pidof processes with process status D lead to high load.
Query the process status description through man ps:
D uninterruptible sleep (usually IO) R running or runnable (on run queue) S interruptible sleep (waiting for an event to complete) T stopped, either by a job control signal or because it is being traced W paging (not valid since the 2.6.xx kernel) X dead (should never be seen) Z defunct ("zombie") process, terminated but not reaped by its parent
According to the description, process status D indicates uninterruptible sleep (normally IO), that is, non interruptible sleep, which is usually caused by IO operation.
According to the data, the process in D state is usually waiting for IO, such as disk IO, network IO and other peripheral io. The process in this state does not accept any external signals, so it cannot kill it through kill.
So why does a process in state D cause the average load of the machine to increase?
Calculation of system average load
Take a single CPU as an example. For example, in the past one minute, the number of processes that judge whether the system is running or waiting indicates the average load of the system. However, it is slightly different in linux system. Those processes in io waiting state will also be included in the calculation. Therefore, the CPU utilization is very different from the average load. When most processes are doing IO processing, the CPU utilization may be very low, but the average load is very large.
In our scenario, there are a large number of processes in D state. These processes are essentially in IO waiting state and will be included in the calculation of average load. Therefore, in the end, the average load of the machine increases.
Why are there a large number of D-state processes
Processes in D state are usually waiting for Io. There are usually two reasons for a large number of processes in D state: IO device failure and kernel IO operation deadlock. In our scenario, both the group monitoring end and our own monitoring end frequently call ps and pidof, and their internal exceptions lead to kernel IO operation deadlock.
Process handling D status
First, you need to judge whether it is an IO device fault. If it is an IO device fault, you can only try to repair the IO device.
If it is a kernel IO operation deadlock, you can try to kill all abnormal processes to see whether the lock can be released.
Since the process in D status cannot be directly killed, you can try to modify the process status and kill the process. The specific operations are as follows:
1. Writing kernel modules
Source file: kill c
#include <linux/init.h> #include <linux/module.h> #include <linux/sched.h> MODULE_LICENSE("BSD"); static int pid = -1; module_param(pid, int, S_IRUGO); static int killd_init(void) { struct task_struct * p; // force D status process to death for_each_process(p){ if(p->pid == pid){ set_task_state(p, TASK_STOPPED); return 0; } } return 0; } static void killd_exit(void) { // do nothing } module_init(killd_init); module_exit(killd_exit);
2. Compiling kernel modules
2.1 Makefile file:
bj-m := killd.o
2.2 get kernel Name:
uname -r
2.3 compilation:
make -C /lib/modules/<kernel name>/build M=`pwd` modules
3 when adding the module, pass in the process number to change the process state to stopped state
3.1 create pids file, and add the process number to the file by line
3.2 create stop SH file:
cat pids | while read line do echo $line insmod ./killd.ko pid=$line && rmmod killd kill -9 $line done
3.3 execution
chmod 755 stop.sh && ./stop.sh
By killing the uninterruptible sleep state process, it is very likely that the deadlock problem of IO operation will not be solved, and some other system exceptions may even be caused. The simplest and effective way is to reboot and restart the machine.