Troubleshooting of CPU and IO idle but high load problems

background

For an online server, the monitoring alarm shows that the server load is high, reaching more than 100. Log in to the machine and use the top command to view. The CPU and IO utilization are low. At the same time, it is found that there are many ps and pidof processes in the system, and their process status is D. using the kill command can not kill these processes.

problem analysis

It is preliminarily suspected that a large number of ps and pidof processes with process status D lead to high load.

Query the process status description through man ps:

D    uninterruptible sleep (usually IO)
R    running or runnable (on run queue)
S    interruptible sleep (waiting for an event to complete)
T    stopped, either by a job control signal or because it is being traced
W    paging (not valid since the 2.6.xx kernel)
X    dead (should never be seen)
Z    defunct ("zombie") process, terminated but not reaped by its parent

According to the description, process status D indicates uninterruptible sleep (normally IO), that is, non interruptible sleep, which is usually caused by IO operation.

According to the data, the process in D state is usually waiting for IO, such as disk IO, network IO and other peripheral io. The process in this state does not accept any external signals, so it cannot kill it through kill.

So why does a process in state D cause the average load of the machine to increase?

Calculation of system average load

Take a single CPU as an example. For example, in the past one minute, the number of processes that judge whether the system is running or waiting indicates the average load of the system. However, it is slightly different in linux system. Those processes in io waiting state will also be included in the calculation. Therefore, the CPU utilization is very different from the average load. When most processes are doing IO processing, the CPU utilization may be very low, but the average load is very large.

In our scenario, there are a large number of processes in D state. These processes are essentially in IO waiting state and will be included in the calculation of average load. Therefore, in the end, the average load of the machine increases.

Why are there a large number of D-state processes

Processes in D state are usually waiting for Io. There are usually two reasons for a large number of processes in D state: IO device failure and kernel IO operation deadlock. In our scenario, both the group monitoring end and our own monitoring end frequently call ps and pidof, and their internal exceptions lead to kernel IO operation deadlock.

Process handling D status

First, you need to judge whether it is an IO device fault. If it is an IO device fault, you can only try to repair the IO device.

If it is a kernel IO operation deadlock, you can try to kill all abnormal processes to see whether the lock can be released.

Since the process in D status cannot be directly killed, you can try to modify the process status and kill the process. The specific operations are as follows:

1. Writing kernel modules

Source file: kill c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/sched.h>
MODULE_LICENSE("BSD");
static int pid = -1;
module_param(pid, int, S_IRUGO);
static int killd_init(void)
{
    struct task_struct * p;
    // force D status process to death
    for_each_process(p){
        if(p->pid == pid){
            set_task_state(p, TASK_STOPPED);
            return 0;
        }
    }
    return 0;
}
static void killd_exit(void)
{
    // do nothing
}
module_init(killd_init);
module_exit(killd_exit);

2. Compiling kernel modules

2.1 Makefile file:

bj-m := killd.o

2.2 get kernel Name:

uname -r

2.3 compilation:

make -C /lib/modules/<kernel name>/build M=`pwd` modules

3 when adding the module, pass in the process number to change the process state to stopped state

3.1 create pids file, and add the process number to the file by line
3.2 create stop SH file:

cat pids | while read line
do
    echo $line
    insmod ./killd.ko pid=$line && rmmod killd
    kill -9 $line
done

3.3 execution

chmod 755 stop.sh && ./stop.sh

By killing the uninterruptible sleep state process, it is very likely that the deadlock problem of IO operation will not be solved, and some other system exceptions may even be caused. The simplest and effective way is to reboot and restart the machine.

Keywords: C++ Linux kernel

Added by SsirhC on Tue, 21 Dec 2021 12:11:56 +0200