CRUDBoy growth plan (6) - CPU100% excellent notes summary
data:image/s3,"s3://crabby-images/a4964/a49648608668b69dc10964966311708f21bc1a24" alt=""
be careful:
- This program is aimed at programmers who have some work experience and want to improve their technology and work ability.
- For details of the plan, see: CRUD Boy growth plan
- purpose First of all, there are too many things like not popularizing theoretical knowledge. Second, if you don't do self moving things, you seem to have learned for a long time. In fact, you can't use it for work and can't ask it in the interview. Finally, the utilitarian point is to improve the salary. In order to have better performance in work, write the code better, easier to use, better maintenance and more elegant. In order to stand out from a group of eight part Liberal Arts candidates in the next interview. In the final analysis, the most important thing for a technical post is the ability to solve problems, not the ability to endorse. Because everyone's experience can not be copied, we need to learn from each other and explore a path for programmers' self-growth. The overall direction focuses on two core professional qualities and technical capabilities. Not only to write the code and meet the requirements now, but also to talk about the salary in the future.
This week's topic: what to do when the CPU is 100%?
Because everyone's notes this week are too wonderful, so let's make a collection and share some excellent notes. learn from each other.
Summary of notes
Since everyone's troubleshooting process is similar, you can directly put the previous collection blog
High CPU usage in server troubleshooting manual
This manual is used to check the CPU performance problems of Java projects (the top us value is too high), and gradually find and print out the problem thread stack according to the thread id, so as to determine the method calls that cause performance problems. An application occupies a high CPU. In addition to being a computing intensive application, it may also have an dead cycle. Solutions / solutions
- The top command finds the Java process and thread id in question:
- Turn on thread display mode
- Sort by CPU usage
- Note the Java process id and its CPU high thread id
- Using the process id as the parameter, jstack the Java process with problems
- Manually convert thread id to hexadecimal (printf% x 1234 can be used)
- Find the hexadecimal thread id (grep can be used)
- View the corresponding thread stack
Next, we take an actual fault as an example to introduce how to locate and solve this kind of problem.
According to the top command, it is found that the Java process with PID 5989 of the test environment occupies up to 100% of the CPU, which is suspected to be faulty. However, how to locate specific threads or code?
data:image/s3,"s3://crabby-images/2ba14/2ba14c7278d7586a7eb327e45e10c9421b6b0849" alt=""
image-20211230174244190
The list of threads is displayed first:
ps -mp pid -o THREAD,tid,time
data:image/s3,"s3://crabby-images/f5cc2/f5cc223c576a3d5d98595c316134837ec42e521e" alt=""
image-20211230174320441
The most time-consuming thread 6048 has been found, which has occupied CPU time for two hours
Secondly, convert the required thread ID to hexadecimal format:
printf "%x\n" tid
data:image/s3,"s3://crabby-images/cbf90/cbf90cae154c8079d0b189819e4871ed41eaa31a" alt=""
image-20211230174346931
Finally, the stack information of the thread is printed
jstack pid | grep tid -A 30
data:image/s3,"s3://crabby-images/d6865/d68659e62f7f5ccd41a2c7c55a3806e9f71c220b" alt=""
image-20211230174413323
Found the code stack information of the problem. The specific modification depends on the specific business logic, which is not discussed here
Summarize the methods and skills for troubleshooting CPU faults:
- top command: you can view the real-time CPU usage and the latest CPU usage
reflection
- How to turn on thread display mode
- How to sort by CPU usage
- How to sort by memory usage
- PS command: powerful process status monitoring command, which can view the current CPU usage of the process and threads in the process, and the sampling data belonging to the current status
- jstack: command provided by Java You can view the current thread stack operation of a process. According to the output of this command, you can locate the current running state, running code, deadlock, etc. of all threads of a process
- pstack: Linux command You can view the current thread stack operation of a process
Finally, in the actual project, this operation will be carried out many times to determine the problem. The above process is too cumbersome. Share a script for you:
This script is used to quickly check the CPU performance problems of Java (the top us value is too high), automatically find out the threads that consume too much CPU in the running java process, and print out their thread stack, so as to determine the method calls that cause performance problems.
#!/bin/bash # @Function # Find out the highest cpu consumed threads of java, and print the stack of these threads. # # @Usage # $ ./show-busy-java-threads.sh # # @author sean readonly PROG=`basename $0` readonly -a COMMAND_LINE=("$0" "$@") usage() { cat <<EOF Usage: ${PROG} [OPTION]... Find out the highest cpu consumed threads of java, and print the stack of these threads. Example: ${PROG} -c 10 Options: -p, --pid find out the highest cpu consumed threads from the specifed java process, default from all java process. -c, --count set the thread count to show, default is 5 -h, --help display this help and exit EOF exit $1 } readonly ARGS=`getopt -n "$PROG" -a -o c:p:h -l count:,pid:,help -- "$@"` [ $? -ne 0 ] && usage 1 eval set -- "${ARGS}" while true; do case "$1" in -c|--count) count="$2" shift 2 ;; -p|--pid) pid="$2" shift 2 ;; -h|--help) usage ;; --) shift break ;; esac done count=${count:-5} redEcho() { [ -c /dev/stdout ] && { # if stdout is console, turn on color output. echo -ne "\033[1;31m" echo -n "$@" echo -e "\033[0m" } || echo "$@" } yellowEcho() { [ -c /dev/stdout ] && { # if stdout is console, turn on color output. echo -ne "\033[1;33m" echo -n "$@" echo -e "\033[0m" } || echo "$@" } blueEcho() { [ -c /dev/stdout ] && { # if stdout is console, turn on color output. echo -ne "\033[1;36m" echo -n "$@" echo -e "\033[0m" } || echo "$@" } # Check the existence of jstack command! if ! which jstack &> /dev/null; then [ -z "$JAVA_HOME" ] && { redEcho "Error: jstack not found on PATH!" exit 1 } ! [ -f "$JAVA_HOME/bin/jstack" ] && { redEcho "Error: jstack not found on PATH and $JAVA_HOME/bin/jstack file does NOT exists!" exit 1 } ! [ -x "$JAVA_HOME/bin/jstack" ] && { redEcho "Error: jstack not found on PATH and $JAVA_HOME/bin/jstack is NOT executalbe!" exit 1 } export PATH="$JAVA_HOME/bin:$PATH" fi readonly uuid=`date +%s`_${RANDOM}_$$ cleanupWhenExit() { rm /tmp/${uuid}_* &> /dev/null } trap "cleanupWhenExit" EXIT printStackOfThreads() { local line local count=1 while IFS=" " read -a line ; do local pid=${line[0]} local threadId=${line[1]} local threadId0x="0x`printf %x ${threadId}`" local user=${line[2]} local pcpu=${line[4]} local jstackFile=/tmp/${uuid}_${pid} [ ! -f "${jstackFile}" ] && { { if [ "${user}" == "${USER}" ]; then jstack ${pid} > ${jstackFile} else if [ $UID == 0 ]; then sudo -u ${user} jstack ${pid} > ${jstackFile} else redEcho "[$((count++))] Fail to jstack Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process(${pid}) under user(${user})." redEcho "User of java process($user) is not current user($USER), need sudo to run again:" yellowEcho " sudo ${COMMAND_LINE[@]}" echo continue fi fi } || { redEcho "[$((count++))] Fail to jstack Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process(${pid}) under user(${user})." echo rm ${jstackFile} continue } } blueEcho "[$((count++))] Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process(${pid}) under user(${user}):" sed "/nid=${threadId0x} /,/^$/p" -n ${jstackFile} done } ps -Leo pid,lwp,user,comm,pcpu --no-headers | { [ -z "${pid}" ] && awk '$4=="java"{print $0}' || awk -v "pid=${pid}" '$1==pid,$4=="java"{print $0}' } | sort -k5 -r -n | head --lines "${count}" | printStackOfThreads
usage
show-busy-java-threads.sh # Find out the most CPU consuming thread (5 by default) from all Java processes and print out its thread stack. show-busy-java-threads.sh -c <Number of thread stacks to display> show-busy-java-threads.sh -c <Number of thread stacks to display> -p <designated Java Process> ############################## # be careful: ############################## # If the user of the Java process is different from the current user executing the script, jstack cannot the Java process. # In order to switch to the user of Java process, sudo needs to be added for execution, which can solve the following problems: sudo show-busy-java-threads.sh
Examples
$ show-busy-java-threads.sh [1] Busy(57.0%) thread(23355/0x5b3b) stack of java process(23269) under user(admin): "pool-1-thread-1" prio=10 tid=0x000000005b5c5000 nid=0x5b3b runnable [0x000000004062c000] java.lang.Thread.State: RUNNABLE at java.text.DateFormat.format(DateFormat.java:316) at com.xxx.foo.services.common.DateFormatUtil.format(DateFormatUtil.java:41) at com.xxx.foo.shared.monitor.schedule.AppMonitorDataAvgScheduler.run(AppMonitorDataAvgScheduler.java:127) at com.xxx.foo.services.common.utils.AliTimer$2.run(AliTimer.java:128) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [2] Busy(26.1%) thread(24018/0x5dd2) stack of java process(23269) under user(admin): "pool-1-thread-2" prio=10 tid=0x000000005a968800 nid=0x5dd2 runnable [0x00000000420e9000] java.lang.Thread.State: RUNNABLE at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuffer.append(StringBuffer.java:320) - locked <0x00000007908d0030> (a java.lang.StringBuffer) at java.text.SimpleDateFormat.format(SimpleDateFormat.java:890) at java.text.SimpleDateFormat.format(SimpleDateFormat.java:869) at java.text.DateFormat.format(DateFormat.java:316) at com.xxx.foo.services.common.DateFormatUtil.format(DateFormatUtil.java:41) at com.xxx.foo.shared.monitor.schedule.AppMonitorDataAvgScheduler.run(AppMonitorDataAvgScheduler.java:126) at com.xxx.foo.services.common.utils.AliTimer$2.run(AliTimer.java:128) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) ...
As can be seen from the above thread stack, the two threads with the highest CPU consumption are executing Java text. DateFormat. Format, the method corresponding to the business code is shared monitor. schedule. AppMonitorDataAvgScheduler. run. It can be basically determined that:
- AppMonitorDataAvgScheduler.run calls dateformat Format times are frequent.
- DateFormat.format is slow. (this can be determined by the implementation of DateFormat.format.)
Execute show busy Java threads several times SH, if the above situation occurs with a high probability, the above decision can be determined.
Because the fewer calls, the faster code execution, the lower the probability of online stack.
Analyze shared monitor. schedule. AppMonitorDataAvgScheduler. Run implements logic and call mode to optimize the implementation and solve the problem.
Basic concepts
For the sudden slow running problem of the online system, if the problem makes the online system unavailable, the first thing to do is to export jstack and memory information, and then restart the system to ensure the availability of the system as soon as possible. There are two possible reasons for this:
- The large amount of data read from a certain position in the code leads to the depletion of system memory, resulting in too many times of Full GC and slow system;
- There are CPU consuming operations in the code, resulting in too high CPU and slow system operation;
Relatively speaking, these are the two most frequent online problems, and they will directly lead to system unavailability. In addition, there are several situations that can lead to slow operation of a function, but will not lead to unavailability of the system:
- There is a blocking operation at a certain position of the code, which leads to the overall time-consuming function call, but the occurrence is relatively random;
- A thread enters the WAITING state for some reason. At this time, the function is not available as a whole, but it cannot be reproduced;
- Due to the improper use of locks, multiple threads enter the deadlock state, resulting in the overall slow speed of the system.
For these three cases, you can't find out the specific problems by checking the CPU and system memory, because they are relatively blocking operations. The CPU and system memory usage are not high, but the function is very slow.
- The large amount of data read from a certain position in the code leads to the depletion of system memory, resulting in too many times of Full GC and slow system;
- There are CPU consuming operations in the code, resulting in too high CPU and slow system operation;
These two are basically the reasons for most accidents. There was a problem in the previous project: a collection was encapsulated in the entity class of interface response. Because there was no restriction on the collection (there was too little data at the beginning, and the students who wrote the code didn't care), the whole table data was encapsulated each time, resulting in OOM.
Thread stack information
Thread stack, also known as thread call stack, is an instantaneous snapshot of the state of threads (including locks) in the virtual machine, that is, the running state of all threads at a certain time, including the call stack of each thread and the holding of locks. Although the formats printed by different virtual machines are somewhat different, the information of thread stack includes:
- Thread name, id, number of threads, etc.
- The running state of the thread and the state of the lock (which thread holds the lock, which thread is waiting for the lock, etc.)
- Call stack (that is, the call hierarchy of functions) the call stack contains the complete class name, the method executed, and the number of lines of source code.
You can mainly see the function call relationship, through which you can quickly locate the problem code.
case
For new applications, due to the small traffic, the cluster QPS is only about 5, and the rt of the write interface is about 30ms.
After that, a new service is connected. The data given by the business party is that the daily QPS can reach 2000, and the peak QPS may reach 10000.
During the subsequent pressure measurement, it is found that when the QPS of a single machine reaches about 200, the rt of the interface does not change significantly, but the CPU utilization increases sharply until it is full.
After the pressure measurement stopped, the CPU utilization immediately decreased.
Troubleshooting and solution
Use Alibaba's open source tool Arthas or JDK's own tool
First, check the CPU usage. You can directly use the top command:
At this time, it can be seen that the Java process with process ID 3480 occupies a high CPU. It can be basically concluded that a large amount of CPU is consumed during the execution of application code
Check the code, and there are problems in the creation and use of sequence
public Long insert(T dataObject) { if (dataObject.getId() == null) { Long id = next(); dataObject.setId(id); } if (sqlSession.insert(getNamespace() + ".insert", dataObject) > 0) { return dataObject.getId(); } else { return null; } } public Sequence sequence() { return SequenceBuilder.create() .name(getTableName()) .sequenceDao(sequenceDao) .build(); } /** \* Get next primary key ID * \* @return */ protected Long next() { try { return sequence().nextValue(); } catch (SequenceException e) { throw new RuntimeException(e); } }
It can be found that each insert statement re build s a new sequence, which leads to the loss of the local cache. Therefore, we will pull 1000 items from the database every time, but only one item is used, and 1000 items are retrieved again next time, and over and over again
Optimization method:
Adjust the code structure and change the generation of sequence instance to initialization at application startup. Later, when obtaining a sequence, you will not interact with the database every time, but first check the local cache. When the local cache is exhausted, you will interact with the database to obtain a new sequence.
public abstract class BaseMybatisDAO implements InitializingBean { @Override public void afterPropertiesSet() throws Exception { sequence = SequenceBuilder.create().name(getTableName()).sequenceDao(sequenceDao).build(); } }
Initialize the Sequence in this method by implementing InitializingBean and overriding the afterpropertieset () method.