Record a troubleshooting of the slow response of the server

I checked the problem for a week. I was a little nervous and panicked. I was ignorant before I determined the cause of the problem.

1. It is found that there is mysql connection timeout (Communications link failure, The last packet sent successfully to the server was 0 milliseconds ago. Caused by: java.net.ConnectException: Connection timed out) in the log;


2. It is found that some requests take 6s, some take 12s, and some exceed 15s to report abnormalities;
3. Considering that the connection pool is not large enough to reach the upper limit of the connection pool, increase the upper limit configuration of the connection pool of the resin data source from 20 to 50; Check the database slow log and find that there is no record of business query

 


4. The problem still exists. I wanted to analyze the log but couldn't find a useful tool. So I wrote a java program to analyze the log and counted the time distribution of connection requests. It was found that 91% responded within 1s, some 1-2s, some 3s, some 6s, and some 12s. It was very regular. 1.5 - > 3 - > 6 - > 12 doubled, and other response times were few

The main start and end logs of the interface are in this format:

2022-01-19 08:09:32469 Info - [http://*:8080-4] - [AgentController] requestId=xxxx, enter the loginAndCall() method of AgentController, parameters: {"agentId":"","ani":"2","dn":"xxxx","number":"xxxx","requestId":"xxxx","um":"xxxx"}

2022-01-19 08:09:32499 Info - [pool-1-thread-1] - [AgentController] requestId=xxxx, forced login ----- start

The analysis code is as follows, using regular expressions to extract key information

public static void main(String[] args) throws ParseException {
        List<String> list = readFile("E:\\Document\\CRM\\CDesk\\2022 January\\Log 0124\\AgentController59-0124.log");
        Pattern pattern = Pattern.compile("(.*) INFO.*http.*AgentController.*requestId=(.*?),.*|(.*) INFO.*AgentController.*requestId=(.*?),.*Force login.*start.*");
        Map<String, List<String>> requestIdMap = new LinkedHashMap<String, List<String>>();
        for (String s : list) {
            Matcher matcher = pattern.matcher(s);
            if (matcher.find()) {
                String time = matcher.group(1);
                String requestId = matcher.group(2);
                if (time == null) {
                    time = matcher.group(3);
                    requestId = matcher.group(4);
                }
                List<String> timeList = requestIdMap.get(requestId);
                if (timeList == null) {
                    timeList = new ArrayList<String>();
                    timeList.add(time);
                    timeList.add(time);
                    requestIdMap.put(requestId, timeList);
                }
                if (timeList.get(1).compareTo(time) < 0) {
                    timeList.set(1, time);
                }
                //System.out.println(matcher.group(1) + "\t" + matcher.group(2));
            }
        }
        List<Long> statistic = new ArrayList<Long>();
        for (int i = 0; i < 50; i++) {
            statistic.add(0L);
        }
        for (String requestId : requestIdMap.keySet()) {
            List<String> timeList = requestIdMap.get(requestId);
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss,SSS");
            Date begin = sdf.parse(timeList.get(0));
            Date end = sdf.parse(timeList.get(1));
            long diff = (end.getTime() - begin.getTime()) / 1000;
            int interval = (int) diff;
            if (interval > 6) {
                System.out.println(requestId + " " + diff);
            }
            statistic.set(interval, 1 + statistic.get(interval));
        }
        //System.out.println(statistic);
        long sum = 0;
        for (int i = 0; i < 50; i++) {
            System.out.println(i + "s Number of requests" + statistic.get(i));
            sum += statistic.get(i);
        }
        System.out.println(sum);
    }


5. Considering that the database connection pool 50 is not enough, there may be an algorithm for increasing the interval time to obtain the connection pool, so adjust the connection pool from 50 to 500;
6. The problem still exists. There is no way to consider the network problem. Capture packets of the machine where the program is located on the database port 3306;
7. There are a large number of TCP retransmissions in the packet capture file. Some retransmissions take 6 times to get a successful response, which takes exactly 6 s; The interval time of each retransmission is doubled;


8. The network group finally checked the network problem and found that the keepalive id was heavy, so it solved the network problem and analyzed the log. It was normal. In a day, only one of the 10979 http requests of the two instances exceeded 1s, and the others responded within 1s

Conclusion: I thought the network problem was an excuse to prevaricate when I couldn't find the real cause of the problem, so I didn't think of catching the package to check the network problem. I wasted several days, but I didn't think it was really a network problem.

Keywords: Java Database server

Added by horstuff on Wed, 26 Jan 2022 18:10:45 +0200