Understanding TCP's three grasp and four swing from a single online fault

Introduction:

  • Production failure scenario introduction
  • Three handshakes in TCP connection
  • TCP disconnect four wave process
  • Analyzing source code with Java stack
  • Find the "culprit" in the stack
  • Summary of problem optimization scheme

1. Production failure scenario introduction

Business profile:

This service mainly provides external proxy interfaces, most of which call the third-party interfaces, and provide them to clients after data acquisition and aggregation.

One night, during the peak period of the system, the small partners of the project team were eating "overtime meal" with great relish. They had just put "🍚" into their mouths, and emails and SMS sent out alarms at the same time.

When a server interface times out, it will occasionally receive similar alarms, sometimes due to network fluctuations and other reasons. I'm really sorry, I always let others "network" students back:). However, this alarm did not converge. It lasted for more than ten minutes and felt bad.

Click the URL interface link of the alarm in the email, which has been circling the page. The response is very slow, tragic!

At the moment, I silently pushed the boxed rice aside and couldn't bear to look straight at it:(

Basic process of problem location:

1) determine the scope of influence

There are multiple servers behind the service, and only one server is down, so it will not have a great impact on users.
Remove it from the registration center temporarily. Don't let the client continue to retry on this machine. Keep the scene of the accident.

2) check monitoring indicators

View the traffic volume of the interface service. Because it's a late peak, the traffic volume of users will be larger than that of other time periods. However, this traffic volume does not appear to increase significantly compared with the same time period.
The CPU, memory, IO and network indexes of the server are all normal.

3) server troubleshooting

Log in to the server and further view the CPU, memory and other indicators of the server in combination with monitoring. It is normal to view the service log, and no special Exception log output, Exception or OOM and other exceptions are found.

What we see is that the interface service can no longer respond normally. The application runs on the JVM and quickly checks through the common commands provided by the JDK.

The following command prints stack information:

`
jstack -l $pid > jstack.log
`

The statistical results are as follows:

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

994    java.lang.Thread.State: WAITING (parking)
501    java.lang.Thread.State: WAITING (on object monitor)
7      java.lang.Thread.State: TIMED_WAITING (sleeping)
13     java.lang.Thread.State: TIMED_WAITING (parking)
2      java.lang.Thread.State: TIMED_WAITING (on object monitor)
23     java.lang.Thread.State: RUNNABLE

If you encounter a thread state such as java.lang.Thread.State: WAITING (parking), java.lang.Thread.State: WAITING (on object monitor), you should pay attention to it. Generally, it may be caused by the program itself.

According to java.lang.Thread.State: WAITING, you can check the stack information in jstack.log and find a large number of logs calling HttpClient tool class to request to wait for the hang. The specific stack information will be analyzed in detail below.

These service calls are directly called by HttpClient tool, which encapsulates Spring RestTemplate. The underlying layer is also called by Apache HttpClient tool class to implement service calls.

In addition to the above jstack log exceptions, the network status on the server is also checked, which is also a common troubleshooting method used by the operation and maintenance students.

Attach command to count network connection status:

netstat -n | awk '/^tcp/ {++State[$NF]} END {for(i in State) print i, State[i]}'

Statistical results:

TIME_WAIT 9
CLOSE_WAIT 3826
SYN_SENT 2
ESTABLISHED 28
SYN_RECV 8

Note here that in the server's weird network connection statistics, there are a large number of connections in the state of close and wait.

And this state, when you execute the statistics command again at intervals, still exists, that is, it will not be released. It seems that the problem is serious.

It is further speculated that the presence of these close ﹣ wait states should be related to the slow response of the interface, and also to the HttpClient thread blocking in the java stack information, as a breakthrough to analyze the problem.

Let's first understand the close ﹣ wait status. The close ﹣ wait status is in the process of TCP network disconnection. When the client initiates the disconnection request, the server receives the disconnection request for the first time and replies with the confirmation message, and then it is in the close ﹣ wait status. When the server responds, it will reply to the network packet to the client, and the normal connection will be closed.

2. Three handshakes of TCP connection

Although the close? Wait state is in the process of four waves of TCP network connection. It's still necessary for us to understand the three-way handshake of TCP network connection first, because it is the first thing the request server needs to do, which is to establish a TCP connection.

Technology comes from life.

We can give an example in our daily life to understand the process of TCP triple handshake.

For example, you want to chat with a friend you haven't met for a long time on wechat:

Xiaodong: Xiaosheng, are you there?

(after a long time...)

Xiao Sheng: Yes, are you still there?

(Xiaodong happens to be online and brushes her circle of friends every day...)

Xiaodong: mm-hmm, yes

(then the two start to talk hot...)

If you usually chat with friends like this, do you feel tired?

In fact, the above process can well understand the process of TCP triple handshake.

Let's regard Xiaodong as the "client" and Xiaosheng as the "server".
Xiaodong is a programmer and IT worker. Xiaosheng is starting a business in his hometown.

Understand the TCP triple handshake process:

1) as a "client", Xiaodong initiates a chat with Xiaosheng as a "server", which is to send a network packet (signed as SYN) to Xiaosheng. [this is the first handshake of TCP. Xiaodong is in syn send state at this time]

2) Xiaosheng has received Xiaodong's chat network package. You have to confirm that you have received IT. At this time, Xiaosheng has other things. Unlike Xiaodong, who does IT work, he works on wechat. In the evening, Xiaosheng has time to take a look at the mobile wechat and pop up Xiaodong's message. Then, I replied excitedly, "for the sync package sent by Xiaodong, I made an ack reply confirmation." Because after a while, Xiaosheng is not sure Xiaodong is still not online. [this is the second handshake of TCP, and Xiaosheng is in syn [RCVD].

3) Xiaodong received Xiaosheng's reply confirmation message because it happened to be online. He immediately replied to the message "for Xiaosheng's sync + ack, he made a further ack reply confirmation. This is the third handshake of TCP". [Xiaosheng status now becomes established chat status]

4) at this time, Xiaosheng is also online, and the two begin to talk hot. [the data has been officially transmitted. Xiaodong and Xiaosheng are in the established state]

The states mentioned above, syn send syn RCVD established, are the key states involved in the three-way handshake of TCP.

In the previous picture, let's understand:

3. The process of four waves of TCP disconnection

The hot chat between Xiaodong and Xiaosheng is over. It's too late. I've been busy all day. I need a rest.

Xiaodong will get up early tomorrow for work reasons, so he told Xiaosheng in advance:

Xiaodong: I'm going to get up at 4 a.m. tomorrow to upgrade the system. I'm going to have a rest earlier. I'll invite you to drink some other day!
Xiaosheng: right? Well, I don't know!
Xiao Sheng: then you can have a rest earlier. The wine you said is still for drinking!
Xiaodong: Mm-hmm. good night! You should also sleep early.
Xiao Sheng: OK, good night, brother!

Corresponding understanding of the four wave process of TCP:

1) Xiaodong is going to have a rest. He started fin1 to finish chatting. [Xiaodong state is in fin_wait1 state, which is the first time TCP waves]

2) Xiaosheng receives Xiaodong's fin1 package and replies ack confirmation message. [at this time, Xiaosheng is in the state of close'wait, Xiaodong is in the state of fin'wait2, which is the second wave of TCP]

3) Xiaosheng comes for a final confirmation and doesn't plan to continue talking. He sends fin2 packet. [at this time, Xiaosheng state is in last back state, which is the third wave of TCP]

4) Xiaodong finally replied an ack confirmation for the fin2 package sent by Xiaosheng. [at this time, Xiaodong state is in the time ﹣ wait state, which is the fourth time for TCP to wave]

Why is Xiaodong still in the state of "time" wait?

Because according to the "old rules", they have to make sure that Xiaosheng received the final message, so that they can finally end the chat.

The standard duration of time ﹣ wait status is 4 minutes. During this period, the TCP network connection socket resources (ports) established by Xiaodong and Xiaosheng cannot be used by others or recycled by the system.

If Xiaosheng does not receive feedback, he will continue to ask "resend fin2 message" until Xiaodong successfully sends ack message.
The two sides officially closed the chat channel, released the port resources and closed the connection.

The waiting 4 minutes is 2 MSL, each MSL is 2 minutes. MSL is the maximum segment lifetime - the longest message lifetime. This time is set by the official RFC agreement.

In the previous picture, we have a further intuitive understanding:

4. Analyze the source code with Java stack

After analyzing the process of four waves of TCP, when the server receives the request packet of TCP disconnection, it needs to reply a confirmation message to the client, and the server is in the state of close ﹣ wait.

We know the location of the network connection where the state is. Combined with the problems mentioned above, a large number of threads are blocked on the HttpClient call. The thread status is WAITING. According to the statistics on the server, a large number of network connections in the state of close ﹣ wait cannot be released.

Threads are valuable resources in the JVM process. If a large number of threads are waiting or blocking all the time, slowly all threads are full, resulting in the failure of the service to respond normally.

Let's analyze the specific reasons through java thread stack information and source code.

Find the first critical stack log:

"http-nio-8970-exec-1108" #24971 daemon prio=5 os_prio=0 tid=0x00007f45b4445800 nid=0x61ad waiting on condition [0x00007f444ad69000]
java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c2f30968> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
        at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
        at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
        - locked <0x0000000641c7fe38> (a org.apache.http.pool.AbstractConnPool$2)
        at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:303)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
        at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
        at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
        at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:660)
        at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:629)
        at org.springframework.web.client.RestTemplate.getForEntity(RestTemplate.java:329)
        at com.xxx.proxy.common.util.HttpClientUtil.getForEntity(HttpClientUtil.java:267)
        at com.xxx.proxy.common.util.HttpClientUtil.getForObject(HttpClientUtil.java:521)
... ...

A large number of the above logs appear in the stack log, which are basically called by HttpClient tool class. All threads are in java.lang.Thread.State: WAITING (parking) state.

A WAITING (parking) thread pending state occurs because a large number of third-party interfaces are called inside the interface service. To obtain the Http connection, but it cannot be obtained all the time, you have to wait.

HttpClientUtil tool class inherits from Spring RestTemplate and makes some parameters, retry mechanism and proxy customization. Its package path is located at org.springframework.web.client.RestTemplate.

The class diagram is as follows:

To create an HttpClient tool example code:

HttpClientFactoryBean httpClientFactoryBean = new HttpClientFactoryBean(config);
                    HttpComponentsClientHttpRequestFactory httpRequestFactory = new HttpComponentsClientHttpRequestFactory(httpClientFactoryBean.getObject());
                    return new HttpClientUtil(httpRequestFactory);

HttpClientFactoryBean inherits from AbstractFactoryBean, overriding getObjectType() and createInstance() methods.

The class diagram is as follows:

HttpClientFactoryBean partial example method:

@Override
public Class<?> getObjectType() {
        return HttpClient.class;
}
        
@Override
protected HttpClient createInstance() {
    if (restConfig == null) {
            HttpClients.custom().build();
    }
    // Maximum connections per route
    int maxPerRoute = restConfig.getMaxConnections();
    // Total maximum connections
    int maxTotal = restConfig.getMaxTotalConnections();
    // Connection timeout
    int connectTimeout = restConfig.getConnectionTimeout();
  // Timeout for reading data
    int socketTimeout = restConfig.getTimeout();
    
    PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(30, TimeUnit.SECONDS);
    connManager.setDefaultMaxPerRoute(maxPerRoute);
    connManager.setMaxTotal(maxTotal);
    connManager.setValidateAfterInactivity(1000);

    RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(connectTimeout)        .setSocketTimeout(socketTimeout).build();

/ ... Omit some codes
return HttpClients.custom().setConnectionManager(connManager).evictExpiredConnections().setDefaultRequestConfig(requestConfig).build();
}

According to the stack information, we can see that the connection is obtained from the leaseConnection() method of PoolingHttpClientConnectionManager. Then we can look at the source code in detail. Why didn't it succeed?

How to find the source code, through the call stack link in the stack information, it is very easy to find which classes, methods and lines of code have passed.

According to the log in jstack above:

org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)

According to the name guess, abstract the connection pool class through AbstractConnPool, call getPoolEntryBlocking blocking method to get the connection, line 380.

View source code:

private E getPoolEntryBlocking(
                    final T route, final Object state,
                    final long timeout, final TimeUnit tunit,
                    final Future<E> future) throws IOException, InterruptedException, TimeoutException {

    Date deadline = null;
    // Connection get timeout parameter
    if (timeout > 0) {
            deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
    }
    this.lock.lock();
    try {
            final RouteSpecificPool<T, C, E> pool = getPool(route);
            // Omit some source code

            boolean success = false;
            try {
                    if (future.isCancelled()) {
                            throw new InterruptedException("Operation interrupted");
                    }
                    // Put future, the actual type of which is future < cpoolentry > into the pending double linked list queue
                    pool.queue(future);
                    this.pending.add(future);
                    if (deadline != null) {
                            success = this.condition.awaitUntil(deadline);
                    } else {
                         // This is line 380 of the source code.
                            this.condition.await();
                            success = true;
                    }
                    if (future.isCancelled()) {
                            throw new InterruptedException("Operation interrupted");
                    }
            } finally {
                    // In case of 'success', we were woken up by the
                    // connection pool and should now have a connection
                    // waiting for us, or else we're shutting down.
                    // Just continue in the loop, both cases are checked.
                    pool.unqueue(future);
                    this.pending.remove(future);
            }
            // check for spurious wakeup vs. timeout
            if (!success && (deadline != null && deadline.getTime() <= System.currentTimeMillis()))           {
                    break;
            }
        }
        throw new TimeoutException("Timeout waiting for connection");
    } finally {
                this.lock.unlock();
    }
}

Find the source code in line 380, and call the await() method of condition:

this.condition.await();

Here, the Condition under the contract is used to realize the multi-threaded coordination communication mechanism. After the await() method is called, the current thread will be added to the Condition waiting queue, which is a FIFO structure queue. At the same time, the current thread lock will be released. If not released, other threads will not be able to obtain the lock, which may cause deadlock.

Source code of await() method:

public final void await() throws InterruptedException {
        if (Thread.interrupted())
                throw new InterruptedException();
        // Join the Condition waiting queue
        Node node = addConditionWaiter();
        // Release lock for current thread
        long savedState = fullyRelease(node);
        int interruptMode = 0;
        // Not in the AQS synchronization queue, suspend the current thread, if exit the loop in the AQS queue
        while (!isOnSyncQueue(node)) {
                LockSupport.park(this);
                if ((interruptMode = checkInterruptWhileWaiting(node)) != 0)
                        break;
        }
        // Wake up by signal() method, spin wait to try to acquire lock again
        if (acquireQueued(node, savedState) && interruptMode != THROW_IE)
                interruptMode = REINTERRUPT;
        if (node.nextWaiter != null) // clean up if cancelled
                unlinkCancelledWaiters();
        if (interruptMode != 0)
                reportInterruptAfterWait(interruptMode);
}

The current thread joins the Condition waiting queue structure chart:

When the signalAll() or signal() method is called through Condtion, the first node of the waiting queue will be obtained, moved to the synchronization queue, and the thread in the node will be awakened by LockSupport. The node moves from the waiting queue to the AQS synchronization queue as follows:

The release() release connection code was found in the AbstractConnPool class.

The source code of release() method is as follows:

@Override
public void release(final E entry, final boolean reusable) {
        this.lock.lock();
    try {
        if (this.leased.remove(entry)) {
                final RouteSpecificPool<T, C, E> pool = getPool(entry.getRoute());
                pool.free(entry, reusable);
                if (reusable && !this.isShutDown) {
                        this.available.addFirst(entry);
                } else {
                        entry.close();
                }
                onRelease(entry);
                Future<E> future = pool.nextPending();
                if (future != null) {
                        this.pending.remove(future);
                } else {
                        future = this.pending.poll();
                }
                if (future != null) {
                        this.condition.signalAll();
                }
        }
    } finally {
            this.lock.unlock();
    }
}

We see that when we release the connection, we call this. Condition. Signalall(); the call of signalall() method will wake up all waiting queue threads. Although we wake up all, only one thread can get the lock, and other threads still need to spin to wait.

The source code of the signalAll() method is as follows:

private void doSignalAll(Node first) {
    lastWaiter = firstWaiter = null;
    do {
            Node next = first.nextWaiter;
            first.nextWaiter = null;
            // Signal notification
            transferForSignal(first);
            first = next;
    } while (first != null);
}

final boolean transferForSignal(Node node) {
    /*
     * Set waitStatus of node: condition - > 0
     */
    if (!compareAndSetWaitStatus(node, Node.CONDITION, 0))
        return false;

         // Join the waiting queue of AQS, let the node continue to acquire the lock, and set the front node status to SIGNAL.
    Node p = enq(node);
    int c = p.waitStatus;
    if (c > 0 || !compareAndSetWaitStatus(p, c, Node.SIGNAL))
            // Call the unpark() method of LockSupport to wake up the thread
        LockSupport.unpark(node.thread);
    return true;
}

After analyzing the underlying code, looking back, we can see that the reason is that we call the wait() parameterless method of condition, and we can't get Http connection all the time, and then occupy the hole of tomcat thread all the time.

At the beginning of getPoolEntryBlocking() method, there is a piece of code that cannot be ignored:

Date deadline = null;
// Connection get timeout parameter
if (timeout > 0) {
    deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
}

This code is the timeout at first glance. Guess, here is the code. The timeout should be the waiting time when getting the connection from the connection pool.

See the following for getPoolEntryBocking() method:

if (deadline != null) {
        success = this.condition.awaitUntil(deadline);
}

If the deadline is not empty, the awaitUtil(deadline) method of condtion will be called. The awaitUtil(deadline) method indicates that it has not been waked up until the deadline time is reached, it will wake up automatically and join the AQS synchronization queue to acquire the lock.

We can look for callers based on the stack information and see the source of timeout in the deadlock.

at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)

Mainclientexec ා execute() method partial source code:

final HttpClientConnection managedConn;
    try {
        final int timeout = config.getConnectionRequestTimeout();
        managedConn = connRequest.get(timeout > 0 ? timeout : 0, TimeUnit.MILLISECONDS);
    } catch(final InterruptedException interrupted) {
}

The timeout here, that is, connectionRequestTimeout, is exactly the timeout value for calculating the deadline time.
This confirms our conjecture.

Initializes the initial configuration parameter of HttpClient tool. If the parameter connectionRequestTimeout is not configured, the parameter is also critical. If it is not set and the thread suspended by park has not been awakened by signal, it will wait.

So, you have to set this parameter. The deadlock here is an absolute time. If it is not empty, the awaitutil (deadlock) method of the condition will be called. Even if it is not awakened by the signal, it will automatically wake up and scramble for the lock, without causing the lock to be blocked all the time without being awakened.

Moreover, this awaitUtil(deadline) method is similar to the deadline variable design in awaitNanos(long nanosTimeout) method.

The set timeout has been reached, and no signal has passed. Finally, the success variable is false, which is unsuccessful. Directly break out of the loop, and finally a timeout exception ("timeout waiting for connection") exception will be thrown.

If this exception is thrown, it is clear in the system error log that the connection cannot be obtained. At the same time, it also avoids occupying threads all the time.

5. Find the "culprit" from the stack

In the previous section, from the first stack log analysis to the Condition concurrency bottom source details.
But this is not over, because we only analyze the WAITING (parking) state in java.lang.Thread.State, and the cause of the problem is not necessarily caused by this state. Next, we will continue to analyze the other "exception" thread state WAITING (on object monitor).

The second key log in the java stack is as follows:

"http-nio-8970-exec-462" #24297 daemon prio=5 os_prio=0 tid=0x00007f45b41bd000 nid=0x5f0b in Object.wait() [0x00007f446befa000]
 java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            at java.lang.Object.wait(Object.java:502)
            at java.net.InetAddress.checkLookupTable(InetAddress.java:1393)
            - locked <0x00000006c05a5570> (a java.util.HashMap)
            at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1310)
            at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
            at java.net.InetAddress.getAllByName(InetAddress.java:1192)
            at java.net.InetAddress.getAllByName(InetAddress.java:1126)
            at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
            at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
            at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
            at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
            at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
            at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
            at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
            at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
            at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
            at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
            at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
            at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
            at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:660)
            at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:629)
            at org.springframework.web.client.RestTemplate.getForEntity(RestTemplate.java:329)
            at com.xxx.tvproxy.common.util.HttpClientUtil.getForEntity(HttpClientUtil.java:267)
            at com.xxx.tvproxy.common.util.HttpClientUtil.getForObject(HttpClientUtil.java:521)
``

java.lang.Thread.State: WAITING (on object monitor),This kind of thread state should also be paid special attention to, monitoring the object lock, and blocking the thread all the time.

//According to the thread stack information, guess is related to HttpClient parameter settings. Let's analyze the creation parameters.
//Looking up the information on the top of the stack, I saw the wait() method that called the Object object Object, indicating that it was WAITING for another thread to notify, but it was too late, and the current thread has been in the WAITING state.

//Continue to find out who called:

at java.net.InetAddress.checkLookupTable(InetAddress.java:1393)
``

This code call is caused by the following source code:

private static InetAddress[] checkLookupTable(String host) {
    synchronized (lookupTable) {
        // If the host isn't in the lookupTable, add it in the
        // lookuptable and return null. The caller should do
        // the lookup.
        if (lookupTable.containsKey(host) == false) {
                lookupTable.put(host, null);
                return null;
        }

        // If the host is in the lookupTable, it means that another
        // thread is trying to look up the addresses of this host.
        // This thread should wait.
        while (lookupTable.containsKey(host)) {
                try {
                        // Corresponding to java.net.inetaddress.checklookuptable in the stack (InetAddress. Java: 1393)
                        lookupTable.wait();
                } catch (InterruptedException e) {
                }
        }
    }

    // The other thread has finished looking up the addresses of
    // the host. This thread should retry to get the addresses
    // from the addressCache. If it doesn't get the addresses from
    // the cache, it will try to look up the addresses itself.
    InetAddress[] addresses = getCachedAddresses(host);
    if (addresses == null) {
        synchronized (lookupTable) {
                lookupTable.put(host, null);
                return null;
        }
    }

    return addresses;
}

Found is the lookupTable object, using the synchronized block lock, internally calling the wait() method of the lookupTable object, which is blocked without notice.

You can't see any problems in the code troubleshooting, because it has little to do with the application itself. It's the JVM thread deadlock caused by IPV6.

Refer to foreign zimbra site wiki: https://wiki.zimbra.com/wiki/...

Here's why:

The application itself is in IPv4 environment. If IPv6 is tried to be used, it will cause some known problems.

When Inet6AddressImpl.lookupAllHostAddr() method is called, because there is a bug between Java and the operating system libc library, when a specific race condition occurs, the search for host address action will be endless. The frequency of this situation is very low, but once it happens, it will lead to the JVM deadlock problem, which will cause all threads in the JVM to be blocked.

According to the above analysis, the third key stack log is found in jstack stack as follows:

java.lang.Thread.State: RUNNABLE
   at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
   at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
   at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
   at java.net.InetAddress.getAllByName(InetAddress.java:1192)
   at java.net.InetAddress.getAllByName(InetAddress.java:1126)
   at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
   at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
   at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
   at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
   at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
   at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
   at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
   at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
   at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
   at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
   at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
   at
   ... ...

How to determine if IPv6 is enabled in the operating system?

Two ways are introduced:

1)ifconfig

Obviously, the word inet6 addr indicates IPv6 is enabled.

2)lsmod

[root@BJ]# lsmod | grep ipv6
Module                  Size  Used by
ipv6                  335951  73 bridge

It mainly depends on the Used column. The value is 70 +. It does not support IPv6 environment. The Used column is 1.

6. Summary of problem optimization scheme

Through the analysis of the key thread state in java stack, the cause of the problem is clarified, and then the solution is given.

First question:

When the connection cannot be obtained from the Http connection pool, the thread may be blocked.

In HttpClient initialization parameter configuration, add connectionRequestTimeout to get the timeout of the connection. Generally, it is not recommended to be too large. We set it to 500ms.

After setting, the underlying condition ා awaitutil (deadlock) method will be called. When the thread cannot be awakened by signal, and reaches the absolute deadline time, the thread will automatically wake up from the waiting queue and join the AQS synchronization queue to scramble for lock.

Second question:

There are two solutions to the JVM process deadlock caused by IPv6:

1) disable IPv6 at the operating system level

Edit the / etc/sysctl.conf file
Add the following two lines:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Save, execute sysctl -p to make it effective.

Execute the following commands in the operating system to take effect directly:

sysctl -w net.ipv6.conf.all.disable_ipv6=1
sysctl -w net.ipv6.conf.default.disable_ipv6=1

2) Java application level

Add - Djava.net.preferIPv4Stack=true to the application JVM startup parameter.

Disable IPv6 from the operating system level. If other applications are deployed on the server, please observe. If you encounter some problems, you can use the search engine.

We have many servers that are maintained by operation and maintenance, so I adopt the second method, which is simple and convenient to add parameters directly to the JVM.

The final conclusion:

In the java stack log, there are two key WAITING thread states. WAITING (on object monitor) appears first. Because of IPv6 problem, all threads in HttpClient thread pool are blocked. After that, WAITING (parking) appears, and Tomcat thread receives the forward request. When the request is called to HttpClient, a large number of threads are blocked because the Http connection resource cannot be obtained and the timeout time for getting the connection is not set.

After the optimization of the above two problems, online observation for a long time, also experienced higher traffic than when the problem occurred, and there was no JVM thread blocking problem.
Through the statistics of network command line, there will not be a large number of close ﹣ wait network connection states.

Welcome to pay attention to my public account, scan QR code to get more wonderful articles and grow with you~

Keywords: Java Apache network jvm

Added by UnknownPlayer on Sat, 19 Oct 2019 11:03:00 +0300