Database idle connection disconnection caused by firewall

catalog

1, Problem description

A new project of the company is online and in the trial operation stage. Although the project is accessible from the external network, it is deployed in the DMZ area, but only a few employees' addresses and accounts (including some leaders) are given in the trial operation stage, so the number of visits is very small, but the project is still very important.

During the trial operation stage, exceptions will be reported irregularly in the application log of the project, especially when it is used at the beginning of the morning and it is used again after a period of idle time. The specific exceptions are as follows:

ERROR [com.alibaba.druid.util.JdbcUtils] - close connection error
java.sql.SQLRecoverableException: IO Error: Broken pipe
    at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:556)
    at oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:3984)
    at com.alibaba.druid.filter.FilterChainImpl.connection_close(FilterChainImpl.java:167)
    at com.alibaba.druid.filter.stat.StatFilter.connection_close(StatFilter.java:254)
    at com.alibaba.druid.filter.FilterChainImpl.connection_close(FilterChainImpl.java:163)
    at com.alibaba.druid.proxy.jdbc.ConnectionProxyImpl.close(ConnectionProxyImpl.java:115)
    at com.alibaba.druid.util.JdbcUtils.close(JdbcUtils.java:79)
    at com.alibaba.druid.pool.DruidDataSource.discardConnection(DruidDataSource.java:965)
    at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:932)
    at com.alibaba.druid.filter.FilterChainImpl.dataSource_connect(FilterChainImpl.java:4534)
    at com.alibaba.druid.filter.stat.StatFilter.dataSource_getConnection(StatFilter.java:661)
    at com.alibaba.druid.filter.FilterChainImpl.dataSource_connect(FilterChainImpl.java:4530)
    at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:884)
    at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:876)
    at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:92)
    at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:205)
    at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373)
    at org.springframework.transaction.interceptor.TransactionAspectSupport.createTransactionIfNecessary(TransactionAspectSupport.java:420)
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:257)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
    at org.springframework.aop.framework.CglibAopProxyDynamicAdvisedInterceptor.intercept(CglibAopProxy.java:644)atxxx.xx.modules.deposit.api.service.DepositApiServiceEnhancerBySpringCGLIB

$59c8f6e2.doRecharge()
    at xxx.xx.modules.deposit.FundDepositController.rechargeConfirm(FundDepositController.java:125)
......

Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
    at oracle.net.ns.DataPacket.send(DataPacket.java:210)
    at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:230)
    at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:312)
    at oracle.net.ns.NetInputStream.read(NetInputStream.java:260)
    at oracle.net.ns.NetInputStream.read(NetInputStream.java:185)
    at oracle.net.ns.NetInputStream.read(NetInputStream.java:102)
    at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124)
    at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80)
    at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137)
    at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:290)
    at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
    at oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:61)
    at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:543)
    ... 69 more

From the exception information, we can see that the problem is that when Druid database connection pool closes the physical database connection, it reports SocketException: Broken pipe. But why does Druid close the database connection when it is in use, and why does Druid report SocketException when it closes the data connection? How does this anomaly affect the system? Next step by step analysis.

2, Step by step analysis of problems

1, java.net.SocketException : how is the broken pipe exception generated? What's the impact?

In the project, the Druid is used to connect to the database. Why do you try to close the database connection after the system is idle for a period of time, and when you close it, you still throw it away java.net.SocketException What about broken pipe?
From the exception stack information, or look at the Druid source code, we can know that an exception occurs when a connection is obtained from the database connection pool for subsequent database operations DruidDataSource.getConnectionDirect(maxWaitMillis) method has the following logic:

public DruidPooledConnection getConnectionDirect(long maxWaitMillis) throws SQLException {
    //loop
    for (;;) {
        //Get a connection from the connection pool in maxWaitMillis time
        DruidPooledConnection poolalbeConnection = getConnectionInternal(maxWaitMillis);

        //testOnBorrow is true, that is, the connection needs to be checked after getting the connection from the pool
        if (isTestOnBorrow()) {
            boolean validate = testConnectionInternal(poolalbeConnection.getConnection());
            if (!validate) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("skip not validate connection.");
                }

                Connection realConnection = poolalbeConnection.getConnection();
                discardConnection(realConnection);
                continue;
            }
        } 
        else {
            Connection realConnection = poolalbeConnection.getConnection();
            //If the connection is closed, get another one from the pool
            if (realConnection.isClosed()) {
                discardConnection(null); // Pass in null to avoid repeated shutdown
                continue;
            }

            //testWhileIdle is true, i.e. the connection needs to be checked after idle
            if (isTestWhileIdle()) {
                //Connection idle time (current time - last ActiveTime)
                long idleMillis = System.currentTimeMillis()
                                  - poolalbeConnection.getConnectionHolder().getLastActiveTimeMillis();
                
                //Connection idle time > time between evolutionrunsmillis, check connection
                if (idleMillis >= this.getTimeBetweenEvictionRunsMillis()) {
                    boolean validate = testConnectionInternal(poolalbeConnection.getConnection());
                    
                    //Failed to check the connection, print the log, discard the connection and get another connection
                    if (!validate) {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("skip not validate connection.");
                        }

                        discardConnection(realConnection);
                        continue;
                    }
                }
            }
        }

        //If connection timeout recycle is enabled
        if (isRemoveAbandoned()) {
            StackTraceElement[] stackTrace = Thread.currentThread().getStackTrace();
            poolalbeConnection.setConnectStackTrace(stackTrace);
            poolalbeConnection.setConnectedTimeNano(); //Set the current time to ConnectedTime
            poolalbeConnection.setTraceEnable(true);

            synchronized (activeConnections) {
                activeConnections.put(poolalbeConnection, PRESENT); //Put the connection in the activeConnections Map
            }
        }

        if (!this.isDefaultAutoCommit()) {
            poolalbeConnection.setAutoCommit(false);
        }

        return poolalbeConnection;
    }
}

In short, when getting a database connection from Druid, you can test. This code contains the logic of testOnBorrow and testWhileIdle. This project is in the configuration file

testOnBorrow = false
testWhileIdle = true
timeBetweenEvictionRunsMillis = 60000(60s)

Therefore, it can only be detected when the Connection is idle for 60s and then used again. In fact, it is to execute a SQL. If it fails to execute the SQL, it will call JdbcUtils.close(realConnection) to close the Connection, a SocketException exception is thrown when closing the Connection, but in fact, this exception will not have a great impact on the program that wants to get the Connection to execute SQL query, because jdbcu tils.close () the exception is caught in the method, and the log is printed without being thrown up

public static void close(Connection x) {
    if (x == null) {
        return;
    }
    try {
        x.close();
    } catch (Exception e) {
        LOG.debug("close connection error", e);
    }
}

that java.net.SocketException What does broken pipe mean?
In fact, the tcp connection established with the database is disconnected for some reasons, which leads to "pipe rupture". In general, the database connection pool will maintain a long connection with the database. When necessary, the process of establishing a connection will be omitted and used directly. Why are these idle connections disconnected? By whom?

2. Why is the database TCP connection broken?

At the beginning, I couldn't figure out. Was it because the Oracle database was disconnected? For some reasons, such as too many connections from the server to the database? Obviously not. This project is still in the trial operation stage, and there are not many people using it. In addition, we observe the connection pool monitoring of Druid. Generally, there are only a few connections established. Later, we learned that similar situations have happened to other project groups in the process of discussion with colleagues. What they have in common with this project is that the service is in the DMZ area, the external network is accessible, and the database is in the internal network, so we need to communicate with each other The database can only be accessed through the firewall. So I went to my colleagues who are responsible for maintaining the network and firewall to understand that the firewall had a TCP timeout, which is currently set to half an hour. The meaning is that for all TCP connections through the firewall, if there is no activity within half an hour, they will be removed by the firewall, which will cause the connection to be interrupted. When the connection is removed, no data is sent to both ends of the connection to inform that the connection has been removed.
The reason for the disconnection of database is found. Then this is a typical problem that an application and database will encounter in different networks when the connection needs to go through the firewall. How to make the application and database maintain a certain number of long connections even if they are idle is an urgent problem to be solved.

3. The effect of firewall cutting off database connection
  • Database session executing long SQL
    Before the connection is cut off, the Oracle session corresponding to the connection is executing a SQL that takes a very long time, such as a stored procedure, during which no data is output to the client. So when the SQL execution is completed and the result is returned to the client, if the TCP connection has been interrupted by the firewall, there will be an error obviously. If the connection is interrupted, the session will also be interrupted. However, the client does not know that it will always wait for the server to return results.
    If the client does not have a connection recovery mechanism for this kind of time-consuming SQL, the connection of the client will always be in a waiting state. If the client continues to execute this kind of time-consuming SQL, the client will accumulate more and more waiting connections.
    The removeAbandoned configuration and logic of Druid connection pool is to solve this connection recycling setting.
  • Database session idle
    Before the connection is cut off, the Oracle session is always idle. After the firewall is interrupted, when the client submits SQL to the Oracle server, because the TCP connection has been interrupted, the client detects the connection interruption, then the client will report the error ORA-03113/ORA-03114, and then the session is interrupted. But on the Oracle server side, the session is always waiting for client messages.
    For Druid, which has the detection mechanism of testOnBorrow and testWhileIdle, and the connection pool can be reestablished if the detection fails, the idle connection cut off by the firewall will be rebuilt continuously in the future, while on the database server side, the connection is more and more, that is, the number of sessions is more and more, even exceeding the maximum number of data connections.

3, Solutions

1. Adjust the cut-off time of firewall

This is a temporary solution, such as adjusting the connection timeout of firewall to 8 hours, which can avoid the disconnection of idle connection as much as possible, but cannot be completely avoided, because it is impossible to predict how long the connection will be idle. If your system is not always accessed, the connection will be disconnected sooner or later because it is idle, leading to some unexpected problems, and the connection will be enlarged Timeout is just a relief

2. tcp keepalive function

The keepalive of TCP is actually used to maintain the TCP connection. Its principle is simply that if a TCP connection does not have any activity within the specified time, it will send a detection packet to the opposite end of the connection to detect whether the opposite end of the connection still exists. If the opposite end still does not respond to the detection within a certain time, it will send the detection packet again. After several times of sending, it still does not ring Should, consider that the connection has failed, close the local connection.

tcp keepalive is not enabled by default. When developing programs, you can set tcp keepalive to true, so that the TCP connection will start detection if there is no data message transmission within a certain period of time. This time is generally specified by the operating system, which can be set in Linux system net.ipv4.tcp_keepalive_time to modify. The default is 7200 seconds, or 2 hours. Of course, this time can also be set for the current socket when programming, but it seems that only keepalive=true is set in the Socket API of Java, and TCP cannot be set_ keepalive_ Time, when tcp keepalive is set, the firewall will check that there is still data transmission in the connection and will not disconnect the connection as long as the time of sending the TCP detection packet is less than the connection timeout of the firewall.

The database tcp connection created by JDBC is not set with keepalive, which can be verified on the database client (i.e. application side) by netstat or ss command of Linux
Use the command netstat -ano or ss -ano, where the parameter o shows the timer timer, which can keep the connection alive when the connection is established
The netstat command is shown as: off (0.00/0/0) for tcp connections without keepalive enabled
ss command will not display timer for tcp connection without keepalive

3. DCD of Oracle Database

Oracle provides a mechanism similar to tcp keepalive, namely DCD (Dead Conneciton Detection). At $ORACLE_HOME/network/admin/sqlnet.ora Add the following line to the file:

sqlnet.expire_time=NNN

Here NNN is the number of minutes. When the session ID time exceeds the specified time, Oracle database will check whether the peer (i.e. client) of the session is still valid. Avoid that the client exits abnormally, which causes the session to exist all the time.
Similarly, if the time of DCD is shorter than that of firewall cutting off the idle connection, the connection can also be kept all the time

4. The program executes query from time to time

The above methods either use the keepalive feature of tcp connection, or use the idle connection detection of database side. Our program can also take the initiative to do this heartbeat detection

  • Druid database connection pool started from 1.0.28, added druid.keepAlive Property, default off on druid.keepAlive After that, when the connection pool is idle, if the number of minidles in the pool is less than the number of minidletimemillis, the keepalive operation will be executed
  • druid.validationQuery The specified query SQL is generally select * from dual. As long as the minevictable idletimemillis is set to be less than the firewall cut-off time, it can ensure that when the connection is idle, it will automatically do the security detection and will not be cut off by the firewall

Keywords: Java Database Druid Oracle

Added by Popcorn on Fri, 05 Jun 2020 10:51:02 +0300