Stepping on the pit: a bloody case caused by timeout of dubbo asynchronous call

background
The response time limit for new business requirements shall be controlled within 5s(md5 warehouse collision time limit of 1.2s, business flow time limit of 3.8s), which will directly lead to production accidents after online (Party A's father feedbacks that there are a lot of overtime).

environment
dubbo 2.6.0

problem analysis
For the timeout problem, we first think of two configurations related to timeliness. Among them, the business flow part operates stably for a long time, and there are almost no problems in timeliness control. md5 library collision timeliness is the newly accessed demand in this demand. Therefore, we focus on the timeliness control of md5 library collision part, and then the business flow part.

Troubleshooting

(ApiCommonAspect.java:204) requestId:539177628, product:67941, costL:6055ms
(ApiCommonAspect.java:204) requestId:539177695, product:67941, costL:5028ms
(ApiCommonAspect.java:204) requestId:539177767, product:67941, costL:6191ms
(ApiCommonAspect.java:204) requestId:539177769, product:67941, costL:5236ms
(ApiCommonAspect.java:204) requestId:539178024, product:67941, costL:6283ms
(ApiCommonAspect.java:204) requestId:539178073, product:67941, costL:6104ms
(ApiCommonAspect.java:204) requestId:539178098, product:67941, costL:5997ms
(ApiCommonAspect.java:204) requestId:539178116, product:67941, costL:6064ms
(ApiCommonAspect.java:204) requestId:539178170, product:67941, costL:6128ms

First pull all the request record IDs and find the response log records through the request id. it is found that the maximum timeout time is no more than 6.3 seconds. There seems to be a relationship between 1.2 and 1.3. HMM ~ keep going

When calling md5 asynchronously through RPC, I found the problem. The whole call took 2.4s and the call time doubled, but the clearly set time was 1.2s. Why did it double! Think carefully and fear!

public Object get(int timeout) throws RemotingException {
        if (timeout <= 0) {
            timeout = Constants.DEFAULT_TIMEOUT;
        }
        if (!isDone()) {
            long start = System.currentTimeMillis();
            lock.lock();
            try {
                while (!isDone()) {
                    done.await(timeout, TimeUnit.MILLISECONDS);
                    if (isDone() || System.currentTimeMillis() - start > timeout) {
                        break;
                    }
                }
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            } finally {
                lock.unlock();
            }
            if (!isDone()) {
                throw new TimeoutException(sent > 0, channel, getTimeoutMessage(false));
            }
        }
        return returnFromResponse();
    }

Finally, the answer is found in the default get method of dubbo. At the end of the first dubbo waiting time, a time judgment will be made with the current time. The default time unit is Ms. when the call ends or is greater than its waiting time, jump out of the cycle, otherwise continue to wait. There is a fatal problem in this logic, and 1ms is also a large unit in the computer system, The computer can do a lot of things in this 1ms time, so there is no 1ms when judging the time, that is, the judgment result is false, entering the second waiting, resulting in the doubling of the whole dubbo call time

Solution

private static class RemotingInvocationTimeoutScan implements Runnable {

        public final int DURATION = 30;

        private Object getFiledValue(Class clazz, DefaultFuture future, String property) throws NoSuchFieldException, IllegalAccessException {
            Field field = clazz.getDeclaredField(property);
            field.setAccessible(true);
            return field.get(future);
        }

        @Override
        @SuppressWarnings("squid:S2189")
        public void run() {
            while (true) {
                try {
                    Iterator<Map.Entry<DefaultFuture, Long>> iterator = FUTURES.entrySet().iterator();
                    while (iterator.hasNext()) {
                        Map.Entry<DefaultFuture, Long> entry = iterator.next();

                        DefaultFuture future = entry.getKey();
                        if (future == null || future.isDone()) {
                            //Call succeeded, release future
                            iterator.remove();
                            continue;
                        }
                        Long timeout = Long.valueOf(entry.getValue().toString());
                        if (timeout == null) {
                            logger.error("timeout is null by class:[{}] future:[{}]", clazz.getName(), JSON.toJSONString(future));
                            continue;
                        }
                        Long start = Long.valueOf(String.valueOf(getFiledValue(clazz, future, "start")));
                        if (start == null) {
                            logger.error("start is null by class:[{}] future:[{}]", clazz.getName(), JSON.toJSONString(future));
                            continue;
                        }
                        if (System.currentTimeMillis() - start >= timeout) {
                            Long id = Long.valueOf(String.valueOf(getFiledValue(clazz, future, "id")));
                            if (id == null) {
                                logger.error("id is null by class:[{}] future:[{}]", clazz.getName(), JSON.toJSONString(future));
                                continue;
                            }
                            Long sent = Long.valueOf(String.valueOf(getFiledValue(clazz, future, "sent").toString()));
                            String msg = String.valueOf(invokeMethod(clazz, future, "getTimeoutMessage"));
                            Channel channel = (Channel) getFiledValue(clazz, future, "channel");
                            if (channel == null) {
                                logger.error("channel is null by class:[{}] future:[{}]", clazz.getName(), JSON.toJSONString(future));
                                continue;
                            }
                            Response timeoutResponse = new Response(id);
                            timeoutResponse.setStatus(sent != null && sent > 0 ? Response.SERVER_TIMEOUT : Response.CLIENT_TIMEOUT);
                            timeoutResponse.setErrorMessage(msg);
                            if (!future.isDone()) {
                                iterator.remove();
                                DefaultFuture.received(channel, timeoutResponse);
                            }
                        }
                    }
                    Thread.sleep(DURATION);
                } catch (Throwable e) {
                    logger.error("Exception when scan the timeout invocation of remoting.", e);
                }
            }
        }

        private Object invokeMethod(Class clazz, DefaultFuture future, String property) throws NoSuchMethodException, InvocationTargetException, IllegalAccessException {
            Method method = clazz.getDeclaredMethod(property, boolean.class);
            method.setAccessible(true);
            return method.invoke(future, true);
        }
    }

The first solution is to construct an asynchronous thread adapter to inherit dubbo's FutureAdapter, and build a futureMap to store the future currently called asynchronously through Rpc. At the same time, open a dubbo scanning thread in this class to scan futureMap circularly to judge whether it timed out. If it timed out and the current state is incomplete, remove the dubbo thread

The second solution is to upgrade the dubbo Version (after evaluation, the impact is too large, give up...)

summary

Behind this production accident, there are three problems in the whole development process. 1 No in-depth understanding of dubbo, wanton development 2 Insufficient communication during the development process leads to repeated stepping on the pit 3 The test students did not include the timeout problem into the use case during the test

In the follow-up, consider connecting the request timeout problem to the nail alarm, actively find and solve it immediately.

Forgetting words
Forgetting words
Focus on Java. Share daily learning and development experience.
official account

Keywords: Java Dubbo Distribution Microservices

Added by Chesso on Sat, 19 Feb 2022 07:38:31 +0200