dubbo hidden pit of production failure caused by dubbo generalization#

Last month, the company zk cluster had a failure, and then all project teams were required to check whether Dubbo programmatic / generalized calls were used, and @ Reference was forced to generate consumers. The specific reason is that a large number of online service visitors visited zk in a short time and created 2.4 million + nodes, resulting in the collapse of all zk nodes one after another, and multiple applications reported errors because they were unable to connect to zk. The reason is that the provider is not started when the generalization call is heard, which leads to the creation of a consumption node in zk every request.

Because it is a project team with little relevance to itself, I don't know very well, but I want to understand this matter, so I carried out the following experiments:

Experiment 1: generalization does not use caching

dubbo generalization

public Result<Map> getProductGenericCache(ProductDTO dto) {
    ReferenceConfig<GenericService> reference = new ReferenceConfig<GenericService>();
    ApplicationConfig application = new ApplicationConfig();
    application.setName("pangu-client-consumer-generic");
    // Connection registry configuration
    RegistryConfig registry = new RegistryConfig();
    registry.setAddress("zookeeper://127.0.0.1:2181");
    // Service consumer default configuration
    ConsumerConfig consumer = new ConsumerConfig();
    consumer.setTimeout(5000);
    consumer.setRetries(0);

    reference.setApplication(application);
    reference.setRegistry(registry);
    reference.setConsumer(consumer);
    reference.setInterface(org.pangu.api.ProductService.class); // Weakly typed interface name
    //        reference.setVersion("");
    //        reference.setGroup("");
    reference.setGeneric(true); // Declared as a generalized interface
    GenericService svc = reference.get();
    Object target = svc.$invoke("findProduct", new String[]{ProductDTO.class.getName()}, new Object[]{dto});//In the actual gateway, the method name, parameter type and parameter are passed in as parameters
    return Result.success((Map)target);
}

In this way, there is no cache reference. Therefore, every time this method is requested, a consumer node will be created in zk (whether the provider is started or not). When the number of requests is large, all nodes in zk will crash one after another. Using generalization without caching, it is estimated that this error will not occur after reading the official documents a little. The application function that caused this failure is not the first time it has been online. It has been running for some time. The number of zk nodes is monitored in production, otherwise this problem will be found for the first time. Therefore, we can basically eliminate the problem that the other party does not use cache.

Experiment 2: generalization using cache

@Override
public Result<Map> getProductGenericCache(ProductDTO dto) {
    ReferenceConfigCache referenceCache = ReferenceConfigCache.getCache();

    ReferenceConfig<GenericService> reference = new ReferenceConfig<GenericService>();//Cache. Otherwise, each request will create a ReferenceConfig and register nodes in zk, which may eventually lead to too many zk nodes and affect performance
    ApplicationConfig application = new ApplicationConfig();
    application.setName("pangu-client-consumer-generic");
    // Connection registry configuration
    RegistryConfig registry = new RegistryConfig();
    registry.setAddress("zookeeper://127.0.0.1:2181");

    // Service consumer default configuration
    ConsumerConfig consumer = new ConsumerConfig();
    consumer.setTimeout(5000);
    consumer.setRetries(0);

    reference.setApplication(application);
    reference.setRegistry(registry);
    reference.setConsumer(consumer);
    reference.setInterface(org.pangu.api.ProductService.class); // Weakly typed interface name
    //        reference.setVersion("");
    //        reference.setGroup("");
    reference.setGeneric(true); // Declared as a generalized interface
    GenericService svc = referenceCache.get(reference);//cache. The Reference object will be cached in the get method, and ReferenceConfig will be called The get method starts ReferenceConfig
    Object target = svc.$invoke("findProduct", new String[]{ProductDTO.class.getName()}, new Object[]{dto});//In the actual gateway, the method name, parameter type and parameter are passed in as parameters
    return Result.success((Map)target);
}

No matter whether the provider is started or not, only one consumption node will be created in zk

Experiment 3: set the service check to true, reference setCheck(true);

Excluding the previous two experiments, I checked the dubbo source code and generalized the use of ReferenceConfig, so ReferenceConfig will be executed anyway Get (), the code is as follows

public synchronized T get() {
    if (destroyed) {
        throw new IllegalStateException("Already destroyed!");
    }
    if (ref == null) {
        init();
    }
    return ref;
}

If ref is null, initialize init. How does ref come from? It is generated by createProxy in init operation. The createProxy code is as follows:

//com.alibaba.dubbo.config.ReferenceConfig.createProxy(Map<String, String>)
private T createProxy(Map<String, String> map) {
    //Previous code ignored
    //Create an Invoker using Protocol and a consumer node in zk

    Boolean c = check;
    if (c == null && consumer != null) {
        c = consumer.isCheck();
    }
    if (c == null) {
        c = true; // default true
    }
    if (c && !invoker.isAvailable()) {
        // make it possible for consumer to retry later if provider is temporarily unavailable
        initialized = false;
        throw new IllegalStateException("Failed to check the status of the service " + interfaceName + ". No provider available for the service " + (group == null ? "" : group + "/") + interfaceName + (version == null ? "" : ":" + version) + " from the url " + invoker.getUrl() + " to the consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion());
    }
    if (logger.isInfoEnabled()) {
        logger.info("Refer dubbo service " + interfaceClass.getName() + " from url " + invoker.getUrl());
    }
    // create service proxy
    return (T) proxyFactory.getProxy(invoker);
}

Specific logic:

1. Create an Invoker using Protocol

2. Check that the server check=false, and use proxyFactory to create the Invoker proxy object, ref.

3. Check that the server check=true. If the provider is not started, an IllegalStateException will be thrown. Naturally, ref will still be null. Next time, since ref is null, continue to execute init - > createproxy and create a consumer node in zk.

So how to check whether the service is alive or not? That is, execute registrydirectory Isavailable(), judge registrydirectory Whether urlinvokermap is empty. If it is empty, it must indicate that the provider does not exist.

PS: RegistryDirectory.urlInvokerMap caches the Invoker collection

The problem is generally understood, so under the test, set check=true

@Override
public Result<Map> getProductGenericCache(ProductDTO dto) {
    ReferenceConfigCache referenceCache = ReferenceConfigCache.getCache();

    ReferenceConfig<GenericService> reference = new ReferenceConfig<GenericService>();//Cache. Otherwise, each request will create a ReferenceConfig and register nodes in zk, which may eventually lead to too many zk nodes and affect performance
    ApplicationConfig application = new ApplicationConfig();
    application.setName("pangu-client-consumer-generic");
    // Connection registry configuration
    RegistryConfig registry = new RegistryConfig();
    registry.setAddress("zookeeper://127.0.0.1:2181");

    // Service consumer default configuration
    ConsumerConfig consumer = new ConsumerConfig();
    consumer.setTimeout(5000);
    consumer.setRetries(0);

    reference.setApplication(application);
    reference.setRegistry(registry);
    reference.setConsumer(consumer);
    reference.setCheck(true);//Test 3, set up the detection service
    reference.setInterface(org.pangu.api.ProductService.class); // Weakly typed interface name
    //        reference.setVersion("");
    //        reference.setGroup("");
    reference.setGeneric(true); // Declared as a generalized interface
    GenericService svc = referenceCache.get(reference);//cache. The Reference object will be cached in the get method, and ReferenceConfig will be called Start ReferenceConfig with get method
    Object target = svc.$invoke("findProduct", new String[]{ProductDTO.class.getName()}, new Object[]{dto});//In the actual gateway, the method name, parameter type and parameter are passed in as parameters
    return Result.success((Map)target);
}

Verification 1: first start the provider service, then start the consumer generalization, request this generalization method, and register only one consumer node in zk; Stop the provider and then request the generalization method. It is found that the number of nodes on zk does not change. Why? After the provider stops, the reason why the zk node is requested not to be created is that the ref of RegistryConfig has generated an agent at startup (because the provider service exists at startup, the check=true has passed the check), so it will not be created.

Verification 2: without starting the provider service, directly start the generalization on the consumer side and request this generalization method. It is found that each request will create a consumption node in zk. So far, the fault is verified.

In this case, why do you create a consumption node in zk every request? What is the root cause?

private T createProxy(Map<String, String> map) {
    //Ignore other codes

    if (isJvmRefer) {
    //Ignore other codes
    } else {
        if (url != null && url.length() > 0) { 
            //Ignore other codes
        } else { // assemble URL from register center's configuration
            List<URL> us = loadRegistries(false);//Code @ 1
            if (us != null && !us.isEmpty()) {
                for (URL u : us) {
                    URL monitorUrl = loadMonitor(u);
                    if (monitorUrl != null) {
                        map.put(Constants.MONITOR_KEY, URL.encode(monitorUrl.toFullString()));
                    }
                    urls.add(u.addParameterAndEncoded(Constants.REFER_KEY, StringUtils.toQueryString(map)));//Code @ 2
                }
            }
            if (urls.isEmpty()) {
                throw new IllegalStateException("No such any registry to reference " + interfaceName + " on the consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion() + ", please config <dubbo:registry address=\"...\" /> to your spring config.");
            }
        }

        if (urls.size() == 1) {
            invoker = refprotocol.refer(interfaceClass, urls.get(0));//Code @ 3
        } else {
            List<Invoker<?>> invokers = new ArrayList<Invoker<?>>();
            URL registryURL = null;
            for (URL url : urls) {//Code @ 4
                invokers.add(refprotocol.refer(interfaceClass, url));
                if (Constants.REGISTRY_PROTOCOL.equals(url.getProtocol())) {
                    registryURL = url; // use last registry url
                }
            }
            if (registryURL != null) { // registry url is available
                // use AvailableCluster only when register's cluster is available
                URL u = registryURL.addParameterIfAbsent(Constants.CLUSTER_KEY, AvailableCluster.NAME);
                invoker = cluster.join(new StaticDirectory(u, invokers));
            } else { // not a registry url
                invoker = cluster.join(new StaticDirectory(invokers));
            }
        }
    }

    Boolean c = check;
    if (c == null && consumer != null) {
        c = consumer.isCheck();
    }
    if (c == null) {
        c = true; // default true
    }
    if (c && !invoker.isAvailable()) {//check=true, the provider service does not exist, and an exception is thrown
        // make it possible for consumer to retry later if provider is temporarily unavailable
        initialized = false;
        throw new IllegalStateException("Failed to check the status of the service " + interfaceName + ". No provider available for the service " + (group == null ? "" : group + "/") + interfaceName + (version == null ? "" : ":" + version) + " from the url " + invoker.getUrl() + " to the consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion());
    }
    if (logger.isInfoEnabled()) {
        logger.info("Refer dubbo service " + interfaceClass.getName() + " from url " + invoker.getUrl());
    }
    // create service proxy
    return (T) proxyFactory.getProxy(invoker);
}

1. The generalization method is requested for the first time. Because the ref of ReferenceConfig is null, createProxy is executed, and the code @ 1, @ 2, @ 3 is executed. The consumption node is created in zk. However, because check=true, an IllegalStateException is thrown, and finally the ref of ReferenceConfig is still null.

2. Request the generalization method for the second time. Since ReferenceConfig has been cached, the ReferenceConfig object this time is the ReferenceConfig object for the first time. Get the ReferenceConfig proxy object Ref. because the ref of ReferenceConfig is null, execute createProxy, execute code @ 1, @ 2, @ 4, and create a consumption node in zk, but because check=true, Therefore, an IllegalStateException is thrown, and the ref of ReferenceConfig is still null.

3. The third and subsequent requests have the same effect as the second request.

The reason why consumer nodes are created in zk every time can only be explained by different subscription URLs. If the URLs are the same, they will not be created in zk. So what are the differences in the composition of a subscription url for a service? View referenceconfig Init(), it is found that there is a timestamp on the subscription url, which is the current timestamp. This also explains why you register every time, because the subscription url is different, as shown in the following figure

Is it unreasonable to add this timestamp to the subscription URL? After checking the official, in version 2.7.5, the timestamp in the subscribed URL has been removed, and only one URL will be subscribed.

The following figure shows the time of failure. After analyzing the dump of ZK, it is found that the number of ZK directory nodes at that time is 170W, which is actually 10w at ordinary times.

Impact of dubbo consumer generalization check=true on application side

private T createProxy(Map<String, String> map) {
    //Ignore other codes

    if (isJvmRefer) {
    //Ignore other codes
    } else {
        if (url != null && url.length() > 0) { 
            //Ignore other codes
        } else { // assemble URL from register center's configuration
            List<URL> us = loadRegistries(false);//Code @ 1
            if (us != null && !us.isEmpty()) {
                for (URL u : us) {
                    URL monitorUrl = loadMonitor(u);
                    if (monitorUrl != null) {
                        map.put(Constants.MONITOR_KEY, URL.encode(monitorUrl.toFullString()));
                    }
                    urls.add(u.addParameterAndEncoded(Constants.REFER_KEY, StringUtils.toQueryString(map)));//Code @ 2
                }
            }
            if (urls.isEmpty()) {
                throw new IllegalStateException("No such any registry to reference " + interfaceName + " on the consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion() + ", please config <dubbo:registry address=\"...\" /> to your spring config.");
            }
        }

        if (urls.size() == 1) {
            invoker = refprotocol.refer(interfaceClass, urls.get(0));//Code @ 3
        } else {
            List<Invoker<?>> invokers = new ArrayList<Invoker<?>>();
            URL registryURL = null;
            for (URL url : urls) {//Code @ 4
                invokers.add(refprotocol.refer(interfaceClass, url));
                if (Constants.REGISTRY_PROTOCOL.equals(url.getProtocol())) {
                    registryURL = url; // use last registry url
                }
            }
            if (registryURL != null) { // registry url is available
                // use AvailableCluster only when register's cluster is available
                URL u = registryURL.addParameterIfAbsent(Constants.CLUSTER_KEY, AvailableCluster.NAME);
                invoker = cluster.join(new StaticDirectory(u, invokers));
            } else { // not a registry url
                invoker = cluster.join(new StaticDirectory(invokers));
            }
        }
    }

    Boolean c = check;
    if (c == null && consumer != null) {
        c = consumer.isCheck();
    }
    if (c == null) {
        c = true; // default true
    }
    if (c && !invoker.isAvailable()) {//check=true, the provider service does not exist, and an exception is thrown
        // make it possible for consumer to retry later if provider is temporarily unavailable
        initialized = false;
        throw new IllegalStateException("Failed to check the status of the service " + interfaceName + ". No provider available for the service " + (group == null ? "" : group + "/") + interfaceName + (version == null ? "" : ":" + version) + " from the url " + invoker.getUrl() + " to the consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion());
    }
    if (logger.isInfoEnabled()) {
        logger.info("Refer dubbo service " + interfaceClass.getName() + " from url " + invoker.getUrl());
    }
    // create service proxy
    return (T) proxyFactory.getProxy(invoker);
}

1. The generalization method is requested for the first time. Because the ref of ReferenceConfig is null, createProxy is executed, and the code @ 1, @ 2, @ 3 is executed. The consumption node is created in zk. However, because check=true, an IllegalStateException is thrown, and finally the ref of ReferenceConfig is still null. Add the url with time stamp to ReferenceConfig URLs collection. Create 1 RegistryDirectory.

2. Request the generalization method for the second time. Since ReferenceConfig has been cached, the ReferenceConfig object this time is the ReferenceConfig object for the first time. Get the ReferenceConfig proxy object Ref. because the ref of ReferenceConfig is null, execute createProxy, execute code @ 1, @ 2, @ 4, and create a consumption node in zk, but because check=true, Therefore, an IllegalStateException is thrown, and the ref of ReferenceConfig is still null. ReferenceConfig If the urls collection is two urls, traverse urls and execute refprotocol Refer (interface class, url) to create two registrydirectories.

3. Third, the request generalization method is basically the same as 2, but referenceconfig If the urls collection is three urls, traverse urls and execute refprotocol Refer (interface class, url) to create three registrydirectories.

And so on. After the nth request, the total number of RegistryDirectory objects created is 1 + 2 + 3 ++ n. Therefore, when dubbo generalization is set to check=true, it will eventually lead to zk failure and oom in local applications.

Use this test to solve the oom problem and learn to analyze dump

jmeter configuration

The details are in the Pangu client parent project

The renderings are as follows

Keywords: Java RabbitMQ Redis Spring Back-end

Added by erikjan on Mon, 07 Feb 2022 21:19:07 +0200