Interview challenge: how does Netty solve the null polling BUG of Selector? (illustration + second understanding + the most complete in History)

The article is very long. It is recommended to collect it and read it slowly! Java high concurrency enthusiast community: Crazy maker circle Here are some valuable learning resources:

Recommendation: wonderful blog posts on joining big factories, building architectures and vigorously improving Java internal skills

Excellent blog posts necessary for entering big factories, building architectures and vigorously improving Java internal skills1W salary increase in autumn 2021 + necessary wonderful blog posts
1: Redis distributed lock (illustration - second understanding - the most complete in History)2: Zookeeper distributed lock (illustration - second understanding - the most complete in History)
3: How to ensure the double write consistency between Redis and MySQL? (required for interview)4: Interview essential: second kill oversold solution (the most complete in History)
5:Necessary for interview: Reactor mode6: 10 minutes to understand the underlying principles of Java NIO
7:TCP/IP (illustration + second understanding + most complete in History)8: Feign principle (illustration)
9:DNS diagram (second understanding + most complete in history + high salary necessary)10: CDN diagram (second understanding + most complete in history + high salary necessary)
11: Distributed transaction (diagram + the most complete in history + hematemesis recommendation)12: seata AT mode actual combat (illustration + second understanding + the most complete in History)
13: Interpretation of seata source code (illustration + second understanding + the most complete in History)14: Actual combat of seata TCC mode (illustration + second understanding + the most complete in History)
Java interview questions 30 topics, the most complete in history, interview must brushAli, JD, meituan... Pick and walk horizontally at will!!!
1: JVM interview questions (the strongest, continuously updated, hematemesis recommendation in History)2: Java basic interview questions (the most complete, continuously updated and recommended for hematemesis in History)
3: Architecture design interview questions (the most complete, continuously updated and recommended for hematemesis in History)4: Design mode interview questions (the most complete, continuously updated and recommended for hematemesis in History)
17,Distributed transaction interview questions (the most complete, continuously updated and recommended for hematemesis in History)Consistency agreement (the most complete in History)
29,Multithreaded interview questions (the most complete in History)30,HR face classics, after passing five passes and cutting six generals, be careful of the ditch capsizing!
9.Network protocol interview questions (the most complete, continuously updated and recommended for hematemesis in History)For more topics, see[ General directory of high concurrency in crazy maker circle ]
SpringCloud blog post
nacos actual combat (the most complete in History) sentinel (the most complete + introductory tutorial in History)
Spring cloud gateway (the most complete in History)For more topics, see[ General directory of high concurrency in crazy maker circle ]

Netty's strategy to solve the null polling BUG of Selector (diagram + second understanding + the most complete in History)

Null polling BUG for Selector

If the polling result of the Selector is null and there is no wakeup or new message processing, null polling occurs and the CPU utilization is 100%.

Note: the CPU is 100%, which is a very serious bug.

This notorious epoll bug is a BUG of JDK NIO. The official claims that the problem was fixed in update18 of JDK1.6. However, the problem still exists until JDK1.7 and JDK1.8, but the occurrence probability of the BUG has been reduced, and it has not been fundamentally solved. For this BUG and the list of problems related to the BUG, see the following link:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=2147719

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6403933

4 steps for Netty to solve empty polling:

Overview of Netty's solutions:

  • 1. Count the select operation cycle of the Selector, and count every empty select operation completed. If N empty polls occur continuously in a certain cycle, the epoll dead cycle bug is triggered.
  • 2. Rebuild the Selector to determine whether the reconstruction request is initiated by other threads. If not, remove the registration of the original SocketChannel from the old Selector, re register it to the new Selector, and close the original Selector.

4 steps for Netty to resolve empty polling

Netty has four steps to solve empty polling, as follows:

Part I: timed blocking select (timemilliseconds)

  • First define the current time currentTimeNanos.
  • Then calculate the minimum time required for execution, timeoutMillis.
  • Timed blocking select (timemilliseconds).
  • Perform + + operations on selectCnt every time.

Part II: valid IO event processing logic

Part III: timeout processing logic

  • If the query times out, seletCnt is reset to 1.

Step 4: solve the empty polling BUG

  • Once the threshold of SELECTOR_AUTO_REBUILD_THRESHOLD is reached, the selector needs to be rebuilt to solve this problem.
  • This threshold is 512 by default.
  • Rebuild the selector and re register the channel

Netty solves the 4-step core code of empty polling

long time = System.nanoTime();

//Call the select method, and the blocking time is the latest timed task time to timeout calculated above
int selectedKeys = selector.select(timeoutMillis);

//Counter plus 1
++selectCnt;

if (selectedKeys != 0 || oldWakenUp || this.wakenUp.get() || this.hasTasks() || this.hasScheduledTasks()) {
   //Entering this branch indicates a normal scenario     

   //Selectedkeys! = 0: the number of selectedkeys is not 0. There are io events
   //oldWakenUp: indicates that the selector has been awakened elsewhere when it comes in
   //wakenUp.get(): also indicates that the selector is awakened
   //Hasscheduledtasks() 𞓜 hasscheduledtasks(): indicates that there are tasks or scheduled tasks to be executed
   //If any of the above situations occurs, it will be returned directly

   break;
}

//The logic here is: current time - cycle start time > = timed select timeoutMillis, which indicates that a blocking select() has been executed, and a valid select has been executed
if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
   //Entering this branch indicates timeout, which is a normal scenario
   //Indicates that a blocking poll has occurred and timed out
   selectCnt = 1;
} else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 && selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
   //Entering this branch indicates that there is no timeout, and selectedKeys==0
   //It belongs to abnormal scenario
   //Indicates that the select bug repair mechanism is enabled,
   //That is, the configured io.netty.selectorAutoRebuildThreshold
   //The parameter is greater than 3, and the number of early returns of the above select method has been greater than
   //The configured threshold will trigger the reconstruction of the selector

   //Perform selector rebuild
   //After reconstruction, try to call the non blocking version select once and return directly
   selector = this.selectRebuildSelector(selectCnt);
   selectCnt = 1;
   break;
}
currentTimeNanos = time;

Netty's detection and processing logic for the early return of Selector.select is mainly in the NioEventLoop.select method. The complete code is as follows:

public final class NioEventLoop extends SingleThreadEventLoop {

    private void select(boolean oldWakenUp) throws IOException {
        Selector selector = this.selector;

        try {
            //The counter is set to 0
            int selectCnt = 0;
            long currentTimeNanos = System.nanoTime();
            
            //Obtain the blocking time of this select ion according to the registered scheduled task
            long selectDeadLineNanos = currentTimeNanos + this.delayNanos(currentTimeNanos);

            while(true) {
                //Each iteration of the loop recalculates the blocking time of the select
                long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
                
                //If the blocking time is 0, it indicates that a scheduled task is about to timeout
                //At this time, if it is the first loop (selectCnt=0), call selector.selectNow once, then exit the loop and return
                //The call of selectorNow method is mainly to detect and process the prepared network events as much as possible
                if (timeoutMillis <= 0L) {
                    if (selectCnt == 0) {
                        selector.selectNow();
                        selectCnt = 1;
                    }
                    break;
                }
                
                //If there is no timed task timeout, but there are previously registered tasks (not limited to timed tasks here),
                //If wakenUp is successfully set to true, selectNow is called and returned
                if (this.hasTasks() && this.wakenUp.compareAndSet(false, true)) {
                    selector.selectNow();
                    selectCnt = 1;
                    break;
                }
                
                //Call the select method, and the blocking time is the latest timed task time to timeout calculated above
                int selectedKeys = selector.select(timeoutMillis);
                
                //Counter plus 1
                ++selectCnt;
                

                if (selectedKeys != 0 || oldWakenUp || this.wakenUp.get() || this.hasTasks() || this.hasScheduledTasks()) {
               //Entering this branch indicates a normal scenario     
                    
                //selectedKeys !=  0: the number of selectedkeys is not 0. There are io events
                //oldWakenUp: indicates that the selector has been awakened elsewhere when it comes in
                //wakenUp.get(): also indicates that the selector is awakened
                //Hasscheduledtasks() 𞓜 hasscheduledtasks(): indicates that there are tasks or scheduled tasks to be executed
                //If any of the above situations occurs, it will be returned directly
                    
                    break;
                }

                //If the thread is interrupted, the counter is set to zero and returns directly
                if (Thread.interrupted()) {
                    if (logger.isDebugEnabled()) {
                        logger.debug("Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
                    }

                    selectCnt = 1;
                    break;
                }

                //Here, judge whether the select returns because the calculated timeout has expired,
                //In this case, it also belongs to normal return, and the counter is set to 1 to enter the next cycle
                long time = System.nanoTime();
                if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
                    //Entering this branch indicates timeout, which is a normal scenario
                    //Indicates that a blocking poll has occurred and timed out
                    selectCnt = 1;
                } else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 && selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
                    //Entering this branch indicates that there is no timeout, and selectedKeys==0
                    //It belongs to abnormal scenario
                    //Indicates that the select bug repair mechanism is enabled,
                    //That is, the configured io.netty.selectorAutoRebuildThreshold
                    //The parameter is greater than 3, and the number of early returns of the above select method has been greater than
                    //The configured threshold will trigger the reconstruction of the selector
                    
                    //Perform selector rebuild
                    //After reconstruction, try to call the non blocking version select once and return directly
                    selector = this.selectRebuildSelector(selectCnt);
                    selectCnt = 1;
                    break;
                }

                currentTimeNanos = time;
            }

            //This is the processing of programs that turn off the select bug repair mechanism,
            //Simply record the log to facilitate troubleshooting
            if (selectCnt > 3 && logger.isDebugEnabled()) {
                logger.debug("Selector.select() returned prematurely {} times in a row for Selector {}.", selectCnt - 1, selector);
            }
        } catch (CancelledKeyException var13) {
            if (logger.isDebugEnabled()) {
                logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector {} - JDK bug?", selector, var13);
            }
        }

    }
    
    private Selector selectRebuildSelector(int selectCnt) throws IOException {
        logger.warn("Selector.select() returned prematurely {} times in a row; rebuilding Selector {}.", selectCnt, this.selector);
        //Perform selector rebuild
        this.rebuildSelector();
        Selector selector = this.selector;
        //After reconstruction, try to call the non blocking version select once and return directly
        selector.selectNow();
        return selector;
    }   
}

The source code of this.rebuildSelector() called above is as follows:

public final class NioEventLoop extends SingleThreadEventLoop {

    public void rebuildSelector() {
        //If it is not in the thread, it is put in the task queue
        if (!this.inEventLoop()) {
            this.execute(new Runnable() {
                public void run() {
                    NioEventLoop.this.rebuildSelector0();
                }
            });
        } else {
            //Otherwise, it means that the actual reconstruction method is called directly in this thread
            this.rebuildSelector0();
        }
    }
    
    private void rebuildSelector0() {
        Selector oldSelector = this.selector;
        
        //If the old selector is empty, it will be returned directly
        if (oldSelector != null) {
            NioEventLoop.SelectorTuple newSelectorTuple;
            try {
                //Create a new selector
                newSelectorTuple = this.openSelector();
            } catch (Exception var9) {
                logger.warn("Failed to create a new Selector.", var9);
                return;
            }

            int nChannels = 0;
            Iterator var4 = oldSelector.keys().iterator();
            
            //For all key s registered on the old selector, re register them on the new selecor in turn
            while(var4.hasNext()) {
                SelectionKey key = (SelectionKey)var4.next();
                Object a = key.attachment();

                try {
                    if (key.isValid() && key.channel().keyFor(newSelectorTuple.unwrappedSelector) == null) {
                        int interestOps = key.interestOps();
                        key.cancel();
                        SelectionKey newKey = key.channel().register(newSelectorTuple.unwrappedSelector, interestOps, a);
                        if (a instanceof AbstractNioChannel) {
                            ((AbstractNioChannel)a).selectionKey = newKey;
                        }

                        ++nChannels;
                    }
                } catch (Exception var11) {
                    logger.warn("Failed to re-register a Channel to the new Selector.", var11);
                    if (a instanceof AbstractNioChannel) {
                        AbstractNioChannel ch = (AbstractNioChannel)a;
                        ch.unsafe().close(ch.unsafe().voidPromise());
                    } else {
                        NioTask<SelectableChannel> task = (NioTask)a;
                        invokeChannelUnregistered(task, key, var11);
                    }
                }
            }

            //Assign the selector associated with the NioEventLoop as the new selector
            this.selector = newSelectorTuple.selector;
            this.unwrappedSelector = newSelectorTuple.unwrappedSelector;

            try {
                //Close the old selector
                oldSelector.close();
            } catch (Throwable var10) {
                if (logger.isWarnEnabled()) {
                    logger.warn("Failed to close the old Selector.", var10);
                }
            }

            if (logger.isInfoEnabled()) {
                logger.info("Migrated " + nChannels + " channel(s) to the new Selector.");
            }
        }
    }
}

Threshold configuration for Netty null polling

Netty considers this problem in NioEventLoop and fixes this bug by re creating a new Selector when the select method returns abnormally (netty source code comments call it prematurely, i.e. returns in advance) more than a certain number of times.

Netty provides the configuration parameter io.netty.selectorAutoRebuildThreshold, which allows the user to define the threshold number of times for selecting to create a new Selector to return in advance. Exceeding this number of times will trigger the automatic reconstruction of the Selector, which is 512 by default.

However, if the specified io.netty.selectorAutoRebuildThreshold is less than 3, the function is considered to be turned off in Netty.

public final class NioEventLoop extends SingleThreadEventLoop {

    private static final int SELECTOR_AUTO_REBUILD_THRESHOLD;

    static {
        //... omit some codes

        int selectorAutoRebuildThreshold = SystemPropertyUtil.getInt("io.netty.selectorAutoRebuildThreshold", 512);
        if (selectorAutoRebuildThreshold < 3) {
            selectorAutoRebuildThreshold = 0;
        }

        SELECTOR_AUTO_REBUILD_THRESHOLD = selectorAutoRebuildThreshold;
        if (logger.isDebugEnabled()) {
            logger.debug("-Dio.netty.noKeySetOptimization: {}", DISABLE_KEY_SET_OPTIMIZATION);
            logger.debug("-Dio.netty.selectorAutoRebuildThreshold: {}", SELECTOR_AUTO_REBUILD_THRESHOLD);
        }

    }
}

reference:

https://www.jianshu.com/p/b1ba37b6563b

https://blog.csdn.net/zhengchao1991/article/details/106534280

https://www.cnblogs.com/devilwind/p/8351732.html

Keywords: Java Database Interview

Added by cloudhybrid on Fri, 08 Oct 2021 01:54:05 +0300