Programmers chatting downstairs: a jvm crash investigation

Downstairs of an office building on Dawang Road.

Ape A: When we worked in our company, we ran a lot of data processing tasks in the middle of the night, and then the program often crashed.
Me: Oh? How to deal with that?
Ape A: There was some water in the architecture at that time. It said that we should adjust the ratio of "Eden" to "from-to"... It's nothing to do with the crash.
Me: Young man, your success has caught my attention. Come on, please have a drink and have a good chat about the situation.

Business scenarios

"Ape A" is my colleague and death party. After talking to him in detail, I probably understood the scene at that time.

As far as I understand, that timed task will take out too much data from hive (said to be about 200 million), integrate and process data according to specific business, and eventually push it to es (elastic search). (I haven't done anything about hive, but it doesn't hinder the discussion.)

In the business processing part, the thread pool Fixed ThreadPool is used.

Simulated solution process

Problem Location

Ape A: At that time, it was suspected that the memory OOM caused the jvm crash, and then suspected that a large number of objects GC could not be recycled, so it hit the GC log.
Me: Well, didn't you use hs_err_pid_xxx.log for analysis?
Ape A: When I was young, I didn't know this skill yet...

The solution process of "Ape A" was rough at that time. When there was suspicion, it added - XX:+PrintGC directly to the start-up parameters. This command prints GC logs and assumes that the production environment uses GC as a CMS, writing a demo to simulate the scene at that time.

public class CMSGCLogs {
    //Start-up parameters: -Xmx10m-Xms10m-Xmn4M-XX:+PrintGC-XX:+UseConcMarkSweepGC
    public static void main(String[] args) throws InterruptedException {
        // Set the number of threads to 1, named `T-1'.`
        ExecutorService es = Executors.newFixedThreadPool(1, new ThreadFactory() {
            @Override
            public Thread newThread(Runnable r) {
                return new Thread(r,"T-1");
            }
        });

        boolean flag = true;
        while (flag){
            es.execute(()->{
                try {
                    byte[] bytes = new byte[1024*1000*1];    //The simulation reads a lot of data from hive (1M)
                    TimeUnit.MILLISECONDS.sleep(50L);   //Simulate writing es process
                } catch (Exception e) {
                    e.printStackTrace();
                }
            });
        }
        es.shutdown();
    }
}

Start with a thread pool process.

When the thread pool receives requests less than the number of core threads, it creates threads until it equals the number of core threads.
After that, requests that cannot be digested by the number of core threads are put into the blocking queue.
Then, the blocking queue is full, and the thread creation continues until the maximum number of threads.
The largest threads also can't carry the still surging requests and execute the rejection strategy.

fixed thread pool, as a member of thread pool family, also meets the above rules.
The difference is that its core threads = the maximum number of threads, and then the Linked Blocking Queue is unbounded.

# Creation of fixed thread pool in source code
public static ExecutorService newFixedThreadPool(int nThreads, ThreadFactory threadFactory) {
    return new ThreadPoolExecutor(nThreads, nThreads,
                                  0L, TimeUnit.MILLISECONDS,
                                  new LinkedBlockingQueue<Runnable>(),
                                  threadFactory);
}

Run it, demo execution results are about as long as this:

[GC (Allocation Failure)  3302K->1174K(9856K), 0.0037023 secs]
[GC (Allocation Failure)  3664K->1541K(9856K), 0.0014799 secs]
[Full GC (Allocation Failure)  1541K->1446K(9856K), 0.0039197 secs]
[Full GC (Allocation Failure)  1446K->1387K(9856K), 0.0037783 secs]
## Thread T-1 OOM
Exception in thread "T-1" java.lang.OutOfMemoryError: Java heap space
    at com.evolution.gc.CMSGCLogs.lambda$main$0(CMSGCLogs.java:44)
    at com.evolution.gc.CMSGCLogs$$Lambda$1/233530418.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

...

[Full GC (Allocation Failure)  9676K->9676K(9856K), 0.0253352 secs]
[Full GC (Allocation Failure)  9676K->9676K(9856K), 0.0190795 secs]
## Thread T-1 OOM
Exception in thread "T-1" java.lang.OutOfMemoryError: Java heap space
Exception in thread "T-1" java.lang.OutOfMemoryError: Java heap space

...

[GC (CMS Initial Mark)  9855K(9856K), 0.0051677 secs]
## Thread main OOM
Exception in thread "main" [Full GC (Allocation Failure)  9855K->9855K(9856K), 0.0211383 secs]
[Full GC (Allocation Failure)  9855K->9855K(9856K), 0.0203374 secs]
[Full GC (Allocation Failure)  9855K->9855K(9856K), 0.0433360 secs]
[GC (CMS Initial Mark)  9855K(9856K), 0.0029999 secs]
[Full GC (Allocation Failure) java.lang.OutOfMemoryError: Java heap space
 9855K->9855K(9856K), 0.0249560 secs]
[Full GC (Allocation Failure)  9855K->9854K(9856K), 0.0244291 secs]
[GC (CMS Initial Mark)  9854K(9856K), 0.0063567 secs]
[Full GC (Allocation Failure)  9854K->9854K(9856K), 0.0208301 secs]
[Full GC (Allocation Failure)  9854K->9854K(9856K), 0.0346616 secs]
[Full GC (Allocation Failure)  9855K->1147K(9856K), 0.0067034 secs]
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message can't create byte arrau at JPLISAgent.c line: 813

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "T-1"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

In fact, friends who are familiar with fixed thread pools can guess what the problem is when they see that the master is using it - the unbounded queue of Fixed thread pools will make an attempt to store tasks indefinitely until memory overflows.
From the GC logs, it's really the number of paths. It's also a very thoughtful indication of which thread, which object, and even which line is abnormal (at com. evolution. gc. CMSGCLogs $Lambda $1/233530418. run), of course, I use internal classes and functions here, so it's a little harder to watch.

Branch Task 1: Subthread OOM Problem

But one thing is different from what I imagined. Thread T-1 crashes many times in the log, and the program is still strong for a while before the thread main crashes.

That's strange. Isn't heap memory public? Subthread T-1 crashes. Why didn't it crash the entire JVM?

So I used Visual VM to observe the memory situation:

The cliff in the figure is the thread T-1 OOM. After verifying the relevant data, we draw the conclusion that when a thread (T-1) OOM, it will release the memory occupied by that thread without affecting other threads (main)!

The detailed process is as follows:

  1. Thread T-1 business processing is slow (TimeUnit.MILLISECONDS.sleep(50L);), and a large number of tasks come in, resulting in the blocked queue of fixed thread pool crazy reserve for tasks to be performed, memory is gradually tight;
  2. Thread T-1 takes up another 1M of memory each time it executes (byte[] bytes = new byte[102410001]); thread T-1 has no choice but to declare crash due to insufficient memory (less than 1M of memory is occupied by blocked queues);
  3. With the collapse of thread T-1, resources will be released and re-entered the loop (the next cycle, thread T-1 continues to crash...) The entire JVM crashes and exits at an opportunity when the thread main also crashes

Branch Task 2: CMS STP

When I debugged demo, I accidentally increased the time spent writing emulated es, indicating that it took longer (as follows):

while (flag){
    es.execute(()->{
        try {
            byte[] bytes = new byte[1024*1000*1];
            TimeUnit.HOURS.sleep(50L);    ## Change the unit directly from `MILLISECONDS'to `HOURS'.
        } catch (Exception e) {
            e.printStackTrace();
        }
    });
}

Originally thought that this would make the jvm crash faster - thread T-150 hours to complete the task and endless backlog, jvm should be faster to die out.
It turned out to be contrary to our wishes!

The effect of program execution becomes as follows:

31.949: [GC (CMS Initial Mark)  9855K(9856K), 0.0033049 secs]
31.966: [GC (CMS Final Remark)  9855K(9856K), 0.0046193 secs]
32.401: [Full GC (Allocation Failure)  9855K->9855K(9856K), 0.0243874 secs]
32.425: [Full GC (Allocation Failure)  9855K->9855K(9856K), 0.0240275 secs]
32.450: [GC (CMS Initial Mark)  9855K(9856K), 0.0033793 secs]
32.466: [GC (CMS Final Remark)  9855K(9856K), 0.0048207 secs]
34.473: [GC (CMS Initial Mark)  9855K(9856K), 0.0033645 secs]
34.485: [GC (CMS Final Remark)  9855K(9856K), 0.0046805 secs]
36.491: [GC (CMS Initial Mark)  9855K(9856K), 0.0032574 secs]
...
109.890: [GC (CMS Initial Mark)  9855K(9856K), 0.0038242 secs]
109.902: [GC (CMS Final Remark)  9855K(9856K), 0.0053191 secs]
111.909: [GC (CMS Initial Mark)  9855K(9856K), 0.0106591 secs]
111.945: [GC (CMS Final Remark)  9855K(9856K), 0.0059029 secs]

I got endless CMS Initial Mark and CMS Final Remark.

The analysis diagram of Visual VM is as follows:

Memory will be full, just can't wait for "the last straw to overwhelm the camel".
Think about it carefully and remember that there are two stopping the world (STP) links in the execution of CMS, which seem to be the two brothers of CMS Initial Mark and CMS Final Remark?

Then this phenomenon can be explained:
As heap memory becomes full, the CMS collector begins to execute, but no resource can be recovered from any execution of the two markup links. (1M bytes cannot be recycled because they are still available after 50 hours of sleep; tasks in the blocked queue cannot be recycled, and jvm feels it will always perform these tasks)
Therefore, on the eve of JVM crash due to memory constraints, CMS will continue to execute initial and final markup, and the two commands will cause user threads (thread T-1) to stop and no chance to plug jvm.
So it's delicately balanced...

Solution

The above two branch tasks were encounters in demo simulation. Comrade Ape A did not make so many twists and turns (of course, he may have other opportunities).
Now that the problem has been identified as arising from Fixed ThreadPool, it's good to replace it, my new ThreadPool Executor.

  • Parameter specified bounded queue
  • Modify the rejection policy to ThreadPoolExecutor.CallerRunsPolicy

ThreadPoolExecutor.AbortPolicy defaults to ThreadPoolExecutor.AbortPolicy, which discards tasks and throws exceptions; ThreadPoolExecutor.CallerRunsPolicy does not bother with thread pools, but directly transfers tasks to the original thread (T-1 in demo, top of thread main)

Tip: Here are all the strategies for reference__

ThreadPoolExecutor.AbortPolicy: Discard tasks and throw RejectedExecutionException exceptions.
ThreadPoolExecutor.DiscardPolicy: ThreadPoolExecutor.DiscardPolicy also discards tasks, but does not throw exceptions.
ThreadPool Executor. Discard Oldest Policy: Discard the first task in the queue and try again (repeat this process)
ThreadPoolExecutor.CallerRunsPolicy: This task is handled by the calling thread

Modified demo code:

public class CMSGCLogs {
    //-Xmx10m -Xms10m -Xmn4M -XX:+PrintGC -XX:+UseConcMarkSweepGC
    public static void main(String[] args) throws InterruptedException {
        BlockingQueue list = new ArrayBlockingQueue(20);    //Modification 1: bounded queues
        ExecutorService es = new ThreadPoolExecutor(1, 1,
                0, TimeUnit.MILLISECONDS, list, new ThreadFactory() {
                    @Override
                    public Thread newThread(Runnable r) {
                        return new Thread(r,"T-1");
                    }
                },
                new ThreadPoolExecutor.CallerRunsPolicy());   //Modification 2: Replacement rejection strategy
        boolean flag = true;
        while (flag){
            es.execute(()->{
                try {
                    byte[] bytes = new byte[1024*1000*1];
                    TimeUnit.MILLISECONDS.sleep(50L);
                    System.out.println(String.format("curThread=%s,run task!",Thread.currentThread())); //Print Execution Thread
                } catch (Exception e) {
                    e.printStackTrace();
                }
            });
        }
        es.shutdown();
    }
}

Look at the implementation effect:

53.422: [GC (Allocation Failure)  6615K->3550K(9856K), 0.0006514 secs]
curThread=Thread[main,5,main],run task!
curThread=Thread[main,5,main],run task!
curThread=Thread[T-1,5,main],run task!
53.473: [GC (Allocation Failure)  6615K->3550K(9856K), 0.0003791 secs]
curThread=Thread[main,5,main],run task!
curThread=Thread[T-1,5,main],run task!
curThread=Thread[main,5,main],run task!
curThread=Thread[T-1,5,main],run task!
53.573: [GC (Allocation Failure)  6615K->3550K(9856K), 0.0003733 secs]
curThread=Thread[main,5,main],run task!
curThread=Thread[T-1,5,main],run task!
53.623: [GC (Allocation Failure)  6615K->3550K(9856K), 0.0009536 secs]
curThread=Thread[main,5,main],run task!
curThread=Thread[T-1,5,main],run task!
curThread=Thread[main,5,main],run task!
curThread=Thread[T-1,5,main],run task!
53.724: [GC (Allocation Failure)  6615K->3550K(9856K), 0.0004394 secs]

We found that the program has been able to run smoothly, and then observe the jvm heap memory at this time, everything is normal:

The role of Cenozoic regional proportions

The problem has been solved. Looking back on the architect's proposal at that time:

Ape A: There was some water in the architecture at that time. It said that we should adjust the ratio of "Eden" to "from-to"... It's nothing to do with the crash.

"Architecture" means setting - XX:NewRadio parameters to adjust the proportion of modules in the new generation.

This ratio defaults to 8:1:1. Let's look at the impact of adjusting this parameter.

  • To the extreme, put Edian in a big tune.

In the new generation of GC, whether the object enters the old age mainly depends on two factors - the age of the object and the size of the object.
If the size of the object exceeds a certain limit, the object will directly enter the old age. In this way, it will increase the pressure on the GC of the elderly, and may produce more STP s.

  • Reverse extreme, adjust Edian very little

Because one of the s1 and s2 regions is always empty, this adjustment will waste more memory and trigger the new generation of GC more frequently.


Our perceptual understanding of the role of this proportion, then the architect's proposal and ape A encountered problems related? The answer is no, so that architecture is real water...

Keywords: Java jvm hive Lambda

Added by hofmann777 on Mon, 09 Sep 2019 13:09:17 +0300