Write it down once NET analysis of CPU explosion of an intelligent transportation background service

1: Background

1. Tell a story

The day before yesterday, a friend added wechat to ask for help. There was a problem of high CPU in his program. It was a red envelope at the beginning, which scared me! 🤣🤣🤣

As it was a young year in the south, it was inconvenient for me to deal with it in my hometown, so I didn't help him deal with it at the first time. My friend found the problem the next morning and fed back that it was caused by a while(true), which was a little interesting. In many cases of CPU explosion I analyzed, I didn't encounter a while(true), so I always had regrets. It's really lucky, I caught up years ago, ha ha 😄😄😄.

Next, let's use windbg to analyze it.

2: Windbg analysis

1. Check CPU usage

My friends who have been paying attention to me all know, use! tp command is OK.

0:022> !tp
CPU utilization: 95 Unknown format characterUnknown format control characterWorker Thread: Total: 11 Running: 11 Idle: 0 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 0
--------------------------------------
Number of Timers: 3
--------------------------------------
Completion Port Thread:Total: 4 Free: 4 MaxFree: 8 CurrentLimit: 4 MaxLimit: 1000 MinLimit: 4

The above sentence "Unknown format characterUnknown format control characterWorker" seems discordant. In fact, it means% and I don't know why NETCore has this coding problem 😂😂😂, Next, let's use it! eeversion take a look.

0:022> !eeversion
4.700.21.56803 (3.x runtime) free
4,700,21,56803 @Commit: 28bb6f994c28bc91f09bc0ddb5dcb51d0f066806
Workstation mode
In plan phase of garbage collection
SOS Version: 5.0.4.36902 retail build

From the basic information, the current situation is netcore 3.x version, and it is obvious that the current GC is in the planning stage. What is the planning stage?

2. What is the planning stage

In short, the GC in the planning stage needs to decide whether the current managed heap is to do a simple mark free operation or a heavyweight compression operation. If the compression processing is to be done, it also needs to involve the relocation of managed heap objects, which often consumes a considerable amount of CPU time. Next, what causes the GC trigger?

3. GC trigger reason

Because the triggering of GC is often caused by the data allocated by the user thread, one link in the whole execution flow triggered by GC is to freeze the CLR execution engine, that is, suspend EE, which can be used in GC Find out in the app.

Why do you have to mention SuspendEE? Because I can pass! t -special find the thread of SuspendEE, so that the accuracy is higher.

0:072> !t -special
ThreadCount:      54
UnstartedThread:  0
BackgroundThread: 40
PendingThread:    0
DeadThread:       1
Hosted Runtime:   no

          OSID Special thread type
        1 6328 DbgHelper 
        2 35c0 Finalizer 
        4 5aac Timer 
        5 38b0 ThreadpoolWorker 
       17 3530 ThreadpoolWorker 
       18 4484 ThreadpoolWorker 
       19 1e4c ThreadpoolWorker 
       21 6380 ThreadpoolWorker 
       44 5bc4 SuspendEE 
       52  8ac ThreadpoolWorker 
       54 4164 ThreadpoolWorker 
       56 61c8 ThreadpoolWorker 
       58 1fa4 ThreadpoolWorker 
       60 2788 ThreadpoolWorker 
       69 48f4 IOCompletion 
       70 5708 IOCompletion 
       71 3b58 ThreadpoolWorker 
       72 17a0 GC 
       73 2f00 Gate 
       74 35e8 IOCompletion 
       75 5730 IOCompletion 

You can see that the current thread 44 is the thread that triggers GC. Then it becomes clear. Let's see what thread 44 is doing? Switch to thread 44, and then! Just clrstack.

0:044> !clrstack 
OS Thread Id: 0x5bc4 (44)
        Child SP               IP Call Site
000000A2B0C3E4C8 00007ffd471ead3a [HelperMethodFrame: 000000a2b0c3e4c8] 
000000A2B0C3E5E0 00007ffce8b4e506 System.Collections.Generic.List`1[[System.__Canon, System.Private.CoreLib]].System.Collections.Generic.IEnumerable.GetEnumerator() [/_/src/System.Private.CoreLib/shared/System/Collections/Generic/List.cs @ 585]
000000A2B0C3E630 00007ffce85e7a10 xxx.Program.DeletexxxExipredDate()
000000A2B0C3E780 00007ffd46bc1f0b System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 44]
000000A2B0C3E7B0 00007ffd46bb90b6 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 172]
000000A2B0C3E830 00007ffd46ba535b System.Threading.ThreadHelper.ThreadStart() [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 93]
000000A2B0C3EA48 00007ffd47236c93 [GCFrame: 000000a2b0c3ea48] 
000000A2B0C3ECB0 00007ffd47236c93 [DebuggerU2MCatchHandlerFrame: 000000a2b0c3ecb0] 

From the output information, the problem lies in the deletexxexipredate () method. Next, explore the source code of this method.

        private static void DeletexxxExipredDate()
        {
            while (true)
            {
                foreach (string key in xxx.xxxSpeedLimit.Keys)
                {
                    try
                    {
                        string[] array = xxx.xxxSpeedLimit[key].Split('$');
                        if (array.Length > 1)
                        {
                            DateTime dateTime = Convert.ToDateTime(array[1]);
                            if ((DateTime.Now - dateTime).TotalSeconds > 21600.0 && xxx.xxxSpeedLimit.ContainsKey(key))
                            {
                                xxx.xxxSpeedLimit.TryRemove(key, out var _);
                            }
                        }
                    }
                    catch (Exception ex)
                    {
                        LogHelper.WriteAppExceptionLog("Exception in deleting data:" + ex.Message, ex);
                    }
                    Thread.Sleep(20000);
                }
            }
        }

If you have a friend with rich experience in stepping on pits, I believe you can see the problems in the code at a glance. Yes, when the xxxSpeedLimit dictionary is empty, it is equivalent to a while(true) loop, isn't it?

In order to verify my statement, you can use it! dso finds the memory address of dict and uses it! wconcurrentdict is enough.

0:044> !dso 
OS Thread Id: 0x5bc4 (44)
RSP/REG          Object           Name
...
000000A2B0C3E708 000001ba8007f618 System.Collections.Concurrent.ConcurrentDictionary`2[[System.String, System.Private.CoreLib],[System.String, System.Private.CoreLib]]
000000A2B0C3E760 000001ba88501cd0 System.Collections.Generic.List`1+Enumerator[[System.String, System.Private.CoreLib]]
000000A2B0C3E768 000001ba80050ec0 System.Threading.ContextCallback
000000A2B0C3E7F8 000001ba80a1a818 System.Threading.Thread
000000A2B0C3EA28 000001ba80a1a898 System.Threading.ThreadStart
0:044> !wconcurrentdict 000001ba8007f618
Empty ConcurrentDictionary

You can see that the current dictionary is empty 😂😂😂

3: Summary

The main reason for this accident: the coder lacks some programming experience and the process processing of empty dictionary when writing business logic, which leads to the embarrassment of while(true), or the thread Sleep (20000) is misplaced 🤣🤣🤣

In general, I am very grateful for the dump provided by this friend, which makes me really believe what I see!

Keywords: windbg

Added by unibroue on Sat, 29 Jan 2022 06:11:35 +0200