Write it down once NET analysis of CPU burst height of HIS system in a hospital

1: Background

1. Tell a story

A few days ago, a friend jia wx complained that his program always had an inexplicable high cpu during the peak period. How to analyze it for help?

After communicating with this friend, it is said that this problem has plagued them for several years 😂, We also invited Microsoft engineers to solve the problem and ended up with nothing. We still didn't find the right boss of Microsoft...

As for the problem of high CPU explosion, regular readers should know that I have written several articles, which are basically divided into two categories:

  • GC trigger

  • Large number of lock locks

A few are all kinds of inexplicable problems, which cannot be listed one by one 🤣🤣🤣, Now that my friend has found me, I have to find a way to solve it for him. I don't talk much and go to windbg.

2: windbg analysis

1. View the synchronization block table

In case of this problem, it is my inertial thinking to check the synchronization block table for the first time. The command is very simple! syncblk .

0:000> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
-----------------------------
Total           20779
CCW             16
RCW             21
ComClassFactory 0
Free            16490

I went and threw an empty one. There was nothing in the synchronization block table... Since it has nothing to do with the lock, let's look at the thread. After all, the thread depends on the CPU.

2. View threads

To view the managed threads in the system, you can use! t command, there are many threads, which is a little simplified.

0:000> !t
ThreadCount:      135
UnstartedThread:  0
BackgroundThread: 132
PendingThread:    0
DeadThread:       1
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
  34    1 25d4 000001ea28702130    28220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     Ukn 
  74    2 3ed0 000001ea286fa940    2b220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     MTA (Finalizer) 
  76    3 4a70 000001f4447d7810  102a220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     MTA (Threadpool Worker) 
  77    4 326c 000001f4447dbe60    21220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     Ukn 
  78    6 2dac 000001f4447d9750  1020220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     Ukn (Threadpool Worker) 
  79    7 1468 000001f444a2d6f0    21220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 1     Ukn (GC) 
  80    8   f0 000001f444a2cf20    21220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     Ukn 
  81    9 3118 000001f444a2f630    21220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     Ukn 

First sell a pass, can a friend see what's different about these threads??? Yes, there is a Ukn (GC) tag in the last column of thread 79. I think you must be curious. What does this mean? There are some changes in the mode of the underlying GC, but anyway, it tells you to a certain extent that your program triggers GC. For further verification, you can use it! t -special see what category thread 79 belongs to and more detailed information.

0:000> !t -special
ThreadCount:      135
UnstartedThread:  0
BackgroundThread: 132
PendingThread:    0
DeadThread:       1
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
  34    1 25d4 000001ea28702130    28220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 0     Ukn 
  79    7 1468 000001f444a2d6f0    21220 Preemptive  0000000000000000:0000000000000000 000001ea286ee080 1     Ukn (GC) 

OSID Special thread type
41 38d8 DbgHelper 
42 1c88 GC 
74 3ed0 Finalizer 
75 402c ProfilingAPIAttach 
76 4a70 Timer
79 1468 GC SuspendEE 

As you can see from the last line of output, 79 is the GC thread, followed by a strange SuspendEE tag. You're curious. What does that mean?

SuspendEE = Suspend CLR Execution Engine (EE)

In other words, thread 79 freezes the CLR execution engine for the simple purpose of facilitating the other 31 GC threads to take care of the current managed heap, but this old man machine is really 🐂👃, 32core, I don't know which hospital is so awesome. Add it! cpuid validation.

0:000> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,62,4  <unavailable>   2600
 1  6,62,4  <unavailable>   2600
 2  6,62,4  <unavailable>   2600
 3  6,62,4  <unavailable>   2600
 xxx
31  6,62,4  <unavailable>   2600

Since the GC trigger is predicted, the next step is to call out the managed and unmanaged stacks of all threads.

3. View each thread stack

To view the managed and unmanaged stacks of each thread, it's easy to use! eestack, and then you can retrieve the keyword WaitUntilGCComplete to determine how many threads are waiting for GC processing to complete.

It can be seen from the figure that 40 threads are blocked at present. It's good. The problem is becoming clearer and clearer. Next, analyze which thread did what it shouldn't do, resulting in GC triggering. You can also search try_allocate_more_space to determine which threads are allocating space.

I'll go. It's obvious that threads 170 and 187 are allocating large objects_ heap::allocate_ large_ Object triggers GC. The large object heap itself is a daunting thing. Its recycling and cleaning are very CPU intensive. This is also consistent with the fact that the CPU drops in about one minute.

4. Find the culprit

Now the focus is on these two threads. I've seen that these two thread stacks are the same method, so here's a 187 thread to analyze. It can be used! clrstack looks at its hosting stack.

0:187> !clrstack 
OS Thread Id: 0x1ef0 (187)
        Child SP               IP Call Site
00000054ce631e30 00007ffc4021bde2 System.String.FillStringChecked(System.String, Int32, System.String)
00000054ce631e70 00007ffc402222a8 System.String.ConcatArray(System.String[], Int32)
00000054ce631ed0 00007ffc40223528 System.String.Concat(System.String[])
00000054ce631f40 00007ffbe6dbdafb BLL.xxx.xxx.GetRowString(System.String, Boolean, Boolean, System.String, System.String, System.String, System.String, System.String, System.String, Int32, System.String, System.String, System.String, System.String, System.String, System.String, Int32, Int32, System.String, System.Nullable`1, System.Nullable`1, System.String, System.Nullable`1, System.Nullable`1, System.Nullable`1, System.Nullable`1, System.String, System.String, System.String, System.String, System.String ByRef)
00000054ce65cf40 00007ffbe6ab3187 BLL.xxx.xxx.ExpZB(System.String, Boolean, Boolean, System.String, System.String, System.String, System.String, Int32, System.String, System.String, System.String, Int32, System.String, System.String, System.String, System.String, System.String, System.String, Int32, Int32, System.String, System.Nullable`1, System.Nullable`1, System.String, System.Nullable`1, System.Nullable`1, System.Nullable`1, System.Nullable`1, System.String, System.String, System.String, System.String)

From the stack point of view, it seems to use system String. Concat splicing is caused by string. Good guy, string splicing has been criticized many times over the years. There are still many people stepping on the pit 😓😓😓, For further verification, you can use! clrstack -p + !da -details xxx look at this system What exactly is string [] simplified as follows:

0:187> !clrstack -p
OS Thread Id: 0x1ef0 (187)
00000054ce631e70 00007ffc402222a8 System.String.ConcatArray(System.String[], Int32)
    PARAMETERS:
        values (<CLR reg>) = 0x000001ea69e8d2f8
        totalLength = <no data>
0:187> !da -details 0x000001ea69e8d2f8
Name:        System.String[]
Size:        128(0x80) bytes
Array:       Rank 1, Number of elements 13, Type CLASS
Element Methodtable: 00007ffc403d6948
[0] 000001f2391a83f0
    Name:        System.String
    MethodTable: 00007ffc403d6948
    EEClass:     00007ffc3fcd50e0
    Size:        445950(0x6cdfe) bytes
    File:        C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
    String:      xxxxx

From the output information, we can see that there are 13 elements in string [], of which the largest string is 445950 bytes = 445k, which is larger than the 85k definition of large objects. Therefore, Concat here is the crux, and the same is true for thread 170. The next question I have to solve is: why is such a large string generated, What is done in the code??? To find the answer, you have to export the source code from dump to find out.

5. Check the problem code

To analyze the problem code, you can use! ip2md + !savemodule export Bll xxx. xxx. Getrowstring method.

0:187> !ip2md 00007ffbe6dbdafb
MethodDesc:   00007ffbe5342118
Method Name:  BLL.xxx.xxx.GetRowString(System.String, Boolean, Boolean, System.String, System.String, System.String, System.String, System.String, System.String, Int32, System.String, System.String, System.String, System.String, System.String, System.String, Int32, Int32, System.String, System.Nullable`1<Int32>, System.Nullable`1<Int32>, System.String, System.Nullable`1<Single>, System.Nullable`1<Single>, System.Nullable`1<Single>, System.Nullable`1<Single>, System.String, System.String, System.String, System.String, System.String ByRef)
Class:        00007ffbe52fe328
MethodTable:  00007ffbe53421d8
mdToken:      0000000006000096
Module:       00007ffbe471a890
0:187> !savemodule  00007ffbe471a890 E:\dumps\RowString.dll
3 sections in file
section 0 - VA=2000, VASize=f7fcc, FileAddr=200, FileSize=f8000
section 1 - VA=fa000, VASize=3bc, FileAddr=f8200, FileSize=400
section 2 - VA=fc000, VASize=c, FileAddr=f8600, FileSize=200

Then, ILSpy decompiles this code.

Good guy, it's really written 🐂👃, With countless string splices, I feel that gen and LOH have no time to allocate memory segments 😥😥😥, It really killed GC...

3: Summary

In fact, this is a textbook problem and there are also textbook solutions. Moreover, I have seen that this method has more than 600 lines of code, basically doing string splicing. Finally, let's talk about the solution.

  • Refactor this method and try to replace String with StringBuilder to minimize the number of GC triggers.

It seems that the result of this analysis coincides with the deep suspicion of friends...

More high quality dry goods: see my GitHub: dotnetfly

Added by phpcat on Thu, 10 Feb 2022 16:56:04 +0200