In the third part of this series, I used an example to detail the Impala query profile.In this section, I will use an Impala Profile file from a real production system to explain the steps I take to read the Profile file when I encounter a problem and gradually find the real cause of the problem.If you missed the first three parts of my article, you can click on the following link to read it:
- Understanding Impala query profile step by step
- Understanding Impala query profile step by step (2)
- Understanding Impala query profile step by step (3)
Okay, let's get started.Because the Profile file itself is quite large because it involves multiple Impalad processes, the page will look ugly if I show the full content on it.So, from here link Download and view the file on your local computer or browser.
The problem with this query is that for some unknown reason, the same query could have been executed in a few minutes before, but it now takes more than an hour to complete.This Profile file is just one example, and in fact, all queries running in this cluster had the same problem at that time.So take a moment to look at this Profile to see if you can get any useful information and get a general idea of the scenario in which the problem occurs.
Now let's take a look at the steps to troubleshoot the problem.
1. Since user feedback takes longer to query than normal, the first thing I want to know is exactly how long it takes?So first, I look at the beginning of Profile to see when queries begin and end:
Start Time: 2020-01-03 07:33:42.928171000 End Time: 2020-01-03 08:47:55.745537000
I find it now takes 1 hour and 14 minutes to complete the query, which is consistent with user feedback.
2. Query failed abnormally due to user cancellation:
Query State: EXCEPTION Query Status: Cancelled
The user has lost patience and has to cancel the query because it has been running for too long and there is nothing to worry about here.
3. Next, I checked the complexity of the query:
Sql Statement: SELECT COUNT(*),NVL(SUM(NVL(NUMBER_OF_CALLS,0)),0) FROM cef_hpmn WHERE settlement_month IN ( SELECT DISTINCT dd.month_id FROM ext_call_event_fact cef, date_dim dd WHERE CAST(settlement_date_id AS bigint) = dd.date_id ) AND process_date = TO_TIMESTAMP ('01/02/2020', 'MM/dd/yyyy')
I reformatted it for readability.You can see that the query is very simple, a direct SELECT statement with subquery conditions.
4. What I see next is the host name of the Coordinator node:
If I can't draw any conclusions from this Profile file, the next step should be to get the impalad log on this host.But for now, I can only continue because I don't have a log yet.
5. As I mentioned in my previous article, the next part of the Profile file is Query Plan. Usually I skip this part first and then skip to the Summary section to see how long each operation took and to see if any obvious information might make me discover:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------- 12:AGGREGATE 1 0.000ns 0.000ns 0 1 16.00 KB 10.00 MB FINALIZE 11:EXCHANGE 1 0.000ns 0.000ns 28 1 232.00 KB 0 UNPA... 06:AGGREGATE 29 61.358ms 110.536ms 28 1 13.60 MB 10.00 MB 05:HASH JOIN 29 223.055ms 345.295ms 277.41M 116.66M 2.00 MB 1.94 MB LEFT... |--10:EXCHANGE 29 29.977us 67.789us 2 261 24.00 KB 0 BROADCAST | 09:AGGREGATE 29 1.569ms 2.903ms 2 261 1.96 MB 10.00 MB FINALIZE | 08:EXCHANGE 29 33.880us 419.580us 58 261 240.00 KB 0 HASH(.. | 04:AGGREGATE 29 9s751ms 3m53s 58 261 4.58 MB 10.00 MB STREAMING | 03:HASH JOIN 29 1s099ms 2s635ms 411.27M 411.27M 3.35 MB 1.94 MB INNER .. | |--07:EXCHANGE 29 142.532us 334.794us 8.07K 8.07K 304.00 KB 0 BROADCAST | | 02:SCAN HDFS 1 1.700ms 1.700ms 8.07K 8.07K 332.94 KB 32.00 MB xx.. | 01:SCAN HDFS 29 44s391ms 13m18s 411.27M 411.27M 525.95 MB 4.38 GB xx.. 00:SCAN HDFS 29 3m7s 56m13s 277.41M 116.66M 7.83 MB 160.00 MB xx..
Here are a few key messages I found:
- Reverse reading, from bottom to top, because this is the order in which Impala operates
- Compare data from Avg Time and Max Time columns
- Compare data for Rows and Est. #Rows columns
- Check the Detail column to see the Join type for each operation
Soon, I noticed that there was a big difference between Avg Time and Max Time for scanning HDFS (SCAN HDFS) operations, with an average time of 3 minutes and 7 seconds, but the maximum time for one of the 29 hosts was 56 minutes and 13 seconds.As I read on, I also noticed the same problem with the second scan of HDFS, 44 seconds and 13 minutes and 18 seconds, respectively.
Therefore, my next thought is to determine which hosts are performing slower than others and whether they are from the same host or not.To do this, I search for Profile files using the string "id=0", which is the operation number at the beginning of each line in the Summary section "00:SCAN HDFS".
I searched the first part of "id=0" from the beginning of the file:
HDFS_SCAN_NODE (id=0) .... - ScannerThreadsTotalWallClockTime: 20.0m (1200982115995) - MaterializeTupleTime(*): 226ms (226568242) - ScannerThreadsSysTime: 322ms (322168172) - ScannerThreadsUserTime: 6.76s (6758158482) - ScannerThreadsVoluntaryContextSwitches: 10,907 (10907) - TotalRawHdfsOpenFileTime(*): 8.6m (517759170560) - TotalRawHdfsReadTime(*): 3.4m (201957505069) - TotalReadThroughput: 749.9 KiB/s (767874) - TotalTime: 3.1m (187289950304)
I found TotalTime to be 3.1 minutes, which matches the 3.7 minutes seen in the Summary section, so this is Averaged Fragment.To confirm, I scroll back and see the following:
Averaged Fragment F00
To continue searching for files, I'll go to the following section (second example, "id=0"):
HDFS_SCAN_NODE (id=0) .... - ScannerThreadsTotalWallClockTime: 10.4m (626435081910) - MaterializeTupleTime(*): 278ms (278689886) - ScannerThreadsSysTime: 266ms (266094000) - ScannerThreadsUserTime: 5.75s (5748833000) - ScannerThreadsVoluntaryContextSwitches: 11,285 (11285) - TotalRawHdfsOpenFileTime(*): 7.8m (468388283839) - TotalRawHdfsReadTime(*): 1.9m (114598713147) - TotalReadThroughput: 731.0 KiB/s (748535) - TotalTime: 2.1m (125005670562)
This told me it took 2.1 minutes, faster than the average of 3.1 minutes, and then I rolled back to confirm the host:
Fragment F00 Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)
Now you can see three main pieces of information I found:
Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000) HDFS_SCAN_NODE (id=0) - TotalTime: 2.1m (125005670562)
Next, I use grep to filter out what I want.Since the Profile file is indented, I use egrep to do the following:
egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=0\)|^ - TotalTime: ' profile-example.txt
The results are as follows:
... Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 2.1m (125005670562) Instance 94481a81355e51e4:51fd9f9500000057 (host=xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.9m (114395426955) Instance 94481a81355e51e4:51fd9f9500000058 (host=xxxxx-xxx-cdh-cdn020.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.5m (92671503850) Instance 94481a81355e51e4:51fd9f950000003d (host=xxxxx-xxx-cdh-cdn012.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.4m (86459970122) Instance 94481a81355e51e4:51fd9f950000004b (host=xxxxx-xxx-cdh-cdn014.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.4m (82187347776) Instance 94481a81355e51e4:51fd9f9500000050 (host=xxxxx-xxx-cdh-cdn006.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.4m (82146306944) Instance 94481a81355e51e4:51fd9f950000004f (host=xxxxx-xxx-cdh-cdn024.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (80468400288) Instance 94481a81355e51e4:51fd9f950000004d (host=xxxxx-xxx-cdh-cdn022.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (79714897965) Instance 94481a81355e51e4:51fd9f9500000043 (host=xxxxx-xxx-cdh-cdn017.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (78877950983) Instance 94481a81355e51e4:51fd9f9500000052 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (77593734314) Instance 94481a81355e51e4:51fd9f950000003c (host=xxxxx-xxx-cdh-cdn019.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (76164245478) Instance 94481a81355e51e4:51fd9f9500000045 (host=xxxxx-xxx-cdh-cdn007.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (75588331159) Instance 94481a81355e51e4:51fd9f9500000044 (host=xxxxx-xxx-cdh-cdn010.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (73596530464) Instance 94481a81355e51e4:51fd9f9500000042 (host=xxxxx-xxx-cdh-cdn018.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (72946574082) Instance 94481a81355e51e4:51fd9f9500000055 (host=xxxxx-xxx-cdh-cdn026.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (69918383242) Instance 94481a81355e51e4:51fd9f9500000054 (host=xxxxx-xxx-cdh-cdn011.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (69355611992) Instance 94481a81355e51e4:51fd9f9500000051 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.1m (68527129814) Instance 94481a81355e51e4:51fd9f9500000048 (host=xxxxx-xxx-cdh-cdn016.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.1m (67249633571) Instance 94481a81355e51e4:51fd9f9500000047 (host=xxxxx-xxx-cdh-cdn013.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.1m (63989781076) Instance 94481a81355e51e4:51fd9f9500000041 (host=xxxxx-xxx-cdh-cdn028.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (62739870946) Instance 94481a81355e51e4:51fd9f950000003f (host=xxxxx-xxx-cdh-cdn025.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (62136511127) Instance 94481a81355e51e4:51fd9f950000004c (host=xxxxx-xxx-cdh-cdn005.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (61943905274) Instance 94481a81355e51e4:51fd9f9500000046 (host=xxxxx-xxx-cdh-cdn027.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (61955797776) Instance 94481a81355e51e4:51fd9f950000004e (host=xxxxx-xxx-cdh-cdn021.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (60045780252) Instance 94481a81355e51e4:51fd9f9500000040 (host=xxxxx-xxx-cdh-cdn029.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 58.05s (58048904552) Instance 94481a81355e51e4:51fd9f950000004a (host=xxxxx-xxx-cdh-cdn023.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 57.34s (57338024825) Instance 94481a81355e51e4:51fd9f9500000049 (host=xxxxx-xxx-cdh-cdn008.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 53.13s (53130104765) Instance 94481a81355e51e4:51fd9f9500000056 (host=xxxxx-xxx-cdh-cdn004.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 43.24s (43238668974) Instance 94481a81355e51e4:51fd9f950000003e (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 56.2m (3373973559713)
I omitted other relevant information, leaving only the information that interests me.Now I can clearly see which server node is the bottleneck: host xxxx-xxx-cdh-cdn015.xxx.XXXXXX.com, which takes 56.2 minutes, while all other hosts take about 40 seconds to 2 minutes.
Now, I remember that another HDFS SCAN had the same problem, operation 01 (01:SCAN HDFS), so I executed the same egrep command (remember, indentation may be different for different operations, so I need to search for that operation in the Profile file and copy and paste all the spaces before it to get the results I want):
egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=1\)|^ - TotalTime: ' profile-example.txt
The result reconfirms the same conclusion:
.... Instance 94481a81355e51e4:51fd9f950000000c (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=1) - TotalTime: 13.3m (798747290751) ... Instance 94481a81355e51e4:51fd9f9500000007 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=1) - TotalTime: 28.16s (28163113611) Instance 94481a81355e51e4:51fd9f9500000018 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=1) - TotalTime: 23.29s (23285966387) ...
Obviously, the same host xxxx-xxx-cdh-cdn015.xxx.XXXX.com has the same problem. It runs much slower than other hosts, 13.3 minutes longer than 28.16 seconds.
Then I came to the conclusion that there were some problems with the server that needed to be repaired.
To confirm my findings after troubleshooting, I asked users to stop the Impalad process on this host and test their queries again. They confirmed that the problem has been resolved.Later they reported that they found something wrong with the server's hardware and that it had been down and maintained.
Compiled from: IMPALA QUERY PROFILE EXPLAINED – PART 4