Understanding Impala query profile step by step

In the third part of this series, I used an example to detail the Impala query profile.In this section, I will use an Impala Profile file from a real production system to explain the steps I take to read the Profile file when I encounter a problem and gradually find the real cause of the problem.If you missed the first three parts of my article, you can click on the following link to read it:

Okay, let's get started.Because the Profile file itself is quite large because it involves multiple Impalad processes, the page will look ugly if I show the full content on it.So, from here link Download and view the file on your local computer or browser.

The problem with this query is that for some unknown reason, the same query could have been executed in a few minutes before, but it now takes more than an hour to complete.This Profile file is just one example, and in fact, all queries running in this cluster had the same problem at that time.So take a moment to look at this Profile to see if you can get any useful information and get a general idea of the scenario in which the problem occurs.

Now let's take a look at the steps to troubleshoot the problem.

1. Since user feedback takes longer to query than normal, the first thing I want to know is exactly how long it takes?So first, I look at the beginning of Profile to see when queries begin and end:

Start Time: 2020-01-03 07:33:42.928171000
End Time: 2020-01-03 08:47:55.745537000

I find it now takes 1 hour and 14 minutes to complete the query, which is consistent with user feedback.

2. Query failed abnormally due to user cancellation:

Query State: EXCEPTION
Query Status: Cancelled

The user has lost patience and has to cancel the query because it has been running for too long and there is nothing to worry about here.

3. Next, I checked the complexity of the query:

    Sql Statement: 
SELECT COUNT(*),NVL(SUM(NVL(NUMBER_OF_CALLS,0)),0) 
FROM cef_hpmn 
WHERE settlement_month IN (
    SELECT DISTINCT dd.month_id 
    FROM  ext_call_event_fact cef, date_dim dd 
    WHERE CAST(settlement_date_id AS bigint) = dd.date_id
) 
AND process_date = TO_TIMESTAMP ('01/02/2020', 'MM/dd/yyyy')

I reformatted it for readability.You can see that the query is very simple, a direct SELECT statement with subquery conditions.

4. What I see next is the host name of the Coordinator node:

Coordinator: xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000

If I can't draw any conclusions from this Profile file, the next step should be to get the impalad log on this host.But for now, I can only continue because I don't have a log yet.

5. As I mentioned in my previous article, the next part of the Profile file is Query Plan. Usually I skip this part first and then skip to the Summary section to see how long each operation took and to see if any obvious information might make me discover:

Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                             
-----------------------------------------------------------------------------------------------------------
12:AGGREGATE              1    0.000ns    0.000ns        0           1   16.00 KB       10.00 MB  FINALIZE
11:EXCHANGE               1    0.000ns    0.000ns       28           1  232.00 KB              0  UNPA...
06:AGGREGATE             29   61.358ms  110.536ms       28           1   13.60 MB       10.00 MB  
05:HASH JOIN             29  223.055ms  345.295ms  277.41M     116.66M    2.00 MB        1.94 MB  LEFT... 
|--10:EXCHANGE           29   29.977us   67.789us        2         261   24.00 KB              0  BROADCAST
|  09:AGGREGATE          29    1.569ms    2.903ms        2         261    1.96 MB       10.00 MB  FINALIZE
|  08:EXCHANGE           29   33.880us  419.580us       58         261  240.00 KB              0  HASH(..
|  04:AGGREGATE          29    9s751ms      3m53s       58         261    4.58 MB       10.00 MB  STREAMING
|  03:HASH JOIN          29    1s099ms    2s635ms  411.27M     411.27M    3.35 MB        1.94 MB  INNER ..
|  |--07:EXCHANGE        29  142.532us  334.794us    8.07K       8.07K  304.00 KB              0  BROADCAST 
|  |  02:SCAN HDFS        1    1.700ms    1.700ms    8.07K       8.07K  332.94 KB       32.00 MB  xx..             
|  01:SCAN HDFS          29   44s391ms     13m18s  411.27M     411.27M  525.95 MB        4.38 GB  xx.. 
00:SCAN HDFS             29       3m7s     56m13s  277.41M     116.66M    7.83 MB      160.00 MB  xx..

Here are a few key messages I found:

  • Reverse reading, from bottom to top, because this is the order in which Impala operates
  • Compare data from Avg Time and Max Time columns
  • Compare data for Rows and Est. #Rows columns
  • Check the Detail column to see the Join type for each operation

Soon, I noticed that there was a big difference between Avg Time and Max Time for scanning HDFS (SCAN HDFS) operations, with an average time of 3 minutes and 7 seconds, but the maximum time for one of the 29 hosts was 56 minutes and 13 seconds.As I read on, I also noticed the same problem with the second scan of HDFS, 44 seconds and 13 minutes and 18 seconds, respectively.

Therefore, my next thought is to determine which hosts are performing slower than others and whether they are from the same host or not.To do this, I search for Profile files using the string "id=0", which is the operation number at the beginning of each line in the Summary section "00:SCAN HDFS".

I searched the first part of "id=0" from the beginning of the file:

    HDFS_SCAN_NODE (id=0)
 ....
            - ScannerThreadsTotalWallClockTime: 20.0m (1200982115995)
              - MaterializeTupleTime(*): 226ms (226568242)
              - ScannerThreadsSysTime: 322ms (322168172)
              - ScannerThreadsUserTime: 6.76s (6758158482)
            - ScannerThreadsVoluntaryContextSwitches: 10,907 (10907)
            - TotalRawHdfsOpenFileTime(*): 8.6m (517759170560)
            - TotalRawHdfsReadTime(*): 3.4m (201957505069)
            - TotalReadThroughput: 749.9 KiB/s (767874)
            - TotalTime: 3.1m (187289950304)

I found TotalTime to be 3.1 minutes, which matches the 3.7 minutes seen in the Summary section, so this is Averaged Fragment.To confirm, I scroll back and see the following:

Averaged Fragment F00

To continue searching for files, I'll go to the following section (second example, "id=0"):

            HDFS_SCAN_NODE (id=0)
....
              - ScannerThreadsTotalWallClockTime: 10.4m (626435081910)
                - MaterializeTupleTime(*): 278ms (278689886)
                - ScannerThreadsSysTime: 266ms (266094000)
                - ScannerThreadsUserTime: 5.75s (5748833000)
              - ScannerThreadsVoluntaryContextSwitches: 11,285 (11285)
              - TotalRawHdfsOpenFileTime(*): 7.8m (468388283839)
              - TotalRawHdfsReadTime(*): 1.9m (114598713147)
              - TotalReadThroughput: 731.0 KiB/s (748535)
              - TotalTime: 2.1m (125005670562)

This told me it took 2.1 minutes, faster than the average of 3.1 minutes, and then I rolled back to confirm the host:

    Fragment F00
      Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)

Now you can see three main pieces of information I found:

      Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 2.1m (125005670562)

Next, I use grep to filter out what I want.Since the Profile file is indented, I use egrep to do the following:

egrep ' Instance .*\)|^            HDFS_SCAN_NODE \(id=0\)|^              - TotalTime: ' profile-example.txt

The results are as follows:

...
      Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 2.1m (125005670562)
      Instance 94481a81355e51e4:51fd9f9500000057 (host=xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.9m (114395426955)
      Instance 94481a81355e51e4:51fd9f9500000058 (host=xxxxx-xxx-cdh-cdn020.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.5m (92671503850)
      Instance 94481a81355e51e4:51fd9f950000003d (host=xxxxx-xxx-cdh-cdn012.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.4m (86459970122)
      Instance 94481a81355e51e4:51fd9f950000004b (host=xxxxx-xxx-cdh-cdn014.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.4m (82187347776)
      Instance 94481a81355e51e4:51fd9f9500000050 (host=xxxxx-xxx-cdh-cdn006.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.4m (82146306944)
      Instance 94481a81355e51e4:51fd9f950000004f (host=xxxxx-xxx-cdh-cdn024.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (80468400288)
      Instance 94481a81355e51e4:51fd9f950000004d (host=xxxxx-xxx-cdh-cdn022.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (79714897965)
      Instance 94481a81355e51e4:51fd9f9500000043 (host=xxxxx-xxx-cdh-cdn017.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (78877950983)
      Instance 94481a81355e51e4:51fd9f9500000052 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (77593734314)
      Instance 94481a81355e51e4:51fd9f950000003c (host=xxxxx-xxx-cdh-cdn019.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (76164245478)
      Instance 94481a81355e51e4:51fd9f9500000045 (host=xxxxx-xxx-cdh-cdn007.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (75588331159)
      Instance 94481a81355e51e4:51fd9f9500000044 (host=xxxxx-xxx-cdh-cdn010.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (73596530464)
      Instance 94481a81355e51e4:51fd9f9500000042 (host=xxxxx-xxx-cdh-cdn018.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (72946574082)
      Instance 94481a81355e51e4:51fd9f9500000055 (host=xxxxx-xxx-cdh-cdn026.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (69918383242)
      Instance 94481a81355e51e4:51fd9f9500000054 (host=xxxxx-xxx-cdh-cdn011.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (69355611992)
      Instance 94481a81355e51e4:51fd9f9500000051 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.1m (68527129814)
      Instance 94481a81355e51e4:51fd9f9500000048 (host=xxxxx-xxx-cdh-cdn016.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.1m (67249633571)
      Instance 94481a81355e51e4:51fd9f9500000047 (host=xxxxx-xxx-cdh-cdn013.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.1m (63989781076)
      Instance 94481a81355e51e4:51fd9f9500000041 (host=xxxxx-xxx-cdh-cdn028.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (62739870946)
      Instance 94481a81355e51e4:51fd9f950000003f (host=xxxxx-xxx-cdh-cdn025.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (62136511127)
      Instance 94481a81355e51e4:51fd9f950000004c (host=xxxxx-xxx-cdh-cdn005.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (61943905274)
      Instance 94481a81355e51e4:51fd9f9500000046 (host=xxxxx-xxx-cdh-cdn027.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (61955797776)
      Instance 94481a81355e51e4:51fd9f950000004e (host=xxxxx-xxx-cdh-cdn021.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (60045780252)
      Instance 94481a81355e51e4:51fd9f9500000040 (host=xxxxx-xxx-cdh-cdn029.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 58.05s (58048904552)
      Instance 94481a81355e51e4:51fd9f950000004a (host=xxxxx-xxx-cdh-cdn023.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 57.34s (57338024825)
      Instance 94481a81355e51e4:51fd9f9500000049 (host=xxxxx-xxx-cdh-cdn008.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 53.13s (53130104765)
      Instance 94481a81355e51e4:51fd9f9500000056 (host=xxxxx-xxx-cdh-cdn004.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 43.24s (43238668974)
      Instance 94481a81355e51e4:51fd9f950000003e (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 56.2m (3373973559713)

I omitted other relevant information, leaving only the information that interests me.Now I can clearly see which server node is the bottleneck: host xxxx-xxx-cdh-cdn015.xxx.XXXXXX.com, which takes 56.2 minutes, while all other hosts take about 40 seconds to 2 minutes.

Now, I remember that another HDFS SCAN had the same problem, operation 01 (01:SCAN HDFS), so I executed the same egrep command (remember, indentation may be different for different operations, so I need to search for that operation in the Profile file and copy and paste all the spaces before it to get the results I want):

egrep ' Instance .*\)|^            HDFS_SCAN_NODE \(id=1\)|^              - TotalTime: ' profile-example.txt

The result reconfirms the same conclusion:

....
      Instance 94481a81355e51e4:51fd9f950000000c (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=1)
              - TotalTime: 13.3m (798747290751)
...
      Instance 94481a81355e51e4:51fd9f9500000007 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=1)
              - TotalTime: 28.16s (28163113611)
      Instance 94481a81355e51e4:51fd9f9500000018 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=1)
              - TotalTime: 23.29s (23285966387)
...

Obviously, the same host xxxx-xxx-cdh-cdn015.xxx.XXXX.com has the same problem. It runs much slower than other hosts, 13.3 minutes longer than 28.16 seconds.

Then I came to the conclusion that there were some problems with the server that needed to be repaired.

To confirm my findings after troubleshooting, I asked users to stop the Impalad process on this host and test their queries again. They confirmed that the problem has been resolved.Later they reported that they found something wrong with the server's hardware and that it had been down and maintained.

Compiled from: IMPALA QUERY PROFILE EXPLAINED – PART 4

Keywords: Big Data Fragment SQL

Added by ioop on Wed, 05 Feb 2020 18:25:54 +0200