Hive explain implementation plan

Hive explain implementation plan

 

hive provides the EXPLAIN command to show the execution plan of a query. This execution plan is very helpful for us to understand the underlying principle, hive tuning, troubleshooting data skew, etc

The syntax used is as follows:

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION | ANALYZE] query

explain can be followed by the following optional parameters. Note: these optional parameters are not supported by every version of hive

  1. extended: add extended to output additional information about the plan. This is usually physical information, such as file names. This extra information is of little use to us
  2. Dependency: the use of dependency in the EXPLAIN statement produces additional information about the input in the plan. It shows the various properties of the input
  3. AUTHORIZATION: displays the queries and AUTHORIZATION failures that all entities need to be authorized to execute (if any)
  4. VECTORIZATION: add details to the EXPLAIN output to show why Map and Reduce are not vectorized. Support from Hive 2.3.0
  5. ANALYZE: annotate the plan with the actual number of lines. Support from Hive 2.2.0

Execute the following statement on the hive client

Mainly sql citizen Table and city Table conduct join And sift it out city_name != " upper sea " According to the data, then city_name Group and count the total number of people in each city 2 City of
explain
SELECT 
city.city_name AS city_name,
COUNT(*) AS citizen_cnt 
FROM citizen 
JOIN city ON citizen.city_id = city.city_id 
WHERE city.city_name != 'Shanghai' 
GROUP BY city.city_name 
HAVING COUNT(*) >= 2 
ORDER BY city.city_name ASC 
LIMIT 2;

Results obtained:

Explain
STAGE DEPENDENCIES:
  Stage-6 is a root stage
  Stage-2 depends on stages: Stage-6
  Stage-3 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-6
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:city 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:city 
          TableScan
            alias: city
            Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_name <> 'Shanghai') and city_id is not null) (type: boolean)
              Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: city_id (type: int), city_name (type: string)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: citizen
            Statistics: Num rows: 9 Data size: 65 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: city_id is not null (type: boolean)
              Statistics: Num rows: 9 Data size: 65 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: city_id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 9 Data size: 65 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col2
                  Statistics: Num rows: 9 Data size: 71 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count()
                    keys: _col2 (type: string)
                    mode: hash
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 9 Data size: 71 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 9 Data size: 71 Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col1 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 4 Data size: 31 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (_col1 >= 2) (type: boolean)
            Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string)
              sort order: +
              Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
              TopN Hash Memory Usage: 0.1
              value expressions: _col1 (type: bigint)
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 2
            Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 2
      Processor Tree:
        ListSink

Time taken: 0.566 seconds, Fetched: 114 row(s)

A HIVE query is transformed into a sequence composed of one or more stages (directed acyclic graph DAG). These stages can be MapReduce stage s, stages responsible for metadata storage, or stages responsible for file system operations (such as moving and renaming).

Split the above results into two parts, starting from the outermost layer:

  1. stage dependencies: dependencies between stages
  2. stage plan: the execution plan of each stage

The first part is stage dependencies, which includes four stages. Stage-6 is the root stage. Stage-2 depends on Stage-6, Stage-3 depends on stage-2, and Stage-0 depends on Stage-1. Execute Stage-6 first, then stage-2, Stage-3, and finally Stage-0.

Part II stage plan,

Stage: Stage-6 is a Map Reduce Local Work process with only map (running tasks in local mode. For small data sets, the execution time can be significantly shortened)

Stage: Stage-2, Stage: Stage-3 are two complete Map Reduce processes, including

  1. Map Operator Tree: the execution plan tree on the MAP side
  2. Reduce Operator Tree: the execution plan tree on the reduce side

The two execution plan trees contain the operator s of this sql statement:

  1. The first operation on the map side must be to load tables, so it is a TableScan table scanning operation,
    • alias: table name
    • Statistics: table statistics, including the number and size of data in the table (the number of data is not accurate enough, and only one is displayed most of the time. I don't know if there is a problem with my operation!!)
  2. Filter Operator: filter operation,
    • Predict: filter condition, ((city_name < > 'Shanghai') and city_id is not null) (type: boolean)), and the null data in the associated field will be removed (join only)
    • Statistics: table statistics, including the number and size of data in the table
  3. Select Operator: select operation,
    • expressions: required field name and field type
    • outputColumnNames: the name of the output column
    • Statistics: table statistics, including the number and size of data in the table
  4. Map Join Operator: join operation,
    • condition map: join mode, such as inner join 0 to 1, left outer join 0 to 2
    • Keys: condition field of join
    • outputColumnNames: the fields to be output after the join is completed
    • Statistics: the number and size of data pieces generated after the join is completed
  5. Group By Operator: group aggregation operation,
    • Aggregates: displays aggregate function information, such as count(), sum(), agv()
    • keys: grouped field. If there is no grouping, there is no such field
    • Mode: aggregation mode, with hash: random aggregation, namely hash partition; partial: local aggregation; Final: final aggregation
    • outputColumnNames: output column names after aggregation
    • Statistics: table statistics, including the number and size of data after grouping and aggregation
  6. Reduce Output Operator: output to reduce operation,
    • key expressions: key field name and field type
    • sort order: do not sort if the value is empty; The value is + positive sort, and the value is - reverse sort; The columns with + - value are sorted in two columns. The first column is in positive order and the second column is in reverse order
    • Map reduce partition columns: columns of partition
    • Statistics: table statistics, including the number and size of data after grouping and aggregation
    • value expressions: result field name and field type
  7. File Output Operator: file output operation,
    • compressed: whether to compress
    • Table: table information, including input and output file format, serialization, etc
  8. Fetch Operator client obtains data,
    • Limit. A value of - 1 indicates that there is no limit on the number of entries. Other values are the number of entries

At this point, the explain query result of the whole sql can be clear. You can try to analyze it through explain when writing sql in the future.

 

 

Keywords: hive

Added by freeheader on Tue, 08 Mar 2022 21:55:19 +0200