Hive explain implementation plan
hive provides the EXPLAIN command to show the execution plan of a query. This execution plan is very helpful for us to understand the underlying principle, hive tuning, troubleshooting data skew, etc
The syntax used is as follows:
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION | ANALYZE] query
explain can be followed by the following optional parameters. Note: these optional parameters are not supported by every version of hive
- extended: add extended to output additional information about the plan. This is usually physical information, such as file names. This extra information is of little use to us
- Dependency: the use of dependency in the EXPLAIN statement produces additional information about the input in the plan. It shows the various properties of the input
- AUTHORIZATION: displays the queries and AUTHORIZATION failures that all entities need to be authorized to execute (if any)
- VECTORIZATION: add details to the EXPLAIN output to show why Map and Reduce are not vectorized. Support from Hive 2.3.0
- ANALYZE: annotate the plan with the actual number of lines. Support from Hive 2.2.0
Execute the following statement on the hive client
explain SELECT city.city_name AS city_name, COUNT(*) AS citizen_cnt FROM citizen JOIN city ON citizen.city_id = city.city_id WHERE city.city_name != 'Shanghai' GROUP BY city.city_name HAVING COUNT(*) >= 2 ORDER BY city.city_name ASC LIMIT 2;
Results obtained:
Explain STAGE DEPENDENCIES: Stage-6 is a root stage Stage-2 depends on stages: Stage-6 Stage-3 depends on stages: Stage-2 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-6 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_1:city Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_1:city TableScan alias: city Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((city_name <> 'Shanghai') and city_id is not null) (type: boolean) Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: city_id (type: int), city_name (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 _col0 (type: int) 1 _col0 (type: int) Stage: Stage-2 Map Reduce Map Operator Tree: TableScan alias: citizen Statistics: Num rows: 9 Data size: 65 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: city_id is not null (type: boolean) Statistics: Num rows: 9 Data size: 65 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: city_id (type: int) outputColumnNames: _col0 Statistics: Num rows: 9 Data size: 65 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col2 Statistics: Num rows: 9 Data size: 71 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() keys: _col2 (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 9 Data size: 71 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 9 Data size: 71 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: bigint) Local Work: Map Reduce Local Work Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 4 Data size: 31 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col1 >= 2) (type: boolean) Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe Stage: Stage-3 Map Reduce Map Operator Tree: TableScan Reduce Output Operator key expressions: _col0 (type: string) sort order: + Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE TopN Hash Memory Usage: 0.1 value expressions: _col1 (type: bigint) Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 (type: bigint) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE Limit Number of rows: 2 Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 7 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: 2 Processor Tree: ListSink Time taken: 0.566 seconds, Fetched: 114 row(s)
A HIVE query is transformed into a sequence composed of one or more stages (directed acyclic graph DAG). These stages can be MapReduce stage s, stages responsible for metadata storage, or stages responsible for file system operations (such as moving and renaming).
Split the above results into two parts, starting from the outermost layer:
- stage dependencies: dependencies between stages
- stage plan: the execution plan of each stage
The first part is stage dependencies, which includes four stages. Stage-6 is the root stage. Stage-2 depends on Stage-6, Stage-3 depends on stage-2, and Stage-0 depends on Stage-1. Execute Stage-6 first, then stage-2, Stage-3, and finally Stage-0.
Part II stage plan,
Stage: Stage-6 is a Map Reduce Local Work process with only map (running tasks in local mode. For small data sets, the execution time can be significantly shortened)
Stage: Stage-2, Stage: Stage-3 are two complete Map Reduce processes, including
- Map Operator Tree: the execution plan tree on the MAP side
- Reduce Operator Tree: the execution plan tree on the reduce side
The two execution plan trees contain the operator s of this sql statement:
- The first operation on the map side must be to load tables, so it is a TableScan table scanning operation,
- alias: table name
- Statistics: table statistics, including the number and size of data in the table (the number of data is not accurate enough, and only one is displayed most of the time. I don't know if there is a problem with my operation!!)
- Filter Operator: filter operation,
- Predict: filter condition, ((city_name < > 'Shanghai') and city_id is not null) (type: boolean)), and the null data in the associated field will be removed (join only)
- Statistics: table statistics, including the number and size of data in the table
- Select Operator: select operation,
- expressions: required field name and field type
- outputColumnNames: the name of the output column
- Statistics: table statistics, including the number and size of data in the table
- Map Join Operator: join operation,
- condition map: join mode, such as inner join 0 to 1, left outer join 0 to 2
- Keys: condition field of join
- outputColumnNames: the fields to be output after the join is completed
- Statistics: the number and size of data pieces generated after the join is completed
- Group By Operator: group aggregation operation,
- Aggregates: displays aggregate function information, such as count(), sum(), agv()
- keys: grouped field. If there is no grouping, there is no such field
- Mode: aggregation mode, with hash: random aggregation, namely hash partition; partial: local aggregation; Final: final aggregation
- outputColumnNames: output column names after aggregation
- Statistics: table statistics, including the number and size of data after grouping and aggregation
- Reduce Output Operator: output to reduce operation,
- key expressions: key field name and field type
- sort order: do not sort if the value is empty; The value is + positive sort, and the value is - reverse sort; The columns with + - value are sorted in two columns. The first column is in positive order and the second column is in reverse order
- Map reduce partition columns: columns of partition
- Statistics: table statistics, including the number and size of data after grouping and aggregation
- value expressions: result field name and field type
- File Output Operator: file output operation,
- compressed: whether to compress
- Table: table information, including input and output file format, serialization, etc
- Fetch Operator client obtains data,
- Limit. A value of - 1 indicates that there is no limit on the number of entries. Other values are the number of entries
At this point, the explain query result of the whole sql can be clear. You can try to analyze it through explain when writing sql in the future.