Exception analysis of Hive SQL inserting into dynamic partition

82.1 problems

When the INSERT... SELECT statement inserts data into a table in Parquet or ORC format, dynamic partitioning is enabled and cannot be executed normally
Hive client:

Task with the most failures(4):
Diagnostic Messages for this Task:
Error: GC overhead limit exceeded
...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Check the specific map task error in 8088 of YARN:

2017-10-27 17:08:04,317 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

82.2 solutions

Method 1: enable hive optimize. sort. dynamic. Partition, set it to true

  • This optimization method uses relatively less memory when writing parquet files, but the cost is to sort the partition fields
    • In this optimization, only mapreduce of the map task will introduce the reduce process, so that the field of the dynamic partition, such as the date, will be sorted when it is transferred to the reducer
    • Because the partition fields are sorted, each reducer only needs to keep one file writer open at any time. After receiving all lines from a specific partition, close the record writer to reduce memory pressure
SET hive.optimize.sort.dynamic.partition=true; 
INSERT OVERWRITE TABLE [table] SELECT ...

Method 2: increase the memory allocation of each mapper, that is, increase MapReduce map. memory. MB and MapReduce map. java. Opts, so that the memory corresponding to all file writer buffers will be more abundant
Method 3: decompose the query into several smaller queries to reduce the number of partitions created by each query

  • Hive allocates 128MB for each open Parquet file buffer
    • The buffer size is determined by the parameter parquet block. Size control
    • For the best performance, the buffer size of parquet needs to be aligned with the block size of HDFS (for example, equal), so that each parquet file is in a single HDFS block, so that each I/O request can read the entire data file without accessing subsequent blocks through network transmission
-- set Parquetbuffer size to 256MB (in bytes)
set parquet.block.size=268435456;

Some parameter descriptions for generating dynamic partitions

hive.exec.dynamic.partition
## Default: false
## Whether to enable the dynamic partition function. It is off by default
## When using dynamic partitioning, this parameter must be set to true

mapreduce.input.fileinputformat.split.maxsize
mapreduce.input.fileinputformat.split.minsize
## These two parameters are used together to control the number of mapreduce map s
## For example, it is set to 1073741824 so that each map can process 1GB files

mapreduce.map.java.opts
## The Java stack size of the map task is generally set to be less than or equal to 75% of the above value, which can ensure that the map task has enough memory space outside the stack

mapreduce.map.memory.mb
## The physical memory allocation value of the map task is usually set to 1GB,2GB,4GB, etc

hive.exec.max.created.files
## Default: 100000
## How many HDFS files can be created in the whole MR Job
## Generally, the default value is sufficient, unless your data volume is very large and the number of files to be created is greater than 100000, which can be adjusted according to the actual situation

hive.exec.max.dynamic.partitions
## Default: 1000
## How many dynamic partitions can be created on all MR nodes

hive.exec.max.dynamic.partitions.pernode
## Default: 100
## How many dynamic partitions can be created on each node executing MR
## This parameter needs to be set according to the actual data
## For example, if the source data contains one year's data, that is, the day field has 365 values, the parameter needs to be set to greater than 365. If the default value of 100 is used, an error will be reported


hive.exec.dynamic.partition.mode
## Default: strict
## The dynamic partition mode, strict by default, means that at least one partition must be specified as a static partition, and the nonstrict mode means that all partition fields are allowed to use dynamic partitions
## Generally, it needs to be set to nonstrict

Big data video recommendation:
CSDN
Big data voice recommendation:
Application of enterprise big data technology
Recommendation system for big data machine learning cases
natural language processing
Big data foundation
Artificial intelligence: introduction to deep learning to mastery

Keywords: Hadoop hive SQL

Added by drkstr on Sat, 25 Dec 2021 04:41:15 +0200