Hive dynamic partition

Hive dynamic partition

hive provides a dynamic partition function, which can infer the partition name based on the location of query parameters, so as to establish the partition.

1. Configuration parameters

1.1 main configuration parameters

The following are the configuration parameters involved in Hive dynamic partition:

set hive.exec.dynamic.partition=true   --Indicates that the dynamic partition function is enabled. The default is false
set hive.exec.dynamic.partition.mode=nonstrict  --Indicates that all partitions are allowed to be dynamic, otherwise there must be a static partition field, which is the default strict. 

1.2 tuning related parameters

Tuning parameters related to dynamic partitioning:

set hive.exec.max.dynamic.partitions.pernode=100
--The default value is 100. Generally, it can be set larger, such as 1000. Represents each maper or reducer The maximum number of dynamic partitions that can be created is 100 by default. If it exceeds 100, an error will be reported.

set hive.exec.max.dynamic.partitions =1000(Default value)
--Indicates the maximum number of dynamic partitions that can be created by a dynamic partition statement, exceeding which an error is reported.

set hive.exec.max.created.files =10000(default)
--The maximum number of files that can be created globally exceeds the error.

2. Implement basic dynamic partitioning

2.1 data preparation

The following are the steps to prepare the data and the contents of the data.

hive> create table user_3g (id int, name string, city string) row format delimited fields terminated by '\t';
OK
Time taken: 0.259 seconds
hive> load data local inpath '/root/book/3guser.txt' overwrite into table user_3g;
Loading data to table test.user_3g
Table test.user_3g stats: [numFiles=1, numRows=0, totalSize=127, rawDataSize=0]
OK
Time taken: 1.379 seconds
hive> select * from user_3g;
OK
1	Liu Bei	shu
2	Guan Yu	shu
3	Fei Zhang	shu
4	Cao Cao	wei
5	Cao Pi	wei
6	Sun Quan	wu
7	Sun CE	wu
8	Sun Shangxiang	wu
9	Xia Houdun	wei
Time taken: 0.354 seconds, Fetched: 9 row(s)

2.2 create partition table

The following is a statement to create an internal partition table:

hive> create table user_3g_area(id int, name string) partitioned by(city string) row format delimited fields terminated by '\t';
OK
Time taken: 0.081 seconds

The following is a statement to create an external partition table:

I have a ready-made test directory here. If not, remember to add the - r parameter.

#To create an external table, first create a data directory
[root@hadoop ~]# hadoop fs -mkdir /test/user_xy_area_out/

#Then create a table

hive> create external table user_3g_area_out (id int, name string) partitioned by (city string) row format delimited fields terminated by '\t' location '/test/user_xy_area_out';
OK
Time taken: 0.336 seconds

2.3 setting dynamic partition parameters

Only the main dynamic partition parameters are configured here. Other parameters can be configured as needed.

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;

2.4 start importing data

Internal partition table data import:

hive> insert into table user_3g_area partition(city) select id, name, city from user_3g;
Query ID = root_20200614162430_25d9fa2d-7811-484a-bb52-efa9dae1fa72
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592121918794_0001, Tracking URL = http://hadoop:8088/proxy/application_1592121918794_0001/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592121918794_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-14 16:24:44,429 Stage-1 map = 0%,  reduce = 0%
2020-06-14 16:24:52,638 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.19 sec
MapReduce Total cumulative CPU time: 1 seconds 190 msec
Ended Job = job_1592121918794_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_area/.hive-staging_hive_2020-06-14_16-24-30_674_3273172977143207137-1/-ext-10000
Loading data to table test.user_3g_area partition (city=null)
	 Time taken for load dynamic partitions : 451
	Loading partition {city=wu}
	Loading partition {city=shu}
	Loading partition {city=wei}
	 Time taken for adding to write entity : 7
Partition test.user_3g_area{city=shu} stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24]
Partition test.user_3g_area{city=wei} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
Partition test.user_3g_area{city=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.19 sec   HDFS Read: 3768 HDFS Write: 258 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 190 msec
OK
Time taken: 25.395 seconds

External partition table data import:

hive> insert into table user_3g_area_out partition(city) select id,name,city from user_3g;
Query ID = root_20200615232438_f6a9d195-c416-4d5e-bd31-b96fb9c0133e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0001, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0001/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-15 23:24:53,191 Stage-1 map = 0%,  reduce = 0%
2020-06-15 23:25:02,148 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.19 sec
MapReduce Total cumulative CPU time: 1 seconds 190 msec
Ended Job = job_1592152196890_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/test/user_xy_area_out/.hive-staging_hive_2020-06-15_23-24-38_208_8331054339743666004-1/-ext-10000
Loading data to table test.user_3g_area_out partition (city=null)
	 Time taken for load dynamic partitions : 466
	Loading partition {city=wu}
	Loading partition {city=wei}
	Loading partition {city=shu}
	 Time taken for adding to write entity : 8
Partition test.user_3g_area_out{city=shu} stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24]
Partition test.user_3g_area_out{city=wei} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
Partition test.user_3g_area_out{city=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.19 sec   HDFS Read: 3718 HDFS Write: 270 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 190 msec
OK
Time taken: 26.51 seconds

2.5 viewing partitions

To view the internal partition table:

hive> show partitions user_3g_area;
OK
city=shu
city=wei
city=wu
Time taken: 0.6 seconds, Fetched: 3 row(s)

To view the external partition table:

#View partition:
hive> show partitions user_3g_area_out;
OK
city=shu
city=wei
city=wu
Time taken: 0.138 seconds, Fetched: 3 row(s)

#View data directory
[root@hadoop ~]# hadoop fs -ls /test/user_xy_area_out/
20/06/15 23:29:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   - root supergroup          0 2020-06-15 23:25 /test/user_xy_area_out/city=shu
drwxr-xr-x   - root supergroup          0 2020-06-15 23:25 /test/user_xy_area_out/city=wei
drwxr-xr-x   - root supergroup          0 2020-06-15 23:25 /test/user_xy_area_out/city=wu

It can be seen from the above demonstration that the implementation process of internal tables and external tables of dynamic partition is almost the same, and the difference is the location of data storage.

Why is this basic? Because this partition is made of English characters, the following Chinese characters are implemented.

Note: using insert When selecting to import data into a table, the number of fields to query must be the same as the number of target fields, neither more nor less, otherwise an error will be reported. However, if the field types are inconsistent, it will be filled with null values and no error will be reported. When loading data into the hive table in the form of load data, it will not be checked. If there are more fields, they will be discarded, and if there are fewer fields, they will be filled with null values. Similarly, if the field types are inconsistent, it is also filled with null values.

3. Realize Chinese dynamic partition

The implementation of dynamic partition including Chinese and English is the same step. The difference lies in the support of Hive and MySQL and the problem of garbled code.

3.1 MySQL

On mysql5 MySQL databases before version 7 can't support the index field of Hive metadata with super long characters, so it's on MySQL 5 The Hive metabase in MySQL before version 7 uses the latin1 character set by default, and the metadata information storage does not support Chinese.

If you forcibly change mysql5 If you change the character set of Hive metabase of MySQL database before version 7 or change the character set of fields in some tables, the following errors will occur:

Specified key was too long; max key length is 767 bytes

Therefore, the character type cannot be changed because of the field length. You can also adjust the field length before changing the character set. This method will not be repeated here.

When Chinese content is to appear in Hive's metadata information, first ensure that MySQL database uses version 5.7 or above, because MySQL 5.0 7. The maximum character length supported has been adjusted. Able to meet Hive's needs.

If MySQL 5.0 is installed For versions before 5.7, considering the data problem, upgrade the database version to 5.7. For database upgrade, please refer to: MySQL5.6.29 upgrade to MySQL 5.0 seven point two nine

If it is a new environment, you can directly install version 5.7 or above. To install Hive, please refer to: HIVE introduction.

The following commands are used to change the character set involved in the Hive database:

#Modify database code
alter database hive default character set utf8;
#Modify table code
alter table BUCKETING_COLS default character set utf8;
alter table CDS default character set utf8;
alter table COLUMNS_V2 default character set utf8;
alter table DATABASE_PARAMS default character set utf8;
alter table DBS default character set utf8;
alter table FUNCS default character set utf8;
alter table FUNC_RU default character set utf8;
alter table GLOBAL_PRIVS default character set utf8;
alter table PARTITIONS default character set utf8;
alter table PARTITION_KEYS default character set utf8;
alter table PARTITION_KEY_VALS default character set utf8;
alter table PARTITION_PARAMS default character set utf8;
alter table PART_COL_STATS default character set utf8;
alter table ROLES default character set utf8;
alter table SDS default character set utf8;
alter table SD_PARAMS default character set utf8;
alter table SEQUENCE_TABLE default character set utf8;
alter table SERDES default character set utf8;
alter table SERDE_PARAMS default character set utf8;
alter table SKEWED_COL_NAMES default character set utf8;
alter table SKEWED_COL_VALUE_LOC_MAP default character set utf8;
alter table SKEWED_STRING_LIST default character set utf8;
alter table SKEWED_STRING_LIST_VALUES default character set utf8;
alter table SKEWED_VALUES default character set utf8;
alter table SORT_COLS default character set utf8;
alter table TABLE_PARAMS default character set utf8;
alter table TAB_COL_STATS default character set utf8;
alter table TBLS default character set utf8;
alter table VERSION default character set utf8;
#Modify column code
alter table BUCKETING_COLS convert to character set utf8;
alter table CDS convert to character set utf8;
alter table COLUMNS_V2 convert to character set utf8;
alter table DATABASE_PARAMS convert to character set utf8;
alter table DBS convert to character set utf8;
alter table FUNCS convert to character set utf8;
alter table FUNC_RU convert to character set utf8;
alter table GLOBAL_PRIVS convert to character set utf8;
alter table PARTITIONS convert to character set utf8;
alter table PARTITION_KEYS convert to character set utf8;
alter table PARTITION_KEY_VALS convert to character set utf8;
alter table PARTITION_PARAMS convert to character set utf8;
alter table PART_COL_STATS convert to character set utf8;
alter table ROLES convert to character set utf8;
alter table SDS convert to character set utf8;
alter table SD_PARAMS convert to character set utf8;
alter table SEQUENCE_TABLE convert to character set utf8;
alter table SERDES convert to character set utf8;
alter table SERDE_PARAMS convert to character set utf8;
alter table SKEWED_COL_NAMES convert to character set utf8;
alter table SKEWED_COL_VALUE_LOC_MAP convert to character set utf8;
alter table SKEWED_STRING_LIST convert to character set utf8;
alter table SKEWED_STRING_LIST_VALUES convert to character set utf8;
alter table SKEWED_VALUES convert to character set utf8;
alter table SORT_COLS convert to character set utf8;
alter table TABLE_PARAMS convert to character set utf8;
alter table TAB_COL_STATS convert to character set utf8;
alter table TBLS convert to character set utf8;
alter table VERSION convert to character set utf8;
alter table PART_COL_STATS convert to character set utf8;
#Modify the database properties. If the default setting of the database is not UTF8, it can be executed.
SET character_set_client = utf8 ;
SET character_set_connection = utf8; 
SET character_set_database = utf8; 
SET character_set_results = utf8; 
SET character_set_server = utf8; 
SET collation_connection = utf8; 
SET collation_database = utf8; 
SET collation_server = utf8; 
SET NAMES 'utf8';

3.2 realization

When all the above are ready, you can proceed to the following steps:

#Prepare data
hive> create table user_xy(id int, name string, city string) row format delimited fields terminated by '\t';
OK
Time taken: 0.102 seconds
hive> load data local inpath '/root/book/user.txt' overwrite into table user_xy;
Loading data to table test.user_xy
Table test.user_xy stats: [numFiles=1, numRows=0, totalSize=232, rawDataSize=0]
OK
Time taken: 0.33 seconds
hive> select * from user_xy;
OK
1	Sun WuKong	Huaguo Mountain
2	Zhu Bajie	Gao Laozhuang
3	Sand monk	Liusha River
4	Tang Xuanzang	Xi'an
5	Little white dragon	Qingtan stream
6	Tang Xuanzong	Xi'an
7	Avalokitesvara	South China Sea
8	Jade Emperor	Lingxiao Palace
9	Yellow wind monster	Xiaoleiyin Temple
10	Tathagata Buddha	Leiyin Temple
Time taken: 0.094 seconds, Fetched: 10 row(s)
#Set parameters
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
#Start dynamic partition import data
hive> insert into table user_xy_area partition(city) select id,name,city from user_xy;
Query ID = root_20200614163041_ff9e15f6-ce41-4be2-bb86-ea569cab795e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592121918794_0003, Tracking URL = http://hadoop:8088/proxy/application_1592121918794_0003/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592121918794_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-14 16:30:49,649 Stage-1 map = 0%,  reduce = 0%
2020-06-14 16:30:57,202 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.26 sec
MapReduce Total cumulative CPU time: 1 seconds 260 msec
Ended Job = job_1592121918794_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_area/.hive-staging_hive_2020-06-14_16-30-41_545_7128885006398250032-1/-ext-10000
Loading data to table test.user_xy_area partition (city=null)
	 Time taken for load dynamic partitions : 793
	Loading partition {city=Huaguo Mountain}
	Loading partition {city=Lingxiao Palace}
	Loading partition {city=Leiyin Temple}
	Loading partition {city=South China Sea}
	Loading partition {city=Qingtan stream}
	Loading partition {city=Gao Laozhuang}
	Loading partition {city=Liusha River}
	Loading partition {city=Xi'an}
	Loading partition {city=Xiaoleiyin Temple}
	 Time taken for adding to write entity : 1
Partition test.user_xy_area{city=Lingxiao Palace} stats: [numFiles=1, numRows=1, totalSize=15, rawDataSize=14]
Partition test.user_xy_area{city=South China Sea} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area{city=Xiaoleiyin Temple} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area{city=Liusha River} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area{city=Qingtan stream} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area{city=Huaguo Mountain} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area{city=Xi'an} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_xy_area{city=Leiyin Temple} stats: [numFiles=1, numRows=1, totalSize=16, rawDataSize=15]
Partition test.user_xy_area{city=Gao Laozhuang} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.26 sec   HDFS Read: 4053 HDFS Write: 632 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 260 msec
OK
Time taken: 19.53 seconds

#Check the data, no problem
hive> select * from user_xy_area;
OK
8	Jade Emperor	Lingxiao Palace
7	Avalokitesvara	South China Sea
9	Yellow wind monster	Xiaoleiyin Temple
3	Sand monk	Liusha River
5	Little white dragon	Qingtan stream
1	Sun WuKong	Huaguo Mountain
4	Tang Xuanzang	Xi'an
6	Tang Xuanzong	Xi'an
10	Tathagata Buddha	Leiyin Temple
2	Zhu Bajie	Gao Laozhuang
Time taken: 0.242 seconds, Fetched: 10 row(s)

3.3 garbled code

View partition:

hive> show partitions user_xy_area;
OK
city=� ��
city=Ww
city=���
city=A��
city=m�
city=��q
city�
city=���
city=� �
Time taken: 0.135 seconds, Fetched: 9 row(s)

It's garbled. Does it collapse? Pit after pit!

Then modify the configuration file: hive site XML

[root@hadoop conf]# vim hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop:3306/hive?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
</configuration>

The following paragraph is added to the configuration:

&amp;useUnicode=true&amp;characterEncoding=UTF-8

Note the delimiter, the & symbol needs to be escaped in the xml file.

If there is no escape, the following error occurs:

[root@hadoop ~]# hive
[Fatal Error] hive-site.xml:21:105: The reference to entity "characterEncoding" must end with the ';' delimiter.

After everything is finished and restarted, the results are as follows:

hive> show partitions user_xy_area;
OK
city=� ��
city=Ww
city=���
city=A��
city=m�
city=��q
city�
city=���
city=� �
Time taken: 0.135 seconds, Fetched: 9 row(s)

Still garbled, this problem, I can't solve it for the time being! Which God has a solution, please tell me, thank you!

Added by alexcmm on Sun, 02 Jan 2022 19:26:32 +0200