[hard Hive] Hive Foundation (18): Hive syntax DDL partition table and bucket table

Welcome to the blog home page: Wechat search: Import_ Bigdata, hard core original author in big data field_ import_bigdata_ CSDN blog
Welcome to like, collect, leave messages, and exchange messages!
This article was originally written by [Wang Zhiwu] and started on CSDN blog!
This article is the first CSDN forum. It is strictly prohibited to reprint without the permission of the official and myself!

This article is right [hard big data learning route] learning guide for experts from zero to big data (fully upgraded version) The Hive section of.

1 partition table

The partition table actually corresponds to an independent folder on an HDFS file system. Under this folder are all the data files of the partition. The partition in Hive is a subdirectory, which divides a large data set into small data sets according to business needs. When querying, select the specified partition required by the query through the expression in the WHERE clause, which will greatly improve the query efficiency.

1.1 basic operation of partition table

1) Introduce partition table (manage logs according to date and simulate through department information)

dept_20200401.log
dept_20200402.log
dept_20200403.log

2) Create partition table syntax

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

Note: the partition field cannot be the data that already exists in the table. The partition field can be regarded as a pseudo column of the table.

3) Load data into partitioned table

(1)

Data preparation

dept_20200401.log

10 ACCOUNTING 1700
20 RESEARCH 1800

dept_20200402.log

30 SALES 1900
40 OPERATIONS 1700

dept_20200403.log

50 TEST 2000
60 DEV 1900

(2) Load data

hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200401.log' into table dept_partition 
partition(day='20200401');
hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200402.log' into table dept_partition 
partition(day='20200402');
hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200403.log' into table dept_partition 
partition(day='20200403');

Note: the partition must be specified when the partition table loads data

4) Query data in partition table

Single partition query  

hive (default)> select * from dept_partition where day='20200401';

  Multi partition joint query  

hive (default)> select * from dept_partition where day='20200401'
 union
 select * from dept_partition where day='20200402'
 union
 select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or
 day='20200402' or day='20200403';

5) Add partition

Create a single partition

hive (default)> alter table dept_partition add partition(day='20200404');

Create multiple partitions at the same time

hive (default)> alter table dept_partition add partition(day='20200405')

partition(day='20200406'); 

6) Delete partition

Delete a single partition

hive (default)> alter table dept_partition drop partition

(day='20200406');

Delete multiple partitions at the same time

hive (default)> alter table dept_partition drop partition

(day='20200404'), partition(day='20200405'); 

7) View how many partitions the partition table has

hive> show partitions dept_partition;

8) View partition table structure

hive> desc formatted dept_partition;
# Partition Information 
# col_name data_type comment 
month string

1.2 secondary zoning

Thinking: how to split the log data when there is a large amount of log data in a day?

1) Create secondary partition table  

hive (default)> create table dept_partition2(
 deptno int, dname string, loc string
 )
 partitioned by (day string, hour string)
 row format delimited fields terminated by '\t';
hive (default)> select * from dept_partition2 where day='20200401' and 
hour='12';

2) Normal load data

(1) Load data into secondary partition table

hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');

(2) Query partition data  

hive (default)> select * from dept_partition2 where day='20200401' and 
hour='12';

3) There are three ways to upload data directly to the partition directory and associate the partition table with the data

(1) Method 1: repair after uploading data

Upload data

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log 
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

Query data (the data just uploaded cannot be queried)  

hive (default)> select * from dept_partition2 where day='20200401' and 
hour='13';

Execute repair command

hive> msck repair table dept_partition2;

Query data again

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

(2) Method 2: add partition after uploading data

Upload data  

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log 
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

Execute add partition

hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');

Query data

hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

(3) Method 3: after creating a folder, load data to the partition

Create directory  

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

Upload data  

hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401',hour='15');

Query data

hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

1.3 dynamic zoning adjustment

In a relational database, when inserting data into a partitioned table, the database automatically inserts the data according to the value of the partitioned field

It is inserted into the corresponding partition. Hive also provides a similar mechanism, dynamic partition, but,

The dynamic partition of Hive needs to be configured accordingly.

1) Enable dynamic partition parameter setting

(1) Enable the dynamic partition function (true by default, enabled)

hive.exec.dynamic.partition=true

(2) Set to non strict mode (dynamic partition mode, strict by default), indicating that at least one partition must be specified as

Static partition, nonstrict mode means that all partition fields are allowed to use dynamic partition.)

hive.exec.dynamic.partition.mode=nonstrict

(3) How many dynamic partitions can be created on all MR nodes. Default 1000

hive.exec.max.dynamic.partitions=1000

(4) How many dynamic partitions can be created on each node executing MR. This parameter needs to be determined according to the actual situation

To set. For example, if the source data contains one year's data, that is, the day field has 365 values, then this parameter is valid

It needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported.

hive.exec.max.dynamic.partitions.pernode=100

(5) How many HDFS files can be created in the whole MR Job. Default 100000

hive.exec.max.created.files=100000

(6) Whether to throw an exception when an empty partition is generated. Generally, no setting is required. Default false

hive.error.on.empty.partition=false 

2) Case practice

Requirement: insert the data in the dept table into the target table Dept by Region (loc field)_ In the corresponding partition of partition.

(1) Create target partition table  

hive (default)> create table dept_partition_dy(id int, name string) 
partitioned by (loc int) row format delimited fields terminated by '\t';

(2) Set dynamic partition  

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

(3) View the partition of the target partition table  

hive (default)> show partitions dept_partition;

Think: how does the target partition table match to the partition field?  

2 barrel table

Partitioning provides a convenient way to isolate data and optimize queries. However, not all data sets can form reasonable partitions. For a table or partition, Hive can be further organized into buckets, that is, more fine-grained data range division.

Bucket splitting is another technique for breaking up a data set into parts that are easier to manage.

Partition refers to the storage path of data; Buckets are for data files.  

1) Create bucket table first

(1) Data preparation

1001 ss1
1002 ss2
1003 ss3
1004 ss4
1005 ss5
1006 ss6
1007 ss7
1008 ss8
1009 ss9
1010 ss10
1011 ss11
1012 ss12
1013 ss13
1014 ss14
1015 ss15
1016 ss16

(2) Create bucket table

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

(3) View table structure

hive (default)> desc formatted stu_buck;
Num Buckets:

(4) Import data into the bucket splitting table by load ing

hive (default)> load data inpath '/student.txt' into table stu_buck;

(5) Check whether the created bucket table is divided into 4 buckets  

(6) Query bucket data

hive(default)> select * from stu_buck;

(7) Barrel separation rules:

According to the results, Hive's bucket division determines which bucket the record is stored in by hashing the value of the bucket division field and dividing it by the number of buckets  

2) Precautions for operation of drum splitting table:

(1) Set the number of reduce to - 1, and let the Job decide how many reduce to use or how many to reduce

The number of barrels is set to be greater than or equal to the number of barrels in the barrel distribution table

(2) load data from hdfs to the bucket table to avoid the problem that the local file cannot be found

(3) Do not use local mode

3) Import data into bucket table in insert mode

hive(default)>insert into table stu_buck select * from student_insert; 

3 sampling query

For very large data sets, sometimes users need to use a representative query result instead of all results

Fruit. Hive can meet this requirement by sampling tables.

Syntax: TABLESAMPLE(BUCKET x OUT OF y)

Query table stu_ Data in buck.

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

Note: the value of x must be less than or equal to the value of y, otherwise

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck 

 

Keywords: Big Data hive

Added by jsschmitt on Sat, 11 Sep 2021 04:27:01 +0300