Welcome to the blog home page: Wechat search: Import_ Bigdata, hard core original author in big data field_ import_bigdata_ CSDN blog
Welcome to like, collect, leave messages, and exchange messages!
This article was originally written by [Wang Zhiwu] and started on CSDN blog!
This article is the first CSDN forum. It is strictly prohibited to reprint without the permission of the official and myself!
This article is right [hard big data learning route] learning guide for experts from zero to big data (fully upgraded version) The Hive section of.
1 partition table
The partition table actually corresponds to an independent folder on an HDFS file system. Under this folder are all the data files of the partition. The partition in Hive is a subdirectory, which divides a large data set into small data sets according to business needs. When querying, select the specified partition required by the query through the expression in the WHERE clause, which will greatly improve the query efficiency.
1.1 basic operation of partition table
1) Introduce partition table (manage logs according to date and simulate through department information)
dept_20200401.log dept_20200402.log dept_20200403.log
2) Create partition table syntax
hive (default)> create table dept_partition( deptno int, dname string, loc string ) partitioned by (day string) row format delimited fields terminated by '\t';
Note: the partition field cannot be the data that already exists in the table. The partition field can be regarded as a pseudo column of the table.
3) Load data into partitioned table
(1)
Data preparation
dept_20200401.log
10 ACCOUNTING 1700 20 RESEARCH 1800
dept_20200402.log
30 SALES 1900 40 OPERATIONS 1700
dept_20200403.log
50 TEST 2000 60 DEV 1900
(2) Load data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401'); hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402'); hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');
Note: the partition must be specified when the partition table loads data
4) Query data in partition table
Single partition query
hive (default)> select * from dept_partition where day='20200401';
Multi partition joint query
hive (default)> select * from dept_partition where day='20200401' union select * from dept_partition where day='20200402' union select * from dept_partition where day='20200403'; hive (default)> select * from dept_partition where day='20200401' or day='20200402' or day='20200403';
5) Add partition
Create a single partition
hive (default)> alter table dept_partition add partition(day='20200404');
Create multiple partitions at the same time
hive (default)> alter table dept_partition add partition(day='20200405')
partition(day='20200406');
6) Delete partition
Delete a single partition
hive (default)> alter table dept_partition drop partition
(day='20200406');
Delete multiple partitions at the same time
hive (default)> alter table dept_partition drop partition
(day='20200404'), partition(day='20200405');
7) View how many partitions the partition table has
hive> show partitions dept_partition;
8) View partition table structure
hive> desc formatted dept_partition; # Partition Information # col_name data_type comment month string
1.2 secondary zoning
Thinking: how to split the log data when there is a large amount of log data in a day?
1) Create secondary partition table
hive (default)> create table dept_partition2( deptno int, dname string, loc string ) partitioned by (day string, hour string) row format delimited fields terminated by '\t';
hive (default)> select * from dept_partition2 where day='20200401' and hour='12';
2) Normal load data
(1) Load data into secondary partition table
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12');
(2) Query partition data
hive (default)> select * from dept_partition2 where day='20200401' and hour='12';
3) There are three ways to upload data directly to the partition directory and associate the partition table with the data
(1) Method 1: repair after uploading data
Upload data
hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13; hive (default)> dfs -put /opt/module/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
Query data (the data just uploaded cannot be queried)
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
Execute repair command
hive> msck repair table dept_partition2;
Query data again
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
(2) Method 2: add partition after uploading data
Upload data
hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14; hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
Execute add partition
hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');
Query data
hive (default)> select * from dept_partition2 where day='20200401' and hour='14';
(3) Method 3: after creating a folder, load data to the partition
Create directory
hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
Upload data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='15');
Query data
hive (default)> select * from dept_partition2 where day='20200401' and hour='15';
1.3 dynamic zoning adjustment
In a relational database, when inserting data into a partitioned table, the database automatically inserts the data according to the value of the partitioned field
It is inserted into the corresponding partition. Hive also provides a similar mechanism, dynamic partition, but,
The dynamic partition of Hive needs to be configured accordingly.
1) Enable dynamic partition parameter setting
(1) Enable the dynamic partition function (true by default, enabled)
hive.exec.dynamic.partition=true
(2) Set to non strict mode (dynamic partition mode, strict by default), indicating that at least one partition must be specified as
Static partition, nonstrict mode means that all partition fields are allowed to use dynamic partition.)
hive.exec.dynamic.partition.mode=nonstrict
(3) How many dynamic partitions can be created on all MR nodes. Default 1000
hive.exec.max.dynamic.partitions=1000
(4) How many dynamic partitions can be created on each node executing MR. This parameter needs to be determined according to the actual situation
To set. For example, if the source data contains one year's data, that is, the day field has 365 values, then this parameter is valid
It needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported.
hive.exec.max.dynamic.partitions.pernode=100
(5) How many HDFS files can be created in the whole MR Job. Default 100000
hive.exec.max.created.files=100000
(6) Whether to throw an exception when an empty partition is generated. Generally, no setting is required. Default false
hive.error.on.empty.partition=false
2) Case practice
Requirement: insert the data in the dept table into the target table Dept by Region (loc field)_ In the corresponding partition of partition.
(1) Create target partition table
hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';
(2) Set dynamic partition
set hive.exec.dynamic.partition.mode = nonstrict; hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;
(3) View the partition of the target partition table
hive (default)> show partitions dept_partition;
Think: how does the target partition table match to the partition field?
2 barrel table
Partitioning provides a convenient way to isolate data and optimize queries. However, not all data sets can form reasonable partitions. For a table or partition, Hive can be further organized into buckets, that is, more fine-grained data range division.
Bucket splitting is another technique for breaking up a data set into parts that are easier to manage.
Partition refers to the storage path of data; Buckets are for data files.
1) Create bucket table first
(1) Data preparation
1001 ss1 1002 ss2 1003 ss3 1004 ss4 1005 ss5 1006 ss6 1007 ss7 1008 ss8 1009 ss9 1010 ss10 1011 ss11 1012 ss12 1013 ss13 1014 ss14 1015 ss15 1016 ss16
(2) Create bucket table
create table stu_buck(id int, name string) clustered by(id) into 4 buckets row format delimited fields terminated by '\t';
(3) View table structure
hive (default)> desc formatted stu_buck; Num Buckets:
(4) Import data into the bucket splitting table by load ing
hive (default)> load data inpath '/student.txt' into table stu_buck;
(5) Check whether the created bucket table is divided into 4 buckets
(6) Query bucket data
hive(default)> select * from stu_buck;
(7) Barrel separation rules:
According to the results, Hive's bucket division determines which bucket the record is stored in by hashing the value of the bucket division field and dividing it by the number of buckets
2) Precautions for operation of drum splitting table:
(1) Set the number of reduce to - 1, and let the Job decide how many reduce to use or how many to reduce
The number of barrels is set to be greater than or equal to the number of barrels in the barrel distribution table
(2) load data from hdfs to the bucket table to avoid the problem that the local file cannot be found
(3) Do not use local mode
3) Import data into bucket table in insert mode
hive(default)>insert into table stu_buck select * from student_insert;
3 sampling query
For very large data sets, sometimes users need to use a representative query result instead of all results
Fruit. Hive can meet this requirement by sampling tables.
Syntax: TABLESAMPLE(BUCKET x OUT OF y)
Query table stu_ Data in buck.
hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
Note: the value of x must be less than or equal to the value of y, otherwise
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck