Partition table
Definition of partition table
- The partition table is actually a separate folder on an HDFS file system
Basic operation of partition table
-
Introduce partition table (manage the log according to the date and simulate through department information)
dept_20200401.log dept_20200402.log dept_20200403.log ......
-
Create partition table syntax
create table dept_partition( deptno int, dname string, loc string ) partitioned by (day string) row format delimited fields terminated by '\t';
-
Load data into partition table
# Data preparation # dept_20200401.log 10 ACCOUNTING 1700 20 RESEARCH 1800 # dept_20200402.log 30 SALES 1900 40 OPERATIONS 1700 # dept_20200403.log 50 TEST 2000 60 DEV 1900 # Load data # Partition must be specified when the partition table loads data load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401'); load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402'); load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');
-
Query data in partition table
# Single partition query select * from dept_partition where day='20200401'; # Multi partition joint query select * from dept_partition where day='20200401' union select * from dept_partition where day='20200402';
-
Add partition
# Create a single partition alter table dept_partition add partition(day='20200404'); # Create multiple partitions at the same time alter table dept_partition add partition(day='20200405') partition(day='20200406');
-
delete a partition
# Delete a single partition alter table dept_partition drop partition (day='20200406'); # Delete multiple partitions at the same time alter table dept_partition drop partition (day='20200404'), partition(day='20200405');
-
See how many partitions the partition table has
show partitions dept_partition
-
View partition table structure
desc formatted dept_partition;
Secondary partition
Create secondary partition table
create table dept_partition2( deptno int, dname string, loc string ) partitioned by (day string, hour string) row format delimited fields terminated by '\t';
Normal loading data
# Load data into secondary partition table load data local inpath '/opt/module`/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12'); # Query partition data select * from dept_partition2 where day='20200401' and hour='12';
Upload the data directly to the partition directory
-
Method 1: repair after uploading data
# Upload data dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13; dfs -put /opt/module/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13; # Query data (the data just uploaded cannot be queried) select * from dept_partition2 where day='20200401' and hour='13'; # Execute repair command msck repair table dept_partition2; # Query data again
-
Method 2: add partition after uploading data
# Add partition alter table dept_partition2 add partition(day='201709',hour='14');
-
Method 3: create a file and load the data to the partition
# Create directory dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15; # Upload data load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='15'); # Query data select * from dept_partition2 where day='20200401' and hour='15';
Dynamic partition
Parameter setting
# Enable the dynamic partition function (default: enabled) hive.exec.dynamic.partition=true # Set to non strict mode # Dynamic partition mode. The default is strict, which means that at least one partition must be specified as a static partition # nonstrict mode means that all partition fields are allowed to use dynamic partition hive.exec.dynamic.partition.mode=nonstrict # The number of dynamic partitions that can be created on all MR nodes is 1000 by default hive.exec.max.dynamic.partitions=1000 # How many dynamic partitions can be created on each node executing MR # If the day field has 365 values, the parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported hive.exec.max.dynamic.partitions.pernode=100 # How many HDFS files can be created in the whole MR Job. Default 100000 hive.exec.max.created.files=100000 # Whether to throw an exception when an empty partition is generated. Generally, no setting is required. Default false hive.error.on.empty.partition=false
Practical cases
# Create target partition table create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t'; # Set dynamic partition set hive.exec.dynamic.partition.mode = nonstrict; insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept; # View the partition of the target partition table show partitions dept_partition;
Barrel table
- Not all data sets can form reasonable partitions
- Hive can be further organized into buckets for a table or partition
- Partition refers to the storage path of data, and bucket division refers to data files
Data preparation
1001 ss1 1002 ss2 1003 ss3 1004 ss4 1005 ss5 1006 ss6 1007 ss7 1008 ss8 1009 ss9 1010 ss10 1011 ss11 1012 ss12 1013 ss13 1014 ss14 1015 ss15 1016 ss16
Create bucket table
create table stu_bucket(id int, name string) clustered by(id) into 4 buckets row format delimited fields terminated by '\t';
Import data into the bucket sorting table in load mode
load data inpath '/student.txt' into table stu_bucket;
Import data into bucket table in insert mode
insert into table stu_buck select * from student_insert;
Query bucket data
select * from stu_buck;
Barrel rule
- Hive's bucket division determines which bucket the record is stored in by hashing the value of the bucket division field and dividing it by the number of buckets
Precautions for bucket table operation
- Set the number of reduce to - 1 and let the Job decide how many reduce to use
- Or set the number of reduce to be greater than the number of buckets in the bucket splitting table
- load data from hdfs to the bucket table to avoid the problem that the local file cannot be found
- Do not use local mode
Sampling query
- For very large data sets, sometimes users need to use a representative query result instead of all the results. Hive can meet this demand by sampling the table
select * from stu_buck tablesample(bucket 1 out of 4 on id);
Common built-in functions
Empty field assignment
-
NVL: assign a value to NULL data
# If the comm of the employee is NULL, use - 1 instead select comm,nvl(comm, -1) from emp; # If the comm of the employee is NULL, the leader id is used instead select comm, nvl(comm,mgr) from emp;
case when then else end
-
Demand: how many men and women are required in different departments
-
Create local emp_sex.txt, import data
vim emp_sex.txt name of a fictitious monkey with supernatural powers A male sea A male Song song B male Miss Luo Yu feng A female Sister Ting B female Tingting B female create table emp_sex( name string, dept_id string, sex string) row format delimited fields terminated by "\t"; load data local inpath '/opt/module/hive/datas/emp_sex.txt' into table emp_sex;
-
Query data by demand
select dept_id, sum(case sex when 'male' then 1 else 0 end) male_count, sum(case sex when 'female' then 1 else 0 end) female_count from emp_sex group by dept_id;
Row to column
-
collect_set(col): only basic data types are received. The main function is to de summarize the values of a field and generate an array field
-
concat_ws must be string or 'array'
-
Need: classify people with the same constellation and blood type
sagittarius,A sea|Miss Luo Yu feng Aries,A Sun WuKong|Zhu Bajie Aries,B Song song|Aoi Sora
-
Create local constellation Txt, import data
Sun WuKong Aries A sea sagittarius A Song song Aries B Zhu Bajie Aries A Miss Luo Yu feng sagittarius A Aoi Sora Aries B create table person_info( name string, constellation string, blood_type string) row format delimited fields terminated by "\t"; load data local inpath "/opt/module/hive/datas/person_info.txt" into table person_info;
-
Query data by demand
SELECT t1.c_b , CONCAT_WS("|",collect_set(t1.name)) FROM ( SELECT NAME ,CONCAT_WS(',',constellation,blood_type) c_b FROM person_info )t1 GROUP BY t1.c_b
Column to row
-
Expand (Col): split the complex array or map structure in the hive column into multiple rows
-
Requirements: expand the array data in the movie classification
<Suspect tracking Suspense <Suspect tracking action <Suspect tracking science fiction <Suspect tracking plot <Lie to me> Suspense <Lie to me> gangster <Lie to me> action <Lie to me> psychology <Lie to me> plot <War wolf 2 Warfare <War wolf 2 action <War wolf 2 disaster
-
Create local movie Txt, import data
<Suspect tracking Suspense,action,science fiction,plot <Lie to me> Suspense,gangster ,action,psychology,plot <War wolf 2 Warfare,action,disaster
-
Create hive table and import data
create table movie_info( movie string, category string) row format delimited fields terminated by "\t"; load data local inpath "/opt/module/hive/datas/movie_info.txt" into table movie_info;
-
Query data by demand
SELECT movie,category_name FROM movie_info lateral VIEW explode(split(category,",")) movie_info_tmp AS category_name;
Window function (windowing function)
- over(): Specifies the size of the data window in which the analysis function works. The size of the data window may change with the change of the row
- n preceding: data in the previous n rows
- n following: next n rows of data
- unbounded: starting point
- Lag (col, n, deault_val): data in the nth row ahead
- Lead (col, n, deault_val): the nth row of data in the future
- ntile(n): distribute the rows with a large ordered window to the group with the specified data. Each array has a number. The number starts from 1. For each row, ntile returns the number of the group to which the row belongs. N must be of type int
Data preparation
jack,2017-01-01,10 tony,2017-01-02,15 jack,2017-02-03,23 tony,2017-01-04,29 jack,2017-01-05,46 jack,2017-04-06,42 tony,2017-01-07,50 jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 neil,2017-05-10,12 mart,2017-04-11,75 neil,2017-06-12,80 mart,2017-04-13,94
demand
# Query the customers and total number of people who purchased in April 2017 # Query the customer's purchase details and monthly total purchase amount # In the above scenario, the cost of each customer is accumulated according to the date # Query the last purchase time of each customer # Order information 20% before query
Create hive table and import data
create table business( name string, orderdate string, cost int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; load data local inpath "/opt/module/hive/datas/business.txt" into table business;
Query data by demand
# Query the customers and total number of people who purchased in April 2017 select name,count(*) over () from business where substring(orderdate,1,7) = '2017-04' group by name; # Query the customer's purchase details and monthly total purchase amount select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business; # Accumulate the cost of each customer according to the date select name,orderdate,cost, sum(cost) over() as sample1,--Add all rows sum(cost) over(partition by name) as sample2,--Press name Grouping, intra group data addition sum(cost) over(partition by name order by orderdate) as sample3,--Press name Grouping, intra group data accumulation sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--and sample3 equally,Aggregation from starting point to current row sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --Aggregate the current row with the previous row sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--The current line, the front line and the back line sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --Current line and all subsequent lines from business; # View the customer's last purchase time select name,orderdate,cost, lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from business; # Order information 20% before query select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate) sorted from business ) t where sorted = 1;
Rank
- If the rank() sort is the same, it will repeat, and the total number will not change
- dense_ If the rank() sort is the same, it will be repeated and the total number will be reduced
- row_number() is calculated in order
Data preparation
name | subject | score |
---|---|---|
Sun WuKong | language | 87 |
Sun WuKong | mathematics | 95 |
Sun WuKong | English | 68 |
sea | language | 94 |
sea | mathematics | 56 |
sea | English | 84 |
Song song | language | 64 |
Song song | mathematics | 86 |
Song song | English | 84 |
Tingting | language | 65 |
Tingting | mathematics | 85 |
Tingting | English | 78 |
demand
- Calculate the score ranking of each subject
Create Hive table and import data
create table score( name string, subject string, score int) row format delimited fields terminated by "\t"; load data local inpath '/opt/module/hive/datas/score.txt' into table score;
Query data by demand
select name, subject, score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rmp from score;