42 partition table, bucket table and function

Partition table

Definition of partition table

The partition table is actually a separate folder on an HDFS file system

Basic operation of partition table

Introduce partition table (manage the log according to the date and simulate through department information)
```
dept_20200401.log
dept_20200402.log
dept_20200403.log
......
```

Create partition table syntax

create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

Load data into partition table

# Data preparation
# dept_20200401.log
10	ACCOUNTING	1700
20	RESEARCH	1800

# dept_20200402.log
30	SALES	1900
40	OPERATIONS	1700

# dept_20200403.log
50	TEST	2000
60	DEV	1900

# Load data
# Partition must be specified when the partition table loads data
load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');

Query data in partition table

# Single partition query
select * from dept_partition where day='20200401';

# Multi partition joint query
select * from dept_partition where day='20200401'
union
select * from dept_partition where day='20200402';

Add partition

# Create a single partition
alter table dept_partition add partition(day='20200404');

# Create multiple partitions at the same time
alter table dept_partition add partition(day='20200405') partition(day='20200406');

delete a partition

# Delete a single partition
alter table dept_partition drop partition (day='20200406');

# Delete multiple partitions at the same time
alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

See how many partitions the partition table has
```
show partitions dept_partition
```
View partition table structure
```
desc formatted dept_partition;
```

Secondary partition

Create secondary partition table

create table dept_partition2(
    deptno int, dname string, loc string
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

Normal loading data

# Load data into secondary partition table
load data local inpath '/opt/module`/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');

# Query partition data
select * from dept_partition2 where day='20200401' and hour='12';

Upload the data directly to the partition directory

Method 1: repair after uploading data

# Upload data
dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
dfs -put /opt/module/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

# Query data (the data just uploaded cannot be queried)
select * from dept_partition2 where day='20200401' and hour='13';

# Execute repair command
msck repair table dept_partition2;

# Query data again

Method 2: add partition after uploading data

# Add partition
alter table dept_partition2 add partition(day='201709',hour='14');

Method 3: create a file and load the data to the partition

# Create directory
dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

# Upload data
load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table
 dept_partition2 partition(day='20200401',hour='15');

# Query data
select * from dept_partition2 where day='20200401' and hour='15';

Dynamic partition

Parameter setting

# Enable the dynamic partition function (default: enabled)
hive.exec.dynamic.partition=true

# Set to non strict mode
# Dynamic partition mode. The default is strict, which means that at least one partition must be specified as a static partition
# nonstrict mode means that all partition fields are allowed to use dynamic partition
hive.exec.dynamic.partition.mode=nonstrict

# The number of dynamic partitions that can be created on all MR nodes is 1000 by default
hive.exec.max.dynamic.partitions=1000

# How many dynamic partitions can be created on each node executing MR
# If the day field has 365 values, the parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported
hive.exec.max.dynamic.partitions.pernode=100

# How many HDFS files can be created in the whole MR Job. Default 100000
hive.exec.max.created.files=100000

# Whether to throw an exception when an empty partition is generated. Generally, no setting is required. Default false
hive.error.on.empty.partition=false

Practical cases

# Create target partition table
create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';

# Set dynamic partition
set hive.exec.dynamic.partition.mode = nonstrict;

insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

# View the partition of the target partition table
show partitions dept_partition;

Barrel table

Not all data sets can form reasonable partitions
Hive can be further organized into buckets for a table or partition
Partition refers to the storage path of data, and bucket division refers to data files

Data preparation

1001	ss1
1002	ss2
1003	ss3
1004	ss4
1005	ss5
1006	ss6
1007	ss7
1008	ss8
1009	ss9
1010	ss10
1011	ss11
1012	ss12
1013	ss13
1014	ss14
1015	ss15
1016	ss16

Create bucket table

create table stu_bucket(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

Import data into the bucket sorting table in load mode

load data inpath   '/student.txt' into table stu_bucket;

Import data into bucket table in insert mode

insert into table stu_buck select * from student_insert;

Query bucket data

select * from stu_buck;

Barrel rule

Hive's bucket division determines which bucket the record is stored in by hashing the value of the bucket division field and dividing it by the number of buckets

Precautions for bucket table operation

Set the number of reduce to - 1 and let the Job decide how many reduce to use
Or set the number of reduce to be greater than the number of buckets in the bucket splitting table
load data from hdfs to the bucket table to avoid the problem that the local file cannot be found
Do not use local mode

Sampling query

For very large data sets, sometimes users need to use a representative query result instead of all the results. Hive can meet this demand by sampling the table

select * from stu_buck tablesample(bucket 1 out of 4 on id);

Common built-in functions

Empty field assignment

NVL: assign a value to NULL data

# If the comm of the employee is NULL, use - 1 instead
select comm,nvl(comm, -1) from emp;

# If the comm of the employee is NULL, the leader id is used instead
select comm, nvl(comm,mgr) from emp;

case when then else end

Demand: how many men and women are required in different departments

Create local emp_sex.txt, import data

vim emp_sex.txt
 name of a fictitious monkey with supernatural powers	A	male
 sea	A	male
 Song song	B	male
 Miss Luo Yu feng	A	female
 Sister Ting	B	female
 Tingting	B	female

create table emp_sex(
name string, 
dept_id string, 
sex string) 
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/hive/datas/emp_sex.txt' into table emp_sex;

Query data by demand

select 
  dept_id,
  sum(case sex when 'male' then 1 else 0 end) male_count,
  sum(case sex when 'female' then 1 else 0 end) female_count
from 
  emp_sex
group by
  dept_id;

Row to column

collect_set(col): only basic data types are received. The main function is to de summarize the values of a field and generate an array field
concat_ws must be string or 'array'

Need: classify people with the same constellation and blood type

sagittarius,A            sea|Miss Luo Yu feng
 Aries,A            Sun WuKong|Zhu Bajie
 Aries,B            Song song|Aoi Sora

Create local constellation Txt, import data

Sun WuKong	Aries	A
 sea	sagittarius	A
 Song song	Aries	B
 Zhu Bajie	Aries	A
 Miss Luo Yu feng	sagittarius	A
 Aoi Sora	Aries	B

create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";

load data local inpath "/opt/module/hive/datas/person_info.txt" into table person_info;

Query data by demand

SELECT t1.c_b , CONCAT_WS("|",collect_set(t1.name))
FROM (
SELECT NAME ,CONCAT_WS(',',constellation,blood_type) c_b
FROM person_info
)t1 
GROUP BY t1.c_b

Column to row

Expand (Col): split the complex array or map structure in the hive column into multiple rows

Requirements: expand the array data in the movie classification

<Suspect tracking      Suspense
<Suspect tracking      action
<Suspect tracking      science fiction
<Suspect tracking      plot
<Lie to me>   Suspense
<Lie to me>   gangster 
<Lie to me>   action
<Lie to me>   psychology
<Lie to me>   plot
<War wolf 2        Warfare
<War wolf 2        action
<War wolf 2        disaster

Create local movie Txt, import data

<Suspect tracking	Suspense,action,science fiction,plot
<Lie to me>	Suspense,gangster ,action,psychology,plot
<War wolf 2	Warfare,action,disaster

Create hive table and import data

create table movie_info(
    movie string, 
    category string) 
row format delimited fields terminated by "\t";

load data local inpath "/opt/module/hive/datas/movie_info.txt" into table movie_info;

Query data by demand

SELECT movie,category_name 
FROM movie_info 
lateral VIEW
explode(split(category,",")) movie_info_tmp  AS category_name;

Window function (windowing function)

over(): Specifies the size of the data window in which the analysis function works. The size of the data window may change with the change of the row
n preceding: data in the previous n rows
n following: next n rows of data
unbounded: starting point
Lag (col, n, deault_val): data in the nth row ahead
Lead (col, n, deault_val): the nth row of data in the future
ntile(n): distribute the rows with a large ordered window to the group with the specified data. Each array has a number. The number starts from 1. For each row, ntile returns the number of the group to which the row belongs. N must be of type int

Data preparation

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

demand

# Query the customers and total number of people who purchased in April 2017
# Query the customer's purchase details and monthly total purchase amount
# In the above scenario, the cost of each customer is accumulated according to the date
# Query the last purchase time of each customer
# Order information 20% before query

Create hive table and import data

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/hive/datas/business.txt" into table business;

Query data by demand

# Query the customers and total number of people who purchased in April 2017
select name,count(*) over ()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;

# Query the customer's purchase details and monthly total purchase amount
select name,orderdate,cost,sum(cost) 
over(partition by month(orderdate)) from
business;

# Accumulate the cost of each customer according to the date
select name,orderdate,cost, 
sum(cost) over() as sample1,--Add all rows 
sum(cost) over(partition by name) as sample2,--Press name Grouping, intra group data addition 
sum(cost) over(partition by name order by orderdate) as sample3,--Press name Grouping, intra group data accumulation 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--and sample3 equally,Aggregation from starting point to current row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --Aggregate the current row with the previous row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--The current line, the front line and the back line 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --Current line and all subsequent lines 
from business;

# View the customer's last purchase time
select name,orderdate,cost, 
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 
from business;

# Order information 20% before query
select * from (
    select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
    from business
) t
where sorted = 1;

Rank

If the rank() sort is the same, it will repeat, and the total number will not change
dense_ If the rank() sort is the same, it will be repeated and the total number will be reduced
row_number() is calculated in order

Data preparation

name	subject	score
Sun WuKong	language	87
Sun WuKong	mathematics	95
Sun WuKong	English	68
sea	language	94
sea	mathematics	56
sea	English	84
Song song	language	64
Song song	mathematics	86
Song song	English	84
Tingting	language	65
Tingting	mathematics	85
Tingting	English	78

demand

Calculate the score ranking of each subject

Create Hive table and import data

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/hive/datas/score.txt' into table score;

Query data by demand

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;

Added by Hooker on Thu, 03 Feb 2022 12:18:45 +0200

Programming VIP

42 partition table, bucket table and function

Partition table

Definition of partition table

Basic operation of partition table

Secondary partition

Create secondary partition table

Normal loading data

Upload the data directly to the partition directory

Dynamic partition

Parameter setting

Practical cases

Barrel table

Data preparation

Create bucket table

Import data into the bucket sorting table in load mode

Import data into bucket table in insert mode

Query bucket data

Barrel rule

Precautions for bucket table operation

Sampling query

Common built-in functions

Empty field assignment

case when then else end

Row to column

Column to row

Window function (windowing function)

Data preparation

demand

Create hive table and import data

Query data by demand

Rank

Data preparation

demand

Create Hive table and import data

Query data by demand

Popular Keywords