42 partition table, bucket table and function

Partition table

Definition of partition table

  • The partition table is actually a separate folder on an HDFS file system

Basic operation of partition table

  • Introduce partition table (manage the log according to the date and simulate through department information)

    dept_20200401.log
    dept_20200402.log
    dept_20200403.log
    ......
    
  • Create partition table syntax

    create table dept_partition(
    deptno int, dname string, loc string
    )
    partitioned by (day string)
    row format delimited fields terminated by '\t';
    
  • Load data into partition table

    # Data preparation
    # dept_20200401.log
    10	ACCOUNTING	1700
    20	RESEARCH	1800
    
    # dept_20200402.log
    30	SALES	1900
    40	OPERATIONS	1700
    
    # dept_20200403.log
    50	TEST	2000
    60	DEV	1900
    
    # Load data
    # Partition must be specified when the partition table loads data
    load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
    load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
    load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');
    
  • Query data in partition table

    # Single partition query
    select * from dept_partition where day='20200401';
    
    # Multi partition joint query
    select * from dept_partition where day='20200401'
    union
    select * from dept_partition where day='20200402';
    
  • Add partition

    # Create a single partition
    alter table dept_partition add partition(day='20200404');
    
    # Create multiple partitions at the same time
    alter table dept_partition add partition(day='20200405') partition(day='20200406');
    
  • delete a partition

    # Delete a single partition
    alter table dept_partition drop partition (day='20200406');
    
    # Delete multiple partitions at the same time
    alter table dept_partition drop partition (day='20200404'), partition(day='20200405');
    
  • See how many partitions the partition table has

    show partitions dept_partition
    
  • View partition table structure

    desc formatted dept_partition;
    

Secondary partition

Create secondary partition table

create table dept_partition2(
    deptno int, dname string, loc string
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

Normal loading data

# Load data into secondary partition table
load data local inpath '/opt/module`/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');

# Query partition data
select * from dept_partition2 where day='20200401' and hour='12';

Upload the data directly to the partition directory

  • Method 1: repair after uploading data

    # Upload data
    dfs -mkdir -p
    /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
    dfs -put /opt/module/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
    
    # Query data (the data just uploaded cannot be queried)
    select * from dept_partition2 where day='20200401' and hour='13';
    
    # Execute repair command
    msck repair table dept_partition2;
    
    # Query data again
    
  • Method 2: add partition after uploading data

    # Add partition
    alter table dept_partition2 add partition(day='201709',hour='14');
    
  • Method 3: create a file and load the data to the partition

    # Create directory
    dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
    
    # Upload data
    load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table
     dept_partition2 partition(day='20200401',hour='15');
    
    # Query data
    select * from dept_partition2 where day='20200401' and hour='15';
    

Dynamic partition

Parameter setting

# Enable the dynamic partition function (default: enabled)
hive.exec.dynamic.partition=true

# Set to non strict mode
# Dynamic partition mode. The default is strict, which means that at least one partition must be specified as a static partition
# nonstrict mode means that all partition fields are allowed to use dynamic partition
hive.exec.dynamic.partition.mode=nonstrict

# The number of dynamic partitions that can be created on all MR nodes is 1000 by default
hive.exec.max.dynamic.partitions=1000

# How many dynamic partitions can be created on each node executing MR
# If the day field has 365 values, the parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported
hive.exec.max.dynamic.partitions.pernode=100

# How many HDFS files can be created in the whole MR Job. Default 100000
hive.exec.max.created.files=100000

# Whether to throw an exception when an empty partition is generated. Generally, no setting is required. Default false
hive.error.on.empty.partition=false

Practical cases

# Create target partition table
create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';

# Set dynamic partition
set hive.exec.dynamic.partition.mode = nonstrict;

insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

# View the partition of the target partition table
show partitions dept_partition;

Barrel table

  • Not all data sets can form reasonable partitions
  • Hive can be further organized into buckets for a table or partition
  • Partition refers to the storage path of data, and bucket division refers to data files

Data preparation

1001	ss1
1002	ss2
1003	ss3
1004	ss4
1005	ss5
1006	ss6
1007	ss7
1008	ss8
1009	ss9
1010	ss10
1011	ss11
1012	ss12
1013	ss13
1014	ss14
1015	ss15
1016	ss16

Create bucket table

create table stu_bucket(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

Import data into the bucket sorting table in load mode

load data inpath   '/student.txt' into table stu_bucket;

Import data into bucket table in insert mode

insert into table stu_buck select * from student_insert;

Query bucket data

select * from stu_buck;

Barrel rule

  • Hive's bucket division determines which bucket the record is stored in by hashing the value of the bucket division field and dividing it by the number of buckets

Precautions for bucket table operation

  • Set the number of reduce to - 1 and let the Job decide how many reduce to use
  • Or set the number of reduce to be greater than the number of buckets in the bucket splitting table
  • load data from hdfs to the bucket table to avoid the problem that the local file cannot be found
  • Do not use local mode

Sampling query

  • For very large data sets, sometimes users need to use a representative query result instead of all the results. Hive can meet this demand by sampling the table
select * from stu_buck tablesample(bucket 1 out of 4 on id);

Common built-in functions

Empty field assignment

  • NVL: assign a value to NULL data

    # If the comm of the employee is NULL, use - 1 instead
    select comm,nvl(comm, -1) from emp;
    
    # If the comm of the employee is NULL, the leader id is used instead
    select comm, nvl(comm,mgr) from emp;
    

case when then else end

  • Demand: how many men and women are required in different departments

  • Create local emp_sex.txt, import data

    vim emp_sex.txt
     name of a fictitious monkey with supernatural powers	A	male
     sea	A	male
     Song song	B	male
     Miss Luo Yu feng	A	female
     Sister Ting	B	female
     Tingting	B	female
    
    create table emp_sex(
    name string, 
    dept_id string, 
    sex string) 
    row format delimited fields terminated by "\t";
    
    load data local inpath '/opt/module/hive/datas/emp_sex.txt' into table emp_sex;
    
  • Query data by demand

    select 
      dept_id,
      sum(case sex when 'male' then 1 else 0 end) male_count,
      sum(case sex when 'female' then 1 else 0 end) female_count
    from 
      emp_sex
    group by
      dept_id;
    

Row to column

  • collect_set(col): only basic data types are received. The main function is to de summarize the values of a field and generate an array field

  • concat_ws must be string or 'array'

  • Need: classify people with the same constellation and blood type

    sagittarius,A            sea|Miss Luo Yu feng
     Aries,A            Sun WuKong|Zhu Bajie
     Aries,B            Song song|Aoi Sora
    
  • Create local constellation Txt, import data

    Sun WuKong	Aries	A
     sea	sagittarius	A
     Song song	Aries	B
     Zhu Bajie	Aries	A
     Miss Luo Yu feng	sagittarius	A
     Aoi Sora	Aries	B
    
    create table person_info(
    name string, 
    constellation string, 
    blood_type string) 
    row format delimited fields terminated by "\t";
    
    load data local inpath "/opt/module/hive/datas/person_info.txt" into table person_info;
    
  • Query data by demand

    SELECT t1.c_b , CONCAT_WS("|",collect_set(t1.name))
    FROM (
    SELECT NAME ,CONCAT_WS(',',constellation,blood_type) c_b
    FROM person_info
    )t1 
    GROUP BY t1.c_b
    

Column to row

  • Expand (Col): split the complex array or map structure in the hive column into multiple rows

  • Requirements: expand the array data in the movie classification

    <Suspect tracking      Suspense
    <Suspect tracking      action
    <Suspect tracking      science fiction
    <Suspect tracking      plot
    <Lie to me>   Suspense
    <Lie to me>   gangster 
    <Lie to me>   action
    <Lie to me>   psychology
    <Lie to me>   plot
    <War wolf 2        Warfare
    <War wolf 2        action
    <War wolf 2        disaster
    
  • Create local movie Txt, import data

    <Suspect tracking	Suspense,action,science fiction,plot
    <Lie to me>	Suspense,gangster ,action,psychology,plot
    <War wolf 2	Warfare,action,disaster
    
  • Create hive table and import data

    create table movie_info(
        movie string, 
        category string) 
    row format delimited fields terminated by "\t";
    
    load data local inpath "/opt/module/hive/datas/movie_info.txt" into table movie_info;
    
  • Query data by demand

    SELECT movie,category_name 
    FROM movie_info 
    lateral VIEW
    explode(split(category,",")) movie_info_tmp  AS category_name;
    

Window function (windowing function)

  • over(): Specifies the size of the data window in which the analysis function works. The size of the data window may change with the change of the row
  • n preceding: data in the previous n rows
  • n following: next n rows of data
  • unbounded: starting point
  • Lag (col, n, deault_val): data in the nth row ahead
  • Lead (col, n, deault_val): the nth row of data in the future
  • ntile(n): distribute the rows with a large ordered window to the group with the specified data. Each array has a number. The number starts from 1. For each row, ntile returns the number of the group to which the row belongs. N must be of type int

Data preparation

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

demand

# Query the customers and total number of people who purchased in April 2017
# Query the customer's purchase details and monthly total purchase amount
# In the above scenario, the cost of each customer is accumulated according to the date
# Query the last purchase time of each customer
# Order information 20% before query

Create hive table and import data

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/hive/datas/business.txt" into table business;

Query data by demand

# Query the customers and total number of people who purchased in April 2017
select name,count(*) over ()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;

# Query the customer's purchase details and monthly total purchase amount
select name,orderdate,cost,sum(cost) 
over(partition by month(orderdate)) from
business;

# Accumulate the cost of each customer according to the date
select name,orderdate,cost, 
sum(cost) over() as sample1,--Add all rows 
sum(cost) over(partition by name) as sample2,--Press name Grouping, intra group data addition 
sum(cost) over(partition by name order by orderdate) as sample3,--Press name Grouping, intra group data accumulation 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--and sample3 equally,Aggregation from starting point to current row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --Aggregate the current row with the previous row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--The current line, the front line and the back line 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --Current line and all subsequent lines 
from business;

# View the customer's last purchase time
select name,orderdate,cost, 
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 
from business;

# Order information 20% before query
select * from (
    select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
    from business
) t
where sorted = 1;

Rank

  • If the rank() sort is the same, it will repeat, and the total number will not change
  • dense_ If the rank() sort is the same, it will be repeated and the total number will be reduced
  • row_number() is calculated in order

Data preparation

name subject score
Sun WuKong language 87
Sun WuKong mathematics 95
Sun WuKong English 68
sea language 94
sea mathematics 56
sea English 84
Song song language 64
Song song mathematics 86
Song song English 84
Tingting language 65
Tingting mathematics 85
Tingting English 78

demand

  • Calculate the score ranking of each subject

Create Hive table and import data

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/hive/datas/score.txt' into table score;

Query data by demand

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;

Added by Hooker on Thu, 03 Feb 2022 12:18:45 +0200