1, Hive table structure
Format for creating table:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ( col1Name col1Type [COMMENT col_comment], co21Name col2Type [COMMENT col_comment], co31Name col3Type [COMMENT col_comment], co41Name col4Type [COMMENT col_comment], co51Name col5Type [COMMENT col_comment], ...... coN1Name colNType [COMMENT col_comment] ) [PARTITIONED BY (col_name data_type ...)] --Partition table structure [CLUSTERED BY (col_name...) [SORTED BY (col_name ...)] INTO N BUCKETS] --Bucket table structure [ROW FORMAT row_format] -- Specifies the delimiter for the data file row format delimited fields terminated by 'Column separator' -- Column separator, default to\001 lines terminated by 'Line separator' --Line separator, default\n [STORED AS file_format] -- Specifies the storage format of the file [LOCATION hdfs_path] -- Used to specify the location of the table directory. The default table directory is under the database directory
1. Common table structure
MapReduce processing rules:
Step 1: retrieve the metadata and find the HDFS directory corresponding to the table
Step 2: take the last level directory of the table as the input of MapReduce program
Structure:
Hive Data warehouse directory/Database directory/Table of contents/data file
characteristic:
The directory at the last level of the table is the directory of the table
Question: if it is an ordinary table structure, manually put the file into the table directory through the HDFS command. Can you read it in the table?
Yes, because the last level directory of the table is an ordinary directory, it can be read in the table
Application:
The tables created by default are of normal table structure
It is generally used to build the original data file into a table structure
2. Partition table structure
Common table structure problem?
A large number of meaningless filtering operations need to be done in the Map stage, resulting in a waste of resources
2.1 design of partition table structure
Design idea:
Divide the data into different directories according to certain rules and conditions for partition storage
When querying, you can filter at the directory level according to the query criteria, and MapReduce directly loads the directory of the data to be processed
Essence:
Divide the data into different directories for storage in advance
Reduce the amount of input data of the underlying MapReduce through query conditions, avoid useless filtering and improve performance
Structure:
Data warehouse directory/Database directory/Table of contents/Partition directory/Partition data file
characteristic:
The last level directory of the table is the partition directory
Application:
Most commonly used table: partitioned external table
Partition: it is generally divided according to time
2.2 implementation of partition table structure:
1. Static partition
The data file itself is planned according to the partition. Directly create the partition table and load the data of each partition
Step 1: create partition table directly
Step 2: load each file into the corresponding partition
Create partition table:
create table tb_emp_part1( empno string, ename string, job string, managerno string, hiredate string, salary double, jiangjin double, deptno string ) partitioned by (department int) row format delimited fields terminated by '\t';
Load the corresponding data file to the corresponding partition
load data local inpath '/export/data/emp10.txt' into table tb_emp_part1 partition (department=10); load data local inpath '/export/data/emp10.txt' into table tb_emp_part1 partition (department=20); load data local inpath '/export/data/emp10.txt' into table tb_emp_part1 partition (department=30);
Test SQL execution plan: explain
General table:
explain extended select count(*) as numb from tb_emp where deptno = 20;
Partition table:
explain extended select count(*) as numb from tb_emp_part1 where department = 20;
View metadata:
PARTITIONS
SDS
2. Dynamic zoning
The data itself is not divided according to the partition rules, and it needs to be automatically and dynamically divided by program
Steps:
Step 1: first create a common table and load the overall data
tb_emp: common table. The employee information of all departments is in one directory file
#1. Create employee table create database db_emp; use db_emp; create table tb_emp( empno string, ename string, job string, managerid string, hiredate string, salary double, jiangjin double, deptno string ) row format delimited fields terminated by '\t'; #2. Load data load data local inpath '/export/data/emp.txt' into table tb_emp;
Step 2: create partition table
create table tb_emp_part2( empno string, ename string, job string, managerno string, hiredate string, salary double, jiangjin double ) partitioned by (dept string) row format delimited fields terminated by '\t';
Enable dynamic partition:
set hive.exec.dynamic.partition.mode=nonstrict;
step3: write the data of ordinary table into partition table to realize dynamic partition
insert into table tb_emp_part2 partition(dept) select ......,deptno from tb_emp ;
Requirement: generally, select * is not required for query statements, and the last field of query statements is required to be the partition field
3. Multi level zoning
Create multi-level partition table
create table tb_ds_source( id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string ) partitioned by (daystr string,hourstr string) row format delimited fields terminated by '\t';
Load multi-level partition data
load data local inpath '/export/data/2015082818' into table tb_ds_source partition (daystr='20150828',hourstr='18'); load data local inpath '/export/data/2015082819' into table tb_ds_source partition (daystr='20150828',hourstr='19');
View partition Directory:
show partitions tb_ds_source;
Partitions are directory level. A directory represents a partition
Name of directory: partition field = value of partition
The fields of the partition are logical
3. Bucket table structure
1.Join's questions
Map Join: good performance, suitable for small tables and large tables
Reduce Join: it is realized through the grouping of Shuffle. It is suitable for large tables to join large tables
2. Barrel separation design
Idea: divide large data into multiple small data according to rules, and each small data goes through Map Join to reduce the comparison times of each data and improve performance
Function: optimize the problem of large table join, and sample by bucket
Essence: divide the data into multiple files through the partition [partition rules of Reduce] at the bottom of MapReduce
Each file = each bucket = each Reduce
Division rule: Hash remainder
Bucket table fields are physical
Bucket table is a file level design
The data of bucket splitting table cannot be loaded with load
Structure:
Data warehouse directory/Database directory/Table of contents/File data by bucket
Divide the two large tables into buckets according to the same rules to realize bucket sharing Join. Map Join is directly performed between buckets to reduce the number of comparisons and improve performance
technological process:
step1: divide the two tables into buckets
Implementation: bucket 2
Hive will automatically judge whether the conditions for bucket splitting Join are met. If so, bucket splitting Join will be realized automatically
3. Realization of barrel separation
Syntax:
clustered by col [sorted by col] into N buckets clustered by : According to which column are barrels divided sorted by: In which column is the interior of each bucket sorted N: Divided into several buckets, there are several data written on behalf of the bottom layer reduce
Open configuration
set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;
Create bucket table
create table tb_emp_bucket( empno string, ename string, job string, managerno string, hiredate string, salary double, jiangjin double, deptno string ) clustered by (deptno) into 3 BUCKETS row format delimited fields terminated by '\t';
Write bucket table
insert overwrite table tb_emp_bucket select * from tb_emp cluster by (deptno);
2, Join in Hive
1.inner join: inner join
select a.empno, a.ename, b.deptno, b.dname from tb_emp a join tb_dept b on a.deptno = b.deptno;
Features: there are results on both sides
2.left outer join: left outer join
select a.empno, a.ename, b.deptno, b.dname from tb_emp a left join tb_dept b on a.deptno = b.deptno;
Features: there are on the left, and there are results
3.right outer join: right outer join
select a.empno, a.ename, b.deptno, b.dname from tb_emp a right join tb_dept b on a.deptno = b.deptno;
Features: there are on the right, and there are results
4.full join: full join
select a.empno, a.ename, b.deptno, b.dname from tb_emp a full join tb_dept b on a.deptno = b.deptno;
5.map join
Features: put the data of the small table into the distributed cache and Join with each part of the large table. It occurs on the Map side without shuffling
Scenario: small table join small table, small table join large table
Requirement: one table must be a small table
Implementation rules:
By default, Hive will give priority to judging whether the conditions of Map Join are met
File size of judgment table: less than 25MB
If satisfied, go to Map Join automatically
If it does not meet the requirements, it will automatically go to Reduce Join
6.Reduce Join
Features: using Shuffle grouping to realize the Join process, which takes place at the Reduce end and needs to go through Shuffle
Scenario: join large table
Requirement: Hive defaults to Reduce Join automatically if Map Join is not satisfied
7.Bucket Join
Features: divide large data into multiple small data, each small data is a bucket, and realize bucket join
Scenario: join large tables. Optimize this process if you join multiple times
requirement:
Bucket Map Join: an ordinary bucket join
Both tables are bucket tables. The number of buckets is a multiple, and the Join field = bucket field
Bucket Sort Merge Map Join: Bucket Sort Merge Map Join based on sorting
Both tables are bucket tables
The number of barrels is multiple
Join field = bucket sorting field = sorting field
3, Select syntax: order by and sort by
1. Number of ReduceTask in hive
Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes):\ #Amount of data processed by each Reduce set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: #The maximum number of reduce allowed to start set hive.exec.reducers.max=<number> In order to set a constant number of reducers: #Set the number of reduce set mapreduce.job.reduces=<number>
#Specify the number of reduce as 2 set mapreduce.job.reduces=2;
2.order by
Function: Global sorting
select empno,ename,deptno,salary from tb_emp order by salary desc;
Question: if there are more than one Reduce, can global sorting be realized?
There cannot be more than one reduce. If order by is used, only one reduce will be started
3.sort by
Function: local sorting
In multiple reduce scenarios, each reduce is orderly
select empno,ename,deptno,salary from tb_emp sort by salary desc;
Put the sql results into a file
insert overwrite local directory '/export/data/sort' row format delimited fields terminated by '\t' select empno,ename,deptno,salary from tb_emp sort by salary desc;
4. What are the functions and differences between order by and sort by?
order by: Global sort. There can only be one reduce
sort by: local sort, multiple reduce, local sort for each reduce
4, Select syntax: distribute by and cluster by
1.distribute by
Function: specify a field as k2 due to the intervention of the underlying MapReduce
#Specify the number of reduce as 3 set mapreduce.job.reduces=3;
insert overwrite local directory '/export/data/distby' row format delimited fields terminated by '\t' select empno,ename,deptno,salary from tb_emp distribute by deptno;
2. Use distribution by and sort by together
insert overwrite local directory '/export/data/distby' row format delimited fields terminated by '\t' select empno,ename,deptno,salary from tb_emp distribute by deptno sort by salary desc;
Application:
distribute by 1 => Used to put all data into one Reduce in distribute by rand() => Realize random partition to avoid data skew
3.cluster by
Function: if distribute by And sort by Is the same field and can be used cluster by replace
5, Data type: Array
1. Generate data
#1. Create an array txt vim /export/data/array.txt #2. Add data zhangsan beijing,shanghai,tianjin wangwu shanghai,chengdu,wuhan,haerbin
2. Create table
#1. Create db_complex database create database db_complex; use db_complex; #2. Create a complex_array table create table if not exists complex_array( name string, work_locations array<string> ) row format delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY ',';
COLLECTION ITEMS TERMINATED BY ‘,’ ; Used to specify the separator for each element in the array
3. Load data
load data local inpath '/export/data/array.txt' into table complex_array;
4. Take out the required data
#1. Count the number of cities each user has worked in select name,size(work_locations) as numb from complex_array; #2. Take out the individual elements in the array select name,work_locations[0],work_locations[1] from complex_array;
6, Data type: Map
1. Generate data
#1. Create an amp txt vim /export/data/map.txt #2. Add data 1,zhangsan,sing:like it very much-dance:like-Swimming:So-so 2,lisi,Play games:like it very much-Basketball:dislike
2. Create table
create table if not exists complex_map( id int, name string, hobby map<string,string> ) row format delimited fields terminated by ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':';
COLLECTION ITEMS TERMINATED BY '-'
Used to divide each KV pair
MAP KEYS TERMINATED BY ':';
User divided K and V
3. Load data
load data local inpath '/export/data/map.txt' into table complex_map;
4. Take out the required data
#1. Count how many hobbies everyone has select name,size(hobby) as numb from complex_map; #2. Take out everyone's preference for singing select name,hobby["sing"] as deep from complex_map;
7, Regular loading
Hive does not support multi byte delimiters
1. Separator problem
Separator for columns in data||
The delimiters of each column in the data are inconsistent
The data field contains delimiters
2. Treatment scheme
Scheme 1: first do ETL, first develop a program to realize data processing, and then load the processing results into Hive table
Replace separator by program
Scheme 2: Hive officially provides a regular loading method
Regular expressions are used to match each column in each piece of data
3. Regular loading
Generate data
#1. Create a regex txt vim /export/data/regex.txt #2. Add data 2019-08-28 00:03:00 tom 2019-08-28 10:00:00 frank 2019-08-28 11:00:00 jack 2019-08-29 00:13:23 tom 2019-08-29 10:00:00 frank 2019-08-30 10:00:00 tom 2019-08-30 12:00:00 jack
Create table
Normal creation
#1. Create table regex1 create table regex1( timestr string, name string ) row format delimited fields terminated by ' '; #2. Load data load data local inpath '/export/data/regex.txt' into table regex1;
Regular loading
#1. Create table regex2 create table regex2( timestr string, name string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^}]*) ([^ ]*)" ) STORED AS TEXTFILE; #2. Load data load data local inpath '/export/data/regex.txt' into table regex2;
8, Functions in Hive: built in functions
#1. View functions show functions; #2. Check function usage desc function [extended] funName; #3. Common functions #3.1 aggregate function count,sum,avg,max,min #3.2 conditional function if,case when #3.3 string function intercept: substring,substr Splicing: concat,concat_ws division: split Find: instr Replace: regex_replace Length: length #3.4 date function transformation: unix_timestamp,from_unixtime Date: current_date,date_sub,date_add obtain: year,month,day,hour #4. Special functions JSON: json_tuple,get_json_object URL: parse_url,parse_url_tuple Window function: Aggregation window: sum,count...... Location window: lag,lead,first_value,last_value Analysis function: row_number,rank,dense_rank
9, Functions in Hive: custom functions
1. Function classification
UDF: one-to-one function, similar to substr
UDAF: many to one function, similar to aggregate function: count
UDTF: one to many function, expand turns each element in a row into a row
2. User defined UDF function
Demand: 24 / DEC / 2019:15:55:01 - > 2019-12-24 15:55:01
step1: customize a class and inherit UDF class
Step 2: implement at least one evaluate method in the class to define the logic for processing data
step3: make a jar package and add it to hive's environment variable
add jar /export/data/udf.jar;
step4: register the class as a function
create temporary function transFDate as 'com.miao.hive.udf.UserUDF';
Step 5: call function
select transFDate("24/Dec/2019:15:55:01");
3. Customize UDAF and UDTF
UDAF
Step 1: register the class as a function
create temporary function userMax as 'com.miao.hive.udaf.UserUDAF';
Step 2: call function
select userMax(cast(deptno as int)) from db_emp.tb_dept;
Cast: cast function
cast(column as type)
UDTF
Step 1: register the class as a function
create temporary function transMap as 'com.miao.hive.udtf.UserUDTF';
Step 2: call function
select transMap("uuid=root&url=www.taobao.com") as (userCol1,userCol2);
10, Functions in Hive: parse_url_tuple
#1. Create lateral Txt file vim /export/data/lateral.txt #2. Add data 1 http://facebook.com/path/p1.php?query=1 2 http://www.baidu.com/news/index.jsp?uuid=frank 3 http://www.jd.com/index?source=baidu #3. Create table create table tb_url( id int, url string ) row format delimited fields terminated by '\t'; #4. Load data load data local inpath '/export/data/lateral.txt' into table tb_url;
1.parse_url: used to parse URLs. Only one element can be parsed at a time
2.parse_url_tuple: used to parse URLs. It is a UDTF function that parses multiple elements at a time
3. Syntax:
select parse_url(url,'HOST') from tb_url; select parse_url(url,'PATH') from tb_url; select parse_url(url,'QUERY') from tb_url; select parse_url_tuple(url,'HOST','PATH','QUERY') from tb_url;
11, Function in Hive: lateral view
1.UDTF function problem
udtf can only be used directly in select, can not be used by adding other fields, can not be nested calls, and can not be used together with group by/cluster by/distribute by/sort by
2.lateral view
When used in combination with UDTF function, the result of UDTF function is transformed into a table similar to the view, which is convenient to operate with the original table
Syntax:
select ...... from tabelA lateral view UDTF(xxx) alias as col1,col2,col3......
12, Functions in Hive: explode
1. Expand function
Turns each element of a collection in a collection type into a row
Syntax:
explode( Map | Array)
2. Usage
Use alone
select explode(work_locations) as loc from complex_array; select explode(hobby) from complex_map;
Use with side view
select a.id, a.name, b.* from complex_map a lateral view explode(hobby) b as hobby,deep;