The previous chapter introduced how to install hive and the basic introduction of hive. Here we start using hive. Before using, first introduce the learning of the basic statements of hive, and what are internal tables and external tables.
hive base statement
Let's take a look at the most basic formats, because there are many kinds of formats. Let's first look at a general one, and then analyze it a little bit.
1. CREATE TABLE statement
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name // Define field name and field type [(col_name data_type [COMMENT col_comment], ...)] // Annotate the table [COMMENT table_comment] // partition [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] // Separate barrel [CLUSTERED BY (col_name, col_name, ...) // Set ascending and descending sorting fields [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ // Specify and set row and column separators [ROW FORMAT row_format] // Specify Hive storage formats: textFile, rcFile, SequenceFile. The default is: textFile [STORED AS file_format] | STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ] (Note: only available starting with 0.6.0) ] // Specify storage location [LOCATION hdfs_path] // It can be used in conjunction with external tables, such as mapping HBase tables, and then using HQL to query HBase data. Of course, the speed is slow [TBLPROPERTIES (property_name=property_value, ...)] (Note: only available starting with 0.6.0) [AS select_statement] (Note: this feature is only available starting with 0.5.0.)
Let's go into detail one by one:
Table creation 1: all use the default table creation method
Here is the most basic default table creation statement. The last sentence may not be available, but it means that you can only have one column of data.
create table students ( id bigint, name string, age int, gender string, clazz string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; // Required to specify the column separator
Create table 2: specify location (this method is also commonly used)
Generally, when the data has been uploaded to HDFS and you want to use it directly, you will specify the location. Usually, the Locaion will be used together with the external table, and the default location is generally used for the internal table
The storage location of the specified data is not necessarily an external table, which will be described later
create table students2 ( id bigint, name string, age int, gender string, clazz string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/input1'; // Specifies where the Hive table's data is stored
Table 3: specify storage format
This is the storage format of the specified table. For this storage format, several data storage formats were introduced in the previous chapter.
create table students3 ( id bigint, name string, age int, gender string, clazz string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS rcfile; // The specified storage format is rcfile, inputFormat:RCFileInputFormat,outputFormat:RCFileOutputFormat. If it is not specified, it defaults to textfile. Note: except textfile, data in other storage formats cannot be loaded directly, and the method of loading from table needs to be used.
Table 4
create table xxxx as select_statement(SQL sentence) (This method is more commonly used)
Is this method of creating tables familiar? We have also contacted in mysql.
create table students4 as select * from students2;
Table 5
create table xxxx like table_name You only want to create a table without loading data
create table students5 like students;
hive load data
Upload file
- Use hdfs dfs -put 'local data' in the HDFS directory corresponding to the hive table
- Using load data inpath
The first way should be needless to say. Let's look at the second way // Move the data under the / input1 directory on the HDFS to the HDFS directory corresponding to the students table. Pay attention to move, move and move load data inpath '/input1/students.txt' into table students; Why should we emphasize moving here? Because if you transfer files in this way, the files in the original directory will no longer exist. However, if you upload files in the first way, the price of the original path still exists. Here is a simple distinction.
Note here: If we upload data to hdfs Table of contents and hive Table has no relationship. Upload to hive Table and hive Tables have relationships (data conversion is required).
Empty file
Empty table contents
truncate table students;
Note here: add the local keyword to upload the files in the Linux local directory to the hive table, and the original files in the corresponding HDFS directory will not be deleted
load data local inpath '/usr/local/soft/data/students.txt' into table students;
overwrite Overlay loading load data local inpath '/usr/local/soft/data/students.txt' overwrite into table students;
Load data
Basic format
hive> dfs -put /usr/local/soft/data/students.txt /input2/; hive> dfs -put /usr/local/soft/data/students.txt /input3/;
Delete table
format
drop table <table_name> Moved: 'hdfs://master:9000/input2' to trash at: hdfs://master:9000/user/root/.Trash/Current OK Time taken: 0.474 seconds hive> drop table students_external; OK Time taken: 0.09 seconds hive>
be careful:
- It can be seen that when deleting an internal table, the data in the table (files on HDFS) will be deleted together with the metadata of the table.
- When deleting an external table, only the metadata of the table will be deleted, and the data in the table (files on HDFS) will not be deleted.
- In general, in the company, more external tables are used because data can be used by multiple programs to avoid accidental deletion. Usually, external tables are used together with location.
- External tables can also map data from other data sources to hive, such as hbase, ElasticSearch
- The original intention of designing an external table is to decouple the metadata of the table from the data.
Hive partition
concept
Partition table is actually a subdirectory named after partition under the table directory
effect
Partition clipping to avoid full table scanning and reduce MapReduce The amount of data processed improves efficiency
matters needing attention:
- Generally, in a company, almost all tables are partitioned tables, usually by date and region.
- Remember to add the partition field when using the partition table
- The more partitions, the better. Generally, it does not exceed three levels. It is measured according to the actual business.
Create partition table
create external table students_pt1 ( id bigint, name string, age int, gender string, clazz string ) PARTITIONED BY(pt string) ROW F
Add a partition
alter table students_pt1 add partition(pt='20210904');
Delete a partition
alter table students_pt drop partition(pt='20210904');
View partition
View all partitions in a table
show partitions students_pt; // This method is recommended (get partition information directly from metadata) select distinct pt from students_pt; // Not recommended
insert data
Add data to partition
insert into table students_pt partition(pt='20210902') select * from students; load data local inpath '/usr/local/soft/data/students.txt' into table students_pt partition(pt='20210902');
Query partition data
Query the data of a partition
// Full table scanning, not recommended, low efficiency select count(*) from students_pt; // The use of where conditions for partition clipping avoids full table scanning and has high efficiency select count(*) from students_pt where pt='20210101'; // Non equivalent judgment can also be used in the where condition select count(*) from students_pt where pt<='20210112' and pt>='20210110';
Hive dynamic partition
Sometimes the data in our original table contains''Date field dt'',We need to dt Different dates in are divided into different partitions, and the original table is changed into a partition table. hive Dynamic partitions are not enabled by default
concept
Dynamic partition: divide different data according to different values of certain columns in the data.
Enable Hive's dynamic partition support
# Indicates that dynamic partitioning is enabled hive> set hive.exec.dynamic.partition=true; # Indicates the dynamic partition mode: strict (needs to be used with static partition), nostrict # strict: insert into table students_pt partition(dt='anhui',pt) select ......,pt from students; hive> set hive.exec.dynamic.partition.mode=nostrict; # Indicates that the maximum number of partitions supported is 1000, which can be adjusted according to the business hive> set hive.exec.max.dynamic.partitions.pernode=1000;
practice
- Create the original table and load the data
create table students_dt ( id bigint, name string, age int, gender string, clazz string, dt string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- Create partition table and load data
create table students_dt_p ( id bigint, name string, age int, gender string, clazz string ) PARTITIONED BY(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- Insert data using dynamic partitions
// The partition field needs to be placed at the end of the select. If there are multiple partition fields, it is matched by location, not by name insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt; // For example, the following statement will use age as the partition field instead of student_dt in dt as partition field insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;
- Multilevel partition
create table students_year_month ( id bigint, name string, age int, gender string, clazz string, year string, month string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; create table students_year_month_pt ( id bigint, name string, age int, gender string, clazz string ) PARTITIONED BY(year string,month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;
Hive drum
concept
hive's bucket division in the 20th century is the further segmentation of files (data)
hive is off by default
effect
When inserting data into the bucket splitting table, it will be based on clustered by Specified field, enter that 's ok hash Partition, for the specified buckets The number of data can be used to extract the remainder, and then the data can be divided become buckets Several files to achieve uniform data distribution can be solved Map Data for The "skew" problem is convenient for us to obtain sampling data and improve the efficiency Map join Efficiency. The bucket splitting field needs to be set according to the business
Open barrel separation
Since barrel splitting is off by default,
hive> set hive.enforce.bucketing=true;
Create bucket table
create table students_buks ( id bigint, name string, age int, gender string, clazz string ) CLUSTERED BY (clazz) into 12 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Add data
// Using load data directly does not break up the data load data local inpath '/usr/local/soft/data/students.txt' into table students_buks; // You need to insert data in the following way to make the bucket table work insert into students_buks select * from students;
https://zhuanlan.zhihu.com/p/93728864 Usage scenario, advantages and disadvantages of hive bucket table
Hive JDBC
Start hive server2
hive --service hiveserver2 & perhaps hiveserver2 &
Create a new maven project and add two dependencies
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.6</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1</version> </dependency>
Writing JDBC code
import java.sql.*; public class HiveJDBC { public static void main(String[] args) throws ClassNotFoundException, SQLException { Class.forName("org.apache.hive.jdbc.HiveDriver"); Connection conn = DriverManager.getConnection("jdbc:hive2://master:10000/test3"); Statement stat = conn.createStatement(); ResultSet rs = stat.executeQuery("select * from students limit 10"); while (rs.next()) { int id = rs.getInt(1); String name = rs.getString(2); int age = rs.getInt(3); String gender = rs.getString(4); String clazz = rs.getString(5); System.out.println(id + "," + name + "," + age + "," + gender + "," + clazz); } rs.close(); stat.close(); conn.close(); } }