hive learning ---- basic sentences

The previous chapter introduced how to install hive and the basic introduction of hive. Here we start using hive. Before using, first introduce the learning of the basic statements of hive, and what are internal tables and external tables.

hive base statement

Let's take a look at the most basic formats, because there are many kinds of formats. Let's first look at a general one, and then analyze it a little bit.

1. CREATE TABLE statement

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  // Define field name and field type
  [(col_name data_type [COMMENT col_comment], ...)]
  // Annotate the table
  [COMMENT table_comment]
  // partition
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  // Separate barrel
  [CLUSTERED BY (col_name, col_name, ...) 
  // Set ascending and descending sorting fields
  [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [
  	// Specify and set row and column separators 
   [ROW FORMAT row_format] 
   // Specify Hive storage formats: textFile, rcFile, SequenceFile. The default is: textFile
   [STORED AS file_format]
   
   | STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]  (Note:  only available starting with 0.6.0)
  ]
  // Specify storage location
  [LOCATION hdfs_path]
  // It can be used in conjunction with external tables, such as mapping HBase tables, and then using HQL to query HBase data. Of course, the speed is slow
  [TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0)
  [AS select_statement]  (Note: this feature is only available starting with 0.5.0.)

Let's go into detail one by one:

Table creation 1: all use the default table creation method

Here is the most basic default table creation statement. The last sentence may not be available, but it means that you can only have one column of data.

create table students
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; // Required to specify the column separator 

Create table 2: specify location (this method is also commonly used)

Generally, when the data has been uploaded to HDFS and you want to use it directly, you will specify the location. Usually, the Locaion will be used together with the external table, and the default location is generally used for the internal table
The storage location of the specified data is not necessarily an external table, which will be described later

create table students2
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/input1'; // Specifies where the Hive table's data is stored

Table 3: specify storage format

This is the storage format of the specified table. For this storage format, several data storage formats were introduced in the previous chapter.

create table students3
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS rcfile; // The specified storage format is rcfile, inputFormat:RCFileInputFormat,outputFormat:RCFileOutputFormat. If it is not specified, it defaults to textfile. Note: except textfile, data in other storage formats cannot be loaded directly, and the method of loading from table needs to be used.

Table 4

create table xxxx as select_statement(SQL sentence) (This method is more commonly used)

Is this method of creating tables familiar? We have also contacted in mysql.

create table students4 as select * from students2;

Table 5

create table xxxx like table_name  You only want to create a table without loading data
create table students5 like students;

hive load data

Upload file

  1. Use hdfs dfs -put 'local data' in the HDFS directory corresponding to the hive table
  2. Using load data inpath
The first way should be needless to say. Let's look at the second way
// Move the data under the / input1 directory on the HDFS to the HDFS directory corresponding to the students table. Pay attention to move, move and move
load data inpath '/input1/students.txt' into table students;
Why should we emphasize moving here? Because if you transfer files in this way, the files in the original directory will no longer exist. However, if you upload files in the first way, the price of the original path still exists. Here is a simple distinction.
Note here:
  If we upload data to hdfs Table of contents and hive Table has no relationship.
  Upload to hive Table and hive Tables have relationships (data conversion is required).

Empty file

Empty table contents

truncate table students;

Note here: add the local keyword to upload the files in the Linux local directory to the hive table, and the original files in the corresponding HDFS directory will not be deleted

load data local inpath '/usr/local/soft/data/students.txt' into table students;
overwrite Overlay loading
load data local inpath '/usr/local/soft/data/students.txt' overwrite into table students;

Load data

Basic format

hive> dfs -put /usr/local/soft/data/students.txt /input2/;
hive> dfs -put /usr/local/soft/data/students.txt /input3/;

Delete table

format

drop table <table_name>
Moved: 'hdfs://master:9000/input2' to trash at: hdfs://master:9000/user/root/.Trash/Current

OK
Time taken: 0.474 seconds
hive> drop table students_external;
OK
Time taken: 0.09 seconds
hive> 

be careful:

  1. It can be seen that when deleting an internal table, the data in the table (files on HDFS) will be deleted together with the metadata of the table.
  2. When deleting an external table, only the metadata of the table will be deleted, and the data in the table (files on HDFS) will not be deleted.
  3. In general, in the company, more external tables are used because data can be used by multiple programs to avoid accidental deletion. Usually, external tables are used together with location.
  4. External tables can also map data from other data sources to hive, such as hbase, ElasticSearch
  5. The original intention of designing an external table is to decouple the metadata of the table from the data.

Hive partition

concept

Partition table is actually a subdirectory named after partition under the table directory

effect

Partition clipping to avoid full table scanning and reduce MapReduce The amount of data processed improves efficiency

matters needing attention:

  1. Generally, in a company, almost all tables are partitioned tables, usually by date and region.
  2. Remember to add the partition field when using the partition table
  3. The more partitions, the better. Generally, it does not exceed three levels. It is measured according to the actual business.

Create partition table

create external table students_pt1
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(pt string)
ROW F

Add a partition

alter table students_pt1 add partition(pt='20210904');

Delete a partition

alter table students_pt drop partition(pt='20210904');

View partition

View all partitions in a table

show partitions students_pt; // This method is recommended (get partition information directly from metadata)

select distinct pt from students_pt; // Not recommended

insert data

Add data to partition

insert into table students_pt partition(pt='20210902') select * from students;

load data local inpath '/usr/local/soft/data/students.txt' into table students_pt partition(pt='20210902');

Query partition data

Query the data of a partition

// Full table scanning, not recommended, low efficiency
select count(*) from students_pt;

// The use of where conditions for partition clipping avoids full table scanning and has high efficiency
select count(*) from students_pt where pt='20210101';

// Non equivalent judgment can also be used in the where condition
select count(*) from students_pt where pt<='20210112' and pt>='20210110';

Hive dynamic partition

Sometimes the data in our original table contains''Date field dt'',We need to dt
 Different dates in are divided into different partitions, and the original table is changed into a partition table.
hive Dynamic partitions are not enabled by default

concept

Dynamic partition: divide different data according to different values of certain columns in the data.  

Enable Hive's dynamic partition support

# Indicates that dynamic partitioning is enabled
hive> set hive.exec.dynamic.partition=true;
# Indicates the dynamic partition mode: strict (needs to be used with static partition), nostrict
# strict:  insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;
hive> set hive.exec.dynamic.partition.mode=nostrict;
# Indicates that the maximum number of partitions supported is 1000, which can be adjusted according to the business
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

practice

  1. Create the original table and load the data
create table students_dt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    dt string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

  1. Create partition table and load data
create table students_dt_p
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
  1. Insert data using dynamic partitions
// The partition field needs to be placed at the end of the select. If there are multiple partition fields, it is matched by location, not by name
insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;


// For example, the following statement will use age as the partition field instead of student_dt in dt as partition field
insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;
  1. Multilevel partition
create table students_year_month
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    year string,
    month string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

create table students_year_month_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(year string,month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;

Hive drum

concept

hive's bucket division in the 20th century is the further segmentation of files (data)
hive is off by default

effect

When inserting data into the bucket splitting table, it will be based on clustered by Specified field, enter
 that 's ok hash Partition, for the specified buckets The number of data can be used to extract the remainder, and then the data can be divided
 become buckets Several files to achieve uniform data distribution can be solved Map Data for
 The "skew" problem is convenient for us to obtain sampling data and improve the efficiency Map join Efficiency.

The bucket splitting field needs to be set according to the business

Open barrel separation

Since barrel splitting is off by default,

hive> set hive.enforce.bucketing=true;

Create bucket table

create table students_buks
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
CLUSTERED BY (clazz) into 12 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 

Add data

// Using load data directly does not break up the data
load data local inpath '/usr/local/soft/data/students.txt' into table students_buks;

// You need to insert data in the following way to make the bucket table work
insert into students_buks select * from students;

https://zhuanlan.zhihu.com/p/93728864 Usage scenario, advantages and disadvantages of hive bucket table

Hive JDBC

Start hive server2

hive --service hiveserver2 &

perhaps
hiveserver2 &
Create a new maven project and add two dependencies
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.6</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>1.2.1</version>
    </dependency>
Writing JDBC code
import java.sql.*;

public class HiveJDBC {
    public static void main(String[] args) throws ClassNotFoundException, SQLException {
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        Connection conn = DriverManager.getConnection("jdbc:hive2://master:10000/test3");
        Statement stat = conn.createStatement();
        ResultSet rs = stat.executeQuery("select * from students limit 10");
        while (rs.next()) {
            int id = rs.getInt(1);
            String name = rs.getString(2);
            int age = rs.getInt(3);
            String gender = rs.getString(4);
            String clazz = rs.getString(5);
            System.out.println(id + "," + name + "," + age + "," + gender + "," + clazz);
        }
        rs.close();
        stat.close();
        conn.close();
    }
}

Keywords: Hadoop HBase hive

Added by FezEvils on Mon, 27 Dec 2021 07:35:09 +0200