Hive SQL syntax summary

I've been doing hive related work these days. Fortunately, I learned a little before and got started very quickly. Now I'm free, let's systematically review the syntax of hive sql again

preface

Hive is an application tool based on data warehouse. It is used to process structured data in Hadoop. It is based on Hadoop and operates the data through SQL.

The hive query operation process strictly follows the job execution model of Hadoop MapReduce. Hive converts the user's Hive SQL statements into MapReduce jobs through the interpreter and submits them to the Hadoop cluster. Hadoop monitors the job execution process and then returns the job execution results to the user. Hive is not designed for online transaction processing. Hive does not provide real-time query and row level data update operations. Hive is best used for batch jobs with large data sets.

Common SQL syntax

(statements enclosed by "[]" represent statements that we can or can not write)

Create database

CREATE DARABASE name;

Display view operation command

show tables; --Display table

show databases; --show database

show partitions table_name; --The display table name is table_name All partitions of the table

show functions ; --Show all functions

describe extended table_name col_name; --View fields in table

DDL(Data Defination Language)

Database definition language

Create table structure

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

CREATE TABLE creates a table with the specified name. If a table with the same name already exists, an exception is thrown; The user can ignore this exception with the IF NOT EXIST option
EXTERNAL keyword allows users to create an EXTERNAL table and specify a path (LOCATION) to the actual data while creating the table
LIKE allows users to copy existing table structures without copying data
COMMENT can add descriptions to tables and fields
ROW FORMAT sets the row data segmentation format

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

STORED AS

SEQUENCEFILE

| TEXTFILE

| RCFILE

| INPUTFORMAT input_format_classname OUTPUTFORMAT

output_format_classname

If the file data is plain text, you can use STORED AS TEXTFILE.

If the data needs to be compressed, use STORED AS SEQUENCE.

To create a simple table:

CREATE TABLE person(name STRING,age INT);

To create an external table:

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User',

country STRING COMMENT 'country of origination')

COMMENT 'Write the description of the table here'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'

STORED AS TEXTFILE

LOCATION '<hdfs_location>';

Create partition table:

CREATE TABLE par_table(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(date STRING, pos STRING)

ROW FORMAT DELIMITED '\t'

FIELDS TERMINATED BY '\n'

STORED AS SEQUENCEFILE;

Create bucket table:

CREATE TABLE par_table(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(date STRING, pos STRING)

CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS

ROW FORMAT DELIMITED '\t'

FIELDS TERMINATED BY '\n'

STORED AS SEQUENCEFILE;

To create a table with indexed fields:

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (dindex STRING);

Copy an empty table:

CREATE TABLE empty_key_value_store

LIKE key_value_store;

Show all tables:

SHOW TABLES;

Display table by regular expression:

SHOW TABLES '.*s';

Add a field to the table:

ALTER TABLE pokes ADD COLUMNS (new_col INT);

Add a field and comment on it:

ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');

Delete column:

ALTER TABLE test REPLACE COLUMNS(id BIGINT, name STRING);

Change table name:

ALTER TABLE events RENAME TO new_events;

Adding and deleting partitions

--Add:

ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...

partition_spec:

: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

   

--Delete:

ALTER TABLE table_name DROP partition_spec, partition_spec,...

Change the file format and organization of the table:

ALTER TABLE table_name SET FILEFORMAT file_format

ALTER TABLE table_name CLUSTERED BY(userid) SORTED BY(viewTime) INTO num_buckets BUCKETS --This command modifies the physical storage properties of the table

Create and delete Views:

--Create view:

CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ][COMMENT view_comment][TBLPROPERTIES (property_name = property_value, ...)] AS SELECT；

   

--Delete view:

DROP VIEW view_name；

DML(Data manipulation language)

Data operation language mainly includes three operations of database addition, deletion and modification. DML includes INSERT, UPDATE and DELETE.

To load a file into a data table:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

--load The operation is just a simple copy/Move operation to move data files to Hive The position corresponding to the table.

--Load local

LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

   

--load HDFS Data and given partition information

LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

Insert query results into Hive table:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

   

--Multi insert mode:

FROM from_statement

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1

[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...

   

--Auto partition mode

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

Insert query results into HDFS file system:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

FROM from_statement

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2]

INSERT INTO

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

The difference between insert overwrite and insert into:

insert overwrite will overwrite the existing data. If the original table overwrites the above data, first remove the data of the original table, and then insert new data.
insert into is just a simple insertion. It does not consider the data of the original table and is directly appended to the table. The data of the last table is the original data and the newly inserted data

DQL (data query language) data query language select operation

SELECT query structure:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list [HAVING condition]]

[ CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]

]

[LIMIT number]

Use the ALL and DISTINCT options to distinguish the processing of duplicate records. The default is ALL, indicating that ALL records are queried, and DISTINCT indicates that duplicate records are removed
The where condition is similar to the where condition of our traditional SQL
ORDER BY global sorting. There is only one Reduce task
SORT BY sorts only locally
LIMIT limits the number of outputs and the starting position of the output

Output query data to directory:

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';

Output query results to local directory:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Insert the results of one table into another:

FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;

INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

JOIN

FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

Insert multi table data into the same table

FROM src

INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100

INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200

INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

Hive only supports equal joins, outer joins, and left semi joins. Hive does not support all non equivalent connections because it is very difficult to convert non equivalent connections to map/reduce tasks.

The LEFT, RIGHT and FULL OUTER keywords are used to handle the case of join ing hollow records
LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS sub query
During join, the logic of each map/reduce task is as follows: the reducer will cache the records of all tables in the join sequence except the last table, and then serialize the results to the file system through the last table
In the process of practical application, try to use small tables to join large tables

Points to note during join query:

--Only equivalent connections are supported

SELECT a.* FROM a JOIN b ON (a.id = b.id)

   

SELECT a.* FROM a JOIN b

ON (a.id = b.id AND a.department = b.department)

   

--sure join Multiple tables

SELECT a.val, b.val, c.val FROM a JOIN b

ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

   

--If join Of multiple tables in join key Is the same, then join Will be converted to a single map/reduce task

LEFT, RIGHT and FULL OUTER keywords

--Left outer connection

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)

--Right outer link

SELECT a.val, b.val FROM a RIGHT OUTER JOIN b ON (a.key=b.key)

--Full external connection

SELECT a.val, b.val FROM a FULL OUTER JOIN b ON (a.key=b.key)

LEFT SEMI JOIN keyword

--LEFT SEMI JOIN The limitation is, JOIN The right table in the clause can only be ON Set filter conditions in Clause WHERE Clause SELECT Clause or anywhere else

SELECT a.key, a.value

FROM a

WHERE a.key in

(SELECT b.key

FROM B);

--Can be written as:

SELECT a.key, a.val

FROM a LEFT SEMI JOIN b on (a.key = b.key)

UNION and UNION ALL

--Used to merge multiple select The query results need to be guaranteed select The fields in the must be consistent

select_statement UNION ALL select_statement UNION ALL select_statement ...

--UNION and UNION ALL Differences between

--UNION Only different data in two tables will be queried, and the same part will not be found

--UNION ALL All the data of the two tables will be queried

Keywords: Big Data hive SQL

Added by napier_matt on Sun, 26 Sep 2021 03:33:07 +0300

Programming VIP