I've been doing hive related work these days. Fortunately, I learned a little before and got started very quickly. Now I'm free, let's systematically review the syntax of hive sql again
preface
Hive is an application tool based on data warehouse. It is used to process structured data in Hadoop. It is based on Hadoop and operates the data through SQL.
The hive query operation process strictly follows the job execution model of Hadoop MapReduce. Hive converts the user's Hive SQL statements into MapReduce jobs through the interpreter and submits them to the Hadoop cluster. Hadoop monitors the job execution process and then returns the job execution results to the user. Hive is not designed for online transaction processing. Hive does not provide real-time query and row level data update operations. Hive is best used for batch jobs with large data sets.
Common SQL syntax
(statements enclosed by "[]" represent statements that we can or can not write)
Create database
CREATE DARABASE name;
- Display view operation command
show tables; --Display table show databases; --show database show partitions table_name; --The display table name is table_name All partitions of the table show functions ; --Show all functions describe extended table_name col_name; --View fields in table
DDL(Data Defination Language)
Database definition language
- Create table structure
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]
- CREATE TABLE creates a table with the specified name. If a table with the same name already exists, an exception is thrown; The user can ignore this exception with the IF NOT EXIST option
- EXTERNAL keyword allows users to create an EXTERNAL table and specify a path (LOCATION) to the actual data while creating the table
- LIKE allows users to copy existing table structures without copying data
- COMMENT can add descriptions to tables and fields
- ROW FORMAT sets the row data segmentation format
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
- STORED AS
SEQUENCEFILE | TEXTFILE | RCFILE | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
If the file data is plain text, you can use STORED AS TEXTFILE.
If the data needs to be compressed, use STORED AS SEQUENCE.
To create a simple table:
CREATE TABLE person(name STRING,age INT);
To create an external table:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'Write the description of the table here' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '<hdfs_location>';
Create partition table:
CREATE TABLE par_table(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date STRING, pos STRING) ROW FORMAT DELIMITED '\t' FIELDS TERMINATED BY '\n' STORED AS SEQUENCEFILE;
Create bucket table:
CREATE TABLE par_table(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date STRING, pos STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED '\t' FIELDS TERMINATED BY '\n' STORED AS SEQUENCEFILE;
To create a table with indexed fields:
CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (dindex STRING);
Copy an empty table:
CREATE TABLE empty_key_value_store LIKE key_value_store;
Show all tables:
SHOW TABLES;
Display table by regular expression:
SHOW TABLES '.*s';
Add a field to the table:
ALTER TABLE pokes ADD COLUMNS (new_col INT);
Add a field and comment on it:
ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
Delete column:
ALTER TABLE test REPLACE COLUMNS(id BIGINT, name STRING);
Change table name:
ALTER TABLE events RENAME TO new_events;
Adding and deleting partitions
--Add: ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ... partition_spec: : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...) --Delete: ALTER TABLE table_name DROP partition_spec, partition_spec,...
Change the file format and organization of the table:
ALTER TABLE table_name SET FILEFORMAT file_format ALTER TABLE table_name CLUSTERED BY(userid) SORTED BY(viewTime) INTO num_buckets BUCKETS --This command modifies the physical storage properties of the table
Create and delete Views:
--Create view: CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ][COMMENT view_comment][TBLPROPERTIES (property_name = property_value, ...)] AS SELECT; --Delete view: DROP VIEW view_name;
DML(Data manipulation language)
Data operation language mainly includes three operations of database addition, deletion and modification. DML includes INSERT, UPDATE and DELETE.
To load a file into a data table:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] --load The operation is just a simple copy/Move operation to move data files to Hive The position corresponding to the table. --Load local LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; --load HDFS Data and given partition information LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
Insert query results into Hive table:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; --Multi insert mode: FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ... --Auto partition mode INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
Insert query results into HDFS file system:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ... FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2]
INSERT INTO
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
The difference between insert overwrite and insert into:
insert overwrite will overwrite the existing data. If the original table overwrites the above data, first remove the data of the original table, and then insert new data.
insert into is just a simple insertion. It does not consider the data of the original table and is directly appended to the table. The data of the last table is the original data and the newly inserted data
DQL (data query language) data query language select operation
SELECT query structure:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list [HAVING condition]] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] ] [LIMIT number]
- Use the ALL and DISTINCT options to distinguish the processing of duplicate records. The default is ALL, indicating that ALL records are queried, and DISTINCT indicates that duplicate records are removed
- The where condition is similar to the where condition of our traditional SQL
- ORDER BY global sorting. There is only one Reduce task
- SORT BY sorts only locally
- LIMIT limits the number of outputs and the starting position of the output
Output query data to directory:
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';
Output query results to local directory:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
Insert the results of one table into another:
FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; JOIN FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
Insert multi table data into the same table
FROM src INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
Hive only supports equal joins, outer joins, and left semi joins. Hive does not support all non equivalent connections because it is very difficult to convert non equivalent connections to map/reduce tasks.
- The LEFT, RIGHT and FULL OUTER keywords are used to handle the case of join ing hollow records
- LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS sub query
- During join, the logic of each map/reduce task is as follows: the reducer will cache the records of all tables in the join sequence except the last table, and then serialize the results to the file system through the last table
- In the process of practical application, try to use small tables to join large tables
Points to note during join query:
--Only equivalent connections are supported SELECT a.* FROM a JOIN b ON (a.id = b.id) SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department) --sure join Multiple tables SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) --If join Of multiple tables in join key Is the same, then join Will be converted to a single map/reduce task
LEFT, RIGHT and FULL OUTER keywords
--Left outer connection SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key) --Right outer link SELECT a.val, b.val FROM a RIGHT OUTER JOIN b ON (a.key=b.key) --Full external connection SELECT a.val, b.val FROM a FULL OUTER JOIN b ON (a.key=b.key)
LEFT SEMI JOIN keyword
--LEFT SEMI JOIN The limitation is, JOIN The right table in the clause can only be ON Set filter conditions in Clause WHERE Clause SELECT Clause or anywhere else SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B); --Can be written as: SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)
UNION and UNION ALL
--Used to merge multiple select The query results need to be guaranteed select The fields in the must be consistent select_statement UNION ALL select_statement UNION ALL select_statement ... --UNION and UNION ALL Differences between --UNION Only different data in two tables will be queried, and the same part will not be found --UNION ALL All the data of the two tables will be queried