Data warehouse tool hive

1. What's Hive

1. General
Apache Hive data warehouse software provides query and management of large data sets stored in distributed. It is built on Apache Hadoop and mainly provides the following functions:

(1) It provides a series of tools that can be used to extract / transform / load data (ETL);

(2) It is a mechanism that can store, query and analyze large-scale data stored in HDFS (or HBase);

(3) HQL query is completed through MapReduce [select *from]

2. The difference between data warehouse and data hive is to solve the problem of structural data analysis of MR and complete the storage of massive data

2.1. Concept:

Database: the number of structured data storage is small, and it is more inclined to fine data management. When data storage is divided into databases and tables, the concept of transaction is stored

Data warehouse: a structure with a large amount of data. Data storage tends to data management, without the concept of transaction

2.2. Usage: Database: mysql/oracle/sql server standard sql data warehouse: hql dialect sql in hive

2.3. Application scenario: Database: OLTP on line transaction processing insert delete update data warehouse: OLAP: on line analysis processing online analysis transaction processing select does not support update delete in hive, but supports insert

**2.4. Data volume database: * * centralized data volume is relatively small. Data warehouse: distributed data volume is large. 2.5. Delayed database: real-time data warehouse: offline

3. Summary
3.1 hive3 is a data warehouse. Hive is based on Hadoop. The final original data of hive is stored in HDFS. 3.2 provides a way similar to SQL statements to complete MapReduce. The essence of hive is to convert SQL statements into MapReduce tasks
3.3 users unfamiliar with MapReduce can easily use HQL to process and calculate structured data on HDFS

Hive Architecture Principle

Data storage location

Hive is built on Hadoop, and all hive data is stored in HDFS. The database can save the data in a block device or a local file system.

Execution delay

When Hive queries data, it needs to scan the whole table because there is no index, so the delay is high. Another factor causing high Hive execution latency is the MapReduce framework. Because MapReduce itself has high latency, it will also have high latency when using MapReduce to execute Hive queries. In contrast, the execution latency of the database is low. Of course, this low is conditional, that is, the data scale is small. When the data scale is large enough to exceed the processing capacity of the database, Hive's parallel computing can obviously reflect its advantages.

Scalability

Because Hive is built on Hadoop, the scalability of Hive is consistent with that of Hadoop (the world's largest Hadoop cluster is Yahoo!, with a scale of about 4000 nodes in 2009). Due to the strict restriction of ACID semantics, the extended rows in the database are very limited. At present, the theoretical expansion capacity of the most advanced parallel database Oracle is only about 100.

Hive installation

mysql database is used instead of Derby to store metadata to solve the problem of multi-user concurrent access

2.Hive installation address

1. Hive official website address

http://hive.apache.org/

2. Document viewing address

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

3. Download address

http://archive.apache.org/dist/hive/

4. github address

https://github.com/apache/hive

Hive installation deployment

1. Hive installation and configuration

(1) Upload apache-hive-1.2.1-bin.tar.gz to the / opt/software directory of linux

(2) Unzip apache-hive-1.2.1-bin.tar.gz to / opt/module /

[root@hadoop102 software]$ tar -zxvf apache-hive-1.2.1-bin.tar.gz -C
 /opt/module/

(3) Change the name of apache-hive-1.2.1-bin.tar.gz to hive

# 1. Switch to the module directory
[root@hadoop102 software]$ cd ../module

# 2. Modify the name
[root@hadoop102 module]$ mv apache-hive-1.2.1-bin/ hive

(4) Modify the name of hiv-env.sh.template in / opt/module/hive/conf directory to hiv-env.sh

[root@hadoop102 conf]$ mv hive-env.sh.template hive-env.sh

(5) Configure the hive-env.sh file

(a)to configure HADOOP_HOME route

export HADOOP_HOME=/opt/module/hadoop-2.7.2

(b)to configure HIVE_CONF_DIR route

export HIVE_CONF_DIR=/opt/module/hive/conf

2. Hadoop cluster configuration

(1) hdfs and yarn must be started

[root@hadoop102 hadoop-2.7.2]$ start-dfs.sh
[root@hadoop103 hadoop-2.7.2]$ start-yarn.sh

(2) Create / tmp and / user/hive/warehouse directories on HDFS and modify their same group permissions to be writable

[root@hadoop102 hadoop-2.7.2]$ hadoop fs -mkdir /tmp
[root@hadoop102 hadoop-2.7.2]$ hadoop fs -mkdir -p /user/hive/warehouse
[root@hadoop102 hadoop-2.7.2]$ hadoop fs -chmod g+w /tmp
[root@hadoop102 hadoop-2.7.2]$ hadoop fs -chmod g+w /user/hive/warehouse

3. Hive basic operation

(1)start-up hive
 /opt/module/hive
[root@hadoop102 hive]$ bin/hive
(2)view the database
hive> show databases;
(3)Open default database
hive> use default;
(4)display default Table in database
hive> show tables;
(5)Create a table
hive> create table student(id int, name string);
(6)Displays how many tables are in the database
hive> show tables;
(7)View table structure
hive> desc student;
(8)Insert data into a table
hive> insert into student values(1000,"ss");
(9)Data in query table
hive> select * from student;
(10)sign out hive
hive> quit; 
Description: (view) hive stay hdfs (structure in)
Databases: in hdfs As shown in ${hive.metastore.warehouse.dir}Next folder under directory

Table: displays a folder under the db directory in hdfs, where the specific data in the table is stored

Import local files into Hive cases

demand

Import the data in the local / opt / module / hive / data / student.txt directory into the student (ID, int, name string) table of hive.

1. Data preparation

Prepare data in the directory / opt / module / data

(1) Create data in the / opt/module / directory

[root@hadoop102 module]$ mkdir datas

(2) Create a student.txt file in the / opt / module / data / directory and add data

[root@hadoop102 datas]$ touch student.txt
[root@hadoop102 datas]$ vi student.txt

1001	zhangshan
1002	lishi
1003	zhaoliu

Note the tab interval.
2. Hive actual operation

(1)start-up hive
[root@hadoop102 hive]$ bin/hive

(2)show database
hive> show databases;

(3)use default database
hive> use default;

(4)display default Table in database
hive> show tables;

(5)Delete created student surface
hive> drop table student;

(6)establish student surface, And declare the file separator'\t'
hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED
 BY '\t';

(7)load/opt/module/datas/student.txt File to student In the database table.
hive> load data local inpath '/opt/module/hive/datas/student.txt' into table student;

(8)Hive Query results
hive> select * from student;
OK

1001	zhangshan
1002	lishi
1003	zhaoliu

Time taken: 0.266 seconds, Fetched: 3 row(s)

3.Mysql installation

Installation package preparation

1. Check whether MySQL is installed. If so, uninstall mysql

(1) view

[root@hadoop102 desktop]# rpm -qa|grep mysql
		rpm -qa |grep -i mysql 
mysql-libs-5.1.73-5.el6_6.x86_64

(2) unloading

[root@hadoop102 desktop]# rpm -e --nodeps mysql-libs-5.1.73-5.el6_6.x86_64

2. Unzip the mysql-libs.zip file to the current directory

[root@hadoop102 software]# unzip mysql-libs.zip
[root@hadoop102 software]# ls
mysql-libs.zip
mysql-libs

3. Enter the MySQL LIBS folder

[root@hadoop102 mysql-libs]# ll
 Total consumption 76048
-rw-r--r--. 1 root root 18509960 3 June 26, 2015 MySQL-client-5.6.24-1.el6.x86_64.rpm
-rw-r--r--. 1 root root  3575135 12 January 2013 mysql-connector-java-5.1.27.tar.gz
-rw-r--r--. 1 root root 55782196 3 June 26, 2015 MySQL-server-5.6.24-1.el6.x86_64.rpm

Install MySql server

1. Install mysql server

[root@hadoop102 mysql-libs]# rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm

2. View the generated random password

[root@hadoop102 mysql-libs]# cat /root/.mysql_secret
sP2wb_zdwA1o3zAh

3. View mysql status

[root@hadoop102 mysql-libs]# service mysql status

4. Start mysql

[root@hadoop102 mysql-libs]# service mysql start

Install MySql client

1. Install mysql client

[root@hadoop102 mysql-libs]# rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm

2. Link mysql

​    cat /root/.mysql_secret

[root@hadoop102 mysql-libs]# mysql -uroot -pdDpX_qNGzPgXomRd

3. Change password

mysql>SET PASSWORD=PASSWORD('000000');

4. Exit mysql

mysql>exit

Host configuration in user table in MySql

As long as the configuration is root + password, you can log in to MySQL database on any host.

1. Enter mysql

[root@hadoop102 mysql-libs]# mysql -uroot -p000000

2. Display database

mysql>show databases;

3. Use mysql database

mysql>use mysql;

4. Display all tables in mysql database

mysql>show tables;

5. Show the structure of the user table

mysql>desc user;

6. Query user table

mysql>select User, Host, Password from user;

7. Modify the user table and the Host table to%

mysql>update user set host='%' where host='localhost';

8. Delete other host s of the root user

mysql>delete from user where Host='node2';
mysql>delete from user where Host='127.0.0.1';
mysql>delete from user where Host='::1';

9. Refresh

mysql>flush privileges;

10. Exit

mysql>quit;

4. Configure hive metadata to MySql

4.1 driver copy

1. Unzip the mysql-connector-java-5.1.27.tar.gz driver package in / opt / software / MySQL LIBS directory

[root@hadoop102 mysql-libs]# tar -zxvf mysql-connector-java-5.1.27.tar.gz

2. Copy mysql-connector-java-5.1.27-bin.jar in / opt / software / MySQL LIBS / mysql-connector-java-5.1.27 directory to / opt/module/hive/lib/

[root@hadoop102 mysql-connector-java-5.1.27]# cp mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib/

4.2 configuring Metastore to MySql

1. Create a hive-site.xml in the / opt/module/hive/conf directory

[root@hadoop102 conf]$ touch hive-site.xml
[root@hadoop102 conf]$ vi hive-site.xml

2. Copy the data to hive-site.xml file according to the official document configuration parameters

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=true</value>
	  <description>JDBC connect string for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	  <description>Driver class name for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>root</value>
	  <description>username to use against metastore database</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>000000</value>
	  <description>password to use against metastore database</description>
	</property>
</configuration>

3. After configuration, if hive is abnormal, you can restart the virtual machine. (after restarting, don't forget to start the hadoop cluster)

4.3 multi window launch Hive test

1. Start MySQL first

[root@hadoop102 mysql-libs]$ mysql -uroot -p000000

How many databases are there

mysql> show databases;

±-------------------+

| Database |

±-------------------+

| information_schema |

| mysql |

| performance_schema |

| test |

±-------------------+

2. Open multiple windows again and start hive respectively

[root@hadoop102 hive]$ bin/hive

3. After you start hive, go back to the MySQL window to view the database. The metastore database is added

​ mysql> show databases;

±-------------------+

| Database |

±-------------------+

| information_schema |

| metastore |

| mysql |

| performance_schema |

| test |

±-------------------+

5.HiveJDBC access

5.1 start the hiveserver2 service

[root@hadoop102 hive]$ bin/hiveserver2

5.2 start beeline

[root@hadoop102 hive]$ bin/beeline
Beeline version 1.2.1 by Apache Hive
beeline>

5.3 connecting the hiveserver2

beeline> !connect jdbc:hive2://Hadoop 102:10000 (enter)

Connecting to jdbc:hive2://hadoop102:10000

Enter username for jdbc:hive2://Hadoop 102:10000: root (enter)

Enter password for jdbc:hive2://Hadoop 102:10000: (enter directly)

Connected to: Apache Hive (version 1.2.1)

Driver: Hive JDBC (version 1.2.1)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://hadoop102:10000> show databases;

+----------------+--+

| database_name  |

+----------------+--+

| default     |

+----------------+--+

6 Hive common interactive commands

[root@hadoop102 hive]$ bin/hive -help

1. "- e" does not enter the interactive window of hive to execute sql statements

[root@hadoop102 hive]$ bin/hive -e "select id from student;"

2. "- f" executes sql statements in the script

(1) create hivef.sql file in / opt / module / data directory

[root@hadoop102 datas]$ touch hivef.sql

Write the correct sql statement in the file

	select *from student;

(2) execute the sql statements in the file

[root@hadoop102 hive]$ bin/hive -f /opt/module/datas/hivef.sql

(3) Execute the sql statements in the file and write the results to the file

[root@hadoop102 hive]$ bin/hive -f /opt/module/datas/hivef.sql  > /opt/module/datas/hive_result.txt

7 Hive other command operations

1. Exit hive window:

hive(default)>exit;
hive(default)>quit;

In the new version of hive, there is no difference. In the previous version, there was:

exit: submit data implicitly before exiting;

Quit: quit without submitting data;

2. How to view the hdfs file system in the hive cli command window

hive(default)>dfs -ls /;

3. How to view the local file system in the hive cli command window

hive(default)>! ls /opt/module/datas;

4. View all historical commands entered in hive

(1) enter the root directory / root or / home/root of the current user

(2) view the. Historyfile

[root@hadoop102 ~]$ cat .hivehistory

7.1 Hive data warehouse location configuration

1) the original location of the Default data warehouse is in the: / user/hive/warehouse Path on hdfs.

2) under the warehouse directory, no folder is created for the default database default. If a table belongs to the default database, create a folder directly under the data warehouse directory.

3) modify the original location of the default data warehouse (copy the following configuration information of hive-default.xml.template to hive-site.xml file).

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

Configure that users in the same group have execution permission

bin/hdfs dfs -chmod g+w /user/hive/warehouse

7.2 information display configuration after query

1) Add the following configuration information to the hive-site.xml file to display the header information of the current database and query table.

<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>

<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>

2) restart hive and compare the differences before and after configuration.

Display data warehouse

7.3 Hive operation log information configuration

1. Hive's log is stored in / tmp/root/hive.log directory by default (under the current user name)

2. Modify hive's log and save the log to / opt/module/hive/logs

(1) modify the file name of / opt/module/hive/conf/hive-log4j.properties.template to

hive-log4j.properties

[root@hadoop102 conf]$ pwd

/opt/module/hive/conf

[root@hadoop102 conf]$ mv hive-log4j.properties.template hive-log4j.properties

(2) modify the log storage location in hive-log4j.properties file

hive.log.dir=/opt/module/hive/logs

Keywords: Big Data hive SQL

Added by SleepyP on Sat, 16 Oct 2021 08:51:21 +0300