Tutorial address: http://www.showmeai.tech/tutorials/84
Article address: http://www.showmeai.tech/article-detail/171
1.Hive construction and configuration
Hive is one of the most important tool platforms in the big data ecosystem. It is a key component of offline computing. It is often used for data warehouse construction. Statistics and reports of big data will be realized through SQL in the company. Let's take a look at hive's construction, configuration and use methods.
1) Download Hive
Please refer to the procedure of installing Hive Official documents: https://cwiki.apache.org/confluence/display/Hive/GettingStarted.
In accordance with the recommendations of the document http://www.apache.org/dyn/closer.cgi/hive/ Download the latest release, here to hive3 1.3 as an example.
Extract the installation file to the machine where hadoop environment is installed:
root@ubuntu:~/bigdata# ll total 20 drwxr-xr-x 27 root root 4096 Sep 30 07:24 azkaban/ drwxr-xr-x 2 root root 4096 Oct 13 08:36 demo/ drwxr-xr-x 12 1001 1001 4096 Sep 30 09:43 hadoop-3.3.0/ drwxr-xr-x 11 root root 4096 Oct 13 07:58 hive-3.1.3/ drwxr-xr-x 32 work work 4096 Aug 28 07:28 spark-3.0.1/ root@ubuntu:~/bigdata# pwd /root/bigdata
Next, export Hive's bin directory to PATH:
root@ubuntu:~/bigdata# cat /etc/profile.d/java.sh export PATH=/usr/local/jdk/bin:/root/bigdata/hadoop-3.3.0/bin:/root/bigdata/spark-3.0.1/bin:/root/bigdata/hive-3.1.3/bin:${PATH}
Upgrade guava to hadoop version:
mv lib/guava-19.0.jar lib/guava-19.0.jar.bk ln -s /root/bigdata/hadoop-3.3.0/share/hadoop/hdfs/lib/guava-27.0-jre.jar /root/bigdata/hive-3.1.3/lib/guava-27.0-jre.jar
The first command above backs up the guava dependency package of Hive, and then links the higher version of hadoop. The consistency of this version is one of the keys to the normal operation of Hive.
2) Install MYSQL
hive's metadata service is deployed independently. It saves data based on mysql. You can use apt command to install oracle mysql in ubuntu environment:
apt-get install mysql-server
If you prompt access denied when logging in with mysql -h localhost -u root -p, you need to find and delete a special rule for localhost in the mysql user table:
delete from user where User='root' and Host='localhost'; FLUSH PRIVILEGES;
Then create the hive database:
create database hive;
Then download JDBC mysql through wget command and put the Jar in the lib directory of hive (hive metastore service will use the jdbc driver to connect mysql to read and write metadata):
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.21/mysql-connector-java-8.0.21.jar mv mysql-connector-java-8.0.21.jar lib
3) Configure Hive
Hive will automatically load conf / hive site XML configuration file, officially in conf / hive default xml. Template provides a template file, in which hive can't load hive site For the default value used in XML, refer to conf / hive default xml. Template to fill in hive site XML, the following is a configured sample (only the necessary items are configured):
root@ubuntu:~/bigdata/hive-3.1.3# cat conf/hive-site.xml <?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>xxxxx</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> </configuration>
The core items are explained as follows:
- hive.cli.print.header: Hive command line tool will print the header of table to facilitate reading the results
- javax.jdo.option.ConnectionURL: database JDBC URL. Here is the Hive database of mysql
- javax.jdo.option.ConnectionDriverName: JDBC class name, 8 The class name of Mysql jar in version x has changed, which should be noted
- javax.jdo.option.ConnectionUserName: mysql user name
- javax.jdo.option.ConnectionPassword: mysql password
- hive.metastore.uris: the listening address to start the Metastore service
4) Start metastore service
First execute the Hive create table command to complete the mysql metadata create table:
bin/schematool -dbType mysql -initSchema
Execute command:
nohup hive –service metastore &
The service will listen on the localhost:9083 port. The production environment needs to make the host accessible to other servers, because the client accessing the metastore service is not necessarily on the local machine.
Now the command line uses the Hive command, which will automatically be based on Hive site XML to metastore service, run Hive command to test:
root@ubuntu:~/bigdata/hive-3.1.3# hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/bigdata/hive-3.1.3/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = f0d4bf60-d85f-456a-98fb-e904d50f5242 Logging initialized using configuration in jar:file:/root/bigdata/hive-3.1.3/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true Hive Session ID = 959e0cda-f8eb-4fc1-b798-cb5175e735d2 Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> show databases; OK database_name default Time taken: 0.516 seconds, Fetched: 2 row(s)
After entering the Hive command, you can see the default database, and also prompt Hive to use Mr as the computing engine. In fact, Hive suggests to start using spark or tez as the underlying computing engine of SQL, and MR will be completely cancelled in the future.
At present, you can continue to use MR as the computing engine. Hive will automatically find the hadoop and yarn configuration files according to the hadoop command. Finally, SQL is calculated on yarn through MR.
The above is the complete construction process of Hive, and the little partners can happily start using Hive.
2.Hive application case
This case analyzes various indicators of the data of the video website to provide decision support for managers.
1) Requirement description
Statistics of general indicators of youtube video and audio websites, various TopN indicators:
- Statistics of video viewing number Top10
- Statistical video category heat Top10
- Statistics of video viewing number and category of Top20
- Count the number of videos viewed and the category Rank of the videos associated with Top50
- Top 10 video popularity in each category
- Top 10 video traffic statistics in each category
- Count the top 10 users who upload the most videos and the videos they upload
- Count the number of videos viewed in each category Top10
2) Project table fields
Video table
field | remarks | Detailed description |
---|---|---|
video id | Video unique id | 11 bit string |
uploader | Video uploader | User name for uploading video String |
age | Video age | Integer days of video on the platform |
category | Video category | Upload video to the specified video category |
length | Video length | Shaping the video length of the digital ID |
views | Viewing times | Number of times the video has been viewed |
rate | Video rating | Full Score: 5 points |
ratings | flow | Digital video traffic, integer |
conments | Number of comments | Integer number of comments for a video |
related ids | Related video id | id of relevant video, up to 20 |
User table
field | remarks | Field type |
---|---|---|
uploader | Uploader user name | string |
videos | Number of videos uploaded | int |
friends | Number of friends | int |
ETL raw data
By observing the original data form, it can be found that the video can have multiple subordinate categories. Each subordinate category is divided by the & symbol, and there are blank characters on both sides of the segmentation. At the same time, the related video can also have multiple elements, and multiple related videos are divided by \ t.
In order to facilitate the operation of data with multiple sub elements when analyzing data, the data reorganization and cleaning operation is carried out first. That is, all categories are segmented with &, and the spaces on both sides are removed. Multiple related video IDs are also segmented with &.
The core needs to do three things:
- If the length is less than 9, delete it
- Delete spaces for video categories
- The segmentation character of the related video
3) Preparatory work
(1) Create Hive table
Create table: youtubevideo_ori,youtubevideo_user_ori
Create table: youtubevideo_orc,youtubevideo_user_orc
--establish: youtubevideo_ori surface create table youtubevideo_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile; --establish youtubevideo_user_ori surface: create table youtubevideo_user_ori( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as textfile; --establish youtubevideo_orc surface: create table youtubevideo_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc; --establish youtubevideo_user_orc surface: create table youtubevideo_user_orc( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as orc;
(2) Data after importing ETL
youtubevideo_ori:
load data inpath "/output/youtube_video" into table youtubevideo_ori;
youtubevideo_user_ori:
load data inpath "/youtube_video/user" into table youtubevideo_user_ori;
(3) Insert data into ORC table
youtubevideo_orc:
insert overwrite table youtubevideo_orc select * from youtubevideo_ori;
youtubevideo_user_orc:
insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;
3. Business analysis
1) Top 10 videos viewed
Idea: use order by to make a global sorting according to the views field, and set that only the first 10 items are displayed.
SELECT videoid, uploader, age, category, length, views, rate, ratings, comments FROM youtubevideo_orc ORDER BY views DESC limit 10; -- Mode 2 SELECT * FROM ( SELECT videoid , age, category, length, views, Row_number() OVER( ORDER BY views DESC) AS rn FROM youtubevideo_orc )t WHERE t.rn <= 10;
2) Statistical video category heat Top10
Idea: count the number of videos in each category and display the top 10 categories with the most videos.
- ① You need to aggregate by category group by, and then count the number of videoids in the group.
- ② Because the current table structure is: one video corresponds to one or more categories. Therefore, if you want to group by category, you need to first convert the category column to row (expand), and then count.
- ③ Finally, the top 10 items are displayed in order of heat.
SELECT category_name AS category, Count(t1.videoid) AS hot FROM ( SELECT videoid, category_name FROM youtubevideo_orc lateral view explode(category) t_catetory as category_name) t1 GROUP BY t1.category_name ORDER BY hot DESC limit 10;
3) Count the categories of the 20 videos with the highest number of videos viewed and the number of top 20 videos in the category
Idea:
- ① First find all the information of the entries of the 20 videos with the highest number of views and arrange them in descending order
- ② Split the category in these 20 messages (column to row)
- ③ Finally, query the name of the video category and how many Top20 videos are under this category
SELECT category_name AS category, Count(t2.videoid) AS hot_with_views FROM ( SELECT videoid, category_name FROM ( SELECT * FROM youtubevideo_orc ORDER BY views DESC limit 20) t1 lateral VIEW explode(category) t_catetory as category_name) t2 GROUP BY category_name ORDER BY hot_with_views DESC;
4) Count the video popularity Top10 in each category, taking Music as an example
Idea:
- ① To count the top 10 video popularity in the Music category, you need to find the Music category first, so you need to expand the category, so you can create a table to store the data expanded by the category ID.
- ② Insert data into the table expanded by category.
- ③ Count the video heat in the corresponding category (Music).
--To create a table category table: CREATE TABLE youtubevideo_category ( videoid STRING, uploader STRING, age INT, categoryid STRING, length INT, views INT, rate FLOAT, ratings INT, comments INT, relatedid ARRAY<string> ) row format delimited fields terminated BY "\t" collection items terminated BY "&" stored AS orc; --Insert data into the category table: INSERT INTO table youtubevideo_category SELECT videoid, uploader, age, categoryid, length, views, rate, ratings, comments, relatedid FROM youtubevideo_orc lateral view explode(category) catetory AS categoryid; --Statistics Music Categorical Top10(Other statistics can also be made) SELECT videoid, views FROM youtubevideo_category WHERE categoryid = "Music" ORDER BY views DESC limit 10; -- Mode 2 SELECT * FROM ( SELECT videoid , age, categoryid, length, views, Row_number() OVER( ORDER BY views DESC) AS rn FROM youtubevideo_category WHERE categoryid = "music" )t WHERE t.rn <= 10;
5) Top 10 video traffic statistics in each category
Idea:
- ① Create a video category expansion table (table after column categoryId is converted)
- ② Sort by ratings
SELECT * FROM (SELECT videoid, age, categoryid, length, ratings, Row_number() OVER( partition BY categoryid ORDER BY ratings DESC) AS rn FROM youtubevideo_category)t WHERE t.rn <= 10;
6) Count the top 10 users who upload the most videos and the top 10 videos they upload
Idea:
- ① First find the user information of the top 10 users who upload videos
- ② Connect with YouTube video through the uploader field_ join the orc table and sort the information obtained according to the viewing times of views.
--First step: SELECT * FROM youtubevideo_user_orc ORDER BY videos DESC LIMIT 10; --Step two: SELECT t2.videoid, t2.uploader, t2.views, t2.ratings, t1.videos, t1.friends FROM (SELECT * FROM youtubevideo_user_orc ORDER BY videos DESC LIMIT 10) t1 JOIN youtubevideo_orc t2 ON t1.uploader = t2.uploader ORDER BY views DESC LIMIT 20;