Graphic big data | practical operation case Hive construction and application case

Author: [Han Xinzi]( https://github.com/HanXinzi-AI)@[ShowMeAI](http://www.showmeai.tech/ ) [tutorial address]( http://www.showmeai.tech/tutorials/84): http://www.showmeai.tech/tutorials/84 [address of this article]( http://www.showmeai.tech/article-detail/171): http://www.showmeai.tech/article-detail/171 Notice: All Rights Reserved. Please contact the platform and the author for reprint and indicate the source

1.Hive construction and configuration

Hive is one of the most important tool platforms in the big data ecosystem. It is a key component of offline computing. It is often used for data warehouse construction. Statistics and reports of big data will be realized through SQL in the company. Let's take a look at hive's construction, configuration and use methods.

1) Download Hive

Please refer to the procedure of installing Hive Official documents: https://cwiki.apache.org/confluence/display/Hive/GettingStarted.

In accordance with the recommendations of the document http://www.apache.org/dyn/closer.cgi/hive/ Download the latest release, here to hive3 1.3 as an example.

Extract the installation file to the machine where hadoop environment is installed:

root@ubuntu:~/bigdata# ll
total 20
drwxr-xr-x 27 root root 4096 Sep 30 07:24 azkaban/
drwxr-xr-x  2 root root 4096 Oct 13 08:36 demo/
drwxr-xr-x 12 1001 1001 4096 Sep 30 09:43 hadoop-3.3.0/
drwxr-xr-x 11 root root 4096 Oct 13 07:58 hive-3.1.3/
drwxr-xr-x 32 work work 4096 Aug 28 07:28 spark-3.0.1/
root@ubuntu:~/bigdata# pwd
/root/bigdata

Next, export Hive's bin directory to PATH:

root@ubuntu:~/bigdata# cat /etc/profile.d/java.sh 
export PATH=/usr/local/jdk/bin:/root/bigdata/hadoop-3.3.0/bin:/root/bigdata/spark-3.0.1/bin:/root/bigdata/hive-3.1.3/bin:${PATH}

Upgrade guava to hadoop version:

mv lib/guava-19.0.jar lib/guava-19.0.jar.bk 
ln -s  /root/bigdata/hadoop-3.3.0/share/hadoop/hdfs/lib/guava-27.0-jre.jar  /root/bigdata/hive-3.1.3/lib/guava-27.0-jre.jar

The first command above backs up the guava dependency package of Hive, and then links the higher version of hadoop. The consistency of this version is one of the keys to the normal operation of Hive.

2) Install MYSQL

hive's metadata service is deployed independently. It saves data based on mysql. You can use apt command to install oracle mysql in ubuntu environment:

apt-get install mysql-server

If you prompt access denied when logging in with mysql -h localhost -u root -p, you need to find and delete a special rule for localhost in the mysql user table:

delete from user where User='root' and Host='localhost';
FLUSH PRIVILEGES;

Then create the hive database:

create database hive;

Then download JDBC mysql through wget command and put the Jar in the lib directory of hive (hive metastore service will use the jdbc driver to connect mysql to read and write metadata):

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.21/mysql-connector-java-8.0.21.jar
mv mysql-connector-java-8.0.21.jar lib

3) Configure Hive

Hive will automatically load conf / hive site XML configuration file, officially in conf / hive default xml. Template provides a template file, in which hive can't load hive site For the default value used in XML, refer to conf / hive default xml. Template to fill in hive site XML, the following is a configured sample (only the necessary items are configured):

root@ubuntu:~/bigdata/hive-3.1.3# cat conf/hive-site.xml 
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
                <name>hive.cli.print.header</name>
                <value>true</value>
                <description>Whether to print the names of the columns in query output.</description>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://localhost:3306/hive</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.cj.jdbc.Driver</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>xxxxx</value>
        </property>
        <property>
                <name>hive.metastore.uris</name>
                <value>thrift://localhost:9083</value>
                <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
        </property>
</configuration>

The core items are explained as follows:

  • hive.cli.print.header: Hive command line tool will print the header of table to facilitate reading the results
  • javax.jdo.option.ConnectionURL: database JDBC URL. Here is the Hive database of mysql
  • javax.jdo.option.ConnectionDriverName: JDBC class name, 8 The class name of Mysql jar in version x has changed, which should be noted
  • javax.jdo.option.ConnectionUserName: mysql user name
  • javax.jdo.option.ConnectionPassword: mysql password
  • hive.metastore.uris: the listening address to start the Metastore service

4) Start metastore service

First execute the Hive create table command to complete the mysql metadata create table:

bin/schematool -dbType mysql -initSchema

Execute command:

nohup hive –service metastore &

The service will listen on the localhost:9083 port. The production environment needs to make the host accessible to other servers, because the client accessing the metastore service is not necessarily on the local machine.

Now the command line uses the Hive command, which will automatically be based on Hive site XML to metastore service, run Hive command to test:

root@ubuntu:~/bigdata/hive-3.1.3# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/bigdata/hive-3.1.3/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = f0d4bf60-d85f-456a-98fb-e904d50f5242

Logging initialized using configuration in jar:file:/root/bigdata/hive-3.1.3/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true
Hive Session ID = 959e0cda-f8eb-4fc1-b798-cb5175e735d2
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> show databases;
OK
database_name
default
Time taken: 0.516 seconds, Fetched: 2 row(s)

After entering the Hive command, you can see the default database, and also prompt Hive to use Mr as the computing engine. In fact, Hive suggests to start using spark or tez as the underlying computing engine of SQL, and MR will be completely cancelled in the future.

At present, you can continue to use MR as the computing engine. Hive will automatically find the hadoop and yarn configuration files according to the hadoop command. Finally, SQL is calculated on yarn through MR.

The above is the complete construction process of Hive, and the little partners can happily start using Hive.

2.Hive application case

This case analyzes various indicators of the data of the video website to provide decision support for managers.

1) Requirement description

Statistics of general indicators of youtube video and audio websites, various TopN indicators:

  • Statistics of video viewing number Top10
  • Statistical video category heat Top10
  • Statistics of video viewing number and category of Top20
  • Count the number of videos viewed and the category Rank of the videos associated with Top50
  • Top 10 video popularity in each category
  • Top 10 video traffic statistics in each category
  • Count the top 10 users who upload the most videos and the videos they upload
  • Count the number of videos viewed in each category Top10

2) Project table fields

Video table

field remarks Detailed description
video id Video unique id 11 bit string
uploader Video uploader User name for uploading video String
age Video age Integer days of video on the platform
category Video category Upload video to the specified video category
length Video length Shaping the video length of the digital ID
views Viewing times Number of times the video has been viewed
rate Video rating Full Score: 5 points
ratings flow Digital video traffic, integer
conments Number of comments Integer number of comments for a video
related ids Related video id id of relevant video, up to 20

User table

field remarks Field type
uploader Uploader user name string
videos Number of videos uploaded int
friends Number of friends int

ETL raw data

By observing the original data form, it can be found that the video can have multiple subordinate categories. Each subordinate category is divided by the & symbol, and there are blank characters on both sides of the segmentation. At the same time, the related video can also have multiple elements, and multiple related videos are divided by \ t.

In order to facilitate the operation of data with multiple sub elements when analyzing data, the data reorganization and cleaning operation is carried out first. That is, all categories are segmented with &, and the spaces on both sides are removed. Multiple related video IDs are also segmented with &.

The core needs to do three things:

  • If the length is less than 9, delete it
  • Delete spaces for video categories
  • The segmentation character of the related video

3) Preparatory work

(1) Create Hive table

Create table: youtubevideo_ori,youtubevideo_user_ori
Create table: youtubevideo_orc,youtubevideo_user_orc

--establish: youtubevideo_ori surface
create table youtubevideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;

--establish youtubevideo_user_ori surface:
create table youtubevideo_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited
fields terminated by "\t" 
stored as textfile;

--establish youtubevideo_orc surface:
create table youtubevideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

--establish youtubevideo_user_orc surface:
create table youtubevideo_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited
fields terminated by "\t" 
stored as orc;

(2) Data after importing ETL

youtubevideo_ori:

load data inpath "/output/youtube_video" into table youtubevideo_ori;

youtubevideo_user_ori:

load data inpath "/youtube_video/user" into table youtubevideo_user_ori;

(3) Insert data into ORC table

youtubevideo_orc:

insert overwrite table youtubevideo_orc select * from youtubevideo_ori;

youtubevideo_user_orc:

insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;

3. Business analysis

1) Top 10 videos viewed

Idea: use order by to make a global sorting according to the views field, and set that only the first 10 items are displayed.

SELECT   videoid,
         uploader,
         age,
         category,
         length,
         views,
         rate,
         ratings,
         comments
FROM     youtubevideo_orc
ORDER BY views DESC limit 10;

-- Mode 2 SELECT *

FROM  (
                SELECT   videoid ,
                         age,
                         category,
                         length,
                         views,
                         Row_number() OVER( ORDER BY views DESC) AS rn
                FROM     youtubevideo_orc )t
WHERE  t.rn <= 10;

2) Statistical video category heat Top10

Idea: count the number of videos in each category and display the top 10 categories with the most videos.

  • ① You need to aggregate by category group by, and then count the number of videoids in the group.
  • ② Because the current table structure is: one video corresponds to one or more categories. Therefore, if you want to group by category, you need to first convert the category column to row (expand), and then count.
  • ③ Finally, the top 10 items are displayed in order of heat.
SELECT   category_name     AS category,
         Count(t1.videoid) AS hot
FROM     (
                SELECT videoid,
                       category_name
                FROM   youtubevideo_orc lateral view explode(category) t_catetory as category_name) t1
GROUP BY t1.category_name
ORDER BY hot DESC limit 10;

3) Count the categories of the 20 videos with the highest number of videos viewed and the number of top 20 videos in the category

Idea:

  • ① First find all the information of the entries of the 20 videos with the highest number of views and arrange them in descending order
  • ② Split the category in these 20 messages (column to row)
  • ③ Finally, query the name of the video category and how many Top20 videos are under this category
SELECT   category_name     AS category,
         Count(t2.videoid) AS hot_with_views
FROM     (
                SELECT videoid,
                       category_name
                FROM   (
                                SELECT   *
                                FROM     youtubevideo_orc
                                ORDER BY views DESC limit 20) t1 lateral VIEW explode(category) t_catetory as category_name) t2
GROUP BY category_name
ORDER BY hot_with_views DESC;

4) Count the video popularity Top10 in each category, taking Music as an example

Idea:

  • ① To count the top 10 video popularity in the Music category, you need to find the Music category first, so you need to expand the category, so you can create a table to store the data expanded by the category ID.
  • ② Insert data into the table expanded by category.
  • ③ Count the video heat in the corresponding category (Music).
--To create a table category table:
CREATE TABLE youtubevideo_category
             (
                          videoid STRING,
                          uploader STRING,
                          age INT,
                          categoryid STRING,
                          length INT,
                          views  INT,
                          rate FLOAT,
                          ratings  INT,
                          comments INT,
                          relatedid ARRAY<string>
             )
             row format delimited fields terminated BY "\t" collection items terminated BY "&" stored AS orc;

--Insert data into the category table:
INSERT INTO table youtubevideo_category
SELECT videoid,
       uploader,
       age,
       categoryid,
       length,
       views,
       rate,
       ratings,
       comments,
       relatedid
FROM   youtubevideo_orc lateral view explode(category) catetory AS categoryid;

--Statistics Music Categorical Top10(Other statistics can also be made)
SELECT   videoid,
         views
FROM     youtubevideo_category
WHERE    categoryid = "Music"
ORDER BY views DESC limit 10;

-- Mode 2 SELECT *
FROM  (
                SELECT   videoid ,
                         age,
                         categoryid,
                         length,
                         views,
                         Row_number() OVER( ORDER BY views DESC) AS rn
                FROM     youtubevideo_category
                WHERE    categoryid = "music" )t
WHERE  t.rn <= 10;

5) Top 10 video traffic statistics in each category

Idea:

  • ① Create a video category expansion table (table after column categoryId is converted)
  • ② Sort by ratings
SELECT *
FROM  (SELECT videoid,
              age,
              categoryid,
              length,
              ratings,
              Row_number()
                OVER(
                  partition BY categoryid
                  ORDER BY ratings DESC) AS rn
       FROM   youtubevideo_category)t
WHERE  t.rn <= 10; 

6) Count the top 10 users who upload the most videos and the top 10 videos they upload

Idea:

  • ① First find the user information of the top 10 users who upload videos
  • ② Connect with YouTube video through the uploader field_ join the orc table and sort the information obtained according to the viewing times of views.
--First step:
SELECT *
FROM   youtubevideo_user_orc
ORDER  BY videos DESC
LIMIT  10;

--Step two:
SELECT t2.videoid,
       t2.uploader,
       t2.views,
       t2.ratings,
       t1.videos,
       t1.friends
FROM   (SELECT *
        FROM   youtubevideo_user_orc
        ORDER  BY videos DESC
        LIMIT  10) t1
       JOIN youtubevideo_orc t2
         ON t1.uploader = t2.uploader
ORDER  BY views DESC
LIMIT  20;

ShowMeAI related articles recommended

ShowMeAI series tutorial recommendations

Keywords: Big Data hive

Added by Rupo on Tue, 08 Mar 2022 12:35:20 +0200