Hive's Learning Notes - Chapter 10 Hive Practice

1. Requirement Description

Statistics Silicon Valley video and video website general indicators, various TopN indicators:

  • Count Video Views Top10
  • Statistics Video Category Heat Top10
  • Statistics Video Views Top20 Categories
  • Statistical Video Views Rank of the Category of Video Associated with Top50
  • Count video heat Top10 in each category
  • Count video views per category Top10

2. Data structure

3. Preparations

(1) Create external tables: video s table, user table

#video table
create external table video_ori(
  videoId string, 
  uploader string, 
  age int, 
  category array<string>, 
  length int, 
  views int, 
  rate float, 
  ratings int, 
  comments int,
  relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
location '/mario/hive/9/video';

#user table
create external table user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited fields terminated by "\t" 
location '/mario/hive/9/user';

(2) Then set up the internal table: video_orc table, user_orc table

#video_orc table
create table video_orc(
  videoId string, 
  uploader string, 
  age int, 
  category array<string>, 
  length int, 
  views int, 
  rate float, 
  ratings int, 
  comments int,
  relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");

#user_orc table
create table user_orc(
    uploader string,
    videos int,
    friends int)
stored as orc
tblproperties("orc.compress"="SNAPPY");

(3) Inserting data into ORC tables

insert into table video_orc select * from video_ori;
insert into table user_orc select * from user_ori;

4. Business Analysis

4.1 Statistics Video Views Top10

Ideas:

Use order by to sort globally by the views field, and we set the display to show only the first 10 entries

Final Code:

SELECT videoid,
       views
FROM video_orc
ORDER BY views DESC
LIMIT 10;

4.2 Statistics Video Category Heat Top10

Ideas:

(1) Define video category warmth (assuming the number of videos in a category)

(2) We need to aggregate by category group by and then count the number of videoId s in the group.

(3) Because the current table structure is: one video corresponds to one or more categories. So if you want a group by category, you need to first row by column (expand) the category, then count it [t1]

SELECT videoid,
       cate
FROM video_orc LATERAL VIEW explode(category) tbl AS cate;

(4) counting how many videos there are in each category, showing the top 10 categories that contain the most videos

SELECT cate,
       COUNT(videoid) n
FROM t1
GROUP BY cate
ORDER BY n DESC
LIMIT 10;

Final Code:

SELECT cate,
       COUNT(videoid) n
FROM
  ( SELECT videoid,
           cate
   FROM video_orc LATERAL VIEW explode(category) tbl AS cate) t1
GROUP BY cate
ORDER BY n DESC
LIMIT 10;

4.3 Count the top 20 video categories and the number of Top20 videos in each category

Ideas:

(1) First find all the information about the entries of the 20 most viewed videos, in descending order [t1]

SELECT videoid,
       views,
       category
FROM video_orc
ORDER BY views DESC
LIMIT 20;

(2) Split the category of the 20 pieces of information [t2]

SELECT videoid,
       cate
FROM t1 LATERAL VIEW explode(category) tbl AS cate;

(3) Final query for video category name and how many Top20 videos are under that category

SELECT cate,
       COUNT(videoid) n
FROM t2
GROUP BY cate
ORDER BY n DESC;

Final Code:

SELECT cate,
       COUNT(videoid) n
FROM
  ( SELECT videoid,
           cate
   FROM
     ( SELECT videoid,
              views,
              category
      FROM video_orc
      ORDER BY views DESC
      LIMIT 20) t1 LATERAL VIEW explode(category) tbl AS cate) t2
GROUP BY cate
ORDER BY n DESC;

4.4 Statistics Video Watches Order of Categories of Videos Associated with Top50

Ideas:

(1) Count the Related videos of the top 50 videos [t1]

SELECT videoid,
       views,
       relatedid
FROM video_orc
ORDER BY views DESC
LIMIT 50;

(2) explosion switch video [t2]

SELECT explode(relatedid) videoid
FROM t1;

(3) Categories of associated videos obtained by Join from the original table [t3]

SELECT DISTINCT t2.videoid,
                v.category
FROM t2
JOIN video_orc v ON t2.videoid = v.videoid;

(4) Explode category [t4]

SELECT explode(category) cate
FROM t3;

(5) Category thermometer [t5]

SELECT cate,
       COUNT(videoid) n
FROM
  (SELECT videoid,
          cate
   FROM video_orc LATERAL VIEW explode(category) tbl AS cate) g1
GROUP BY cate

(6) Join sorting with category thermometer [t5]

SELECT DISTINCT t4.cate,
                t5.n
FROM t4
JOIN t5 ON t4.cate = t5.cate
ORDER BY t5.n DESC;

Final Code:

SELECT DISTINCT t4.cate,
                t5.n
FROM
  ( SELECT explode(category) cate
   FROM
     ( SELECT DISTINCT t2.videoid,
                       v.category
      FROM
        ( SELECT explode(relatedid) videoid
         FROM
           ( SELECT videoid,
                    views,
                    relatedid
            FROM video_orc
            ORDER BY views DESC
            LIMIT 50) t1) t2
      JOIN video_orc v ON t2.videoid = v.videoid) t3) t4
JOIN
  ( SELECT cate,
           COUNT(videoid) n
   FROM
     ( SELECT videoid,
              cate
      FROM video_orc LATERAL VIEW explode(category) tbl AS cate) g1
   GROUP BY cate) t5 ON t4.cate = t5.cate
ORDER BY t5.n DESC;

4.5 Count video heat Top10 in each category, take Music for example

Ideas:

(1) If you want to count Top10 video heat in the Music category, you need to find the Music category first, then you need to expand the category, so you can create a table to store the data of the categoryId expansion.

create table video_category(
    videoId string, 
    uploader string, 
    age int, 
    categoryId string, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int, 
    relatedId array<string>)
row format delimited 
fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

(2) Insert data into a category-expanded table.

INSERT INTO TABLE video_category
SELECT videoId,
       uploader,
       age,
       categoryId,
       LENGTH,
       views,
       rate,
       ratings,
       comments,
       relatedId
FROM video_orc LATERAL VIEW explode(category) catetory AS categoryId;

(3) Statistics of video heat in corresponding categories (Music).

SELECT videoid,
       views
FROM video_category
WHERE cate ="Music"
ORDER BY views DESC
LIMIT 10;

Final Code:

SELECT videoId,
       views
FROM gulivideo_category
WHERE categoryId = "Music"
ORDER BY views DESC
LIMIT 10;

4.6 Count video views per category Top10

Ideas:

From Video_ Catgory directly queries the top 10 videos of Music traffic

Final Code:

SELECT cate,
       videoid,
       views
FROM
  ( SELECT cate,
           videoid,
           views,
           RANK() OVER(PARTITION BY cate
                       ORDER BY views DESC) hot
   FROM video_category) t1
WHERE hot <= 10;

4.7 Count Top10 users who uploaded the most videos and the videos they uploaded in the top 20

4.7.1 Top 10 users Top 20 per person

Ideas:

(1) Statistics of Top10 [t1] users who upload most videos

SELECT uploader,
       videos
FROM user_orc
ORDER BY videos DESC
LIMIT 10;

(2) and video_orc works together to find videos uploaded by these users and rank them by heat [t2]

SELECT t1.uploader,
       v.videoid,
       RANK() OVER(PARTITION BY t1.uploader
                   ORDER BY v.views DESC) hot
FROM t1
LEFT JOIN video_orc v ON t1.uploader = v.uploader;

(3) Ask each person for the first 20

SELECT t2.uploader,
       t2.videoid,
       t2.hot
FROM t2
WHERE hot <= 20;

4.7.2 Top 20 of Top 10 Users

(1) Statistics of Top10 users who upload most videos

SELECT uploader,
       videos
FROM user_orc
ORDER BY videos DESC
LIMIT 10;

(2) Watch the top 20 videos

SELECT videoid,
       uploader,
       views
FROM video_orc
ORDER BY views DESC
LIMIT 20;

(3) Join two tables to see if they have uploaded

SELECT t1.uploader,
       t2.videoid
FROM t1
LEFT JOIN t2 ON t1.uploader = t2.uploader;

4.8 Count video views per category Top10

(1) From video_category table to find the ranking of video views per category

SELECT cate,
       videoid,
       views,
       RANK() OVER(PARTITION BY cate
                   ORDER BY views DESC) hot
FROM video_category;

(2) Take Top10 for each category

SELECT cate,
       videoid,
       views
FROM t1
WHERE hot <= 10;

Keywords: hive

Added by dawnrae on Sat, 15 Jan 2022 04:50:44 +0200