1. Requirement Description
Statistics Silicon Valley video and video website general indicators, various TopN indicators:
- Count Video Views Top10
- Statistics Video Category Heat Top10
- Statistics Video Views Top20 Categories
- Statistical Video Views Rank of the Category of Video Associated with Top50
- Count video heat Top10 in each category
- Count video views per category Top10
2. Data structure
3. Preparations
(1) Create external tables: video s table, user table
#video table create external table video_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" location '/mario/hive/9/video'; #user table create external table user_ori( uploader string, videos int, friends int) row format delimited fields terminated by "\t" location '/mario/hive/9/user';
(2) Then set up the internal table: video_orc table, user_orc table
#video_orc table create table video_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) stored as orc tblproperties("orc.compress"="SNAPPY"); #user_orc table create table user_orc( uploader string, videos int, friends int) stored as orc tblproperties("orc.compress"="SNAPPY");
(3) Inserting data into ORC tables
insert into table video_orc select * from video_ori; insert into table user_orc select * from user_ori;
4. Business Analysis
4.1 Statistics Video Views Top10
Ideas:
Use order by to sort globally by the views field, and we set the display to show only the first 10 entries
Final Code:
SELECT videoid, views FROM video_orc ORDER BY views DESC LIMIT 10;
4.2 Statistics Video Category Heat Top10
Ideas:
(1) Define video category warmth (assuming the number of videos in a category)
(2) We need to aggregate by category group by and then count the number of videoId s in the group.
(3) Because the current table structure is: one video corresponds to one or more categories. So if you want a group by category, you need to first row by column (expand) the category, then count it [t1]
SELECT videoid, cate FROM video_orc LATERAL VIEW explode(category) tbl AS cate;
(4) counting how many videos there are in each category, showing the top 10 categories that contain the most videos
SELECT cate, COUNT(videoid) n FROM t1 GROUP BY cate ORDER BY n DESC LIMIT 10;
Final Code:
SELECT cate, COUNT(videoid) n FROM ( SELECT videoid, cate FROM video_orc LATERAL VIEW explode(category) tbl AS cate) t1 GROUP BY cate ORDER BY n DESC LIMIT 10;
4.3 Count the top 20 video categories and the number of Top20 videos in each category
Ideas:
(1) First find all the information about the entries of the 20 most viewed videos, in descending order [t1]
SELECT videoid, views, category FROM video_orc ORDER BY views DESC LIMIT 20;
(2) Split the category of the 20 pieces of information [t2]
SELECT videoid, cate FROM t1 LATERAL VIEW explode(category) tbl AS cate;
(3) Final query for video category name and how many Top20 videos are under that category
SELECT cate, COUNT(videoid) n FROM t2 GROUP BY cate ORDER BY n DESC;
Final Code:
SELECT cate, COUNT(videoid) n FROM ( SELECT videoid, cate FROM ( SELECT videoid, views, category FROM video_orc ORDER BY views DESC LIMIT 20) t1 LATERAL VIEW explode(category) tbl AS cate) t2 GROUP BY cate ORDER BY n DESC;
4.4 Statistics Video Watches Order of Categories of Videos Associated with Top50
Ideas:
(1) Count the Related videos of the top 50 videos [t1]
SELECT videoid, views, relatedid FROM video_orc ORDER BY views DESC LIMIT 50;
(2) explosion switch video [t2]
SELECT explode(relatedid) videoid FROM t1;
(3) Categories of associated videos obtained by Join from the original table [t3]
SELECT DISTINCT t2.videoid, v.category FROM t2 JOIN video_orc v ON t2.videoid = v.videoid;
(4) Explode category [t4]
SELECT explode(category) cate FROM t3;
(5) Category thermometer [t5]
SELECT cate, COUNT(videoid) n FROM (SELECT videoid, cate FROM video_orc LATERAL VIEW explode(category) tbl AS cate) g1 GROUP BY cate
(6) Join sorting with category thermometer [t5]
SELECT DISTINCT t4.cate, t5.n FROM t4 JOIN t5 ON t4.cate = t5.cate ORDER BY t5.n DESC;
Final Code:
SELECT DISTINCT t4.cate, t5.n FROM ( SELECT explode(category) cate FROM ( SELECT DISTINCT t2.videoid, v.category FROM ( SELECT explode(relatedid) videoid FROM ( SELECT videoid, views, relatedid FROM video_orc ORDER BY views DESC LIMIT 50) t1) t2 JOIN video_orc v ON t2.videoid = v.videoid) t3) t4 JOIN ( SELECT cate, COUNT(videoid) n FROM ( SELECT videoid, cate FROM video_orc LATERAL VIEW explode(category) tbl AS cate) g1 GROUP BY cate) t5 ON t4.cate = t5.cate ORDER BY t5.n DESC;
4.5 Count video heat Top10 in each category, take Music for example
Ideas:
(1) If you want to count Top10 video heat in the Music category, you need to find the Music category first, then you need to expand the category, so you can create a table to store the data of the categoryId expansion.
create table video_category( videoId string, uploader string, age int, categoryId string, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc;
(2) Insert data into a category-expanded table.
INSERT INTO TABLE video_category SELECT videoId, uploader, age, categoryId, LENGTH, views, rate, ratings, comments, relatedId FROM video_orc LATERAL VIEW explode(category) catetory AS categoryId;
(3) Statistics of video heat in corresponding categories (Music).
SELECT videoid, views FROM video_category WHERE cate ="Music" ORDER BY views DESC LIMIT 10;
Final Code:
SELECT videoId, views FROM gulivideo_category WHERE categoryId = "Music" ORDER BY views DESC LIMIT 10;
4.6 Count video views per category Top10
Ideas:
From Video_ Catgory directly queries the top 10 videos of Music traffic
Final Code:
SELECT cate, videoid, views FROM ( SELECT cate, videoid, views, RANK() OVER(PARTITION BY cate ORDER BY views DESC) hot FROM video_category) t1 WHERE hot <= 10;
4.7 Count Top10 users who uploaded the most videos and the videos they uploaded in the top 20
4.7.1 Top 10 users Top 20 per person
Ideas:
(1) Statistics of Top10 [t1] users who upload most videos
SELECT uploader, videos FROM user_orc ORDER BY videos DESC LIMIT 10;
(2) and video_orc works together to find videos uploaded by these users and rank them by heat [t2]
SELECT t1.uploader, v.videoid, RANK() OVER(PARTITION BY t1.uploader ORDER BY v.views DESC) hot FROM t1 LEFT JOIN video_orc v ON t1.uploader = v.uploader;
(3) Ask each person for the first 20
SELECT t2.uploader, t2.videoid, t2.hot FROM t2 WHERE hot <= 20;
4.7.2 Top 20 of Top 10 Users
(1) Statistics of Top10 users who upload most videos
SELECT uploader, videos FROM user_orc ORDER BY videos DESC LIMIT 10;
(2) Watch the top 20 videos
SELECT videoid, uploader, views FROM video_orc ORDER BY views DESC LIMIT 20;
(3) Join two tables to see if they have uploaded
SELECT t1.uploader, t2.videoid FROM t1 LEFT JOIN t2 ON t1.uploader = t2.uploader;
4.8 Count video views per category Top10
(1) From video_category table to find the ranking of video views per category
SELECT cate, videoid, views, RANK() OVER(PARTITION BY cate ORDER BY views DESC) hot FROM video_category;
(2) Take Top10 for each category
SELECT cate, videoid, views FROM t1 WHERE hot <= 10;