Hive sql programming interview questions
Question 1
Table structure: uid,subject_id,score
Ask: find out the students whose scores in all subjects are greater than the average score of a certain subject
The data set is as follows
1001 01 90 1001 02 90 1001 03 90 1002 01 85 1002 02 85 1002 03 70 1003 01 70 1003 02 70 1003 03 85
1) Create table statement
create table score( uid string, subject_id string, score int) row format delimited fields terminated by '\t';
2) Calculate the average score of each subject
select uid, score, avg(score) over(partition by subject_id) avg_score from score;t1
3) Record the flag according to whether it is greater than the average score. If it is greater than, it will be recorded as 0; otherwise, it will be recorded as 1
select uid, if(score>avg_score,0,1) flag from t1;t2
4) The sum of flag is counted according to the student id. if the sum is 0, all subjects are greater than the average score
select uid from t2 group by uid having sum(flag)=0;
5) Final SQL
select uid from (select uid, if(score>avg_score,0,1) flag from (select uid, score, avg(score) over(partition by subject_id) avg_score from score)t1)t2 group by uid having sum(flag)=0;
Question 2
We have the following users to access data
userId | visitDate | visitCount |
---|---|---|
u01 | 2017/1/21 | 5 |
u02 | 2017/1/23 | 6 |
u03 | 2017/1/22 | 8 |
u04 | 2017/1/20 | 3 |
u01 | 2017/1/23 | 6 |
u01 | 2017/2/21 | 8 |
U02 | 2017/1/23 | 6 |
U01 | 2017/2/22 | 4 |
It is required to use SQL to count the cumulative access times of each user, as shown in the following table:
User id | month | Subtotal | accumulate |
---|---|---|---|
u01 | 2017-01 | 11 | 11 |
u01 | 2017-02 | 12 | 23 |
u02 | 2017-01 | 12 | 12 |
u03 | 2017-01 | 8 | 8 |
u04 | 2017-01 | 3 | 3 |
data set
u01 2017/1/21 5 u02 2017/1/23 6 u03 2017/1/22 8 u04 2017/1/20 3 u01 2017/1/23 6 u01 2017/2/21 8 u02 2017/1/23 6 u01 2017/2/22 4
1) Create table
create table action (userId string, visitDate string, visitCount int) row format delimited fields terminated by "\t";
2) Modify data format
select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCount from action;t1
3) Calculate monthly visits per person
select userId, mn, sum(visitCount) mn_count from t1 group by userId,mn;t2
4) Cumulative monthly visits
select userId, mn, mn_count, sum(mn_count) over(partition by userId order by mn) from t2;
5) Final SQL
select userId, mn, mn_count, sum(mn_count) over(partition by userId order by mn) from ( select userId, mn, sum(visitCount) mn_count from (select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCount from action)t1 group by userId,mn)t2;
Question 3
There are 50W JD stores. When each customer or visitor accesses any commodity in any store, an access log will be generated. The table stored in the access log is called Visit, and the visitor's user id is user_id, the name of the visited store is shop, please make statistics:
1) UV per store (number of visitors)
2) Visitor information of top 3 visits per store. Output store name, visitor id and number of visits
data set
u1 a u2 b u1 b u1 a u3 c u4 b u1 a u2 c u5 b u4 b u6 c u2 c u1 b u2 a u2 a u3 a u5 a u5 a u5 a
1) Build table
create table visit(user_id string,shop string) row format delimited fields terminated by '\t';
2) UV per store (number of visitors)
select shop,count(distinct user_id) from visit group by shop;
3) Visitor information of top 3 visits per store. Output store name, visitor id and number of visits
(1) Query the number of times each store has been visited by each user
select shop,user_id,count(*) ct from visit group by shop,user_id;t1
(2) Calculate the ranking of each store visited by users
select shop,user_id,ct,rank() over(partition by shop order by ct) rk from t1;t2
(3) Take the top 3 of each store
select shop,user_id,ct from t2 where rk<=3;
(4) Final SQL
select shop, user_id, ct from (select shop, user_id, ct, rank() over(partition by shop order by ct) rk from (select shop, user_id, count(*) ct from visit group by shop, user_id)t1 )t2 where rk<=3;
Question 4
A known table STG.ORDER has the following fields: Date, Order_id,User_id,amount. Please give an example of sql statistics:
2017-01-01,10029028,1000003251,33.57.
1) Give the number of orders, users and total transaction amount of each month in 2017.
2) Give the number of new customers in November 2017 (the first order is only in November)
Build table
create table order_tab(dt string,order_id string,user_id string,amount decimal(10,2)) row format delimited fields terminated by '\t';
1) Give the number of orders, users and total transaction amount of each month in 2017.
select date_format(dt,'yyyy-MM'), count(order_id), count(distinct user_id), sum(amount) from order_tab where date_format(dt,'yyyy')='2017' group by date_format(dt,'yyyy-MM');
2) Give the number of new customers in November 2017 (the first order is only in November)
select count(user_id) from order_tab group by user_id having date_format(min(dt),'yyyy-MM')='2017-11';
Question 5
The log is as follows. Please write the code to get the total number and average age of all users and active users. (active user refers to the user who has access records for two consecutive days) date user age
data set
2019-02-11,test_1,23 2019-02-11,test_2,19 2019-02-11,test_3,39 2019-02-11,test_1,23 2019-02-11,test_3,39 2019-02-11,test_1,23 2019-02-12,test_2,19 2019-02-13,test_1,23 2019-02-15,test_2,19 2019-02-16,test_2,19
1) Build table
create table user_age(dt string,user_id string,age int)row format delimited fields terminated by ',';
2) Grouped by date and user, sorted by date and ranked
select dt, user_id, min(age) age, rank() over(partition by user_id order by dt) rk from user_age group by dt,user_id;t1
3) Calculate the difference between date and ranking
select user_id, age, date_sub(dt,rk) flag from t1;t2
4) Filter out users whose difference is greater than or equal to 2, that is, users who are active for two consecutive days
select user_id, min(age) age from t2 group by user_id,flag having count(*)>=2;t3
5) De reprocess the data (a user can log in continuously at two different time points), for example: a user logs in on January 10, January 11, January 20 and January 21.
select user_id, min(age) age from t3 group by user_id;t4
6) Calculate the number of active users (two consecutive days of access) and the average age
select count(*) ct, cast(sum(age)/count(*) as decimal(10,2)) from t4;
7) De duplication of the full data set according to the user
select user_id, min(age) age from user_age group by user_id;t5
8) Calculate the number and average age of all users
select count(*) user_count, cast((sum(age)/count(*)) as decimal(10,1)) from t5;
9) Perform the union all operation on the two data sets in step 5 and step 7
select 0 user_total_count, 0 user_total_avg_age, count(*) twice_count, cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age from ( select user_id, min(age) age from (select user_id, min(age) age from ( select user_id, age, date_sub(dt,rk) flag from ( select dt, user_id, min(age) age, rank() over(partition by user_id order by dt) rk from user_age group by dt,user_id )t1 )t2 group by user_id,flag having count(*)>=2)t3 group by user_id )t4 union all select count(*) user_total_count, cast((sum(age)/count(*)) as decimal(10,1)), 0 twice_count, 0 twice_count_avg_age from ( select user_id, min(age) age from user_age group by user_id )t5;t6
10) Sum and splice to final SQL
select sum(user_total_count), sum(user_total_avg_age), sum(twice_count), sum(twice_count_avg_age) from (select 0 user_total_count, 0 user_total_avg_age, count(*) twice_count, cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age from ( select user_id, min(age) age from (select user_id, min(age) age from ( select user_id, age, date_sub(dt,rk) flag from ( select dt, user_id, min(age) age, rank() over(partition by user_id order by dt) rk from user_age group by dt,user_id )t1 )t2 group by user_id,flag having count(*)>=2)t3 group by user_id )t4 union all select count(*) user_total_count, cast((sum(age)/count(*)) as decimal(10,1)), 0 twice_count, 0 twice_count_avg_age from ( select user_id, min(age) age from user_age group by user_id )t5)t6;
Question 6
Please use sql to write out the amount of goods purchased by all users for the first time in October this year, and the ordertable field (purchase user: userid, amount: money, purchase time: paymenttime (Format: 2017-10-01), order id: orderid)
1) Build table
create table ordertable( userid string, money int, paymenttime string, orderid string) row format delimited fields terminated by '\t';
2) Find out
select userid, min(paymenttime) paymenttime from ordertable where date_format(paymenttime,'yyyy-MM')='2017-10' group by userid;t1 select t1.userid, t1.paymenttime, od.money from t1 join ordertable od on t1.userid=od.userid and t1.paymenttime=od.paymenttime; select t1.userid, t1.paymenttime, od.money from (select userid, min(paymenttime) paymenttime from ordertable where date_format(paymenttime,'yyyy-MM')='2017-10' group by userid)t1 join ordertable od on t1.userid=od.userid and t1.paymenttime=od.paymenttime;
Question 7
There is an online server access log with the following format (answer with sql)
Time interface ip address
2016-11-09 11: 22: 05 /api/user/login 110.23.5.33 2016-11-09 11: 23: 10 /api/user/detail 57.3.2.16 ..... 2016-11-09 23: 59: 40 /api/user/login 200.6.5.166
Ask for the ip address of top10 of api/user/login interface at 14:00 p.m. (14-15:00) on November 9
data set
2016-11-09 14:22:05 /api/user/login 110.23.5.33 2016-11-09 11:23:10 /api/user/detail 57.3.2.16 2016-11-09 14:59:40 /api/user/login 200.6.5.166 2016-11-09 14:22:05 /api/user/login 110.23.5.34 2016-11-09 14:22:05 /api/user/login 110.23.5.34 2016-11-09 14:22:05 /api/user/login 110.23.5.34 2016-11-09 11:23:10 /api/user/detail 57.3.2.16 2016-11-09 23:59:40 /api/user/login 200.6.5.166 2016-11-09 14:22:05 /api/user/login 110.23.5.34 2016-11-09 11:23:10 /api/user/detail 57.3.2.16 2016-11-09 23:59:40 /api/user/login 200.6.5.166 2016-11-09 14:22:05 /api/user/login 110.23.5.35 2016-11-09 14:23:10 /api/user/detail 57.3.2.16 2016-11-09 23:59:40 /api/user/login 200.6.5.166 2016-11-09 14:59:40 /api/user/login 200.6.5.166 2016-11-09 14:59:40 /api/user/login 200.6.5.166
1) Build table
create table ip( time string, interface string, ip string) row format delimited fields terminated by '\t';
2) Final SQL
select ip, interface, count(*) ct from ip where date_format(time,'yyyy-MM-dd HH')>='2016-11-09 14' and date_format(time,'yyyy-MM-dd HH')<='2016-11-09 15' and interface='/api/user/login' group by ip,interface order by ct desc limit 2;t1
Question 8
1) There are three tables: member table, sale s table and regoods table
(1) The member table has the fields memberid (member id, primary key) and credits (points);
(2) The sales table has a field memberid (member id, foreign key) and purchase amount (MNAccount);
(3) There is a field memberid (member id, foreign key) and return amount (RMNAccount) in the return table.
2) Business description
(1) The sales records in the sales table can be purchased by members or non members. (that is, the memberid in the sales table can be empty);
(2) A member in the sales table can have multiple purchase records;
(3) The return records in the return table can be members or non members;
(4) A member can have one or more return records.
Query demand: find the purchase amount of all members in the sales table in groups, find the return amount of all members in the return table in groups, and update the results obtained from the purchase amount refund amount with the same member id to the credit field of the corresponding member in the member table
data set
sale
1001 50.3 1002 56.5 1003 235 1001 23.6 1005 56.2 25.6 33.5
regoods
1001 20.1 1002 23.6 1001 10.1 23.5 10.2 1005 0.8
1) Build table
create table member(memberid string,credits double) row format delimited fields terminated by '\t'; create table sale(memberid string,MNAccount double) row format delimited fields terminated by '\t'; create table regoods(memberid string,RMNAccount double) row format delimited fields terminated by '\t';
2) Final SQL
insert into table member select t1.memberid, MNAccount-RMNAccount from (select memberid, sum(MNAccount) MNAccount from sale where memberid!='' group by memberid )t1 join (select memberid, sum(RMNAccount) RMNAccount from regoods where memberid!='' group by memberid )t2 on t1.memberid=t2.memberid;
Question 9
1. Use an SQL statement to query the names of students with scores greater than 80 in each course
name kecheng fenshu
Zhang San Chinese 81 Zhang San mathematics 75 Li Si Chinese 76 Li Si mathematics 90 Wang Wu language 81 Wang Wu mathematics 100 Wang Wu English 90
A: select distinct name from table where name not in (select distinct name from table where fenshu<=80) B: select name from table group by name having min(fenshu)>80
-
The student list is as follows:
Automatic numbering student number name course number course name score1 2005001 Zhang San 0001 mathematics 69 2 2005002 Li Si 0001 mathematics 89 3 2005001 Zhang San 0001 mathematics 69
Delete redundant student information that is the same except for automatic numbering
A: delete tablename where auto number not in (select min from tablename group by student number, name, course number, course name, score)
3. In a table called team, there is only one field name. There are four records, namely a, B, C and D, corresponding to four teams. Now the four teams are playing, and an sql statement is used to display all possible game combinations
Answer:
select a.name, b.name from team a, team b where a.name < b.name
4. Interview question: how to put such a
year month amount
1991 1 1.1 1991 2 1.2 1991 3 1.3 1991 4 1.4 1992 1 2.1 1992 2 2.2 1992 3 2.3 1992 4 2.4
It turns out that
year m1 m2 m3 m4
1991 1.1 1.2 1.3 1.4
1992 2.1 2.2 2.3 2.4
answer
select year, (select amount from aaa m where month=1 and m.year=aaa.year) as m1, (select amount from aaa m where month=2 and m.year=aaa.year) as m2, (select amount from aaa m where month=3 and m.year=aaa.year) as m3, (select amount from aaa m where month=4 and m.year=aaa.year) as m4 from aaa group by year
Original table:
courseid coursename score
-------------------------------------
1 java 70
2 oracle 90
3 xml 40
4 jsp 30
5 servlet 80
-------------------------------------
For ease of reading, the results after querying this table are as follows (the passing score is 60):
courseid coursename score mark
---------------------------------------------------
1 java 70 pass
2 oracle 90 pass
3 xml 40 fail
4 jsp 30 fail
5 servlet 80 pass
---------------------------------------------------
Write this query statement
select courseid, coursename ,score ,if(score>=60, "pass","fail") as mark from course
7. Table name: shopping information
Shopper's commodity name and quantity
A a 2
B B 4
C C 1
A Ding 2
B C 5
......
Give the records of shoppers who have purchased two or more goods
Answer: select * from shopping info where shopper in (select shopper from shopping info group by shopper having count (*) > = 2);
info table
date result
2005-05-09 win
2005-05-09 lose
2005-05-09 lose
2005-05-09 lose
2005-05-10 win
2005-05-10 lose
2005-05-10 lose
How to write sql statements if you want to generate the following results?
win lose
2005-05-09 2 2
2005-05-10 1 2
answer:
(1) select date, sum(case when result = "win" then 1 else 0 end) as "win", sum(case when result = "lose" then 1 else 0 end) as "lose" from info group by date;
(2) select a.date, a.result as win, b.result as lose
from
(select date, count(result) as result from info where result = "win" group by date) as a
join
(select date, count(result) as result from info where result = "lose" group by date) as b
on a.date = b.date;
Question 10
uid age gender orderInfo
user2;20;female; { "orderId": 1, "mobiles": ["166666661","177777771"], "orderDetails": [ {"goodsId":"1111","goodsPrice":40.1}, {"goodsId":"2222","goodsPrice":50.1} ] }
Parse the data and expand as follows:
uid age gender orderId mobile goodsId goodsPrice
user2,20,female,1,166666661,1111,40.1 user2,20,female,1,177777771,2222,50.1
It mainly examines your ability to process json format data. It is not a pure json format, but a nested and complex json format.
Custom function, select custom UDTF function to solve!!
UDTF function – enter a row of data to get the results of multiple rows and columns.
To customize UDTF:
1. Create a class to inherit GernericUDTF
2. Rewrite three of the methods
2.1 initialize method: determine the output field name and data type
2.2 process method: customize the logic of parsing data
2.3 close method: close and release some resources
Key point: send data using the forward method
Get more information
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG riiwiqr2-1633685256899) (hive SQL programming interview question. assets/Bob QR code. jpg)]
Having count (*) > = 2);
info table
date result
2005-05-09 win
2005-05-09 lose
2005-05-09 lose
2005-05-09 lose
2005-05-10 win
2005-05-10 lose
2005-05-10 lose
How to write sql statements if you want to generate the following results?
win lose
2005-05-09 2 2
2005-05-10 1 2
answer:
(1) select date, sum(case when result = "win" then 1 else 0 end) as "win", sum(case when result = "lose" then 1 else 0 end) as "lose" from info group by date;
(2) select a.date, a.result as win, b.result as lose
from
(select date, count(result) as result from info where result = "win" group by date) as a
join
(select date, count(result) as result from info where result = "lose" group by date) as b
on a.date = b.date;
Question 10
uid age gender orderInfo
user2;20;female; { "orderId": 1, "mobiles": ["166666661","177777771"], "orderDetails": [ {"goodsId":"1111","goodsPrice":40.1}, {"goodsId":"2222","goodsPrice":50.1} ] }
Parse the data and expand as follows:
uid age gender orderId mobile goodsId goodsPrice
user2,20,female,1,166666661,1111,40.1 user2,20,female,1,177777771,2222,50.1
It mainly examines your ability to process json format data. It is not a pure json format, but a nested and complex json format.
Custom function, select custom UDTF function to solve!!
UDTF function – enter a row of data to get the results of multiple rows and columns.
To customize UDTF:
1. Create a class to inherit GernericUDTF
2. Rewrite three of the methods
2.1 initialize method: determine the output field name and data type
2.2 process method: customize the logic of parsing data
2.3 close method: close and release some resources
Key point: send data using the forward method