Hive sql programming interview questions

Hive sql programming interview questions

Question 1

Table structure: uid,subject_id,score

Ask: find out the students whose scores in all subjects are greater than the average score of a certain subject

The data set is as follows

1001	01	90
1001	02	90
1001	03	90
1002	01	85
1002	02	85
1002	03	70
1003	01	70
1003	02	70
1003	03	85

1) Create table statement

create table score(

  uid string,

  subject_id string,

  score int)

row format delimited fields terminated by '\t'; 

2) Calculate the average score of each subject

select

  uid,

  score,

  avg(score) over(partition by subject_id) avg_score

from

  score;t1

3) Record the flag according to whether it is greater than the average score. If it is greater than, it will be recorded as 0; otherwise, it will be recorded as 1

select

  uid,

  if(score>avg_score,0,1) flag

from

  t1;t2

4) The sum of flag is counted according to the student id. if the sum is 0, all subjects are greater than the average score

select

  uid

from

  t2

group by

  uid

having

  sum(flag)=0;

5) Final SQL

select

  uid

from

  (select

  uid,

  if(score>avg_score,0,1) flag

from

  (select

  uid,

  score,
  avg(score) over(partition by subject_id) avg_score
from

  score)t1)t2

group by

  uid

having

  sum(flag)=0;

Question 2

We have the following users to access data

userIdvisitDatevisitCount
u012017/1/215
u022017/1/236
u032017/1/228
u042017/1/203
u012017/1/236
u012017/2/218
U022017/1/236
U012017/2/224

It is required to use SQL to count the cumulative access times of each user, as shown in the following table:

User idmonthSubtotalaccumulate
u012017-011111
u012017-021223
u022017-011212
u032017-0188
u042017-0133

data set

u01   2017/1/21    5
u02   2017/1/23    6
u03   2017/1/22    8
u04   2017/1/20    3
u01   2017/1/23    6
u01   2017/2/21    8
u02   2017/1/23    6
u01   2017/2/22    4

1) Create table

create table action
(userId string,
visitDate string,
visitCount int) 
row format delimited fields terminated by "\t";

2) Modify data format

select

   userId,

   date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
   visitCount
from

   action;t1

3) Calculate monthly visits per person

select

  userId,

  mn,

  sum(visitCount) mn_count

from

  t1

group by 

userId,mn;t2

4) Cumulative monthly visits

select

  userId,

  mn,

  mn_count,

  sum(mn_count) over(partition by userId order by mn)

from t2;

5) Final SQL

select

  userId,

  mn,

  mn_count,

  sum(mn_count) over(partition by userId order by mn)

from 

(  select

   userId,

    mn,

    sum(visitCount) mn_count

  from

     (select

       userId,

       date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,

       visitCount

     from

       action)t1

group by userId,mn)t2;

Question 3

There are 50W JD stores. When each customer or visitor accesses any commodity in any store, an access log will be generated. The table stored in the access log is called Visit, and the visitor's user id is user_id, the name of the visited store is shop, please make statistics:

1) UV per store (number of visitors)

2) Visitor information of top 3 visits per store. Output store name, visitor id and number of visits

data set

u1	a
u2	b
u1	b
u1	a
u3	c
u4	b
u1	a
u2	c
u5	b
u4	b
u6	c
u2	c
u1	b
u2	a
u2	a
u3	a
u5	a
u5	a
u5	a

1) Build table

create table visit(user_id string,shop string) row format delimited fields terminated by '\t';

2) UV per store (number of visitors)

select shop,count(distinct user_id) from visit group by shop;

3) Visitor information of top 3 visits per store. Output store name, visitor id and number of visits

(1) Query the number of times each store has been visited by each user

select shop,user_id,count(*) ct

from visit

group by shop,user_id;t1

(2) Calculate the ranking of each store visited by users

select shop,user_id,ct,rank() over(partition by shop order by ct) rk

from t1;t2

(3) Take the top 3 of each store

select shop,user_id,ct

from t2

where rk<=3;

(4) Final SQL

select 

shop,

user_id,

ct

from

(select 

shop,

user_id,

ct,

rank() over(partition by shop order by ct) rk

from 

(select 

shop,

user_id,

count(*) ct

from visit

group by 

shop,

user_id)t1

)t2

where rk<=3;

Question 4

A known table STG.ORDER has the following fields: Date, Order_id,User_id,amount. Please give an example of sql statistics:

2017-01-01,10029028,1000003251,33.57.

1) Give the number of orders, users and total transaction amount of each month in 2017.

2) Give the number of new customers in November 2017 (the first order is only in November)

Build table

create table order_tab(dt string,order_id string,user_id string,amount decimal(10,2)) row format delimited fields terminated by '\t';

1) Give the number of orders, users and total transaction amount of each month in 2017.

select

  date_format(dt,'yyyy-MM'),

  count(order_id),

  count(distinct user_id),

  sum(amount)

from

  order_tab

where

  date_format(dt,'yyyy')='2017'

group by

  date_format(dt,'yyyy-MM');

2) Give the number of new customers in November 2017 (the first order is only in November)

select

  count(user_id)

from

  order_tab

group by

  user_id

having

  date_format(min(dt),'yyyy-MM')='2017-11';

Question 5

The log is as follows. Please write the code to get the total number and average age of all users and active users. (active user refers to the user who has access records for two consecutive days) date user age

data set

2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19

1) Build table

create table user_age(dt string,user_id string,age int)row format delimited fields terminated by ',';

2) Grouped by date and user, sorted by date and ranked

select

  dt,

  user_id,

  min(age) age,

  rank() over(partition by user_id order by dt) rk

from

  user_age

group by

  dt,user_id;t1

3) Calculate the difference between date and ranking

select

  user_id,

  age,

  date_sub(dt,rk) flag

from

  t1;t2

4) Filter out users whose difference is greater than or equal to 2, that is, users who are active for two consecutive days

select

  user_id,

  min(age) age

from

  t2

group by

  user_id,flag

having

  count(*)>=2;t3

5) De reprocess the data (a user can log in continuously at two different time points), for example: a user logs in on January 10, January 11, January 20 and January 21.

select

  user_id,

  min(age) age

from

  t3

group by

  user_id;t4

6) Calculate the number of active users (two consecutive days of access) and the average age

select

  count(*) ct,

  cast(sum(age)/count(*) as decimal(10,2))

from t4;

7) De duplication of the full data set according to the user

select

  user_id,

  min(age) age 

from

  user_age 

group by 

  user_id;t5

8) Calculate the number and average age of all users

select

  count(*) user_count,

  cast((sum(age)/count(*)) as decimal(10,1)) 

from 

  t5;

9) Perform the union all operation on the two data sets in step 5 and step 7

select

  0 user_total_count,

  0 user_total_avg_age,

  count(*) twice_count,

  cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age

from 

(

  select

  user_id,

  min(age) age

from

  (select

  user_id,

  min(age) age

from

  (

  select

  user_id,

  age,

  date_sub(dt,rk) flag

from

  (

  select

​    dt,

​    user_id,

​    min(age) age,

​    rank() over(partition by user_id order by dt) rk

  from

​    user_age

  group by

​    dt,user_id

  )t1

  )t2

group by

  user_id,flag

having

  count(*)>=2)t3

group by

  user_id 

)t4

 

union all

 

select

  count(*) user_total_count,

  cast((sum(age)/count(*)) as decimal(10,1)),

  0 twice_count,

  0 twice_count_avg_age

from 

  (

   select

​     user_id,

​     min(age) age 

   from 

​     user_age 

   group by 

​     user_id

  )t5;t6

10) Sum and splice to final SQL

select 

  sum(user_total_count),

  sum(user_total_avg_age),

  sum(twice_count),

  sum(twice_count_avg_age)

from 

(select

  0 user_total_count,

  0 user_total_avg_age,

  count(*) twice_count,

  cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age

from 

(

  select

  user_id,

  min(age) age

from

  (select

  user_id,

  min(age) age

from

  (

  select

  user_id,

  age,

  date_sub(dt,rk) flag

from

  (

  select

​    dt,

​    user_id,

​    min(age) age,

​    rank() over(partition by user_id order by dt) rk

  from

​    user_age

  group by

​    dt,user_id

  )t1

  )t2

group by

  user_id,flag

having

  count(*)>=2)t3

group by

  user_id 

)t4

 

union all

 

select

  count(*) user_total_count,

  cast((sum(age)/count(*)) as decimal(10,1)),

  0 twice_count,

  0 twice_count_avg_age

from 

  (

   select

​     user_id,

​     min(age) age 

   from 

​     user_age 

   group by 

​     user_id

  )t5)t6;

Question 6

Please use sql to write out the amount of goods purchased by all users for the first time in October this year, and the ordertable field (purchase user: userid, amount: money, purchase time: paymenttime (Format: 2017-10-01), order id: orderid)

1) Build table

create table ordertable(
  userid string,
  money int,
  paymenttime string,
  orderid string)
row format delimited fields terminated by '\t';

2) Find out

select

  userid,

  min(paymenttime) paymenttime

from

  ordertable

where

  date_format(paymenttime,'yyyy-MM')='2017-10'

group by

  userid;t1

 

select

  t1.userid,

  t1.paymenttime,

  od.money

from

  t1

join

  ordertable od

on

  t1.userid=od.userid

  and

  t1.paymenttime=od.paymenttime;

 

select

  t1.userid,

  t1.paymenttime,

  od.money

from

  (select

  userid,

  min(paymenttime) paymenttime

from

  ordertable

where

  date_format(paymenttime,'yyyy-MM')='2017-10'

group by

  userid)t1

join

  ordertable od

on

  t1.userid=od.userid

  and

  t1.paymenttime=od.paymenttime;

Question 7

There is an online server access log with the following format (answer with sql)

Time interface ip address

2016-11-09 11: 22: 05   /api/user/login          110.23.5.33

2016-11-09 11: 23: 10   /api/user/detail          57.3.2.16

.....

2016-11-09 23: 59: 40   /api/user/login          200.6.5.166

Ask for the ip address of top10 of api/user/login interface at 14:00 p.m. (14-15:00) on November 9

data set

2016-11-09 14:22:05	/api/user/login	110.23.5.33
2016-11-09 11:23:10	/api/user/detail	57.3.2.16
2016-11-09 14:59:40	/api/user/login	200.6.5.166
2016-11-09 14:22:05	/api/user/login	110.23.5.34
2016-11-09 14:22:05	/api/user/login	110.23.5.34
2016-11-09 14:22:05	/api/user/login	110.23.5.34
2016-11-09 11:23:10	/api/user/detail	57.3.2.16
2016-11-09 23:59:40	/api/user/login	200.6.5.166
2016-11-09 14:22:05	/api/user/login	110.23.5.34
2016-11-09 11:23:10	/api/user/detail	57.3.2.16
2016-11-09 23:59:40	/api/user/login	200.6.5.166
2016-11-09 14:22:05	/api/user/login	110.23.5.35
2016-11-09 14:23:10	/api/user/detail	57.3.2.16
2016-11-09 23:59:40	/api/user/login	200.6.5.166
2016-11-09 14:59:40	/api/user/login	200.6.5.166
2016-11-09 14:59:40	/api/user/login	200.6.5.166

1) Build table

create table ip(

  time string,

  interface string,

  ip string)

row format delimited fields terminated by '\t';

2) Final SQL

select

  ip,

  interface,

  count(*) ct

from

  ip

where

  date_format(time,'yyyy-MM-dd HH')>='2016-11-09 14'
  and 
  date_format(time,'yyyy-MM-dd HH')<='2016-11-09 15'

  and

  interface='/api/user/login'

group by

ip,interface

order by

  ct desc

limit 2;t1

Question 8

1) There are three tables: member table, sale s table and regoods table

(1) The member table has the fields memberid (member id, primary key) and credits (points);

(2) The sales table has a field memberid (member id, foreign key) and purchase amount (MNAccount);

(3) There is a field memberid (member id, foreign key) and return amount (RMNAccount) in the return table.

2) Business description

(1) The sales records in the sales table can be purchased by members or non members. (that is, the memberid in the sales table can be empty);

(2) A member in the sales table can have multiple purchase records;

(3) The return records in the return table can be members or non members;

(4) A member can have one or more return records.

Query demand: find the purchase amount of all members in the sales table in groups, find the return amount of all members in the return table in groups, and update the results obtained from the purchase amount refund amount with the same member id to the credit field of the corresponding member in the member table

data set

sale

1001	50.3
1002	56.5
1003	235
1001	23.6
1005	56.2
	25.6
	33.5

regoods

1001   20.1

1002   23.6

1001   10.1

​    23.5

​    10.2

1005   0.8

1) Build table

create table member(memberid string,credits double) row format delimited fields terminated by '\t';

 

create table sale(memberid string,MNAccount double) row format delimited fields terminated by '\t';

 

create table regoods(memberid string,RMNAccount double) row format delimited fields terminated by '\t';

2) Final SQL

insert into table member

select

  t1.memberid,

  MNAccount-RMNAccount

from

  (select

​    memberid,

​    sum(MNAccount) MNAccount

  from

​    sale

  where

​    memberid!=''

  group by

​    memberid

  )t1

join

  (select

​    memberid,

​    sum(RMNAccount) RMNAccount

  from

​    regoods

  where

​    memberid!=''

  group by

​    memberid

  )t2

on

  t1.memberid=t2.memberid;

Question 9

1. Use an SQL statement to query the names of students with scores greater than 80 in each course

name kecheng fenshu

Zhang San Chinese 81

Zhang San mathematics 75

Li Si Chinese 76

Li Si mathematics 90

Wang Wu language 81

Wang Wu mathematics 100

Wang Wu English 90
A: select distinct name from table where name not in (select distinct name from table where fenshu<=80)

B: select name from table group by name having min(fenshu)>80
  1. The student list is as follows:
    Automatic numbering student number name course number course name score

    1   2005001 Zhang San 0001 mathematics 69
    2   2005002 Li Si 0001 mathematics 89
    3   2005001 Zhang San 0001 mathematics 69
    

    Delete redundant student information that is the same except for automatic numbering

A: delete tablename where auto number not in (select min from tablename group by student number, name, course number, course name, score)

3. In a table called team, there is only one field name. There are four records, namely a, B, C and D, corresponding to four teams. Now the four teams are playing, and an sql statement is used to display all possible game combinations

Answer:

select a.name, b.name
from team a, team b
where a.name < b.name

4. Interview question: how to put such a
year month amount

1991  1   1.1
1991  2   1.2
1991  3   1.3
1991  4   1.4
1992  1   2.1
1992  2   2.2
1992  3   2.3
1992  4   2.4

It turns out that
year m1 m2 m3 m4
1991 1.1 1.2 1.3 1.4
1992 2.1 2.2 2.3 2.4

answer

select year, 
(select amount from aaa m where month=1 and m.year=aaa.year) as m1,
(select amount from aaa m where month=2 and m.year=aaa.year) as m2,
(select amount from aaa m where month=3 and m.year=aaa.year) as m3,
(select amount from  aaa m where month=4 and m.year=aaa.year) as m4
from aaa group by year

Original table:
courseid coursename score
-------------------------------------
1 java 70
2 oracle 90
3 xml 40
4 jsp 30
5 servlet 80
-------------------------------------
For ease of reading, the results after querying this table are as follows (the passing score is 60):
courseid coursename score mark
---------------------------------------------------
1 java 70 pass
2 oracle 90 pass
3 xml 40 fail
4 jsp 30 fail
5 servlet 80 pass
---------------------------------------------------
Write this query statement
select courseid, coursename ,score ,if(score>=60, "pass","fail") as mark from course

7. Table name: shopping information

Shopper's commodity name and quantity

A a 2

B B 4

C C 1

A Ding 2

B C 5

......

Give the records of shoppers who have purchased two or more goods

Answer: select * from shopping info where shopper in (select shopper from shopping info group by shopper having count (*) > = 2);

info table

date result

2005-05-09 win

2005-05-09 lose

2005-05-09 lose

2005-05-09 lose

2005-05-10 win

2005-05-10 lose

2005-05-10 lose

How to write sql statements if you want to generate the following results?

win lose

2005-05-09 2 2

2005-05-10 1 2

answer:

(1) select date, sum(case when result = "win" then 1 else 0 end) as "win", sum(case when result = "lose" then 1 else 0 end) as "lose" from info group by date;

(2) select a.date, a.result as win, b.result as lose

from

(select date, count(result) as result from info where result = "win" group by date) as a

join

(select date, count(result) as result from info where result = "lose" group by date) as b

on a.date = b.date;

Question 10

uid age gender orderInfo

user2;20;female;  
{
"orderId": 1,
"mobiles": ["166666661","177777771"],
"orderDetails": [
{"goodsId":"1111","goodsPrice":40.1},
{"goodsId":"2222","goodsPrice":50.1}
]
}

Parse the data and expand as follows:

uid age gender orderId mobile goodsId goodsPrice

user2,20,female,1,166666661,1111,40.1
user2,20,female,1,177777771,2222,50.1

It mainly examines your ability to process json format data. It is not a pure json format, but a nested and complex json format.

Custom function, select custom UDTF function to solve!!

UDTF function – enter a row of data to get the results of multiple rows and columns.

To customize UDTF:

1. Create a class to inherit GernericUDTF

2. Rewrite three of the methods

2.1 initialize method: determine the output field name and data type

2.2 process method: customize the logic of parsing data

2.3 close method: close and release some resources

Key point: send data using the forward method

Get more information

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG riiwiqr2-1633685256899) (hive SQL programming interview question. assets/Bob QR code. jpg)]

Having count (*) > = 2);

info table

date result

2005-05-09 win

2005-05-09 lose

2005-05-09 lose

2005-05-09 lose

2005-05-10 win

2005-05-10 lose

2005-05-10 lose

How to write sql statements if you want to generate the following results?

win lose

2005-05-09 2 2

2005-05-10 1 2

answer:

(1) select date, sum(case when result = "win" then 1 else 0 end) as "win", sum(case when result = "lose" then 1 else 0 end) as "lose" from info group by date;

(2) select a.date, a.result as win, b.result as lose

from

(select date, count(result) as result from info where result = "win" group by date) as a

join

(select date, count(result) as result from info where result = "lose" group by date) as b

on a.date = b.date;

Question 10

uid age gender orderInfo

user2;20;female;  
{
"orderId": 1,
"mobiles": ["166666661","177777771"],
"orderDetails": [
{"goodsId":"1111","goodsPrice":40.1},
{"goodsId":"2222","goodsPrice":50.1}
]
}

Parse the data and expand as follows:

uid age gender orderId mobile goodsId goodsPrice

user2,20,female,1,166666661,1111,40.1
user2,20,female,1,177777771,2222,50.1

It mainly examines your ability to process json format data. It is not a pure json format, but a nested and complex json format.

Custom function, select custom UDTF function to solve!!

UDTF function – enter a row of data to get the results of multiple rows and columns.

To customize UDTF:

1. Create a class to inherit GernericUDTF

2. Rewrite three of the methods

2.1 initialize method: determine the output field name and data type

2.2 process method: customize the logic of parsing data

2.3 close method: close and release some resources

Key point: send data using the forward method

Keywords: hive SQL Interview

Added by vinpkl on Fri, 08 Oct 2021 11:53:20 +0300