The log is as follows. Please write the code to get the total number and average age of all users and active users. (active users refer to users who have access records for two consecutive days)
Date user age 2019-02-11,test_1,23 2019-02-11,test_2,19 2019-02-11,test_3,39 2019-02-11,test_1,23 2019-02-11,test_3,39 2019-02-11,test_1,23 2019-02-12,test_2,19 2019-02-13,test_1,23 2019-02-15,test_2,19 2019-02-16,test_2,19 2019-02-17,test_4,35 2019-02-18,test_4,35 2019-02-22,test_4,35 2019-02-23,test_4,35
Build table
create table users(day_time string,user_id string,age int) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
Import data
load data local inpath '/opt/data/user.txt' into table users;
Note: it is assumed that everyone's age is constant here, regardless of a person's age growth;
The knowledge points used include hive's over() windowing function and row_number() function, lead() function
Train of thought I
1. Find the total number and average age of all users
1) First find the age of each user
select user_id,avg(age) from users group by user_id;
user_id c1 test_1 23.0 test_2 19.0 test_3 39.0 test_4 35.0
2) Then find the total number and average age of all users
select count(a.user_id) `Number of users`, avg(a.av) `Average age of all users` from( select user_id,avg(age) av from users group by user_id ) a;
Number of users average age of all users 4 29.0
2. Calculate the total number and average age of active users (active users refer to users who have access records for two consecutive days)
1) Because a user may log in multiple times a day, log in once a day when calculating the number of active users, even if they are active today, it is necessary to re log in users every day
select distinct day_time,user_id,age from users;
day_time user_id age 2019-02-11 test_1 23 2019-02-11 test_2 19 2019-02-11 test_3 39 2019-02-12 test_2 19 2019-02-13 test_1 23 2019-02-15 test_2 19 2019-02-16 test_2 19 2019-02-17 test_4 35 2019-02-18 test_4 35 2019-02-22 test_4 35 2019-02-23 test_4 35
2) Here comes the calculation logic of active users:
Use row first_ number() over(partiton by _ order by _) Group users and sort login dates. After sorting, there will be one more column (sorted column);
select t1.user_id,t1.day_time, row_number() over(partition by t1.user_id order by t1.day_time) rk, t1.age from(select distinct day_time,user_id,age from users) t1;
user_id day_time rk age test_1 2019-02-11 1 23 test_1 2019-02-13 2 23 test_2 2019-02-11 1 19 test_2 2019-02-12 2 19 test_2 2019-02-15 3 19 test_2 2019-02-16 4 19 test_3 2019-02-11 1 39 test_4 2019-02-17 1 35 test_4 2019-02-18 2 35 test_4 2019-02-22 3 35 test_4 2019-02-23 4 35
Then subtract the sorted column (rk) from the time column. In the case of continuous login, the result of each subtraction is the same;
select t2.user_id, t2.day_time, t2.rk, date_sub(t2.day_time,rk) ds, age from( select t1.user_id, t1.day_time, row_number() over(partition by t1.user_id order by t1.day_time) rk, t1.age from(select distinct day_time,user_id,age from users) t1 ) t2;
user_id day_time rk ds age test_1 2019-02-11 1 2019-02-10 23 test_1 2019-02-13 2 2019-02-11 23 test_2 2019-02-11 1 2019-02-10 19 test_2 2019-02-12 2 2019-02-10 19 test_2 2019-02-15 3 2019-02-12 19 test_2 2019-02-16 4 2019-02-12 19 test_3 2019-02-11 1 2019-02-10 39 test_4 2019-02-17 1 2019-02-16 35 test_4 2019-02-18 2 2019-02-16 35 test_4 2019-02-22 3 2019-02-19 35 test_4 2019-02-23 4 2019-02-19 35
3) The rule of active users is that two consecutive logins are active users, so they are grouped by user id and ds column, and the number of data in the group is counted. When the number of data is greater than or equal to 2, they are active users; By the way, you can get the start date, end date and days of continuous login;
select t3.user_id, min(t3.day_time), max(t3.day_time), count(*), avg(t3.age) from( select t2.user_id, t2.day_time, t2.rk, date_sub(t2.day_time,rk) ds, age from( select t1.user_id, t1.day_time, row_number() over(partition by t1.user_id order by t1.day_time) rk, t1.age from(select distinct day_time,user_id,age from users) t1 ) t2 ) t3 group by t3.user_id,t3.ds having count(*)>=2;
user_id Start date End date Continuous login days age test_2 2019-02-11 2019-02-12 2 19.0 test_2 2019-02-15 2019-02-16 2 19.0 test_4 2019-02-17 2019-02-18 2 35.0 test_4 2019-02-22 2019-02-23 2 35.0
4) Find the total number and average age of active users
Step 1: because the same user can be active multiple times (log in continuously for 2 days in different time periods), but only one active user is counted, it is necessary to de duplicate first, and first find the id and corresponding age of the active user (de duplicate);
select t4.user_id, avg(t4.ag) from( select t3.user_id, min(t3.day_time), max(t3.day_time), count(*), avg(t3.age) ag from( select t2.user_id, t2.day_time, t2.rk, date_sub(t2.day_time,rk) ds, age from( select t1.user_id, t1.day_time, row_number() over(partition by t1.user_id order by t1.day_time) rk, t1.age from(select distinct day_time,user_id,age from users) t1 ) t2 ) t3 group by t3.user_id,t3.ds having count(*)>=2 ) t4 group by t4.user_id;
user_id age test_2 19.0 test_4 35.0
If you log in for 3 consecutive days, you can count (*) > = 3;
If you log in for 4 consecutive days, you can count (*) > = 4;
If you log in for 5 consecutive days, you can count (*) > = 5;
By logical analogy;
Step 2: calculate the total number and average age of active users
select count(t5.user_id) `Total active users`, avg(t5.ag2) `Average age of active users` from( select t4.user_id, avg(t4.ag) ag2 from( select t3.user_id, min(t3.day_time), max(t3.day_time), count(*), avg(t3.age) ag from( select t2.user_id, t2.day_time, t2.rk, date_sub(t2.day_time,rk) ds, age from( select t1.user_id, t1.day_time, row_number() over(partition by t1.user_id order by t1.day_time) rk, t1.age from(select distinct day_time,user_id,age from users) t1 ) t2 ) t3 group by t3.user_id,t3.ds having count(*)>=2 ) t4 group by t4.user_id ) t5;
Total active users Average age of active users 2 27.0
The last step: combine the two results
select b.`Number of users` `Number of all users`, b.`Average age of all users` `Average age of all users`, 0 `Total active users`, 0 `Average age of active users` from( select count(a.user_id) `Number of users`, avg(a.av) `Average age of all users` from( select user_id,avg(age) av from users group by user_id ) a ) b union all select 0 `Number of all users`, 0 `Average age of all users`, t6.`Total active users` `Total active users`, t6.`Average age of active users` `Average age of active users` from( select count(t5.user_id) `Total active users`, avg(t5.ag2) `Average age of active users` from( select t4.user_id, avg(t4.ag) ag2 from( select t3.user_id, min(t3.day_time), max(t3.day_time), count(*), avg(t3.age) ag from( select t2.user_id, t2.day_time, t2.rk, date_sub(t2.day_time,rk) ds, age from( select t1.user_id, t1.day_time, row_number() over(partition by t1.user_id order by t1.day_time) rk, t1.age from(select distinct day_time,user_id,age from users) t1 ) t2 ) t3 group by t3.user_id,t3.ds having count(*)>=2 ) t4 group by t4.user_id ) t5 ) t6;
u1.Number of all users u1.Average age of all users u1.Total active users u1.Average age of active users 4 29.0 0 0.0 0 0.0 2 27.0
Idea 2: use the function lead() in hive
day_time user_id age 2019-02-11,test_1,23 2019-02-11,test_2,19 2019-02-11,test_3,39 2019-02-11,test_1,23 2019-02-11,test_3,39 2019-02-11,test_1,23 2019-02-12,test_2,19 2019-02-13,test_1,23 2019-02-15,test_2,19 2019-02-16,test_2,19 2019-02-17,test_4,35 2019-02-18,test_4,35 2019-02-22,test_4,35 2019-02-23,test_4,35
Find the total number and average age of active users (active users refer to users who have access records for two consecutive days)
1)select distinct user_id,day_time,age from users;
user_id day_time age test_1 2019-02-11 23 test_2 2019-02-11 19 test_3 2019-02-11 39 test_2 2019-02-12 19 test_1 2019-02-13 23 test_2 2019-02-15 19 test_2 2019-02-16 19 test_4 2019-02-17 35 test_4 2019-02-18 35 test_4 2019-02-22 35 test_4 2019-02-23 35
2) select t1.user_id, t1.day_time, lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt, t1.age from( select distinct user_id,day_time,age from users ) t1;
t1.user_id t1.day_time nt t1.age test_1 2019-02-11 2019-02-13 23 test_1 2019-02-13 NULL 23 test_2 2019-02-11 2019-02-12 19 test_2 2019-02-12 2019-02-15 19 test_2 2019-02-15 2019-02-16 19 test_2 2019-02-16 NULL 19 test_3 2019-02-11 NULL 39 test_4 2019-02-17 2019-02-18 35 test_4 2019-02-18 2019-02-22 35 test_4 2019-02-22 2019-02-23 35 test_4 2019-02-23 NULL 35
3) select t2.user_id, t2.age from( select t1.user_id, t1.day_time, lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt, t1.age from( select distinct user_id,day_time,age from users ) t1 ) t2 where datediff(t2.nt,t2.day_time)=1;
t2.user_id t2.age test_2 19 test_2 19 test_4 35 test_4 35
Active users who log in for 2 consecutive days:
Use lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt
Then NT day_ time=1;
Active users who log in for 3 consecutive days:
Use lead(t1.day_time,2) over(partition by t1.user_id order by t1.day_time) nt
Then NT day_ time=2;
Log in for 4 consecutive days, even if the user is active:
Use lead(t1.day_time,3) over(partition by t1.user_id order by t1.day_time) nt
Then NT day_ time=3;
By logical analogy;
select t3.user_id, avg(t3.age) ag from( select t2.user_id, t2.age from( select t1.user_id, t1.day_time, lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt, t1.age from( select distinct user_id,day_time,age from users ) t1 ) t2 where datediff(t2.nt,t2.day_time)=1 ) t3 group by t3.user_id;
t3.user_id ag test_2 19.0 test_4 35.0
5) select count(t6.user_id) `Total active users`, avg(t6.ag) `Average age of active users` from( select t3.user_id, avg(t3.age) ag from( select t2.user_id, t2.age from( select t1.user_id, t1.day_time, lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt, t1.age from( select distinct user_id,day_time,age from users ) t1 ) t2 where datediff(t2.nt,t2.day_time)=1 ) t3 group by t3.user_id ) t6;
Total active users Average age of active users 2 27.0
6) hinder union all Just like the idea, omit;