hive sql calculates the total number and average age of all users and active users

The log is as follows. Please write the code to get the total number and average age of all users and active users. (active users refer to users who have access records for two consecutive days)

Date user age
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
2019-02-17,test_4,35
2019-02-18,test_4,35
2019-02-22,test_4,35
2019-02-23,test_4,35

Build table

create table users(day_time string,user_id string,age int)
row format delimited
fields terminated by ',' 
lines terminated by '\n'
stored as textfile;

Import data

load data local inpath '/opt/data/user.txt' into table users;

Note: it is assumed that everyone's age is constant here, regardless of a person's age growth;

The knowledge points used include hive's over() windowing function and row_number() function, lead() function

Train of thought I

1. Find the total number and average age of all users

1) First find the age of each user

select user_id,avg(age) from users group by user_id;

user_id c1
test_1  23.0
test_2  19.0
test_3  39.0
test_4  35.0

2) Then find the total number and average age of all users

select 
	count(a.user_id) `Number of users`,
	avg(a.av) `Average age of all users`
from(
	select user_id,avg(age) av from users group by user_id
) a;

Number of users average age of all users
4       29.0

2. Calculate the total number and average age of active users (active users refer to users who have access records for two consecutive days)

1) Because a user may log in multiple times a day, log in once a day when calculating the number of active users, even if they are active today, it is necessary to re log in users every day

select distinct day_time,user_id,age from users;

day_time        user_id age
2019-02-11      test_1  23
2019-02-11      test_2  19
2019-02-11      test_3  39
2019-02-12      test_2  19
2019-02-13      test_1  23
2019-02-15      test_2  19
2019-02-16      test_2  19
2019-02-17      test_4  35
2019-02-18      test_4  35
2019-02-22      test_4  35
2019-02-23      test_4  35

2) Here comes the calculation logic of active users:

Use row first_ number() over(partiton by _ order by _) Group users and sort login dates. After sorting, there will be one more column (sorted column);

select 
	t1.user_id,t1.day_time,
	row_number() over(partition by t1.user_id order by t1.day_time) rk,
	t1.age
from(select distinct day_time,user_id,age from users) t1;

user_id  day_time      rk       age
test_1  2019-02-11      1       23
test_1  2019-02-13      2       23
test_2  2019-02-11      1       19
test_2  2019-02-12      2       19
test_2  2019-02-15      3       19
test_2  2019-02-16      4       19
test_3  2019-02-11      1       39
test_4  2019-02-17      1       35
test_4  2019-02-18      2       35
test_4  2019-02-22      3       35
test_4  2019-02-23      4       35

Then subtract the sorted column (rk) from the time column. In the case of continuous login, the result of each subtraction is the same;

select 
	t2.user_id,
	t2.day_time,
	t2.rk,
	date_sub(t2.day_time,rk) ds,
	age
from(
	select 
		t1.user_id,
		t1.day_time,
		row_number() over(partition by t1.user_id order by t1.day_time) rk,
		t1.age
	from(select distinct day_time,user_id,age from users) t1
) t2;

user_id  day_time      rk         ds			age
test_1  2019-02-11      1       2019-02-10      23
test_1  2019-02-13      2       2019-02-11      23
test_2  2019-02-11      1       2019-02-10      19
test_2  2019-02-12      2       2019-02-10      19
test_2  2019-02-15      3       2019-02-12      19
test_2  2019-02-16      4       2019-02-12      19
test_3  2019-02-11      1       2019-02-10      39
test_4  2019-02-17      1       2019-02-16      35
test_4  2019-02-18      2       2019-02-16      35
test_4  2019-02-22      3       2019-02-19      35
test_4  2019-02-23      4       2019-02-19      35

3) The rule of active users is that two consecutive logins are active users, so they are grouped by user id and ds column, and the number of data in the group is counted. When the number of data is greater than or equal to 2, they are active users; By the way, you can get the start date, end date and days of continuous login;

select 
	t3.user_id,
	min(t3.day_time),
	max(t3.day_time),
	count(*),
	avg(t3.age)
from(
	select 
		t2.user_id,
		t2.day_time,
		t2.rk,
		date_sub(t2.day_time,rk) ds,
		age
	from(
		select 
			t1.user_id,
			t1.day_time,
			row_number() over(partition by t1.user_id order by t1.day_time) rk,
			t1.age
		from(select distinct day_time,user_id,age from users) t1
	) t2
) t3 group by t3.user_id,t3.ds having count(*)>=2;

user_id   Start date			End date	Continuous login days		age
test_2  2019-02-11      2019-02-12      2       	19.0
test_2  2019-02-15      2019-02-16      2       	19.0
test_4  2019-02-17      2019-02-18      2       	35.0
test_4  2019-02-22      2019-02-23      2       	35.0

4) Find the total number and average age of active users

Step 1: because the same user can be active multiple times (log in continuously for 2 days in different time periods), but only one active user is counted, it is necessary to de duplicate first, and first find the id and corresponding age of the active user (de duplicate);

select
	t4.user_id,
	avg(t4.ag)
from(
	select 
		t3.user_id,
		min(t3.day_time),
		max(t3.day_time),
		count(*),
		avg(t3.age) ag
	from(
		select 
			t2.user_id,
			t2.day_time,
			t2.rk,
			date_sub(t2.day_time,rk) ds,
			age
		from(
			select 
				t1.user_id,
				t1.day_time,
				row_number() over(partition by t1.user_id order by t1.day_time) rk,
				t1.age
			from(select distinct day_time,user_id,age from users) t1
		) t2
	) t3 group by t3.user_id,t3.ds having count(*)>=2
) t4 group by t4.user_id;

user_id  	age
test_2  	19.0
test_4  	35.0

If you log in for 3 consecutive days, you can count (*) > = 3;

If you log in for 4 consecutive days, you can count (*) > = 4;

If you log in for 5 consecutive days, you can count (*) > = 5;

By logical analogy;

Step 2: calculate the total number and average age of active users

select
	count(t5.user_id) `Total active users`,
	avg(t5.ag2) `Average age of active users`
from(
	select
		t4.user_id,
		avg(t4.ag) ag2
	from(
		select 
			t3.user_id,
			min(t3.day_time),
			max(t3.day_time),
			count(*),
			avg(t3.age) ag
		from(
			select 
				t2.user_id,
				t2.day_time,
				t2.rk,
				date_sub(t2.day_time,rk) ds,
				age
			from(
				select 
					t1.user_id,
					t1.day_time,
					row_number() over(partition by t1.user_id order by t1.day_time) rk,
					t1.age
				from(select distinct day_time,user_id,age from users) t1
			) t2
		) t3 group by t3.user_id,t3.ds having count(*)>=2
	) t4 group by t4.user_id
) t5;

Total active users    Average age of active users
2       		27.0

The last step: combine the two results

select 
	b.`Number of users` `Number of all users`,
	b.`Average age of all users` `Average age of all users`,
	0 `Total active users`,
	0 `Average age of active users`
from(
	select 
		count(a.user_id) `Number of users`,
		avg(a.av) `Average age of all users`
	from(
		select user_id,avg(age) av from users group by user_id
	) a
) b
union all
select 
	0 `Number of all users`,
	0 `Average age of all users`,
	t6.`Total active users` `Total active users`,
	t6.`Average age of active users` `Average age of active users`
from(
	select
		count(t5.user_id) `Total active users`,
		avg(t5.ag2) `Average age of active users`
	from(
		select
			t4.user_id,
			avg(t4.ag) ag2
		from(
			select 
				t3.user_id,
				min(t3.day_time),
				max(t3.day_time),
				count(*),
				avg(t3.age) ag
			from(
				select 
					t2.user_id,
					t2.day_time,
					t2.rk,
					date_sub(t2.day_time,rk) ds,
					age
				from(
					select 
						t1.user_id,
						t1.day_time,
						row_number() over(partition by t1.user_id order by t1.day_time) rk,
						t1.age
					from(select distinct day_time,user_id,age from users) t1
				) t2
			) t3 group by t3.user_id,t3.ds having count(*)>=2
		) t4 group by t4.user_id
	) t5
) t6;

u1.Number of all users   	u1.Average age of all users   u1.Total active users u1.Average age of active users
4       		29.0    				0      			 0.0
0       		0.0     				2       		27.0

Idea 2: use the function lead() in hive

day_time user_id age
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
2019-02-17,test_4,35
2019-02-18,test_4,35
2019-02-22,test_4,35
2019-02-23,test_4,35

Find the total number and average age of active users (active users refer to users who have access records for two consecutive days)

1)select distinct user_id,day_time,age from users;

user_id day_time        age
test_1  2019-02-11      23
test_2  2019-02-11      19
test_3  2019-02-11      39
test_2  2019-02-12      19
test_1  2019-02-13      23
test_2  2019-02-15      19
test_2  2019-02-16      19
test_4  2019-02-17      35
test_4  2019-02-18      35
test_4  2019-02-22      35
test_4  2019-02-23      35

2)
select 
	t1.user_id,
	t1.day_time,
	lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt,
	t1.age
from(
	select distinct user_id,day_time,age from users
) t1;

t1.user_id      t1.day_time     	nt      	t1.age
test_1   		2019-02-11      2019-02-13      23
test_1  		2019-02-13      NULL    		23
test_2  		2019-02-11      2019-02-12      19
test_2  		2019-02-12      2019-02-15      19
test_2  		2019-02-15      2019-02-16      19
test_2  		2019-02-16      NULL    		19
test_3  		2019-02-11      NULL    		39
test_4  		2019-02-17      2019-02-18      35
test_4  		2019-02-18      2019-02-22      35
test_4  		2019-02-22      2019-02-23      35
test_4  		2019-02-23      NULL    		35

3)
select
	t2.user_id,
	t2.age
from(
	select 
		t1.user_id,
		t1.day_time,
		lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt,
		t1.age
	from(
		select distinct user_id,day_time,age from users
	) t1
) t2 where datediff(t2.nt,t2.day_time)=1;

t2.user_id     	 t2.age
test_2  			19
test_2  			19
test_4  			35
test_4  			35

Active users who log in for 2 consecutive days:
Use lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt
Then NT day_ time=1;

Active users who log in for 3 consecutive days:
Use lead(t1.day_time,2) over(partition by t1.user_id order by t1.day_time) nt
Then NT day_ time=2;

Log in for 4 consecutive days, even if the user is active:
Use lead(t1.day_time,3) over(partition by t1.user_id order by t1.day_time) nt
Then NT day_ time=3;