[SQL Interview Question] Count the number of people in the online rush hour and the time line of the rush hour

1. Requirements

id      start_time              end_time

1001	2021-12-01 08:08:05	    2021-12-01 19:02:07
1002	2021-12-01 08:15:07	    2021-12-01 19:05:10
1003	2021-12-01 09:10:20	    2021-12-01 21:02:07
1005	2021-12-01 09:26:05	    2021-12-01 19:52:15
1006	2021-12-01 10:08:05	    2021-12-01 22:05:06
1004	2021-12-01 10:20:07	    2021-12-01 18:03:05
1001	2021-12-01 20:08:05	    2021-12-01 23:02:07

         Data id is the host ID,start_time represents the upload time, end_time indicates download time
         Data for a platform hosts download and download schedule, the platform hosts at the same time online peak number and the time line of peak.

Prepare data:

//Create table player
create table if not exists player(
	id string,
	start_time string,
	end_time string) 
row format delimited fields terminated by '\t'; 
//Load data
load data local inpath '/opt/module/data/play1.txt' into table player;

2. Solution ideas

         If it is difficult to start this topic directly from the SQL itself, let's change our mind. Assuming we get a piece of data, what should we do with java program now? In fact, it is the idea of an accumulator (such as the accumulator of SPARK). First, we need to split such a record into different records or data streams and enter the accumulator, then mark each record type, where the data streams enter the accumulator in chronological order, and then overlay in the accumulator, where the cumulative result is the maximum. The essence is to use the idea of an accumulator, but the data entering the accumulator is sequential stream data in chronological order (data entering in chronological order).

         1) Divide the data by the time of upload and download, mark the time of upload and download, then union is merged into a column and sorted by time

         2) Data enters the accumulator to add up the number of people online during that time period and complete the end time of that number of people online period

         3) Get data for the current maximum number of people online in the accumulator

3. Step-by-step code and testing

1. Divide the data by the time of upload and download, mark the time of upload and download, then union is merged into a column and sorted by time

select 
	id,
	start_time at_time,
	'login' type
from player
union all
select
	id,
	end_time at_time,
	'logout' type
from player
order by at_time; t1

Query results:

id      at_time             type

1001	2021-12-01 08:08:05	login 
1002	2021-12-01 08:15:07	login 
1003	2021-12-01 09:10:20	login 
1005	2021-12-01 09:26:05	login 
1006	2021-12-01 10:08:05	login 
1004	2021-12-01 10:20:07	login 
1004	2021-12-01 18:03:05	logout
1001	2021-12-01 19:02:07	logout
1002	2021-12-01 19:05:10	logout
1005	2021-12-01 19:52:15	logout
1001	2021-12-01 20:08:05	login 
1003	2021-12-01 21:02:07	logout 
1006	2021-12-01 22:05:06	logout 
1001	2021-12-01 23:02:07	logout 

2. Data enters the accumulator to add up the number of people online during that time period and complete the end time of that number of people online period

         Number of people online during this online period: sum() window open, type login + 1, type logout - 1, no boundaries on the range to the current line
         The end time of this time period for online population: the next data time is the end time of this time period, which can be easily calculated using lead(), and this data time is taken without the next data

select
	at_time,
	lead(at_time,1,at_time) over(order by at_time) lead_time,
	sum(if(type = 'login',1,-1)) over(order by at_time rows between unbounded preceding and current row) online_ct
from t1; t2

Query results:

at_time 	            lead_time 	            online_ct
2021-12-01 08:08:05		2021-12-01 08:15:07		1
2021-12-01 08:15:07		2021-12-01 09:10:20		2
2021-12-01 09:10:20		2021-12-01 09:26:05		3
2021-12-01 09:26:05		2021-12-01 10:08:05		4
2021-12-01 10:08:05		2021-12-01 10:20:07		5
2021-12-01 10:20:07		2021-12-01 18:03:05		6
2021-12-01 18:03:05		2021-12-01 19:02:07		5
2021-12-01 19:02:07		2021-12-01 19:05:10		4
2021-12-01 19:05:10		2021-12-01 19:52:15		3
2021-12-01 19:52:15		2021-12-01 20:08:05		2
2021-12-01 20:08:05		2021-12-01 21:02:07		3
2021-12-01 21:02:07		2021-12-01 22:05:06		2
2021-12-01 22:05:06		2021-12-01 23:02:07		1
2021-12-01 23:02:07		2021-12-01 23:02:07		0

3. Get data for the current maximum number of people online in the accumulator

         The first data, sorted in reverse order by the number of people online, corresponds to the maximum number of people online

select
	at_time,
	lead_time,
	online_ct
from t2
order by online_ct desc
limit 1; 

Query results:

at_time 	           lead_time           	  online_ct
2021-12-01 10:20:07    2021-12-01 18:03:05    6

IV. Final Code and Testing

select
    at_time,
    lead_time,
    online_ct
from
    (select
         at_time,
         lead(at_time, 1, at_time) over (order by at_time)                            lead_time,
         sum(if(type = 'login', 1, -1))
             over (order by at_time rows between unbounded preceding and current row) online_ct
     from
         (select
              id,
              start_time at_time,
              'login'    type
          from
              player
          union all
          select
              id,
              end_time at_time,
              'logout' type
          from
              player
          order by at_time) t1) t2
order by
    online_ct desc
limit 1;

Test results:

at_time 	           lead_time           	  online_ct
2021-12-01 10:20:07    2021-12-01 18:03:05    6

V. Summary

         This paper analyses the problem of SQL statistics simultaneous online population, solves it with the idea of accumulator, and finally classifies it into time series data, makes time series data analysis (common techniques: labeling, forming series, multi-sequence analysis), and finally uses sum() over() to accumulate the labels to find the current online population. The key point of this topic is the idea of converting into time series data and accumulators.

         In fact, the analysis of this problem is of great business importance. We can track the number of people online over time, understand the load changes on the server, the number of concurrent servers in real time, and so on. This problem has different meanings in different business scenarios, such as the number of people online at the same time in a game, such as the number of real-time concurrencies on a server, such as the backlog of goods in a warehouse, the maximum number of orders in the service process at the same time over a period of time, and so on. In fact, it is the same thing to find the maximum number of people online and the number of people online in real time. The maximum number depends on the current online number table. Only by first finding the current online number table can you find the maximum number of people online at the same time.

Keywords: Database SQL

Added by Salsaboy on Thu, 02 Dec 2021 19:17:43 +0200