1. Requirements
id start_time end_time 1001 2021-12-01 08:08:05 2021-12-01 19:02:07 1002 2021-12-01 08:15:07 2021-12-01 19:05:10 1003 2021-12-01 09:10:20 2021-12-01 21:02:07 1005 2021-12-01 09:26:05 2021-12-01 19:52:15 1006 2021-12-01 10:08:05 2021-12-01 22:05:06 1004 2021-12-01 10:20:07 2021-12-01 18:03:05 1001 2021-12-01 20:08:05 2021-12-01 23:02:07
Data id is the host ID,start_time represents the upload time, end_time indicates download time
Data for a platform hosts download and download schedule, the platform hosts at the same time online peak number and the time line of peak.
Prepare data:
//Create table player create table if not exists player( id string, start_time string, end_time string) row format delimited fields terminated by '\t'; //Load data load data local inpath '/opt/module/data/play1.txt' into table player;
2. Solution ideas
If it is difficult to start this topic directly from the SQL itself, let's change our mind. Assuming we get a piece of data, what should we do with java program now? In fact, it is the idea of an accumulator (such as the accumulator of SPARK). First, we need to split such a record into different records or data streams and enter the accumulator, then mark each record type, where the data streams enter the accumulator in chronological order, and then overlay in the accumulator, where the cumulative result is the maximum. The essence is to use the idea of an accumulator, but the data entering the accumulator is sequential stream data in chronological order (data entering in chronological order).
1) Divide the data by the time of upload and download, mark the time of upload and download, then union is merged into a column and sorted by time
2) Data enters the accumulator to add up the number of people online during that time period and complete the end time of that number of people online period
3) Get data for the current maximum number of people online in the accumulator
3. Step-by-step code and testing
1. Divide the data by the time of upload and download, mark the time of upload and download, then union is merged into a column and sorted by time
select id, start_time at_time, 'login' type from player union all select id, end_time at_time, 'logout' type from player order by at_time; t1
Query results:
id at_time type 1001 2021-12-01 08:08:05 login 1002 2021-12-01 08:15:07 login 1003 2021-12-01 09:10:20 login 1005 2021-12-01 09:26:05 login 1006 2021-12-01 10:08:05 login 1004 2021-12-01 10:20:07 login 1004 2021-12-01 18:03:05 logout 1001 2021-12-01 19:02:07 logout 1002 2021-12-01 19:05:10 logout 1005 2021-12-01 19:52:15 logout 1001 2021-12-01 20:08:05 login 1003 2021-12-01 21:02:07 logout 1006 2021-12-01 22:05:06 logout 1001 2021-12-01 23:02:07 logout
2. Data enters the accumulator to add up the number of people online during that time period and complete the end time of that number of people online period
Number of people online during this online period: sum() window open, type login + 1, type logout - 1, no boundaries on the range to the current line
The end time of this time period for online population: the next data time is the end time of this time period, which can be easily calculated using lead(), and this data time is taken without the next data
select at_time, lead(at_time,1,at_time) over(order by at_time) lead_time, sum(if(type = 'login',1,-1)) over(order by at_time rows between unbounded preceding and current row) online_ct from t1; t2
Query results:
at_time lead_time online_ct 2021-12-01 08:08:05 2021-12-01 08:15:07 1 2021-12-01 08:15:07 2021-12-01 09:10:20 2 2021-12-01 09:10:20 2021-12-01 09:26:05 3 2021-12-01 09:26:05 2021-12-01 10:08:05 4 2021-12-01 10:08:05 2021-12-01 10:20:07 5 2021-12-01 10:20:07 2021-12-01 18:03:05 6 2021-12-01 18:03:05 2021-12-01 19:02:07 5 2021-12-01 19:02:07 2021-12-01 19:05:10 4 2021-12-01 19:05:10 2021-12-01 19:52:15 3 2021-12-01 19:52:15 2021-12-01 20:08:05 2 2021-12-01 20:08:05 2021-12-01 21:02:07 3 2021-12-01 21:02:07 2021-12-01 22:05:06 2 2021-12-01 22:05:06 2021-12-01 23:02:07 1 2021-12-01 23:02:07 2021-12-01 23:02:07 0
3. Get data for the current maximum number of people online in the accumulator
The first data, sorted in reverse order by the number of people online, corresponds to the maximum number of people online
select at_time, lead_time, online_ct from t2 order by online_ct desc limit 1;
Query results:
at_time lead_time online_ct 2021-12-01 10:20:07 2021-12-01 18:03:05 6
IV. Final Code and Testing
select at_time, lead_time, online_ct from (select at_time, lead(at_time, 1, at_time) over (order by at_time) lead_time, sum(if(type = 'login', 1, -1)) over (order by at_time rows between unbounded preceding and current row) online_ct from (select id, start_time at_time, 'login' type from player union all select id, end_time at_time, 'logout' type from player order by at_time) t1) t2 order by online_ct desc limit 1;
Test results:
at_time lead_time online_ct 2021-12-01 10:20:07 2021-12-01 18:03:05 6
V. Summary
This paper analyses the problem of SQL statistics simultaneous online population, solves it with the idea of accumulator, and finally classifies it into time series data, makes time series data analysis (common techniques: labeling, forming series, multi-sequence analysis), and finally uses sum() over() to accumulate the labels to find the current online population. The key point of this topic is the idea of converting into time series data and accumulators.
In fact, the analysis of this problem is of great business importance. We can track the number of people online over time, understand the load changes on the server, the number of concurrent servers in real time, and so on. This problem has different meanings in different business scenarios, such as the number of people online at the same time in a game, such as the number of real-time concurrencies on a server, such as the backlog of goods in a warehouse, the maximum number of orders in the service process at the same time over a period of time, and so on. In fact, it is the same thing to find the maximum number of people online and the number of people online in real time. The maximum number depends on the current online number table. Only by first finding the current online number table can you find the maximum number of people online at the same time.