-Catalogue
- 1. Aggregate function + over
- 2. partition by clause
- 3. order by clause
- 4. ★ window clause (included) - predicting: forward - FOLLOWING: backward - CURRENT ROW: CURRENT ROW - UNBOUNDED: starting point
- 5. ★ sequence function in window function: including NTILE(n), row_number,rank,dense_rank
brief introduction
The hive function in this article is mainly introduced. The window functions in hive are similar to those in sql. They are used to do some data analysis. They are generally used for olap analysis
Note: OVER(): Specifies the size of the data window of the analysis function, which determines the scope of the aggregation function. The size of this data window may change with the change of rows. At the same time, the following can be used to limit the scope.
concept
We all know that there is a kind of function called aggregate function in sql, such as sum(), avg(), max(), etc. this kind of function can aggregate multiple rows of data into one row according to rules. Generally speaking, the number of rows after aggregation is less than that before aggregation.
But sometimes we want to display both the data before aggregation and the data after aggregation. At this time, we introduce the window function.
Before delving into the Over sentence, we must note that in SQL processing, the window function is the last step, and only before the Order by sentence
Data preparation
We prepare an order table with fields of name, orderdate and cost. The data contents are as follows:
jack,2015-01-01,10 tony,2015-01-02,15 jack,2015-02-03,23 tony,2015-01-04,29 jack,2015-01-05,46 jack,2015-04-06,42 tony,2015-01-07,50 jack,2015-01-08,55 mart,2015-04-08,62 mart,2015-04-09,68 neil,2015-05-10,12 mart,2015-04-11,75 neil,2015-06-12,80 mart,2015-04-13,94
Create a table t in hive_ Window and insert the data into it
example
Aggregate function + over
If we want to query the customers and total number of people who bought in April 2015, we can use the window function to realize it
select name,count(*) over () from t_window where substring(orderdate,1,7) = '2015-04'
The results are as follows:
name count_window_0 mart 5 mart 5 mart 5 mart 5 jack 5
It can be seen that in April 2015, there were five purchase records, four for mart and one for jack In fact, in most cases, we only look at the results after repetition In this case, we have two implementation methods
First: distinct
select distinct name,count(*) over () from t_window where substring(orderdate,1,7) = '2015-04'
Second: group by
select name,count(*) over () from t_window where substring(orderdate,1,7) = '2015-04' group by name
The results after implementation are as follows:
name count_window_0 mart 2 jack 2
partition by clause
The first thing mentioned after the Over clause is partition by The partition by clause can also be called a query partition clause, which is very similar to Group By. It groups data according to boundary values. The functions before Over are performed within each group. If the grouping is exceeded, the function will recalculate
example
If we want to see the customer's purchase details and monthly total purchase amount, we can execute the following sql
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from t_window
The results are as follows:
name orderdate cost sum_window_0 jack 2015-01-01 10 205 jack 2015-01-08 55 205 tony 2015-01-07 50 205 jack 2015-01-05 46 205 tony 2015-01-04 29 205 tony 2015-01-02 15 205 jack 2015-02-03 23 23 mart 2015-04-13 94 341 jack 2015-04-06 42 341 mart 2015-04-11 75 341 mart 2015-04-09 68 341 mart 2015-04-08 62 341 neil 2015-05-10 12 12 neil 2015-06-12 80 80
It can be seen that the data have been summarized by month
order by clause
In the above scenario, if we want to accumulate cost by month At this point, we introduce the order by clause
The order by clause will force the input data to be sorted (as mentioned earlier in the article, the window function is the last function executed by the SQL statement, so you can think of the SQL result set as input data). The order by clause is not valid for items such as Row_Number(), Lead(), LAG() and other functions are necessary, because if the data is out of order, the results of these functions have no meaning. Therefore, if there is an order by clause, the results calculated by Count(), Min(), etc. have no meaning.
We add order by to the above code
select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate ) from t_window
The results are as follows: (order by aggregates the data from the starting line to the current line by default)
name orderdate cost sum_window_0 jack 2015-01-01 10 10 tony 2015-01-02 15 25 tony 2015-01-04 29 54 jack 2015-01-05 46 100 tony 2015-01-07 50 150 jack 2015-01-08 55 205 jack 2015-02-03 23 23 jack 2015-04-06 42 42 mart 2015-04-08 62 104 mart 2015-04-09 68 172 mart 2015-04-11 75 247 mart 2015-04-13 94 341 neil 2015-05-10 12 12 neil 2015-06-12 80 80
window clause
We have grouped the data by using the partition by clause above If we want more fine - grained partitioning, we need to introduce the window clause
We should first understand two concepts:
-If only the partition by clause is used and order by is not specified, our aggregation is within the group
-The order by clause is used. If the window clause is not used, the default is from the starting point to the current line
When there are multiple window functions in the same select query, they have no influence on each other Each window function applies its own rules
window clause:
-Predicting: forward
-FOLLOWING: back
-CURRENT ROW: CURRENT ROW
-UNBOUNDED: the starting point, UNBOUNDED predicting means the starting point from the front, UNBOUNDED FOLLOWING means the ending point from the back
We divide by name, sort by shopping time, and accumulate cost s
As follows, we use the window clause to query
select name,orderdate,cost, sum(cost) over() as sample1,--Add all rows sum(cost) over(partition by name) as sample2,--Press name Grouping, intra group data addition sum(cost) over(partition by name order by orderdate) as sample3,--Press name Grouping, intra group data accumulation sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--and sample3 equally,Aggregation from starting point to current row sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --Aggregate the current row with the previous row sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--The current line, the front line and the back line sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --Current line and all subsequent lines from t_window;
The query results are as follows:
name orderdate cost sample1 sample2 sample3 sample4 sample5 sample6 sample7 jack 2015-01-01 10 661 176 10 10 10 56 176 jack 2015-01-05 46 661 176 56 56 56 111 166 jack 2015-01-08 55 661 176 111 111 101 124 120 jack 2015-02-03 23 661 176 134 134 78 120 65 jack 2015-04-06 42 661 176 176 176 65 65 42 mart 2015-04-08 62 661 299 62 62 62 130 299 mart 2015-04-09 68 661 299 130 130 130 205 237 mart 2015-04-11 75 661 299 205 205 143 237 169 mart 2015-04-13 94 661 299 299 299 169 169 94 neil 2015-05-10 12 661 92 12 12 12 92 92 neil 2015-06-12 80 661 92 92 92 92 92 80 tony 2015-01-02 15 661 94 15 15 15 44 94 tony 2015-01-04 29 661 94 44 44 44 94 79 tony 2015-01-07 50 661 94 94 94 79 79 50
Sequence function in window function
The main sequence function does not support window clause
The sequence functions commonly used in hive are as follows:
NTILE
-
NTILE(n), which is used to divide the grouped data into n slices in order and return the current slice value
-
NTILE does not support ROWS BETWEEN,
For example, ntile (2) over (partition by cookie order by createtime rows between 3 predicting and current row) - If the slices are not uniform, the distribution of the first slice is increased by default
What application scenario does this function use? We can use this function if we want the transaction record of the first third of each customer's purchase amount
select name,orderdate,cost, ntile(3) over() as sample1 , --Global data slicing ntile(3) over(partition by name), -- according to name Group,Cut the data into 3 copies in the group ntile(3) over(order by cost),--Global according to cost Ascending arrangement,Cut the data into 3 parts ntile(3) over(partition by name order by cost ) --according to name Group by cost Ascending arrangement,Cut the data into 3 parts from t_window
The data obtained are as follows:
name orderdate cost sample1 sample2 sample3 sample4 jack 2015-01-01 10 3 1 1 1 jack 2015-02-03 23 3 1 1 1 jack 2015-04-06 42 2 2 2 2 jack 2015-01-05 46 2 2 2 2 jack 2015-01-08 55 2 3 2 3 mart 2015-04-08 62 2 1 2 1 mart 2015-04-09 68 1 2 3 1 mart 2015-04-11 75 1 3 3 2 mart 2015-04-13 94 1 1 3 3 neil 2015-05-10 12 1 2 1 1 neil 2015-06-12 80 1 1 3 2 tony 2015-01-02 15 3 2 1 1 tony 2015-01-04 29 3 3 1 2 tony 2015-01-07 50 2 1 2 3
As for the above data, the part with sample4 = 1 is the result we want
row_number,rank,dense_rank
There are many usage scenarios for these three window functions
- row_number() starts from 1 and generates the sequence of records in the group in order, row_ The value of number () will not be repeated. When the sorted values are the same, they will be arranged according to the order of records in the table
-RANK() generates the ranking of data items in the group. If the ranking is equal, there will be a vacancy in the ranking
- DENSE_RANK() generates the ranking of data items in the group. If the ranking is equal, there will be no vacancy in the ranking
**Note:
Rank and deny_ The difference between rank and rank is whether there will be a vacancy when the ranking is equal**
Examples are as follows:
SELECT cookieid, createtime, pv, RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1, DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 FROM lxw1234 WHERE cookieid = 'cookie1'; cookieid day pv rn1 rn2 rn3 cookie1 2015-04-12 7 1 1 1 cookie1 2015-04-11 5 2 2 2 cookie1 2015-04-15 4 3 3 3 cookie1 2015-04-16 4 3 3 4 cookie1 2015-04-13 3 5 4 5 cookie1 2015-04-14 2 6 5 6 cookie1 2015-04-10 1 7 6 7
rn1: 15 No. 1 and No. 16 are tied for No. 3, 13 Row 5 rn2: 15 No. 1 and No. 16 are tied for No. 3, 13 Row 4 rn3: If they are equal, they will be sorted by record values to generate a unique order. If all record values are equal, they may be arranged randomly.
LAG and LEAD functions
These two functions are commonly used window functions, which can return the data of upper and lower data rows
Take our order form as an example. If we want to check the last purchase time of customers, we can check it in this way
select name,orderdate,cost, lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from t_window;
The queried data is:
name orderdate cost time1 time2 jack 2015-01-01 10 1900-01-01 NULL jack 2015-01-05 46 2015-01-01 NULL jack 2015-01-08 55 2015-01-05 2015-01-01 jack 2015-02-03 23 2015-01-08 2015-01-05 jack 2015-04-06 42 2015-02-03 2015-01-08 mart 2015-04-08 62 1900-01-01 NULL mart 2015-04-09 68 2015-04-08 NULL mart 2015-04-11 75 2015-04-09 2015-04-08 mart 2015-04-13 94 2015-04-11 2015-04-09 neil 2015-05-10 12 1900-01-01 NULL neil 2015-06-12 80 2015-05-10 NULL tony 2015-01-02 15 1900-01-01 NULL tony 2015-01-04 29 2015-01-02 NULL tony 2015-01-07 50 2015-01-04 2015-01-02
time1 refers to grouping by name, which is arranged in ascending order and takes the value of the previous row of data
time2 takes the value of the data in the above two rows, which is grouped by name in ascending order. Note that when the lag function is set to the row value, it defaults to 1 row If no default value is set, null value will be taken
The direction of the lead function is opposite to that of the lag function, and the data is oriented
first_value and last_value
first_value gets the first value of the current row after sorting in the group
last_value is the last value until the current row after sorting in the group
select name,orderdate,cost, first_value(orderdate) over(partition by name order by orderdate) as time1, last_value(orderdate) over(partition by name order by orderdate) as time2 from t_window
The query results are as follows:
name orderdate cost time1 time2 jack 2015-01-01 10 2015-01-01 2015-01-01 jack 2015-01-05 46 2015-01-01 2015-01-05 jack 2015-01-08 55 2015-01-01 2015-01-08 jack 2015-02-03 23 2015-01-01 2015-02-03 jack 2015-04-06 42 2015-01-01 2015-04-06 mart 2015-04-08 62 2015-04-08 2015-04-08 mart 2015-04-09 68 2015-04-08 2015-04-09 mart 2015-04-11 75 2015-04-08 2015-04-11 mart 2015-04-13 94 2015-04-08 2015-04-13 neil 2015-05-10 12 2015-05-10 2015-05-10 neil 2015-06-12 80 2015-05-10 2015-06-12 tony 2015-01-02 15 2015-01-02 2015-01-02 tony 2015-01-04 29 2015-01-02 2015-01-04 tony 2015-01-07 50 2015-01-02 2015-01-07
At the end of the article, I recommend some popular technology blog links:
- JAVA related in-depth technology blog links
- Link to Flink related technology blog
- Spark # core technology link
- Design pattern - deep technology blog link
- Machine learning - deep technology blog link
- Blog link technology
- Super dry goods - Flink mind map, which took about 3 weeks to write and proofread
- Go deep into the JVM core principle of JAVA and solve various online faults [case attached]
- Please talk about your understanding of volatile-- A recent "hard core contest" between Xiao Li and the interviewer
- Talk about RPC communication, an interview question often asked. Source code + notes, package understand
- In depth talk about Java garbage collection mechanism [schematic diagram and tuning method attached]
Welcome to scan the two-dimensional code below or search for the official account of "big data senior architect". We will send more and timely information to you. Welcome to communicate more.