Hive window function Over partition by

-Catalogue

  • 1. Aggregate function + over
  • 2. partition by clause
  • 3. order by clause
  • 4. ★ window clause (included) - predicting: forward - FOLLOWING: backward - CURRENT ROW: CURRENT ROW - UNBOUNDED: starting point
  • 5. ★ sequence function in window function: including NTILE(n), row_number,rank,dense_rank
     

brief introduction

The hive function in this article is mainly introduced. The window functions in hive are similar to those in sql. They are used to do some data analysis. They are generally used for olap analysis

Note: OVER(): Specifies the size of the data window of the analysis function, which determines the scope of the aggregation function. The size of this data window may change with the change of rows. At the same time, the following can be used to limit the scope.

concept

We all know that there is a kind of function called aggregate function in sql, such as sum(), avg(), max(), etc. this kind of function can aggregate multiple rows of data into one row according to rules. Generally speaking, the number of rows after aggregation is less than that before aggregation.

But sometimes we want to display both the data before aggregation and the data after aggregation. At this time, we introduce the window function.

Before delving into the Over sentence, we must note that in SQL processing, the window function is the last step, and only before the Order by sentence

Data preparation

We prepare an order table with fields of name, orderdate and cost. The data contents are as follows:

jack,2015-01-01,10
tony,2015-01-02,15
jack,2015-02-03,23
tony,2015-01-04,29
jack,2015-01-05,46
jack,2015-04-06,42
tony,2015-01-07,50
jack,2015-01-08,55
mart,2015-04-08,62
mart,2015-04-09,68
neil,2015-05-10,12
mart,2015-04-11,75
neil,2015-06-12,80
mart,2015-04-13,94

Create a table t in hive_ Window and insert the data into it

example

Aggregate function + over

If we want to query the customers and total number of people who bought in April 2015, we can use the window function to realize it

select name,count(*) over ()
from t_window
where substring(orderdate,1,7) = '2015-04'
   

The results are as follows:

name    count_window_0
mart    5
mart    5
mart    5
mart    5
jack    5

It can be seen that in April 2015, there were five purchase records, four for mart and one for jack In fact, in most cases, we only look at the results after repetition In this case, we have two implementation methods

First: distinct

select distinct name,count(*) over ()
from t_window
where substring(orderdate,1,7) = '2015-04'

Second: group by

select name,count(*) over ()
from t_window
where substring(orderdate,1,7) = '2015-04'
group by name

The results after implementation are as follows:

name count_window_0
mart 2
jack 2

partition by clause

The first thing mentioned after the Over clause is partition by The partition by clause can also be called a query partition clause, which is very similar to Group By. It groups data according to boundary values. The functions before Over are performed within each group. If the grouping is exceeded, the function will recalculate

example

If we want to see the customer's purchase details and monthly total purchase amount, we can execute the following sql

select name,orderdate,cost,sum(cost) over(partition by month(orderdate))
from t_window

The results are as follows:

name    orderdate   cost    sum_window_0
jack    2015-01-01  10  205
jack    2015-01-08  55  205
tony    2015-01-07  50  205
jack    2015-01-05  46  205
tony    2015-01-04  29  205
tony    2015-01-02  15  205
jack    2015-02-03  23  23
mart    2015-04-13  94  341
jack    2015-04-06  42  341
mart    2015-04-11  75  341
mart    2015-04-09  68  341
mart    2015-04-08  62  341
neil    2015-05-10  12  12
neil    2015-06-12  80  80

It can be seen that the data have been summarized by month

order by clause

In the above scenario, if we want to accumulate cost by month At this point, we introduce the order by clause

The order by clause will force the input data to be sorted (as mentioned earlier in the article, the window function is the last function executed by the SQL statement, so you can think of the SQL result set as input data). The order by clause is not valid for items such as Row_Number(), Lead(), LAG() and other functions are necessary, because if the data is out of order, the results of these functions have no meaning. Therefore, if there is an order by clause, the results calculated by Count(), Min(), etc. have no meaning.

We add order by to the above code

select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate )
from t_window

The results are as follows: (order by aggregates the data from the starting line to the current line by default)

name    orderdate   cost    sum_window_0
jack    2015-01-01  10  10
tony    2015-01-02  15  25
tony    2015-01-04  29  54
jack    2015-01-05  46  100
tony    2015-01-07  50  150
jack    2015-01-08  55  205
jack    2015-02-03  23  23
jack    2015-04-06  42  42
mart    2015-04-08  62  104
mart    2015-04-09  68  172
mart    2015-04-11  75  247
mart    2015-04-13  94  341
neil    2015-05-10  12  12
neil    2015-06-12  80  80

window clause

We have grouped the data by using the partition by clause above If we want more fine - grained partitioning, we need to introduce the window clause

We should first understand two concepts:
-If only the partition by clause is used and order by is not specified, our aggregation is within the group
-The order by clause is used. If the window clause is not used, the default is from the starting point to the current line

When there are multiple window functions in the same select query, they have no influence on each other Each window function applies its own rules

window clause:
-Predicting: forward
-FOLLOWING: back
-CURRENT ROW: CURRENT ROW
-UNBOUNDED: the starting point, UNBOUNDED predicting means the starting point from the front, UNBOUNDED FOLLOWING means the ending point from the back

We divide by name, sort by shopping time, and accumulate cost s
As follows, we use the window clause to query

select name,orderdate,cost,
sum(cost) over() as sample1,--Add all rows
sum(cost) over(partition by name) as sample2,--Press name Grouping, intra group data addition
sum(cost) over(partition by name order by orderdate) as sample3,--Press name Grouping, intra group data accumulation
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4 ,--and sample3 equally,Aggregation from starting point to current row
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5, --Aggregate the current row with the previous row
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   AND 1 FOLLOWING  ) as sample6,--The current line, the front line and the back line
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --Current line and all subsequent lines
from t_window;

The query results are as follows:

name    orderdate   cost    sample1 sample2 sample3 sample4 sample5 sample6 sample7
jack    2015-01-01  10  661 176 10  10  10  56  176
jack    2015-01-05  46  661 176 56  56  56  111 166
jack    2015-01-08  55  661 176 111 111 101 124 120
jack    2015-02-03  23  661 176 134 134 78  120 65
jack    2015-04-06  42  661 176 176 176 65  65  42
mart    2015-04-08  62  661 299 62  62  62  130 299
mart    2015-04-09  68  661 299 130 130 130 205 237
mart    2015-04-11  75  661 299 205 205 143 237 169
mart    2015-04-13  94  661 299 299 299 169 169 94
neil    2015-05-10  12  661 92  12  12  12  92  92
neil    2015-06-12  80  661 92  92  92  92  92  80
tony    2015-01-02  15  661 94  15  15  15  44  94
tony    2015-01-04  29  661 94  44  44  44  94  79
tony    2015-01-07  50  661 94  94  94  79  79  50

Sequence function in window function

The main sequence function does not support window clause

The sequence functions commonly used in hive are as follows:

NTILE

  • NTILE(n), which is used to divide the grouped data into n slices in order and return the current slice value

  • NTILE does not support ROWS BETWEEN,
    For example, ntile (2) over (partition by cookie order by createtime rows between 3 predicting and current row)

  • If the slices are not uniform, the distribution of the first slice is increased by default

What application scenario does this function use? We can use this function if we want the transaction record of the first third of each customer's purchase amount

select name,orderdate,cost,
       ntile(3) over() as sample1 , --Global data slicing
       ntile(3) over(partition by name), -- according to name Group,Cut the data into 3 copies in the group
       ntile(3) over(order by cost),--Global according to cost Ascending arrangement,Cut the data into 3 parts
       ntile(3) over(partition by name order by cost ) --according to name Group by cost Ascending arrangement,Cut the data into 3 parts
from t_window

The data obtained are as follows:

name    orderdate   cost    sample1 sample2 sample3 sample4
jack    2015-01-01  10  3   1   1   1
jack    2015-02-03  23  3   1   1   1
jack    2015-04-06  42  2   2   2   2
jack    2015-01-05  46  2   2   2   2
jack    2015-01-08  55  2   3   2   3
mart    2015-04-08  62  2   1   2   1
mart    2015-04-09  68  1   2   3   1
mart    2015-04-11  75  1   3   3   2
mart    2015-04-13  94  1   1   3   3
neil    2015-05-10  12  1   2   1   1
neil    2015-06-12  80  1   1   3   2
tony    2015-01-02  15  3   2   1   1
tony    2015-01-04  29  3   3   1   2
tony    2015-01-07  50  2   1   2   3

As for the above data, the part with sample4 = 1 is the result we want

row_number,rank,dense_rank

There are many usage scenarios for these three window functions
- row_number() starts from 1 and generates the sequence of records in the group in order, row_ The value of number () will not be repeated. When the sorted values are the same, they will be arranged according to the order of records in the table
-RANK() generates the ranking of data items in the group. If the ranking is equal, there will be a vacancy in the ranking
- DENSE_RANK() generates the ranking of data items in the group. If the ranking is equal, there will be no vacancy in the ranking

**Note:
Rank and deny_ The difference between rank and rank is whether there will be a vacancy when the ranking is equal**

Examples are as follows:

SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM lxw1234 
WHERE cookieid = 'cookie1';

cookieid day           pv       rn1     rn2     rn3 

cookie1 2015-04-12      7       1       1       1
cookie1 2015-04-11      5       2       2       2
cookie1 2015-04-15      4       3       3       3
cookie1 2015-04-16      4       3       3       4
cookie1 2015-04-13      3       5       4       5
cookie1 2015-04-14      2       6       5       6
cookie1 2015-04-10      1       7       6       7
rn1: 15 No. 1 and No. 16 are tied for No. 3, 13 Row 5
rn2: 15 No. 1 and No. 16 are tied for No. 3, 13 Row 4
rn3: If they are equal, they will be sorted by record values to generate a unique order. If all record values are equal, they may be arranged randomly.

LAG and LEAD functions

These two functions are commonly used window functions, which can return the data of upper and lower data rows
Take our order form as an example. If we want to check the last purchase time of customers, we can check it in this way

select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
lag(orderdate,2) over (partition by name order by orderdate) as time2
from t_window;
   

The queried data is:

name    orderdate   cost    time1   time2
jack    2015-01-01  10  1900-01-01  NULL
jack    2015-01-05  46  2015-01-01  NULL
jack    2015-01-08  55  2015-01-05  2015-01-01
jack    2015-02-03  23  2015-01-08  2015-01-05
jack    2015-04-06  42  2015-02-03  2015-01-08
mart    2015-04-08  62  1900-01-01  NULL
mart    2015-04-09  68  2015-04-08  NULL
mart    2015-04-11  75  2015-04-09  2015-04-08
mart    2015-04-13  94  2015-04-11  2015-04-09
neil    2015-05-10  12  1900-01-01  NULL
neil    2015-06-12  80  2015-05-10  NULL
tony    2015-01-02  15  1900-01-01  NULL
tony    2015-01-04  29  2015-01-02  NULL
tony    2015-01-07  50  2015-01-04  2015-01-02
   

time1 refers to grouping by name, which is arranged in ascending order and takes the value of the previous row of data

time2 takes the value of the data in the above two rows, which is grouped by name in ascending order. Note that when the lag function is set to the row value, it defaults to 1 row If no default value is set, null value will be taken

The direction of the lead function is opposite to that of the lag function, and the data is oriented

first_value and last_value

first_value gets the first value of the current row after sorting in the group
last_value is the last value until the current row after sorting in the group

select name,orderdate,cost,
first_value(orderdate) over(partition by name order by orderdate) as time1,
last_value(orderdate) over(partition by name order by orderdate) as time2
from t_window
   

The query results are as follows:

name    orderdate   cost    time1   time2
jack    2015-01-01  10  2015-01-01  2015-01-01
jack    2015-01-05  46  2015-01-01  2015-01-05
jack    2015-01-08  55  2015-01-01  2015-01-08
jack    2015-02-03  23  2015-01-01  2015-02-03
jack    2015-04-06  42  2015-01-01  2015-04-06
mart    2015-04-08  62  2015-04-08  2015-04-08
mart    2015-04-09  68  2015-04-08  2015-04-09
mart    2015-04-11  75  2015-04-08  2015-04-11
mart    2015-04-13  94  2015-04-08  2015-04-13
neil    2015-05-10  12  2015-05-10  2015-05-10
neil    2015-06-12  80  2015-05-10  2015-06-12
tony    2015-01-02  15  2015-01-02  2015-01-02
tony    2015-01-04  29  2015-01-02  2015-01-04
tony    2015-01-07  50  2015-01-02  2015-01-07

At the end of the article, I recommend some popular technology blog links:

  1. JAVA related in-depth technology blog links
  2. Link to Flink related technology blog
  3. Spark # core technology link
  4. Design pattern - deep technology blog link
  5. Machine learning - deep technology blog link
  6. Blog link technology
  7. Super dry goods - Flink mind map, which took about 3 weeks to write and proofread
  8. Go deep into the JVM core principle of JAVA and solve various online faults [case attached]
  9. Please talk about your understanding of volatile-- A recent "hard core contest" between Xiao Li and the interviewer
  10. Talk about RPC communication, an interview question often asked. Source code + notes, package understand
  11. In depth talk about Java garbage collection mechanism [schematic diagram and tuning method attached]

Welcome to scan the two-dimensional code below or search for the official account of "big data senior architect". We will send more and timely information to you. Welcome to communicate more.

                                           

       

Keywords: Big Data hive SQL

Added by BigTime on Tue, 08 Mar 2022 23:46:16 +0200