当前位置：网站首页>Summary of five methods of window function

Summary of five methods of window function

2022-07-22 10:40:00 【Energetic Wang Dapeng】

The window function is also called OLAP function , Is to operate on a set of values , But when operating on a set of values , You don't need to use group by Clause , To achieve the logical implementation of Grouping calculation . And it can also be achieved that a piece of data is divided into multiple groups for repeated calculation .

When encountering accumulation 、 Cumulative 、 Until what time is this scene , Give priority to window analysis OVER solve .

Insert picture description here

List of articles

One 、 polymerization ：sum,avg,max,min
Two 、 ranking ：row_number,rank,dense_rank
3、 ... and 、 section ：ntile,cume_dist,percent_rank
Four 、 grouping ：grouping sets, grouping_id, cube, rollup
5、 ... and 、 Take the former value or the latter value ：lag,lead,first_value,last_value

One 、 polymerization ：sum,avg,max,min

data

task1,2022-04-10,1
task1,2022-04-11,5
task1,2022-04-12,7
task1,2022-04-13,3
task1,2022-04-14,2
task1,2022-04-15,4
task1,2022-04-16,4

Cumulative calculation

SELECT taskid,createtime,pv,
-- pv1: Within the group from the starting point to the current line pv The cumulative , Such as ,11 The no. pv1=10 The no. pv+11 The no. pv, 12 Number =10 Number +11 Number +12 Number 
SUM(pv) OVER(PARTITION BY taskid ORDER BY createtime) AS pv1, 
-- pv2: Same as pv1
SUM(pv) OVER(PARTITION BY taskid ORDER BY createtime ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) AS pv2, 
-- pv3:  Within group (cookie1) be-all pv Add up  
SUM(pv) OVER(PARTITION BY taskid) AS pv3, 
-- pv4:  Go ahead in groups + forward 3 That's ok ,11 Number =10 Number +11 Number , 12 Number =10 Number +11 Number +12 Number , 13 Number =10 Number +11 Number +12 Number +13 Number , 14 Number =11 Number +12 Number +13 Number +14 Number 
SUM(pv) OVER(PARTITION BY taskid ORDER BY createtime ROWS BETWEEN 3 PRECEDING
AND CURRENT ROW) AS pv4, 
-- pv5:  Go ahead in groups + forward 3 That's ok + Back up 1 That's ok , Such as ,14 Number =11 Number +12 Number +13 Number +14 Number +15 Number =5+7+3+2+4=21 
SUM(pv) OVER(PARTITION BY taskid ORDER BY createtime ROWS BETWEEN 3 PRECEDING
AND 1 FOLLOWING) AS pv5, 
-- pv6:  Go ahead in groups + All the way back , Such as ,13 Number =13 Number +14 Number +15 Number +16 Number =3+2+4+4=13,14 Number =14 Number +15 Number +16 Number =2+4+4=10
SUM(pv) OVER(PARTITION BY taskid ORDER BY createtime ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS pv6 
FROM hive_zjyprc_hadoop.tmp.task1 order by taskid, createtime;

Get the results ：
Insert picture description here

Templates

The logic of aggregation after grouping , When calculating a record , Which data should be put into a group for aggregate calculation . We take pv4 To illustrate ：
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4

First determine which data is a group , And in a group , What is the order of the data
PARTITION BY cookieid ORDER BY： according to cookieid To group , according to createtime Sort by . The grouping here can be understood as opening a window , The default window size is two . When this is done , The data looks like this ：
Then determine the length of the window
The window length refers to which record to traverse now , What data are calculated for this window .
The template of window length is ：Rows between A and B.
A There are three ways of writing ：
- unbounded preceding Indicates that it starts from the first item of the current group , Until the first item in this group
- X preceding Start with the current number of rows , Not counting the current one , Count forward X strip
- current row Current row
B There are three ways of writing ：
- unbounded following To the end of the last line of the Group , From the current row to the last row of the Group
- X following After the current line * That's ok , Take it back X That's ok
- current row Current row

take pv4 Look at ：ROWS BETWEEN 3 PRECEDING AND CURRENT ROW

Insert picture description here
When it comes to 4-13 When the data is , Push forward 3 Calculate to the current line 3, seek sum：1+5+7+3=16

Finally, let's see what calculations are performed on this window
The last calculation performed is the aggregate function , You can ask in the window sum,avg,max or min.

Two 、 ranking ：row_number,rank,dense_rank

When it comes to sorting every piece of data in the table, it is necessary to put it into the corresponding group .row_number, rank, dense_rank It's very useful .

data

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

Row number

ROW_NUMBER() The function is ： from 1 Start , In order , Generating a sequence of records within a group . such as ：

May, in accordance with the pv Descending order , Generate... For each day in the group pv Ranking ;
Get the first ranked record in the group ;
Get one session First of all refer

SELECT taskid, createtime, pv,
ROW_NUMBER() OVER(PARTITION BY taskid ORDER BY pv desc) AS rn
FROM hive_zjyprc_hadoop.tmp.task1;

So if you need to take the front of each group 3 name , It only needs rn<=3 that will do .
Insert picture description here

rank and dense_rank

RANK() Generate the ranking of data items in the group , When the ranking is equal , Will leave a vacancy in the ranking , The new value sorting will increase . Such as ：335
DENSE_RANK() Generate the ranking of data items in the group , When the ranking is equal , There will be no vacancy in the ranking , Parallel ranking , The sorting of new values will not change . Such as ：334

SELECT taskid, createtime, pv,
RANK() OVER(PARTITION BY taskid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY taskid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY taskid ORDER BY pv DESC) AS rn3
FROM hive_zjyprc_hadoop.tmp.task1 
WHERE taskid = 'task1';

Insert picture description here

3、 ... and 、 section ：ntile,cume_dist,percent_rank

data

d1,user1,1000
d1,user2,2000
d1,user3,3000
d2,user4,4000
d2,user5,5000

ntile

grammar

NTILE(n), It is used to cut the grouped data into n slice , Returns the current slice value , If the slice is not uniform , The distribution of the first slice is increased by default .
NTILE Window syntax is not supported , namely ROWS BETWEEN, such as NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)

Example

SELECT taskid,createtime,pv,
-- Divide the data into 2 slice 
NTILE(2) OVER(PARTITION BY taskid ORDER BY createtime) AS rn1, 
-- Divide the data into 3 slice 
NTILE(3) OVER(PARTITION BY taskid ORDER BY createtime) AS rn2, 
-- Divide all the data into 4 slice , Default asc Ascending 
NTILE(4) OVER(ORDER BY createtime) AS rn3 
FROM hive_zjyprc_hadoop.tmp.task1 
ORDER BY taskid,createtime;

Insert picture description here
ntile The scene is suitable for counting percentages top Group , such as ： Query the top of the day with the highest duration 60% Average duration of users

scene ： Query the top of the day with the highest duration 60% Average duration of users

utilize ntile The function divides it into five groups in descending order of duration , The ranking is 1,2,3 The group is the front 60% User duration of .

SELECT avg( Duration )
FROM (
    SELECT  Duration , ntile ( 5 ) over ( ORDER BY  Duration  desc ) AS ranking 
    FROM  surface 1
) a 
WHERE a.ranking in (1,2,3);

cume_dist

CUME_DIST Number of lines less than or equal to the current value / The total number of rows in the group .

Example

For the current data Count the number of people less than or equal to the current salary , The proportion of the total number of people .

SELECT dept, userid, sal,
round(CUME_DIST() OVER(ORDER BY sal), 2) AS rn1,
round(CUME_DIST() OVER(PARTITION BY dept ORDER BY sal desc), 2) AS rn2
FROM hive_zjyprc_hadoop.tmp.task1;

Insert picture description here
rn1: No, partition, All data are 1 Group , The total number of rows is 5

first line ： Less than or equal to 3000 The number of lines is 3, therefore ,3/5=0.6
The third line ： Less than or equal to 1000 The number of lines is 1, therefore ,1/5=0.2

rn2: Group by Department ,dpet=d1 The number of lines is 3,

The second line ： Less than or equal to 2000 The number of lines is 2, therefore ,2/3=0.67

For repetition value , When calculating , Take the position of the last row of duplicate values .

scene ： Count the number of people in different departments whose salary is less than or equal to the current salary , The proportion of people in different departments

cume_dist The actual scene can be counted The distribution of a value in the total value , Such as ：

Count the number of people in different departments whose salary is less than or equal to the current salary , The proportion of people in different departments

SELECT  department ,  people ,  salary ,
round(CUME_DIST() OVER(PARTITION BY  department  ORDER BY  salary ), 2) AS rn1
FROM  surface ;

percent_rank

percent_rank： and cume_dist The difference lies in the method of calculating the distribution results , The calculation method is as follows （ The relative position -1）/（ Total number of lines -1）

SELECT dept, userid, sal,
PERCENT_RANK() OVER(ORDER BY sal) AS rn1,   -- Within group 
RANK() OVER(ORDER BY sal) AS rn11,          -- Within group RANK value 
SUM(1) OVER(PARTITION BY NULL) AS rn12,     -- The total number of rows in the group 
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 
FROM hive_zjyprc_hadoop.tmp.task1;

Insert picture description here
rn1: rn1 = (rn11-1) / (rn12-1)

first line ,(1-1)/(5-1)=0/4=0
The second line ,(2-1)/(5-1)=1/4=0.25
In the fourth row ,(4-1)/(5-1)=3/4=0.75

rn2: according to dept grouping ,dept=d1 The total number of rows is 3

first line ,(1-1)/(3-1)=0
The third line ,(3-1)/(3-1)=1

Four 、 grouping ：grouping sets, grouping_id, cube, rollup

These analysis functions are usually used for OLAP in , It can't add up , And we need to make statistics according to the indicators of drilling up and drilling down in different dimensions , such as ,
Minutes and hours 、 God 、 Of the month UV Count .

data

2015-03,2015-03-10,cookie1
2015-03,2015-03-10,cookie5
2015-03,2015-03-12,cookie7
2015-04,2015-04-12,cookie3
2015-04,2015-04-13,cookie2
2015-04,2015-04-13,cookie4
2015-04,2015-04-16,cookie4
2015-03,2015-03-10,cookie2
2015-03,2015-03-10,cookie3
2015-04,2015-04-12,cookie5
2015-04,2015-04-13,cookie6
2015-04,2015-04-15,cookie3
2015-04,2015-04-15,cookie2
2015-04,2015-04-16,cookie1

grouping sets

In a GROUP BY Querying , Aggregate according to different dimension combinations , It's equivalent to putting different dimensions of GROUP BY The result set is UNION ALL.

SELECT month, day,
COUNT(DISTINCT cookied) AS uv,GROUPING__ID
FROM hive_zjyprc_hadoop.tmp.task4
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
--  Equivalent to respectively group by  do union all
ORDER BY GROUPING__ID;

GROUPING__ID, Indicates which grouping set the result belongs to .
Insert picture description here

cube

according to GROUP BY Aggregation of all combinations of dimensions

SELECT month, day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM hive_zjyprc_hadoop.tmp.task4
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;

Insert picture description here

rollup

yes CUBE Subset , It's the leftmost dimension , Hierarchical aggregation from this dimension .
such as , With month Dimensions are aggregated hierarchically ,SQL sentence

SELECT month, day, COUNT(DISTINCT cookied) AS uv, GROUPING__ID
FROM hive_zjyprc_hadoop.tmp.task4
GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;

Insert picture description here
The result is zero ： Lunar UV-> Of the month UV-> total UV
with rollup It will finally appear null, This line is for each group before , The value of a column to be displayed by using the grouped set operation .

5、 ... and 、 Take the former value or the latter value ：lag,lead,first_value,last_value

data

Peter,2015-10-1201:10:00,url1
Peter,2015-10-1201:15:10,url2
Peter,2015-10-1201:16:40,url3
Peter,2015-10-1202:13:00,url4
Peter,2015-10-1203:14:30,url5
Marry,2015-11-1201:10:00,url1
Marry,2015-11-1201:15:10,url2
Marry,2015-11-1201:16:40,url3
Marry,2015-11-1202:13:00,url4
Marry,2015-11-1203:14:30,url5

lag and lead

grammar

These two functions can get the first... Of the same field in the same query N Row data (Lag) And after N Row data (Lead) As an independent column .

LAG(col,n,DEFAULT) It is used to count up the n Row value
The first parameter is the column name ,
The second parameter is upward n That's ok （ Optional , The default is 1）,
The third parameter is the default value （ When you go up n Behavior NULL When , Take the default value , If not specified , Then for NULL）
LEAD(col,n,DEFAULT) Used for the next... In the statistics window n Row value
The first parameter is the column name ,
The second parameter is next n That's ok （ Optional , The default is 1）,
The third parameter is the default value （ When you go down n Behavior NULL When , Take the default value , If not specified , Then for NULL）

scene ： Calculate the time each user stays on a page , And the total residence time of each page

The actual scene ： Calculate the time each user stays on a page , And the total residence time of each page .

Ideas ：

Get the start and end time of a user's stay on a page

select cookieid,
    createtime stime,
    lead(createtime) over(partition by cookieid order by createtime) etime,
    url 
from hive_zjyprc_hadoop.tmp.task5;

Insert picture description here

Calculate the time interval that users stay on the page

SELECT  cookieid
       ,stime
       ,etime
       ,UNIX_TIMESTAMP(etime,'yyyy-MM-dd HH:mm:ss')- UNIX_TIMESTAMP(stime,'yyyy-MM-dd HH:mm:ss') AS period
       ,url
FROM
(
    SELECT  cookieid
           ,createtime stime
           ,lead(createtime) over(partition by cookieid ORDER BY createtime) etime
           ,url
    FROM hive_zjyprc_hadoop.tmp.task5
);

Insert picture description here

Calculate the time each user stays on a page , And the total residence time of each page

SELECT  url
       ,cookieid
       ,sum(period)
FROM
(
    SELECT  cookieid
           ,createtime stime
           ,lead(createtime) over(partition by cookieid ORDER BY createtime) etime
           ,UNIX_TIMESTAMP(lead(createtime) over(partition by cookieid ORDER BY createtime),'yyyy-MM-dd HH:mm:ss')- UNIX_TIMESTAMP(createtime,'yyyy-MM-dd HH:mm:ss') AS period
           ,url
    FROM hive_zjyprc_hadoop.tmp.task5
) temp
group by url, cookieid with ROLLUP;

Insert picture description here

first_value and last_value

grammar

first_value： After sorting in groups , Up to the current line , The first value is
last_value： After sorting in groups , Up to the current line , Last value
These two can be used in general ： Take positive order for each group last_value That is to put each group in reverse order , And then take first_value.

scene ： Check each user , First visited url

SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM hive_zjyprc_hadoop.tmp.task5;