聊聊Hive常見的分析函數(shù)
本文轉載自微信公眾號「大數(shù)據(jù)技術與數(shù)倉」,作者西貝。轉載本文請聯(lián)系大數(shù)據(jù)技術與數(shù)倉公眾號。
1.基本語法
- Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
- [<window_expression>])
Function (arg1,..., argn) 可以是下面的四類函數(shù):
- Aggregate Functions: 聚合函數(shù),比如:sum(...)、 max(...)、min(...)、avg(...)等
- Sort Functions: 數(shù)據(jù)排序函數(shù), 比如 :rank(...)、row_number(...)等
- Analytics Functions: 統(tǒng)計和比較函數(shù), 比如:lead(...)、lag(...)、 first_value(...)等
2.數(shù)據(jù)準備
樣例數(shù)據(jù)
- [職工姓名|部門編號|職工ID|工資|崗位類型|入職時間]
- Michael|1000|100|5000|full|2014-01-29
- Will|1000|101|4000|full|2013-10-02
- Wendy|1000|101|4000|part|2014-10-02
- Steven|1000|102|6400|part|2012-11-03
- Lucy|1000|103|5500|full|2010-01-03
- Lily|1001|104|5000|part|2014-11-29
- Jess|1001|105|6000|part|2014-12-02
- Mike|1001|106|6400|part|2013-11-03
- Wei|1002|107|7000|part|2010-04-03
- Yun|1002|108|5500|full|2014-01-29
- Richard|1002|109|8000|full|2013-09-01
建表語句:
- CREATE TABLE IF NOT EXISTS employee (
- name string,
- dept_num int,
- employee_id int,
- salary int,
- type string,
- start_date date
- )
- ROW FORMAT DELIMITED
- FIELDS TERMINATED BY '|'
- STORED as TEXTFILE;
加載數(shù)據(jù)
- load data local inpath '/opt/datas/data/employee_contract.txt' into table employee;
3.窗口聚合函數(shù)
(1)查詢姓名、部門編號、工資以及部門人數(shù)
- select
- name,
- dept_num as deptno ,
- salary,
- count(*) over (partition by dept_num) as cnt
- from employee ;
結果輸出:
- name deptno salary cnt
- Lucy 1000 5500 5
- Steven 1000 6400 5
- Wendy 1000 4000 5
- Will 1000 4000 5
- Michael 1000 5000 5
- Mike 1001 6400 3
- Jess 1001 6000 3
- Lily 1001 5000 3
- Richard 1002 8000 3
- Yun 1002 5500 3
- Wei 1002 7000 3
(2)查詢姓名、部門編號、工資以及每個部門的總工資,部門總工資按照降序輸出
- select
- name ,
- dept_num as deptno,
- salary,
- sum(salary) over (partition by dept_num order by dept_num) as sum_dept_salary
- from employee
- order by sum_dept_salary desc;
結果輸出:
- name deptno salary sum_dept_salary
- Michael 1000 5000 24900
- Will 1000 4000 24900
- Wendy 1000 4000 24900
- Steven 1000 6400 24900
- Lucy 1000 5500 24900
- Wei 1002 7000 20500
- Yun 1002 5500 20500
- Richard 1002 8000 20500
- Lily 1001 5000 17400
- Jess 1001 6000 17400
- Mike 1001 6400 17400
4.窗口排序函數(shù)
簡介
窗口排序函數(shù)提供了數(shù)據(jù)的排序信息,比如行號和排名。在一個分組的內部將行號或者排名作為數(shù)據(jù)的一部分進行返回,最常用的排序函數(shù)主要包括:
row_number:根據(jù)具體的分組和排序,為每行數(shù)據(jù)生成一個起始值等于1的唯一序列數(shù)
rank:對組中的數(shù)據(jù)進行排名,如果名次相同,則排名也相同,但是下一個名次的排名序號會出現(xiàn)不連續(xù)。比如查找具體條件的topN行
dense_rank:dense_rank函數(shù)的功能與rank函數(shù)類似,dense_rank函數(shù)在生成序號時是連續(xù)的,而rank函數(shù)生成的序號有可能不連續(xù)。當出現(xiàn)名次相同時,則排名序號也相同。而下一個排名的序號與上一個排名序號是連續(xù)的。
percent_rank:排名計算公式為:(current rank - 1)/(total number of rows - 1)
ntile:將一個有序的數(shù)據(jù)集劃分為多個桶(bucket),并為每行分配一個適當?shù)耐皵?shù)。它可用于將數(shù)據(jù)劃分為相等的小切片,為每一行分配該小切片的數(shù)字序號。
(1)查詢姓名、部門編號、工資、排名編號(按工資的多少排名)
- select
- name ,
- dept_num as dept_no ,
- salary,
- row_number() over (order by salary desc ) rnum
- from employee;
結果輸出:
- name dept_no salary rnum
- Richard 1002 8000 1
- Wei 1002 7000 2
- Mike 1001 6400 3
- Steven 1000 6400 4
- Jess 1001 6000 5
- Yun 1002 5500 6
- Lucy 1000 5500 7
- Lily 1001 5000 8
- Michael 1000 5000 9
- Wendy 1000 4000 10
- Will 1000 4000 11
(2)查詢每個部門工資最高的兩個人的信息(姓名、部門、薪水)
- select
- name,
- dept_num,
- salary
- from
- (
- select name ,
- dept_num ,
- salary,
- row_number() over (partition by dept_num order by salary desc ) rnum
- from employee) t1
- where rnum <= 2;
結果輸出:
- name dept_num salary
- Steven 1000 6400
- Lucy 1000 5500
- Mike 1001 6400
- Jess 1001 6000
- Richard 1002 8000
- Wei 1002 7000
(3)查詢每個部門的員工工資排名信息
- select
- name ,
- dept_num as dept_no ,
- salary,row_number() over (partition by dept_num order by salary desc ) rnum
- from employee;
結果輸出:
- name dept_no salary rnum
- Steven 1000 6400 1
- Lucy 1000 5500 2
- Michael 1000 5000 3
- Wendy 1000 4000 4
- Will 1000 4000 5
- Mike 1001 6400 1
- Jess 1001 6000 2
- Lily 1001 5000 3
- Richard 1002 8000 1
- Wei 1002 7000 2
- Yun 1002 5500 3
(4)使用rank函數(shù)進行排名
- select
- name,
- dept_num,
- salary,
- rank() over (order by salary desc) rank
- from employee;
結果輸出:
- name dept_num salary rank
- Richard 1002 8000 1
- Wei 1002 7000 2
- Mike 1001 6400 3
- Steven 1000 6400 3
- Jess 1001 6000 5
- Yun 1002 5500 6
- Lucy 1000 5500 6
- Lily 1001 5000 8
- Michael 1000 5000 8
- Wendy 1000 4000 10
- Will 1000 4000 10
(5)使用dense_rank進行排名
- select
- name,
- dept_num,
- salary,
- dense_rank() over (order by salary desc) rank
- from employee;
結果輸出:
- name dept_num salary rank
- Richard 1002 8000 1
- Wei 1002 7000 2
- Mike 1001 6400 3
- Steven 1000 6400 3
- Jess 1001 6000 4
- Yun 1002 5500 5
- Lucy 1000 5500 5
- Lily 1001 5000 6
- Michael 1000 5000 6
- Wendy 1000 4000 7
- Will 1000 4000 7
(6)使用percent_rank()進行排名
- select
- name,
- dept_num,
- salary,
- percent_rank() over (order by salary desc) rank
- from employee;
結果輸出:
- name dept_num salary rank
- Richard 1002 8000 0.0
- Wei 1002 7000 0.1
- Mike 1001 6400 0.2
- Steven 1000 6400 0.2
- Jess 1001 6000 0.4
- Yun 1002 5500 0.5
- Lucy 1000 5500 0.5
- Lily 1001 5000 0.7
- Michael 1000 5000 0.7
- Wendy 1000 4000 0.9
- Will 1000 4000 0.9
(7)使用ntile進行數(shù)據(jù)分片排名
- SELECT
- name,
- dept_num as deptno,
- salary,
- ntile(4) OVER(ORDER BY salary desc) as ntile
- FROM employee;
結果輸出:
- name deptno salary ntile
- Richard 1002 8000 1
- Wei 1002 7000 1
- Mike 1001 6400 1
- Steven 1000 6400 2
- Jess 1001 6000 2
- Yun 1002 5500 2
- Lucy 1000 5500 3
- Lily 1001 5000 3
- Michael 1000 5000 3
- Wendy 1000 4000 4
- Will 1000 4000 4
從 Hive v2.1.0開始, 支持在OVER語句里使用聚集函數(shù),比如
- SELECT
- dept_num,
- row_number() OVER (PARTITION BY dept_num ORDER BY sum(salary)) as rk
- FROM employee
- GROUP BY dept_num;
結果輸出:
- dept_num rk
- 1000 1
- 1001 1
- 1002 1
5.窗口分析函數(shù)
常用的分析函數(shù)主要包括:
- cume_dist
如果按升序排列,則統(tǒng)計:小于等于當前值的行數(shù)/總行數(shù)(number of rows ≤ current row)/(total number of rows)。如果是降序排列,則統(tǒng)計:大于等于當前值的行數(shù)/總行數(shù)。比如,統(tǒng)計小于等于當前工資的人數(shù)占總人數(shù)的比例 ,用于累計統(tǒng)計。
- lead(value_expr[,offset[,default]])
用于統(tǒng)計窗口內往下第n行值。第一個參數(shù)為列名,第二個參數(shù)為往下第n行(可選,默認為1),第三個參數(shù)為默認值(當往下第n行為NULL時候,取默認值,如不指定,則為NULL
- lag(value_expr[,offset[,default]]):
與lead相反,用于統(tǒng)計窗口內往上第n行值。第一個參數(shù)為列名,第二個參數(shù)為往上第n行(可選,默認為1),第三個參數(shù)為默認值(當往上第n行為NULL時候,取默認值,如不指定,則為NULL)
- first_value: 取分組內排序后,截止到當前行,第一個值
- last_value
取分組內排序后,截止到當前行,最后一個值
- (1)統(tǒng)計小于等于當前工資的人數(shù)占總人數(shù)的比例
- SELECT
- name,
- dept_num as deptno,
- salary,
- cume_dist() OVER (ORDER BY salary) as cume
- FROM employee;
結果輸出:
- name deptno salary cume
- Wendy 1000 4000 0.18181818181818182
- Will 1000 4000 0.18181818181818182
- Lily 1001 5000 0.36363636363636365
- Michael 1000 5000 0.36363636363636365
- Yun 1002 5500 0.5454545454545454
- Lucy 1000 5500 0.5454545454545454
- Jess 1001 6000 0.6363636363636364
- Mike 1001 6400 0.8181818181818182
- Steven 1000 6400 0.8181818181818182
- Wei 1002 7000 0.9090909090909091
- Richard 1002 8000 1.0
(2)統(tǒng)計大于等于當前工資的人數(shù)占總人數(shù)的比例
- SELECT
- name,
- dept_num as deptno,
- salary,
- cume_dist() OVER (ORDER BY salary desc) as cume
- FROM employee;
結果輸出:
- name deptno salary cume
- Richard 1002 8000 0.09090909090909091
- Wei 1002 7000 0.18181818181818182
- Mike 1001 6400 0.36363636363636365
- Steven 1000 6400 0.36363636363636365
- Jess 1001 6000 0.45454545454545453
- Yun 1002 5500 0.6363636363636364
- Lucy 1000 5500 0.6363636363636364
- Lily 1001 5000 0.8181818181818182
- Michael 1000 5000 0.8181818181818182
- Wendy 1000 4000 1.0
- Will 1000 4000 1.0
(3)按照部門統(tǒng)計小于等于當前工資的人數(shù)占部門總人數(shù)的比例
- SELECT
- name,
- dept_num as deptno,
- salary,
- cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume
- FROM employee;
結果輸出:
- name deptno salary cume
- Wendy 1000 4000 0.4
- Will 1000 4000 0.4
- Michael 1000 5000 0.6
- Lucy 1000 5500 0.8
- Steven 1000 6400 1.0
- Lily 1001 5000 0.3333333333333333
- Jess 1001 6000 0.6666666666666666
- Mike 1001 6400 1.0
- Yun 1002 5500 0.3333333333333333
- Wei 1002 7000 0.6666666666666666
- Richard 1002 8000 1.0
(4)按部門分組,統(tǒng)計每個部門員工的工資以及大于等于該員工工資的下一個員工的工資
- SELECT
- name,
- dept_num as deptno,
- salary,
- lead(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
- FROM employee;
結果輸出:
- name deptno salary lead
- Wendy 1000 4000 4000
- Will 1000 4000 5000
- Michael 1000 5000 5500
- Lucy 1000 5500 6400
- Steven 1000 6400 NULL
- Lily 1001 5000 6000
- Jess 1001 6000 6400
- Mike 1001 6400 NULL
- Yun 1002 5500 7000
- Wei 1002 7000 8000
- Richard 1002 8000 NULL
(5)按部門分組,統(tǒng)計每個部門員工的工資以及小于等于該員工工資的上一個員工的工資
- SELECT
- name,
- dept_num as deptno,
- salary,
- lag(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
- FROM employee;
結果輸出:
- name deptno salary lead
- Wendy 1000 4000 NULL
- Will 1000 4000 4000
- Michael 1000 5000 4000
- Lucy 1000 5500 5000
- Steven 1000 6400 5500
- Lily 1001 5000 NULL
- Jess 1001 6000 5000
- Mike 1001 6400 6000
- Yun 1002 5500 NULL
- Wei 1002 7000 5500
- Richard 1002 8000 7000
(6)按部門分組,統(tǒng)計每個部門員工工資以及該部門最低的員工工資
- SELECT
- name,
- dept_num as deptno,
- salary,
- first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval
- FROM employee;
結果輸出:
- name deptno salary fval
- Wendy 1000 4000 4000
- Will 1000 4000 4000
- Michael 1000 5000 4000
- Lucy 1000 5500 4000
- Steven 1000 6400 4000
- Lily 1001 5000 5000
- Jess 1001 6000 5000
- Mike 1001 6400 5000
- Yun 1002 5500 5500
- Wei 1002 7000 5500
- Richard 1002 8000 5500
(7)按部門分組,統(tǒng)計每個部門員工工資以及該部門最高的員工工資
- SELECT
- name,
- dept_num as deptno,
- salary,
- last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
- BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval
- FROM employee;
結果輸出:
- name deptno salary lval
- Wendy 1000 4000 6400
- Will 1000 4000 6400
- Michael 1000 5000 6400
- Lucy 1000 5500 6400
- Steven 1000 6400 6400
- Lily 1001 5000 6400
- Jess 1001 6000 6400
- Mike 1001 6400 6400
- Yun 1002 5500 8000
- Wei 1002 7000 8000
- Richard 1002 8000 8000
注意:last_value默認的窗口是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,表示當前行永遠是最后一個值,需改成RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING。
img
- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
為默認值,即當指定了ORDER BY從句,而省略了window從句 ,表示從開始到當前行。
- RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
表示從當前行到最后一行
- RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
表示所有行
- n PRECEDING m FOLLOWING
表示窗口的范圍是:[(當前行的行數(shù))- n, (當前行的行數(shù))+m]