高頻HIVE-SQL筆試題詳解
本文轉(zhuǎn)載自微信公眾號(hào)「數(shù)師兄」,作者數(shù)師兄。轉(zhuǎn)載本文請(qǐng)聯(lián)系數(shù)師兄公眾號(hào)。
在上期的SQL題目中,我們以求連續(xù)活躍為例,講述了lead 、lag錯(cuò)位相減的用法。但若不采用錯(cuò)位相減的話,其實(shí)也可以實(shí)現(xiàn)這一目的。本次我們就先講述一下最大連續(xù)活躍天數(shù)的第二種解法,然后再講述另一常見(jiàn)面試SQL題目。
這一方法就是利用row_number給每一行生成一個(gè)連續(xù)的序號(hào),這樣,在連續(xù)活躍的時(shí)間段,就有了一個(gè)對(duì)照值,利用日期和行號(hào)這兩者就可以得到一個(gè)差值,在連續(xù)活躍的時(shí)段內(nèi),其差值是相同的,那么我們按照這個(gè)差值進(jìn)行分組就可以進(jìn)一步計(jì)算本次連續(xù)活躍有多少天了。
SQL如下
- WITH active AS
- (SELECT 100 UID,
- '2021-04-01'dt
- UNION ALL SELECT 101 UID,
- '2021-04-01'dt
- UNION ALL SELECT 102 UID,
- '2021-04-01'dt
- UNION ALL SELECT 103 UID,
- '2021-04-01'dt
- UNION ALL SELECT 100 UID,
- '2021-04-02'dt
- UNION ALL SELECT 101 UID,
- '2021-04-02'dt
- UNION ALL SELECT 102 UID,
- '2021-04-02'dt
- UNION ALL SELECT 103 UID,
- '2021-04-02'dt
- UNION ALL SELECT 104 UID,
- '2021-04-02'dt
- UNION ALL SELECT 100 UID,
- '2021-04-03'dt
- UNION ALL SELECT 104 UID,
- '2021-04-03'dt
- UNION ALL SELECT 101 UID,
- '2021-04-04'dt
- UNION ALL SELECT 102 UID,
- '2021-04-04'dt
- UNION ALL SELECT 103 UID,
- '2021-04-04'dt
- UNION ALL SELECT 104 UID,
- '2021-04-04'dt
- UNION ALL SELECT 105 UID,
- '2021-04-04'dt
- UNION ALL SELECT 102 UID,
- '2021-04-03'dt)
- SELECT UID,
- max(days)days
- FROM
- (SELECT date_sub(dt,rn),
- UID,
- count(DISTINCT dt)days
- FROM
- (SELECT UID,
- dt,
- row_number()over(PARTITION BY UID
- ORDER BY dt)rn
- FROM active)x
- GROUP BY date_sub(dt,rn),
- UID)y
- GROUP BY UID
結(jié)果如下:
至此,連續(xù)活躍的問(wèn)題就告一段落了。下面我們來(lái)看另一個(gè)常見(jiàn)問(wèn)題:
2.行列轉(zhuǎn)換
(1)某直播產(chǎn)品,直播記錄表lives中有l(wèi)iveid(直播id)、invite_list(邀約名單,string,用戶(hù)id以“,”分割)字段,直播觀看記錄表record中有uid(用戶(hù)id)、liveid(直播id)、duration(觀看時(shí)長(zhǎng)),試求出邀約名單中所有用戶(hù)的觀看時(shí)長(zhǎng)
由于存在邀約用戶(hù)未進(jìn)房觀看的問(wèn)題,所以我們直接從觀看記錄中來(lái)取,不一定能將名單中所有用戶(hù)包含進(jìn)去,所以,我們需要以直播記錄中的邀約名單為準(zhǔn)。這就需要利用explode和lateral view來(lái)將這個(gè)string類(lèi)型轉(zhuǎn)換為行記錄的形式。
仍然是使用with構(gòu)造lives和record兩個(gè)簡(jiǎn)單的臨時(shí)表,詳細(xì)SQL如下:
- with lives as(
- select 100 liveid,'A01,A02,A03,A04'invite_list union all
- select 101 liveid,'A05,A06,A07,A08'invite_list
- ),
- record as(
- select 100 liveid,'A01' uid,30 duration union all
- select 100 liveid,'A02' uid,50 duration union all
- select 100 liveid,'A03' uid,15 duration union all
- select 101 liveid,'A07' uid,20 duration union all
- select 101 liveid,'A08' uid,60 duration
- )
- select
- list.liveid,list.ulist,if(record.duration is null,0,record.duration)duration
- from (
- select liveid,ulist
- from lives
- lateral view explode(split(invite_list,',')) uid as ulist
- )list left join record on list.liveid=record.liveid and list.ulist=record.uid
這里簡(jiǎn)單說(shuō)明下:lateral view 是為原表調(diào)用explode函數(shù),將一個(gè)array或者map的字段分解成多行并作為一個(gè)臨時(shí)表,然后再與其他字段組合到一起。
(2)某APP用戶(hù)訪問(wèn)頁(yè)面記錄record,含有字段uid(用戶(hù)id)、pageid(頁(yè)面id),將每個(gè)用戶(hù)的訪問(wèn)路徑輸出在一個(gè)字段中
前一道題目是列轉(zhuǎn)行,這道題是行專(zhuān)列。利用collect_set可以將分組下面某個(gè)字段的所有記錄聚合成一個(gè)列表,然后可以進(jìn)一步使用concat_ws函數(shù)把a(bǔ)rray轉(zhuǎn)為字符串,并用“>”將其連接起來(lái)。
- WITH record as
- (SELECT 100 UID,'A01' pageid
- UNION ALL SELECT 100 UID,'A02' pageid
- UNION ALL SELECT 100 UID,'A03' pageid
- UNION ALL SELECT 101 UID,'A07' pageid
- UNION ALL SELECT 101 UID,'A08' pageid)
- SELECT UID,
- concat_ws('>',collect_list(pageid))page
- FROM record
- GROUP BY UID
3.求留存
對(duì)于多數(shù)互聯(lián)網(wǎng)公司來(lái)說(shuō),任何時(shí)候留存率都是重中之重。所以在數(shù)據(jù)分析的面試當(dāng)中,求留存率的SQL題目出現(xiàn)的頻率也是極高。
所謂“留存率”,通常是定義一個(gè)初始行為與一個(gè)后繼行為,符合初始條件的用戶(hù)量在經(jīng)過(guò)N天之后,剩余的比例就是留存率。發(fā)散來(lái)講,回訪、復(fù)看等概念也與留存相類(lèi)似。如:
某APP新注冊(cè)用戶(hù)表user,含有字段uid(用戶(hù)id)、dt(注冊(cè)日期),活躍表active,有字段uid(用戶(hù)id),dt(活躍日期),求每天新增用戶(hù)的次日留存率、次2日留存率、次3日留存率……次7日留存率
此題目就是以新注冊(cè)作為初始行為,活躍作為后繼行為。顯而易見(jiàn)地,我們可以將初始行為與后繼行為做左連接,然后按注冊(cè)日期進(jìn)行聚合,求相應(yīng)的新增用戶(hù)和次N日留存人數(shù)。
- WITH user as
- (SELECT 100 UID,'2021-01-01' dt
- UNION ALL SELECT 101 UID,'2021-01-01' dt
- UNION ALL SELECT 102 UID,'2021-01-01' dt
- UNION ALL SELECT 103 UID,'2021-01-02' dt
- UNION ALL SELECT 104 UID,'2021-01-02' dt
- UNION ALL SELECT 105 UID,'2021-01-02' dt
- UNION ALL SELECT 106 UID,'2021-01-02' dt),
- active as(
- select 100 UID,'2021-01-01'dt union all
- select 100 UID,'2021-01-02'dt union all
- select 100 UID,'2021-01-03'dt union all
- select 100 UID,'2021-01-05'dt union all
- select 100 UID,'2021-01-07'dt union all
- select 101 UID,'2021-01-01'dt union all
- select 101 UID,'2021-01-07'dt union all
- select 102 UID,'2021-01-01'dt union all
- select 103 UID,'2021-01-01'dt union all
- select 103 UID,'2021-01-02'dt union all
- select 103 UID,'2021-01-03'dt union all
- select 103 UID,'2021-01-05'dt union all
- select 104 UID,'2021-01-02'dt union all
- select 104 UID,'2021-01-03'dt union all
- select 104 UID,'2021-01-04'dt union all
- select 105 UID,'2021-01-02'dt union all
- select 105 UID,'2021-01-03'dt union all
- select 105 UID,'2021-01-04'dt union all
- select 105 UID,'2021-01-05'dt union all
- select 105 UID,'2021-01-07'dt union all
- select 106 UID,'2021-01-02'dt union all
- select 106 UID,'2021-01-04'dt
- )
- select
- user.dt,
- count(distinct user.uid)new_user,
- count(distinct if(datediff(active.dt,user.dt)=1,active.uid,null))retain1,
- count(distinct if(datediff(active.dt,user.dt)=2,active.uid,null))retain2,
- count(distinct if(datediff(active.dt,user.dt)=3,active.uid,null))retain3,
- count(distinct if(datediff(active.dt,user.dt)=4,active.uid,null))retain4,
- count(distinct if(datediff(active.dt,user.dt)=5,active.uid,null))retain5,
- count(distinct if(datediff(active.dt,user.dt)=6,active.uid,null))retain6,
- count(distinct if(datediff(active.dt,user.dt)=7,active.uid,null))retain7
- from user left join active on user.uid=active.uid
- group by user.dt
其結(jié)果如下:
這種方法可能比較麻煩一些,得把要求的每一個(gè)留存指標(biāo)都要單獨(dú)寫(xiě)一個(gè)字段。若想省事兒,可以單獨(dú)求出每日的新增數(shù)量,再將新增表與活躍表連接而求出每個(gè)注冊(cè)日期在后續(xù)每天的活躍數(shù)量,然后將新增數(shù)量與活躍數(shù)量再進(jìn)行關(guān)聯(lián),從而求得任意周期的留存。