自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

五個(gè)最難的SQL任務(wù):數(shù)據(jù)缺失、透視、獨(dú)熱編碼

數(shù)據(jù)庫(kù)
本文分享一些最難以在SQL中學(xué)習(xí)和執(zhí)行的具體轉(zhuǎn)換,并提供實(shí)際所需的SQL代碼。

許多人體驗(yàn)到了云數(shù)據(jù)倉(cāng)庫(kù)內(nèi)集中計(jì)算的速度和效率優(yōu)勢(shì),但同時(shí)大家也認(rèn)識(shí)到這種方法存在一些缺點(diǎn)。缺點(diǎn)之一是需要學(xué)習(xí)和執(zhí)行不同語言(尤其是SQL)的查詢,這增加了很多復(fù)雜性。

為了解決這個(gè)問題,本文分享一些最難以在SQL中學(xué)習(xí)和執(zhí)行的具體轉(zhuǎn)換,并提供實(shí)際所需的SQL代碼。

1. Datespine

Datespine是生成日期索引的轉(zhuǎn)換。

想象下,您正在分析每日銷售數(shù)據(jù),表格如下:

因16日和17日沒有銷售,對(duì)應(yīng)的行缺失。如果想計(jì)算平均每日銷售額或構(gòu)建時(shí)間序列預(yù)測(cè)模型,那這種格式是個(gè)問題,我們需要插入缺失日期的行。

基本概念:

  • 生成或選擇唯一日期
  • 生成或選擇唯一產(chǎn)品
  • 交叉連接(笛卡爾積)1&2的所有組合
  • 將第3步的結(jié)果與原始數(shù)據(jù)進(jìn)行外連接
WITH GLOBAL_SPINE AS (
  SELECT 
    ROW_NUMBER() OVER (
      ORDER BY 
        NULL
    ) as INTERVAL_ID, 
    DATEADD(
      'day', 
      (INTERVAL_ID - 1), 
      '2020-01-01T00:00' :: timestamp_ntz
    ) as SPINE_START, 
    DATEADD(
      'day', INTERVAL_ID, '2020-01-01T00:00' :: timestamp_ntz
    ) as SPINE_END 
  FROM 
    TABLE (
      GENERATOR(ROWCOUNT => 1097)
    )
), 
GROUPS AS (
  SELECT 
    product, 
    MIN(sales_date) AS LOCAL_START, 
    MAX(sales_date) AS LOCAL_END 
  FROM 
    My_First_Table 
  GROUP BY 
    product
), 
GROUP_SPINE AS (
  SELECT 
    product, 
    SPINE_START AS GROUP_START, 
    SPINE_END AS GROUP_END 
  FROM 
    GROUPS G CROSS 
    JOIN LATERAL (
      SELECT 
        SPINE_START, 
        SPINE_END 
      FROM 
        GLOBAL_SPINE S 
      WHERE 
        S.SPINE_START >= G.LOCAL_START
    )
) 
SELECT 
  G.product AS GROUP_BY_product, 
  GROUP_START, 
  GROUP_END, 
  T.* 
FROM 
  GROUP_SPINE G 
  LEFT JOIN My_First_Table T ON sales_date >= G.GROUP_START 
  AND sales_date < G.GROUP_END 
  AND G.product = T.product;

最終結(jié)果如下所示:

2. Pivot

Pivot是一種將行數(shù)據(jù)轉(zhuǎn)換為列數(shù)據(jù)的操作,以便更好地進(jìn)行分析和可視化。

有時(shí),在進(jìn)行分析時(shí),您希望重新構(gòu)造表格。例如,有個(gè)羅列學(xué)生、科目和成績(jī)表格,我們想把具體科目分解為每個(gè)列。

之前:

SELECT Student, MATHEMATICS, GEOGRAPHY, PHYS_ED
FROM ( SELECT Student, Grade, Subject FROM skool)
PIVOT ( AVG ( Grade ) FOR Subject IN ( 'Mathematics', 'Geography', 'Phys Ed' ) ) as p
( Student, MATHEMATICS, GEOGRAPHY, PHYS_ED );

結(jié)果:

3. One-Hot編碼(或“虛擬”變量)

One-Hot編碼是一種將分類變量轉(zhuǎn)換為數(shù)字變量的方法。

這里是一個(gè)使用STATE作為列進(jìn)行獨(dú)熱編碼的示例。

之前:

SELECT *,
    CASE WHEN State = 'AL' THEN 1 ELSE 0 END as STATE_AL, 
    CASE WHEN State = 'AK' THEN 1 ELSE 0 END as STATE_AK, 
    CASE WHEN State = 'AZ' THEN 1 ELSE 0 END as STATE_AZ, 
    CASE WHEN State = 'AR' THEN 1 ELSE 0 END as STATE_AR, 
    CASE WHEN State = 'AS' THEN 1 ELSE 0 END as STATE_AS, 
    CASE WHEN State = 'CA' THEN 1 ELSE 0 END as STATE_CA, 
    CASE WHEN State = 'CO' THEN 1 ELSE 0 END as STATE_CO, 
    CASE WHEN State = 'CT' THEN 1 ELSE 0 END as STATE_CT, 
    CASE WHEN State = 'DC' THEN 1 ELSE 0 END as STATE_DC, 
    CASE WHEN State = 'FL' THEN 1 ELSE 0 END as STATE_FL, 
    CASE WHEN State = 'GA' THEN 1 ELSE 0 END as STATE_GA, 
    CASE WHEN State = 'HI' THEN 1 ELSE 0 END as STATE_HI, 
    CASE WHEN State = 'ID' THEN 1 ELSE 0 END as STATE_ID, 
    CASE WHEN State = 'IL' THEN 1 ELSE 0 END as STATE_IL, 
    CASE WHEN State = 'IN' THEN 1 ELSE 0 END as STATE_IN, 
    CASE WHEN State = 'IA' THEN 1 ELSE 0 END as STATE_IA, 
    CASE WHEN State = 'KS' THEN 1 ELSE 0 END as STATE_KS, 
    CASE WHEN State = 'KY' THEN 1 ELSE 0 END as STATE_KY, 
    CASE WHEN State = 'LA' THEN 1 ELSE 0 END as STATE_LA, 
    CASE WHEN State = 'ME' THEN 1 ELSE 0 END as STATE_ME, 
    CASE WHEN State = 'MD' THEN 1 ELSE 0 END as STATE_MD, 
    CASE WHEN State = 'MA' THEN 1 ELSE 0 END as STATE_MA, 
    CASE WHEN State = 'MI' THEN 1 ELSE 0 END as STATE_MI, 
    CASE WHEN State = 'MN' THEN 1 ELSE 0 END as STATE_MN, 
    CASE WHEN State = 'MS' THEN 1 ELSE 0 END as STATE_MS, 
    CASE WHEN State = 'MO' THEN 1 ELSE 0 END as STATE_MO, 
    CASE WHEN State = 'MT' THEN 1 ELSE 0 END as STATE_MT, 
    CASE WHEN State = 'NE' THEN 1 ELSE 0 END as STATE_NE, 
    CASE WHEN State = 'NV' THEN 1 ELSE 0 END as STATE_NV, 
    CASE WHEN State = 'NH' THEN 1 ELSE 0 END as STATE_NH, 
    CASE WHEN State = 'NJ' THEN 1 ELSE 0 END as STATE_NJ, 
    CASE WHEN State = 'NM' THEN 1 ELSE 0 END as STATE_NM, 
    CASE WHEN State = 'NY' THEN 1 ELSE 0 END as STATE_NY, 
    CASE WHEN State = 'NC' THEN 1 ELSE 0 END as STATE_NC, 
    CASE WHEN State = 'ND' THEN 1 ELSE 0 END as STATE_ND, 
    CASE WHEN State = 'OH' THEN 1 ELSE 0 END as STATE_OH, 
    CASE WHEN State = 'OK' THEN 1 ELSE 0 END as STATE_OK, 
    CASE WHEN State = 'OR' THEN 1 ELSE 0 END as STATE_OR, 
    CASE WHEN State = 'PA' THEN 1 ELSE 0 END as STATE_PA, 
    CASE WHEN State = 'RI' THEN 1 ELSE 0 END as STATE_RI, 
    CASE WHEN State = 'SC' THEN 1 ELSE 0 END as STATE_SC, 
    CASE WHEN State = 'SD' THEN 1 ELSE 0 END as STATE_SD, 
    CASE WHEN State = 'TN' THEN 1 ELSE 0 END as STATE_TN, 
    CASE WHEN State = 'TX' THEN 1 ELSE 0 END as STATE_TX, 
    CASE WHEN State = 'UT' THEN 1 ELSE 0 END as STATE_UT, 
    CASE WHEN State = 'VT' THEN 1 ELSE 0 END as STATE_VT, 
    CASE WHEN State = 'VA' THEN 1 ELSE 0 END as STATE_VA, 
    CASE WHEN State = 'WA' THEN 1 ELSE 0 END as STATE_WA, 
    CASE WHEN State = 'WV' THEN 1 ELSE 0 END as STATE_WV, 
    CASE WHEN State = 'WI' THEN 1 ELSE 0 END as STATE_WI, 
    CASE WHEN State = 'WY' THEN 1 ELSE 0 END as STATE_WY
FROM BABYTABLE;

結(jié)果:

4. 市場(chǎng)籃子分析

市場(chǎng)籃子分析是購(gòu)物籃分析和挖掘關(guān)聯(lián)規(guī)則的一種方法。在這個(gè)過程中,首先需要對(duì)數(shù)據(jù)進(jìn)行格式化,以便將每筆交易聚合到單個(gè)記錄中。對(duì)于個(gè)人電腦來說,這一步驟可能很具挑戰(zhàn)性,因?yàn)樗婕皵?shù)據(jù)處理和轉(zhuǎn)換。然而,數(shù)據(jù)倉(cāng)庫(kù)專為高效處理這些數(shù)據(jù)而設(shè)計(jì),因此它更適合執(zhí)行市場(chǎng)籃子分析所需的數(shù)據(jù)格式化任務(wù)。數(shù)據(jù)倉(cāng)庫(kù)提供了便捷的功能,使數(shù)據(jù)格式化更加容易,從而支持購(gòu)物籃分析和關(guān)聯(lián)規(guī)則的挖掘。

WITH order_detail as (
  SELECT 
    SALESORDERNUMBER, 
    listagg(ENGLISHPRODUCTNAME, ', ') WITHIN group (
      order by 
        ENGLISHPRODUCTNAME
    ) as ENGLISHPRODUCTNAME_listagg, 
    COUNT(ENGLISHPRODUCTNAME) as num_products 
  FROM 
    transactions 
  GROUP BY 
    SALESORDERNUMBER
) 
SELECT 
  ENGLISHPRODUCTNAME_listagg, 
  count(SALESORDERNUMBER) as NumTransactions 
FROM 
  order_detail 
where 
  num_products > 1 
GROUP BY 
  ENGLISHPRODUCTNAME_listagg 
order by 
  count(SALESORDERNUMBER) desc;

結(jié)果:

5. 時(shí)間序列聚合

時(shí)間序列聚合是指將時(shí)間序列數(shù)據(jù)按照一定的時(shí)間間隔進(jìn)行匯總和聚合,廣泛用語分析數(shù)據(jù)。然而,要正確執(zhí)行時(shí)間序列聚合,關(guān)鍵因素之一是需要對(duì)數(shù)據(jù)進(jìn)行適當(dāng)?shù)母袷交?,尤其是在使用窗口函?shù)時(shí)。這一步驟的正確處理對(duì)于獲得準(zhǔn)確且有意義的聚合結(jié)果至關(guān)重要。

例如,假設(shè)我們想計(jì)算過去14天的平均銷售額。使用窗口函數(shù)的方法要求我們將所有銷售數(shù)據(jù)轉(zhuǎn)換為每天一行的格式。然而,實(shí)際上,大多數(shù)銷售數(shù)據(jù)是以交易級(jí)別存儲(chǔ)的,這讓格式轉(zhuǎn)換變得困難。這就是時(shí)間序列聚合發(fā)揮作用的地方。通過時(shí)間序列聚合,我們可以創(chuàng)建歷史指標(biāo)的聚合結(jié)果,無需重新格式化整個(gè)數(shù)據(jù)集。如果我們想一次添加多個(gè)指標(biāo),它也會(huì)派上用場(chǎng):

  • 過去14天內(nèi)的平均銷售額
  • 過去6個(gè)月中最大的購(gòu)買
  • 統(tǒng)計(jì)過去 90 天內(nèi)的不同產(chǎn)品類型

如果您想使用窗口函數(shù),則需要通過幾個(gè)步驟獨(dú)立構(gòu)建每個(gè)指標(biāo)。

處理此問題的更好方法是使用公共表表達(dá)式 (CTE) 來定義每個(gè)預(yù)先聚合的歷史窗口。

例如:

WITH BASIC_OFFSET_14DAY AS (
  SELECT 
    A.CustomerID, 
    A.TransactionDate, 
    AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST14DAY, 
    MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST14DAY, 
    COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST14DAY
  FROM 
    My_First_Table A 
    INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID 
    AND 1 = 1 
  WHERE 
    B.TransactionDate >= DATEADD(day, -14, A.TransactionDate) 
    AND B.TransactionDate <= A.TransactionDate 
  GROUP BY 
    A.CustomerID, 
    A.TransactionDate
), 
BASIC_OFFSET_90DAY AS (
  SELECT 
    A.CustomerID, 
    A.TransactionDate, 
    AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST90DAY, 
    MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST90DAY, 
    COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST90DAY
  FROM 
    My_First_Table A 
    INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID 
    AND 1 = 1 
  WHERE 
    B.TransactionDate >= DATEADD(day, -90, A.TransactionDate) 
    AND B.TransactionDate <= A.TransactionDate 
  GROUP BY 
    A.CustomerID, 
    A.TransactionDate
), 
BASIC_OFFSET_180DAY AS (
  SELECT 
    A.CustomerID, 
    A.TransactionDate, 
    AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST180DAY, 
    MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST180DAY, 
    COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST180DAY
  FROM 
    My_First_Table A 
    INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID 
    AND 1 = 1 
  WHERE 
    B.TransactionDate >= DATEADD(day, -180, A.TransactionDate) 
    AND B.TransactionDate <= A.TransactionDate 
  GROUP BY 
    A.CustomerID, 
    A.TransactionDate
) 
SELECT 
  src.*, 
  BASIC_OFFSET_14DAY.AVG_PURCHASEAMOUNT_PAST14DAY, 
  BASIC_OFFSET_14DAY.MAX_PURCHASEAMOUNT_PAST14DAY, 
  BASIC_OFFSET_14DAY.COUNT_DISTINCT_TRANSACTIONID_PAST14DAY, 
  BASIC_OFFSET_90DAY.AVG_PURCHASEAMOUNT_PAST90DAY, 
  BASIC_OFFSET_90DAY.MAX_PURCHASEAMOUNT_PAST90DAY, 
  BASIC_OFFSET_90DAY.COUNT_DISTINCT_TRANSACTIONID_PAST90DAY, 
  BASIC_OFFSET_180DAY.AVG_PURCHASEAMOUNT_PAST180DAY, 
  BASIC_OFFSET_180DAY.MAX_PURCHASEAMOUNT_PAST180DAY, 
  BASIC_OFFSET_180DAY.COUNT_DISTINCT_TRANSACTIONID_PAST180DAY 
FROM 
  My_First_Table src 
  LEFT OUTER JOIN BASIC_OFFSET_14DAY ON BASIC_OFFSET_14DAY.TransactionDate = src.TransactionDate 
  AND BASIC_OFFSET_14DAY.CustomerID = src.CustomerID 
  LEFT OUTER JOIN BASIC_OFFSET_90DAY ON BASIC_OFFSET_90DAY.TransactionDate = src.TransactionDate 
  AND BASIC_OFFSET_90DAY.CustomerID = src.CustomerID 
  LEFT OUTER JOIN BASIC_OFFSET_180DAY ON BASIC_OFFSET_180DAY.TransactionDate = src.TransactionDate 
  AND BASIC_OFFSET_180DAY.CustomerID = src.CustomerID;

結(jié)果:

結(jié)語

希望本篇文章有助于闡明數(shù)據(jù)從業(yè)者在操作現(xiàn)代數(shù)據(jù)堆棧時(shí)會(huì)遇到的不同問題。當(dāng)涉及查詢?cè)茢?shù)據(jù)倉(cāng)庫(kù)時(shí),SQL是一把雙刃劍。雖然將計(jì)算集中在云數(shù)據(jù)倉(cāng)庫(kù)中可以提高速度,但有時(shí)需要一些額外的SQL技能。

責(zé)任編輯:趙寧寧 來源: Java學(xué)研大本營(yíng)
相關(guān)推薦

2020-09-23 13:44:26

分類變量獨(dú)熱編碼編碼

2011-07-08 13:40:18

2023-07-03 07:21:23

軟件敏捷編碼

2021-02-20 23:32:51

DevSecOps開發(fā)漏洞

2021-06-08 09:18:54

SQLPandas數(shù)據(jù)透視表

2022-08-29 18:34:46

Pythonsubprocess系統(tǒng)

2010-09-15 08:53:50

SQL Server

2020-12-30 06:19:15

編碼分類特征數(shù)字特征

2024-09-29 15:15:46

2021-05-29 07:36:08

MySQLSQL數(shù)據(jù)庫(kù)

2010-07-14 14:16:30

SQL Server

2022-04-01 06:37:47

CIOIT人才

2009-02-18 20:39:34

2012-02-15 14:49:19

2017-06-01 12:30:59

SQL云計(jì)算云端

2023-03-28 23:08:18

Bash編碼Shell

2018-09-04 23:04:31

大數(shù)據(jù)架構(gòu)大數(shù)據(jù)大數(shù)據(jù)分析

2021-03-25 10:43:24

數(shù)據(jù)缺失數(shù)據(jù)數(shù)據(jù)保護(hù)

2010-09-27 15:51:58

SQL角色

2022-10-18 15:45:17

數(shù)獨(dú)Sudoku鴻蒙
點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)