許多人體驗到了云數據倉庫內集中計算的速度和效率優勢,但同時大家也認識到這種方法存在一些缺點。缺點之一是需要學習和執行不同語言(尤其是SQL)的查詢,這增加了很多復雜性。
為了解決這個問題,本文分享一些最難以在SQL中學習和執行的具體轉換,并提供實際所需的SQL代碼。
1. Datespine
Datespine是生成日期索引的轉換。
想象下,您正在分析每日銷售數據,表格如下:
因16日和17日沒有銷售,對應的行缺失。如果想計算平均每日銷售額或構建時間序列預測模型,那這種格式是個問題,我們需要插入缺失日期的行。
基本概念:
- 生成或選擇唯一日期
- 生成或選擇唯一產品
- 交叉連接(笛卡爾積)1&2的所有組合
- 將第3步的結果與原始數據進行外連接
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (
ORDER BY
NULL
) as INTERVAL_ID,
DATEADD(
'day',
(INTERVAL_ID - 1),
'2020-01-01T00:00' :: timestamp_ntz
) as SPINE_START,
DATEADD(
'day', INTERVAL_ID, '2020-01-01T00:00' :: timestamp_ntz
) as SPINE_END
FROM
TABLE (
GENERATOR(ROWCOUNT => 1097)
)
),
GROUPS AS (
SELECT
product,
MIN(sales_date) AS LOCAL_START,
MAX(sales_date) AS LOCAL_END
FROM
My_First_Table
GROUP BY
product
),
GROUP_SPINE AS (
SELECT
product,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM
GROUPS G CROSS
JOIN LATERAL (
SELECT
SPINE_START,
SPINE_END
FROM
GLOBAL_SPINE S
WHERE
S.SPINE_START >= G.LOCAL_START
)
)
SELECT
G.product AS GROUP_BY_product,
GROUP_START,
GROUP_END,
T.*
FROM
GROUP_SPINE G
LEFT JOIN My_First_Table T ON sales_date >= G.GROUP_START
AND sales_date < G.GROUP_END
AND G.product = T.product;
最終結果如下所示:
2. Pivot
Pivot是一種將行數據轉換為列數據的操作,以便更好地進行分析和可視化。
有時,在進行分析時,您希望重新構造表格。例如,有個羅列學生、科目和成績表格,我們想把具體科目分解為每個列。
之前:
SELECT Student, MATHEMATICS, GEOGRAPHY, PHYS_ED
FROM ( SELECT Student, Grade, Subject FROM skool)
PIVOT ( AVG ( Grade ) FOR Subject IN ( 'Mathematics', 'Geography', 'Phys Ed' ) ) as p
( Student, MATHEMATICS, GEOGRAPHY, PHYS_ED );
結果:
3. One-Hot編碼(或“虛擬”變量)
One-Hot編碼是一種將分類變量轉換為數字變量的方法。
這里是一個使用STATE作為列進行獨熱編碼的示例。
之前:
SELECT *,
CASE WHEN State = 'AL' THEN 1 ELSE 0 END as STATE_AL,
CASE WHEN State = 'AK' THEN 1 ELSE 0 END as STATE_AK,
CASE WHEN State = 'AZ' THEN 1 ELSE 0 END as STATE_AZ,
CASE WHEN State = 'AR' THEN 1 ELSE 0 END as STATE_AR,
CASE WHEN State = 'AS' THEN 1 ELSE 0 END as STATE_AS,
CASE WHEN State = 'CA' THEN 1 ELSE 0 END as STATE_CA,
CASE WHEN State = 'CO' THEN 1 ELSE 0 END as STATE_CO,
CASE WHEN State = 'CT' THEN 1 ELSE 0 END as STATE_CT,
CASE WHEN State = 'DC' THEN 1 ELSE 0 END as STATE_DC,
CASE WHEN State = 'FL' THEN 1 ELSE 0 END as STATE_FL,
CASE WHEN State = 'GA' THEN 1 ELSE 0 END as STATE_GA,
CASE WHEN State = 'HI' THEN 1 ELSE 0 END as STATE_HI,
CASE WHEN State = 'ID' THEN 1 ELSE 0 END as STATE_ID,
CASE WHEN State = 'IL' THEN 1 ELSE 0 END as STATE_IL,
CASE WHEN State = 'IN' THEN 1 ELSE 0 END as STATE_IN,
CASE WHEN State = 'IA' THEN 1 ELSE 0 END as STATE_IA,
CASE WHEN State = 'KS' THEN 1 ELSE 0 END as STATE_KS,
CASE WHEN State = 'KY' THEN 1 ELSE 0 END as STATE_KY,
CASE WHEN State = 'LA' THEN 1 ELSE 0 END as STATE_LA,
CASE WHEN State = 'ME' THEN 1 ELSE 0 END as STATE_ME,
CASE WHEN State = 'MD' THEN 1 ELSE 0 END as STATE_MD,
CASE WHEN State = 'MA' THEN 1 ELSE 0 END as STATE_MA,
CASE WHEN State = 'MI' THEN 1 ELSE 0 END as STATE_MI,
CASE WHEN State = 'MN' THEN 1 ELSE 0 END as STATE_MN,
CASE WHEN State = 'MS' THEN 1 ELSE 0 END as STATE_MS,
CASE WHEN State = 'MO' THEN 1 ELSE 0 END as STATE_MO,
CASE WHEN State = 'MT' THEN 1 ELSE 0 END as STATE_MT,
CASE WHEN State = 'NE' THEN 1 ELSE 0 END as STATE_NE,
CASE WHEN State = 'NV' THEN 1 ELSE 0 END as STATE_NV,
CASE WHEN State = 'NH' THEN 1 ELSE 0 END as STATE_NH,
CASE WHEN State = 'NJ' THEN 1 ELSE 0 END as STATE_NJ,
CASE WHEN State = 'NM' THEN 1 ELSE 0 END as STATE_NM,
CASE WHEN State = 'NY' THEN 1 ELSE 0 END as STATE_NY,
CASE WHEN State = 'NC' THEN 1 ELSE 0 END as STATE_NC,
CASE WHEN State = 'ND' THEN 1 ELSE 0 END as STATE_ND,
CASE WHEN State = 'OH' THEN 1 ELSE 0 END as STATE_OH,
CASE WHEN State = 'OK' THEN 1 ELSE 0 END as STATE_OK,
CASE WHEN State = 'OR' THEN 1 ELSE 0 END as STATE_OR,
CASE WHEN State = 'PA' THEN 1 ELSE 0 END as STATE_PA,
CASE WHEN State = 'RI' THEN 1 ELSE 0 END as STATE_RI,
CASE WHEN State = 'SC' THEN 1 ELSE 0 END as STATE_SC,
CASE WHEN State = 'SD' THEN 1 ELSE 0 END as STATE_SD,
CASE WHEN State = 'TN' THEN 1 ELSE 0 END as STATE_TN,
CASE WHEN State = 'TX' THEN 1 ELSE 0 END as STATE_TX,
CASE WHEN State = 'UT' THEN 1 ELSE 0 END as STATE_UT,
CASE WHEN State = 'VT' THEN 1 ELSE 0 END as STATE_VT,
CASE WHEN State = 'VA' THEN 1 ELSE 0 END as STATE_VA,
CASE WHEN State = 'WA' THEN 1 ELSE 0 END as STATE_WA,
CASE WHEN State = 'WV' THEN 1 ELSE 0 END as STATE_WV,
CASE WHEN State = 'WI' THEN 1 ELSE 0 END as STATE_WI,
CASE WHEN State = 'WY' THEN 1 ELSE 0 END as STATE_WY
FROM BABYTABLE;
結果:
4. 市場籃子分析
市場籃子分析是購物籃分析和挖掘關聯規則的一種方法。在這個過程中,首先需要對數據進行格式化,以便將每筆交易聚合到單個記錄中。對于個人電腦來說,這一步驟可能很具挑戰性,因為它涉及數據處理和轉換。然而,數據倉庫專為高效處理這些數據而設計,因此它更適合執行市場籃子分析所需的數據格式化任務。數據倉庫提供了便捷的功能,使數據格式化更加容易,從而支持購物籃分析和關聯規則的挖掘。
WITH order_detAIl as (
SELECT
SALESORDERNUMBER,
listagg(ENGLISHPRODUCTNAME, ', ') WITHIN group (
order by
ENGLISHPRODUCTNAME
) as ENGLISHPRODUCTNAME_listagg,
COUNT(ENGLISHPRODUCTNAME) as num_products
FROM
transactions
GROUP BY
SALESORDERNUMBER
)
SELECT
ENGLISHPRODUCTNAME_listagg,
count(SALESORDERNUMBER) as NumTransactions
FROM
order_detail
where
num_products > 1
GROUP BY
ENGLISHPRODUCTNAME_listagg
order by
count(SALESORDERNUMBER) desc;
結果:
5. 時間序列聚合
時間序列聚合是指將時間序列數據按照一定的時間間隔進行匯總和聚合,廣泛用語分析數據。然而,要正確執行時間序列聚合,關鍵因素之一是需要對數據進行適當的格式化,尤其是在使用窗口函數時。這一步驟的正確處理對于獲得準確且有意義的聚合結果至關重要。
例如,假設我們想計算過去14天的平均銷售額。使用窗口函數的方法要求我們將所有銷售數據轉換為每天一行的格式。然而,實際上,大多數銷售數據是以交易級別存儲的,這讓格式轉換變得困難。這就是時間序列聚合發揮作用的地方。通過時間序列聚合,我們可以創建歷史指標的聚合結果,無需重新格式化整個數據集。如果我們想一次添加多個指標,它也會派上用場:
- 過去14天內的平均銷售額
- 過去6個月中最大的購買
- 統計過去 90 天內的不同產品類型
如果您想使用窗口函數,則需要通過幾個步驟獨立構建每個指標。
處理此問題的更好方法是使用公共表表達式 (CTE) 來定義每個預先聚合的歷史窗口。
例如:
WITH BASIC_OFFSET_14DAY AS (
SELECT
A.CustomerID,
A.TransactionDate,
AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST14DAY,
MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST14DAY,
COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST14DAY
FROM
My_First_Table A
INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID
AND 1 = 1
WHERE
B.TransactionDate >= DATEADD(day, -14, A.TransactionDate)
AND B.TransactionDate <= A.TransactionDate
GROUP BY
A.CustomerID,
A.TransactionDate
),
BASIC_OFFSET_90DAY AS (
SELECT
A.CustomerID,
A.TransactionDate,
AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST90DAY,
MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST90DAY,
COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST90DAY
FROM
My_First_Table A
INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID
AND 1 = 1
WHERE
B.TransactionDate >= DATEADD(day, -90, A.TransactionDate)
AND B.TransactionDate <= A.TransactionDate
GROUP BY
A.CustomerID,
A.TransactionDate
),
BASIC_OFFSET_180DAY AS (
SELECT
A.CustomerID,
A.TransactionDate,
AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST180DAY,
MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST180DAY,
COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST180DAY
FROM
My_First_Table A
INNER JOIN My_First_Table B ON A.CustomerID = B.CustomerID
AND 1 = 1
WHERE
B.TransactionDate >= DATEADD(day, -180, A.TransactionDate)
AND B.TransactionDate <= A.TransactionDate
GROUP BY
A.CustomerID,
A.TransactionDate
)
SELECT
src.*,
BASIC_OFFSET_14DAY.AVG_PURCHASEAMOUNT_PAST14DAY,
BASIC_OFFSET_14DAY.MAX_PURCHASEAMOUNT_PAST14DAY,
BASIC_OFFSET_14DAY.COUNT_DISTINCT_TRANSACTIONID_PAST14DAY,
BASIC_OFFSET_90DAY.AVG_PURCHASEAMOUNT_PAST90DAY,
BASIC_OFFSET_90DAY.MAX_PURCHASEAMOUNT_PAST90DAY,
BASIC_OFFSET_90DAY.COUNT_DISTINCT_TRANSACTIONID_PAST90DAY,
BASIC_OFFSET_180DAY.AVG_PURCHASEAMOUNT_PAST180DAY,
BASIC_OFFSET_180DAY.MAX_PURCHASEAMOUNT_PAST180DAY,
BASIC_OFFSET_180DAY.COUNT_DISTINCT_TRANSACTIONID_PAST180DAY
FROM
My_First_Table src
LEFT OUTER JOIN BASIC_OFFSET_14DAY ON BASIC_OFFSET_14DAY.TransactionDate = src.TransactionDate
AND BASIC_OFFSET_14DAY.CustomerID = src.CustomerID
LEFT OUTER JOIN BASIC_OFFSET_90DAY ON BASIC_OFFSET_90DAY.TransactionDate = src.TransactionDate
AND BASIC_OFFSET_90DAY.CustomerID = src.CustomerID
LEFT OUTER JOIN BASIC_OFFSET_180DAY ON BASIC_OFFSET_180DAY.TransactionDate = src.TransactionDate
AND BASIC_OFFSET_180DAY.CustomerID = src.CustomerID;
結果:
結語
希望本篇文章有助于闡明數據從業者在操作現代數據堆棧時會遇到的不同問題。當涉及查詢云數據倉庫時,SQL是一把雙刃劍。雖然將計算集中在云數據倉庫中可以提高速度,但有時需要一些額外的SQL技能。