本文介紹了一種高性能的每組最大值SQL查詢方法的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)吧!
問題描述
我正在嘗試構(gòu)建一個基礎(chǔ)結(jié)構(gòu),用于按需快速運(yùn)行回歸,從包含我們的Web服務(wù)器上所有歷史活動的數(shù)據(jù)庫中提取Apache請求。為了通過確保我們?nèi)匀贿f減來自較小客戶端的請求來提高覆蓋率,我希望通過為每個客戶端檢索至多n個(對于這個問題,假設(shè)10個)請求來確保請求的分布。
我在這里找到了許多類似問題的答案,其中最接近的似乎是SQL query to return top N rows per ID across a range of IDs,但答案大多是與性能無關(guān)的解決方案
我已經(jīng)試過了。例如,ROW_NUMBER()分析函數(shù)為我們提供了所需的數(shù)據(jù):
SELECT
*
FROM
(
SELECT
dailylogdata.*,
row_number() over (partition by dailylogdata.contextid order by occurrencedate) rn
FROM
dailylogdata
WHERE
shorturl in (?)
)
WHERE
rn <= 10;
但是,假設(shè)該表包含給定一天的數(shù)百萬個條目,并且這種方法需要從索引中讀取與我們的選擇標(biāo)準(zhǔn)匹配的所有行,以便應(yīng)用ROW_NUMBER分析函數(shù),則性能非常糟糕。我們最終選擇了近一百萬行,卻因為它們的row_number超過10而丟棄了其中的絕大多數(shù)。
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12222 |00:09:08.94 | 895K| 584K| 301 | | | | ||
||* 1 | VIEW | | 1 | 4427K| 12222 |00:09:08.94 | 895K| 584K| 301 | | | | ||
||* 2 | WINDOW SORT PUSHED RANK | | 1 | 4427K| 13536 |00:09:08.94 | 895K| 584K| 301 | 2709K| 743K| 97M (1)| 4096 ||
|| 3 | PARTITION RANGE SINGLE | | 1 | 4427K| 932K|00:22:27.90 | 895K| 584K| 0 | | | | ||
|| 4 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA | 1 | 4427K| 932K|00:22:27.61 | 895K| 584K| 0 | | | | ||
||* 5 | INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT | 1 | 17345 | 932K|00:00:00.75 | 1448 | 0 | 0 | | | | ||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter("RN"<=:SYS_B_2) |
| 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "DAILYLOGDATA"."CONTEXTID" ORDER BY "OCCURRENCEDATE")<=:SYS_B_2) |
| 5 - access("SHORTURL"=:P1) |
| |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
但是,如果我們只查詢特定上下文ID的前10個結(jié)果,則可以大大加快執(zhí)行速度:
SELECT
*
FROM
(
SELECT
dailylogdata.*
FROM
dailylogdata
WHERE
shorturl in (?)
and contextid = ?
)
WHERE
rownum <= 10;
運(yùn)行此查詢的統(tǒng)計信息:
|-------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers ||
|-------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.01 | 14 ||
||* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.01 | 14 ||
|| 2 | PARTITION RANGE SINGLE | | 1 | 10 | 10 |00:00:00.01 | 14 ||
|| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA | 1 | 10 | 10 |00:00:00.01 | 14 ||
||* 4 | INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT | 1 | 1 | 10 |00:00:00.01 | 5 ||
|-------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter(ROWNUM<=10) |
| 4 - access("SHORTURL"=:P1 AND "CONTEXTID"=TO_NUMBER(:P2)) |
| |
+-------------------------------------------------------------------------------------------------------------------------+
在本例中,Oracle足夠聰明,可以在獲得10個結(jié)果后停止檢索數(shù)據(jù)。我可以收集一組完整的上下文ID并以編程方式生成一個查詢,該查詢由每個上下文ID的一個查詢實例和union all
整個查詢組成,但是考慮到上下文ID的絕對數(shù)量,我們可能會遇到Oracle內(nèi)部的限制,即使不是這樣,這種方法也充滿了笨拙的味道。
有沒有人知道有一種方法可以保持第一個查詢的簡單性,同時保持與第二個查詢相稱的性能?還要注意,我實際上并不關(guān)心檢索一組穩(wěn)定的行;只要它們滿足我的標(biāo)準(zhǔn),它們就可以用于回歸。
編輯:Adam Musch的建議奏效了。我在這里附加了他的更改的性能結(jié)果,因為我無法將它們放在對他的答案的評論回應(yīng)中。這次我還使用了一個更大的數(shù)據(jù)集進(jìn)行測試,以下是來自我最初的ROW_NUMBER方法的(緩存)統(tǒng)計數(shù)據(jù)以供比較:
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12624 |00:00:22.34 | 1186K| 931K| | | ||
||* 1 | VIEW | | 1 | 1163K| 12624 |00:00:22.34 | 1186K| 931K| | | ||
||* 2 | WINDOW NOSORT | | 1 | 1163K| 1213K|00:00:21.82 | 1186K| 931K| 3036M| 17M| ||
|| 3 | TABLE ACCESS BY INDEX ROWID| TWTEST | 1 | 1163K| 1213K|00:00:20.41 | 1186K| 931K| | | ||
||* 4 | INDEX RANGE SCAN | TWTEST_URLCONTEXT | 1 | 1163K| 1213K|00:00:00.81 | 8568 | 0 | | | ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter("RN"<=10) |
| 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "CONTEXTID" ORDER BY NULL )<=10) |
| 4 - access("SHORTURL"=:P1) |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
我冒昧地略微簡化了Adam的建議;以下是修改后的查詢…
select
*
from
twtest
where
rowid in (
select
rowid
from (
select
rowid,
shorturl,
row_number() over (partition by shorturl, contextid
order by null) rn
from
twtest
)
where rn <= 10
and shorturl in (?)
);
…及其(緩存)評估的統(tǒng)計信息:
|--------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem ||
|--------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12624 |00:00:01.33 | 19391 | | | ||
|| 1 | NESTED LOOPS | | 1 | 1 | 12624 |00:00:01.33 | 19391 | | | ||
|| 2 | VIEW | VW_NSO_1 | 1 | 1163K| 12624 |00:00:01.27 | 6770 | | | ||
|| 3 | HASH UNIQUE | | 1 | 1 | 12624 |00:00:01.27 | 6770 | 1377K| 1377K| 5065K (0)||
||* 4 | VIEW | | 1 | 1163K| 12624 |00:00:01.25 | 6770 | | | ||
||* 5 | WINDOW NOSORT | | 1 | 1163K| 1213K|00:00:01.09 | 6770 | 283M| 5598K| ||
||* 6 | INDEX RANGE SCAN | TWTEST_URLCONTEXT | 1 | 1163K| 1213K|00:00:00.40 | 6770 | | | ||
|| 7 | TABLE ACCESS BY USER ROWID| TWTEST | 12624 | 1 | 12624 |00:00:00.04 | 12621 | | | ||
|--------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 4 - filter("RN"<=10) |
| 5 - filter(ROW_NUMBER() OVER ( PARTITION BY "SHORTURL","CONTEXTID" ORDER BY NULL NULL )<=10) |
| 6 - access("SHORTURL"=:P1) |
| |
|Note |
|----- |
| - dynamic sampling used for this statement (level=2) |
| |
+--------------------------------------------------------------------------------------------------------------------------------------+
正如所宣傳的,我們只訪問經(jīng)過完全篩選的行的dailylogdata表。我擔(dān)心似乎仍在根據(jù)它聲稱正在選擇的行數(shù)(1213K)對urlContext索引進(jìn)行完全掃描,但考慮到它只使用6770個緩沖區(qū)(即使我增加了上下文特定結(jié)果的數(shù)量,這個數(shù)字也保持不變),這可能具有誤導(dǎo)性。
推薦答案
這是一個不太好的解決方案,但似乎做了您想做的事情:盡快縮短索引掃描,并在通過篩選條件和top-n查詢條件限定數(shù)據(jù)之前不讀取數(shù)據(jù)。
請注意,測試時使用的是shorturl =
條件,而不是shorturl IN
條件。
with rowid_list as
(select rowid
from (select *
from (select rowid,
row_number() over (partition by shorturl, contextid
order by null) rn
from dailylogdata
)
where rn <= 10
)
where shorturl = ?
)
select *
from dailylogdata
where rowid in (select rowid from rowid_list)
with
子句獲取前10個ROID,為滿足您的條件的shorturl
和contextid
的每個唯一組合篩選一個窗口NOSORT。然后,它循環(huán)遍歷這組roid,按rowid獲取每個roid。
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 286 | 1536 (1)| 00:00:19 |
| 1 | NESTED LOOPS | | 1 | 286 | 1536 (1)| 00:00:19 |
| 2 | VIEW | VW_NSO_1 | 136K| 1596K| 910 (1)| 00:00:11 |
| 3 | HASH UNIQUE | | 1 | 3326K| | |
|* 4 | VIEW | | 136K| 3326K| 910 (1)| 00:00:11 |
|* 5 | WINDOW NOSORT | | 136K| 2794K| 910 (1)| 00:00:11 |
|* 6 | INDEX RANGE SCAN | TABLE_REDACTED_INDEX | 136K| 2794K| 910 (1)| 00:00:11 |
| 7 | TABLE ACCESS BY USER ROWID| TABLE_REDACTED | 1 | 274 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("RN"<=10)
5 - filter(ROW_NUMBER() OVER ( PARTITION BY "CLIENT_ID","SCE_ID" ORDER BY NULL NULL
)<=10)
6 - access("TABLE_REDACTED"."SHORTURL"=:b1)
這篇關(guān)于一種高性能的每組最大值SQL查詢方法的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,