Given this data:
ID DIFF_ID START_DAT END_DATE AMOUNTFor each consecutive time period (month) that there is an amount, count how many buckets, up to six.
--------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
2 4 01-FEB-08 29-FEB-08 0
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
5 4 01-MAY-08 31-MAY-08 0
6 1 01-JAN-08 31-JAN-08 10
7 1 01-FEB-08 29-FEB-08 0
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
First thought was definitely Analytics. I toiled away on what became a very unwieldy query (took more than one page anyway). A whole bunch of LAGs with the same number of ever increasing CASE statements.
My first obstacle overcome was to filter out those that had a 0 for amount. That left me with:
ID DIFF_ID START_DAT END_DATE AMOUNTIt took a good while to figure that out for some reason.
---------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
6 1 01-JAN-08 31-JAN-08 10
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
Once I had that figured, I needed to figure out which were consecutive. Frank Zhou is always solving puzzles with SQL and I remembered I had responded to one of his about a year ago. If you get a chance, please take a look at his site...he solves some pretty cool puzzles with SQL using the MODEL clause and analytics.
Anyway, his post, How to find the earliest start date and the latest end date for consecutive transactions in SQL was similar (and my response similar), so I found it to revisit my thinking at the time.
First, I use the LAG function to get the previous row's ID (unique) and call it PREV_ID. I use DIFF_ID in the PARTITION clause (window) and order by END_DATE; then add one to see if the months are consecutive. If that value matches the START_DATE of the current row, it's consecutive and I use LAG again to get the previous row's ID.
SELECTThat produces the following output:
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
DIFF_ID ID START_DAT END_DATE PREV_ID AMOUNTAs you can see, I have 3 records with the PREV_ID populated.
-------- ---------- --------- --------- ---------- ----------
1 6 01-JAN-08 31-JAN-08 10
1 8 01-MAR-08 31-MAR-08 10
1 9 01-APR-08 30-APR-08 8 10
1 10 01-MAY-08 31-MAY-08 9 10
4 1 01-JAN-08 31-JAN-08 40
4 3 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 3 10
As I am building it, I realize I keep nesting the queries, so in comes the WITH clause (when I first learned of that it was terribly difficult to search for, I didn't know it was also called subquery factoring clause...).
WITH subMuch better. Note the START WITH and CONNECT BY PRIOR, I created my own heirarchical table to determine another window to PARTITION on (MIN_START_DATE of the consecutive records).
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
id,
start_date,
end_date,
TO_DATE( SUBSTR( SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 ), 'MMDDYYYY' ) min_start_date
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
DIFF_ID ID START_DAT END_DATE MIN_STARTNow all I have to do is PIVOT the table (I chose not to use the new PIVOT feature) on DIFF_ID and add an analytic COUNT on my new window (MIN_START_DATE).
------- ---------- --------- --------- ---------
1 6 01-JAN-08 31-JAN-08 01-JAN-08
1 8 01-MAR-08 31-MAR-08 01-MAR-08
1 9 01-APR-08 30-APR-08 01-MAR-08
1 10 01-MAY-08 31-MAY-08 01-MAR-08
4 1 01-JAN-08 31-JAN-08 01-JAN-08
4 3 01-MAR-08 31-MAR-08 01-MAR-08
4 4 01-APR-08 30-APR-08 01-MAR-08
WITH subAnd voila!
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
COUNT( CASE WHEN d = 1 THEN 1 ELSE NULL END ) b1,
COUNT( CASE WHEN d = 2 THEN 1 ELSE NULL END ) b2,
COUNT( CASE WHEN d = 3 THEN 1 ELSE NULL END ) b3,
COUNT( CASE WHEN d = 4 THEN 1 ELSE NULL END ) b4,
COUNT( CASE WHEN d = 5 THEN 1 ELSE NULL END ) b5,
COUNT( CASE WHEN d = 6 THEN 1 ELSE NULL END ) b6
FROM
(
SELECT
diff_id,
COUNT( id ) OVER
( PARTITION BY diff_id, SUBSTR(
SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 )
ORDER BY end_date ) d
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
)
GROUP BY diff_id;
DIFF_ID B1 B2 B3 B4 B5 B6Problem solved!
------- ---- ---- ---- ---- ---- ----
1 2 1 1 0 0 0
4 2 1 0 0 0 0
Table creation and data can be found here.
I do not like using hierarchies for non-hierarchic queries. It will not scale well
ReplyDeleteselect diff_id,count(decode(r,1,1))b1,
count(decode(r,2,1))b2,count(decode(r,3
,1))b3,count(decode(r,4,1))b4,count(
decode(r,5,1))b5,count(decode(r,6,1))b6
from(select diff_id,x,row_number()over(
partition by diff_id,x order by s)r
from(SELECT diff_id,start_date s,
ROW_NUMBER()OVER(PARTITION BY diff_id
ORDER BY start_date)-MONTHS_BETWEEN(
start_date,DATE '2000-01-01')x FROM
col_test WHERE amount>0))group by
diff_id;
I have noticed that as well, my previous go at it had 15 million records.
ReplyDeleteHow about the use of a virtual column that is indexed? Use the CASE/LAG statement to create it?
One of my colleagues suggested a PIPELINEd function, but it was procedural and I try to avoid that.
I do like yours as it is much more concise.
Thanks Mr. Schneider!
I advise you not to use analytics and sys_connect_by_path in the same query to avoid bug 3564507
ReplyDeleteRe Laurent's query: Too sweet! I wouldn't have thought of the ROW_NUMBER - MONTHS_BETWEEN trick.
ReplyDeleteWhen I've done this sort of thing, I've usually flagged rows where a new bucket starts, instead of what you did (flagging rows that continue a bucket). I flag them with whatever unique ID I have handy (in your case I'd use a ROW_NUMBER).
Then in the next query/WITH clause I'd use MAX(flag) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING) to get the latest row that started a bucket prior to this row.
I think my approach is clearer... although I must admire the cleverness of Laurent's solution. :)
@jb318
ReplyDeleteI compared Laurent's to a full set of my data and it didn't match mine. I am going to look at it more closely later on and provide more feedback.
I was trying to figure how to flag the sliding window, but I couldn't figure out a way to do it. I was thinking something like a date, but couldn't quite get it.
I hadn't seen the RANGE UNBOUNDED clause...I'll look into that.