Tuesday, September 9, 2008

Fun with SQL: Analytics and Heirarchical

I've had this problem since yesterday and I believe I finally solved it.

Given this data:
       ID    DIFF_ID START_DAT END_DATE      AMOUNT
--------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
2 4 01-FEB-08 29-FEB-08 0
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
5 4 01-MAY-08 31-MAY-08 0
6 1 01-JAN-08 31-JAN-08 10
7 1 01-FEB-08 29-FEB-08 0
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
For each consecutive time period (month) that there is an amount, count how many buckets, up to six.

First thought was definitely Analytics. I toiled away on what became a very unwieldy query (took more than one page anyway). A whole bunch of LAGs with the same number of ever increasing CASE statements.

My first obstacle overcome was to filter out those that had a 0 for amount. That left me with:
        ID    DIFF_ID START_DAT END_DATE      AMOUNT
---------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
6 1 01-JAN-08 31-JAN-08 10
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
It took a good while to figure that out for some reason.

Once I had that figured, I needed to figure out which were consecutive. Frank Zhou is always solving puzzles with SQL and I remembered I had responded to one of his about a year ago. If you get a chance, please take a look at his site...he solves some pretty cool puzzles with SQL using the MODEL clause and analytics.

Anyway, his post, How to find the earliest start date and the latest end date for consecutive transactions in SQL was similar (and my response similar), so I found it to revisit my thinking at the time.

First, I use the LAG function to get the previous row's ID (unique) and call it PREV_ID. I use DIFF_ID in the PARTITION clause (window) and order by END_DATE; then add one to see if the months are consecutive. If that value matches the START_DATE of the current row, it's consecutive and I use LAG again to get the previous row's ID.
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
That produces the following output:
 DIFF_ID         ID START_DAT END_DATE     PREV_ID     AMOUNT
-------- ---------- --------- --------- ---------- ----------
1 6 01-JAN-08 31-JAN-08 10
1 8 01-MAR-08 31-MAR-08 10
1 9 01-APR-08 30-APR-08 8 10
1 10 01-MAY-08 31-MAY-08 9 10
4 1 01-JAN-08 31-JAN-08 40
4 3 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 3 10
As you can see, I have 3 records with the PREV_ID populated.

As I am building it, I realize I keep nesting the queries, so in comes the WITH clause (when I first learned of that it was terribly difficult to search for, I didn't know it was also called subquery factoring clause...).
WITH sub
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
id,
start_date,
end_date,
TO_DATE( SUBSTR( SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 ), 'MMDDYYYY' ) min_start_date
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
Much better. Note the START WITH and CONNECT BY PRIOR, I created my own heirarchical table to determine another window to PARTITION on (MIN_START_DATE of the consecutive records).
DIFF_ID         ID START_DAT END_DATE  MIN_START
------- ---------- --------- --------- ---------
1 6 01-JAN-08 31-JAN-08 01-JAN-08
1 8 01-MAR-08 31-MAR-08 01-MAR-08
1 9 01-APR-08 30-APR-08 01-MAR-08
1 10 01-MAY-08 31-MAY-08 01-MAR-08
4 1 01-JAN-08 31-JAN-08 01-JAN-08
4 3 01-MAR-08 31-MAR-08 01-MAR-08
4 4 01-APR-08 30-APR-08 01-MAR-08
Now all I have to do is PIVOT the table (I chose not to use the new PIVOT feature) on DIFF_ID and add an analytic COUNT on my new window (MIN_START_DATE).
WITH sub
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
COUNT( CASE WHEN d = 1 THEN 1 ELSE NULL END ) b1,
COUNT( CASE WHEN d = 2 THEN 1 ELSE NULL END ) b2,
COUNT( CASE WHEN d = 3 THEN 1 ELSE NULL END ) b3,
COUNT( CASE WHEN d = 4 THEN 1 ELSE NULL END ) b4,
COUNT( CASE WHEN d = 5 THEN 1 ELSE NULL END ) b5,
COUNT( CASE WHEN d = 6 THEN 1 ELSE NULL END ) b6
FROM
(
SELECT
diff_id,
COUNT( id ) OVER
( PARTITION BY diff_id, SUBSTR(
SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 )
ORDER BY end_date ) d
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
)
GROUP BY diff_id;
And voila!
DIFF_ID   B1   B2   B3   B4   B5   B6
------- ---- ---- ---- ---- ---- ----
1 2 1 1 0 0 0
4 2 1 0 0 0 0
Problem solved!

Table creation and data can be found here.

5 comments:

  1. I do not like using hierarchies for non-hierarchic queries. It will not scale well

    select diff_id,count(decode(r,1,1))b1,
    count(decode(r,2,1))b2,count(decode(r,3
    ,1))b3,count(decode(r,4,1))b4,count(
    decode(r,5,1))b5,count(decode(r,6,1))b6
    from(select diff_id,x,row_number()over(
    partition by diff_id,x order by s)r
    from(SELECT diff_id,start_date s,
    ROW_NUMBER()OVER(PARTITION BY diff_id
    ORDER BY start_date)-MONTHS_BETWEEN(
    start_date,DATE '2000-01-01')x FROM
    col_test WHERE amount>0))group by
    diff_id;

    ReplyDelete
  2. I have noticed that as well, my previous go at it had 15 million records.

    How about the use of a virtual column that is indexed? Use the CASE/LAG statement to create it?

    One of my colleagues suggested a PIPELINEd function, but it was procedural and I try to avoid that.

    I do like yours as it is much more concise.

    Thanks Mr. Schneider!

    ReplyDelete
  3. I advise you not to use analytics and sys_connect_by_path in the same query to avoid bug 3564507

    ReplyDelete
  4. Re Laurent's query: Too sweet! I wouldn't have thought of the ROW_NUMBER - MONTHS_BETWEEN trick.

    When I've done this sort of thing, I've usually flagged rows where a new bucket starts, instead of what you did (flagging rows that continue a bucket). I flag them with whatever unique ID I have handy (in your case I'd use a ROW_NUMBER).

    Then in the next query/WITH clause I'd use MAX(flag) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING) to get the latest row that started a bucket prior to this row.

    I think my approach is clearer... although I must admire the cleverness of Laurent's solution. :)

    ReplyDelete
  5. @jb318

    I compared Laurent's to a full set of my data and it didn't match mine. I am going to look at it more closely later on and provide more feedback.

    I was trying to figure how to flag the sliding window, but I couldn't figure out a way to do it. I was thinking something like a date, but couldn't quite get it.

    I hadn't seen the RANGE UNBOUNDED clause...I'll look into that.

    ReplyDelete