patternsqlMinor
Tricky selection of grouped rows, selecting based on values of two distinct but related rows
Viewed 0 times
rowsdistincttrickyselectionbutrelatedtwogroupedselectingbased
Problem
Say I have a table with the following information (sorted here for display purposes)
I want to select the groups of rows that have the same reference where the first occurrence belonged to a particular centreid, but where the second occurrence was within a particular date range. For example, I want to find all taskids that have the same reference where the second date falls in October but the first centreid is 1, the results should be:
It's possible that there are more than two taskids with the same reference, in which case, I am always interested in the last date and the second to last centreid (when ordered by date_out). Each time I write a query it ends up being very complex and still not returning exactly the rows I want.
I have:
Thanks in advance for any help. I'm using PostgreSQL 9.1 if that makes any difference...
Update (2012-11-10)
I have used the answer provided by Erwin, but the self join i
taskid | reference | centreid | date_out
---------------------------------------------------
10001 | 'IX#001323' | 1 | '2012-09-28'
10022 | 'IX#001323' | 2 | '2012-10-04'
10032 | 'IX#001544' | 1 | '2012-10-09'
10046 | 'IX#001666' | 1 | '2012-10-10'
10056 | 'IX#001666' | 3 | '2012-10-13'
10078 | 'IX#002100' | 2 | '2012-10-23'
10098 | 'IX#002100' | 1 | '2012-11-01'I want to select the groups of rows that have the same reference where the first occurrence belonged to a particular centreid, but where the second occurrence was within a particular date range. For example, I want to find all taskids that have the same reference where the second date falls in October but the first centreid is 1, the results should be:
taskid | reference | centreid | date_out
---------------------------------------------------
10001 | 'IX#001323' | 1 | '2012-09-28'
10022 | 'IX#001323' | 2 | '2012-10-04'
10046 | 'IX#001666' | 1 | '2012-10-10'
10056 | 'IX#001666' | 3 | '2012-10-13'It's possible that there are more than two taskids with the same reference, in which case, I am always interested in the last date and the second to last centreid (when ordered by date_out). Each time I write a query it ends up being very complex and still not returning exactly the rows I want.
I have:
SELECT taskid
FROM tasks t1
WHERE (EXISTS (SELECT 1
FROM tasks t2
WHERE t1.taskid <> t2.taskd
AND t1.reference = t2.reference))
AND
-- STUCK
-- Unsure how to select amongst remaining rows where max(date) between two datesThanks in advance for any help. I'm using PostgreSQL 9.1 if that makes any difference...
Update (2012-11-10)
I have used the answer provided by Erwin, but the self join i
Solution
Get qualifying rows only
One way ..
Another way:
I'd expect the second one to be faster. Depends on your data distribution. Test with
Get all rows for qualifying references
-> sqlfiddle
Answer to follow-up in comment
Separate groups if more than 30 days between entries.
-> sqlfiddle
But while pure SQL is a beaut .. I would solve this procedurally in a plpgsql function. Very similar to this recent answer on SO. Would be faster, because it can be done in a single table scan.
One way ..
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
, y AS (
SELECT *
FROM x
WHERE x.rn = 1
AND date_out >= '2012-10-01'
AND date_out < '2012-11-01'
)
, z AS (
SELECT x.*
FROM x
JOIN y USING (reference)
WHERE x.rn = 2
AND x.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
JOIN z USING (reference)
UNION ALL
SELECT taskid, reference, centreid, date_out
FROM z
ORDER BY reference, date_out;Another way:
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT x.*, y.taskid AS taskid2, y.centreid AS centreid2, y.date_out AS date_out2
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
UNION ALL
SELECT y.taskid2, y.reference, y.centreid2, y.date_out2
FROM y
ORDER BY reference, date_out;I'd expect the second one to be faster. Depends on your data distribution. Test with
EXPLAIN ANALYZE.Get all rows for qualifying references
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT reference
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT *
FROM tbl
JOIN y USING (reference)
ORDER BY reference, date_out;-> sqlfiddle
Answer to follow-up in comment
Separate groups if more than 30 days between entries.
WITH a AS (
SELECT *
,lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) AS last_date
,CASE WHEN date_out >
(lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) - 30)
THEN 0 ELSE 1
END AS step
FROM tbl
)
,b AS (
SELECT *
,sum(step) OVER (PARTITION BY reference ORDER BY date_out DESC) AS grp
FROM a
)
,c AS (
SELECT *
,row_number() OVER (PARTITION BY reference, grp ORDER BY date_out DESC) AS rn
FROM b
)
,d AS (
SELECT reference, grp
FROM c
JOIN c d USING (reference, grp)
WHERE c.rn = 1
AND c.date_out >= '2012-10-01'
AND c.date_out < '2012-11-01'
AND d.rn = 2
AND d.centreid = 1
)
SELECT b.taskid, b.reference, b.centreid, b.date_out
FROM b
JOIN d USING (reference, grp)
ORDER BY reference, date_out-> sqlfiddle
But while pure SQL is a beaut .. I would solve this procedurally in a plpgsql function. Very similar to this recent answer on SO. Would be faster, because it can be done in a single table scan.
Code Snippets
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
, y AS (
SELECT *
FROM x
WHERE x.rn = 1
AND date_out >= '2012-10-01'
AND date_out < '2012-11-01'
)
, z AS (
SELECT x.*
FROM x
JOIN y USING (reference)
WHERE x.rn = 2
AND x.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
JOIN z USING (reference)
UNION ALL
SELECT taskid, reference, centreid, date_out
FROM z
ORDER BY reference, date_out;WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT x.*, y.taskid AS taskid2, y.centreid AS centreid2, y.date_out AS date_out2
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
UNION ALL
SELECT y.taskid2, y.reference, y.centreid2, y.date_out2
FROM y
ORDER BY reference, date_out;WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT reference
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT *
FROM tbl
JOIN y USING (reference)
ORDER BY reference, date_out;WITH a AS (
SELECT *
,lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) AS last_date
,CASE WHEN date_out >
(lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) - 30)
THEN 0 ELSE 1
END AS step
FROM tbl
)
,b AS (
SELECT *
,sum(step) OVER (PARTITION BY reference ORDER BY date_out DESC) AS grp
FROM a
)
,c AS (
SELECT *
,row_number() OVER (PARTITION BY reference, grp ORDER BY date_out DESC) AS rn
FROM b
)
,d AS (
SELECT reference, grp
FROM c
JOIN c d USING (reference, grp)
WHERE c.rn = 1
AND c.date_out >= '2012-10-01'
AND c.date_out < '2012-11-01'
AND d.rn = 2
AND d.centreid = 1
)
SELECT b.taskid, b.reference, b.centreid, b.date_out
FROM b
JOIN d USING (reference, grp)
ORDER BY reference, date_outContext
StackExchange Database Administrators Q#28162, answer score: 2
Revisions (0)
No revisions yet.