patternsqlModerate
Solving "Gaps and Islands" with row_number() and dense_rank()?
Viewed 0 times
solvingwithdense_rankrow_numberandgapsislands
Problem
How does one solve the islands part of gaps-and-islands with
Let's use something like this for example data (example uses PostgreSQL),
Which should produce something like this.
What we want is something like this... (where
Such that we can do either
That would look like this for (x,grp) in the above example.. (where
Background
This has come up only a few times in answers in my research.
It also seems to explicitly confuse others though it has never been explained on this network,
dense_rank() and row_number(). I've seen this now a few times and I'm wondering if someone could explain it,Let's use something like this for example data (example uses PostgreSQL),
CREATE TABLE foo
AS
SELECT x AS id, trunc(random()*3+1) AS x
FROM generate_series(1,50)
AS t(x);Which should produce something like this.
id | x
----+---
1 | 3
2 | 1
3 | 3
4 | 3
5 | 3
6 | 2
7 | 3
8 | 2
9 | 1
10 | 3
...What we want is something like this... (where
z is some value that we can use to GROUP BY)id | x | grp
----+------
1 | 3 | z
2 | 1 | z
3 | 3 | z
4 | 3 | z
5 | 3 | z
6 | 2 | z
7 | 3 | z
8 | 2 | z
9 | 1 | z
10 | 3 | z
...Such that we can do either
GROUP BY x,grpwhere(x,grp)is a unique island
GROUP BY grpwheregrpis a unique island.
That would look like this for (x,grp) in the above example.. (where
z is some value that we can use to GROUP BY)id | x | grp
----+------
1 | 3 | 1
2 | 1 | 1
3 | 3 | 2
4 | 3 | 2
5 | 3 | 2
6 | 2 | z -- Because we're grouping by an (x,grp)
7 | 3 | z -- Does not have to be `3` (or something unique)
8 | 2 | z -- z=1, is fine to use again if we
9 | 1 | z -- GROUP BY (x,grp)
10 | 3 | zBackground
This has come up only a few times in answers in my research.
- Using Row_Number to find consecutive row count by @Martin Smith
- Getting each status change in a table by @Martin Smith
- Grouping or Window by @ErikE
It also seems to explicitly confuse others though it has never been explained on this network,
- Question 1 (never explained), solved and accepted with
lag()
- Question 2 (never explained), solved an accepted with
lag()
Solution
So the trick here is a property of two equally incrementing series which produce a difference that can be used to identify islands
Background
Stepping aside from the problem. Let's check this out.. Here we
Here is some code.
This looks pretty good, and grouping by the
We'll we have to find a way define the second offset statically.
And, again we're back on track to making groups for each pairs of offsets. This isn't quite what we're doing, but it's pretty close and hopefully it serves to help illustrate how the two sets can be subtracted to create islands.
Application
Now let's revisit the problem above with table
You can see here all of the variables at play, in addition to
The group
{11,12,13} - {1,2,3} = {10,10,10}. This property isn't enough to identify islands in and of itself, but it's a crucial step that we can exploit to do so.Background
Stepping aside from the problem. Let's check this out.. Here we
- Define 1 group as 3 rows.
- Create one group for each pair/row of offsets in xoffsets.
Here is some code.
SELECT x1, x2, x1-x2 AS difference
FROM (VALUES
(2,42),
(13,7),
(42,2)
)
AS xoffsets(o1,o2)
CROSS JOIN LATERAL (
SELECT x+o1,x+o2 -- here we add the offsets above to x
FROM generate_series(1,3) AS t(x)
) AS t(x1, x2)
ORDER BY x1, x2;
x1 | x2 | difference
----+----+------------
3 | 43 | -40
4 | 44 | -40
5 | 45 | -40
14 | 8 | 6
15 | 9 | 6
16 | 10 | 6
43 | 3 | 40
44 | 4 | 40
45 | 5 | 40
(9 rows)This looks pretty good, and grouping by the
difference would be good enough in this example. You can see we have three groups starting at (2,42), (13,7), and (42,2), corresponding to the groups in xoffsets. This is essentially the problem in reverse. But we have one major issue because we're demonstrating this with static offsets. If the difference between any two offsets o1-o2 is the same we'll have a problem. Like this,SELECT x1, x2, x1-x2 AS difference
FROM (VALUES
(100,90),
(90,80)
) AS xoffsets(o1,o2)
CROSS JOIN LATERAL (
SELECT x+o1,x+o2 -- here we add the offsets above to x
FROM generate_series(1,3) AS t(x)
) AS t(x1, x2)
ORDER BY x1, x2;
x1 | x2 | difference
-----+----+------------
91 | 81 | 10
92 | 82 | 10
93 | 83 | 10
101 | 91 | 10
102 | 92 | 10
103 | 93 | 10We'll we have to find a way define the second offset statically.
SELECT x1, x2, x1-x2 AS difference
FROM (VALUES
(100,0),
(90,0)
) AS xoffsets(o1,o2)
CROSS JOIN LATERAL (
SELECT x+o1,x+o2 -- here we add the offsets above to x
FROM generate_series(1,3) AS t(x)
) AS t(x1, x2)
ORDER BY x1, x2;
x1 | x2 | difference
-----+----+------------
91 | 1 | 90
92 | 2 | 90
93 | 3 | 90
101 | 1 | 100
102 | 2 | 100
103 | 3 | 100
(6 rows)And, again we're back on track to making groups for each pairs of offsets. This isn't quite what we're doing, but it's pretty close and hopefully it serves to help illustrate how the two sets can be subtracted to create islands.
Application
Now let's revisit the problem above with table
foo. We sandwich the variables between two copies of x for display purposes only.SELECT
id,
x,
dense_rank() OVER (w1) AS lhs,
dense_rank() OVER (w2) AS rhs,
dense_rank() OVER (w1)
- dense_rank() OVER (w2)
AS diff,
(
x,
dense_rank() OVER (w1)
- dense_rank() OVER (w2)
) AS grp,
x
FROM foo
WINDOW w1 AS (ORDER BY id),
w2 AS (PARTITION BY x ORDER BY id)
ORDER BY id
LIMIT 10;
id | x | lhs | rhs | diff | grp | x
----+---+-----+-----+------+-------+---
1 | 2 | 1 | 1 | 0 | (2,0) | 2
2 | 1 | 2 | 1 | 1 | (1,1) | 1
3 | 2 | 3 | 2 | 1 | (2,1) | 2
4 | 3 | 4 | 1 | 3 | (3,3) | 3
5 | 3 | 5 | 2 | 3 | (3,3) | 3
6 | 2 | 6 | 3 | 3 | (2,3) | 2
7 | 3 | 7 | 3 | 4 | (3,4) | 3
8 | 1 | 8 | 2 | 6 | (1,6) | 1
9 | 3 | 9 | 4 | 5 | (3,5) | 3
10 | 1 | 10 | 3 | 7 | (1,7) | 1
(10 rows)You can see here all of the variables at play, in addition to
x and id- The
lhsis simple. We're just generating a unique sequential identifier with that (because we're feeding it a unique sequential identifier:id-- though never forget that ids are seldom gapless)
- The
rhsis slightly more complex. We partition byxand generate a sequential identifier with that amongst the differentxvalues. Observe howrhsincrements over the set each time it sees a row with a value that it has already seen. The property ofrhsis how many times the value was seen.
- The
diffis the simple result of subtraction but it's not too useful to think of it like that. Think of more like subtracting a sequence than digits (though they're digits for any single row). We have a sequence increasing by one for the amount of times a value is seen. And, we have a sequence increasing by one for every distinct id (every time). Subtracting these two sequences will return the same number for repeating values, just like in our example above.(11,12,13) - (1,2,3) = (10,10,10). This is the same principle in the first section of this answer
diffdoesn't independently mark the group by itself
diffforces all identical islands ofxinto a group (which may have false-positives, ex. the three cases wherediff=3above and their correspondingxvalues)
The group
grp is a function of (x, diff). It seCode Snippets
SELECT x1, x2, x1-x2 AS difference
FROM (VALUES
(2,42),
(13,7),
(42,2)
)
AS xoffsets(o1,o2)
CROSS JOIN LATERAL (
SELECT x+o1,x+o2 -- here we add the offsets above to x
FROM generate_series(1,3) AS t(x)
) AS t(x1, x2)
ORDER BY x1, x2;
x1 | x2 | difference
----+----+------------
3 | 43 | -40
4 | 44 | -40
5 | 45 | -40
14 | 8 | 6
15 | 9 | 6
16 | 10 | 6
43 | 3 | 40
44 | 4 | 40
45 | 5 | 40
(9 rows)SELECT x1, x2, x1-x2 AS difference
FROM (VALUES
(100,90),
(90,80)
) AS xoffsets(o1,o2)
CROSS JOIN LATERAL (
SELECT x+o1,x+o2 -- here we add the offsets above to x
FROM generate_series(1,3) AS t(x)
) AS t(x1, x2)
ORDER BY x1, x2;
x1 | x2 | difference
-----+----+------------
91 | 81 | 10
92 | 82 | 10
93 | 83 | 10
101 | 91 | 10
102 | 92 | 10
103 | 93 | 10SELECT x1, x2, x1-x2 AS difference
FROM (VALUES
(100,0),
(90,0)
) AS xoffsets(o1,o2)
CROSS JOIN LATERAL (
SELECT x+o1,x+o2 -- here we add the offsets above to x
FROM generate_series(1,3) AS t(x)
) AS t(x1, x2)
ORDER BY x1, x2;
x1 | x2 | difference
-----+----+------------
91 | 1 | 90
92 | 2 | 90
93 | 3 | 90
101 | 1 | 100
102 | 2 | 100
103 | 3 | 100
(6 rows)SELECT
id,
x,
dense_rank() OVER (w1) AS lhs,
dense_rank() OVER (w2) AS rhs,
dense_rank() OVER (w1)
- dense_rank() OVER (w2)
AS diff,
(
x,
dense_rank() OVER (w1)
- dense_rank() OVER (w2)
) AS grp,
x
FROM foo
WINDOW w1 AS (ORDER BY id),
w2 AS (PARTITION BY x ORDER BY id)
ORDER BY id
LIMIT 10;
id | x | lhs | rhs | diff | grp | x
----+---+-----+-----+------+-------+---
1 | 2 | 1 | 1 | 0 | (2,0) | 2
2 | 1 | 2 | 1 | 1 | (1,1) | 1
3 | 2 | 3 | 2 | 1 | (2,1) | 2
4 | 3 | 4 | 1 | 3 | (3,3) | 3
5 | 3 | 5 | 2 | 3 | (3,3) | 3
6 | 2 | 6 | 3 | 3 | (2,3) | 2
7 | 3 | 7 | 3 | 4 | (3,4) | 3
8 | 1 | 8 | 2 | 6 | (1,6) | 1
9 | 3 | 9 | 4 | 5 | (3,5) | 3
10 | 1 | 10 | 3 | 7 | (1,7) | 1
(10 rows)SELECT x, diff, count(*)
FROM (
SELECT
id,
x,
dense_rank() OVER (ORDER BY id)
- dense_rank() OVER (PARTITION BY x ORDER BY id)
AS diff
FROM foo
) AS t
GROUP BY x, diff;Context
StackExchange Database Administrators Q#167068, answer score: 12
Revisions (0)
No revisions yet.