patternsqlMinor
Group by array overlapping
Viewed 0 times
arrayoverlappinggroup
Problem
I have a PostgreSQL table with id and clusters like this:
If you aggregate clusters grouped by id, you can see that there are overlapping values in the cluster arrays:
i.e. cluster 4 covers id 1 and 5, cluster 2 covers id 2, 3 and 4, whereas cluster 5 corresponds only to one id.
How can I now aggregate ids grouped by cluster arrays overlapping?
i.e. the expected result is:
I don't care much about the cluster column just need ids properly aggregated.
There is no restriction on number of possible overlappings. Number of clusters per id is not restricted either (it can be hundreds or even more). Clusters are assined to ids not sequenctially.
There are millions of rows in the table!!!
Using PostgreSQL 11.
CREATE TABLE w (id bigint, clst int);
INSERT INTO w (id,clst)
VALUES
(1,0),
(1,4),
(2,1),
(2,2),
(2,3),
(3,2),
(4,2),
(5,4),
(6,5);If you aggregate clusters grouped by id, you can see that there are overlapping values in the cluster arrays:
select id, array_agg(clst) clst from w group by id order by id;
id | clst
----+---------
1 | {0,4}
2 | {1,2,3}
3 | {2}
4 | {2}
5 | {4}
6 | {5}i.e. cluster 4 covers id 1 and 5, cluster 2 covers id 2, 3 and 4, whereas cluster 5 corresponds only to one id.
How can I now aggregate ids grouped by cluster arrays overlapping?
i.e. the expected result is:
id | clst
---------+-------
{1,5} | {0,4,4}
{2,3,4} | {1,2,3,2,2}
{6} | {5}I don't care much about the cluster column just need ids properly aggregated.
There is no restriction on number of possible overlappings. Number of clusters per id is not restricted either (it can be hundreds or even more). Clusters are assined to ids not sequenctially.
There are millions of rows in the table!!!
Using PostgreSQL 11.
Solution
I don't care much about the cluster column just need ids properly aggregated.
In that case we can make use of the
array_agg | clst
:-------- | :------
{6} | {5}
{2,3,4} | {1,2,3}
{1,5} | {0,4}
db<>fiddle here
Bear in mind that this is unlikely to perform well on millions of rows.
In that case we can make use of the
uniq and sort functions in the intarray extension:with recursive a as (
select id, array_agg(distinct clst) clst from w group by id)
, t(id,pid,clst) as (
select id,id,clst from a
union all
select t.id,a.id,t.clst|a.clst
from t join a on a.id<>t.pid and t.clst&&a.clst and not t.clst@>a.clst)
, d as (
select distinct on(id) id, clst from t order by id, cardinality(clst) desc)
select array_agg(id), clst from d group by clst;array_agg | clst
:-------- | :------
{6} | {5}
{2,3,4} | {1,2,3}
{1,5} | {0,4}
db<>fiddle here
Bear in mind that this is unlikely to perform well on millions of rows.
Code Snippets
with recursive a as (
select id, array_agg(distinct clst) clst from w group by id)
, t(id,pid,clst) as (
select id,id,clst from a
union all
select t.id,a.id,t.clst|a.clst
from t join a on a.id<>t.pid and t.clst&&a.clst and not t.clst@>a.clst)
, d as (
select distinct on(id) id, clst from t order by id, cardinality(clst) desc)
select array_agg(id), clst from d group by clst;Context
StackExchange Database Administrators Q#235318, answer score: 4
Revisions (0)
No revisions yet.