patternsqlMinor
Grouping by overlapping arrays, transitively, without duplicates
Viewed 0 times
withoutarraysgroupingtransitivelyoverlappingduplicates
Problem
I've found:
However, I'm having trouble putting it to use in my case.
I have a table like so (the real
The result I can't work out how to produce is:
The logic here is that any ids with ips in common are grouped together, transitively. For instance,
There are tens of thousands of rows, no specific limit on how many ips for any given id, and no specific limit on how many ids may show any given ip.
I know how to aggregate by ip, or aggregate by id, but doing both transitively is giving me trouble. I tried a recursive CTE but I couldn't seem to get it right, and I'm not sure if that was the right approach in the first place. (If I could first group by id and then group by overlapping array of ips, and avoid duplicates in the aggregation, I would be all set, but there may be
- Group by array overlapping
- https://stackoverflow.com/a/23650080/5419599
However, I'm having trouble putting it to use in my case.
I have a table like so (the real
myid values are hashes, but simplified here for illustration):create temp table a (myid text, ip inet);
insert into a (myid, ip)
values
('0a', '10.10.1.1'),
('0a', '10.10.1.2'),
('0a', '10.10.1.3'),
('0b', '10.10.1.2'),
('0b', '10.10.1.4'),
('0c', '10.10.1.5'),
('0d', '10.10.1.3'),
('0e', '10.10.1.6'),
('0e', '10.10.1.7'),
('0f', '10.10.1.8'),
('0f', '10.10.1.9'),
('10', '10.10.1.9'),
('11', '10.10.1.10'),
('12', '10.10.1.11'),
('12', '10.10.1.4'),
('1a', '10.10.1.2'),
('1a', '10.10.1.4'),
('1e', '10.10.1.11'),
('1f', '10.10.1.12'),
('23', '10.10.1.12');The result I can't work out how to produce is:
ids | ips
---------------------+------------------------------------------------------
{0a,0b,0d,12,1a,1e} | {10.10.1.1,10.10.1.2,10.10.1.3,10.10.1.4,10.10.1.11}
{0c} | {10.10.1.5}
{0e} | {10.10.1.6,10.10.1.7}
{0f,10} | {10.10.1.8,10.10.1.9}
{11} | {10.10.1.10}
{1f,23} | {10.10.1.12}The logic here is that any ids with ips in common are grouped together, transitively. For instance,
0a has an ip in common with 0b; 0b has one in common with 12; 12 has one in common with 1e, and so forth.There are tens of thousands of rows, no specific limit on how many ips for any given id, and no specific limit on how many ids may show any given ip.
I know how to aggregate by ip, or aggregate by id, but doing both transitively is giving me trouble. I tried a recursive CTE but I couldn't seem to get it right, and I'm not sure if that was the right approach in the first place. (If I could first group by id and then group by overlapping array of ips, and avoid duplicates in the aggregation, I would be all set, but there may be
Solution
One of the key parts of Jack Douglas's solution in Group by array overlapping is the
This operator concatenates two arrays suppressing duplicate items. The reason that answer cannot be directly applied to your setup is because apparently the
You can do that by treating the arrays as row sets. If you notice, what the
can be replaced with a correlated subquery:
Yes, the substitution is quite unwieldy in comparison, but it does the job, and that is something to start with.
Adapting the solution to your example (and adding a bit of white space to the original code), the complete query would look like this:
You can test the query in this demo at db<>fiddle.uk.
Note also that Jack's word of caution probably applies to your situation as well:
Bear in mind that this is unlikely to perform well on millions of rows.
| (pipe) operator used on arrays in the recursive part of the recursive t CTE like this:...
select t.id, a.id, t.clst | a.clst
...This operator concatenates two arrays suppressing duplicate items. The reason that answer cannot be directly applied to your setup is because apparently the
| operator is defined for int arrays only, while you need a way to perform the same operation on inet arrays.You can do that by treating the arrays as row sets. If you notice, what the
| operator produces is effectively a union of two sets. Therefore, if you unnest both arrays, union them and aggregate the combined set back as an array, you will get the same result. So, this expression,t.clst | a.clstcan be replaced with a correlated subquery:
(
select
array_agg(sub.n)
from
(
select unnest(t.clst)
union
select unnest(a.clst)
) as sub (n)
)Yes, the substitution is quite unwieldy in comparison, but it does the job, and that is something to start with.
Adapting the solution to your example (and adding a bit of white space to the original code), the complete query would look like this:
with recursive
cte_a as
(
select
myid,
array_agg(distinct ip) as ip
from
a
group by
myid
)
, cte_t (myid, pmyid, ip) as
(
select
myid,
myid,
ip
from
cte_a
union all
select
t.myid,
a.myid,
( /* this is the replacement expression */
select
array_agg(sub.n)
from
(
select unnest(t.ip)
union
select unnest(a.ip)
) as sub (n)
)
from
cte_t as t
join cte_a as a
on a.myid <> t.pmyid and t.ip && a.ip and not t.ip @> a.ip
)
, cte_d as
(
select distinct on (myid)
myid,
ip
from
cte_t
order by
myid,
cardinality(ip) desc
)
select
array_agg(myid),
ip
from
cte_d
group by
ip
;You can test the query in this demo at db<>fiddle.uk.
Note also that Jack's word of caution probably applies to your situation as well:
Bear in mind that this is unlikely to perform well on millions of rows.
Code Snippets
...
select t.id, a.id, t.clst | a.clst
...t.clst | a.clst(
select
array_agg(sub.n)
from
(
select unnest(t.clst)
union
select unnest(a.clst)
) as sub (n)
)with recursive
cte_a as
(
select
myid,
array_agg(distinct ip) as ip
from
a
group by
myid
)
, cte_t (myid, pmyid, ip) as
(
select
myid,
myid,
ip
from
cte_a
union all
select
t.myid,
a.myid,
( /* this is the replacement expression */
select
array_agg(sub.n)
from
(
select unnest(t.ip)
union
select unnest(a.ip)
) as sub (n)
)
from
cte_t as t
join cte_a as a
on a.myid <> t.pmyid and t.ip && a.ip and not t.ip @> a.ip
)
, cte_d as
(
select distinct on (myid)
myid,
ip
from
cte_t
order by
myid,
cardinality(ip) desc
)
select
array_agg(myid),
ip
from
cte_d
group by
ip
;Context
StackExchange Database Administrators Q#250223, answer score: 7
Revisions (0)
No revisions yet.