patternsqlMinor
Match .csv values as INTs. If one value from a group in the .csv matches one in another group, then merge strings
Viewed 0 times
matchesthegroupmergecsvmatchvalueoneanotherthen
Problem
Here we have two sets of numbers. The problem is that I can't figure out how to get from the input to the output of numbers (DDL and DML below and also in the fiddle here).
We are simply trying to match a group of numbers based on if a single number matches in any other group. Then merge all of these groups.
For instance.
If we have:
We want:
We consolidate them into a single line in PostgreSQL.
or (another example):
desired output:
The query would group these numbers because they have the number 10 in common. But it wouldn't group with the previous row (i.e.
When we have these groups in a table. They will only group together if they at least have one matching number in each group.
======== DDL and DML =============
current_data
1,2,3
1,4
1,5,7
8,9,10
10,11,15
expected_outcome
1,2,3,4,5,7
8,9,10,11,15We are simply trying to match a group of numbers based on if a single number matches in any other group. Then merge all of these groups.
For instance.
If we have:
('1,2,3'),
('1,4'),
('1,5,7')We want:
(1,2,3,4,5,7)We consolidate them into a single line in PostgreSQL.
or (another example):
('8,9,10'),
('10,11,15')desired output:
(8,9,10,11,15)The query would group these numbers because they have the number 10 in common. But it wouldn't group with the previous row (i.e.
(1,2,3,4,5,7)) that don't share a number.When we have these groups in a table. They will only group together if they at least have one matching number in each group.
======== DDL and DML =============
create table current (current_data text not null);
create table expected_output (expected_outcome text not null);
insert into current (current_data) values ('1,2,3'),('1,4'),('1,5,7'),('8,9,10'), ('10,11,15');
insert into expected_output (expected_outcome) values ('1,2,3,4,5,7'),('8,9,10,11,15');Solution
If you want to see how this problem should NOT be tackled, take a look at the edits to see my first two solutions.
TL;DR (fiddle)
Result:
/TL;DR
Introduction:
SQL was not designed for work with strings and, in particular, looking inside strings and manipulating individual elements of that string (unlike, say,
I changed the data quite a bit (even a lot!) so that I could test my solution - and when you're dealing with row numbers that are similar to your actual data, things quickly become confusing!
You might well be better off (performance wise) using
With regards to the word
I found these answers helpful:
-
Split column into multiple rows in Postgres
-
Converting comma separated string to integer array in Postgres
-
(https://dba.stackexchange.com/questions/226456/how-can-i-get-a-unique-array-in-postgresql).
-
https://stackoverflow.com/questions/3994556/eliminate-duplicate-array-values-in-postgres
The table is as created in the question, the input data is as follows:
and the desired output is:
The
if one value from the group matches another group.
My reading of this (
The final SQL is shown above but also (and the whole process step-by-step) can be found on the fiddle:
Step 1:
We need to find a way of identifying which numbers in the strings belong together - we do this using the ROW_NUMBER() function! If you're unclear on SQL window functions, I would urge you to watch Bruce Momjian's YouTube "The Magic of Window Functions in Postgres" presentation. They are extremely powerful, part of the SQL standard, and will repay any effort spent learning how to use them many times over!
So, we run this query:
Result:
Step 2:
Above, I mentioned that strings are not SQL's forté - what is their forté is tables and records - so, we turn our data into a table by running:
Result:
This makes use of two non-standard functions -
STRING_TO_ARRAY does what it says on the tin! It takes the string and the provided delimiter (
The
We then use the UNNEST function which turns the elements of these arrays into rows in a "virtual" table - so we now have a table which contains our original combined string data as integer rows in a table!
Step 3:
Now, we determine the tokens in which we are interested - i.e. the ones which are repeated across groups! We make use of the SQL above and the aggregate COUNT() function (which can also be a window function to operate on subsets - i.e. "win
TL;DR (fiddle)
SELECT num AS "Token", STRING_AGG(val::TEXT, ',' ORDER BY val::INT) AS "The string"
FROM
(
SELECT DISTINCT t02.d AS num, UNNEST(t03.d2) AS val FROM
(
SELECT d, COUNT(d) AS cnt
FROM
(
SELECT
UNNEST(STRING_TO_ARRAY(c.current_data, ','))::INT AS d,
ROW_NUMBER() OVER () AS rn
FROM current c
) AS t01
GROUP BY d
HAVING COUNT(d) > 1
) AS t02
JOIN
(
SELECT
STRING_TO_ARRAY(c.current_data, ',')::INT[] AS d2
FROM current c
) AS t03
ON t02.d = ANY(t03.d2)
) AS t04
GROUP BY num
ORDER BY num;Result:
Token The string
11 1,11,99,1203,2222,6666
44 13,44,1005,1110,10078
1005 13,44,992,1005,1007,1008,1110,10078/TL;DR
Introduction:
SQL was not designed for work with strings and, in particular, looking inside strings and manipulating individual elements of that string (unlike, say,
C) - and although most RDBMS suppliers now have a healthy stable of string functions, it's not SQL's forté (or strong point)!I changed the data quite a bit (even a lot!) so that I could test my solution - and when you're dealing with row numbers that are similar to your actual data, things quickly become confusing!
You might well be better off (performance wise) using
C (or another language of your choice) to do this on the app side! Just because it can be done in "pure" SQL, that doesn't mean it should be done!With regards to the word
"pure", this solution makes use of a few non-standard extensions - so does a SQL Server solution and I imagine that it would probably be nigh-on impossible to solve this without these handy tools!I found these answers helpful:
-
Split column into multiple rows in Postgres
-
Converting comma separated string to integer array in Postgres
-
(https://dba.stackexchange.com/questions/226456/how-can-i-get-a-unique-array-in-postgresql).
-
https://stackoverflow.com/questions/3994556/eliminate-duplicate-array-values-in-postgres
The table is as created in the question, the input data is as follows:
INSERT INTO current (current_data) VALUES
('992,1005,1007,1008'),
('44,1005,1110'),
('13,44,1005,10078'),
('11,1203,6666'),
('1,11,99,2222'),
('1234'); -- note the singleton!and the desired output is:
Token The string
11 1,11,99,1203,2222,6666
44 13,44,1005,1110,10078
1005 13,44,992,1005,1007,1008,100The
1234 singleton is not included in the desired resultset because the question asks:if one value from the group matches another group.
My reading of this (
if .... matches...) is that singletons are excluded!The final SQL is shown above but also (and the whole process step-by-step) can be found on the fiddle:
Step 1:
We need to find a way of identifying which numbers in the strings belong together - we do this using the ROW_NUMBER() function! If you're unclear on SQL window functions, I would urge you to watch Bruce Momjian's YouTube "The Magic of Window Functions in Postgres" presentation. They are extremely powerful, part of the SQL standard, and will repay any effort spent learning how to use them many times over!
So, we run this query:
SELECT
c.current_data AS d,
ROW_NUMBER() OVER () AS rn
FROM current c;Result:
d rn
992,1005,1007,1008 1
44,1005,1110 2
13,44,1005,10078 3
11,1203,6666 4
1,11,99,2222 5
1234 6Step 2:
Above, I mentioned that strings are not SQL's forté - what is their forté is tables and records - so, we turn our data into a table by running:
SELECT
UNNEST(STRING_TO_ARRAY(c.current_data, ','))::INT AS d,
ROW_NUMBER() OVER () AS rn
FROM current c
ORDER BY rn, d;Result:
d rn
992 1
1005 1
1007 1
1008 1
44 2
1005 2
...
... snipped for brevity
...This makes use of two non-standard functions -
STRING_TO_ARRAY() and UNNEST()!STRING_TO_ARRAY does what it says on the tin! It takes the string and the provided delimiter (
',') and turns it into an array. PostgreSQL makes use of arrays "under the hood" so this is quite an efficient type!The
:: operator is PostgreSQL shorthand for CASTing - way more elegant than any other RDBMS - so we now have an array of integers.We then use the UNNEST function which turns the elements of these arrays into rows in a "virtual" table - so we now have a table which contains our original combined string data as integer rows in a table!
Step 3:
Now, we determine the tokens in which we are interested - i.e. the ones which are repeated across groups! We make use of the SQL above and the aggregate COUNT() function (which can also be a window function to operate on subsets - i.e. "win
Code Snippets
SELECT num AS "Token", STRING_AGG(val::TEXT, ',' ORDER BY val::INT) AS "The string"
FROM
(
SELECT DISTINCT t02.d AS num, UNNEST(t03.d2) AS val FROM
(
SELECT d, COUNT(d) AS cnt
FROM
(
SELECT
UNNEST(STRING_TO_ARRAY(c.current_data, ','))::INT AS d,
ROW_NUMBER() OVER () AS rn
FROM current c
) AS t01
GROUP BY d
HAVING COUNT(d) > 1
) AS t02
JOIN
(
SELECT
STRING_TO_ARRAY(c.current_data, ',')::INT[] AS d2
FROM current c
) AS t03
ON t02.d = ANY(t03.d2)
) AS t04
GROUP BY num
ORDER BY num;Token The string
11 1,11,99,1203,2222,6666
44 13,44,1005,1110,10078
1005 13,44,992,1005,1007,1008,1110,10078INSERT INTO current (current_data) VALUES
('992,1005,1007,1008'),
('44,1005,1110'),
('13,44,1005,10078'),
('11,1203,6666'),
('1,11,99,2222'),
('1234'); -- note the singleton!Token The string
11 1,11,99,1203,2222,6666
44 13,44,1005,1110,10078
1005 13,44,992,1005,1007,1008,100SELECT
c.current_data AS d,
ROW_NUMBER() OVER () AS rn
FROM current c;Context
StackExchange Database Administrators Q#298120, answer score: 3
Revisions (0)
No revisions yet.