HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Match .csv values as INTs. If one value from a group in the .csv matches one in another group, then merge strings

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
matchesthegroupmergecsvmatchvalueoneanotherthen

Problem

Here we have two sets of numbers. The problem is that I can't figure out how to get from the input to the output of numbers (DDL and DML below and also in the fiddle here).

current_data
1,2,3
1,4
1,5,7
8,9,10
10,11,15

expected_outcome
1,2,3,4,5,7
8,9,10,11,15


We are simply trying to match a group of numbers based on if a single number matches in any other group. Then merge all of these groups.

For instance.

If we have:

('1,2,3'),
('1,4'),
('1,5,7')


We want:

(1,2,3,4,5,7)


We consolidate them into a single line in PostgreSQL.

or (another example):

('8,9,10'), 
('10,11,15')


desired output:

(8,9,10,11,15)


The query would group these numbers because they have the number 10 in common. But it wouldn't group with the previous row (i.e. (1,2,3,4,5,7)) that don't share a number.

When we have these groups in a table. They will only group together if they at least have one matching number in each group.

======== DDL and DML =============

create table current (current_data text not null);
create table expected_output (expected_outcome text not null);

insert into current (current_data) values ('1,2,3'),('1,4'),('1,5,7'),('8,9,10'), ('10,11,15');
insert into expected_output (expected_outcome) values ('1,2,3,4,5,7'),('8,9,10,11,15');

Solution

If you want to see how this problem should NOT be tackled, take a look at the edits to see my first two solutions.
TL;DR (fiddle)

SELECT num AS "Token", STRING_AGG(val::TEXT, ',' ORDER BY val::INT) AS "The string"
FROM
(
  SELECT DISTINCT t02.d AS num, UNNEST(t03.d2) AS val FROM
  (
    SELECT d, COUNT(d) AS cnt
    FROM
    (
      SELECT 
        UNNEST(STRING_TO_ARRAY(c.current_data, ','))::INT AS d,
        ROW_NUMBER() OVER () AS rn
      FROM current c
    ) AS t01
    GROUP BY d
    HAVING COUNT(d) > 1
  ) AS t02
  JOIN
  (
    SELECT 
      STRING_TO_ARRAY(c.current_data, ',')::INT[] AS d2
    FROM current c
  ) AS t03
  ON t02.d = ANY(t03.d2)
) AS t04
GROUP BY num
ORDER BY num;


Result:

Token             The string
   11                 1,11,99,1203,2222,6666
   44                  13,44,1005,1110,10078
 1005    13,44,992,1005,1007,1008,1110,10078


/TL;DR
Introduction:

SQL was not designed for work with strings and, in particular, looking inside strings and manipulating individual elements of that string (unlike, say, C) - and although most RDBMS suppliers now have a healthy stable of string functions, it's not SQL's forté (or strong point)!

I changed the data quite a bit (even a lot!) so that I could test my solution - and when you're dealing with row numbers that are similar to your actual data, things quickly become confusing!

You might well be better off (performance wise) using C (or another language of your choice) to do this on the app side! Just because it can be done in "pure" SQL, that doesn't mean it should be done!

With regards to the word "pure", this solution makes use of a few non-standard extensions - so does a SQL Server solution and I imagine that it would probably be nigh-on impossible to solve this without these handy tools!

I found these answers helpful:

-
Split column into multiple rows in Postgres

-
Converting comma separated string to integer array in Postgres

-
(https://dba.stackexchange.com/questions/226456/how-can-i-get-a-unique-array-in-postgresql).

-
https://stackoverflow.com/questions/3994556/eliminate-duplicate-array-values-in-postgres

The table is as created in the question, the input data is as follows:

INSERT INTO current (current_data) VALUES 
('992,1005,1007,1008'),
('44,1005,1110'),
('13,44,1005,10078'),
('11,1203,6666'),
('1,11,99,2222'),
('1234');           -- note the singleton!


and the desired output is:

Token                       The string
   11           1,11,99,1203,2222,6666
   44            13,44,1005,1110,10078
 1005     13,44,992,1005,1007,1008,100


The 1234 singleton is not included in the desired resultset because the question asks:

if one value from the group matches another group.

My reading of this (if .... matches...) is that singletons are excluded!

The final SQL is shown above but also (and the whole process step-by-step) can be found on the fiddle:
Step 1:

We need to find a way of identifying which numbers in the strings belong together - we do this using the ROW_NUMBER() function! If you're unclear on SQL window functions, I would urge you to watch Bruce Momjian's YouTube "The Magic of Window Functions in Postgres" presentation. They are extremely powerful, part of the SQL standard, and will repay any effort spent learning how to use them many times over!

So, we run this query:

SELECT 
  c.current_data AS d,
  ROW_NUMBER() OVER () AS rn
FROM current c;


Result:

d             rn
992,1005,1007,1008      1
      44,1005,1110      2
  13,44,1005,10078      3
      11,1203,6666      4
      1,11,99,2222      5
              1234      6


Step 2:

Above, I mentioned that strings are not SQL's forté - what is their forté is tables and records - so, we turn our data into a table by running:

SELECT 
  UNNEST(STRING_TO_ARRAY(c.current_data, ','))::INT AS d,
  ROW_NUMBER() OVER () AS rn
FROM current c
ORDER BY rn, d;


Result:

d         rn
  992          1
 1005          1
 1007          1
 1008          1
   44          2
 1005          2
  ...
  ... snipped for brevity
  ...


This makes use of two non-standard functions - STRING_TO_ARRAY() and UNNEST()!

STRING_TO_ARRAY does what it says on the tin! It takes the string and the provided delimiter (',') and turns it into an array. PostgreSQL makes use of arrays "under the hood" so this is quite an efficient type!

The :: operator is PostgreSQL shorthand for CASTing - way more elegant than any other RDBMS - so we now have an array of integers.

We then use the UNNEST function which turns the elements of these arrays into rows in a "virtual" table - so we now have a table which contains our original combined string data as integer rows in a table!
Step 3:

Now, we determine the tokens in which we are interested - i.e. the ones which are repeated across groups! We make use of the SQL above and the aggregate COUNT() function (which can also be a window function to operate on subsets - i.e. "win

Code Snippets

SELECT num AS "Token", STRING_AGG(val::TEXT, ',' ORDER BY val::INT) AS "The string"
FROM
(
  SELECT DISTINCT t02.d AS num, UNNEST(t03.d2) AS val FROM
  (
    SELECT d, COUNT(d) AS cnt
    FROM
    (
      SELECT 
        UNNEST(STRING_TO_ARRAY(c.current_data, ','))::INT AS d,
        ROW_NUMBER() OVER () AS rn
      FROM current c
    ) AS t01
    GROUP BY d
    HAVING COUNT(d) > 1
  ) AS t02
  JOIN
  (
    SELECT 
      STRING_TO_ARRAY(c.current_data, ',')::INT[] AS d2
    FROM current c
  ) AS t03
  ON t02.d = ANY(t03.d2)
) AS t04
GROUP BY num
ORDER BY num;
Token             The string
   11                 1,11,99,1203,2222,6666
   44                  13,44,1005,1110,10078
 1005    13,44,992,1005,1007,1008,1110,10078
INSERT INTO current (current_data) VALUES 
('992,1005,1007,1008'),
('44,1005,1110'),
('13,44,1005,10078'),
('11,1203,6666'),
('1,11,99,2222'),
('1234');           -- note the singleton!
Token                       The string
   11           1,11,99,1203,2222,6666
   44            13,44,1005,1110,10078
 1005     13,44,992,1005,1007,1008,100
SELECT 
  c.current_data AS d,
  ROW_NUMBER() OVER () AS rn
FROM current c;

Context

StackExchange Database Administrators Q#298120, answer score: 3

Revisions (0)

No revisions yet.