HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Find duplicate records across one-to-many relationship tables

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
findtablesduplicaterecordsonemanyacrossrelationship

Problem

Let's say I have two tables that have a one-to-many relationship: People and Attributes. I'm trying to find duplicates based on the first_name, last_name, and all attributes must match exactly.

CREATE TABLE People (Id int, 
                 first_name varchar(100), 
                 last_name varchar(100));

CREATE TABLE Attributes (Id int, 
                     person_id int, 
                     field varchar(100),
                     field_value varchar(100));

INSERT INTO People VALUES (1, 'John', 'Smith');
INSERT INTO People VALUES (2, 'John', 'Smith');
INSERT INTO People VALUES (3, 'John', 'Smith');

INSERT INTO Attributes VALUES (1, 1, 'HairColor', 'Brown');
INSERT INTO Attributes VALUES (2, 1, 'EyeColor', 'Blue');
INSERT INTO Attributes VALUES (3, 2, 'HairColor', 'Brown');
INSERT INTO Attributes VALUES (4, 2, 'EyeColor', 'Blue');
INSERT INTO Attributes VALUES (5, 3, 'HairColor', 'Blonde');


Which gives us:

id | first_name | last_name
----+------------+-----------
  1 | John       | Smith
  2 | John       | Smith
  3 | John       | Smith

 id | person_id |   field   | field_value
----+-----------+-----------+-------------
  1 |         1 | HairColor | Brown
  2 |         1 | EyeColor  | Blue
  3 |         2 | HairColor | Brown
  4 |         2 | EyeColor  | Blue
  5 |         3 | HairColor | Blonde


I'd like a query that will return IDs 1 and 2 from the People table. I can find duplicates within a single table:

select first_name,last_name,count(*) from People 
    group by first_name,last_name having ( count(*) > 1 );


But I'm having trouble joining the one-to-many table and detecting duplicates across both tables. How can I detect duplicates across tables that have a one-to-many relationship?

Solution

One way to do this (check the SQLfiddle):

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id  2
   ) ;


and another:

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      union 
        select field, field_value
        from attributes a2
       where a2.person_id = p2.id
      ) 
    except
      ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      intersect
        select field, field_value
        from attributes a2
        where a2.person_id = p2.id
      )
    ) ;


Intersect has higher precedence than except/minus, at least in Postgres and SQL Server. Just to be safe, you can use parentheses, to ensure the precedence.

Code Snippets

select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( select 1
      from 
      ( select * 
        from attributes a1
        where a1.person_id = p1.id
      union all
        select * 
        from attributes a2
        where a2.person_id = p2.id
      ) g
      group by field, field_value
      having count(*) <> 2
   ) ;
select 
    p1.id as id1, 
    p2.id as id2
from people p1
  join people p2
    on  p1.first_name = p2.first_name
    and p1.last_name = p2.last_name
    and p1.id < p2.id
where not exists
    ( ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      union 
        select field, field_value
        from attributes a2
       where a2.person_id = p2.id
      ) 
    except
      ( select field, field_value
        from attributes a1
        where a1.person_id = p1.id
      intersect
        select field, field_value
        from attributes a2
        where a2.person_id = p2.id
      )
    ) ;

Context

StackExchange Database Administrators Q#91959, answer score: 4

Revisions (0)

No revisions yet.