HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Marking duplicate records

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
duplicaterecordsmarking

Problem

I have a list of things, some of which are duplicates.

create table things (
  id int primary key,
  title text not null,
  author text not null,
  duplicate_of int null references things(id)
);


I'd like to mark duplicates as such (can't delete them). The first item (defined by the lowest id) should never be marked as a duplicate. Its duplicates should have their duplicate_of value set to its id.

This is what I have so far:

update things a set duplicate_of = (
  select 
    id 
  from things b 
  where 
    b.title = a.title 
    and b.author = a.author 
    and b.id < a.id 
  order by 
    id asc 
  limit 1
  );


Is there a more efficient way to accomplish this?

Solution

This can be solved with the help of the FIRST_VALUE analytical function:

UPDATE
  things AS tgt
SET
  duplicate_of = src.first_id
FROM
  (
    SELECT
      id
    , title
    , author
    , FIRST_VALUE(id) OVER (PARTITION BY title, author ORDER BY id ASC) as first_id
    FROM
      things
  ) AS src
WHERE
  tgt.id = src.id
  AND tgt.id <> src.first_id
;


The FIRST_VALUE function returns the first value in each subset defined by the PARTITION BY clause based on the sorting order defined by ORDER BY.

Suppose we have a dataset like this:

id
title
author
duplicate_of

1
Hamlet
William Shakespeare
null

2
Done Quixote
Miguel de Cervantes
null

3
War and Peace
Leo Tolstoy
null

4
Hamlet
William Shakespeare
null

5
War and Peace
Leo Tolstoy
null

6
To Kill a Mockingbird
Harper Lee
null

7
Hamlet
William Shakespeare
null

For the above sample, the src derived table will evaluate like this:

id
title
author
first_id

1
Hamlet
William Shakespeare
1

2
Done Quixote
Miguel de Cervantes
2

3
War and Peace
Leo Tolstoy
3

4
Hamlet
William Shakespeare
1

5
War and Peace
Leo Tolstoy
3

6
To Kill a Mockingbird
Harper Lee
6

7
Hamlet
William Shakespeare
1

As you can see, for the three instances of Hamlet, the first_id calculated column shows 1, which is the ID of the first instance. Similar for the two instances of War and Peace.

Joining this set with the original table allows you to use the values of first_id to update the duplicate_of column, which is exactly what the query does. The additional filter of tgt.id <> src.first_id is there to prevent updating the first instance of each duplicate.

This is what the table ends up like after the query:

id
title
author
duplicate_of

1
Hamlet
William Shakespeare
null

2
Done Quixote
Miguel de Cervantes
null

3
War and Peace
Leo Tolstoy
null

4
Hamlet
William Shakespeare
1

5
War and Peace
Leo Tolstoy
3

6
To Kill a Mockingbird
Harper Lee
null

7
Hamlet
William Shakespeare
1

The test setup and the solution are available in a live demo at db<>fiddle.

More about analytical and other window functions can be read in the online PostgreSQL manual (current version).

Code Snippets

UPDATE
  things AS tgt
SET
  duplicate_of = src.first_id
FROM
  (
    SELECT
      id
    , title
    , author
    , FIRST_VALUE(id) OVER (PARTITION BY title, author ORDER BY id ASC) as first_id
    FROM
      things
  ) AS src
WHERE
  tgt.id = src.id
  AND tgt.id <> src.first_id
;

Context

StackExchange Database Administrators Q#83224, answer score: 3

Revisions (0)

No revisions yet.