patternsqlMinor
Marking duplicate records
Viewed 0 times
duplicaterecordsmarking
Problem
I have a list of things, some of which are duplicates.
I'd like to mark duplicates as such (can't delete them). The first item (defined by the lowest id) should never be marked as a duplicate. Its duplicates should have their
This is what I have so far:
Is there a more efficient way to accomplish this?
create table things (
id int primary key,
title text not null,
author text not null,
duplicate_of int null references things(id)
);I'd like to mark duplicates as such (can't delete them). The first item (defined by the lowest id) should never be marked as a duplicate. Its duplicates should have their
duplicate_of value set to its id.This is what I have so far:
update things a set duplicate_of = (
select
id
from things b
where
b.title = a.title
and b.author = a.author
and b.id < a.id
order by
id asc
limit 1
);Is there a more efficient way to accomplish this?
Solution
This can be solved with the help of the
The
Suppose we have a dataset like this:
id
title
author
duplicate_of
1
Hamlet
William Shakespeare
null
2
Done Quixote
Miguel de Cervantes
null
3
War and Peace
Leo Tolstoy
null
4
Hamlet
William Shakespeare
null
5
War and Peace
Leo Tolstoy
null
6
To Kill a Mockingbird
Harper Lee
null
7
Hamlet
William Shakespeare
null
For the above sample, the
id
title
author
first_id
1
Hamlet
William Shakespeare
1
2
Done Quixote
Miguel de Cervantes
2
3
War and Peace
Leo Tolstoy
3
4
Hamlet
William Shakespeare
1
5
War and Peace
Leo Tolstoy
3
6
To Kill a Mockingbird
Harper Lee
6
7
Hamlet
William Shakespeare
1
As you can see, for the three instances of Hamlet, the
Joining this set with the original table allows you to use the values of
This is what the table ends up like after the query:
id
title
author
duplicate_of
1
Hamlet
William Shakespeare
null
2
Done Quixote
Miguel de Cervantes
null
3
War and Peace
Leo Tolstoy
null
4
Hamlet
William Shakespeare
1
5
War and Peace
Leo Tolstoy
3
6
To Kill a Mockingbird
Harper Lee
null
7
Hamlet
William Shakespeare
1
The test setup and the solution are available in a live demo at db<>fiddle.
More about analytical and other window functions can be read in the online PostgreSQL manual (current version).
FIRST_VALUE analytical function:UPDATE
things AS tgt
SET
duplicate_of = src.first_id
FROM
(
SELECT
id
, title
, author
, FIRST_VALUE(id) OVER (PARTITION BY title, author ORDER BY id ASC) as first_id
FROM
things
) AS src
WHERE
tgt.id = src.id
AND tgt.id <> src.first_id
;The
FIRST_VALUE function returns the first value in each subset defined by the PARTITION BY clause based on the sorting order defined by ORDER BY.Suppose we have a dataset like this:
id
title
author
duplicate_of
1
Hamlet
William Shakespeare
null
2
Done Quixote
Miguel de Cervantes
null
3
War and Peace
Leo Tolstoy
null
4
Hamlet
William Shakespeare
null
5
War and Peace
Leo Tolstoy
null
6
To Kill a Mockingbird
Harper Lee
null
7
Hamlet
William Shakespeare
null
For the above sample, the
src derived table will evaluate like this:id
title
author
first_id
1
Hamlet
William Shakespeare
1
2
Done Quixote
Miguel de Cervantes
2
3
War and Peace
Leo Tolstoy
3
4
Hamlet
William Shakespeare
1
5
War and Peace
Leo Tolstoy
3
6
To Kill a Mockingbird
Harper Lee
6
7
Hamlet
William Shakespeare
1
As you can see, for the three instances of Hamlet, the
first_id calculated column shows 1, which is the ID of the first instance. Similar for the two instances of War and Peace.Joining this set with the original table allows you to use the values of
first_id to update the duplicate_of column, which is exactly what the query does. The additional filter of tgt.id <> src.first_id is there to prevent updating the first instance of each duplicate.This is what the table ends up like after the query:
id
title
author
duplicate_of
1
Hamlet
William Shakespeare
null
2
Done Quixote
Miguel de Cervantes
null
3
War and Peace
Leo Tolstoy
null
4
Hamlet
William Shakespeare
1
5
War and Peace
Leo Tolstoy
3
6
To Kill a Mockingbird
Harper Lee
null
7
Hamlet
William Shakespeare
1
The test setup and the solution are available in a live demo at db<>fiddle.
More about analytical and other window functions can be read in the online PostgreSQL manual (current version).
Code Snippets
UPDATE
things AS tgt
SET
duplicate_of = src.first_id
FROM
(
SELECT
id
, title
, author
, FIRST_VALUE(id) OVER (PARTITION BY title, author ORDER BY id ASC) as first_id
FROM
things
) AS src
WHERE
tgt.id = src.id
AND tgt.id <> src.first_id
;Context
StackExchange Database Administrators Q#83224, answer score: 3
Revisions (0)
No revisions yet.