snippetsqlModerate
How do I remove duplicate records in a join table in PostgreSQL?
Viewed 0 times
postgresqlduplicaterecordsjoinremovehowtable
Problem
I have a table that has a schema like this:
I would like to remove records that are duplicates, i.e. they have both the same
What does the SQL look like for that?
create_table "questions_tags", :id => false, :force => true do |t|
t.integer "question_id"
t.integer "tag_id"
end
add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"I would like to remove records that are duplicates, i.e. they have both the same
tag_id and question_id as another record.What does the SQL look like for that?
Solution
In my experience (and as shown in many tests)
Deletes every row where another row with the same
Test
I ran a test case with this table matched to your question and 100k rows:
Indexes do not help in this case.
Results
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.
Finishes in half a second in this SQLfiddle.
Alternatives
If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.
If you have dependencies and want to keep them, you could:
Views can just stay, they have no impact on performance. More here or here.
NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:DELETE FROM questions_tags q
WHERE EXISTS (
SELECT FROM questions_tags q1
WHERE q1.ctid < q.ctid
AND q1.question_id = q.question_id
AND q1.tag_id = q.tag_id
);Deletes every row where another row with the same
(tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).ctid is the internal tuple identifier present in every row and necessarily unique. Further reading:- How do I decompose ctid into page and row numbers?
- How list all tables with data changes in the last 24 hours?
- How do I (or can I) SELECT DISTINCT on multiple columns?
Test
I ran a test case with this table matched to your question and 100k rows:
CREATE TABLE questions_tags(
question_id integer NOT NULL
, tag_id integer NOT NULL
);
INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM generate_series(1, 100000);
ANALYZE questions_tags;Indexes do not help in this case.
Results
NOT INThe SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.
EXISTSFinishes in half a second in this SQLfiddle.
Alternatives
If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.
If you have dependencies and want to keep them, you could:
- Drop all foreign keys and indexes - for performance.
SELECTsurvivors to a temporary table.
TRUNCATEthe original.
- Re-
INSERTsurvivors.
- Re-
CREATEindexes and foreign keys.
Views can just stay, they have no impact on performance. More here or here.
Code Snippets
DELETE FROM questions_tags q
WHERE EXISTS (
SELECT FROM questions_tags q1
WHERE q1.ctid < q.ctid
AND q1.question_id = q.question_id
AND q1.tag_id = q.tag_id
);CREATE TABLE questions_tags(
question_id integer NOT NULL
, tag_id integer NOT NULL
);
INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM generate_series(1, 100000);
ANALYZE questions_tags;Context
StackExchange Database Administrators Q#36539, answer score: 15
Revisions (0)
No revisions yet.