HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Most efficient to prevent duplicates

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
duplicatespreventefficientmost

Problem

I'm building an application that will fetch messages with JSON every once in a while and store these in a postgres db. Now I want to prevent duplicates, so I calculate the sha-512 of the message and set it as the primary key. Since I want to prevent duplicates I can do two things:

  • Fetch all hashes from the database and see when inserting new messages if the hash already exists (checking in code)



  • Use an insert ignore (in postgres an update probably, since postgres doesn't support insert ignore) and let the database handle these exceptions



Considering I will probably be inserting a few messages at a time, I was wondering if method two will look up the primary key and see if exists for every insert as opposed to just storing a list in memory with method 1. What would be the most efficient?

Solution

PostgreSQL doesn't have INSERT ... IGNORE so that isn't an option. You can use upsert-like operations (search for "postgresql upsert) but they certainly have a performance cost. Inserting and ignoring the resulting error on duplicate key is possible, but it tends to result in pretty spammy logs.

Personally, I'd do what's effectively do a bulk upsert, where I'd do something like this:

  • CREATE TEMPORARY TABLE staging_table(...)



  • COPY TO staging_table(...) FROM ... (or do bulk inserts)



  • LOCK TABLE real_table IN EXCLUSIVE MODE



  • INSERT INTO real_table(...) SELECT col1,col2,col3,... FROM staging_table WHERE NOT EXISTS (SELECT 1 FROM real_table WHERE real_table.primarykey = staging_table.primarykey)



  • COMMIT



You can do an UPDATE before the INSERT if you want, updating existing records with new values.

The LOCK TABLE is necessary to prevent concurrent updates failing with errors. You can still SELECT from the table during the update.

If you EXPLAIN the plan for your INSERT you should see that it's been transformed into a JOIN when you're inserting larger numbers of records.

Context

StackExchange Database Administrators Q#38979, answer score: 7

Revisions (0)

No revisions yet.