patternsqlMinor
Most efficient to prevent duplicates
Viewed 0 times
duplicatespreventefficientmost
Problem
I'm building an application that will fetch messages with JSON every once in a while and store these in a postgres db. Now I want to prevent duplicates, so I calculate the sha-512 of the message and set it as the primary key. Since I want to prevent duplicates I can do two things:
Considering I will probably be inserting a few messages at a time, I was wondering if method two will look up the primary key and see if exists for every insert as opposed to just storing a list in memory with method 1. What would be the most efficient?
- Fetch all hashes from the database and see when inserting new messages if the hash already exists (checking in code)
- Use an insert ignore (in postgres an update probably, since postgres doesn't support insert ignore) and let the database handle these exceptions
Considering I will probably be inserting a few messages at a time, I was wondering if method two will look up the primary key and see if exists for every insert as opposed to just storing a list in memory with method 1. What would be the most efficient?
Solution
PostgreSQL doesn't have
Personally, I'd do what's effectively do a bulk upsert, where I'd do something like this:
You can do an
The
If you
INSERT ... IGNORE so that isn't an option. You can use upsert-like operations (search for "postgresql upsert) but they certainly have a performance cost. Inserting and ignoring the resulting error on duplicate key is possible, but it tends to result in pretty spammy logs.Personally, I'd do what's effectively do a bulk upsert, where I'd do something like this:
CREATE TEMPORARY TABLE staging_table(...)
COPY TO staging_table(...) FROM ...(or do bulk inserts)
LOCK TABLE real_table IN EXCLUSIVE MODE
INSERT INTO real_table(...) SELECT col1,col2,col3,... FROM staging_table WHERE NOT EXISTS (SELECT 1 FROM real_table WHERE real_table.primarykey = staging_table.primarykey)
COMMIT
You can do an
UPDATE before the INSERT if you want, updating existing records with new values.The
LOCK TABLE is necessary to prevent concurrent updates failing with errors. You can still SELECT from the table during the update.If you
EXPLAIN the plan for your INSERT you should see that it's been transformed into a JOIN when you're inserting larger numbers of records.Context
StackExchange Database Administrators Q#38979, answer score: 7
Revisions (0)
No revisions yet.