snippetsqlMinor
How to ignore duplicates during bulk inserts?
Viewed 0 times
ignoreduringinsertsbulkhowduplicates
Problem
In Postgres 9.3.5, I'm importing records from an external source where duplicates are VERY rare, but they do happen. Given a
(FWIW, the above fails 'properly' with a duplicate key violation.)
I know that handling exceptions is expensive, but as I said, duplicate entries are very rare. So to keep the code simple, I followed an example given in Optimal way to ignore duplicate inserts?:
The above gets two errors:
Can anyone set me straight on the correct syntax? Or is my error deeper than mere syntax? (For example, will all of the INSERTs be ignored if there is a single duplicate?)
Generally, what's a good way to do bulk inserts where very few (< .1%) are duplicates?
readings table with a unique compound key on (real_time_device_id, recorded_at), the following will fail once in a blue moon:INSERT INTO readings (real_time_device_id, recorded_at, duration) VALUES
('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0),
... many more records ...
('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0);(FWIW, the above fails 'properly' with a duplicate key violation.)
I know that handling exceptions is expensive, but as I said, duplicate entries are very rare. So to keep the code simple, I followed an example given in Optimal way to ignore duplicate inserts?:
BEGIN
INSERT INTO readings (real_time_device_id, recorded_at, duration) VALUES
('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0),
... many more records ...
('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0);
EXCEPTION WHEN unique_violation THEN
-- silently ignore inserts
END;The above gets two errors:
psql:sketches/t15.sql:11: ERROR: syntax error at or near "INSERT"
LINE 2: INSERT INTO readings (real_time_device_id, recorded_...
^
psql:sketches/t15.sql:14: ERROR: syntax error at or near "EXCEPTION"
LINE 1: EXCEPTION WHEN unique_violation THEN
^Can anyone set me straight on the correct syntax? Or is my error deeper than mere syntax? (For example, will all of the INSERTs be ignored if there is a single duplicate?)
Generally, what's a good way to do bulk inserts where very few (< .1%) are duplicates?
Solution
There are 3 possible kinds of duplicates:
-
Duplicates within the rows of the bulk insert. That's your immediate cause for the exception.
-
Duplicates between inserted rows and existing rows.
-
Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.
(Assuming no overlapping rows are ever deleted concurrently, which would introduce new challenges.)
I am picking an arbitrary row from each set of dupes with
If it applies after all, consider this:
To insert ~250K rows, it would be much more efficient to
If that's a large part of the table, it might pay to drop indexes you don't need for the duration of the update and create them afterwards ...
-
Duplicates within the rows of the bulk insert. That's your immediate cause for the exception.
-
Duplicates between inserted rows and existing rows.
-
Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.
(Assuming no overlapping rows are ever deleted concurrently, which would introduce new challenges.)
- and 2. can be fixed easily. But you really need to define exactly how to solve conflicts. Which row is to be picked?
INSERT INTO readings (real_time_device_id, recorded_at, duration)
SELECT DISTINCT ON (real_time_device_id, recorded_at) -- solves 1.
i.real_time_device_id, i.recorded_at, i.dur
FROM (
VALUES
('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0)
, ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0)
) i (real_time_device_id, recorded_at, dur)
LEFT JOIN readings r1 USING (real_time_device_id, recorded_at)
WHERE r1.real_time_device_id IS NULL -- solves 2.I am picking an arbitrary row from each set of dupes with
DISTINCT ON. You may want to define deterministically instead:- Select first row in each GROUP BY group?
- is more tricky - but hopefully not applicable to your case ...
If it applies after all, consider this:
- How to use RETURNING with ON CONFLICT in PostgreSQL?
To insert ~250K rows, it would be much more efficient to
COPY the lot to a temp table (or as much as you can easily store in RAM and process without spilling to disk) and proceed from there.If that's a large part of the table, it might pay to drop indexes you don't need for the duration of the update and create them afterwards ...
- Getting error while importing large data through csv
- Call UPSERT function multiple times to different inputs
- How to update selected rows with values from a CSV file in Postgres?
Code Snippets
INSERT INTO readings (real_time_device_id, recorded_at, duration)
SELECT DISTINCT ON (real_time_device_id, recorded_at) -- solves 1.
i.real_time_device_id, i.recorded_at, i.dur
FROM (
VALUES
('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0)
, ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0)
) i (real_time_device_id, recorded_at, dur)
LEFT JOIN readings r1 USING (real_time_device_id, recorded_at)
WHERE r1.real_time_device_id IS NULL -- solves 2.Context
StackExchange Database Administrators Q#84058, answer score: 4
Revisions (0)
No revisions yet.