HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Is there a performance penalty for unique key constraint violations in Postgres?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
uniquepenaltypostgresviolationsforperformanceconstrainttherekey

Problem

In my API, the user can potentially send a request which tries to create a new row when a row with that unique key exists.

Currently, I'm catching the unique key error and returning a message to say that X already exists. But, is it more performant to lookup the row first (on the same connection), and only run the INSERT statement if that row doesn't exist?

My intuition says that reading the error from PostgreSQL should be more efficient, but I'd like to be sure I'm doing things idiomatically.

PostgreSQL is version 12

The unique key in my API isn't a surrogate ID value, it's a composite made up of a foreign key, combined with a text value. The database does already generate its own surrogate ID for this row if the unique key constraint doesn't fail. So the ID of the row is not what I'm checking. The correct behavior is to not insert the row, because the FK/text value needs to be unique in the table. If the request contains a FK/text value that already exists in the table, then no row should be inserted.

Solution

is it more performant to lookup the row first (on the same connection), and only run the INSERT statement if that row doesn't exist?

If someone else inserts a duplicate concurrently, there's no issue: the select won't see it, but the unique constraint will be enforced. You still have to duplicate your error handling code, though. However if someone deletes the duplicate after the select saw it, then it won't be inserted.

I ran a Python benchmark, source code is available on pastebin. This is a simple example using a table with only a primary key and a dummy text column. For each id in the range 0..99, it inserts it 100 times. Only the first time will work, the rest will be rejected by the unique constraint.

The candidates are:

-
insert_only: sends the insert, then either it works or it fails the unique constraint.

-
select_then_insert: does one select to check, then the insert.

-
insert_select combines the previous two queries into one, which also removes the race condition:

INSERT INTO testins (id,t) SELECT %s,'hello, world'
WHERE NOT EXISTS( SELECT FROM testins WHERE id=%s )
RETURNING id


-
on_conflict uses the upsert feature:

INSERT INTO testins (id,t) VALUES (%s,'hello, world') 
ON CONFLICT (id) DO NOTHING 
RETURNING id


RETURNING id simply returns the id if the row was inserted, so you know it was. If the query returns nothing, it means there was a duplicate.

Results: "Latency" is time per INSERT attempt. "No. of rows" is the number of successful INSERTs having left a row in the table. There are 10k INSERT attempts total.

Method
Latency
No. of rows
Table size

on_conflict:
68.3µs
100 rows
8.000 kB

insert_only:
85.0µs
100 rows
512.000 kB

select_then_insert:
73.6µs
100 rows
8.000 kB

insert_select:
61.5µs
100 rows
8.000 kB

This test has 99 duplicates for each insert, so let's try a more reasonable amount of 1 duplicate per insert:

Method
Latency
No. of rows
Table size

on_conflict:
74.3µs
5000 rows
256.000 kB

insert_only:
78.2µs
5000 rows
512.000 kB

select_then_insert:
94.3µs
5000 rows
256.000 kB

insert_select:
66.8µs
5000 rows
256.000 kB

No duplicates:

Method
Latency
No. of rows
Table size

on_conflict:
81.6µs
10000 rows
512.000 kB

insert_only:
69.1µs
10000 rows
512.000 kB

select_then_insert:
184µs
10000 rows
512.000 kB

insert_select:
77.7µs
10000 rows
512.000 kB

In all cases, most of the time is spent doing roundtrips on the connection, and committing transactions.

Conclusion:

The issue with the straight INSERT is the fact that it still writes the row in the table, then tries to write it in the index and fails on a duplicate, then rolls back the transaction. This results in disk writes (table and WAL) and it bloats the table with dead rows which will need VACUUMing. Doing all this stuff explains the small performance penalty.

The other solutions don't insert the row if there is a duplicate, which avoids useless writes and table bloat.

The most idiomatic for postgres would be ON CONFLICT.

So if you expect to have lots of duplicates, ie most of the time the INSERT will fail, and traffic on this query is high, it would be advantageous to use ON CONFLICT.

If you expect few duplicates, ie most of the times the INSERT will work, then you can just let it throw the error.

If this is part of a larger transaction that you'd rather not fail, rollback, and do all the work again, then ON CONFLICT can help since it won't throw an error in case of a duplicate.

Code Snippets

INSERT INTO testins (id,t) SELECT %s,'hello, world'
WHERE NOT EXISTS( SELECT FROM testins WHERE id=%s )
RETURNING id
INSERT INTO testins (id,t) VALUES (%s,'hello, world') 
ON CONFLICT (id) DO NOTHING 
RETURNING id

Context

StackExchange Database Administrators Q#334904, answer score: 13

Revisions (0)

No revisions yet.