patternsqlMinor
PostgreSQL "upsert" tuning with unique constraint
Viewed 0 times
postgresqluniquewithtuningupsertconstraint
Problem
I have the following table (PostgreSQL 9.6 but could upgrade):
I'm running bulk insertions with
What would be a good way of tuning "upsert" performance here:
# \d annotation_annotation
Table "public.annotation_annotation"
Column | Type | Modifiers
-------------------+-------------------------+--------------------------------------------------------------------
id | bigint | not null default nextval('annotation_annotation_id_seq'::regclass)
release | character varying(32) | not null
chromosome | character varying(32) | not null
position | integer | not null
reference | character varying(512) | not null
alternative | character varying(512) | not null
database | character varying(8) |
effect | character varying(64)[] | not null
gene_id | character varying(64) |
transcript_id | character varying(64) |
transcript_coding | boolean |
hgvs_c | character varying(512) |
hgvs_p | character varying(512) |
Indexes:
"annotation_annotation_pkey" PRIMARY KEY, btree (id)
"annotation_annotation_release_chromosome_posit_d102b713_uniq" UNIQUE CONSTRAINT, btree (release, chromosome, "position", reference, alternative, transcript_id)
"annotation__release_3f254e_idx" btree (release, chromosome, "position", reference, alternative, gene_id)I'm running bulk insertions with
ON CONFLICT IGNORE of tens of millions of files into the table. I'm expecting a high number of duplicates (>90%). The table is append-only. Using pg_activity, I'm seeing that PostgreSQL is spending more time reading than writing (as I would expect if the number of duplicates is high).What would be a good way of tuning "upsert" performance here:
- Would partitioning with PostgreSQL 11 help? I could provide an additional
integer"chromosome_no" column that could be used
Solution
Partitioning the table would only help if the data in the files you are trying to insert are also partitioned using the same scheme.
If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.
For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.
If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.
For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.
Context
StackExchange Database Administrators Q#237571, answer score: 3
Revisions (0)
No revisions yet.