HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL "upsert" tuning with unique constraint

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqluniquewithtuningupsertconstraint

Problem

I have the following table (PostgreSQL 9.6 but could upgrade):

# \d annotation_annotation 
                                       Table "public.annotation_annotation"
      Column       |          Type           |                             Modifiers                              
-------------------+-------------------------+--------------------------------------------------------------------
 id                | bigint                  | not null default nextval('annotation_annotation_id_seq'::regclass)
 release           | character varying(32)   | not null
 chromosome        | character varying(32)   | not null
 position          | integer                 | not null
 reference         | character varying(512)  | not null
 alternative       | character varying(512)  | not null
 database          | character varying(8)    | 
 effect            | character varying(64)[] | not null
 gene_id           | character varying(64)   | 
 transcript_id     | character varying(64)   | 
 transcript_coding | boolean                 | 
 hgvs_c            | character varying(512)  | 
 hgvs_p            | character varying(512)  | 
Indexes:
    "annotation_annotation_pkey" PRIMARY KEY, btree (id)
    "annotation_annotation_release_chromosome_posit_d102b713_uniq" UNIQUE CONSTRAINT, btree (release, chromosome, "position", reference, alternative, transcript_id)
    "annotation__release_3f254e_idx" btree (release, chromosome, "position", reference, alternative, gene_id)


I'm running bulk insertions with ON CONFLICT IGNORE of tens of millions of files into the table. I'm expecting a high number of duplicates (>90%). The table is append-only. Using pg_activity, I'm seeing that PostgreSQL is spending more time reading than writing (as I would expect if the number of duplicates is high).

What would be a good way of tuning "upsert" performance here:

  • Would partitioning with PostgreSQL 11 help? I could provide an additional integer "chromosome_no" column that could be used

Solution

Partitioning the table would only help if the data in the files you are trying to insert are also partitioned using the same scheme.

If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.

For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.

Context

StackExchange Database Administrators Q#237571, answer score: 3

Revisions (0)

No revisions yet.