HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Query performance on static, large PostgreSQL table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlquerylargeperformancetablestatic

Problem

I tried to have this as detailed as possible. Sorry about the length!
Background

I created the following partitioned table protein_snp_assoc on a PostgreSQL (version 12.13) database:

CREATE TABLE protein_snp_assoc (
  protein_id    int not null,
  snp_id        int not null,
  beta          double precision,
  se            double precision,
  logp          double precision
) PARTITION BY RANGE (snp_id);


I then created 51 partitions, each containing roughly 150 million lines (total lines 7.65 billion), based on the following template:

CREATE TABLE IF NOT EXISTS protein_snp_assoc_(x) PARTITION OF protein_snp_assoc
  FOR VALUES FROM (y) TO (z);


where x ranged from 1 to 51, and y, z defined intervals, each of length 150,000. As an example, the first two and last partitions are:

protein_snp_assoc_1 FOR VALUES FROM (1) TO (150001),
protein_snp_assoc_2 FOR VALUES FROM (150001) TO (300001), ...
protein_snp_assoc_51 FOR VALUES FROM (7500001) TO (7650001)


The variable column protein_id has 1,000 unique values (1 to 1,000) and snp_id has 7,500,000 unique values (1 to 7,650,001). As the pair (snp_id, protein_id) uniquely determines a row in the table, I used the two columns to create a BTree index, with snp_id as the left-most variable:

CREATE INDEX ON protein_snp_assoc (snp_id, protein_id);


This will be a static database. It currently has about 20% of the total data that will be on it (since I'm prototyping), but once all the data has been added to the database, no further rows will be added (nor deleted).
Typical queries

The most common queries will be (a) single SNP/protein queries, (b) single protein, multiple SNPs queries, and (c) multiple proteins and multiple SNPs queries.

Example queries where I use VALUES as I read on this site that it can increase performance when IN(...) has multiple values.

```
-- Single SNP/protein
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1))
AND
protein_id IN

Solution

Based on this:

  • protein_id has 1,000 unique values



  • snp_id has 7,500,000 unique values



  • the pair (snp_id, protein_id) uniquely determines a row in the table



Typical queries:

(a) single SNP/protein queries

(b) single protein, multiple SNPs queries

(c) multiple proteins and multiple SNPs queries

Table is read-only.

I don't see the benefit of partitioning. Drop partitioning and use a plain table instead:

CREATE TABLE protein_snp_assoc (
  protein_id int
, snp_id     int
, beta       double precision
, se         double precision
, logp       double precision
);


After filling the table, add this PK:

ALTER TABLE protein_snp_assoc
ADD CONSTRAINT protein_snp_assoc_pkey
    PRIMARY KEY (protein_id, snp_id) WITH (FILLFACTOR = 100);


Notably, the PK replaces your index protein_snp_assoc, but with leading protein_id, since that is more commonly the single value in your queries. With FILLFACTOR = 100 since your table is read-only. Default for B-tree indexes is 90. See:

  • Fillfactor for a sequential index that is PK



Apart from the 10 % savings, size of the underlying index is the same. Two integer columns is pretty ideal for your multicolumn index. See:

  • Is a composite index also good for queries on the first field?



Once populated, cluster the table once on the PK:

CLUSTER TABLE protein_snp_assoc USING protein_snp_assoc_pkey;


Expensive for the huge table, but do it once. This rewrites table and index, so that you have both in pristine condition.

Or (better) if at all possible, fill the table with sorted data to begin with. Ideally use COPY with FREEZE:

COPY protein_snp_assoc FROM '/path/to/filename' WITH (FREEZE);


.. after creating or truncating the table in the same transaction. Read about FREEZE in the manual. And this:

COPY is fastest when used within the same transaction as an earlier
CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be
written, because in case of an error, the files containing the newly
loaded data will be removed anyway. However, this consideration only
applies when wal_level is minimal as all commands must write
WAL otherwise.

You must be superuser, and have access to the server filesystem, which should be the case. Else, the next-best option is \copy instead of COPY.

Either way, add the PK only after filling the table. That's a lot cheaper, and you get an index in pristine condition right away. Read the manual about populating a database.

Turn off autovacuum for this read-only table right from the start. And run a single VACUUM ANALYZE protein_snp_assoc after populating it. (Or VACUUM FREEZE ANALYZE protein_snp_assoc if you didn't go the COPY ... FREEZE route.) With your even data distribution, column statistics are not very critical, but you still need them, and the updated visibility map.

Either way, and crucially, rows with the same leading protein_id are now physically clustered, which minimizes the number of data pages that have to be read to satisfy your queries.

The table will have around 420 GB in pristine condition and the index a little over 140 GB. 7.5 billion rows x 20 bytes per index tuple (plus some overhead) or 60 bytes per table row. See:

  • How to determine or predict space on disk per table row?



  • Configuring PostgreSQL for read performance



After we have now minimized and optimized the disk footprint, caching the index will be crucial, especially with your 3.5" rotating disk. 64 GB of RAM is not exactly overkill. Postgres (in cooperation with the underlying OS) will still cache the most frequently accessed parts of index and table, and if you happen to focus on a subset of protein_id at a time, then that will still cover most of what needs to be cached. If your queries are all over the place all the time, 256 GB or more would help (a lot).

Update: You revealed in comments that there are 5 times as many rows. And read patterns are completely random. So you are back to disk-reads dominating the cost. You must get faster storage. SSD instead of HDD, there is no good alternative to that. Plus, all the RAM you can get to at least cache the most frequented parts of the index.

Do all of this with the latest version of Postgres (Postgres 15.2 at the time of writing) since there has been a steady flow of performance improvements, especially for big data.

Code Snippets

CREATE TABLE protein_snp_assoc (
  protein_id int
, snp_id     int
, beta       double precision
, se         double precision
, logp       double precision
);
ALTER TABLE protein_snp_assoc
ADD CONSTRAINT protein_snp_assoc_pkey
    PRIMARY KEY (protein_id, snp_id) WITH (FILLFACTOR = 100);
CLUSTER TABLE protein_snp_assoc USING protein_snp_assoc_pkey;
COPY protein_snp_assoc FROM '/path/to/filename' WITH (FREEZE);

Context

StackExchange Database Administrators Q#324640, answer score: 10

Revisions (0)

No revisions yet.