patternsqlMinor

Non-deterministic performance of query on select, from 1s to 60s on table with 1 billion rows

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

rowsbillion60snonwithqueryperformancedeterministicselectfrom

Problem

I'm trying to investigate why the performance of this query is so non-deterministic. It can take anywhere from 1 seconds, to 60 seconds and above. The nature of the query is to select a "time window", and get all rows from within that time window.

Here's the query in question, running on a table of approximately 1 billion rows:

SELECT CAST(extract(EPOCH from ts)*1000000 as bigint) as ts
    , ticks
    , quantity
    , side
FROM order_book
WHERE ts >= TO_TIMESTAMP(1618882633073383/1000000.0)
    AND ts < TO_TIMESTAMP(1618969033073383/1000000.0)
    AND zx_prod_id = 0
ORDER BY ts ASC, del desc;

This is how the table is created

CREATE TABLE public.order_book
(
    ts timestamp with time zone NOT NULL,
    zx_prod_id smallint NOT NULL,
    ticks integer NOT NULL,
    quantity integer NOT NULL,
    side boolean NOT NULL,
    del boolean NOT NULL
)

The values within TO_TIMESTAMP will keep sliding forward as I walk the whole table. Here is the EXPLAIN ANALYZE output for the same query on two different time windows:

Slow Performance

```
Gather Merge (cost=105996.20..177498.48 rows=586308 width=18) (actual time=45196.559..45280.769 rows=539265 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared hit=116386 read=42298
-> Sort (cost=104996.11..105240.40 rows=97718 width=18) (actual time=45169.717..45176.775 rows=77038 loops=7)
Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC
Sort Method: quicksort Memory: 9327kB
Worker 0: Sort Method: quicksort Memory: 8967kB
Worker 1: Sort Method: quicksort Memory: 9121kB
Worker 2: Sort Method: quicksort Memory: 9098kB
Worker 3: Sort Method: quicksort Memory: 9075kB
Worker 4: Sort Method: quicksort Memory: 9019kB
Worker 5: Sort Method: quicksort Memory: 9031kB
Buffers: shared hit=116386 read=42298
-> Result (cost=0.57

Solution

Explanation

The all-important difference between the two query plans is the added read=xyz bit in multiple places of the slow version.

Slow:

Buffers: shared hit=116296 read=42298

Fast:

Buffers: shared hit=158122

This tells you that Postgres encountered data (or index) pages that were not cached, yet. Repeat the slow query (possibly more than one time, until the added read=xyz vanishes), then you'll see comparable performance on equally cached data pages.

Read the manual about BUFFERS in the EXPLAIN output.

Related:

"Recheck Cond:" line in query plans with a bitmap index scan

Notably, increasing work_mem can actually hurt your situation, as work_mem and cache memory (both operating system cache and Postres' own shared memory buffers) are competitors for available RAM. Setting work_mem higher can lead to data being evicted from the (then smaller) cache sooner, so your "slow query" becomes more likely to occur. Don't forget that Postgres can use many times the amount set with work_mem. The manual:

Note that for a complex query, several sort or hash operations might
be running in parallel; each operation will generally be allowed to
use as much memory as this value specifies before it starts to write
data into temporary files. Also, several running sessions could be
doing such operations concurrently. Therefore, the total memory used
could be many times the value of work_mem;

Your query uses 6 parallel workers, that's already 6x work_mem - as possible worst case; not every worker will use its maximum allowance.

In case of insufficient work_mem you see disk mentioned in various contexts of the EXPLAIN output. That's when additional work_mem is in order.

Increasing shared_buffers can make more sense. Still, don't set it too high, either. The manual:

If you have a dedicated database server with 1GB or more of RAM, a
reasonable starting value for shared_buffers is 25% of the memory in
your system. There are some workloads where even larger settings for
shared_buffers are effective, but because PostgreSQL also relies on
the operating system cache, it is unlikely that an allocation of more
than 40% of RAM to shared_buffers will work better than a smaller amount.

If you have plenty of RAM for everything (hard to imagine for a "table with 1 billion rows"), then all that's missing is that data pages in question just were not cached, yet. You may be interested in pg_prewarm. See:

PostgreSQL: Force data into memory

Aside: the second query plan shows Filter: (zx_prod_id = 4), which does not match the displayed query, but I think that's insignificant for the problem.
Additional improvements

You already tried an index on (zx_prod_id, ts) but it ...

didn't seem to do anything.

That index would typically look good for the query, though.

The multicolumn index is around 50 % larger than the index on just (ts): 4 + 8 + 16 = 28 bytes vs. 4 + 8 + 8 = 20 bytes per index tuple, and possibly less compression from "index deduplication". And since RAM seems to be a contested resource, that is a downside. See:

Configuring PostgreSQL for read performance

I see rows=539265 vs. Rows Removed by Filter: 465513. So the query reads twice the rows it needs to. Depending on data distribution, that index may or may not be an improvement. It would certainly be if your table was physically clustered on that index. Then most data pages contain rows with the same zx_prod_id value, which makes that index much more effective. Again, see:

Configuring PostgreSQL for read performance

Your query seems to always filter for a single value of zx_prod_id. If that's focused around a few values, or if there are only few distinct values to begin with, a couple of partial indexes can make sense. Like, if only zx_prod_id IN (1,4) are targeted:

CREATE INDEX order_book_ts_zx1_idx ON order_book (ts) WHERE zx_prod_id = 1;
CREATE INDEX order_book_ts_zx4_idx ON order_book (ts) WHERE zx_prod_id = 4;

Then the query can work with a (much) smaller index, and does not have to filter half of the rows.

ts obviously is type timestamptz. Still you handle timestamps as bigint numbers representing microseconds. That split approach has downsides and pitfalls. It's typically best to work with timestamps across the food chain. Executing CAST(extract(EPOCH from ts)*1000000 as bigint) for every row is not free. Adds up for half a million rows.

Code Snippets

Buffers: shared hit=116296 read=42298

Buffers: shared hit=158122

CREATE INDEX order_book_ts_zx1_idx ON order_book (ts) WHERE zx_prod_id = 1;
CREATE INDEX order_book_ts_zx4_idx ON order_book (ts) WHERE zx_prod_id = 4;

Context

StackExchange Database Administrators Q#290860, answer score: 7

Revisions (0)

No revisions yet.