HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Slow index scans in large table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
slowscanslargeindextable

Problem

Update 2020-08-04:

Since this answer is apparently still being viewed regularly I wanted to provide an update on the situation. We're currently using PG 11 with table partitioning on timestamp and are easily handling a few billion rows in the table(s). The index-only scans are a life saver and it wouldn't be possible without.

Using PostgreSQL 9.2, I have troubles with slow queries on a relatively large table (200+ million rows). I'm not trying anything crazy, just adding historic values. Below is the query and the query plan output.

My table layout:

Table "public.energy_energyentry"
  Column   |           Type           |                            Modifiers
-----------+--------------------------+-----------------------------------------------------------------
 id        | integer                  | not null default nextval('energy_energyentry_id_seq'::regclass)
 prop_id   | integer                  | not null
 timestamp | timestamp with time zone | not null
 value     | double precision         | not null
Indexes:
    "energy_energyentry_pkey" PRIMARY KEY, btree (id)
    "energy_energyentry_prop_id" btree (prop_id)
    "energy_energyentry_prop_id_timestamp_idx" btree (prop_id, "timestamp")
Foreign-key constraints:
    "energy_energyentry_prop_id_fkey" FOREIGN KEY (prop_id) REFERENCES gateway_peripheralproperty(id) DEFERRABLE INITIALLY DEFERRED


The data ranges from 2012-01-01 till now, with new data constantly being added. There are about 2.2k distinct values in the prop_id foreign key, distributed evenly.

I notice that the row estimates aren't far off, but the cost estimates seem larger by factor 4x. This probably isn't an issue, but is there anything I could do about it?

I expect that disk access might be the issue, since the table isn't in memory all the time.

EXPLAIN ANALYZE 
SELECT SUM("value") 
FROM "energy_energyentry" 
WHERE 
  "prop_id"=82411 
  AND "timestamp">'2014-06-11' 
  AND "timestamp"<'2014-11-11'
;


```
Aggregate (cost

Solution

Your table is big, and so is any index spanning the whole table. Assuming that:

  • only new data (with timestamp = now()) is entered



  • existing rows are neither changed nor deleted.



  • you have data since 2012-01-01 but queries are predominantly on the current year (?)



I would suggest a partial, multi-column (covering!) index:

CREATE INDEX ON energy_energyentry (prop_id, "timestamp", value)
WHERE "timestamp" >= '2014-01-01 0:0';  -- adapt to your needs


In Postgres 11 or later use a "covering" index instead:

CREATE INDEX ON energy_energyentry (prop_id, "timestamp") INCLUDE (value)
WHERE "timestamp" >= '2014-01-01 0:0';


See:

  • Do covering indexes in PostgreSQL help JOIN columns?



Only include the time range that is queried regularly. Only makes sense if you cut off most of the big table this way. Effectiveness deteriorates over time with new entries. Recreate the index from time to time. (You may need to adapt your queries.) See linked answer below.

The last column value is only included to get index-only scans out of this. Aggressive autovacuum setting may help by keeping the visibility map up to date, like @jjanes already mentioned.

The partial index should fit into RAM more easily and stay there longer.

You may need to include this WHERE condition in queries to make the planner understand the index is applicable to the query, like:

SELECT sum(value) AS sum_value
FROM   energy_energyentry
WHERE  prop_id = 82411 
AND   "timestamp" > '2014-06-11 0:0' 
AND   "timestamp" = '2014-01-01 0:0'; -- seems redundant, but may be needed


Since your query is summing a lot of rows (rows=13578), this is going to take some time, even with an index-only scan. It shouldn't be anywhere near 50 seconds, though. Less than a second on any halfway decent hardware.

Related (but ignore CLUSTER and FILLFACTOR, both are irrelevant if you can get index-only scans out of this):

  • Index optimization with dates



Aside:

Since you currently have an index on (prop_id, "timestamp"), the additional index on just (prop_id) may cost more than it's worth:

  • Is a composite index also good for queries on the first field?

Code Snippets

CREATE INDEX ON energy_energyentry (prop_id, "timestamp", value)
WHERE "timestamp" >= '2014-01-01 0:0';  -- adapt to your needs
CREATE INDEX ON energy_energyentry (prop_id, "timestamp") INCLUDE (value)
WHERE "timestamp" >= '2014-01-01 0:0';
SELECT sum(value) AS sum_value
FROM   energy_energyentry
WHERE  prop_id = 82411 
AND   "timestamp" > '2014-06-11 0:0' 
AND   "timestamp" < '2014-11-11 0:0'
AND   "timestamp" >= '2014-01-01 0:0'; -- seems redundant, but may be needed

Context

StackExchange Database Administrators Q#81456, answer score: 20

Revisions (0)

No revisions yet.