HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Index optimization with dates

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
withindexdatesoptimization

Problem

I have a large table of objects (15M+ row) in PostgreSQL 9.0.8, for which I want to query for outdated field.

I want to divide the query by millions, for scalability & concurrency purposes, and I want to fetch all data with the updated_at field with a date of a few days ago.

I have tried many indexes, and queries, on a million ids, and I can't seem to get performance under 100 seconds with Heroku's Ronin hardware.

I am looking for suggestions I haven't tried to make this as efficient as possible.

TRY #1

EXPLAIN ANALYZE SELECT count(*) FROM objects
 WHERE (date(updated_at)) = 5000001 AND id < 6000001;
 INDEX USED: (date(updated_at),id)
 268578.934 ms


TRY #2

EXPLAIN ANALYZE SELECT count(*) FROM objects
 WHERE ((date(now()) - (date(updated_at)) > 7)) AND id >= 5000001 AND id < 6000001;
 INDEX USED: primary key
 335555.144 ms


TRY #3

EXPLAIN ANALYZE SELECT count(*) FROM objects
 WHERE (date(updated_at)) < (date(now())-7) AND id/1000000 = 5;
 INDEX USED: (date(updated_at),(id/1000000))
 243427.042 ms


TRY #4

EXPLAIN ANALYZE SELECT count(*) FROM objects
 WHERE (date(updated_at)) < (date(now())-7) AND id/1000000 = 5 AND updated_at IS NOT NULL;
 INDEX USED: (date(updated_at),(id/1000000)) WHERE updated_at IS NOT NULL 
 706714.812 ms


TRY #5 (for a single month of outdated data)

EXPLAIN ANALYZE SELECT count(*) FROM objects
 WHERE (EXTRACT(MONTH from date(updated_at)) = 8) AND id/1000000 = 5;
 INDEX USED: (EXTRACT(MONTH from date(updated_at)),(id/1000000))
 107241.472 ms


TRY #6

EXPLAIN ANALYZE SELECT count(*) FROM objects
 WHERE (date(updated_at)) < (date(now())-7) AND id/1000000 = 5;
 INDEX USED: ( (id/1000000 ) ASC ,updated_at DESC NULLS LAST)
 106842.395 ms


TRY #7 (see: http://explain.depesz.com/s/DQP)

```
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE id/1000000 = 5 and (date(updated_at)) < (date(now())-7);
INDEX USED: ( (id/1000000 ) ASC ,date(updated_at) DESC NULLS LAST);
100732.049 ms
Second try: 872

Solution

First off, can it be? You write:

I want to fetch all data with the updated_at field with a date of a
few days ago.

But your WHERE condition is:
(date(updated_at))

Should be >?

Indexes

For optimal performance, you could ...

  • partition your indexes



  • exclude irrelevant rows from the indexes



  • automatically recreate indexes at off-hours with updated predicate.



Your indexes could look like:

CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE  id BETWEEN 0 AND 999999
AND    updated_at > '2012-10-01 0:0'::timestamp;  -- some minimum date

CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE  id BETWEEN 1000000 AND 1999999
AND    updated_at > '2012-10-01 0:0'::timestamp;  -- some minimum date

-- etc.


(Assuming updated_at is type timestamp. With timestamptz, the cast to date is not IMMUTABLE and you need to define the time zone first ...)

The second condition excludes irrelevant rows from the index right away, which should make it smaller and faster - depending on your actual data distribution. In accordance with my preliminary comment, I am assuming you want newer rows.

The condition also automatically excludes NULL values in updated_at - which you seem to allow in the table and obviously want to exclude in the query. The usefulness of the index deteriorates over time. The query always retrieves the latest entries. Recreate the index with an updated WHERE clause periodically. This requires an exclusive lock on the table, so do it at off hours. There is also CREATE INDEX CONCURRENTLY to minimize the duration of locks:

CREATE INDEX CONCURRENTLY objects_id_up_201211_idx ...; -- create new idx
DROP INDEX CONCURRENTLY objects_id_up_201210_idx;       -- then drop old


DROP INDEX allows CONCURRENTLY since Postgres 9.2.

Related answer on SO:

  • Postgres returns records in wrong order



To further optimize, you could use CLUSTER like we mentioned in the comments. But you need a full index for that. Doesn't work with a partial index. You would create temporarily:

CREATE INDEX objects_full_idx ON objects (id/1000000, (updated_at::date) DESC NULLS LAST);


This form of the full index matches the sort order of above partial indexes.

CLUSTER objects USING objects_full_idx;
ANALYZE objects;


This will take a while, since the table is rewritten physically. It is also effectively a VACUUM FULL. It needs an exclusive write lock on the table, so do it at off-hours - provided you can afford that at all. Again, there are less invasive alternatives: pg_repack or pg_squeeze.

You can then drop the index again (if it's unused). It's a one-time effect. I would at least try this once to see how much your queries benefit from it. The effect deteriorates with subsequent write operations. You could repeat this procedure at off hours if you see a substantial effect.

If your table receives a lot of write operations, you have to weigh cost and benefit for this step. For many UPDATEs consider setting a FILLFACTOR lower than 100. Do that before you CLUSTER.
Query

SELECT count(*)
FROM   objects
WHERE  id BETWEEN 0 AND 999999  -- match conditions of partial index!
AND    updated_at > '2012-10-01 0:0'::timestamp
AND    updated_at::date > (now()::date - 7);


More

A more advanced technique for index partitioning:

  • Can spatial index help a "range - order by - limit" query



Among other things it provides example code for automatic index (re-)creation.

Make sure that autovacuum is running properly. The huge gain by CLUSTER you have reported may be due in part to the implicit vacuuming that you get from CLUSTER. Maybe this is set up by Heroku automatically, not sure.

The settings in your question look good. So that's probably not an issue here and CLUSTER really was that effective.
Declarative partitioning

has finally matured in Postgres 12. I would consider using that now instead of manual index partitioning (or at least additionally). Range partitioning with updated_at as the partition key. There are also multiple improvements to general performance, big data and B-tree index performance in particular.

Code Snippets

CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE  id BETWEEN 0 AND 999999
AND    updated_at > '2012-10-01 0:0'::timestamp;  -- some minimum date

CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE  id BETWEEN 1000000 AND 1999999
AND    updated_at > '2012-10-01 0:0'::timestamp;  -- some minimum date

-- etc.
CREATE INDEX CONCURRENTLY objects_id_up_201211_idx ...; -- create new idx
DROP INDEX CONCURRENTLY objects_id_up_201210_idx;       -- then drop old
CREATE INDEX objects_full_idx ON objects (id/1000000, (updated_at::date) DESC NULLS LAST);
CLUSTER objects USING objects_full_idx;
ANALYZE objects;
SELECT count(*)
FROM   objects
WHERE  id BETWEEN 0 AND 999999  -- match conditions of partial index!
AND    updated_at > '2012-10-01 0:0'::timestamp
AND    updated_at::date > (now()::date - 7);

Context

StackExchange Database Administrators Q#27928, answer score: 35

Revisions (0)

No revisions yet.