patternsqlMajor
Index optimization with dates
Viewed 0 times
withindexdatesoptimization
Problem
I have a large table of objects (15M+ row) in PostgreSQL 9.0.8, for which I want to query for outdated field.
I want to divide the query by millions, for scalability & concurrency purposes, and I want to fetch all data with the updated_at field with a date of a few days ago.
I have tried many indexes, and queries, on a million ids, and I can't seem to get performance under 100 seconds with Heroku's Ronin hardware.
I am looking for suggestions I haven't tried to make this as efficient as possible.
TRY #1
TRY #2
TRY #3
TRY #4
TRY #5 (for a single month of outdated data)
TRY #6
TRY #7 (see: http://explain.depesz.com/s/DQP)
```
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE id/1000000 = 5 and (date(updated_at)) < (date(now())-7);
INDEX USED: ( (id/1000000 ) ASC ,date(updated_at) DESC NULLS LAST);
100732.049 ms
Second try: 872
I want to divide the query by millions, for scalability & concurrency purposes, and I want to fetch all data with the updated_at field with a date of a few days ago.
I have tried many indexes, and queries, on a million ids, and I can't seem to get performance under 100 seconds with Heroku's Ronin hardware.
I am looking for suggestions I haven't tried to make this as efficient as possible.
TRY #1
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE (date(updated_at)) = 5000001 AND id < 6000001;
INDEX USED: (date(updated_at),id)
268578.934 msTRY #2
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE ((date(now()) - (date(updated_at)) > 7)) AND id >= 5000001 AND id < 6000001;
INDEX USED: primary key
335555.144 msTRY #3
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE (date(updated_at)) < (date(now())-7) AND id/1000000 = 5;
INDEX USED: (date(updated_at),(id/1000000))
243427.042 msTRY #4
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE (date(updated_at)) < (date(now())-7) AND id/1000000 = 5 AND updated_at IS NOT NULL;
INDEX USED: (date(updated_at),(id/1000000)) WHERE updated_at IS NOT NULL
706714.812 msTRY #5 (for a single month of outdated data)
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE (EXTRACT(MONTH from date(updated_at)) = 8) AND id/1000000 = 5;
INDEX USED: (EXTRACT(MONTH from date(updated_at)),(id/1000000))
107241.472 msTRY #6
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE (date(updated_at)) < (date(now())-7) AND id/1000000 = 5;
INDEX USED: ( (id/1000000 ) ASC ,updated_at DESC NULLS LAST)
106842.395 msTRY #7 (see: http://explain.depesz.com/s/DQP)
```
EXPLAIN ANALYZE SELECT count(*) FROM objects
WHERE id/1000000 = 5 and (date(updated_at)) < (date(now())-7);
INDEX USED: ( (id/1000000 ) ASC ,date(updated_at) DESC NULLS LAST);
100732.049 ms
Second try: 872
Solution
First off, can it be? You write:
I want to fetch all data with the updated_at field with a date of a
few days ago.
But your
(date(updated_at))
Should be
Indexes
For optimal performance, you could ...
Your indexes could look like:
(Assuming
The second condition excludes irrelevant rows from the index right away, which should make it smaller and faster - depending on your actual data distribution. In accordance with my preliminary comment, I am assuming you want newer rows.
The condition also automatically excludes NULL values in
Related answer on SO:
To further optimize, you could use
This form of the full index matches the sort order of above partial indexes.
This will take a while, since the table is rewritten physically. It is also effectively a
You can then drop the index again (if it's unused). It's a one-time effect. I would at least try this once to see how much your queries benefit from it. The effect deteriorates with subsequent write operations. You could repeat this procedure at off hours if you see a substantial effect.
If your table receives a lot of write operations, you have to weigh cost and benefit for this step. For many
Query
More
A more advanced technique for index partitioning:
Among other things it provides example code for automatic index (re-)creation.
Make sure that
The settings in your question look good. So that's probably not an issue here and
Declarative partitioning
has finally matured in Postgres 12. I would consider using that now instead of manual index partitioning (or at least additionally). Range partitioning with
I want to fetch all data with the updated_at field with a date of a
few days ago.
But your
WHERE condition is:(date(updated_at))
Should be
>?Indexes
For optimal performance, you could ...
- partition your indexes
- exclude irrelevant rows from the indexes
- automatically recreate indexes at off-hours with updated predicate.
Your indexes could look like:
CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE id BETWEEN 0 AND 999999
AND updated_at > '2012-10-01 0:0'::timestamp; -- some minimum date
CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE id BETWEEN 1000000 AND 1999999
AND updated_at > '2012-10-01 0:0'::timestamp; -- some minimum date
-- etc.(Assuming
updated_at is type timestamp. With timestamptz, the cast to date is not IMMUTABLE and you need to define the time zone first ...)The second condition excludes irrelevant rows from the index right away, which should make it smaller and faster - depending on your actual data distribution. In accordance with my preliminary comment, I am assuming you want newer rows.
The condition also automatically excludes NULL values in
updated_at - which you seem to allow in the table and obviously want to exclude in the query. The usefulness of the index deteriorates over time. The query always retrieves the latest entries. Recreate the index with an updated WHERE clause periodically. This requires an exclusive lock on the table, so do it at off hours. There is also CREATE INDEX CONCURRENTLY to minimize the duration of locks:CREATE INDEX CONCURRENTLY objects_id_up_201211_idx ...; -- create new idx
DROP INDEX CONCURRENTLY objects_id_up_201210_idx; -- then drop oldDROP INDEX allows CONCURRENTLY since Postgres 9.2.Related answer on SO:
- Postgres returns records in wrong order
To further optimize, you could use
CLUSTER like we mentioned in the comments. But you need a full index for that. Doesn't work with a partial index. You would create temporarily:CREATE INDEX objects_full_idx ON objects (id/1000000, (updated_at::date) DESC NULLS LAST);This form of the full index matches the sort order of above partial indexes.
CLUSTER objects USING objects_full_idx;
ANALYZE objects;This will take a while, since the table is rewritten physically. It is also effectively a
VACUUM FULL. It needs an exclusive write lock on the table, so do it at off-hours - provided you can afford that at all. Again, there are less invasive alternatives: pg_repack or pg_squeeze.You can then drop the index again (if it's unused). It's a one-time effect. I would at least try this once to see how much your queries benefit from it. The effect deteriorates with subsequent write operations. You could repeat this procedure at off hours if you see a substantial effect.
If your table receives a lot of write operations, you have to weigh cost and benefit for this step. For many
UPDATEs consider setting a FILLFACTOR lower than 100. Do that before you CLUSTER.Query
SELECT count(*)
FROM objects
WHERE id BETWEEN 0 AND 999999 -- match conditions of partial index!
AND updated_at > '2012-10-01 0:0'::timestamp
AND updated_at::date > (now()::date - 7);More
A more advanced technique for index partitioning:
- Can spatial index help a "range - order by - limit" query
Among other things it provides example code for automatic index (re-)creation.
Make sure that
autovacuum is running properly. The huge gain by CLUSTER you have reported may be due in part to the implicit vacuuming that you get from CLUSTER. Maybe this is set up by Heroku automatically, not sure.The settings in your question look good. So that's probably not an issue here and
CLUSTER really was that effective.Declarative partitioning
has finally matured in Postgres 12. I would consider using that now instead of manual index partitioning (or at least additionally). Range partitioning with
updated_at as the partition key. There are also multiple improvements to general performance, big data and B-tree index performance in particular.Code Snippets
CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE id BETWEEN 0 AND 999999
AND updated_at > '2012-10-01 0:0'::timestamp; -- some minimum date
CREATE INDEX objects_id_updated_at_idx ON objects ((updated_at::date) DESC NULLS LAST)
WHERE id BETWEEN 1000000 AND 1999999
AND updated_at > '2012-10-01 0:0'::timestamp; -- some minimum date
-- etc.CREATE INDEX CONCURRENTLY objects_id_up_201211_idx ...; -- create new idx
DROP INDEX CONCURRENTLY objects_id_up_201210_idx; -- then drop oldCREATE INDEX objects_full_idx ON objects (id/1000000, (updated_at::date) DESC NULLS LAST);CLUSTER objects USING objects_full_idx;
ANALYZE objects;SELECT count(*)
FROM objects
WHERE id BETWEEN 0 AND 999999 -- match conditions of partial index!
AND updated_at > '2012-10-01 0:0'::timestamp
AND updated_at::date > (now()::date - 7);Context
StackExchange Database Administrators Q#27928, answer score: 35
Revisions (0)
No revisions yet.