patternsqlMinor
Index for inequality on multiple columns (col1 >= 'const_1' AND col2 <= 'const_2')
Viewed 0 times
const_2const_1columnscol2inequalityformultipleandindexcol1
Problem
I have the following query:
I've created the following indexes:
Yet when I run the query, it scans all elements in plane_tracks.
When I drop the updated_at part, like this:
Then I get this:
At least this last part makes sense.
But I don't understand why
Tom
SELECT * FROM plane_tracks
WHERE created_at >= '2016-12-29 08:00:00' AND updated_at <= '2016-12-30 00:00:00';I've created the following indexes:
create index plane_tracks_on_created_at on plane_tracks(created_at);
create index plane_tracks_on_updated_at on plane_tracks(updated_at);
create index plane_tracks_on_created_at_updated_at on plane_tracks(created_at,updated_at);
create index plane_tracks_on_created_at_updated_at_desc on plane_tracks(created_at,updated_at DESC);Yet when I run the query, it scans all elements in plane_tracks.
# EXPLAIN ANALYZE SELECT * FROM plane_tracks WHERE created_at >= '2016-12-29 08:00:00' AND updated_at = '2016-12-29 08:00:00'::timestamp without time zone) AND (updated_at <= '2016-12-30 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 15941
Planning time: 0.366 ms
Execution time: 3.618 ms
(5 rows)When I drop the updated_at part, like this:
SELECT * FROM plane_tracks WHERE created_at >= '2016-12-29 08:00:00';Then I get this:
# EXPLAIN ANALYZE SELECT * FROM plane_tracks WHERE created_at >= '2016-12-29 08:00:00';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on plane_tracks (cost=0.00..537.99 rows=8595 width=471) (actual time=0.022..3.261 rows=8593 loops=1)
Filter: (created_at >= '2016-12-29 08:00:00'::timestamp without time zone)
Rows Removed by Filter: 9326
Planning time: 0.093 ms
Execution time: 3.901 ms
(5 rows)At least this last part makes sense.
But I don't understand why
- the first query isn't at least as efficient as the second one.
- why first query isn't much better than the second one, given that there are also indexes on updated_at.
- what index to create to make it really efficient
Tom
Solution
Whether PostgreSQL will use one index or not depends (a lot!) on the actual values of your tables, the conditions on your queries, whether statistics are up-to-date for the values in the table, and your configuration settings. So... whatever comes next depends a lot on your actual data. If you can provide a better description of your actual values, I guess we'll be able to better help you.
Anyhow, a first guess:
I assume your table has a structure similar to this:
I populate the table with some simulated data:
I simulate that (about) half your rows are updated, and that changes "udpated_at":
I make sure PostgreSQL has good statistics:
If I use under these conditions an index on (created_at, updated_at) with your query:
The index IS really used:
The reason why the index is used: PostgreSQL thinks it will help it select 12.209 out of the 60.475, and this will take less time than a sequential scan. This is giving good selectivity.
If you just change your query, and you make the condition broader, see what happens:
The reason: PostgreSQL has decided that, if the number of rows returned is going to be much higher than before, it makes more sense to scan the whole table than to use the index. Using an index means: first check the index, then gather the data from the main table. The savings on data gathering (SELECT *) would not be compensated by the expense of also checking the index.
In practice: I'd probably have one ("reasonable") index, and let PostgreSQL decide whether it is worth using it or not. The way you use your index (one comparison >=, the other
-
The choice of sequential versus index scans depends not only on your data, but also on a number of PostgreSQL settings. You can actually force PostgreSQL to not use index scans, or set some cost parameters for random access so high that, in practice, the usage of indexes will be heavily restricted. Check the Query Planning documentation.
-
When you're not sure about which indices to use, you can start by creating (a few) more than you actually need, analyze the index usage statistics after the system has been running in practice for a period that you think is long enough, and remove the seldom used ones.
Anyhow, a first guess:
I assume your table has a structure similar to this:
CREATE TABLE plane_tracks
(
plane_track_id SERIAL PRIMARY KEY,
created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
updated_at TIMESTAMP WITHOUT TIME ZONE NULL,
some_pay_load TEXT
) ;I populate the table with some simulated data:
-- Generate data for a few days (60.475 rows).
INSERT INTO
plane_tracks (created_at, some_pay_load)
SELECT
generate_series(timestamp '2016-12-25 00:00:00',
timestamp '2016-12-31 23:59:00', interval '10 seconds'), 'payload' ;I simulate that (about) half your rows are updated, and that changes "udpated_at":
-- Update randomly some of these data (just simulate updated_at)
UPDATE
plane_tracks
SET
updated_at = created_at + ((random()*10) || ' minutes')::interval
WHERE
plane_track_id IN
(SELECT plane_track_id FROM plane_tracks ORDER BY random() LIMIT 30000) ;I make sure PostgreSQL has good statistics:
ANALYZE VERBOSE plane_tracks ;If I use under these conditions an index on (created_at, updated_at) with your query:
CREATE INDEX
plane_tracks_on_created_at_updated_at
ON plane_tracks(created_at,updated_at);The index IS really used:
explain analyze SELECT * FROM plane_tracks
WHERE created_at >= '2016-12-29 08:00:00'
AND updated_at = '2016-12-29 08:00:00'::timestamp without time zone) AND (updated_at Bitmap Index Scan on plane_tracks_on_created_at_updated_at (cost=0.00..887.29 rows=12209 width=0) (actual time=4.596..4.596 rows=2834 loops=1)
Index Cond: ((created_at >= '2016-12-29 08:00:00'::timestamp without time zone) AND (updated_at <= '2016-12-30 00:00:00'::timestamp without time zone))
Planning time: 0.147 ms
Execution time: 5.443 msThe reason why the index is used: PostgreSQL thinks it will help it select 12.209 out of the 60.475, and this will take less time than a sequential scan. This is giving good selectivity.
If you just change your query, and you make the condition broader, see what happens:
explain analyze SELECT * FROM plane_tracks
WHERE created_at >= '2016-12-26 08:00:00' /* 26 instead of 29 */
AND updated_at = '2016-12-26 08:00:00'::timestamp without time zone) AND (updated_at <= '2016-12-31 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 40482
Planning time: 0.086 ms
Execution time: 22.441 msThe reason: PostgreSQL has decided that, if the number of rows returned is going to be much higher than before, it makes more sense to scan the whole table than to use the index. Using an index means: first check the index, then gather the data from the main table. The savings on data gathering (SELECT *) would not be compensated by the expense of also checking the index.
In practice: I'd probably have one ("reasonable") index, and let PostgreSQL decide whether it is worth using it or not. The way you use your index (one comparison >=, the other
-
The choice of sequential versus index scans depends not only on your data, but also on a number of PostgreSQL settings. You can actually force PostgreSQL to not use index scans, or set some cost parameters for random access so high that, in practice, the usage of indexes will be heavily restricted. Check the Query Planning documentation.
-
When you're not sure about which indices to use, you can start by creating (a few) more than you actually need, analyze the index usage statistics after the system has been running in practice for a period that you think is long enough, and remove the seldom used ones.
Code Snippets
CREATE TABLE plane_tracks
(
plane_track_id SERIAL PRIMARY KEY,
created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
updated_at TIMESTAMP WITHOUT TIME ZONE NULL,
some_pay_load TEXT
) ;-- Generate data for a few days (60.475 rows).
INSERT INTO
plane_tracks (created_at, some_pay_load)
SELECT
generate_series(timestamp '2016-12-25 00:00:00',
timestamp '2016-12-31 23:59:00', interval '10 seconds'), 'payload' ;-- Update randomly some of these data (just simulate updated_at)
UPDATE
plane_tracks
SET
updated_at = created_at + ((random()*10) || ' minutes')::interval
WHERE
plane_track_id IN
(SELECT plane_track_id FROM plane_tracks ORDER BY random() LIMIT 30000) ;ANALYZE VERBOSE plane_tracks ;CREATE INDEX
plane_tracks_on_created_at_updated_at
ON plane_tracks(created_at,updated_at);Context
StackExchange Database Administrators Q#159679, answer score: 2
Revisions (0)
No revisions yet.