patternsqlMinor
Order by on two columns very slow compared to order by on a single column
Viewed 0 times
ordercolumnscolumnslowtwosingleverycompared
Problem
I am using Postgres and I see that with order by on two columns, my query is several order of magnitude slower compared to the order by on only one column. I have approximately 29.5 million rows in the table under consideration.
Here are result of three different queries:
With order by only on id:
With order by only on datetime:
```
EXPLAIN ANALYZE SELECT "api_meterdata"."id", "api_meterdata"."meter_id", "api_meterdata"."datetime", "api_meter"."id" FROM "api_meterdata" INNER JOIN "api_meter" ON ( "api_meterdata"."meter_id" = "api_meter"."id" ) ORDER BY "api_meterdata"."datetime" ASC LIMIT 100;
Here are result of three different queries:
With order by only on id:
EXPLAIN ANALYZE SELECT "api_meterdata"."id", "api_meterdata"."meter_id", "api_meterdata"."datetime", "api_meter"."id" FROM "api_meterdata" INNER JOIN "api_meter" ON ( "api_meterdata"."meter_id" = "api_meter"."id" ) ORDER BY "api_meterdata"."id" DESC LIMIT 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..321.49 rows=100 width=20) (actual time=0.407..30.424 rows=100 loops=1)
-> Nested Loop (cost=0.44..94824299.30 rows=29535145 width=20) (actual time=0.402..30.090 rows=100 loops=1)
Join Filter: (api_meterdata.meter_id = api_meter.id)
Rows Removed by Join Filter: 8147
-> Index Scan Backward using api_meterdata_pkey on api_meterdata (cost=0.44..58053041.74 rows=29535145 width=16) (actual time=0.103..0.867 rows=100 loops=1)
-> Materialize (cost=0.00..2.25 rows=83 width=4) (actual time=0.002..0.144 rows=82 loops=100)
-> Seq Scan on api_meter (cost=0.00..1.83 rows=83 width=4) (actual time=0.008..0.153 rows=83 loops=1) Planning time:
0.491 ms Execution time: 30.701 ms (9 rows)With order by only on datetime:
```
EXPLAIN ANALYZE SELECT "api_meterdata"."id", "api_meterdata"."meter_id", "api_meterdata"."datetime", "api_meter"."id" FROM "api_meterdata" INNER JOIN "api_meter" ON ( "api_meterdata"."meter_id" = "api_meter"."id" ) ORDER BY "api_meterdata"."datetime" ASC LIMIT 100;
Solution
The reason of the difference in timings is due to several facts:
The need for retrieving all the data (not just 100 rows already ordered), having to join all of them and then having a Sort step is what's causing such a big difference in performance. This can be seen crystal clear by using explain.depesz.com.
Find a simulation of your scenario at dbfiddle here, with the different cases covered and explained, and the suggestion from @ypercube taken into account for another index. Also note that some of your indexes were redundant.
DDL for your scenario, and some simulated data:
... and for the table holding your meter_data
... some simulated data (648001 rows, to make it realistic). The data is less than what you have, but DBFiddle reaches its limits if I try to put more
Analysis of your 1st query
| QUERY PLAN |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Limit (cost=0.57..20.71 rows=100 width=20) (actual time=0.033..0.188 rows=100 loops=1) |
| -> Nested Loop (cost=0.57..130514.61 rows=648001 width=20) (actual time=0.031..0.175 rows=100 loops=1) |
| -> Index Scan Backward using api_meterdata_pkey on api_meterdata (cost=0.42..20342.44 rows=648001 width=16) (actual time=0.023..0.038 rows=100 loops=1) |
| -> Index Only Scan using api_meter_pkey on api_meter (cost=0.14..0.16 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=100) |
| Index Cond: (id = api_meterdata.meter_id) |
| Heap Fetches: 100 |
| Planning time: 0.331 ms |
| Execution time: 0.216 ms
- Your queries don't have a
WHEREclause that filter out which results to retrieve.
- The speed of the query is defined by the fact that you DO HAVE a
LIMITclause.
- If the planner can find an index by which it can retrieve the rows of your query in the same order that your
ORDER BYspecifies, it will start picking them one by one, until it has read 100 (the number specified by yourLIMITclause. The index, if it is multi-column, will need to have the same columns, in the same order, and with the same ASC, DESC sort directions (or all of them reversed).
- If the planner cannot have an index fulfilling this role, it will have to perform a SORT step, put all rows in order in a (temporary, virtual) table, and then retrieve the first 100 rows.
The need for retrieving all the data (not just 100 rows already ordered), having to join all of them and then having a Sort step is what's causing such a big difference in performance. This can be seen crystal clear by using explain.depesz.com.
Find a simulation of your scenario at dbfiddle here, with the different cases covered and explained, and the suggestion from @ypercube taken into account for another index. Also note that some of your indexes were redundant.
DDL for your scenario, and some simulated data:
CREATE TABLE api_meter
(
id INTEGER PRIMARY KEY
) ;
INSERT INTO
api_meter
(id)
SELECT
generate_series(1, 83) ;... and for the table holding your meter_data
CREATE TABLE api_meterdata
(
id serial /* integer */ PRIMARY KEY,
meter_id integer REFERENCES api_meter(id),
datetime timestamp NOT NULL default now()
) ;
-- The PK will have made an implicit index ON (id)
-- Index on (meter_id, datetime); which is probably the *NATURAL KEY*
CREATE UNIQUE INDEX api_meterdata_meter_id_datetime_unique
ON api_meterdata (meter_id, datetime) ;
-- The following index is redundant, the column meter_id is already the first in
-- the previous one.
-- CREATE INDEX api_meterdata_meter_id_idx
-- ON api_meterdata (meter_id) ;
CREATE INDEX api_meterdata_datetime_idx
ON api_meterdata (datetime) ;... some simulated data (648001 rows, to make it realistic). The data is less than what you have, but DBFiddle reaches its limits if I try to put more
INSERT INTO
api_meterdata
(meter_id, datetime)
SELECT
random()*82+1, d
FROM
generate_series(timestamp '2017-01-01', timestamp '2017-01-31',
interval '4 second') AS s(d);
-- Make sure statistics are good
ANALYZE api_meterdata;
ANALYZE api_meter;Analysis of your 1st query
-- This query doesn't have a WHERE clause, so, indexes will be used based on
-- ORDER BY + LIMIT (and, eventually, column coverage)
--
-- * The index helping this case is the one corresponding to the PK of
-- api_meter_data, used in DESC order
-- * A second index will help: the one used for the JOIN condition
-- * How does postgresql choose to JOIN will depend on specific data values
-- distribution, sizes, etc.
EXPLAIN ANALYZE
SELECT
api_meterdata.id, api_meterdata.meter_id, api_meterdata.datetime,
api_meter.id
FROM
api_meterdata
INNER JOIN api_meter ON ( api_meterdata.meter_id = api_meter.id )
ORDER BY
api_meterdata.id DESC
LIMIT 100;| QUERY PLAN |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Limit (cost=0.57..20.71 rows=100 width=20) (actual time=0.033..0.188 rows=100 loops=1) |
| -> Nested Loop (cost=0.57..130514.61 rows=648001 width=20) (actual time=0.031..0.175 rows=100 loops=1) |
| -> Index Scan Backward using api_meterdata_pkey on api_meterdata (cost=0.42..20342.44 rows=648001 width=16) (actual time=0.023..0.038 rows=100 loops=1) |
| -> Index Only Scan using api_meter_pkey on api_meter (cost=0.14..0.16 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=100) |
| Index Cond: (id = api_meterdata.meter_id) |
| Heap Fetches: 100 |
| Planning time: 0.331 ms |
| Execution time: 0.216 ms
Code Snippets
CREATE TABLE api_meter
(
id INTEGER PRIMARY KEY
) ;
INSERT INTO
api_meter
(id)
SELECT
generate_series(1, 83) ;CREATE TABLE api_meterdata
(
id serial /* integer */ PRIMARY KEY,
meter_id integer REFERENCES api_meter(id),
datetime timestamp NOT NULL default now()
) ;
-- The PK will have made an implicit index ON (id)
-- Index on (meter_id, datetime); which is probably the *NATURAL KEY*
CREATE UNIQUE INDEX api_meterdata_meter_id_datetime_unique
ON api_meterdata (meter_id, datetime) ;
-- The following index is redundant, the column meter_id is already the first in
-- the previous one.
-- CREATE INDEX api_meterdata_meter_id_idx
-- ON api_meterdata (meter_id) ;
CREATE INDEX api_meterdata_datetime_idx
ON api_meterdata (datetime) ;INSERT INTO
api_meterdata
(meter_id, datetime)
SELECT
random()*82+1, d
FROM
generate_series(timestamp '2017-01-01', timestamp '2017-01-31',
interval '4 second') AS s(d);
-- Make sure statistics are good
ANALYZE api_meterdata;
ANALYZE api_meter;-- This query doesn't have a WHERE clause, so, indexes will be used based on
-- ORDER BY + LIMIT (and, eventually, column coverage)
--
-- * The index helping this case is the one corresponding to the PK of
-- api_meter_data, used in DESC order
-- * A second index will help: the one used for the JOIN condition
-- * How does postgresql choose to JOIN will depend on specific data values
-- distribution, sizes, etc.
EXPLAIN ANALYZE
SELECT
api_meterdata.id, api_meterdata.meter_id, api_meterdata.datetime,
api_meter.id
FROM
api_meterdata
INNER JOIN api_meter ON ( api_meterdata.meter_id = api_meter.id )
ORDER BY
api_meterdata.id DESC
LIMIT 100;-- This query doesn't have either a WHERE clause, so, indexes will be used
-- based on ORDER BY + LIMIT (and, eventually, column coverage).
-- * The index helping this case is the one corresponding to
-- ON api_meterdata (datetime), because that's the only column used in the
-- ORDER BY.
-- * A second index will help: the one used for the JOIN condition
-- * How does postgresql choose to JOIN will depend on specific data values
-- distribution
EXPLAIN ANALYZE
SELECT
api_meterdata.id, api_meterdata.meter_id, api_meterdata.datetime,
api_meter.id
FROM
api_meterdata
INNER JOIN api_meter ON ( api_meterdata.meter_id = api_meter.id )
ORDER BY
api_meterdata.datetime ASC
LIMIT 100;Context
StackExchange Database Administrators Q#178344, answer score: 8
Revisions (0)
No revisions yet.