HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Order by on two columns very slow compared to order by on a single column

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
ordercolumnscolumnslowtwosingleverycompared

Problem

I am using Postgres and I see that with order by on two columns, my query is several order of magnitude slower compared to the order by on only one column. I have approximately 29.5 million rows in the table under consideration.

Here are result of three different queries:

With order by only on id:

EXPLAIN ANALYZE SELECT "api_meterdata"."id", "api_meterdata"."meter_id", "api_meterdata"."datetime", "api_meter"."id" FROM "api_meterdata" INNER JOIN "api_meter" ON ( "api_meterdata"."meter_id" = "api_meter"."id" ) ORDER BY "api_meterdata"."id" DESC LIMIT 100;
                                                                               QUERY PLAN                                                            

------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.44..321.49 rows=100 width=20) (actual time=0.407..30.424 rows=100 loops=1)    
   ->  Nested Loop  (cost=0.44..94824299.30 rows=29535145 width=20) (actual time=0.402..30.090 rows=100 loops=1)
         Join Filter: (api_meterdata.meter_id = api_meter.id)
         Rows Removed by Join Filter: 8147
         ->  Index Scan Backward using api_meterdata_pkey on api_meterdata  (cost=0.44..58053041.74 rows=29535145 width=16) (actual time=0.103..0.867 rows=100 loops=1)
         ->  Materialize  (cost=0.00..2.25 rows=83 width=4) (actual time=0.002..0.144 rows=82 loops=100)
               ->  Seq Scan on api_meter  (cost=0.00..1.83 rows=83 width=4) (actual time=0.008..0.153 rows=83 loops=1)  Planning time:
0.491 ms  Execution time: 30.701 ms (9 rows)


With order by only on datetime:

```
EXPLAIN ANALYZE SELECT "api_meterdata"."id", "api_meterdata"."meter_id", "api_meterdata"."datetime", "api_meter"."id" FROM "api_meterdata" INNER JOIN "api_meter" ON ( "api_meterdata"."meter_id" = "api_meter"."id" ) ORDER BY "api_meterdata"."datetime" ASC LIMIT 100;

Solution

The reason of the difference in timings is due to several facts:

  • Your queries don't have a WHERE clause that filter out which results to retrieve.



  • The speed of the query is defined by the fact that you DO HAVE a LIMIT clause.



  • If the planner can find an index by which it can retrieve the rows of your query in the same order that your ORDER BY specifies, it will start picking them one by one, until it has read 100 (the number specified by your LIMIT clause. The index, if it is multi-column, will need to have the same columns, in the same order, and with the same ASC, DESC sort directions (or all of them reversed).



  • If the planner cannot have an index fulfilling this role, it will have to perform a SORT step, put all rows in order in a (temporary, virtual) table, and then retrieve the first 100 rows.



The need for retrieving all the data (not just 100 rows already ordered), having to join all of them and then having a Sort step is what's causing such a big difference in performance. This can be seen crystal clear by using explain.depesz.com.

Find a simulation of your scenario at dbfiddle here, with the different cases covered and explained, and the suggestion from @ypercube taken into account for another index. Also note that some of your indexes were redundant.

DDL for your scenario, and some simulated data:

CREATE TABLE api_meter
(
    id INTEGER PRIMARY KEY
) ;
INSERT INTO 
    api_meter
    (id)
SELECT
    generate_series(1, 83) ;


... and for the table holding your meter_data

CREATE TABLE api_meterdata
(
    id serial /* integer */ PRIMARY KEY,
    meter_id integer REFERENCES api_meter(id),
    datetime timestamp NOT NULL default now()
) ;

-- The PK will have made an implicit index ON (id)

-- Index on (meter_id, datetime); which is probably the *NATURAL KEY*
CREATE UNIQUE INDEX api_meterdata_meter_id_datetime_unique 
    ON api_meterdata (meter_id, datetime) ;

-- The following index is redundant, the column meter_id is already the first in
-- the previous one.
-- CREATE INDEX api_meterdata_meter_id_idx 
--    ON api_meterdata (meter_id) ;

CREATE INDEX api_meterdata_datetime_idx 
    ON api_meterdata (datetime) ;


... some simulated data (648001 rows, to make it realistic). The data is less than what you have, but DBFiddle reaches its limits if I try to put more

INSERT INTO 
    api_meterdata
    (meter_id, datetime)
SELECT
    random()*82+1, d
FROM
    generate_series(timestamp '2017-01-01', timestamp '2017-01-31', 
                    interval '4 second') AS s(d);

-- Make sure statistics are good
ANALYZE api_meterdata;
ANALYZE api_meter;


Analysis of your 1st query

-- This query doesn't have a WHERE clause, so, indexes will be used based on 
-- ORDER BY + LIMIT (and, eventually, column coverage)
--
-- * The index helping this case is the one corresponding to the PK of 
--   api_meter_data, used in DESC order
-- * A second index will help: the one used for the JOIN condition
-- * How does postgresql choose to JOIN will depend on specific data values 
--   distribution, sizes, etc.
EXPLAIN ANALYZE
SELECT 
    api_meterdata.id, api_meterdata.meter_id, api_meterdata.datetime, 
    api_meter.id 
FROM 
    api_meterdata 
    INNER JOIN api_meter ON ( api_meterdata.meter_id = api_meter.id ) 
ORDER BY 
    api_meterdata.id DESC 
LIMIT 100;


| QUERY PLAN |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Limit (cost=0.57..20.71 rows=100 width=20) (actual time=0.033..0.188 rows=100 loops=1) |
| -> Nested Loop (cost=0.57..130514.61 rows=648001 width=20) (actual time=0.031..0.175 rows=100 loops=1) |
| -> Index Scan Backward using api_meterdata_pkey on api_meterdata (cost=0.42..20342.44 rows=648001 width=16) (actual time=0.023..0.038 rows=100 loops=1) |
| -> Index Only Scan using api_meter_pkey on api_meter (cost=0.14..0.16 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=100) |
| Index Cond: (id = api_meterdata.meter_id) |
| Heap Fetches: 100 |
| Planning time: 0.331 ms |
| Execution time: 0.216 ms

Code Snippets

CREATE TABLE api_meter
(
    id INTEGER PRIMARY KEY
) ;
INSERT INTO 
    api_meter
    (id)
SELECT
    generate_series(1, 83) ;
CREATE TABLE api_meterdata
(
    id serial /* integer */ PRIMARY KEY,
    meter_id integer REFERENCES api_meter(id),
    datetime timestamp NOT NULL default now()
) ;

-- The PK will have made an implicit index ON (id)

-- Index on (meter_id, datetime); which is probably the *NATURAL KEY*
CREATE UNIQUE INDEX api_meterdata_meter_id_datetime_unique 
    ON api_meterdata (meter_id, datetime) ;

-- The following index is redundant, the column meter_id is already the first in
-- the previous one.
-- CREATE INDEX api_meterdata_meter_id_idx 
--    ON api_meterdata (meter_id) ;

CREATE INDEX api_meterdata_datetime_idx 
    ON api_meterdata (datetime) ;
INSERT INTO 
    api_meterdata
    (meter_id, datetime)
SELECT
    random()*82+1, d
FROM
    generate_series(timestamp '2017-01-01', timestamp '2017-01-31', 
                    interval '4 second') AS s(d);

-- Make sure statistics are good
ANALYZE api_meterdata;
ANALYZE api_meter;
-- This query doesn't have a WHERE clause, so, indexes will be used based on 
-- ORDER BY + LIMIT (and, eventually, column coverage)
--
-- * The index helping this case is the one corresponding to the PK of 
--   api_meter_data, used in DESC order
-- * A second index will help: the one used for the JOIN condition
-- * How does postgresql choose to JOIN will depend on specific data values 
--   distribution, sizes, etc.
EXPLAIN ANALYZE
SELECT 
    api_meterdata.id, api_meterdata.meter_id, api_meterdata.datetime, 
    api_meter.id 
FROM 
    api_meterdata 
    INNER JOIN api_meter ON ( api_meterdata.meter_id = api_meter.id ) 
ORDER BY 
    api_meterdata.id DESC 
LIMIT 100;
-- This query doesn't have either a WHERE clause, so, indexes will be used 
-- based on ORDER BY + LIMIT (and, eventually, column coverage).
-- * The index helping this case is the one corresponding to 
--   ON api_meterdata (datetime), because that's the only column used in the
--   ORDER BY.
-- * A second index will help: the one used for the JOIN condition
-- * How does postgresql choose to JOIN will depend on specific data values 
--   distribution
EXPLAIN ANALYZE 
SELECT 
    api_meterdata.id, api_meterdata.meter_id, api_meterdata.datetime, 
    api_meter.id 
FROM 
    api_meterdata 
    INNER JOIN api_meter ON ( api_meterdata.meter_id = api_meter.id ) 
ORDER BY 
    api_meterdata.datetime ASC 
LIMIT 100;

Context

StackExchange Database Administrators Q#178344, answer score: 8

Revisions (0)

No revisions yet.