HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Really slow DISTINCT ON query with multiple joins

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctreallywithqueryslowmultiplejoins

Problem

Originally posted:
https://stackoverflow.com/questions/11173717/expensive-query-on-select-distinct-with-multiple-inner-join-in-postgres

The songs table has only about 4k rows, posts and stations have less. Running the query without the DISTINCT ON fixes it.

Running Postgres on Mac OS X Lion.

```
Song Load (7358.2ms)

EXPLAIN (426.2ms)

EXPLAIN for:
SELECT DISTINCT ON (songs.rank, songs.shared_id) songs.*,
songs.*,
posts.url as post_url,
posts.excerpt as post_excerpt,
stations.title as station_title,
stations.slug as station_slug
FROM "songs"
INNER JOIN "posts" ON "posts"."id" = "songs"."post_id"
inner join stations on stations.blog_id = songs.blog_id
WHERE "songs"."processed" = 't'
AND "songs"."working" = 't'
ORDER BY songs.rank desc
LIMIT 24 OFFSET 0

QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=546147.28..546159.16 rows=24 width=2525)
-> Unique (cost=546147.28..547360.75 rows=2452 width=2525)
-> Sort (cost=546147.28..546551.77 rows=161796 width=2525)
Sort Key: songs.rank, songs.shared_id
-> Hash Join (cost=466.50..2906.84 rows=161796 width=2525)
Hash Cond: (songs.blog_id = stations.blog_id)
-> Hash Join (cost=249.41..587.52 rows=2452 width=2499)
Hash Cond: (songs.post_id = posts.id)
-> Seq Scan on songs (cost=0.00..304.39 rows=2452 width=2223)
Filter: (processed AND working)
-> Hash (cost=230.85..230.85 rows=1485 width=280)
-> Seq Scan on posts (cost=0.00..230.85 rows=1485 width=280)
-> Hash (cost=140.93..140.93 rows=6093 width=30)
-> Seq Scan on stations (cost=0.00..140.93 rows=6093 wi

Solution

Because the WHERE condition of the query involves only equality checks:

WHERE "songs"."processed" = 't' 
  AND "songs"."working" = 't'


and then you have:

SELECT  DISTINCT ON (songs.rank, songs.shared_id) ...


which is similar to GROUP BY songs.rank, songs.shared_id

I would first try adding a compound index on (first the columns in WHERE, then the columns in DISTINCT ON):

(processed, working, rank, shared_id)


The ordering: ORDER BY rank DESC may be better optimized if you have the index as:

(processed, working, rank DESC, shared_id)


Not really sure if this would contribute to efficiency but you can test.

Addition by @Erwin

As per request in comment

In principal (default) b-tree indexes can be scanned forward and backward at the same speed. But sorting can make a difference in multi-column indexes where you combine the sort order of multiple columns. The query starts with:

SELECT  DISTINCT ON (songs.rank, songs.shared_id)


In combination with ORDER BY rank DESC this dictates that the result be ordered by rank DESC, shared_id effectively. After the (simplified) WHERE clause WHERE processed AND working has been applied and before LIMIT can be applied.

I have my doubts if the DISTINCT clause is actually useful. But while it is there, the optimal index for the query should be (just as @ypercube suspected):

CREATE INDEX songs_special_idx
ON songs (processed, working, rank DESC, shared_id);


Looks like one of the rare cases where explicit ordering of index columns would benefit the query. There is an excellent explanation in the chapter Indexes and ORDER BY of the manual.

If the WHERE condition is stable (always WHERE processed AND working), a partial multi-column index would be smaller and faster, yet:

CREATE INDEX songs_special_idx
ON songs (rank DESC, shared_id)
WHERE processed AND working;

Code Snippets

WHERE "songs"."processed" = 't' 
  AND "songs"."working" = 't'
SELECT  DISTINCT ON (songs.rank, songs.shared_id) ...
(processed, working, rank, shared_id)
(processed, working, rank DESC, shared_id)
SELECT  DISTINCT ON (songs.rank, songs.shared_id)

Context

StackExchange Database Administrators Q#19804, answer score: 6

Revisions (0)

No revisions yet.