HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlModerate

How to use index to speed up sorting in Postgres?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sortingpostgreshowindexusespeed

Problem

I have a table messages in a Postgres 9.4 database. Messages belong to feed_id, and have posted_at. Also, messages can have a parent message (in case of replies).
Table "public.messages"
Column | Type | Modifiers
------------------------------+-----------------------------+-----------
message_id | character varying(255) | not null
feed_id | integer |
parent_id | character varying(255) |
posted_at | timestamp without time zone |
share_count | integer |
Indexes:
"messages_pkey" PRIMARY KEY, btree (message_id)
"index_messages_on_feed_id_posted_at" btree (feed_id, posted_at DESC NULLS LAST)


I want to return all messages ordered by share_count, but for each parent_id, I only want to return one message. ie, if multiple messages have the same parent_id, then only the latest one (posted_at) is returned. The parent_id can be null, messages with null parent_id should all return.

The query I used is:

WITH filtered_messages AS (SELECT * 
                           FROM messages
                           WHERE feed_id IN (7) 
                           AND (posted_at >= '2015-01-01 04:00:00.000000') 
                           AND (posted_at < '2015-04-28 04:00:00.000000'))
SELECT *
FROM (SELECT DISTINCT ON(COALESCE(parent_id, message_id)) parent_id,
                      message_id, 
                      posted_at, 
                      share_count
      FROM filtered_messages
      ORDER BY COALESCE(parent_id, message_id), posted_at DESC NULLS LAST
     ) messages
ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;


Here's an SQL fiddle with schema, exact query, and expected result.

The performance of the query is slow once the messages table gets big. I tried add multiple sorting indexes, but it does not seem to

Solution

Query

This query should be substantially faster in any case:

SELECT parent_id, message_id, posted_at, share_count
FROM   messages
WHERE  feed_id = 7
AND    posted_at >= '2015-01-01 4:0:0'
AND    posted_at = '2015-01-01 4:0:0'
AND    posted_at <  '2015-04-28 4:0:0'
AND    parent_id IS NOT NULL  -- match index condition
ORDER  BY parent_id, posted_at DESC NULLS LAST
)
ORDER  BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;


The CTE does nothing here that a plain subquery could not deliver also. And a CTE introduces an optimization barrier since it is executed separately and its result is materialized.

Update: that changes in Postgres 12. See:

  • How to prevent PostgreSQL from rewriting OUTER JOIN queries?



You had one more subquery-level than you actually need.

The expression (COALESCE(parent_id, message_id) is not compatible with a plain index. You would need an index on that expression. But that may not be very useful either, depending on data distribution. See links below.

Splitting the simple case of parent_id IS NULL into a separate SELECT may or may not deliver the optimum. Especially not, if that's a rare case anyway, in which case a combined query with an index on (COALESCE(parent_id, message_id) may perform better. Other considerations apply ...
Indices

Especially when supported with these indices:

CREATE INDEX messages_idx_null ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NULL;

CREATE INDEX messages_idx_notnull ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NOT NULL;


The two partial indices cover the whole table together and are about the same size together as a single total index.

Covering indexes (with the INCLUDE clause) were added with Postgres 11. See:

  • Do covering indexes in PostgreSQL help JOIN columns?



The INCLUDE part only make sense if you get index-only scans out of it. Else remove it from both indices.

db<>fiddle here

Old sqlfiddle

Depending on missing details, DISTINCT ON may or may not be the best query technique for the purpose. Details here:

  • Select first row in each GROUP BY group?



Possibly faster alternatives here:

  • Optimize GROUP BY query to retrieve latest record per user

Code Snippets

SELECT parent_id, message_id, posted_at, share_count
FROM   messages
WHERE  feed_id = 7
AND    posted_at >= '2015-01-01 4:0:0'
AND    posted_at <  '2015-04-28 4:0:0'
AND    parent_id IS NULL  -- match index condition
UNION ALL
(
SELECT DISTINCT ON(parent_id)
       parent_id, message_id, posted_at, share_count
FROM   messages
WHERE  feed_id = 7
AND    posted_at >= '2015-01-01 4:0:0'
AND    posted_at <  '2015-04-28 4:0:0'
AND    parent_id IS NOT NULL  -- match index condition
ORDER  BY parent_id, posted_at DESC NULLS LAST
)
ORDER  BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;
CREATE INDEX messages_idx_null ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NULL;

CREATE INDEX messages_idx_notnull ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NOT NULL;

Context

StackExchange Database Administrators Q#98971, answer score: 14

Revisions (0)

No revisions yet.