HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Optimize a query with small LIMIT, predicate on one column and order by another

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
ordercolumnwithlimitquerypredicateonesmalloptimizeanother

Problem

I'm using Postgres 9.3.4 and I have 4 queries that have very similar inputs but have vastly different response times:

Query #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1)
         Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Filter: 6913925
 Total runtime: 507361.944 ms


Query #2

```
EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
-> Sort (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
Sort Key: external_created_at
Sort Method: top-N heapsort Memory: 80kB
-> Index Scan using index_posts_on_source_id on posts (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
Index Cond: (source_id =

Solution

General advice

All the general advice for performance optimization applies. Default settings are very conservative and some of these resource settings are way to low for tables with millions of rows (work_mem in particular). You need to configure your RDBMS to make use of available RAM wisely. The Postgres Wiki is a good starting point. This is beyond the scope of a single question here.

However, the query I suggest below only needs very moderate resource settings.

Also increase the statistics target for source_id to have more detailed statistics on the crucial column:

ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000;  -- or similar


Then: ANALYZE posts;

More:

  • Keep PostgreSQL from sometimes choosing a bad query plan



You could optimize storage some more (for minor gains):

  • Configuring PostgreSQL for read performance



Query

The query itself is hard to optimize. Refer to @ypercube's related question for advanced performance optimization:

  • Can spatial index help a "range - order by - limit" query



There is a simple method if ...

  • the number of distinct source_id per query is reasonably small



  • and the LIMIT is also reasonably small.



... which is true for your case according to your added details.

The only index you need for the query below:

CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;


Example based on your query #1:

SELECT p.*
FROM   unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
              , 17804, 20717, 27598, 27599}'::int[]) s(source_id)
     , LATERAL (
   SELECT *
   FROM   posts
   WHERE  source_id = s.source_id
   AND    deleted_at IS NULL
   ORDER  BY external_created_at DESC
   LIMIT  100
   ) p
ORDER  BY p.external_created_at DESC
LIMIT  100;


This is emulating a loose index scan, similar to what's discussed here in great detail:

  • Optimize GROUP BY query to retrieve latest record per user



If n is the number of source_id's (and luckily never > 21), we make Postgres fetch the top 100 rows (according to external_created_at DESC) for each source_id from the index, which is very fast in itself, but max. (n-1) * 100 rows are surplus. Given your value frequencies:

22,245 source_id with 1 to 1,543,950 rows - and 20,997,027 rows total

(You didn't clarify whether all of these numbers include "deleted" rows, but only ~ 25% are "deleted".)

... I'd expect some of the source_id's to have less than 100 rows to begin with. So we only have to sort 2100 rows in the worst case (typically much fewer) to keep the top 100. That shouldn't perform so badly - once you have configured Postgres with decent resource settings.

If you have a source table holding all distinct source_id, it might make sense to use it and eliminate non-existent source_id early:

SELECT p.*
FROM   source s, LATERAL ( ... ) p
WHERE  s.source_id IN (19082, 19075, 20705, ...)
ORDER  BY ...


A maximum of 21 IN values is ok for this form, but consider this related question:

  • Optimizing a Postgres query with a large IN



You could optimize some more if you know the minimum external_created_at or the maximum number of rows from a single source_id in the result ...

Code Snippets

ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000;  -- or similar
CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;
SELECT p.*
FROM   unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
              , 17804, 20717, 27598, 27599}'::int[]) s(source_id)
     , LATERAL (
   SELECT *
   FROM   posts
   WHERE  source_id = s.source_id
   AND    deleted_at IS NULL
   ORDER  BY external_created_at DESC
   LIMIT  100
   ) p
ORDER  BY p.external_created_at DESC
LIMIT  100;
SELECT p.*
FROM   source s, LATERAL ( ... ) p
WHERE  s.source_id IN (19082, 19075, 20705, ...)
ORDER  BY ...

Context

StackExchange Database Administrators Q#124790, answer score: 5

Revisions (0)

No revisions yet.