patternsqlMinor
Optimize a query with small LIMIT, predicate on one column and order by another
Viewed 0 times
ordercolumnwithlimitquerypredicateonesmalloptimizeanother
Problem
I'm using Postgres 9.3.4 and I have 4 queries that have very similar inputs but have vastly different response times:
Query #1
Query #2
```
EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
-> Sort (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
Sort Key: external_created_at
Sort Method: top-N heapsort Memory: 80kB
-> Index Scan using index_posts_on_source_id on posts (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
Index Cond: (source_id =
Query #1
EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1)
-> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1)
Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
Rows Removed by Filter: 6913925
Total runtime: 507361.944 msQuery #2
```
EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
-> Sort (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
Sort Key: external_created_at
Sort Method: top-N heapsort Memory: 80kB
-> Index Scan using index_posts_on_source_id on posts (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
Index Cond: (source_id =
Solution
General advice
All the general advice for performance optimization applies. Default settings are very conservative and some of these resource settings are way to low for tables with millions of rows (
However, the query I suggest below only needs very moderate resource settings.
Also increase the statistics target for
Then:
More:
You could optimize storage some more (for minor gains):
Query
The query itself is hard to optimize. Refer to @ypercube's related question for advanced performance optimization:
There is a simple method if ...
... which is true for your case according to your added details.
The only index you need for the query below:
Example based on your query #1:
This is emulating a loose index scan, similar to what's discussed here in great detail:
If n is the number of source_id's (and luckily never > 21), we make Postgres fetch the top 100 rows (according to
22,245
(You didn't clarify whether all of these numbers include "deleted" rows, but only ~ 25% are "deleted".)
... I'd expect some of the
If you have a source table holding all distinct
A maximum of 21
You could optimize some more if you know the minimum
All the general advice for performance optimization applies. Default settings are very conservative and some of these resource settings are way to low for tables with millions of rows (
work_mem in particular). You need to configure your RDBMS to make use of available RAM wisely. The Postgres Wiki is a good starting point. This is beyond the scope of a single question here.However, the query I suggest below only needs very moderate resource settings.
Also increase the statistics target for
source_id to have more detailed statistics on the crucial column:ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000; -- or similarThen:
ANALYZE posts;More:
- Keep PostgreSQL from sometimes choosing a bad query plan
You could optimize storage some more (for minor gains):
- Configuring PostgreSQL for read performance
Query
The query itself is hard to optimize. Refer to @ypercube's related question for advanced performance optimization:
- Can spatial index help a "range - order by - limit" query
There is a simple method if ...
- the number of distinct
source_idper query is reasonably small
- and the
LIMITis also reasonably small.
... which is true for your case according to your added details.
The only index you need for the query below:
CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;Example based on your query #1:
SELECT p.*
FROM unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
, 17804, 20717, 27598, 27599}'::int[]) s(source_id)
, LATERAL (
SELECT *
FROM posts
WHERE source_id = s.source_id
AND deleted_at IS NULL
ORDER BY external_created_at DESC
LIMIT 100
) p
ORDER BY p.external_created_at DESC
LIMIT 100;This is emulating a loose index scan, similar to what's discussed here in great detail:
- Optimize GROUP BY query to retrieve latest record per user
If n is the number of source_id's (and luckily never > 21), we make Postgres fetch the top 100 rows (according to
external_created_at DESC) for each source_id from the index, which is very fast in itself, but max. (n-1) * 100 rows are surplus. Given your value frequencies:22,245
source_id with 1 to 1,543,950 rows - and 20,997,027 rows total(You didn't clarify whether all of these numbers include "deleted" rows, but only ~ 25% are "deleted".)
... I'd expect some of the
source_id's to have less than 100 rows to begin with. So we only have to sort 2100 rows in the worst case (typically much fewer) to keep the top 100. That shouldn't perform so badly - once you have configured Postgres with decent resource settings.If you have a source table holding all distinct
source_id, it might make sense to use it and eliminate non-existent source_id early:SELECT p.*
FROM source s, LATERAL ( ... ) p
WHERE s.source_id IN (19082, 19075, 20705, ...)
ORDER BY ...A maximum of 21
IN values is ok for this form, but consider this related question:- Optimizing a Postgres query with a large IN
You could optimize some more if you know the minimum
external_created_at or the maximum number of rows from a single source_id in the result ...Code Snippets
ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000; -- or similarCREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;SELECT p.*
FROM unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
, 17804, 20717, 27598, 27599}'::int[]) s(source_id)
, LATERAL (
SELECT *
FROM posts
WHERE source_id = s.source_id
AND deleted_at IS NULL
ORDER BY external_created_at DESC
LIMIT 100
) p
ORDER BY p.external_created_at DESC
LIMIT 100;SELECT p.*
FROM source s, LATERAL ( ... ) p
WHERE s.source_id IN (19082, 19075, 20705, ...)
ORDER BY ...Context
StackExchange Database Administrators Q#124790, answer score: 5
Revisions (0)
No revisions yet.