patternsqlMinor
Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
Viewed 0 times
sometimesorderpostgresinferiorlimitwhereusesforindex
Problem
We have a PostgreSQL table with ~5 billion rows that has developed a nasty habit of missing the proper indices and doing a Primary Key scan on certain
The problem generally manifests on an
where the items in that
The
Limit (cost=0.58..4632.03 rows=1 width=28)
-> Index Scan Backward using mcqueen_base_imagemeta2_pkey on mcqueen_base_imagemeta2 (cost=0.58..364597074.75 rows=78722 width=28)
Filter: (image_id = ANY ('{123, ...}'::bigint[]))
If the
Limit (cost=7585.92..7585.93 rows=2 width=28)
-> Sort (cost=7585.92..7782.73 rows=78722 width=28)
Sort Key: id DESC
-> Index Scan using mcqueen_base_imagemeta2_image_id_616fe89c on mcqueen_base_imagemeta2 (cost=0.58..6798.70 rows=78722 width=28)
Index Cond: (image_id = ANY ('{123, ...}'::bigint[]))
This also happens on queries where the index matches ~3000 rows and the limit is set to 100, so something that easily happens in real world REST API pagination.
The table definition is:
```
mcqueen=# \d mcqueen_base_imagemeta2
Table "public.mcqueen_base_imagemeta2"
Column | Type | Modifiers
-------------------+--------------------------+----------------------------------------------------------------------
id | bigint | not null default nextval('mcqueen_base_imagemeta2_id_seq'::regc
LIMIT operations.The problem generally manifests on an
ORDER BY .. LIMIT .. clause (a common pattern in Django pagination) where the LIMIT is some relatively small subset of the results matched by the index. An extreme example is this:SELECT * FROM mcqueen_base_imagemeta2
WHERE image_id IN ( 123, ... )
ORDER BY id DESC
LIMIT 1;where the items in that
IN clause are ~20 and total rows matched by the index on image_id is 16.The
EXPLAIN shows that it misses the image_id index and instead does a PK scan of 5B rows:Limit (cost=0.58..4632.03 rows=1 width=28)
-> Index Scan Backward using mcqueen_base_imagemeta2_pkey on mcqueen_base_imagemeta2 (cost=0.58..364597074.75 rows=78722 width=28)
Filter: (image_id = ANY ('{123, ...}'::bigint[]))
If the
LIMIT is increased to 2, it works as expected:Limit (cost=7585.92..7585.93 rows=2 width=28)
-> Sort (cost=7585.92..7782.73 rows=78722 width=28)
Sort Key: id DESC
-> Index Scan using mcqueen_base_imagemeta2_image_id_616fe89c on mcqueen_base_imagemeta2 (cost=0.58..6798.70 rows=78722 width=28)
Index Cond: (image_id = ANY ('{123, ...}'::bigint[]))
This also happens on queries where the index matches ~3000 rows and the limit is set to 100, so something that easily happens in real world REST API pagination.
The table definition is:
```
mcqueen=# \d mcqueen_base_imagemeta2
Table "public.mcqueen_base_imagemeta2"
Column | Type | Modifiers
-------------------+--------------------------+----------------------------------------------------------------------
id | bigint | not null default nextval('mcqueen_base_imagemeta2_id_seq'::regc
Solution
Why?
For a
Postgres collects statistics about the most common values (MCV list), but not for the least common ones - for obvious reasons, that would be far too many to be useful. And it has no statistics for correlations between columns by default. (While that can be created manually it won't fit your use case anyway, as ID numbers are typically uncorrelated.)
So Postgres has to base its decision on generic estimates. It's very hard to identify the sweet spot where to switch from one index to the other. This gets harder, yet, for a predicate like
Solutions?
You may be able to improve the situation somewhat with a larger statistics target:
That (among other things) increases the size of the MCV list for the column and help identify more (less) common values. But it's not a general solution for the problem, and makes
Upgrading to the latest version (soon to be Postgres 12) also helps as general performance got better and the planner smarter.
There are various techniques for a workaround, depending on cardinalities, value frequencies, access patterns, ... Completely disabling the
Related:
Workaround for your case
Should work well for the given numbers: 5 billion rows, around 20
Provide your list as array and
It's essential to support this with a multicolumn index on
You might then delete the existing index
This should result in one very fast index(-only) scan per
Fetching N rows for each
Aside
(a common pattern in Django pagination)
Pagination with
For a
LIMIT 1, Postgres may estimate it to be faster to traverse the index supporting the ORDER BY and just keep filtering until the first row is found. This is fast as long as more than a few rows qualify and one of those pops up early according to ORDER BY. But it is (very) slow if no qualifying row pops up early, or even a worst case scenario if no row ends up qualifying at all. Similar for any small LIMIT.Postgres collects statistics about the most common values (MCV list), but not for the least common ones - for obvious reasons, that would be far too many to be useful. And it has no statistics for correlations between columns by default. (While that can be created manually it won't fit your use case anyway, as ID numbers are typically uncorrelated.)
So Postgres has to base its decision on generic estimates. It's very hard to identify the sweet spot where to switch from one index to the other. This gets harder, yet, for a predicate like
image_id IN (123, ... ) with many items, and most are typically rare or very rare or even non-existent. But if you put enough numbers into the list, Postgres will eventually expect that traversing the other index will find the first hit faster.Solutions?
You may be able to improve the situation somewhat with a larger statistics target:
ALTER TABLE mcqueen_base_imagemeta2 ALTER image_id SET STATISTICS 2000;That (among other things) increases the size of the MCV list for the column and help identify more (less) common values. But it's not a general solution for the problem, and makes
ANALYZE and query planning a bit more expensive. Related:- Check statistics targets in PostgreSQL
Upgrading to the latest version (soon to be Postgres 12) also helps as general performance got better and the planner smarter.
There are various techniques for a workaround, depending on cardinalities, value frequencies, access patterns, ... Completely disabling the
ORDER BY index like Laurenz demonstrated is one radical workaround - which can backfire for long lists or very common image_id, where the ORDER BY index would, in fact, be much faster.Related:
- Can spatial index help a "range - order by - limit" query
Workaround for your case
Should work well for the given numbers: 5 billion rows, around 20
image_id in the filter list, small LIMIT. Most efficient for LIMIT 1 and a short list, but good for any small LIMIT and manageable list size:SELECT m.*
FROM unnest( '{123, ...}'::bigint[]) i(image_id)
CROSS JOIN LATERAL (
SELECT m.id
FROM mcqueen_base_imagemeta2 m
WHERE m.image_id = i.image_id
ORDER BY m.id DESC
LIMIT 1 -- or N
) m
ORDER BY id DESC
LIMIT 1; -- or NProvide your list as array and
unnest(). Or use a VALUES expression. Related:- Optimizing a Postgres query with a large IN
It's essential to support this with a multicolumn index on
(image_id, id DESC)!You might then delete the existing index
mcqueen_base_imagemeta2_image_id_616fe89c on just (image_id). See:- Is a composite index also good for queries on the first field?
This should result in one very fast index(-only) scan per
image_id. And a final, (very) cheap sort step.Fetching N rows for each
image_id guarantees that we have all rows needed in the outer query. If you have meta-knowledge that only fewer rows per single image_id can be in the result, you can decrease the nested LIMIT accordingly.Aside
(a common pattern in Django pagination)
Pagination with
LIMIT and OFFSET? OK for the first page, but after that it's just a bad idea.- Efficient pagination for big tables
- What is the recommended way to join junction tables for efficient ordering/pagination?
Code Snippets
ALTER TABLE mcqueen_base_imagemeta2 ALTER image_id SET STATISTICS 2000;SELECT m.*
FROM unnest( '{123, ...}'::bigint[]) i(image_id)
CROSS JOIN LATERAL (
SELECT m.id
FROM mcqueen_base_imagemeta2 m
WHERE m.image_id = i.image_id
ORDER BY m.id DESC
LIMIT 1 -- or N
) m
ORDER BY id DESC
LIMIT 1; -- or NContext
StackExchange Database Administrators Q#249617, answer score: 8
Revisions (0)
No revisions yet.