patternsqlModerate
SELECT very slow when no results and a LIMIT is specified
Viewed 0 times
limitspecifiedslowresultswhenandselectvery
Problem
I am running into an issue where a
If the number of results is greater than 0 then Postgres uses the index and the result return in ~ 1ms. This seems to always be true as far as I can find.
If the number of results is 0 and no
If the number of results is 0 and a
Why doesn't PostgreSQL use the index in the last case?
Cardinalities:
~ 21 million rows total.
~ 3 million rows
~ 3 million rows
2 rows
0 rows
~ 8 million rows
Postgres Version: 9.4.6
Table Schema:
Queries and plans:
This query returns 0 rows:
```
EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms
LIMIT 20;
Limit (cost=0.00..14.07 rows=20 width=737) (actual time=12863.465..12863.465 rows=0 loops=1)
-> Seq Scan on mytable (cost=0.00..1492790.47 rows=2122653 width=737) (actual time=12863.464..12863.464 rows=0 loops=1)
Filter: ((related_id = 1) AND (platform = 'p2'::platforms))
Rows Removed by Filter: 21076656
Plannin
SELECT query is very slow because it does not use an index when the number of final results is 0 and a LIMIT clause is specified.If the number of results is greater than 0 then Postgres uses the index and the result return in ~ 1ms. This seems to always be true as far as I can find.
If the number of results is 0 and no
LIMIT is used, Postgres uses the index and the results return in ~ 1msIf the number of results is 0 and a
LIMIT is specified Postgres does a sequential scan and the results take ~ 13,000 ms.Why doesn't PostgreSQL use the index in the last case?
Cardinalities:
~ 21 million rows total.
~ 3 million rows
WHERE related_id=1~ 3 million rows
WHERE related_id=1 AND platform=p12 rows
WHERE related_id=1 AND platform=p20 rows
WHERE related_id=1 AND platform=p3~ 8 million rows
WHERE platform=p2Postgres Version: 9.4.6
Table Schema:
CREATE TYPE platforms AS ENUM ('p1', 'p2', 'p3');
CREATE TABLE mytable (
id bigint NOT NULL DEFAULT nextval('mytable_sq'::regclass),
related_id integer NOT NULL,
platform platforms NOT NULL DEFAULT 'default'::platforms,
name character varying(200) NOT NULL,
CONSTRAINT mytable_pkey PRIMARY KEY (id),
CONSTRAINT mytable_related_id_fkey FOREIGN KEY (related_id)
REFERENCES related (id)
);
CREATE INDEX related_id__platform__index ON mytable (related_id, platform);
CREATE UNIQUE INDEX some_other_index ON mytable (related_id, lower(name::text));Queries and plans:
This query returns 0 rows:
```
EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms
LIMIT 20;
Limit (cost=0.00..14.07 rows=20 width=737) (actual time=12863.465..12863.465 rows=0 loops=1)
-> Seq Scan on mytable (cost=0.00..1492790.47 rows=2122653 width=737) (actual time=12863.464..12863.464 rows=0 loops=1)
Filter: ((related_id = 1) AND (platform = 'p2'::platforms))
Rows Removed by Filter: 21076656
Plannin
Solution
Postgres does a bad job estimating the frequency of the combination of predicates in your query:
Each of your predicates is not very selective on its own - that information is available to Postgres ("most common values") - assuming your statistics are up to date:
~ 21 million rows total.
~ 3 million rows
~ 8 million rows
IOW, ~ every 7th row passes the 1st filter and ~ every 3rd row passes the 2nd. Postgres does the (naive) math and expects that roughly every 20th row qualifies. Since there is no
Using any indexes adds some extra cost, Postgres needs to scan index and table. (Exception: index-only scans, which are not possible here for your
Unfortunately, the combination of your predicates turns out to be unexpectedly rare. Postgres has to scan the whole table just to find only 2 qualifying rows. (The index would actually be much cheaper in any case.)
2 rows
The combined frequency of values in multiple columns is not available to Postgres. Think about it: gathering statistics like that would quickly get out of hand.
A very simple and efficient solution for this particular case: Create a partial index:
Not only will this super-tiny index (2 rows) be a perfect match for your query, it will also make a count estimate available for your particular combination (entry in
If one of the two predicates can change, there is a more generic approach. Say, if
The index column is relevant again. This may not be enough to tip the scales because Postgres only collects complete statistics for index columns of functional indexes (else it relies on the statistics of the underlying table). I propose:
Note the extra pair of parentheses - a syntactical necessity for the cast shorthand.
The expression
Now, (after
Check with:
And Postgres should choose the index for your case - if you repeat the same expression in your query. So:
Related:
About most common values in the Postgres statistics:
SELECT * FROM tbl
WHERE related_id = 1 AND platform = 'p2'::platforms
LIMIT 20;Each of your predicates is not very selective on its own - that information is available to Postgres ("most common values") - assuming your statistics are up to date:
~ 21 million rows total.
~ 3 million rows
WHERE related_id=1~ 8 million rows
WHERE platform=p2 IOW, ~ every 7th row passes the 1st filter and ~ every 3rd row passes the 2nd. Postgres does the (naive) math and expects that roughly every 20th row qualifies. Since there is no
ORDER BY, any 20 qualifying rows will do. The fastest way should be to scan the table sequentially and be done after ~ 400 rows - only a few data pages, very cheap.Using any indexes adds some extra cost, Postgres needs to scan index and table. (Exception: index-only scans, which are not possible here for your
SELECT *). That will only pay if Postgres would otherwise have to read enough additional pages to estimate that to be more expensive. That's how I would explain that you see a sequential scan for a small LIMIT, but a bitmap index scan for a big (or no) LIMIT.Unfortunately, the combination of your predicates turns out to be unexpectedly rare. Postgres has to scan the whole table just to find only 2 qualifying rows. (The index would actually be much cheaper in any case.)
2 rows
WHERE related_id=1 AND platform=p2The combined frequency of values in multiple columns is not available to Postgres. Think about it: gathering statistics like that would quickly get out of hand.
A very simple and efficient solution for this particular case: Create a partial index:
CREATE INDEX related_id_1_platform_2_idx ON tbl (id)
WHERE related_id = 1 AND platform = 'p2'::platforms;Not only will this super-tiny index (2 rows) be a perfect match for your query, it will also make a count estimate available for your particular combination (entry in
pg_class.reltuples). The actual index column is mostly irrelevant for this one, pick a small column, typically best to make it the PK.If one of the two predicates can change, there is a more generic approach. Say, if
related_id = 1 is the stable condition, then create:CREATE INDEX related_id_1_idx ON tbl (platform)
WHERE related_id = 1;The index column is relevant again. This may not be enough to tip the scales because Postgres only collects complete statistics for index columns of functional indexes (else it relies on the statistics of the underlying table). I propose:
CREATE INDEX related_id_1_func_idx ON tbl ((platform::text::platforms)) -- double parens!
WHERE related_id = 1;Note the extra pair of parentheses - a syntactical necessity for the cast shorthand.
The expression
platform::text::platforms doesn't actually change anything - it casts your enum to text and back. But it makes Postgres collect full statistics on the (presumed) new values.Now, (after
ANALYZE tbl) we have full statistics including most common values of platform for related_id = 1.Check with:
SELECT *
FROM pg_stats
WHERE schemaname = 'public' -- actual schema
AND tablename = 'related_id_1_func_idx'; -- actual idx nameAnd Postgres should choose the index for your case - if you repeat the same expression in your query. So:
SELECT ...
WHERE related_id = 1 AND platform::text::platforms = 'p2'::platforms;Related:
- Index that is not used, yet influences query
About most common values in the Postgres statistics:
- Speeding up a GROUP BY, HAVING COUNT query
Code Snippets
SELECT * FROM tbl
WHERE related_id = 1 AND platform = 'p2'::platforms
LIMIT 20;CREATE INDEX related_id_1_platform_2_idx ON tbl (id)
WHERE related_id = 1 AND platform = 'p2'::platforms;CREATE INDEX related_id_1_idx ON tbl (platform)
WHERE related_id = 1;CREATE INDEX related_id_1_func_idx ON tbl ((platform::text::platforms)) -- double parens!
WHERE related_id = 1;SELECT *
FROM pg_stats
WHERE schemaname = 'public' -- actual schema
AND tablename = 'related_id_1_func_idx'; -- actual idx nameContext
StackExchange Database Administrators Q#132235, answer score: 10
Revisions (0)
No revisions yet.