patternsqlMinor
Postgres is performing sequential scan instead of index scan
Viewed 0 times
performingscanpostgressequentialinsteadindex
Problem
I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?
From other answers I suspect this is as much related to the query as to the index.
Table structure:
```
http=# \d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer
From other answers I suspect this is as much related to the query as to the index.
explain select "labelDate" from pages group by "labelDate";
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=524616.78..524617.04 rows=26 width=4)
Group Key: "labelDate"
-> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)Table structure:
```
http=# \d pages
Table "public.pages"
Column | Type | Modifiers
-----------------+------------------------+----------------------------------
pageid | integer | not null default nextval('...
createDate | integer | not null
archive | character varying(16) | not null
label | character varying(32) | not null
wptid | character varying(64) | not null
wptrun | integer | not null
url | text |
urlShort | character varying(255) |
startedDateTime | integer |
renderStart | integer |
onContentLoaded | integer |
onLoad | integer |
PageSpeed | integer |
rank | integer |
reqTotal | integer | not null
reqHTML | integer | not null
reqJS | integer | not null
reqCSS | integer | not null
reqImg | integer | not null
reqFlash | integer | not null
reqJSON | integer
Solution
The best query very much depends on data distribution.
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve
Index
All you need is a B-tree index on
With more than a few NULL values in the column, a partial index helps some more (and is smaller):
You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type
Why the cast to
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
Or, for an immutable interval:
Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
The first CTE
But the verbose form makes sure our partial index is used. Plus, this form is slightly faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)
There is no need to involve
pageid at all (like you commented).Index
All you need is a B-tree index on
"labelDate".With more than a few NULL values in the column, a partial index helps some more (and is smaller):
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;You later clarified:
0% NULL but only after fixing things up when importing.
The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).
Query
Based on a provisional range
If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type
date to our advantage. There's only a finite, countable number of values between two given values. If gaps are few, this will be fastest:SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");Why the cast to
timestamp in generate_series()? See:- Generating time series between two dates in PostgreSQL
Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:
SELECT d."labelDate"
FROM (
SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");Or, for an immutable interval:
SELECT d."labelDate"
FROM (
SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");Loose index scan
This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.
WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;The first CTE
p is effectively the same asSELECT min("labelDate") FROM pagesBut the verbose form makes sure our partial index is used. Plus, this form is slightly faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
- Optimize GROUP BY query to retrieve latest record per user
Asides
Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:
- Calculating and saving space in PostgreSQL
Code Snippets
CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE "labelDate" IS NOT NULL;SELECT d."labelDate"
FROM (
SELECT generate_series(min("labelDate")::timestamp
, max("labelDate")::timestamp
, interval '1 day')::date AS "labelDate"
FROM pages
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");SELECT d."labelDate"
FROM (
SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, now()::date - date '2011-01-01' - 1) g
) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");SELECT d."labelDate"
FROM (
SELECT date '2011-01-01' + g AS "labelDate"
FROM generate_series(0, 363) g) d
WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");WITH RECURSIVE p AS (
( -- parentheses required for LIMIT
SELECT "labelDate"
FROM pages
WHERE "labelDate" IS NOT NULL
ORDER BY "labelDate"
LIMIT 1
)
UNION ALL
SELECT (SELECT "labelDate"
FROM pages
WHERE "labelDate" > p."labelDate"
ORDER BY "labelDate"
LIMIT 1)
FROM p
WHERE "labelDate" IS NOT NULL
)
SELECT "labelDate"
FROM p
WHERE "labelDate" IS NOT NULL;Context
StackExchange Database Administrators Q#105537, answer score: 9
Revisions (0)
No revisions yet.