HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL partial index unused when created on a table with existing data

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlcreatedwithunusedpartialwhenexistingindexdatatable

Problem

In PostgreSQL 9.3, I am attempting to create an efficient index on a rarely-used (0.00001% of total records) boolean column. To that end, I discovered this post on SO: https://stackoverflow.com/a/12026593/808921

I am attempting to make use of the "partial index" feature of PostgreSQL that Erwin Brandstetter recommends. I have a table with a couple million records already, and I would like to add the index to that table, like so:

CREATE INDEX schema_defs_deprovision ON schema_defs (deprovision) 
WHERE deprovision = 0;


(the vast majority of the records will have deprovision = 1)

The problem is that when I try to use this index with the simplest of queries, PostgreSQL acts as if it isn't there:

explain select * from schema_defs where deprovision = 0;

Seq Scan on schema_defs (cost=0.00..1.05 rows=1 width=278)
Filter: (deprovision = 0)


The really strange thing is that I have found that if this index is created before there is data in the table, then it does actually work fine. Don't believe me? Here are some SQL Fiddle entries that prove it:

Partial Index Created After Inserts (Index not working)

Partial Index Created Before Inserts (Index working properly)

In both of those, just expand the "View Execution Plan" link to see what I'm talking about.

So, my question is - what do I have to do in order to get PostgreSQL to start using a partial index on a table that had data in it before the index was created?

BTW I'm also the developer of SQL Fiddle and this question is related to a new development effort I'm making for it.

Solution

Run ANALYZE after the index has been added. And make sure the column deprovision has statistics. How to verify?

Basic statistics in pg_class:

SELECT relname, relkind, reltuples, relpages
FROM   pg_class
WHERE  oid = 'schema_defs'::regclass;


Data histograms per column in pg_stats (pg_statistics):

SELECT attname, inherited, n_distinct
     , array_to_string(most_common_vals, E'\n') AS most_common_vals
FROM   pg_stats
WHERE  tablename = 'schema_defs'
AND    attname = 'deprovision';


The manual:


The PostgreSQL query planner relies on statistical information about
the contents of tables in order to generate good plans for queries.
These statistics are gathered by the ANALYZE command, which can be
invoked by itself or as an optional step in VACUUM. It is important to
have reasonably accurate statistics, otherwise poor choices of plans
might degrade database performance.


The autovacuum daemon, if enabled, will automatically issue ANALYZE
commands whenever the content of a table has changed sufficiently.
However, administrators might prefer to rely on manually-scheduled
ANALYZE operations, particularly if it is known that update activity
on a table will not affect the statistics of "interesting" columns.
The daemon schedules ANALYZE strictly as a function of the number of
rows inserted or updated; it has no knowledge of whether that will
lead to meaningful statistical changes.

In your case, analyzing just the one column would do the job:

ANALYZE table_name (deprovision);


While being at it, it makes no sense to have the index on the column deprovision. Given the predicate WHERE deprovision = 0 it does not carry additional information. You might as well use a constant expression:

CREATE INDEX schema_defs_deprovision ON schema_defs ((true)) 
WHERE deprovision = 0;


Just a proof of concept. This would not be any more useful. In this special case you wouldn't need an index column at all, but you must provide at least one column or expression. So use the primary key (since it does not change and is indexed anyway, you don't introduce more restrictions / overhead costs) or any other small column (

I don't know how you configured autovacuum and whether / when you run ANALYZE manually. But I noticed in the past that sqlfiddle can be misleading due to missing / outdated statistics.

I would be very interested how ANALYZE is handled behind the curtains on sqlfiddle. It may be best not to do anything special, but some info would be welcome. Maybe one basic webpage per available RDBMS version?

Demo

I created an SQL Fiddle to demonstrate the effects of CREATE INDEX and ANALYZE on the various statistics.

The effects show (at least) on the first run for me. May not be reproducible on later runs, you would have to create a new schema and run again.

First we see neither basic statistics in pg_class:

relname      reltuples  relpages
schema_defs  0          0


Nor any entries for deprovision in pg_statistics at all (no result).

Postgres has no idea what's in the table and defaults to using the index - which is a bad choice!

After CREATE INDEX, we see basic statistics, but still no data histogram in pg_statistics.

After ANALYZE we see both.

With proper statistics, Postgres now uses a sequential scan (good choice, even if there is an index - it would be more expensive for so few rows).

Code Snippets

SELECT relname, relkind, reltuples, relpages
FROM   pg_class
WHERE  oid = 'schema_defs'::regclass;
SELECT attname, inherited, n_distinct
     , array_to_string(most_common_vals, E'\n') AS most_common_vals
FROM   pg_stats
WHERE  tablename = 'schema_defs'
AND    attname = 'deprovision';
ANALYZE table_name (deprovision);
CREATE INDEX schema_defs_deprovision ON schema_defs ((true)) 
WHERE deprovision = 0;
CREATE INDEX schema_defs_deprovision ON schema_defs (id) 
WHERE deprovision = 0;

Context

StackExchange Database Administrators Q#99323, answer score: 9

Revisions (0)

No revisions yet.