HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Index on primary key not used in simple join

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
simpleprimaryusedjoinindexnotkey

Problem

I have the following table and index definitions:

CREATE TABLE munkalap (
    munkalap_id serial PRIMARY KEY,
    ...
);

CREATE TABLE munkalap_lepes (
    munkalap_lepes_id serial PRIMARY KEY,
    munkalap_id integer REFERENCES munkalap (munkalap_id),
    ...
);

CREATE INDEX idx_munkalap_lepes_munkalap_id ON munkalap_lepes (munkalap_id);


Why are none of the indexes on munkalap_id used in the following query?

EXPLAIN ANALYZE SELECT ml.* FROM munkalap m JOIN munkalap_lepes ml USING (munkalap_id);

QUERY PLAN
Hash Join  (cost=119.17..2050.88 rows=38046 width=214) (actual time=0.824..18.011 rows=38046 loops=1)
  Hash Cond: (ml.munkalap_id = m.munkalap_id)
  ->  Seq Scan on munkalap_lepes ml  (cost=0.00..1313.46 rows=38046 width=214) (actual time=0.005..4.574 rows=38046 loops=1)
  ->  Hash  (cost=78.52..78.52 rows=3252 width=4) (actual time=0.810..0.810 rows=3253 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 115kB
        ->  Seq Scan on munkalap m  (cost=0.00..78.52 rows=3252 width=4) (actual time=0.003..0.398 rows=3253 loops=1)
Total runtime: 19.786 ms


It's the same even if I add a filter:

EXPLAIN ANALYZE SELECT ml.* FROM munkalap m JOIN munkalap_lepes ml USING (munkalap_id) WHERE NOT lezarva;

QUERY PLAN
Hash Join  (cost=79.60..1545.79 rows=1006 width=214) (actual time=0.616..10.824 rows=964 loops=1)
  Hash Cond: (ml.munkalap_id = m.munkalap_id)
  ->  Seq Scan on munkalap_lepes ml  (cost=0.00..1313.46 rows=38046 width=214) (actual time=0.007..5.061 rows=38046 loops=1)
  ->  Hash  (cost=78.52..78.52 rows=86 width=4) (actual time=0.587..0.587 rows=87 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 4kB
        ->  Seq Scan on munkalap m  (cost=0.00..78.52 rows=86 width=4) (actual time=0.014..0.560 rows=87 loops=1)
              Filter: (NOT lezarva)
Total runtime: 10.911 ms

Solution

Many people have heard guidance that "sequential scans are bad" and seek to eliminate them from their plans, but it isn't so simple. If a query is going to cover every row in a table, a sequential scan is the fastest way to get those rows. This is why your original join query used seq scan, because all rows in both tables were required.

When planning a query, Postgres's planner estimates the costs of various operations (computation, sequential, and random IO) under different possible schemes, and picks the plan it estimates as having the lowest cost. When doing IO from rotating storage (disks), random IO is usually substantially slower than sequential IO, the default pg configuration for random_page_cost and seq_page_cost estimate a 4:1 difference in cost.

These considerations come into play when considering a join or filter method which uses an index vs one which sequentially scans a table. When using an index, the plan may find a row quickly via the index, then have to account for a random block read to resolve the row data. In the case of your second query which added a filtering predicate WHERE NOT lezarva, you can see how this effected the planning estimates in the EXPLAIN ANALYZE results. The planner estimates 1006 rows resulting from the join (which pretty closely matches the actual result set of 964). Given that the larger table munkalap_lepes contains about 38K rows, the planner sees that the join is going to have to access about 1006/38046 or 1/38 of the rows in the table. It also knows the avg row width is 214 bytes and a block is 8K, so there's about 38 rows/block.

With these statistics, the planner considers it likely that the join will have to read all or most of the table's data blocks. Since the index lookups aren't free either, and the computation to scan a block evaluating a filter condition is very cheap relative to IO, the planner has chosen to sequentially scan the table and avoid index overhead and random reads as it calculates the seq scan will be faster.

In the real world, data is often available in memory via the OS page cache, and so not every block read requires IO. It can be quite hard to predict how effective a cache will be for a given query, but the Pg planner does use some simple heuristics. The configuration value effective_cache_size informs the planners estimates of the likelyhood of incurring actual IO costs. A larger value will cause it to estimate a lower cost to random IO and may thus bias it towards an index driven method over a sequential scan.

Context

StackExchange Database Administrators Q#16612, answer score: 31

Revisions (0)

No revisions yet.