HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

PostgreSQL seems to create inefficient plans in simple conditional joins

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlsimplecreateinefficientseemsconditionalplansjoins

Problem

Consider these two queries:

SELECT
    t1.id, *
FROM
    t1
INNER JOIN
    t2 ON t1.id = t2.id
    where t1.id > -9223372036513411363;


And:

SELECT
    t1.id, *
FROM
    t1
INNER JOIN
    t2 ON t1.id = t2.id
    where t1.id > -9223372036513411363 and t2.id > -9223372036513411363;


Note: The -9223372036513411363 is not the minimum value in the tables and the condition reduces the result (from the total number of rows, 350 million) to 17 million.

Personally, I expect PostgreSQL to come up with the same plan for both queries, because having t1.id = t2.id automatically implies the second condition. But unfortunately, PostgreSQL is creating two different plans with the plan for the second one being much better:

  • First query: http://explain.depesz.com/s/uauk



  • Second query: link: http://explain.depesz.com/s/uQd



  • EXPLAIN ANALYZE for the second query: http://explain.depesz.com/s/Snkx


(Second query finishes in 215 seconds, while the first one didn't finish after 1000 seconds until I terminated it).

I would highly prefer the first query, since I want to create a view from the join and put the where condition on queries on the view, where I see a single id column (I join using USING so a single id column is visible in view). Also, I will join more than two tables and I would prefer not to add such condition for each join.

Is there any reason for this behavior? Or is it a bug? Are there any workarounds?

  • Replacing ON t1.id = t2.id with USING (id) makes no difference in both queries.



  • This is PostgreSQL 9.3



  • The actual number of returned rows is 17,658,189



  • Analyze has been run on the tables. However, statistics related settings of PostgreSQL are its default values.



  • Observation: Explain for query 1 has a good estimate for the final result, but uses a poor plan for querying t2. For 2nd query, estimates of the number of rows from t1 & t2 are good, but estimate for the final merge is about half the number of actual rows.



  • The id co

Solution

Then it looks like an optimizer's blind spot and you should use the second query.

When there is a condition joining two tables a and b: a.id = b.id and an additional condition a.id > @some_constant, seems like the optimizer uses the "index condition" for where to start the index scan on a (id) index but it doesn't use it for the second index b (id).

So, adding the (redundant) b.id > @some_constant allows it to produce a slightly more efficient plan, skipping a part of the b (id) index as well.

This could be posted as a suggestion for improvement (if it hasn't been already) to the Postgres hackers group.

After the edit, we know there is a FOREIGN KEY constraint from t2 that REFERENCES t1. So the "natural" (equivalent) way to write the query would be:

SELECT
    -- whatever
FROM
    t2
  LEFT JOIN
    t1 ON t1.id = t2.id
WHERE t2.id > -9223372036513411363 ;


Can you try this and tell us the execution plan it produces? There are some transformations that apply only to LEFT (outer) joins and not to inner joins.

Unfortunately this doesn't produce any different plan either.

The OP has posted a question at the Postgres performance list, we can see the whole thread here: PostgreSQL seems to create inefficient plans in simple conditional joins and the reply by David Rowley, which confirms that this is a feature that although it has been considered, hasn't yet been implemented in the optimizer:


Yes, unfortunately you've done about the only thing that you can do,
and that's just include both conditions in the query. Is there some
special reason why you can't just write the t2.id > ... condition in
the query too? or is the query generated dynamically by some software
that you have no control over?


I'd personally quite like to see improvements in this area, and even
wrote a patch 1 which fixes this problem too. The problem I had when
proposing the fix for this was that I was unable to report details
about how many people are hit by this planner limitation. The patch I
proposed caused a very small impact on planning time for many queries,
and was thought by many not to apply in enough cases for it to be
worth slowing down queries which cannot possibly benefit. Of course I
agree with this, I've no interest in slowing down planning on queries,
but at the same time understand the annoying poor optimisation in this
area.


Although please remember the patch I proposed was merely a first draft
proposal. Not for production use.

Code Snippets

SELECT
    -- whatever
FROM
    t2
  LEFT JOIN
    t1 ON t1.id = t2.id
WHERE t2.id > -9223372036513411363 ;

Context

StackExchange Database Administrators Q#127082, answer score: 3

Revisions (0)

No revisions yet.