HiveBrain v1.2.0
Get Started
← Back to all entries
gotchasqlMinor

Why does a list of 10,000 IDs perform better than using the equivalent SQL to select them?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whytheequivalent000sqlidsthanselectbetterperform

Problem

I have a Rails application with a legacy query that I'd like to refurbish. The current implementation performs two SQL queries: one to get a large number of IDs and a second query that uses those IDs and applies some additional joins and filters to get the desired result.

I am trying to replace this with a single query that avoids the round trip, but doing so has incurred a large performance degradation in my local testing environment (which is a copy of the full production dataset). It appears that an index is not being used in the new query, leading to a full table scan. I had hoped the single query would keep the same performance as the original code, ideally improving on it due to not needing to send all the IDs around.

This is a fairly minimized version of my actual problem. A slightly larger version is discussed at Why does a list of 10,000 IDs perform better in a complicated query with multiple CTEs compared to the equivalent SQL to select them?.
Current Query

There's a query that takes ~6.5 seconds to calculate a list of 10000+ IDs. You can see that as the CTE visible_projects in the "proposed query" section below. Those IDs are then fed into this query:

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
    SELECT
        id
    FROM
        "projects"
    WHERE
        "projects"."id" IN (
            -- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
    SELECT
        tasks.id
    FROM
        tasks
    WHERE
        tasks.project_id IN (
            SELECT
                id
            FROM
                visible_projects))
SELECT
    COUNT(1)
FROM
    visible_tasks;


Query plan (depesz)

```
Aggregate (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
Buffers: shared hit=73107 read=22301
CTE visible_tasks
-> Gather (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
Workers Planned: 2
Workers

Solution

The main reason for the different query plan is probably the increased number of rows that Postgres estimates to get back from projects:

(cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)

vs.

(cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)

Over-estimated by factor 3, which is not dramatic, but obviously enough to favor a different (inferior) query plan. Related:

  • Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N



Assuming projects.is_template is mostly false, I suggest these multicolumn indices:

CREATE INDEX ON projects(company_id, state);


Equality first, range later. See:

  • Multicolumn index and performance



You might also try to increase the statistics target on company_id, state, and ANALYZE the table, to get better estimates.

And:

CREATE INDEX ON tasks (project_id, id);


Plus increase statistics target on tasks.project_id and ANALYZE.

In both cases, the multicolumn index can replace the one on just project.company_id / task.project_id. Since all columns are integer, the size of the index will would be the same - except for the effect of index de-duplication (added with Postgres 13), which shows strongly in your test for the highly duplicative tasks.project_id. See:

  • Is a composite index also good for queries on the first field?



And this query:

SELECT t.id
FROM   projects p
JOIN   tasks t ON t.project_id = p.id
WHERE  p.company_id = 11171
AND    p.state < 6
AND    p.is_template = FALSE;


The direct join should be faster.

Code Snippets

CREATE INDEX ON projects(company_id, state);
CREATE INDEX ON tasks (project_id, id);
SELECT t.id
FROM   projects p
JOIN   tasks t ON t.project_id = p.id
WHERE  p.company_id = 11171
AND    p.state < 6
AND    p.is_template = FALSE;

Context

StackExchange Database Administrators Q#281150, answer score: 3

Revisions (0)

No revisions yet.