gotchasqlMinor
Why does a list of 10,000 IDs perform better than using the equivalent SQL to select them?
Viewed 0 times
whytheequivalent000sqlidsthanselectbetterperform
Problem
I have a Rails application with a legacy query that I'd like to refurbish. The current implementation performs two SQL queries: one to get a large number of IDs and a second query that uses those IDs and applies some additional joins and filters to get the desired result.
I am trying to replace this with a single query that avoids the round trip, but doing so has incurred a large performance degradation in my local testing environment (which is a copy of the full production dataset). It appears that an index is not being used in the new query, leading to a full table scan. I had hoped the single query would keep the same performance as the original code, ideally improving on it due to not needing to send all the IDs around.
This is a fairly minimized version of my actual problem. A slightly larger version is discussed at Why does a list of 10,000 IDs perform better in a complicated query with multiple CTEs compared to the equivalent SQL to select them?.
Current Query
There's a query that takes ~6.5 seconds to calculate a list of 10000+ IDs. You can see that as the CTE
Query plan (depesz)
```
Aggregate (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
Buffers: shared hit=73107 read=22301
CTE visible_tasks
-> Gather (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
Workers Planned: 2
Workers
I am trying to replace this with a single query that avoids the round trip, but doing so has incurred a large performance degradation in my local testing environment (which is a copy of the full production dataset). It appears that an index is not being used in the new query, leading to a full table scan. I had hoped the single query would keep the same performance as the original code, ideally improving on it due to not needing to send all the IDs around.
This is a fairly minimized version of my actual problem. A slightly larger version is discussed at Why does a list of 10,000 IDs perform better in a complicated query with multiple CTEs compared to the equivalent SQL to select them?.
Current Query
There's a query that takes ~6.5 seconds to calculate a list of 10000+ IDs. You can see that as the CTE
visible_projects in the "proposed query" section below. Those IDs are then fed into this query:EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
SELECT
id
FROM
"projects"
WHERE
"projects"."id" IN (
-- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
SELECT
tasks.id
FROM
tasks
WHERE
tasks.project_id IN (
SELECT
id
FROM
visible_projects))
SELECT
COUNT(1)
FROM
visible_tasks;Query plan (depesz)
```
Aggregate (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
Buffers: shared hit=73107 read=22301
CTE visible_tasks
-> Gather (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
Workers Planned: 2
Workers
Solution
The main reason for the different query plan is probably the increased number of rows that Postgres estimates to get back from
vs.
Over-estimated by factor 3, which is not dramatic, but obviously enough to favor a different (inferior) query plan. Related:
Assuming
Equality first, range later. See:
You might also try to increase the statistics target on
And:
Plus increase statistics target on
In both cases, the multicolumn index can replace the one on just
And this query:
The direct join should be faster.
projects:(cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)vs.
(cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)Over-estimated by factor 3, which is not dramatic, but obviously enough to favor a different (inferior) query plan. Related:
- Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
Assuming
projects.is_template is mostly false, I suggest these multicolumn indices:CREATE INDEX ON projects(company_id, state);Equality first, range later. See:
- Multicolumn index and performance
You might also try to increase the statistics target on
company_id, state, and ANALYZE the table, to get better estimates.And:
CREATE INDEX ON tasks (project_id, id);Plus increase statistics target on
tasks.project_id and ANALYZE.In both cases, the multicolumn index can replace the one on just
project.company_id / task.project_id. Since all columns are integer, the size of the index will would be the same - except for the effect of index de-duplication (added with Postgres 13), which shows strongly in your test for the highly duplicative tasks.project_id. See:- Is a composite index also good for queries on the first field?
And this query:
SELECT t.id
FROM projects p
JOIN tasks t ON t.project_id = p.id
WHERE p.company_id = 11171
AND p.state < 6
AND p.is_template = FALSE;The direct join should be faster.
Code Snippets
CREATE INDEX ON projects(company_id, state);CREATE INDEX ON tasks (project_id, id);SELECT t.id
FROM projects p
JOIN tasks t ON t.project_id = p.id
WHERE p.company_id = 11171
AND p.state < 6
AND p.is_template = FALSE;Context
StackExchange Database Administrators Q#281150, answer score: 3
Revisions (0)
No revisions yet.