HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

PostgreSQL filter/aggregate performance fluctuates depending on condition value

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlconditionvaluefilterperformancedependingaggregatefluctuates

Problem

I'm experiencing a strange variation of filter performance depending on the value in WHERE condition with this query:

SELECT MIN(revenue) AS min_revenue, 
       MAX(revenue) AS max_revenue 
FROM   v_premises_filter 
WHERE  area IS NOT NULL 
AND    v_premises_filter.scat_group_code =1


Query with =1 condition executes much faster than =23.

I have noticed that the drastic slowdown happens when aggregated column (revenue in the example below) has only NULL values for rows matching the WHERE condition.

I would understand if this was happening when using index lookup, but it happens when using filter - which to my knowledge should scan all rows.

With scat_group_code =1, the query executes quickly:

Result (cost=2.92..2.93 rows=1 width=0) (actual time=0.239..0.241 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..1.46 rows=1 width=8) (actual time=0.134..0.135 rows=1 loops=1)
-> Index Scan using v_premises_filter_revenue_idx on v_premises_filter (cost=0.43..310356.33 rows=301212 width=8) (actual time=0.130..0.130 rows=1 loops=1)
Index Cond: (revenue IS NOT NULL)
Filter: ((area IS NOT NULL) AND (scat_group_code = 1))
Rows Removed by Filter: 20
InitPlan 2 (returns $1)
-> Limit (cost=0.43..1.46 rows=1 width=8) (actual time=0.094..0.096 rows=1 loops=1)
-> Index Scan Backward using v_premises_filter_revenue_idx on v_premises_filter v_premises_filter_1 (cost=0.43..310356.33 rows=301212 width=8) (actual time=0.091..0.091 rows=1 loops=1)
Index Cond: (revenue IS NOT NULL)
Filter: ((area IS NOT NULL) AND (scat_group_code = 1))
Rows Removed by Filter: 17
Planning time: 0.359 ms
Execution time: 0.276 ms


However, with scat_group_code =23, the same query with the same execution plan takes much more time:

`Result (cost=129.15..129.16 rows=1 width=0) (actual time=8138.644..8138.669 rows=1 loops=1)
InitPlan 1

Solution

You have a lot of indexes and my first impuls would be to check whether all of them are actually used and remove the unused ones.

The optimal index for your query is not there, yet, though: a multicolumn index on (scat_group_code, revenue). Since this is a materialized view, it's reasonable to assume a read-only situation, which is perfect for index-only scans.

Also, since the table (many columns, some potentially big text columns) is much wider than the index (two columns, fewer rows for the partial variant), an index-only scan is particularly efficient here. Size matters.

If this particular query is very common / important, I would go one step further and make it a partial, multicolumn index:

CREATE INDEX ON v_premises_filter (scat_group_code, revenue)
WHERE area IS NOT NULL
AND   revenue IS NOT NULL;


I added the 2nd condition revenue IS NOT NULL because min() and max() ignore NULL values. I tested with Postgres 9.6. You may have to add the logically redundant predicate AND revenue IS NOT NULL to your query in older versions to make the query planner understand the index is applicable.

The index your query used in your tests (v_premises_filter_revenue_idx) is on (revenue), which is pretty inefficient for a predicate on scat_group_code. Adding that column as first index item makes a big difference, because Postgres can cheaply grab only the share with matching scat_group_code now. Where your 2nd query had to filter 1631527 rows, (Rows Removed by Filter: 1631527") it only has to filter much fewer rows with area is null OR revenue is null for the simple index and none at all for the partial index.

Also, since irrelevant rows are excluded from the index, it is substantially smaller, which contribute3s to performance as well.

You should see two index-only scans and a much faster query in either case now.

Related:

  • Is a composite index also good for queries on the first field?



  • How to properly implement compound greatest-n filtering



Why the fluctuation?

Your table is big (1909175 rows) and data distribution is obviously irregular. The default statistics_target setting of 100 is probably too small to gather enough details. Postgres decides to use a bad query plan for your 2nd query, based on insufficient information. It might help to increase the STATISTICS setting for scat_group_code, revenue and area.

But you won't need this with the suggested new index any more, because an index-only scan on the new index is always the best plan.

Consider this related answer from just yesterday:

  • Inconsistent statistics on jsonb column with btree index

Code Snippets

CREATE INDEX ON v_premises_filter (scat_group_code, revenue)
WHERE area IS NOT NULL
AND   revenue IS NOT NULL;

Context

StackExchange Database Administrators Q#167625, answer score: 4

Revisions (0)

No revisions yet.