patternModerate
Can an index speed up a query with GROUP BY/aggregate over the whole table (no selectivity)?
Viewed 0 times
selectivitycanthegroupwholewithtablequeryoverindex
Problem
Let's say i have a table with 3 columns a,b and c.
Could i possible speed up a query that looks like this by using index(es) ??
If the above question is positive, what type of index do you recommend and how would this work?
Could i possible speed up a query that looks like this by using index(es) ??
SELECT a,b,SUM(c) # or AVG(c)
FROM table
GROUP BY a,b
ORDER BY a,b
;If the above question is positive, what type of index do you recommend and how would this work?
Solution
Not likely.
With the HashAggregate plan,
so we add an index...
... still shows the same query plan. So we disable HashAggregate
And, try again..
And, it takes more time 655ms compared to the previous 611ms.
Need faster?
If that's not fast enough (and 611ms to group and sum a million rows isn't bad). Then you can use a
Now you're down to sub-MS time when
On the actually column being aggregated.
There are some exceptions, but
you can see that it still uses a seq scan.. You'll see the same plan for
Proof using index on (a,b,c)
For whatever reason if you want to see proof that a further index on
```
-- turn this back on we turned it off earlier
SET enable_hashagg = true;
DROP INDEX idx;
CREATE INDEX idx ON foo (a,b,c);
VACUUM FULL ANALYZE foo;
EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
GROUP BY and ORDER BY typically entail a sort. However, in this case a HashAggregate is used (likely because we're working with the whole table).CREATE TABLE foo AS
SELECT x % 5 AS a, x % 10 AS b, x AS c
FROM generate_series(1,1e6) AS x;With the HashAggregate plan,
# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=23668.04..23668.16 rows=50 width=14) (actual time=611.607..611.608 rows=10 loops=1)
Sort Key: a, b
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=23666.00..23666.62 rows=50 width=14) (actual time=611.589..611.593 rows=10 loops=1)
Group Key: a, b
-> Seq Scan on foo (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.012..71.157 rows=1000000 loops=1)
Planning time: 0.168 ms
Execution time: 611.665 msso we add an index...
CREATE INDEX idx ON foo (a,b);
VACUUM FULL ANALYZE foo;... still shows the same query plan. So we disable HashAggregate
SET enable_hashagg = false;And, try again..
# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.42..61292.04 rows=50 width=14) (actual time=108.149..655.536 rows=10 loops=1)
Group Key: a, b
-> Index Scan using idx on foo (cost=0.42..53791.41 rows=1000000 width=14) (actual time=0.066..272.299 rows=1000000 loops=1)
Planning time: 0.121 ms
Execution time: 655.594 ms
(5 rows)And, it takes more time 655ms compared to the previous 611ms.
Need faster?
If that's not fast enough (and 611ms to group and sum a million rows isn't bad). Then you can use a
MATERIALIZED VIEW if your workload permits it (if the query is hot and/or infrequently updated),CREATE MATERIALIZED VIEW foo2 AS
SELECT a,b,sum(c)
FROM foo
GROUP BY a,b
ORDER BY a,b;Now you're down to sub-MS time when
TABLE foo2. Then just do REFRESH MATERIALIZED VIEW foo2; to refresh the view. Or, you can create a trigger update another table and update it with triggers.On the actually column being aggregated.
There are some exceptions, but
sum() isn't one of them. Most aggregates do not use an index because they usually do not have a need for one. The exceptions are the aggregates that are order-specific (like min() and max()). So for instance, if after we created the index on (a,b) you run a sum(a),# EXPLAIN ANALYZE SELECT sum(a) FROM foo;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Aggregate (cost=18666.00..18666.01 rows=1 width=4) (actual time=287.063..287.063 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..16166.00 rows=1000000 width=4) (actual time=0.015..85.435 rows=1000000 loops=1)
Planning time: 0.098 ms
Execution time: 287.104 ms
(4 rows)you can see that it still uses a seq scan.. You'll see the same plan for
sum(c) which doesn't have an index at all. Now here is the kicker,# EXPLAIN ANALYZE SELECT min(a) FROM foo;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.48..0.49 rows=1 width=0) (actual time=0.041..0.041 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.42..0.48 rows=1 width=4) (actual time=0.036..0.037 rows=1 loops=1)
-> Index Only Scan using idx on foo (cost=0.42..56291.41 rows=1000000 width=4) (actual time=0.035..0.035 rows=1 loops=1)
Index Cond: (a IS NOT NULL)
Heap Fetches: 1
Planning time: 0.171 ms
Execution time: 0.080 ms
(8 rows)min(a) unlike sum(a) can make use of an ordering, and so the query planner realizes that the index scan, which isn't free, has a benefit.Proof using index on (a,b,c)
For whatever reason if you want to see proof that a further index on
c doesn't matter for the purposes of summing (ask a question if after reading the above you still do not understand why),```
-- turn this back on we turned it off earlier
SET enable_hashagg = true;
DROP INDEX idx;
CREATE INDEX idx ON foo (a,b,c);
VACUUM FULL ANALYZE foo;
EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
Code Snippets
CREATE TABLE foo AS
SELECT x % 5 AS a, x % 10 AS b, x AS c
FROM generate_series(1,1e6) AS x;# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=23668.04..23668.16 rows=50 width=14) (actual time=611.607..611.608 rows=10 loops=1)
Sort Key: a, b
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=23666.00..23666.62 rows=50 width=14) (actual time=611.589..611.593 rows=10 loops=1)
Group Key: a, b
-> Seq Scan on foo (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.012..71.157 rows=1000000 loops=1)
Planning time: 0.168 ms
Execution time: 611.665 msCREATE INDEX idx ON foo (a,b);
VACUUM FULL ANALYZE foo;SET enable_hashagg = false;# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.42..61292.04 rows=50 width=14) (actual time=108.149..655.536 rows=10 loops=1)
Group Key: a, b
-> Index Scan using idx on foo (cost=0.42..53791.41 rows=1000000 width=14) (actual time=0.066..272.299 rows=1000000 loops=1)
Planning time: 0.121 ms
Execution time: 655.594 ms
(5 rows)Context
StackExchange Database Administrators Q#160586, answer score: 11
Revisions (0)
No revisions yet.