patternsqlMinor
Improving performance for sum and group by with large datasets
Viewed 0 times
groupwithdatasetslargeperformanceforsumandimproving
Problem
I have a query that looked like this:
Users has 250k rows. Results has 4m rows with 1m with year 2015.
Although explain doesn't look too bad:
It was taking 50 seconds to run.
I refactored it to:
Explain actually 'looks' much worse:
```
1 PRIMARY ALL NULL NULL NULL NULL 2045 Using filesort
1 PRIMARY teams eq_ref PRIMARY,teams_type_id_foreign PRIMARY 4 r.team_id 1
1 PRIMARY types eq_ref PRIMARY PRIMARY 4 teams.type_id 1
2 DERIVED users range PRIMARY,users_team_id_foreign users_team_id_foreign 5 NULL 132496 Using where; Using index
2 DERIVED results ref user_id,results_
SELECT sum(score) as score, count(distinct(user_id)) number, round(sum(score)/count(distinct(user_id))) score_per_member, teams.*, types.name as type
FROM `teams` inner join `users` on `teams`.`id` = `users`.`team_id`
inner join `results` on `results`.`user_id` = `users`.`id`
inner join `types` on `types`.`id` = `teams`.`type_id`
WHERE `results`.`year` = '2015'
GROUP BY `team_id`
ORDER BY `score` DESC;Users has 250k rows. Results has 4m rows with 1m with year 2015.
Although explain doesn't look too bad:
1 SIMPLE teams ALL PRIMARY,teams_type_id_foreign NULL NULL NULL 4449 Using temporary; Using filesort
1 SIMPLE types eq_ref PRIMARY PRIMARY 4 teams.type_id 1
1 SIMPLE users ref PRIMARY,users_team_id_foreign users_team_id_foreign 5 teams.id 57 Using where; Using index
1 SIMPLE results ref user_id,results_year_user_id_index results_year_user_id_index 6 const,users.id 6It was taking 50 seconds to run.
I refactored it to:
SELECT score, number, round(score/number) score_per_member, teams.*, types.name as type FROM `teams`
inner join
(
select sum(score) score, team_id, count(distinct(user_id)) number
from `results`
join users on `results`.`user_id` = `users`.`id`
WHERE `year` = '2015' and year_id is not null
GROUP BY `team_id`
) r on teams.id = r.team_id
inner join `types` on `types`.`id` = `teams`.`type_id` ORDER BY `score` DESC;Explain actually 'looks' much worse:
```
1 PRIMARY ALL NULL NULL NULL NULL 2045 Using filesort
1 PRIMARY teams eq_ref PRIMARY,teams_type_id_foreign PRIMARY 4 r.team_id 1
1 PRIMARY types eq_ref PRIMARY PRIMARY 4 teams.type_id 1
2 DERIVED users range PRIMARY,users_team_id_foreign users_team_id_foreign 5 NULL 132496 Using where; Using index
2 DERIVED results ref user_id,results_
Solution
The big issue is "inflation-deflation" when you do
Think of what
Then it does the
Usually the combination gives
So, the trick is to use subqueries to avoid any single query with both
There are two ways to do such subqueries:
The subquery is effectively a
and
The subquery "deflates" some of the data before getting to the
Caution: Using multiple subqueries of the second kind is slow because of lack of indexes:
5.6 helps somewhat because it discovers and adds an index.
JOIN + GROUP BY.Think of what
JOIN does -- it generates an intermediate table with teams types/team users/type/team * results/... That is, that table is much bigger than any of the original tables.Then it does the
GROUP BY do deflate that back to one row per team.Usually the combination gives
COUNTs and SUMs that are bigger than expected. (Did you check?) And it takes a long time.So, the trick is to use subqueries to avoid any single query with both
JOIN and GROUP BY in it. (The combo is ok if the JOIN is known to be 1:1, which seems to be the case for teams and types, but not the other JOINs.)There are two ways to do such subqueries:
SELECT
( SELECT SUM() FROM ... WHERE ... ) -- correlated subquery; single result only
FROM ...The subquery is effectively a
GROUP BY without saying it, but it boils down to a single value, thereby avoiding the "inflation".and
SELECT ...
FROM ( SELECT COUNT(), SUM(...) FROM ... GROUP BY ...
) AS x -- tmp table for some of the data; no JOIN
JOIN ...The subquery "deflates" some of the data before getting to the
JOIN.Caution: Using multiple subqueries of the second kind is slow because of lack of indexes:
SELECT ...
FROM ( SELECT ... ) a
JOIN ( SELECT ... ) b ON ...5.6 helps somewhat because it discovers and adds an index.
Code Snippets
SELECT
( SELECT SUM() FROM ... WHERE ... ) -- correlated subquery; single result only
FROM ...SELECT ...
FROM ( SELECT COUNT(), SUM(...) FROM ... GROUP BY ...
) AS x -- tmp table for some of the data; no JOIN
JOIN ...SELECT ...
FROM ( SELECT ... ) a
JOIN ( SELECT ... ) b ON ...Context
StackExchange Database Administrators Q#120784, answer score: 4
Revisions (0)
No revisions yet.