patternsqlMinor

Improving performance for sum and group by with large datasets

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

groupwithdatasetslargeperformanceforsumandimproving

Problem

I have a query that looked like this:

SELECT sum(score) as score, count(distinct(user_id)) number, round(sum(score)/count(distinct(user_id))) score_per_member, teams.*, types.name as type 
FROM `teams` inner join `users` on `teams`.`id` = `users`.`team_id` 
inner join `results` on `results`.`user_id` = `users`.`id`
inner join `types` on `types`.`id` = `teams`.`type_id`
WHERE `results`.`year` = '2015'
GROUP BY `team_id`
ORDER BY `score` DESC;

Users has 250k rows. Results has 4m rows with 1m with year 2015.

Although explain doesn't look too bad:

1   SIMPLE  teams   ALL PRIMARY,teams_type_id_foreign   NULL    NULL    NULL    4449    Using temporary; Using filesort
1   SIMPLE  types   eq_ref  PRIMARY PRIMARY 4   teams.type_id   1   
1   SIMPLE  users   ref PRIMARY,users_team_id_foreign   users_team_id_foreign   5   teams.id    57  Using where; Using index
1   SIMPLE  results ref user_id,results_year_user_id_index  results_year_user_id_index  6   const,users.id  6

It was taking 50 seconds to run.

I refactored it to:

SELECT score, number, round(score/number) score_per_member, teams.*, types.name as type FROM `teams`  
    inner join 
        (
            select sum(score) score, team_id, count(distinct(user_id)) number
            from `results` 
            join users on `results`.`user_id` = `users`.`id`
            WHERE `year` = '2015' and year_id is not null
            GROUP BY `team_id`
    ) r on teams.id = r.team_id
    inner join `types` on `types`.`id` = `teams`.`type_id` ORDER BY `score` DESC;

Explain actually 'looks' much worse:

```
1 PRIMARY ALL NULL NULL NULL NULL 2045 Using filesort
1 PRIMARY teams eq_ref PRIMARY,teams_type_id_foreign PRIMARY 4 r.team_id 1
1 PRIMARY types eq_ref PRIMARY PRIMARY 4 teams.type_id 1
2 DERIVED users range PRIMARY,users_team_id_foreign users_team_id_foreign 5 NULL 132496 Using where; Using index
2 DERIVED results ref user_id,results_

Solution

The big issue is "inflation-deflation" when you do JOIN + GROUP BY.

Think of what JOIN does -- it generates an intermediate table with teams types/team users/type/team * results/... That is, that table is much bigger than any of the original tables.

Then it does the GROUP BY do deflate that back to one row per team.

Usually the combination gives COUNTs and SUMs that are bigger than expected. (Did you check?) And it takes a long time.

So, the trick is to use subqueries to avoid any single query with both JOIN and GROUP BY in it. (The combo is ok if the JOIN is known to be 1:1, which seems to be the case for teams and types, but not the other JOINs.)

There are two ways to do such subqueries:

SELECT
        ( SELECT SUM() FROM ... WHERE ... ) -- correlated subquery; single result only
    FROM ...

The subquery is effectively a GROUP BY without saying it, but it boils down to a single value, thereby avoiding the "inflation".

and

SELECT ...
    FROM ( SELECT COUNT(), SUM(...) FROM ... GROUP BY ...
         ) AS x  -- tmp table for some of the data; no JOIN
    JOIN ...

The subquery "deflates" some of the data before getting to the JOIN.

Caution: Using multiple subqueries of the second kind is slow because of lack of indexes:

SELECT ...
    FROM ( SELECT ... ) a
    JOIN ( SELECT ... ) b ON ...

5.6 helps somewhat because it discovers and adds an index.

Code Snippets

SELECT
        ( SELECT SUM() FROM ... WHERE ... ) -- correlated subquery; single result only
    FROM ...

SELECT ...
    FROM ( SELECT COUNT(), SUM(...) FROM ... GROUP BY ...
         ) AS x  -- tmp table for some of the data; no JOIN
    JOIN ...

SELECT ...
    FROM ( SELECT ... ) a
    JOIN ( SELECT ... ) b ON ...

Context

StackExchange Database Administrators Q#120784, answer score: 4

Revisions (0)

No revisions yet.