HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

SUM() ignores GROUP BY and sums up 4 rows instead of 2

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sumrowsgroupignoresinsteadsumsand

Problem

I'm having difficulty with GROUP BY in MySQL.

My database setup:

client_visit
    - id
    - member_id
    - status_type_id (type_of_visit table)
    - visit_starts_at
    - visit_ends_at

member
    - id

schedule_event
    - id
    - member_id
    - starts_at
    - ends_at

type_of_visit
    - id
    - type (TYPE_BOOKED, TYPE_PRESENT etc)


For the purpose of this question: a member teaches a class or leads an activity (a schedule_event) at a given time. A client signs up for this class or activity.

For example:

Client A, B and C book visits and those go to client_visit table which consists of schedule_event_id and member_id, so we know what class and what member is teaching/or having activity.

Now, we want to know the total time a given member spent teaching/leading events that clients signed up for (based on the client_visit type_of_visit column equivalent to "Booked" or "Present"). We'll take member ID 82 as our test case.

Member ID 82 had 4 clients in two different classes, so if each class took 2h 15 minutes (8100 seconds), that means total time should be 16200 seconds.

Here's my query first:

SELECT cv.member_id AS `member_id`,
       sch.id AS `scheduleId`,
       cv.visit_starts_at AS `visitStartsAt`,
       TIMESTAMPDIFF(SECOND, sch.starts_at, sch.ends_at) AS `totalTime`
  FROM `schedule_event` AS `sch`
         LEFT JOIN `client_visit` AS `cv` ON cv.schedule_event_id = sch.id
         INNER JOIN `type_of_visit` AS `tov` ON tov.id = cv.status_type_id
 WHERE (tov.type = 'TYPE_BOOKED' OR tov.type = 'TYPE_PRESENT') and cv.member_id = 82


The result is as follows:

This shows me the clients for the first class, and the one for the second. I just want two rows, one for each class. So, I add this:

GROUP BY sch.id


Now, the result is as follows:

So far so good,

I know that there are two schedule ids for this member, so I modified the group by to pull those together into one:

```
GROUP BY sch.id AND cv.member_i

Solution

I believe you have a misunderstanding of what GROUP BY does. Not surprising, I had issues myself when first learning, in large part because the MySQL manual doesn't actually explicitly say what GROUP BY does, at least not that I can find (and I searched a lot, just now; plenty of caveats and special behavior, not so much an actual definition).

My (on the fly) Definition:


GROUP BY condenses your SELECT results so that only 1 row is returned
for each distinct combination of values for the columns specified in
the GROUP BY clause. In that sense, it is similar to DISTINCT, but
works on the columns in the GROUP BY instead of the SELECT statement.

In non-MySQL land, you can only SELECT columns you specify in your GROUP BY clause, PLUS any aggregate functions you want. Those aggregate functions, including SUM, operate on a per row basis, reporting a result ONLY for the now "hidden" extra rows.

As you can see, that is what your query is actually doing (or would be, but I think you gave an inaccurate example, as ypercube points out in the comments). It is summing up all the now-hidden extra rows, and reporting their total, for the given sch.id.

If you want the total of only distinct values of each sch.id, you'll have to do things differently to get the information you want.

One reason that it is not simple, is MySQL has no idea WHICH row you want to include in the sum. They may be all the same in your example (8100), but there's no guarantee of that.

Since MySQL allows you to select columns that are neither specified in the GROUP BY clause nor are aggregate functions, it essentially chooses one at "random" and displays it to you. While not actually random, it is non-deterministic, and can change at any time for the same query and data, even if it appears to you to always give the same result.

So, before you can proceed, you need to decide how you want to determine which row for each sch.id contains the value you want to sum.

If you know the values to always be the same, then one simple (although not necessarily optimized) solution is to wrap your original GROUP BY query in another query (making the original a subquery) and then use the SUM function in the outer query, without a GROUP BY clause. The subquery will remove your duplicates, and the outer query will sum up the total of the de-duplicated rows.

Context

StackExchange Database Administrators Q#177166, answer score: 7

Revisions (0)

No revisions yet.