HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Median with GROUP BY

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
withmediangroup

Problem

Suppose the following table t1:

=================
| tag | val | --+ for the sake of simplicity, val is non NULL
=================
| a1 | v1 |
| a1 | v2 |
| a1 | v3 |
| a1 | v4 |
| a1 | v5 |
| a2 | v6 |
| a2 | v7 |
| a2 | v8 |
| a2 | v9 |
| ... | ... |
=================

If you execute the script below in MySQL:

SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`


You would get the average values grouped by the column tag:

=================
| tag | AVG() |
=================
| a1 | avg1 |
| a2 | avg2 |
| a3 | avg3 |
| a4 | avg4 |
| ... | ... |
=================

Besides AVG(), MySQL has several other built in functions to calculate aggregate values (e.g. SUM(), MAX(), COUNT(), and STD()) that could be used in the same way as in the script aforementioned. However, there is no built in function for median.

This issue has already come up several other times at SE; however, most of them related to tables without GROUP BY. The only one with GROUP BY seems to be MySql: Count median grouped by day; however, the script seems to be overcomplicated.
Question

What would be an easy and simple way (if possible) to calculate this median?
Follow-up

Excellent article that complement the accepted answer:

http://danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/

Solution

This query could answer your question: median value and group by

SELECT tag, AVG(val) as median
            FROM 
            (
              SELECT tag, val,
                  (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
                  seq,
                  (SELECT count(*) FROM median t2 WHERE t2.tag  0 and seq-delta = (ct+1)/2)
            ) T
            GROUP BY tag
            ORDER BY tag;


I tried it on this dataset (mainly from here):

+------+------+
            | tag  | val  |
            +------+------+
            |    1 |    3 |
            |    1 |   13 |


... (see explanation below)

|    3 |   12 |
            |    3 |   43 |
            |    3 |   15 |
            +------+------+


and the result was:

+------+---------+
            | tag  | median  |
            +------+---------+
            |    1 | 23.0000 |
            |    2 | 22.0000 |
            |    3 | 15.0000 |
            +------+---------+


Explanation

Inner subqueries will be computed first: sequence is (1)(2)(3)(4).

-- (4) compute the average (of 2 lines or 1 line)

SELECT tag, AVG(val) as median                          
      FROM 
      (


-- (3) get lines to compute the median value

SELECT tag, val,                                       
           (SELECT count(*) FROM median t2                    -- +number of lines for the current tag value as ct
              WHERE t2.tag = t3.tag) as ct,
           seq,
           (SELECT count(*) FROM median t2                    -- +number of lines before the current tag value as delta
              WHERE t2.tag < t3.tag) as delta                --     to compute the starting line number of a tag
         FROM (


-- (2) sort dataset by tag and sequence

SELECT tag, val,                            
                    @rownum := @rownum + 1 as seq       -- +@rownum enable to create a sequence from 0 by 1
              FROM (


-- (1) sort dataset by tag and value

SELECT * FROM median           
                    ORDER BY tag, val) t1


-- (2) continue here

ORDER BY tag, seq
            ) t3 CROSS JOIN (SELECT @rownum := 0) x            -- +use to set @rownum to 0 (no data)


-- (3) continue here

HAVING (ct%2 = 0                                      -- +when ct is even, select the two lines around the middle
                  and seq-delta between floor((ct+1)/2) 
                                and floor((ct+1)/2) +1)
           or (ct%2 <> 0                                       -- +when ct is odd, select the one line in the middle
                  and seq-delta = (floor(ct+1)/2))
      ) T


-- (4) continue here

GROUP BY tag
      ORDER BY tag;


Dataset:

after (1)     after (2)           processing (3)   
    +------+------+                   
    | tag  | val  |  ct  delta  seq       seq-delta
    +------+------+                   
    |    1 |    3 |  15    0     1        1         ct : odd ct%2 <> 0  
    |    1 |    5 |  15    0     2        2         floor((ct+1)/2) : 8
    |    1 |    7 |  15    0     3        3         
    |    1 |   12 |  15    0     4        4         
    |    1 |   13 |  15    0     5        5
    |    1 |   14 |  15    0     6        6
    |    1 |   21 |  15    0     7        7
    |    1 |   23 |  15    0     8        8 ---> keep this line
    |    1 |   23 |  15    0     9        9 
    |    1 |   23 |  15    0     10       10
    |    1 |   23 |  15    0     11       11
    |    1 |   29 |  15    0     12       12
    |    1 |   39 |  15    0     13       13
    |    1 |   40 |  15    0     14       14
    |    1 |   56 |  15    0     15       15

    |    2 |    3 |  14    15    16        1         ct : even (ct%2 = 0  )
    |    2 |    5 |  14    15    17        2         floor((ct+1)/2) : 7
    |    2 |    7 |  14    15    18        3         floor((ct+1)/2)+1 : 8
    |    2 |   12 |  14    15    19        4
    |    2 |   13 |  14    15    20        5
    |    2 |   14 |  14    15    21        6
    |    2 |   21 |  14    15    22        7 ---> keep this line
    |    2 |   23 |  14    15    23        8 ---> keep this line
    |    2 |   23 |  14    15    24        9
    |    2 |   23 |  14    15    25        10
    |    2 |   23 |  14    15    26        11
    |    2 |   29 |  14    15    27        12
    |    2 |   40 |  14    15    28        13
    |    2 |   56 |  14    15    29        14

    |    3 |   12 |  3     29    30        1                  ct : odd ct%2 <> 0 
    |    3 |   15 |  3     29    31        2 ---> keep        floor((ct+1)/2) : 2
    |    3 |   43 |  3     29    32        3
    +------+------+


Dataset after (3)

```
+------+------+------+------+-------+
| tag | val | ct | seq | delta |
+------+------+------+------+-------+
| 1 | 23 | 15 | 8 | 0 |
| 2 | 21 | 14 | 22 | 15 |
| 2 | 23 | 14 | 23 | 15 |
| 3 | 15 | 3 | 31 |

Code Snippets

SELECT tag, AVG(val) as median
            FROM 
            (
              SELECT tag, val,
                  (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
                  seq,
                  (SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
                FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
                      FROM (SELECT * FROM median ORDER BY tag, val) t1 
                      ORDER BY tag, seq
                    ) t3 CROSS JOIN (SELECT @rownum := 0) x
                HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
                  or (ct%2 <> 0 and seq-delta = (ct+1)/2)
            ) T
            GROUP BY tag
            ORDER BY tag;
+------+------+
            | tag  | val  |
            +------+------+
            |    1 |    3 |
            |    1 |   13 |
|    3 |   12 |
            |    3 |   43 |
            |    3 |   15 |
            +------+------+
+------+---------+
            | tag  | median  |
            +------+---------+
            |    1 | 23.0000 |
            |    2 | 22.0000 |
            |    3 | 15.0000 |
            +------+---------+
SELECT tag, AVG(val) as median                          
      FROM 
      (

Context

StackExchange Database Administrators Q#158333, answer score: 8

Revisions (0)

No revisions yet.