patternsqlMinor
Median with GROUP BY
Viewed 0 times
withmediangroup
Problem
Suppose the following table
=================
| tag | val | --+ for the sake of simplicity, val is non NULL
=================
| a1 | v1 |
| a1 | v2 |
| a1 | v3 |
| a1 | v4 |
| a1 | v5 |
| a2 | v6 |
| a2 | v7 |
| a2 | v8 |
| a2 | v9 |
| ... | ... |
=================
If you execute the script below in MySQL:
You would get the average values grouped by the column
=================
| tag | AVG() |
=================
| a1 | avg1 |
| a2 | avg2 |
| a3 | avg3 |
| a4 | avg4 |
| ... | ... |
=================
Besides
This issue has already come up several other times at SE; however, most of them related to tables without
Question
What would be an easy and simple way (if possible) to calculate this median?
Follow-up
Excellent article that complement the accepted answer:
http://danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/
t1:=================
| tag | val | --+ for the sake of simplicity, val is non NULL
=================
| a1 | v1 |
| a1 | v2 |
| a1 | v3 |
| a1 | v4 |
| a1 | v5 |
| a2 | v6 |
| a2 | v7 |
| a2 | v8 |
| a2 | v9 |
| ... | ... |
=================
If you execute the script below in MySQL:
SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`You would get the average values grouped by the column
tag:=================
| tag | AVG() |
=================
| a1 | avg1 |
| a2 | avg2 |
| a3 | avg3 |
| a4 | avg4 |
| ... | ... |
=================
Besides
AVG(), MySQL has several other built in functions to calculate aggregate values (e.g. SUM(), MAX(), COUNT(), and STD()) that could be used in the same way as in the script aforementioned. However, there is no built in function for median.This issue has already come up several other times at SE; however, most of them related to tables without
GROUP BY. The only one with GROUP BY seems to be MySql: Count median grouped by day; however, the script seems to be overcomplicated.Question
What would be an easy and simple way (if possible) to calculate this median?
Follow-up
Excellent article that complement the accepted answer:
http://danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/
Solution
This query could answer your question: median value and group by
I tried it on this dataset (mainly from here):
... (see explanation below)
and the result was:
Explanation
Inner subqueries will be computed first: sequence is (1)(2)(3)(4).
-- (4) compute the average (of 2 lines or 1 line)
-- (3) get lines to compute the median value
-- (2) sort dataset by tag and sequence
-- (1) sort dataset by tag and value
-- (2) continue here
-- (3) continue here
-- (4) continue here
Dataset:
Dataset after (3)
```
+------+------+------+------+-------+
| tag | val | ct | seq | delta |
+------+------+------+------+-------+
| 1 | 23 | 15 | 8 | 0 |
| 2 | 21 | 14 | 22 | 15 |
| 2 | 23 | 14 | 23 | 15 |
| 3 | 15 | 3 | 31 |
SELECT tag, AVG(val) as median
FROM
(
SELECT tag, val,
(SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 WHERE t2.tag 0 and seq-delta = (ct+1)/2)
) T
GROUP BY tag
ORDER BY tag;I tried it on this dataset (mainly from here):
+------+------+
| tag | val |
+------+------+
| 1 | 3 |
| 1 | 13 |... (see explanation below)
| 3 | 12 |
| 3 | 43 |
| 3 | 15 |
+------+------+and the result was:
+------+---------+
| tag | median |
+------+---------+
| 1 | 23.0000 |
| 2 | 22.0000 |
| 3 | 15.0000 |
+------+---------+Explanation
Inner subqueries will be computed first: sequence is (1)(2)(3)(4).
-- (4) compute the average (of 2 lines or 1 line)
SELECT tag, AVG(val) as median
FROM
(-- (3) get lines to compute the median value
SELECT tag, val,
(SELECT count(*) FROM median t2 -- +number of lines for the current tag value as ct
WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 -- +number of lines before the current tag value as delta
WHERE t2.tag < t3.tag) as delta -- to compute the starting line number of a tag
FROM (-- (2) sort dataset by tag and sequence
SELECT tag, val,
@rownum := @rownum + 1 as seq -- +@rownum enable to create a sequence from 0 by 1
FROM (-- (1) sort dataset by tag and value
SELECT * FROM median
ORDER BY tag, val) t1-- (2) continue here
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x -- +use to set @rownum to 0 (no data)-- (3) continue here
HAVING (ct%2 = 0 -- +when ct is even, select the two lines around the middle
and seq-delta between floor((ct+1)/2)
and floor((ct+1)/2) +1)
or (ct%2 <> 0 -- +when ct is odd, select the one line in the middle
and seq-delta = (floor(ct+1)/2))
) T-- (4) continue here
GROUP BY tag
ORDER BY tag;Dataset:
after (1) after (2) processing (3)
+------+------+
| tag | val | ct delta seq seq-delta
+------+------+
| 1 | 3 | 15 0 1 1 ct : odd ct%2 <> 0
| 1 | 5 | 15 0 2 2 floor((ct+1)/2) : 8
| 1 | 7 | 15 0 3 3
| 1 | 12 | 15 0 4 4
| 1 | 13 | 15 0 5 5
| 1 | 14 | 15 0 6 6
| 1 | 21 | 15 0 7 7
| 1 | 23 | 15 0 8 8 ---> keep this line
| 1 | 23 | 15 0 9 9
| 1 | 23 | 15 0 10 10
| 1 | 23 | 15 0 11 11
| 1 | 29 | 15 0 12 12
| 1 | 39 | 15 0 13 13
| 1 | 40 | 15 0 14 14
| 1 | 56 | 15 0 15 15
| 2 | 3 | 14 15 16 1 ct : even (ct%2 = 0 )
| 2 | 5 | 14 15 17 2 floor((ct+1)/2) : 7
| 2 | 7 | 14 15 18 3 floor((ct+1)/2)+1 : 8
| 2 | 12 | 14 15 19 4
| 2 | 13 | 14 15 20 5
| 2 | 14 | 14 15 21 6
| 2 | 21 | 14 15 22 7 ---> keep this line
| 2 | 23 | 14 15 23 8 ---> keep this line
| 2 | 23 | 14 15 24 9
| 2 | 23 | 14 15 25 10
| 2 | 23 | 14 15 26 11
| 2 | 29 | 14 15 27 12
| 2 | 40 | 14 15 28 13
| 2 | 56 | 14 15 29 14
| 3 | 12 | 3 29 30 1 ct : odd ct%2 <> 0
| 3 | 15 | 3 29 31 2 ---> keep floor((ct+1)/2) : 2
| 3 | 43 | 3 29 32 3
+------+------+Dataset after (3)
```
+------+------+------+------+-------+
| tag | val | ct | seq | delta |
+------+------+------+------+-------+
| 1 | 23 | 15 | 8 | 0 |
| 2 | 21 | 14 | 22 | 15 |
| 2 | 23 | 14 | 23 | 15 |
| 3 | 15 | 3 | 31 |
Code Snippets
SELECT tag, AVG(val) as median
FROM
(
SELECT tag, val,
(SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median ORDER BY tag, val) t1
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq-delta = (ct+1)/2)
) T
GROUP BY tag
ORDER BY tag;+------+------+
| tag | val |
+------+------+
| 1 | 3 |
| 1 | 13 || 3 | 12 |
| 3 | 43 |
| 3 | 15 |
+------+------++------+---------+
| tag | median |
+------+---------+
| 1 | 23.0000 |
| 2 | 22.0000 |
| 3 | 15.0000 |
+------+---------+SELECT tag, AVG(val) as median
FROM
(Context
StackExchange Database Administrators Q#158333, answer score: 8
Revisions (0)
No revisions yet.