patternsqlMinor
Unduplicated counts over various dimensions
Viewed 0 times
countsoverdimensionsvariousunduplicated
Problem
I'm working with some data on college students, which has a unique key of
Now, usually, I'd just use
I've therefore been pursuing some solutions to deal with this. What I've come up with is a dynamic SQL solution. Basically, I create a table to hold my output, then I repeatedly cycle through calling the same SQL code with different parameters and adding the output to the output table. The parameters control what level of aggregation I'm using. I use a simple subquery with
```
IF OBJECT_ID('tempdb..##CountOut') IS NOT NULL
SSN, Campus, Credential (type of degree sought, like "Bachelor's"), ProgCat (broad subject area, like "History"), Program (specific subject area, like "European History"), and SchoolYear. I want to calculate a lot of statistics (for the sake of this example, we'll just stick with basic counts of number of students) and I want those statistics to be calculated for various combinations of aggregation of Campus, Program, and Program Category. In other words, I want to calculate statistics that get as specific as "number of students getting a bachelor's degree in European History at Big Oak University," as general as "number of students getting a bachelor's degree in anything at any university" and every level in between.Now, usually, I'd just use
CUBE in my GROUP BY statement, which would give me exactly what I want. However, there's an added complication... I need unduplicated counts of students, and it's possible that one student is enrolled in two different subjects, or at two different universities. So, if Jim Bob is enrolled both at Big Oak University and Sunrise College, I'd want him to show up once in the statistics for Big Oak, once in the statistics for Sunrise, and once and only once in the statistics for all colleges grouped together. As far as I know, there's no way to accomplish this with CUBE.I've therefore been pursuing some solutions to deal with this. What I've come up with is a dynamic SQL solution. Basically, I create a table to hold my output, then I repeatedly cycle through calling the same SQL code with different parameters and adding the output to the output table. The parameters control what level of aggregation I'm using. I use a simple subquery with
row_number() to pull only one record per student at any level of aggregation, making sure I get unduplicated counts. It looks something like this...```
IF OBJECT_ID('tempdb..##CountOut') IS NOT NULL
Solution
Don't worry my friend, you should still be able to use
Here is an example that I threw together.
Seems to do the job on my SQL-Server 2008, so it should work for you too.
EDIT: After a small back and forth discussion in the comments, I managed to throw this together.
The CTE and the subquery that I used to get the average prevents the same student from affecting the average more than once.
Here is the SQL Fiddle.
GROUP BY CUBE(), instead of doing a COUNT(student) of the students you should use a COUNT(DISTINCT student) which will make sure each student is counted only once for each row.Here is an example that I threw together.
SELECT Student, Campus, Subject, COUNT(DISTINCT Student)
FROM
(
SELECT 'Jim Bob' AS Student,'Big Oak' AS Campus,'History' AS Subject
UNION ALL
SELECT 'Jim Bob','Sunrise','Chemistry'
UNION ALL
SELECT 'Jim Bob','Sunrise','Art'
UNION ALL
SELECT 'Bob Jim','Big Oak','Chemistry'
)a
GROUP BY CUBE(Student, Campus, Subject)Seems to do the job on my SQL-Server 2008, so it should work for you too.
EDIT: After a small back and forth discussion in the comments, I managed to throw this together.
;WITH Student_Wages AS
(
SELECT DISTINCT SSN, WAGE
FROM Data
)
SELECT A.campus, A.subject, SUM(A.Students), AVG(A.Wage)
FROM
(
SELECT
ISNULL(D.campus,'-') AS campus,
ISNULL(D.subject,'-') AS subject,
COUNT(DISTINCT D.SSN) AS Students,
(SELECT AVG(S.wage) WHERE D.SSN IN (SELECT S.SSN)) AS Wage
FROM #Data D
INNER JOIN Student_Wages S ON D.SSN = S.SSN
GROUP BY CUBE(D.campus, D.subject), D.SSN, S.ssn
) AS A
GROUP BY A.campus, A.subject
ORDER BY A.campus, A.subjectThe CTE and the subquery that I used to get the average prevents the same student from affecting the average more than once.
Here is the SQL Fiddle.
Code Snippets
SELECT Student, Campus, Subject, COUNT(DISTINCT Student)
FROM
(
SELECT 'Jim Bob' AS Student,'Big Oak' AS Campus,'History' AS Subject
UNION ALL
SELECT 'Jim Bob','Sunrise','Chemistry'
UNION ALL
SELECT 'Jim Bob','Sunrise','Art'
UNION ALL
SELECT 'Bob Jim','Big Oak','Chemistry'
)a
GROUP BY CUBE(Student, Campus, Subject);WITH Student_Wages AS
(
SELECT DISTINCT SSN, WAGE
FROM Data
)
SELECT A.campus, A.subject, SUM(A.Students), AVG(A.Wage)
FROM
(
SELECT
ISNULL(D.campus,'-') AS campus,
ISNULL(D.subject,'-') AS subject,
COUNT(DISTINCT D.SSN) AS Students,
(SELECT AVG(S.wage) WHERE D.SSN IN (SELECT S.SSN)) AS Wage
FROM #Data D
INNER JOIN Student_Wages S ON D.SSN = S.SSN
GROUP BY CUBE(D.campus, D.subject), D.SSN, S.ssn
) AS A
GROUP BY A.campus, A.subject
ORDER BY A.campus, A.subjectContext
StackExchange Code Review Q#60127, answer score: 3
Revisions (0)
No revisions yet.