HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Unduplicated counts over various dimensions

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
countsoverdimensionsvariousunduplicated

Problem

I'm working with some data on college students, which has a unique key of SSN, Campus, Credential (type of degree sought, like "Bachelor's"), ProgCat (broad subject area, like "History"), Program (specific subject area, like "European History"), and SchoolYear. I want to calculate a lot of statistics (for the sake of this example, we'll just stick with basic counts of number of students) and I want those statistics to be calculated for various combinations of aggregation of Campus, Program, and Program Category. In other words, I want to calculate statistics that get as specific as "number of students getting a bachelor's degree in European History at Big Oak University," as general as "number of students getting a bachelor's degree in anything at any university" and every level in between.

Now, usually, I'd just use CUBE in my GROUP BY statement, which would give me exactly what I want. However, there's an added complication... I need unduplicated counts of students, and it's possible that one student is enrolled in two different subjects, or at two different universities. So, if Jim Bob is enrolled both at Big Oak University and Sunrise College, I'd want him to show up once in the statistics for Big Oak, once in the statistics for Sunrise, and once and only once in the statistics for all colleges grouped together. As far as I know, there's no way to accomplish this with CUBE.

I've therefore been pursuing some solutions to deal with this. What I've come up with is a dynamic SQL solution. Basically, I create a table to hold my output, then I repeatedly cycle through calling the same SQL code with different parameters and adding the output to the output table. The parameters control what level of aggregation I'm using. I use a simple subquery with row_number() to pull only one record per student at any level of aggregation, making sure I get unduplicated counts. It looks something like this...

```
IF OBJECT_ID('tempdb..##CountOut') IS NOT NULL

Solution

Don't worry my friend, you should still be able to use GROUP BY CUBE(), instead of doing a COUNT(student) of the students you should use a COUNT(DISTINCT student) which will make sure each student is counted only once for each row.

Here is an example that I threw together.

SELECT Student, Campus, Subject, COUNT(DISTINCT Student) 
FROM 
(
    SELECT 'Jim Bob' AS Student,'Big Oak' AS Campus,'History' AS Subject
    UNION ALL
    SELECT 'Jim Bob','Sunrise','Chemistry'
    UNION ALL
    SELECT 'Jim Bob','Sunrise','Art'
    UNION ALL
    SELECT 'Bob Jim','Big Oak','Chemistry'
)a
GROUP BY CUBE(Student, Campus, Subject)


Seems to do the job on my SQL-Server 2008, so it should work for you too.

EDIT: After a small back and forth discussion in the comments, I managed to throw this together.

;WITH Student_Wages AS 
(
    SELECT DISTINCT SSN, WAGE
    FROM Data
)

SELECT A.campus, A.subject, SUM(A.Students), AVG(A.Wage) 
FROM 
(
    SELECT 
    ISNULL(D.campus,'-') AS campus,
    ISNULL(D.subject,'-') AS subject,
    COUNT(DISTINCT D.SSN) AS Students,
    (SELECT AVG(S.wage) WHERE D.SSN IN (SELECT S.SSN)) AS Wage
    FROM #Data D
    INNER JOIN Student_Wages S ON D.SSN = S.SSN
    GROUP BY CUBE(D.campus, D.subject), D.SSN, S.ssn
) AS A
GROUP BY A.campus, A.subject
ORDER BY A.campus, A.subject


The CTE and the subquery that I used to get the average prevents the same student from affecting the average more than once.

Here is the SQL Fiddle.

Code Snippets

SELECT Student, Campus, Subject, COUNT(DISTINCT Student) 
FROM 
(
    SELECT 'Jim Bob' AS Student,'Big Oak' AS Campus,'History' AS Subject
    UNION ALL
    SELECT 'Jim Bob','Sunrise','Chemistry'
    UNION ALL
    SELECT 'Jim Bob','Sunrise','Art'
    UNION ALL
    SELECT 'Bob Jim','Big Oak','Chemistry'
)a
GROUP BY CUBE(Student, Campus, Subject)
;WITH Student_Wages AS 
(
    SELECT DISTINCT SSN, WAGE
    FROM Data
)

SELECT A.campus, A.subject, SUM(A.Students), AVG(A.Wage) 
FROM 
(
    SELECT 
    ISNULL(D.campus,'-') AS campus,
    ISNULL(D.subject,'-') AS subject,
    COUNT(DISTINCT D.SSN) AS Students,
    (SELECT AVG(S.wage) WHERE D.SSN IN (SELECT S.SSN)) AS Wage
    FROM #Data D
    INNER JOIN Student_Wages S ON D.SSN = S.SSN
    GROUP BY CUBE(D.campus, D.subject), D.SSN, S.ssn
) AS A
GROUP BY A.campus, A.subject
ORDER BY A.campus, A.subject

Context

StackExchange Code Review Q#60127, answer score: 3

Revisions (0)

No revisions yet.