patternsqlCritical
Running total with count?
Viewed 0 times
totalcountrunningwith
Problem
As the title suggests I need some help getting a running total in T-SQL. The problem is that the sum I need to do is the sum of a count:
Say if I ran the count alone, the result would be:
I need output with the sum to be :
I've done running totals before using the
sum(count (distinct (customers)))Say if I ran the count alone, the result would be:
Day | CountCustomers
----------------------
5/1 | 1
5/2 | 0
5/3 | 5I need output with the sum to be :
Day | RunningTotalCustomers
----------------------
5/1 | 1
5/2 | 1
5/3 | 6I've done running totals before using the
coalesce method, but never with a count. I'm not sure how to do it now that I have the count.Solution
Here are a few methods you can compare. First let's set up a table with some dummy data. I'm populating this with a bunch of random data from sys.all_columns. Well, it's kind of random - I'm ensuring that the dates are contiguous (which is really only important for one of the answers).
Results:
The data looks like this (5000 rows) - but will look slightly different on your system depending on version and build #:
And the running totals results should look like this (501 rows):
So the methods I am going to compare are:
self-join
This is the way people will tell you to do it when they're warning you to stay away from cursors, because "set-based is always faster." In some recent experiments I've found that the cursor out-paces this solution.
recursive cte with dates
Reminder - this relies on contiguous dates (no gaps), up to 10000 levels of recursion, and that you know the start date of the range you're interested (to set the anchor). You could set the anchor dynamically using a subquery, of course, but I wanted to keep things simple.
recursive cte with row_number
Row_number calculation is slightly expensive here. Again this supports max level of recursion of 10000, but you don't need to assign the anchor.
recursive cte with temp table
Stealing from Mikael's answer, as suggested, to include this in the tests.
quirky update
Again I am only including this for completeness; I personally wouldn't rely on this solution since, as I mentioned on another answer, this method is not guaranteed to work at all, and may completely break in a future version of SQL Server. (I'm doing my best to coerce SQL Server into obeying the order I want, using a hint for
CREATE TABLE dbo.Hits(Day SMALLDATETIME, CustomerID INT);
CREATE CLUSTERED INDEX x ON dbo.Hits([Day]);
INSERT dbo.Hits SELECT TOP (5000) DATEADD(DAY, r, '20120501'),
COALESCE(ASCII(SUBSTRING(name, s, 1)), 86)
FROM (SELECT name, r = ROW_NUMBER() OVER (ORDER BY name)/10,
s = CONVERT(INT, RIGHT(CONVERT(VARCHAR(20), [object_id]), 1))
FROM sys.all_columns) AS x;
SELECT
Earliest_Day = MIN([Day]),
Latest_Day = MAX([Day]),
Unique_Days = DATEDIFF(DAY, MIN([Day]), MAX([Day])) + 1,
Total_Rows = COUNT(*)
FROM dbo.Hits;Results:
Earliest_Day Latest_Day Unique_Days Total_Days
------------------- ------------------- ----------- ----------
2012-05-01 00:00:00 2013-09-13 00:00:00 501 5000The data looks like this (5000 rows) - but will look slightly different on your system depending on version and build #:
Day CustomerID
------------------- ---
2012-05-01 00:00:00 95
2012-05-01 00:00:00 97
2012-05-01 00:00:00 97
2012-05-01 00:00:00 117
2012-05-01 00:00:00 100
...
2012-05-02 00:00:00 110
2012-05-02 00:00:00 110
2012-05-02 00:00:00 95
...And the running totals results should look like this (501 rows):
Day c rt
------------------- -- --
2012-05-01 00:00:00 6 6
2012-05-02 00:00:00 5 11
2012-05-03 00:00:00 4 15
2012-05-04 00:00:00 7 22
2012-05-05 00:00:00 6 28
...So the methods I am going to compare are:
- "self-join" - the set-based purist approach
- "recursive CTE with dates" - this relies on contiguous dates (no gaps)
- "recursive CTE with row_number" - similar to above but slower, relying on ROW_NUMBER
- "recursive CTE with #temp table" - stolen from Mikael's answer as suggested
- "quirky update" which, while unsupported and not promising defined behavior, seems to be quite popular
- "cursor"
- SQL Server 2012 using new windowing functionality
self-join
This is the way people will tell you to do it when they're warning you to stay away from cursors, because "set-based is always faster." In some recent experiments I've found that the cursor out-paces this solution.
;WITH g AS
(
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
)
SELECT g.[Day], g.c, rt = SUM(g2.c)
FROM g INNER JOIN g AS g2
ON g.[Day] >= g2.[Day]
GROUP BY g.[Day], g.c
ORDER BY g.[Day];recursive cte with dates
Reminder - this relies on contiguous dates (no gaps), up to 10000 levels of recursion, and that you know the start date of the range you're interested (to set the anchor). You could set the anchor dynamically using a subquery, of course, but I wanted to keep things simple.
;WITH g AS
(
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
), x AS
(
SELECT [Day], c, rt = c
FROM g
WHERE [Day] = '20120501'
UNION ALL
SELECT g.[Day], g.c, x.rt + g.c
FROM x INNER JOIN g
ON g.[Day] = DATEADD(DAY, 1, x.[Day])
)
SELECT [Day], c, rt
FROM x
ORDER BY [Day]
OPTION (MAXRECURSION 10000);recursive cte with row_number
Row_number calculation is slightly expensive here. Again this supports max level of recursion of 10000, but you don't need to assign the anchor.
;WITH g AS
(
SELECT [Day], rn = ROW_NUMBER() OVER (ORDER BY DAY),
c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
), x AS
(
SELECT [Day], rn, c, rt = c
FROM g
WHERE rn = 1
UNION ALL
SELECT g.[Day], g.rn, g.c, x.rt + g.c
FROM x INNER JOIN g
ON g.rn = x.rn + 1
)
SELECT [Day], c, rt
FROM x
ORDER BY [Day]
OPTION (MAXRECURSION 10000);recursive cte with temp table
Stealing from Mikael's answer, as suggested, to include this in the tests.
CREATE TABLE #Hits
(
rn INT PRIMARY KEY,
c INT,
[Day] SMALLDATETIME
);
INSERT INTO #Hits (rn, c, Day)
SELECT ROW_NUMBER() OVER (ORDER BY DAY),
COUNT(DISTINCT CustomerID),
[Day]
FROM dbo.Hits
GROUP BY [Day];
WITH x AS
(
SELECT [Day], rn, c, rt = c
FROM #Hits as c
WHERE rn = 1
UNION ALL
SELECT g.[Day], g.rn, g.c, x.rt + g.c
FROM x INNER JOIN #Hits as g
ON g.rn = x.rn + 1
)
SELECT [Day], c, rt
FROM x
ORDER BY [Day]
OPTION (MAXRECURSION 10000);
DROP TABLE #Hits;quirky update
Again I am only including this for completeness; I personally wouldn't rely on this solution since, as I mentioned on another answer, this method is not guaranteed to work at all, and may completely break in a future version of SQL Server. (I'm doing my best to coerce SQL Server into obeying the order I want, using a hint for
Code Snippets
CREATE TABLE dbo.Hits(Day SMALLDATETIME, CustomerID INT);
CREATE CLUSTERED INDEX x ON dbo.Hits([Day]);
INSERT dbo.Hits SELECT TOP (5000) DATEADD(DAY, r, '20120501'),
COALESCE(ASCII(SUBSTRING(name, s, 1)), 86)
FROM (SELECT name, r = ROW_NUMBER() OVER (ORDER BY name)/10,
s = CONVERT(INT, RIGHT(CONVERT(VARCHAR(20), [object_id]), 1))
FROM sys.all_columns) AS x;
SELECT
Earliest_Day = MIN([Day]),
Latest_Day = MAX([Day]),
Unique_Days = DATEDIFF(DAY, MIN([Day]), MAX([Day])) + 1,
Total_Rows = COUNT(*)
FROM dbo.Hits;Earliest_Day Latest_Day Unique_Days Total_Days
------------------- ------------------- ----------- ----------
2012-05-01 00:00:00 2013-09-13 00:00:00 501 5000Day CustomerID
------------------- ---
2012-05-01 00:00:00 95
2012-05-01 00:00:00 97
2012-05-01 00:00:00 97
2012-05-01 00:00:00 117
2012-05-01 00:00:00 100
...
2012-05-02 00:00:00 110
2012-05-02 00:00:00 110
2012-05-02 00:00:00 95
...Day c rt
------------------- -- --
2012-05-01 00:00:00 6 6
2012-05-02 00:00:00 5 11
2012-05-03 00:00:00 4 15
2012-05-04 00:00:00 7 22
2012-05-05 00:00:00 6 28
...;WITH g AS
(
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
)
SELECT g.[Day], g.c, rt = SUM(g2.c)
FROM g INNER JOIN g AS g2
ON g.[Day] >= g2.[Day]
GROUP BY g.[Day], g.c
ORDER BY g.[Day];Context
StackExchange Database Administrators Q#19507, answer score: 53
Revisions (0)
No revisions yet.