patternsqlMajor
Date range rolling sum using window functions
Viewed 0 times
functionsrangerollingdateusingsumwindow
Problem
I need to calculate a rolling sum over a date range. To illustrate, using the AdventureWorks sample database, the following hypothetical syntax would do exactly what I need:
Sadly, the
I know I can write a solution using a subquery and a regular (non-window) aggregate:
Given the following index:
The execution plan is:
While not horribly inefficient, it seems like it should be possible to express this query using only window aggregate and analytic functions supported in SQL Server 2012, 2014, or 2016 (so far).
For clarity, I am looking for a solution that performs a single pass over the data.
In T-SQL this is likely to mean that the
For T-SQL solutions, the fewer Has
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 = SUM(TH.ActualCost) OVER (
PARTITION BY TH.ProductID
ORDER BY TH.TransactionDate
RANGE BETWEEN
INTERVAL 45 DAY PRECEDING
AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;Sadly, the
RANGE window frame extent does not currently allow an interval in SQL Server.I know I can write a solution using a subquery and a regular (non-window) aggregate:
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 =
(
SELECT SUM(TH2.ActualCost)
FROM Production.TransactionHistory AS TH2
WHERE
TH2.ProductID = TH.ProductID
AND TH2.TransactionDate = DATEADD(DAY, -45, TH.TransactionDate)
)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;Given the following index:
CREATE UNIQUE INDEX i
ON Production.TransactionHistory
(ProductID, TransactionDate, ReferenceOrderID)
INCLUDE
(ActualCost);The execution plan is:
While not horribly inefficient, it seems like it should be possible to express this query using only window aggregate and analytic functions supported in SQL Server 2012, 2014, or 2016 (so far).
For clarity, I am looking for a solution that performs a single pass over the data.
In T-SQL this is likely to mean that the
OVER clause will do the work, and the execution plan will feature Window Spools and Window Aggregates. All language elements that use the OVER clause are fair game. A SQLCLR solution is acceptable, provided it is guaranteed to produce correct results.For T-SQL solutions, the fewer Has
Solution
Great question, Paul! I used a couple different approaches, one in T-SQL and one in CLR.
T-SQL quick summary
The T-SQL approach can be summarized as the following steps:
Using
As reported by
CLR quick summary
The CLR summary can be summarized as the following steps:
Using
As reported by
Another big advantage of this approach is that it yields exactly the same results as the original loop/seek approach, including a row for every transaction even in cases where a product is sold multiple times on the same day. (On AdventureWorks, I specifically compared row-by-row results and confirmed that they tie out with Paul's original query.)
A disadvantage of this approach, at least in its current form, is that it reads all data in memory. However, the algorithm that has been designed only strictly needs the current window frame in memory at any given time and could be updated to work for data sets that exceed memory. Paul has illustrated this point in his answer by producing an implementation of this algorithm that stores only the sliding window in memory. This comes at the expense of granting higher permissions to CLR assembly, but would definitely be worthwhile in scaling this solution up to arbitrarily large data sets.
T-SQL - one scan, grouped by date
Initial setup
The query
```
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
SELECT ProductID, TransactionDate, NumOrders, ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates, combined with actual cost information for that product/date
SELECT p.ProductID, c.d AS TransactionDate,
COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.d
GROUP BY P.ProductID, c.d
) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP
T-SQL quick summary
The T-SQL approach can be summarized as the following steps:
- Take the cross-product of products/dates
- Merge in the observed sales data
- Aggregate that data to the product/date level
- Compute rolling sums over the past 45 days based on this aggregate data (which contains any "missing" days filled in)
- Filter those results to only the product/date pairings that had one or more sales
Using
SET STATISTICS IO ON, this approach reports Table 'TransactionHistory'. Scan count 1, logical reads 484, which confirms the "single pass" over the table. For reference, the original loop-seek query reports Table 'TransactionHistory'. Scan count 113444, logical reads 438366.As reported by
SET STATISTICS TIME ON, the CPU time is 514ms. This compares favorably to 2231ms for the original query.CLR quick summary
The CLR summary can be summarized as the following steps:
- Read the data into memory, ordered by product and date
- While processing each transaction, add to a running total of the costs. Whenever a transaction is a different product than the previous transaction, reset the running total to 0.
- Maintain a pointer to the first transaction that has the same (product, date) as the current transaction. Whenever the last transaction with that (product, date) is encountered, compute the rolling sum for that transaction and apply it to all transactions with the same (product, date)
- Return all of the results to the user!
Using
SET STATISTICS IO ON, this approach reports that no logical I/O has occurred! Wow, a perfect solution! (Actually, it seems that SET STATISTICS IO does not report I/O incurred within CLR. But from the code, it is easy to see that exactly one scan of the table is made and retrieves the data in order by the index Paul suggested.As reported by
SET STATISTICS TIME ON, the CPU time is now 187ms. So this is quite an improvement over the T-SQL approach. Unfortunately, the overall elapsed time of both approaches is very similar at about half a second each. However, the CLR based approach does have to output 113K rows to the console (vs. just 52K for the T-SQL approach that groups by product/date), so that's why I've focused on CPU time instead.Another big advantage of this approach is that it yields exactly the same results as the original loop/seek approach, including a row for every transaction even in cases where a product is sold multiple times on the same day. (On AdventureWorks, I specifically compared row-by-row results and confirmed that they tie out with Paul's original query.)
A disadvantage of this approach, at least in its current form, is that it reads all data in memory. However, the algorithm that has been designed only strictly needs the current window frame in memory at any given time and could be updated to work for data sets that exceed memory. Paul has illustrated this point in his answer by producing an implementation of this algorithm that stores only the sliding window in memory. This comes at the expense of granting higher permissions to CLR assembly, but would definitely be worthwhile in scaling this solution up to arbitrarily large data sets.
T-SQL - one scan, grouped by date
Initial setup
USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
INSERT INTO dbo.calendar (d) VALUES (@d)
SELECT @d = DATEADD(DAY, 1, @d)
END
GOThe query
```
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
SELECT ProductID, TransactionDate, NumOrders, ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates, combined with actual cost information for that product/date
SELECT p.ProductID, c.d AS TransactionDate,
COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.d
GROUP BY P.ProductID, c.d
) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP
Code Snippets
USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
INSERT INTO dbo.calendar (d) VALUES (@d)
SELECT @d = DATEADD(DAY, 1, @d)
END
GODECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
SELECT ProductID, TransactionDate, NumOrders, ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates, combined with actual cost information for that product/date
SELECT p.ProductID, c.d AS TransactionDate,
COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.d
GROUP BY P.ProductID, c.d
) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1)USE AdventureWorks2012; /* GPATTERSON2\SQL2014DEVELOPER */
GO
-- Enable CLR
EXEC sp_configure 'clr enabled', 1;
GO
RECONFIGURE;
GO
-- Create the assembly based on the dll generated by compiling the CLR project
-- I've also included the "assembly bits" version that can be run without compiling
CREATE ASSEMBLY ClrPlayground
-- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version
FROM 'C:\FullPathGoesHere\ClrPlayground\bin\Debug\ClrPlayground.dll'
WITH PERMISSION_SET = safe;
GO
--Create a function from the assembly
CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT)
RETURNS TABLE ( ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT,
ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT)
-- The function yields rows in order, so let SQL Server know to avoid an extra sort
ORDER (ProductID, TransactionDate, ReferenceOrderID)
AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf;
GO
-- Now we can actually use the TVF!
SELECT *
FROM dbo.RollingSumTvf(45)
ORDER BY ProductId, TransactionDate, ReferenceOrderId
GO-- Compute all running costs into a #temp table; Note that this query could simply read
-- from Production.TransactionHistory, but a CROSS APPLY by product allows the window
-- function to be computed independently per product, supporting a parallel query plan
SELECT t.*
INTO #runningCosts
FROM Production.Product p
CROSS APPLY (
SELECT t.ProductId, t.TransactionDate, t.ReferenceOrderId, t.ActualCost,
-- Running sum of the cost for this product, including all ties on TransactionDate
SUM(t.ActualCost) OVER (
ORDER BY t.TransactionDate
RANGE UNBOUNDED PRECEDING) AS RunningCost
FROM Production.TransactionHistory t
WHERE t.ProductId = p.ProductId
) t
GO
-- Key the table in our output order
ALTER TABLE #runningCosts
ADD PRIMARY KEY (ProductId, TransactionDate, ReferenceOrderId)
GO
SELECT r.ProductId, r.TransactionDate, r.ReferenceOrderId, r.ActualCost,
-- Cumulative running cost - running cost prior to the sliding window
r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45
FROM #runningCosts r
OUTER APPLY (
-- For each transaction, find the running cost just before the sliding window begins
SELECT TOP 1 b.RunningCost
FROM #runningCosts b
WHERE b.ProductId = r.ProductId
AND b.TransactionDate < DATEADD(DAY, -45, r.TransactionDate)
ORDER BY b.TransactionDate DESC
) w
ORDER BY r.ProductId, r.TransactionDate, r.ReferenceOrderId
GOContext
StackExchange Database Administrators Q#114403, answer score: 46
Revisions (0)
No revisions yet.