HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Date range rolling sum using window functions

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
functionsrangerollingdateusingsumwindow

Problem

I need to calculate a rolling sum over a date range. To illustrate, using the AdventureWorks sample database, the following hypothetical syntax would do exactly what I need:

SELECT
    TH.ProductID,
    TH.TransactionDate,
    TH.ActualCost,
    RollingSum45 = SUM(TH.ActualCost) OVER (
        PARTITION BY TH.ProductID
        ORDER BY TH.TransactionDate
        RANGE BETWEEN 
            INTERVAL 45 DAY PRECEDING
            AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
    TH.ProductID,
    TH.TransactionDate,
    TH.ReferenceOrderID;


Sadly, the RANGE window frame extent does not currently allow an interval in SQL Server.

I know I can write a solution using a subquery and a regular (non-window) aggregate:

SELECT 
    TH.ProductID,
    TH.TransactionDate,
    TH.ActualCost,
    RollingSum45 =
    (
        SELECT SUM(TH2.ActualCost)
        FROM Production.TransactionHistory AS TH2
        WHERE
            TH2.ProductID = TH.ProductID
            AND TH2.TransactionDate = DATEADD(DAY, -45, TH.TransactionDate)
    )
FROM Production.TransactionHistory AS TH
ORDER BY
    TH.ProductID,
    TH.TransactionDate,
    TH.ReferenceOrderID;


Given the following index:

CREATE UNIQUE INDEX i
ON Production.TransactionHistory
    (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE
    (ActualCost);


The execution plan is:

While not horribly inefficient, it seems like it should be possible to express this query using only window aggregate and analytic functions supported in SQL Server 2012, 2014, or 2016 (so far).

For clarity, I am looking for a solution that performs a single pass over the data.

In T-SQL this is likely to mean that the OVER clause will do the work, and the execution plan will feature Window Spools and Window Aggregates. All language elements that use the OVER clause are fair game. A SQLCLR solution is acceptable, provided it is guaranteed to produce correct results.

For T-SQL solutions, the fewer Has

Solution

Great question, Paul! I used a couple different approaches, one in T-SQL and one in CLR.

T-SQL quick summary

The T-SQL approach can be summarized as the following steps:

  • Take the cross-product of products/dates



  • Merge in the observed sales data



  • Aggregate that data to the product/date level



  • Compute rolling sums over the past 45 days based on this aggregate data (which contains any "missing" days filled in)



  • Filter those results to only the product/date pairings that had one or more sales



Using SET STATISTICS IO ON, this approach reports Table 'TransactionHistory'. Scan count 1, logical reads 484, which confirms the "single pass" over the table. For reference, the original loop-seek query reports Table 'TransactionHistory'. Scan count 113444, logical reads 438366.

As reported by SET STATISTICS TIME ON, the CPU time is 514ms. This compares favorably to 2231ms for the original query.

CLR quick summary

The CLR summary can be summarized as the following steps:

  • Read the data into memory, ordered by product and date



  • While processing each transaction, add to a running total of the costs. Whenever a transaction is a different product than the previous transaction, reset the running total to 0.



  • Maintain a pointer to the first transaction that has the same (product, date) as the current transaction. Whenever the last transaction with that (product, date) is encountered, compute the rolling sum for that transaction and apply it to all transactions with the same (product, date)



  • Return all of the results to the user!



Using SET STATISTICS IO ON, this approach reports that no logical I/O has occurred! Wow, a perfect solution! (Actually, it seems that SET STATISTICS IO does not report I/O incurred within CLR. But from the code, it is easy to see that exactly one scan of the table is made and retrieves the data in order by the index Paul suggested.

As reported by SET STATISTICS TIME ON, the CPU time is now 187ms. So this is quite an improvement over the T-SQL approach. Unfortunately, the overall elapsed time of both approaches is very similar at about half a second each. However, the CLR based approach does have to output 113K rows to the console (vs. just 52K for the T-SQL approach that groups by product/date), so that's why I've focused on CPU time instead.

Another big advantage of this approach is that it yields exactly the same results as the original loop/seek approach, including a row for every transaction even in cases where a product is sold multiple times on the same day. (On AdventureWorks, I specifically compared row-by-row results and confirmed that they tie out with Paul's original query.)

A disadvantage of this approach, at least in its current form, is that it reads all data in memory. However, the algorithm that has been designed only strictly needs the current window frame in memory at any given time and could be updated to work for data sets that exceed memory. Paul has illustrated this point in his answer by producing an implementation of this algorithm that stores only the sliding window in memory. This comes at the expense of granting higher permissions to CLR assembly, but would definitely be worthwhile in scaling this solution up to arbitrarily large data sets.

T-SQL - one scan, grouped by date

Initial setup

USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
    INSERT INTO dbo.calendar (d) VALUES (@d)
    SELECT @d =  DATEADD(DAY, 1, @d)
END
GO


The query

```
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
SELECT ProductID, TransactionDate, NumOrders, ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates, combined with actual cost information for that product/date
SELECT p.ProductID, c.d AS TransactionDate,
COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.d
GROUP BY P.ProductID, c.d
) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP

Code Snippets

USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
    INSERT INTO dbo.calendar (d) VALUES (@d)
    SELECT @d =  DATEADD(DAY, 1, @d)
END
GO
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
        @maxAnalysisDate DATE = '2008-09-03'  -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
    SELECT ProductID, TransactionDate, NumOrders, ActualCost,
        SUM(ActualCost) OVER (
                PARTITION BY ProductId ORDER BY TransactionDate 
                ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
            ) AS RollingSum45
    FROM (
        -- The full cross-product of products and dates, combined with actual cost information for that product/date
        SELECT p.ProductID, c.d AS TransactionDate,
            COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
        FROM Production.Product p
        JOIN dbo.calendar c
            ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
        LEFT OUTER JOIN Production.TransactionHistory TH
            ON TH.ProductId = p.productId
            AND TH.TransactionDate = c.d
        GROUP BY P.ProductID, c.d
    ) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1)
USE AdventureWorks2012; /* GPATTERSON2\SQL2014DEVELOPER */
GO

-- Enable CLR
EXEC sp_configure 'clr enabled', 1;
GO
RECONFIGURE;
GO

-- Create the assembly based on the dll generated by compiling the CLR project
-- I've also included the "assembly bits" version that can be run without compiling
CREATE ASSEMBLY ClrPlayground
-- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version
FROM 'C:\FullPathGoesHere\ClrPlayground\bin\Debug\ClrPlayground.dll'
WITH PERMISSION_SET = safe;
GO

--Create a function from the assembly
CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT)
RETURNS TABLE ( ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT,
                ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT)
-- The function yields rows in order, so let SQL Server know to avoid an extra sort
ORDER (ProductID, TransactionDate, ReferenceOrderID)
AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf;
GO

-- Now we can actually use the TVF!
SELECT * 
FROM dbo.RollingSumTvf(45) 
ORDER BY ProductId, TransactionDate, ReferenceOrderId
GO
-- Compute all running costs into a #temp table; Note that this query could simply read
-- from Production.TransactionHistory, but a CROSS APPLY by product allows the window 
-- function to be computed independently per product, supporting a parallel query plan
SELECT t.*
INTO #runningCosts
FROM Production.Product p
CROSS APPLY (
    SELECT t.ProductId, t.TransactionDate, t.ReferenceOrderId, t.ActualCost,
        -- Running sum of the cost for this product, including all ties on TransactionDate
        SUM(t.ActualCost) OVER (
            ORDER BY t.TransactionDate 
            RANGE UNBOUNDED PRECEDING) AS RunningCost
    FROM Production.TransactionHistory t
    WHERE t.ProductId = p.ProductId
) t
GO

-- Key the table in our output order
ALTER TABLE #runningCosts
ADD PRIMARY KEY (ProductId, TransactionDate, ReferenceOrderId)
GO

SELECT r.ProductId, r.TransactionDate, r.ReferenceOrderId, r.ActualCost,
    -- Cumulative running cost - running cost prior to the sliding window
    r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45
FROM #runningCosts r
OUTER APPLY (
    -- For each transaction, find the running cost just before the sliding window begins
    SELECT TOP 1 b.RunningCost
    FROM #runningCosts b
    WHERE b.ProductId = r.ProductId
        AND b.TransactionDate < DATEADD(DAY, -45, r.TransactionDate)
    ORDER BY b.TransactionDate DESC
) w
ORDER BY r.ProductId, r.TransactionDate, r.ReferenceOrderId
GO

Context

StackExchange Database Administrators Q#114403, answer score: 46

Revisions (0)

No revisions yet.