HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Improve The Performance Of Multiple Date Range Predicates

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
predicatestherangedateimproveperformancemultiple

Problem

let's say

You have a stored procedure that accepts arrays of datetimes, which are loaded into a temp table, and used to filter on a datetime column in a table.

  • There can be any number of values inserted as begin and end dates.



  • Date ranges may overlap sometimes, but it's not a condition that I'd count on regularly.



  • It's also possible for dates with times to be supplied.



What's the most efficient way to write a query to perform the filtering?
setup

USE StackOverflow2013;

CREATE TABLE
    #d
(
    dfrom datetime,
    dto datetime,
    PRIMARY KEY (dfrom, dto)
)
INSERT
    #d
(
    dfrom,
    dto
)
SELECT
    dfrom = '2013-11-20',
    dto =   '2013-12-05'
UNION ALL
SELECT
    dfrom = '2013-11-27',
    dto =   '2013-12-12'; 

CREATE INDEX
    p
ON dbo.Posts
    (CreationDate)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);


query

The best I've been able to get is using EXISTS like so:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE EXISTS
(
    SELECT
        1/0
    FROM #d AS d
    WHERE p.CreationDate BETWEEN d.dfrom
                             AND d.dto
);


Which results in a rather sad-looking execution plan:

Nested Loops is the only join operator available, since we don't have an equality predicate.

What I’m looking for is alternative syntax that produces a different type of join.

Thanks!

Solution

Join

You may find a join gives adequate performance, despite still using nested loops. This is a variation on David Browne's updated answer:

SELECT 
    NumRows = COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
JOIN dbo.Posts AS P
    ON P.CreationDate BETWEEN D.dfrom AND D.dto;


That runs in around 150ms for me.
Remove overlaps and join

If you have significant overlapping rows, it may be worth reducing them to distinct, non-overlapping ranges, as outlined in Martin Smith's answer. My implementation of a similar idea is:

WITH 
    Intervals AS
    (
    SELECT 
        IntervalStart =
            ISNULL
            (
                LAG(Q1.NextStart) OVER (ORDER BY Q1.ThisFrom),
                Q1.FirstFrom
            ),
        IntervalEnd = 
            IIF 
            (
                Q1.NextStart IS NOT NULL, 
                Q1.ThisEnd,
                Q1.LastEnd
            )
    FROM 
    (
        SELECT
            ThisFrom = D.dfrom, 
            ThisEnd = D.dto,
            -- Remember the start of the next row because that row
            -- may get filtered out by the outer WHERE clase if it
            -- is not also the end of an interval.
            NextStart = LEAD(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto),
            -- Start point of the first interval
            FirstFrom = MIN(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
            -- End point of the last interval
            LastEnd = MAX(D.dto) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        FROM #d AS D 
        WHERE 
            -- Valid intervals only
            D.dto >= D.dfrom
    ) AS Q1
    WHERE
        -- Interval ends only
        Q1.NextStart > Q1.ThisEnd
        OR Q1.NextStart IS NULL
)
SELECT COUNT_BIG(*) 
FROM dbo.Posts AS P
JOIN Intervals AS N
    ON P.CreationDate BETWEEN N.IntervalStart AND N.IntervalEnd;


That runs in about 40ms for me.
Dynamic seek with no join

Another approach is to generate literal ranges dynamically:

DECLARE @SQL nvarchar(max) =
N'
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
';

SELECT @SQL +=
    STRING_AGG
    (
        CONCAT
        (
            CONVERT(nvarchar(max), SPACE(0)),
            N'OR P.CreationDate BETWEEN ',
            N'CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dfrom, 121),
            NCHAR(39),
            N', 121)',
            N' AND CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dto, 121),
            NCHAR(39),
            N', 121)',
            NCHAR(13), NCHAR(10)
        ),
        N''
    )
FROM #d AS D;

SET @SQL += N'OPTION (RECOMPILE);' -- Plan reuse is unlikely

EXECUTE (@SQL);


With the sample data, that produces:

SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-20 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-05 00:00:00.000', 121)
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-27 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-12 00:00:00.000', 121)
OPTION (RECOMPILE);


Which SQL Server simplifies to a single range seek:

That runs in about 35ms for me.

In general, the optimizer will simplify the ranges as much as possible (much like a Merge Interval would do). There doesn't appear to be a limit on the number of separate range seeks in a single Index Seek operator. I got bored after 1440 seeks.
Indexed view

Rather than count rows over and over again, it may be helpful to store and maintain bucketized counts in an indexed view. The following implementation uses hour granularity:

CREATE OR ALTER VIEW dbo.PostsTimeBucket
WITH SCHEMABINDING
AS
SELECT
    HourBucket =
        DATEADD(HOUR, 
            DATEDIFF(HOUR, 
                CONVERT(datetime, '20000101', 112), 
                P.CreationDate),
            CONVERT(datetime, '20000101', 112)),
    NumRows = COUNT_BIG(*)
FROM dbo.Posts AS P
GROUP BY
    DATEADD(HOUR, 
        DATEDIFF(HOUR, 
            CONVERT(datetime, '20000101', 112), 
            P.CreationDate),
        CONVERT(datetime, '20000101', 112));
GO
CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.PostsTimeBucket HourBucket] 
ON dbo.PostsTimeBucket (HourBucket);


The view takes around 2s to create on my laptop. At hour granularity, the indexed view holds 47,469 rows for my copy of the StackOverflow2013 database's Posts table with 17,142,169 rows.

The majority of the work can be done from the view. Only part-hour periods at the start and end of any range not covered by any other range needs to be processed separately. For example:

```
-- Helper inline function
CREATE OR ALTER FUNCTION dbo.RoundToHour (@d datetime)
RETURNS table
AS
RETURN
SELECT
HourBucket =
DATEADD(HOUR,
DATEDIFF(HOUR,
CONVERT(datetime, '2000010

Code Snippets

SELECT 
    NumRows = COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
JOIN dbo.Posts AS P
    ON P.CreationDate BETWEEN D.dfrom AND D.dto;
WITH 
    Intervals AS
    (
    SELECT 
        IntervalStart =
            ISNULL
            (
                LAG(Q1.NextStart) OVER (ORDER BY Q1.ThisFrom),
                Q1.FirstFrom
            ),
        IntervalEnd = 
            IIF 
            (
                Q1.NextStart IS NOT NULL, 
                Q1.ThisEnd,
                Q1.LastEnd
            )
    FROM 
    (
        SELECT
            ThisFrom = D.dfrom, 
            ThisEnd = D.dto,
            -- Remember the start of the next row because that row
            -- may get filtered out by the outer WHERE clase if it
            -- is not also the end of an interval.
            NextStart = LEAD(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto),
            -- Start point of the first interval
            FirstFrom = MIN(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
            -- End point of the last interval
            LastEnd = MAX(D.dto) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        FROM #d AS D 
        WHERE 
            -- Valid intervals only
            D.dto >= D.dfrom
    ) AS Q1
    WHERE
        -- Interval ends only
        Q1.NextStart > Q1.ThisEnd
        OR Q1.NextStart IS NULL
)
SELECT COUNT_BIG(*) 
FROM dbo.Posts AS P
JOIN Intervals AS N
    ON P.CreationDate BETWEEN N.IntervalStart AND N.IntervalEnd;
DECLARE @SQL nvarchar(max) =
N'
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
';

SELECT @SQL +=
    STRING_AGG
    (
        CONCAT
        (
            CONVERT(nvarchar(max), SPACE(0)),
            N'OR P.CreationDate BETWEEN ',
            N'CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dfrom, 121),
            NCHAR(39),
            N', 121)',
            N' AND CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dto, 121),
            NCHAR(39),
            N', 121)',
            NCHAR(13), NCHAR(10)
        ),
        N''
    )
FROM #d AS D;

SET @SQL += N'OPTION (RECOMPILE);' -- Plan reuse is unlikely

EXECUTE (@SQL);
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-20 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-05 00:00:00.000', 121)
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-27 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-12 00:00:00.000', 121)
OPTION (RECOMPILE);
CREATE OR ALTER VIEW dbo.PostsTimeBucket
WITH SCHEMABINDING
AS
SELECT
    HourBucket =
        DATEADD(HOUR, 
            DATEDIFF(HOUR, 
                CONVERT(datetime, '20000101', 112), 
                P.CreationDate),
            CONVERT(datetime, '20000101', 112)),
    NumRows = COUNT_BIG(*)
FROM dbo.Posts AS P
GROUP BY
    DATEADD(HOUR, 
        DATEDIFF(HOUR, 
            CONVERT(datetime, '20000101', 112), 
            P.CreationDate),
        CONVERT(datetime, '20000101', 112));
GO
CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.PostsTimeBucket HourBucket] 
ON dbo.PostsTimeBucket (HourBucket);

Context

StackExchange Database Administrators Q#329139, answer score: 9

Revisions (0)

No revisions yet.