HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Should I join datetime to a date using cast or range?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rangedatejoincastusingshoulddatetime

Problem

This question is a take-off from the excellent one posed here:

Cast to date is sargable but is it a good idea?

In my case, I am not concerned with the WHERE clause but in joining to an events table which has a column of type DATE

One table has DATETIME2 and the other has DATE... so I can effectively JOIN using a CAST( AS DATE) or I can use a "traditional" range query (>= date AND < date+1).

My question is which is preferable? The DATETIME values will almost never match the predicate DATE value.

I expect to stay on the order of 2M rows having the DATETIME and under 5k having the DATE (if this consideration makes a difference)

Should I expect the same behavior on the JOIN as I might using the WHERE clause? Which should I prefer to retain performance with scaling? Does the answer change with MSSQL 2012?

My generalized use-case is to treat my events table like a calendar table

SELECT
    events.columns
    ,SOME_AGGREGATIONS(tasks.column)
FROM
    events
LEFT OUTER JOIN
    tasks
        --This appropriately states my intent clearer
        ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn 
        --But is this more effective/scalable?
        --ON tasks.datetimecolumn >= events.datecolumn 
        --AND tasks.datetimecolumn < DATEADD(day,1,events.datecolumn)
GROUP BY
    events.columns

Solution

"It depends".

One advantage of the = predicate and cast to date is that the join can be hash or merge. The range version will force a nested loops plan.

If there are no useful indexes to seek into the datetimecolumn on tasks this would make a substantial difference.

Setting up the 5K/ 2 million rows of test data mentioned in the question

CREATE TABLE events
  (
     eventId    INT IDENTITY PRIMARY KEY,
     datecolumn DATE NOT NULL,
     details    CHAR(1000) DEFAULT 'D'
  )

INSERT INTO events
            (datecolumn)
SELECT TOP 5000 DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY @@SPID), GETDATE())
FROM   spt_values v1,
       spt_values v2

CREATE TABLE tasks
  (
     taskId         INT IDENTITY PRIMARY KEY,
     datetimecolumn DATETIME2 NOT NULL,
     details        CHAR(1000) DEFAULT 'D'
  );

WITH N
     AS (SELECT number
         FROM   spt_values
         WHERE  number BETWEEN 1 AND 40
                AND type = 'P')
INSERT INTO tasks
            (datetimecolumn)
SELECT DATEADD(MINUTE, number, CAST(datecolumn AS DATETIME2))
FROM   events,
       N


Then turning on

SET STATISTICS IO ON;
SET STATISTICS TIME ON;


And trying the CAST version

SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
GROUP  BY events.eventId


Completed in 7.4 seconds

Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 1, logical reads 28679
Table 'events'. Scan count 1, logical reads 719

   CPU time = 3042 ms,  elapsed time = 7434 ms.


The estimated number of rows coming out of the join and into the GROUP BY was far too small (5006.27 vs actual 2,000,000) and the hash aggregate spilled to tempdb

Trying the range predicate

SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON tasks.datetimecolumn >= events.datecolumn
            AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)
GROUP  BY events.eventId


The lack of an equality predicate forces a nested loops plan. As there are no useful indexes to support this query it has no option but to scan the 2 million row table 5,000 times.

On my machine that gave a parallel plan that eventually completed after 1 minute 40 seconds.

Table 'tasks'. Scan count 4, logical reads 143390000
Table 'events'. Scan count 5, logical reads 788
Table 'Worktable'. Scan count 0, logical reads 0
  CPU time = 368193 ms,  elapsed time = 100528 ms.


This time the number of rows coming out of the join and into the aggregate was grossly over estimated (at estimated 124,939,000 vs actual 2,000,000)

Repeating the experiment after altering the tables to make the respective date/time columns the clustered primary key altered the results.

Both queries ended up choosing a nested loops plan. The CAST as DATE version gave a serial version that completed in 4.5 seconds and the range version a parallel plan that completed in elapsed time 1.1 seconds with CPU time of 3.2 seconds.

Applying MAXDOP 1 to the second query to make the figures more easily comparable returns the following.

Query 1

Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 5000, logical reads 78137
Table 'events'. Scan count 1, logical reads 719
   CPU time = 3167 ms,  elapsed time = 4497 ms.


Query 2

Table 'tasks'. Scan count 5000, logical reads 49440
Table 'events'. Scan count 1, logical reads 719
   CPU time = 3042 ms,  elapsed time = 3147 ms.


Query 1 had an estimated 5006.73 rows coming out of the join and the hash aggregate spilled to tempdb again.

Query 2 again has a large overestimate (at 120,927,000 this time).

The other obvious difference between the two results is that the range query looks like it manages to seek into tasks more efficiently in some way. Only reading 49,440 pages vs 78,137.

The range that the cast as date version seeks into is derived from an internal function GetRangeThroughConvert. The plan shows a residual predicate on CONVERT(date,[dbo].[tasks].[datetimecolumn],0)= [dbo].[events].[datecolumn].

If Query 2 is altered to

LEFT OUTER JOIN tasks
     ON tasks.datetimecolumn > DATEADD(day, -1, events.datecolumn)
        AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)


Then the number of reads then becomes the same. The dynamic seek used by the CAST AS DATE version reads unnecessary rows (two days worth rather than one) and then discards them with the residual predicate.

One other possibility would be to restructure the table to store the date and time components in different columns.

```
CREATE TABLE [dbo].tasks NOT NULL,
[datecolumn] date NOT NULL,
[timecolumn] time NOT NULL,
[datetimecolumn] AS DATEADD(day, DATEDIFF(DAY,0,[datecolumn]), CAST([timecolumn] AS DATETIME2(7))),
[details] char NULL,
PRIMARY KEY CLUST

Code Snippets

CREATE TABLE events
  (
     eventId    INT IDENTITY PRIMARY KEY,
     datecolumn DATE NOT NULL,
     details    CHAR(1000) DEFAULT 'D'
  )

INSERT INTO events
            (datecolumn)
SELECT TOP 5000 DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY @@SPID), GETDATE())
FROM   spt_values v1,
       spt_values v2

CREATE TABLE tasks
  (
     taskId         INT IDENTITY PRIMARY KEY,
     datetimecolumn DATETIME2 NOT NULL,
     details        CHAR(1000) DEFAULT 'D'
  );

WITH N
     AS (SELECT number
         FROM   spt_values
         WHERE  number BETWEEN 1 AND 40
                AND type = 'P')
INSERT INTO tasks
            (datetimecolumn)
SELECT DATEADD(MINUTE, number, CAST(datecolumn AS DATETIME2))
FROM   events,
       N
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
GROUP  BY events.eventId
Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 1, logical reads 28679
Table 'events'. Scan count 1, logical reads 719

   CPU time = 3042 ms,  elapsed time = 7434 ms.
SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON tasks.datetimecolumn >= events.datecolumn
            AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)
GROUP  BY events.eventId

Context

StackExchange Database Administrators Q#51340, answer score: 6

Revisions (0)

No revisions yet.