patternMinor
Should I join datetime to a date using cast or range?
Viewed 0 times
rangedatejoincastusingshoulddatetime
Problem
This question is a take-off from the excellent one posed here:
Cast to date is sargable but is it a good idea?
In my case, I am not concerned with the
One table has
My question is which is preferable? The
I expect to stay on the order of 2M rows having the
Should I expect the same behavior on the
My generalized use-case is to treat my events table like a calendar table
Cast to date is sargable but is it a good idea?
In my case, I am not concerned with the
WHERE clause but in joining to an events table which has a column of type DATEOne table has
DATETIME2 and the other has DATE... so I can effectively JOIN using a CAST( AS DATE) or I can use a "traditional" range query (>= date AND < date+1).My question is which is preferable? The
DATETIME values will almost never match the predicate DATE value.I expect to stay on the order of 2M rows having the
DATETIME and under 5k having the DATE (if this consideration makes a difference)Should I expect the same behavior on the
JOIN as I might using the WHERE clause? Which should I prefer to retain performance with scaling? Does the answer change with MSSQL 2012?My generalized use-case is to treat my events table like a calendar table
SELECT
events.columns
,SOME_AGGREGATIONS(tasks.column)
FROM
events
LEFT OUTER JOIN
tasks
--This appropriately states my intent clearer
ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
--But is this more effective/scalable?
--ON tasks.datetimecolumn >= events.datecolumn
--AND tasks.datetimecolumn < DATEADD(day,1,events.datecolumn)
GROUP BY
events.columnsSolution
"It depends".
One advantage of the
If there are no useful indexes to seek into the
Setting up the 5K/ 2 million rows of test data mentioned in the question
Then turning on
And trying the
Completed in 7.4 seconds
The estimated number of rows coming out of the join and into the
Trying the range predicate
The lack of an equality predicate forces a nested loops plan. As there are no useful indexes to support this query it has no option but to scan the 2 million row table 5,000 times.
On my machine that gave a parallel plan that eventually completed after 1 minute 40 seconds.
This time the number of rows coming out of the join and into the aggregate was grossly over estimated (at estimated 124,939,000 vs actual 2,000,000)
Repeating the experiment after altering the tables to make the respective date/time columns the clustered primary key altered the results.
Both queries ended up choosing a nested loops plan. The
Applying
Query 1
Query 2
Query 1 had an estimated 5006.73 rows coming out of the join and the hash aggregate spilled to
Query 2 again has a large overestimate (at 120,927,000 this time).
The other obvious difference between the two results is that the range query looks like it manages to seek into
The range that the cast as date version seeks into is derived from an internal function
If Query 2 is altered to
Then the number of reads then becomes the same. The dynamic seek used by the
One other possibility would be to restructure the table to store the
```
CREATE TABLE [dbo].tasks NOT NULL,
[datecolumn] date NOT NULL,
[timecolumn] time NOT NULL,
[datetimecolumn] AS DATEADD(day, DATEDIFF(DAY,0,[datecolumn]), CAST([timecolumn] AS DATETIME2(7))),
[details] char NULL,
PRIMARY KEY CLUST
One advantage of the
= predicate and cast to date is that the join can be hash or merge. The range version will force a nested loops plan. If there are no useful indexes to seek into the
datetimecolumn on tasks this would make a substantial difference.Setting up the 5K/ 2 million rows of test data mentioned in the question
CREATE TABLE events
(
eventId INT IDENTITY PRIMARY KEY,
datecolumn DATE NOT NULL,
details CHAR(1000) DEFAULT 'D'
)
INSERT INTO events
(datecolumn)
SELECT TOP 5000 DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY @@SPID), GETDATE())
FROM spt_values v1,
spt_values v2
CREATE TABLE tasks
(
taskId INT IDENTITY PRIMARY KEY,
datetimecolumn DATETIME2 NOT NULL,
details CHAR(1000) DEFAULT 'D'
);
WITH N
AS (SELECT number
FROM spt_values
WHERE number BETWEEN 1 AND 40
AND type = 'P')
INSERT INTO tasks
(datetimecolumn)
SELECT DATEADD(MINUTE, number, CAST(datecolumn AS DATETIME2))
FROM events,
NThen turning on
SET STATISTICS IO ON;
SET STATISTICS TIME ON;And trying the
CAST versionSELECT events.eventId,
MAX(tasks.details)
FROM events
LEFT OUTER JOIN tasks
ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
GROUP BY events.eventIdCompleted in 7.4 seconds
Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 1, logical reads 28679
Table 'events'. Scan count 1, logical reads 719
CPU time = 3042 ms, elapsed time = 7434 ms.The estimated number of rows coming out of the join and into the
GROUP BY was far too small (5006.27 vs actual 2,000,000) and the hash aggregate spilled to tempdb Trying the range predicate
SELECT events.eventId,
MAX(tasks.details)
FROM events
LEFT OUTER JOIN tasks
ON tasks.datetimecolumn >= events.datecolumn
AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)
GROUP BY events.eventIdThe lack of an equality predicate forces a nested loops plan. As there are no useful indexes to support this query it has no option but to scan the 2 million row table 5,000 times.
On my machine that gave a parallel plan that eventually completed after 1 minute 40 seconds.
Table 'tasks'. Scan count 4, logical reads 143390000
Table 'events'. Scan count 5, logical reads 788
Table 'Worktable'. Scan count 0, logical reads 0
CPU time = 368193 ms, elapsed time = 100528 ms.This time the number of rows coming out of the join and into the aggregate was grossly over estimated (at estimated 124,939,000 vs actual 2,000,000)
Repeating the experiment after altering the tables to make the respective date/time columns the clustered primary key altered the results.
Both queries ended up choosing a nested loops plan. The
CAST as DATE version gave a serial version that completed in 4.5 seconds and the range version a parallel plan that completed in elapsed time 1.1 seconds with CPU time of 3.2 seconds. Applying
MAXDOP 1 to the second query to make the figures more easily comparable returns the following.Query 1
Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 5000, logical reads 78137
Table 'events'. Scan count 1, logical reads 719
CPU time = 3167 ms, elapsed time = 4497 ms.Query 2
Table 'tasks'. Scan count 5000, logical reads 49440
Table 'events'. Scan count 1, logical reads 719
CPU time = 3042 ms, elapsed time = 3147 ms.Query 1 had an estimated 5006.73 rows coming out of the join and the hash aggregate spilled to
tempdb again.Query 2 again has a large overestimate (at 120,927,000 this time).
The other obvious difference between the two results is that the range query looks like it manages to seek into
tasks more efficiently in some way. Only reading 49,440 pages vs 78,137.The range that the cast as date version seeks into is derived from an internal function
GetRangeThroughConvert. The plan shows a residual predicate on CONVERT(date,[dbo].[tasks].[datetimecolumn],0)= [dbo].[events].[datecolumn].If Query 2 is altered to
LEFT OUTER JOIN tasks
ON tasks.datetimecolumn > DATEADD(day, -1, events.datecolumn)
AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)Then the number of reads then becomes the same. The dynamic seek used by the
CAST AS DATE version reads unnecessary rows (two days worth rather than one) and then discards them with the residual predicate.One other possibility would be to restructure the table to store the
date and time components in different columns.```
CREATE TABLE [dbo].tasks NOT NULL,
[datecolumn] date NOT NULL,
[timecolumn] time NOT NULL,
[datetimecolumn] AS DATEADD(day, DATEDIFF(DAY,0,[datecolumn]), CAST([timecolumn] AS DATETIME2(7))),
[details] char NULL,
PRIMARY KEY CLUST
Code Snippets
CREATE TABLE events
(
eventId INT IDENTITY PRIMARY KEY,
datecolumn DATE NOT NULL,
details CHAR(1000) DEFAULT 'D'
)
INSERT INTO events
(datecolumn)
SELECT TOP 5000 DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY @@SPID), GETDATE())
FROM spt_values v1,
spt_values v2
CREATE TABLE tasks
(
taskId INT IDENTITY PRIMARY KEY,
datetimecolumn DATETIME2 NOT NULL,
details CHAR(1000) DEFAULT 'D'
);
WITH N
AS (SELECT number
FROM spt_values
WHERE number BETWEEN 1 AND 40
AND type = 'P')
INSERT INTO tasks
(datetimecolumn)
SELECT DATEADD(MINUTE, number, CAST(datecolumn AS DATETIME2))
FROM events,
NSET STATISTICS IO ON;
SET STATISTICS TIME ON;SELECT events.eventId,
MAX(tasks.details)
FROM events
LEFT OUTER JOIN tasks
ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
GROUP BY events.eventIdTable 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 1, logical reads 28679
Table 'events'. Scan count 1, logical reads 719
CPU time = 3042 ms, elapsed time = 7434 ms.SELECT events.eventId,
MAX(tasks.details)
FROM events
LEFT OUTER JOIN tasks
ON tasks.datetimecolumn >= events.datecolumn
AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)
GROUP BY events.eventIdContext
StackExchange Database Administrators Q#51340, answer score: 6
Revisions (0)
No revisions yet.