patternsqlMinor
Cluster date time values
Viewed 0 times
valuestimedatecluster
Problem
I came up with this TSQL for SQL Server 2008:
This clusters datetime values inside groups (class) if they fall inside the same time window and if the cluster contains more than 1 entry. Please review the code.
DECLARE @TimeWindowInSeconds INT
SET @TimeWindowInSeconds = 10
IF OBJECT_ID('tempdb..#Temp') IS NOT NULL
DROP TABLE #Temp
CREATE TABLE #Temp
(
Class INT,
DT DateTime
)
INSERT INTO #Temp (Class, DT)
SELECT 1, '2014-11-05 10:55:00'
UNION ALL
SELECT 1, '2014-11-05 10:55:10'
UNION ALL
SELECT 1, '2014-11-05 11:55:10'
UNION ALL
SELECT 2, '2014-11-05 10:55:11'
UNION ALL
SELECT 2, '2014-11-05 13:56:10'
;WITH CTE1 AS
(
SELECT
Class,
DT,
ROW_NUMBER() OVER (ORDER BY Class, DT ) AS RowNumber
FROM #Temp
)
,CTE2 AS
(
-- A is the successor
SELECT
A.Class,
A.RowNumber,
B.RowNumber AS RowNumber1,
A.DT,
B.DT AS DT2,
DATEDIFF(second, B.DT, A.DT) AS DifferenceInSeconds,
CASE WHEN B.DT IS NULL THEN 1 END z
FROM CTE1 AS A
LEFT OUTER JOIN CTE1 AS B ON A.RowNumber = B.RowNumber + 1 AND A.Class = B.Class
AND DATEDIFF(second, B.DT, A.DT) 1
)This clusters datetime values inside groups (class) if they fall inside the same time window and if the cluster contains more than 1 entry. Please review the code.
Solution
I agree with the other feedback given here so far; I have two three suggestions of my own, two minor and one more significant.
Minor point (1)
Initial assignation to a variable can be done in one line, so instead of:
you could just write:
Minor point (2)
I am not a fan of selecting columns in intermediate rowsets that are never used subsequently. They add clutter and unnecessarily make the reader spend time understanding or visualising them.
However,
I suspect that you put these columns in whilst writing
The bigger one
Your code uses both common table expressions (CTEs) and subqueries. For me, this is quite a painful inconsistency. CTEs were effectively an evolution of subqueries in the ANSI SQL standard; mixing both in one query feels to me like designing an aeroplane that has both propellers and a jet engine.
Aside: please don't read too far into that analogy: this answer is not about their relative performance, nor am I claiming that one is made redundant by the other. People still use prop planes, after all.
Unless performance is particularly important, and you know for certain that subqueries perform measurably better on your server, with your data, I recommend sticking with CTEs alone:
In your particular case, I think the
This also removes altogether the need to
Even if you didn't want to change the
Minor point (1)
Initial assignation to a variable can be done in one line, so instead of:
DECLARE @TimeWindowInSeconds INT
SET @TimeWindowInSeconds = 10you could just write:
DECLARE @TimeWindowInSeconds INT = 10Minor point (2)
I am not a fan of selecting columns in intermediate rowsets that are never used subsequently. They add clutter and unnecessarily make the reader spend time understanding or visualising them.
CTE1 selects just three columns, all of which are necessary for the join in CTE2: this is good. However,
CTE2 then selects seven columns, of which only A.Class, A.DT and the expression aliased as z are required by CTE3 or presented in the final output. I suspect that you put these columns in whilst writing
CTE2, because the others are all used in the join within that CTE. This is fine for work-in-progress, but these columns should be removed from the finished query. They don't need to be in the SELECT list just because they're used in the JOIN. The bigger one
Your code uses both common table expressions (CTEs) and subqueries. For me, this is quite a painful inconsistency. CTEs were effectively an evolution of subqueries in the ANSI SQL standard; mixing both in one query feels to me like designing an aeroplane that has both propellers and a jet engine.
Aside: please don't read too far into that analogy: this answer is not about their relative performance, nor am I claiming that one is made redundant by the other. People still use prop planes, after all.
Unless performance is particularly important, and you know for certain that subqueries perform measurably better on your server, with your data, I recommend sticking with CTEs alone:
- The syntax is easier to follow
- They mimic step-by-step problem solving, walking your reader through the solution steadily
- Well-chosen CTE names are tantamount to good comments, explaining your rationale
- They can make code shorter because you can re-use them
In your particular case, I think the
SubGroup subquery will need to become a separate CTE between CTE2 and CTE3 (you'd then join to that CTE in the current CTE3, using c.DT 1.This also removes altogether the need to
CAST and concatenate Class and Subgroup, since you're not including that concatenation in your result set. Even if you didn't want to change the
WHERE...IN ( subquery ) part, you could simplify things by adding an extra, final CTE in which you do the casting and concatenation of Class and Subgroup just once, give it an alias, and then doWHERE alias IN ( SELECT alias FROM cte3 GROUP BY ... )Code Snippets
DECLARE @TimeWindowInSeconds INT
SET @TimeWindowInSeconds = 10DECLARE @TimeWindowInSeconds INT = 10COUNT(*) OVER ( PARTITION BY Class, Subgroup ) as CountBy_Cls_SgpWHERE alias IN ( SELECT alias FROM cte3 GROUP BY ... )Context
StackExchange Code Review Q#68978, answer score: 2
Revisions (0)
No revisions yet.