patternsqlMinor
Comparing DISTINCT, GROUP BY and ROW_NUMBER() in SQL Server 2008 with data warehouse workloads
Viewed 0 times
distinctcomparinggroupserversql2008withwarehouseworkloadsrow_number
Problem
I recently heard the advice: "If you're using DISTINCT, I challenge you to fix your code." However, I'm wondering what considerations are behind this advice, and whether it means that I should approach the following problem differently.
Problem context
I have a denormalised date dimension table (Kimball star schema). In this table, I have collapsed several 3rd normal form tables: date, absolute week, absolute month and year. For each of these collapsed tables, I have a column that represents a unique identity, a textual name column and a numeric manual sort ordering index. Here's an example row to show the structure, the numbers are illustrative only (in vertical form for ease of reading):
I've chosen to use a date dimension table for ease of understanding, but the problem can be transposed to any other dimension table involving hierarchies.
Problem synopsis
I want to retrieve the distinct set of week and month combinations with the related textual fields.
Possible solutions
Note that I cannot exclude MonthId or WeekId because I cannot assume that free text fields are unique.
In this approach, indexed views of every combination of non-leaf attributes are precalculated and queried directly. Obviously this is fastest at query time, but could require many duplications of the dimensio
Problem context
I have a denormalised date dimension table (Kimball star schema). In this table, I have collapsed several 3rd normal form tables: date, absolute week, absolute month and year. For each of these collapsed tables, I have a column that represents a unique identity, a textual name column and a numeric manual sort ordering index. Here's an example row to show the structure, the numbers are illustrative only (in vertical form for ease of reading):
- DateId: 20110507
- DateName: 7th May 2011
- DateOrder: 17813
- WeekId: 201118
- WeekName: Week 18, 2011
- WeekOrder: 617
- MonthId: 201105
- MonthName: May 2011
- MonthOrder: 74
- YearId: 2011
- YearName: 2011
- YearOrder: 41
I've chosen to use a date dimension table for ease of understanding, but the problem can be transposed to any other dimension table involving hierarchies.
Problem synopsis
I want to retrieve the distinct set of week and month combinations with the related textual fields.
Possible solutions
- DISTINCT
SELECT DISTINCT MonthId, WeekId, YearName, MonthName, WeekName
FROM DimDateNote that I cannot exclude MonthId or WeekId because I cannot assume that free text fields are unique.
- GROUP BY
SELECT MIN(YearName), MIN(MonthName), MIN(WeekName)
FROM DimDate
GROUP BY MonthId, WeekId- ROWNUMBER
WITH grp AS (
SELECT YearName, MonthName, WeekName
, ROW_NUMBER() OVER (PARTITION BY MonthId, WeekId) AS r
FROM DimDate
)
SELECT YearName, MonthName, WeekName
FROM grp
WHERE grp.r = 1- ETL
In this approach, indexed views of every combination of non-leaf attributes are precalculated and queried directly. Obviously this is fastest at query time, but could require many duplications of the dimensio
Solution
In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach. Saying that, ROW_NUMBER is better with SQL Server 2008 than SQL Server 2005.
However, you'll have to try for your situation.
Compare query plans, and use Profiler and SET to capture IO, CPU, Duration etc
For a lot of background, see these SO questions:
Finally, do you need the ROW_NUMBER approach? It looks like you're fixing a problem caused by de-normalisation.
And some notes:
However, you'll have to try for your situation.
Compare query plans, and use Profiler and SET to capture IO, CPU, Duration etc
For a lot of background, see these SO questions:
- Why are logical reads for windowed aggregate functions so high? (follow the links from Martin Smith)
- can I get count() and rows from one sql query in sql server? (note the comments in the answers from me and Chris Bednarski)
Finally, do you need the ROW_NUMBER approach? It looks like you're fixing a problem caused by de-normalisation.
And some notes:
- shouldn't YearID be in the GROUP BY or PARTITION BY?
- Won't DISTINCT give different output?
- Are these columns indexed?
Context
StackExchange Database Administrators Q#7118, answer score: 3
Revisions (0)
No revisions yet.