HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Comparing DISTINCT, GROUP BY and ROW_NUMBER() in SQL Server 2008 with data warehouse workloads

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctcomparinggroupserversql2008withwarehouseworkloadsrow_number

Problem

I recently heard the advice: "If you're using DISTINCT, I challenge you to fix your code." However, I'm wondering what considerations are behind this advice, and whether it means that I should approach the following problem differently.

Problem context

I have a denormalised date dimension table (Kimball star schema). In this table, I have collapsed several 3rd normal form tables: date, absolute week, absolute month and year. For each of these collapsed tables, I have a column that represents a unique identity, a textual name column and a numeric manual sort ordering index. Here's an example row to show the structure, the numbers are illustrative only (in vertical form for ease of reading):

  • DateId: 20110507



  • DateName: 7th May 2011



  • DateOrder: 17813



  • WeekId: 201118



  • WeekName: Week 18, 2011



  • WeekOrder: 617



  • MonthId: 201105



  • MonthName: May 2011



  • MonthOrder: 74



  • YearId: 2011



  • YearName: 2011



  • YearOrder: 41



I've chosen to use a date dimension table for ease of understanding, but the problem can be transposed to any other dimension table involving hierarchies.

Problem synopsis

I want to retrieve the distinct set of week and month combinations with the related textual fields.

Possible solutions

  1. DISTINCT



SELECT DISTINCT MonthId, WeekId, YearName, MonthName, WeekName
  FROM DimDate


Note that I cannot exclude MonthId or WeekId because I cannot assume that free text fields are unique.

  1. GROUP BY



SELECT MIN(YearName), MIN(MonthName), MIN(WeekName)
  FROM DimDate
 GROUP BY MonthId, WeekId


  1. ROWNUMBER



WITH grp AS (
  SELECT YearName, MonthName, WeekName
       , ROW_NUMBER() OVER (PARTITION BY MonthId, WeekId) AS r
    FROM DimDate
)
SELECT YearName, MonthName, WeekName
  FROM grp
 WHERE grp.r = 1


  1. ETL



In this approach, indexed views of every combination of non-leaf attributes are precalculated and queried directly. Obviously this is fastest at query time, but could require many duplications of the dimensio

Solution

In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach. Saying that, ROW_NUMBER is better with SQL Server 2008 than SQL Server 2005.

However, you'll have to try for your situation.

Compare query plans, and use Profiler and SET to capture IO, CPU, Duration etc

For a lot of background, see these SO questions:

  • Why are logical reads for windowed aggregate functions so high? (follow the links from Martin Smith)



  • can I get count() and rows from one sql query in sql server? (note the comments in the answers from me and Chris Bednarski)



Finally, do you need the ROW_NUMBER approach? It looks like you're fixing a problem caused by de-normalisation.

And some notes:

  • shouldn't YearID be in the GROUP BY or PARTITION BY?



  • Won't DISTINCT give different output?



  • Are these columns indexed?

Context

StackExchange Database Administrators Q#7118, answer score: 3

Revisions (0)

No revisions yet.