HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Optimizing removal of duplicate rows from recursive CTE query

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
removalrowsduplicatequerycterecursiveoptimizingfrom

Problem

I need to write a recursive SQL query to fetch multiple projects together with their lineage from the following table:

CREATE TABLE #Project
(
Name VARCHAR(10),
ParentName VARCHAR(10)
)

INSERT INTO #Project(Name, ParentName) VALUES ('P', NULL);
INSERT INTO #Project(Name, ParentName) VALUES ('C1', 'P');
INSERT INTO #Project(Name, ParentName) VALUES ('C2', 'P');
SELECT Name, ParentName FROM #Project


(this is just a toy example, so please ignore table design issues and the lack of indexing)

So, given multiple project names, I need to fetch a table that contains the projects as well as all their ancestors, because the result table will have to support lineage queries for a given Project name

In order to do this, I came up with the following recursive CTE query:

WITH Lineage (Name, ParentName)
AS
(
SELECT        Name, ParentName
FROM            #Project WHERE Name = 'C1' OR Name = 'C2'
UNION ALL
SELECT        #Project.Name, #Project.ParentName
FROM            #Project INNER JOIN Lineage ON #Project.Name = Lineage.ParentName
)
SELECT Name, ParentName FROM Lineage


And, here is the result of the query:

Name | ParentName
C1   | P
C2   | P
P    | NULL
P    | NULL


As expected, this returns projects C1 and C2, as well as their parents, which will be duplicated for each child, unfortunately.

In order to get rid of the duplicated parents, I added DISTINCT to the final SELECT:

SELECT DISTINCT Name, ParentName FROM Lineage


Which returns:

Name | ParentName
P    | NULL
C1   | P
C2   | P


Even though this produces the desired result, I am concerned about the performance of this query. The Lineage CTE will surely contain a lot of extra rows if I let it return these duplicates, which I remove at a later step in the query.

Is there any way to rewrite the CTE such that it removes these duplicates directly?

UPDATE: Just in case my example wasn't clear enough, I need to somehow specify the project names for which to fetch the lineag

Solution

How about this? Kind of a hack - but row_number should be quicker than distinct.

;WITH Lineage AS
(
SELECT
    Name
   ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY ParentName ORDER BY ParentName) > 1 
         THEN NULL 
         ELSE ParentName END AS ParentNameToJoin
   ,ParentName AS ParentNameToDisplay 
FROM
    #Project
WHERE
    Name in ('C1','C2')
UNION ALL
SELECT
    p.Name
   ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY p.ParentName ORDER BY p.ParentName) > 1 
         THEN NULL 
         ELSE ParentName END AS ParentNameToJoin
   ,p.ParentName 
FROM
    #Project as p
INNER JOIN
    Lineage ON p.Name = Lineage.ParentNameToJoin
)
SELECT
    Name
   ,ParentNameToDisplay
FROM
    lineage

Code Snippets

;WITH Lineage AS
(
SELECT
    Name
   ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY ParentName ORDER BY ParentName) > 1 
         THEN NULL 
         ELSE ParentName END AS ParentNameToJoin
   ,ParentName AS ParentNameToDisplay 
FROM
    #Project
WHERE
    Name in ('C1','C2')
UNION ALL
SELECT
    p.Name
   ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY p.ParentName ORDER BY p.ParentName) > 1 
         THEN NULL 
         ELSE ParentName END AS ParentNameToJoin
   ,p.ParentName 
FROM
    #Project as p
INNER JOIN
    Lineage ON p.Name = Lineage.ParentNameToJoin
)
SELECT
    Name
   ,ParentNameToDisplay
FROM
    lineage

Context

StackExchange Database Administrators Q#51900, answer score: 4

Revisions (0)

No revisions yet.