HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Can you use COUNT DISTINCT with an OVER clause?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctcanyouwithclausecountuseover

Problem

I'm trying to improve the performance of the following query:

UPDATE  [#TempTable]
        SET     Received = r.Number
        FROM    [#TempTable] 
        INNER JOIN (SELECT  AgentID,
                            RuleID,
                            COUNT(DISTINCT (GroupId)) Number
                    FROM    [#TempTable]
                    WHERE   Passed = 1
                    GROUP BY AgentID,
                            RuleID
                   ) r ON r.RuleID = [#TempTable].RuleID AND
                          r.AgentID = [#TempTable].AgentID


Currently with my test data it takes about a minute. I have a limited amount of input into changes on the over all stored procedure where this query resides but I can probably get them to modify this one query. Or add an index. I tried adding the following index:

CREATE CLUSTERED INDEX ix_test ON #TempTable(AgentID, RuleId, GroupId, Passed)


And it actually doubled the amount of time the query takes. I get the same effect with a NON-CLUSTERED index.

I tried re-writing it as follows with no effect.

WITH r AS (SELECT  AgentID,
                            RuleID,
                            COUNT(DISTINCT (GroupId)) Number
                    FROM    [#TempTable]
                    WHERE   Passed = 1
                    GROUP BY AgentID,
                            RuleID
            ) 
        UPDATE  [#TempTable]
        SET     Received = r.Number
        FROM    [#TempTable] 
        INNER JOIN r 
            ON r.RuleID = [#TempTable].RuleID AND
               r.AgentID = [#TempTable].AgentID


Next I tried to use a windowing function like this.

UPDATE  [#TempTable]
        SET     Received = COUNT(DISTINCT (CASE WHEN Passed=1 THEN GroupId ELSE NULL END)) 
                    OVER (PARTITION BY AgentId, RuleId)
        FROM    [#TempTable]


At this point I started to get the error

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'distinct'.


So I have two ques

Solution

This construction is not currently supported in SQL Server. It could (and should, in my opinion) be implemented in a future version.

Applying one of the workarounds listed in the feedback item reporting this deficiency, your query could be rewritten as:

WITH UpdateSet AS
(
    SELECT 
        AgentID, 
        RuleID, 
        Received, 
        Calc = SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (
            PARTITION BY AgentID, RuleID) 
    FROM 
    (
        SELECT  
            AgentID,
            RuleID,
            Received,
            rn = ROW_NUMBER() OVER (
                PARTITION BY AgentID, RuleID, GroupID 
                ORDER BY GroupID)
        FROM    #TempTable
        WHERE   Passed = 1
    ) AS X
)
UPDATE UpdateSet
SET Received = Calc;


The resulting execution plan is:

This has the advantage of avoiding an Eager Table Spool for Halloween Protection (due to the self-join), but it introduces a sort (for the window) and an often-inefficient Lazy Table Spool construction to calculate and apply the SUM OVER (PARTITION BY) result to all rows in the window. How it performs in practice is an exercise only you can perform.

The overall approach is a difficult one to make perform well. Applying updates (especially ones based on a self-join) recursively to a large structure may be good for debugging but it is a recipe for poor performance. Repeated large scans, memory spills, and Halloween issues are just some of the issues. Indexing and (more) temporary tables can help, but very careful analysis is needed especially if the index is updated by other statements in the process (maintaining indexes affects query plan choices and adds I/O).

Ultimately, solving the underlying problem would make for interesting consultancy work, but it is too much for this site. I hope this answer addresses the surface question though.

Alternative interpretation of the original query (results in updating more rows):

WITH UpdateSet AS
(
    SELECT 
        AgentID, 
        RuleID, 
        Received, 
        Calc = SUM(CASE WHEN Passed = 1 AND rn = 1 THEN 1 ELSE 0 END) OVER (
            PARTITION BY AgentID, RuleID) 
    FROM 
    (
        SELECT  
            AgentID,
            RuleID,
            Received,
            Passed,
            rn = ROW_NUMBER() OVER (
                PARTITION BY AgentID, RuleID, Passed, GroupID
                ORDER BY GroupID)
        FROM    #TempTable
    ) AS X
)
UPDATE UpdateSet
SET Received = Calc
WHERE Calc > 0;


Note: eliminating the sort (e.g. by providing an index) might reintroduce the need for an Eager Spool or something else to provide the necessary Halloween Protection. Sort is a blocking operator, so it provides full phase separation.

Code Snippets

WITH UpdateSet AS
(
    SELECT 
        AgentID, 
        RuleID, 
        Received, 
        Calc = SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (
            PARTITION BY AgentID, RuleID) 
    FROM 
    (
        SELECT  
            AgentID,
            RuleID,
            Received,
            rn = ROW_NUMBER() OVER (
                PARTITION BY AgentID, RuleID, GroupID 
                ORDER BY GroupID)
        FROM    #TempTable
        WHERE   Passed = 1
    ) AS X
)
UPDATE UpdateSet
SET Received = Calc;
WITH UpdateSet AS
(
    SELECT 
        AgentID, 
        RuleID, 
        Received, 
        Calc = SUM(CASE WHEN Passed = 1 AND rn = 1 THEN 1 ELSE 0 END) OVER (
            PARTITION BY AgentID, RuleID) 
    FROM 
    (
        SELECT  
            AgentID,
            RuleID,
            Received,
            Passed,
            rn = ROW_NUMBER() OVER (
                PARTITION BY AgentID, RuleID, Passed, GroupID
                ORDER BY GroupID)
        FROM    #TempTable
    ) AS X
)
UPDATE UpdateSet
SET Received = Calc
WHERE Calc > 0;

Context

StackExchange Database Administrators Q#76628, answer score: 32

Revisions (0)

No revisions yet.