HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Check existence with EXISTS outperform COUNT! ... Not?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
existencewithoutperformexistscountchecknot

Problem

I've often read when one had to check existence of a row should always be done with EXISTS instead of with a COUNT.

Yet in several recent scenarios I've measured a performance improvement when using count.

The pattern goes like this:

LEFT JOIN (
    SELECT
        someID
        , COUNT(*)
    FROM someTable
    GROUP BY someID
) AS Alias ON (
    Alias.someID = mainTable.ID
)


I'm not familiar with methods to tell what's happening "inside" SQL Server so I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done (could EXISTS be RBAR?!).

Do you have some explanation to that phenomena?

EDIT:

Here's a full script you can run:

SET NOCOUNT ON
SET STATISTICS IO OFF

DECLARE @tmp1 TABLE (
    ID INT UNIQUE
)

DECLARE @tmp2 TABLE (
    ID INT
    , X INT IDENTITY
    , UNIQUE (ID, X)
)

; WITH T(n) AS (
    SELECT
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM master.dbo.spt_values AS S
) 
, tally(n) AS (
    SELECT
        T2.n * 100 + T1.n
    FROM T AS T1
    CROSS JOIN T AS T2
    WHERE T1.n  0
AND T2.n  0 THEN 1 ELSE 0 END AS DoesExist
FROM @tmp1 AS T1
LEFT JOIN (
    SELECT
        T2.ID
        , COUNT(*) AS n
    FROM @tmp2 AS T2
    GROUP BY T2.ID
) AS T2 ON (
    T2.ID = T1.ID
)
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (RECOMPILE) -- Required since table are filled within the same scope

SET STATISTICS TIME OFF

PRINT '

EXISTS Version:'

WAITFOR DELAY '00:00:01'

SET STATISTICS TIME ON

SELECT
    T1.ID
    , CASE WHEN EXISTS (
        SELECT 1
        FROM @tmp2 AS T2
        WHERE T2.ID = T1.ID
    ) THEN 1 ELSE 0 END AS DoesExist
FROM @tmp1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (RECOMPILE) -- Required since table are filled within the same scope

SET STATISTICS TIME OFF


On SQL Server 2008R2 (Seven 64bits) I get this result

COUNT Version:


Table '#455F344D'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob phy

Solution

I've often read when one had to check existence of a row should always be done with EXISTS instead of with a COUNT.

It's very rare for anything to always be true, especially when it comes to databases. There are any number of ways to express the same semantic in SQL. If there is a useful rule of thumb, it might be to write queries using the most natural syntax available (and, yes, that is subjective) and only consider rewrites if the query plan or performance you get is unacceptable.

For what it's worth, my own take on the issue is that existence queries are most naturally expressed using EXISTS. It has also been my experience that EXISTS tends to optimize better than the OUTER JOIN reject NULL alternative. Using COUNT(*) and filtering on =0 is another alternative, that happens to have some support in the SQL Server query optimizer, but I have personally found this to be unreliable in more complex queries. In any case, EXISTS just seems much more natural (to me) than either of those alternatives.

I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done

Your particular example is interesting, because it highlights the way the optimizer deals with subqueries in CASE expressions (and EXISTS tests in particular).
Subqueries in CASE expressions

Consider the following (perfectly legal) query:

DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);

SELECT
    CASE
        WHEN (SELECT W.b FROM @When AS W) = 1
            THEN (SELECT T.c FROM @Then AS T)
        ELSE (SELECT E.d FROM @Else AS E)
    END
FROM @Base AS B;


The semantics of CASE are that WHEN/ELSE clauses are generally evaluated in textual order. In the query above, it would be incorrect for SQL Server to return an error if the ELSE subquery returned more than one row, if the WHEN clause was satisfied. To respect these semantics, the optimizer produces a plan that uses pass-through predicates:

The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that CASE expressions are tested in order, and subqueries are only evaluated if no previous expression was satisfied.
CASE expressions with an EXISTS subquery

Where a CASE subquery uses EXISTS, the logical existence test is implemented as a semi-join, but rows that would normally be rejected by the semi-join have to be retained in case a later clause needs them. Rows flowing through this special kind of semi-join acquire a flag to indicate if the semi-join found a match or not. This flag is known as the probe column.

The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called RemoveSubqInPrj (remove subquery in projection). We can see the details using trace flag 8606:

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);


Part of the input tree showing the EXISTS test is shown below:

ScaOp_Exists 
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: #T2
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier [T2].ID
                ScaOp_Identifier [T1].ID


This is transformed by RemoveSubqInPrj to a structure headed by:

LogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)


This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.

One of the possible execution plan shapes for this query is a direct implementation of that logical structure:

The final Compute Scalar evaluates the result of the CASE expression using the probe column value:

The basic shape of the plan tree is preserved when the optimizer considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:

Notice the merge outputs an expression labelled Expr1008 (that the name is the same as before is a coincidence) though no definition for it appears on any operator in the plan. This is just the probe column again. As before, the final Compute Scalar uses this probe value to evaluate the CASE.

The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in T2 match the range on every iteration. With a merge or hash plan, this could be

Code Snippets

DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);

SELECT
    CASE
        WHEN (SELECT W.b FROM @When AS W) = 1
            THEN (SELECT T.c FROM @Then AS T)
        ELSE (SELECT E.d FROM @Else AS E)
    END
FROM @Base AS B;
SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);
ScaOp_Exists 
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: #T2
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier [T2].ID
                ScaOp_Identifier [T1].ID
LogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)
SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
            AND T2.ID BETWEEN 5000 AND 7000 -- New
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

Context

StackExchange Database Administrators Q#46359, answer score: 47

Revisions (0)

No revisions yet.