patternsqlMajor
Check existence with EXISTS outperform COUNT! ... Not?
Viewed 0 times
existencewithoutperformexistscountchecknot
Problem
I've often read when one had to check existence of a row should always be done with EXISTS instead of with a COUNT.
Yet in several recent scenarios I've measured a performance improvement when using count.
The pattern goes like this:
I'm not familiar with methods to tell what's happening "inside" SQL Server so I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done (could EXISTS be RBAR?!).
Do you have some explanation to that phenomena?
EDIT:
Here's a full script you can run:
On SQL Server 2008R2 (Seven 64bits) I get this result
Table '#455F344D'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob phy
Yet in several recent scenarios I've measured a performance improvement when using count.
The pattern goes like this:
LEFT JOIN (
SELECT
someID
, COUNT(*)
FROM someTable
GROUP BY someID
) AS Alias ON (
Alias.someID = mainTable.ID
)I'm not familiar with methods to tell what's happening "inside" SQL Server so I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done (could EXISTS be RBAR?!).
Do you have some explanation to that phenomena?
EDIT:
Here's a full script you can run:
SET NOCOUNT ON
SET STATISTICS IO OFF
DECLARE @tmp1 TABLE (
ID INT UNIQUE
)
DECLARE @tmp2 TABLE (
ID INT
, X INT IDENTITY
, UNIQUE (ID, X)
)
; WITH T(n) AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master.dbo.spt_values AS S
)
, tally(n) AS (
SELECT
T2.n * 100 + T1.n
FROM T AS T1
CROSS JOIN T AS T2
WHERE T1.n 0
AND T2.n 0 THEN 1 ELSE 0 END AS DoesExist
FROM @tmp1 AS T1
LEFT JOIN (
SELECT
T2.ID
, COUNT(*) AS n
FROM @tmp2 AS T2
GROUP BY T2.ID
) AS T2 ON (
T2.ID = T1.ID
)
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (RECOMPILE) -- Required since table are filled within the same scope
SET STATISTICS TIME OFF
PRINT '
EXISTS Version:'
WAITFOR DELAY '00:00:01'
SET STATISTICS TIME ON
SELECT
T1.ID
, CASE WHEN EXISTS (
SELECT 1
FROM @tmp2 AS T2
WHERE T2.ID = T1.ID
) THEN 1 ELSE 0 END AS DoesExist
FROM @tmp1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (RECOMPILE) -- Required since table are filled within the same scope
SET STATISTICS TIME OFFOn SQL Server 2008R2 (Seven 64bits) I get this result
COUNT Version:Table '#455F344D'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob phy
Solution
I've often read when one had to check existence of a row should always be done with EXISTS instead of with a COUNT.
It's very rare for anything to always be true, especially when it comes to databases. There are any number of ways to express the same semantic in SQL. If there is a useful rule of thumb, it might be to write queries using the most natural syntax available (and, yes, that is subjective) and only consider rewrites if the query plan or performance you get is unacceptable.
For what it's worth, my own take on the issue is that existence queries are most naturally expressed using
I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done
Your particular example is interesting, because it highlights the way the optimizer deals with subqueries in
Subqueries in CASE expressions
Consider the following (perfectly legal) query:
The semantics of
The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that
CASE expressions with an EXISTS subquery
Where a
The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called
Part of the input tree showing the
This is transformed by
This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.
One of the possible execution plan shapes for this query is a direct implementation of that logical structure:
The final Compute Scalar evaluates the result of the
The basic shape of the plan tree is preserved when the optimizer considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:
Notice the merge outputs an expression labelled
The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in
It's very rare for anything to always be true, especially when it comes to databases. There are any number of ways to express the same semantic in SQL. If there is a useful rule of thumb, it might be to write queries using the most natural syntax available (and, yes, that is subjective) and only consider rewrites if the query plan or performance you get is unacceptable.
For what it's worth, my own take on the issue is that existence queries are most naturally expressed using
EXISTS. It has also been my experience that EXISTS tends to optimize better than the OUTER JOIN reject NULL alternative. Using COUNT(*) and filtering on =0 is another alternative, that happens to have some support in the SQL Server query optimizer, but I have personally found this to be unreliable in more complex queries. In any case, EXISTS just seems much more natural (to me) than either of those alternatives.I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done
Your particular example is interesting, because it highlights the way the optimizer deals with subqueries in
CASE expressions (and EXISTS tests in particular).Subqueries in CASE expressions
Consider the following (perfectly legal) query:
DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);
SELECT
CASE
WHEN (SELECT W.b FROM @When AS W) = 1
THEN (SELECT T.c FROM @Then AS T)
ELSE (SELECT E.d FROM @Else AS E)
END
FROM @Base AS B;The semantics of
CASE are that WHEN/ELSE clauses are generally evaluated in textual order. In the query above, it would be incorrect for SQL Server to return an error if the ELSE subquery returned more than one row, if the WHEN clause was satisfied. To respect these semantics, the optimizer produces a plan that uses pass-through predicates:The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that
CASE expressions are tested in order, and subqueries are only evaluated if no previous expression was satisfied.CASE expressions with an EXISTS subquery
Where a
CASE subquery uses EXISTS, the logical existence test is implemented as a semi-join, but rows that would normally be rejected by the semi-join have to be retained in case a later clause needs them. Rows flowing through this special kind of semi-join acquire a flag to indicate if the semi-join found a match or not. This flag is known as the probe column.The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called
RemoveSubqInPrj (remove subquery in projection). We can see the details using trace flag 8606:SELECT
T1.ID,
CASE
WHEN EXISTS
(
SELECT 1
FROM #T2 AS T2
WHERE T2.ID = T1.ID
) THEN 1
ELSE 0
END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);Part of the input tree showing the
EXISTS test is shown below:ScaOp_Exists
LogOp_Project
LogOp_Select
LogOp_Get TBL: #T2
ScaOp_Comp x_cmpEq
ScaOp_Identifier [T2].ID
ScaOp_Identifier [T1].IDThis is transformed by
RemoveSubqInPrj to a structure headed by:LogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.
One of the possible execution plan shapes for this query is a direct implementation of that logical structure:
The final Compute Scalar evaluates the result of the
CASE expression using the probe column value:The basic shape of the plan tree is preserved when the optimizer considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:
Notice the merge outputs an expression labelled
Expr1008 (that the name is the same as before is a coincidence) though no definition for it appears on any operator in the plan. This is just the probe column again. As before, the final Compute Scalar uses this probe value to evaluate the CASE.The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in
T2 match the range on every iteration. With a merge or hash plan, this could beCode Snippets
DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);
SELECT
CASE
WHEN (SELECT W.b FROM @When AS W) = 1
THEN (SELECT T.c FROM @Then AS T)
ELSE (SELECT E.d FROM @Else AS E)
END
FROM @Base AS B;SELECT
T1.ID,
CASE
WHEN EXISTS
(
SELECT 1
FROM #T2 AS T2
WHERE T2.ID = T1.ID
) THEN 1
ELSE 0
END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);ScaOp_Exists
LogOp_Project
LogOp_Select
LogOp_Get TBL: #T2
ScaOp_Comp x_cmpEq
ScaOp_Identifier [T2].ID
ScaOp_Identifier [T1].IDLogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)SELECT
T1.ID,
CASE
WHEN EXISTS
(
SELECT 1
FROM #T2 AS T2
WHERE T2.ID = T1.ID
AND T2.ID BETWEEN 5000 AND 7000 -- New
) THEN 1
ELSE 0
END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;Context
StackExchange Database Administrators Q#46359, answer score: 47
Revisions (0)
No revisions yet.