HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

In Subselect, using GetDate instead of a @variable leads to worse execution plan

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
getdatesubselectworseinsteadplanusingvariableexecutionleads

Problem

Execution plans

This is a simplified version of the real-life problem. When using GetDate in subselect, the plan is worse and different. the 2 queries:

select * , (select count(1) 
            from hamsfa.customer as sub 
            where sub.tstamp_nosync > @old 
              and sub.cus_id > main.cus_id) as x 
from hamsfa.customer as main ;

select * , (select count(1) 
            from hamsfa.customer as sub 
            where sub.tstamp_nosync > dateadd(year, -100, getdate())
              and sub.cus_id > main.cus_id) as x 
from hamsfa.customer as main ;


What I don't fully understand is why.

The variable should be by my understanding be worse.

There is another query where this leads to a difference of almost 700% (1s vs 7s).

So the question is: Why does a variable produce better plan even through it should be worse by my experience?

EDIT: Statistics are up to date and correct

Solution

When you use the variable, SQL Server makes a guess at the selectivity of the tstamp_nosync predicate. For the > operator, the guess is 30%. The table contains 3052 rows, so estimation guesses that 30% of 3052 = 915.6 rows will pass that test.

When you use the dateadd(year, -100, getdate()) expression, SQL Server can produce a much more accurate estimate of the number of rows that will qualify for that predicate.

Your build of SQL Server 2012 does not have the Actual Rows Read property, so the number of rows that actually match is not available in your second plan. That feature was added in Service Pack 3 of SQL Server 2012 (you are still on Service Pack 1, for some reason).

From what I can tell, the better estimate for the dateadd expression is that all rows will qualify. This is apparent from the output of the Eager Index Spool in the first plan. The 3052 rows are tested 3052 times (once per iteration of the nested loop join) with the tstamp_nosync predicate, and all pass. The result is 3052 * 3052 = 9,314,704 rows in total. This is shown as the actual number of rows from the spool:

The general expectation that using variables will result in a worse execution plan is sound, because specific estimations are usually more accurate than complete guesses; however, the other predicate sub.cus_id > main.cus_id in the query will always result in a 30% guess.

For the variable query, both predicates are estimated with a 30% guess. The combined estimate is 3052 30% 30% = 274.68 rows (seen at the input to the Stream Aggregate):

For the dateadd query, there is one 30% guess (seek predicate) and a 100% selectivity estimate (residual predicate), giving a combined estimate of 915.6 rows as noted previously. This estimate is shown at the output of the Clustered Index Seek:

The difference in estimates explains the optimizer's different plan selections.

With fewer rows estimated for the variable case, the estimated cost is lower. The better estimate is costed higher, for similar reasons. Nevertheless, both plans are based on inaccurate estimations because of the guess involved in the cus_id predicate.

In terms of observed performance, which plan performs better rather depends on the details of your local environment. The variable plan scans the source table once and builds an indexed temporary table in tempdb. The non-variable plan simply accesses the base table each time with a seek (guessed 30% selectivity) and residual predicate (computed 100% selectivity). Which performs better in practice depends on multiple factors.

Given correct information for both predicates, the optimizer would be more likely to generate the plan that is better in practice.

It often pays to be very conscious of data types. There is an implicit conversion to datetime2(4) in your execution plan because that is the type of the column, and the variable is something else (datetime perhaps). The result of the dateadd expression is also datetime, and also requires implicit conversion. This conversion does not appear to be important in this specific case, but it often is. For example, see:

Performance Surprises and Assumptions : DATEADD() by Aaron Bertrand.

Context

StackExchange Database Administrators Q#156112, answer score: 7

Revisions (0)

No revisions yet.