patternsqlModerate
Apply cardinality estimation problem in SQL Server
Viewed 0 times
problemcardinalitysqlapplyserverestimation
Problem
Now, I am faced with the problem of the logic of cardinality estimation that is not quite clear for me in a seemingly rather simple situation.
I encountered this situation at my work, therefore, for privacy reasons, I will provide only a general description of the problem below, however, for a more detailed analysis, I simulated this problem in the AdventureWorksDW training base.
There is a query of the following form:
As you can see from the execution plan presented above, the cardinality estimator estimated the estimated number of rows in the Index Seek operation at 17,884,200 (corresponding to 2,980,700 per row from the outer part of the NL), which is quite close to the actual number.
Now I will modify the query and add to CROSS APPLY LEFT OUTER JOIN:
This query gives the following plan:
Seeing the logical form of the query, it is logical to assume that the expected number of rows of the Index Seek operation will remain the same, although I understand that the route for finding the plan is different, however, it would seem that the part highlighted in red has not changed, the same predicates, etc. , but Index Seek's estimate is now 664,506 (corresponding to 110,751 per line from the external part of NL), which is a gross mistake and in the production environment can cause a serious tempdb spill data.
The above queries were executed on an instance of Sql Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (x64).
To get more details and simplify the analysis, I simulated this problem in the AdventureWorksDW2017 database on an instance of SQL Server 2019 (RTM) - 15.0.2000.5 (X64), but I execute queries with the 9481 trace flag turned on to simulate a system with cardinality estimator version 70
I encountered this situation at my work, therefore, for privacy reasons, I will provide only a general description of the problem below, however, for a more detailed analysis, I simulated this problem in the AdventureWorksDW training base.
There is a query of the following form:
SELECT
FROM
CROSS APPLY(
SELECT
FROM p
WHERE p.StartDate = Dates.d
) tAs you can see from the execution plan presented above, the cardinality estimator estimated the estimated number of rows in the Index Seek operation at 17,884,200 (corresponding to 2,980,700 per row from the outer part of the NL), which is quite close to the actual number.
Now I will modify the query and add to CROSS APPLY LEFT OUTER JOIN:
SELECT
FROM
CROSS APPLY(
SELECT
FROM p
LEFT JOIN f ON p.key = f.key
AND f.date = Dates.d
WHERE p.StartDate = Dates.d
) tThis query gives the following plan:
Seeing the logical form of the query, it is logical to assume that the expected number of rows of the Index Seek operation will remain the same, although I understand that the route for finding the plan is different, however, it would seem that the part highlighted in red has not changed, the same predicates, etc. , but Index Seek's estimate is now 664,506 (corresponding to 110,751 per line from the external part of NL), which is a gross mistake and in the production environment can cause a serious tempdb spill data.
The above queries were executed on an instance of Sql Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (x64).
To get more details and simplify the analysis, I simulated this problem in the AdventureWorksDW2017 database on an instance of SQL Server 2019 (RTM) - 15.0.2000.5 (X64), but I execute queries with the 9481 trace flag turned on to simulate a system with cardinality estimator version 70
Solution
There are often several ways to derive a cardinality estimate, with each method giving a different (but equally valid) answer. That is simply the nature of statistics and estimations.
You ask essentially why one method produces an estimate of 296.839 rows, while another gives 181.8 rows.
Let's look at a simpler example of the same AdventureWorksDW2017 join as given in the question:
Example 1 - Join
This is a join between:
with the join predicate being:
One way to compute the selectivity of the join is to consider how FullDateAlternateKey
You ask essentially why one method produces an estimate of 296.839 rows, while another gives 181.8 rows.
Let's look at a simpler example of the same AdventureWorksDW2017 join as given in the question:
Example 1 - Join
DECLARE @db date = '20130720';
DECLARE @de date = '20130802';
SELECT DD.FullDateAlternateKey, DP.ProductAlternateKey
FROM dbo.DimDate AS DD
JOIN dbo.DimProduct AS DP
ON DP.StartDate <= CONVERT(datetime, DD.FullDateAlternateKey)
WHERE
DD.FullDateAlternateKey BETWEEN @db AND @de
OPTION (FORCE ORDER, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'));This is a join between:
DimDate(filtered onFullDateAlternateKey BETWEEN @db AND @de); and
DimProduct
with the join predicate being:
DP.StartDate
One way to compute the selectivity of the join is to consider how FullDateAlternateKey
values will overlap with StartDate values using histogram information.
The histogram steps of FullDateAlternateKey will be scaled for the selectivity of BETWEEN @db AND @de, before being compared with DP.StartDate to see how they join.
Using the original CE, the join estimation will align the two histograms step by step using linear interpolation before being 'joined'.
Once we have computed the selectivity of the join using this method, it doesn't matter (except for display purposes) whether the join is a hash, merge, nested loops, or apply.
The steps of the histogram-based calculation aren't particularly difficult, but they are too long-winded to show here. So I will cut to the chase and simply show the outcome:
Notice the estimate of 296.839 rows on the DimProduct seek.
This is a consequence of the join cardinality estimate being computed as 97,565.2 rows (using histograms). The filter on DimDate passes through 328.68 rows, so the inner side must produce 296.839 rows per iteration on average to make the maths work out.
If a hash or merge join were possible for this query (which it isn't, due to the inequality), the DimProduct table would be scanned, producing all of its 606 rows. The result of the join would still be 97,565.2 rows.
This estimate is a consequence of estimating as a join.
Example 2 - Apply
We could also estimate this query as an apply. A logically-equivalent form written in T-SQL is:
DECLARE @db date = '20130720';
DECLARE @de date = '20130802';
SELECT DD.FullDateAlternateKey, DP.ProductAlternateKey
FROM dbo.DimDate AS DD
CROSS APPLY
(
SELECT DP.ProductAlternateKey
FROM dbo.DimProduct AS DP
WHERE
DP.StartDate <= CONVERT(datetime, DD.FullDateAlternateKey)
) AS DP
WHERE
DD.FullDateAlternateKey BETWEEN @db AND @de
OPTION (FORCE ORDER, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'), QUERYTRACEON 9114);
(trace flag 9114 prevents the optimizer rewriting the apply as a join)
The estimation approach this time is to assess how many rows will match in DimProduct for each row from DimDate (per iteration):
We have 328.68 rows from DimDate as before, but now each of those rows is expected to match 181.8 rows in DimProduct.
This is simply a guess at the selectivity of StartDate <= FullDateAlternateKey.
The guess is 30% of the 606 rows in DimProduct: 0.3 * 606 = 181.8 rows.
This estimate is a consequence of estimating as an apply.
Final notes
Your example introduces an outer join as a way to make the query too complex for the optimizer to transform from apply to join form. Using TOP` inside the apply is another way to convince the optimizer not to translate an apply to join (even when it could).Code Snippets
DECLARE @db date = '20130720';
DECLARE @de date = '20130802';
SELECT DD.FullDateAlternateKey, DP.ProductAlternateKey
FROM dbo.DimDate AS DD
JOIN dbo.DimProduct AS DP
ON DP.StartDate <= CONVERT(datetime, DD.FullDateAlternateKey)
WHERE
DD.FullDateAlternateKey BETWEEN @db AND @de
OPTION (FORCE ORDER, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'));DECLARE @db date = '20130720';
DECLARE @de date = '20130802';
SELECT DD.FullDateAlternateKey, DP.ProductAlternateKey
FROM dbo.DimDate AS DD
CROSS APPLY
(
SELECT DP.ProductAlternateKey
FROM dbo.DimProduct AS DP
WHERE
DP.StartDate <= CONVERT(datetime, DD.FullDateAlternateKey)
) AS DP
WHERE
DD.FullDateAlternateKey BETWEEN @db AND @de
OPTION (FORCE ORDER, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'), QUERYTRACEON 9114);Context
StackExchange Database Administrators Q#264646, answer score: 11
Revisions (0)
No revisions yet.