patternsqlMinor

Pre-staging data causes execution plan cost to skyrocket

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

skyrocketprestagingplancostdataexecutioncauses

Problem

I have a troublesome query that we are trying to tune. One of our first thoughts was to take a portion of a larger execution plan and store those results to an intermediate temp table and then perform the other operations.

What I'm observing is that, when we prestage the data into a temp table, the execution plan cost goes through the roof (22 -> 1.1k). Now, this has the benefit of allowing the plan to go parallel, which reduced execution time by 20% but that isn't worth the much higher CPU usage per execution in our case.

We are using SQL Server 2016 SP2 with the legacy CE on.

Original plan (Cost ~20):

https://www.brentozar.com/pastetheplan/?id=ry-QGnkCM

Original SQL:

```
WITH Object1(Column1, Column2, Column3, Column4, Column5, Column6)
AS
(
SELECT Object2.Column1,
Object2.Column2,
Object3.Column3,
Object3.Column4,
Object3.Column5,
Object3.Column6
FROM Object4 AS Object5
INNER JOIN Object6 AS Object2 ON Object2.Column2 = Object5.Column2 AND Object2.Column7 = 0
INNER JOIN Object7 AS Object8 ON Object8.Column8 = Object2.Column9 AND Object8.Column7 = 0
INNER JOIN Object9 AS Object3 ON Object3.Column10 = Object8.Column11 AND Object3.Column7 = 0
INNER JOIN Object10 AS Object11 ON Object2.Column1 = Object11.Column1
WHERE Object8.Column12 IS NULL AND
Object8.Column13 = Object5.Column13 AND
Object3.Column3 = Object5.Column3 AND
Object11.Column14 = Variable1
)
insert Object12
SELECT Object13.Column2,
Object13.Column3,
MIN(Object13.Column4) AS Column15,
MAX(Object13.Column4) AS Column16,
COUNT(DISTINCT (CASE WHEN Object13.Column5 = 1 THEN Object13.Column1 END)) AS Column17,
COUNT(DISTINCT (CASE WHEN Object13.Column6 = 0 THEN Object13.Column1 END)) AS Column18,
COUNT(DISTINCT Object13.Column1) AS Column19
FROM Object1 AS Object13
GROUP BY Object13.Column2, Object13.Column3 OPTION (RECO

Solution

Costs are based on estimates, even in "actual plans". You cannot compare two query plans side by side and conclude that one of them will require more CPU to execute based on operator or total plan costs alone. I can create a query with a cost in the millions that executes in one second. I can also create a query with a tiny cost that will effectively take forever to execute. For your case, the first query has a cost of only 22 optimizer units because of a poor cardinality estimate after the hash join:

The operators in red execute millions of times, but the query optimizer expects them to only execute a few thousand times. Costs, which are based on estimates, won't reflect that work. The operator in blue is a table spool for which the cardinality estimator expects to insert a single row. It instead inserts a few million. As a result, the operators in black (along with a few others not shown) are inefficient and spill to tempdb.

With the other plan, you put a significant number of rows into tempdb and as a result the cardinality estimate is more reasonable, though it's still not ideal:

The query optimizer expects to need to process many more rows and as a result the query plan has a higher cost. As a very general rule of thumb you may see improved performance with improved estimates, but it doesn't always work out how you'd like. Looking at the plan with a temp table I see a few areas of possible improvement:

-
Load the full CTE from the original query into the temp table. A query with multiple distinct aggregates can be tough to optimize. Sometimes you'll get a query plan where all of the data is loaded into a spool (into tempdb) and some of the aggregates are applied separately to the spool. All of that work is always done in a serial zone in my experience. If you eliminate all of the joins in the query I believe that you won't get that optimization. The aggregates will just be applied to the temp table. That will save you the work of writing out nearly the same data to tempdb and the entire plan should be eligible for parallelism.

-
Define the temp table as a heap and write to it with TABLOCK. It looks like right now you have a clustered index which means you aren't eligible for parallel insert.

-
Consider making the query eligible for batch mode by using one of these tricks. Batch mode aggregates can be significantly more efficient with multiple distinct aggregates.

I would expect some combination of those steps to significantly improve runtime. Please note that I did a quick analysis partially because anonymized plans are difficult to interpret.

Context

StackExchange Database Administrators Q#206180, answer score: 6

Revisions (0)

No revisions yet.