patternsqlMinor
Why is there an aggregate in this UPDATE plan?
Viewed 0 times
thiswhyupdateplanthereaggregate
Problem
Given this
I had expected
However what I get is
The first
where the Stream Aggregate passes through the first row per
The result and general plan shape is the same with and without primary keys. Changing to temp tables, or real tables makes no difference. I can reproduce on SQL Server 2017 and 2019.
My question is what is the theoretical basis for inserting the aggregate into the plan? My best guess is it is a form of Halloween protection. I understood its purpose was to prevent a row leapfrogging ahead of the current scan position and hence being updated a second time. I can sort-of see that applying here as a row in @Summary would be touched many times in the absence of the aggregate. It seems a very expansive application of Halloween protection, though.
I know there are ways to achieve this - STRING_AGG being the most obvious. My real use-case is around constructing JSON. This is just a minimal reproducible example. My specific question here is about understanding the semantics and behaviour of the optimizer.
declare @Data table (id int, fact char(1));
declare @Summary table (id int, collected varchar(99));
insert @Data(id, fact)
values
(1, 'a'),
(1, 'b'),
(2, 'c'),
(2, 'd'),
(2, 'e');
-- Form a list of unique id values
insert @Summary(id, collected) select distinct id, '' from @Data;
-- Accumulate the fact values into collected
update s
set collected = collected + d.fact
from @Summary as s
inner join @Data as d
on d.id = s.id;
select * from @Summary;I had expected
id collected
--- ---------
1 ab
2 cdeHowever what I get is
id collected
--- ---------
1 a
2 cThe first
fact per id is written and the others skipped. The reason is obvious from the planwhere the Stream Aggregate passes through the first row per
id; 5 rows are passed to it from the Nested Loop and it passes 2 rows to the Compute Scalar.The result and general plan shape is the same with and without primary keys. Changing to temp tables, or real tables makes no difference. I can reproduce on SQL Server 2017 and 2019.
My question is what is the theoretical basis for inserting the aggregate into the plan? My best guess is it is a form of Halloween protection. I understood its purpose was to prevent a row leapfrogging ahead of the current scan position and hence being updated a second time. I can sort-of see that applying here as a row in @Summary would be touched many times in the absence of the aggregate. It seems a very expansive application of Halloween protection, though.
I know there are ways to achieve this - STRING_AGG being the most obvious. My real use-case is around constructing JSON. This is just a minimal reproducible example. My specific question here is about understanding the semantics and behaviour of the optimizer.
Solution
This isn't Halloween protection, it's normal
Check out this warning from the docs on UPDATE:
The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.
A
Since the join results in duplicate potential values for collected, the optimizer introduced a Stream Aggregate - essentially it groups by the "primary key" of the heap (the row locator, referred to as
Looking at the plan XML, you can see that the internal only "ANY" aggregate was used to choose among the possible values of collected and fact:
*this sort of accumulation does work when you add a variable assignment into the mix, although it's still not "supported." This is referred to as "quirky updates" (reference)
UPDATE semantics.Check out this warning from the docs on UPDATE:
The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.
A
SET statement doesn't accumulate like you expect it to* - it just updates each qualifying row in the Summary table once. Since the join results in duplicate potential values for collected, the optimizer introduced a Stream Aggregate - essentially it groups by the "primary key" of the heap (the row locator, referred to as
Bmk1000 in the execution plan) to get just one value for collected in each row.Looking at the plan XML, you can see that the internal only "ANY" aggregate was used to choose among the possible values of collected and fact:
*this sort of accumulation does work when you add a variable assignment into the mix, although it's still not "supported." This is referred to as "quirky updates" (reference)
Context
StackExchange Database Administrators Q#259464, answer score: 8
Revisions (0)
No revisions yet.