HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Why is there an aggregate in this UPDATE plan?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thiswhyupdateplanthereaggregate

Problem

Given this

declare @Data table (id int, fact char(1));
declare @Summary table (id int, collected varchar(99));

insert @Data(id, fact)
values
    (1, 'a'),
    (1, 'b'),
    (2, 'c'),
    (2, 'd'),
    (2, 'e');

-- Form a list of unique id values
insert @Summary(id, collected) select distinct id, '' from @Data;

-- Accumulate the fact values into collected
update s
set collected = collected + d.fact
from @Summary as s
inner join @Data as d
    on d.id = s.id;

select * from @Summary;


I had expected

id   collected
---  ---------
1    ab
2    cde


However what I get is

id   collected
---  ---------
1    a
2    c


The first fact per id is written and the others skipped. The reason is obvious from the plan

where the Stream Aggregate passes through the first row per id; 5 rows are passed to it from the Nested Loop and it passes 2 rows to the Compute Scalar.

The result and general plan shape is the same with and without primary keys. Changing to temp tables, or real tables makes no difference. I can reproduce on SQL Server 2017 and 2019.

My question is what is the theoretical basis for inserting the aggregate into the plan? My best guess is it is a form of Halloween protection. I understood its purpose was to prevent a row leapfrogging ahead of the current scan position and hence being updated a second time. I can sort-of see that applying here as a row in @Summary would be touched many times in the absence of the aggregate. It seems a very expansive application of Halloween protection, though.

I know there are ways to achieve this - STRING_AGG being the most obvious. My real use-case is around constructing JSON. This is just a minimal reproducible example. My specific question here is about understanding the semantics and behaviour of the optimizer.

Solution

This isn't Halloween protection, it's normal UPDATE semantics.

Check out this warning from the docs on UPDATE:


The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.

A SET statement doesn't accumulate like you expect it to* - it just updates each qualifying row in the Summary table once.

Since the join results in duplicate potential values for collected, the optimizer introduced a Stream Aggregate - essentially it groups by the "primary key" of the heap (the row locator, referred to as Bmk1000 in the execution plan) to get just one value for collected in each row.

Looking at the plan XML, you can see that the internal only "ANY" aggregate was used to choose among the possible values of collected and fact:












*this sort of accumulation does work when you add a variable assignment into the mix, although it's still not "supported." This is referred to as "quirky updates" (reference)

Context

StackExchange Database Administrators Q#259464, answer score: 8

Revisions (0)

No revisions yet.