patternsqlMajor
Why is my ORDER BY in STRING_AGG not always working?
Viewed 0 times
whyorderalwaysworkingstring_aggnot
Problem
I have a table that consists of a record ID, a group ID (linking 1 or more records into a group) and a hash value for each record.
(I know this isn't an efficient table, but it's just a temp table for the purposes of this example).
I want to generate a hash for each group, so I thought the simplest way would be to concatenate the hashes of each record in the group. RecordIDs are unique, but what those records relate to are not necessarily unique so the hashes may be duplicates. The point of this is to flag fully duplicate groups ie a group were all records in that group are duplicates of all records in another group. The GUI needs all members of the group to have the same hash if it's to recognise them as a duplicate group.
I'm using STRING_AGG to concatenate the individual hashes of the records in the group, and sorting them by the hash to make sure I get the same string of characters for duplicate groups. I don't actually care what the order of the hashes is, so long as it's the same each time. When I run it as a SELECT query, it works fine and I can see identical strings for duplicate groups. When I take that same SELECT query and put it into an UPDATE query, the ordering seems to get lost.
This gives the results (for an example pair of duplicate groups):
When I put that same code into the UPDATE query, it doesn't sort them correctly:
```
UPDATE HashTable
SET GroupHashList = c.HashList
FROM HashTable
INNER JOIN (
SELECT (STRING_AGG([Hash],';') WITHIN GROUP (ORDER BY [Hash] ASC)) AS [HashList],
[G
CREATE TABLE HashTable(
RecordID VARCHAR(255),
GroupIdentifier VARCHAR(255),
Hash VARCHAR (255),
GroupHashList VARCHAR(4000)
)(I know this isn't an efficient table, but it's just a temp table for the purposes of this example).
I want to generate a hash for each group, so I thought the simplest way would be to concatenate the hashes of each record in the group. RecordIDs are unique, but what those records relate to are not necessarily unique so the hashes may be duplicates. The point of this is to flag fully duplicate groups ie a group were all records in that group are duplicates of all records in another group. The GUI needs all members of the group to have the same hash if it's to recognise them as a duplicate group.
I'm using STRING_AGG to concatenate the individual hashes of the records in the group, and sorting them by the hash to make sure I get the same string of characters for duplicate groups. I don't actually care what the order of the hashes is, so long as it's the same each time. When I run it as a SELECT query, it works fine and I can see identical strings for duplicate groups. When I take that same SELECT query and put it into an UPDATE query, the ordering seems to get lost.
SELECT STRING_AGG([Hash],';') WITHIN GROUP (ORDER BY [Hash] ASC)
FROM HashTable
GROUP BY [GroupIdentifier]This gives the results (for an example pair of duplicate groups):
73F294873462B2BA0E930FD16DCCB7;90E749375DF806CB6E3F5CA48FFA38;E44256CE7CFCB971EB679BAC25A697
73F294873462B2BA0E930FD16DCCB7;90E749375DF806CB6E3F5CA48FFA38;E44256CE7CFCB971EB679BAC25A697
When I put that same code into the UPDATE query, it doesn't sort them correctly:
```
UPDATE HashTable
SET GroupHashList = c.HashList
FROM HashTable
INNER JOIN (
SELECT (STRING_AGG([Hash],';') WITHIN GROUP (ORDER BY [Hash] ASC)) AS [HashList],
[G
Solution
This appears to be a bug in the optimizer.
The optimizer, having realized that the join is a self-join, is transforming it into a window aggregate. It can do this despite
PasteThePlan
The problem is that the aggregation is over the wrong value. It is aggregating the outer value rather than the inner one.
If you give the two references different aliases, then a careful examination of the query plan reveals the bug.
The other issue is that the aggregates used with that rule (e.g.
As you can see below, the Sort only orders by the correlation column
If you are a
As a workaround, one option to prevent this optimization being applied is to use a grouped
This gets you a pretty straightforward self-join with a Stream Aggregate.
db<>fiddle
I strongly suggest you file this as a bug with Microsoft.
You could also leave feedback, but that does not typically lead to a specific response.
As an aside, you should follow the aliasing rules suggested by Conor Cunningham when writing multi-table
The non-ANSI FROM clause (which you are using here) has specific binding behaviors that may or may not be what you expect. I will suggest you start by aliasing the 3 references to hashtable to be different and then make sure you are explicitly refering to the one you want. It may be (I am guessing) that it is binding to a different one than you think and providing you an undesired output as a result.
The optimizer, having realized that the join is a self-join, is transforming it into a window aggregate. It can do this despite
STRING_AGG not being available as a window aggregate. The rule is called GenGbApplySimple, and allows a self-join to be converted to a window aggregate. There is nothing specifically wrong with this so far.PasteThePlan
The problem is that the aggregation is over the wrong value. It is aggregating the outer value rather than the inner one.
If you give the two references different aliases, then a careful examination of the query plan reveals the bug.
STRING_AGG([dbo].[HashTable].[Hash] as [HT1].[Hash],'')
WITHIN GROUP (ORDER BY [HT2].[Hash])The other issue is that the aggregates used with that rule (e.g.
MIN, MAX, AVG) don't have a WITHIN GROUP ordering to satisfy, so the replacement plan doesn't account for it. It seems likely that STRING_AGG was not intended to work the GbApply rules, or work would be needed to make it compatible (honouring the sort request).As you can see below, the Sort only orders by the correlation column
GroupIdentifier, not by the Hash column used in the WITHIN GROUP.
If you are a
sysadmin, you can turn this rule off for the query, by using the following undocumented OPTION.OPTION (QUERYRULEOFF GenGbApplySimple)
As a workaround, one option to prevent this optimization being applied is to use a grouped
OUTER APPLYUPDATE HT1
SET GroupHashList = C.HashList
OUTPUT inserted.*
FROM HashTable AS HT1
OUTER APPLY
(
SELECT
HashList =
STRING_AGG(HT2.[Hash], ';')
WITHIN GROUP (ORDER BY HT2.[Hash] ASC)
FROM HashTable AS HT2
WHERE HT2.GroupIdentifier = HT1.GroupIdentifier
) C;
This gets you a pretty straightforward self-join with a Stream Aggregate.
db<>fiddle
I strongly suggest you file this as a bug with Microsoft.
You could also leave feedback, but that does not typically lead to a specific response.
As an aside, you should follow the aliasing rules suggested by Conor Cunningham when writing multi-table
UPDATE statements:The non-ANSI FROM clause (which you are using here) has specific binding behaviors that may or may not be what you expect. I will suggest you start by aliasing the 3 references to hashtable to be different and then make sure you are explicitly refering to the one you want. It may be (I am guessing) that it is binding to a different one than you think and providing you an undesired output as a result.
Code Snippets
STRING_AGG([dbo].[HashTable].[Hash] as [HT1].[Hash],'')
WITHIN GROUP (ORDER BY [HT2].[Hash])Context
StackExchange Database Administrators Q#318188, answer score: 25
Revisions (0)
No revisions yet.