HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Update Intersect hangs

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
hangsupdateintersect

Problem

We have this really weird behavior we just started experiencing with our update intersect statements. These were working fine, but now we are ingesting a pretty wide data source in terms of columns and slows gradually till hanging indefinitely.

The query below will get longer and longer as we add data in sets of like 20K rows (which is very small), and hang around 70K rows. There are no indexes being used, we drop them before we ingest the data.

Here is the statement:

UPDATE Staging.[TdDailyPerformance]
SET [SYS_OPERATION] = 'U'
FROM (
    SELECT [HashCode]           
    FROM Staging.[TdDailyPerformance]
    INTERSECT
    SELECT [HashCode]
    FROM [IdMatch].[TdDailyPerformance]
) AS A


Execution plan:

Now this query works in many other places on our server but not here. The interesting thing is that no matter whether the INTERSECT returns rows, the query hangs forever (I test this by running the intersect independently - it takes less than 2ms.).

It seems like according to SQL it shouldn't work, but it does. If the HasCode in the Staging table exists already in the IdMatch table it updates the [SYS_OPERATION] of the Staging table to be "U". We use this several places and it only started failing on this one dataset recently.

Any ideas what might be causing this?

No blocking, as far as we can see. The only wait types on the transaction are CXPACKET which is what I would expect from the QP. I've queried sp_who2, looked at All Transactions and Activity Monitor to identify blocks and have found nothing. I haven't traced.

Mostly, our tests work out now, so that it has 0 rows when it hangs. But we have verified that it also hangs with 1-100 rows that INTERSECT.

IdMatch has no HashCodes that exist in Staging, but both tables have around 70K rows at time of hang. So to be clear, both tables have around 70K, but the intersection on HashCode is 0 rows.

We've tested with indexes. We get overall poor performance before we hit

Solution

You've written an accidental cross join:

UPDATE Staging.[TdDailyPerformance]
SET [SYS_OPERATION] = 'U'
FROM (
    SELECT [HashCode]           
    FROM Staging.[TdDailyPerformance]
    INTERSECT
    SELECT [HashCode]
    FROM [IdMatch].[TdDailyPerformance]
) AS A


That is quite a common error when using the FROM extension of T-SQL UPDATE.

The expectation is that both references to Staging.[TdDailyPerformance] identify the same instance of the object, but that is not how it works.

The statement above actually specifies that all rows of the target should be updated if the derived table A produces any rows at all. The two instances of Staging.[TdDailyPerformance] are bound separately.

The query appears to hang for the reasons I discuss separately at the end of this answer.

The safest way to write this sort of update is to alias the tables and always use an alias as the target. (You should also write the query deterministically such that each target row can only be updated at most once.)

If we try to follow the alias rule with the statement above:

UPDATE S_TDP
SET SYS_OPERATION = 'U'
FROM
(
    SELECT S_TDP.HashCode
    FROM Staging.TdDailyPerformance AS S_TDP
    INTERSECT
    SELECT I_TDP.HashCode
    FROM IdMatch.TdDailyPerformance AS I_TDP
) AS A;


We get a binding error, which alerts us to the mistake:


Msg 208, Level 16, State 1, Line xxx

Invalid object name 'S_TDP'.

You're already aware of the working alternatives, so I won't labour that point, except to mention that you should probably look at combining the two updates as suggested in the chat room.

Related Q & A:

  • How can I identify non deterministic updates occurring in my SQL Server?



  • "Upsert" trigger writing mass duplicates from SSIS



Plan Analysis

I don't know how interested you are in the execution plan for the incorrect update statement, but just in case, here is a brief analysis of the serial version of the plan.

The portion of the plan below the Top is concerned with finding the first row (if any) resulting from the INTERSECT:

The Flow Distinct is a row-goal optimization that aims to produce the first distinct hash code value quickly. The Nested Loops Join is chosen because the optimizer only expects the scan the inner side heap table once to find a matching hash code.

This strategy is exposed when there is no match on hash code. In that case, the inner side will be fully scanned for every row on the outer side - 70,000 full scans in total. This might take a while. You can test the effect of removing the row goal by using documented trace flag 4138 e.g. via a query hint OPTION (QUERYTRACEON 4138). The update statement will still be incorrect, but at least it won't appear to hang.

You don't see this problem when running the INTERSECT on its own because the row goal is introduced (with the Top) by the optimizer as it searches for a reasonable plan. You can simulate it with a query like:

SELECT TOP (1) 1
FROM 
(
    SELECT S_TDP.HashCode
    FROM Staging.TdDailyPerformance AS S_TDP
    INTERSECT
    SELECT I_TDP.HashCode
    FROM IdMatch.TdDailyPerformance AS I_TDP
) AS A


Or:

SELECT DISTINCT 
    TDP.HashCode
FROM Staging.TdDailyPerformance AS TDP
CROSS JOIN 
(
    SELECT [HashCode]           
    FROM Staging.[TdDailyPerformance]
    INTERSECT
    SELECT [HashCode]
    FROM [IdMatch].[TdDailyPerformance]
) AS A;


The remainder of the plan updates the whole target table if a row was found:

The Nested Loops Join has no join predicate. The Sort and Stream Aggregate group records by heap RID. This is pointless but it is part of the general logic used to collapse plans that might update the same target row multiple times to a single (non-deterministic) update per-row.

Code Snippets

UPDATE Staging.[TdDailyPerformance]
SET [SYS_OPERATION] = 'U'
FROM (
    SELECT [HashCode]           
    FROM Staging.[TdDailyPerformance]
    INTERSECT
    SELECT [HashCode]
    FROM [IdMatch].[TdDailyPerformance]
) AS A
UPDATE S_TDP
SET SYS_OPERATION = 'U'
FROM
(
    SELECT S_TDP.HashCode
    FROM Staging.TdDailyPerformance AS S_TDP
    INTERSECT
    SELECT I_TDP.HashCode
    FROM IdMatch.TdDailyPerformance AS I_TDP
) AS A;
SELECT TOP (1) 1
FROM 
(
    SELECT S_TDP.HashCode
    FROM Staging.TdDailyPerformance AS S_TDP
    INTERSECT
    SELECT I_TDP.HashCode
    FROM IdMatch.TdDailyPerformance AS I_TDP
) AS A
SELECT DISTINCT 
    TDP.HashCode
FROM Staging.TdDailyPerformance AS TDP
CROSS JOIN 
(
    SELECT [HashCode]           
    FROM Staging.[TdDailyPerformance]
    INTERSECT
    SELECT [HashCode]
    FROM [IdMatch].[TdDailyPerformance]
) AS A;

Context

StackExchange Database Administrators Q#240843, answer score: 11

Revisions (0)

No revisions yet.