HiveBrain v1.2.0
Get Started
← Back to all entries
debugsqlMinor

Query runs indefinitely - what is the root cause ? (not asking how to fix)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fixthewhatqueryindefinitelyroothowaskingcauseruns

Problem

I have a simple query with a grouping, which worked fine, until I added one more join:

select 
   [ca].Value,
   [c].cID, 
   [c].Name

from ReportingDB..Table1 [t1]

join MainDB..Companies [c] on
    [t1].CompanyID = [c].cID
    and [c].cID not in (1)

join MainDB..CompanyAttributes [ca] on -- this is the join that causes trouble
    [t1].CompanyID = caCID
    and caAttr = 26

group by [ca].Value, [c].cID, [c].Name


Info:

Companies table is a "lookup" table and has 2254 rows, cID is PK

CompanyAttributes has many to one relationship to Companies and has 4055 rows

Table1 has many to one relationship to Companies table and has 3,485,150 rows and

Estimated execution plan does not look unusual.

  • When I try to run the query, it does not finish and after 1 hour I stop it, so can't see what is going on with the Actual execution plan



  • Live Query Statistics made my SSMS hang



  • If remove "group by" clause, it starts fetching rows without any problem pretty fast. Or when last join is removed, it also works fine - with grouping



  • Server is not busy, has enough resources, and I don't see noticeable CPU rise when launch the query



  • looking at sys.dm_exec_requests, wait_type is NULL, cpu_time and logical_reads continue to grow, for the session running the query



What is the ROOT cause of original query running over 1 h without finishing?

I resolved the performance bottleneck itself (see my answer), but don't understand WHAT exactly causes original query to run 1 hour and not finish on a decent server, queried tables are not huge. Would expect original query to finish in less than 1 minute.

Solution

The main problem with the estimated execution plan shown is the Top operator above the Clustered Index Scan of Table1. The scan has a residual predicate:
[ReportingDB].[dbo].[Table1].[CompanyID]=[MainDB].[dbo].[CompanyAttributes].[cacID]


The optimizer tries to estimate how many rows it will need to read from the scan before one passes that test. The logic it uses is generic and not particularly sound in my opinion. In particular, if there is no match, the scan will run to completion, checking all 3,514,200 rows.

More to the point, this scan will repeat for every row returned by the scan of Company Attributes, multiplied by the number of rows returned by the seek on Companies. This is just how nested loops joins work.

The optimizer is super-optimistic about finding matches on each scan of Table1. This results in this plan shape having the lowest estimated cost of the alternatives considered. The root cause of that is the row goal introduced by the Top operator.

If you are curious about where the Top came from (it is not in your query text), please take a look at my closely-related article Row Goals, Part 4: The Anti Join Anti Pattern.

In short: the optimizer introduced a local (partial) aggregate as part of its plan search. That aggregate turned out to be logically redundant, and was replaced with the equivalent Top. An unfortunate side-effect of the Top is to introduce a row goal, significantly reducing the estimated cost of the scan.

A Top above a Scan with a residual predicate (especially where the scan is repeated) is very much an anti-pattern to watch out for.

Context

StackExchange Database Administrators Q#302480, answer score: 4

Revisions (0)

No revisions yet.