HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

A methodology for tracing a query that sporadically runs for hours instead of seconds

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
hoursquerysecondssporadicallyinsteadmethodologythatforrunstracing

Problem

Once a few weeks or months, but always on the same weekday, one query from a stored procedure used in a daily batch job may visibly become stuck and run for about 200 minutes until the application kills its connection. This is a very trivial textbook query which joins 5 tables and calculates a sum() on one field to assign to a variable, so there is no point in providing it here. The total number of records to be summed is normally with 1-2 dozens. This query has no parameters, so parameter sniffing is (fortunately) out of question.

I have looked at a number of things:

  • There are no online users in the DB. It is used purely as a data dump.



  • There is no job that runs simultaneously. The last activity on the DB is a tranlog backup that completes about 10 min earlier. The next activity starts after about 2 hours, it is a full backup which completes fine after a few min (while this query is still going nuts).



-
There are following locks while this query is running, all in GRANT status:

  • OBJECT lock on one of the tables requested in Sch-S mode



  • HOBT.BULK_OPERATION lock on another table in S mode



  • OBJECT lock on yet another table requested in IX mode



The query has (NOLOCK) hint on all joined tables. It is in RUNNABLE state and according to sp_whoisactive, which runs every 15 min on the server, the CPU count of the query in question is constantly increasing. The CPU grows to about 1000x of the standalone query execution. The IO of the query from sp_whoisactive is about the same as if I ran it standalone, but the reads count is enormous, 10000x times over standalone execution. It feels like the query is in an endless loop. But when I get into the office in the morning, the job re-runs within a few seconds.

-
The plan reported by sp_whoisactive is the same as standalone. There is an optimizer hint for creation of an index when I run it, but as the query standalone time is around 200ms, I am not concerned about this hint. Implementing it

Solution

Based on your thorough outline, I believe you may be experiencing a problem with your table statistics.

Tables only automatically update their statistics when certain thresholds of row updates are passed, in the case of any table over 500 rows it requires 500+20% of the rows to change. For example your million row table requires 200,500 row changes before it will update statistics.

An index REBUILD will update the statistics on a table (REORGANIZE does not).

I think your tables are growing over time and eventually invalidating your statistics, but not enough to trigger an automatic update. However, your reindexing job is not triggering an actual REBUILD for the relevant tables until after your query starts performing poorly due to the last couple changes from earlier in the day. If your changes are inconsistent and small enough increments this would make the performance degradation consistent and sudden. Your reindexing job then fixes the problem after the query has ran poorly and everything looks fine the next morning.

Things to check to confirm:

  • If you can catch a 'bad' query plan check the pipes' Actual versus Estimated number of rows returned. If this is radically different it's red flag for statistics issues.



  • Check the criteria on your Reindex Job to confirm that it's not updating the indexes on the query's tables regularly.



  • If you can use the following query regularly you can track when the statistics are updated:



SELECT name AS index_name
,STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated
FROM sys.indexes
WHERE OBJECT_ID = OBJECT_ID('SchemaName.TableName')


If you are running SQL Server 2008 R2 Service Pack 1 or later, you have the option to enable documented trace flag 2371 to dynamically update statistics more frequently:


The higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when 1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need 200 million changes before an update statistics is triggered.

The new behaviour is enabled by default (i.e. without the trace flag) from SQL Server 2016 onward, for databases at compatibility level >= 130.

Context

StackExchange Database Administrators Q#157507, answer score: 3

Revisions (0)

No revisions yet.