HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Long Running Query on Read-Only Replica that takes moments on the Primary

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
theprimaryreadreplicaquerylongrunningthatmomentstakes

Problem

I've got a 4 node AG setup as follows:

VM Hardware Configuration of all nodes:

  • Microsoft SQL Server 2017 Enterprise Edition (RTM-CU14) (KB4484710)



  • 16 vCPUs



  • 356 GB RAM (long story to this one...)



  • max degree of parallelism: 1 (as required by app vendor)



  • cost threshold for parallelism: 50



  • max server memory (MB): 338944 (331 GB)



AG Configuration:

  • Node 1: Primary or Synchronous Commit Non-readable Secondary, Configured for Automatic Failover



  • Node 2: Primary or Synchronous Commit Non-readable Secondary, Configured for Automatic Failover



  • Node 3: Readable Secondary set with Asynchronous Commit, Configured for Manual Failover



  • Node 4: Readable Secondary set with Asynchronous Commit, Configured for Manual Failover



The Query In Question:

There's nothing terribly crazy about this query, it provides a summary of outstanding work items in various queues within the application. You can see the code from one of the execution plan links below.

Execution Behavior on the Primary Node:

When executed on the Primary node, the execution time is generally around the 1 second mark. Here is the execution plan, and below are stats captured from STATISTICS IO and STATISTICS TIME from the primary node:

```
(347 rows affected)
Table 'Worktable'. Scan count 647, logical reads 2491, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'workitemlc'. Scan count 300, logical reads 7125, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'schedulertask'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'wfschedulertask'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical read

Solution

This answer is in addition to Joe's answer as I can't be 100% certain it is the version store, however there is enough evidence so far to imply that to be part of the issue.

When a secondary replica is marked as readable a good steady state for versioning information needs to first be attained so that there is a known and good starting point for all read operations on the secondary. When this is waiting to transition and there are still open transactions on the primary, this will manifest as HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING and is also a good indicator that the primary does go through quite a bit of data churn (or at least someone has a really long open transaction which also isn't good). The longer the transactions are open and the more data changes there are, the more versioning will occur.

Secondary replicas achieve readable status by using snapshot isolation under the covers for the session, even though if you check your session information you'll see it show up at the default read committed. Since snapshots isolation is optimistic and uses the version store, all changes will need to be versioned. This is exacerbated when there are many running (and potentially long running) queries on the secondary while the churn of data is high on the primary. Generally this manifests only in a few tables for an OLTP system but it's completely application and workload dependent.

The version store itself is measured in generations and when a query is run which requires the use of the version store, the versioning record pointer is used to point to the TempDB chain of that row. I say chain, because it's a list of versions for that row and the entire chain must be walked sequentially to find the proper version based on the starting timestamp of the transaction so that the results are inline with the data at that given time.

If the version store have many generations for these rows due to long running transactions on the primary and secondary replicas, this will cause longer than average times for queries to run and generally in the form of higher CPU while all other items seemingly stay exactly the same - such as execution plan, statistics, rows returned, etc. The walking of the chain is almost a purely cpu operation, so when the chains become very long and the amount of rows returned is very high, you get a (not linear, but can be close) increase in time for the query.

The only thing that can be done is to limit the length of the transactions on both the primary and the secondary to make sure the version store isn't becoming too large in TempDB while having many generations. Attempts to clean up the version store happen roughly once a minute, however cleanup requires that all versions from the same generation no longer be needed before it can be removed and all future versions can't be cleaned until the oldest version is no longer needed. Thus, a long running query can cause the inability to effectively cleanup many unused generations.

Switching the replica in and out of readable mode will also clear out the version store as it is no longer readable.

There are other items that could also be at play, but this seems the most plausible given the current data and way the replica was reacting.

TempDB Versioning DMVs (not to be confused with ADR versioning).

Context

StackExchange Database Administrators Q#237671, answer score: 11

Revisions (0)

No revisions yet.