patternMinor
Change tracking causing latch contention
Viewed 0 times
trackingcontentionlatchcausingchange
Problem
I am using Microsoft SQL Server 2016 (SP2-GDR) (KB4505220) - 13.0.5101.9 (X64) Jun 15 2019 23:15:58 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows Server 2012 R2 Standard 6.3 (Build 9600: )
The database is about 870 gb in size. It's SQL Standard and I have 128 gb of RAM on the server. The database is on SSD drives. The Data file is on a different drive than the log file and Tempdb has it's own SSD drive as well. The Server gets about 1200 queries/seconds in average, it can go as high as 2000 queries/second. Recompilation stays low, only 1 to 8 per second. Page life expectancy is good, averaging 61 minutes.
The Server has 6 physical cores + hyper threads.
We are using SQL Server's Change tracking a lot on a system where thousands of device connect and try to synchronize changes with the tracking key.
It usually runs fine, but then, from time to time, one day or another, the server's latches will skyrocket, going from 0ms to an average of 60677ms.
When I check what queries are running, all I can see are the synchronization queries, all blocked, with "PAGELATCH_UP", all trying to access the change tracking tables, more than 300 queries blocked.
I have a few questions:
Those are my change tracking table size. The tables that my query blocks on are the three first ones, only a few mb of data.
They are all waiting on the same waitresource.
Waitresource 2:4:88968 is in the tempdb.
But tempdb is only responsible for about 9% of the writes of the server and 6% of the reads.
But my query doesn't use the tempdb so I guess that it's like that because of the internal way change tracking works? Here is my query
```
DECLARE @Id INT; SET @Id = (SELECT Id
The database is about 870 gb in size. It's SQL Standard and I have 128 gb of RAM on the server. The database is on SSD drives. The Data file is on a different drive than the log file and Tempdb has it's own SSD drive as well. The Server gets about 1200 queries/seconds in average, it can go as high as 2000 queries/second. Recompilation stays low, only 1 to 8 per second. Page life expectancy is good, averaging 61 minutes.
The Server has 6 physical cores + hyper threads.
We are using SQL Server's Change tracking a lot on a system where thousands of device connect and try to synchronize changes with the tracking key.
It usually runs fine, but then, from time to time, one day or another, the server's latches will skyrocket, going from 0ms to an average of 60677ms.
When I check what queries are running, all I can see are the synchronization queries, all blocked, with "PAGELATCH_UP", all trying to access the change tracking tables, more than 300 queries blocked.
I have a few questions:
- Does SQL Server locks the whole table while looking up change tracking changes?
- Would I have better results with SQL Entreprise or it doesn't change anything?
- Any idea why does the change tracking works well most of the time but crashes without apparent reason from week to week?
Those are my change tracking table size. The tables that my query blocks on are the three first ones, only a few mb of data.
They are all waiting on the same waitresource.
Waitresource 2:4:88968 is in the tempdb.
But tempdb is only responsible for about 9% of the writes of the server and 6% of the reads.
But my query doesn't use the tempdb so I guess that it's like that because of the internal way change tracking works? Here is my query
```
DECLARE @Id INT; SET @Id = (SELECT Id
Solution
I'm not sure how to deal with the bad plan issue, maybe someone will come along with better query tuning skills to help out with that.
However, to speak to the tempdb contention issue, the page that all these sessions are fighting over is a PFS page. These are defined in the documentation as:
Page Free Space (PFS) pages record the allocation status of each page, whether an individual page has been allocated, and the amount of free space on each page.
Note that they only track free space for certain kinds of pages:
The amount of free space in a page is only maintained for heap and Text/Image pages.
The "Text/Image" comment there includes more modern LOB data types (
By the way, you can tell it's a PFS page because it's divisible evenly by 8088:
Thus, there is a new PFS page 8,088 pages after the first PFS page, and additional PFS pages in subsequent 8,088 page intervals.
All of that is to say that the "bad plan" is probably spilling to tempdb (there are two sorts and one hash join), which is causing contention over that particular PFS page. There are also some eager index spools that are writing to tempdb.
You may be able to mitigate this problem by installing SP2 CU1, which includes this fix:
This fix improves PFS page allocation by performing round-robin allocations across several PFS pages in the same data file in addition to the round-robin allocation across all tempdb data files. Therefore, the more PFS data pages that are contained in a data file and the more data files there are, the better the allocation distribution.
Thus even with the "bad plan," you would hopefully have less latch contention slowing things down.
However, to speak to the tempdb contention issue, the page that all these sessions are fighting over is a PFS page. These are defined in the documentation as:
Page Free Space (PFS) pages record the allocation status of each page, whether an individual page has been allocated, and the amount of free space on each page.
Note that they only track free space for certain kinds of pages:
The amount of free space in a page is only maintained for heap and Text/Image pages.
The "Text/Image" comment there includes more modern LOB data types (
nvarchar(max) and the like).By the way, you can tell it's a PFS page because it's divisible evenly by 8088:
Thus, there is a new PFS page 8,088 pages after the first PFS page, and additional PFS pages in subsequent 8,088 page intervals.
All of that is to say that the "bad plan" is probably spilling to tempdb (there are two sorts and one hash join), which is causing contention over that particular PFS page. There are also some eager index spools that are writing to tempdb.
You may be able to mitigate this problem by installing SP2 CU1, which includes this fix:
This fix improves PFS page allocation by performing round-robin allocations across several PFS pages in the same data file in addition to the round-robin allocation across all tempdb data files. Therefore, the more PFS data pages that are contained in a data file and the more data files there are, the better the allocation distribution.
Thus even with the "bad plan," you would hopefully have less latch contention slowing things down.
Context
StackExchange Database Administrators Q#257970, answer score: 5
Revisions (0)
No revisions yet.