patternsqlMinor
SQL Server 2017 with 500 databases - Frequent AG disconnects since CU9
Viewed 0 times
databasessqlwithsince500disconnectsservercu92017frequent
Problem
Hi everyone and thanks in advance for your help.
We are experiencing challenges with SQL Server 2017 Availability Groups.
Background
Company is a retail B2B back-end software. About 500 single tenant databases, and 5 shared databases used by all tenants. Workload characteristic is read mostly, and the majority of databases have very low activity.
Physical production servers hosted at co-location were recently upgraded from SQL Server 2014 Enterprise on Windows Server 2012 in a shared SAN / FCI configuration, to SQL Server 2017 Enterprise on Windows Server 2016 on a 2 socket / 32 core / 768 GB RAM and local SSD drives using AlwaysOn AG. AG traffic uses dedicated 10G NIC ports with a crossed cable connection.
Their requirement is for all databases to failover together, so they had to put them all in a single AG. It's a single, non-readable synchronous replica on an identical server.
The new servers have been in production since June 2018. Latest CU (CU7 at the time) and windows updates were installed, and system was working well. About a month later, after updating the servers from CU7 to CU9, they started noticing the following challenges, listed in order of priority.
We have been monitoring the servers using SQL Sentry and observed no physical bottlenecks. All key indicators seem good. CPU is averaging 20%, IO times typically less than 1ms, RAM not fully utilized, and network
-
Transaction log backup job takes 10X longer to complete than before. Previously it took 2 - 3 minutes to back up the logs of all 500 databases, now it takes 15-25. We have verified that that Backup itself runs fine with good throughput. However, there is a small delay after completing the backup of one log, and before starting the next. it starts off very low, but within a day or two gets to 2-3 seconds. Multiplied by 500 databases, and there is the difference.
-
Occasionally, some seemingly random databases get stuck in "Not synchronizing" state after manual failover. The only wa
We are experiencing challenges with SQL Server 2017 Availability Groups.
Background
Company is a retail B2B back-end software. About 500 single tenant databases, and 5 shared databases used by all tenants. Workload characteristic is read mostly, and the majority of databases have very low activity.
Physical production servers hosted at co-location were recently upgraded from SQL Server 2014 Enterprise on Windows Server 2012 in a shared SAN / FCI configuration, to SQL Server 2017 Enterprise on Windows Server 2016 on a 2 socket / 32 core / 768 GB RAM and local SSD drives using AlwaysOn AG. AG traffic uses dedicated 10G NIC ports with a crossed cable connection.
Their requirement is for all databases to failover together, so they had to put them all in a single AG. It's a single, non-readable synchronous replica on an identical server.
The new servers have been in production since June 2018. Latest CU (CU7 at the time) and windows updates were installed, and system was working well. About a month later, after updating the servers from CU7 to CU9, they started noticing the following challenges, listed in order of priority.
We have been monitoring the servers using SQL Sentry and observed no physical bottlenecks. All key indicators seem good. CPU is averaging 20%, IO times typically less than 1ms, RAM not fully utilized, and network
-
Transaction log backup job takes 10X longer to complete than before. Previously it took 2 - 3 minutes to back up the logs of all 500 databases, now it takes 15-25. We have verified that that Backup itself runs fine with good throughput. However, there is a small delay after completing the backup of one log, and before starting the next. it starts off very low, but within a day or two gets to 2-3 seconds. Multiplied by 500 databases, and there is the difference.
-
Occasionally, some seemingly random databases get stuck in "Not synchronizing" state after manual failover. The only wa
Solution
Update:
-
The blocking issues on the secondary replica were confirmed to be an issue with an update to the VSS writer code that was introduced in CU10.
Hopefully it will be resolved in CU 13.
The interim solution is to manually replace the VSS writer DLLs with the Pre-CU10 DLLs...
Unfortunately, Microsoft seem to be repeatedly failing to properly QA not only Windows 10 updates, but enterprise mission critical software such as SQL Server as well.
I much preferred their previous strategy of service packs, at least they had enough time to test them properly before inflicting production crisis and data loss to their customers with careless release of half baked updates.
- The Frequent Availability Group disconnects were confirmed to be a regression that was introduced by CU9 and they were resolved after installing CU12.
-
The blocking issues on the secondary replica were confirmed to be an issue with an update to the VSS writer code that was introduced in CU10.
Hopefully it will be resolved in CU 13.
The interim solution is to manually replace the VSS writer DLLs with the Pre-CU10 DLLs...
BEGIN RANT-SACTION;Unfortunately, Microsoft seem to be repeatedly failing to properly QA not only Windows 10 updates, but enterprise mission critical software such as SQL Server as well.
I much preferred their previous strategy of service packs, at least they had enough time to test them properly before inflicting production crisis and data loss to their customers with careless release of half baked updates.
COMMIT RANT-SACTION;Code Snippets
BEGIN RANT-SACTION;COMMIT RANT-SACTION;Context
StackExchange Database Administrators Q#220283, answer score: 9
Revisions (0)
No revisions yet.