patternsqlMinor
Availability Group database stays in Reverting for too long after a failover
Viewed 0 times
afterstaysgrouplongtoodatabaseforrevertingavailabilityfailover
Problem
Architecture:
I have 2 Node Sync-Commit AlwaysOn configuration running on Multi-Subnet Failover Cluster. Primary node is in Europe and Secondary node is in US. I have only one database in the Availability Group which is OperationsManager db of SCOM.
Problem:
When I initiate a failover, database fails over from Primary to Secondary node successfully in seconds. However, new secondary(old primary) database goes into Reverting / In Recovery phase and stays there for 30 minutes approximately.I also experienced same thing while failing back to original primary box so it's a problem which occurs both ways.
Findings:
I searched about this on internet and read documentation to investigate the issue. When role change from Primary to Secondary is finished, new secondary database goes through 3 phases:
Synchronization State: “NOT SYNCHRONIZING” ; Database State: ONLINE
Synchronization State: “NOT SYNCHRONIZING” ; Database State: RECOVERING
Synchronization State: “REVERTING” ; Database State: RECOVERING
In my case, all time has spent on last step. I also monitored the undo process by looking into perfmon counter "SQLServer:Database Replica Log remaining for undo"
I checked primary site before failover tests to spot any long running transactions or open transactions but couldn't find one. After failover, "Log remaining for undo" was around 30MB and it took 30 minutes for secondary database to go back to "Synchronized" state. When taking into consideration that we are running in Sync-Commit mode and there is a little workload on primary, it shouldn't take 30 minutes for redo phase imho.
SQL Server Error Log: I found this strange messages.
-
Remote harden of transaction 'RECEIVE MSG' (ID 0x000000004d52c65a 0001:01c4e415) started at Feb 22 2019 2:55PM in database 'Oper
I have 2 Node Sync-Commit AlwaysOn configuration running on Multi-Subnet Failover Cluster. Primary node is in Europe and Secondary node is in US. I have only one database in the Availability Group which is OperationsManager db of SCOM.
- Primary and Secondary Hosts are identical.
- SQL Server Version on both boxes: 13.0.5237 & Windows Core
- Update: I patched both servers to 10.0.5270.0 , it didn't help.
- DB VLF count is only 27.
Problem:
When I initiate a failover, database fails over from Primary to Secondary node successfully in seconds. However, new secondary(old primary) database goes into Reverting / In Recovery phase and stays there for 30 minutes approximately.I also experienced same thing while failing back to original primary box so it's a problem which occurs both ways.
Findings:
I searched about this on internet and read documentation to investigate the issue. When role change from Primary to Secondary is finished, new secondary database goes through 3 phases:
Synchronization State: “NOT SYNCHRONIZING” ; Database State: ONLINE
Synchronization State: “NOT SYNCHRONIZING” ; Database State: RECOVERING
Synchronization State: “REVERTING” ; Database State: RECOVERING
In my case, all time has spent on last step. I also monitored the undo process by looking into perfmon counter "SQLServer:Database Replica Log remaining for undo"
I checked primary site before failover tests to spot any long running transactions or open transactions but couldn't find one. After failover, "Log remaining for undo" was around 30MB and it took 30 minutes for secondary database to go back to "Synchronized" state. When taking into consideration that we are running in Sync-Commit mode and there is a little workload on primary, it shouldn't take 30 minutes for redo phase imho.
SQL Server Error Log: I found this strange messages.
-
Remote harden of transaction 'RECEIVE MSG' (ID 0x000000004d52c65a 0001:01c4e415) started at Feb 22 2019 2:55PM in database 'Oper
Solution
One thing to check whenever database recovery is running long, whether it's a normal restore or an AG failover, is your VLF count. Having lots of VLFs (thousands or tens of thousands), or VLFs of unusual size (one or two extremely large VLFs) will cause this process to slow to a crawl.
Run the following command on your the database in question:
Note: if you're on SQL Server 2016 SP2 or newer, you can use this dynamic management function instead of the DBCC command:
The number of rows that comes back is the number of VLFs you have. If that number is very large, or if the
The details of fixing VLF sizing issues have been covered extensively elsewhere, here is one example: A Busy/Accidental DBA’s Guide to Managing VLFs
Run the following command on your the database in question:
USE YourDatabaseName;
GO
DBCC LOGINFO;Note: if you're on SQL Server 2016 SP2 or newer, you can use this dynamic management function instead of the DBCC command:
sys.dm_db_log_infoThe number of rows that comes back is the number of VLFs you have. If that number is very large, or if the
FileSize column shows extreme outliers among your VLFs, then you can likely resolve the slow recovery problem by (at a high level):- shrinking the log file as small as possible
- growing it back out to its target size
- making sure autogrowth is set to a reasonable number based on your typical log growth rate and the frequency of your transaction log backups
The details of fixing VLF sizing issues have been covered extensively elsewhere, here is one example: A Busy/Accidental DBA’s Guide to Managing VLFs
Code Snippets
USE YourDatabaseName;
GO
DBCC LOGINFO;Context
StackExchange Database Administrators Q#230663, answer score: 3
Revisions (0)
No revisions yet.