HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

MongoDB Replica Set SECONDARY Stuck in `ROLLBACK` State

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rollbackreplicastuckmongodbsecondarystateset

Problem

During a recent automated update of our mongodb PRIMARY, when the PRIMARY stepped down it permanently went into a ROLLBACK state.

After several hours in the ROLLBACK state, there was still no rollback .bson file in the rollback directory in the mongodb database directory. That, and also this line in our log file: [rsSync] replSet syncThread: 13410 replSet too much data to roll back, seems to indicate that the ROLLBACK process failed.

I would like some help analyzing what exactly went wrong.

  • It appears two different rollbacks occurred in our logs. Is that the case or was it one that took 3 hours?



  • If the first rollback (at 19:00 hours) was successful, why didnt anything appears in ou rollback directory?



  • Any guess as to the cause of all those warnings? Could that be related to the rollback failure?



  • Did we lose 18 seconds of data due to the first ROLLBACK?



  • Is there a generic solution to the "stuck in ROLLBACK state" problem? We ended up having to hose our entire DB and re-sync from primary.



The relevant log lines are:

```
# Primary coming back after restart...
Tue May 15 19:01:01 [initandlisten] MongoDB starting : pid=3684 port=27017 dbpath=/var/lib/mongodb 64-bit host=magnesium
Tue May 15 19:01:01 [initandlisten] db version v2.0.5, pdfile version 4.5
# ... init stuff
Tue May 15 19:01:01 [initandlisten] journal dir=/var/lib/mongodb/journal
Tue May 15 19:01:01 [initandlisten] recover : no journal files present, no recovery needed
# ... More init stuff
Tue May 15 19:01:03 [rsStart] trying to contact rs1arb1.c9w.co:27017
Tue May 15 19:01:03 [rsStart] trying to contact rs1m2.c9w.co:27017
Tue May 15 19:01:03 [rsStart] replSet STARTUP2
Tue May 15 19:01:03 [rsHealthPoll] replSet member rs1arb1.c9w.co:27017 is up
Tue May 15 19:01:03 [rsHealthPoll] replSet member rs1arb1.c9w.co:27017 is now in state ARBITER
Tue May 15 19:01:03 [rsSync] replSet SECONDARY
Tue May 15 19:01:05 [rsHealthPoll] replSet member rs1m2.c9w.co:27017 is up
Tue May 1

Solution

When a MongoDB instance gets into a Rollback state, and the rollback data is greater than 300MB of data, you have to manually intervene. It will stay in a rollback state until you take action to save/remove/move that data, the (now secondary) should then be resynced to bring it back in line with the primary. This does not have to be a full resync, but that is the simplest way.

Multiple rollbacks are a symptom rather than the cause of a problem. Rollback only happens when a secondary that was not in sync (either due to lag or an issue with replication) becomes primary and takes writes. So, the problems that cause that to happen in the first place are what need to be taken care of - the rollback itself is something you need to deal with as an admin - there are too many potential pitfalls for MongoDB to reconcile the data automatically.

If you want to simulate this again for testing purposes, I have outlined how to do so here:

http://comerford.cc/2012/05/28/simulating-rollback-on-mongodb/

Eventually, this data will be stored in a collection (in the local DB) rather than dumped to disk, which will present opportunities to deal with it more effectively:

https://jira.mongodb.org/browse/SERVER-4375

At the moment though, once a rollback occurs, as you found, manual intervention is required.

Finally, the manual contains similar information to Kristina's blog now:

https://docs.mongodb.com/manual/core/replica-set-rollbacks

Context

StackExchange Database Administrators Q#18020, answer score: 7

Revisions (0)

No revisions yet.