debugMinor
How to recover MariaDB Galera Cluster after full crash?
Viewed 0 times
afterfullcrashrecoverhowgaleramariadbcluster
Problem
I crashed all my 3 nodes. After all nodes have been started I noticed that mariadb is dead. An I couldn't run it again.
I am using CentOS 7 on all servers
I tried to start first node and then others but without success.
First of all I tried to find the newest seqno as documentation says. So I looked in this file on all 3 nodes:
Ok. As all nodes are identical I can run any node as a new one and add another nodes to it. I used next command:
And it didn't work. Node didn't start.
Here is what I got:
```
-- Unit mariadb.service has begun starting up.
Dec 07 18:20:55 GlusterDC1_1 sh[4298]: 2016-12-07 18:20:55 139806456780992 [Note] /usr/sbin/mysqld (mysqld 10.1.19-MariaDB) starting as process 4332 ...
Dec 07 18:20:58 GlusterDC1_1 sh[4298]: WSREP: Recovered position ec3e180d-bbff-11e6-b989-3273ac13ba57:83
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] /usr/sbin/mysqld (mysqld 10.1.19-MariaDB) starting as process 4364 ...
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: Read nil XID from storage engines, skipping position init
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: wsrep_load(): Galera 25.3.18(r3632) by Codership Oy loaded successfully.
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: CRC-32C: using hardware acceleration.
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: Found saved state: ec3e180d-
I am using CentOS 7 on all servers
I tried to start first node and then others but without success.
First of all I tried to find the newest seqno as documentation says. So I looked in this file on all 3 nodes:
/var/lib/mysql/grastate.dat and noticed that content is identical on all 3 nodes (uuid is the same and seqno is the same)! Here is this file:# GALERA saved state
version: 2.1
uuid: ec3e180d-bbff-11e6-b989-3273ac13ba57
seqno: -1
cert_index:Ok. As all nodes are identical I can run any node as a new one and add another nodes to it. I used next command:
galera_new_clusterAnd it didn't work. Node didn't start.
Here is what I got:
```
-- Unit mariadb.service has begun starting up.
Dec 07 18:20:55 GlusterDC1_1 sh[4298]: 2016-12-07 18:20:55 139806456780992 [Note] /usr/sbin/mysqld (mysqld 10.1.19-MariaDB) starting as process 4332 ...
Dec 07 18:20:58 GlusterDC1_1 sh[4298]: WSREP: Recovered position ec3e180d-bbff-11e6-b989-3273ac13ba57:83
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] /usr/sbin/mysqld (mysqld 10.1.19-MariaDB) starting as process 4364 ...
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: Read nil XID from storage engines, skipping position init
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: wsrep_load(): Galera 25.3.18(r3632) by Codership Oy loaded successfully.
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: CRC-32C: using hardware acceleration.
Dec 07 18:20:58 GlusterDC1_1 mysqld[4364]: 2016-12-07 18:20:58 139830894778560 [Note] WSREP: Found saved state: ec3e180d-
Solution
Pre-Recovery Settings:
Crash Recovery Steps:
-
Find a valid seqno. Look at the grastate.dat file on each server to see which machine has the most current data. The node with the biggest seqno is the node with the current data.
-
Next, look at three grastate.dat files.
a) Node0: This grastate.dat shows a graceful shutdown. Note the seqno. We are looking for the node with the largest seqno.
b) Node1: This grastate.dat file shows -1 in the seqno. This node crashed during transaction processing. Start this node using the wsrep-recover option. MySQL will store the last committed GTID in the InnoDB data header.
c) Node2: This grastate.dat file has no seqno or group ID. This node crashed during DDL.
/path/to/mysql/bin/mysqld --wsrep-recover. Mysqld will read the InnoDB header files and shutdown immediately. The last wsrep position is printed in mysqld.log file.
Example:
140716 12:55:45 [Note] WSREP: Found saved state: cbd332a9- f617-11e2-b77d-3ee9fa637069:36742
-
Look at the seqno from Node0 (seqno: 43760) and Node1 (seqno: -1). Node0 has the current snapshot of data and should be started first.
-
On Node0, issue this command to start the node:
a) nohup /path/to/mysql/bin/mysqld_safe -- wsrep_cluster_address=gcomm:// &; wait for this node to come online.
b) Then start Node1, and Node2. These two nodes should be started one at a time, and can be started as you normally would.
c) Once all three nodes are up and in a primary state, restart Node0 in the normal fashion (so it comes up as part of the entire cluster, not just a bootstrap).
- Ensure that the MYSQL_HOME path is being exported in the .profile. If the MySQL install is in a different location, then make that change to the MYSQL_HOME.(Example: MYSQL_HOME=/path/to/mysql)
Crash Recovery Steps:
-
Find a valid seqno. Look at the grastate.dat file on each server to see which machine has the most current data. The node with the biggest seqno is the node with the current data.
-
Next, look at three grastate.dat files.
a) Node0: This grastate.dat shows a graceful shutdown. Note the seqno. We are looking for the node with the largest seqno.
/var/lib/mysql/grastate.dat
version: 2.1
uuid: cbd332a9-f617-11e2-b77d-3ee9fa637069
seqno: 43760b) Node1: This grastate.dat file shows -1 in the seqno. This node crashed during transaction processing. Start this node using the wsrep-recover option. MySQL will store the last committed GTID in the InnoDB data header.
/var/lib/mysql/grastate.dat
version: 2.1
uuid: cbd332a9-f617-11e2-b77d-3ee9fa637069
seqno: -1c) Node2: This grastate.dat file has no seqno or group ID. This node crashed during DDL.
/var/lib/mysql/grastate.dat
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1- Next, recover the node with the uuid, but no seqno. To get the seqno, use the --wsrep-recover option. To recover the seqno:
/path/to/mysql/bin/mysqld --wsrep-recover. Mysqld will read the InnoDB header files and shutdown immediately. The last wsrep position is printed in mysqld.log file.
Example:
140716 12:55:45 [Note] WSREP: Found saved state: cbd332a9- f617-11e2-b77d-3ee9fa637069:36742
-
Look at the seqno from Node0 (seqno: 43760) and Node1 (seqno: -1). Node0 has the current snapshot of data and should be started first.
-
On Node0, issue this command to start the node:
a) nohup /path/to/mysql/bin/mysqld_safe -- wsrep_cluster_address=gcomm:// &; wait for this node to come online.
b) Then start Node1, and Node2. These two nodes should be started one at a time, and can be started as you normally would.
c) Once all three nodes are up and in a primary state, restart Node0 in the normal fashion (so it comes up as part of the entire cluster, not just a bootstrap).
- If Node1 or Node2 had the highest seqno, then that Node would be booted as the bootstrap, and you would allow the remaining nodes to start up one at a time (connecting to the Node with the highest seqno).
Code Snippets
/var/lib/mysql/grastate.dat
version: 2.1
uuid: cbd332a9-f617-11e2-b77d-3ee9fa637069
seqno: 43760/var/lib/mysql/grastate.dat
version: 2.1
uuid: cbd332a9-f617-11e2-b77d-3ee9fa637069
seqno: -1/var/lib/mysql/grastate.dat
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1Context
StackExchange Database Administrators Q#157500, answer score: 8
Revisions (0)
No revisions yet.