snippetsqlMinor
Patroni : How to handle a replica which has been disconnected from primary for long time?
Viewed 0 times
primaryhandlereplicapatronilongtimebeenhasforhow
Problem
Let's say I am using asynchronous streaming replication with the below configuration in a 3 node cluster with Postgres 10.4 and Patroni 1.4.4
Let's assume that one of the replica nodes suddenly loses its connection to primary for a long time.
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_wal_senders: 10
wal_keep_segments: 100
max_replication_slots: 10Let's assume that one of the replica nodes suddenly loses its connection to primary for a long time.
- In this case I think the size of WAL on the primary will keep on growing as it is not being consumed by the disconnected replica's replication slot. So is there any setting in patroni configuration which will remove the replica and remove its replication slot if it is disconnected from primary for x time duration?
- What is the recommended way to handle this case?
Solution
I would assume you are monitoring your DB cluster health, so a missing replica would pop up very soon. Also, it is a must to monitor disk space (running out of it might bring you into a situation that is not very easy to solve), so that would also catch this (later than sooner, usually).
Once you discover you have a replica that fell back, you have to investigate why it did so, and fix it - or remove the host from Patroni altogether. If under disk space pressure, remove the replication slot to free up WAL space. In a cloud setup, often simply terminating the host will solve all this by bringing up a new host. In any case, once you have a functioning host, you might have to reinit the Patroni node.
On the other hand, I'm afraid currently there is no mechanism for fencing off replicas that doesn't appear to come back (be it any actual implementation from removing the replication slot to anything more complex than that).
Once you discover you have a replica that fell back, you have to investigate why it did so, and fix it - or remove the host from Patroni altogether. If under disk space pressure, remove the replication slot to free up WAL space. In a cloud setup, often simply terminating the host will solve all this by bringing up a new host. In any case, once you have a functioning host, you might have to reinit the Patroni node.
On the other hand, I'm afraid currently there is no mechanism for fencing off replicas that doesn't appear to come back (be it any actual implementation from removing the replication slot to anything more complex than that).
Context
StackExchange Database Administrators Q#209281, answer score: 2
Revisions (0)
No revisions yet.