snippetdockerMinor

How to manage Consul and its quorum in auto-scaling environments?

Submitted by: @import:stackexchange-devops·Mar 10, 2026·

Viewed 0 times

consulautoscalingmanagequorumitshowandenvironments

Problem

We have auto-scaling Docker environments in which we use Consul for service discovery. These environments can add or remove one instance every few minutes.

Our early Consul testing showed that it was very easy for Consul to loose its quorum. Perhaps naively, our very first experimentation was a setup in which we would start a Consul server on all instances and have that Consul server join the cluster. That part was working fine.

However, Consul does not reap unreachable nodes quickly (it take about 72 hours?) in a very scalable environments that means that the list of Consul servers keeps growing and over time, most of them are "unreachable" and at that point, the cluster loses its quorum.

We've seen armon's response from almost two years ago on this issue on GitHub: https://github.com/hashicorp/consul/issues/454#issuecomment-125767550

Most of these problems are caused by our default behavior of
attempting a graceful leave. Our mental model is that servers are long
lived and don't shutdown for any reason other than unexpected power
loss, or a graceful maintenance in which case you need to leave the
cluster. In retrospect that was a bad default. Almost all of this can
be avoided by just kill -9 the Consul server, in affect simulating
power loss.

We were trying to avoid running dedicated, long-lived nodes. Keep in mind that at no point, we remove N/2+1 instances from an auto-scaling group. The EC2 cluster is able at any point in time to reach most of the nodes and should be able to vote whether a node should be removed from the Consul (or other tool) cluster.

Solution

I would set the leave_on_terminate option to true. As per the documentation

leave_on_terminate If enabled, when the agent receives a TERM signal,
it will send a Leave message to the rest of the cluster and gracefully
leave. The default behavior for this feature varies based on whether
or not the agent is running as a client or a server (prior to Consul
0.7 the default value was unconditionally set to false). On agents in client-mode, this defaults to true and for agents in server-mode, this
defaults to false.

What happens when a node is shutdown gracefully is to send SIGTERM to all processes before shutdown, with this setting on the consul agent will leave the cluster so it won't be considered as a node which can restart and be back in the cluster in a few hours (which is what your quote says it does by default).

Context

StackExchange DevOps Q#182, answer score: 6

Revisions (0)

No revisions yet.