I am exploring RabbitMQ quorum queues to improve HA for some services in a Kubernetes cluster. As I am reading, they are designed with data safety in mind.
However, the chapter "Managing Replicas" states:
Replicas of a quorum queue are explicitly managed by the operator.
When a new node is added to the cluster, it will host no quorum queue
replicas unless the operator explicitly adds it to a member (replica)
list of a quorum queue or a set of quorum queues.
It seems therefore that, in case of disruptions (especially involuntary), the following situation could arise (for a 3-nodes cluster):
- after a disruption a node would go down: the other two nodes still compose the majority and will "keep the queue alive", possibly electing a new leader;
- kubernetes will provide a new node (pod) to replace the failed node; the new node will automatically rejoin the RabbitMQ cluster, but
- unless the operator manually intervenes, the new node will not contribute to the existing quorum queues;
- for a 3-nodes cluster, this means that there is no HA anymore: if, sometime in the future, one of the other nodes fails, the queue is effectively lost;
Is there any way to mitigate this scenario? Is it, for example, possible to have nodes automatically rejoin all existing quorum queue clusters? Maybe by maintaining a list of "startup commands" (which run after RabbitMQ starts) to which we could add the rejoin commands?