If you are running RabbitMQ in a cluster, it is not unlikely that the cluster gets partitioned (part of the cluster losing connection to the rest). The basic commands to show the status and configure the behaviour is explained in the linked page above. And when partitioning happens, you want to first be notified about that, and second – resolve it.
RabbitMQ actually automatically handles the second, with the cluster_partition_handling
configuration. It has three values: ignore, pause_minority and autoheal. The partitions guide linked above explains that as well (“Which mode should I pick?”). Note that whatever you choose, you have a problem and you have to restore the connectivity. For example, in a multi-availability-zone setup I explained a while ago it’s probably better to use pause_minority and then to manually reconnect.
Fortunately, it’s rather simple to detect partitioning. The status command has an empty “partitions” element if there is no partitioning, and there is either a non-empty partitions element, or no such element at all, if there are partitions. So this line does the detection:
clusterOK=$(sudo rabbitmqctl cluster_status | grep "{partitions,[]}" | wc -l)
You would want to schedule that script to run every minute, for example. What to do with the result depends on the tool you use (Nagios, CloudWatch, etc). For Nagios there is a ready-to-use plugin, actually. And if it’s AWS CloudWatch, then you can do as follows:
if [ "$clusterOK" -eq "0" ]; then echo "RabbitMQ cluster is partitioned" aws cloudwatch put-metric-data --metric-name $METRIC_NAME --namespace $NAMESPACE --value 1 --dimensions Stack=$STACKNAME --region $REGION else aws cloudwatch put-metric-data --metric-name $METRIC_NAME --namespace $NAMESPACE --value 0 --dimensions Stack=$STACKNAME --region $REGION fi
When partitioning happens, the important things is getting notified about it. After that it depends on the particular application, problem, configuration of queues (durable, mirrored, etc.)
Recent Comments