As part of our efforts to improve the resilience of OpenStack within sites, we are moving to a multi controller architecture. I decided to tackle the underlying, stateful infrastructure services first. The first of those I looked into was RabbitMQ.
RabbitMQ supports clustering, and HA queues, as well as durable queues (queues that persist to disk). The setup seems fairly simple, but unfortunately many things still assume IP means IPv4, and that's exactly where this got interesting.
When I read through the OpenStack HA guide, I saw that it's a small number of steps to get things working. The guide is a little outdated, since it still suggests using the
rabbit_hosts parameter, but I decided to give it a try and see if I could follow it without modification.
IPv6, Erlang, RabbitMQ, and You
I tried following the guide, and the
rabbitmqctl join_cluster command failed every time. I tried several suggested workarounds (adding the short hostname to
/etc/hosts was one, for example), but nothing seemed to make any difference. I verified that the
erlang.cookie was the same on all hosts. I checked to verify ports were listening. I checked for firewalls in my path and connected with
telnet with no problems to the destination port on the remote host. After all of this, I was no closer to a cluster. I decided to dig into the process a bit by wrapping it with strace (yes, there are better ways, but strace is easy, and I didn't care about the performance hit in this case). I ran
strace -f -e trace=network rabbitmqctl join_cluster <myhost> and found that even though both hosts are IPv6 only, and that even dual stack hosts should try IPv6 first and then fall back to IPv4, the request was failing because all of the network activity was trying to use IPv4.
I searched, and found this bug for TripleO which suggested I was not alone in finding difficulty clustering RabbitMQ with IPv6 only. I tried upgrading packages, as suggested in the bug, but that did not resolve the issue. Finally I dug through the RabbitMQ mailing list and found this gem, wherein it is explained that the problem is in the underlying Erlang distribution and includes the needed config changes to tell the Erlang server that supports RabbitMQ to us IPv6 only.
I verified this solved my connectivity problems, and after some more testing I was able to remove the other workarounds I had tried. I pulled the short hostnames from
/etc/hosts (actually Chef did that for me, so thanks Chef), and removed the packages from RabbitMQ team I installed manually, and reverted to the packages in my upstream apt repo (Ubuntu). I tested again and was able to join the cluster. It was one of those moments where I tend to hit the
enter|return key a lot harder than necessary to indicate triumph.
Updated Syntax for Messaging
Now that I finally had a working cluster, since that connectivity issue was the last real hurdle and the guide was easy to follow after resolving it, I needed OpenStack to actually use it.
The HA guide still uses the deprecated
rabbit_host syntax, and I needed to find a better way. I was able to find the
transport_url equivalent in the OpenStack Questions forum. After creating the cluster, setting ha policy for all queues, and updating transport_url across controllers and compute nodes, and restarting services, I was able to execute the definitive test. I stopped RabbitMQ on the main controller, and created a VM. Once it launched successfully, I started RabbitMQ and repeated the process, stopping the other nodes, and observing the output of
rabbitmqctl cluster_status and
rabbitmqctl list_queues along the way.
The last task was to update our automation so this can be done in a repeatable way. I also verified that the RabbitMQ changes are backward compatible with the single node RabbitMQ for good measure, since we still deploy single controller instances for some testing needs.