在读《RabbitMQ in Action》这本书,写的不错,通俗易懂。除了介绍RabbitMQ基本知识外,作者还介绍了RabbitMQ的业务场景、设计模式等,颇有收获。我在Google论坛里创建了一个RabbitMQ中文讨论组,欢迎点击这里订阅。
5.2.1章节关于集群队列有如下描述:
The minute you join one node to another to form a cluster something dramatically changes: not every node has a full copy of every queue. In a single node setup all of the information about a queue (metadata, state, and contents) is fully stored in that node (Figure 5.1). However, in a cluster when you create queues, the cluster only creates the full information about the queue (metadata, state, contents) on a single node in the cluster rather than on all of them (the cluster tries to evenly distribute queue creation amongst the nodes). The result is that only the “owner” node for a queue knows the full information about that queue. All of the “non-owner” nodes only know the queue’s metadata and a pointer to the node where the queue actually lives. So when a cluster node dies that node’s queues and associated bindings disappear. Consumers attached to those queues lose their subscriptions, and any new messages that would have matched that queue’s bindings become blackholed.
上述表明:在集群里创建队列时,队列内容并不会复制到集群的每个节点上。RabbitMQ会挑选一个节点创建队列(这个选择算法是随机、轮询、还是基于资源负载的,有待深究),只有该节点上存储队列的完整内容。其他节点拥有队列的元数据(metadata),以及指向队列存储节点的指针。假如这个节点挂掉了怎么办?后面接着描述:
Not to worry though…we can have our consumers reconnect to the cluster and
recreate the queues right? Only if the queues weren’t originally marked durable. If the queues being recreated were marked as durable, redeclaring them from another node will get you a big ugly 404 NOT_FOUND error. This ensures messages in that queue on the failed node don’t disappear when you restore it to the cluster. The only way to get that specific queue name back into the cluster is to actually restore the failed node. However, if the queues your consumers try to recreate are not durable, the re-declarations will succeed and you’re ready to rebind them and keep trucking.
上述是说,只有该队列是非持久的,客户端才能重新连接到集群里的其他节点,并重新创建队列。假如该队列是持久化的,那么唯一办法是将故障节点恢复起来。为什么RabbitMQ不将队列复制到集群里每个节点呢?这与它的集群的设计本意相冲突,集群的设计目的就是增加更多节点时,能线性的增加性能(CPU、内存)和容量(内存、磁盘)。理由如下:
1. storage space: If every cluster node had a full copy of every queue, adding nodes wouldn’t give you more storage capacity. For example, if one node could store 1GB of messages, adding two more nodes would simply give you two more copies of the same 1GB of messages.
2. performance: Publishing messages would require replicating those messages to every cluster node. For durable messages that would require triggering disk activity on all nodes for every message. Your network and disk load would increase every time you added a node, keeping the performance of the cluster the same (or possibly worse).
当然RabbitMQ集群也支持队列复制(有个选项可以配置)。比如在有五个节点的集群里,可以指定某个队列的内容在2个节点上进行存储,从而在性能与高可用性之间取得一个平衡。
Soon there will be an option to replicate the contents of a queue across more than one RabbitMQ cluster node. This will allow the contents of a queue to survive the failure of the queue’s primary owner node. The option will allow you to specify how many copies of a queues (replication factor) should exist in the cluster. So if you specified a replication factor of 2 on the avocado_receipts queue, 2 out of the 5 nodes in the cluster would get a copy of every message put into the avocado_receipts queue. Since the replication factor will be specified per queue, you can precisely balance durability versus performance within your cluster.