diff options
author | Alan Conway <aconway@apache.org> | 2011-08-26 22:11:18 +0000 |
---|---|---|
committer | Alan Conway <aconway@apache.org> | 2011-08-26 22:11:18 +0000 |
commit | 2fa119666035c81b4073b4509e7cf155f48e6771 (patch) | |
tree | aee946b9cc490cd3c24ce60a7cb3d842b7ff30db | |
parent | b3881a53867688832f3891dbecbde149b0d278db (diff) | |
download | qpid-python-2fa119666035c81b4073b4509e7cf155f48e6771.tar.gz |
QPID-2920: A hot-standby design for a new cluster implementation.
See qpid/cpp/design_docs/hot-standby-design.txt
This is in competition with the previous active-active cluster design
in qpid/cpp/design_docs/new-cluster-design.txt
git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1162272 13f79535-47bb-0310-9956-ffa450edef68
-rw-r--r-- | qpid/cpp/design_docs/hot-standby-design.txt | 239 |
1 files changed, 239 insertions, 0 deletions
diff --git a/qpid/cpp/design_docs/hot-standby-design.txt b/qpid/cpp/design_docs/hot-standby-design.txt new file mode 100644 index 0000000000..99a5dc0199 --- /dev/null +++ b/qpid/cpp/design_docs/hot-standby-design.txt @@ -0,0 +1,239 @@ +-*-org-*- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +* Another new design for Qpid clustering. + +For background see [[./new-cluster-design.txt]] which describes the issues +with the old design and a new active-active design that could replace it. + +This document describes an alternative hot-standby approach. + +** Delivery guarantee + +We guarantee N-way redundant, at least once delivey. Once a message +from a client has been acknowledged by the broker, it will be +delivered even if N-1 brokers subsequently fail. There may be +duplicates in the event of a failure. We don't make duplicates +during normal operation (i.e when no brokers have failed) + +This is the same guarantee as the old cluster and the alternative +active-active design. + +** Active-active vs. hot standby (aka primary-backup) + +An active-active cluster allows clients to connect to any broker in +the cluster. If a broker fails, clients can fail-over to any other +live broker. + +A hot-standby cluster has only one active broker at a time (the +"primary") and one or more brokers on standby (the "backups"). Clients +are only served by the leader, clients that connect to a backup are +redirected to the leader. The backpus are kept up-to-date in real time +by the primary, if the primary fails a backup is elected to be the new +primary. + +Aside: A cold-standby cluster is possible using a standalone broker, +CMAN and shared storage. In this scenario only one broker runs at a +time writing to a shared store. If it fails, another broker is started +(by CMAN) and recovers from the store. This bears investigation but +the store recovery time is probably too long for failover. + +** Why hot standby? + +Active-active has some advantages: +- Finding a broker on startup or failover is simple, just pick any live broker. +- All brokers are always running in active mode, there's no +- Distributing clients across brokers gives better performance, but see [1]. +- A broker failure affects only clients connected to that broker. + +The main problem with active-active is co-ordinating consumers of the +same queue on multiple brokers such that there are no duplicates in +normal operation. There are 2 approaches: + +Predictive: each broker predicts which messages others will take. This +the main weakness of the old design so not appealing. + +Locking: brokers "lock" a queue in order to take messages. This is +complex to implement, its not straighforward to determine the most +performant strategie for passing the lock. + +Hot-standby removes this problem. Only the primary can modify queues +so it just has to tell the backups what it is doing, there's no +locking. + +The primary can enqueue messages and replicate asynchronously - +exactly like the store does, but it "writes" to the replicas over the +network rather than writing to disk. + +** Failover in a hot-standby cluster. + +Hot-standby has some potential performance issues around failover: + +- Failover "spike": when the primary fails every client will fail over + at the same time, putting strain on the system. + +- Until a new primary is elected, cluster cannot serve any clients or + redirect clients to the primary. + +We want to minimize the number of re-connect attempts that clients +have to make. The cluster can use a well-known algorithm to choose the +new primary (e.g. round robin on a known sequence of brokers) so that +clients can guess the new primary correctly in most cases. + +Even if clients do guess correctly it may be that the new primary is +not yet aware of the death of the old primary, which is may to cause +multiple failed connect attempts before clients eventually get +connected. We will need to prototype to see how much this happens in +reality and how we can best get clients redirected. + +** Threading and performance. + +The primary-backup cluster operates analogously to the way the disk store does now: +- use the same MessageStore interface as the store to interact with the broker +- use the same asynchronous-completion model for replicating messages. +- use the same recovery interfaces (?) for new backups joining. + +Re-using the well-established store design gives credibility to the new cluster design. + +The single CPG dispatch thread was a severe performance bottleneck for the old cluster. + +The primary has the same threading model as a a standalone broker with +a store, which we know that this performs well. + +If we use CPG for replication of messages, the backups will receive +messages in the CPG dispatch thread. To get more concurency, the CPG +thread can dump work onto internal PollableQueues to be processed in +parallel. + +Messages from the same broker queue need to go onto the same +PollableQueue. There could be a separate PollableQueue for each broker +queue. If that's too resource intensive we can use a fixed set of +PollableQueues and assign broker queues to PollableQueues via hashing +or round robin. + +Another possible optimization is to use multiple CPG queues: one per +queue or a hashed set, to get more concurrency in the CPG layer. The +old cluster is not able to keep CPG busy. + +TODO: Transactions pose a challenge with these concurrent models: how +to co-ordinate multiple messages being added (commit a publish or roll +back an accept) to multiple queues so that all replicas end up with +the same message sequence while respecting atomicity. + +** Use of CPG + +CPG provides several benefits in the old cluster: +- tracking membership (essential for determining the primary) +- handling "spit brain" (integrates with partition support from CMAN) +- reliable multicast protocol to distribute messages. + +I believe we still need CPG for membership and split brain. We could +experiment with sending the bulk traffic over AMQP conections. + +** Flow control + +Need to ensure that +1) In-memory internal queues used by the cluster don't overflow. +2) The backups don't fall too far behind on processing CPG messages + +** Recovery +When a new backup joins an active cluster it must get a snapshot +from one of the other backups, or the primary if there are none. In +store terms this is "recovery" (old cluster called it an "update) + +Compared to old cluster we only replidate well defined data set of the store. +This is the crucial sore spot of old cluster. + +We can also replicated it more efficiently by recovering queues in +reverse (LIFO) order. That means as clients actively consume messages +from the front of the queue, they are redeucing the work we have to do +in recovering from the back. (NOTE: this may not be compatible with +using the same recovery interfaces as the store.) + +** Selective replication +In this model it's easy to support selective replication of individual queues via +configuration. +- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate. + Treated analogously to persistent/durable properties for the store. +- if not explicitly marked, provide a choice of default + - default is replicate (replicated message on replicated queue) + - default is don't replicate + - default is replicate persistent/durable messages. + +** Inconsistent errors + +The new design eliminates most sources of inconsistent errors in the +old design (connections, sessions, security, management etc.) and +eliminates the need to stall the whole cluster till an error is +resolved. We still have to handle inconsistent store errors when store +and cluster are used together. + +We also have to include error handling in the async completion loop to +guarantee N-way at least once: we should only report success to the +client when we know the message was replicated and stored on all N-1 +backups. + +TODO: We have a lot more options than the old cluster, need to figure +out the best approach, or possibly allow mutliple approaches. Need to +go thru the various failure cases. We may be able to do recovery on a +per-queue basis rather than restarting an entire node. + +** New members joining + +We should be able to catch up much faster than the the old design. A +new backup can catch up ("recover") the current cluster state on a +per-queue basis. +- queues can be updated in parallel +- "live" updates avoid the the "endless chase" + +During a "live" update several things are happening on a queue: +- clients are publishing messages to the back of the queue, replicated to the backup +- clients are consuming messages from the front of the queue, replicated to the backup. +- the primary is sending pre-existing messages to the new backup. + +The primary sends pre-existing messages in LIFO order - starting from +the back of the queue, at the same time clients are consuming from the front. +The active consumers actually reduce the amount of work to be done, as there's +no need to replicate messages that are no longer on the queue. + +* Steps to get there + +** Baseline replication +Validate the overall design get initial notion of performance. Just +message+wiring replication, no update/recovery for new members joining, +single CPG dispatch thread on backups, no failover, no transactions. + +** Failover +Electing primary, backups redirect to primary. Measure failover time +for large # clients. Strategies to minimise number of retries after a +failure. + +** Flow Control +Keep internal queues from over-flowing. Similar to internal flow control in old cluster. +Needed for realistic performance/stress tests + +** Concurrency +Experiment with multiple threads on backups, multiple CPG groups. + +** Recovery/new member joining +Initial status handshake for new member. Recovering queues from the back. + +** Transactions +TODO: How to implement transactions with concurrency. Worst solution: +a global --cluster-use-transactions flag that forces single thread +mode. Need to find a better solution. |