QPID-2920: A hot-standby design for a new cluster implementation.

See qpid/cpp/design_docs/hot-standby-design.txt This is in competition with the previous active-active cluster design in qpid/cpp/design_docs/new-cluster-design.txt git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1162272 13f79535-47bb-0310-9956-ffa450edef68
author: Alan Conway <aconway@apache.org> 2011-08-26 22:11:18 +0000
committer: Alan Conway <aconway@apache.org> 2011-08-26 22:11:18 +0000
commit: 2fa119666035c81b4073b4509e7cf155f48e6771 (patch)
tree: aee946b9cc490cd3c24ce60a7cb3d842b7ff30db
parent: b3881a53867688832f3891dbecbde149b0d278db (diff)
download: qpid-python-2fa119666035c81b4073b4509e7cf155f48e6771.tar.gz
1 files changed, 239 insertions, 0 deletions
diff --git a/qpid/cpp/design_docs/hot-standby-design.txt b/qpid/cpp/design_docs/hot-standby-design.txt
new file mode 100644
index 0000000000..99a5dc0199
--- /dev/null
+++ b/qpid/cpp/design_docs/hot-standby-design.txt
@@ -0,0 +1,239 @@
+-*-org-*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+* Another new design for Qpid clustering.
+
+For background see [[./new-cluster-design.txt]] which describes the issues
+with the old design and a new active-active design that could replace it.
+
+This document describes an alternative hot-standby approach.
+
+** Delivery guarantee
+
+We guarantee N-way redundant, at least once delivey. Once a message
+from a client has been acknowledged by the broker, it will be
+delivered even if N-1 brokers subsequently fail. There may be
+duplicates in the event of a failure. We don't make duplicates 
+during normal operation (i.e when no brokers have failed)
+
+This is the same guarantee as the old cluster and the alternative
+active-active design.
+
+** Active-active vs. hot standby (aka primary-backup)
+
+An active-active cluster allows clients to connect to any broker in
+the cluster. If a broker fails, clients can fail-over to any other
+live broker.
+
+A hot-standby cluster has only one active broker at a time (the
+"primary") and one or more brokers on standby (the "backups"). Clients
+are only served by the leader, clients that connect to a backup are
+redirected to the leader. The backpus are kept up-to-date in real time
+by the primary, if the primary fails a backup is elected to be the new
+primary.
+
+Aside: A cold-standby cluster is possible using a standalone broker,
+CMAN and shared storage. In this scenario only one broker runs at a
+time writing to a shared store. If it fails, another broker is started
+(by CMAN) and recovers from the store. This bears investigation but
+the store recovery time is probably too long for failover.
+
+** Why hot standby?
+
+Active-active has some advantages:
+- Finding a broker on startup or failover is simple, just pick any live broker.
+- All brokers are always running in active mode, there's no
+- Distributing clients across brokers gives better performance, but see [1].
+- A broker failure affects only clients connected to that broker.
+
+The main problem with active-active is co-ordinating consumers of the
+same queue on multiple brokers such that there are no duplicates in
+normal operation. There are 2 approaches:
+
+Predictive: each broker predicts which messages others will take. This
+the main weakness of the old design so not appealing.
+
+Locking: brokers "lock" a queue in order to take messages. This is
+complex to implement, its not straighforward to determine the most
+performant strategie for passing the lock.
+
+Hot-standby removes this problem. Only the primary can modify queues
+so it just has to tell the backups what it is doing, there's no
+locking.
+
+The primary can enqueue messages and replicate asynchronously -
+exactly like the store does, but it "writes" to the replicas over the
+network rather than writing to disk.
+
+** Failover in a hot-standby cluster.
+
+Hot-standby has some potential performance issues around failover:
+
+- Failover "spike": when the primary fails every client will fail over
+  at the same time, putting strain on the system.
+
+- Until a new primary is elected, cluster cannot serve any clients or
+  redirect clients to the primary.
+
+We want to minimize the number of re-connect attempts that clients
+have to make. The cluster can use a well-known algorithm to choose the
+new primary (e.g. round robin on a known sequence of brokers) so that
+clients can guess the new primary correctly in most cases.
+
+Even if clients do guess correctly it may be that the new primary is
+not yet aware of the death of the old primary, which is may to cause
+multiple failed connect attempts before clients eventually get
+connected. We will need to prototype to see how much this happens in
+reality and how we can best get clients redirected.
+
+** Threading and performance.
+
+The primary-backup cluster operates analogously to the way the disk store does now:
+- use the same MessageStore interface as the store to interact with the broker
+- use the same asynchronous-completion model for replicating messages.
+- use the same recovery interfaces (?) for new backups joining.
+
+Re-using the well-established store design gives credibility to the new cluster design.
+
+The single CPG dispatch thread was a severe performance bottleneck for the old cluster.
+
+The primary has the same threading model as a a standalone broker with
+a store, which we know that this performs well.
+
+If we use CPG for replication of messages, the backups will receive
+messages in the CPG dispatch thread. To get more concurency, the CPG
+thread can dump work onto internal PollableQueues to be processed in
+parallel. 
+
+Messages from the same broker queue need to go onto the same
+PollableQueue. There could be a separate PollableQueue for each broker
+queue. If that's too resource intensive we can use a fixed set of
+PollableQueues and assign broker queues to PollableQueues via hashing
+or round robin.
+
+Another possible optimization is to use multiple CPG queues: one per
+queue or a hashed set, to get more concurrency in the CPG layer. The
+old cluster is not able to keep CPG busy.
+
+TODO: Transactions pose a challenge with these concurrent models: how
+to co-ordinate multiple messages being added (commit a publish or roll
+back an accept) to multiple queues so that all replicas end up with
+the same message sequence while respecting atomicity.
+
+** Use of CPG
+
+CPG provides several benefits in the old cluster:
+- tracking membership (essential for determining the primary)
+- handling "spit brain" (integrates with partition support from CMAN)
+- reliable multicast protocol to distribute messages.
+
+I believe we still need CPG for membership and split brain. We could
+experiment with sending the bulk traffic over AMQP conections.
+
+** Flow control
+
+Need to ensure that
+1) In-memory internal queues used by the cluster don't overflow.
+2) The backups don't fall too far behind on processing CPG messages
+
+** Recovery
+When a new backup joins an active cluster it must get a snapshot
+from one of the other backups, or the primary if there are none. In
+store terms this is "recovery" (old cluster called it an "update)
+
+Compared to old cluster we only replidate well defined data set of the store.
+This is the crucial sore spot of old cluster. 
+
+We can also replicated it more efficiently by recovering queues in
+reverse (LIFO) order. That means as clients actively consume messages
+from the front of the queue, they are redeucing the work we have to do
+in recovering from the back. (NOTE: this may not be compatible with
+using the same recovery interfaces as the store.)
+
+** Selective replication
+In this model it's easy to support selective replication of individual queues via
+configuration. 
+- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate. 
+  Treated analogously to persistent/durable properties for the store.
+- if not explicitly marked, provide a choice of default
+  - default is replicate (replicated message on replicated queue)
+  - default is don't replicate
+  - default is replicate persistent/durable messages.
+
+** Inconsistent errors
+
+The new design eliminates most sources of inconsistent errors in the
+old design (connections, sessions, security, management etc.) and
+eliminates the need to stall the whole cluster till an error is
+resolved. We still have to handle inconsistent store errors when store
+and cluster are used together.
+
+We also have to include error handling in the async completion loop to
+guarantee N-way at least once: we should only report success to the
+client when we know the message was replicated and stored on all N-1
+backups.
+
+TODO: We have a lot more options than the old cluster, need to figure
+out the best approach, or possibly allow mutliple approaches. Need to
+go thru the various failure cases. We may be able to do recovery on a
+per-queue basis rather than restarting an entire node.
+
+** New members joining
+
+We should be able to catch up much faster than the the old design. A
+new backup can catch up ("recover") the current cluster state on a
+per-queue basis.
+- queues can be updated in parallel
+- "live" updates avoid the the "endless chase"
+
+During a "live" update several things are happening on a queue:
+- clients are publishing messages to the back of the queue, replicated to the backup
+- clients are consuming messages from the front of the queue, replicated to the backup.
+- the primary is sending pre-existing messages to the new backup.
+
+The primary sends pre-existing messages in LIFO order - starting from
+the back of the queue, at the same time clients are consuming from the front.
+The active consumers actually reduce the amount of work to be done, as there's
+no need to replicate messages that are no longer on the queue.
+
+* Steps to get there
+
+** Baseline replication
+Validate the overall design get initial notion of performance. Just
+message+wiring replication, no update/recovery for new members joining,
+single CPG dispatch thread on backups, no failover, no transactions.
+
+** Failover
+Electing primary, backups redirect to primary. Measure failover time
+for large # clients.  Strategies to minimise number of retries after a
+failure.
+
+** Flow Control
+Keep internal queues from over-flowing. Similar to internal flow control in old cluster.
+Needed for realistic performance/stress tests
+
+** Concurrency
+Experiment with multiple threads on backups, multiple CPG groups.
+
+** Recovery/new member joining
+Initial status handshake for new member. Recovering queues from the back.
+
+** Transactions
+TODO: How to implement transactions with concurrency.  Worst solution:
+a global --cluster-use-transactions flag that forces single thread
+mode. Need to find a better solution.
author	Alan Conway <aconway@apache.org>	2011-08-26 22:11:18 +0000
committer	Alan Conway <aconway@apache.org>	2011-08-26 22:11:18 +0000
commit	2fa119666035c81b4073b4509e7cf155f48e6771 (patch)
tree	aee946b9cc490cd3c24ce60a7cb3d842b7ff30db
parent	b3881a53867688832f3891dbecbde149b0d278db (diff)
download	qpid-python-2fa119666035c81b4073b4509e7cf155f48e6771.tar.gz