Merge branch 'qpid-3603-4-rebase' into qpid-3603-5qpid-3603-5

git-svn-id: https://svn.apache.org/repos/asf/qpid/branches/qpid-3603-5@1243748 13f79535-47bb-0310-9956-ffa450edef68
author: Alan Conway <aconway@apache.org> 2012-02-13 23:50:18 +0000
committer: Alan Conway <aconway@apache.org> 2012-02-13 23:50:18 +0000
commit: 057a0635e081848ba59964c4a8e0923e521b7fe6 (patch)
tree: cdad9578140d5dbdb2e90525e2ba9cc870ca50b6 /qpid/cpp/design_docs/new-ha-design.txt
parent: cf61dbf9b313f9bd69b392eae8fd8d27d4e609a2 (diff)
download: qpid-python-057a0635e081848ba59964c4a8e0923e521b7fe6.tar.gz
1 files changed, 418 insertions, 0 deletions
diff --git a/qpid/cpp/design_docs/new-ha-design.txt b/qpid/cpp/design_docs/new-ha-design.txt
new file mode 100644
index 0000000000..8173585283
--- /dev/null
+++ b/qpid/cpp/design_docs/new-ha-design.txt
@@ -0,0 +1,418 @@
+-*-org-*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+* An active-passive, hot-standby design for Qpid clustering.
+
+This document describes an active-passive approach to HA based on
+queue browsing to replicate message data.
+
+See [[./old-cluster-issues.txt]] for issues with the old design.
+
+** Active-active vs. active-passive (hot-standby)
+
+An active-active cluster allows clients to connect to any broker in
+the cluster. If a broker fails, clients can fail-over to any other
+live broker.
+
+A hot-standby cluster has only one active broker at a time (the
+"primary") and one or more brokers on standby (the "backups"). Clients
+are only served by the primary, clients that connect to a backup are
+redirected to the primary. The backups are kept up-to-date in real
+time by the primary, if the primary fails a backup is elected to be
+the new primary.
+
+The main problem with active-active is co-ordinating consumers of the
+same queue on multiple brokers such that there are no duplicates in
+normal operation. There are 2 approaches:
+
+Predictive: each broker predicts which messages others will take. This
+the main weakness of the old design so not appealing.
+
+Locking: brokers "lock" a queue in order to take messages. This is
+complex to implement and it is not straighforward to determine the
+best strategy for passing the lock. In tests to date it results in
+very high latencies (10x standalone broker).
+
+Hot-standby removes this problem. Only the primary can modify queues
+so it just has to tell the backups what it is doing, there's no
+locking.
+
+The primary can enqueue messages and replicate asynchronously -
+exactly like the store does, but it "writes" to the replicas over the
+network rather than writing to disk.
+
+** Replicating browsers
+
+The unit of replication is a replicating browser. This is an AMQP
+consumer that browses a remote queue via a federation link and
+maintains a local replica of the queue. As well as browsing the remote
+messages as they are added the browser receives dequeue notifications
+when they are dequeued remotely.
+
+On the primary broker incoming mesage transfers are completed only when
+all of the replicating browsers have signaled completion. Thus a completed
+message is guaranteed to be on the backups.
+
+** Failover and Cluster Resource Managers
+
+We want to delegate the failover management to an existing cluster
+resource manager. Initially this is rgmanager from Cluster Suite, but
+other managers (e.g. PaceMaker) could be supported in future.
+
+Rgmanager takes care of starting and stopping brokers and informing
+brokers of their roles as primary or backup. It ensures there's
+exactly one primary broker running at any time. It also tracks quorum
+and protects against split-brain.
+
+Rgmanger can also manage a virtual IP address so clients can just
+retry on a single address to fail over. Alternatively we will also
+support configuring a fixed list of broker addresses when qpid is run
+outside of a resource manager.
+
+Aside: Cold-standby is also possible using rgmanager with shared
+storage for the message store (e.g. GFS). If the broker fails, another
+broker is started on a different node and and recovers from the
+store. This bears investigation but the store recovery times are
+likely too long for failover.
+
+** Replicating configuration
+
+New queues and exchanges and their bindings also need to be replicated.
+This is done by a QMF client that registers for configuration changes
+on the remote broker and mirrors them in the local broker.
+
+** Use of CPG (openais/corosync)
+
+CPG is not required in this model, an external cluster resource
+manager takes care of membership and quorum.
+
+** Selective replication
+
+In this model it's easy to support selective replication of individual queues via
+configuration.
+
+Explicit exchange/queue qpid.replicate argument:
+- none: the object is not replicated
+- configuration: queues, exchanges and bindings are replicated but messages are not.
+- messages: configuration and messages are replicated.
+
+TODO: provide configurable default for qpid.replicate
+
+[GRS: current prototype relies on queue sequence for message identity
+so selectively replicating certain messages on a given queue would be
+challenging. Selectively replicating certain queues however is trivial.]
+
+** Inconsistent errors
+
+The new design eliminates most sources of inconsistent errors in the
+old design (connections, sessions, security, management etc.) and
+eliminates the need to stall the whole cluster till an error is
+resolved. We still have to handle inconsistent store errors when store
+and cluster are used together.
+
+We have 2 options (configurable) for handling inconsistent errors,
+on the backup that fails to store a message from primary we can:
+- Abort the backup broker allowing it to be re-started.
+- Raise a critical error on the backup broker but carry on with the message lost.
+We can configure the option to abort or carry on per-queue, we
+will also provide a broker-wide configurable default.
+
+** New backups connecting to primary.
+
+When the primary fails, one of the backups becomes primary and the
+others connect to the new primary as backups.
+
+The backups can take advantage of the messages they already have
+backed up, the new primary only needs to replicate new messages.
+
+To keep the N-way guarantee the primary needs to delay completion on
+new messages until all the back-ups have caught up. However if a
+backup does not catch up within some timeout, it is considered dead
+and its messages are completed so the cluster can carry on with N-1
+members.
+
+
+** Broker discovery and lifecycle.
+
+The cluster has a client URL that can contain a single virtual IP
+address or a list of real IP addresses for the cluster.
+
+In backup mode, brokers reject connections normal client connections
+so clients will fail over to the primary. HA admin tools mark their
+connections so they are allowed to connect to backup brokers.
+
+Clients discover the primary by re-trying connection to the client URL
+until the successfully connect to the primary. In the case of a
+virtual IP they re-try the same address until it is relocated to the
+primary. In the case of a list of IPs the client tries each in
+turn. Clients do multiple retries over a configured period of time
+before giving up.
+
+Backup brokers discover the primary in the same way as clients. There
+is a separate broker URL for brokers since they often will connect
+over a different network. The broker URL has to be a list of real
+addresses rather than a virtual address.
+
+Brokers have the following states:
+- connecting: Backup broker trying to connect to primary - loops retrying broker URL.
+- catchup: Backup connected to primary, catching up on pre-existing configuration & messages.
+- ready: Backup fully caught-up, ready to take over as primary.
+- primary: Acting as primary, serving clients.
+
+** Interaction with rgmanager
+
+rgmanager interacts with qpid via 2 service scripts: backup &
+primary. These scripts interact with the underlying qpidd
+service. rgmanager picks the new primary when the old primary
+fails. In a partition it also takes care of killing inquorate brokers.
+
+*** Initial cluster start
+
+rgmanager starts the backup service on all nodes and the primary service on one node.
+
+On the backup nodes qpidd is in the connecting state. The primary node goes into
+the primary state. Backups discover the primary, connect and catch up.
+
+*** Failover
+
+primary broker or node fails. Backup brokers see disconnect and go
+back to connecting mode.
+
+rgmanager notices the failure and starts the primary service on a new node.
+This tells qpidd to go to primary mode. Backups re-connect and catch up.
+
+The primary can only be started on nodes where there is a ready backup service.
+If the backup is catching up, it's not eligible to take over as primary.
+
+*** Failback
+
+Cluster of N brokers has suffered a failure, only N-1 brokers
+remain. We want to start a new broker (possibly on a new node) to
+restore redundancy.
+
+If the new broker has a new IP address, the sysadmin pushes a new URL
+to all the existing brokers.
+
+The new broker starts in connecting mode. It discovers the primary,
+connects and catches up.
+
+*** Failure during failback
+
+A second failure occurs before the new backup B completes its catch
+up. The backup B refuses to become primary by failing the primary
+start script if rgmanager chooses it, so rgmanager will try another
+(hopefully caught-up) backup to become primary.
+
+*** Backup failure
+
+If a backup fails it is re-started. It connects and catches up from scratch
+to become a ready backup.
+
+** Interaction with the store.
+
+Clean shutdown: entire cluster is shut down cleanly by an admin tool:
+- primary stops accepting client connections till shutdown is complete.
+- backups come fully up to speed with primary state.
+- all shut down marking stores as 'clean' with an identifying  UUID.
+
+After clean shutdown the cluster can re-start automatically as all nodes
+have equivalent stores. Stores starting up with the wrong UUID will fail.
+
+Stored status: clean(UUID)/dirty, primary/backup, generation number.
+- All stores are marked dirty except after a clean shutdown.
+- Generation number: passed to backups and incremented by new primary.
+
+After total crash must manually identify the "best" store, provide admin tool.
+Best = highest generation number among stores in primary state.
+
+Recovering from total crash: all brokers will refuse to start as all stores are dirty.
+Check the stores manually to find the best one, then either:
+1. Copy stores:
+ - copy good store to all hosts
+ - restart qpidd on all hosts.
+2. Erase stores:
+ - Erase the store on all other hosts.
+ - Restart qpidd on the good store and wait for it to become primary.
+ - Restart qpidd on all other hosts.
+
+Broker startup with store:
+- Dirty: refuse to start
+- Clean:
+  - Start and load from store.
+  - When connecting as backup, check UUID matches primary, shut down if not.
+- Empty: start ok, no UUID check with primary.
+
+** Current Limitations
+
+(In no particular order at present)
+
+For message replication:
+
+LM1 - The re-synchronisation does not handle the case where a newly elected
+primary is *behind* one of the other backups. To address this I propose
+a new event for restting the sequence that the new primary would send
+out on detecting that a replicating browser is ahead of it, requesting
+that the replica revert back to a particular sequence number. The
+replica on receiving this event would then discard (i.e. dequeue) all
+the messages ahead of that sequence number and reset the counter to
+correctly sequence any subsequently delivered messages.
+
+LM2 - There is a need to handle wrap-around of the message sequence to avoid
+confusing the resynchronisation where a replica has been disconnected
+for a long time, sufficient for the sequence numbering to wrap around.
+
+LM3 - Transactional changes to queue state are not replicated atomically.
+
+LM4 - Acknowledgements are confirmed to clients before the message has been
+dequeued from replicas or indeed from the local store if that is
+asynchronous.
+
+LM5 - During failover, messages (re)published to a queue before there are
+the requisite number of replication subscriptions established will be
+confirmed to the publisher before they are replicated, leaving them
+vulnerable to a loss of the new primary before they are replicated.
+
+LM6 - persistence: In the event of a total cluster failure there are
+no tools to automatically identify the "latest" store. Once this
+is manually identfied, all other stores belonging to cluster members
+must be erased and the latest node must started as primary.
+
+LM6 - persistence: In the event of a single node failure, that nodes
+store must be erased before it can re-join the cluster.
+
+For configuration propagation:
+
+LC2 - Queue and exchange propagation is entirely asynchronous. There
+are three cases to consider here for queue creation:
+
+(a) where queues are created through the addressing syntax supported
+the messaging API, they should be recreated if needed on failover and
+message replication if required is dealt with seperately;
+
+(b) where queues are created using configuration tools by an
+administrator or by a script they can query the backups to verify the
+config has propagated and commands can be re-run if there is a failure
+before that;
+
+(c) where applications have more complex programs on which
+queues/exchanges are created using QMF or directly via 0-10 APIs, the
+completion of the command will not guarantee that the command has been
+carried out on other nodes.
+
+I.e. case (a) doesn't require anything (apart from LM5 in some cases),
+case (b) can be addressed in a simple manner through tooling but case
+(c) would require changes to the broker to allow client to simply
+determine when the command has fully propagated.
+
+LC3 - Queues that are not in the query response received when a
+replica establishes a propagation subscription but exist locally are
+not deleted. I.e. Deletion of queues/exchanges while a replica is not
+connected will not be propagated. Solution is to delete any queues
+marked for propagation that exist locally but do not show up in the
+query response.
+
+LC4 - It is possible on failover that the new primary did not
+previously receive a given QMF event while a backup did (sort of an
+analogous situation to LM1 but without an easy way to detect or remedy
+it).
+
+LC5 - Need richer control over which queues/exchanges are propagated, and
+which are not.
+
+LC6 - The events and query responses are not fully synchronized.
+
+      In particular it *is* possible to not receive a delete event but
+      for the deleted object to still show up in the query response
+      (meaning the deletion is 'lost' to the update).
+
+      It is also possible for an create event to be received as well
+      as the created object being in the query response. Likewise it
+      is possible to receive a delete event and a query response in
+      which the object no longer appears. In these cases the event is
+      essentially redundant.
+
+      It is not possible to miss a create event and yet not to have
+      the object in question in the query response however.
+
+
+* Benefits compared to previous cluster implementation.
+
+- Does not depend on openais/corosync, does not require multicast.
+- Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas.
+- Can be ported to/implemented in other environments: e.g. Java, Windows
+- Disaster Recovery is just another backup, no need for separate queue replication mechanism.
+- Can take advantage of resource manager features, e.g. virtual IP addresses.
+- Fewer inconsistent errors (store failures) that can be handled without killing brokers.
+- Improved performance
+* User Documentation Notes
+
+Notes to seed initial user documentation. Loosely tracking the implementation,
+some points mentioned in the doc may not be implemented yet.
+
+** High Availability Overview
+
+HA is implemented using a 'hot standby' approach. Clients are directed
+to a single "primary" broker. The primary executes client requests and
+also replicates them to one or more "backup" brokers. If the primary
+fails, one of the backups takes over the role of primary carrying on
+from where the primary left off. Clients will fail over to the new
+primary automatically and continue their work.
+
+TODO: at least once, deduplication.
+
+** Enabling replication on the client.
+
+To enable replication set the qpid.replicate argument when creating a
+queue or exchange.
+
+This can have one of 3 values
+- none: the object is not replicated
+- configuration: queues, exchanges and bindings are replicated but messages are not.
+- messages: configuration and messages are replicated.
+
+TODO: examples
+TODO: more options for default value of qpid.replicate
+
+A HA client connection has multiple addresses, one for each broker. If
+the it fails to connect to an address, or the connection breaks,
+it will automatically fail-over to another address.
+
+Only the primary broker accepts connections, the backup brokers
+redirect connection attempts to the primary. If the primary fails, one
+of the backups is promoted to primary and clients fail-over to the new
+primary.
+
+TODO: using multiple-address connections, examples c++, python, java.
+
+TODO: dynamic cluster addressing?
+
+TODO: need de-duplication.
+
+** Enabling replication on the broker.
+
+Network topology: backup links, separate client/broker networks.
+Describe failover mechanisms.
+- Client view: URLs, failover, exclusion & discovery.
+- Broker view: similar.
+Role of rmganager
+
+** Configuring rgmanager
+
+** Configuring qpidd
+
+
author	Alan Conway <aconway@apache.org>	2012-02-13 23:50:18 +0000
committer	Alan Conway <aconway@apache.org>	2012-02-13 23:50:18 +0000
commit	057a0635e081848ba59964c4a8e0923e521b7fe6 (patch)
tree	cdad9578140d5dbdb2e90525e2ba9cc870ca50b6 /qpid/cpp/design_docs/new-ha-design.txt
parent	cf61dbf9b313f9bd69b392eae8fd8d27d4e609a2 (diff)
download	qpid-python-057a0635e081848ba59964c4a8e0923e521b7fe6.tar.gz