diff options
Diffstat (limited to 'qpid/cpp/docs/design/new-ha-design.txt')
-rw-r--r-- | qpid/cpp/docs/design/new-ha-design.txt | 304 |
1 files changed, 304 insertions, 0 deletions
diff --git a/qpid/cpp/docs/design/new-ha-design.txt b/qpid/cpp/docs/design/new-ha-design.txt new file mode 100644 index 0000000000..df6c7242eb --- /dev/null +++ b/qpid/cpp/docs/design/new-ha-design.txt @@ -0,0 +1,304 @@ +-*-org-*- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +* An active-passive, hot-standby design for Qpid clustering. + +This document describes an active-passive approach to HA based on +queue browsing to replicate message data. + +See [[./old-cluster-issues.txt]] for issues with the old design. + +** Active-active vs. active-passive (hot-standby) + +An active-active cluster allows clients to connect to any broker in +the cluster. If a broker fails, clients can fail-over to any other +live broker. + +A hot-standby cluster has only one active broker at a time (the +"primary") and one or more brokers on standby (the "backups"). Clients +are only served by the primary, clients that connect to a backup are +redirected to the primary. The backups are kept up-to-date in real +time by the primary, if the primary fails a backup is elected to be +the new primary. + +The main problem with active-active is co-ordinating consumers of the +same queue on multiple brokers such that there are no duplicates in +normal operation. There are 2 approaches: + +Predictive: each broker predicts which messages others will take. This +the main weakness of the old design so not appealing. + +Locking: brokers "lock" a queue in order to take messages. This is +complex to implement and it is not straighforward to determine the +best strategy for passing the lock. In tests to date it results in +very high latencies (10x standalone broker). + +Hot-standby removes this problem. Only the primary can modify queues +so it just has to tell the backups what it is doing, there's no +locking. + +The primary can enqueue messages and replicate asynchronously - +exactly like the store does, but it "writes" to the replicas over the +network rather than writing to disk. + +** Replicating browsers + +The unit of replication is a replicating browser. This is an AMQP +consumer that browses a remote queue via a federation link and +maintains a local replica of the queue. As well as browsing the remote +messages as they are added the browser receives dequeue notifications +when they are dequeued remotely. + +On the primary broker incoming mesage transfers are completed only when +all of the replicating browsers have signaled completion. Thus a completed +message is guaranteed to be on the backups. + +** Failover and Cluster Resource Managers + +We want to delegate the failover management to an existing cluster +resource manager. Initially this is rgmanager from Cluster Suite, but +other managers (e.g. PaceMaker) could be supported in future. + +Rgmanager takes care of starting and stopping brokers and informing +brokers of their roles as primary or backup. It ensures there's +exactly one primary broker running at any time. It also tracks quorum +and protects against split-brain. + +Rgmanger can also manage a virtual IP address so clients can just +retry on a single address to fail over. Alternatively we will also +support configuring a fixed list of broker addresses when qpid is run +outside of a resource manager. + +** Replicating configuration + +New queues and exchanges and their bindings also need to be replicated. +This is done by a QMF client that registers for configuration changes +on the remote broker and mirrors them in the local broker. + +** Use of CPG (openais/corosync) + +CPG is not required in this model, an external cluster resource +manager takes care of membership and quorum. + +** Selective replication + +In this model it's easy to support selective replication of individual queues via +configuration. + +Explicit exchange/queue qpid.replicate argument: +- none: the object is not replicated +- configuration: queues, exchanges and bindings are replicated but messages are not. +- all: configuration and messages are replicated. + +Set configurable default all/configuration/none + +** Inconsistent errors + +The new design eliminates most sources of inconsistent errors in the +old design (connections, sessions, security, management etc.) and +eliminates the need to stall the whole cluster till an error is +resolved. We still have to handle inconsistent store errors when store +and cluster are used together. + +We have 3 options (configurable) for handling inconsistent errors, +on the backup that fails to store a message from primary we can: +- Abort the backup broker allowing it to be re-started. +- Raise a critical error on the backup broker but carry on with the message lost. +- Reset and re-try replication for just the affected queue. + +We will provide some configurable options in this regard. + +** New backups connecting to primary. + +When the primary fails, one of the backups becomes primary and the +others connect to the new primary as backups. + +The backups can take advantage of the messages they already have +backed up, the new primary only needs to replicate new messages. + +To keep the N-way guarantee the primary needs to delay completion on +new messages until all the back-ups have caught up. However if a +backup does not catch up within some timeout, it is considered dead +and its messages are completed so the cluster can carry on with N-1 +members. + + +** Broker discovery and lifecycle. + +The cluster has a client URL that can contain a single virtual IP +address or a list of real IP addresses for the cluster. + +In backup mode, brokers reject connections normal client connections +so clients will fail over to the primary. HA admin tools mark their +connections so they are allowed to connect to backup brokers. + +Clients discover the primary by re-trying connection to all addresses in the client URL +until they successfully connect to the primary. In the case of a +virtual IP they re-try the same address until it is relocated to the +primary. In the case of a list of IPs the client tries each in +turn. Clients do multiple retries over a configured period of time +before giving up. + +Backup brokers discover the primary in the same way as clients. There +is a separate broker URL for brokers since they often will connect +over a different network. The broker URL has to be a list of real +addresses rather than a virtual address. + +** Interaction with rgmanager + +rgmanager interacts with qpid via 2 service scripts: backup & +primary. These scripts interact with the underlying qpidd +service. rgmanager picks the new primary when the old primary +fails. In a partition it also takes care of killing inquorate brokers. + +*** Initial cluster start + +rgmanager starts the backup service on all nodes and the primary service on one node. + +On the backup nodes qpidd is in the connecting state. The primary node goes into +the primary state. Backups discover the primary, connect and catch up. + +*** Failover + +primary broker or node fails. Backup brokers see the disconnect and +start trying to re-connect to the new primary. + +rgmanager notices the failure and starts the primary service on a new node. +This tells qpidd to go to primary mode. Backups re-connect and catch up. + +The primary can only be started on nodes where there is a ready backup service. +If the backup is catching up, it's not eligible to take over as primary. + +*** Failback + +Cluster of N brokers has suffered a failure, only N-1 brokers +remain. We want to start a new broker (possibly on a new node) to +restore redundancy. + +If the new broker has a new IP address, the sysadmin pushes a new URL +to all the existing brokers. + +The new broker starts in connecting mode. It discovers the primary, +connects and catches up. + +*** Failure during failback + +A second failure occurs before the new backup B completes its catch +up. The backup B refuses to become primary by failing the primary +start script if rgmanager chooses it, so rgmanager will try another +(hopefully caught-up) backup to become primary. + +*** Backup failure + +If a backup fails it is re-started. It connects and catches up from scratch +to become a ready backup. + +** Interaction with the store. + +Needs more detail: + +We want backup brokers to be able to user their stored messages on restart +so they don't have to download everything from priamary. +This requires a HA sequence number to be stored with the message +so the backup can identify which messages are in common with the primary. + +This will work very similarly to the way live backups can use in-memory +messages to reduce the download. + +Need to determine which broker is chosen as initial primary based on currency of +stores. Probably using stored generation numbers and status flags. Ideally +automated with rgmanager, or some intervention might be reqiured. + +* Current Limitations + +(In no particular order at present) + +For message replication (missing items have been fixed) + +LM3 - Transactional changes to queue state are not replicated atomically. + +LM4 - (No worse than store) Acknowledgements are confirmed to clients before the message +has been dequeued from replicas or indeed from the local store if that is asynchronous. + +LM6 - persistence: In the event of a total cluster failure there are +no tools to automatically identify the "latest" store. + +LM7 - persistence: In the event of a persistent broker being +re-started (due to failure or admin) it should be able to use its +stored messages to reduce the download required from the +primary. This means storing message IDs persistently. + +For configuration propagation: + +LC2 - Queue and exchange propagation is entirely asynchronous. There +are three cases to consider here for queue creation: + +(a) where queues are created through the addressing syntax supported +the messaging API, they should be recreated if needed on failover and +message replication if required is dealt with seperately; + +(b) where queues are created using configuration tools by an +administrator or by a script they can query the backups to verify the +config has propagated and commands can be re-run if there is a failure +before that; + +(c) where applications have more complex programs on which +queues/exchanges are created using QMF or directly via 0-10 APIs, the +completion of the command will not guarantee that the command has been +carried out on other nodes. + +I.e. case (a) doesn't require anything (apart from LM5 in some cases), +case (b) can be addressed in a simple manner through tooling but case +(c) would require changes to the broker to allow client to simply +determine when the command has fully propagated. + +LC4 - It is possible on failover that the new primary did not +previously receive a given QMF event while a backup did (sort of an +analogous situation to LM1 but without an easy way to detect or remedy +it). + +LC6 - The events and query responses are not fully synchronized. + + In particular it *is* possible to not receive a delete event but + for the deleted object to still show up in the query response + (meaning the deletion is 'lost' to the update). + + It is also possible for an create event to be received as well + as the created object being in the query response. Likewise it + is possible to receive a delete event and a query response in + which the object no longer appears. In these cases the event is + essentially redundant. + + It is not possible to miss a create event and yet not to have + the object in question in the query response however. + +LC7 Federated links from the primary will be lost in failover, they will not be re-connected on +the new primary. Federation links to the primary can fail over. + +LC9 The "last man standing" feature of the old cluster is not available. + +* Benefits compared to previous cluster implementation. + +- Allows per queue/exchange control over what is replicated. +- Does not depend on openais/corosync, does not require multicast. +- Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas. +- Can be ported to/implemented in other environments: e.g. Java, Windows +- Disaster Recovery is just another backup, no need for separate queue replication mechanism. +- Can take advantage of resource manager features, e.g. virtual IP addresses. +- Fewer inconsistent errors (store failures) that can be handled without killing brokers. +- Improved performance |