diff options
Diffstat (limited to 'qpid/cpp/design_docs/new-ha-design.txt')
-rw-r--r-- | qpid/cpp/design_docs/new-ha-design.txt | 353 |
1 files changed, 353 insertions, 0 deletions
diff --git a/qpid/cpp/design_docs/new-ha-design.txt b/qpid/cpp/design_docs/new-ha-design.txt new file mode 100644 index 0000000000..9b6d7d676c --- /dev/null +++ b/qpid/cpp/design_docs/new-ha-design.txt @@ -0,0 +1,353 @@ +-*-org-*- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +* An active-passive, hot-standby design for Qpid clustering. + +For some background see [[./new-cluster-design.txt]] which describes the +issues with the old design and a new active-active design that could +replace it. + +This document describes an alternative active-passive approach based on +queue browsing to replicate message data. + +** Active-active vs. active-passive (hot-standby) + +An active-active cluster allows clients to connect to any broker in +the cluster. If a broker fails, clients can fail-over to any other +live broker. + +A hot-standby cluster has only one active broker at a time (the +"primary") and one or more brokers on standby (the "backups"). Clients +are only served by the primary, clients that connect to a backup are +redirected to the primary. The backups are kept up-to-date in real +time by the primary, if the primary fails a backup is elected to be +the new primary. + +The main problem with active-active is co-ordinating consumers of the +same queue on multiple brokers such that there are no duplicates in +normal operation. There are 2 approaches: + +Predictive: each broker predicts which messages others will take. This +the main weakness of the old design so not appealing. + +Locking: brokers "lock" a queue in order to take messages. This is +complex to implement and it is not straighforward to determine the +best strategy for passing the lock. In tests to date it results in +very high latencies (10x standalone broker). + +Hot-standby removes this problem. Only the primary can modify queues +so it just has to tell the backups what it is doing, there's no +locking. + +The primary can enqueue messages and replicate asynchronously - +exactly like the store does, but it "writes" to the replicas over the +network rather than writing to disk. + +** Replicating browsers + +The unit of replication is a replicating browser. This is an AMQP +consumer that browses a remote queue via a federation link and +maintains a local replica of the queue. As well as browsing the remote +messages as they are added the browser receives dequeue notifications +when they are dequeued remotely. + +On the primary broker incoming mesage transfers are completed only when +all of the replicating browsers have signaled completion. Thus a completed +message is guaranteed to be on the backups. + +** Failover and Cluster Resource Managers + +We want to delegate the failover management to an existing cluster +resource manager. Initially this is rgmanager from Cluster Suite, but +other managers (e.g. PaceMaker) could be supported in future. + +Rgmanager takes care of starting and stopping brokers and informing +brokers of their roles as primary or backup. It ensures there's +exactly one primary broker running at any time. It also tracks quorum +and protects against split-brain. + +Rgmanger can also manage a virtual IP address so clients can just +retry on a single address to fail over. Alternatively we will also +support configuring a fixed list of broker addresses when qpid is run +outside of a resource manager. + +Aside: Cold-standby is also possible using rgmanager with shared +storage for the message store (e.g. GFS). If the broker fails, another +broker is started on a different node and and recovers from the +store. This bears investigation but the store recovery times are +likely too long for failover. + +** Replicating wiring + +New queues and exchanges and their bindings also need to be replicated. +This is done by a QMF client that registers for wiring changes +on the remote broker and mirrors them in the local broker. + +** Use of CPG + +CPG is not required in this model, an external cluster resource +manager takes care of membership and quorum. + +** Selective replication + +In this model it's easy to support selective replication of individual queues via +configuration. +- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate. + Treated analogously to persistent/durable properties for the store. +- if not explicitly marked, provide a choice of default + - default is replicate (replicated message on replicated queue) + - default is don't replicate + - default is replicate persistent/durable messages. + +[GRS: current prototype relies on queue sequence for message identity +so selectively replicating certain messages on a given queue would be +challenging. Selectively replicating certain queues however is trivial.] + +** Inconsistent errors + +The new design eliminates most sources of inconsistent errors in the +old design (connections, sessions, security, management etc.) and +eliminates the need to stall the whole cluster till an error is +resolved. We still have to handle inconsistent store errors when store +and cluster are used together. + +We also have to include error handling in the async completion loop to +guarantee N-way at least once: we should only report success to the +client when we know the message was replicated and stored on all N-1 +backups. + +TODO: We have a lot more options than the old cluster, need to figure +out the best approach, or possibly allow mutliple approaches. Need to +go thru the various failure cases. We may be able to do recovery on a +per-queue basis rather than restarting an entire node. + + +** New backups joining + +New brokers can join the cluster as backups. Note - if the new broker +has a new IP address, then the existing cluster members must be +updated with a new client and broker URLs by a sysadmin. + + +They discover + +We should be able to catch up much faster than the the old design. A +new backup can catch up ("recover") the current cluster state on a +per-queue basis. +- queues can be updated in parallel +- "live" updates avoid the the "endless chase" + +During a "live" update several things are happening on a queue: +- clients are publishing messages to the back of the queue, replicated to the backup +- clients are consuming messages from the front of the queue, replicated to the backup. +- the primary is sending pre-existing messages to the new backup. + +The primary sends pre-existing messages in LIFO order - starting from +the back of the queue, at the same time clients are consuming from the front. +The active consumers actually reduce the amount of work to be done, as there's +no need to replicate messages that are no longer on the queue. + +** Broker discovery and lifecycle. + +The cluster has a client URL that can contain a single virtual IP +address or a list of real IP addresses for the cluster. + +In backup mode, brokers reject connections except from a special +cluster-admin user. + +Clients discover the primary by re-trying connection to the client URL +until the successfully connect to the primary. In the case of a +virtual IP they re-try the same address until it is relocated to the +primary. In the case of a list of IPs the client tries each in +turn. Clients do multiple retries over a configured period of time +before giving up. + +Backup brokers discover the primary in the same way as clients. There +is a separate broker URL for brokers since they often will connect +over a different network to the clients. The broker URL has to be a +list of IPs rather than one virtual IP so broker members can connect +to each other. + +Brokers have the following states: +- connecting: backup broker trying to connect to primary - loops retrying broker URL. +- catchup: connected to primary, catching up on pre-existing wiring & messages. +- backup: fully functional backup. +- primary: Acting as primary, serving clients. + +** Interaction with rgmanager + +rgmanager interacts with qpid via 2 service scripts: backup & primary. These +scripts interact with the underlying qpidd service. + +*** Initial cluster start + +rgmanager starts the backup service on all nodes and the primary service on one node. + +On the backup nodes qpidd is in the connecting state. The primary node goes into +the primary state. Backups discover the primary, connect and catch up. + +*** Failover + +primary broker or node fails. Backup brokers see disconnect and go +back to connecting mode. + +rgmanager notices the failure and starts the primary service on a new node. +This tells qpidd to go to primary mode. Backups re-connect and catch up. + +*** Failback + +Cluster of N brokers has suffered a failure, only N-1 brokers +remain. We want to start a new broker (possibly on a new node) to +restore redundancy. + +If the new broker has a new IP address, the sysadmin pushes a new URL +to all the existing brokers. + +The new broker starts in connecting mode. It discovers the primary, +connects and catches up. + +*** Failure during failback + +A second failure occurs before the new backup B can complete its catch +up. The backup B refuses to become primary by failing the primary +start script if rgmanager chooses it, so rgmanager will try another +(hopefully caught-up) broker to be primary. + +** Interaction with the store. + +# FIXME aconway 2011-11-16: TBD +- how to identify the "best" store after a total cluster crash. +- best = last to be primary. +- primary "generation" - counter passed to backups and incremented by new primary. + +restart after crash: +- clean/dirty flag on disk for admin shutdown vs. crash +- dirty brokers refuse to be primary +- sysamdin tells best broker to be primary +- erase stores? delay loading? + +** Current Limitations + +(In no particular order at present) + +For message replication: + +LM1 - The re-synchronisation does not handle the case where a newly elected +master is *behind* one of the other backups. To address this I propose +a new event for restting the sequence that the new master would send +out on detecting that a replicating browser is ahead of it, requesting +that the replica revert back to a particular sequence number. The +replica on receiving this event would then discard (i.e. dequeue) all +the messages ahead of that sequence number and reset the counter to +correctly sequence any subsequently delivered messages. + +LM2 - There is a need to handle wrap-around of the message sequence to avoid +confusing the resynchronisation where a replica has been disconnected +for a long time, sufficient for the sequence numbering to wrap around. + +LM3 - Transactional changes to queue state are not replicated atomically. + +LM4 - Acknowledgements are confirmed to clients before the message has been +dequeued from replicas or indeed from the local store if that is +asynchronous. + +LM5 - During failover, messages (re)published to a queue before there are +the requisite number of replication subscriptions established will be +confirmed to the publisher before they are replicated, leaving them +vulnerable to a loss of the new master before they are replicated. + +For configuration propagation: + +LC1 - Bindings aren't propagated, only queues and exchanges. + +LC2 - Queue and exchange propagation is entirely asynchronous. There +are three cases to consider here for queue creation: (a) where queues +are created through the addressign syntax supported the messaging API, +they should be recreated if needed on failover and message replication +if required is dealt with seperately; (b) where queues are created +using configuration tools by an administrator or by a script they can +query the backups to verify the config has propagated and commands can +be re-run if there is a failure before that; (c) where applications +have more complex programs on which queues/exchanges are created using +QMF or directly via 0-10 APIs, the completion of the command will not +guarantee that the command has been carried out on other +nodes. I.e. case (a) doesn't require anything (apart from LM5 in some +cases), case (b) can be addressed in a simple manner through tooling +but case (c) would require changes to the broker to allow client to +simply determine when the command has fully propagated. + +LC3 - Queues that are not in the query response received when a +replica establishes a propagation subscription but exist locally are +not deleted. I.e. Deletion of queues/exchanges while a replica is not +connected will not be propagated. Solution is to delete any queues +marked for propagation that exist locally but do not show up in the +query response. + +LC4 - It is possible on failover that the new master did not +previously receive a given QMF event while a backup did (sort of an +analogous situation to LM1 but without an easy way to detect or remedy +it). + +LC5 - Need richer control over which queues/exchanges are propagated, and +which are not. + +LC6 - The events and query responses are not fully synchronized. + + In particular it *is* possible to not receive a delete event but + for the deleted object to still show up in the query response + (meaning the deletion is 'lost' to the update). + + It is also possible for an create event to be received as well + as the created object being in the query response. Likewise it + is possible to receive a delete event and a query response in + which the object no longer appears. In these cases the event is + essentially redundant. + + It is not possible to miss a create event and yet not to have + the object in question in the query response however. + +* User Documentation Notes + +Notes to seed initial user documentation. Loosely tracking the implementation, +some points mentioned in the doc may not be implemented yet. + +** High Availability Overview +Explain basic concepts: hot standby, primary/backup, replicated queue/exchange. +Network topology: backup links, corosync, separate client/cluster networks. +Describe failover mechanisms. +- Client view: URLs, failover, exclusion & discovery. +- Broker view: similar. +Role of rmganager & corosync. + +** Client view. +Clients use multi-address URL in base case. +Clients can't connect to backups, retry till they find primary. +Only qpid.cluster-admin can connect to backup, must not mess with replicated queues. +Note connection known-hosts returns client URL, as does amq.failover exchange. + +Creating replicated queues & exchanges: +- qpid.replicate argument, +- examples using addressing and qpid-config) + +** Configuring corosync +Must be on same network as backup links. + +** Configuring rgmanager + +** Configuring qpidd +HA related options. |