diff options
author | Alan Conway <aconway@apache.org> | 2012-01-25 22:47:01 +0000 |
---|---|---|
committer | Alan Conway <aconway@apache.org> | 2012-01-25 22:47:01 +0000 |
commit | deb1247dfb548ce5907a832a6a7f14fc6f533c0a (patch) | |
tree | a2e4efb1270182a909ca1e533263c830bcf80f2f | |
parent | 03ffe85d335d0e8f66f5afa4eb151417f297c85f (diff) | |
download | qpid-python-deb1247dfb548ce5907a832a6a7f14fc6f533c0a.tar.gz |
QPID-3603: Update to HA design docs.
git-svn-id: https://svn.apache.org/repos/asf/qpid/branches/qpid-3603-2@1235977 13f79535-47bb-0310-9956-ffa450edef68
-rw-r--r-- | qpid/cpp/design_docs/new-cluster-design.txt | 63 | ||||
-rw-r--r-- | qpid/cpp/design_docs/new-ha-design.txt | 82 | ||||
-rw-r--r-- | qpid/cpp/design_docs/old-cluster-issues.txt | 82 |
3 files changed, 124 insertions, 103 deletions
diff --git a/qpid/cpp/design_docs/new-cluster-design.txt b/qpid/cpp/design_docs/new-cluster-design.txt index 0b015a4570..d29ecce445 100644 --- a/qpid/cpp/design_docs/new-cluster-design.txt +++ b/qpid/cpp/design_docs/new-cluster-design.txt @@ -17,69 +17,10 @@ # under the License. * A new design for Qpid clustering. -** Issues with current design. -The cluster is based on virtual synchrony: each broker multicasts -events and the events from all brokers are serialized and delivered in -the same order to each broker. +** Issues with old cluster design -In the current design raw byte buffers from client connections are -multicast, serialized and delivered in the same order to each broker. - -Each broker has a replica of all queues, exchanges, bindings and also -all connections & sessions from every broker. Cluster code treats the -broker as a "black box", it "plays" the client data into the -connection objects and assumes that by giving the same input, each -broker will reach the same state. - -A new broker joining the cluster receives a snapshot of the current -cluster state, and then follows the multicast conversation. - -*** Maintenance issues. - -The entire state of each broker is replicated to every member: -connections, sessions, queues, messages, exchanges, management objects -etc. Any discrepancy in the state that affects how messages are -allocated to consumers can cause an inconsistency. - -- Entire broker state must be faithfully updated to new members. -- Management model also has to be replicated. -- All queues are replicated, can't have unreplicated queues (e.g. for management) - -Events that are not deterministically predictable from the client -input data stream can cause inconsistencies. In particular use of -timers/timestamps require cluster workarounds to synchronize. - -A member that encounters an error which is not encounted by all other -members is considered inconsistent and will shut itself down. Such -errors can come from any area of the broker code, e.g. different -ACL files can cause inconsistent errors. - -The following areas required workarounds to work in a cluster: - -- Timers/timestamps in broker code: management, heartbeats, TTL -- Security: cluster must replicate *after* decryption by security layer. -- Management: not initially included in the replicated model, source of many inconsistencies. - -It is very easy for someone adding a feature or fixing a bug in the -standalone broker to break the cluster by: -- adding new state that needs to be replicated in cluster updates. -- doing something in a timer or other non-connection thread. - -It's very hard to test for such breaks. We need a looser coupling -and a more explicitly defined interface between cluster and standalone -broker code. - -*** Performance issues. - -Virtual synchrony delivers all data from all clients in a single -stream to each broker. The cluster must play this data thru the full -broker code stack: connections, sessions etc. in a single thread -context in order to get identical behavior on each broker. The cluster -has a pipelined design to get some concurrency but this is a severe -limitation on scalability in multi-core hosts compared to the -standalone broker which processes each connection in a separate thread -context. +See [[./old-cluster-issues.txt]] ** A new cluster design. diff --git a/qpid/cpp/design_docs/new-ha-design.txt b/qpid/cpp/design_docs/new-ha-design.txt index 053dd7227d..18962f8be8 100644 --- a/qpid/cpp/design_docs/new-ha-design.txt +++ b/qpid/cpp/design_docs/new-ha-design.txt @@ -18,13 +18,11 @@ * An active-passive, hot-standby design for Qpid clustering. -For some background see [[./new-cluster-design.txt]] which describes the -issues with the old design and a new active-active design that could -replace it. - -This document describes an alternative active-passive approach based on +This document describes an active-passive approach to HA based on queue browsing to replicate message data. +See [[./old-cluster-issues.txt]] for issues with the old design. + ** Active-active vs. active-passive (hot-standby) An active-active cluster allows clients to connect to any broker in @@ -92,13 +90,13 @@ broker is started on a different node and and recovers from the store. This bears investigation but the store recovery times are likely too long for failover. -** Replicating wiring +** Replicating configuration New queues and exchanges and their bindings also need to be replicated. -This is done by a QMF client that registers for wiring changes +This is done by a QMF client that registers for configuration changes on the remote broker and mirrors them in the local broker. -** Use of CPG +** Use of CPG (openais/corosync) CPG is not required in this model, an external cluster resource manager takes care of membership and quorum. @@ -107,12 +105,13 @@ manager takes care of membership and quorum. In this model it's easy to support selective replication of individual queues via configuration. -- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate. - Treated analogously to persistent/durable properties for the store. -- if not explicitly marked, provide a choice of default - - default is replicate (replicated message on replicated queue) - - default is don't replicate - - default is replicate persistent/durable messages. + +Explicit exchange/queue qpid.replicate argument: +- none: the object is not replicated +- configuration: queues, exchanges and bindings are replicated but messages are not. +- messages: configuration and messages are replicated. + +TODO: provide configurable default for qpid.replicate [GRS: current prototype relies on queue sequence for message identity so selectively replicating certain messages on a given queue would be @@ -137,30 +136,19 @@ go thru the various failure cases. We may be able to do recovery on a per-queue basis rather than restarting an entire node. -** New backups joining +** New backups connecting to primary. -New brokers can join the cluster as backups. Note - if the new broker -has a new IP address, then the existing cluster members must be -updated with a new client and broker URLs by a sysadmin. +When the primary fails, one of the backups becomes primary and the +others connect to the new primary as backups. +The backups can take advantage of the messages they already have +backed up, the new primary only needs to replicate new messages. -They discover - -We should be able to catch up much faster than the the old design. A -new backup can catch up ("recover") the current cluster state on a -per-queue basis. -- queues can be updated in parallel -- "live" updates avoid the the "endless chase" - -During a "live" update several things are happening on a queue: -- clients are publishing messages to the back of the queue, replicated to the backup -- clients are consuming messages from the front of the queue, replicated to the backup. -- the primary is sending pre-existing messages to the new backup. - -The primary sends pre-existing messages in LIFO order - starting from -the back of the queue, at the same time clients are consuming from the front. -The active consumers actually reduce the amount of work to be done, as there's -no need to replicate messages that are no longer on the queue. +To keep the N-1 guarantee, the primary needs to delay completion on +new messages until the back-ups have caught up. However if a backup +does not catch up within some timeout, it should be considered +out-of-order and messages completed even though it is not caught up. +Need to think about reasonable behavior here. ** Broker discovery and lifecycle. @@ -185,14 +173,16 @@ to each other. Brokers have the following states: - connecting: backup broker trying to connect to primary - loops retrying broker URL. -- catchup: connected to primary, catching up on pre-existing wiring & messages. +- catchup: connected to primary, catching up on pre-existing configuration & messages. - backup: fully functional backup. - primary: Acting as primary, serving clients. ** Interaction with rgmanager -rgmanager interacts with qpid via 2 service scripts: backup & primary. These -scripts interact with the underlying qpidd service. +rgmanager interacts with qpid via 2 service scripts: backup & +primary. These scripts interact with the underlying qpidd +service. rgmanager picks the new primary when the old primary +fails. In a partition it also takes care of killing inquorate brokers.q *** Initial cluster start @@ -273,8 +263,6 @@ vulnerable to a loss of the new master before they are replicated. For configuration propagation: -LC1 - Bindings aren't propagated, only queues and exchanges. - LC2 - Queue and exchange propagation is entirely asynchronous. There are three cases to consider here for queue creation: (a) where queues are created through the addressign syntax supported the messaging API, @@ -321,6 +309,14 @@ LC6 - The events and query responses are not fully synchronized. It is not possible to miss a create event and yet not to have the object in question in the query response however. +* Benefits compared to previous cluster implementation. + +- Does not need openais/corosync, does not require multicast. +- Possible to replace rgmanager with other resource mgr (PaceMaker, windows?) +- DR is just another backup +- Performance (some numbers?) +- Virtual IP supported by rgmanager. + * User Documentation Notes Notes to seed initial user documentation. Loosely tracking the implementation, @@ -354,8 +350,10 @@ A HA client connection has multiple addresses, one for each broker. If the it fails to connect to an address, or the connection breaks, it will automatically fail-over to another address. -Only the primary broker accepts connections, the backup brokers abort -connection attempts. That ensures clients connect to the primary only. +Only the primary broker accepts connections, the backup brokers +redirect connection attempts to the primary. If the primary fails, one +of the backups is promoted to primary and clients fail-over to the new +primary. TODO: using multiple-address connections, examples c++, python, java. diff --git a/qpid/cpp/design_docs/old-cluster-issues.txt b/qpid/cpp/design_docs/old-cluster-issues.txt new file mode 100644 index 0000000000..5d778861c1 --- /dev/null +++ b/qpid/cpp/design_docs/old-cluster-issues.txt @@ -0,0 +1,82 @@ +-*-org-*- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +* Issues with the old design. + +The cluster is based on virtual synchrony: each broker multicasts +events and the events from all brokers are serialized and delivered in +the same order to each broker. + +In the current design raw byte buffers from client connections are +multicast, serialized and delivered in the same order to each broker. + +Each broker has a replica of all queues, exchanges, bindings and also +all connections & sessions from every broker. Cluster code treats the +broker as a "black box", it "plays" the client data into the +connection objects and assumes that by giving the same input, each +broker will reach the same state. + +A new broker joining the cluster receives a snapshot of the current +cluster state, and then follows the multicast conversation. + +** Maintenance issues. + +The entire state of each broker is replicated to every member: +connections, sessions, queues, messages, exchanges, management objects +etc. Any discrepancy in the state that affects how messages are +allocated to consumers can cause an inconsistency. + +- Entire broker state must be faithfully updated to new members. +- Management model also has to be replicated. +- All queues are replicated, can't have unreplicated queues (e.g. for management) + +Events that are not deterministically predictable from the client +input data stream can cause inconsistencies. In particular use of +timers/timestamps require cluster workarounds to synchronize. + +A member that encounters an error which is not encounted by all other +members is considered inconsistent and will shut itself down. Such +errors can come from any area of the broker code, e.g. different +ACL files can cause inconsistent errors. + +The following areas required workarounds to work in a cluster: + +- Timers/timestamps in broker code: management, heartbeats, TTL +- Security: cluster must replicate *after* decryption by security layer. +- Management: not initially included in the replicated model, source of many inconsistencies. + +It is very easy for someone adding a feature or fixing a bug in the +standalone broker to break the cluster by: +- adding new state that needs to be replicated in cluster updates. +- doing something in a timer or other non-connection thread. + +It's very hard to test for such breaks. We need a looser coupling +and a more explicitly defined interface between cluster and standalone +broker code. + +** Performance issues. + +Virtual synchrony delivers all data from all clients in a single +stream to each broker. The cluster must play this data thru the full +broker code stack: connections, sessions etc. in a single thread +context in order to get identical behavior on each broker. The cluster +has a pipelined design to get some concurrency but this is a severe +limitation on scalability in multi-core hosts compared to the +standalone broker which processes each connection in a separate thread +context. + |