QPID-3603: Update to HA design docs.

git-svn-id: https://svn.apache.org/repos/asf/qpid/branches/qpid-3603-2@1235977 13f79535-47bb-0310-9956-ffa450edef68
author: Alan Conway <aconway@apache.org> 2012-01-25 22:47:01 +0000
committer: Alan Conway <aconway@apache.org> 2012-01-25 22:47:01 +0000
commit: deb1247dfb548ce5907a832a6a7f14fc6f533c0a (patch)
tree: a2e4efb1270182a909ca1e533263c830bcf80f2f
parent: 03ffe85d335d0e8f66f5afa4eb151417f297c85f (diff)
download: qpid-python-deb1247dfb548ce5907a832a6a7f14fc6f533c0a.tar.gz
3 files changed, 124 insertions, 103 deletions
diff --git a/qpid/cpp/design_docs/new-cluster-design.txt b/qpid/cpp/design_docs/new-cluster-design.txt
index 0b015a4570..d29ecce445 100644
--- a/qpid/cpp/design_docs/new-cluster-design.txt
+++ b/qpid/cpp/design_docs/new-cluster-design.txt
@@ -17,69 +17,10 @@
 # under the License.
 
 * A new design for Qpid clustering.
-** Issues with current design.
 
-The cluster is based on virtual synchrony: each broker multicasts
-events and the events from all brokers are serialized and delivered in
-the same order to each broker.
+** Issues with old cluster design
 
-In the current design raw byte buffers from client connections are
-multicast, serialized and delivered in the same order to each broker.
-
-Each broker has a replica of all queues, exchanges, bindings and also
-all connections & sessions from every broker. Cluster code treats the
-broker as a "black box", it "plays" the client data into the
-connection objects and assumes that by giving the same input, each
-broker will reach the same state.
-
-A new broker joining the cluster receives a snapshot of the current
-cluster state, and then follows the multicast conversation.
-
-*** Maintenance issues.
-
-The entire state of each broker is replicated to every member:
-connections, sessions, queues, messages, exchanges, management objects
-etc. Any discrepancy in the state that affects how messages are
-allocated to consumers can cause an inconsistency.
-
-- Entire broker state must be faithfully updated to new members.
-- Management model also has to be replicated.
-- All queues are replicated, can't have unreplicated queues (e.g. for management)
-
-Events that are not deterministically predictable from the client
-input data stream can cause inconsistencies. In particular use of
-timers/timestamps require cluster workarounds to synchronize.
-
-A member that encounters an error which is not encounted by all other
-members is considered inconsistent and will shut itself down. Such
-errors can come from any area of the broker code, e.g. different
-ACL files can cause inconsistent errors.
-
-The following areas required workarounds to work in a cluster:
-
-- Timers/timestamps in broker code: management, heartbeats, TTL
-- Security: cluster must replicate *after* decryption by security layer.
-- Management: not initially included in the replicated model, source of many inconsistencies.
-
-It is very easy for someone adding a feature or fixing a bug in the
-standalone broker to break the cluster by:
-- adding new state that needs to be replicated in cluster updates.
-- doing something in a timer or other non-connection thread.
-
-It's very hard to test for such breaks. We need a looser coupling
-and a more explicitly defined interface between cluster and standalone
-broker code.
-
-*** Performance issues.
-
-Virtual synchrony delivers all data from all clients in a single
-stream to each broker.  The cluster must play this data thru the full
-broker code stack: connections, sessions etc. in a single thread
-context in order to get identical behavior on each broker. The cluster
-has a pipelined design to get some concurrency but this is a severe
-limitation on scalability in multi-core hosts compared to the
-standalone broker which processes each connection in a separate thread
-context.
+See [[./old-cluster-issues.txt]]
 
 ** A new cluster design.
 
diff --git a/qpid/cpp/design_docs/new-ha-design.txt b/qpid/cpp/design_docs/new-ha-design.txt
index 053dd7227d..18962f8be8 100644
--- a/qpid/cpp/design_docs/new-ha-design.txt
+++ b/qpid/cpp/design_docs/new-ha-design.txt
@@ -18,13 +18,11 @@
 
 * An active-passive, hot-standby design for Qpid clustering.
 
-For some background see [[./new-cluster-design.txt]] which describes the
-issues with the old design and a new active-active design that could
-replace it.
-
-This document describes an alternative active-passive approach based on
+This document describes an active-passive approach to HA based on
 queue browsing to replicate message data.
 
+See [[./old-cluster-issues.txt]] for issues with the old design.
+
 ** Active-active vs. active-passive (hot-standby)
 
 An active-active cluster allows clients to connect to any broker in
@@ -92,13 +90,13 @@ broker is started on a different node and and recovers from the
 store. This bears investigation but the store recovery times are
 likely too long for failover.
 
-** Replicating wiring
+** Replicating configuration
 
 New queues and exchanges and their bindings also need to be replicated.
-This is done by a QMF client that registers for wiring changes
+This is done by a QMF client that registers for configuration changes
 on the remote broker and mirrors them in the local broker.
 
-** Use of CPG
+** Use of CPG (openais/corosync)
 
 CPG is not required in this model, an external cluster resource
 manager takes care of membership and quorum.
@@ -107,12 +105,13 @@ manager takes care of membership and quorum.
 
 In this model it's easy to support selective replication of individual queues via
 configuration.
-- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate.
-  Treated analogously to persistent/durable properties for the store.
-- if not explicitly marked, provide a choice of default
-  - default is replicate (replicated message on replicated queue)
-  - default is don't replicate
-  - default is replicate persistent/durable messages.
+
+Explicit exchange/queue qpid.replicate argument:
+- none: the object is not replicated
+- configuration: queues, exchanges and bindings are replicated but messages are not.
+- messages: configuration and messages are replicated.
+
+TODO: provide configurable default for qpid.replicate
 
 [GRS: current prototype relies on queue sequence for message identity
 so selectively replicating certain messages on a given queue would be
@@ -137,30 +136,19 @@ go thru the various failure cases. We may be able to do recovery on a
 per-queue basis rather than restarting an entire node.
 
 
-** New backups joining
+** New backups connecting to primary.
 
-New brokers can join the cluster as backups. Note - if the new broker
-has a new IP address, then the existing cluster members must be
-updated with a new client and broker URLs by a sysadmin.
+When the primary fails, one of the backups becomes primary and the
+others connect to the new primary as backups.
 
+The backups can take advantage of the messages they already have
+backed up, the new primary only needs to replicate new messages.
 
-They discover
-
-We should be able to catch up much faster than the the old design. A
-new backup can catch up ("recover") the current cluster state on a
-per-queue basis.
-- queues can be updated in parallel
-- "live" updates avoid the the "endless chase"
-
-During a "live" update several things are happening on a queue:
-- clients are publishing messages to the back of the queue, replicated to the backup
-- clients are consuming messages from the front of the queue, replicated to the backup.
-- the primary is sending pre-existing messages to the new backup.
-
-The primary sends pre-existing messages in LIFO order - starting from
-the back of the queue, at the same time clients are consuming from the front.
-The active consumers actually reduce the amount of work to be done, as there's
-no need to replicate messages that are no longer on the queue.
+To keep the N-1 guarantee, the primary needs to delay completion on
+new messages until the back-ups have caught up. However if a backup
+does not catch up within some timeout, it should be considered
+out-of-order and messages completed even though it is not caught up.
+Need to think about reasonable behavior here.
 
 ** Broker discovery and lifecycle.
 
@@ -185,14 +173,16 @@ to each other.
 
 Brokers have the following states:
 - connecting: backup broker trying to connect to primary - loops retrying broker URL.
-- catchup: connected to primary, catching up on pre-existing wiring & messages.
+- catchup: connected to primary, catching up on pre-existing configuration & messages.
 - backup: fully functional backup.
 - primary: Acting as primary, serving clients.
 
 ** Interaction with rgmanager
 
-rgmanager interacts with qpid via 2 service scripts: backup & primary. These
-scripts interact with the underlying qpidd service.
+rgmanager interacts with qpid via 2 service scripts: backup &
+primary. These scripts interact with the underlying qpidd
+service. rgmanager picks the new primary when the old primary
+fails. In a partition it also takes care of killing inquorate brokers.q
 
 *** Initial cluster start
 
@@ -273,8 +263,6 @@ vulnerable to a loss of the new master before they are replicated.
 
 For configuration propagation:
 
-LC1 - Bindings aren't propagated, only queues and exchanges.
-
 LC2 - Queue and exchange propagation is entirely asynchronous. There
 are three cases to consider here for queue creation: (a) where queues
 are created through the addressign syntax supported the messaging API,
@@ -321,6 +309,14 @@ LC6 - The events and query responses are not fully synchronized.
       It is not possible to miss a create event and yet not to have
       the object in question in the query response however.
 
+* Benefits compared to previous cluster implementation.
+
+- Does not need openais/corosync, does not require multicast.
+- Possible to replace rgmanager with other resource mgr (PaceMaker, windows?)
+- DR is just another backup
+- Performance (some numbers?)
+- Virtual IP supported by rgmanager.
+
 * User Documentation Notes
 
 Notes to seed initial user documentation. Loosely tracking the implementation,
@@ -354,8 +350,10 @@ A HA client connection has multiple addresses, one for each broker. If
 the it fails to connect to an address, or the connection breaks,
 it will automatically fail-over to another address.
 
-Only the primary broker accepts connections, the backup brokers abort
-connection attempts. That ensures clients connect to the primary only.
+Only the primary broker accepts connections, the backup brokers
+redirect connection attempts to the primary. If the primary fails, one
+of the backups is promoted to primary and clients fail-over to the new
+primary.
 
 TODO: using multiple-address connections, examples c++, python, java.
 
diff --git a/qpid/cpp/design_docs/old-cluster-issues.txt b/qpid/cpp/design_docs/old-cluster-issues.txt
new file mode 100644
index 0000000000..5d778861c1
--- /dev/null
+++ b/qpid/cpp/design_docs/old-cluster-issues.txt
@@ -0,0 +1,82 @@
+-*-org-*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+* Issues with the old design.
+
+The cluster is based on virtual synchrony: each broker multicasts
+events and the events from all brokers are serialized and delivered in
+the same order to each broker.
+
+In the current design raw byte buffers from client connections are
+multicast, serialized and delivered in the same order to each broker.
+
+Each broker has a replica of all queues, exchanges, bindings and also
+all connections & sessions from every broker. Cluster code treats the
+broker as a "black box", it "plays" the client data into the
+connection objects and assumes that by giving the same input, each
+broker will reach the same state.
+
+A new broker joining the cluster receives a snapshot of the current
+cluster state, and then follows the multicast conversation.
+
+** Maintenance issues.
+
+The entire state of each broker is replicated to every member:
+connections, sessions, queues, messages, exchanges, management objects
+etc. Any discrepancy in the state that affects how messages are
+allocated to consumers can cause an inconsistency.
+
+- Entire broker state must be faithfully updated to new members.
+- Management model also has to be replicated.
+- All queues are replicated, can't have unreplicated queues (e.g. for management)
+
+Events that are not deterministically predictable from the client
+input data stream can cause inconsistencies. In particular use of
+timers/timestamps require cluster workarounds to synchronize.
+
+A member that encounters an error which is not encounted by all other
+members is considered inconsistent and will shut itself down. Such
+errors can come from any area of the broker code, e.g. different
+ACL files can cause inconsistent errors.
+
+The following areas required workarounds to work in a cluster:
+
+- Timers/timestamps in broker code: management, heartbeats, TTL
+- Security: cluster must replicate *after* decryption by security layer.
+- Management: not initially included in the replicated model, source of many inconsistencies.
+
+It is very easy for someone adding a feature or fixing a bug in the
+standalone broker to break the cluster by:
+- adding new state that needs to be replicated in cluster updates.
+- doing something in a timer or other non-connection thread.
+
+It's very hard to test for such breaks. We need a looser coupling
+and a more explicitly defined interface between cluster and standalone
+broker code.
+
+** Performance issues.
+
+Virtual synchrony delivers all data from all clients in a single
+stream to each broker.  The cluster must play this data thru the full
+broker code stack: connections, sessions etc. in a single thread
+context in order to get identical behavior on each broker. The cluster
+has a pipelined design to get some concurrency but this is a severe
+limitation on scalability in multi-core hosts compared to the
+standalone broker which processes each connection in a separate thread
+context.
+
author	Alan Conway <aconway@apache.org>	2012-01-25 22:47:01 +0000
committer	Alan Conway <aconway@apache.org>	2012-01-25 22:47:01 +0000
commit	deb1247dfb548ce5907a832a6a7f14fc6f533c0a (patch)
tree	a2e4efb1270182a909ca1e533263c830bcf80f2f
parent	03ffe85d335d0e8f66f5afa4eb151417f297c85f (diff)
download	qpid-python-deb1247dfb548ce5907a832a6a7f14fc6f533c0a.tar.gz