summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAlan Conway <aconway@apache.org>2012-01-25 22:47:01 +0000
committerAlan Conway <aconway@apache.org>2012-01-25 22:47:01 +0000
commitdeb1247dfb548ce5907a832a6a7f14fc6f533c0a (patch)
treea2e4efb1270182a909ca1e533263c830bcf80f2f
parent03ffe85d335d0e8f66f5afa4eb151417f297c85f (diff)
downloadqpid-python-deb1247dfb548ce5907a832a6a7f14fc6f533c0a.tar.gz
QPID-3603: Update to HA design docs.
git-svn-id: https://svn.apache.org/repos/asf/qpid/branches/qpid-3603-2@1235977 13f79535-47bb-0310-9956-ffa450edef68
-rw-r--r--qpid/cpp/design_docs/new-cluster-design.txt63
-rw-r--r--qpid/cpp/design_docs/new-ha-design.txt82
-rw-r--r--qpid/cpp/design_docs/old-cluster-issues.txt82
3 files changed, 124 insertions, 103 deletions
diff --git a/qpid/cpp/design_docs/new-cluster-design.txt b/qpid/cpp/design_docs/new-cluster-design.txt
index 0b015a4570..d29ecce445 100644
--- a/qpid/cpp/design_docs/new-cluster-design.txt
+++ b/qpid/cpp/design_docs/new-cluster-design.txt
@@ -17,69 +17,10 @@
# under the License.
* A new design for Qpid clustering.
-** Issues with current design.
-The cluster is based on virtual synchrony: each broker multicasts
-events and the events from all brokers are serialized and delivered in
-the same order to each broker.
+** Issues with old cluster design
-In the current design raw byte buffers from client connections are
-multicast, serialized and delivered in the same order to each broker.
-
-Each broker has a replica of all queues, exchanges, bindings and also
-all connections & sessions from every broker. Cluster code treats the
-broker as a "black box", it "plays" the client data into the
-connection objects and assumes that by giving the same input, each
-broker will reach the same state.
-
-A new broker joining the cluster receives a snapshot of the current
-cluster state, and then follows the multicast conversation.
-
-*** Maintenance issues.
-
-The entire state of each broker is replicated to every member:
-connections, sessions, queues, messages, exchanges, management objects
-etc. Any discrepancy in the state that affects how messages are
-allocated to consumers can cause an inconsistency.
-
-- Entire broker state must be faithfully updated to new members.
-- Management model also has to be replicated.
-- All queues are replicated, can't have unreplicated queues (e.g. for management)
-
-Events that are not deterministically predictable from the client
-input data stream can cause inconsistencies. In particular use of
-timers/timestamps require cluster workarounds to synchronize.
-
-A member that encounters an error which is not encounted by all other
-members is considered inconsistent and will shut itself down. Such
-errors can come from any area of the broker code, e.g. different
-ACL files can cause inconsistent errors.
-
-The following areas required workarounds to work in a cluster:
-
-- Timers/timestamps in broker code: management, heartbeats, TTL
-- Security: cluster must replicate *after* decryption by security layer.
-- Management: not initially included in the replicated model, source of many inconsistencies.
-
-It is very easy for someone adding a feature or fixing a bug in the
-standalone broker to break the cluster by:
-- adding new state that needs to be replicated in cluster updates.
-- doing something in a timer or other non-connection thread.
-
-It's very hard to test for such breaks. We need a looser coupling
-and a more explicitly defined interface between cluster and standalone
-broker code.
-
-*** Performance issues.
-
-Virtual synchrony delivers all data from all clients in a single
-stream to each broker. The cluster must play this data thru the full
-broker code stack: connections, sessions etc. in a single thread
-context in order to get identical behavior on each broker. The cluster
-has a pipelined design to get some concurrency but this is a severe
-limitation on scalability in multi-core hosts compared to the
-standalone broker which processes each connection in a separate thread
-context.
+See [[./old-cluster-issues.txt]]
** A new cluster design.
diff --git a/qpid/cpp/design_docs/new-ha-design.txt b/qpid/cpp/design_docs/new-ha-design.txt
index 053dd7227d..18962f8be8 100644
--- a/qpid/cpp/design_docs/new-ha-design.txt
+++ b/qpid/cpp/design_docs/new-ha-design.txt
@@ -18,13 +18,11 @@
* An active-passive, hot-standby design for Qpid clustering.
-For some background see [[./new-cluster-design.txt]] which describes the
-issues with the old design and a new active-active design that could
-replace it.
-
-This document describes an alternative active-passive approach based on
+This document describes an active-passive approach to HA based on
queue browsing to replicate message data.
+See [[./old-cluster-issues.txt]] for issues with the old design.
+
** Active-active vs. active-passive (hot-standby)
An active-active cluster allows clients to connect to any broker in
@@ -92,13 +90,13 @@ broker is started on a different node and and recovers from the
store. This bears investigation but the store recovery times are
likely too long for failover.
-** Replicating wiring
+** Replicating configuration
New queues and exchanges and their bindings also need to be replicated.
-This is done by a QMF client that registers for wiring changes
+This is done by a QMF client that registers for configuration changes
on the remote broker and mirrors them in the local broker.
-** Use of CPG
+** Use of CPG (openais/corosync)
CPG is not required in this model, an external cluster resource
manager takes care of membership and quorum.
@@ -107,12 +105,13 @@ manager takes care of membership and quorum.
In this model it's easy to support selective replication of individual queues via
configuration.
-- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate.
- Treated analogously to persistent/durable properties for the store.
-- if not explicitly marked, provide a choice of default
- - default is replicate (replicated message on replicated queue)
- - default is don't replicate
- - default is replicate persistent/durable messages.
+
+Explicit exchange/queue qpid.replicate argument:
+- none: the object is not replicated
+- configuration: queues, exchanges and bindings are replicated but messages are not.
+- messages: configuration and messages are replicated.
+
+TODO: provide configurable default for qpid.replicate
[GRS: current prototype relies on queue sequence for message identity
so selectively replicating certain messages on a given queue would be
@@ -137,30 +136,19 @@ go thru the various failure cases. We may be able to do recovery on a
per-queue basis rather than restarting an entire node.
-** New backups joining
+** New backups connecting to primary.
-New brokers can join the cluster as backups. Note - if the new broker
-has a new IP address, then the existing cluster members must be
-updated with a new client and broker URLs by a sysadmin.
+When the primary fails, one of the backups becomes primary and the
+others connect to the new primary as backups.
+The backups can take advantage of the messages they already have
+backed up, the new primary only needs to replicate new messages.
-They discover
-
-We should be able to catch up much faster than the the old design. A
-new backup can catch up ("recover") the current cluster state on a
-per-queue basis.
-- queues can be updated in parallel
-- "live" updates avoid the the "endless chase"
-
-During a "live" update several things are happening on a queue:
-- clients are publishing messages to the back of the queue, replicated to the backup
-- clients are consuming messages from the front of the queue, replicated to the backup.
-- the primary is sending pre-existing messages to the new backup.
-
-The primary sends pre-existing messages in LIFO order - starting from
-the back of the queue, at the same time clients are consuming from the front.
-The active consumers actually reduce the amount of work to be done, as there's
-no need to replicate messages that are no longer on the queue.
+To keep the N-1 guarantee, the primary needs to delay completion on
+new messages until the back-ups have caught up. However if a backup
+does not catch up within some timeout, it should be considered
+out-of-order and messages completed even though it is not caught up.
+Need to think about reasonable behavior here.
** Broker discovery and lifecycle.
@@ -185,14 +173,16 @@ to each other.
Brokers have the following states:
- connecting: backup broker trying to connect to primary - loops retrying broker URL.
-- catchup: connected to primary, catching up on pre-existing wiring & messages.
+- catchup: connected to primary, catching up on pre-existing configuration & messages.
- backup: fully functional backup.
- primary: Acting as primary, serving clients.
** Interaction with rgmanager
-rgmanager interacts with qpid via 2 service scripts: backup & primary. These
-scripts interact with the underlying qpidd service.
+rgmanager interacts with qpid via 2 service scripts: backup &
+primary. These scripts interact with the underlying qpidd
+service. rgmanager picks the new primary when the old primary
+fails. In a partition it also takes care of killing inquorate brokers.q
*** Initial cluster start
@@ -273,8 +263,6 @@ vulnerable to a loss of the new master before they are replicated.
For configuration propagation:
-LC1 - Bindings aren't propagated, only queues and exchanges.
-
LC2 - Queue and exchange propagation is entirely asynchronous. There
are three cases to consider here for queue creation: (a) where queues
are created through the addressign syntax supported the messaging API,
@@ -321,6 +309,14 @@ LC6 - The events and query responses are not fully synchronized.
It is not possible to miss a create event and yet not to have
the object in question in the query response however.
+* Benefits compared to previous cluster implementation.
+
+- Does not need openais/corosync, does not require multicast.
+- Possible to replace rgmanager with other resource mgr (PaceMaker, windows?)
+- DR is just another backup
+- Performance (some numbers?)
+- Virtual IP supported by rgmanager.
+
* User Documentation Notes
Notes to seed initial user documentation. Loosely tracking the implementation,
@@ -354,8 +350,10 @@ A HA client connection has multiple addresses, one for each broker. If
the it fails to connect to an address, or the connection breaks,
it will automatically fail-over to another address.
-Only the primary broker accepts connections, the backup brokers abort
-connection attempts. That ensures clients connect to the primary only.
+Only the primary broker accepts connections, the backup brokers
+redirect connection attempts to the primary. If the primary fails, one
+of the backups is promoted to primary and clients fail-over to the new
+primary.
TODO: using multiple-address connections, examples c++, python, java.
diff --git a/qpid/cpp/design_docs/old-cluster-issues.txt b/qpid/cpp/design_docs/old-cluster-issues.txt
new file mode 100644
index 0000000000..5d778861c1
--- /dev/null
+++ b/qpid/cpp/design_docs/old-cluster-issues.txt
@@ -0,0 +1,82 @@
+-*-org-*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+* Issues with the old design.
+
+The cluster is based on virtual synchrony: each broker multicasts
+events and the events from all brokers are serialized and delivered in
+the same order to each broker.
+
+In the current design raw byte buffers from client connections are
+multicast, serialized and delivered in the same order to each broker.
+
+Each broker has a replica of all queues, exchanges, bindings and also
+all connections & sessions from every broker. Cluster code treats the
+broker as a "black box", it "plays" the client data into the
+connection objects and assumes that by giving the same input, each
+broker will reach the same state.
+
+A new broker joining the cluster receives a snapshot of the current
+cluster state, and then follows the multicast conversation.
+
+** Maintenance issues.
+
+The entire state of each broker is replicated to every member:
+connections, sessions, queues, messages, exchanges, management objects
+etc. Any discrepancy in the state that affects how messages are
+allocated to consumers can cause an inconsistency.
+
+- Entire broker state must be faithfully updated to new members.
+- Management model also has to be replicated.
+- All queues are replicated, can't have unreplicated queues (e.g. for management)
+
+Events that are not deterministically predictable from the client
+input data stream can cause inconsistencies. In particular use of
+timers/timestamps require cluster workarounds to synchronize.
+
+A member that encounters an error which is not encounted by all other
+members is considered inconsistent and will shut itself down. Such
+errors can come from any area of the broker code, e.g. different
+ACL files can cause inconsistent errors.
+
+The following areas required workarounds to work in a cluster:
+
+- Timers/timestamps in broker code: management, heartbeats, TTL
+- Security: cluster must replicate *after* decryption by security layer.
+- Management: not initially included in the replicated model, source of many inconsistencies.
+
+It is very easy for someone adding a feature or fixing a bug in the
+standalone broker to break the cluster by:
+- adding new state that needs to be replicated in cluster updates.
+- doing something in a timer or other non-connection thread.
+
+It's very hard to test for such breaks. We need a looser coupling
+and a more explicitly defined interface between cluster and standalone
+broker code.
+
+** Performance issues.
+
+Virtual synchrony delivers all data from all clients in a single
+stream to each broker. The cluster must play this data thru the full
+broker code stack: connections, sessions etc. in a single thread
+context in order to get identical behavior on each broker. The cluster
+has a pipelined design to get some concurrency but this is a severe
+limitation on scalability in multi-core hosts compared to the
+standalone broker which processes each connection in a separate thread
+context.
+