summaryrefslogtreecommitdiff
path: root/cpp/design_docs/new-ha-design.txt
diff options
context:
space:
mode:
Diffstat (limited to 'cpp/design_docs/new-ha-design.txt')
-rw-r--r--cpp/design_docs/new-ha-design.txt119
1 files changed, 26 insertions, 93 deletions
diff --git a/cpp/design_docs/new-ha-design.txt b/cpp/design_docs/new-ha-design.txt
index acca1720b4..df6c7242eb 100644
--- a/cpp/design_docs/new-ha-design.txt
+++ b/cpp/design_docs/new-ha-design.txt
@@ -84,12 +84,6 @@ retry on a single address to fail over. Alternatively we will also
support configuring a fixed list of broker addresses when qpid is run
outside of a resource manager.
-Aside: Cold-standby is also possible using rgmanager with shared
-storage for the message store (e.g. GFS). If the broker fails, another
-broker is started on a different node and and recovers from the
-store. This bears investigation but the store recovery times are
-likely too long for failover.
-
** Replicating configuration
New queues and exchanges and their bindings also need to be replicated.
@@ -109,13 +103,9 @@ configuration.
Explicit exchange/queue qpid.replicate argument:
- none: the object is not replicated
- configuration: queues, exchanges and bindings are replicated but messages are not.
-- messages: configuration and messages are replicated.
-
-TODO: provide configurable default for qpid.replicate
+- all: configuration and messages are replicated.
-[GRS: current prototype relies on queue sequence for message identity
-so selectively replicating certain messages on a given queue would be
-challenging. Selectively replicating certain queues however is trivial.]
+Set configurable default all/configuration/none
** Inconsistent errors
@@ -125,12 +115,13 @@ eliminates the need to stall the whole cluster till an error is
resolved. We still have to handle inconsistent store errors when store
and cluster are used together.
-We have 2 options (configurable) for handling inconsistent errors,
+We have 3 options (configurable) for handling inconsistent errors,
on the backup that fails to store a message from primary we can:
- Abort the backup broker allowing it to be re-started.
- Raise a critical error on the backup broker but carry on with the message lost.
-We can configure the option to abort or carry on per-queue, we
-will also provide a broker-wide configurable default.
+- Reset and re-try replication for just the affected queue.
+
+We will provide some configurable options in this regard.
** New backups connecting to primary.
@@ -156,8 +147,8 @@ In backup mode, brokers reject connections normal client connections
so clients will fail over to the primary. HA admin tools mark their
connections so they are allowed to connect to backup brokers.
-Clients discover the primary by re-trying connection to the client URL
-until the successfully connect to the primary. In the case of a
+Clients discover the primary by re-trying connection to all addresses in the client URL
+until they successfully connect to the primary. In the case of a
virtual IP they re-try the same address until it is relocated to the
primary. In the case of a list of IPs the client tries each in
turn. Clients do multiple retries over a configured period of time
@@ -168,12 +159,6 @@ is a separate broker URL for brokers since they often will connect
over a different network. The broker URL has to be a list of real
addresses rather than a virtual address.
-Brokers have the following states:
-- connecting: Backup broker trying to connect to primary - loops retrying broker URL.
-- catchup: Backup connected to primary, catching up on pre-existing configuration & messages.
-- ready: Backup fully caught-up, ready to take over as primary.
-- primary: Acting as primary, serving clients.
-
** Interaction with rgmanager
rgmanager interacts with qpid via 2 service scripts: backup &
@@ -190,8 +175,8 @@ the primary state. Backups discover the primary, connect and catch up.
*** Failover
-primary broker or node fails. Backup brokers see disconnect and go
-back to connecting mode.
+primary broker or node fails. Backup brokers see the disconnect and
+start trying to re-connect to the new primary.
rgmanager notices the failure and starts the primary service on a new node.
This tells qpidd to go to primary mode. Backups re-connect and catch up.
@@ -225,71 +210,30 @@ to become a ready backup.
** Interaction with the store.
-Clean shutdown: entire cluster is shut down cleanly by an admin tool:
-- primary stops accepting client connections till shutdown is complete.
-- backups come fully up to speed with primary state.
-- all shut down marking stores as 'clean' with an identifying UUID.
-
-After clean shutdown the cluster can re-start automatically as all nodes
-have equivalent stores. Stores starting up with the wrong UUID will fail.
-
-Stored status: clean(UUID)/dirty, primary/backup, generation number.
-- All stores are marked dirty except after a clean shutdown.
-- Generation number: passed to backups and incremented by new primary.
-
-After total crash must manually identify the "best" store, provide admin tool.
-Best = highest generation number among stores in primary state.
-
-Recovering from total crash: all brokers will refuse to start as all stores are dirty.
-Check the stores manually to find the best one, then either:
-1. Copy stores:
- - copy good store to all hosts
- - restart qpidd on all hosts.
-2. Erase stores:
- - Erase the store on all other hosts.
- - Restart qpidd on the good store and wait for it to become primary.
- - Restart qpidd on all other hosts.
-
-Broker startup with store:
-- Dirty: refuse to start
-- Clean:
- - Start and load from store.
- - When connecting as backup, check UUID matches primary, shut down if not.
-- Empty: start ok, no UUID check with primary.
+Needs more detail:
-* Current Limitations
+We want backup brokers to be able to user their stored messages on restart
+so they don't have to download everything from priamary.
+This requires a HA sequence number to be stored with the message
+so the backup can identify which messages are in common with the primary.
-(In no particular order at present)
+This will work very similarly to the way live backups can use in-memory
+messages to reduce the download.
-For message replication:
+Need to determine which broker is chosen as initial primary based on currency of
+stores. Probably using stored generation numbers and status flags. Ideally
+automated with rgmanager, or some intervention might be reqiured.
-LM1a - On failover, backups delete their queues and download the full queue state from the
-primary. There was code to use messags already on the backup for re-synchronisation, it
-was removed in early development (r1214490) to simplify the logic while getting basic
-replication working. It needs to be re-introduced.
+* Current Limitations
-LM1b - This re-synchronisation does not handle the case where a newly elected primary is *behind*
-one of the other backups. To address this I propose a new event for restting the sequence
-that the new primary would send out on detecting that a replicating browser is ahead of
-it, requesting that the replica revert back to a particular sequence number. The replica
-on receiving this event would then discard (i.e. dequeue) all the messages ahead of that
-sequence number and reset the counter to correctly sequence any subsequently delivered
-messages.
+(In no particular order at present)
-LM2 - There is a need to handle wrap-around of the message sequence to avoid
-confusing the resynchronisation where a replica has been disconnected
-for a long time, sufficient for the sequence numbering to wrap around.
+For message replication (missing items have been fixed)
LM3 - Transactional changes to queue state are not replicated atomically.
-LM4 - Acknowledgements are confirmed to clients before the message has been
-dequeued from replicas or indeed from the local store if that is
-asynchronous.
-
-LM5 - During failover, messages (re)published to a queue before there are
-the requisite number of replication subscriptions established will be
-confirmed to the publisher before they are replicated, leaving them
-vulnerable to a loss of the new primary before they are replicated.
+LM4 - (No worse than store) Acknowledgements are confirmed to clients before the message
+has been dequeued from replicas or indeed from the local store if that is asynchronous.
LM6 - persistence: In the event of a total cluster failure there are
no tools to automatically identify the "latest" store.
@@ -323,21 +267,11 @@ case (b) can be addressed in a simple manner through tooling but case
(c) would require changes to the broker to allow client to simply
determine when the command has fully propagated.
-LC3 - Queues that are not in the query response received when a
-replica establishes a propagation subscription but exist locally are
-not deleted. I.e. Deletion of queues/exchanges while a replica is not
-connected will not be propagated. Solution is to delete any queues
-marked for propagation that exist locally but do not show up in the
-query response.
-
LC4 - It is possible on failover that the new primary did not
previously receive a given QMF event while a backup did (sort of an
analogous situation to LM1 but without an easy way to detect or remedy
it).
-LC5 - Need richer control over which queues/exchanges are propagated, and
-which are not.
-
LC6 - The events and query responses are not fully synchronized.
In particular it *is* possible to not receive a delete event but
@@ -356,12 +290,11 @@ LC6 - The events and query responses are not fully synchronized.
LC7 Federated links from the primary will be lost in failover, they will not be re-connected on
the new primary. Federation links to the primary can fail over.
-LC8 Only plain FIFO queues can be replicated. LVQs and ring queues are not yet supported.
-
LC9 The "last man standing" feature of the old cluster is not available.
* Benefits compared to previous cluster implementation.
+- Allows per queue/exchange control over what is replicated.
- Does not depend on openais/corosync, does not require multicast.
- Can be integrated with different resource managers: for example rgmanager, PaceMaker, Veritas.
- Can be ported to/implemented in other environments: e.g. Java, Windows