qpid/cpp/design_docs/new-ha-design.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375

-*-org-*-
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

* An active-passive, hot-standby design for Qpid clustering.

This document describes an active-passive approach to HA based on
queue browsing to replicate message data.

See [[./old-cluster-issues.txt]] for issues with the old design.

** Active-active vs. active-passive (hot-standby)

An active-active cluster allows clients to connect to any broker in
the cluster. If a broker fails, clients can fail-over to any other
live broker.

A hot-standby cluster has only one active broker at a time (the
"primary") and one or more brokers on standby (the "backups"). Clients
are only served by the primary, clients that connect to a backup are
redirected to the primary. The backups are kept up-to-date in real
time by the primary, if the primary fails a backup is elected to be
the new primary.

The main problem with active-active is co-ordinating consumers of the
same queue on multiple brokers such that there are no duplicates in
normal operation. There are 2 approaches:

Predictive: each broker predicts which messages others will take. This
the main weakness of the old design so not appealing.

Locking: brokers "lock" a queue in order to take messages. This is
complex to implement and it is not straighforward to determine the
best strategy for passing the lock. In tests to date it results in
very high latencies (10x standalone broker).

Hot-standby removes this problem. Only the primary can modify queues
so it just has to tell the backups what it is doing, there's no
locking.

The primary can enqueue messages and replicate asynchronously -
exactly like the store does, but it "writes" to the replicas over the
network rather than writing to disk.

** Replicating browsers

The unit of replication is a replicating browser. This is an AMQP
consumer that browses a remote queue via a federation link and
maintains a local replica of the queue. As well as browsing the remote
messages as they are added the browser receives dequeue notifications
when they are dequeued remotely.

On the primary broker incoming mesage transfers are completed only when
all of the replicating browsers have signaled completion. Thus a completed
message is guaranteed to be on the backups.

** Failover and Cluster Resource Managers

We want to delegate the failover management to an existing cluster
resource manager. Initially this is rgmanager from Cluster Suite, but
other managers (e.g. PaceMaker) could be supported in future.

Rgmanager takes care of starting and stopping brokers and informing
brokers of their roles as primary or backup. It ensures there's
exactly one primary broker running at any time. It also tracks quorum
and protects against split-brain.

Rgmanger can also manage a virtual IP address so clients can just
retry on a single address to fail over. Alternatively we will also
support configuring a fixed list of broker addresses when qpid is run
outside of a resource manager.

Aside: Cold-standby is also possible using rgmanager with shared
storage for the message store (e.g. GFS). If the broker fails, another
broker is started on a different node and and recovers from the
store. This bears investigation but the store recovery times are
likely too long for failover.

** Replicating configuration

New queues and exchanges and their bindings also need to be replicated.
This is done by a QMF client that registers for configuration changes
on the remote broker and mirrors them in the local broker.

** Use of CPG (openais/corosync)

CPG is not required in this model, an external cluster resource
manager takes care of membership and quorum.

** Selective replication

In this model it's easy to support selective replication of individual queues via
configuration.

Explicit exchange/queue qpid.replicate argument:
- none: the object is not replicated
- configuration: queues, exchanges and bindings are replicated but messages are not.
- messages: configuration and messages are replicated.

TODO: provide configurable default for qpid.replicate

[GRS: current prototype relies on queue sequence for message identity
so selectively replicating certain messages on a given queue would be
challenging. Selectively replicating certain queues however is trivial.]

** Inconsistent errors

The new design eliminates most sources of inconsistent errors in the
old design (connections, sessions, security, management etc.) and
eliminates the need to stall the whole cluster till an error is
resolved. We still have to handle inconsistent store errors when store
and cluster are used together.

We also have to include error handling in the async completion loop to
guarantee N-way at least once: we should only report success to the
client when we know the message was replicated and stored on all N-1
backups.

TODO: We have a lot more options than the old cluster, need to figure
out the best approach, or possibly allow mutliple approaches. Need to
go thru the various failure cases. We may be able to do recovery on a
per-queue basis rather than restarting an entire node.


** New backups connecting to primary.

When the primary fails, one of the backups becomes primary and the
others connect to the new primary as backups.

The backups can take advantage of the messages they already have
backed up, the new primary only needs to replicate new messages.

To keep the N-1 guarantee, the primary needs to delay completion on
new messages until the back-ups have caught up. However if a backup
does not catch up within some timeout, it should be considered
out-of-order and messages completed even though it is not caught up.
Need to think about reasonable behavior here.

** Broker discovery and lifecycle.

The cluster has a client URL that can contain a single virtual IP
address or a list of real IP addresses for the cluster.

In backup mode, brokers reject connections except from a special
cluster-admin user.

Clients discover the primary by re-trying connection to the client URL
until the successfully connect to the primary. In the case of a
virtual IP they re-try the same address until it is relocated to the
primary. In the case of a list of IPs the client tries each in
turn. Clients do multiple retries over a configured period of time
before giving up.

Backup brokers discover the primary in the same way as clients. There
is a separate broker URL for brokers since they often will connect
over a different network to the clients. The broker URL has to be a
list of IPs rather than one virtual IP so broker members can connect
to each other.

Brokers have the following states:
- connecting: backup broker trying to connect to primary - loops retrying broker URL.
- catchup: connected to primary, catching up on pre-existing configuration & messages.
- backup: fully functional backup.
- primary: Acting as primary, serving clients.

** Interaction with rgmanager

rgmanager interacts with qpid via 2 service scripts: backup &
primary. These scripts interact with the underlying qpidd
service. rgmanager picks the new primary when the old primary
fails. In a partition it also takes care of killing inquorate brokers.q

*** Initial cluster start

rgmanager starts the backup service on all nodes and the primary service on one node.

On the backup nodes qpidd is in the connecting state. The primary node goes into
the primary state. Backups discover the primary, connect and catch up.

*** Failover

primary broker or node fails. Backup brokers see disconnect and go
back to connecting mode.

rgmanager notices the failure and starts the primary service on a new node.
This tells qpidd to go to primary mode. Backups re-connect and catch up.

*** Failback

Cluster of N brokers has suffered a failure, only N-1 brokers
remain. We want to start a new broker (possibly on a new node) to
restore redundancy.

If the new broker has a new IP address, the sysadmin pushes a new URL
to all the existing brokers.

The new broker starts in connecting mode. It discovers the primary,
connects and catches up.

*** Failure during failback

A second failure occurs before the new backup B can complete its catch
up.  The backup B refuses to become primary by failing the primary
start script if rgmanager chooses it, so rgmanager will try another
(hopefully caught-up) broker to be primary.

** Interaction with the store.

# FIXME aconway 2011-11-16: TBD
- how to identify the "best" store after a total cluster crash.
- best = last to be primary.
- primary "generation" - counter passed to backups and incremented by new primary.

restart after crash:
- clean/dirty flag on disk for admin shutdown vs. crash
- dirty brokers refuse to be primary
- sysamdin tells best broker to be primary
- erase stores? delay loading?

** Current Limitations

(In no particular order at present)

For message replication:

LM1 - The re-synchronisation does not handle the case where a newly elected
master is *behind* one of the other backups. To address this I propose
a new event for restting the sequence that the new master would send
out on detecting that a replicating browser is ahead of it, requesting
that the replica revert back to a particular sequence number. The
replica on receiving this event would then discard (i.e. dequeue) all
the messages ahead of that sequence number and reset the counter to
correctly sequence any subsequently delivered messages.

LM2 - There is a need to handle wrap-around of the message sequence to avoid
confusing the resynchronisation where a replica has been disconnected
for a long time, sufficient for the sequence numbering to wrap around.

LM3 - Transactional changes to queue state are not replicated atomically.

LM4 - Acknowledgements are confirmed to clients before the message has been
dequeued from replicas or indeed from the local store if that is
asynchronous.

LM5 - During failover, messages (re)published to a queue before there are
the requisite number of replication subscriptions established will be
confirmed to the publisher before they are replicated, leaving them
vulnerable to a loss of the new master before they are replicated.

For configuration propagation:

LC2 - Queue and exchange propagation is entirely asynchronous. There
are three cases to consider here for queue creation: (a) where queues
are created through the addressign syntax supported the messaging API,
they should be recreated if needed on failover and message replication
if required is dealt with seperately; (b) where queues are created
using configuration tools by an administrator or by a script they can
query the backups to verify the config has propagated and commands can
be re-run if there is a failure before that; (c) where applications
have more complex programs on which queues/exchanges are created using
QMF or directly via 0-10 APIs, the completion of the command will not
guarantee that the command has been carried out on other
nodes. I.e. case (a) doesn't require anything (apart from LM5 in some
cases), case (b) can be addressed in a simple manner through tooling
but case (c) would require changes to the broker to allow client to
simply determine when the command has fully propagated.

LC3 - Queues that are not in the query response received when a
replica establishes a propagation subscription but exist locally are
not deleted. I.e. Deletion of queues/exchanges while a replica is not
connected will not be propagated. Solution is to delete any queues
marked for propagation that exist locally but do not show up in the
query response.

LC4 - It is possible on failover that the new master did not
previously receive a given QMF event while a backup did (sort of an
analogous situation to LM1 but without an easy way to detect or remedy
it).

LC5 - Need richer control over which queues/exchanges are propagated, and
which are not.

LC6 - The events and query responses are not fully synchronized.

      In particular it *is* possible to not receive a delete event but
      for the deleted object to still show up in the query response
      (meaning the deletion is 'lost' to the update).

      It is also possible for an create event to be received as well
      as the created object being in the query response. Likewise it
      is possible to receive a delete event and a query response in
      which the object no longer appears. In these cases the event is
      essentially redundant.

      It is not possible to miss a create event and yet not to have
      the object in question in the query response however.

* Benefits compared to previous cluster implementation.

- Does not need openais/corosync, does not require multicast.
- Possible to replace rgmanager with other resource mgr (PaceMaker, windows?)
- DR is just another backup
- Performance (some numbers?)
- Virtual IP supported by rgmanager.

* User Documentation Notes

Notes to seed initial user documentation. Loosely tracking the implementation,
some points mentioned in the doc may not be implemented yet.

** High Availability Overview

HA is implemented using a 'hot standby' approach. Clients are directed
to a single "primary" broker. The primary executes client requests and
also replicates them to one or more "backup" brokers. If the primary
fails, one of the backups takes over the role of primary carrying on
from where the primary left off. Clients will fail over to the new
primary automatically and continue their work.

TODO: at least once, deduplication.

** Enabling replication on the client.

To enable replication set the qpid.replicate argument when creating a
queue or exchange.

This can have one of 3 values
- none: the object is not replicated
- configuration: queues, exchanges and bindings are replicated but messages are not.
- messages: configuration and messages are replicated.

TODO: examples
TODO: more options for default value of qpid.replicate

A HA client connection has multiple addresses, one for each broker. If
the it fails to connect to an address, or the connection breaks,
it will automatically fail-over to another address.

Only the primary broker accepts connections, the backup brokers
redirect connection attempts to the primary. If the primary fails, one
of the backups is promoted to primary and clients fail-over to the new
primary.

TODO: using multiple-address connections, examples c++, python, java.

TODO: dynamic cluster addressing?

TODO: need de-duplication.

** Enabling replication on the broker.

Network topology: backup links, separate client/broker networks.
Describe failover mechanisms.
- Client view: URLs, failover, exclusion & discovery.
- Broker view: similar.
Role of rmganager

** Configuring rgmanager

** Configuring qpidd
HA related options.