diff options
author | Alan Conway <aconway@apache.org> | 2010-09-22 21:14:01 +0000 |
---|---|---|
committer | Alan Conway <aconway@apache.org> | 2010-09-22 21:14:01 +0000 |
commit | 455150834b51020a6dc9abd1466a452079e2ffaf (patch) | |
tree | af757f90318e8ba9aac4b82ca979301b00217b84 /cpp | |
parent | eb2f320af6e874d763fb2f2b3e81801424416cc1 (diff) | |
download | qpid-python-455150834b51020a6dc9abd1466a452079e2ffaf.tar.gz |
Design note on proposed new cluster design.
git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk/qpid@1000234 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'cpp')
-rw-r--r-- | cpp/src/qpid/cluster/new-cluster-design.txt | 277 |
1 files changed, 277 insertions, 0 deletions
diff --git a/cpp/src/qpid/cluster/new-cluster-design.txt b/cpp/src/qpid/cluster/new-cluster-design.txt new file mode 100644 index 0000000000..199f9b12c6 --- /dev/null +++ b/cpp/src/qpid/cluster/new-cluster-design.txt @@ -0,0 +1,277 @@ +* A new design for Qpid clustering. + +** Issues with current design. + +The cluster is based on virtual synchrony: each broker multicasts +events and the events from all brokers are serialized and delivered in +the same order to each broker. + +In the current design raw byte buffers from client connections are +multicast, serialized and delivered in the same order to each broker. + +Each broker has a replica of all queues, exchanges, bindings and also +all connections & sessions from every broker. Cluster code treats the +broker as a "black box", it "plays" the client data into the +connection objects and assumes that by giving the same input, each +broker will reach the same state. + +A new broker joining the cluster receives a snapshot of the current +cluster state, and then follows the multicast conversation. + +*** Maintenance issues. + +The entire state of each broker is replicated to every member: +connections, sessions, queues, messages, exchanges, management objects +etc. Any discrepancy in the state that affects how messages are +allocated to consumers can cause an inconsistency. + +- Entire broker state must be faithfully updated to new members. +- Management model also has to be replicated. +- All queues are replicated, can't have unreplicated queues (e.g. for management) + +Events that are not deterministically predictable from the client +input data stream can cause inconsistencies. In particular use of +timers/timestamps require cluster workarounds to synchronize. + +A member that encounters an error which is not encounted by all other +members is considered inconsistent and will shut itself down. Such +errors can come from any area of the broker code, e.g. different +ACL files can cause inconsistent errors. + +The following areas required workarounds to work in a cluster: + +- Timers/timestamps in broker code: management, heartbeats, TTL +- Security: cluster must replicate *after* decryption by security layer. +- Management: not initially included in the replicated model, source of many inconsistencies. + +It is very easy for someone adding a feature or fixing a bug in the +standalone broker to break the cluster by: +- adding new state that needs to be replicated in cluster updates. +- doing something in a timer or other non-connection thread. + +It's very hard to test for such breaks. We need a looser coupling +and a more explicitly defined interface between cluster and standalone +broker code. + +*** Performance issues. + +Virtual synchrony delivers all data from all clients in a single +stream to each broker. The cluster must play this data thru the full +broker code stack: connections, sessions etc. in a single thread +context. The cluster has a pipelined design to get some concurrency +but this is a severe limitation on scalability in multi-core hosts +compared to the standalone broker which processes each connection +in a separate thread context. + +** A new cluster design. + +Maintain /equivalent/ state not /identical/ state on each member. + +Messages from different sources need not be ordered identically on all members. + +Use a moving queue ownership protocol to agree order of dequeues, rather +than relying on identical state and lock-step behavior to cause +identical dequeues on each broker. + +*** Requirements + +The cluster must provide these delivery guarantees: + +- client sends transfer: message must be replicated and not lost even if the local broker crashes. +- client acquires a message: message must not be delivered on another broker while acquired. +- client rejects acquired message: message must be re-queued on cluster and not lost. +- client disconnects or broker crashes: acquired but not accepted messages must be re-queued on cluster. +- client accepts message: message is forgotten, will never be delivered or re-queued by any broker. + +Each guarantee takes effect when the client receives a *completion* +for the associated command (transfer, acquire, reject, accept) + +The cluster must provide this ordering guarantee: + +- messages from the same publisher received by the same subscriber + must be received in the order they were sent (except in the case of + re-queued messages.) + +*** Broker receiving messages + +On recieving a message transfer, in the connection thread we: +- multicast a message-received event. +- enqueue the message on the local queue. +- asynchronously complete the transfer when the message-received is self-delivered. + +This is exactly like the asynchronous completion in the MessageStore: +the cluster "stores" a message by multicast. We send a completion to +the client asynchronously when the multicast "completes" by +self-delivery. This satisfies the "client sends transfer" guarantee, +but makes the message available on the queue immediately, avoiding the +multicast latency. + +It also moves most of the work to the client connection thread. The +only work in the virtual synchrony deliver thread is sending the client +completion. + +Other brokers enqueue the message when they recieve the +message-received event, in the virtual synchrony deliver thread. + +*** Broker sending messages: moving queue ownership + +Each queue is *owned* by at most one cluster broker at a time. Only +that broker may acquire or dequeue messages. The owner multicasts +notification of messages it acquires/dequeues to the cluster. +Periodically the owner hands over ownership to another interested +broker, providing time-shared access to the queue among all interested +brokers. + +This means we no longer need identical message ordering on all brokers +to get consistent dequeuing. Messages from different sources can be +ordered differently on different brokers. + +We assume the same IO-driven dequeuing algorithm as the standalone +broker with one modification: queues can be "locked". A locked queue +is not available for dequeuing messages and will be skipped by the +output algorithm. + +At any given time only those queues owned by the local broker will be +unlocked. + +As messages are acquired/dequeued from unlocked queues by the IO threads +the broker multicasts acquire/dequeue events to the cluster. + +When an unlocked queue has no more consumers with credit, or when a +time limit expires, the broker relinquishes ownership by multicasting +a release-queue event, allowing another interested broker to take +ownership. + +*** Asynchronous completion of accept + +In acknowledged mode a message is not forgotten until it is accepted, +to allow for requeue on rejection or crash. The accept should not be +completed till the message has been forgotten. + +On receiving an accept the broker: +- dequeues the message from the local queue +- multicasts a "dequeue" event +- completes the accept asynchronously when the dequeue event is self delivered. + +NOTE: The message store does not currently implement asynchronous +completions of accept, this is a bug. + +** Inconsistent errors. + +The new design eliminates most sources of inconsistent errors +(connections, sessions, security, management etc.) The only points +where inconsistent errors can occur are at enqueue and dequeue (most +likely store-related errors.) + +The new design can use the exisiting error-handling protocol with one +major improvement: since brokers are no longer required to maintain +identical state they do not have to stall processing while an error is +being resolved. + +#TODO: The only source of dequeue errors is probably an unrecoverable journal failure. + +** Cluster API + +The new cluster API is an extension of the existing MessageStore API. + +The MessageStore API already covers these events: +- wiring changes: queue/exchange declare/bind +- message enqueued/dequeued. + +The cluster needs to add a "message acquired" call, which store +implementations can ignore. + +The cluster will require some extensions to the Queue: +- Queues can be "locked", locked queues are skipped by IO-driven output. +- Messages carry a cluster-message-id. +- messages can be dequeued by cluster-message-id + +When cluster+store are in use the cluster implementation of MessageStore +will delegate to the store implementation. + +** Maintainability + +This design gives us more robust code with a clear and explicit interfaces. + +The cluster depends on specific, well-defined events - defined by the +extended MessageStore API. Provided the semantics of this API are not +violated, the cluster will not be broken by changes to broker code. + +Re-using the established MessageStore API provides assurance that the +API is sound and gives economy of design. + +The cluster no longer requires identical processing of the entire +broker stack on each broker. It is not affected by the details of how +the broker allocates messages. It is independent of the +protocol-specific state of connections and sessions and so is +protected from future protocol changes (e.g. AMQP 1.0) + +A number of specific ways the code will be simplified: +- drop code to replicate management model. +- drop timer workarounds for TTL, management, heartbeats. +- drop "cluster-safe assertions" in broker code. +- drop connections, sessions, management from cluster update. +- drop security workarounds: cluster code now operates after message decoding. +- drop connection tracking in cluster code. +- simper inconsistent-error handling code, no need to stall. + +** Performance + +The only way to verify the relative performance of the new design is +to prototype & profile. The following points suggest the new design +may scale/perform better: + +Moving work from virtual synchrony thread to connection threads where +it can be done in parallel: +- All connection/session logic moves to connection thread. +- Exchange routing logic moves to connection thread. +- Local broker does all enqueue/dequeue work in connection thread +- Enqueue/dequeue are IO driven as for a standalone broker. + +Optimizes common cases (pay for what you use): +- Publisher/subscriber on same broker: replication is fully asynchronous, no extra latency. +- Unshared queue: dequeue is all IO-driven in connection thread. +- Time sharing: pay for time-sharing only if queue has consumers on multiple brokers. + +#TODO Not clear how the time sharing algorithm would compare with the existing cluster delivery algorithm. + +Extra decode/encode: The old design multicasts raw client data and +decodes it in the virtual synchrony thread. The new design would +decode messages in the connection thread, re-encode them for +multicast, and decode (on non-local brokers) in the virtual synchrony +thread. There is extra work here, but only in the *connection* thread: +on a multi-core machine this happens in parallel for every connection, +so it probably is not a bottleneck. There may be scope to optimize +decode/re-encode by re-using some of the original encoded data, this +could also benefit the stand-alone broker. + +** Misc outstanding issues & notes + +Message IDs: need an efficient cluster-wide message ID. + +Replicating wiring +- Need async completion of wiring commands? +- qpid.sequence_counter: need extra work to support in new design, do we care? + +Cluster+persistence: +- cluster implements MessageStore+ & delegates to store? +- finish async completion: dequeue completion for store & cluster +- need to support multiple async completions +- cluster restart from store: clean stores *not* identical, pick 1, all others update. +- async completion of wiring changes? + +Live updates: we don't need to stall brokers during an update! +- update on queue-by-queue basis. +- updatee locks queues during update, no dequeue. +- update in reverse: don't update messages dequeued during update. +- updatee adds update messages at front (as normal), replicated messages at back. +- updater starts from back, sends "update done" when it hits front of queue. + +Flow control: need to throttle multicasting +1. bound the number of outstanding multicasts. +2. ensure the entire cluster keeps up, no unbounded "lag" +The existing design uses read-credit to solve 1., and does not solve 2. +New design should stop reading on all connections while flow control +condition exists? + + |