diff options
Diffstat (limited to 'cpp/design_docs/hot-standby-design.txt')
-rw-r--r-- | cpp/design_docs/hot-standby-design.txt | 239 |
1 files changed, 0 insertions, 239 deletions
diff --git a/cpp/design_docs/hot-standby-design.txt b/cpp/design_docs/hot-standby-design.txt deleted file mode 100644 index 99a5dc0199..0000000000 --- a/cpp/design_docs/hot-standby-design.txt +++ /dev/null @@ -1,239 +0,0 @@ --*-org-*- -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -* Another new design for Qpid clustering. - -For background see [[./new-cluster-design.txt]] which describes the issues -with the old design and a new active-active design that could replace it. - -This document describes an alternative hot-standby approach. - -** Delivery guarantee - -We guarantee N-way redundant, at least once delivey. Once a message -from a client has been acknowledged by the broker, it will be -delivered even if N-1 brokers subsequently fail. There may be -duplicates in the event of a failure. We don't make duplicates -during normal operation (i.e when no brokers have failed) - -This is the same guarantee as the old cluster and the alternative -active-active design. - -** Active-active vs. hot standby (aka primary-backup) - -An active-active cluster allows clients to connect to any broker in -the cluster. If a broker fails, clients can fail-over to any other -live broker. - -A hot-standby cluster has only one active broker at a time (the -"primary") and one or more brokers on standby (the "backups"). Clients -are only served by the leader, clients that connect to a backup are -redirected to the leader. The backpus are kept up-to-date in real time -by the primary, if the primary fails a backup is elected to be the new -primary. - -Aside: A cold-standby cluster is possible using a standalone broker, -CMAN and shared storage. In this scenario only one broker runs at a -time writing to a shared store. If it fails, another broker is started -(by CMAN) and recovers from the store. This bears investigation but -the store recovery time is probably too long for failover. - -** Why hot standby? - -Active-active has some advantages: -- Finding a broker on startup or failover is simple, just pick any live broker. -- All brokers are always running in active mode, there's no -- Distributing clients across brokers gives better performance, but see [1]. -- A broker failure affects only clients connected to that broker. - -The main problem with active-active is co-ordinating consumers of the -same queue on multiple brokers such that there are no duplicates in -normal operation. There are 2 approaches: - -Predictive: each broker predicts which messages others will take. This -the main weakness of the old design so not appealing. - -Locking: brokers "lock" a queue in order to take messages. This is -complex to implement, its not straighforward to determine the most -performant strategie for passing the lock. - -Hot-standby removes this problem. Only the primary can modify queues -so it just has to tell the backups what it is doing, there's no -locking. - -The primary can enqueue messages and replicate asynchronously - -exactly like the store does, but it "writes" to the replicas over the -network rather than writing to disk. - -** Failover in a hot-standby cluster. - -Hot-standby has some potential performance issues around failover: - -- Failover "spike": when the primary fails every client will fail over - at the same time, putting strain on the system. - -- Until a new primary is elected, cluster cannot serve any clients or - redirect clients to the primary. - -We want to minimize the number of re-connect attempts that clients -have to make. The cluster can use a well-known algorithm to choose the -new primary (e.g. round robin on a known sequence of brokers) so that -clients can guess the new primary correctly in most cases. - -Even if clients do guess correctly it may be that the new primary is -not yet aware of the death of the old primary, which is may to cause -multiple failed connect attempts before clients eventually get -connected. We will need to prototype to see how much this happens in -reality and how we can best get clients redirected. - -** Threading and performance. - -The primary-backup cluster operates analogously to the way the disk store does now: -- use the same MessageStore interface as the store to interact with the broker -- use the same asynchronous-completion model for replicating messages. -- use the same recovery interfaces (?) for new backups joining. - -Re-using the well-established store design gives credibility to the new cluster design. - -The single CPG dispatch thread was a severe performance bottleneck for the old cluster. - -The primary has the same threading model as a a standalone broker with -a store, which we know that this performs well. - -If we use CPG for replication of messages, the backups will receive -messages in the CPG dispatch thread. To get more concurency, the CPG -thread can dump work onto internal PollableQueues to be processed in -parallel. - -Messages from the same broker queue need to go onto the same -PollableQueue. There could be a separate PollableQueue for each broker -queue. If that's too resource intensive we can use a fixed set of -PollableQueues and assign broker queues to PollableQueues via hashing -or round robin. - -Another possible optimization is to use multiple CPG queues: one per -queue or a hashed set, to get more concurrency in the CPG layer. The -old cluster is not able to keep CPG busy. - -TODO: Transactions pose a challenge with these concurrent models: how -to co-ordinate multiple messages being added (commit a publish or roll -back an accept) to multiple queues so that all replicas end up with -the same message sequence while respecting atomicity. - -** Use of CPG - -CPG provides several benefits in the old cluster: -- tracking membership (essential for determining the primary) -- handling "spit brain" (integrates with partition support from CMAN) -- reliable multicast protocol to distribute messages. - -I believe we still need CPG for membership and split brain. We could -experiment with sending the bulk traffic over AMQP conections. - -** Flow control - -Need to ensure that -1) In-memory internal queues used by the cluster don't overflow. -2) The backups don't fall too far behind on processing CPG messages - -** Recovery -When a new backup joins an active cluster it must get a snapshot -from one of the other backups, or the primary if there are none. In -store terms this is "recovery" (old cluster called it an "update) - -Compared to old cluster we only replidate well defined data set of the store. -This is the crucial sore spot of old cluster. - -We can also replicated it more efficiently by recovering queues in -reverse (LIFO) order. That means as clients actively consume messages -from the front of the queue, they are redeucing the work we have to do -in recovering from the back. (NOTE: this may not be compatible with -using the same recovery interfaces as the store.) - -** Selective replication -In this model it's easy to support selective replication of individual queues via -configuration. -- Explicit exchange/queue declare argument and message boolean: x-qpid-replicate. - Treated analogously to persistent/durable properties for the store. -- if not explicitly marked, provide a choice of default - - default is replicate (replicated message on replicated queue) - - default is don't replicate - - default is replicate persistent/durable messages. - -** Inconsistent errors - -The new design eliminates most sources of inconsistent errors in the -old design (connections, sessions, security, management etc.) and -eliminates the need to stall the whole cluster till an error is -resolved. We still have to handle inconsistent store errors when store -and cluster are used together. - -We also have to include error handling in the async completion loop to -guarantee N-way at least once: we should only report success to the -client when we know the message was replicated and stored on all N-1 -backups. - -TODO: We have a lot more options than the old cluster, need to figure -out the best approach, or possibly allow mutliple approaches. Need to -go thru the various failure cases. We may be able to do recovery on a -per-queue basis rather than restarting an entire node. - -** New members joining - -We should be able to catch up much faster than the the old design. A -new backup can catch up ("recover") the current cluster state on a -per-queue basis. -- queues can be updated in parallel -- "live" updates avoid the the "endless chase" - -During a "live" update several things are happening on a queue: -- clients are publishing messages to the back of the queue, replicated to the backup -- clients are consuming messages from the front of the queue, replicated to the backup. -- the primary is sending pre-existing messages to the new backup. - -The primary sends pre-existing messages in LIFO order - starting from -the back of the queue, at the same time clients are consuming from the front. -The active consumers actually reduce the amount of work to be done, as there's -no need to replicate messages that are no longer on the queue. - -* Steps to get there - -** Baseline replication -Validate the overall design get initial notion of performance. Just -message+wiring replication, no update/recovery for new members joining, -single CPG dispatch thread on backups, no failover, no transactions. - -** Failover -Electing primary, backups redirect to primary. Measure failover time -for large # clients. Strategies to minimise number of retries after a -failure. - -** Flow Control -Keep internal queues from over-flowing. Similar to internal flow control in old cluster. -Needed for realistic performance/stress tests - -** Concurrency -Experiment with multiple threads on backups, multiple CPG groups. - -** Recovery/new member joining -Initial status handshake for new member. Recovering queues from the back. - -** Transactions -TODO: How to implement transactions with concurrency. Worst solution: -a global --cluster-use-transactions flag that forces single thread -mode. Need to find a better solution. |