diff options
author | Mark Kampe <mark.kampe@dreamhost.com> | 2011-12-02 11:26:20 -0800 |
---|---|---|
committer | Mark Kampe <mark.kampe@dreamhost.com> | 2011-12-02 11:28:38 -0800 |
commit | 06228716e345a81ee2c93055a6a6133c540fbada (patch) | |
tree | 3aa1d81bcaa4ab0b5dea9e3d8a5a3af795899716 /doc/dev | |
parent | c45a8491fefbb3442fef2b4c2ab6bbebcc5013ea (diff) | |
download | ceph-06228716e345a81ee2c93055a6a6133c540fbada.tar.gz |
Doc: add a conceptual overview of the peering process
Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>
Diffstat (limited to 'doc/dev')
-rw-r--r-- | doc/dev/peering.rst | 221 |
1 files changed, 220 insertions, 1 deletions
diff --git a/doc/dev/peering.rst b/doc/dev/peering.rst index 16fbdf5994b..60a3a5621d1 100644 --- a/doc/dev/peering.rst +++ b/doc/dev/peering.rst @@ -1,6 +1,225 @@ ====================== Peering ====================== -Overview of peering state machine structure: + +Concepts +-------- + +*Peering* + the process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Active Set* + the set of OSDs who are (or were as of some epoch) + in the list of nodes to store a particular PG. + +*primary* + the (by convention first) member of the *acting set*, + who is the only OSD that will accept client initiated + writes to objects in a placement group. + +*replica* + a non-primary OSD in the *acting set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*stray* + an OSD who is not a member of the current *acting set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*recovery* + ensuring that copies of all of the objects in a PG + are on all of the OSD's in the *acting set*. Once + *peering* has been performed, the primary can start + accepting write operations, and *recovery* can proceed + in the background. + +*PG log* + a list of recent updates made to objects in a PG. + Note that these logs can be truncated after all OSDs + in the *acting set* have acknowledged up to a certain + point. + +*back-log* + If the failure of an OSD makes it necessary to replicate + operations that have been truncated from the most recent + PG logs, it will be necessary to reconstruct the missing + information by walking the object space and generating + log entries for + operations to create the existing objects in their existing + states. While a back-log may be different than the actual + set of operations that brought the PG to its current state, + it is equivalent ... and that is good enough. + +*missing set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *missing set* for that <OSD,PG>. + +*Authoritative History* + a complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a Placement Group + up to date. + +*epoch* + a (monotonically increasing) OSD map version number + +*last epoch start* + the last epoch at which all nodes in the *acting set* + for a particular placement group agreed on an + *authoritative history*. At this point, *peering* is + deemed to have been successful. + +*up through* + when a primary successfully completes the *peering* process, + it informs a monitor that an *authoritative history* has + been established (for that PG) **up through** the current + epoch, and the primary is now going active. + +*last epoch clean* + the last epoch at which all nodes in the *acting set* + for a particular placement group were completely + up to date (both PG logs and object contents). + At this point, *recovery* is deemed to have been + completed. + +Description of the Peering Process +---------------------------------- + +The *Golden Rule* is that no write operation to any PG +is acknowledged to a client until it has been persisted +by all members of the *acting set* for that PG. This means +that if we can communicate with at least one member of +each *acting set* since the last successful *peering*, someone +will have a record of every (acknowledged) operation +since the last successful *peering*. +This means that it should be possible for the current +primary to construct and disseminate a new *authoritative history*. + +It is also important to appreciate the role of the OSD map +(list of all known OSDs and their states, as well as some +information about the placement groups) in the *peering* +process: + + When OSDs go up or down (or get added or removed) + this has the potential to affect the *active sets* + of many placement groups. + + When a primary successfully completes the *peering* + process, this too is noted in the OSD map (*last epoch start*). + + Changes can only be made after successful *peering* + (recorded in the PAXOS stream as a an "up through"). + +Thus, if a new primary has a copy of the latest OSD map, +it can infer which *active sets* may have accepted updates, +and thus which OSDs must be consulted before we can successfully +*peer*. + +The high level process is for the current PG primary to: + + 1. get the latest OSD map (to identify the members of the + all interesting *acting sets*, and confirm that we are still the primary). + + 2. generate a list of all of the acting sets (that achieved + *last epoch start*) since the *last epoch clean*. We can + ignore acting sets that did not achieve *last epoch start* + because they could not have accepted any updates. + + Successfull *peering* will require that we be able to contact at + least one OSD from each of these *acting sets*. + + 3. ask every node in that list what its first and last PG log entries are + (which gives us a complete list of all known operations, and enables + us to make a list of what log entries each member of the current + *acting set* does not have). + + 4. if anyone else has (in his PG log) operations that I do not have, + instruct them to send me the missing log entries + (constructing a *back-log* if necessary). + + 5. for each member of the current *acting set*: + + a) ask him for copies of all PG log entries since *last epoch start* + so that I can verify that they agree with mine (or know what + objects I will be telling him to delete). + + If the cluster failed before an operation was persisted by all + members of the *acting set*, and the subsequent *peering* did not + remember that operation, and a node that did remember that + operation later rejoined, his logs would record a different + (divergent) history than the *authoritative history* that was + reconstructed in the *peering* after the failure. + + Since the *divergent* events were not recorded in other logs + from that *acting set*, they were not acknowledged to the client, + and there is no harm in discarding them (so that all OSDs agree + on the *authoritative history*). But, we will have to instruct + any OSD that stores data from a divergent update to delete the + affected (and now deemed to be apocryphal) objects. + + b) ask him for his *missing set* (object updates recorded + in his PG log, but for which he does not have the new data). + This is the list of objects that must be fully replicated + before we can accept writes. + + 6. at this point, my PG log contains an *authoritative history* of + the placement group (which may have involved generating a *back-log*), + and I now have sufficient + information to bring any other OSD in the *acting set* up to date. + I can now inform a monitor + that I am "up through" the end of my *aurhoritative history*. + + The monitor will persist this through PAXOS, so that any future + *peering* of this PG will note that the *acting set* for this + interval may have made updates to the PG and that a member of + this *acting set* must be included the next *peering*. + + This makes me active as the *primary* and establishes a new *last epoch start*. + + 7. for each member of the current *acting set*: + + a) send them log updates to bring their PG logs into agreement with + my own (*authoritative history*) ... which may involve deciding + to delete divergent objects. + + b) await acknowledgement that they have persisted the PG log entries. + + 8. at this point all OSDs in the *acting set* agree on all of the meta-data, + and would (in any future *peering*) return identical accounts of all + updates. + + a) start accepting client write operations (because we have unanimous + agreement on the state of the objects into which those updates are + being accepted). Note, however, that we will delay any attempts to + write to objects that are not yet fully replicated throughout the + current *acting set*. + + b) start pulling object data updates that other OSDs have, but I do not. + + c) start pushing object data updates to other OSDs that do not yet have them. + + We push these updates from the primary (rather than having the replicas + pull them) because this allows the primary to ensure that a replica has + the current contents before sending it an update write. It also makes + it possible for a single read (from the primary) to be used to write + the data to multiple replicas. If each replica did its own pulls, + the data might have to be read multiple times. + + 9. once all replicas store the all copies of all objects (that existed + prior to the start of this epoch) we can dismiss all of the *stray* + replicas, allowing them to delete their copies of objects for which + they are no longer in the *acting set*. + + We could not dismiss the *strays* prior to this because it was possible + that one of those *strays* might hold the sole surviving copy of an + old object (all of whose copies disappeared before they could be + replicated on members of the current *acting set*). + +State Model +----------- .. graphviz:: peering_graph.generated.dot |