Doc: add a conceptual overview of the peering process

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>
author: Mark Kampe <mark.kampe@dreamhost.com> 2011-12-02 11:26:20 -0800
committer: Mark Kampe <mark.kampe@dreamhost.com> 2011-12-02 11:28:38 -0800
commit: 06228716e345a81ee2c93055a6a6133c540fbada (patch)
tree: 3aa1d81bcaa4ab0b5dea9e3d8a5a3af795899716 /doc/dev
parent: c45a8491fefbb3442fef2b4c2ab6bbebcc5013ea (diff)
download: ceph-06228716e345a81ee2c93055a6a6133c540fbada.tar.gz
1 files changed, 220 insertions, 1 deletions
diff --git a/doc/dev/peering.rst b/doc/dev/peering.rst
index 16fbdf5994b..60a3a5621d1 100644
--- a/doc/dev/peering.rst
+++ b/doc/dev/peering.rst
@@ -1,6 +1,225 @@
 ======================
 Peering
 ======================
-Overview of peering state machine structure:
+
+Concepts
+--------
+
+*Peering*
+   the process of bringing all of the OSDs that store
+   a Placement Group (PG) into agreement about the state 
+   of all of the objects (and their metadata) in that PG.
+   Note that agreeing on the state does not mean that 
+   they all have the latest contents. 
+
+*Active Set*
+   the set of OSDs who are (or were as of some epoch)
+   in the list of nodes to store a particular PG.
+
+*primary*
+   the (by convention first) member of the *acting set*,
+   who is the only OSD that will accept client initiated
+   writes to objects in a placement group.
+
+*replica*
+   a non-primary OSD in the *acting set* for a placement group
+   (and who has been recognized as such and *activated* by the primary).
+
+*stray*
+   an OSD who is not a member of the current *acting set*, but
+   has not yet been told that it can delete its copies of a
+   particular placement group.
+
+*recovery*
+   ensuring that copies of all of the objects in a PG
+   are on all of the OSD's in the *acting set*.  Once
+   *peering* has been performed, the primary can start
+   accepting write operations, and *recovery* can proceed
+   in the background.
+
+*PG log*
+   a list of recent updates made to objects in a PG.
+   Note that these logs can be truncated after all OSDs
+   in the *acting set* have acknowledged up to a certain
+   point.
+
+*back-log*
+   If the failure of an OSD makes it necessary to replicate
+   operations that have been truncated from the most recent
+   PG logs, it will be necessary to reconstruct the missing
+   information by walking the object space and generating
+   log entries for
+   operations to create the existing objects in their existing
+   states.  While a back-log may be different than the actual
+   set of operations that brought the PG to its current state,
+   it is equivalent ... and that is good enough.
+
+*missing set*
+   Each OSD notes update log entries and if they imply updates to
+   the contents of an object, adds that object to a list of needed
+   updates.  This list is called the *missing set* for that <OSD,PG>.
+
+*Authoritative History*
+   a complete, and fully ordered set of operations that, if 
+   performed, would bring an OSD's copy of a Placement Group 
+   up to date.  
+
+*epoch*
+   a (monotonically increasing) OSD map version number
+
+*last epoch start*
+   the last epoch at which all nodes in the *acting set*
+   for a particular placement group agreed on an 
+   *authoritative history*.  At this point, *peering* is 
+   deemed to have been successful.
+   
+*up through*
+   when a primary successfully completes the *peering* process,
+   it informs a monitor that an *authoritative history* has
+   been established (for that PG) **up through** the current
+   epoch, and the primary is now going active.
+
+*last epoch clean*
+   the last epoch at which all nodes in the *acting set*
+   for a particular placement group were completely
+   up to date (both PG logs and object contents).
+   At this point, *recovery* is deemed to have been
+   completed.
+
+Description of the Peering Process
+----------------------------------
+
+The *Golden Rule* is that no write operation to any PG
+is acknowledged to a client until it has been persisted 
+by all members of the *acting set* for that PG.  This means
+that if we can communicate with at least one member of
+each *acting set* since the last successful *peering*, someone
+will have a record of every (acknowledged) operation 
+since the last successful *peering*.  
+This means that it should be possible for the current 
+primary to construct and disseminate a new *authoritative history*.
+
+It is also important to appreciate the role of the OSD map 
+(list of all known OSDs and their states, as well as some
+information about the placement groups) in the *peering* 
+process:
+
+   When OSDs go up or down (or get added or removed)
+   this has the potential to affect the *active sets*
+   of many placement groups.
+
+   When a primary successfully completes the *peering*
+   process, this too is noted in the OSD map (*last epoch start*).
+
+   Changes can only be made after successful *peering*
+   (recorded in the PAXOS stream as a an "up through").
+
+Thus, if a new primary has a copy of the latest OSD map,
+it can infer which *active sets* may have accepted updates,
+and thus which OSDs must be consulted before we can successfully
+*peer*.
+   
+The high level process is for the current PG primary to:
+
+  1. get the latest OSD map (to identify the members of the
+     all interesting *acting sets*, and confirm that we are still the primary).
+
+  2. generate a list of all of the acting sets (that achieved
+     *last epoch start*) since the *last epoch clean*.  We can
+     ignore acting sets that did not achieve *last epoch start*
+     because they could not have accepted any updates.
+
+     Successfull *peering* will require that we be able to contact at
+     least one OSD from each of these *acting sets*.
+
+  3. ask every node in that list what its first and last PG log entries are
+     (which gives us a complete list of all known operations, and enables
+     us to make a list of what log entries each member of the current
+     *acting set* does not have).
+
+  4. if anyone else has (in his PG log) operations that I do not have, 
+     instruct them to send me the missing log entries 
+     (constructing a *back-log* if necessary).
+
+  5. for each member of the current *acting set*:
+
+     a) ask him for copies of all PG log entries since *last epoch start*
+        so that I can verify that they agree with mine (or know what 
+        objects I will be telling him to delete).
+
+        If the cluster failed before an operation was persisted by all 
+        members of the *acting set*, and the subsequent *peering* did not
+        remember that operation, and a node that did remember that 
+        operation later rejoined, his logs would record a different
+        (divergent) history than the *authoritative history* that was
+        reconstructed in the *peering* after the failure.
+
+        Since the *divergent* events were not recorded in other logs
+        from that *acting set*, they were not acknowledged to the client,
+        and there is no harm in discarding them (so that all OSDs agree
+        on the *authoritative history*).  But, we will have to instruct
+        any OSD that stores data from a divergent update to delete the
+        affected (and now deemed to be apocryphal) objects.
+
+     b) ask him for his *missing set* (object updates recorded
+        in his PG log, but for which he does not have the new data).
+        This is the list of objects that must be fully replicated
+        before we can accept writes.
+
+  6. at this point, my PG log contains an *authoritative history* of
+     the placement group (which may have involved generating a *back-log*), 
+     and I now have sufficient
+     information to bring any other OSD in the *acting set* up to date.  
+     I can now inform a monitor
+     that I am "up through" the end of my *aurhoritative history*.  
+
+     The monitor will persist this through PAXOS, so that any future
+     *peering* of this PG will note that the *acting set* for this 
+     interval may have made updates to the PG and that a member of
+     this *acting set* must be included the next *peering*.
+     
+     This makes me active as the *primary* and establishes a new *last epoch start*.
+
+  7. for each member of the current *acting set*:
+
+     a) send them log updates to bring their PG logs into agreement with
+        my own (*authoritative history*) ... which may involve deciding
+        to delete divergent objects.
+
+     b) await acknowledgement that they have persisted the PG log entries.
+
+  8. at this point all OSDs in the *acting set* agree on all of the meta-data,
+     and would (in any future *peering*) return identical accounts of all
+     updates.
+
+     a) start accepting client write operations (because we have unanimous
+        agreement on the state of the objects into which those updates are
+        being accepted).  Note, however, that we will delay any attempts to
+        write to objects that are not yet fully replicated throughout the
+        current *acting set*.
+
+     b) start pulling object data updates that other OSDs have, but I do not.
+
+     c) start pushing object data updates to other OSDs that do not yet have them.
+
+        We push these updates from the primary (rather than having the replicas
+        pull them) because this allows the primary to ensure that a replica has
+        the current contents before sending it an update write.  It also makes
+        it possible for a single read (from the primary) to be used to write
+        the data to multiple replicas.  If each replica did its own pulls,
+        the data might have to be read multiple times.
+
+  9. once all replicas store the all copies of all objects (that existed
+     prior to the start of this epoch) we can dismiss all of the *stray*
+     replicas, allowing them to delete their copies of objects for which
+     they are no longer in the *acting set*.  
+
+     We could not dismiss the *strays* prior to this because it was possible
+     that one of those *strays* might hold the sole surviving copy of an
+     old object (all of whose copies disappeared before they could be
+     replicated on members of the current *acting set*).
+     
+State Model
+-----------
 
 .. graphviz:: peering_graph.generated.dot
author	Mark Kampe <mark.kampe@dreamhost.com>	2011-12-02 11:26:20 -0800
committer	Mark Kampe <mark.kampe@dreamhost.com>	2011-12-02 11:28:38 -0800
commit	06228716e345a81ee2c93055a6a6133c540fbada (patch)
tree	3aa1d81bcaa4ab0b5dea9e3d8a5a3af795899716 /doc/dev
parent	c45a8491fefbb3442fef2b4c2ab6bbebcc5013ea (diff)
download	ceph-06228716e345a81ee2c93055a6a6133c540fbada.tar.gz