rearrange the documentation to be inserted and maintained in master

Signed-off-by: Loic Dachary <loic@dachary.org>
author: Loic Dachary <loic@dachary.org> 2013-08-09 23:23:17 +0200
committer: Loic Dachary <loic@dachary.org> 2013-08-09 23:26:49 +0200
commit: 8bf3971b7e9669578cfae21a1738b116fab48a44 (patch)
tree: 69153dc0e81dc8cd9db522f4bf1aa91e3e5b860e /doc
parent: 980a07380db65f802fc4d8971a5404cc2ed0ff6e (diff)
download: ceph-8bf3971b7e9669578cfae21a1738b116fab48a44.tar.gz
4 files changed, 478 insertions, 309 deletions
diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst
index df21d3dccdc..deb91aca9db 100644
--- a/doc/dev/osd_internals/erasure_coding.rst
+++ b/doc/dev/osd_internals/erasure_coding.rst
@@ -1,313 +1,18 @@
-===================
-PG Backend Proposal
-===================
+==============================
+Erasure Coded Placement Groups
+==============================
 
-See src/osd/PGBackend.h
+The documentation of the erasure coding implementation in Ceph was
+created in July 2013. It is included in Ceph even before erasure
+coding is available because it drives a number of architectural
+changes. It is meant to be updated to reflect the `progress of these
+architectural changes <http://tracker.ceph.com/issues/4929>`_, up to
+the point where it becomes a reference of the erasure coding
+implementation itself.
 
-Motivation
-----------
+.. toctree::
+   :maxdepth: 1
 
-The purpose of the PG Backend interface is to abstract over the
-differences between replication and erasure coding as failure recovery
-mechanisms.
+   High level design document <erasure_coding/pgbackend>
+   Developer notes <erasure_coding/developer_notes>
 
-Much of the existing PG logic, particularly that for dealing with
-peering, will be common to each.  With both schemes, a log of recent
-operations will be used to direct recovery in the event that an osd is
-down or disconnected for a brief period of time.  Similarly, in both
-cases it will be necessary to scan a recovered copy of the PG in order
-to recover an empty OSD.  The PGBackend abstraction must be
-sufficiently expressive for Replicated and ErasureCoded backends to be
-treated uniformly in these areas.
-
-However, there are also crucial differences between using replication
-and erasure coding which PGBackend must abstract over:
-
-1. The current write strategy would not ensure that a particular
-   object could be reconstructed after a failure.
-2. Reads on an erasure coded PG require chunks to be read from the
-   replicas as well.
-3. Object recovery probably involves recovering the primary and
-   replica missing copies at the same time to avoid performing extra
-   reads of replica shards.
-4. Erasure coded PG chunks created for different acting set
-   positions are not interchangeable.  In particular, it might make
-   sense for a single OSD to hold more than 1 PG copy for different
-   acting set positions.
-5. Selection of a pgtemp for backfill may difer between replicated
-   and erasure coded backends.
-6. The set of necessary osds from a particular interval required to
-   to continue peering may difer between replicated and erasure
-   coded backends.
-7. The selection of the authoritative log may difer between replicated
-   and erasure coded backends.
-
-Client Writes
--------------
-
-The current PG implementation performs a write by performing the write
-locally while concurrently directing replicas to perform the same
-operation.  Once all operations are durable, the operation is
-considered durable.  Because these writes may be destructive
-overwrites, during peering, a log entry on a replica (or the primary)
-may be found to be divergent if that replica remembers a log event
-which the authoritative log does not contain.  This can happen if only
-1 out of 3 replicas persisted an operation, but was not available in
-the next interval to provide an authoritative log.  With replication,
-we can repair the divergent object as long as at least 1 replica has a
-current copy of the divergent object.  With erasure coding, however,
-it might be the case that neither the new version of the object nor
-the old version of the object has enough available chunks to be
-reconstructed.  This problem is much simpler if we arrange for all
-supported operations to be locally roll-back-able.
-
-- CEPH_OSD_OP_APPEND: We can roll back an append locally by
-  including the previous object size as part of the PG log event.
-- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
-  requires that we retain the deleted object until all replicas have
-  persisted the deletion event.  ErasureCoded backend will therefore
-  need to store objects with the version at which they were created
-  included in the key provided to the filestore.  Old versions of an
-  object can be pruned when all replicas have committed up to the log
-  event deleting the object.
-- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
-  to be set or removed, we can roll back these operations locally.
-
-Core Changes:
-
-- Current code should be adapted to use and rollback as appropriate
-  APPEND, DELETE, (SET|RM)ATTR log entries.
-- The filestore needs to be able to deal with multiply versioned
-  hobjects.  This probably means adapting the filestore internally to
-  use a vhobject which is basically a pair<version_t, hobject_t>.  The
-  version needs to be included in the on-disk filename.  An interface
-  needs to be added to get all versions of a particular hobject_t or
-  the most recently versioned instance of a particular hobject_t.
-
-PGBackend Interfaces:
-
-- PGBackend::perform_write() : It seems simplest to pass the actual
-  ops vector.  The reason for providing an async, callback based
-  interface rather than having the PGBackend respond directly is that
-  we might want to use this interface for internal operations like
-  watch/notify expiration or snap trimming which might not necessarily
-  have an external client.
-- PGBackend::try_rollback() : Some log entries (all of the ones valid
-  for the Erasure coded backend) will support local rollback.  In
-  those cases, PGLog can avoid adding objects to the missing set when
-  identifying divergent objects.
-
-Peering and PG Logs
--------------------
-
-Currently, we select the log with the newest last_update and the
-longest tail to be the authoritative log.  This is fine because we
-aren't generally able to roll operations on the other replicas forward
-or backwards, instead relying on our ability to re-replicate divergent
-objects.  With the write approach discussed in the previous section,
-however, the erasure coded backend will rely on being able to roll
-back divergent operations since we may not be able to re-replicate
-divergent objects.  Thus, we must choose the *oldest* last_update from
-the last interval which went active in order to minimize the number of
-divergent objects.
-
-The dificulty is that the current code assumes that as long as it has
-an info from at least 1 osd from the prior interval, it can complete
-peering.  In order to ensure that we do not end up with an
-unrecoverably divergent object, an M+K erasure coded PG must hear from at
-least M of the replicas of the last interval to serve writes.  This ensures
-that we will select a last_update old enough to roll back at least M
-replicas.  If a replica with an older last_update comes along later,
-we will be able to provide at least M chunks of any divergent object.
-
-Core Changes:
-
-- `PG::choose_acting(), etc. need to be generalized to use PGBackend
-  <http://tracker.ceph.com/issues/5860>`_ to determine the
-  authoritative log.
-- `PG::RecoveryState::GetInfo needs to use PGBackend
-  <http://tracker.ceph.com/issues/5859>`_ to determine whether it has
-  enough infos to continue with authoritative log selection.
-
-PGBackend interfaces:
-
-- have_enough_infos() 
-- choose_acting()
-
-PGTemp
-------
-
-Currently, an osd is able to request a temp acting set mapping in
-order to allow an up-to-date osd to serve requests while a new primary
-is backfilled (and for other reasons).  An erasure coded pg needs to
-be able to designate a primary for these reasons without putting it
-in the first position of the acting set.  It also needs to be able
-to leave holes in the requested acting set.
-
-Core Changes:
-
-- OSDMap::pg_to_*_osds needs to separately return a primary.  For most
-  cases, this can continue to be acting[0].
-- MOSDPGTemp (and related OSD structures) needs to be able to specify
-  a primary as well as an acting set.
-- Much of the existing code base assumes that acting[0] is the primary
-  and that all elements of acting are valid.  This needs to be cleaned
-  up since the acting set may contain holes.
-
-Client Reads
-------------
-
-Reads with the replicated strategy can always be satisfied
-syncronously out of the primary osd.  With an erasure coded strategy,
-the primary will need to request data from some number of replicas in
-order to satisfy a read.  The perform_read() interface for PGBackend
-therefore will be async.
-
-PGBackend interfaces:
-
-- perform_read(): as with perform_write() it seems simplest to pass
-  the ops vector.  The call to oncomplete will occur once the out_bls
-  have been appropriately filled in.
-
-Distinguished acting set positions
-----------------------------------
-
-With the replicated strategy, all replicas of a PG are
-interchangeable.  With erasure coding, different positions in the
-acting set have different pieces of the erasure coding scheme and are
-not interchangeable.  Worse, crush might cause chunk 2 to be written
-to an osd which happens already to contain an (old) copy of chunk 4.
-This means that the OSD and PG messages need to work in terms of a
-type like pair<chunk_id_t, pg_t> in order to distinguish different pg
-chunks on a single OSD.
-
-Because the mapping of object name to object in the filestore must
-be 1-to-1, we must ensure that the objects in chunk 2 and the objects
-in chunk 4 have different names.  To that end, the filestore must
-include the chunk id in the object key.
-
-Core changes:
-
-- The filestore `vhobject_t needs to also include a chunk id
-  <http://tracker.ceph.com/issues/5862>`_ making it more like
-  tuple<hobject_t, version_t, chunk_id_t>.
-- coll_t needs to include a chunk_id_t.
-- The `OSD pg_map and similar pg mappings need to work in terms of a
-  cpg_t <http://tracker.ceph.com/issues/5863>`_ (essentially
-  pair<pg_t, chunk_id_t>).  Similarly, pg->pg messages need to include
-  a chunk_id_t
-- For client->PG messages, the OSD will need a way to know which PG
-  chunk should get the message since the OSD may contain both a
-  primary and non-primary chunk for the same pg
-
-Object Classes
---------------
-
-We probably won't support object classes at first on Erasure coded
-backends.
-
-Scrub
------
-
-We currently have two scrub modes with different default frequencies:
-
-1. [shallow] scrub: compares the set of objects and metadata, but not
-   the contents
-2. deep scrub: compares the set of objects, metadata, and a crc32 of
-   the object contents (including omap)
-
-The primary requests a scrubmap from each replica for a particular
-range of objects.  The replica fills out this scrubmap for the range
-of objects including, if the scrub is deep, a crc32 of the contents of
-each object.  The primary gathers these scrubmaps from each replica
-and performs a comparison identifying inconsistent objects.
-
-Most of this can work essentially unchanged with erasure coded PG with
-the caveat that the PGBackend implementation must be in charge of
-actually doing the scan, and that the PGBackend implementation should
-be able to attach arbitrary information to allow PGBackend on the
-primary to scrub PGBackend specific metadata.
-
-The main catch, however, for erasure coded PG is that sending a crc32
-of the stored chunk on a replica isn't particularly helpful since the
-chunks on different replicas presumably store different data.  Because
-we don't support overwrites except via DELETE, however, we have the
-option of maintaining a crc32 on each chunk through each append.
-Thus, each replica instead simply computes a crc32 of its own stored
-chunk and compares it with the locally stored checksum.  The replica
-then reports to the primary whether the checksums match.
-
-`PGBackend interfaces <http://tracker.ceph.com/issues/5861>`_:
-
-- scan()
-- scrub()
-- compare_scrub_maps()
-
-Crush
------
-
-If crush is unable to generate a replacement for a down member of an
-acting set, the acting set should have a hole at that position rather
-than shifting the other elements of the acting set out of position.
-
-Core changes:
-
-- Ensure that crush behaves as above for INDEP.
-
-`Recovery <http://tracker.ceph.com/issues/5857>`_
---------
-
-The logic for recovering an object depends on the backend.  With
-the current replicated strategy, we first pull the object replica
-to the primary and then concurrently push it out to the replicas.
-With the erasure coded strategy, we probably want to read the
-minimum number of replica chunks required to reconstruct the object
-and push out the replacement chunks concurrently.
-
-Another difference is that objects in erasure coded pg may be
-unrecoverable without being unfound.  The "unfound" concept
-should probably then be renamed to unrecoverable.  Also, the
-PGBackend impementation will have to be able to direct the search
-for pg replicas with unrecoverable object chunks and to be able
-to determine whether a particular object is recoverable.
-
-Core changes:
-
-- s/unfound/unrecoverable
-
-PGBackend interfaces:
-
-- might_have_unrecoverable()
-- recoverable()
-- recover_object()
-
-`Backfill <http://tracker.ceph.com/issues/5856>`_
---------
-
-For the most part, backfill itself should behave similarly between
-replicated and erasure coded pools with a few exceptions:
-
-1. We probably want to be able to backfill multiple osds concurrently
-   with an erasure coded pool in order to cut down on the read
-   overhead.
-2. We probably want to avoid having to place the backfill peers in the
-   acting set for an erasure coded pg because we might have a good
-   temporary pg chunk for that acting set slot.
-
-For 2, we don't really need to place the backfill peer in the acting
-set for replicated PGs anyway.
-For 1, PGBackend::choose_backfill() should determine which osds are
-backfilled in a particular interval.
-
-Core changes:
-
-- Backfill should be capable of `handling multiple backfill peers
-  concurrently <http://tracker.ceph.com/issues/5858>`_ even for
-  replicated pgs (easier to test for now)
-- `Backfill peers should not be placed in the acting set
-  <http://tracker.ceph.com/issues/5855>`_.
-
-PGBackend interfaces:
-
-- choose_backfill(): allows the implementation to determine which osds
-  should be backfilled in a particular interval.
diff --git a/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst b/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst
new file mode 100644
index 00000000000..7e1998382a0
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst
@@ -0,0 +1,151 @@
+PGBackend.h
+::
+ /**
+  * PGBackend
+  *
+  * PGBackend defines an interface for logic handling IO and
+  * replication on RADOS objects.  The PGBackend implementation
+  * is responsible for:
+  *
+  * 1) Handling client operations
+  * 2) Handling object recovery
+  * 3) Handling object access
+  */
+ class PGBackend {
+ public:	
+   /// IO
+ 
+   /// Perform write
+   int perform_write(
+     const vector<OSDOp> &ops,  ///< [in] ops to perform
+     Context *onreadable,       ///< [in] called when readable on all reaplicas
+     Context *onreadable,       ///< [in] called when durable on all replicas
+     ) = 0; ///< @return 0 or error
+ 
+   /// Attempt to roll back a log entry
+   int try_rollback(
+     const pg_log_entry_t &entry, ///< [in] entry to roll back
+     ObjectStore::Transaction *t  ///< [out] transaction
+     ) = 0; ///< @return 0 on success, -EINVAL if it can't be rolled back
+ 
+   /// Perform async read, oncomplete is called when ops out_bls are filled in
+   int perform_read(
+     vector<OSDOp> &ops,        ///< [in, out] ops
+     Context *oncomplete        ///< [out] called with r code
+     ) = 0; ///< @return 0 or error
+ 
+   /// Peering
+ 
+   /**
+    * have_enough_infos
+    *
+    * Allows PGBackend implementation to ensure that enough peers have
+    * been contacted to satisfy its requirements.
+    *
+    * TODO: this interface should yield diagnostic info about which infos
+    * are required
+    */
+   bool have_enough_infos(
+     const map<epoch_t, pg_interval_t> &past_intervals,      ///< [in] intervals
+     const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
+     ) = 0; ///< @return true if we can continue peering
+ 
+   /**
+    * choose_acting
+    *
+    * Allows PGBackend implementation to select the acting set based on the
+    * received infos
+    *
+    * @return False if the current acting set is inadequate, *req_acting will
+    *         be filled in with the requested new acting set.  True if the
+    *         current acting set is adequate, *auth_log will be filled in
+    *         with the correct location of the authoritative log.
+    */
+   bool choose_acting(
+     const map<int, pg_info_t> &peer_infos, ///< [in] received infos
+     int *auth_log,                         ///< [out] osd with auth log
+     vector<int> *req_acting                ///< [out] requested acting set
+     ) = 0;
+ 
+   /// Scrub
+ 
+   /// scan
+   int scan(
+     const hobject_t &start, ///< [in] scan objects >= start
+     const hobject_t &up_to, ///< [in] scan objects < up_to
+     vector<hobject_t> *out  ///< [out] objects returned
+     ) = 0; ///< @return 0 or error
+ 
+   /// stat (TODO: ScrubMap::object needs to have PGBackend specific metadata)
+   int scrub(
+     const hobject_t &to_stat, ///< [in] object to stat
+     bool deep,                ///< [in] true if deep scrub
+     ScrubMap::object *o       ///< [out] result
+     ) = 0; ///< @return 0 or error
+ 
+   /**
+    * compare_scrub_maps
+    *
+    * @param inconsistent [out] map of inconsistent pgs to pair<correct, incorrect>
+    * @param errstr [out] stream of text about inconsistencies for user
+    *                     perusal
+    *
+    * TODO: this interface doesn't actually make sense...
+    */
+   void compare_scrub_maps(
+     const map<int, ScrubMap> &maps, ///< [in] maps to compare
+     bool deep,                      ///< [in] true if scrub is deep
+     map<hobject_t, pair<set<int>, set<int> > > *inconsistent,
+     std:ostream *errstr
+     ) = 0;
+ 
+   /// Recovery
+ 
+   /**
+    * might_have_unrecoverable
+    *
+    * @param missing [in] missing,info gathered so far (must include acting)
+    * @param intervals [in] past intervals
+    * @param should_query [out] pair<int, cpg_t> shards to query
+    */
+   void might_have_unrecoverable(
+     const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
+     const map<epoch_t, pg_interval_t> &past_intervals,
+     set<pair<int, cpg_t> > *should_query
+     ) = 0;
+ 
+   /**
+    * might_have_unfound
+    *
+    * @param missing [in] missing,info gathered so far (must include acting)
+    */
+   bool recoverable(
+     const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
+     const hobject_t &hoid ///< [in] object to check
+     ) = 0; ///< @return true if object can be recovered given missing
+ 
+   /**
+    * recover_object
+    *
+    * Triggers a recovery operation on the specified hobject_t
+    * onreadable must be called before onwriteable
+    *
+    * @param missing [in] set of info, missing pairs for queried nodes
+    */
+   void recover_object(
+     const hobject_t &hoid, ///< [in] object to recover
+     const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing
+     Context *onreadable,   ///< [in] called when object can be read
+     Context *onwriteable   ///< [in] called when object can be written
+     ) = 0;
+ 
+   /// Backfill
+ 
+   /// choose_backfill
+   void choose_backfill(
+     const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
+     const vector<int> &acting, ///< [in] acting set
+     const vector<int> &up,     ///< [in] up set
+     set<int> *to_backfill      ///< [out] osds to backfill
+     ) = 0;
+ };
diff --git a/doc/dev/osd_internals/erasure-code.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
index 40616ae271c..40616ae271c 100644
--- a/doc/dev/osd_internals/erasure-code.rst
+++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
diff --git a/doc/dev/osd_internals/erasure_coding/pgbackend.rst b/doc/dev/osd_internals/erasure_coding/pgbackend.rst
new file mode 100644
index 00000000000..662351e9d77
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/pgbackend.rst
@@ -0,0 +1,313 @@
+===================
+PG Backend Proposal
+===================
+
+See also `PGBackend.h <PGBackend>`_
+
+Motivation
+----------
+
+The purpose of the PG Backend interface is to abstract over the
+differences between replication and erasure coding as failure recovery
+mechanisms.
+
+Much of the existing PG logic, particularly that for dealing with
+peering, will be common to each.  With both schemes, a log of recent
+operations will be used to direct recovery in the event that an osd is
+down or disconnected for a brief period of time.  Similarly, in both
+cases it will be necessary to scan a recovered copy of the PG in order
+to recover an empty OSD.  The PGBackend abstraction must be
+sufficiently expressive for Replicated and ErasureCoded backends to be
+treated uniformly in these areas.
+
+However, there are also crucial differences between using replication
+and erasure coding which PGBackend must abstract over:
+
+1. The current write strategy would not ensure that a particular
+   object could be reconstructed after a failure.
+2. Reads on an erasure coded PG require chunks to be read from the
+   replicas as well.
+3. Object recovery probably involves recovering the primary and
+   replica missing copies at the same time to avoid performing extra
+   reads of replica shards.
+4. Erasure coded PG chunks created for different acting set
+   positions are not interchangeable.  In particular, it might make
+   sense for a single OSD to hold more than 1 PG copy for different
+   acting set positions.
+5. Selection of a pgtemp for backfill may difer between replicated
+   and erasure coded backends.
+6. The set of necessary osds from a particular interval required to
+   to continue peering may difer between replicated and erasure
+   coded backends.
+7. The selection of the authoritative log may difer between replicated
+   and erasure coded backends.
+
+Client Writes
+-------------
+
+The current PG implementation performs a write by performing the write
+locally while concurrently directing replicas to perform the same
+operation.  Once all operations are durable, the operation is
+considered durable.  Because these writes may be destructive
+overwrites, during peering, a log entry on a replica (or the primary)
+may be found to be divergent if that replica remembers a log event
+which the authoritative log does not contain.  This can happen if only
+1 out of 3 replicas persisted an operation, but was not available in
+the next interval to provide an authoritative log.  With replication,
+we can repair the divergent object as long as at least 1 replica has a
+current copy of the divergent object.  With erasure coding, however,
+it might be the case that neither the new version of the object nor
+the old version of the object has enough available chunks to be
+reconstructed.  This problem is much simpler if we arrange for all
+supported operations to be locally roll-back-able.
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+  including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+  requires that we retain the deleted object until all replicas have
+  persisted the deletion event.  ErasureCoded backend will therefore
+  need to store objects with the version at which they were created
+  included in the key provided to the filestore.  Old versions of an
+  object can be pruned when all replicas have committed up to the log
+  event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+  to be set or removed, we can roll back these operations locally.
+
+Core Changes:
+
+- Current code should be adapted to use and rollback as appropriate
+  APPEND, DELETE, (SET|RM)ATTR log entries.
+- The filestore needs to be able to deal with multiply versioned
+  hobjects.  This probably means adapting the filestore internally to
+  use a vhobject which is basically a pair<version_t, hobject_t>.  The
+  version needs to be included in the on-disk filename.  An interface
+  needs to be added to get all versions of a particular hobject_t or
+  the most recently versioned instance of a particular hobject_t.
+
+PGBackend Interfaces:
+
+- PGBackend::perform_write() : It seems simplest to pass the actual
+  ops vector.  The reason for providing an async, callback based
+  interface rather than having the PGBackend respond directly is that
+  we might want to use this interface for internal operations like
+  watch/notify expiration or snap trimming which might not necessarily
+  have an external client.
+- PGBackend::try_rollback() : Some log entries (all of the ones valid
+  for the Erasure coded backend) will support local rollback.  In
+  those cases, PGLog can avoid adding objects to the missing set when
+  identifying divergent objects.
+
+Peering and PG Logs
+-------------------
+
+Currently, we select the log with the newest last_update and the
+longest tail to be the authoritative log.  This is fine because we
+aren't generally able to roll operations on the other replicas forward
+or backwards, instead relying on our ability to re-replicate divergent
+objects.  With the write approach discussed in the previous section,
+however, the erasure coded backend will rely on being able to roll
+back divergent operations since we may not be able to re-replicate
+divergent objects.  Thus, we must choose the *oldest* last_update from
+the last interval which went active in order to minimize the number of
+divergent objects.
+
+The dificulty is that the current code assumes that as long as it has
+an info from at least 1 osd from the prior interval, it can complete
+peering.  In order to ensure that we do not end up with an
+unrecoverably divergent object, an M+K erasure coded PG must hear from at
+least M of the replicas of the last interval to serve writes.  This ensures
+that we will select a last_update old enough to roll back at least M
+replicas.  If a replica with an older last_update comes along later,
+we will be able to provide at least M chunks of any divergent object.
+
+Core Changes:
+
+- `PG::choose_acting(), etc. need to be generalized to use PGBackend
+  <http://tracker.ceph.com/issues/5860>`_ to determine the
+  authoritative log.
+- `PG::RecoveryState::GetInfo needs to use PGBackend
+  <http://tracker.ceph.com/issues/5859>`_ to determine whether it has
+  enough infos to continue with authoritative log selection.
+
+PGBackend interfaces:
+
+- have_enough_infos() 
+- choose_acting()
+
+PGTemp
+------
+
+Currently, an osd is able to request a temp acting set mapping in
+order to allow an up-to-date osd to serve requests while a new primary
+is backfilled (and for other reasons).  An erasure coded pg needs to
+be able to designate a primary for these reasons without putting it
+in the first position of the acting set.  It also needs to be able
+to leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to separately return a primary.  For most
+  cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+  a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+  and that all elements of acting are valid.  This needs to be cleaned
+  up since the acting set may contain holes.
+
+Client Reads
+------------
+
+Reads with the replicated strategy can always be satisfied
+syncronously out of the primary osd.  With an erasure coded strategy,
+the primary will need to request data from some number of replicas in
+order to satisfy a read.  The perform_read() interface for PGBackend
+therefore will be async.
+
+PGBackend interfaces:
+
+- perform_read(): as with perform_write() it seems simplest to pass
+  the ops vector.  The call to oncomplete will occur once the out_bls
+  have been appropriately filled in.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable.  With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable.  Worse, crush might cause chunk 2 to be written
+to an osd which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<chunk_id_t, pg_t> in order to distinguish different pg
+chunks on a single OSD.
+
+Because the mapping of object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names.  To that end, the filestore must
+include the chunk id in the object key.
+
+Core changes:
+
+- The filestore `vhobject_t needs to also include a chunk id
+  <http://tracker.ceph.com/issues/5862>`_ making it more like
+  tuple<hobject_t, version_t, chunk_id_t>.
+- coll_t needs to include a chunk_id_t.
+- The `OSD pg_map and similar pg mappings need to work in terms of a
+  cpg_t <http://tracker.ceph.com/issues/5863>`_ (essentially
+  pair<pg_t, chunk_id_t>).  Similarly, pg->pg messages need to include
+  a chunk_id_t
+- For client->PG messages, the OSD will need a way to know which PG
+  chunk should get the message since the OSD may contain both a
+  primary and non-primary chunk for the same pg
+
+Object Classes
+--------------
+
+We probably won't support object classes at first on Erasure coded
+backends.
+
+Scrub
+-----
+
+We currently have two scrub modes with different default frequencies:
+
+1. [shallow] scrub: compares the set of objects and metadata, but not
+   the contents
+2. deep scrub: compares the set of objects, metadata, and a crc32 of
+   the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects.  The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a crc32 of the contents of
+each object.  The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the PGBackend implementation must be in charge of
+actually doing the scan, and that the PGBackend implementation should
+be able to attach arbitrary information to allow PGBackend on the
+primary to scrub PGBackend specific metadata.
+
+The main catch, however, for erasure coded PG is that sending a crc32
+of the stored chunk on a replica isn't particularly helpful since the
+chunks on different replicas presumably store different data.  Because
+we don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum.  The replica
+then reports to the primary whether the checksums match.
+
+`PGBackend interfaces <http://tracker.ceph.com/issues/5861>`_:
+
+- scan()
+- scrub()
+- compare_scrub_maps()
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+Core changes:
+
+- Ensure that crush behaves as above for INDEP.
+
+`Recovery <http://tracker.ceph.com/issues/5857>`_
+--------
+
+The logic for recovering an object depends on the backend.  With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded pg may be
+unrecoverable without being unfound.  The "unfound" concept
+should probably then be renamed to unrecoverable.  Also, the
+PGBackend impementation will have to be able to direct the search
+for pg replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+Core changes:
+
+- s/unfound/unrecoverable
+
+PGBackend interfaces:
+
+- might_have_unrecoverable()
+- recoverable()
+- recover_object()
+
+`Backfill <http://tracker.ceph.com/issues/5856>`_
+--------
+
+For the most part, backfill itself should behave similarly between
+replicated and erasure coded pools with a few exceptions:
+
+1. We probably want to be able to backfill multiple osds concurrently
+   with an erasure coded pool in order to cut down on the read
+   overhead.
+2. We probably want to avoid having to place the backfill peers in the
+   acting set for an erasure coded pg because we might have a good
+   temporary pg chunk for that acting set slot.
+
+For 2, we don't really need to place the backfill peer in the acting
+set for replicated PGs anyway.
+For 1, PGBackend::choose_backfill() should determine which osds are
+backfilled in a particular interval.
+
+Core changes:
+
+- Backfill should be capable of `handling multiple backfill peers
+  concurrently <http://tracker.ceph.com/issues/5858>`_ even for
+  replicated pgs (easier to test for now)
+- `Backfill peers should not be placed in the acting set
+  <http://tracker.ceph.com/issues/5855>`_.
+
+PGBackend interfaces:
+
+- choose_backfill(): allows the implementation to determine which osds
+  should be backfilled in a particular interval.
author	Loic Dachary <loic@dachary.org>	2013-08-09 23:23:17 +0200
committer	Loic Dachary <loic@dachary.org>	2013-08-09 23:26:49 +0200
commit	8bf3971b7e9669578cfae21a1738b116fab48a44 (patch)
tree	69153dc0e81dc8cd9db522f4bf1aa91e3e5b860e /doc
parent	980a07380db65f802fc4d8971a5404cc2ed0ff6e (diff)
download	ceph-8bf3971b7e9669578cfae21a1738b116fab48a44.tar.gz