summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLoic Dachary <loic@dachary.org>2013-08-09 23:23:17 +0200
committerLoic Dachary <loic@dachary.org>2013-08-09 23:26:49 +0200
commit8bf3971b7e9669578cfae21a1738b116fab48a44 (patch)
tree69153dc0e81dc8cd9db522f4bf1aa91e3e5b860e
parent980a07380db65f802fc4d8971a5404cc2ed0ff6e (diff)
downloadceph-8bf3971b7e9669578cfae21a1738b116fab48a44.tar.gz
rearrange the documentation to be inserted and maintained in master
Signed-off-by: Loic Dachary <loic@dachary.org>
-rw-r--r--doc/dev/osd_internals/erasure_coding.rst323
-rw-r--r--doc/dev/osd_internals/erasure_coding/PGBackend-h.rst151
-rw-r--r--doc/dev/osd_internals/erasure_coding/developer_notes.rst (renamed from doc/dev/osd_internals/erasure-code.rst)0
-rw-r--r--doc/dev/osd_internals/erasure_coding/pgbackend.rst313
-rw-r--r--src/osd/PGBackend.h169
5 files changed, 478 insertions, 478 deletions
diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst
index df21d3dccdc..deb91aca9db 100644
--- a/doc/dev/osd_internals/erasure_coding.rst
+++ b/doc/dev/osd_internals/erasure_coding.rst
@@ -1,313 +1,18 @@
-===================
-PG Backend Proposal
-===================
+==============================
+Erasure Coded Placement Groups
+==============================
-See src/osd/PGBackend.h
+The documentation of the erasure coding implementation in Ceph was
+created in July 2013. It is included in Ceph even before erasure
+coding is available because it drives a number of architectural
+changes. It is meant to be updated to reflect the `progress of these
+architectural changes <http://tracker.ceph.com/issues/4929>`_, up to
+the point where it becomes a reference of the erasure coding
+implementation itself.
-Motivation
-----------
+.. toctree::
+ :maxdepth: 1
-The purpose of the PG Backend interface is to abstract over the
-differences between replication and erasure coding as failure recovery
-mechanisms.
+ High level design document <erasure_coding/pgbackend>
+ Developer notes <erasure_coding/developer_notes>
-Much of the existing PG logic, particularly that for dealing with
-peering, will be common to each. With both schemes, a log of recent
-operations will be used to direct recovery in the event that an osd is
-down or disconnected for a brief period of time. Similarly, in both
-cases it will be necessary to scan a recovered copy of the PG in order
-to recover an empty OSD. The PGBackend abstraction must be
-sufficiently expressive for Replicated and ErasureCoded backends to be
-treated uniformly in these areas.
-
-However, there are also crucial differences between using replication
-and erasure coding which PGBackend must abstract over:
-
-1. The current write strategy would not ensure that a particular
- object could be reconstructed after a failure.
-2. Reads on an erasure coded PG require chunks to be read from the
- replicas as well.
-3. Object recovery probably involves recovering the primary and
- replica missing copies at the same time to avoid performing extra
- reads of replica shards.
-4. Erasure coded PG chunks created for different acting set
- positions are not interchangeable. In particular, it might make
- sense for a single OSD to hold more than 1 PG copy for different
- acting set positions.
-5. Selection of a pgtemp for backfill may difer between replicated
- and erasure coded backends.
-6. The set of necessary osds from a particular interval required to
- to continue peering may difer between replicated and erasure
- coded backends.
-7. The selection of the authoritative log may difer between replicated
- and erasure coded backends.
-
-Client Writes
--------------
-
-The current PG implementation performs a write by performing the write
-locally while concurrently directing replicas to perform the same
-operation. Once all operations are durable, the operation is
-considered durable. Because these writes may be destructive
-overwrites, during peering, a log entry on a replica (or the primary)
-may be found to be divergent if that replica remembers a log event
-which the authoritative log does not contain. This can happen if only
-1 out of 3 replicas persisted an operation, but was not available in
-the next interval to provide an authoritative log. With replication,
-we can repair the divergent object as long as at least 1 replica has a
-current copy of the divergent object. With erasure coding, however,
-it might be the case that neither the new version of the object nor
-the old version of the object has enough available chunks to be
-reconstructed. This problem is much simpler if we arrange for all
-supported operations to be locally roll-back-able.
-
-- CEPH_OSD_OP_APPEND: We can roll back an append locally by
- including the previous object size as part of the PG log event.
-- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
- requires that we retain the deleted object until all replicas have
- persisted the deletion event. ErasureCoded backend will therefore
- need to store objects with the version at which they were created
- included in the key provided to the filestore. Old versions of an
- object can be pruned when all replicas have committed up to the log
- event deleting the object.
-- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
- to be set or removed, we can roll back these operations locally.
-
-Core Changes:
-
-- Current code should be adapted to use and rollback as appropriate
- APPEND, DELETE, (SET|RM)ATTR log entries.
-- The filestore needs to be able to deal with multiply versioned
- hobjects. This probably means adapting the filestore internally to
- use a vhobject which is basically a pair<version_t, hobject_t>. The
- version needs to be included in the on-disk filename. An interface
- needs to be added to get all versions of a particular hobject_t or
- the most recently versioned instance of a particular hobject_t.
-
-PGBackend Interfaces:
-
-- PGBackend::perform_write() : It seems simplest to pass the actual
- ops vector. The reason for providing an async, callback based
- interface rather than having the PGBackend respond directly is that
- we might want to use this interface for internal operations like
- watch/notify expiration or snap trimming which might not necessarily
- have an external client.
-- PGBackend::try_rollback() : Some log entries (all of the ones valid
- for the Erasure coded backend) will support local rollback. In
- those cases, PGLog can avoid adding objects to the missing set when
- identifying divergent objects.
-
-Peering and PG Logs
--------------------
-
-Currently, we select the log with the newest last_update and the
-longest tail to be the authoritative log. This is fine because we
-aren't generally able to roll operations on the other replicas forward
-or backwards, instead relying on our ability to re-replicate divergent
-objects. With the write approach discussed in the previous section,
-however, the erasure coded backend will rely on being able to roll
-back divergent operations since we may not be able to re-replicate
-divergent objects. Thus, we must choose the *oldest* last_update from
-the last interval which went active in order to minimize the number of
-divergent objects.
-
-The dificulty is that the current code assumes that as long as it has
-an info from at least 1 osd from the prior interval, it can complete
-peering. In order to ensure that we do not end up with an
-unrecoverably divergent object, an M+K erasure coded PG must hear from at
-least M of the replicas of the last interval to serve writes. This ensures
-that we will select a last_update old enough to roll back at least M
-replicas. If a replica with an older last_update comes along later,
-we will be able to provide at least M chunks of any divergent object.
-
-Core Changes:
-
-- `PG::choose_acting(), etc. need to be generalized to use PGBackend
- <http://tracker.ceph.com/issues/5860>`_ to determine the
- authoritative log.
-- `PG::RecoveryState::GetInfo needs to use PGBackend
- <http://tracker.ceph.com/issues/5859>`_ to determine whether it has
- enough infos to continue with authoritative log selection.
-
-PGBackend interfaces:
-
-- have_enough_infos()
-- choose_acting()
-
-PGTemp
-------
-
-Currently, an osd is able to request a temp acting set mapping in
-order to allow an up-to-date osd to serve requests while a new primary
-is backfilled (and for other reasons). An erasure coded pg needs to
-be able to designate a primary for these reasons without putting it
-in the first position of the acting set. It also needs to be able
-to leave holes in the requested acting set.
-
-Core Changes:
-
-- OSDMap::pg_to_*_osds needs to separately return a primary. For most
- cases, this can continue to be acting[0].
-- MOSDPGTemp (and related OSD structures) needs to be able to specify
- a primary as well as an acting set.
-- Much of the existing code base assumes that acting[0] is the primary
- and that all elements of acting are valid. This needs to be cleaned
- up since the acting set may contain holes.
-
-Client Reads
-------------
-
-Reads with the replicated strategy can always be satisfied
-syncronously out of the primary osd. With an erasure coded strategy,
-the primary will need to request data from some number of replicas in
-order to satisfy a read. The perform_read() interface for PGBackend
-therefore will be async.
-
-PGBackend interfaces:
-
-- perform_read(): as with perform_write() it seems simplest to pass
- the ops vector. The call to oncomplete will occur once the out_bls
- have been appropriately filled in.
-
-Distinguished acting set positions
-----------------------------------
-
-With the replicated strategy, all replicas of a PG are
-interchangeable. With erasure coding, different positions in the
-acting set have different pieces of the erasure coding scheme and are
-not interchangeable. Worse, crush might cause chunk 2 to be written
-to an osd which happens already to contain an (old) copy of chunk 4.
-This means that the OSD and PG messages need to work in terms of a
-type like pair<chunk_id_t, pg_t> in order to distinguish different pg
-chunks on a single OSD.
-
-Because the mapping of object name to object in the filestore must
-be 1-to-1, we must ensure that the objects in chunk 2 and the objects
-in chunk 4 have different names. To that end, the filestore must
-include the chunk id in the object key.
-
-Core changes:
-
-- The filestore `vhobject_t needs to also include a chunk id
- <http://tracker.ceph.com/issues/5862>`_ making it more like
- tuple<hobject_t, version_t, chunk_id_t>.
-- coll_t needs to include a chunk_id_t.
-- The `OSD pg_map and similar pg mappings need to work in terms of a
- cpg_t <http://tracker.ceph.com/issues/5863>`_ (essentially
- pair<pg_t, chunk_id_t>). Similarly, pg->pg messages need to include
- a chunk_id_t
-- For client->PG messages, the OSD will need a way to know which PG
- chunk should get the message since the OSD may contain both a
- primary and non-primary chunk for the same pg
-
-Object Classes
---------------
-
-We probably won't support object classes at first on Erasure coded
-backends.
-
-Scrub
------
-
-We currently have two scrub modes with different default frequencies:
-
-1. [shallow] scrub: compares the set of objects and metadata, but not
- the contents
-2. deep scrub: compares the set of objects, metadata, and a crc32 of
- the object contents (including omap)
-
-The primary requests a scrubmap from each replica for a particular
-range of objects. The replica fills out this scrubmap for the range
-of objects including, if the scrub is deep, a crc32 of the contents of
-each object. The primary gathers these scrubmaps from each replica
-and performs a comparison identifying inconsistent objects.
-
-Most of this can work essentially unchanged with erasure coded PG with
-the caveat that the PGBackend implementation must be in charge of
-actually doing the scan, and that the PGBackend implementation should
-be able to attach arbitrary information to allow PGBackend on the
-primary to scrub PGBackend specific metadata.
-
-The main catch, however, for erasure coded PG is that sending a crc32
-of the stored chunk on a replica isn't particularly helpful since the
-chunks on different replicas presumably store different data. Because
-we don't support overwrites except via DELETE, however, we have the
-option of maintaining a crc32 on each chunk through each append.
-Thus, each replica instead simply computes a crc32 of its own stored
-chunk and compares it with the locally stored checksum. The replica
-then reports to the primary whether the checksums match.
-
-`PGBackend interfaces <http://tracker.ceph.com/issues/5861>`_:
-
-- scan()
-- scrub()
-- compare_scrub_maps()
-
-Crush
------
-
-If crush is unable to generate a replacement for a down member of an
-acting set, the acting set should have a hole at that position rather
-than shifting the other elements of the acting set out of position.
-
-Core changes:
-
-- Ensure that crush behaves as above for INDEP.
-
-`Recovery <http://tracker.ceph.com/issues/5857>`_
---------
-
-The logic for recovering an object depends on the backend. With
-the current replicated strategy, we first pull the object replica
-to the primary and then concurrently push it out to the replicas.
-With the erasure coded strategy, we probably want to read the
-minimum number of replica chunks required to reconstruct the object
-and push out the replacement chunks concurrently.
-
-Another difference is that objects in erasure coded pg may be
-unrecoverable without being unfound. The "unfound" concept
-should probably then be renamed to unrecoverable. Also, the
-PGBackend impementation will have to be able to direct the search
-for pg replicas with unrecoverable object chunks and to be able
-to determine whether a particular object is recoverable.
-
-Core changes:
-
-- s/unfound/unrecoverable
-
-PGBackend interfaces:
-
-- might_have_unrecoverable()
-- recoverable()
-- recover_object()
-
-`Backfill <http://tracker.ceph.com/issues/5856>`_
---------
-
-For the most part, backfill itself should behave similarly between
-replicated and erasure coded pools with a few exceptions:
-
-1. We probably want to be able to backfill multiple osds concurrently
- with an erasure coded pool in order to cut down on the read
- overhead.
-2. We probably want to avoid having to place the backfill peers in the
- acting set for an erasure coded pg because we might have a good
- temporary pg chunk for that acting set slot.
-
-For 2, we don't really need to place the backfill peer in the acting
-set for replicated PGs anyway.
-For 1, PGBackend::choose_backfill() should determine which osds are
-backfilled in a particular interval.
-
-Core changes:
-
-- Backfill should be capable of `handling multiple backfill peers
- concurrently <http://tracker.ceph.com/issues/5858>`_ even for
- replicated pgs (easier to test for now)
-- `Backfill peers should not be placed in the acting set
- <http://tracker.ceph.com/issues/5855>`_.
-
-PGBackend interfaces:
-
-- choose_backfill(): allows the implementation to determine which osds
- should be backfilled in a particular interval.
diff --git a/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst b/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst
new file mode 100644
index 00000000000..7e1998382a0
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst
@@ -0,0 +1,151 @@
+PGBackend.h
+::
+ /**
+ * PGBackend
+ *
+ * PGBackend defines an interface for logic handling IO and
+ * replication on RADOS objects. The PGBackend implementation
+ * is responsible for:
+ *
+ * 1) Handling client operations
+ * 2) Handling object recovery
+ * 3) Handling object access
+ */
+ class PGBackend {
+ public:
+ /// IO
+
+ /// Perform write
+ int perform_write(
+ const vector<OSDOp> &ops, ///< [in] ops to perform
+ Context *onreadable, ///< [in] called when readable on all reaplicas
+ Context *onreadable, ///< [in] called when durable on all replicas
+ ) = 0; ///< @return 0 or error
+
+ /// Attempt to roll back a log entry
+ int try_rollback(
+ const pg_log_entry_t &entry, ///< [in] entry to roll back
+ ObjectStore::Transaction *t ///< [out] transaction
+ ) = 0; ///< @return 0 on success, -EINVAL if it can't be rolled back
+
+ /// Perform async read, oncomplete is called when ops out_bls are filled in
+ int perform_read(
+ vector<OSDOp> &ops, ///< [in, out] ops
+ Context *oncomplete ///< [out] called with r code
+ ) = 0; ///< @return 0 or error
+
+ /// Peering
+
+ /**
+ * have_enough_infos
+ *
+ * Allows PGBackend implementation to ensure that enough peers have
+ * been contacted to satisfy its requirements.
+ *
+ * TODO: this interface should yield diagnostic info about which infos
+ * are required
+ */
+ bool have_enough_infos(
+ const map<epoch_t, pg_interval_t> &past_intervals, ///< [in] intervals
+ const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
+ ) = 0; ///< @return true if we can continue peering
+
+ /**
+ * choose_acting
+ *
+ * Allows PGBackend implementation to select the acting set based on the
+ * received infos
+ *
+ * @return False if the current acting set is inadequate, *req_acting will
+ * be filled in with the requested new acting set. True if the
+ * current acting set is adequate, *auth_log will be filled in
+ * with the correct location of the authoritative log.
+ */
+ bool choose_acting(
+ const map<int, pg_info_t> &peer_infos, ///< [in] received infos
+ int *auth_log, ///< [out] osd with auth log
+ vector<int> *req_acting ///< [out] requested acting set
+ ) = 0;
+
+ /// Scrub
+
+ /// scan
+ int scan(
+ const hobject_t &start, ///< [in] scan objects >= start
+ const hobject_t &up_to, ///< [in] scan objects < up_to
+ vector<hobject_t> *out ///< [out] objects returned
+ ) = 0; ///< @return 0 or error
+
+ /// stat (TODO: ScrubMap::object needs to have PGBackend specific metadata)
+ int scrub(
+ const hobject_t &to_stat, ///< [in] object to stat
+ bool deep, ///< [in] true if deep scrub
+ ScrubMap::object *o ///< [out] result
+ ) = 0; ///< @return 0 or error
+
+ /**
+ * compare_scrub_maps
+ *
+ * @param inconsistent [out] map of inconsistent pgs to pair<correct, incorrect>
+ * @param errstr [out] stream of text about inconsistencies for user
+ * perusal
+ *
+ * TODO: this interface doesn't actually make sense...
+ */
+ void compare_scrub_maps(
+ const map<int, ScrubMap> &maps, ///< [in] maps to compare
+ bool deep, ///< [in] true if scrub is deep
+ map<hobject_t, pair<set<int>, set<int> > > *inconsistent,
+ std:ostream *errstr
+ ) = 0;
+
+ /// Recovery
+
+ /**
+ * might_have_unrecoverable
+ *
+ * @param missing [in] missing,info gathered so far (must include acting)
+ * @param intervals [in] past intervals
+ * @param should_query [out] pair<int, cpg_t> shards to query
+ */
+ void might_have_unrecoverable(
+ const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
+ const map<epoch_t, pg_interval_t> &past_intervals,
+ set<pair<int, cpg_t> > *should_query
+ ) = 0;
+
+ /**
+ * might_have_unfound
+ *
+ * @param missing [in] missing,info gathered so far (must include acting)
+ */
+ bool recoverable(
+ const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
+ const hobject_t &hoid ///< [in] object to check
+ ) = 0; ///< @return true if object can be recovered given missing
+
+ /**
+ * recover_object
+ *
+ * Triggers a recovery operation on the specified hobject_t
+ * onreadable must be called before onwriteable
+ *
+ * @param missing [in] set of info, missing pairs for queried nodes
+ */
+ void recover_object(
+ const hobject_t &hoid, ///< [in] object to recover
+ const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing
+ Context *onreadable, ///< [in] called when object can be read
+ Context *onwriteable ///< [in] called when object can be written
+ ) = 0;
+
+ /// Backfill
+
+ /// choose_backfill
+ void choose_backfill(
+ const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
+ const vector<int> &acting, ///< [in] acting set
+ const vector<int> &up, ///< [in] up set
+ set<int> *to_backfill ///< [out] osds to backfill
+ ) = 0;
+ };
diff --git a/doc/dev/osd_internals/erasure-code.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
index 40616ae271c..40616ae271c 100644
--- a/doc/dev/osd_internals/erasure-code.rst
+++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
diff --git a/doc/dev/osd_internals/erasure_coding/pgbackend.rst b/doc/dev/osd_internals/erasure_coding/pgbackend.rst
new file mode 100644
index 00000000000..662351e9d77
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/pgbackend.rst
@@ -0,0 +1,313 @@
+===================
+PG Backend Proposal
+===================
+
+See also `PGBackend.h <PGBackend>`_
+
+Motivation
+----------
+
+The purpose of the PG Backend interface is to abstract over the
+differences between replication and erasure coding as failure recovery
+mechanisms.
+
+Much of the existing PG logic, particularly that for dealing with
+peering, will be common to each. With both schemes, a log of recent
+operations will be used to direct recovery in the event that an osd is
+down or disconnected for a brief period of time. Similarly, in both
+cases it will be necessary to scan a recovered copy of the PG in order
+to recover an empty OSD. The PGBackend abstraction must be
+sufficiently expressive for Replicated and ErasureCoded backends to be
+treated uniformly in these areas.
+
+However, there are also crucial differences between using replication
+and erasure coding which PGBackend must abstract over:
+
+1. The current write strategy would not ensure that a particular
+ object could be reconstructed after a failure.
+2. Reads on an erasure coded PG require chunks to be read from the
+ replicas as well.
+3. Object recovery probably involves recovering the primary and
+ replica missing copies at the same time to avoid performing extra
+ reads of replica shards.
+4. Erasure coded PG chunks created for different acting set
+ positions are not interchangeable. In particular, it might make
+ sense for a single OSD to hold more than 1 PG copy for different
+ acting set positions.
+5. Selection of a pgtemp for backfill may difer between replicated
+ and erasure coded backends.
+6. The set of necessary osds from a particular interval required to
+ to continue peering may difer between replicated and erasure
+ coded backends.
+7. The selection of the authoritative log may difer between replicated
+ and erasure coded backends.
+
+Client Writes
+-------------
+
+The current PG implementation performs a write by performing the write
+locally while concurrently directing replicas to perform the same
+operation. Once all operations are durable, the operation is
+considered durable. Because these writes may be destructive
+overwrites, during peering, a log entry on a replica (or the primary)
+may be found to be divergent if that replica remembers a log event
+which the authoritative log does not contain. This can happen if only
+1 out of 3 replicas persisted an operation, but was not available in
+the next interval to provide an authoritative log. With replication,
+we can repair the divergent object as long as at least 1 replica has a
+current copy of the divergent object. With erasure coding, however,
+it might be the case that neither the new version of the object nor
+the old version of the object has enough available chunks to be
+reconstructed. This problem is much simpler if we arrange for all
+supported operations to be locally roll-back-able.
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+ including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+ requires that we retain the deleted object until all replicas have
+ persisted the deletion event. ErasureCoded backend will therefore
+ need to store objects with the version at which they were created
+ included in the key provided to the filestore. Old versions of an
+ object can be pruned when all replicas have committed up to the log
+ event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+ to be set or removed, we can roll back these operations locally.
+
+Core Changes:
+
+- Current code should be adapted to use and rollback as appropriate
+ APPEND, DELETE, (SET|RM)ATTR log entries.
+- The filestore needs to be able to deal with multiply versioned
+ hobjects. This probably means adapting the filestore internally to
+ use a vhobject which is basically a pair<version_t, hobject_t>. The
+ version needs to be included in the on-disk filename. An interface
+ needs to be added to get all versions of a particular hobject_t or
+ the most recently versioned instance of a particular hobject_t.
+
+PGBackend Interfaces:
+
+- PGBackend::perform_write() : It seems simplest to pass the actual
+ ops vector. The reason for providing an async, callback based
+ interface rather than having the PGBackend respond directly is that
+ we might want to use this interface for internal operations like
+ watch/notify expiration or snap trimming which might not necessarily
+ have an external client.
+- PGBackend::try_rollback() : Some log entries (all of the ones valid
+ for the Erasure coded backend) will support local rollback. In
+ those cases, PGLog can avoid adding objects to the missing set when
+ identifying divergent objects.
+
+Peering and PG Logs
+-------------------
+
+Currently, we select the log with the newest last_update and the
+longest tail to be the authoritative log. This is fine because we
+aren't generally able to roll operations on the other replicas forward
+or backwards, instead relying on our ability to re-replicate divergent
+objects. With the write approach discussed in the previous section,
+however, the erasure coded backend will rely on being able to roll
+back divergent operations since we may not be able to re-replicate
+divergent objects. Thus, we must choose the *oldest* last_update from
+the last interval which went active in order to minimize the number of
+divergent objects.
+
+The dificulty is that the current code assumes that as long as it has
+an info from at least 1 osd from the prior interval, it can complete
+peering. In order to ensure that we do not end up with an
+unrecoverably divergent object, an M+K erasure coded PG must hear from at
+least M of the replicas of the last interval to serve writes. This ensures
+that we will select a last_update old enough to roll back at least M
+replicas. If a replica with an older last_update comes along later,
+we will be able to provide at least M chunks of any divergent object.
+
+Core Changes:
+
+- `PG::choose_acting(), etc. need to be generalized to use PGBackend
+ <http://tracker.ceph.com/issues/5860>`_ to determine the
+ authoritative log.
+- `PG::RecoveryState::GetInfo needs to use PGBackend
+ <http://tracker.ceph.com/issues/5859>`_ to determine whether it has
+ enough infos to continue with authoritative log selection.
+
+PGBackend interfaces:
+
+- have_enough_infos()
+- choose_acting()
+
+PGTemp
+------
+
+Currently, an osd is able to request a temp acting set mapping in
+order to allow an up-to-date osd to serve requests while a new primary
+is backfilled (and for other reasons). An erasure coded pg needs to
+be able to designate a primary for these reasons without putting it
+in the first position of the acting set. It also needs to be able
+to leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to separately return a primary. For most
+ cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+ a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+ and that all elements of acting are valid. This needs to be cleaned
+ up since the acting set may contain holes.
+
+Client Reads
+------------
+
+Reads with the replicated strategy can always be satisfied
+syncronously out of the primary osd. With an erasure coded strategy,
+the primary will need to request data from some number of replicas in
+order to satisfy a read. The perform_read() interface for PGBackend
+therefore will be async.
+
+PGBackend interfaces:
+
+- perform_read(): as with perform_write() it seems simplest to pass
+ the ops vector. The call to oncomplete will occur once the out_bls
+ have been appropriately filled in.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable. With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable. Worse, crush might cause chunk 2 to be written
+to an osd which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<chunk_id_t, pg_t> in order to distinguish different pg
+chunks on a single OSD.
+
+Because the mapping of object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names. To that end, the filestore must
+include the chunk id in the object key.
+
+Core changes:
+
+- The filestore `vhobject_t needs to also include a chunk id
+ <http://tracker.ceph.com/issues/5862>`_ making it more like
+ tuple<hobject_t, version_t, chunk_id_t>.
+- coll_t needs to include a chunk_id_t.
+- The `OSD pg_map and similar pg mappings need to work in terms of a
+ cpg_t <http://tracker.ceph.com/issues/5863>`_ (essentially
+ pair<pg_t, chunk_id_t>). Similarly, pg->pg messages need to include
+ a chunk_id_t
+- For client->PG messages, the OSD will need a way to know which PG
+ chunk should get the message since the OSD may contain both a
+ primary and non-primary chunk for the same pg
+
+Object Classes
+--------------
+
+We probably won't support object classes at first on Erasure coded
+backends.
+
+Scrub
+-----
+
+We currently have two scrub modes with different default frequencies:
+
+1. [shallow] scrub: compares the set of objects and metadata, but not
+ the contents
+2. deep scrub: compares the set of objects, metadata, and a crc32 of
+ the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects. The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a crc32 of the contents of
+each object. The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the PGBackend implementation must be in charge of
+actually doing the scan, and that the PGBackend implementation should
+be able to attach arbitrary information to allow PGBackend on the
+primary to scrub PGBackend specific metadata.
+
+The main catch, however, for erasure coded PG is that sending a crc32
+of the stored chunk on a replica isn't particularly helpful since the
+chunks on different replicas presumably store different data. Because
+we don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum. The replica
+then reports to the primary whether the checksums match.
+
+`PGBackend interfaces <http://tracker.ceph.com/issues/5861>`_:
+
+- scan()
+- scrub()
+- compare_scrub_maps()
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+Core changes:
+
+- Ensure that crush behaves as above for INDEP.
+
+`Recovery <http://tracker.ceph.com/issues/5857>`_
+--------
+
+The logic for recovering an object depends on the backend. With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded pg may be
+unrecoverable without being unfound. The "unfound" concept
+should probably then be renamed to unrecoverable. Also, the
+PGBackend impementation will have to be able to direct the search
+for pg replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+Core changes:
+
+- s/unfound/unrecoverable
+
+PGBackend interfaces:
+
+- might_have_unrecoverable()
+- recoverable()
+- recover_object()
+
+`Backfill <http://tracker.ceph.com/issues/5856>`_
+--------
+
+For the most part, backfill itself should behave similarly between
+replicated and erasure coded pools with a few exceptions:
+
+1. We probably want to be able to backfill multiple osds concurrently
+ with an erasure coded pool in order to cut down on the read
+ overhead.
+2. We probably want to avoid having to place the backfill peers in the
+ acting set for an erasure coded pg because we might have a good
+ temporary pg chunk for that acting set slot.
+
+For 2, we don't really need to place the backfill peer in the acting
+set for replicated PGs anyway.
+For 1, PGBackend::choose_backfill() should determine which osds are
+backfilled in a particular interval.
+
+Core changes:
+
+- Backfill should be capable of `handling multiple backfill peers
+ concurrently <http://tracker.ceph.com/issues/5858>`_ even for
+ replicated pgs (easier to test for now)
+- `Backfill peers should not be placed in the acting set
+ <http://tracker.ceph.com/issues/5855>`_.
+
+PGBackend interfaces:
+
+- choose_backfill(): allows the implementation to determine which osds
+ should be backfilled in a particular interval.
diff --git a/src/osd/PGBackend.h b/src/osd/PGBackend.h
deleted file mode 100644
index efa26707d41..00000000000
--- a/src/osd/PGBackend.h
+++ /dev/null
@@ -1,169 +0,0 @@
-// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
-/*
- * Ceph - scalable distributed file system
- *
- * Copyright (C) 2013 Inktank Storage, Inc.
- *
- * This is free software; you can redistribute it and/or
- * modify it under the terms of the GNU Lesser General Public
- * License version 2.1, as published by the Free Software
- * Foundation. See file COPYING.
- *
- */
-
-#ifndef CEPH_PGBACKEND_H
-#define CEPH_PGBACKEND_H
-
-#include "osd_types.h"
-
-/**
- * PGBackend
- *
- * PGBackend defines an interface for logic handling IO and
- * replication on RADOS objects. The PGBackend implementation
- * is responsible for:
- *
- * 1) Handling client operations
- * 2) Handling object recovery
- * 3) Handling object access
- */
-class PGBackend {
-public:
- /// IO
-
- /// Perform write
- int perform_write(
- const vector<OSDOp> &ops, ///< [in] ops to perform
- Context *onreadable, ///< [in] called when readable on all reaplicas
- Context *onreadable, ///< [in] called when durable on all replicas
- ) = 0; ///< @return 0 or error
-
- /// Attempt to roll back a log entry
- int try_rollback(
- const pg_log_entry_t &entry, ///< [in] entry to roll back
- ObjectStore::Transaction *t ///< [out] transaction
- ) = 0; ///< @return 0 on success, -EINVAL if it can't be rolled back
-
- /// Perform async read, oncomplete is called when ops out_bls are filled in
- int perform_read(
- vector<OSDOp> &ops, ///< [in, out] ops
- Context *oncomplete ///< [out] called with r code
- ) = 0; ///< @return 0 or error
-
- /// Peering
-
- /**
- * have_enough_infos
- *
- * Allows PGBackend implementation to ensure that enough peers have
- * been contacted to satisfy its requirements.
- *
- * TODO: this interface should yield diagnostic info about which infos
- * are required
- */
- bool have_enough_infos(
- const map<epoch_t, pg_interval_t> &past_intervals, ///< [in] intervals
- const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
- ) = 0; ///< @return true if we can continue peering
-
- /**
- * choose_acting
- *
- * Allows PGBackend implementation to select the acting set based on the
- * received infos
- *
- * @return False if the current acting set is inadequate, *req_acting will
- * be filled in with the requested new acting set. True if the
- * current acting set is adequate, *auth_log will be filled in
- * with the correct location of the authoritative log.
- */
- bool choose_acting(
- const map<int, pg_info_t> &peer_infos, ///< [in] received infos
- int *auth_log, ///< [out] osd with auth log
- vector<int> *req_acting ///< [out] requested acting set
- ) = 0;
-
- /// Scrub
-
- /// scan
- int scan(
- const hobject_t &start, ///< [in] scan objects >= start
- const hobject_t &up_to, ///< [in] scan objects < up_to
- vector<hobject_t> *out ///< [out] objects returned
- ) = 0; ///< @return 0 or error
-
- /// stat (TODO: ScrubMap::object needs to have PGBackend specific metadata)
- int scrub(
- const hobject_t &to_stat, ///< [in] object to stat
- bool deep, ///< [in] true if deep scrub
- ScrubMap::object *o ///< [out] result
- ) = 0; ///< @return 0 or error
-
- /**
- * compare_scrub_maps
- *
- * @param inconsistent [out] map of inconsistent pgs to pair<correct, incorrect>
- * @param errstr [out] stream of text about inconsistencies for user
- * perusal
- *
- * TODO: this interface doesn't actually make sense...
- */
- void compare_scrub_maps(
- const map<int, ScrubMap> &maps, ///< [in] maps to compare
- bool deep, ///< [in] true if scrub is deep
- map<hobject_t, pair<set<int>, set<int> > > *inconsistent,
- std:ostream *errstr
- ) = 0;
-
- /// Recovery
-
- /**
- * might_have_unrecoverable
- *
- * @param missing [in] missing,info gathered so far (must include acting)
- * @param intervals [in] past intervals
- * @param should_query [out] pair<int, cpg_t> shards to query
- */
- void might_have_unrecoverable(
- const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
- const map<epoch_t, pg_interval_t> &past_intervals,
- set<pair<int, cpg_t> > *should_query
- ) = 0;
-
- /**
- * might_have_unfound
- *
- * @param missing [in] missing,info gathered so far (must include acting)
- */
- bool recoverable(
- const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
- const hobject_t &hoid ///< [in] object to check
- ) = 0; ///< @return true if object can be recovered given missing
-
- /**
- * recover_object
- *
- * Triggers a recovery operation on the specified hobject_t
- * onreadable must be called before onwriteable
- *
- * @param missing [in] set of info, missing pairs for queried nodes
- */
- void recover_object(
- const hobject_t &hoid, ///< [in] object to recover
- const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing
- Context *onreadable, ///< [in] called when object can be read
- Context *onwriteable ///< [in] called when object can be written
- ) = 0;
-
- /// Backfill
-
- /// choose_backfill
- void choose_backfill(
- const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
- const vector<int> &acting, ///< [in] acting set
- const vector<int> &up, ///< [in] up set
- set<int> *to_backfill ///< [out] osds to backfill
- ) = 0;
-};
-
-#endif