diff options
author | Loic Dachary <loic@dachary.org> | 2013-08-09 23:23:17 +0200 |
---|---|---|
committer | Loic Dachary <loic@dachary.org> | 2013-08-09 23:26:49 +0200 |
commit | 8bf3971b7e9669578cfae21a1738b116fab48a44 (patch) | |
tree | 69153dc0e81dc8cd9db522f4bf1aa91e3e5b860e /doc | |
parent | 980a07380db65f802fc4d8971a5404cc2ed0ff6e (diff) | |
download | ceph-8bf3971b7e9669578cfae21a1738b116fab48a44.tar.gz |
rearrange the documentation to be inserted and maintained in master
Signed-off-by: Loic Dachary <loic@dachary.org>
Diffstat (limited to 'doc')
-rw-r--r-- | doc/dev/osd_internals/erasure_coding.rst | 323 | ||||
-rw-r--r-- | doc/dev/osd_internals/erasure_coding/PGBackend-h.rst | 151 | ||||
-rw-r--r-- | doc/dev/osd_internals/erasure_coding/developer_notes.rst (renamed from doc/dev/osd_internals/erasure-code.rst) | 0 | ||||
-rw-r--r-- | doc/dev/osd_internals/erasure_coding/pgbackend.rst | 313 |
4 files changed, 478 insertions, 309 deletions
diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst index df21d3dccdc..deb91aca9db 100644 --- a/doc/dev/osd_internals/erasure_coding.rst +++ b/doc/dev/osd_internals/erasure_coding.rst @@ -1,313 +1,18 @@ -=================== -PG Backend Proposal -=================== +============================== +Erasure Coded Placement Groups +============================== -See src/osd/PGBackend.h +The documentation of the erasure coding implementation in Ceph was +created in July 2013. It is included in Ceph even before erasure +coding is available because it drives a number of architectural +changes. It is meant to be updated to reflect the `progress of these +architectural changes <http://tracker.ceph.com/issues/4929>`_, up to +the point where it becomes a reference of the erasure coding +implementation itself. -Motivation ----------- +.. toctree:: + :maxdepth: 1 -The purpose of the PG Backend interface is to abstract over the -differences between replication and erasure coding as failure recovery -mechanisms. + High level design document <erasure_coding/pgbackend> + Developer notes <erasure_coding/developer_notes> -Much of the existing PG logic, particularly that for dealing with -peering, will be common to each. With both schemes, a log of recent -operations will be used to direct recovery in the event that an osd is -down or disconnected for a brief period of time. Similarly, in both -cases it will be necessary to scan a recovered copy of the PG in order -to recover an empty OSD. The PGBackend abstraction must be -sufficiently expressive for Replicated and ErasureCoded backends to be -treated uniformly in these areas. - -However, there are also crucial differences between using replication -and erasure coding which PGBackend must abstract over: - -1. The current write strategy would not ensure that a particular - object could be reconstructed after a failure. -2. Reads on an erasure coded PG require chunks to be read from the - replicas as well. -3. Object recovery probably involves recovering the primary and - replica missing copies at the same time to avoid performing extra - reads of replica shards. -4. Erasure coded PG chunks created for different acting set - positions are not interchangeable. In particular, it might make - sense for a single OSD to hold more than 1 PG copy for different - acting set positions. -5. Selection of a pgtemp for backfill may difer between replicated - and erasure coded backends. -6. The set of necessary osds from a particular interval required to - to continue peering may difer between replicated and erasure - coded backends. -7. The selection of the authoritative log may difer between replicated - and erasure coded backends. - -Client Writes -------------- - -The current PG implementation performs a write by performing the write -locally while concurrently directing replicas to perform the same -operation. Once all operations are durable, the operation is -considered durable. Because these writes may be destructive -overwrites, during peering, a log entry on a replica (or the primary) -may be found to be divergent if that replica remembers a log event -which the authoritative log does not contain. This can happen if only -1 out of 3 replicas persisted an operation, but was not available in -the next interval to provide an authoritative log. With replication, -we can repair the divergent object as long as at least 1 replica has a -current copy of the divergent object. With erasure coding, however, -it might be the case that neither the new version of the object nor -the old version of the object has enough available chunks to be -reconstructed. This problem is much simpler if we arrange for all -supported operations to be locally roll-back-able. - -- CEPH_OSD_OP_APPEND: We can roll back an append locally by - including the previous object size as part of the PG log event. -- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete - requires that we retain the deleted object until all replicas have - persisted the deletion event. ErasureCoded backend will therefore - need to store objects with the version at which they were created - included in the key provided to the filestore. Old versions of an - object can be pruned when all replicas have committed up to the log - event deleting the object. -- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr - to be set or removed, we can roll back these operations locally. - -Core Changes: - -- Current code should be adapted to use and rollback as appropriate - APPEND, DELETE, (SET|RM)ATTR log entries. -- The filestore needs to be able to deal with multiply versioned - hobjects. This probably means adapting the filestore internally to - use a vhobject which is basically a pair<version_t, hobject_t>. The - version needs to be included in the on-disk filename. An interface - needs to be added to get all versions of a particular hobject_t or - the most recently versioned instance of a particular hobject_t. - -PGBackend Interfaces: - -- PGBackend::perform_write() : It seems simplest to pass the actual - ops vector. The reason for providing an async, callback based - interface rather than having the PGBackend respond directly is that - we might want to use this interface for internal operations like - watch/notify expiration or snap trimming which might not necessarily - have an external client. -- PGBackend::try_rollback() : Some log entries (all of the ones valid - for the Erasure coded backend) will support local rollback. In - those cases, PGLog can avoid adding objects to the missing set when - identifying divergent objects. - -Peering and PG Logs -------------------- - -Currently, we select the log with the newest last_update and the -longest tail to be the authoritative log. This is fine because we -aren't generally able to roll operations on the other replicas forward -or backwards, instead relying on our ability to re-replicate divergent -objects. With the write approach discussed in the previous section, -however, the erasure coded backend will rely on being able to roll -back divergent operations since we may not be able to re-replicate -divergent objects. Thus, we must choose the *oldest* last_update from -the last interval which went active in order to minimize the number of -divergent objects. - -The dificulty is that the current code assumes that as long as it has -an info from at least 1 osd from the prior interval, it can complete -peering. In order to ensure that we do not end up with an -unrecoverably divergent object, an M+K erasure coded PG must hear from at -least M of the replicas of the last interval to serve writes. This ensures -that we will select a last_update old enough to roll back at least M -replicas. If a replica with an older last_update comes along later, -we will be able to provide at least M chunks of any divergent object. - -Core Changes: - -- `PG::choose_acting(), etc. need to be generalized to use PGBackend - <http://tracker.ceph.com/issues/5860>`_ to determine the - authoritative log. -- `PG::RecoveryState::GetInfo needs to use PGBackend - <http://tracker.ceph.com/issues/5859>`_ to determine whether it has - enough infos to continue with authoritative log selection. - -PGBackend interfaces: - -- have_enough_infos() -- choose_acting() - -PGTemp ------- - -Currently, an osd is able to request a temp acting set mapping in -order to allow an up-to-date osd to serve requests while a new primary -is backfilled (and for other reasons). An erasure coded pg needs to -be able to designate a primary for these reasons without putting it -in the first position of the acting set. It also needs to be able -to leave holes in the requested acting set. - -Core Changes: - -- OSDMap::pg_to_*_osds needs to separately return a primary. For most - cases, this can continue to be acting[0]. -- MOSDPGTemp (and related OSD structures) needs to be able to specify - a primary as well as an acting set. -- Much of the existing code base assumes that acting[0] is the primary - and that all elements of acting are valid. This needs to be cleaned - up since the acting set may contain holes. - -Client Reads ------------- - -Reads with the replicated strategy can always be satisfied -syncronously out of the primary osd. With an erasure coded strategy, -the primary will need to request data from some number of replicas in -order to satisfy a read. The perform_read() interface for PGBackend -therefore will be async. - -PGBackend interfaces: - -- perform_read(): as with perform_write() it seems simplest to pass - the ops vector. The call to oncomplete will occur once the out_bls - have been appropriately filled in. - -Distinguished acting set positions ----------------------------------- - -With the replicated strategy, all replicas of a PG are -interchangeable. With erasure coding, different positions in the -acting set have different pieces of the erasure coding scheme and are -not interchangeable. Worse, crush might cause chunk 2 to be written -to an osd which happens already to contain an (old) copy of chunk 4. -This means that the OSD and PG messages need to work in terms of a -type like pair<chunk_id_t, pg_t> in order to distinguish different pg -chunks on a single OSD. - -Because the mapping of object name to object in the filestore must -be 1-to-1, we must ensure that the objects in chunk 2 and the objects -in chunk 4 have different names. To that end, the filestore must -include the chunk id in the object key. - -Core changes: - -- The filestore `vhobject_t needs to also include a chunk id - <http://tracker.ceph.com/issues/5862>`_ making it more like - tuple<hobject_t, version_t, chunk_id_t>. -- coll_t needs to include a chunk_id_t. -- The `OSD pg_map and similar pg mappings need to work in terms of a - cpg_t <http://tracker.ceph.com/issues/5863>`_ (essentially - pair<pg_t, chunk_id_t>). Similarly, pg->pg messages need to include - a chunk_id_t -- For client->PG messages, the OSD will need a way to know which PG - chunk should get the message since the OSD may contain both a - primary and non-primary chunk for the same pg - -Object Classes --------------- - -We probably won't support object classes at first on Erasure coded -backends. - -Scrub ------ - -We currently have two scrub modes with different default frequencies: - -1. [shallow] scrub: compares the set of objects and metadata, but not - the contents -2. deep scrub: compares the set of objects, metadata, and a crc32 of - the object contents (including omap) - -The primary requests a scrubmap from each replica for a particular -range of objects. The replica fills out this scrubmap for the range -of objects including, if the scrub is deep, a crc32 of the contents of -each object. The primary gathers these scrubmaps from each replica -and performs a comparison identifying inconsistent objects. - -Most of this can work essentially unchanged with erasure coded PG with -the caveat that the PGBackend implementation must be in charge of -actually doing the scan, and that the PGBackend implementation should -be able to attach arbitrary information to allow PGBackend on the -primary to scrub PGBackend specific metadata. - -The main catch, however, for erasure coded PG is that sending a crc32 -of the stored chunk on a replica isn't particularly helpful since the -chunks on different replicas presumably store different data. Because -we don't support overwrites except via DELETE, however, we have the -option of maintaining a crc32 on each chunk through each append. -Thus, each replica instead simply computes a crc32 of its own stored -chunk and compares it with the locally stored checksum. The replica -then reports to the primary whether the checksums match. - -`PGBackend interfaces <http://tracker.ceph.com/issues/5861>`_: - -- scan() -- scrub() -- compare_scrub_maps() - -Crush ------ - -If crush is unable to generate a replacement for a down member of an -acting set, the acting set should have a hole at that position rather -than shifting the other elements of the acting set out of position. - -Core changes: - -- Ensure that crush behaves as above for INDEP. - -`Recovery <http://tracker.ceph.com/issues/5857>`_ --------- - -The logic for recovering an object depends on the backend. With -the current replicated strategy, we first pull the object replica -to the primary and then concurrently push it out to the replicas. -With the erasure coded strategy, we probably want to read the -minimum number of replica chunks required to reconstruct the object -and push out the replacement chunks concurrently. - -Another difference is that objects in erasure coded pg may be -unrecoverable without being unfound. The "unfound" concept -should probably then be renamed to unrecoverable. Also, the -PGBackend impementation will have to be able to direct the search -for pg replicas with unrecoverable object chunks and to be able -to determine whether a particular object is recoverable. - -Core changes: - -- s/unfound/unrecoverable - -PGBackend interfaces: - -- might_have_unrecoverable() -- recoverable() -- recover_object() - -`Backfill <http://tracker.ceph.com/issues/5856>`_ --------- - -For the most part, backfill itself should behave similarly between -replicated and erasure coded pools with a few exceptions: - -1. We probably want to be able to backfill multiple osds concurrently - with an erasure coded pool in order to cut down on the read - overhead. -2. We probably want to avoid having to place the backfill peers in the - acting set for an erasure coded pg because we might have a good - temporary pg chunk for that acting set slot. - -For 2, we don't really need to place the backfill peer in the acting -set for replicated PGs anyway. -For 1, PGBackend::choose_backfill() should determine which osds are -backfilled in a particular interval. - -Core changes: - -- Backfill should be capable of `handling multiple backfill peers - concurrently <http://tracker.ceph.com/issues/5858>`_ even for - replicated pgs (easier to test for now) -- `Backfill peers should not be placed in the acting set - <http://tracker.ceph.com/issues/5855>`_. - -PGBackend interfaces: - -- choose_backfill(): allows the implementation to determine which osds - should be backfilled in a particular interval. diff --git a/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst b/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst new file mode 100644 index 00000000000..7e1998382a0 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/PGBackend-h.rst @@ -0,0 +1,151 @@ +PGBackend.h +:: + /** + * PGBackend + * + * PGBackend defines an interface for logic handling IO and + * replication on RADOS objects. The PGBackend implementation + * is responsible for: + * + * 1) Handling client operations + * 2) Handling object recovery + * 3) Handling object access + */ + class PGBackend { + public: + /// IO + + /// Perform write + int perform_write( + const vector<OSDOp> &ops, ///< [in] ops to perform + Context *onreadable, ///< [in] called when readable on all reaplicas + Context *onreadable, ///< [in] called when durable on all replicas + ) = 0; ///< @return 0 or error + + /// Attempt to roll back a log entry + int try_rollback( + const pg_log_entry_t &entry, ///< [in] entry to roll back + ObjectStore::Transaction *t ///< [out] transaction + ) = 0; ///< @return 0 on success, -EINVAL if it can't be rolled back + + /// Perform async read, oncomplete is called when ops out_bls are filled in + int perform_read( + vector<OSDOp> &ops, ///< [in, out] ops + Context *oncomplete ///< [out] called with r code + ) = 0; ///< @return 0 or error + + /// Peering + + /** + * have_enough_infos + * + * Allows PGBackend implementation to ensure that enough peers have + * been contacted to satisfy its requirements. + * + * TODO: this interface should yield diagnostic info about which infos + * are required + */ + bool have_enough_infos( + const map<epoch_t, pg_interval_t> &past_intervals, ///< [in] intervals + const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos + ) = 0; ///< @return true if we can continue peering + + /** + * choose_acting + * + * Allows PGBackend implementation to select the acting set based on the + * received infos + * + * @return False if the current acting set is inadequate, *req_acting will + * be filled in with the requested new acting set. True if the + * current acting set is adequate, *auth_log will be filled in + * with the correct location of the authoritative log. + */ + bool choose_acting( + const map<int, pg_info_t> &peer_infos, ///< [in] received infos + int *auth_log, ///< [out] osd with auth log + vector<int> *req_acting ///< [out] requested acting set + ) = 0; + + /// Scrub + + /// scan + int scan( + const hobject_t &start, ///< [in] scan objects >= start + const hobject_t &up_to, ///< [in] scan objects < up_to + vector<hobject_t> *out ///< [out] objects returned + ) = 0; ///< @return 0 or error + + /// stat (TODO: ScrubMap::object needs to have PGBackend specific metadata) + int scrub( + const hobject_t &to_stat, ///< [in] object to stat + bool deep, ///< [in] true if deep scrub + ScrubMap::object *o ///< [out] result + ) = 0; ///< @return 0 or error + + /** + * compare_scrub_maps + * + * @param inconsistent [out] map of inconsistent pgs to pair<correct, incorrect> + * @param errstr [out] stream of text about inconsistencies for user + * perusal + * + * TODO: this interface doesn't actually make sense... + */ + void compare_scrub_maps( + const map<int, ScrubMap> &maps, ///< [in] maps to compare + bool deep, ///< [in] true if scrub is deep + map<hobject_t, pair<set<int>, set<int> > > *inconsistent, + std:ostream *errstr + ) = 0; + + /// Recovery + + /** + * might_have_unrecoverable + * + * @param missing [in] missing,info gathered so far (must include acting) + * @param intervals [in] past intervals + * @param should_query [out] pair<int, cpg_t> shards to query + */ + void might_have_unrecoverable( + const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing, + const map<epoch_t, pg_interval_t> &past_intervals, + set<pair<int, cpg_t> > *should_query + ) = 0; + + /** + * might_have_unfound + * + * @param missing [in] missing,info gathered so far (must include acting) + */ + bool recoverable( + const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing, + const hobject_t &hoid ///< [in] object to check + ) = 0; ///< @return true if object can be recovered given missing + + /** + * recover_object + * + * Triggers a recovery operation on the specified hobject_t + * onreadable must be called before onwriteable + * + * @param missing [in] set of info, missing pairs for queried nodes + */ + void recover_object( + const hobject_t &hoid, ///< [in] object to recover + const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing + Context *onreadable, ///< [in] called when object can be read + Context *onwriteable ///< [in] called when object can be written + ) = 0; + + /// Backfill + + /// choose_backfill + void choose_backfill( + const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos + const vector<int> &acting, ///< [in] acting set + const vector<int> &up, ///< [in] up set + set<int> *to_backfill ///< [out] osds to backfill + ) = 0; + }; diff --git a/doc/dev/osd_internals/erasure-code.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst index 40616ae271c..40616ae271c 100644 --- a/doc/dev/osd_internals/erasure-code.rst +++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst diff --git a/doc/dev/osd_internals/erasure_coding/pgbackend.rst b/doc/dev/osd_internals/erasure_coding/pgbackend.rst new file mode 100644 index 00000000000..662351e9d77 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/pgbackend.rst @@ -0,0 +1,313 @@ +=================== +PG Backend Proposal +=================== + +See also `PGBackend.h <PGBackend>`_ + +Motivation +---------- + +The purpose of the PG Backend interface is to abstract over the +differences between replication and erasure coding as failure recovery +mechanisms. + +Much of the existing PG logic, particularly that for dealing with +peering, will be common to each. With both schemes, a log of recent +operations will be used to direct recovery in the event that an osd is +down or disconnected for a brief period of time. Similarly, in both +cases it will be necessary to scan a recovered copy of the PG in order +to recover an empty OSD. The PGBackend abstraction must be +sufficiently expressive for Replicated and ErasureCoded backends to be +treated uniformly in these areas. + +However, there are also crucial differences between using replication +and erasure coding which PGBackend must abstract over: + +1. The current write strategy would not ensure that a particular + object could be reconstructed after a failure. +2. Reads on an erasure coded PG require chunks to be read from the + replicas as well. +3. Object recovery probably involves recovering the primary and + replica missing copies at the same time to avoid performing extra + reads of replica shards. +4. Erasure coded PG chunks created for different acting set + positions are not interchangeable. In particular, it might make + sense for a single OSD to hold more than 1 PG copy for different + acting set positions. +5. Selection of a pgtemp for backfill may difer between replicated + and erasure coded backends. +6. The set of necessary osds from a particular interval required to + to continue peering may difer between replicated and erasure + coded backends. +7. The selection of the authoritative log may difer between replicated + and erasure coded backends. + +Client Writes +------------- + +The current PG implementation performs a write by performing the write +locally while concurrently directing replicas to perform the same +operation. Once all operations are durable, the operation is +considered durable. Because these writes may be destructive +overwrites, during peering, a log entry on a replica (or the primary) +may be found to be divergent if that replica remembers a log event +which the authoritative log does not contain. This can happen if only +1 out of 3 replicas persisted an operation, but was not available in +the next interval to provide an authoritative log. With replication, +we can repair the divergent object as long as at least 1 replica has a +current copy of the divergent object. With erasure coding, however, +it might be the case that neither the new version of the object nor +the old version of the object has enough available chunks to be +reconstructed. This problem is much simpler if we arrange for all +supported operations to be locally roll-back-able. + +- CEPH_OSD_OP_APPEND: We can roll back an append locally by + including the previous object size as part of the PG log event. +- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete + requires that we retain the deleted object until all replicas have + persisted the deletion event. ErasureCoded backend will therefore + need to store objects with the version at which they were created + included in the key provided to the filestore. Old versions of an + object can be pruned when all replicas have committed up to the log + event deleting the object. +- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr + to be set or removed, we can roll back these operations locally. + +Core Changes: + +- Current code should be adapted to use and rollback as appropriate + APPEND, DELETE, (SET|RM)ATTR log entries. +- The filestore needs to be able to deal with multiply versioned + hobjects. This probably means adapting the filestore internally to + use a vhobject which is basically a pair<version_t, hobject_t>. The + version needs to be included in the on-disk filename. An interface + needs to be added to get all versions of a particular hobject_t or + the most recently versioned instance of a particular hobject_t. + +PGBackend Interfaces: + +- PGBackend::perform_write() : It seems simplest to pass the actual + ops vector. The reason for providing an async, callback based + interface rather than having the PGBackend respond directly is that + we might want to use this interface for internal operations like + watch/notify expiration or snap trimming which might not necessarily + have an external client. +- PGBackend::try_rollback() : Some log entries (all of the ones valid + for the Erasure coded backend) will support local rollback. In + those cases, PGLog can avoid adding objects to the missing set when + identifying divergent objects. + +Peering and PG Logs +------------------- + +Currently, we select the log with the newest last_update and the +longest tail to be the authoritative log. This is fine because we +aren't generally able to roll operations on the other replicas forward +or backwards, instead relying on our ability to re-replicate divergent +objects. With the write approach discussed in the previous section, +however, the erasure coded backend will rely on being able to roll +back divergent operations since we may not be able to re-replicate +divergent objects. Thus, we must choose the *oldest* last_update from +the last interval which went active in order to minimize the number of +divergent objects. + +The dificulty is that the current code assumes that as long as it has +an info from at least 1 osd from the prior interval, it can complete +peering. In order to ensure that we do not end up with an +unrecoverably divergent object, an M+K erasure coded PG must hear from at +least M of the replicas of the last interval to serve writes. This ensures +that we will select a last_update old enough to roll back at least M +replicas. If a replica with an older last_update comes along later, +we will be able to provide at least M chunks of any divergent object. + +Core Changes: + +- `PG::choose_acting(), etc. need to be generalized to use PGBackend + <http://tracker.ceph.com/issues/5860>`_ to determine the + authoritative log. +- `PG::RecoveryState::GetInfo needs to use PGBackend + <http://tracker.ceph.com/issues/5859>`_ to determine whether it has + enough infos to continue with authoritative log selection. + +PGBackend interfaces: + +- have_enough_infos() +- choose_acting() + +PGTemp +------ + +Currently, an osd is able to request a temp acting set mapping in +order to allow an up-to-date osd to serve requests while a new primary +is backfilled (and for other reasons). An erasure coded pg needs to +be able to designate a primary for these reasons without putting it +in the first position of the acting set. It also needs to be able +to leave holes in the requested acting set. + +Core Changes: + +- OSDMap::pg_to_*_osds needs to separately return a primary. For most + cases, this can continue to be acting[0]. +- MOSDPGTemp (and related OSD structures) needs to be able to specify + a primary as well as an acting set. +- Much of the existing code base assumes that acting[0] is the primary + and that all elements of acting are valid. This needs to be cleaned + up since the acting set may contain holes. + +Client Reads +------------ + +Reads with the replicated strategy can always be satisfied +syncronously out of the primary osd. With an erasure coded strategy, +the primary will need to request data from some number of replicas in +order to satisfy a read. The perform_read() interface for PGBackend +therefore will be async. + +PGBackend interfaces: + +- perform_read(): as with perform_write() it seems simplest to pass + the ops vector. The call to oncomplete will occur once the out_bls + have been appropriately filled in. + +Distinguished acting set positions +---------------------------------- + +With the replicated strategy, all replicas of a PG are +interchangeable. With erasure coding, different positions in the +acting set have different pieces of the erasure coding scheme and are +not interchangeable. Worse, crush might cause chunk 2 to be written +to an osd which happens already to contain an (old) copy of chunk 4. +This means that the OSD and PG messages need to work in terms of a +type like pair<chunk_id_t, pg_t> in order to distinguish different pg +chunks on a single OSD. + +Because the mapping of object name to object in the filestore must +be 1-to-1, we must ensure that the objects in chunk 2 and the objects +in chunk 4 have different names. To that end, the filestore must +include the chunk id in the object key. + +Core changes: + +- The filestore `vhobject_t needs to also include a chunk id + <http://tracker.ceph.com/issues/5862>`_ making it more like + tuple<hobject_t, version_t, chunk_id_t>. +- coll_t needs to include a chunk_id_t. +- The `OSD pg_map and similar pg mappings need to work in terms of a + cpg_t <http://tracker.ceph.com/issues/5863>`_ (essentially + pair<pg_t, chunk_id_t>). Similarly, pg->pg messages need to include + a chunk_id_t +- For client->PG messages, the OSD will need a way to know which PG + chunk should get the message since the OSD may contain both a + primary and non-primary chunk for the same pg + +Object Classes +-------------- + +We probably won't support object classes at first on Erasure coded +backends. + +Scrub +----- + +We currently have two scrub modes with different default frequencies: + +1. [shallow] scrub: compares the set of objects and metadata, but not + the contents +2. deep scrub: compares the set of objects, metadata, and a crc32 of + the object contents (including omap) + +The primary requests a scrubmap from each replica for a particular +range of objects. The replica fills out this scrubmap for the range +of objects including, if the scrub is deep, a crc32 of the contents of +each object. The primary gathers these scrubmaps from each replica +and performs a comparison identifying inconsistent objects. + +Most of this can work essentially unchanged with erasure coded PG with +the caveat that the PGBackend implementation must be in charge of +actually doing the scan, and that the PGBackend implementation should +be able to attach arbitrary information to allow PGBackend on the +primary to scrub PGBackend specific metadata. + +The main catch, however, for erasure coded PG is that sending a crc32 +of the stored chunk on a replica isn't particularly helpful since the +chunks on different replicas presumably store different data. Because +we don't support overwrites except via DELETE, however, we have the +option of maintaining a crc32 on each chunk through each append. +Thus, each replica instead simply computes a crc32 of its own stored +chunk and compares it with the locally stored checksum. The replica +then reports to the primary whether the checksums match. + +`PGBackend interfaces <http://tracker.ceph.com/issues/5861>`_: + +- scan() +- scrub() +- compare_scrub_maps() + +Crush +----- + +If crush is unable to generate a replacement for a down member of an +acting set, the acting set should have a hole at that position rather +than shifting the other elements of the acting set out of position. + +Core changes: + +- Ensure that crush behaves as above for INDEP. + +`Recovery <http://tracker.ceph.com/issues/5857>`_ +-------- + +The logic for recovering an object depends on the backend. With +the current replicated strategy, we first pull the object replica +to the primary and then concurrently push it out to the replicas. +With the erasure coded strategy, we probably want to read the +minimum number of replica chunks required to reconstruct the object +and push out the replacement chunks concurrently. + +Another difference is that objects in erasure coded pg may be +unrecoverable without being unfound. The "unfound" concept +should probably then be renamed to unrecoverable. Also, the +PGBackend impementation will have to be able to direct the search +for pg replicas with unrecoverable object chunks and to be able +to determine whether a particular object is recoverable. + +Core changes: + +- s/unfound/unrecoverable + +PGBackend interfaces: + +- might_have_unrecoverable() +- recoverable() +- recover_object() + +`Backfill <http://tracker.ceph.com/issues/5856>`_ +-------- + +For the most part, backfill itself should behave similarly between +replicated and erasure coded pools with a few exceptions: + +1. We probably want to be able to backfill multiple osds concurrently + with an erasure coded pool in order to cut down on the read + overhead. +2. We probably want to avoid having to place the backfill peers in the + acting set for an erasure coded pg because we might have a good + temporary pg chunk for that acting set slot. + +For 2, we don't really need to place the backfill peer in the acting +set for replicated PGs anyway. +For 1, PGBackend::choose_backfill() should determine which osds are +backfilled in a particular interval. + +Core changes: + +- Backfill should be capable of `handling multiple backfill peers + concurrently <http://tracker.ceph.com/issues/5858>`_ even for + replicated pgs (easier to test for now) +- `Backfill peers should not be placed in the acting set + <http://tracker.ceph.com/issues/5855>`_. + +PGBackend interfaces: + +- choose_backfill(): allows the implementation to determine which osds + should be backfilled in a particular interval. |