summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSage Weil <sage@inktank.com>2013-08-28 10:29:17 -0700
committerSage Weil <sage@inktank.com>2013-08-28 10:29:17 -0700
commitcb2abad901bbc567ac313856a780e6e89be964fe (patch)
tree817f6da9a5f955df35e903de9fd847b9045ace9c
parentf271a73ca56a6328fd93b3a488f6ee2883bdc0e8 (diff)
parent14c31ddf1056e48d0361c9650c4e62d95603f1b8 (diff)
downloadceph-cb2abad901bbc567ac313856a780e6e89be964fe.tar.gz
Merge pull request #539 from dachary/master
doc : erasure code developer notes updates
-rw-r--r--doc/dev/osd_internals/erasure_coding/developer_notes.rst277
-rw-r--r--doc/dev/osd_internals/erasure_coding/pgbackend.rst8
2 files changed, 184 insertions, 101 deletions
diff --git a/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
index d542fdb86e2..2bc796c67e5 100644
--- a/doc/dev/osd_internals/erasure_coding/developer_notes.rst
+++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
@@ -1,35 +1,42 @@
-============
-Erasure Code
-============
+============================
+Erasure Code developer notes
+============================
Introduction
------------
-An erasure coded pool only supports full writes, appends and read. It
-does not support snapshots or clone. An ErasureCodedPGBackend is derived
-from PGBackend.
+Each chapter of this document explains an aspect of the implementation
+of the erasure code within Ceph. It is mostly based on examples being
+explained to demonstrate how things work. It is written as if the
+implementation is complete although it may not be the case. For
+instance the plugin system and the jerasure plugin are implemented but
+the erasure code pool is not.
Reading and writing encoded chunks from and to OSDs
---------------------------------------------------
-An erasure coded pool stores each object as M+K chunks. It is divided
-into M data chunks and K parity chunks. The pool is configured to have
-a size of M+K so that each chunk is stored in an OSD in the acting
-set. The rank of the chunks is stored as an attribute of the object.
-An erasure coded pool is created to use five OSDs ( M+K = 5 ) and
-sustain the loss of two of them ( K = 2 ).
+An erasure coded pool stores each object as K+M chunks. It is divided
+into K data chunks and M coding chunks. The pool is configured to have
+a size of K+M so that each chunk is stored in an OSD in the acting
+set. The rank of the chunks is stored as `an attribute of the pool
+<http://tracker.ceph.com/issues/5862>`_ containing the object.
+
+For instance an erasure coded pool is created to use five OSDs ( K+M =
+5 ) and sustain the loss of two of them ( M = 2 ).
When the object *NYAN* containing *ABCDEFGHI* is written to it, the
erasure encoding function splits the content in three data chunks,
simply by dividing the content in three : the first contains *ABC*,
-the second *DEF* and the last *GHI*. The function also creates two
-parity chunks : the fourth with *YXY* and the fifth with *GQC*. Each
+the second *DEF* and the last *GHI*. The content will be padded if the
+content length is not a multiple of K. The function also creates two
+coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each
chunk is stored in an OSD in the acting set. The chunks are stored in
objects that have the same name ( *NYAN* ) but reside on different
OSDs. The order in which the chunks were created must be preserved and
-is stored as an attribute of the object. The chunk *1* contains *ABC*
-and is stored on *OSD5*, the chunk *4* contains *XYY* and is stored on
-*OSD3*.
+is stored as an attribute of the pool containing the object. Chunk
+*1* contains *ABC* and is stored on *OSD5* while chunk *4* contains
+*XYY* and is stored on *OSD3*.
+
::
+-------------------+
@@ -49,7 +56,7 @@ and is stored on *OSD5*, the chunk *4* contains *XYY* and is stored on
+--v---+ +--v---+ +--v---+ +--v---+ +--v---+
name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+------+ +------+ +------+ +------+ +------+
- attribute | 1 | | 2 | | 3 | | 4 | | 5 |
+ pool shard | 1 | | 2 | | 3 | | 4 | | 5 |
+------+ +------+ +------+ +------+ +------+
content | ABC | | DEF | | GHI | | YXY | | QGC |
+--+---+ +--+---+ +--+---+ +--+---+ +--+---+
@@ -94,7 +101,7 @@ and its chunk was not taken into account.
|
+------+------+
| decode(3,2) |
- | erased 2,5 |
+ | erasures 2,5|
+-------------->| |
| +-------------+
| ^ ^
@@ -103,7 +110,7 @@ and its chunk was not taken into account.
+--+---+ +------+ +--+---+ +--+---+
name | NYAN | | NYAN | | NYAN | | NYAN |
+------+ +------+ +------+ +------+
- attribute | 1 | | 2 | | 3 | | 4 |
+ pool shard | 1 | | 2 | | 3 | | 4 |
+------+ +------+ +------+ +------+
content | ABC | | DEF | | GHI | | YXY |
+--+---+ +--+---+ +--+---+ +--+---+
@@ -128,11 +135,10 @@ and its chunk was not taken into account.
Interrupted full writes
-----------------------
-In an erasure coded pool the primary OSD is the first of the acting
-set and receives all write operations. It is responsible for encoding
-the payload into M+K chunks and send them to the OSDs in the acting
-set. It is also responsible for maintaining an authoritative version
-of the placement group logs.
+In an erasure coded pool the primary OSD in the up set receives all
+write operations. It is responsible for encoding the payload into K+M
+chunks and send them to the OSDs in the up set. It is also responsible
+for maintaining an authoritative version of the placement group logs.
::
primary
@@ -152,11 +158,18 @@ of the placement group logs.
| log |
| |
|+----+ |
- ||P1v1| 1,1 |
+ ||C1v1| 1,1 |
|+----+ |
+-----------+
-An erasure coded placement group has been created with M = 2 + K = 1 and is supported by three OSDs, two for M and one for K. The acting set of the placement group is made of *OSD 1* *OSD 2* and *OSD 3*. An object has been encoded and stored in the OSDs : the chunk D1v1 (i.e. Data chunk number 1 version 1) is on *OSD 1*, D2v1 on *OSD 2* and P1v1 (i.e. Parity chunk number 1 version 1) on *OSD 3*. The placement group logs on each OSD are in synch at epoch 1 version 1 (i.e. 1,1).
+An erasure coded placement group has been created with K = 2 + M = 1
+and is supported by three OSDs, two for K and one for M. The acting
+set of the placement group is made of *OSD 1*, *OSD 2* and *OSD 3*. An
+object has been encoded and stored in the OSDs : the chunk D1v1
+(i.e. Data chunk number 1 version 1) is on *OSD 1*, D2v1 on *OSD 2*
+and C1v1 (i.e. Coding chunk number 1 version 1) on *OSD 3*. The
+placement group logs on each OSD are in sync at epoch 1 version 1
+(i.e. 1,1).
::
primary
@@ -175,15 +188,29 @@ An erasure coded placement group has been created with M = 2 + K = 1 and is supp
| +-----------+
| +---OSD 3---+
| |+----+ log |
- +---------->|P1v2| 1,2 |
+ +---------->|C1v2| 1,2 |
|+----+ |
|+----+ |
- ||P1v1| 1,1 |
+ ||C1v1| 1,1 |
|+----+ |
+-----------+
-*OSD 1* is the primary and receives a WRITE FULL from a client, meaning the payload is to replace the content of the object entirely, it is not a partial write that would only overwrite part of it. The version two of the object is created to override the version one. *OSD 1* encodes the payload into three chunks : D1v2 (i.e. Data chunk number 1 version 2) will be on *OSD 1*, D2v2 on *OSD 2* and P1v2 (i.e. Parity chunk number 1 version 2) on *OSD 3*. Each chunk is sent to the target OSD, including the primary OSD which is responsible for storing chunks in addition to handling write operations and maintaining an authoritative version of the placement group logs. When an OSD receives the message instructing it to write the chunk, it also creates a new entry in the placement group logs to reflect the change. For instance, as soon as *OSD 3* stores *P1v2*, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its logs. Because the OSDs work asynchronously, some chunks may still be in flight ( such as *D2v2* ) while others are acknowledged and on disk ( such as *P1v1* and *D1v1* ).
-::
+*OSD 1* is the primary and receives a WRITE FULL from a client, which
+means the payload is to replace the object entirely instead of only
+overwriting a portion of it. Version two of the object is created
+to override version one. *OSD 1* encodes the payload into three
+chunks : D1v2 (i.e. Data chunk number 1 version 2) will be on *OSD 1*,
+D2v2 on *OSD 2* and C1v2 (i.e. Coding chunk number 1 version 2) on
+*OSD 3*. Each chunk is sent to the target OSD, including the primary
+OSD which is responsible for storing chunks in addition to handling
+write operations and maintaining an authoritative version of the
+placement group logs. When an OSD receives the message instructing it
+to write the chunk, it also creates a new entry in the placement group
+logs to reflect the change. For instance, as soon as *OSD 3* stores
+*C1v2*, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its
+logs. Because the OSDs work asynchronously, some chunks may still be
+in flight ( such as *D2v2* ) while others are acknowledged and on disk
+( such as *C1v1* and *D1v1* ). ::
primary
+---OSD 1---+
@@ -204,14 +231,18 @@ An erasure coded placement group has been created with M = 2 + K = 1 and is supp
| +-----------+
| +---OSD 3---+
| |+----+ log |
- +---------->|P1v2| 1,2 |
+ +---------->|C1v2| 1,2 |
|+----+ |
|+----+ |
- ||P1v1| 1,1 |
+ ||C1v1| 1,1 |
|+----+ |
+-----------+
-If all goes well, the chunks are acknowledged on each OSD in the acting set and the *last_complete* pointer of the logs can move from *1,1* to *1,2* and the files used to store the chunks of the previous version of the object can be removed : *D1v1* on *OSD 1*, *D2v1* on *OSD 2* and *P1v1* on *OSD 3*.
+If all goes well, the chunks are acknowledged on each OSD in the
+acting set and the logs' *last_complete* pointer can move from
+*1,1* to *1,2* and the files used to store the chunks of the previous
+version of the object can be removed : *D1v1* on *OSD 1*, *D2v1* on
+*OSD 2* and *C1v1* on *OSD 3*.
::
+---OSD 1---+
@@ -226,10 +257,10 @@ If all goes well, the chunks are acknowledged on each OSD in the acting set and
+-----------+
+---OSD 3---+
|+----+ log |
- ||P1v2| 1,2 |
+ ||C1v2| 1,2 |
|+----+ |
|+----+ |
- ||P1V1| 1,1 |
+ ||C1V1| 1,1 |
|+----+ |
primary +-----------+
+---OSD 4---+
@@ -238,7 +269,15 @@ If all goes well, the chunks are acknowledged on each OSD in the acting set and
| |
+-----------+
-But accidents happen. If *OSD 1* goes down while *D2v2* is still in flight, the version 2 of the object is partially written : *OSD 3* has one chunk but does not have enough to recover. It lost two chunks : *D1v2* and *D2v2* but the erasure coding parameters M = 2 + K = 1 requires that at least two chunks are available to rebuild the third. *OSD 4* becomes the new primary and finds that the *last_complete* log entry ( i.e. all objects before this entry were known to be available on all OSDs in the previous acting set ) is *1,1* and will be the head of the new authoritative log.
+But accidents happen. If *OSD 1* goes down while *D2v2* is still in
+flight, the object's version 2 is partially written : *OSD 3* has
+one chunk but does not have enough to recover. It lost two chunks :
+*D1v2* and *D2v2* but the erasure coding parameters K = 2 + M = 1
+requires that at least two chunks are available to rebuild the
+third. *OSD 4* becomes the new primary and finds that the
+*last_complete* log entry ( i.e. all objects before this entry were
+known to be available on all OSDs in the previous acting set ) is
+*1,1* and will be the head of the new authoritative log.
::
+---OSD 2---+
@@ -248,7 +287,7 @@ But accidents happen. If *OSD 1* goes down while *D2v2* is still in flight, the
+-----------+
+---OSD 3---+
|+----+ log |
- ||P1V1| 1,1 |
+ ||C1V1| 1,1 |
|+----+ |
primary +-----------+
+---OSD 4---+
@@ -257,7 +296,9 @@ But accidents happen. If *OSD 1* goes down while *D2v2* is still in flight, the
| |
+-----------+
-The log entry *1,2* found on *OSD 3* is divergent from the new authoritative log provided by *OSD 4* : it is discarded and the file containing the *P1v2* chunk is removed.
+The log entry *1,2* found on *OSD 3* is divergent from the new
+authoritative log provided by *OSD 4* : it is discarded and the file
+containing the *C1v2* chunk is removed.
::
+---OSD 2---+
@@ -267,7 +308,7 @@ The log entry *1,2* found on *OSD 3* is divergent from the new authoritative log
+-----------+
+---OSD 3---+
|+----+ log |
- ||P1V1| 1,1 |
+ ||C1V1| 1,1 |
|+----+ |
primary +-----------+
+---OSD 4---+
@@ -276,13 +317,20 @@ The log entry *1,2* found on *OSD 3* is divergent from the new authoritative log
|+----+ |
+-----------+
-The *D1v1* chunk is rebuilt with the *repair* function of the erasure coding library during scrubbing and stored on the new primary *OSD 4*.
+The *D1v1* chunk is rebuilt with the *decode* function of the erasure
+coding library during scrubbing and stored on the new primary *OSD 4*.
Interrupted append
------------------
-An object is coded in stripes as described above. In the case of a full write, and assuming the object size is not too large to encode it in memory, there is a single stripe. When appending to an existing object, the stripe size is retrieved from the attributes of the object and if the total size of the object is a multiple of the stripe size and the payload of the append message is lower or equal to the strip size, the following applies. It applies, for instance, when *rgw* writes an object with sequence of append instead of a single write.
-::
+An object is coded in stripes, either because they are too big or
+because they are created with multiple operations instead of a single
+full write. A single stripe will exist/exists in the case of a full
+write, assuming the object size is not too large to encode in memory.
+When appending to an existing object, the stripe size is retrieved
+from the attributes of the object. It applies, for instance, when
+*rgw* writes an object with sequence of append instead of a single
+write. ::
primary
+---OSD 1---+
@@ -299,13 +347,29 @@ An object is coded in stripes as described above. In the case of a full write, a
| +-----------+
| +---OSD 3---+
| |+-s3-+ log |
- +---------->|S1P1| 1,2 |
+ +---------->|S1C1| 1,2 |
||----| |
- ||S2P1| 1,1 |
+ ||S2C1| 1,1 |
|+----+ |
+-----------+
-*OSD 1* is the primary and receives an APPEND from a client, meaning the payload is to be appended at the end of the object. *OSD 1* encodes the payload into three chunks : S2D1 (i.e. Stripe two data chunk number 1 ) will be in s1 ( shard 1 ) on *OSD 1*, S2D2 in s2 on *OSD 2* and S2P1 (i.e. Stripe two parity chunk number 1 ) in s3 on *OSD 3*. Each chunk is sent to the target OSD, including the primary OSD which is responsible for storing chunks in addition to handling write operations and maintaining an authoritative version of the placement group logs. When an OSD receives the message instructing it to write the chunk, it also creates a new entry in the placement group logs to reflect the change. For instance, as soon as *OSD 3* stores *S2P1*, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its logs. The log entry also carries the nature of the operation: in this case 1,2 is an APPEND where 1,1 was a CREATE. Because the OSDs work asynchronously, some chunks may still be in flight ( such as *S2D2* ) while others are acknowledged and on disk ( such as *S2D1* and *S2P1* ).
+*OSD 1* is the primary and receives an APPEND from a client, meaning
+the payload is to be appended at the end of the object. *OSD 1*
+encodes the payload into three chunks : S2D1 (i.e. Stripe two data
+chunk number 1 ) will be in s1 ( shard 1 ) on *OSD 1*, S2D2 in s2 on
+*OSD 2* and S2C1 (i.e. Stripe two coding chunk number 1 ) in s3 on
+*OSD 3*. Each chunk is sent to the target OSD, including the primary
+OSD which is responsible for storing chunks in addition to handling
+write operations and maintaining an authoritative version of the
+placement group logs. When an OSD receives the message instructing it
+to write the chunk, it also creates a new entry in the placement group
+logs to reflect the change. For instance, as soon as *OSD 3* stores
+*S2C1*, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its
+logs. The log entry also carries the nature of the operation: in this
+case 1,2 is an APPEND where 1,1 was a CREATE. Because the OSDs work
+asynchronously, some chunks may still be in flight ( such as *S2D2* )
+while others are acknowledged and on disk ( such as *S2D1* and *S2C1*
+).
::
+---OSD 1---+
@@ -320,9 +384,9 @@ An object is coded in stripes as described above. In the case of a full write, a
+-----------+
+---OSD 3---+
|+-s3-+ log |
- ||S1P1| 1,2 |
+ ||S1C1| 1,2 |
||----| |
- ||S2P1| 1,1 |
+ ||S2C1| 1,1 |
|+----+ |
primary +-----------+
+---OSD 4---+
@@ -331,8 +395,15 @@ An object is coded in stripes as described above. In the case of a full write, a
| |
+-----------+
-If *OSD 1* goes down while *S2D2* is still in flight, the payload is partially appended : s3 ( shard 3) in *OSD 3* has one chunk but does not have enough to recover because s1 and s2 don't have it. It lost two chunks : *S2D1* and *S2D2* but the erasure coding parameters M = 2 + K = 1 requires that at least two chunks are available to rebuild the third. *OSD 4* becomes the new primary and finds that the *last_complete* log entry ( i.e. all objects before this entry were known to be available on all OSDs in the previous acting set ) is *1,1* and will be the head of the new authoritative log.
-::
+If *OSD 1* goes down while *S2D2* is still in flight, the payload is
+partially appended : s3 ( shard 3) in *OSD 3* has one chunk but does
+not have enough to recover because s1 and s2 don't have it. Two chunks
+were lost (*S2D1* and S2D2) but the erasure coding parameters K = 2 +
+M = 1 requires that at least two chunks are available to rebuild the
+third. *OSD 4* becomes the new primary and finds that the
+*last_complete* log entry ( i.e. all objects before this entry were
+known to be available on all OSDs in the previous acting set ) is
+*1,1* and will be the head of the new authoritative log. ::
+---OSD 2---+
|+-s2-+ log |
@@ -341,7 +412,7 @@ If *OSD 1* goes down while *S2D2* is still in flight, the payload is partially a
+-----------+
+---OSD 3---+
|+-s3-+ log |
- ||S1P1| 1,1 |
+ ||S1C1| 1,1 |
|+----+ |
primary +-----------+
+---OSD 4---+
@@ -350,26 +421,30 @@ If *OSD 1* goes down while *S2D2* is still in flight, the payload is partially a
| |
+-----------+
-The log entry *1,2* found on *OSD 3* is divergent from the new authoritative log provided by *OSD 4* : it is discarded and the file containing the *S2P1* chunk is truncated to the nearest multiple of the stripe size.
+The log entry *1,2* found on *OSD 3* is divergent from the new
+authoritative log provided by *OSD 4* : it is discarded and the file
+containing the *S2C1* chunk is truncated to the nearest multiple of
+the stripe size.
Erasure code library
--------------------
-See also `the corresponding tracker issue <http://tracker.ceph.com/issues/5878>`_
+See also `the corresponding tracker issue <http://tracker.ceph.com/issues/5877>`_
Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_,
-with parameters M+K object O is encoded by dividing it into chunks O1,
-O2, ... OM and computing parity chunks P1, P2, ... PK. Any M chunks
-out of the available M+K chunks can be used to obtain the original
-object. If data chunk O2 or parity chunk P2 are lost, they can be
-repaired using any M chunks out of the M+K chunks. If more than K
+with parameters K+M, object O is encoded by dividing it into chunks O1,
+O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks
+out of the available K+M chunks can be used to obtain the original
+object. If data chunk O2 or coding chunk P2 are lost, they can be
+repaired using any K chunks out of the K+M chunks. If more than M
chunks are lost, it is not possible to recover the object.
Reading the original content of object O could be a simple
-concatenation of O1, O2, ... OM, if using `systematic codes
-<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the
-chunks must be given to the erasure code library to retrieve the
-content of the object.
+concatenation of O1, O2, ... OM, because the plugins are using
+`systematic codes
+<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks
+must be given to the erasure code library to retrieve the content of
+the object.
Reed-Solomon is significantly more expensive to encode than fountain
codes with the current `jerasure implementation
@@ -381,14 +456,16 @@ the difference becomes negligible. The difference is even more
important when an object is divided in hundreds or more chunks, but
Ceph will typically be used with less than 32 chunks.
-Performances depend on the parameters to the Reed-Solomon functions
-but they are also influenced by the buffer sizes used when calling
-the encoding functions: smaller buffers will mean more calls and more
-overhead.
+Performance depend on the parameters to the encoding functions and
+is also influenced by the packet sizes used when calling the encoding
+functions ( for Cauchy or Liberation for instance ): smaller packets
+means more calls and more overhead.
Although Reed-Solomon is provided as a default, Ceph uses it via an
-abstract API designed to allow each pool to choose the plugin that
-implements it.
+`abstract API <http://tracker.ceph.com/issues/5878>`_ designed to
+allow each pool to choose the plugin that implements it using
+`key=value pairs when creating the pool
+<http://tracker.ceph.com/issues/6113>`_.
::
ceph osd pool create <pool> \
@@ -397,23 +474,24 @@ implements it.
The *<plugin>* is dynamically loaded from *<dir>* (defaults to
*/usr/lib/ceph/erasure-code* ) and expected to implement the
-*void __erasure_code_init(char *plugin_name)* function
+*int __erasure_code_init(char *plugin_name)* function
which is responsible for registering an object derived from
-*ErasureCodePlugin* in the registry singleton :
+*ErasureCodePlugin* in the registry :
::
- registry.plugins[plugin_name] = new ErasureCodePluginExample();
+ ErasureCodePluginRegistry::add(plugin_name,
+ new ErasureCodePluginExample());
The *ErasureCodePlugin* derived object must provide a factory method
from which the concrete implementation of the *ErasureCodeInterface*
object can be generated:
::
- virtual int factory(ErasureCodeInterfaceRef *erasure_code,
- const map<std::string,std::string> &parameters) {
+ virtual int factory(const map<std::string,std::string> &parameters,
+ ErasureCodeInterfaceRef *erasure_code) {
*erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
return 0;
- }
+ }
The *parameters* is the list of *key=value* pairs that were set when the pool
was created. Each *key* must be prefixed with erasure-code to avoid name collisions
@@ -424,12 +502,7 @@ was created. Each *key* must be prefixed with erasure-code to avoid name collisi
erasure-code-plugin=jerasure \ # mandatory
erasure-code-m=10 \ # optional and plugin dependant
erasure-code-k=3 \ # optional and plugin dependant
- erasure-code-algorithm=Reed-Solomon \ # optional and plugin dependant
-
-Erasure code library abstract API
----------------------------------
-
- .. doxygenfile:: ErasureCodeInterface.h
+ erasure-code-technique=reed_sol_van \ # optional and plugin dependant
Erasure code jerasure plugin
----------------------------
@@ -440,31 +513,36 @@ The parameters interpreted by the jerasure plugin are:
ceph osd pool create <pool> \
erasure-code-directory=<dir> \ # plugin directory absolute path
erasure-code-plugin=jerasure \ # plugin name (only jerasure)
- erasure-code-m=<m> \ # data chunks (default 10)
- erasure-code-k=<k> \ # parity chunks (default 3)
- erasure-code-algorithm=Reed-Solomon \ # algorithm (only Reed-Solomon)
+ erasure-code-k=<k> \ # data chunks (default 2)
+ erasure-code-m=<m> \ # coding chunks (default 2)
+ erasure-code-technique=<technique> \ # coding technique
+The coding techniques can be chosen among *reed_sol_van*,
+*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*,
+*blaum_roth* and *liber8tion*.
Scrubbing
---------
+See also `Refactor scrub to use PGBackend methods <http://tracker.ceph.com/issues/5861>`_
+
The simplest form of scrubbing is to check with each OSDs holding a
-chunk if it exists locally. If more thank K chunks are missing the
-object is marked as lost. If up to K chunks are missing they are
+chunk if it exists locally. If more thank M chunks are missing the
+object is marked as lost. If up to M chunks are missing they are
repaired and written to the relevant OSDs.
-From time to time it may make sense to attempt to read and object,
+From time to time it may make sense to attempt to read an object,
using all of its chunks. If the decode function fails, the object is
lost.
Bit flips happen. Not often, but it is possible. Here is `an article
from 2011 <http://www.linux-mag.com/id/8794/>`_ also search for "bit
rot" and "bit error rate". To detect corrupted chunks, a checksum
-(CRC23C for instance) should be added as an attribute of the file
-containing the chunk so that deep scrubbing can check that the chunk
-is valid by recomputing the content of the chunk and compare it with
-the signature. BTRFS and ZFS have a CRC32C check built-in on a per
-block basis.
+(CRC23C for instance) must be added as an attribute of the file
+containing the chunk ( or shard ) so that deep scrubbing can check
+that the chunk is valid by recomputing the content of the chunk and
+compare it with the signature. BTRFS and ZFS have a CRC32C check
+built-in on a per block basis.
Notes
-----
@@ -496,14 +574,20 @@ Partial reads and writes
If an object is large, reading or writing all of it when changing only
a few bytes is expensive. It is more efficient to only read or write a
subset of the object. When a client writes on an existing object, it
-can provide the offset and the length of the write as well as the payload with the `CEPH_OSD_OP_WRITE <https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.cc#L2542>`_ operation. It is refered to as *partial write* and is different from the `CEPH_OSD_OP_WRITEFULL operation <https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.cc#L2552>`_ which writes the entire object at once.
+can provide the offset and the length of the write as well as the
+payload with the `CEPH_OSD_OP_WRITE
+<https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.cc#L2542>`_
+operation. It is refered to as *partial write* and is different from
+the `CEPH_OSD_OP_WRITEFULL operation
+<https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.cc#L2552>`_
+which writes the entire object at once.
When using replicas for partial writes or reads, the primary OSD
translates them into read(2) and write(2) POSIX system calls. When
writing, it then forwards the CEPH_OSD_OP_WRITE message to the
replicas and waits for them to acknowledge they are done.
-When reading erasure coded objects, at least K chunks must be read and
+When reading erasure coded objects, at least M chunks must be read and
decoded to extract the desired bytes. If a `systematic code
<https://en.wikipedia.org/wiki/Systematic_code>`_ is used ( i.e. the
data chunks are readable by simple concatenation ) read can be
@@ -513,10 +597,10 @@ the erasure decoding function only if a chunk is missing.
When writing an erasure coded object, changing even one byte requires
that it is encoded again in full.
-If Ceph is only used thru the radosgw or librbd, objects will mostly
-have the same size. The radosgw user may upload a 1GB object, it will
+If Ceph is only used thru the *radosgw* or *librbd*, objects will mostly
+have the same size. The *radosgw* user may upload a 1GB object, which will
be divided into smaller 4MB objects behind the scene ( or whatever is
-set with rgw obj stripe size ). If a KVM is attached a 10GB RBD block
+set with *rgw obj stripe size* ). If a KVM is attached a 10GB RBD block
device, it will also be divided into smaller 4BM objects ( or whatever
size is given to the --stripe-unit argument when creating the RBD
block ). In both cases, writing one byte at the beginning will only
@@ -547,4 +631,3 @@ OSDs 5, 8 and 9, stripe 1 is also encoded into 3 chunks on the same
OSDs. The size of a stripe is stored as an attribute of the object.
When writing one byte at offset N, instead of re-encoding the whole
object it is enough to re-encode the stripe that contains it.
-
diff --git a/doc/dev/osd_internals/erasure_coding/pgbackend.rst b/doc/dev/osd_internals/erasure_coding/pgbackend.rst
index 9e3fcb2bf86..4519ee34f9c 100644
--- a/doc/dev/osd_internals/erasure_coding/pgbackend.rst
+++ b/doc/dev/osd_internals/erasure_coding/pgbackend.rst
@@ -114,11 +114,11 @@ divergent objects.
The dificulty is that the current code assumes that as long as it has
an info from at least 1 osd from the prior interval, it can complete
peering. In order to ensure that we do not end up with an
-unrecoverably divergent object, an M+K erasure coded PG must hear from at
-least M of the replicas of the last interval to serve writes. This ensures
-that we will select a last_update old enough to roll back at least M
+unrecoverably divergent object, a K+M erasure coded PG must hear from at
+least K of the replicas of the last interval to serve writes. This ensures
+that we will select a last_update old enough to roll back at least K
replicas. If a replica with an older last_update comes along later,
-we will be able to provide at least M chunks of any divergent object.
+we will be able to provide at least K chunks of any divergent object.
Core Changes: