diff options
author | Samuel Merritt <sam@swiftstack.com> | 2014-10-22 13:18:34 -0700 |
---|---|---|
committer | Clay Gerrard <clay.gerrard@gmail.com> | 2015-04-14 00:52:17 -0700 |
commit | decbcd24d41d6367901db16aaa2578f74870b6b5 (patch) | |
tree | 15eaa73f3936610fe14fdff8429ff2cfa8356376 /swift/common/storage_policy.py | |
parent | b1eda4aef8a228961d5aafe7e4fbd4e812d233ad (diff) | |
download | swift-decbcd24d41d6367901db16aaa2578f74870b6b5.tar.gz |
Foundational support for PUT and GET of erasure-coded objects
This commit makes it possible to PUT an object into Swift and have it
stored using erasure coding instead of replication, and also to GET
the object back from Swift at a later time.
This works by splitting the incoming object into a number of segments,
erasure-coding each segment in turn to get fragments, then
concatenating the fragments into fragment archives. Segments are 1 MiB
in size, except the last, which is between 1 B and 1 MiB.
+====================================================================+
| object data |
+====================================================================+
|
+------------------------+----------------------+
| | |
v v v
+===================+ +===================+ +==============+
| segment 1 | | segment 2 | ... | segment N |
+===================+ +===================+ +==============+
| |
| |
v v
/=========\ /=========\
| pyeclib | | pyeclib | ...
\=========/ \=========/
| |
| |
+--> fragment A-1 +--> fragment A-2
| |
| |
| |
| |
| |
+--> fragment B-1 +--> fragment B-2
| |
| |
... ...
Then, object server A gets the concatenation of fragment A-1, A-2,
..., A-N, so its .data file looks like this (called a "fragment archive"):
+=====================================================================+
| fragment A-1 | fragment A-2 | ... | fragment A-N |
+=====================================================================+
Since this means that the object server never sees the object data as
the client sent it, we have to do a few things to ensure data
integrity.
First, the proxy has to check the Etag if the client provided it; the
object server can't do it since the object server doesn't see the raw
data.
Second, if the client does not provide an Etag, the proxy computes it
and uses the MIME-PUT mechanism to provide it to the object servers
after the object body. Otherwise, the object would not have an Etag at
all.
Third, the proxy computes the MD5 of each fragment archive and sends
it to the object server using the MIME-PUT mechanism. With replicated
objects, the proxy checks that the Etags from all the object servers
match, and if they don't, returns a 500 to the client. This mitigates
the risk of data corruption in one of the proxy --> object connections,
and signals to the client when it happens. With EC objects, we can't
use that same mechanism, so we must send the checksum with each
fragment archive to get comparable protection.
On the GET path, the inverse happens: the proxy connects to a bunch of
object servers (M of them, for an M+K scheme), reads one fragment at a
time from each fragment archive, decodes those fragments into a
segment, and serves the segment to the client.
When an object server dies partway through a GET response, any
partially-fetched fragment is discarded, the resumption point is wound
back to the nearest fragment boundary, and the GET is retried with the
next object server.
GET requests for a single byterange work; GET requests for multiple
byteranges do not.
There are a number of things _not_ included in this commit. Some of
them are listed here:
* multi-range GET
* deferred cleanup of old .data files
* durability (daemon to reconstruct missing archives)
Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com>
Co-Authored-By: Paul Luse <paul.e.luse@intel.com>
Co-Authored-By: Christian Schwede <christian.schwede@enovance.com>
Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com>
Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
Diffstat (limited to 'swift/common/storage_policy.py')
-rw-r--r-- | swift/common/storage_policy.py | 30 |
1 files changed, 30 insertions, 0 deletions
diff --git a/swift/common/storage_policy.py b/swift/common/storage_policy.py index 23e52fc56..e45ab018c 100644 --- a/swift/common/storage_policy.py +++ b/swift/common/storage_policy.py @@ -356,6 +356,36 @@ class ECStoragePolicy(BaseStoragePolicy): def ec_segment_size(self): return self._ec_segment_size + @property + def fragment_size(self): + """ + Maximum length of a fragment, including header. + + NB: a fragment archive is a sequence of 0 or more max-length + fragments followed by one possibly-shorter fragment. + """ + # Technically pyeclib's get_segment_info signature calls for + # (data_len, segment_size) but on a ranged GET we don't know the + # ec-content-length header before we need to compute where in the + # object we should request to align with the fragment size. So we + # tell pyeclib a lie - from it's perspective, as long as data_len >= + # segment_size it'll give us the answer we want. From our + # perspective, because we only use this answer to calculate the + # *minimum* size we should read from an object body even if data_len < + # segment_size we'll still only read *the whole one and only last + # fragment* and pass than into pyeclib who will know what to do with + # it just as it always does when the last fragment is < fragment_size. + return self.pyeclib_driver.get_segment_info( + self.ec_segment_size, self.ec_segment_size)['fragment_size'] + + @property + def ec_scheme_description(self): + """ + This short hand form of the important parts of the ec schema is stored + in Object System Metadata on the EC Fragment Archives for debugging. + """ + return "%s %d+%d" % (self._ec_type, self._ec_ndata, self._ec_nparity) + def __repr__(self): return ("%s, EC config(ec_type=%s, ec_segment_size=%d, " "ec_ndata=%d, ec_nparity=%d)") % ( |