summaryrefslogtreecommitdiff
path: root/swift/common/storage_policy.py
diff options
context:
space:
mode:
authorSamuel Merritt <sam@swiftstack.com>2014-10-22 13:18:34 -0700
committerClay Gerrard <clay.gerrard@gmail.com>2015-04-14 00:52:17 -0700
commitdecbcd24d41d6367901db16aaa2578f74870b6b5 (patch)
tree15eaa73f3936610fe14fdff8429ff2cfa8356376 /swift/common/storage_policy.py
parentb1eda4aef8a228961d5aafe7e4fbd4e812d233ad (diff)
downloadswift-decbcd24d41d6367901db16aaa2578f74870b6b5.tar.gz
Foundational support for PUT and GET of erasure-coded objects
This commit makes it possible to PUT an object into Swift and have it stored using erasure coding instead of replication, and also to GET the object back from Swift at a later time. This works by splitting the incoming object into a number of segments, erasure-coding each segment in turn to get fragments, then concatenating the fragments into fragment archives. Segments are 1 MiB in size, except the last, which is between 1 B and 1 MiB. +====================================================================+ | object data | +====================================================================+ | +------------------------+----------------------+ | | | v v v +===================+ +===================+ +==============+ | segment 1 | | segment 2 | ... | segment N | +===================+ +===================+ +==============+ | | | | v v /=========\ /=========\ | pyeclib | | pyeclib | ... \=========/ \=========/ | | | | +--> fragment A-1 +--> fragment A-2 | | | | | | | | | | +--> fragment B-1 +--> fragment B-2 | | | | ... ... Then, object server A gets the concatenation of fragment A-1, A-2, ..., A-N, so its .data file looks like this (called a "fragment archive"): +=====================================================================+ | fragment A-1 | fragment A-2 | ... | fragment A-N | +=====================================================================+ Since this means that the object server never sees the object data as the client sent it, we have to do a few things to ensure data integrity. First, the proxy has to check the Etag if the client provided it; the object server can't do it since the object server doesn't see the raw data. Second, if the client does not provide an Etag, the proxy computes it and uses the MIME-PUT mechanism to provide it to the object servers after the object body. Otherwise, the object would not have an Etag at all. Third, the proxy computes the MD5 of each fragment archive and sends it to the object server using the MIME-PUT mechanism. With replicated objects, the proxy checks that the Etags from all the object servers match, and if they don't, returns a 500 to the client. This mitigates the risk of data corruption in one of the proxy --> object connections, and signals to the client when it happens. With EC objects, we can't use that same mechanism, so we must send the checksum with each fragment archive to get comparable protection. On the GET path, the inverse happens: the proxy connects to a bunch of object servers (M of them, for an M+K scheme), reads one fragment at a time from each fragment archive, decodes those fragments into a segment, and serves the segment to the client. When an object server dies partway through a GET response, any partially-fetched fragment is discarded, the resumption point is wound back to the nearest fragment boundary, and the GET is retried with the next object server. GET requests for a single byterange work; GET requests for multiple byteranges do not. There are a number of things _not_ included in this commit. Some of them are listed here: * multi-range GET * deferred cleanup of old .data files * durability (daemon to reconstruct missing archives) Co-Authored-By: Alistair Coles <alistair.coles@hp.com> Co-Authored-By: Thiago da Silva <thiago@redhat.com> Co-Authored-By: John Dickinson <me@not.mn> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com> Co-Authored-By: Paul Luse <paul.e.luse@intel.com> Co-Authored-By: Christian Schwede <christian.schwede@enovance.com> Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com> Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
Diffstat (limited to 'swift/common/storage_policy.py')
-rw-r--r--swift/common/storage_policy.py30
1 files changed, 30 insertions, 0 deletions
diff --git a/swift/common/storage_policy.py b/swift/common/storage_policy.py
index 23e52fc56..e45ab018c 100644
--- a/swift/common/storage_policy.py
+++ b/swift/common/storage_policy.py
@@ -356,6 +356,36 @@ class ECStoragePolicy(BaseStoragePolicy):
def ec_segment_size(self):
return self._ec_segment_size
+ @property
+ def fragment_size(self):
+ """
+ Maximum length of a fragment, including header.
+
+ NB: a fragment archive is a sequence of 0 or more max-length
+ fragments followed by one possibly-shorter fragment.
+ """
+ # Technically pyeclib's get_segment_info signature calls for
+ # (data_len, segment_size) but on a ranged GET we don't know the
+ # ec-content-length header before we need to compute where in the
+ # object we should request to align with the fragment size. So we
+ # tell pyeclib a lie - from it's perspective, as long as data_len >=
+ # segment_size it'll give us the answer we want. From our
+ # perspective, because we only use this answer to calculate the
+ # *minimum* size we should read from an object body even if data_len <
+ # segment_size we'll still only read *the whole one and only last
+ # fragment* and pass than into pyeclib who will know what to do with
+ # it just as it always does when the last fragment is < fragment_size.
+ return self.pyeclib_driver.get_segment_info(
+ self.ec_segment_size, self.ec_segment_size)['fragment_size']
+
+ @property
+ def ec_scheme_description(self):
+ """
+ This short hand form of the important parts of the ec schema is stored
+ in Object System Metadata on the EC Fragment Archives for debugging.
+ """
+ return "%s %d+%d" % (self._ec_type, self._ec_ndata, self._ec_nparity)
+
def __repr__(self):
return ("%s, EC config(ec_type=%s, ec_segment_size=%d, "
"ec_ndata=%d, ec_nparity=%d)") % (