diff options
author | Yehuda Sadeh <yehuda@inktank.com> | 2012-12-07 13:26:26 -0800 |
---|---|---|
committer | Yehuda Sadeh <yehuda@inktank.com> | 2012-12-07 13:28:43 -0800 |
commit | 0792b86b982f9c3f6d366a69b35cea9f19982828 (patch) | |
tree | e2d68918512dea6fec4c6ace4ec42eba2a3db4a4 | |
parent | f81d7207663633d82ad591d438c5a7ddbee26ff3 (diff) | |
download | ceph-0792b86b982f9c3f6d366a69b35cea9f19982828.tar.gz |
doc: first go at radosgw disaster recovery design
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
-rw-r--r-- | doc/dev/radosgw/dr.rst | 165 |
1 files changed, 165 insertions, 0 deletions
diff --git a/doc/dev/radosgw/dr.rst b/doc/dev/radosgw/dr.rst new file mode 100644 index 00000000000..442f3871e4d --- /dev/null +++ b/doc/dev/radosgw/dr.rst @@ -0,0 +1,165 @@ +============================ +Disaster Recovery Overview +============================ + + + +Design +============ + +The following discusses a disaster recovery implementation. A complete +geographic replication solution could be implemented later on top of +it, but will require some more design. + +* Primary, secondary clusters + +A secondary cluster follows a primary cluster. It is intended that +clients will only be accessing the primary cluster. Read only access +to the secondary will be possible, however, data in the secondary may +be outdated. It is possible to switch primary and secondary settings +of a cluster. + +The idea is to have a primary cluster, and one or more secondary +clusters following it. Updates will be logged per bucket on the +primary, and list of modified buckets will also be logged. Secondary +clusters will poll list of modified buckets, and will then retrieve +the changes per bucket. Changes will be applied on the secondary. + +As stated, this is a disaster recovery solution, and comes to provide +safety net for complete data loss. The solution does not provide a +complete data loss protection, as latest data that has not been +transferred to the secondaries may be lost. + + +Primary +------- + +Bucket index log +^^^^^^^^^^^^^^^^ + +The 'bucket index log' will keep track of modifications made in the bucket (objects uploaded, deleted, modified). + +The following modifications will be made to the bucket index: + +* bucket index version + +The bucket index will keep an index version that will increase monotonically. + +* log every modify operation + +The bucket index will now log every modify operation. An additional +index version entry will be added to each object entry in the bucket +index. + +* list objects objclass operation returns more info + +list objects objclass operation will also return last index version, +as well as the version of each object (the bucket index version when +it was created). This will allow radosgw to retrieve the entire list +of objects in a bucket in parts, and then retrieve all the changes +that happened since starting the operation. It is required so that we +could do a full sync of the bucket index. + +* operation to retrieve bucket index log entries + +A new objclass operation will retrieve bucket log index entries. It +will get a starting event number, max entries. +When requested with a + +* operation to trim bucket index log entries + +A new objclass operation to remove bucket log index entries. It will +get a starting event number (0 - start from the beginning), last event +number. It will not be able to remove more than predefined number of +entries. + + +Updated Buckets Log +^^^^^^^^^^^^^^^^^^^ + +A log that contains the list of modified buckets within a specific +period. Log info may be spread across multiple objects. This will be +done similarly to what we did with the usage info, and with the +garbage collection. Each bucket's data will go to a specific log +object (by hashing bucket name, modulo number of objects). +The log will use omap to index entries by timestamp, and by a log id (monotonically increasing) and +will be implemented as an objclass. + + +* log resolution + +We'll define a log resolution period. In order to avoid sending extra write for updating this log +for every modifications, we'll define a time length for which a +log entry is valid. Any update that completes within that will only be reported once in the log. + +A bucket modification operation will not be allowed to complete before +a log entry (with the bucket name) was appended to the bucket +operations log within the past ttl (the cycle in which the operation +completes). That means that the first write/modification to that +bucket will have to send an append request. All bucket modification +operations that happen before its completion (and within the same log +cycle) will have to wait for it. + +The radosgw will hold a list of all the buckets that were updated in the past two cycles +and every cycle will log these entries in the updated buckets log. + + +Secondary +--------- + +* Bucket index log + +The bucket index log will also hold the last primary version. + +* Processing state + +The sync processing state will be kept in a log. This will include latest updated buckets log id that was processed successfully. + +* Full sync info + +A list that contains the names of the buckets that require a full sync. It will also be spread across multiple objects. + +Full Sync of System +^^^^^^^^^^^^^^^^^^^ + +Does the following:: + + - retrieve list of all buckets, update the full sync buckets list, start processing + +Processing a single 'full sync list' object:: + + - if successfully locked object then: + - (periodically, potentially in a different thread) renew lock + - for each bucket + - list objects (keep bucket index version retrieved on the first request to list objects). + - for each object we get object name, version (bucket version), tag + - read objects from primary (*), write them to local (secondary) cluster (keep object tag) + - when done, update local bucket index with the bucket index version retrieved from primary + - unlock object + +(*) we should decide what to do in the case of tag mismatch + +Continuous Update +^^^^^^^^^^^^^^^^^ + +A process that does the following:: + + - Try to set a lock on a updated buckets log + - if succeeded then: + - read next log entries (but never read entries newer than current time - updated buckets log ttl) + - for each bucket in log: + - fetch bucket index version from local (secondary) bucket index + - request a list of changes from remote (primary) bucket index, starting at the local bucket index version + - if successful (remote had the requested data) + - update local data + - if not successful + - add bucket to list of buckets requiring full sync + - renew lock until done, then release lock + - continue with the next log entry + +We still need to be able to fully sync buckets that need to catch-up. So also do the following (in parallel):: + - For each object in full sync list + - periodically check list of buckets requiring full sync + - if not empty: + - for each bucket: full sync bucket (as specified above), remove bucket from list + |