summaryrefslogtreecommitdiff
path: root/src/librbd.cc
Commit message (Collapse)AuthorAgeFilesLines
* librbd, cls_rbd: close snapshot creation race with old formatJosh Durgin2012-09-121-1/+3
| | | | | | | | | | | | | | | If two clients created a snapshot at the same time, the one with the higher snapshot id might be created first, so the lower snapshot id would be added to the snapshot context and the snaphot seq would be set to the lower one. Instead of allowing this to happen, return -ESTALE if the snapshot id is lower than the currently stored snapshot sequence number. On the client side, get a new id and retry if this error is encountered. Backport: argonaut Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* librbd: ignore -ENOENT during discardJosh Durgin2012-09-101-2/+12
| | | | | | This is a backport of a3ad98a3eef062e9ed51dd2d1e58c593e12c9703 Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: replace assign_bid with client id and random numberJosh Durgin2012-07-231-31/+10
| | | | | | | | | | | | | | | The assign_bid method has issues with replay because it is a write that also returns data. This means that the replayed operation would return success, but no data, and cause a create to fail. Instead, let the client set the bid based on its global id and a random number. This only affects the creation of new images, since the bid is put into an opaque string as part of the object prefix. Keep the server side assign_bid around in case there are old clients still using it. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: remove unnecessary notify from add_snap()Josh Durgin2012-06-101-1/+0
| | | | | | | The only caller, snapshot_add(), already does a notify when add_snap() succeeds. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: ignore RBD_MAX_BLOCK_NAME_SIZE when generating object idsJosh Durgin2012-06-101-15/+16
| | | | | | | | The actual data object ids don't need to be artificially restricted in length. RBD_MAX_BLOCK_NAME_SIZE just limits the size of the object prefix, since it's used in rbd_info_t. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: add create2 to create an image with the new formatJosh Durgin2012-06-091-26/+56
| | | | | | | | | | This will fail if features are requested that the client or server does not support. Currently there are no features defined, so zero is the only valid value. copy() preserves the format and features of the source image. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: use ImageCtx members instead of the old header in resize()Josh Durgin2012-06-081-2/+3
| | | | Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: validate order before creating an imageJosh Durgin2012-06-081-0/+8
| | | | | | | | The value must be passed, and it shouldn't be below 4k (enforced by the command line tool already) or above the range expressible in the header. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: rename md_oid parameters to header_oidJosh Durgin2012-06-081-8/+9
| | | | | | | This is more consistent with the rest of the code now, and is a bit more clear. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: make rename work with any header formatJosh Durgin2012-06-081-23/+55
| | | | | | | | | | | Instead of interpreting the header, just copy all the data and omap values from the original header to the newly name one. This will continue working with future header changes. We can create the new header and write all data and omap values to it atomically to avoid some races. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: use cls_client functions for calling class methodsJosh Durgin2012-06-081-27/+38
| | | | | | | Use the old or new methods make resize, snapshot add and snapsnhot remove work with both old and new formats. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: remove on-disk header argument from helper functionsJosh Durgin2012-06-081-104/+78
| | | | | | | | | Make most of them take the parameters they actually use. trim_image() now takes an ImageCtx, which means remove() must open the image. This has the nice side effect of not duplicating the snapshot listing code for the old format. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: check that the current snapid for a snap name matchesJosh Durgin2012-06-081-1/+1
| | | | | | | | | | | | | | Checking that it exists doesn't prevent you from having the snapshot change out from under you in the following situation: You have the image open at snapshot "foo". Someone removes snapshot "foo", writes some data to the image, and creates a new snapshot called "foo". This second snapshot will have a different id, but nothing prevents it from having the name of a previously deleted snapshot. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: update ictx_refresh to work with both formatsJosh Durgin2012-06-081-32/+60
| | | | | | | | It now sets the member variables of ImageCtx so other functions don't have to use the on-disk header. If the features use by the new format are incompatible with this client, an error is returned. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: Update ImageCtx for new formatJosh Durgin2012-06-081-24/+87
| | | | | | | | | | | | Detect the format when an image is opened by the presence of the original format header object. Use member variables of ImageCtx to store image metadata instead of the on-disk header format ImageCtx::header. This lays the foundation for changing the rest of librbd to work with old and new formats. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: remove useless ENOMEM checksJosh Durgin2012-06-081-6/+0
| | | | | | There will be an exception if memory can't be allocated. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: Simplify timing initDan Mick2012-05-301-19/+7
| | | | | | Remove possibility of set_start_time before set_ictx error Signed-off-by: Dan Mick <dan.mick@inktank.com>
* librbd: Add latency (elapsed-time) stats for rbd operationsDan Mick2012-05-301-5/+72
| | | | | | | Fixes: #2408 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* librbd: check for cache flush errorsJosh Durgin2012-05-161-8/+11
| | | | | | | | | | | | Return errors from flushing to the caller. Warn if an error occurs during invalidation, but don't retry, since the higher level handles these cases, namely: * rollback (doing this with an image open is asking for trouble) * shrink (doing this with writes in flight may create extra objects anyway) * shutdown (qemu flushes before closing the device) Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* objectcacher: make *_max_dirty_age tunables; pass to ctorSage Weil2012-05-081-4/+5
| | | | | | This replaces the hard-coded 1 second writeback timer. Signed-off-by: Sage Weil <sage@newdream.net>
* objectcacher: make cache sizes explicitSage Weil2012-05-051-0/+3
| | | | | | | | | | | | | | | | | Make ObjectCacher users specify the cache size for each ObjectCacher instances. This avoids the confusing config namespace for the object cache (client_oc_*), and also will make it possible to eventually have cache sizes that vary between (say) RBD images. - drop unused client_oc_max_sync_write - add rbd_cache_max_size, max_dirty, target_dirty config values (these are the defaults for each image) We probably want to add librbd calls to specify the cache size on a per-image basis? Alternatively, we should make it possible to share a cache pool between multiple images in some explicit way. Signed-off-by: Sage Weil <sage@newdream.net>
* objectcacher: wait directly from writex()Sage Weil2012-05-041-2/+1
| | | | | | | This gives us access to the original ObjectExtent (useful later), and simplifies the callers. Signed-off-by: Sage Weil <sage@newdream.net>
* objectcacher: don't wait for write waiters; wait after dirtyingSage Weil2012-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | We do three things here: - Wait for the dirty limit to drop _after_ writing into the cache. This means that an active thread can always provide its dirty data to the cache for potential writing without waiting (a small win). It's also helpful later... (see below, and next commit) - Don't wait for other waiters. If another thread dirtying 1MB and is waiting for it, don't wait for them too. This prevents two threads writing 1MB at a time with a limit of 1MB from serializing: both can dirty their 1MB and initiate a flush, and they once 1/2 of that has flushed one of them will be allowed to proceed. - Update the flusher to add the dirty_waiting bytes to the amount to write so that the OPs will indeed be parallel. Signed-off-by: Sage Weil <sage@newdream.net>
* librbd: use unique error code for image removal failuresJosh Durgin2012-04-301-1/+1
| | | | | | | This allows the rbd tool to provide a useful error message, instead of compounding more possible causes into one error code. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* librbd: the length argument of aio_discard should be uint64_tJosh Durgin2012-04-261-3/+3
| | | | | | size_t was accidentally copy-pasted. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* Merge remote branch 'origin/wip-rbd-snapid' into nextJosh Durgin2012-04-241-60/+57
|\ | | | | | | Reviewed-by: Sage Weil <sage.weil@dreamhost.com>
| * librbd: reset needs_refresh flag before re-reading headerJosh Durgin2012-04-241-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we can't miss an update if we get a notify during ictx_refresh. Specifically, a race like this: Thread 1 Thread 2 Process 2 ictx_refresh() read_header() snap_create() notify() need_refresh = true process header... need_refresh = false If this happened, we would not re-read the header with the new snapshot, so the snapshot would not happen at the intended point in time, but only after we re-read the header again. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
| * librbd: clean up snapshot handling a bitJosh Durgin2012-04-241-51/+47
| | | | | | | | | | | | | | | | | | | | | | | | * snapid should determine whether our mapped snapshot is gone, not snapname * snap_set(<nonexistent_snap>) shouldn't reset us to CEPH_NOSNAP * snapname should be set before using the it in the perfcounter name * snapname and image name don't need to be passed as arguments since an ImageCtx already contains that info * ictx_check() doesn't need to check for non-existent snaps - only I/Os care, so check in check_io() instead Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
| * librbd: clarify handle_sparse_read conditionJosh Durgin2012-04-241-5/+3
| | | | | | | | | | | | | | The earlier condition is >. != means < at this point, and the nesting is unnecessary. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* | librbd: pass errors removing head back to userSage Weil2012-04-241-1/+5
|/ | | | | | | | | In particular, the OSD may return EBUSY if there are still watchers. Ignore ENOENT, as that may indicate we are cleaning up a previously aborted removal. Fixes: #2311 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* Merge branch 'master' into wip-discardSage Weil2012-04-211-4/+5
|\
| * librbd: fix ictx_check pointer weirdness by using std::stringSage Weil2012-04-201-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | I was seeing failures of LibRBD.TestIOToSnapshot where we would fail to refresh after rollback, even though the snap existed. I assume it is because the std::string whose c_str() we were pointing to was reallocated. Use a std::string here instead. This code is weird. Signed-off-by: Sage Weil <sage@newdream.net>
* | librbd: instrument with perfcountersSage Weil2012-04-211-7/+97
| | | | | | | | | | | | | | Track IO operations on a per-image basis. Implements: #1451 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* | librbd: allow image resize to non-block boundariesSage Weil2012-04-201-5/+17
| | | | | | | | | | | | | | | | | | The caller is still invalidating the entire cache, so we don't need to deal with discard at this level. That might be worth cleaning up later, though. Fixes: #2296 Signed-off-by: Sage Weil <sage@newdream.net>
* | objectcacher: rename truncate_set -> discard_set, and use discardSage Weil2012-04-201-2/+2
| | | | | | | | | | | | | | Do not assume the object extents are at the trailing edge of objects. Instead, discard arbitrary extents. Fix callers. Signed-off-by: Sage Weil <sage@newdream.net>
* | librbd: fix debug outputSage Weil2012-04-201-2/+2
| | | | | | | | | | | | objects is misleading here, these are byte offsets Signed-off-by: Sage Weil <sage@newdream.net>
* | librbd: make discard invalidate the range in cacheSage Weil2012-04-201-0/+27
| | | | | | | | | | | | Fed this to test_librbd_fsx and it was happy. Signed-off-by: Sage Weil <sage@newdream.net>
* | librbd: fix zeroing of trailing bits on short reads that span objectsSage Weil2012-04-201-10/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle_sparse_read() was taking buf_ofs and buf_len, but buf_len was being interpreted as the total size of the buffer, not the length of the extent in the buffer start at buf_ofs. Both callers pass in an extent length, so fix the zero code to do the right thing. Specifically, the behavior I saw was: - read range spanning 2 objects, trailing 20k and leading 50k - first object didn't exist, zeroed first 20k of buffer - second object didn't exist, zeroed next 30k (50k-20k) of buffer - the last 20k of buffer was unzeroed. Signed-off-by: Sage Weil <sage@newdream.net>
* | librbd: fix debug output for image resizeSage Weil2012-04-201-3/+3
|/ | | | | | Print old -> new, not new -> old. Signed-off-by: Sage Weil <sage@newdream.net>
* librbd: 'rbd cache enabled' -> 'rbd cache'Sage Weil2012-04-171-1/+1
| | | | | | | 'enabled' is useless verbiage. We should fix the rgw option too, protably... Signed-off-by: Sage Weil <sage@newdream.net>
* objectcacher: name themSage Weil2012-04-131-1/+6
| | | | Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* librbd: implement discardSage Weil2012-04-131-0/+134
| | | | | | | | | | Implement sync and async discard. Embed an ObjectWriteOperation in the BlockCompletion struct. The sync version does a sync op on every block, just like write()... very stupid. Both of these should fixed. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* librbd: flush pending writes when a new snapshot is createdJosh Durgin2012-04-131-6/+24
| | | | | | | This makes sure the state is as consistent as librbd can make it before the snapshot is actually created. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* librbd: flush cache before creating a snapshotJosh Durgin2012-04-131-0/+6
| | | | | | | This is a temporary workaround until the ObjectCacher is smarter about snapshots. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* librbd: fix bytes read accounting in read_iterateJosh Durgin2012-04-131-2/+6
| | | | | | | ObjectCacher will never do short reads, and always returns 0. librados may do short reads at the end of an object. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* librbd: check for writes to snapshotsJosh Durgin2012-04-131-0/+8
| | | | | | | librados does this for us normally, but caching does not check for this. We might as well check early to avoid scheduling a bunch of aios anyway. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* librbd: allow writeback cachingJosh Durgin2012-04-131-45/+207
| | | | | | | This uses the existing infrastructure of ObjectCacher for buffer management and expiry. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* librbd: remove writeback windowJosh Durgin2012-04-131-100/+2
| | | | | | This is superseded by a full-fledged writeback cache. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* log: new logging infrastructureSage Weil2012-03-271-1/+1
| | | | | | | | | - explicitly defined subsystems, and ceph_subsys_FOO enums to go with them - modular log system with Entry object - separate gather level and log level - drop lots of DoutStreambuf hackery Signed-off-by: Sage Weil <sage@newdream.net>
* libradospp: add config_t typedefSage Weil2012-02-141-10/+10
| | | | | | Don't expose internal CephContext type name. Signed-off-by: Sage Weil <sage@newdream.net>