summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* osdc/Objecter: clean up reduncant op assignmentswip-osd-allocSage Weil2012-12-071-8/+0
| | | | | | add_op() sets the op code; the caller doesn't need to do it again. Signed-off-by: Sage Weil <sage@inktank.com>
* osd: implement prealloc/fallocate object operationSage Weil2012-12-077-0/+97
| | | | | | | | | | | | | | Implement a rados PREALLOC method that will call fallocate(2) to allocate disk blocks for a while, but not write to them. We choose the semantics that modify the file size so that the exposed object metadata will be less confusing. e.e.g, prealloc to 4MB will result in a 4MB object full of zeros (or whatever data was prevoiusly written). Include flags for only doing prealloc on object creation, and for only doing prealloc on an existing object. Signed-off-by: Sage Weil <sage@inktank.com>
* rgw: document admin api web interface.caleb miles2012-12-071-3/+1504
| | | | Signed-off-by: caleb miles <caleb.miles@inktank.com>
* doc/install/os-recommendations: fix syncfs notesSage Weil2012-12-071-10/+11
| | | | | | | | | For argonaut, squeeze and wheezy lack syncfs. For bobtail, only older kernels are problematic; we don't depend on glibc support. Signed-off-by: Sage Weil <sage@inktank.com>
* doc: fix bobtail version in os-recommendationsSage Weil2012-12-071-1/+1
| | | | Signed-off-by: Sage Weil <sage@inktank.com>
* Merge remote-tracking branch 'gh/wip_doc'Sage Weil2012-12-072-16/+16
|\
| * doc: write descriptions for the remaining msgr optionsGreg Farnum2012-12-041-7/+7
| | | | | | | | Signed-off-by: Greg Farnum <greg@inktank.com>
| * doc: added some descriptions in ms-ref and filestore-config-refSamuel Just2012-12-042-9/+9
| | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
* | doc: Change per doc request.John Wilkins2012-12-061-2/+2
| | | | | | | | Signed-off-by: John Wilkins <john.wilkins@inktank.com>
* | Merge branch 'next'Dan Mick2012-12-052-6/+12
|\ \
| * \ Merge branch 'testing' into nextDan Mick2012-12-052-6/+12
| |\ \
| | * | rbd: update manpage for import/exportDan Mick2012-12-052-6/+12
| | | | | | | | | | | | | | | | Signed-off-by: Dan Mick <dan.mick@inktank.com>
| | * | librbd: hold AioCompletion lock while modifying global stateDan Mick2012-12-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C_AioRead::finish needs to add in each chunk of a partial read request to the 'partial' map in the AioCompletion's state (in destriper, of type StripedReadResult). That map is global and must be protected from simultaneous access. Use the AioCompletion lock; could create a separate lock if contention is an issue. Fixes: #3567 Signed-off-by: Dan Mick <dan.mick@inktank.com> (cherry picked from commit a55700cc0aea0ff79e55c6bf78e9757b81fe9425)
| | * | librbd: handle parent change while async I/Os are in flightDan Mick2012-12-051-6/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During a test_librbd_fsx run including flatten, ImageCtx->parent was being dereferenced while null. Between the time the parent overlap is calculated and the time the guard+write completes with ENOENT and submits the copyup+write, the parent image could have changed (by resize) or been made irrelevant (by child flatten) such that the parent overlap is now incorrect. Handle "no parent" by just sending the copyup+write; the copyup part will be a no-op. Move to WRITE_FLAT state in this case because there's no more child to deal with. Handle "overlap changed" by recalculating overlap before reading parent data; if none is left, don't read, but rather just clear m_object_image_extents, in which case the copyup will again be a no-op because it will be of zero length. However we still have a parent, so stay in WRITE_COPYUP state and come back through as usual. Signed-off-by: Dan Mick <dan.mick@inktank.com> Fixes: #3524 (cherry picked from commit 41e16a3b40efb80a5ed7a5587438569ca86c85a3)
| | * | Striper: use local variable inside if() that tested itDan Mick2012-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Dan Mick <dan.mick@inktank.com> (cherry picked from commit 917a6f296323164f9d79df94916932722e66fc0a)
* | | | Merge branch 'next'Dan Mick2012-12-053-11/+44
|\ \ \ \ | |/ / / | | | | | | | | Pull in fixes for 3567 and 3524
| * | | librbd: hold AioCompletion lock while modifying global stateDan Mick2012-12-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C_AioRead::finish needs to add in each chunk of a partial read request to the 'partial' map in the AioCompletion's state (in destriper, of type StripedReadResult). That map is global and must be protected from simultaneous access. Use the AioCompletion lock; could create a separate lock if contention is an issue. Fixes: #3567 Signed-off-by: Dan Mick <dan.mick@inktank.com>
| * | | librbd: handle parent change while async I/Os are in flightDan Mick2012-12-051-6/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During a test_librbd_fsx run including flatten, ImageCtx->parent was being dereferenced while null. Between the time the parent overlap is calculated and the time the guard+write completes with ENOENT and submits the copyup+write, the parent image could have changed (by resize) or been made irrelevant (by child flatten) such that the parent overlap is now incorrect. Handle "no parent" by just sending the copyup+write; the copyup part will be a no-op. Move to WRITE_FLAT state in this case because there's no more child to deal with. Handle "overlap changed" by recalculating overlap before reading parent data; if none is left, don't read, but rather just clear m_object_image_extents, in which case the copyup will again be a no-op because it will be of zero length. However we still have a parent, so stay in WRITE_COPYUP state and come back through as usual. Signed-off-by: Dan Mick <dan.mick@inktank.com> Fixes: #3524
| * | | Striper: use local variable inside if() that tested itDan Mick2012-12-051-1/+1
| | | | | | | | | | | | | | | | Signed-off-by: Dan Mick <dan.mick@inktank.com>
* | | | Merge branch 'next'Josh Durgin2012-12-0524-140/+405
|\ \ \ \ | |/ / /
| * | | qa: add script for running xfstests in a vmJosh Durgin2012-12-051-0/+7
| | | | | | | | | | | | | | | | Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
| * | | OSD: ignore queries on now deleted poolsSamuel Just2012-12-051-0/+3
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | Merge remote-tracking branch 'origin/wip-mds' into nextGreg Farnum2012-12-0413-76/+209
| |\ \ \ | | | | | | | | | | | | | | | | | | | | Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
| | * | | mds: journal remote inode's projected parentYan, Zheng2012-12-041-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Server::_rename_prepare() adds remote inode's parent instead of projected parent to the journal. So during journal replay, the journal entry for the rename operation will wrongly revert the remote inode's projected rename. This issue can be reproduced by: touch file1 ln file1 file2 rm file1 mv file2 file3 After journal replay, file1 reappears and directory's fragstat gets corrupted. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | * | | mds: don't create bloom filter for incomplete dirYan, Zheng2012-12-042-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Creating bloom filter for incomplete dir that was added by log replay will confuse subsequent dir lookup and can create null dentry for existing file. The erroneous null dentry confuses the fragstat accounting and causes undeletable empty directory. The fix is check if the dir is complete before creating the bloom filter. For the MDCache::trim_non_auth{,_subtree} cases, just do not call CDir::add_to_bloom because bloom filter is useless for replica. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | * | | Merge remote-tracking branch 'gh/wip-mds' into nextSage Weil2012-12-0413-76/+209
| | |\ \ \
| | | * | | mds: fix freeze inode deadlockYan, Zheng2012-12-019-19/+124
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CInode::freeze_inode() is used in the case of cross authority rename. Server::handle_slave_rename_prep() calls it to wait for all other operations on source inode to complete. This happens after all locks for the rename operation are acquired. But to acquire locks, we need auth pin locks' parent objects first. So there is an ABBA deadlock if someone auth pins the source inode after locks for rename are acquired and before Server::handle_slave_rename_prep() is called. The fix is freeze and auth pin the source inode at the same time. This patch introduces CInode::freeze_auth_pin(), it waits for all other MDRequests to release auth pins, then change the inode to FROZENAUTHPIN state, this state prevents other MDRequests from getting new auth pins. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: use rdlock_try() when checking NULL dentryYan, Zheng2012-12-011-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use rdlock_try() instead can_read() when path_traverse encounters a NULL dentry. This can partly avoid infinitely waiting for the dentry to become readable when the dentry is replica. Strictly speaking, use rdlock_try() is still enough because auth MDS may drop the REQRDLOCK message in some cases. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: allow open_remote_ino() to open xlocked dentryYan, Zheng2012-12-013-27/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | discover_ino() has a parameter want_xlocked. The parameter indicates if remote discover handler can proceed when xlocked dentry is encountered. open_remote_ino() uses discover_ino() to find non-auth inode, but always set 'want_xlocked' to false. This may cause dead lock in some corner cases. For example: we rename a inode's primary dentry to one of its remote dentry and send slave request to one witness MDS. but before the slave request reaches the witness MDS, the inode is trimmed from the witness MDS' cache. Then when the slave request arrives, open_remote_ino() will be called during traversing the destpath. open_remote_ino() calls discover_ino() with 'want_xlocled=false' to find the inode. discover_ino() sends MDiscover message to the inode's authority MDS. The handler of MDiscover message finds the inode's primary dentry is xlocked and it sleeps. The fix is add a parameter 'want_xlocked' to open_remote_ino() and make open_remote_ino() pass the parameter to discover_ino(). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: fix assertion in handle_cache_expireYan, Zheng2012-12-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During export, it's possible to get cache expire messages in DISCOVERING, FREEZING and PREPPING state. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: fix open_remote_inode raceYan, Zheng2012-12-011-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | discover_ino() may return -ENOENT if it races with other FS activities. so use C_MDC_RetryOpenRemoteIno instead of C_MDC_OpenRemoteIno as onfinish callback. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: consider revoking caps in imported caps as issuedYan, Zheng2012-12-011-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The clients may already send caps release message to the exporting MDS, so the importing MDS waits for the release message forever. consider revoking caps as issued can avoid this issue. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: drop locks if requiring auth pinning new objects.Yan, Zheng2012-12-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Locker::acquire_locks() skip auth pinning replica object if we only request a rdlock and the lock is read-lockable. To get all locks, we may call Locker::acquire_locks() several times, locks in replca objects may become not read-lockable between calls. So it is possible we need auth pin new objects after already take some locks. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: don't forward client request from MDSYan, Zheng2012-12-011-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Forwarding client request that was from MDS will trigger assertion in MDS::forward_message_mds(). MDS only send client requests for stray migration/reintegration, so it's safe to drop them. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: call eval() after caps are exportedYan, Zheng2012-12-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For an inode just changed authority, if the new auth MDS want to change a lock in the inode from 'sync' to 'lock' state before caps are exported. The lock in replica can be in 'sync->lock' state because client caps prevent it from transitting to 'lock' state. So we should call eval() after clearing client caps. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHEDYan, Zheng2012-12-011-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So eval_gather() will not skip calling scatter_writebehind(), otherwise the replica lock may be in flushing state forever. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: Don't acquire replica object's versionlockYan, Zheng2012-12-012-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both CInode and CDentry's versionlocks are of type LocalLock. Acquiring LocalLock in replica object is useless and problematic. For example, if two requests try acquiring a replica object's versionlock, the first request succeeds, the second request is added to wait queue. Later when the first request finishes, MDCache::request_drop_foreign_locks() finds the lock's parent is non-auth, it skips waking requests in the wait queue. So the second request hangs. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| | | * | | mds: allow try_eval to eval unstable locks in freezing objectYan, Zheng2012-12-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unstable locks hold auth_pins on the object, it prevents the freezing object become frozen and then unfreeze. So try_eval() should not wait for freezing object Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | | | | Merge branch 'wip-filestore' into nextSage Weil2012-12-043-24/+30
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | Reviewed-by: Sam Just <sam.just@inktank.com>
| | * | | | | os/JournalingObjectStore: applied_seq -> max_applied_seqSage Weil2012-12-022-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename applied_seq to max_applied_seq, since it is a bound; there may be seq's < max_applied_seq that are not applied. This aligns the naming with max_applying_seq. Signed-off-by: Sage Weil <sage@inktank.com>
| | * | | | | os/FileStore: only wait for applying ops to complete before commitSage Weil2012-12-023-19/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can have a large number of operations in the op_wq waiting to be applied to the fs. Currently, when we want to commit, we want for them *all* to apply. This can take a very long time (the default queue length is 500 operations!). Instead, mark an Op as started ("applying") when the thread pool actually starts to apply it. At that point, only wait for applying ops to complete. We let any threads with an op seq < max_applying_seq begin as well so that we have a proper ordering/barrier. When those flush, applied_seq will == max_applying_seq, and that becomes the committing_seq value. Note that 'applied_seq' is still maintain, but serves no real purpose except to populate our asserts with sanity checks. max_applying_seq serves the purpose applied_seq used to. This removes once unnecessary source of latency associated with fs commits. Signed-off-by: Sage Weil <sage@inktank.com>
| * | | | | | Merge branch 'wip-msgr-delay-queue' into nextSage Weil2012-12-043-20/+133
| |\ \ \ \ \ \
| | * | | | | | msg/Pipe: flush delayed messages when stealing/failing pipesSage Weil2012-12-012-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we are failing a pipe, flush the incoming messages before we try to reconnect. Similarly, flush queued messages on an existing pipe beore we replace it. This ensures that when we get a socket failure and reconnect the delayed messages are handled in the normal fashion. Specifically, it fixes a situation like: - read msg, update in_seq etc. - delay msg - pipe faults - peer reconnects, we replace existing pipe, discard delayed msgs - peer resends msgs - we discard, because they are < in_seq Signed-off-by: Sage Weil <sage@inktank.com>
| | * | | | | | msg/Pipe: release dispatch throttle on delayed queue discardSage Weil2012-11-292-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This avoids leaking into the throttle and deadlocking. Signed-off-by: Sage Weil <sage@inktank.com>
| | * | | | | | msg/Pipe: start delay thread *after* we know peer typeSage Weil2012-11-292-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At end of connect(), or end of accept(). Signed-off-by: Sage Weil <sage@inktank.com>
| | * | | | | | msg/Pipe: drop queue helpersSage Weil2012-11-292-35/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a single caller; these only obfuscate. Signed-off-by: Sage Weil <sage@inktank.com>
| | * | | | | | msg/Pipe: refactor msgr delaysSage Weil2012-11-292-83/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - move all delay state into a single class - create thread once and only once per Pipe - adjust debug levels - discard messages at the appropriate times Signed-off-by: Sage Weil <sage@inktank.com>
| | * | | | | | msgr: add a delay_until queue that is used to delay deliveries.Greg Farnum2012-11-292-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Its life-cycle matches that of delay_queue, and the delayed_delivery function respects it. For now queue_received is just setting it to delay everything by 1 second. Signed-off-by: Greg Farnum <greg@inktank.com>
| | * | | | | | msgr: clear out the delay queue when stop()ingGreg Farnum2012-11-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After some brief thought, I believe deleting any messages in the delay queue is correct -- we are trying to simulate line delays in delivery and so anything still in the queue has supposedly not arrived yet. So delete them when we stop the Pipe for any reason. Signed-off-by: Greg Farnum <greg@inktank.com>
| | * | | | | | msgr: move the delay queue initialization into start_readerGreg Farnum2012-11-291-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Pipe doesn't know the peer type in the constructor. It doesn't always know in start_reader either, so this needs more work, but at least it knows more frequently than it did. Signed-off-by: Greg Farnum <greg@inktank.com>