summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* FileStore: optionally compact leveldb on mountwip_cuttlefish_compact_on_startupSamuel Just2013-06-192-0/+10
| | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
* messages/MOSDMarkMeDown: fix uninit fieldSage Weil2013-06-191-5/+5
| | | | | | | | | | | | | | | | | | | | Fixes valgrind warning: ==14803== Use of uninitialised value of size 8 ==14803== at 0x12E7614: sctp_crc32c_sb8_64_bit (sctp_crc32.c:567) ==14803== by 0x12E76F8: update_crc32 (sctp_crc32.c:609) ==14803== by 0x12E7720: ceph_crc32c_le (sctp_crc32.c:733) ==14803== by 0x105085F: ceph::buffer::list::crc32c(unsigned int) (buffer.h:427) ==14803== by 0x115D7B2: Message::calc_front_crc() (Message.h:441) ==14803== by 0x1159BB0: Message::encode(unsigned long, bool) (Message.cc:170) ==14803== by 0x1323934: Pipe::writer() (Pipe.cc:1524) ==14803== by 0x13293D9: Pipe::Writer::entry() (Pipe.h:59) ==14803== by 0x120A398: Thread::_entry_func(void*) (Thread.cc:41) ==14803== by 0x503BE99: start_thread (pthread_create.c:308) ==14803== by 0x6C6E4BC: clone (clone.S:112) Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit eb91f41042fa31df2bef9140affa6eac726f6187)
* Merge remote-tracking branch 'gh/wip-4976-cuttlefish' into cuttlefishSage Weil2013-06-191-30/+2
|\ | | | | | | Reviewed-by: Samuel Just <sam.just@inktank.com>
| * os/FileStore: drop posix_fadvise(...DONTNEED)Sage Weil2013-06-181-28/+0
| | | | | | | | | | | | | | | | | | | | | | | | On XFS this call is problematic because it directly calls the filemap writeback without vectoring through xfs. This can break the delicate ordering of writeback and range zeroing; see #4976 and this thread http://oss.sgi.com/archives/xfs/2013-06/msg00066.html Drop this behavior for now to avoid subtle data corruption. Signed-off-by: Sage Weil <sage@inktank.com>
| * os/FileStore: use fdatasync(2) instead of sync_file_range(2)Sage Weil2013-06-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | The use of sync_file_range(2) on XFS screws up XFS' delicate ordering of writeback and range zeroing; see #4976 and this thread: http://oss.sgi.com/archives/xfs/2013-06/msg00066.html Instead, replace all sync_file_range(2) calls with fdatasync(2), which *does* do ordered writeback and should not leak unzeroed blocks. Signed-off-by: Sage Weil <sage@inktank.com>
* | common/Preforker: fix warningSage Weil2013-06-181-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | common/Preforker.h: In member function ‘int Preforker::signal_exit(int)’: warning: common/Preforker.h:82:45: ignoring return value of ‘ssize_t safe_write(int, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result] This is harder than it should be to fix. :( http://stackoverflow.com/questions/3614691/casting-to-void-doesnt-remove-warn-unused-result-error Whatever, I guess we can do something useful with this return value. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com> (cherry picked from commit ce7b5ea7d5c30be32e4448ab0e7e6bb6147af548)
* | mon: Monitor: make sure we backup a monmap during sync startJoao Eduardo Luis2013-06-181-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | First of all, we must find a monmap to backup. The newest version. Secondly, we must make sure we back it up before clearing the store. Finally, we must make sure that we don't remove said backup while clearing the store; otherwise, we would be out of a backup monmap if the sync happened to fail (and if the monitor happened to be killed before a new sync had finished). This patch makes sure these conditions are met. Fixes: #5256 (partially) Backport: cuttlefish Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> (cherry picked from commit 5e6dc4ea21b452e34599678792cd36ce1ba3edb3)
* | mon: Monitor: obtain latest monmap on sync store initJoao Eduardo Luis2013-06-182-13/+54
| | | | | | | | | | | | | | | | | | | | | | | | Always use the highest version amongst all the typically available monmaps: whatever we have in memory, whatever we have under the MonmapMonitor's store, and whatever we have backed up from a previous sync. This ensures we always use the newest version we came across with. Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> (cherry picked from commit 6284fdce794b73adcc757fee910e975b6b4bd054)
* | mon: Monitor: don't remove 'mon_sync' when clearing the store during abortJoao Eduardo Luis2013-06-181-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise, we will end up losing the monmap we backed up when we started the sync, and the monitor may be unable to start if it is killed or crashes in-between the sync abort and finishing a new sync. Fixes: #5256 (partially) Backport: cuttlefish Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> (cherry picked from commit af5a9861d7c6b4527b0d2312d0efa792910bafd9)
* | config: fix run_dir typoSage Weil2013-06-181-1/+1
| | | | | | | | | | | | | | From 654299108bfb11e7dce45f54946d1505f71d2de8. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit e9689ac6f5f50b077a6ac874f811d204ef996c96)
* | ceph.spec: create /var/run on package installSage Weil2013-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | The %ghost %dir ... line will make this get cleaned up but won't install it. Reported-by: Derek Yarnell <derek@umiacs.umd.edu> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com> (cherry picked from commit 64ee0148a5b7324c7df7de2d5f869b880529d452)
* | global: create /var/run/ceph on daemon startupSage Weil2013-06-182-1/+11
| | | | | | | | | | | | | | | | This handles cases where the daemon is started without the benefit of sysvinit or upstart (as with teuthology or ceph-fuse). Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 654299108bfb11e7dce45f54946d1505f71d2de8)
* | PG: don't dirty log unconditionally in activate()Samuel Just2013-06-181-1/+0
| | | | | | | | | | | | | | | | | | merge_log and friends all take care of dirtying the log as necessary. Fixes: #5238 Signed-off-by: Samuel Just <sam.just@inktank.com> (cherry picked from commit 5deece1d034749bf72b7bd04e4e9c5d97e5ad6ce)
* | mon: OSDMonitor: don't ignore apply_incremental()'s return on UfP [1]Joao Eduardo Luis2013-06-181-1/+2
|/ | | | | | | | | | | apply_incremental() may return -EINVAL. Don't ignore it. [1] UfP = Update from Paxos Fixes: #5343 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> (cherry picked from commit e3c33f4315cbf8718f61eb79e15dd6d44fc908b7)
* client: handle reset during initial mds session openSage Weil2013-06-171-1/+14
| | | | | | | | | | If we get a reset during our attempt to open an MDS session, close out the Connection* and retry to open the session, moving the waiters over. Fixes: #5379 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> (cherry picked from commit df8a3e5591948dfd94de2e06640cfe54d2de4322)
* ceph-disk: add some notes on wth we are up toSage Weil2013-06-171-0/+42
| | | | | Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 8c6b24e9039079e897108f28d6af58cbc703a15a)
* ceph-disk: clear TERM to avoid libreadline hijinxSage Weil2013-06-171-0/+5
| | | | | | | The weird output from libreadline users is related to the TERM variable. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit e538829f16ce19d57d63229921afa01cc687eb86)
* ceph-disk-udev: set up by-partuuid, -typeuuid symlinks on ancient udevSage Weil2013-06-171-5/+17
| | | | | | | | Make the ancient-udev/blkid workaround script for RHEL/CentOS create the symlinks for us too. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit d7f7d613512fe39ec883e11d201793c75ee05db1)
* ceph-disk: do not stop activate-all on first failureSage Weil2013-06-171-2/+9
| | | | | | | | | Keep going even if we hit one activation error. This avoids failing to start some disks when only one of them won't start (e.g., because it doesn't belong to the current cluster). Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit c9074375bfbe1e3757b9c423a5ff60e8013afbce)
* ceph.spec: include partuuid rules in packageSage Weil2013-06-171-0/+1
| | | | | | | Commit f3234c147e083f2904178994bc85de3d082e2836 missed this. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 253069e04707c5bf46869f4ff5a47ea6bb0fde3e)
* ceph.spec: install/uninstall init scriptSage Weil2013-06-171-2/+2
| | | | | | | | | | This was commented out almost years ago in commit 9baf5ef4 but it is not clear to me that it was correct to do so. In any case, we are not installing the rc.d links for ceph, which means it does not start up after a reboot. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit cc9b83a80262d014cc37f0c974963cf7402a577a)
* sysvinit, upstart: ceph-disk activate-all on startSage Weil2013-06-172-0/+11
| | | | | | | | On 'service ceph start' or 'service ceph start osd' or start ceph-osd-all we should activate any osd GPT partitions. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 13680976ef6899cb33109f6f841e99d4d37bb168)
* ceph-disk: add 'activate-all'Sage Weil2013-06-171-0/+52
| | | | | | | | | Scan /dev/disk/by-parttypeuuid for ceph OSDs and activate them all. This is useful when the event didn't trigger on the initial udev event for some reason. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 5c7a23687a1a21bec5cca7b302ac4ba47c78e041)
* udev: /dev/disk/by-parttypeuuid/$type-$uuidSage Weil2013-06-171-0/+3
| | | | | | | We need this to help trigger OSD activations. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit d512dc9eddef3299167d4bf44e2018b3b6031a22)
* rgw: escape prefix correctly when listing objectsYehuda Sadeh2013-06-171-2/+6
| | | | | | | | | | | | Fixes: #5362 When listing objects prefix needs to be escaped correctly (the same as with the marker). Otherwise listing objects with prefix that starts with underscore doesn't work. Backport: bobtail, cuttlefish Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> (cherry picked from commit d582ee2438a3bd307324c5f44491f26fd6a56704)
* messages/MMonSync: initialize crc in ctorSage Weil2013-06-171-2/+2
| | | | | Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit cd1c289b96a874ff99a83a44955d05efc9f2765a)
* client: fix ancient typo in caps revocation pathSage Weil2013-06-171-1/+1
| | | | | | | | | | If we have dropped all references to a revoked capability, send the ack to the MDS. This typo has been there since v0.7 (early 2009)! Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> (cherry picked from commit b7143c2f84daafbe2c27d5b2a2d5dc40c3a68d15)
* messages/MMonHealth: remove unused flag fieldSage Weil2013-06-171-22/+2
| | | | | | | | This was initialized in (one of) the ctor(s), but not encoded/decoded, and not used. Remove it. This makes valgrind a happy. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 08bb8d510b5abd64f5b9f8db150bfc8bccaf9ce8)
* messages/MMonProbe: fix uninitialized variablesSage Weil2013-06-171-1/+4
| | | | | | Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 4974b29e251d433101b69955091e22393172bcd8)
* common/Preforker: fix broken recursion on exit(3)Sage Weil2013-06-151-2/+2
| | | | | | | | | | | | | | | | | | | | | If we exit via preforker, call exit(3) and not recursively back into Preforker::exit(r). Otherwise you get a hang with the child blocked at: Thread 1 (Thread 0x7fa08962e7c0 (LWP 5419)): #0 0x000000309860e0cd in write () from /lib64/libpthread.so.0 #1 0x00000000005cc906 in Preforker::exit(int) () #2 0x00000000005c8dfb in main () and the parent at #0 0x000000309860eba7 in waitpid () from /lib64/libpthread.so.0 #1 0x00000000005cc87a in Preforker::parent_wait() () #2 0x00000000005c75ae in main () Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 7e7ff7532d343c473178799e37f4b83cf29c4eee)
* rules: Don't disable tcmalloc on ARM (and other non-intel)Gary Lowell2013-06-141-7/+0
| | | | | | Fixes #5342 Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
* Remove mon socket in post-stopGuilhem Lettron2013-06-141-0/+5
| | | | | | | | | If ceph-mon segfault, socket file isn't removed. By adding a remove in post-stop, upstart clean run directory properly. Signed-off-by: Guilhem Lettron <guilhem@lettron.fr> (cherry picked from commit 554b41b171eab997038e83928c462027246c24f4)
* Remove stop on from upstart tasksJames Page2013-06-145-5/+0
| | | | | | Upstart tasks don't have to concept of 'stop on' as they are not long running. (cherry picked from commit 17f6fccabc262b9a6d59455c524b550e77cd0fe3)
* ceph-disk: extra dash in error messageDan Mick2013-06-141-1/+1
| | | | | Signed-off-by: Dan Mick <dan.mick@inktank.com> (cherry picked from commit f86b4e7a4831c684033363ddd335d2f3fb9a189a)
* ceph-disk: cast output of _check_output()Danny Al-Gaaf2013-06-141-1/+1
| | | | | | | | Cast output of _check_output() to str() to be able to use str.split(). Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> (cherry picked from commit 16ecae153d260407085aaafbad1c1c51f4486c9a)
* ceph-disk: remove unnecessary semicolonsDanny Al-Gaaf2013-06-141-2/+2
| | | | | Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> (cherry picked from commit 9785478a2aae7bf5234fbfe443603ba22b5a50d2)
* ceph-disk: fix undefined variableDanny Al-Gaaf2013-06-141-1/+1
| | | | | Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> (cherry picked from commit 9429ff90a06368fc98d146e065a7b9d1b68e9822)
* ceph-disk: add missing spaces around operatorDanny Al-Gaaf2013-06-141-2/+2
| | | | | Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> (cherry picked from commit c127745cc021c8b244d721fa940319158ef9e9d4)
* udev: drop useless --mount argument to ceph-diskSage Weil2013-06-142-4/+4
| | | | | | | It doesn't mean anything anymore; drop it. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit bcfd2f31a50d27038bc02e645795f0ec99dd3b32)
* ceph-disk-udev: activate-journalSage Weil2013-06-141-0/+2
| | | | | | | Trigger 'ceph-disk activate-journal' from the alt udev rules. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit b139152039bfc0d190f855910d44347c9e79b22a)
* ceph-disk: do not use mount --move (or --bind)Sage Weil2013-06-141-2/+19
| | | | | | | | | | | | | | | | The kernel does not let you mount --move when the parent mount is shared (see, e.g., https://bugzilla.redhat.com/show_bug.cgi?id=917008 for another person this also confused). We can't use --bind either since that (on RHEL at least) screws up /etc/mtab so that the final result looks like /var/lib/ceph/tmp/mnt.HNHoXU /var/lib/ceph/osd/ceph-0 none rw,bind 0 0 Instead, mount the original dev in the final location and then umount from the old location. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit e5ffe0d2484eb6cbcefcaeb5d52020b1130871a5)
* ceph.spec: include by-partuuid udev workaround rulesSage Weil2013-06-141-0/+2
| | | | | | | | These are need for old or buggy udev. Having them for new and unbroken udev is harmless. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit f3234c147e083f2904178994bc85de3d082e2836)
* ceph-disk: work around buggy rhel/centos partedSage Weil2013-06-141-0/+5
| | | | | | | | | | | | | parted on RHEL/Centos prefixes the *machine readable output* with 1b 5b 3f 31 30 33 34 68 Note that the same thing happens when you 'import readline' in python. Work around it! Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 82ff72f827b9bd7f91d30a09d35e42b25d2a7344)
* ceph-disk: implement 'activate-journal'Sage Weil2013-06-142-0/+88
| | | | | | | | | | | | | | | | | | | Activate an osd via its journal device. udev populates its symlinks and triggers events in an order that is not related to whether the device is an osd data partition or a journal. That means that triggering 'ceph-disk activate' can happen before the journal (or journal symlink) is present and then fail. Similarly, it may be that they are on different disks that are hotplugged with the journal second. This can be wired up to the journal partition type to ensure that osds are started when the journal appears second. Include the udev rules to trigger this. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit a2a78e8d16db0a71b13fc15457abc5fe0091c84c)
* ceph-disk: call partprobe outside of the prepare lock; drop udevadm settleSage Weil2013-06-141-31/+13
| | | | | | | | | | | | | | | | | | After we change the final partition type, sgdisk may or may not trigger a udev event, depending on how well udev is behaving (it varies between distros, it seems). The old code would often settle and wait for udev to activate the device, and then partprobe would uselessly fail because it was already mounted. Call partprobe only at the very end, after prepare is done. This ensures that if partprobe calls udevadm settle (which is sometimes does) we do not get stuck. Drop the udevadm settle. I'm not sure what this accomplishes; take it out, at least until we determine we need it. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 8b3b59e01432090f7ae774e971862316203ade68)
* ceph-disk: add 'zap' commandSage Weil2013-06-141-0/+14
| | | | | Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 10ba60cd088c15d4b4ea0b86ad681aa57f1051b6)
* ceph-disk: fix stat errors with new suppress codeSage Weil2013-06-141-4/+4
| | | | | | | Broken by 225fefe5e7c997b365f481b6c4f66312ea28ed61. Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit bcc8bfdb672654c6a6b48a2aa08267a894debc32)
* ceph-disk: add '[un]suppress-activate <dev>' commandSage Weil2013-06-141-0/+92
| | | | | | | | | | It is often useful to prepare but not activate a device, for example when preparing a bunch of spare disks. This marks a device as 'do not activate' so that it can be prepared without activating. Fixes: #3255 Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 225fefe5e7c997b365f481b6c4f66312ea28ed61)
* upstart: start ceph-all on runlevel [2345]Sage Weil2013-06-141-1/+1
| | | | | | | | | | | | | | Starting when only one network interface has started breaks machines with multiple nics in very problematic ways. There may be an earlier trigger that we can use for cases where other services on the local machine depend on ceph, but for now this is better than the existing behavior. See #5248 Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 7e08ed1bf154f5556b3c4e49f937c1575bf992b8)
* client: set issue_seq (not seq) in cap releaseSage Weil2013-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | We regularly have been observing a stall where the MDS is blocked waiting for a cap revocation (Ls, in our case) and never gets a reply. We finally tracked down the sequence: - mds issues cap seq 1 to client - mds does revocation (seq 2) - client replies - much time goes by - client trims inode from cache, sends release with seq == 2 - mds ignores release because its issue_seq is 1 - mds later tries to revoke other caps - client discards message because it doesn't have the inode in cache The problem is simply that we are using seq instead of issue_seq in the cap release message. Note that the other release call site in encode_inode_release() is correct. That one is much more commonly triggered by short tests, as compared to this case where the inode needs to get pushed out of the client cache. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> (cherry picked from commit 9b012e234a924efd718826ab6a53b9aeb7cd6649)