summaryrefslogtreecommitdiff
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()Julia Lawall2015-10-317-28/+12
| | | | | | | | | | | | | | | | | | | Remove DM's unneeded NULL tests before calling these destroy functions, now that they check for NULL, thanks to these v4.3 commits: 3942d2991 ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()") 4e3ca3e03 ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()") The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: add support for passing through persistent reservationsChristoph Hellwig2015-10-312-1/+124
| | | | | | | | | | | This adds support to pass through persistent reservation requests similar to the existing ioctl handling, and with the same limitations, e.g. devices may only have a single target attached. This is mostly intended for multipathing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: refactor ioctl handlingChristoph Hellwig2015-10-317-72/+91
| | | | | | | | | | | | | This moves the call to blkdev_ioctl and the argument checking to DM core code, and only leaves a callout to find the block device to operate on in the targets. This simplifies the code and allows us to pass through ioctl-like command using other methods in the next patch. Also split out a helper around calling the prepare_ioctl method that will be reused for persistent reservation handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* Revert "dm mpath: fix stalls when handling invalid ioctls"Mauricio Faria de Oliveira2015-10-311-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit a1989b330093578ea5470bea0a00f940c444c466. That commit introduced a regression at least for the case of the SG_IO ioctl() running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there are no active paths: the ioctl() fails with the ENOTTY errno immediately rather than blocking due to queue_if_no_path until a path becomes active, for example. That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]) from multipath devices; which leads to SCSI/filesystem errors in such a guest. More general scenarios can hit that regression too. The following demonstration employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective (some output & user changes omitted for brevity and comments added for clarity). Reverting that commit restores normal operation (queueing) in failing scenarios; tested on linux-next (next-20151022). 1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM) $ cat sg_simple0.c ... see [3] ... $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c $ gcc sgio_inquiry.c -o sgio_inquiry 2) The ioctl() works fine with active paths present. # multipath -l 85ag56 85ag56 (...) dm-19 IBM ,2145 size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='service-time 0' prio=0 status=active | |- 8:0:11:0 sdz 65:144 active undef running | `- 9:0:9:0 sdbf 67:144 active undef running `-+- policy='service-time 0' prio=0 status=enabled |- 8:0:12:0 sdae 65:224 active undef running `- 9:0:12:0 sdbo 68:32 active undef running $ ./sgio_inquiry /dev/mapper/85ag56 Some of the INQUIRY command's response: IBM 2145 0000 INQUIRY duration=0 millisecs, resid=0 3) The ioctl() fails with ENOTTY errno with _no_ active paths present, for unprivileged users (rather than blocking due to queue_if_no_path). # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \ do multipathd -k"fail path $path"; done # multipath -l 85ag56 85ag56 (...) dm-19 IBM ,2145 size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | |- 8:0:11:0 sdz 65:144 failed undef running | `- 9:0:9:0 sdbf 67:144 failed undef running `-+- policy='service-time 0' prio=0 status=enabled |- 8:0:12:0 sdae 65:224 failed undef running `- 9:0:12:0 sdbo 68:32 failed undef running $ ./sgio_inquiry /dev/mapper/85ag56 sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device 4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285); it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl(). $ dmesg <...> [] device-mapper: multipath: Failing path 65:144. [] device-mapper: multipath: Failing path 67:144. [] device-mapper: multipath: Failing path 65:224. [] device-mapper: multipath: Failing path 68:32. [] sgio_inquiry: sending ioctl 2285 to a partition! 5) The ioctl() only works if the SYS_CAP_RAWIO capability is present (then queueing happens -- in this example, queue_if_no_path is set); this is due to a conditional check in scsi_verify_blk_ioctl(). # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56' sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device # ./sgio_inquiry /dev/mapper/85ag56 & [1] 72830 # cat /proc/72830/stack [<c00000171c0df700>] 0xc00000171c0df700 [<c000000000015934>] __switch_to+0x204/0x350 [<c000000000152d4c>] msleep+0x5c/0x80 [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170 [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0 [<c0000000003128e4>] block_ioctl+0x64/0xd0 [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780 [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0 [<c000000000009358>] system_call+0x38/0xd0 6) This is the function call chain exercised in this analysis: SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c -> do_vfs_ioctl() -> vfs_ioctl() ... error = filp->f_op->unlocked_ioctl(filp, cmd, arg); ... -> dm_blk_ioctl() @ drivers/md/dm.c -> multipath_ioctl() @ drivers/md/dm-mpath.c ... (bdev = NULL, due to no active paths) ... if (!bdev || <...>) { int err = scsi_verify_blk_ioctl(NULL, cmd); if (err) r = err; } ... -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c ... if (bd && bd == bd->bd_contains) // not taken (bd = NULL) return 0; ... if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user) return 0; ... printk_ratelimited(KERN_WARNING "%s: sending ioctl %x to a partition!\n" <...>); return -ENOIOCTLCMD; <- ... return r ? : <...> <- ... if (error == -ENOIOCTLCMD) error = -ENOTTY; out: return error; ... Links: [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52 [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device') [3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03) Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
* dm: initialize non-blk-mq queue data before queue is usedMikulas Patocka2015-10-291-3/+7
| | | | | | | | | | | | | | | | | | | | | | | Commit bfebd1cdb497a57757c83f5fbf1a29931591e2a4 ("dm: add full blk-mq support to request-based DM") moves the initialization of the fields backing_dev_info.congested_fn, backing_dev_info.congested_data and queuedata from the function dm_init_md_queue (that is called when the device is created) to dm_init_old_md_queue (that is called after the device type is determined). There is no locking when accessing these variables, thus it is possible for other parts of the kernel to briefly see this data in a transient state (e.g. queue->backing_dev_info.congested_fn initialized and md->queue->backing_dev_info.congested_data uninitialized, resulting in passing an incorrect parameter to the function dm_any_congested). This queue data is left initialized for blk-mq devices even though they that don't use it. Fixes: bfebd1cdb497 ("dm: add full blk-mq support to request-based DM") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+
* Merge tag 'md/4.3-fixes' of git://neil.brown.name/mdLinus Torvalds2015-10-047-26/+28
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md fixes from Neil Brown: "Assorted fixes for md in 4.3-rc. Two tagged for -stable, and one is really a cleanup to match and improve kmemcache interface. * tag 'md/4.3-fixes' of git://neil.brown.name/md: md/bitmap: don't pass -1 to bitmap_storage_alloc. md/raid1: Avoid raid1 resync getting stuck md: drop null test before destroy functions md: clear CHANGE_PENDING in readonly array md/raid0: apply base queue limits *before* disk_stack_limits md/raid5: don't index beyond end of array in need_this_block(). raid5: update analysis state for failed stripe md: wait for pending superblock updates before switching to read-only
| * md/bitmap: don't pass -1 to bitmap_storage_alloc.NeilBrown2015-10-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | Passing -1 to bitmap_storage_alloc() causes page->index to be set to -1, which is quite problematic. So only pass ->cluster_slot if mddev_is_clustered(). Fixes: b97e92574c0b ("Use separate bitmaps for each nodes in the cluster") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>
| * md/raid1: Avoid raid1 resync getting stuckJes Sorensen2015-10-021-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | close_sync() needs to set conf->next_resync to a large, but safe value below MaxSector and use it to determine whether or not to set start_next_window in wait_barrier() Solution suggested by Neil Brown. Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * md: drop null test before destroy functionsJulia Lawall2015-10-024-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>
| * md: clear CHANGE_PENDING in readonly arrayShaohua Li2015-10-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Fixes: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * md/raid0: apply base queue limits *before* disk_stack_limitsNeilBrown2015-10-021-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: 199dc6ed5179 ("md/raid0: update queue parameter in a safer location.") Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed5179). Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * md/raid5: don't index beyond end of array in need_this_block().NeilBrown2015-10-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid5: update analysis state for failed stripeShaohua Li2015-10-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * md: wait for pending superblock updates before switching to read-onlyNeilBrown2015-10-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. For external arrays, waiting for superblock updates can wait on user-space, so in that case, just return an error. Reported-and-tested-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
* | dm crypt: constrain crypt device's max_segment_size to PAGE_SIZEMike Snitzer2015-09-141-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an unfortunate constraint that is required to avoid the potential for exceeding dm-crypt's underlying device's max_segments limits -- due to crypt_alloc_buffer() possibly allocating pages for the encryption bio that are not as physically contiguous as the original bio. It is interesting to note that this problem was already fixed back in 2007 via commit 91e106259 ("dm crypt: use bio_add_page"). But Linux 4.0 commit cf2f1abfb ("dm crypt: don't allocate pages for a partial request") regressed dm-crypt back to _not_ using bio_add_page(). But given dm-crypt's cpu parallelization changes all depend on commit cf2f1abfb's abandoning of the more complex io fragments processing that dm-crypt previously had we cannot easily go back to using bio_add_page(). So all said the cleanest way to resolve this issue is to fix dm-crypt to properly constrain the original bios entering dm-crypt so the encryption bios that dm-crypt generates from the original bios are always compatible with the underlying device's max_segments queue limits. It should be noted that technically Linux 4.3 does _not_ need this fix because of the block core's new late bio-splitting capability. But, it is reasoned, there is little to be gained by having the block core split the encrypted bio that is composed of PAGE_SIZE segments. That said, in the future we may revert this change. Fixes: cf2f1abfb ("dm crypt: don't allocate pages for a partial request") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421 Suggested-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+
* | dm thin: disable discard support for thin devices if pool's is disabledMike Snitzer2015-09-131-0/+4
|/ | | | | | | | | | If the pool is configured with 'ignore_discard' its discard support is disabled. The pool's thin devices should also have queue_limits that reflect discards are disabled. Fixes: 34fbcf62 ("dm thin: range discard support") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
* Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2015-09-112-22/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull second round of SCSI updates from James Bottomley: "There's one late arriving patch here (added today), fixing a build issue which the scsi_dh patch set in here uncovered. Other than that, everything has been incubated in -next and the checkers for a week. The major pieces of this patch are a set patches facilitating better integration between scsi and scsi_dh (the device handling layer used by multi-path; all the dm parts are acked by Mike Snitzer). This also includes driver updates for mp3sas, scsi_debug and an assortment of bug fixes" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (50 commits) scsi_dh: fix randconfig build error scsi: fix scsi_error_handler vs. scsi_host_dev_release race fcoe: Convert use of __constant_htons to htons mpt2sas: setpci reset kernel oops fix pm80xx: Don't override ts->stat on IO_OPEN_CNX_ERROR_HW_RESOURCE_BUSY lpfc: Fix possible use-after-free and double free in lpfc_mbx_cmpl_rdp_page_a2() bfa: Fix incorrect de-reference of pointer bfa: Fix indentation scsi_transport_sas: Remove check for SAS expander when querying bay/enclosure IDs. scsi_debug: resp_request: remove unused variable scsi_debug: fix REPORT LUNS Well Known LU scsi_debug: schedule_resp fix input variable check scsi_debug: make dump_sector static scsi_debug: vfree is null safe so drop the check scsi_debug: use SCSI_W_LUN_REPORT_LUNS instead of SAM2_WLUN_REPORT_LUNS; scsi_debug: define pr_fmt() for consistent logging mpt2sas: Refcount fw_events and fix unsafe list usage mpt2sas: Refcount sas_device objects and fix unsafe list usage scsi_dh: return SCSI_DH_NOTCONN in scsi_dh_activate() scsi_dh: don't allow to detach device handlers at runtime ...
| * scsi_dh: fix randconfig build errorChristoph Hellwig2015-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | It looks like the Kconfig check that was meant to fix this (commit fe9233fb6914a0eb20166c967e3020f7f0fba2c9 [SCSI] scsi_dh: fix kconfig related build errors) was actually reversed, but no-one noticed until the new set of patches which separated DM and SCSI_DH). Fixes: fe9233fb6914a0eb20166c967e3020f7f0fba2c9 Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>
| * dm-mpath, scsi_dh: request scsi_dh modules in scsi_dh, not dm-mpathChristoph Hellwig2015-08-281-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | This way we can reused the same code any attachment method, not just those requested from dm-mpath. [jejb: fixup checkpatch error] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>
| * dm-mpath, scsi_dh: don't let dm detach device handlersChristoph Hellwig2015-08-281-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | While allowing dm-mpath to attach device handlers is a functionality we need for backwards compatibility reason there is no reason to reference count them and detach them if dm-mpath stops using the device for some reason. If the device handler works for the given device it can just stay attached, and we can take the retain_hw_handler codepath. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@Suse.de> Signed-off-by: James Bottomley <JBottomley@Odin.com>
* | Merge tag 'md/4.3' of git://neil.brown.name/mdLinus Torvalds2015-09-059-190/+373
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from Neil Brown: - an assortment of little fixes, several for minor races only likely to be hit during testing - further cluster-md-raid1 development, not ready for real use yet. - new RAID6 syndrome code for ARM NEON - fix a race where a write can return before failure of one device is properly recorded in metadata, so an immediate crash might result in that write being lost. * tag 'md/4.3' of git://neil.brown.name/md: (33 commits) md/raid5: ensure device failure recorded before write request returns. md/raid5: use bio_list for the list of bios to return. md/raid10: ensure device failure recorded before write request returns. md/raid1: ensure device failure recorded before write request returns. md-cluster: remove inappropriate try_module_get from join() md: extend spinlock protection in register_md_cluster_operations md-cluster: Read the disk bitmap sb and check if it needs recovery md-cluster: only call complete(&cinfo->completion) when node join cluster md-cluster: add missed lockres_free md-cluster: remove the unused sb_lock md-cluster: init suspend_list and suspend_lock early in join md-cluster: add the error check if failed to get dlm lock md-cluster: init completion within lockres_init md-cluster: fix deadlock issue on message lock md-cluster: transfer the resync ownership to another node md-cluster: split recover_slot for future code reuse md-cluster: use %pU to print UUIDs md: setup safemode_timer before it's being used md/raid5: handle possible race as reshape completes. md: sync sync_completed has correct value as recovery finishes. ...
| * \ Merge linux-block/for-4.3/core into md/for-linuxNeilBrown2015-09-0548-1163/+363
| |\ \ | | | | | | | | | | | | | | | | | | | | There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: ensure device failure recorded before write request returns.NeilBrown2015-08-312-1/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a write to one of the devices of a RAID5/6 fails, the failure is recorded in the metadata of the other devices so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that completed when MD_CHANGE_PENDING is set to only be processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: use bio_list for the list of bios to return.NeilBrown2015-08-312-27/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will make it easier to splice two lists together which will be needed in future patch. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid10: ensure device failure recorded before write request returns.NeilBrown2015-08-312-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid1: ensure device failure recorded before write request returns.NeilBrown2015-08-313-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: remove inappropriate try_module_get from join()NeilBrown2015-08-311-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md_setup_cluster already calls try_module_get(), so this try_module_get isn't needed. Also, there is no matching module_put (except in error patch), so this leaves an unbalanced module count. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md: extend spinlock protection in register_md_cluster_operationsNeilBrown2015-08-311-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code looks racy. The only possible race is if two modules try to register at the same time and that won't happen. But make the code look safe anyway. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: Read the disk bitmap sb and check if it needs recoveryGuoqing Jiang2015-08-311-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In gather_all_resync_info, we need to read the disk bitmap sb and check if it needs recovery. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: only call complete(&cinfo->completion) when node join clusterGuoqing Jiang2015-08-311-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure complete(&cinfo->completion) is only be invoked when node join cluster. Otherwise node failure could also call the complete, and it doesn't make sense to do it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: add missed lockres_freeGuoqing Jiang2015-08-311-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We also need to free the lock resource before goto out. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: remove the unused sb_lockGuoqing Jiang2015-08-311-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sb_lock is not used anywhere, so let's remove it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: init suspend_list and suspend_lock early in joinGuoqing Jiang2015-08-311-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the node just join the cluster, and receive the msg from other nodes before init suspend_list, it will cause kernel crash due to NULL pointer dereference, so move the initializations early to fix the bug. md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3 BUG: unable to handle kernel NULL pointer dereference at (null) ... ... ... Call Trace: [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster] [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster] [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod] [<ffffffff810768c4>] kthread+0xb4/0xc0 [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0 ... ... ... RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster] Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: add the error check if failed to get dlm lockGuoqing Jiang2015-08-311-6/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In complicated cluster environment, it is possible that the dlm lock couldn't be get/convert on purpose, the related err info is added for better debug potential issue. For lockres_free, if the lock is blocking by a lock request or conversion request, then dlm_unlock just put it back to grant queue, so need to ensure the lock is free finally. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: init completion within lockres_initGuoqing Jiang2015-08-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should init completion within lockres_init, otherwise completion could be initialized more than one time during it's life cycle. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: fix deadlock issue on message lockGuoqing Jiang2015-08-311-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is problem with previous communication mechanism, and we got below deadlock scenario with cluster which has 3 nodes. Sender Receiver Receiver token(EX) message(EX) writes message downconverts message(CR) requests ack(EX) get message(CR) gets message(CR) reads message reads message requests EX on message requests EX on message To fix this problem, we do the following changes: 1. the sender downconverts MESSAGE to CW rather than CR. 2. and the receiver request PR lock not EX lock on message. And in case we failed to down-convert EX to CW on message, it is better to unlock message otherthan still hold the lock. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Lidong Zhong <ldzhong@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: transfer the resync ownership to another nodeGuoqing Jiang2015-08-312-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When node A stops an array while the array is doing a resync, we need to let another node B take over the resync task. To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC message to the cluster. And the node B which received that message will invoke __recover_slot to do resync. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: split recover_slot for future code reuseGuoqing Jiang2015-08-311-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make recover_slot as a wraper to __recover_slot, since the logic of __recover_slot can be reused for the condition when other nodes need to take over the resync job. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md-cluster: use %pU to print UUIDsGuoqing Jiang2015-08-311-14/+2
| | | | | | | | | | | | | | | | | | | | | | | | Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md: setup safemode_timer before it's being usedSasha Levin2015-08-311-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to set up the safemode_timer timer in md_run. If md_run would fail before the timer was set up we'd end up trying to modify a timer that doesn't have a callback function when we access safe_delay_store, which would trigger a BUG. neilb: delete init_timer() call as setup_timer() does that. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: handle possible race as reshape completes.NeilBrown2015-08-311-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible (though unlikely) for a reshape to be interrupted between the time that end_reshape is called and the time when raid5_finish_reshape is called. This can leave conf->reshape_progress set to MaxSector, but mddev->reshape_position not. This combination confused reshape_request() when ->reshape_backwards. As conf->reshape_progress is so high, it seems the reshape hasn't really begun. But assuming MaxSector is a valid address only leads to sorrow. So ensure reshape_position and reshape_progress both agree, and add an extra check in reshape_request() just in case they don't. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md: sync sync_completed has correct value as recovery finishes.NeilBrown2015-08-311-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There can be a small window between the moment that recovery actually writes the last block and the time when various sysfs and /proc/mdstat attributes report that it has finished. During this time, 'sync_completed' can have the wrong value. This can confuse monitoring software. So: - don't set curr_resync_completed beyond the end of the devices, - set it correctly when resync/recovery has completed. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md: be careful when testing resync_max against curr_resync_completed.NeilBrown2015-08-312-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While it generally shouldn't happen, it is not impossible for curr_resync_completed to exceed resync_max. This can particularly happen when reshaping RAID5 - the current status isn't copied to curr_resync_completed promptly, so when it is, it can exceed resync_max. This happens when the reshape is 'frozen', resync_max is set low, and reshape is re-enabled. Taking a difference between two unsigned numbers is always dangerous anyway, so add a test to behave correctly if curr_resync_completed > resync_max Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md: set MD_RECOVERY_RECOVER when starting a degraded array.NeilBrown2015-08-311-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that 'sync_action' will show 'recover' immediately the array is started. If there is no spare the status will change to 'idle' once that is detected. Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change happens. This allows scripts which monitor status not to get confused - particularly my test scripts. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: remove incorrect "min_t()" when calculating writepos.NeilBrown2015-08-311-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code is calculating: writepos, which is the furthest along address (device-space) that we *will* be writing to readpos, which is the earliest address that we *could* possible read from, and safepos, which is the earliest address in the 'old' section that we might read from after a crash when the reshape position is recovered from metadata. The first is a precise calculation, so clipping at zero doesn't make sense. As the reshape position is now guaranteed to always be a multiple of reshape_sectors and as we already BUG_ON when reshape_progress is zero, there is no point in this min_t() call. The readpos and safepos are worst case - actual value depends on precise geometry. That worst case could be negative, which is only a problem because we are storing the value in an unsigned. So leave the min_t() for those. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: strengthen check on reshape_position at run.NeilBrown2015-08-311-15/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reshaping, we work in units of the largest chunk size. If changing from a larger to a smaller chunk size, that means we reshape more than one stripe at a time. So the required alignment of reshape_position needs to take into account both the old and new chunk size. This means that both 'here_new' and 'here_old' are calculated with respect to the same (maximum) chunk size, so testing if they are the same when delta_disks is zero becomes pointless. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors ↵NeilBrown2015-08-311-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | where possible The chunk_sectors and new_chunk_sectors fields of mddev can be changed any time (via sysfs) that the reconfig mutex can be taken. So raid5 keeps internal copies in 'conf' which are stable except for a short locked moment when reshape stops/starts. So any access that does not hold reconfig_mutex should use the 'conf' values, not the 'mddev' values. Several don't. This could result in corruption if new values were written at awkward times. Also use min() or max() rather than open-coding. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: always set conf->prev_chunk_sectors and ->prev_algoNeilBrown2015-08-311-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These aren't really needed when no reshape is happening, but it is safer to have them always set to a meaningful value. The next patch will use ->prev_chunk_sectors without checking if a reshape is happening (because that makes the code simpler), and this patch makes that safe. Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid10: fix a few typos in commentsNeilBrown2015-08-311-2/+2
| | | | | | | | | | | | | | | | Signed-off-by: NeilBrown <neilb@suse.com>
| * | | md/raid5: consider updating reshape_position at start of reshape.NeilBrown2015-08-311-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md/raid5 only updates ->reshape_position (which is stored in metadata and is authoritative) occasionally, but particularly when getting closed to ->resync_max as it must be correct when ->resync_max is reached. When mdadm tries to stop an array which is reshaping it will: - freeze the reshape, - set resync_max to where the reshape has reached. - unfreeze the reshape. When this happens, the reshape is aborted and then restarted. The restart doesn't check that resync_max is close, and so doesn't update ->reshape_position like it should. This results in the reshape stopping, but ->reshape_position being incorrect. So on that first call to reshape_request, make sure ->reshape_position is updated if needed. Signed-off-by: NeilBrown <neilb@suse.com>