summaryrefslogtreecommitdiff
path: root/drivers/nvme/host/pci.c
Commit message (Collapse)AuthorAgeFilesLines
* nvme: provide optimized poll function for separate poll queuesJens Axboe2018-11-161-8/+37
| | | | | | | | | | If we have separate poll queues, we know that they aren't using interrupts. Hence we don't need to disable interrupts around finding completions. Provide a separate set of blk_mq_ops for such devices. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: fix handling of EINVAL on pci_alloc_irq_vectors_affinity()Jens Axboe2018-11-151-9/+11
| | | | | | | | | | | | | | At least on SPARC, if MSI/MSI-X isn't supported, we get EINVAL if we ask for more than one vector. This isn't covered by our ENOSPC check. If we get EINVAL, decrease our ask to just one vector, instead of bailing out in error. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: fix boot hang with only being able to get one IRQ vectorJens Axboe2018-11-141-3/+10
| | | | | | | | | | | | | | | NVMe always asks for io_queues + 1 worth of IRQ vectors, which means that even when we scale all the way down, we still ask for 2 vectors and get -ENOSPC in return if the system can't support more than 1. Getting just 1 vector is fine, it just means that we'll have 1 IO queue and 1 admin queue, with a shared vector between them. Check for this case and don't add our + 1 if it happens. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: add separate poll queue mapJens Axboe2018-11-071-17/+80
| | | | | | | | | | | | Adds support for defining a variable number of poll queues, currently configurable with the 'poll_queues' module parameter. Defaults to a single poll queue. And now we finally have poll support without triggering interrupts! Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: utilize two queue maps, one for reads and one for writesJens Axboe2018-11-071-19/+181
| | | | | | | | | | | | | | | | | | | | | NVMe does round-robin between queues by default, which means that sharing a queue map for both reads and writes can be problematic in terms of read servicing. It's much easier to flood the queue with writes and reduce the read servicing. Implement two queue maps, one for reads and one for writes. The write queue count is configurable through the 'write_queues' parameter. By default, we retain the previous behavior of having a single queue set, shared between reads and writes. Setting 'write_queues' to a non-zero value will create two queue sets, one for reads and one for writes, the latter using the configurable number of queues (hardware queue counts permitting). Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: abstract out queue mapJens Axboe2018-11-071-1/+1
| | | | | | | | | | This is in preparation for allowing multiple sets of maps per queue, if so desired. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-pci: fix conflicting p2p resource addsKeith Busch2018-11-021-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nvme pci driver had been adding its CMB resource to the P2P DMA subsystem everytime on on a controller reset. This results in the following warning: ------------[ cut here ]------------ nvme 0000:00:03.0: Conflicting mapping in same section WARNING: CPU: 7 PID: 81 at kernel/memremap.c:155 devm_memremap_pages+0xa6/0x380 ... Call Trace: pci_p2pdma_add_resource+0x153/0x370 nvme_reset_work+0x28c/0x17b1 [nvme] ? add_timer+0x107/0x1e0 ? dequeue_entity+0x81/0x660 ? dequeue_entity+0x3b0/0x660 ? pick_next_task_fair+0xaf/0x610 ? __switch_to+0xbc/0x410 process_one_work+0x1cf/0x350 worker_thread+0x215/0x3d0 ? process_one_work+0x350/0x350 kthread+0x107/0x120 ? kthread_park+0x80/0x80 ret_from_fork+0x1f/0x30 ---[ end trace f7ea76ac6ee72727 ]--- nvme nvme0: failed to register the CMB This patch fixes this by registering the CMB with P2P only once. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'pci-v4.20-changes' of ↵Linus Torvalds2018-10-251-40/+58
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - Fix ASPM link_state teardown on removal (Lukas Wunner) - Fix misleading _OSC ASPM message (Sinan Kaya) - Make _OSC optional for PCI (Sinan Kaya) - Don't initialize ASPM link state when ACPI_FADT_NO_ASPM is set (Patrick Talbert) - Remove x86 and arm64 node-local allocation for host bridge structures (Punit Agrawal) - Pay attention to device-specific _PXM node values (Jonathan Cameron) - Support new Immediate Readiness bit (Felipe Balbi) - Differentiate between pciehp surprise and safe removal (Lukas Wunner) - Remove unnecessary pciehp includes (Lukas Wunner) - Drop pciehp hotplug_slot_ops wrappers (Lukas Wunner) - Tolerate PCIe Slot Presence Detect being hardwired to zero to workaround broken hardware, e.g., the Wilocity switch/wireless device (Lukas Wunner) - Unify pciehp controller & slot structs (Lukas Wunner) - Constify hotplug_slot_ops (Lukas Wunner) - Drop hotplug_slot_info (Lukas Wunner) - Embed hotplug_slot struct into users instead of allocating it separately (Lukas Wunner) - Initialize PCIe port service drivers directly instead of relying on initcall ordering (Keith Busch) - Restore PCI config state after a slot reset (Keith Busch) - Save/restore DPC config state along with other PCI config state (Keith Busch) - Reference count devices during AER handling to avoid race issue with concurrent hot removal (Keith Busch) - If an Upstream Port reports ERR_FATAL, don't try to read the Port's config space because it is probably unreachable (Keith Busch) - During error handling, use slot-specific reset instead of secondary bus reset to avoid link up/down issues on hotplug ports (Keith Busch) - Restore previous AER/DPC handling that does not remove and re-enumerate devices on ERR_FATAL (Keith Busch) - Notify all drivers that may be affected by error recovery resets (Keith Busch) - Always generate error recovery uevents, even if a driver doesn't have error callbacks (Keith Busch) - Make PCIe link active reporting detection generic (Keith Busch) - Support D3cold in PCIe hierarchies during system sleep and runtime, including hotplug and Thunderbolt ports (Mika Westerberg) - Handle hpmemsize/hpiosize kernel parameters uniformly, whether slots are empty or occupied (Jon Derrick) - Remove duplicated include from pci/pcie/err.c and unused variable from cpqphp (YueHaibing) - Remove driver pci_cleanup_aer_uncorrect_error_status() calls (Oza Pawandeep) - Uninline PCI bus accessors for better ftracing (Keith Busch) - Remove unused AER Root Port .error_resume method (Keith Busch) - Use kfifo in AER instead of a local version (Keith Busch) - Use threaded IRQ in AER bottom half (Keith Busch) - Use managed resources in AER core (Keith Busch) - Reuse pcie_port_find_device() for AER injection (Keith Busch) - Abstract AER interrupt handling to disconnect error injection (Keith Busch) - Refactor AER injection callbacks to simplify future improvments (Keith Busch) - Remove unused Netronome NFP32xx Device IDs (Jakub Kicinski) - Use bitmap_zalloc() for dma_alias_mask (Andy Shevchenko) - Add switch fall-through annotations (Gustavo A. R. Silva) - Remove unused Switchtec quirk variable (Joshua Abraham) - Fix pci.c kernel-doc warning (Randy Dunlap) - Remove trivial PCI wrappers for DMA APIs (Christoph Hellwig) - Add Intel GPU device IDs to spurious interrupt quirk (Bin Meng) - Run Switchtec DMA aliasing quirk only on NTB endpoints to avoid useless dmesg errors (Logan Gunthorpe) - Update Switchtec NTB documentation (Wesley Yung) - Remove redundant "default n" from Kconfig (Bartlomiej Zolnierkiewicz) - Avoid panic when drivers enable MSI/MSI-X twice (Tonghao Zhang) - Add PCI support for peer-to-peer DMA (Logan Gunthorpe) - Add sysfs group for PCI peer-to-peer memory statistics (Logan Gunthorpe) - Add PCI peer-to-peer DMA scatterlist mapping interface (Logan Gunthorpe) - Add PCI configfs/sysfs helpers for use by peer-to-peer users (Logan Gunthorpe) - Add PCI peer-to-peer DMA driver writer's documentation (Logan Gunthorpe) - Add block layer flag to indicate driver support for PCI peer-to-peer DMA (Logan Gunthorpe) - Map Infiniband scatterlists for peer-to-peer DMA if they contain P2P memory (Logan Gunthorpe) - Register nvme-pci CMB buffer as PCI peer-to-peer memory (Logan Gunthorpe) - Add nvme-pci support for PCI peer-to-peer memory in requests (Logan Gunthorpe) - Use PCI peer-to-peer memory in nvme (Stephen Bates, Steve Wise, Christoph Hellwig, Logan Gunthorpe) - Cache VF config space size to optimize enumeration of many VFs (KarimAllah Ahmed) - Remove unnecessary <linux/pci-ats.h> include (Bjorn Helgaas) - Fix VMD AERSID quirk Device ID matching (Jon Derrick) - Fix Cadence PHY handling during probe (Alan Douglas) - Signal Cadence Endpoint interrupts via AXI region 0 instead of last region (Alan Douglas) - Write Cadence Endpoint MSI interrupts with 32 bits of data (Alan Douglas) - Remove redundant controller tests for "device_type == pci" (Rob Herring) - Document R-Car E3 (R8A77990) bindings (Tho Vu) - Add device tree support for R-Car r8a7744 (Biju Das) - Drop unused mvebu PCIe capability code (Thomas Petazzoni) - Add shared PCI bridge emulation code (Thomas Petazzoni) - Convert mvebu to use shared PCI bridge emulation (Thomas Petazzoni) - Add aardvark Root Port emulation (Thomas Petazzoni) - Support 100MHz/200MHz refclocks for i.MX6 (Lucas Stach) - Add initial power management for i.MX7 (Leonard Crestez) - Add PME_Turn_Off support for i.MX7 (Leonard Crestez) - Fix qcom runtime power management error handling (Bjorn Andersson) - Update TI dra7xx unaligned access errata workaround for host mode as well as endpoint mode (Vignesh R) - Fix kirin section mismatch warning (Nathan Chancellor) - Remove iproc PAXC slot check to allow VF support (Jitendra Bhivare) - Quirk Keystone K2G to limit MRRS to 256 (Kishon Vijay Abraham I) - Update Keystone to use MRRS quirk for host bridge instead of open coding (Kishon Vijay Abraham I) - Refactor Keystone link establishment (Kishon Vijay Abraham I) - Simplify and speed up Keystone link training (Kishon Vijay Abraham I) - Remove unused Keystone host_init argument (Kishon Vijay Abraham I) - Merge Keystone driver files into one (Kishon Vijay Abraham I) - Remove redundant Keystone platform_set_drvdata() (Kishon Vijay Abraham I) - Rename Keystone functions for uniformity (Kishon Vijay Abraham I) - Add Keystone device control module DT binding (Kishon Vijay Abraham I) - Use SYSCON API to get Keystone control module device IDs (Kishon Vijay Abraham I) - Clean up Keystone PHY handling (Kishon Vijay Abraham I) - Use runtime PM APIs to enable Keystone clock (Kishon Vijay Abraham I) - Clean up Keystone config space access checks (Kishon Vijay Abraham I) - Get Keystone outbound window count from DT (Kishon Vijay Abraham I) - Clean up Keystone outbound window configuration (Kishon Vijay Abraham I) - Clean up Keystone DBI setup (Kishon Vijay Abraham I) - Clean up Keystone ks_pcie_link_up() (Kishon Vijay Abraham I) - Fix Keystone IRQ status checking (Kishon Vijay Abraham I) - Add debug messages for all Keystone errors (Kishon Vijay Abraham I) - Clean up Keystone includes and macros (Kishon Vijay Abraham I) - Fix Mediatek unchecked return value from devm_pci_remap_iospace() (Gustavo A. R. Silva) - Fix Mediatek endpoint/port matching logic (Honghui Zhang) - Change Mediatek Root Port Class Code to PCI_CLASS_BRIDGE_PCI (Honghui Zhang) - Remove redundant Mediatek PM domain check (Honghui Zhang) - Convert Mediatek to pci_host_probe() (Honghui Zhang) - Fix Mediatek MSI enablement (Honghui Zhang) - Add Mediatek system PM support for MT2712 and MT7622 (Honghui Zhang) - Add Mediatek loadable module support (Honghui Zhang) - Detach VMD resources after stopping root bus to prevent orphan resources (Jon Derrick) - Convert pcitest build process to that used by other tools (iio, perf, etc) (Gustavo Pimentel) * tag 'pci-v4.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits) PCI/AER: Refactor error injection fallbacks PCI/AER: Abstract AER interrupt handling PCI/AER: Reuse existing pcie_port_find_device() interface PCI/AER: Use managed resource allocations PCI: pcie: Remove redundant 'default n' from Kconfig PCI: aardvark: Implement emulated root PCI bridge config space PCI: mvebu: Convert to PCI emulated bridge config space PCI: mvebu: Drop unused PCI express capability code PCI: Introduce PCI bridge emulated config space common logic PCI: vmd: Detach resources after stopping root bus nvmet: Optionally use PCI P2P memory nvmet: Introduce helper functions to allocate and free request SGLs nvme-pci: Add support for P2P memory in requests nvme-pci: Use PCI p2pmem subsystem to manage the CMB IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() block: Add PCI P2P flag for request queue PCI/P2PDMA: Add P2P DMA driver writer's documentation docs-rst: Add a new directory for PCI documentation PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset ...
| * Merge branch 'pci/peer-to-peer'Bjorn Helgaas2018-10-201-39/+58
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Add PCI support for peer-to-peer DMA (Logan Gunthorpe) - Add sysfs group for PCI peer-to-peer memory statistics (Logan Gunthorpe) - Add PCI peer-to-peer DMA scatterlist mapping interface (Logan Gunthorpe) - Add PCI configfs/sysfs helpers for use by peer-to-peer users (Logan Gunthorpe) - Add PCI peer-to-peer DMA driver writer's documentation (Logan Gunthorpe) - Add block layer flag to indicate driver support for PCI peer-to-peer DMA (Logan Gunthorpe) - Map Infiniband scatterlists for peer-to-peer DMA if they contain P2P memory (Logan Gunthorpe) - Register nvme-pci CMB buffer as PCI peer-to-peer memory (Logan Gunthorpe) - Add nvme-pci support for PCI peer-to-peer memory in requests (Logan Gunthorpe) - Use PCI peer-to-peer memory in nvme (Stephen Bates, Steve Wise, Christoph Hellwig, Logan Gunthorpe) * pci/peer-to-peer: nvmet: Optionally use PCI P2P memory nvmet: Introduce helper functions to allocate and free request SGLs nvme-pci: Add support for P2P memory in requests nvme-pci: Use PCI p2pmem subsystem to manage the CMB IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() block: Add PCI P2P flag for request queue PCI/P2PDMA: Add P2P DMA driver writer's documentation docs-rst: Add a new directory for PCI documentation PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset PCI/P2PDMA: Add sysfs group to display p2pmem stats PCI/P2PDMA: Support peer-to-peer memory
| | * nvme-pci: Add support for P2P memory in requestsLogan Gunthorpe2018-10-171-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For P2P requests, we must use the pci_p2pmem_map_sg() function instead of the dma_map_sg functions. With that, we can then indicate PCI_P2P support in the request queue. For this, we create an NVME_F_PCI_P2P flag which tells the core to set QUEUE_FLAG_PCI_P2P in the request queue. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
| | * nvme-pci: Use PCI p2pmem subsystem to manage the CMBLogan Gunthorpe2018-10-171-35/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register the CMB buffer as p2pmem and use the appropriate allocation functions to create and destroy the IO submission queues. If the CMB supports WDS and RDS, publish it for use as P2P memory by other devices. Kernels without CONFIG_PCI_P2PDMA will also no longer support NVMe CMB. However, seeing the main use-cases for the CMB is P2P operations, this seems like a reasonable dependency. We drop the __iomem safety on the buffer seeing that, by convention, it's safe to directly access memory mapped by memremap()/devm_memremap_pages(). Architectures where this is not safe will not be supported by memremap() and therefore will not support PCI P2P and have no support for CMB. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | PCI/AER: Remove pci_cleanup_aer_uncorrect_error_status() callsOza Pawandeep2018-10-021-1/+0
| |/ | | | | | | | | | | | | | | | | | | | | | | After bfcb79fca19d ("PCI/ERR: Run error recovery callbacks for all affected devices"), AER errors are always cleared by the PCI core and drivers don't need to do it themselves. Remove calls to pci_cleanup_aer_uncorrect_error_status() from device driver error recovery functions. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog, remove PCI core changes, remove unused variables] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* | nvme-pci: remove duplicate checkChaitanya Kulkarni2018-10-181-2/+2
| | | | | | | | | | | | | | | | | | This is a cleanup patch doesn't change any functionality. It removes the duplicate call to the blk_integrity_rq() in the nvme_map_data(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme-pci: fix hot removal during error handlingKeith Busch2018-10-171-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A removal waits for the reset_work to complete. If a surprise removal occurs around the same time as an error triggered controller reset, and reset work happened to dispatch a command to the removed controller, the command won't be recovered since the timeout work doesn't do anything during error recovery. We wouldn't want to wait for timeout handling anyway, so this patch fixes this by disabling the controller and killing admin queues prior to syncing with the reset_work. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme-pci: fix nvme_suspend_queue() kernel-doc headerBart Van Assche2018-10-171-1/+1
|/ | | | | | | | | This patch avoids that the kernel-doc tool complains about the nvme_suspend_queue() function header when building with W=1. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_eventMichal Wnukowski2018-08-281-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In many architectures loads may be reordered with older stores to different locations. In the nvme driver the following two operations could be reordered: - Write shadow doorbell (dbbuf_db) into memory. - Read EventIdx (dbbuf_ei) from memory. This can result in a potential race condition between driver and VM host processing requests (if given virtual NVMe controller has a support for shadow doorbell). If that occurs, then the NVMe controller may decide to wait for MMIO doorbell from guest operating system, and guest driver may decide not to issue MMIO doorbell on any of subsequent commands. This issue is purely timing-dependent one, so there is no easy way to reproduce it. Currently the easiest known approach is to run "Oracle IO Numbers" (orion) that is shipped with Oracle DB: orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \ concat -write 40 -duration 120 -matrix row -testname nvme_test Where nvme_test is a .lun file that contains a list of NVMe block devices to run test against. Limiting number of vCPUs assigned to given VM instance seems to increase chances for this bug to occur. On test environment with VM that got 4 NVMe drives and 1 vCPU assigned the virtual NVMe controller hang could be observed within 10-20 minutes. That correspond to about 400-500k IO operations processed (or about 100GB of IO read/writes). Orion tool was used as a validation and set to run in a loop for 36 hours (equivalent of pushing 550M IO operations). No issues were observed. That suggest that the patch fixes the issue. Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices") Signed-off-by: Michal Wnukowski <wnukowski@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> [hch: updated changelog and comment a bit] Signed-off-by: Christoph Hellwig <hch@lst.de>
* Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe2018-08-051-5/+7
|\ | | | | | | | | | | | | Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>
| * nvme-pci: fix memory leak on probe failureKeith Busch2018-07-121-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nvme driver specific structures need to be initialized prior to enabling the generic controller so we can unwind on failure with out using the reference counting callbacks so that 'probe' and 'remove' can be symmetric. The newly added iod_mempool is the only resource that was being allocated out of order, and a failure there would leak the generic controller memory. This patch just moves that allocation above the controller initialization. Fixes: 943e942e6266f ("nvme-pci: limit max IO size and segments to avoid high order allocations") Reported-by: Weiping Zhang <zwp10758@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme: use blk API to remap ref tags for IOs with metadataMax Gurtovoy2018-07-301-74/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Also moved the logic of the remapping to the nvme core driver instead of implementing it in the nvme pci driver. This way all the other nvme transport drivers will benefit from it (in case they'll implement metadata support). Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | nvme: cache struct nvme_ctrl reference to struct nvme_requestSagi Grimberg2018-07-231-0/+2
|/ | | | | | | | | We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: limit max IO size and segments to avoid high order allocationsJens Axboe2018-06-211-5/+37
| | | | | | | | | | | | | | | | | | | nvme requires an sg table allocation for each request. If the request is large, then the allocation can become quite large. For instance, with our default software settings of 1280KB IO size, we'll need 10248 bytes of sg table. That turns into a 2nd order allocation, which we can't always guarantee. If we fail the allocation, blk-mq will retry it later. But there's no guarantee that we'll EVER be able to allocate that much contigious memory. Limit the IO size such that we never need more than a single page of memory. That's a lot faster and more reliable. Then back that allocation with a mempool, so that we know we'll always be able to succeed the allocation at some point. Signed-off-by: Jens Axboe <axboe@kernel.dk> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrlJianchao Wang2018-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is race between nvme_remove and nvme_reset_work that can lead to io hang. nvme_remove nvme_reset_work -> nvme_remove_dead_ctrl -> nvme_dev_disable -> quiesce request_queue -> queue remove_work -> cancel_work_sync reset_work -> nvme_remove_namespaces -> splice ctrl->namespaces nvme_remove_dead_ctrl_work -> nvme_kill_queues -> nvme_ns_remove do nothing -> blk_cleanup_queue -> blk_freeze_queue Finally, the request_queue is quiesced state when wait freeze, we will get io hang here. To fix it, move the nvme_kill_queues from nvme_remove_dead_ctrl_work to nvme_remove_dead_ctrl. Suggested-by: Keith Busch <keith.busch@linux.intel.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* Merge tag 'for-linus-20180608' of git://git.kernel.dk/linux-blockLinus Torvalds2018-06-081-25/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A few fixes for this merge window, where some of them should go in sooner rather than later, hence a new pull this week. This pull request contains: - Set of NVMe fixes, mostly follow up cleanups/fixes to the queue changes, but also teardown/removal and misc changes (Christop/Dan/ Johannes/Sagi/Steve). - Two lightnvm fixes for issues that showed up in this window (Colin/Wei). - Failfast/driver flags inheritance for flush requests (Hannes). - The md device put sanitization and fix (Kent). - dm bio_set inheritance fix (me). - nbd discard granularity fix (Josef). - nbd consistency in command printing (Kevin). - Loop recursion validation fix (Ted). - Partition overlap check (Wang)" [ .. and now my build is warning-free again thanks to the md fix - Linus ] * tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits) nvme: cleanup double shift issue nvme-pci: make CMB SQ mod-param read-only nvme-pci: unquiesce dead controller queues nvme-pci: remove HMB teardown on reset nvme-pci: queue creation fixes nvme-pci: remove unnecessary completion doorbell check nvme-pci: remove unnecessary nested locking nvmet: filter newlines from user input nvme-rdma: correctly check for target keyed sgl support nvme: don't hold nvmf_transports_rwsem for more than transport lookups nvmet: return all zeroed buffer when we can't find an active namespace md: Unify mddev destruction paths dm: use bioset_init_from_src() to copy bio_set block: add bioset_init_from_src() helper block: always set partition number to '0' in blk_partition_remap() block: pass failfast and driver-specific flags to flush requests nbd: set discard_alignment to the granularity nbd: Consistently use request pointer in debug messages. block: add verifier for cmdline partition lightnvm: pblk: fix resource leak of invalid_bitmap ...
| * nvme-pci: make CMB SQ mod-param read-onlyKeith Busch2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A controller reset after a run time change of the CMB module parameter breaks the driver. An 'on -> off' will have the driver use NULL for the host memory queue, and 'off -> on' will use mismatched queue depth between the device and the host. We could fix both, but there isn't really a good reason to change this at run time anyway, compared to at module load time, so this patch makes parameter read-only after after modprobe. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-pci: unquiesce dead controller queuesKeith Busch2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch ensures the nvme namsepace request queues are not quiesced on a surprise removal. It's possible the queues were previously killed in a failed reset, so the queues need to be unquiesced to ensure all requests are flushed to completion. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-pci: remove HMB teardown on resetKeith Busch2018-06-081-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | The controller is required to disable its host memory buffer use on controller reset. We don't need to submit an admin command to delete it, so this patch skips sending that command so we don't need to worry about handling a timeout. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-pci: queue creation fixesKeith Busch2018-06-081-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've been ignoring NVMe error status on queue creations. Fortunately they are uncommon, but we should handle these anyway. This patch adds checks for the a positive error return value that indicates an NVMe status. If we do see a negative return, the controller isn't usable, so this patch returns immediately in since we can't unwind that failure. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-pci: remove unnecessary completion doorbell checkKeith Busch2018-06-081-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | The nvme pci driver never unmaps the doorbell registers while the requests are active, so we can always safely update the completion queue head. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-pci: remove unnecessary nested lockingKeith Busch2018-06-081-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | The nvme pci driver no longer handles completions under the cq lock, so the nested locking is not necessary. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'pci-v4.18-changes' of ↵Linus Torvalds2018-06-071-19/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - unify AER decoding for native and ACPI CPER sources (Alexandru Gagniuc) - add TLP header info to AER tracepoint (Thomas Tai) - add generic pcie_wait_for_link() interface (Oza Pawandeep) - handle AER ERR_FATAL by removing and re-enumerating devices, as Downstream Port Containment does (Oza Pawandeep) - factor out common code between AER and DPC recovery (Oza Pawandeep) - stop triggering DPC for ERR_NONFATAL errors (Oza Pawandeep) - share ERR_FATAL recovery path between AER and DPC (Oza Pawandeep) - disable ASPM L1.2 substate if we don't have LTR (Bjorn Helgaas) - respect platform ownership of LTR (Bjorn Helgaas) - clear interrupt status in top half to avoid interrupt storm (Oza Pawandeep) - neaten pci=earlydump output (Andy Shevchenko) - avoid errors when extended config space inaccessible (Gilles Buloz) - prevent sysfs disable of device while driver attached (Christoph Hellwig) - use core interface to report PCIe link properties in bnx2x, bnxt_en, cxgb4, ixgbe (Bjorn Helgaas) - remove unused pcie_get_minimum_link() (Bjorn Helgaas) - fix use-before-set error in ibmphp (Dan Carpenter) - fix pciehp timeouts caused by Command Completed errata (Bjorn Helgaas) - fix refcounting in pnv_php hotplug (Julia Lawall) - clear pciehp Presence Detect and Data Link Layer Status Changed on resume so we don't miss hotplug events (Mika Westerberg) - only request pciehp control if we support it, so platform can use ACPI hotplug otherwise (Mika Westerberg) - convert SHPC to be builtin only (Mika Westerberg) - request SHPC control via _OSC if we support it (Mika Westerberg) - simplify SHPC handoff from firmware (Mika Westerberg) - fix an SHPC quirk that mistakenly included *all* AMD bridges as well as devices from any vendor with device ID 0x7458 (Bjorn Helgaas) - assign a bus number even to non-native hotplug bridges to leave space for acpiphp additions, to fix a common Thunderbolt xHCI hot-add failure (Mika Westerberg) - keep acpiphp from scanning native hotplug bridges, to fix common Thunderbolt hot-add failures (Mika Westerberg) - improve "partially hidden behind bridge" messages from core (Mika Westerberg) - add macros for PCIe Link Control 2 register (Frederick Lawler) - replace IB/hfi1 custom macros with PCI core versions (Frederick Lawler) - remove dead microblaze and xtensa code (Bjorn Helgaas) - use dev_printk() when possible in xtensa and mips (Bjorn Helgaas) - remove unused pcie_port_acpi_setup() and portdrv_acpi.c (Bjorn Helgaas) - add managed interface to get PCI host bridge resources from OF (Jan Kiszka) - add support for unbinding generic PCI host controller (Jan Kiszka) - fix memory leaks when unbinding generic PCI host controller (Jan Kiszka) - request legacy VGA framebuffer only for VGA devices to avoid false device conflicts (Bjorn Helgaas) - turn on PCI_COMMAND_IO & PCI_COMMAND_MEMORY in pci_enable_device() like everybody else, not in pcibios_fixup_bus() (Bjorn Helgaas) - add generic enable function for simple SR-IOV hardware (Alexander Duyck) - use generic SR-IOV enable for ena, nvme (Alexander Duyck) - add ACS quirk for Intel 7th & 8th Gen mobile (Alex Williamson) - add ACS quirk for Intel 300 series (Mika Westerberg) - enable register clock for Armada 7K/8K (Gregory CLEMENT) - reduce Keystone "link already up" log level (Fabio Estevam) - move private DT functions to drivers/pci/ (Rob Herring) - factor out dwc CONFIG_PCI Kconfig dependencies (Rob Herring) - add DesignWare support to the endpoint test driver (Gustavo Pimentel) - add DesignWare support for endpoint mode (Gustavo Pimentel) - use devm_ioremap_resource() instead of devm_ioremap() in dra7xx and artpec6 (Gustavo Pimentel) - fix Qualcomm bitwise NOT issue (Dan Carpenter) - add Qualcomm runtime PM support (Srinivas Kandagatla) - fix DesignWare enumeration below bridges (Koen Vandeputte) - use usleep() instead of mdelay() in endpoint test (Jia-Ju Bai) - add configfs entries for pci_epf_driver device IDs (Kishon Vijay Abraham I) - clean up pci_endpoint_test driver (Gustavo Pimentel) - update Layerscape maintainer email addresses (Minghuan Lian) - add COMPILE_TEST to improve build test coverage (Rob Herring) - fix Hyper-V bus registration failure caused by domain/serial number confusion (Sridhar Pitchai) - improve Hyper-V refcounting and coding style (Stephen Hemminger) - avoid potential Hyper-V hang waiting for a response that will never come (Dexuan Cui) - implement Mediatek chained IRQ handling (Honghui Zhang) - fix vendor ID & class type for Mediatek MT7622 (Honghui Zhang) - add Mobiveil PCIe host controller driver (Subrahmanya Lingappa) - add Mobiveil MSI support (Subrahmanya Lingappa) - clean up clocks, MSI, IRQ mappings in R-Car probe failure paths (Marek Vasut) - poll more frequently (5us vs 5ms) while waiting for R-Car data link active (Marek Vasut) - use generic OF parsing interface in R-Car (Vladimir Zapolskiy) - add R-Car V3H (R8A77980) "compatible" string (Sergei Shtylyov) - add R-Car gen3 PHY support (Sergei Shtylyov) - improve R-Car PHYRDY polling (Sergei Shtylyov) - clean up R-Car macros (Marek Vasut) - use runtime PM for R-Car controller clock (Dien Pham) - update arm64 defconfig for Rockchip (Shawn Lin) - refactor Rockchip code to facilitate both root port and endpoint mode (Shawn Lin) - add Rockchip endpoint mode driver (Shawn Lin) - support VMD "membar shadow" feature (Jon Derrick) - support VMD bus number offsets (Jon Derrick) - add VMD "no AER source ID" quirk for more device IDs (Jon Derrick) - remove unnecessary host controller CONFIG_PCIEPORTBUS Kconfig selections (Bjorn Helgaas) - clean up quirks.c organization and whitespace (Bjorn Helgaas) * tag 'pci-v4.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (144 commits) PCI/AER: Replace struct pcie_device with pci_dev PCI/AER: Remove unused parameters PCI: qcom: Include gpio/consumer.h PCI: Improve "partially hidden behind bridge" log message PCI: Improve pci_scan_bridge() and pci_scan_bridge_extend() doc PCI: Move resource distribution for single bridge outside loop PCI: Account for all bridges on bus when distributing bus numbers ACPI / hotplug / PCI: Drop unnecessary parentheses ACPI / hotplug / PCI: Mark stale PCI devices disconnected ACPI / hotplug / PCI: Don't scan bridges managed by native hotplug PCI: hotplug: Add hotplug_is_native() PCI: shpchp: Add shpchp_is_native() PCI: shpchp: Fix AMD POGO identification PCI: mobiveil: Add MSI support PCI: mobiveil: Add Mobiveil PCIe Host Bridge IP driver PCI/AER: Decode Error Source Requester ID PCI/AER: Remove aer_recover_work_func() forward declaration PCI/DPC: Use the generic pcie_do_fatal_recovery() path PCI/AER: Pass service type to pcie_do_fatal_recovery() PCI/DPC: Disable ERR_NONFATAL handling by DPC ...
| * nvme-pci: Use pci_sriov_configure_simple() to enable VFsAlexander Duyck2018-04-241-19/+1
| | | | | | | | | | | | | | | | | | Instead of implementing our own version of a SR-IOV configuration stub in the nvme driver, use the existing pci_sriov_configure_simple() function. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* | Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-blockLinus Torvalds2018-06-041-114/+144
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - clean up how we pass around gfp_t and blk_mq_req_flags_t (Christoph) - prepare us to defer scheduler attach (Christoph) - clean up drivers handling of bounce buffers (Christoph) - fix timeout handling corner cases (Christoph/Bart/Keith) - bcache fixes (Coly) - prep work for bcachefs and some block layer optimizations (Kent). - convert users of bio_sets to using embedded structs (Kent). - fixes for the BFQ io scheduler (Paolo/Davide/Filippo) - lightnvm fixes and improvements (Matias, with contributions from Hans and Javier) - adding discard throttling to blk-wbt (me) - sbitmap blk-mq-tag handling (me/Omar/Ming). - remove the sparc jsflash block driver, acked by DaveM. - Kyber scheduler improvement from Jianchao, making it more friendly wrt merging. - conversion of symbolic proc permissions to octal, from Joe Perches. Previously the block parts were a mix of both. - nbd fixes (Josef and Kevin Vigor) - unify how we handle the various kinds of timestamps that the block core and utility code uses (Omar) - three NVMe pull requests from Keith and Christoph, bringing AEN to feature completeness, file backed namespaces, cq/sq lock split, and various fixes - various little fixes and improvements all over the map * tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits) blk-mq: update nr_requests when switching to 'none' scheduler block: don't use blocking queue entered for recursive bio submits dm-crypt: fix warning in shutdown path lightnvm: pblk: take bitmap alloc. out of critical section lightnvm: pblk: kick writer on new flush points lightnvm: pblk: only try to recover lines with written smeta lightnvm: pblk: remove unnecessary bio_get/put lightnvm: pblk: add possibility to set write buffer size manually lightnvm: fix partial read error path lightnvm: proper error handling for pblk_bio_add_pages lightnvm: pblk: fix smeta write error path lightnvm: pblk: garbage collect lines with failed writes lightnvm: pblk: rework write error recovery path lightnvm: pblk: remove dead function lightnvm: pass flag on graceful teardown to targets lightnvm: pblk: check for chunk size before allocating it lightnvm: pblk: remove unnecessary argument lightnvm: pblk: remove unnecessary indirection lightnvm: pblk: return NVM_ error on failed submission lightnvm: pblk: warn in case of corrupted write buffer ...
| * | nvme-pci: simplify __nvme_submit_cmdChristoph Hellwig2018-05-301-23/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With recent CQ handling improvements we can now move the locking into __nvme_submit_cmd. Also remove the local tail variable to make the code more obvious, remove the __ prefix in the name, and fix the comments describing the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
| * | nvme-pci: Rate limit the nvme timeout warningsKeith Busch2018-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The block layer's timeout handling currently prevents drivers from completing commands outside the timeout callback once blk-mq decides they've expired. If a device breaks, this could potentially create many thousands of timed out commands. There's nothing of value to be gleaned from observing each of those messages, so this patch adds a rate limit on them. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/blockJens Axboe2018-05-291-18/+21
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe changes from Christoph: "Here is the current batch of nvme updates for 4.18, we have a few more patches in the queue, but I'd like to get this pile into your tree and linux-next ASAP. The biggest item is support for file-backed namespaces in the NVMe target from Chaitanya, in addition to that we mostly small fixes from all the usual suspects." * 'nvme-4.18-2' of git://git.infradead.org/nvme: nvme: fixup memory leak in nvme_init_identify() nvme: fix KASAN warning when parsing host nqn nvmet-loop: use nr_phys_segments when map rq to sgl nvmet-fc: increase LS buffer count per fc port nvmet: add simple file backed ns support nvmet: remove duplicate NULL initialization for req->ns nvmet: make a few error messages more generic nvme-fabrics: allow duplicate connections to the discovery controller nvme-fabrics: centralize discovery controller defaults nvme-fabrics: remove unnecessary controller subnqn validation nvme-fc: remove setting DNR on exception conditions nvme-rdma: stop admin queue before freeing it nvme-pci: Fix AER reset handling nvme-pci: set nvmeq->cq_vector after alloc cq/sq nvme: host: core: fix precedence of ternary operator nvme: fix lockdep warning in nvme_mpath_clear_current_path
| | * | nvme-pci: Fix AER reset handlingKeith Busch2018-05-251-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nvme timeout handling doesn't do anything if the pci channel is offline, which is the case when recovering from PCI error event, so it was a bad idea to sync the controller reset in this state. This patch flushes the reset work in the error_resume callback instead when the channel is back to online. This keeps AER handling serialized and can recover from timeouts. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199757 Fixes: cc1d5e749a2e ("nvme/pci: Sync controller reset for AER slot_reset") Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com> Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-pci: set nvmeq->cq_vector after alloc cq/sqJianchao Wang2018-05-251-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set cq_vector after alloc cq/sq, otherwise nvme_suspend_queue will invoke free_irq for it and cause a 'Trying to free already-free IRQ xxx' warning if the create CQ/SQ command times out. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> [hch: fixed to pass a s16 and clean up the comment] Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvme: return BLK_EH_DONE from ->timeoutChristoph Hellwig2018-05-291-9/+5
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NVMe always completes the request before returning from ->timeout, either by polling for it, or by disabling the controller. Return BLK_EH_DONE so that the block layer doesn't even try to complete it again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nvme-pci: fix race between poll and IRQ completionsJens Axboe2018-05-211-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If polling completions are racing with the IRQ triggered by a completion, the IRQ handler will find no work and return IRQ_NONE. This can trigger complaints about spurious interrupts: [ 560.169153] irq 630: nobody cared (try booting with the "irqpoll" option) [ 560.175988] CPU: 40 PID: 0 Comm: swapper/40 Not tainted 4.17.0-rc2+ #65 [ 560.175990] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.0010.010920180151 01/09/2018 [ 560.175991] Call Trace: [ 560.175994] <IRQ> [ 560.176005] dump_stack+0x5c/0x7b [ 560.176010] __report_bad_irq+0x30/0xc0 [ 560.176013] note_interrupt+0x235/0x280 [ 560.176020] handle_irq_event_percpu+0x51/0x70 [ 560.176023] handle_irq_event+0x27/0x50 [ 560.176026] handle_edge_irq+0x6d/0x180 [ 560.176031] handle_irq+0xa5/0x110 [ 560.176036] do_IRQ+0x41/0xc0 [ 560.176042] common_interrupt+0xf/0xf [ 560.176043] </IRQ> [ 560.176050] RIP: 0010:cpuidle_enter_state+0x9b/0x2b0 [ 560.176052] RSP: 0018:ffffa0ed4659fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd [ 560.176055] RAX: ffff9527beb20a80 RBX: 000000826caee491 RCX: 000000000000001f [ 560.176056] RDX: 000000826caee491 RSI: 00000000335206ee RDI: 0000000000000000 [ 560.176057] RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000008 [ 560.176059] R10: ffffa0ed4659fe78 R11: 0000000000000001 R12: ffff9527beb29358 [ 560.176060] R13: ffffffffa235d4b8 R14: 0000000000000000 R15: 000000826caed593 [ 560.176065] ? cpuidle_enter_state+0x8b/0x2b0 [ 560.176071] do_idle+0x1f4/0x260 [ 560.176075] cpu_startup_entry+0x6f/0x80 [ 560.176080] start_secondary+0x184/0x1d0 [ 560.176085] secondary_startup_64+0xa5/0xb0 [ 560.176088] handlers: [ 560.178387] [<00000000efb612be>] nvme_irq [nvme] [ 560.183019] Disabling IRQ #630 A previous commit removed ->cqe_seen that was handling this case, but we need to handle this a bit differently due to completions now running outside the queue lock. Return IRQ_HANDLED from the IRQ handler, if the completion ring head was moved since we last saw it. Fixes: 5cb525c8315f ("nvme-pci: handle completions outside of the queue lock") Reported-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nvme-pci: drop IRQ disabling on submission queue lockJens Axboe2018-05-181-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Since we aren't sharing the lock for completions now, we don't have to make it IRQ safe. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme-pci: split the nvme queue lock into submission and completion locksJens Axboe2018-05-181-21/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is now feasible. We protect the submission queue ring with ->sq_lock, and the completion side with ->cq_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme-pci: handle completions outside of the queue lockJens Axboe2018-05-181-42/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the completion of events into a two part process: 1) Reap the events inside the queue lock 2) Complete the events outside the queue lock Since we never wrap the queue, we can access it locklessly after we've updated the completion queue head. This patch started off with batching events on the stack, but with this trick we don't have to. Keith Busch <keith.busch@intel.com> came up with that idea. Note that this kills the ->cqe_seen as well. I haven't been able to trigger any ill effects of this. If we do race with polling every so often, it should be rare enough NOT to trigger any issues. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: refactored, restored poll early exit optimization] Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme-pci: move ->cq_vector == -1 check outside of ->q_lockJens Axboe2018-05-181-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only clear it dynamically in nvme_suspend_queue(). When we do, ensure to do a full flush so that any nvme_queue_rq() invocation will see it. Ideally we'd kill this check completely, but we're using it to flush requests on a dying queue. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme-pci: remove cq check after submissionJens Axboe2018-05-181-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We always check the completion queue after submitting, but in my testing this isn't a win even on DRAM/xpoint devices. In some cases it's actually worse. Kill it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme-pci: simplify nvme_cqe_validChristoph Hellwig2018-05-181-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | We always look at the current CQ head and phase, so don't pass these as separate arguments, and rename the function to nvme_cqe_pending. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nvme/pci: Sync controller reset for AER slot_resetKeith Busch2018-05-111-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AER handling expects a successful return from slot_reset means the driver made the device functional again. The nvme driver had been using an asynchronous reset to recover the device, so the device may still be initializing after control is returned to the AER handler. This creates problems for subsequent event handling, causing the initializion to fail. This patch fixes that by syncing the controller reset before returning to the AER driver, and reporting the true state of the reset. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199657 Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com> Cc: Sinan Kaya <okaya@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | nvme/pci: Hold controller reference during async probeKeith Busch2018-05-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible the driver's remove may have freed the controller if the remove callback is invoked prior to the async_schedule starting the reset_work. This patch fixes that by holding a reference on the controller. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | nvme/pci: Use async_schedule for initial reset workKeith Busch2018-05-021-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch schedules the initial controller reset in an async_domain so that it can be synchronized from wait_for_device_probe(). This way the kernel waits for the initial nvme controller scan to complete for all devices before proceeding with the boot sequence, which may have nvme dependencies. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | nvme: lightnvm: add granby supportWei Xu2018-04-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new lightnvm quirk to identify CNEX’s Granby controller. Signed-off-by: Wei Xu <wxu@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash StorageMicah Parrish2018-04-261-0/+2
| |/ | | | | | | | | | | | | | | | | | | | | Add Seagate Nytro Flash Storage nvme drive to quirk list for NVME_QUIRK_DELAY_BEFORE_CHK_RDY, which solves a bug where the drive is probed on hot-add before the firmare is ready, I/O errors are generated while reading sector 0, and linux is "unable to read partition table". Signed-off-by: Micah Parrish <micah.parrish@hpe.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <keith.busch@intel.com>