linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-25 03:04:35 +07:00

Author	SHA1	Message	Date
James Smart	d625d05ef0	nvme-fc: fix rogue admin cmds stalling teardown When connectivity is lost to a device, the association is terminated and the blk-mq queues are quiesced/stopped. When connectivity is re-established, they are resumed. If an admin command is received while connectivity is list, the ioctl queues the command on the admin_q and the command stalls (the thread issuing the ioctl hangs/waits). if the connectivity is lost long enough such that the controller is then deleted, the delete code makes its calls to initiate the delete, which then expects the core layer to call the transport when all references are removed and the controller can be freed. Unfortunately, nothing in this path dequeued the admin command, so a reference sits outstanding and things stop, hanging the delete indefinitely. Correct by unquiescing the admin queue in the delete association. This means any admin command (which should only be from an ioctl) issued after connectivity is lost will detect the controller is in a reconnecting state and will (fast) fail the command. Thus, a pending reference can no longer be created. Once connectivity is re-established, a new ioctl/admin command would see proper device state and function again. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-17 17:54:49 +01:00
Roland Dreier	df351ef737	nvme-fabrics: fix memory leak when parsing host ID option We use match_strdup() to get a copy of the option string for host ID string, but we just pass it to uuid_parse() and don't store the string pointer, so we need to kfree() the string after parsing it. Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-15 17:09:31 +01:00
Minwoo Im	8adb8c147b	nvme: fix comment typos in nvme_create_io_queues fix comment typos in nvme_create_io_queues() like below. _aount_ to _amount_ _an_ to _can_ Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-15 17:09:30 +01:00
Roy Shterman	b227c59b9b	nvme: host delete_work and reset_work on separate workqueues We need to ensure that delete_work will be hosted on a different workqueue than all the works we flush or cancel from it. Otherwise we may hit a circular dependency warning [1]. Also, given that delete_work flushes reset_work, host reset_work on nvme_reset_wq and delete_work on nvme_delete_wq. In addition, fix the flushing in the individual drivers to flush nvme_delete_wq when draining queued deletes. [1]: [ 178.491942] ============================================= [ 178.492718] [ INFO: possible recursive locking detected ] [ 178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G OE [ 178.494382] --------------------------------------------- [ 178.495160] kworker/5:1/135 is trying to acquire lock: [ 178.495894] ( [ 178.496120] "nvme-wq" [ 178.496471] ){++++.+} [ 178.496599] , at: [ 178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0 [ 178.497670] but task is already holding lock: [ 178.498499] ( [ 178.498724] "nvme-wq" [ 178.499074] ){++++.+} [ 178.499202] , at: [ 178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.500343] other info that might help us debug this: [ 178.501269] Possible unsafe locking scenario: [ 178.502113] CPU0 [ 178.502472] ---- [ 178.502829] lock( [ 178.503115] "nvme-wq" [ 178.503467] ); [ 178.503716] lock( [ 178.504001] "nvme-wq" [ 178.504353] ); [ 178.504601] * DEADLOCK * [ 178.505441] May be due to missing lock nesting notation [ 178.506453] 2 locks held by kworker/5:1/135: [ 178.507068] #0: [ 178.507330] ( [ 178.507598] "nvme-wq" [ 178.507726] ){++++.+} [ 178.508079] , at: [ 178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.509004] #1: [ 178.509265] ( [ 178.509532] (&ctrl->delete_work) [ 178.509795] ){+.+.+.} [ 178.510145] , at: [ 178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.511070] stack backtrace: : [ 178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G OE 4.9.0-rc4-c844263313a8-lb #3 [ 178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp] [ 178.515071] ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80 [ 178.516195] ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700 [ 178.517318] ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e [ 178.518443] Call Trace: [ 178.518807] [<ffffffffa7450823>] dump_stack+0x85/0xc2 [ 178.519542] [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0 [ 178.520377] [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30 [ 178.521330] [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0 [ 178.522174] [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0 [ 178.522975] [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220 [ 178.523753] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.524535] [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0 [ 178.525291] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.526077] [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220 [ 178.527040] [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0 [ 178.527907] [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40 [ 178.528726] [<ffffffffa71cb507>] ? printk+0x48/0x50 [ 178.529434] [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20 [ 178.530381] [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core] [ 178.531314] [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp] [ 178.532271] [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0 [ 178.533101] [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0 [ 178.533954] [<ffffffffa70adc4e>] worker_thread+0x4e/0x490 [ 178.534735] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.535588] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.536441] [<ffffffffa70b48cf>] kthread+0xff/0x120 [ 178.537149] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538094] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538900] [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40 Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-15 17:09:30 +01:00
Sagi Grimberg	147b27e4bd	nvme-pci: allocate device queues storage space at probe It may cause race by setting 'nvmeq' in nvme_init_request() because .init_request is called inside switching io scheduler, which may happen when the NVMe device is being resetted and its nvme queues are being freed and created. We don't have any sync between the two pathes. This patch changes the nvmeq allocation to occur at probe time so there is no way we can dereference it at init_request. [ 93.268391] kernel BUG at drivers/nvme/host/pci.c:408! [ 93.274146] invalid opcode: 0000 [#1] SMP [ 93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3 megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod [ 93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4 [ 93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017 [ 93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1 [ 93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme] [ 93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246 [ 93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008 [ 93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008 [ 93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00 [ 93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00 [ 93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071 [ 93.422969] FS: 00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000 [ 93.431996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0 [ 93.446368] Call Trace: [ 93.449103] blk_mq_alloc_rqs+0x231/0x2a0 [ 93.453579] blk_mq_sched_alloc_tags.isra.8+0x42/0x80 [ 93.459214] blk_mq_init_sched+0x7e/0x140 [ 93.463687] elevator_switch+0x5a/0x1f0 [ 93.467966] ? elevator_get.isra.17+0x52/0xc0 [ 93.472826] elv_iosched_store+0xde/0x150 [ 93.477299] queue_attr_store+0x4e/0x90 [ 93.481580] kernfs_fop_write+0xfa/0x180 [ 93.485958] __vfs_write+0x33/0x170 [ 93.489851] ? __inode_security_revalidate+0x4c/0x60 [ 93.495390] ? selinux_file_permission+0xda/0x130 [ 93.500641] ? _cond_resched+0x15/0x30 [ 93.504815] vfs_write+0xad/0x1a0 [ 93.508512] SyS_write+0x52/0xc0 [ 93.512113] do_syscall_64+0x61/0x1a0 [ 93.516199] entry_SYSCALL64_slow_path+0x25/0x25 [ 93.521351] RIP: 0033:0x7f33ce96aab0 [ 93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0 [ 93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001 [ 93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740 [ 93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400 [ 93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000 [ 93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0 74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f> 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89 [ 93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8 [ 93.602273] ---[ end trace 810dde3993e5f14e ]--- Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-15 16:20:11 +01:00
Sagi Grimberg	79c48ccf2f	nvme-pci: serialize pci resets Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-15 16:20:10 +01:00
Keith Busch	e1f425e770	nvme/multipath: Use blk_path_error Uses common code for determining if an error should be retried on alternate path. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-10 10:52:18 -07:00
Keith Busch	908e45643d	nvme/multipath: Consult blk_status_t for failover This removes nvme multipath's specific status decoding to see if failover is needed, using the generic blk_status_t that was decoded earlier. This abstraction from the raw NVMe status means all status decoding exists in one place. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-10 10:52:14 -07:00
Keith Busch	e96fef2c3f	nvme: Add more command status translation This adds more NVMe status code translations to blk_status_t values, and captures all the current status codes NVMe multipath uses. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-10 10:52:12 -07:00
Joe Perches	c828a89203	treewide: Use DEVICE_ATTR_RO Convert DEVICE_ATTR uses to DEVICE_ATTR_RO where possible. Done with perl script: $ git grep -w --name-only DEVICE_ATTR \| \ xargs perl -i -e 'local $/; while (<>) { s/\bDEVICE_ATTR\s$\s(\w+)\s,\s\(?(?:\sS_IRUGO\s\|\s0444\s)$?\s,\s\1_show\s,\sNULL\s*\)/DEVICE_ATTR_RO(\1)/g; print;}' Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Robert Jarzmik <robert.jarzmik@free.fr> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-09 16:34:34 +01:00
Israel Rukshin	b837b28394	nvme: fix subsystem multiple controllers support check There is a problem when another module (e.g. nvmet) takes a reference on the nvme block device and the physical nvme drive is removed. In that case nvme_free_ctrl() will not be called and the controller state will be "deleting" or "dead" unless nvmet module releases the block device. Later on, the same nvme drive probes back and nvme_init_subsystem() will be called and fail due to duplicate subnqn (if the nvme device doesn't support subsystem with multiple controllers). This will cause a probe failure. This commit changes the check of multiple controllers support at nvme_init_subsystem() by not counting all the controllers at "dead" or "deleting" state (this is safe because controllers at this state will never be active again). Fixes: `ab9e00cc72` ("nvme: track subsystems") Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 16:57:00 +01:00
Nitzan Carmi	85088c4a0f	nvme: take refcount on transport module The block device is backed by the transport so we must ensure that the transport driver will not be removed until all references are released. Otherwise, we might end up referencing freed memory. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 16:56:57 +01:00
Jianchao Wang	2b1b7e784a	nvme-pci: fix NULL pointer reference in nvme_alloc_ns When the io queues setup or tagset allocation failed, ctrl.tagset is NULL. But the scan work will still be queued and executed, then panic comes up due to NULL pointer reference of ctrl.tagset. To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only admin queue is live. When non io queues or tagset allocation failed, ctrl enters into this state, scan work will not be started. But async event work and nvme dev ioctl will be still available. This will be helpful to do further investigation and recovery. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 11:02:13 +01:00
Max Gurtovoy	1a3838d732	nvme: modify the debug level for setting shutdown timeout When an NVMe controller reports RTD3 Entry Latency larger than the value of shutdown_timeout module parameter, we update the shutdown_timeout accordingly to honor RTD3 Entry Latency. Use an informational debug level instead of a warning level for it. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 11:02:00 +01:00
Sagi Grimberg	4caff8fc19	nvme-pci: don't open-code nvme_reset_ctrl Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 11:01:59 +01:00
Minwoo Im	6fbcde6691	nvme-pci: remove an unnecessary initialization in HMB code The local variable __size__ will be set a bit later in a for-loop. Remove the explicit initialization at the beginning of this function. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 11:01:57 +01:00
Roy Shterman	0de5cd367c	nvme-fabrics: protect against module unload during create_ctrl NVMe transport driver module unload may (and usually does) trigger iteration over the active controllers and delete them all (sometimes under a mutex). However, a controller can be created concurrently with module unload which can lead to leakage of resources (most important char device node leakage) in case the controller creation occured after the unload delete and drain sequence. To protect against this, we take a module reference to guarantee that the nvme transport driver is not unloaded while creating a controller. Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 11:01:56 +01:00
Ewan D. Milne	6b018235b4	nvme-fabrics: initialize default host->id in nvmf_host_default() The field was uninitialized before use. Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-01-08 10:52:03 +01:00
Matias Bjørling	fae7fae407	lightnvm: make geometry structures 2.0 ready Prepare for the 2.0 revision by adapting the geometry structures to coexist with the 1.2 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 08:50:12 -07:00
Matias Bjørling	bb27aa9ecd	lightnvm: remove lower page tables The lower page table is unused. All page tables reported by 1.2 devices are all reporting a sequential 1:1 page mapping. This is also not used going forward with the 2.0 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 08:50:12 -07:00
Matias Bjørling	e3e13bcc14	lightnvm: remove hybrid ocssd 1.2 support Now that rrpc have been removed. Also remove the hybrid 1.2 support from the core. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 08:50:12 -07:00
Sagi Grimberg	479a322fb7	nvme-mpath: fix last path removal during traffic In case our last path is removed during traffic, we can end up requeueing the bio(s) but never schedule the actual requeue work as upper layers still have open handles on the mpath device node. Fix this by scheduling requeue work if the namespace being removed is the last path in the ns_head path list. Fixes: `32acab3181` ("nvme: implement multipath access to nvme subsystems") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-29 10:32:58 +01:00
Sagi Grimberg	d5bf4b7f43	nvme-rdma: fix concurrent reset and reconnect Now ctrl state machine allows to transition from RESETTING to RECONNECTING. In nvme-rdma when we receive a rdma cm DISONNECTED event, we trigger nvme_rdma_error_recovery. This happens also when we execute a controller reset, issue a cm diconnect request and receive a cm disconnect reply, as a result, the reset work and the error recovery work can run concurrently. Until now the state machine prevented from the error recovery work from running as a result of a controller reset (RESETTING -> RECONNECTING was not allowed). To fix this, we adopt the FC state machine approach, we always transition from LIVE to RESETTING and only then to RECONNECTING. We do this both for the error recovery work and the controller reset work: 1. transition to RESETTING 2. teardown the controller association 3. transition to RECONNECTING This will restore the protection against reset work and error recovery work from concurrently running together. Fixes: `3cec7f9de4` ("nvme: allow controller RESETTING to RECONNECTING transition") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-29 10:32:58 +01:00
Jeff Lien	cee160fd34	nvme: fix sector units when going between formats If you format a device with a 4k sector size back to 512 bytes, the queue limit values for physical block size and minimum IO size were not getting updated; only the logical block size was being updated. This patch adds code to update the physical block and IO minimum sizes. Signed-off-by: Jeff Lien <jeff.lien@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-29 10:31:05 +01:00
Minwoo Im	955b1b5a00	nvme-pci: move use_sgl initialization to nvme_init_iod() A flag "use_sgl" of "struct nvme_iod" has been used in nvme_init_iod() without being set to any value. It seems like "use_sgl" has been set in either nvme_pci_setup_prps() or nvme_pci_setup_sgls() which occur later than nvme_init_iod(). Make "iod->use_sgl" being set in a proper place, nvme_init_iod(). Also move nvme_pci_use_sgls() up above nvme_init_iod() to make it possible to be called by nvme_init_iod(). Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-29 10:31:04 +01:00
Keith Busch	654b4a4acd	nvme: setup streams after initializing namespace head Fixes a NULL pointer dereference. Reported-by: Arnav Dawn <a.dawn@samsung.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-15 15:18:07 +01:00
Keith Busch	249159c5f1	nvme: check hw sectors before setting chunk sectors Some devices with IDs matching the "stripe" quirk don't actually have this quirk, and don't have an MDTS value. When MDTS is not set, the driver sets the max sectors to UINT_MAX, which is not a power of 2, hitting a BUG_ON from blk_queue_chunk_sectors. This patch skips setting chunk sectors for such devices. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-15 15:18:07 +01:00
Ming Lei	bd9f5d6576	nvme: call blk_integrity_unregister after queue is cleaned up During IO complete path, bio_integrity_advance() is often called, and blk_get_integrity() is called in this function. But in blk_integrity_unregister, the buffer pointed by queue->integrity is cleared, and blk_integrity->profile becomes NULL, then blk_get_integrity returns NULL, and causes kernel oops[1] finally. This patch fixes this issue by calling blk_integrity_unregister() after blk_cleanup_queue(). [1] kernel oops log [ 122.068007] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a [ 122.076760] IP: bio_integrity_advance+0x3d/0xf0 [ 122.081815] PGD 0 P4D 0 [ 122.084641] Oops: 0000 [#1] SMP [ 122.088142] Modules linked in: sunrpc ipmi_ssif intel_rapl vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mei_me ipmi_si crct10dif_pclmul crc32_pclmul sg mei ghash_clmulni_intel mxm_wmi ipmi_devintf iTCO_wdt intel_cstate intel_uncore pcspkr intel_rapl_perf iTCO_vendor_support dcdbas ipmi_msghandler lpc_ich acpi_power_meter shpchp wmi dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel ahci nvme tg3 libahci nvme_core i2c_core libata ptp megaraid_sas pps_core dm_mirror dm_region_hash dm_log dm_mod [ 122.149577] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-11.el7a.x86_64 #1 [ 122.157635] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017 [ 122.166179] task: ffff8802ff1e8000 task.stack: ffffc90000130000 [ 122.172785] RIP: 0010:bio_integrity_advance+0x3d/0xf0 [ 122.178419] RSP: 0018:ffff88047fc03d70 EFLAGS: 00010006 [ 122.184248] RAX: ffff880473b08000 RBX: ffff880458c71a80 RCX: ffff880473b08248 [ 122.192209] RDX: 0000000000000000 RSI: 000000000000003c RDI: ffffc900038d7ba0 [ 122.200171] RBP: ffff88047fc03d78 R08: 0000000000000001 R09: ffffffffa01a78b5 [ 122.208132] R10: ffff88047fc1eda0 R11: ffff880458c71ad0 R12: 0000000000007800 [ 122.216094] R13: 0000000000000000 R14: 0000000000007800 R15: ffff880473a39b40 [ 122.224056] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 122.233083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 122.239494] CR2: 000000000000000a CR3: 0000000001c09002 CR4: 00000000001606e0 [ 122.247455] Call Trace: [ 122.250183] <IRQ> [ 122.252429] bio_advance+0x28/0xf0 [ 122.256217] blk_update_request+0xa1/0x310 [ 122.260778] blk_mq_end_request+0x1e/0x70 [ 122.265256] nvme_complete_rq+0x1c/0xd0 [nvme_core] [ 122.270699] nvme_pci_complete_rq+0x85/0x130 [nvme] [ 122.276140] __blk_mq_complete_request+0x8d/0x140 [ 122.281387] blk_mq_complete_request+0x16/0x20 [ 122.286345] nvme_process_cq+0xdd/0x1c0 [nvme] [ 122.291301] nvme_irq+0x23/0x50 [nvme] [ 122.295485] __handle_irq_event_percpu+0x3c/0x190 [ 122.300725] handle_irq_event_percpu+0x32/0x80 [ 122.305683] handle_irq_event+0x3b/0x60 [ 122.309964] handle_edge_irq+0x8f/0x190 [ 122.314247] handle_irq+0xab/0x120 [ 122.318043] do_IRQ+0x48/0xd0 [ 122.321355] common_interrupt+0x9d/0x9d [ 122.325625] </IRQ> [ 122.327967] RIP: 0010:cpuidle_enter_state+0xe9/0x280 [ 122.333504] RSP: 0018:ffffc90000133e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff35 [ 122.341952] RAX: ffff88047fc1b900 RBX: ffff88047fc24400 RCX: 000000000000001f [ 122.349913] RDX: 0000000000000000 RSI: fffffcf2e6007295 RDI: 0000000000000000 [ 122.357874] RBP: ffffc90000133ea0 R08: 000000000000062e R09: 0000000000000253 [ 122.365836] R10: 0000000000000225 R11: 0000000000000018 R12: 0000000000000002 [ 122.373797] R13: 0000000000000001 R14: ffff88047fc24400 R15: 0000001c6bd1d263 [ 122.381762] ? cpuidle_enter_state+0xc5/0x280 [ 122.386623] cpuidle_enter+0x17/0x20 [ 122.390611] call_cpuidle+0x23/0x40 [ 122.394501] do_idle+0x17e/0x1f0 [ 122.398101] cpu_startup_entry+0x73/0x80 [ 122.402478] start_secondary+0x178/0x1c0 [ 122.406854] secondary_startup_64+0xa5/0xa5 [ 122.411520] Code: 48 8b 5f 68 48 8b 47 08 31 d2 4c 8b 5b 48 48 8b 80 d0 03 00 00 48 83 b8 48 02 00 00 00 48 8d 88 48 02 00 00 48 0f 45 d1 c1 ee 09 <0f> b6 4a 0a 0f b6 52 09 89 f0 48 01 73 08 83 e9 09 d3 e8 0f af [ 122.432604] RIP: bio_integrity_advance+0x3d/0xf0 RSP: ffff88047fc03d70 [ 122.439888] CR2: 000000000000000a Reported-by: Zhang Yi <yizhan@redhat.com> Tested-by: Zhang Yi <yizhan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-15 15:18:07 +01:00
James Smart	4596e752db	nvme-fc: remove double put reference if admin connect fails There are two put references in the failure case of initial create_association. The first put actually frees the controller, thus the second put references freed memory. Remove the unnecessary 2nd put. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-15 15:13:33 +01:00
David Disseldorp	b224f6134d	nvme: set discard_alignment to zero Similar to `7c08428979` ("rbd: set discard_alignment to zero"), NVMe devices are currently incorrectly initialised with the block queue discard_alignment set to the NVMe stream alignment. As per Documentation/ABI/testing/sysfs-block: The discard_alignment parameter indicates how many bytes the beginning of the device is offset from the internal allocation unit's natural alignment. Correcting the discard_alignment parameter to zero has no effect on how discard requests are propagated through the block layer - @alignment in __blkdev_issue_discard() remains zero. However, it does fix other consumers, such as LIO's Block Limits VPD response. Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-12-15 15:13:32 +01:00
Minwoo Im	7e5dd57ef3	nvme-pci: fix NULL pointer dereference in nvme_free_host_mem() Following condition which will cause NULL pointer dereference will occur in nvme_free_host_mem() when it tries to remove pci device via nvme_remove() especially after a failure of host memory allocation for HMB. "(host_mem_descs == NULL) && (nr_host_mem_descs != 0)" It's because __nr_host_mem_descs__ is not cleared to 0 unlike __host_mem_descs__ is so. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-28 08:49:26 -08:00
Max Gurtovoy	eb1bd249ba	nvme-rdma: fix memory leak during queue allocation In case nvme_rdma_wait_for_cm timeout expires before we get an established or rejected event (rdma_connect succeeded) from rdma_cm, we end up with leaking the ib transport resources for dedicated queue. This scenario can easily reproduced using traffic test during port toggling. Also, in order to protect from parallel ib queue destruction, that may be invoked from different context's, introduce new flag that stands for transport readiness. While we're here, protect also against a situation that we can receive rdma_cm events during ib queue destruction. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-28 08:49:22 -08:00
Israel Rukshin	f41725bbe1	nvme-rdma: Use mr pool Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. If an I/O scheduler is used, it doesn't iterate the hctx scheduler tags and the static request aren't been updated. For example, while using NVMe over Fabrics RDMA host, this cause us not to reinit the scheduler requests and thus not re-register all the memory regions during the tagset re-initialization in the reconnect flow. This may lead to a memory registration error: "MEMREG for CQE 0xffff88044c14dce8 failed with status memory management operation error (6)" With this commit we don't need to reinit the requests, and thus fix this failure. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-26 15:33:32 +01:00
Sagi Grimberg	3ef0279bb0	nvme-rdma: Check remotely invalidated rkey matches our expected rkey If we got a remote invalidation on a bogus rkey, this is a protocol error. Fail the connection in this case. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-26 15:33:32 +01:00
Sagi Grimberg	2f122e4f51	nvme-rdma: wait for local invalidation before completing a request We must not complete a request before the host memory region is invalidated. Luckily we have send with invalidate protocol support so we usually don't need to execute it, but in case the target did not invalidate a memory region for us, we must wait for the invalidation to complete before unmapping host memory and completing the I/O. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-26 15:33:32 +01:00
Sagi Grimberg	4af7f7ff92	nvme-rdma: don't complete requests before a send work request has completed In order to guarantee that the HCA will never get an access violation (either from invalidated rkey or from iommu) when retrying a send operation we must complete a request only when both send completion and the nvme cqe has arrived. We need to set the send/recv completions flags atomically because we might have more than a single context accessing the request concurrently (one is cq irq-poll context and the other is user-polling used in IOCB_HIPRI). Only then we are safe to invalidate the rkey (if needed), unmap the host buffers, and complete the IO. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-26 15:33:32 +01:00
Sagi Grimberg	b4b591c87f	nvme-rdma: don't suppress send completions The entire completions suppress mechanism is currently broken because the HCA might retry a send operation (due to dropped ack) after the nvme transaction has completed. In order to handle this, we signal all send completions and introduce a separate done handler for async events as they will be handled differently (as they don't include in-capsule data by definition). Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-26 15:33:32 +01:00
Jens Axboe	26c0a26d78	nvme-fc: don't use bit masks for set/test_bit() numbers So far harmless, but it's confusing and a bug waiting to happen if the shifts grow larger than 4. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-24 10:12:33 -07:00
Jeff Lien	8c97eeccf0	nvme-pci: add quirk for delay before CHK RDY for WDC SN200 And increase the existing delay to cover this device as well. Cc: stable@vger.kernel.org Signed-off-by: Jeff Lien <jeff.lien@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-23 09:12:08 +01:00
Keith Busch	9941a862cc	nvme: Suppress static analyis warning The ns->head is always valid, so we don't need to check for NULL. Reported-by: Dan Carpenter <dan.caprenter@oracle.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:38:11 +01:00
Keith Busch	b0d61d586f	nvme: Fix NULL dereference on reservation request This fixes using the NULL 'head' before getting the reference. It is however possible the head will always be NULL, so this patch uses the struct nvme_ns to get the ns_id field. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:38:11 +01:00
Colin Ian King	89c4aff6d4	nvme: fix spelling mistake: "requeing" -> "requeuing" Trivial fix to spelling mistake in dev_warn_ratelimited message text Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:38:10 +01:00
Minwoo Im	244a8fe40a	nvme-pci: avoid hmb desc array idx out-of-bound when hmmaxd set. hmb descriptor idx out-of-bound occurs in case of below conditions. preferred = 128MiB chunk_size = 4MiB hmmaxd = 1 Current code will not allow rmmod which will free hmb descriptors to be done successfully in above case. "descs[i]" will be set in for-loop without seeing any conditions related to "max_entries" after a single "descs" was allocated by (max_entries = 1) in this case. Added a condition into for-loop to check index of descriptors. Fixes: 044a9df1("nvme-pci: implement the HMB entry number and size limitations") Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:38:09 +01:00
Kai-Heng Feng	8427bbc224	nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A The NVMe device in question drops off the PCIe bus after system suspend. I've tried several approaches to workaround this issue, but none of them works: - NVME_QUIRK_DELAY_BEFORE_CHK_RDY - NVME_QUIRK_NO_DEEPEST_PS - Disable APST before controller shutdown - Delay between controller shutdown and system suspend - Explicitly set power state to 0 before controller shutdown Fortunately it's a desktop, so disable APST won't hurt the battery. Also, change the quirk function name to reflect it's for vendor combination quirks. BugLink: https://bugs.launchpad.net/bugs/1705748 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:36:40 +01:00
Sagi Grimberg	9e0ed16ab9	nvme-fc: check if queue is ready in queue_rq In case the queue is not LIVE (fully functional and connected at the nvmf level), we cannot allow any commands other than connect to pass through. Add a new queue state flag NVME_FC_Q_LIVE which is set after nvmf connect and cleared in queue teardown. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:28:35 +01:00
Sagi Grimberg	48832f8d58	nvme-fabrics: introduce init command check for a queue that is not alive When the fabrics queue is not alive and fully functional, no commands should be allowed to pass but connect (which moves the queue to a fully functional state). Any other command should be failed, with either temporary status BLK_STS_RESOUCE or permanent status BLK_STS_IOERR. This is shared across all fabrics, hence move the check to fabrics library. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-20 08:28:31 +01:00
Linus Torvalds	e2c5923c34	Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block Pull core block layer updates from Jens Axboe: "This is the main pull request for block storage for 4.15-rc1. Nothing out of the ordinary in here, and no API changes or anything like that. Just various new features for drivers, core changes, etc. In particular, this pull request contains: - A patch series from Bart, closing the whole on blk/scsi-mq queue quescing. - A series from Christoph, building towards hidden gendisks (for multipath) and ability to move bio chains around. - NVMe - Support for native multipath for NVMe (Christoph). - Userspace notifications for AENs (Keith). - Command side-effects support (Keith). - SGL support (Chaitanya Kulkarni) - FC fixes and improvements (James Smart) - Lots of fixes and tweaks (Various) - bcache - New maintainer (Michael Lyle) - Writeback control improvements (Michael) - Various fixes (Coly, Elena, Eric, Liang, et al) - lightnvm updates, mostly centered around the pblk interface (Javier, Hans, and Rakesh). - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph) - Writeback series that fix the much discussed hundreds of millions of sync-all units. This goes all the way, as discussed previously (me). - Fix for missing wakeup on writeback timer adjustments (Yafang Shao). - Fix laptop mode on blk-mq (me). - {mq,name} tupple lookup for IO schedulers, allowing us to have alias names. This means you can use 'deadline' on both !mq and on mq (where it's called mq-deadline). (me). - blktrace race fix, oopsing on sg load (me). - blk-mq optimizations (me). - Obscure waitqueue race fix for kyber (Omar). - NBD fixes (Josef). - Disable writeback throttling by default on bfq, like we do on cfq (Luca Miccio). - Series from Ming that enable us to treat flush requests on blk-mq like any other request. This is a really nice cleanup. - Series from Ming that improves merging on blk-mq with schedulers, getting us closer to flipping the switch on scsi-mq again. - BFQ updates (Paolo). - blk-mq atomic flags memory ordering fixes (Peter Z). - Loop cgroup support (Shaohua). - Lots of minor fixes from lots of different folks, both for core and driver code" * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits) nvme: fix visibility of "uuid" ns attribute blk-mq: fixup some comment typos and lengths ide: ide-atapi: fix compile error with defining macro DEBUG blk-mq: improve tag waiting setup for non-shared tags brd: remove unused brd_mutex blk-mq: only run the hardware queue if IO is pending block: avoid null pointer dereference on null disk fs: guard_bio_eod() needs to consider partitions xtensa/simdisk: fix compile error nvme: expose subsys attribute to sysfs nvme: create 'slaves' and 'holders' entries for hidden controllers block: create 'slaves' and 'holders' entries for hidden gendisks nvme: also expose the namespace identification sysfs files for mpath nodes nvme: implement multipath access to nvme subsystems nvme: track shared namespaces nvme: introduce a nvme_ns_ids structure nvme: track subsystems block, nvme: Introduce blk_mq_req_flags_t block, scsi: Make SCSI quiesce and resume work reliably block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag ...	2017-11-14 15:32:19 -08:00
Martin Wilck	a04b5de505	nvme: fix visibility of "uuid" ns attribute "uuid" must be invisible if both ns->uuid and ns->nguid are unset, not if either one is. Fixes: `d934f9848a` "nvme: provide UUID value to userspace" Signed-off-by: Martin Wilck <mwilck@suse.com> [hch: rebased to the nvme-4.15 tree to help resolving a conflict] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-11 15:38:21 -07:00
Hannes Reinecke	1e496938b6	nvme: expose subsys attribute to sysfs We should be exposing the subsystem attributes like 'model' and 'subsysnqn' to sysfs to allow for easier identification of the subsystem. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Hannes Reinecke	e9a48034d7	nvme: create 'slaves' and 'holders' entries for hidden controllers When creating nvme multipath devices we should populate the 'slaves' and 'holders' directorys properly to aid userspace topology detection. Signed-off-by: Hannes Reinecke <hare@suse.com> [hch: split from a larger patch, compile fix for CONFIG_NVME_MULTIPATH=n] Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	5b85b826b8	nvme: also expose the namespace identification sysfs files for mpath nodes We do this by adding a helper that returns the ns_head for a device that can belong to either the per-controller or per-subsystem block device nodes, and otherwise reuse all the existing code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	32acab3181	nvme: implement multipath access to nvme subsystems This patch adds native multipath support to the nvme driver. For each namespace we create only single block device node, which can be used to access that namespace through any of the controllers that refer to it. The gendisk for each controllers path to the name space still exists inside the kernel, but is hidden from userspace. The character device nodes are still available on a per-controller basis. A new link from the sysfs directory for the subsystem allows to find all controllers for a given subsystem. Currently we will always send I/O to the first available path, this will be changed once the NVMe Asynchronous Namespace Access (ANA) TP is ratified and implemented, at which point we will look at the ANA state for each namespace. Another possibility that was prototyped is to use the path that is closes to the submitting NUMA code, which will be mostly interesting for PCI, but might also be useful for RDMA or FC transports in the future. There is not plan to implement round robin or I/O service time path selectors, as those are not scalable with the performance rates provided by NVMe. The multipath device will go away once all paths to it disappear, any delay to keep it alive needs to be implemented at the controller level. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	ed754e5dee	nvme: track shared namespaces Introduce a new struct nvme_ns_head that holds information about an actual namespace, unlike struct nvme_ns, which only holds the per-controller namespace information. For private namespaces there is a 1:1 relation of the two, but for shared namespaces this lets us discover all the paths to it. For now only the identifiers are moved to the new structure, but most of the information in struct nvme_ns should eventually move over. To allow lockless path lookup the list of nvme_ns structures per nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns structure through call_srcu. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	002fab0404	nvme: introduce a nvme_ns_ids structure This allows us to manage the various uniqueue namespace identifiers together instead needing various variables and arguments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	ab9e00cc72	nvme: track subsystems This adds a new nvme_subsystem structure so that we can track multiple controllers that belong to a single subsystem. For now we only use it to store the NQN, and to check that we don't have duplicate NQNs unless the involved subsystems support multiple controllers. Includes code originally from Hannes Reinecke to expose the subsystems in sysfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Bart Van Assche	9a95e4ef70	block, nvme: Introduce blk_mq_req_flags_t Several block layer and NVMe core functions accept a combination of BLK_MQ_REQ_* flags through the 'flags' argument but there is no verification at compile time whether the right type of block layer flags is passed. Make it possible for sparse to verify this. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: linux-nvme@lists.infradead.org Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Ming Lei	03e0f3a65e	nvme-pci: avoid dereference of symbol from unloaded module The 'remove_work' may be scheduled to run after nvme_remove() returns since we can't simply cancel it in nvme_remove() for avoiding deadlock. Once nvme_remove() returns, this module(nvme) can be unloaded. On the other hand, nvme_put_ctrl() calls ctr->ops->free_ctrl which may point to nvme_pci_free_ctrl() in unloaded module. This patch avoids this issue by queuing 'remove_work' via 'nvme_wq', and flush this worqueue in nvme_exit() as suggested by Sagi. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	e3d7874dcf	nvme: send uevent for some asynchronous events This will give udev a chance to observe and handle asynchronous event notifications and clear the log to unmask future events of the same type. The driver will create a change uevent of the asyncronuos event result before submitting the next AEN request to the device if a completed AEN event is of type error, smart, command set or vendor specific, Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	d99ca609a1	nvme: unexport starting async event work Async event work is for core use only and should not be called directly from drivers. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	ad22c355b7	nvme: remove handling of multiple AEN requests The driver can handle tracking only one AEN request, so this patch removes handling for multiple ones. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	08e1507544	nvme-fc: remove unused "queue_size" field This was being saved in a structure, but never used anywhere. The queue size is obtained through other means, so there's no reason to duplicate this without a user for it. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	38dabe210f	nvme: centralize AEN defines All the transports were unnecessarilly duplicating the AEN request accounting. This patch defines everything in one place. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Javier González	ab083b11f6	nvme: fix eui_show() print format Fix print formatting, but keep the original output to prevent user breakage as suggested by Joe Perches. Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Javier González	a47619b5c6	nvme: compare NQN string with right size Copy subnqns using NVMF_NQN_SIZE as it is < 256 Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Max Gurtovoy	1f61def985	nvme-rdma: fix nvme_rdma_create_queue_ib error flow QP object is created using rdma_cm api, therefore the destruction should use the same api for symmetry. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
James Smart	158bfb8888	nvme-fc: decouple ns references from lldd references In the lldd api, a lldd may unregister a remoteport (loss of connectivity or driver unload) or localport (driver unload). The lldd must wait for the remoteport_delete or localport_delete before completing its actions post the unregister. The xxx_deletes currently occur only when the xxxport structure is fully freed after all references are removed. Thus the lldd may be held hostage until an app or in-kernel entity that has a namespace open finally closes so the namespace can be removed, the controller removed, thus the transport objects, thus the lldd. This patch decouples the transport and os-facing objects from the lldd and the remoteport and localport. There is a point in all deletions where the transport will no longer interact with the lldd on behalf of a controller. That point centers around the association established with the target/subsystem. It will access the lldd whenever it attempts to create an association and while the association is active. New associations may only be created if the remoteport is live (thus the localport is live). It will not access the lldd after deleting the association. Therefore, the patch tracks the count of active controllers - those with associations being created or that are active - on a remoteport. It also tracks the number of remoteports that have active controllers, on a a localport. When a remoteport is unregistered, as soon as there are no active controllers, the lldd's remoteport_delete may be called and the lldd may continue. Similarly, when a localport is unregistered, as soon as there are no remoteports with active controllers, the localport_delete callback may be made. This significantly speeds up unregistration with the lldd. The transport objects continue in suspended status with reconnect timers running, and upon expiration, normal ref-counting will occur and the objects will be freed. The transport object may still be held hostage by the application/kernel module, but that is acceptable. With this change, the lldd may be fully unloaded and reloaded, and if registrations occur prior to the timeouts, the nvme controller and namespaces will resume normally as if a link bounce. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
James Smart	c5760f300e	nvme-fc: fix localport resume using stale values The localport resume was not updating the lldd ops structure. If the lldd is unloaded and reloaded, the ops pointers will differ. Additionally, as there are device references taken by the localport, ensure that resume only resumes if the device matches as well. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	84fef62d13	nvme: check admin passthru command effects The NVMe standard provides a command effects log page so the host may be aware of special requirements it may need to do for a particular command. For example, the command may need to run with IO quiesced to prevent timeouts or undefined behavior, or it may change the logical block formats that determine how the host needs to construct future commands. This patch saves the nvme command effects log page if the controller supports it, and performs appropriate actions before and after an admin passthrough command is completed. If the controller does not support the command effects log page, the driver will define the effects for known opcodes. The nvme format and santize are the only commands in this patch with known effects. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Keith Busch	c627c487ec	nvme: factor get log into a helper And fix the warning on a successful firmware log. Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	715ea9e09d	nvme: fix and clarify the check for missing metadata Update the check in nvme_setup_rw for missing metadata so that it is together with the other metadata handling, does not contain impossible to reach conditions and warns if we get an impossible requests for a (non-PI) metadata-enabled namespace when CONFIG_BLK_DEV_INTEGRITY is not set. Also add a little helper that checks if a given metadata configuration contains protection information Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Javier González <jg@lightnvm.io> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	24b0b58c5b	nvme: split __nvme_revalidate_disk Split out the code that applies the calculate value to a given disk/queue into new helper that can be reused by the multipath code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	6e78f21ae4	nvme: set the chunk size before freezing the queue We don't need a frozen queue to update the chunk_size, which just is a hint, and moving it a little earlier will allow for some better code reuse with the multipath code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	30e5e929c7	nvme: don't pass struct nvme_ns to nvme_config_discard To allow reusing this function for the multipath node. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	39b7baa410	nvme: don't pass struct nvme_ns to nvme_init_integrity To allow reusing this function for the multipath node. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	b5be3b3929	nvme: always unregister the integrity profile in __nvme_revalidate_disk This is safe because the queue is always frozen when we revalidate, and it simplifies both the existing code as well as the multipath implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Christoph Hellwig	e54b064cb2	nvme: move the dying queue check from cancel to completion With multipath we don't want a hard DNR bit on a request that is cancelled by a controller reset, but instead want to be able to retry it on another patch. To archive this don't always set the DNR bit when the queue is dying in nvme_cancel_request, but defer that decision to nvme_req_needs_retry. Note that it applies to any command there and not just cancelled commands, but one the queue is dying that is the right thing to do anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-10 19:53:25 -07:00
Jens Axboe	3e2cb3ad47	Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-4.15/block Pull NVMe changes from Christoph: "Below are the currently queue nvme updates for Linux 4.15. There are a few more things that could make it for this merge window, but I'd like to get things into linux-next, especially for the unlikely case that Linus decided to cut -rc8. Highlights: - support for SGLs in the PCIe driver (Chaitanya Kulkarni) - disable I/O schedulers for the admin queue (Israel Rukshin) - various Fibre Channel fixes and enhancements (James Smart) - various refactoring for better code sharing between transports (Sagi Grimberg and me) as well as lots of little bits from various contributors."	2017-11-03 10:28:51 -06:00
Minwoo Im	a806c6c81e	nvme: comment typo fixed in clearing AER a tiny typo fixed in a comment. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-03 10:02:20 +03:00
Linus Torvalds	ead751507d	License cleanup: add SPDX license identifiers to some files Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA 6dVh26uchcEQLN/XqUDt =x306 -----END PGP SIGNATURE----- Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull initial SPDX identifiers from Greg KH: "License cleanup: add SPDX license identifiers to some files Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: License cleanup: add SPDX license identifier to uapi header files with a license License cleanup: add SPDX license identifier to uapi header files with no license License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 10:04:46 -07:00
Greg Kroah-Hartman	b24413180f	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-11-02 11:10:55 +01:00
Keith Busch	3639efef8f	nvme: Remove unused headers Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:43 +01:00
James Smart	2b632970da	nvme-fc: add dev_loss_tmo timeout and remoteport resume support When a remoteport is unregistered (connectivity lost), the following actions are taken: - the remoteport is marked DELETED - the time when dev_loss_tmo would expire is set in the remoteport - all controllers on the remoteport are reset. After a controller resets, it will stall in a RECONNECTING state waiting for one of the following: - the controller will continue to attempt reconnect per max_retries and reconnect_delay. As no remoteport connectivity, the reconnect attempt will immediately fail. If max reconnects has not been reached, a new reconnect_delay timer will be schedule. If the current time plus another reconnect_delay exceeds when dev_loss_tmo expires on the remote port, then the reconnect_delay will be shortend to schedule no later than when dev_loss_tmo expires. If max reconnect attempts are reached (e.g. ctrl_loss_tmo reached) or dev_loss_tmo ix exceeded without connectivity, the controller is deleted. - the remoteport is re-registered prior to dev_loss_tmo expiring. The resume of the remoteport will immediately attempt to reconnect each of its suspended controllers. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> [hch: updated to use nvme_delete_ctrl] Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:41 +01:00
James Smart	3cec7f9de4	nvme: allow controller RESETTING to RECONNECTING transition Transport will typically transition from LIVE to RESETTING when initially performing a reset or recovering from an error. Adding this transition allows a transport to transition to RECONNECTING when it checks/waits for connectivity then creates new transport connections and reinits the controller. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:40 +01:00
James Smart	96e2480105	nvme-fc: check connectivity before initiating reconnects Check remoteport connectivity before initiating reconnects Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:39 +01:00
James Smart	ac7fe82b6f	nvme-fc: add a dev_loss_tmo field to the remoteport Add a dev_loss_tmo value, paralleling the SCSI FC transport, for device connectivity loss. The transport initializes the value in the nvme_fc_register_remoteport() call. If the value is not set, a default of 60s is set. Add a new routine to the api, nvme_fc_set_remoteport_devloss() routine, which allows the lldd to dynamically update the value on an existing remoteport. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:36 +01:00
James Smart	44c6ec77e1	nvme-fc: change ctlr state assignments during reset/reconnect Clean up some of the controller state checks and add the RESETTING->RECONNECTING state transition. Specifically: - the movement of the RESETTING state change and schedule of reset_work to core doesn't work wiht nvme_fc_error_recovery setting state to RECONNECTING before attempting to reset. Remove the state change as the reset request does it. - In the rare cases where an error occurs right as we're transitioning to LIVE, defer the controller start actions. - In error handling on teardown of associations while performing initial controller creation - avoid quiesce calls on the admin_q. They are unneeded. - Add the RESETTING->RECONNECTING transition in the reset handler. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:23 +01:00
Sagi Grimberg	4054637c9b	nvme: flush reset_work before safely continuing with delete operation Prevent racing controller reset and delete flows. reset_work must not ever self-requeue so flushing it suffices. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:17 +01:00
Sagi Grimberg	12fa1304ba	nvme-rdma: reuse nvme_delete_ctrl when reconnect attempts expire instead of just queueing delete work Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:07 +01:00
Christoph Hellwig	6cd53d14aa	nvme: consolidate common code from ->reset_work No change in behavior except that the FC code cancels two work items a little later now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:06 +01:00
Christoph Hellwig	e9bc25874c	nvme-rdma: remove nvme_rdma_remove_ctrl It is only used in two places, and some of the work done by it will be taken into common code soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:05 +01:00
Christoph Hellwig	c5017e8570	nvme: move controller deletion to common code Move the ->delete_work and the associated helpers to common code instead of duplicating them in every driver. This also adds the missing reference get/put for the loop driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:04 +01:00
Christoph Hellwig	29c0964873	nvme-fc: merge __nvme_fc_schedule_delete_work into __nvme_fc_del_ctrl No need to have two functions doing the same thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:04 +01:00
James Smart	71c691fd06	nvme-fc: avoid workqueue flush stalls There's no need to wait for the full nvme_wq, which is now shared, to flush. flush only the delete_work item. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sgi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:03 +01:00
Keith Busch	5e0fab57fb	nvme: Fix setting logical block format when revalidating Revalidating the disk needs to set the logical block format and capacity, otherwise it can't figure out if the users modified anything about the namespace. Fixes: `cdbff4f26b` ("nvme: remove nvme_revalidate_ns") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 08:22:41 -06:00
James Smart	ecad0d2cb8	nvme-fc: remove NVME_FC_MAX_SEGMENTS The define is an arbitrary limit to the io size on the initiator, capping the io to 1MB-4KB. Remove the define from the transport. I/O size will solely be limited by the LLDD sg limits. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-27 09:25:35 +03:00
James Smart	56d5f4f108	nvme-fc: add support for duplicate_connect option Adds support for the duplicate_connect option. When set to true, checks whether there's an existing controller via the same host port and target port for the same host (hostnqn, hostid) to the same subsystem. Fails the connection request if an existing controller. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-27 09:25:32 +03:00
James Smart	36e835f243	nvme-rdma: add support for duplicate_connect option Adds support for the duplicate_connect option. When set to true, checks whether there's an existing controller via the same target address (traddr), target port (trsvcid), and if specified, host address (host_traddr). Fails the connection request if there is an existing controller. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-27 09:25:32 +03:00
James Smart	991231dc48	nvme: add helper to compare options to controller Adds a helper function that compares the host and subsytem specified in a connect options list vs a controller. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-27 09:25:28 +03:00
James Smart	3b33876207	nvme: add duplicate_connect option Add the "duplicate_connect" boolean option (presence means true). Default is false. When false, the transport should validate whether a new controller request is targeted for the same host transport addressing and target transport addressing as an existing controller. If so, the new controller request should be rejected. When true, the callee is explicitly requesting a duplicate controller connection to be made and the new request should be attempted. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-27 09:25:20 +03:00
Christoph Hellwig	999ada2871	nvme: check for a live controller in nvme_dev_open This is a much more sensible check than just the admin queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@rimbeg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2017-10-27 09:05:22 +03:00
Christoph Hellwig	a6a5149b10	nvme: get rid of nvme_ctrl_list Use the core chrdev code to set up the link between the character device and the nvme controller. This allows us to get rid of the global list of all controllers, and also ensures that we have both a reference to the controller and the transport module before the open method of the character device is called. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sgi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com>	2017-10-27 09:04:53 +03:00
Christoph Hellwig	d22524a478	nvme: switch controller refcounting to use struct device Instead of allocating a separate struct device for the character device handle embedd it into struct nvme_ctrl and use it for the main controller refcounting. This removes double refcounting and gets us an automatic reference for the character device operations. We keep ctrl->device as a pointer for now to avoid chaning printks all over, but in the future we could look into message printing helpers that take a controller structure similar to what other subsystems do. Note the delete_ctrl operation always already has a reference (either through sysfs due this change, or because every open file on the /dev/nvme-fabrics node has a refernece) when it is entered now, so we don't need to do the unless_zero variant there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com>	2017-10-27 09:04:07 +03:00
Christoph Hellwig	c6424a90da	nvme: simplify nvme_open Now that we are protected against lookup vs free races for the namespace by using kref_get_unless_zero we don't need the hack of NULLing out the disk private data during removal. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2017-10-27 09:03:31 +03:00
Christoph Hellwig	2dd4122854	nvme: use kref_get_unless_zero in nvme_find_get_ns For kref_get_unless_zero to protect against lookup vs free races we need to use it in all places where we aren't guaranteed to already hold a reference. There is no such guarantee in nvme_find_get_ns, so switch to kref_get_unless_zero in this function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2017-10-27 09:02:58 +03:00
Nitzan Carmi	e62a538da2	nvme-rdma: Add debug message when reaches timeout Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-23 16:28:42 +02:00
Max Gurtovoy	f87c89ad93	nvme-rdma: align nvme_rdma_device structure Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-23 16:28:05 +02:00
Sagi Grimberg	7db8144653	nvme-rdma: fix possible hang when issuing commands during ctrl removal nvme_rdma_queue_is_ready() fails requests in case a queue is not LIVE. If the controller is in RECONNECTING state, we might be in this state for a long time (until we successfully reconnect) and we are better off with failing the request fast. Otherwise, we fail with BLK_STS_RESOURCE to have the block layer try again soon. In case we are removing the controller when the admin queue is not LIVE, we will terminate the request with BLK_STS_RESOURCE but it happens before we call blk_mq_start_request() so the request timeout never expires, and the queue will never get back to LIVE (because we are removing the controller). This causes the removal operation to block infinitly [1]. Thus, if we are removing (state DELETING), and the queue is not LIVE, we need to fail the request permanently as there is no chance for it to ever complete successfully. [1] -- sysrq: SysRq : Show Blocked State task PC stack pid father kworker/u66:2 D 0 440 2 0x80000000 Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] Call Trace: __schedule+0x3e9/0xb00 schedule+0x40/0x90 schedule_timeout+0x221/0x580 io_schedule_timeout+0x1e/0x50 wait_for_completion_io_timeout+0x118/0x180 blk_execute_rq+0x86/0xc0 __nvme_submit_sync_cmd+0x89/0xf0 nvmf_reg_write32+0x4b/0x90 [nvme_fabrics] nvme_shutdown_ctrl+0x41/0xe0 nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma] nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma] nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma] process_one_work+0x1fd/0x630 worker_thread+0x1db/0x3b0 kthread+0x11e/0x150 ret_from_fork+0x27/0x40 01 D 0 2868 2862 0x00000000 Call Trace: __schedule+0x3e9/0xb00 schedule+0x40/0x90 schedule_timeout+0x260/0x580 wait_for_completion+0x108/0x170 flush_work+0x1e0/0x270 nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma] nvme_sysfs_delete+0x2a/0x40 dev_attr_store+0x18/0x30 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x124/0x1c0 __vfs_write+0x28/0x150 vfs_write+0xc7/0x1b0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x18/0xad -- Reported-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-23 16:27:44 +02:00
James Smart	134aedc9c1	nvme-fc: correct io timeout behavior The transport io timeout behavior wasn't quite correct. It ignored that the io error handler is supposed to be synchronous so it possibly allowed the blk request to be restarted while the io associated was still aborting. Timeouts on reserved commands, those used for association create, were never timing out thus they hung out forever. To correct: If an io is times out while a remoteport is not connected, just restart the io timer. The lack of connectivity will simultaneously be resetting the controller, so the reset path will abort and terminate the io. If an io is times out while it was marked for transport abort, just reset the io timer. The abort process is underway and will complete the io. Otherwise, if an io times out, abort the io. If the abort was unsuccessful (unlikely) give up and return not handled. If the abort was successful, as the abort process is underway it will terminate the io, so rather than synchronously waiting, just restart the io timer. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-20 12:17:05 +02:00
James Smart	0a02e39fd1	nvme-fc: correct io termination handling The io completion handling for i/o's that are failing due to to a transport error or association termination had issues, causing io failures (DNR set so retries didn't kick in) or long stalls. Change the io completion handler for the following items: When an io has been completed due to a transport abort (based on an exchange error) or when marked as aborted as part of an association termination (FCOP_FLAGS_TERMIO), set the NVME completion status to NVME_SC_ABORTED. By default, do not set DNR on the status so that a retry can be attempted after association recreate. In cases where an io is failed (non-successful nvme status including aborted), if the controller is being deleted (blk_queue_dying) or the io was part of the ios used for association creation (ctrl state is NEW or RECONNECTING), then additionally set the DNR bit so the io will not be retried. If the failed io was part of association creation, the failure will tear down the partially completioned association and typically restart a new reconnect attempt (another create association later). Rearranged code flow to remove a largely unneeded local variable. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-20 12:16:59 +02:00
Chaitanya Kulkarni	a7a7cbe353	nvme-pci: add SGL support This adds SGL support for NVMe PCIe driver, based on an earlier patch from Rajiv Shanmugam Madeswaran <smrajiv15 at gmail.com>. This patch refactors the original code and adds new module parameter sgl_threshold to determine whether to use SGL or PRP for IOs. The usage of SGLs is controlled by the sgl_threshold module parameter, which allows to conditionally use SGLs if average request segment size (avg_seg_size) is greater than sgl_threshold. In the original patch, the decision of using SGLs was dependent only on the IO size, with the new approach we consider not only IO size but also the number of physical segments present in the IO. We calculate avg_seg_size based on request payload bytes and number of physical segments present in the request. For e.g.:- 1. blk_rq_nr_phys_segments = 2 blk_rq_payload_bytes = 8k avg_seg_size = 4K use sgl if avg_seg_size >= sgl_threshold. 2. blk_rq_nr_phys_segments = 2 blk_rq_payload_bytes = 64k avg_seg_size = 32K use sgl if avg_seg_size >= sgl_threshold. 3. blk_rq_nr_phys_segments = 16 blk_rq_payload_bytes = 64k avg_seg_size = 4K use sgl if avg_seg_size >= sgl_threshold. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-20 12:16:58 +02:00
Christoph Hellwig	9843f685ae	nvme: use ida_simple_{get,remove} for the controller instance Switch to the ida_simple_* helpers instead of opencoding them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2017-10-20 12:16:57 +02:00
Sagi Grimberg	f04b9cc87b	nvme-rdma: Fix error status return in tagset allocation failure We should make sure to escelate allocation failures to prevent a use-after-free in nvmf_create_ctrl. Fixes: `b28a308ee7` ("nvme-rdma: move tagset allocation to a dedicated routine") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-19 17:13:51 +02:00
Sagi Grimberg	bd9f07590a	nvme-rdma: Fix possible double free in reconnect flow The fact that we free the async event buffer in nvme_rdma_destroy_admin_queue can cause us to free it more than once because this happens in every reconnect attempt since commit `31fdf18401`. we rely on the queue state flags DELETING to avoid this for other resources. A more complete fix is to not destroy the admin/io queues unconditionally on every reconnect attempt, but its a bit more extensive and will go in the next release. Fixes: `31fdf18401` ("nvme-rdma: reuse configure/destroy_admin_queue") Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-19 17:07:22 +02:00
Israel Rukshin	5a22e2bf44	nvme-fc: Add BLK_MQ_F_NO_SCHED flag to admin tag set Since commit `b86dd81` "block: get rid of blk-mq default scheduler choice Kconfig entries", when setting nr_hw_queues to 1 the admin tag set uses mq-deadline scheduler. This flag is useful for admin queues that aren't used for normal IO. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-19 10:58:54 +02:00
Israel Rukshin	94f29d4f78	nvme-rdma: Add BLK_MQ_F_NO_SCHED flag to admin tag set Since commit `b86dd81` "block: get rid of blk-mq default scheduler choice Kconfig entries", when setting nr_hw_queues to 1 the admin tag set uses mq-deadline scheduler. This flag is useful for admin queues that aren't used for normal IO. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-19 10:58:53 +02:00
Minwoo Im	16772ae6d9	nvme-pci: fix typos in comments fixed comment typos in adapter_alloc_cq() and adapter_alloc_sq(). 'the the' duplications are replaced with 'that the'. Signed-off-by: Minwoo Im <dn3108@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-19 10:56:07 +02:00
James Smart	17c4dc6eb7	nvme-fc: retry initial controller connections 3 times Currently, if a frame is lost of command fails as part of initial association create for a new controller, the new controller connection request will immediately fail. Add in an immediate 3 retry loop before giving up. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:30:01 +02:00
James Smart	8a82dbf191	nvme-fc: fix iowait hang Add missing iowait head initialization. Fix irqsave vs irq: wait_event_lock_irq() doesn't do irq save/restore Fixes: `36715cf4b3` ("nvme_fc: replace ioabort msleep loop with completion”) Cc: <stable@vger.kernel.org> # 4.13 Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Himanshu Madhani <himanshu.madhani@cavium.com> Tested-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:30:00 +02:00
Sagi Grimberg	0ad0bfa298	nvme-rdma: stop controller reset if the controller is deleting If the controller is deleting (in case the user decided to delete it), we have no point to continue reset sequence. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:29:39 +02:00
Sagi Grimberg	5013e98b5e	nvme-rdma: change queue flag semantics DELETING -> ALLOCATED Instead of marking we are deleting, mark we are allocated and check that instead. This makes the logic symmetrical to connected mark check. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:28:55 +02:00
Sagi Grimberg	60a5188633	nvme-rdma: Don't local invalidate if the queue is not live No chance for the local invalidate to succeed if the queue-pair is in error state. Most likely the target will do a remote invalidation of our mr so not a big loss on the test_bit. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:54 +02:00
Sagi Grimberg	5e1fe61d41	nvme-rdma: teardown admin/io queues once on error recovery Relying on the queue state while tearing down on every reconnect attempt is not a good design. We should do it once in err_work and simply try to establish the queues for each reconnect attempt. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:53 +02:00
Sagi Grimberg	0fc176dfda	nvme-rdma: Check that reinit_request got a proper mr Warn if req->mr is NULL as it should never happen. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:52 +02:00
Sagi Grimberg	0c5b43b9c1	nvme-rdma: move assignment to declaration No need for the extra line for trivial assignments. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:52 +02:00
Sagi Grimberg	d8bfceebc4	nvme-rdma: fix wrong logging message Not necessarily address resolution failed. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:51 +02:00
Sagi Grimberg	60070c78ef	nvme-rdma: pass tagset to directly nvme_rdma_free_tagset Instead of flagging admin/io. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:50 +02:00
Sagi Grimberg	31b8446079	nvme: introduce nvme_reinit_tagset Move blk_mq_reinit_tagset from blk-mq to nvme core as the only user of it. Current transports that use it (rdma, fc) simply implement .reinit_request op. This patch does not change any functionality. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-18 19:27:48 +02:00
Christoph Hellwig	761f2e1ed8	nvme: simplify compat_ioctl handling We can just use our normal ioctl handler for the compat case and remove the boilerplate code for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>	2017-10-16 14:54:10 +02:00
Javier González	1a94b2d484	lightnvm: implement generic path for sync I/O Implement a generic path for sending sync I/O on LightNVM. This allows to reuse the standard synchronous path trough blk_execute_rq(), instead of implementing a wait_for_completion on the target side (e.g., pblk). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-13 08:34:57 -06:00
Javier González	1b839187db	lightnvm: fail fast on passthrough commands Make LightNVM passhtrough commands fail fast. User space will then take care of re-submitting. Fixes: `84d4add793` ('lightnvm: add ioctls for vector I/Os') Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-13 08:34:57 -06:00
James Smart	469d0ef06d	nvme-fc: move remote port get/put/free location move nvme_fc_rport_get/put and rport free to higher in the file to avoid adding prototypes to resolve references in upcoming code additions Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-05 10:02:47 +02:00
Christoph Hellwig	8969f1f829	nvme-pci: Use PCI bus address for data/queues in CMB Currently, NVMe PCI host driver is programming CMB dma address as I/O SQs addresses. This results in failures on systems where 1:1 outbound mapping is not used (example Broadcom iProc SOCs) because CMB BAR will be progammed with PCI bus address but NVMe PCI EP will try to access CMB using dma address. To have CMB working on systems without 1:1 outbound mapping, we program PCI bus address for I/O SQs instead of dma address. This approach will work on systems with/without 1:1 outbound mapping. Based on a report and previous patch from Abhishek Shah. Fixes: `8ffaadf7` ("NVMe: Use CMB for the IO SQes if available") Cc: stable@vger.kernel.org Reported-by: Abhishek Shah <abhishek.shah@broadcom.com> Tested-by: Abhishek Shah <abhishek.shah@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-04 11:42:53 +02:00
James Smart	5f5685569a	nvme-fc: create fc class and transport device Added a new fc class and a device node for udev events under it. I expect the fc class will eventually be the location where the FC SCSI and FC NVME merge in the future. Therefore names are kept somewhat generic. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-04 09:48:23 +02:00
James Smart	eaefd5abf6	nvme-fc: add uevent for auto-connect To support auto-connecting to FC-NVME devices upon their dynamic appearance, add a uevent that can kick off connection scripts. uevent is posted against the fc_udev device. patch set tested with the following rule to kick an nvme-cli connect-all for the FC initiator and FC target ports. This is just an example for testing and not intended for real life use. ACTION=="change", SUBSYSTEM=="fc", ENV{FC_EVENT}=="nvmediscovery", \ ENV{NVMEFC_HOST_TRADDR}=="", ENV{NVMEFC_TRADDR}=="", \ RUN+="/bin/sh -c '/usr/local/sbin/nvme connect-all --transport=fc --host-traddr=$env{NVMEFC_HOST_TRADDR} --traddr=$env{NVMEFC_TRADDR} >> /tmp/nvme_fc.log'" I will post proposed udev/systemd scripts for possible kernel support. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-04 09:48:20 +02:00
Sagi Grimberg	d1f1071f81	nvme-fabrics: request transport module Help userspace to make sure transport module is loaded. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-04 09:43:58 +02:00
Marc Olson	8ae4e4477d	nvme: update timeout module parameter type The underlying blk_mq_tag_set, and request timeout parameters support an unsigned int. Extend the size of the nvme module parameters for io and admin commands to match. Signed-off-by: Marc Olson <marcolso@amazon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-04 09:43:09 +02:00
Martin Wilck	007a61ae2f	nvme: fix visibility of "uuid" ns attribute "uuid" must be invisible if both ns->uuid and ns->nguid are unset, not if either one is. Fixes: `d934f9848a` "nvme: provide UUID value to userspace" Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-01 09:58:04 +02:00
Sagi Grimberg	e4d753d7e5	nvme-rdma: don't fully stop the controller in error recovery By calling nvme_stop_ctrl on a already failed controller will wait for the scan work to complete (only by identify timeout expiration which is 60 seconds). This is unnecessary when we already know that the controller has failed. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 12:42:11 -06:00
Sagi Grimberg	0a960afd60	nvme-rdma: give up reconnect if state change fails If we failed to transition to state LIVE after a successful reconnect, then controller deletion already started. In this case there is no point moving forward with reconnect. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 12:42:11 -06:00
Sagi Grimberg	1a40d97288	nvme-core: Use nvme_wq to queue async events and fw activation async_event_work might race as it is executed from two different workqueues at the moment. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 12:42:11 -06:00
Guilherme G. Piccoli	8edd11c9ad	nvme-fabrics: Allow 0 as KATO value Currently, driver code allows user to set 0 as KATO (Keep Alive TimeOut), but this is not being respected. This patch enforces the expected behavior. Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
James Smart	0951338d96	nvme: allow timed-out ios to retry Currently the nvme_req_needs_retry() applies several checks to see if a retry is allowed. On of those is whether the current time has exceeded the start time of the io plus the timeout length. This check, if an io times out, means there is never a retry allowed for the io. Which means applications see the io failure. Remove this check and allow the io to timeout, like it does on other protocols, and retries to be made. On the FC transport, a frame can be lost for an individual io, and there may be no other errors that escalate for the connection/association. The io will timeout, which causes the transport to escalate into creating a new association, but the io that timed out, due to this retry logic, has already failed back to the application and things are hosed. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
James Smart	cd48282cc7	nvme: stop aer posting if controller state not live If an nvme async_event command completes, in most cases, a new async event is posted. However, if the controller enters a resetting or reconnecting state, there is nothing to block the scheduled work element from posting the async event again. Nor are there calls from the transport to stop async events when an association dies. In the case of FC, where the association is torn down, the aer must be aborted on the FC link and completes through the normal job completion path. Thus the terminated async event ends up being rescheduled even though the controller isn't in a valid state for the aer, and the reposting gets the transport into a partially torn down data structure. It's possible to hit the scenario on rdma, although much less likely due to an aer completing right as the association is terminated and as the association teardown reclaims the blk requests via nvme_cancel_request() so its immediate, not a link-related action like on FC. Fix by putting controller state checks in both the async event completion routine where it schedules the async event and in the async event work routine before it calls into the transport. It's effectively a "stop_async_events()" behavior. The transport, when it creates a new association with the subsystem will transition the state back to live and is already restarting the async event posting. Signed-off-by: James Smart <james.smart@broadcom.com> [hch: remove taking a lock over reading the controller state] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
Keith Busch	d087747384	nvme-pci: Print invalid SGL only once The WARN_ONCE macro returns true if the condition is true, not if the warn was raised, so we're printing the scatter list every time it's invalid. This is excessive and makes debugging harder, so this patch prints it just once. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
Keith Busch	161b8be2bd	nvme-pci: initialize queue memory before interrupts A spurious interrupt before the nvme driver has initialized the completion queue may inadvertently cause the driver to believe it has a completion to process. This may result in a NULL dereference since the nvmeq's tags are not set at this point. The patch initializes the host's CQ memory so that a spurious interrupt isn't mistaken for a real completion. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
James Smart	d9d34c0b23	nvme-fc: use transport-specific sgl format Sync with NVM Express spec change and FC-NVME 1.18. FC transport sets SGL type to Transport SGL Data Block Descriptor and subtype to transport-specific value 0x0A. Removed the warn-on's on the PRP fields. They are unneeded. They were to check for values from the upper layer that weren't set right, and for the most part were fine. But, with Async events, which reuse the same structure and 2nd time issued the SGL overlay converted them to the Transport SGL values - the warn-on's were errantly firing. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
James Smart	56b7103a06	nvme-fc: remove use of FC-specific error codes The FC-NVME transport used the FC-specific error codes in cases where it had to fabricate an error to go back up stack. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
Christoph Hellwig	044a9df1a7	nvme-pci: implement the HMB entry number and size limitations Adds support for the new Host Memory Buffer Minimum Descriptor Entry Size and Host Memory Maximum Descriptors Entries field that were added in TP 4002 HMB Enhancements. These allow the controller to advertise limits for the usual number of segments in the host memory buffer, as well as a minimum usable per-segment size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>	2017-09-11 12:29:40 -04:00
Christoph Hellwig	9620cfba97	nvme-pci: propagate (some) errors from host memory buffer setup We want to catch command execution errors when resetting the device, so propagate errors from the Set Features when setting up the host memory buffer. We keep ignoring memory allocation failures, as the spec clearly says that the controller must work without a host memory buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org	2017-09-11 12:29:39 -04:00
Akinobu Mita	30f92d62e5	nvme-pci: use appropriate initial chunk size for HMB allocation The initial chunk size for host memory buffer allocation is currently PAGE_SIZE << MAX_ORDER. MAX_ORDER order allocation is usually failed without CONFIG_DMA_CMA. So the HMB allocation is retried with chunk size PAGE_SIZE << (MAX_ORDER - 1) in general, but there is no problem if the retry allocation works correctly. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> [hch: rebased] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org	2017-09-11 12:29:38 -04:00
Christoph Hellwig	92dc689563	nvme-pci: fix host memory buffer allocation fallback nvme_alloc_host_mem currently contains two loops that are interwinded, and the outer retry loop turns out to be broken. Fix this by untangling the two. Based on a report an initial patch from Akinobu Mita. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Tested-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org	2017-09-11 12:29:37 -04:00
Christoph Hellwig	608cc4b14a	nvme: fix lightnvm check nvme_nvm_ns_supported assumes every device is a pci_dev, which leads to reading an incorrect field, or possible even a dereference of unallocated memory for fabrics controllers. Fix this by introducing a quirk for lighnvm capable devices instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2017-09-11 12:29:36 -04:00
Linus Torvalds	126e76ffbf	Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...	2017-09-09 12:49:01 -07:00
Linus Torvalds	a0725ab0c7	Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...	2017-09-07 11:59:42 -07:00
Linus Torvalds	aa9d4648c2	Updates for 4.14 kernel merge window - Lots of hfi1 driver updates (mixed with a few qib and core updates as well) - rxe updates - various mlx updates - Set default roce type to RoCEv2 - Several larger fixes for bnxt_re that were too big for -rc - Several larger fixes for qedr that, likewise, were too big for -rc - Misc core changes - Make the hns_roce driver compilable on arches other than aarch64 so we can more easily debug build issues related to it - Add rdma-netlink infrastructure updates - Add automatic IRQ affinity infrastructure - Add 32bit lid support - Lots of misc fixes across the subsystem from random people - Autoloading of RDMA netlink modules - PCI pool cleanups from Romain Perier - mlx5 driver feature additions and fixes - Hardware tag matchine feature - Fix sleeping in atomic when resolving roce ah - Add experimental ioctl interface as posted to linux-api@ -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZqBDtAAoJELgmozMOVy/dNlcQAJhYNRGaNUBx0L6+8t2xwUrt 7ndP6qlMar30DJY9FjTQCzRBw0CRMWkXdJD8rYlyaHy07pjWDKG8LZtxEXu1FLdZ oNRvQX6ZJh8Bz7db2SQFBCTF2uWGZZFqWQCrSbQwjj9xxjMDs59u/knmwHVY9dKk egjPG4IQBDmcTeNY7h1otG2hXpx7QPIOilQW2EFN5SWAuBAazdF2JKxjjxqhnUfp gD2pSdgsm3VSMoo0zpMa6qOP+9GcOu8J97fYFhasRYWCavPdWHyq+XNu9S/eicRd xbv+seCYM+9jPb2dsNdjEKll7w3yyWdu7h6tSCMPYv54eN9sDDiO1w2L2ZnESMZa JRnSfB+HXru1r4RyHOTPO8peaNhYlR1V4u8bTS5G2dffbHis9BajkWoAR/oSiUcB AIjIIDcdJFVGfpF9KIt/pEl+adHNgESibSijzOUYkyw6RNbPqDmdd7YakPHcQhKN clE3zQfIsPRLWsToP/nkBE0tUd3tQocRuLy7ote7hXQK+0p7TBz0a6Kkj87MvX33 8dVbUI+q6WRlEY90l71y0ZdXy/AvkxkFxAc4Y7FQZyJxhEArTaKgfa5fmpRwVxBm yi9baoYCspHNRNv6AO4IL86ZCJqmWBuch8CBY1n2X3h8IGfKYEZUAZ+T/mnTTeUq A4joXduz94ZD4w23leD1 =2ntC -----END PGP SIGNATURE----- Merge tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is a big pull request. Of note is that I'm sending you the new ioctl API for the rdma subsystem. We put it up on linux-api@, but didn't get much response. The API is complex, but it solves two different problems in one go: 1) The bi-directional nature of the RDMA file write calls, which created the security hole we had to handle (and for which the fix is now causing problems for systems in production, we were a bit over zealous in the fix and the ability to open a device, then fork, then create new queue pairs on the device and use them is broken). 2) The bloat caused by different vendors implementing extensions to the base verbs API. Each vendor's hardware is slightly different, and the hardware might be suitable for one extension but not another. By the time we add generic extensions for all the different ways that the different hardware can offload things, the API becomes bloated. Things like our completion structs have started to exceed a cache line in size because of all the elements needed to support this. That in turn shows up heavily in the performance graphs with a noticable drop in performance on 100Gigabit links as our completion structs go from occupying one cache line to 1+. This API makes things like the completion structs modular in a very similar way to netlink so that your structs can only include the items needed for the offloads/features you are actually using on a given queue pair. In that way we support everything, but only use what we need, and our structs stay smaller. The ioctl API is better explained by the posting on linux-api@ than I can explain it here, so I'll just leave it at that. The rest of the pull request is typical stuff. Updates for 4.14 kernel merge window - Lots of hfi1 driver updates (mixed with a few qib and core updates as well) - rxe updates - various mlx updates - Set default roce type to RoCEv2 - Several larger fixes for bnxt_re that were too big for -rc - Several larger fixes for qedr that, likewise, were too big for -rc - Misc core changes - Make the hns_roce driver compilable on arches other than aarch64 so we can more easily debug build issues related to it - Add rdma-netlink infrastructure updates - Add automatic IRQ affinity infrastructure - Add 32bit lid support - Lots of misc fixes across the subsystem from random people - Autoloading of RDMA netlink modules - PCI pool cleanups from Romain Perier - mlx5 driver feature additions and fixes - Hardware tag matchine feature - Fix sleeping in atomic when resolving roce ah - Add experimental ioctl interface as posted to linux-api@" * tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (328 commits) IB/core: Expose ioctl interface through experimental Kconfig IB/core: Assign root to all drivers IB/core: Add completion queue (cq) object actions IB/core: Add legacy driver's user-data IB/core: Export ioctl enum types to user-space IB/core: Explicitly destroy an object while keeping uobject IB/core: Add macros for declaring methods and attributes IB/core: Add uverbs merge trees functionality IB/core: Add DEVICE object and root tree structure IB/core: Declare an object instead of declaring only type attributes IB/core: Add new ioctl interface RDMA/vmw_pvrdma: Fix a signedness RDMA/vmw_pvrdma: Report network header type in WC IB/core: Add might_sleep() annotation to ib_init_ah_from_wc() IB/cm: Fix sleeping in atomic when RoCE is used IB/core: Add support to finalize objects in one transaction IB/core: Add a generic way to execute an operation on a uobject Documentation: Hardware tag matching IB/mlx5: Support IB_SRQT_TM net/mlx5: Add XRQ support ...	2017-09-03 17:49:17 -07:00
Daniel Verkamp	40a5fce495	nvme-fabrics: generate spec-compliant UUID NQNs The default host NQN, which is generated based on the host's UUID, does not follow the UUID-based NQN format laid out in the NVMe 1.3 specification. Remove the "NVMf:" portion of the NQN to match the spec. Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-09-01 09:48:01 +02:00
Keith Busch	63263d60e0	nvme: Use metadata for passthrough commands The ioctls' struct allows the user to provide a metadata address and length for a passthrough command. This patch uses these values that were previously ignored and deletes the now unused wrapper function. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-30 14:59:35 +02:00
Keith Busch	485783ca63	nvme: Make nvme user functions static These functions are used only locally in the nvme core. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-30 14:59:34 +02:00
Keith Busch	b5d8af5b52	nvme/pci: Use req_op to determine DIF remapping Only read and write commands need DIF remapping. Everything else uses a passthrough integrity payload. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-30 14:59:33 +02:00
Christoph Hellwig	1cad65620f	nvme: factor metadata handling out of __nvme_submit_user_cmd Keep the metadata code in a separate helper instead of making the main function more complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-30 14:59:30 +02:00
Roland Dreier	489beb91e6	nvme-fabrics: Convert nvmf_transports_mutex to an rwsem The mutex protects against the list of transports changing while a controller is being created, but using a plain old mutex means that it also serializes controller creation. This unnecessarily slows down creating multiple controllers - for example for the RDMA transport, creating a controller involves establishing one connection for every IO queue, which involves even more network/software round trips, so the delay can become significant. The simplest way to fix this is to change the mutex to an rwsem and only hold it for writing when the list is being mutated. Since we can take the rwsem for reading while creating a controller, we can create multiple controllers in parallel. Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-30 14:51:22 +02:00
Christoph Hellwig	4033f35d17	nvme-pci: use dma memory for the host memory buffer descriptors The NVMe 1.3 specification says in section 5.21.1.13: "After a successful completion of a Set Features enabling the host memory buffer, the host shall not write to the associated host memory region, buffer size, or descriptor list until the host memory buffer has been disabled." While this doesn't state that the descriptor list must remain accessible to the device it certainly implies it must remaing readable by the device. So switch to a dma coherent allocation for the descriptor list just to be safe - it's not like the cost for it matters compared to the actual memory buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Fixes: `87ad72a59a` ("nvme-pci: implement host memory buffer support")	2017-08-30 14:46:16 +02:00
Max Gurtovoy	b925a2dc16	nvme-rdma: default MR page size to 4k Due to various page sizes in the system (IOMMU/device/kernel), we set the fabrics controller page size to 4k and block layer boundaries accordinglly. In architectures that uses different kernel page size we'll have a mismatch to the MR page size that may cause a mapping error. Update the MR page size to correspond to the core ctrl settings. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-30 14:46:10 +02:00
Christoph Hellwig	1d5df6af8c	nvme: don't blindly overwrite identifiers on disk revalidate Instead validate that these identifiers do not change, as that is prohibited by the specification. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>	2017-08-29 10:23:04 +02:00
Christoph Hellwig	cdbff4f26b	nvme: remove nvme_revalidate_ns The function is used in two places, and the shared code for those will diverge later in this series. Instead factor out a new helper to get the ids for a namespace, simplify the calling conventions for nvme_identify_ns and just open code the sequence. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2017-08-29 10:22:39 +02:00
Christoph Hellwig	57eeaf8ec6	nvme: remove unused struct nvme_ns fields And move the flags for the flags field near that field while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2017-08-29 10:22:26 +02:00
Christoph Hellwig	0a72bbba49	nvme: allow calling nvme_change_ctrl_state from irq context Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2017-08-29 10:22:23 +02:00
Christoph Hellwig	a751da3394	nvme: report more detailed status codes to the block layer Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2017-08-29 10:22:18 +02:00
Martin K. Petersen	07fbd32a6b	nvme: honor RTD3 Entry Latency for shutdowns If an NVMe controller reports RTD3 Entry Latency larger than shutdown_timeout, up to a maximum of 60 seconds, use that value to set the shutdown timer. Otherwise fall back to the module parameter which defaults to 5 seconds. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> [hch: removed do_div, made transition time local scope] Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:44 +03:00
Jan H. Schönherr	5228b3280b	nvme: fix uninitialized prp2 value on small transfers The value of iod->first_dma ends up as prp2 in NVMe commands. In case there is not enough data to cross a page boundary, iod->first_dma is never initialized and contains random data. Comply with the NVMe specification and fill in 0 in that case. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:44 +03:00
Max Gurtovoy	a7b7c7a105	nvme-rdma: Use unlikely macro in the fast path This patch slightly improves performance (mainly for small block sizes). Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:43 +03:00
James Smart	5533d42480	nvme-fc: Reattach to localports on re-registration If the LLDD resets or detaches from an fc port, the LLDD will deregister all remoteports seen by the fc port and deregister the localport associated with the fc port. The teardown of the localport structure will be held off due to reference counting until all the remoteports are removed (and they are held off until all controllers/associations to terminated). Currently, if the fc port is reinit/reattached and registered again as a localport it is treated as an independent entity from the prior localport and all prior remoteports and controllers cannot be revived. They are created as new and separate entities. This patch changes the localport registration to look at the known localports that are waiting to be torndown. If they are the same port based on wwn's, the local port is transitioned out of the teardown state. This allows the remote ports and controller connections to be reestablished and resumed as long as the localport can also be reregistered within the timeout windows. The patch adds a new routine nvme_fc_attach_to_unreg_lport() with the functionality and moves the lport get/put routines to avoid forward references. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:40 +03:00
Max Gurtovoy	60b43f627a	nvme: rename AMS symbolic constants to fit specification Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:39 +03:00
Sagi Grimberg	caaa15c509	nvme: fix identify namespace logging Use ctrl->device and lose the func name. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:38 +03:00
Guan Junxiong	9b483da15d	nvme-fabrics: log a warning if hostid is invalid This helps users to quickly spot the reason of why connection fails if the hostid is not compliant with the uuid format. Signed-off-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:37 +03:00
Sagi Grimberg	09fdc23b29	nvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64 To make the nvme_rdma_configure_admin_queue generic in preparation of moving it to common code. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:36 +03:00
Sagi Grimberg	370ae6e450	nvme-rdma: cleanup error path in controller reset No need to queue an extra work to indirect controller removal, just call the ctrl remove routine. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:36 +03:00
Sagi Grimberg	68e16fcfaf	nvme-rdma: introduce nvme_rdma_start_queue This should pair with nvme_rdma_stop_queue. While this is not a complete inverse, it still pairs up pretty well because in fabrics we don't have a disconnect capsule (yet) but we simply teardown the transport association. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:35 +03:00
Sagi Grimberg	41e8cfa117	nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue Give it a name symmetric to nvme_rdma_free_queue. Also pass in the ctrl sqsize+1 and not the opts queue_size. And suppress a superflous failure message. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:34 +03:00
Sagi Grimberg	148b4e7ff3	nvme-rdma: stop queues instead of simply flipping their state If we move the queues from LIVE state, we might as well stop them (drain for rdma). Do it after we stop the request queues to prevent a stray request sneaking in .queue_rq after we stop the queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:33 +03:00
Sagi Grimberg	a57bd54122	nvme-rdma: introduce configure/destroy io queues Make a symmetrical handling with admin queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:33 +03:00
Sagi Grimberg	31fdf18401	nvme-rdma: reuse configure/destroy_admin_queue No need to open-code it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:32 +03:00
Sagi Grimberg	3f02fffb74	nvme-rdma: don't free tagset on resets We're not supposed to do that. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 23:00:27 +03:00
Sagi Grimberg	18398af2dc	nvme-rdma: disable the controller on resets Mimic the pci driver as a controller disable might be more lightweight than a shutdown. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 21:38:26 +02:00
Sagi Grimberg	b28a308ee7	nvme-rdma: move tagset allocation to a dedicated routine We always pair tagset allocation with rdma device reference and it shares some code, centralize it with an argument if its an admin or IO tagset. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 21:38:25 +02:00
Sagi Grimberg	34b6c2315e	nvme: Add admin_tagset pointer to nvme_ctrl Will be used when we centralize control flows. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 21:38:24 +02:00
Sagi Grimberg	90af35123d	nvme-rdma: move nvme_rdma_configure_admin_queue code location We will call it from other places so avoid having to forward declare it. Also move it next to nvme_rdma_destroy_admin_queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 21:38:23 +02:00
Johannes Thumshirn	4897ad4e08	nvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE NVME_RDMA_MAX_SEGMENT_SIZE is not used anywhere, zap it. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 21:38:22 +02:00
Jon Derrick	dbf86b3900	nvme: add support for NVMe 1.3 Timestamp Feature NVME's Timestamp feature allows controllers to be aware of the epoch time in milliseconds. This patch adds the set features hook for various transports through the identify path, so that resets and resumes can update the controller as necessary. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> [hch: rebased on top of nvme-4.13 error handling changes, changed nvme_configure_timestamp to return the status] Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-28 21:38:18 +02:00
Arnav Dawn	62346eaeb2	nvme: define NVME_NSID_ALL Define the constant "0xffffffff" (used as nsid for all namespaces) as NVME_NSID_ALL. Signed-off-by: Arnav Dawn <a.dawn@samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-08-28 21:38:17 +02:00
Arnav Dawn	b6dccf7fae	nvme: add support for FW activation without reset This patch adds support for handling Fw activation without reset On completion of FW-activation-starting AER, all queues are paused till CSTS.PP is cleared or timed out (exceeds max time for fw activtion MTFA). If device fails to clear CSTS.PP within MTFA, driver issues reset controller. Signed-off-by: Arnav Dawn <a.dawn@samsung.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-08-28 21:38:17 +02:00
Jens Axboe	cd996fb47c	Linux 4.13-rc7 -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJZo2HiAAoJEHm+PkMAQRiG3OcIAJqSeVK2uQ/QhmqFN1ExYay4 bdTjSTtSk7GH6PxI2C0cqfZvsxOUU7ICDHG8bYM1LA0S0SxfOtoFhHGKc/BcFLX8 MiKJWlF51ZbX0mkIEpKF+C8pRrXPgSqtk3N450/k2BzG9qCZSM93A2NCOB7v9T9w XOBUIYHqfTS2tdmCinjwu8Ls+w8oPOGH1gLjxZyGnBlg4lTqHMcUufmHeVEAh11d giGByqqqXH69kGD1HNC7H6quzXN9rz4n0gEwEG0mIhfkJ98b+ESSWwSEXXypOAQD QT5/6+2YizXf5DPCqR46xasQCPjRsS6Sv0cF2cntW2PEAb4jBjhx5gTFlJcoOC8= =efWJ -----END PGP SIGNATURE----- Merge tag 'v4.13-rc7' into for-4.14/block-postmerge Linux 4.13-rc7 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-28 13:00:44 -06:00
Christoph Hellwig	74d46992e0	block: replace bi_bdev with a gendisk pointer and partitions index This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-23 12:49:55 -06:00
Doug Ledford	d3cf4d9915	Merge branch 'misc' into k.o/for-next Conflicts: drivers/infiniband/core/iwcm.c - The rdma_netlink patches in HEAD and the iwarp cm workqueue fix (don't use WQ_MEM_RECLAIM, we aren't safe for that context) touched the same code. Signed-off-by: Doug Ledford <dledford@redhat.com>	2017-08-18 14:10:23 -04:00
Sagi Grimberg	5138e4bd0b	nvme-rdma: remove redundant empty device add callout Now that its not needed, we can simply not assign it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>	2017-08-18 10:46:20 -04:00
Bart Van Assche	d352ae205d	blk-mq: Make blk_mq_reinit_tagset() calls easier to read Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: James Smart <james.smart@broadcom.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-18 08:36:58 -06:00
Keith Busch	e9d8a0fdea	nvme-pci: set cqe_seen on polled completions Fixes: `920d13a884` ("nvme-pci: factor out the cqe reading mechanics from __nvme_process_cq") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-18 09:19:39 +02:00
Christoph Hellwig	81a0b8d74e	nvme-fabrics: fix reporting of unrecognized options Only print the specified options that are not recognized, instead of the whole list of options. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>	2017-08-17 18:48:54 +02:00
Kwan (Hingkwan) Huen-SSI	a082b42628	nvme: fix directive command numd calculation The numd field of directive receive command takes number of dwords to transfer. This fix has the correct calculation for numd. Signed-off-by: Kwan (Hingkwan) Huen-SSI <kwan.huen@samsung.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-10 19:53:44 +02:00
Keith Busch	634b832590	nvme: fix nvme reset command timeout handling We need to return an error if a timeout occurs on any NVMe command during initialization. Without this, the nvme reset work will be stuck. A timeout will have a negative error code, meaning we need to stop initializing the controller. All postitive returns mean the controller is still usable. bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196325 Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Martin Peres <martin.peres@intel.com> [jth consolidated cleanup path ] Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-10 19:53:38 +02:00
Max Gurtovoy	1c78f7735b	nvme-pci: fix CMB sysfs file removal in reset path Currently we create the sysfs entry even if we fail mapping it. In that case, the unmapping will not remove the sysfs created file. There is no good reason to create a sysfs entry for a non working CMB and show his characteristics. Fixes: `f63572dff` ("nvme: unmap CMB and remove sysfs file in reset path") Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Stephen Bates <sbates@raithlin.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-10 11:19:06 +02:00
Martin Wilck	758f373558	nvme: strip trailing 0-bytes in wwid_show Some broken controllers (such as earlier Linux targets) pad model or serial fields with 0-bytes rather than spaces. The NVMe spec disallows 0 bytes in "ASCII" fields. Thus strip trailing 0-bytes, too. Also make sure that we get no underflow for pathological input. Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-08-10 10:43:31 +02:00
Sagi Grimberg	0b36658ca8	nvme-rdma: use intelligent affinity based queue mappings Use the generic block layer affinity mapping helper. Also, limit nr_hw_queues to the rdma device number of irq vectors as we don't really need more. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>	2017-08-08 14:58:03 -04:00
Scott Bauer	7dd1ab163c	nvme: validate admin queue before unquiesce With a misbehaving controller it's possible we'll never enter the live state and create an admin queue. When we fail out of reset work it's possible we failed out early enough without setting up the admin queue. We tear down queues after a failed reset, but needed to do some more sanitization. Fixes `443bd90f2c`: "nvme: host: unquiesce queue in nvme_kill_queues()" [ 189.650995] nvme nvme1: pci function 0000:0b:00.0 [ 317.680055] nvme nvme0: Device not ready; aborting reset [ 317.680183] nvme nvme0: Removing after probe failure status: -19 [ 317.681258] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 317.681397] general protection fault: 0000 [#1] SMP KASAN [ 317.682984] CPU: 3 PID: 477 Comm: kworker/3:2 Not tainted 4.13.0-rc1+ #5 [ 317.683112] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F5 03/07/2016 [ 317.683284] Workqueue: events nvme_remove_dead_ctrl_work [nvme] [ 317.683398] task: ffff8803b0990000 task.stack: ffff8803c2ef0000 [ 317.683516] RIP: 0010:blk_mq_unquiesce_queue+0x2b/0xa0 [ 317.683614] RSP: 0018:ffff8803c2ef7d40 EFLAGS: 00010282 [ 317.683716] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 1ffff1006fbdcde3 [ 317.683847] RDX: 0000000000000038 RSI: 1ffff1006f5a9245 RDI: 0000000000000000 [ 317.683978] RBP: ffff8803c2ef7d58 R08: 1ffff1007bcdc974 R09: 0000000000000000 [ 317.684108] R10: 1ffff1007bcdc975 R11: 0000000000000000 R12: 00000000000001c0 [ 317.684239] R13: ffff88037ad49228 R14: ffff88037ad492d0 R15: ffff88037ad492e0 [ 317.684371] FS: 0000000000000000(0000) GS:ffff8803de6c0000(0000) knlGS:0000000000000000 [ 317.684519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 317.684627] CR2: 0000002d1860c000 CR3: 000000045b40d000 CR4: 00000000003406e0 [ 317.684758] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 317.684888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 317.685018] Call Trace: [ 317.685084] nvme_kill_queues+0x4d/0x170 [nvme_core] [ 317.685185] nvme_remove_dead_ctrl_work+0x3a/0x90 [nvme] [ 317.685289] process_one_work+0x771/0x1170 [ 317.685372] worker_thread+0xde/0x11e0 [ 317.685452] ? pci_mmcfg_check_reserved+0x110/0x110 [ 317.685550] kthread+0x2d3/0x3d0 [ 317.685617] ? process_one_work+0x1170/0x1170 [ 317.685704] ? kthread_create_on_node+0xc0/0xc0 [ 317.685785] ret_from_fork+0x25/0x30 [ 317.685798] Code: 0f 1f 44 00 00 55 48 b8 00 00 00 00 00 fc ff df 48 89 e5 41 54 4c 8d a7 c0 01 00 00 53 48 89 fb 4c 89 e2 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 75 50 48 8b bb c0 01 00 00 e8 33 8a f9 00 0f ba b3 [ 317.685872] RIP: blk_mq_unquiesce_queue+0x2b/0xa0 RSP: ffff8803c2ef7d40 [ 317.685908] ---[ end trace a3f8704150b1e8b4 ]--- Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-07-26 17:41:41 +02:00
Christoph Hellwig	50cdb7c61b	nvme-pci: fix HMB size calculation It's possible the preferred HMB size may not be a multiple of the chunk_size. This patch moves len to function scope and uses that in the for loop increment so the last iteration doesn't cause the total size to exceed the allocated HMB size. Based on an earlier patch from Keith Busch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Fixes: `87ad72a59a` ("nvme-pci: implement host memory buffer support")	2017-07-25 18:05:33 +02:00
James Smart	9c5358e15c	nvme-fc: revise TRADDR parsing The FC-NVME spec hasn't locked down on the format string for TRADDR. Currently the spec is lobbying for "nn-<16hexdigits>:pn-<16hexdigits>" where the wwn's are hex values but not prefixed by 0x. Most implementations so far expect a string format of "nn-0x<16hexdigits>:pn-0x<16hexdigits>" to be used. The transport uses the match_u64 parser which requires a leading 0x prefix to set the base properly. If it's not there, a match will either fail or return a base 10 value. The resolution in T11 is pushing out. Therefore, to fix things now and to cover any eventuality and any implementations already in the field, this patch adds support for both formats. The change consists of replacing the token matching routine with a routine that validates the fixed string format, and then builds a local copy of the hex name with a 0x prefix before calling the system parser. Note: the same parser routine exists in both the initiator and target transports. Given this is about the only "shared" item, we chose to replicate rather than create an interdendency on some shared code. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-07-25 18:05:25 +02:00
James Smart	8b25f35192	nvme-fc: address target disconnect race conditions in fcp io submit There are cases where threads are in the process of submitting new io when the LLDD calls in to remove the remote port. In some cases, the next io actually goes to the LLDD, who knows the remoteport isn't present and rejects it. To properly recovery/restart these i/o's we don't want to hard fail them, we want to treat them as temporary resource errors in which a delayed retry will work. Add a couple more checks on remoteport connectivity and commonize the busy response handling when it's seen. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-07-25 17:58:47 +02:00
Johannes Thumshirn	6484f5d16f	nvme: also provide a UUID in the WWID sysfs attribute The WWID sysfs attribute can provide multiple means of a World Wide ID for a NVMe device. It can either be a NGUID, a EUI-64 or a concatenation of VID, Serial Number, Model and the Namespace ID in this order of preference. If the target also sends us a UUID use the UUID for identification and give it the highest priority. This eases generation of /dev/disk/by-* symlinks. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-07-25 17:58:22 +02:00
Christophe JAILLET	b00c9b7aa0	nvme-pci: Fix an error handling path in 'nvme_probe()' Release resources in the correct order in order not to miss a 'put_device()' if 'nvme_dev_map()' fails. Fixes: `b00a726a9f` ("NVMe: Don't unmap controller registers on reset") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-20 08:41:56 -06:00
Keith Busch	86eea2895d	nvme-pci: Remove nvme_setup_prps BUG_ON This patch replaces the invalid nvme SGL kernel panic with a warning, and returns an appropriate error. The warning will occur only on the first occurance, and sgl details will be printed to help debug how the request was allowed to form. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-20 08:41:56 -06:00
David Wayne Fugate	f99cb7af40	nvme-pci: add another device ID with stripe quirk Adds a fourth Intel controller which has the "stripe" quirk. Signed-off-by: David Wayne Fugate <david.fugate@intel.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-20 08:41:56 -06:00
Christoph Hellwig	dc1a0afbac	nvme: fix byte swapping in the streams code Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-20 08:41:56 -06:00
Linus Torvalds	130568d5ea	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "This is a followup for block changes, that didn't make the initial pull request. It's a bit of a mixed bag, this contains: - A followup pull request from Sagi for NVMe. Outside of fixups for NVMe, it also includes a series for ensuring that we properly quiesce hardware queues when browsing live tags. - Set of integrity fixes from Dmitry (mostly), fixing various issues for folks using DIF/DIX. - Fix for a bug introduced in cciss, with the req init changes. From Christoph. - Fix for a bug in BFQ, from Paolo. - Two followup fixes for lightnvm/pblk from Javier. - Depth fix from Ming for blk-mq-sched. - Also from Ming, performance fix for mtip32xx that was introduced with the dynamic initialization of commands" * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits) block: call bio_uninit in bio_endio nvmet: avoid unneeded assignment of submit_bio return value nvme-pci: add module parameter for io queue depth nvme-pci: compile warnings in nvme_alloc_host_mem() nvmet_fc: Accept variable pad lengths on Create Association LS nvme_fc/nvmet_fc: revise Create Association descriptor length lightnvm: pblk: remove unnecessary checks lightnvm: pblk: control I/O flow also on tear down cciss: initialize struct scsi_req null_blk: fix error flow for shared tags during module_init block: Fix __blkdev_issue_zeroout loop nvme-rdma: unconditionally recycle the request mr nvme: split nvme_uninit_ctrl into stop and uninit virtio_blk: quiesce/unquiesce live IO when entering PM states mtip32xx: quiesce request queues to make sure no submissions are inflight nbd: quiesce request queues to make sure no submissions are inflight nvme: kick requeue list when requeueing a request instead of when starting the queues nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues ...	2017-07-11 15:36:52 -07:00
Jens Axboe	459bd0dc39	Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linus Pull followup NVMe (mostly) changes from Sagi: I added the quiesce/unquiesce patches in here as it's easy for me easily apply changes on top. It has accumulated reviews and includes mostly nvme anyway, please tell me if you don't want to take them with this. This includes: - quiesce/unquiesce fixes in nvme and others from me - nvme-fc add create association padding spec updates from James - some more quirking from MKP - nvmet nit cleanup from Max - Fix nvme-rdma racy RDMA completion signalling from Marta - some centralization patches from me - add tagset nr_hw_queues updates on controller resets in nvme drivers from me - nvme-rdma fix resources recycling when doing error recovery from me - minor cleanups in nvme-fc from me	2017-07-10 11:44:34 -06:00
weiping zhang	b27c1e683d	nvme-pci: add module parameter for io queue depth Adjust io queue depth more easily, and make sure io queue depth >= 2. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-10 13:16:46 +03:00
Dan Carpenter	2ee0e4ed5c	nvme-pci: compile warnings in nvme_alloc_host_mem() "i" should be signed or it could cause a forever loop on the cleanup path. "size" can be used uninitialized. Fixes: `87ad72a59a` ("nvme-pci: implement host memory buffer support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-10 09:09:58 +03:00
Linus Torvalds	f263fbb8d6	pci-v4.13-changes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJZYAFUAAoJEFmIoMA60/r8cFQP/A4fpdjhd42WRNQXGTpZieop i40lBQtGdBn/UY97U6BoutcS1ygDi9OiSzg+IR6I90iMgidqyUHFhe4hGWgVHD2g Tg0KLzd+lKKfQ6Gqt1P6t4dLGLvyEj5NUbCeFE4XYODAUkkiBaOndax6DK1GvU54 Vjuj63rHtMKFR/tG/4iFTigObqyI8QE6O9JVxwuvIyEX6RXKbJe+wkulv5taSnWt Ne94950i10MrELtNreVdi8UbCbXiqjg0r5sKI/WTJ7Bc7WsC7X5PhWlhcNrbHyBT Ivhoypkui3Ky8gvwWqL0KBG+cRp8prBXAdabrD9wRbz0TKnfGI6pQzseCGRnkE6T mhlSJpsSNIHaejoCjk93yPn5oRiTNtPMdVhMpEQL9V/crVRGRRmbd7v2TYvpMHVR JaPZ8bv+C2aBTY8uL3/v/rgrjsMKOYFeaxeNklpErxrknsbgb6BgubmeZXDvTBVv YUIbAkvveonUKisv+kbD8L7tp1+jdbRUT0AikS0NVgAJQhfArOmBcDpTL9YC51vE feFhkVx4A32vvOm7Zcg9A7IMXNjeSfccKGw3dJOAvzgDODuJiaCG6S0o7B5Yngze axMi87ixGT4QM98z/I4MC8E9rDrJdIitlpvb6ZBgiLzoO3kmvsIZZKt8UxWqf5r8 w3U2HoyKH13Qbkn1xkum =mkyb -----END PGP SIGNATURE----- Merge tag 'pci-v4.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - add sysfs max_link_speed/width, current_link_speed/width (Wong Vee Khee) - make host bridge IRQ mapping much more generic (Matthew Minter, Lorenzo Pieralisi) - convert most drivers to pci_scan_root_bus_bridge() (Lorenzo Pieralisi) - mutex sriov_configure() (Jakub Kicinski) - mutex pci_error_handlers callbacks (Christoph Hellwig) - split ->reset_notify() into ->reset_prepare()/reset_done() (Christoph Hellwig) - support multiple PCIe portdrv interrupts for MSI as well as MSI-X (Gabriele Paoloni) - allocate MSI/MSI-X vector for Downstream Port Containment (Gabriele Paoloni) - fix MSI IRQ affinity pre/post/min_vecs issue (Michael Hernandez) - test INTx masking during enumeration, not at run-time (Piotr Gregor) - avoid using device_may_wakeup() for runtime PM (Rafael J. Wysocki) - restore the status of PCI devices across hibernation (Chen Yu) - keep parent resources that start at 0x0 (Ard Biesheuvel) - enable ECRC only if device supports it (Bjorn Helgaas) - restore PRI and PASID state after Function-Level Reset (CQ Tang) - skip DPC event if device is not present (Keith Busch) - check domain when matching SMBIOS info (Sujith Pandel) - mark Intel XXV710 NIC INTx masking as broken (Alex Williamson) - avoid AMD SB7xx EHCI USB wakeup defect (Kai-Heng Feng) - work around long-standing Macbook Pro poweroff issue (Bjorn Helgaas) - add Switchtec "running" status flag (Logan Gunthorpe) - fix dra7xx incorrect RW1C IRQ register usage (Arvind Yadav) - modify xilinx-nwl IRQ chip for legacy interrupts (Bharat Kumar Gogada) - move VMD SRCU cleanup after bus, child device removal (Jon Derrick) - add Faraday clock handling (Linus Walleij) - configure Rockchip MPS and reorganize (Shawn Lin) - limit Qualcomm TLP size to 2K (hardware issue) (Srinivas Kandagatla) - support Tegra MSI 64-bit addressing (Thierry Reding) - use Rockchip normal (not privileged) register bank (Shawn Lin) - add HiSilicon Kirin SoC PCIe controller driver (Xiaowei Song) - add Sigma Designs Tango SMP8759 PCIe controller driver (Marc Gonzalez) - add MediaTek PCIe host controller support (Ryder Lee) - add Qualcomm IPQ4019 support (John Crispin) - add HyperV vPCI protocol v1.2 support (Jork Loeser) - add i.MX6 regulator support (Quentin Schulz) * tag 'pci-v4.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (113 commits) PCI: tango: Add Sigma Designs Tango SMP8759 PCIe host bridge support PCI: Add DT binding for Sigma Designs Tango PCIe controller PCI: rockchip: Use normal register bank for config accessors dt-bindings: PCI: Add documentation for MediaTek PCIe PCI: Remove __pci_dev_reset() and pci_dev_reset() PCI: Split ->reset_notify() method into ->reset_prepare() and ->reset_done() PCI: xilinx: Make of_device_ids const PCI: xilinx-nwl: Modify IRQ chip for legacy interrupts PCI: vmd: Move SRCU cleanup after bus, child device removal PCI: vmd: Correct comment: VMD domains start at 0x10000, not 0x1000 PCI: versatile: Add local struct device pointers PCI: tegra: Do not allocate MSI target memory PCI: tegra: Support MSI 64-bit addressing PCI: rockchip: Use local struct device pointer consistently PCI: rockchip: Check for clk_prepare_enable() errors during resume MAINTAINERS: Remove Wenrui Li as Rockchip PCIe driver maintainer PCI: rockchip: Configure RC's MPS setting PCI: rockchip: Reconfigure configuration space header type PCI: rockchip: Split out rockchip_pcie_cfg_configuration_accesses() PCI: rockchip: Move configuration accesses into rockchip_pcie_cfg_atu() ...	2017-07-08 15:51:57 -07:00
Sagi Grimberg	842594c877	nvme-rdma: unconditionally recycle the request mr When our RDMA queue-pair is torn down with high load of I/O traffic, we have no way of knowing if the memory region was actually registered by the reg_mr work request as it completion flushes with error (hw might have done it or not). So in order to not deal with all this uncertanty, we simply recycle the MR in reinit_request. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-06 09:49:43 +03:00
Sagi Grimberg	d09f2b45f3	nvme: split nvme_uninit_ctrl into stop and uninit Usually before we teardown the controller we want to: 1. complete/cancel any ctrl inflight works 2. remove ctrl namespaces (only for removal though, resets shouldn't remove any namespaces). but we do not want to destroy the controller device as we might use it for logging during the teardown stage. This patch adds nvme_start_ctrl() which queues inflight controller works (aen, ns scan, queue start and keep-alive if kato is set) and nvme_stop_ctrl() which cancels the works namespace removal is left to the callers to handle. Move nvme_uninit_ctrl after we are done with the controller device. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-06 09:49:42 +03:00
Sagi Grimberg	8d7b8fafad	nvme: kick requeue list when requeueing a request instead of when starting the queues When we requeue a request, we can always insert the request back to the scheduler instead of doing it when restarting the queues and kicking the requeue work, so get rid of the requeue kick in nvme (core and drivers). Also, now there is no need start hw queues in nvme_kill_queues We don't stop the hw queues anymore, so no need to start them. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-06 09:48:59 +03:00
Sagi Grimberg	c81545f991	nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues quiescing/unquiescing respects the submission path rcu grace. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-06 09:48:47 +03:00
Sagi Grimberg	f9c5af5f8f	nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues quiescing/unquiescing respects the submission path rcu grace. Also, make sure to unquiesce before cleanup the admin queue. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-By: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-06 09:48:31 +03:00
Sagi Grimberg	fb05133972	nvme-rdma: quiesce/unquiesce admin_q instead of start/stop its hw queues unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues quiescing/unquiescing respects the submission path rcu grace. Also make sure to kick the requeue list when appropriate. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-06 09:47:58 +03:00
Marta Rybczynska	5e599d73c1	nvme-rdma: remove race conditions from IB signalling This patch improves the way the RDMA IB signalling is done by using atomic operations for the signalling variable. This avoids race conditions on sig_count. The signalling interval changes slightly and is now the largest power of two not larger than queue depth / 2. ilog() usage idea by Bart Van Assche. Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org	2017-07-06 09:46:58 +03:00
Sagi Grimberg	e5859d3a0e	nvme-fc: use blk_mq_delay_run_hw_queue instead of open-coding it Cc: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-04 10:44:46 +03:00
Sagi Grimberg	cda5fd1ac5	nvme-fc: update tagset nr_hw_queues after queues reinit We might have more/less queues once we reconnect/reset. For example due to cpu going online/offline or controller constraints. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-04 09:44:17 +03:00
Sagi Grimberg	4c8b99f6b1	nvme-rdma: update tagset nr_hw_queues after reconnecting/resetting We might have more/less queues once we reconnect/reset. For example due to cpu going online/offline or controller constraints. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-04 09:44:16 +03:00
Sagi Grimberg	7314183d1d	nvme-fc: don't override opts->nr_io_queues Its what the user passed, so its probably a better idea to keep it intact. Also, limit the number of I/O queues to max online cpus and the lport maximum hw queues. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-04 09:43:26 +03:00
Linus Torvalds	03ffbcdd78	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq department delivers: - Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it. (Thomas Gleixner) Aside of consolidating code this is a preparatory change for: - Finalizing the affinity management for multi-queue devices. The main change here is to shut down interrupts which are affine to a outgoing CPU and reenabling them when the CPU comes online again. That avoids moving interrupts pointlessly around and breaking and reestablishing affinities for no value. (Christoph Hellwig) Note: This contains also the BLOCK-MQ and NVME changes which depend on the rework of the irq core infrastructure. Jens acked them and agreed that they should go with the irq changes. - Consolidation of irq domain code (Marc Zyngier) - State tracking consolidation in the core code (Jeffy Chen) - Add debug infrastructure for hierarchical irq domains (Thomas Gleixner) - Infrastructure enhancement for managing generic interrupt chips via devmem (Bartosz Golaszewski) - Constification work all over the place (Tobias Klauser) - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni) - The usual set of fixes, updates and enhancements all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits) irqchip/or1k-pic: Fix interrupt acknowledgement irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity nvme: Allocate queues for all possible CPUs blk-mq: Create hctx for each present CPU blk-mq: Include all present CPUs in the default queue mapping genirq: Avoid unnecessary low level irq function calls genirq: Set irq masked state when initializing irq_desc genirq/timings: Add infrastructure for estimating the next interrupt arrival time genirq/timings: Add infrastructure to track the interrupt timings genirq/debugfs: Remove pointless NULL pointer check irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID irqchip/gic-v3-its: Add ACPI NUMA node mapping irqchip/gic-v3-its-platform-msi: Make of_device_ids const irqchip/gic-v3-its: Make of_device_ids const irqchip/irq-mvebu-icu: Add new driver for Marvell ICU irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU genirq/irqdomain: Remove auto-recursive hierarchy support irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access ...	2017-07-03 16:50:31 -07:00
Linus Torvalds	c6b1e36c8f	Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block Pull core block/IO updates from Jens Axboe: "This is the main pull request for the block layer for 4.13. Not a huge round in terms of features, but there's a lot of churn related to some core cleanups. Note this depends on the UUID tree pull request, that Christoph already sent out. This pull request contains: - A series from Christoph, unifying the error/stats codes in the block layer. We now use blk_status_t everywhere, instead of using different schemes for different places. - Also from Christoph, some cleanups around request allocation and IO scheduler interactions in blk-mq. - And yet another series from Christoph, cleaning up how we handle and do bounce buffering in the block layer. - A blk-mq debugfs series from Bart, further improving on the support we have for exporting internal information to aid debugging IO hangs or stalls. - Also from Bart, a series that cleans up the request initialization differences across types of devices. - A series from Goldwyn Rodrigues, allowing the block layer to return failure if we will block and the user asked for non-blocking. - Patch from Hannes for supporting setting loop devices block size to that of the underlying device. - Two series of patches from Javier, fixing various issues with lightnvm, particular around pblk. - A series from me, adding support for write hints. This comes with NVMe support as well, so applications can help guide data placement on flash to improve performance, latencies, and write amplification. - A series from Ming, improving and hardening blk-mq support for stopping/starting and quiescing hardware queues. - Two pull requests for NVMe updates. Nothing major on the feature side, but lots of cleanups and bug fixes. From the usual crew. - A series from Neil Brown, greatly improving the bio rescue set support. Most notably, this kills the bio rescue work queues, if we don't really need them. - Lots of other little bug fixes that are all over the place" * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits) lightnvm: pblk: set line bitmap check under debug lightnvm: pblk: verify that cache read is still valid lightnvm: pblk: add initialization check lightnvm: pblk: remove target using async. I/Os lightnvm: pblk: use vmalloc for GC data buffer lightnvm: pblk: use right metadata buffer for recovery lightnvm: pblk: schedule if data is not ready lightnvm: pblk: remove unused return variable lightnvm: pblk: fix double-free on pblk init lightnvm: pblk: fix bad le64 assignations nvme: Makefile: remove dead build rule blk-mq: map all HWQ also in hyperthreaded system nvmet-rdma: register ib_client to not deadlock in device removal nvme_fc: fix error recovery on link down. nvmet_fc: fix crashes on bad opcodes nvme_fc: Fix crash when nvme controller connection fails. nvme_fc: replace ioabort msleep loop with completion nvme_fc: fix double calls to nvme_cleanup_cmd() nvme-fabrics: verify that a controller returns the correct NQN nvme: simplify nvme_dev_attrs_are_visible ...	2017-07-03 10:34:51 -07:00
Linus Torvalds	81e3e04489	UUID/GUID updates: - introduce the new uuid_t/guid_t types that are going to replace the somewhat confusing uuid_be/uuid_le types and make the terminology fit the various specs, as well as the userspace libuuid library. (me, based on a previous version from Amir) - consolidated generic uuid/guid helper functions lifted from XFS and libnvdimm (Amir and me) - conversions to the new types and helpers (Amir, Andy and me) -----BEGIN PGP SIGNATURE----- iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllZfmILHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMvyg/9EvWHOOsSdeDykCK3KdH2uIqnxwpl+m7ljccaGJIc MmaH0KnsP9p/Cuw5hESh2tYlmCYN7pmYziNXpf/LRS65/HpEYbs4oMqo8UQsN0UM 2IXHfXY0HnCoG5OixH8RNbFTkxuGphsTY8meaiDr6aAmqChDQI2yGgQLo3WM2/Qe R9N1KoBWH/bqY6dHv+urlFwtsREm2fBH+8ovVma3TO73uZCzJGLJBWy3anmZN+08 uYfdbLSyRN0T8rqemVdzsZ2SrpHYkIsYGUZV43F581vp8e/3OKMoMxpWRRd9fEsa MXmoaHcLJoBsyVSFR9lcx3axKrhAgBPZljASbbA0h49JneWXrzghnKBQZG2SnEdA ktHQ2sE4Yb5TZSvvWEKMQa3kXhEfIbTwgvbHpcDr5BUZX8WvEw2Zq8e7+Mi4+KJw QkvFC1S96tRYO2bxdJX638uSesGUhSidb+hJ/edaOCB/GK+sLhUdDTJgwDpUGmyA xVXTF51ramRS2vhlbzN79x9g33igIoNnG4/PV0FPvpCTSqxkHmPc5mK6Vals1lqt cW6XfUjSQECq5nmTBtYDTbA/T+8HhBgSQnrrvmferjJzZUFGr/7MXl+Evz2x4CjX OBQoAMu241w6Vp3zoXqxzv+muZ/NLar52M/zbi9TUjE0GvvRNkHvgCC4NmpIlWYJ Sxg= =J/4P -----END PGP SIGNATURE----- Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid Pull uuid subsystem from Christoph Hellwig: "This is the new uuid subsystem, in which Amir, Andy and I have started consolidating our uuid/guid helpers and improving the types used for them. Note that various other subsystems have pulled in this tree, so I'd like it to go in early. UUID/GUID summary: - introduce the new uuid_t/guid_t types that are going to replace the somewhat confusing uuid_be/uuid_le types and make the terminology fit the various specs, as well as the userspace libuuid library. (me, based on a previous version from Amir) - consolidated generic uuid/guid helper functions lifted from XFS and libnvdimm (Amir and me) - conversions to the new types and helpers (Amir, Andy and me)" * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits) ACPI: hns_dsaf_acpi_dsm_guid can be static mmc: sdhci-pci: make guid intel_dsm_guid static uuid: Take const on input of uuid_is_null() and guid_is_null() thermal: int340x_thermal: fix compile after the UUID API switch thermal: int340x_thermal: Switch to use new generic UUID API acpi: always include uuid.h ACPI: Switch to use generic guid_t in acpi_evaluate_dsm() ACPI / extlog: Switch to use new generic UUID API ACPI / bus: Switch to use new generic UUID API ACPI / APEI: Switch to use new generic UUID API acpi, nfit: Switch to use new generic UUID API MAINTAINERS: add uuid entry tmpfs: generate random sb->s_uuid scsi_debug: switch to uuid_t nvme: switch to uuid_t sysctl: switch to use uuid_t partitions/ldm: switch to use uuid_t overlayfs: use uuid_t instead of uuid_be fs: switch ->s_uuid to uuid_t ima/policy: switch to use uuid_t ...	2017-07-03 09:55:26 -07:00
Christoph Hellwig	775755ed3c	PCI: Split ->reset_notify() method into ->reset_prepare() and ->reset_done() The pci_error_handlers->reset_notify() method had a flag to indicate whether to prepare for or clean up after a reset. The prepare and done cases have no shared functionality whatsoever, so split them into separate methods. [bhelgaas: changelog, update locking comments] Link: http://lkml.kernel.org/r/20170601111039.8913-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>	2017-07-03 07:58:30 -05:00
Sagi Grimberg	01ad099046	nvme-pci: rename to nvme_pci_configure_admin_queue we are going to need the name for the core routine... Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-02 15:03:23 +03:00
Sagi Grimberg	20d0dfe65a	nvme: move ctrl cap to struct nvme_ctrl All transports use either a private cache of controller cap or an on-stack copy, move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-02 15:03:22 +03:00
Sagi Grimberg	d858e5f04e	nvme: move queue_count to the nvme_ctrl All all transports use the queue_count in exactly the same, so move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-By: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-02 14:59:04 +03:00
Martin K. Petersen	d554b5e1ca	nvme: Quirks for PM1725 controllers PM1725 controllers have a couple of quirks that need to be handled in the driver: - I/O queue depth must be limited to 64 entries on controllers that do not report MQES. - The host interface registers go offline briefly while resetting the chip. Thus a delay is needed before checking whether the controller is ready. Note that the admin queue depth is also limited to 64 on older versions of this board. Since our NVME_AQ_DEPTH is now 32 that is no longer an issue. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2017-07-02 14:59:03 +03:00
Valentin Rothberg	a2b9377500	nvme: Makefile: remove dead build rule Remove dead build rule for drivers/nvme/host/scsi.c which has been removed by commit ("nvme: Remove SCSI translations"). Signed-off-by: Valentin Rothberg <vrothberg@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-29 09:43:23 -06:00
Christoph Hellwig	425a17cbff	nvme: Allocate queues for all possible CPUs Unlike most drіvers that simply pass the maximum possible vectors to pci_alloc_irq_vectors NVMe needs to configure the device before allocting the vectors, so it needs a manual update for the new scheme of using all present CPUs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: linux-block@vger.kernel.org Cc: linux-nvme@lists.infradead.org Link: http://lkml.kernel.org/r/20170626102058.10200-4-hch@lst.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2017-06-28 23:00:07 +02:00
James Smart	69fa964632	nvme_fc: fix error recovery on link down. Currently, the fc transport invokes nvme_fc_error_recovery() on every io in which the transport detects an error. Which means: a) it's really noisy on large io loads that all get hit by a link down. b) we repeatively call nvme_stop_queues() even though queues are stopped upon the first error or as first steps of reset_work. Correct by: Errors are only meaningful if the controller is in the LIVE state. Thus, enact the reset_work only if LIVE. If called repeatively, state will have already transitioned. There's no need to stop the queues here. Let the first steps of reset_work do the queue stopping. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
James Smart	0b5a7669a4	nvme_fc: Fix crash when nvme controller connection fails. If a controller connection is attempted (say to a subsystem that does not exist), the first attempt errors out. If another connect is attempted, it crashes. Issue is the prior controller has yet execute it's final put, thus its still on lists. However, opts points on it have been cleared, thus causing the crash if they are referenced. Fix is to add the missing put after the nvme_uninit_ctrl() call on the attachment failure. Signed-off-by: Paul Ely <Paul.Ely@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
James Smart	36715cf4b3	nvme_fc: replace ioabort msleep loop with completion Per the recommendation by Sagi on: http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html Wait for io aborts to complete wait converted from msleep look to using a struct completion. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
James Smart	b4dfd6ee99	nvme_fc: fix double calls to nvme_cleanup_cmd() Current fc transport code, on io termination, is calling nvme_cleanup_cmd() followed by the transport dma unmap routine which also calls nvme_cleanup_cmd(). Which means two kfrees occur on the same address, raising havoc. This resulted in odd data errors, effectively corruption.. Fix by removing the extraneous double calls. Call now occurs only in teardown paths and as part of dma unmap routine. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Christoph Hellwig	b1465c6344	nvme-fabrics: verify that a controller returns the correct NQN Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Christoph Hellwig	49d3d50b0d	nvme: simplify nvme_dev_attrs_are_visible Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Christoph Hellwig	180de00700	nvme: read the subsystem NQN from Identify Controller NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the Identify controller data structures. Use this NQN for the subsysnqn sysfs attribute by storing it in the nvme_ctrl structure after verifying it. For older controllers we generate a "fake" NQN per non-normative text in the NVMe 1.3 spec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Christoph Hellwig	942fbab4cd	nvme: remove a misleading comment on struct nvme_ns While a NVMe Namespace is somewhat similar to a SCSI Logical Unit (and not a Logical Unit Number anyway) there are subtile differences. Remove the misleading comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grmberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Kai-Heng Feng	76a5af8417	nvme: explicitly disable APST on quirked devices A user reports APST is enabled, even when the NVMe is quirked or with option "default_ps_max_latency_us=0". The current logic will not set APST if the device is quirked. But the NVMe in question will enable APST automatically. Separate the logic "apst is supported" and "to enable apst", so we can use the latter one to explicitly disable APST at initialiaztion. BugLink: https://bugs.launchpad.net/bugs/1699004 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Sagi Grimberg	7aa1f42752	nvme: use a single NVME_AQ_DEPTH and relax it to 32 No need to differentiate fabrics from pci/loop, also lower it to 32 as we don't really need 256 inflight admin commands. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Johannes Thumshirn	6bfe04255d	nvme: add hostid token to fabric options Currently we have no way to define a stable host-id but always use the one which is randomly generated when we add the host or use the default host. Provide a "hostid=%s" for user-space to pass in a persistent host-id which overrides the randomly generated one. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00
Keith Busch	3f7f25a910	nvme: Remove SCSI translations The SCSI-to-NVMe translations were added to assist storage applications utilizing SG_IO transitioning to NVMe. It was always recommended, however, to use native NVMe for device management as too much is lost in translation and the maintenance burden in keeping this kludgey layer around has been neglected such that much of the translations are completely broken. This patch removes SG_IO handling from NVMe to avoid any confusion regarding maintenance support for this interface. The config option for NVMe SCSI emulation has been disabled by default since 4.5. The driver has supported native nvme user commands since the beginning, and native tooling is publicly available for use or as reference for anyone writing their own tools, so there's no excuse for hanging onto a broken crutch. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-28 08:14:13 -06:00

... 3 4 5 6 7 ...

957 Commits