Commit Graph

667 Commits

Author SHA1 Message Date
Scott Bauer
e286bcfc59 nvme/pci: re-check security protocol support after reset
A device may change capabilities after each reset, e.g. due to a firmware
upgrade.  We should thus check for Security Send/Receive and OPAL support
after each reset.

Based on patches from Christoph and Keith.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-23 11:55:43 -07:00
Daniel Roschka
124298bd03 nvme: detect NVMe controller in recent MacBooks
Adds support for detection of the NVMe controller found in the
following recent MacBooks:
- Retina MacBook 2016 (MacBook9,1)
- 13" MacBook Pro 2016 without Touch Bar (MacBook13,1)
- 13" MacBook Pro 2016 with Touch Bar (MacBook13,2)

Signed-off-by: Daniel Roschka <danielroschka@phoenitydawn.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 15:17:29 -07:00
Max Gurtovoy
8f4e8dace3 nvme-rdma: add support for host_traddr
This will enable the user to control the specific interface for
connection establishment in case the host has more than 1 interface
under the same subnet.
E.g:
Host interfaces configured as:
 - ib0 1.1.1.1/16
 - ib1 1.1.1.2/16

Target interfaces configured as:
 - ib0 1.1.1.3/16 (listener interface)
 - ib1 1.1.1.4/16

the following connect command will go through host iface ib0 (default):
nvme connect -t rdma -n testsubsystem -a 1.1.1.3 -s 1023

but the following command will go through host iface ib1:
nvme connect -t rdma -n testsubsystem -a 1.1.1.3 -s 1023 -w 1.1.1.2

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Christophe JAILLET
6ccaeb5600 nvmet-rdma: Fix error handling
According to the preceeding goto, it is likely that 'out_destroy_sq' was
expected here.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Max Gurtovoy
7a01a6ea2c nvmet-rdma: use nvme cm status helper
Also remove redundant debug prints.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Max Gurtovoy
3ee80c3d15 nvme-rdma: move nvme cm status helper to .h file
This will enable the usage for nvme rdma target.
Also move from a lookup array to a switch statement.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
James Smart
faef3af69d nvme-fc: don't bother to validate ioccsz and iorcsz
Discovery controllers don't set the values. They are in reserved
areas of the Identify Controller data structure.

Given the cmd completed, the minimal capsule sizes are supported,
so no need to check nqn to detect discovery controllers and
special case validations.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Keith Busch
9ef3932e25 nvme/pci: No special case for queue busy on IO
This driver previously required we have a special check for IO submitted
to nvme IO queues that are temporarily suspended. That is no longer
necessary since blk-mq provides a quiesce, so any IO that actually gets
submitted to such a queue must be ended since the queue isn't going to
start back up.

This is fixing a condition where we have fewer IO queues after a
controller reset. This may happen if the number of CPU's has changed,
or controller firmware update changed the queue count, for example.

While it may be possible to complete the IO on a different queue, the
block layer does not provide a way to resubmit a request on a different
hardware context once the request has entered the queue. We don't want
these requests to be stuck indefinitely either, so ending them in error
is our only option at the moment.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Keith Busch
f33447b90e nvme/core: Fix race kicking freed request_queue
If a namespace has already been marked dead, we don't want to kick the
request_queue again since we may have just freed it from another thread.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Keith Busch
6db28eda26 nvme/pci: Disable on removal when disconnected
If the device is not present, the driver should disable the queues
immediately. Prior to this, the driver was relying on the watchdog timer
to kill the queues if requests were outstanding to the device, and that
just delays removal up to one second.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Andy Lutomirski
c5552fde10 nvme: Enable autonomous power state transitions
NVMe devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each successive
non-operational state will be entered after an idle time equal to 100%
of the total latency (entry plus exit) associated with that state.
The maximum acceptable latency is controlled using dev_pm_qos
(e.g. power/pm_qos_latency_tolerance_us in sysfs); non-operational
states with total latency greater than this value will not be used.
As a special case, setting the latency tolerance to 0 will disable
APST entirely.  On hardware without APST support, the sysfs file will
not be exposed.

The latency tolerance for newly-probed devices is set by the module
parameter nvme_core.default_ps_max_latency_us.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

This feature is quirked off on a known-buggy Samsung device.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Andy Lutomirski
bd4da3abaa nvme: Add a quirk mechanism that uses identify_ctrl
Currently, all NVMe quirks are based on PCI IDs.  Add a mechanism to
define quirks based on identify_ctrl's vendor id, model number,
and/or firmware revision.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Johannes Thumshirn
e5a39dd823 nvme: make nvmf_register_transport require a create_ctrl callback
nvmf_create_ctrl() relys on the presence of a create_crtl callback in the
registered nvmf_transport_ops, so make nvmf_register_transport require one.

Update the available call-sites as well to reflect these changes.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Parav Pandit
986994a275 nvme: Use CNS as 8-bit field and avoid endianness conversion
This patch defines CNS field as 8-bit field and avoids cpu_to/from_le
conversions.
Also initialize nvme_command cns value explicitly to NVME_ID_CNS_NS
for readability (don't rely on the fact that NVME_ID_CNS_NS = 0).

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Max Gurtovoy
778f067c18 nvme: add semicolon in nvme_command setting
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Max Gurtovoy
2dbf5816bb nvmet: avoid dereferencing nvmet_req
No need to dereference req twice to get the cmd when we already
have it stored in a local variable.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Sagi Grimberg
8432bdb290 nvme: Make controller state visible via sysfs
Easier for debugging and testing state machine
transitions.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Sagi Grimberg
15fbad96fc nvmet: Make cntlid globally unique
We usually log the cntlid which is confusing in case
we have multiple subsystems each with it's own cntlid ida.
Instead make cntlid ida globally unique and log the initial
association.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
James Smart
f64935abbf nvmet_fc: cleanup of abort flag processing in fcp_op_done
Cleanup of abort flag processing in fcp_op_done.
References were unnecessary

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Colin Ian King
b38b905498 nvme: admin-cmd: fix spelling mistake: "Counld" -> "Could"
trivial fix to spelling mistake in pr_err message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Jens Axboe
818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Jens Axboe
6010720da8 Merge branch 'for-4.11/block' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:06:45 -07:00
Scott Bauer
8a9ae52328 nvme: Check for Security send/recv support before issuing commands.
We need to verify that the controller supports the security
commands before actually trying to issue them.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
[hch: moved the check so that we don't call into the OPAL code if not
      supported]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 12:41:49 -07:00
Christoph Hellwig
4f1244c829 block/sed-opal: allocate struct opal_dev dynamically
Insted of bloating the containing structure with it all the time this
allocates struct opal_dev dynamically.  Additionally this allows moving
the definition of struct opal_dev into sed-opal.c.  For this a new
private data field is added to it that is passed to the send/receive
callback.  After that a lot of internals can be made private as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 12:41:47 -07:00
Scott Bauer
e225c20eb0 Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
When CONFIG_KASAN is enabled, compilation fails:

block/sed-opal.c: In function 'sed_ioctl':
block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

Moved all the ioctl structures off the stack and dynamically allocate
using _IOC_SIZE()

Fixes: 455a7b238c ("block: Add Sed-opal library")

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-14 19:47:18 -07:00
Christoph Hellwig
b35ba01ea6 nvme: support ranged discard requests
NVMe supports up to 256 ranges per DSM command, so wire up support
for ranged discards up to that limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-08 13:43:10 -07:00
Scott Bauer
a98e58e54f nvme: Add Support for Opal: Unlock from S3 & Opal Allocation/Ioctls
This patch implements the necessary logic to unlock an Opal
enabled device coming back from an S3.

The patch also implements the SED/Opal allocation necessary to support
the opal ioctls.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-06 09:44:21 -07:00
Christoph Hellwig
aebf526b53 block: fold cmd_type into the REQ_OP_ space
Instead of keeping two levels of indirection for requests types, fold it
all into the operations.  The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:44 -07:00
Christoph Hellwig
57292b58dd block: introduce blk_rq_is_passthrough
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:34 -07:00
Keith Busch
7bf7d77862 nvme/pci: Don't mark IOD as aborted if abort wasn't sent
This patch sets the aborted flag only if an abort was sent, reducing
excessive kernel message spamming for completed IO that wasn't actually
aborted.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:34:47 -07:00
Matias Bjørling
06894efea7 lightnvm: use end_io callback instead of instance
When the lightnvm core had the "gennvm" layer between the device and the
target, there was a need for the core to be able to figure out which
target it should send an end_io callback to. Leading to a "double"
end_io, first for the media manager instance, and then for the target
instance. Now that core and gennvm is merged, there is no longer a need
for this, and a single end_io callback will do.

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:32:13 -07:00
Matias Bjørling
84d4add793 lightnvm: add ioctls for vector I/Os
Enable user-space to issue vector I/O commands through ioctls. To issue
a vector I/O, the ppa list with addresses is also required and must be
mapped for the controller to access.

For each ioctl, the result and status bits are returned as well, such
that user-space can retrieve the open-channel SSD completion bits.

The implementation covers the traditional use-cases of bad block
management, and vectored read/write/erase.

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Metadata implementation, test, and fixes.
Signed-off-by: Simon A.F. Lund <slund@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:32:13 -07:00
Matias Bjørling
19bd6fe73c lightnvm: reduce number of nvm_id groups to one
The number of configuration groups has been limited to one in current
code, even if there is support for up to four. With the introduction
of the open-channel SSD 1.3 specification, only a single
group is exposed onwards. Reflect this in the nvm_id structure.

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:32:13 -07:00
Matias Bjørling
ade69e2432 lightnvm: merge gennvm with core
For the first iteration of Open-Channel SSDs, it was anticipated that
there could be various media managers on top of an open-channel SSD,
such to allow vendors to plug in their own host-side FTLs, without the
media manager in between.

Now that an Open-Channel SSD is exposed as a traditional block device,
there is no longer a need for this. Therefore lets merge the gennvm code
with core and simplify the stack.

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:32:13 -07:00
Christoph Hellwig
1a7347c7db nvme/scsi: don't rely on BLK_MAX_CDB
The NVMe SCSI emulation doesn't use BLOCK_PC requests, so BLK_MAX_CDB
doesn't have a meaning for it.  Instead opencode the value of 16
and refactor the code a bit so that related checks are next to each
other and we only need to use the value in one place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-30 08:33:51 -07:00
Jens Axboe
f605208e01 nvme: fix compilation of scsi component
Since we moved the cdb parts and define out of the block proper,
we need to include scsi/scsi_request.h for the nvme scsi layer.

Fixes: 82ed4db499 ("block: split scsi_request out of struct request")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-29 20:04:49 -07:00
Jens Axboe
f924ba70c1 Merge branch 'for-4.11/block' into for-4.11/rq-refactor
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 15:08:31 -07:00
Christoph Hellwig
19e420bb40 nvme-fc: use blk_rq_nr_phys_segments
Without this deallocate won't work properly due to the mismatch
of the bio/request size and the actual payload size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-01-26 17:49:14 +02:00
Parav Pandit
748ff8408f nvmet-rdma: Fix missing dma sync to nvme data structures
This patch performs dma sync operations on nvme_command
and nvme_completion.

nvme_command is synced
(a) on receiving of the recv queue completion for cpu access.
(b) before posting recv wqe back to rdma adapter for device access.

nvme_completion is synced
(a) on receiving of the recv queue completion of associated
nvme_command for cpu access.
(b) before posting send wqe to rdma adapter for device access.

This patch is generated for git://git.infradead.org/nvme-fabrics.git
Branch: nvmf-4.10

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
2017-01-26 17:45:03 +02:00
Sagi Grimberg
23a8ed4a62 nvmet: Call fatal_error from keep-alive timout expiration
We only need to call delete_ctrl once, so given that both
keep-alive timeout and any other fatal error can trigger it,
just make sure we only call delete_ctrl once.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-01-26 17:45:03 +02:00
Sagi Grimberg
06406d81a2 nvmet: cancel fatal error and flush async work before free controller
Make sure they are not running and we can free the controller
safely.

Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-01-26 17:45:00 +02:00
Sagi Grimberg
344770b07b nvmet: delete controllers deletion upon subsystem release
No reason for them to be kept around if we are
deleting the subsystem, so instead of passively
wait for the host to disconnect, actively delete
the controllers.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-01-26 17:44:22 +02:00
James Smart
c81e55e057 nvmet_fc: correct logic in disconnect queue LS handling
Correct logic in disconnect queue LS handling.
Rework so that queue searching and error reporting is above the
section to send back a ls rjt

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-01-26 17:44:21 +02:00
Bart Van Assche
92f4ae3559 nvme-rdma: Switch from dma_device to dev.parent
Prepare for removal of ib_device.dma_device.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Sagi Grimberg <sagi@grimbeg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24 12:26:17 -05:00
Jens Axboe
d348499138 blk-mq-sched: allow setting of default IO scheduler
Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:31 -07:00
Christoph Hellwig
b131c61d62 nvme: use blk_rq_payload_bytes
The new blk_rq_payload_bytes generalizes the payload length hacks
that nvme_map_len did before.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Guilherme G. Piccoli
b5a10c5f75 nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
Commit 54adc01055 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.

When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.

In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.

So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.

Fixes: 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness")
Reported-by: Andrew Byrne <byrneadw@ie.ibm.com>
Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com>
Reported-by: Zachary D. Myers <zdmyers@us.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-01-11 17:21:35 +01:00
Christoph Hellwig
1392370ee7 nvme-rdma: fix nvme_rdma_queue_is_ready
Now that we don't abuse the cmd field in struct request for nvme command
passthrough this function needs to be converted to the proper accessor
as well.

Fixes: d49187e97e ("nvme: introduce struct nvme_request")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
2017-01-11 17:20:39 +01:00
Jens Axboe
8e5d31eb02 Merge branch 'nvme-4.10' of git://git.infradead.org/nvme into for-linus
Christoph writes:

The most significant one is that we've agreed on shared maintaince and
a common repository for the PCIe NVMe driver and NVMe over Fabrics.  The
target code still only has a subset of the maintainers but goes through
the same tree as well.  Keith, Sagi and me will take turns at collecting
patches and sending you pull requests.
2016-12-22 11:54:46 -07:00
Colin Ian King
7c3a23b85c nvmet/fcloop: remove some logically dead code performing redundant ret checks
The check to see if ret is non-zero and return this rather than count
is redundant in two occassions.  It is redundant because prior to this
check, the return code ret is already checked for a non-zero error
return value and we return from the function at that point.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:34:28 +01:00
Daniel Verkamp
6c73f94930 nvmet: fix KATO offset in Set Features
The Set Features implementation for Keep Alive Timer was using the wrong
structure when retrieving the KATO value; it was treating the Set
Features command as a Property Set command.

The NVMe spec defines the Keep Alive Timer feature as having one input
in CDW11 (4 bytes at offset 44 in the command) whereas the code was
reading 8 bytes at offset 48.

Since the Linux NVMe over Fabrics host never sets this feature, this
code has presumably never been tested.

Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:34:27 +01:00
Johannes Thumshirn
17a1ec08ce nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues
Simplify the error handling of nvme_fc_create_hw_io_queues(), this saves us
one variable and one level of indentation.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviwed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:34:25 +01:00
James Smart
c703489885 nvme/fc: correct some printk information
Dan Carpenters's tool caught a pointer reference - should have been
just ptr, not &ptr.

Don't bother. Remove the pointer value in the printf. Its irrelevant.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:31 +01:00
Andy Lutomirski
2c473a9d02 nvme/scsi: Remove START STOP emulation
Now that the broken power state control is gone, it appears to serve
no purpose.  Just delete it.  NVME devices don't have a concept of
started vs stopped anyway.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:24 +01:00
Keith Busch
ff13b39ecf nvme/pci: Delete misleading queue-wrap comment
It is not theoretically possible for this driver to wrap twice while
processing completions. The driver allocates only 'queue_depth - 1'
tags, so there can never be more than that to reap when processing a
completion queue. Removing this misleading comment makes it a little
less likely people with broken controllers will blame the driver for
their spurious interrupts.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:23 +01:00
Max Gurtovoy
9fa196e7fc nvme/pci: Fix whitespace problem
Convert to tabs and remove unneeded whitespaces.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:18 +01:00
Keith Busch
e6282aef7b nvme: simplify stripe quirk
Some OEMs believe they own the Identify Controller vendor specific
region and will repurpose it with their own values. While not common,
we can't rely on the PCI VID:DID to tell use how to decode the field
we reserved for this as the stripe size so we need to do something else
for the list of devices using this quirk.

The field was supposed to allow flexibility on the device's back-end
striping, but it turned out that never materialized; the chunk is always
the same as MDTS in the products subscribing to this quirk, so this
patch removes the stripe_size field and sets the chunk to the max hw
transfer size for the devices using this quirk.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:06 +01:00
Stephen Bates
c965809c66 nvme : Use correct scnprintf in cmb show
Make sure we are using the correct scnprintf in the sysfs show
function for the CMB.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-19 08:35:24 -07:00
Linus Torvalds
4d5b57e05a Updates for 4.10 kernel merge window
- Shared mlx5 updates with net stack (will drop out on merge if Dave's
   tree has already been merged)
 - Driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe
 - Debug cleanups
 - New connection rejection helpers
 - SRP updates
 - Various misc fixes
 - New paravirt driver from vmware
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYUbAPAAoJELgmozMOVy/dMXcP/iuG5MNzfN8Ny1JftyBQGWg3
 cqoQ2OLj9CsXjwVB+5EqbcZHRZY852lKONaLoDKkIOx4YAXO2YuIKOp944vN7EQx
 96wfqzT1F5jzAcy5mYZXgLaStGFDAwejKMqeHd0LfJj3OEtemGnVPWYzyqSQmSKo
 dzJraS1Z9GIRppzU5WaRpB9PtRBkqIqGJ5vZ0EKLGhed5hYY5r0iMJB0GfriMRDO
 lJ4UUVfpsAoLPnqDBFH6IMn2V2UeAw9IR5zNa1mrM1RBfvt/uYTxrw1w3p9WoaNs
 GRodhk4DCeAfeyqzVPNBLyXZ4Zq4FzGe3UWM4qysJ1RR4oFNw9Cuw0Fqk8mrfznr
 7hv5TpGIckRZiKf8l6e+qLirF0qGtXJg29j2vPVQI9i5nSj95g1agA81PnLQlLLb
 flWyxeMj81my7lfMHN1xcV6pqPEKMCOysZmfcvVfJd2XxpjuVD7ekl/YXWp8o8kU
 YPdQMqPD626XsD8VpPdMszb9FPmx0JD0HEv+Y1rIFX8JegEI+c3H2X0dqC27T/Ou
 FEPWOy025EgHm0Fh/7eIzkG6tjZ4JHoCugJAcxNZGj2XW4eB6r5vY8UwJ8iQRv+n
 PVYHiy0UoIRePh0mrdOSSphGZMi/GO/DsqKwCtAMEK43WqZQju6wR7QSIGkh66mp
 4uSHJqpf3YEYylxGMhk3
 =QeGy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is the complete update for the rdma stack for this release cycle.

  Most of it is typical driver and core updates, but there is the
  entirely new VMWare pvrdma driver. You may have noticed that there
  were changes in DaveM's pull request to the bnxt Ethernet driver to
  support a RoCE RDMA driver. The bnxt_re driver was tentatively set to
  be pulled in this release cycle, but it simply wasn't ready in time
  and was dropped (a few review comments still to address, and some
  multi-arch build issues like prefetch() not working across all
  arches).

  Summary:

   - shared mlx5 updates with net stack (will drop out on merge if
     Dave's tree has already been merged)

   - driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe

   - debug cleanups

   - new connection rejection helpers

   - SRP updates

   - various misc fixes

   - new paravirt driver from vmware"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (210 commits)
  IB: Add vmw_pvrdma driver
  IB/mlx4: fix improper return value
  IB/ocrdma: fix bad initialization
  infiniband: nes: return value of skb_linearize should be handled
  MAINTAINERS: Update Intel RDMA RNIC driver maintainers
  MAINTAINERS: Remove Mitesh Ahuja from emulex maintainers
  IB/core: fix unmap_sg argument
  qede: fix general protection fault may occur on probe
  IB/mthca: Replace pci_pool_alloc by pci_pool_zalloc
  mlx5, calc_sq_size(): Make a debug message more informative
  mlx5: Remove a set-but-not-used variable
  mlx5: Use { } instead of { 0 } to init struct
  IB/srp: Make writing the add_target sysfs attr interruptible
  IB/srp: Make mapping failures easier to debug
  IB/srp: Make login failures easier to debug
  IB/srp: Introduce a local variable in srp_add_one()
  IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build
  IB/multicast: Check ib_find_pkey() return value
  IPoIB: Avoid reading an uninitialized member variable
  IB/mad: Fix an array index check
  ...
2016-12-15 12:03:32 -08:00
Linus Torvalds
cf1b3341af Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "A few fixes that I collected as post-merge.

  I was going to wait a bit with sending this out, but the O_DIRECT fix
  should really go in sooner rather than later"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: Fix failed allocation path when mapping queues
  blk-mq: Avoid memory reclaim when remapping queues
  block_dev: don't update file access position for sync direct IO
  nvme/pci: Log PCI_STATUS when the controller dies
  block_dev: don't test bdev->bd_contains when it is not stable
2016-12-14 17:21:53 -08:00
Linus Torvalds
84b6079134 Just one simple change from Andrzej to drop the pointless return value
from the ->drop_link method.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYUUZ8AAoJEA+eU2VSBFGDjHIP/RKCBerQNkQ5hdUPnwJkXOh1
 b4SWiPNsNgkzpKJsE/vuL6R2qq9KX5sxWLi8a3Wx5jyMst1Xj2m4SvI4fUew/hBC
 QZM86aMJLcWZ8BfqLk2tLys759z/MSZSrXcUU8EM+JJqb6E0+i+5pgX8gOk7Pxwo
 FUagFzuGXTGORwkWRf47ludBuDEGMCrLQZ6WaXEKQULTUwfPrnP+n9EhZcWzzsyu
 0YxW9SFD73LgRSPLgxc+rw875D8rb3WSClWj/2LLQUy8z8QEJ83Mgt9hcbBV0Ppa
 efP0kPZbpDnVx6TjpldRKW9GivkbFXNnChMmgTkBGYTRjn8IHDsyAb6ZABw/O37N
 oyvd42xVDCE+GSImaMgCPL/5MEsQ+v9xCfgkBcyhWVQYFFj89Nmz/8VKp6AtTW3j
 X7MQuzdzKGoWTVpAOgw/SvjrRx+fcciTg31AhhGjE5cmARVoBJuCDa6NM3WXFQf4
 Pq74zetWDB38sBZwQV/6Y1m3OJGquD4MxX9b5SbNzwuROrKyJCAe3CCw7CKvuuWj
 RPkwZkHiCawRijxNDCWU8zpMcuUCdt9yjTWbUW/WrvKR6BGF4IwUHf9k8oorv1Qc
 Bo7enURnYBmcb7cijOLO+onzCf5iWAwMNk8KKJ1zthaZbloIGw8jdp5kOe5spEvF
 Uju3n/GNA/OWe2MAuutE
 =nOOH
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.10' of git://git.infradead.org/users/hch/configfs

Pull configfs update from Christoph Hellwig:
 "Just one simple change from Andrzej to drop the pointless return value
  from the ->drop_link method"

* tag 'configfs-for-4.10' of git://git.infradead.org/users/hch/configfs:
  fs: configfs: don't return anything from drop_link
2016-12-14 10:31:25 -08:00
Steve Wise
512fb1b32b nvmet_rdma: log the connection reject message
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 11:38:28 -05:00
Steve Wise
7f03953c2f nvme-rdma: use rdma connection reject helper functions
Also add nvme cm status strings and use them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 11:38:28 -05:00
Andy Lutomirski
d2a6191840 nvme/pci: Log PCI_STATUS when the controller dies
When debugging nvme controller crashes, it's nice to know whether
the controller died cleanly so that the failure is just reflected in
CSTS, it died and put an error in PCI_STATUS, or whether it died so
badly that it stopped responding to PCI configuration space reads.

I've seen a failure that gives 0xffff in PCI_STATUS on a Samsung
"SM951 NVMe SAMSUNG 256GB" with firmware "BXW75D0Q".

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>

Fixed up white space and hunk reject.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-13 21:07:08 -07:00
Linus Torvalds
cdb98c2698 Revert "nvme: add support for the Write Zeroes command"
This reverts commit 6d31e3ba23.

This causes bootup problems for me both on my laptop and my desktop.
What they have in common is that they have NVMe disks with dm-crypt, but
it's not the same controller, so it's not controller-specific.

Jens does not see it on his machine (also NVMe), so it's presumably
something that triggers just on bootup.  Possibly related to dm-crypt
and the fact that I mark my luks volume with "allow-discards" in
/etc/crypttab.

It's 100% repeatable for me, which made it fairly straightforward to
bisect the problem to this commit. Small mercies.

So we don't know what the reason is yet, but the revert is needed to get
things going again.

Acked-by: Jens Axboe <axboe@fb.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 19:53:37 -08:00
Linus Torvalds
b92e09bb5b Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:

 - Adam added opt-in ATA command priority support.

 - There are machines which hide multiple nvme devices behind an ahci
   BAR. Dan Williams proposed a solution to force-switch the mode but
   deemed too hackishd. People are gonna discuss the proper way to
   handle the situation in nvme standard meetings. For now, detect and
   warn about the situation.

 - Low level driver specific changes.

Christoph Hellwig pipes in about the hidden nvme warning:
 "I wish that was the case. We've pretty much agreed that we'll want to
  implement it as a virtual PCIe root bridge, similar to Intels other
  'innovation' VMD that we work around that way.

  But Intel management has apparently decided that they don't want to
  spend more cycles on this now that Lenovo has an optional BIOS that
  doesn't force this broken mode anymore, and no one outside of Intel
  has enough information to implement something like this.

  So for now I guess this warning is it, until Intel reconsideres and
  spends resources on fixing up the damage their Chipset people caused"

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  ahci: warn about remapped NVMe devices
  ahci-remap.h: add ahci remapping definitions
  nvme: move NVMe class code to pci_ids.h
  pata: imx: support controller modes up to PIO4
  pata: imx: add support of setting timings for PIO modes
  pata: imx: set controller PIO mode with .set_piomode callback
  pata: imx: sort headers out
  ata: set ncq_prio_enabled iff device has support
  ata: ATA Command Priority Disabled By Default
  ata: Enabling ATA Command Priorities
  block: Add iocontext priority to request
  ahci: qoriq: added ls1046a platform support
2016-12-13 13:26:24 -08:00
Linus Torvalds
36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Christoph Hellwig
f9d03f96b9 block: improve handling of the magic discard payload
Instead of allocating a single unused biovec for discard requests, send
them down without any payload.  Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

 - we don't have to allocate the bio_vec
 - the amount of special casing for discard requests in the block
   layer is significantly reduced
 - using this same scheme for other request types is trivial,
   which will be important for implementing the new WRITE_ZEROES
   op on devices where it actually requires a payload (e.g. SCSI)
 - we can get rid of playing games with the request length, as
   we'll never touch it and completions will work just fine
 - it will allow us to support ranged discard operations in the
   future by merging non-contiguous discard bios into a single
   request
 - last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09 08:30:51 -07:00
James Smart
475d0fe795 nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
Add FC LLDD loopback driver to test FC host and target transport within
nvme-fabrics

To aid in the development and testing of the lower-level api of the FC
transport, this loopback driver has been created to act as if it were a
FC hba driver supporting both the host interfaces as well as the target
interfaces with the nvme FC transport.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-06 14:51:48 +01:00
James Smart
c53432030d nvme-fabrics: Add target support for FC transport
Implements the FC-NVME T11 definition of how nvme fabric capsules are
performed on an FC fabric. Utilizes a lower-layer API to FC host adapters
to send/receive FC-4 LS operations and perform the FCP transactions
necessary to perform and FCP IO request for NVME.

The T11 definitions for FC-4 Link Services are implemented which create
NVMeOF connections.  Implements the hooks with nvmet layer to pass NVME
commands to it for processing and posting of data/response base to the
host via the different connections.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-06 10:17:56 +02:00
James Smart
e399441de9 nvme-fabrics: Add host support for FC transport
Implements the FC-NVME T11 definition of how nvme fabric capsules are
performed on an FC fabric. Utilizes a lower-layer API to FC host adapters
to send/receive FC-4 LS operations and FCP operations that comprise NVME
over FC operation.

The T11 definitions for FC-4 Link Services are implemented which create
NVMeOF connections.  Implements the hooks with blk-mq to then submit admin
and io requests to the different connections.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-06 10:17:56 +02:00
James Smart
885aa4015f nvme-fabrics: patch target code in prep for FC transport support
- Add FC transport type decoding
- Add FC address family decoding

Signed-off-by: James Smart <james.smart@broadcom.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-06 10:17:03 +02:00
James Smart
721b3917c4 nvme-fabrics: set sqe.command_id in core not transports
Currently, core.c sets command_id only on rd/wr commands, leaving it to
the transport to set it again to ensure the request had a command id.

Move location of set in core so applies to all commands.
Remove transport sets.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Max Gurtovoy
27a4beef0b nvme-rdma: align to generic ib_event logging helper
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-06 10:17:03 +02:00
Max Gurtovoy
675796be47 nvmet-rdma: align to generic ib_event logging helper
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-06 10:17:03 +02:00
Sagi Grimberg
d4a5340edd nvme-rdma: remove redundant define
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Bart Van Assche
6eb7283054 nvme-fabrics: Adjust source code indentation
Adjust indentation such that arguments are aligned.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Bart Van Assche
6bcb5268d4 nvme/scsi: Remove set-but-not-used variables
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Solganik Alexander
e4fcf07cca nvmet: Fix possible infinite loop triggered on hot namespace removal
When removing a namespace we delete it from the subsystem namespaces
list with list_del_init which allows us to know if it is enabled or
not.

The problem is that list_del_init initialize the list next and does
not respect the RCU list-traversal we do on the IO path for locating
a namespace. Instead we need to use list_del_rcu which is allowed to
run concurrently with the _rcu list-traversal primitives (keeps list
next intact) and guarantees concurrent nvmet_find_naespace forward
progress.

By changing that, we cannot rely on ns->dev_link for knowing if the
namspace is enabled, so add enabled indicator entry to nvmet_ns for
that.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
Cc: <stable@vger.kernel.org> # v4.8+
2016-12-06 10:17:03 +02:00
Bart Van Assche
f3116d8f1e nvme-fabrics: Fix a memory leak in an nvmf_create_ctrl() error path
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Bart Van Assche
8eadfcb1b6 nvme-fabrics: Fix memory leaks in nvmf_parse_options()
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Samuel Jones
76c08bf46a nvme-rdma: force queue size to respect controller capability
Queue size needs to respect the Maximum Queue Entries Supported advertised by
the controller in its Capability register.

Signed-off-by: Samuel Jones <sjones@kalray.eu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[sagig: fixed queue_size adjustment according to
Daniel Verkamp <daniel.verkamp@intel.com> comment]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Bart Van Assche
70d4281c4b nvmet-rdma: Fix REJ status code
nvmet_sq_init() returns a value <= 0. nvmet_rdma_cm_reject() expects
a second argument that is a NVME_RDMA_CM_* constant. Hence this patch.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Christoph Hellwig
a2e7eefd56 nvme: move NVMe class code to pci_ids.h
We'll need to check for it in the AHCI drivers (yes, really) soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-12-05 14:31:23 -05:00
Chaitanya Kulkarni
d262920998 nvmet: add support for the Write Zeroes command
Add support for handling write zeroes command on target.
Call into __blkdev_issue_zeroout, which the block layer expands into the
best suitable variant of zeroing the LBAs. Allow write zeroes operation
to deallocate the LBAs when calling __blkdev_issue_zeroout.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:58:40 -07:00
Chaitanya Kulkarni
6d31e3ba23 nvme: add support for the Write Zeroes command
Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block
device, if the device supports optional command bit set for write
zeroes. Add support to setup write zeroes command. Set maximum possible
write zeroes sectors in one write zeroes command according to
nvme write zeroes command definition.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:58:40 -07:00
Andrzej Pietrasiewicz
e16769d4bc fs: configfs: don't return anything from drop_link
Documentation/filesystems/configfs/configfs.txt says:

"When unlink(2) is called on the symbolic link, the source item is
notified via the ->drop_link() method.  Like the ->drop_item() method,
this is a void function and cannot return failure."

The ->drop_item() is indeed a void function, the ->drop_link() is
actually not. This, together with the fact that the value of ->drop_link()
is silently ignored suggests, that it is the ->drop_link() return
type that should be corrected and changed to void.

This patch changes drop_link() signature and all its users.

Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
[hch: reverted reformatting of some code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-01 10:50:49 +01:00
Javier González
8e53624d44 lightnvm: eliminate nvm_lun abstraction in mm
In order to naturally support multi-target instances on an Open-Channel
SSD, targets should own the LUNs they get blocks from and manage
provisioning internally. This is done in several steps.

Since targets own the LUNs the are instantiated on top of and manage the
free block list internally, there is no need for a LUN abstraction in
the media manager. LUNs are intrinsically managed as in the physical
layout (ch:0,lun:0, ..., ch:0,lun:n, ch:1,lun:0, ch:1,lun:n, ...,
ch:m,lun:0, ch:m,lun:n) and given to the targets based on the target
creation ioctl. This simplifies LUN management and clears the path for a
partition manager to sit directly underneath LightNVM targets.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Javier González
8e79b5cb1d lightnvm: move block provisioning to targets
In order to naturally support multi-target instances on an Open-Channel
SSD, targets should own the LUNs they get blocks from and manage
provisioning internally. This is done in several steps.

This patch moves the block provisioning inside of the target and removes
the get/put block interface from the media manager.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Javier González
bb3149792e lightnvm: enable to send hint to erase command
Erases might be subject to host hints. An example is multi-plane
programming to erase blocks in parallel. Enable targets to specify this
hint.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Matias Bjørling
3dc87dd048 nvme: lightnvm: attach lightnvm sysfs to nvme block device
Previously, LBA read and write were not supported in the lightnvm
specification. Now that it supports it, lets use the traditional
NVMe gendisk, and attach the lightnvm sysfs geometry export.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Matias Bjørling
7498e99fc5 nvme: lightnvm: frees wrong cmd structure
When struct nvme_request was introduced, the nvme_nvm_submit_io was
converted to the new interface. The interface moves nvme_nvm_command
data structure into the struct request pdu. On io completion, rq->cmd is
freed, which should have been the dereferenced pdu nvme_request->cmd.

Fixes: d49187e97e "nvme: introduce struct nvme_request"
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Ming Lei
3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Keith Busch
d48756228e nvme/pci: Don't free queues on error
The nvme_remove function tears down all allocated resources in the correct
order, so no need to free queues on error during initialization. This
fixes possible use-after-free errors when queues are still associated
with a blk-mq hctx.

Reported-by: Scott Bauer <scott.bauer@intel.com>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 12:39:57 -07:00
Omar Sandoval
bac0000af5 nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK
Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-15 12:50:11 -07:00
Sagi Grimberg
14c862dbb0 nvmet-rdma: drain the queue-pair just before freeing it
draining the qp right after disconnect might not suffice because
the nvmet sq is not fully drained (in nvmet_sq_destroy) and we might
see completions after the drain. Instead, drain right before the
qp destroy which comes after the sq destruction and we can be sure
that no posts come after the drain.

Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:53 +02:00
Steve Wise
c8dbc37cd8 nvme-rdma: stop and free io queues on connect failure
While testing nvme-rdma with the spdk nvmf target over iw_cxgb4, I
configured the target (mistakenly) to generate an error creating the
NVMF IO queues.  This resulted a "Invalid SQE Parameter" error sent back
to the host on the first IO queue connect:

[ 9610.928182] nvme nvme1: queue_size 128 > ctrl maxcmd 120, clamping down
[ 9610.938745] nvme nvme1: creating 32 I/O queues.

So nvmf_connect_io_queue() returns an error to
nvmf_connect_io_queue() / nvmf_connect_io_queues(), and that
is returned to nvme_rdma_create_io_queues().  In the error path,
nvmf_rdma_create_io_queues() frees the queue tagset memory _before_
stopping and freeing the IB queues, which causes yet another
touch-after-free crash due to SQ CQEs being flushed after the ib_cqe
structs pointed-to by the flushed WRs have been freed (since they are
part of the nvme_rdma_request struct).

The fix is to stop and free the queues in nvmf_connect_io_queues()
if there is an error connecting any of the queues.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:53 +02:00
Sagi Grimberg
766dbb179d nvmet-rdma: don't forget to delete a queue from the list of connection failed
In case we accepted a queue connection and it failed, we might not
remove the queue from the list until we unload and clean it up.
We should delete it from the queue list on the relevant handler.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:52 +02:00
Sagi Grimberg
8242ddac1b nvmet: Don't queue fatal error work if csts.cfs is set
In the transport, in case of an interal queue error like
error completion in rdma we trigger a fatal error. However,
multiple queues in the same controller can serr error completions
and we don't want to trigger fatal error work more than once.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:51 +02:00
Christoph Hellwig
553cd9ef82 nvme-rdma: reject non-connect commands before the queue is live
If we reconncect we might have command queue up that get resent as soon
as the queue is restarted.  But until the connect command succeeded we
can't send other command.  Add a new flag that marks a queue as live when
connect finishes, and delay any non-connect command until the queue is
live based on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
[sagig: fixes admin queue LIVE setting]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:51 +02:00
Bart Van Assche
fa14a0acea nvmet-rdma: Fix possible NULL deref when handling rdma cm events
When we initiate queue teardown sequence we call rdma_destroy_qp
which clears cm_id->qp, afterwards we call rdma_destroy_id, but
we might see a rdma_cm event in between with a cleared cm_id->qp
so watch out for that and silently ignore the event because this
means that the queue teardown sequence is in progress.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:50 +02:00
Matias Bjørling
409ae5a76e lightnvm: invalid offset calculation for lba_shift
The ns->lba_shift assumes its value to be the logarithmic of the
LA size. A previous patch duplicated the lba_shift calculation into
lightnvm. It prematurely also subtracted a 512byte shift, which commonly
is applied per-command. The 512byte shift being subtracted twice led to
data loss when restoring the logical to physical mapping table from
device and when issuing I/O commands using rrpc.

Fix offset by removing the 512byte shift subtraction when calculating
lba_shift.

Fixes: b0b4e09c1a "lightnvm: control life of nvm_dev in driver"
Reported-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 18:27:32 -07:00
Jens Axboe
bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00
Christoph Hellwig
7bf58533a0 nvme: don't pass the full CQE to nvme_complete_async_event
We only need the status and result fields, and passing them explicitly
makes life a lot easier for the Fibre Channel transport which doesn't
have a full CQE for the fast path case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:06:26 -07:00
Christoph Hellwig
d49187e97e nvme: introduce struct nvme_request
This adds a shared per-request structure for all NVMe I/O.  This structure
is embedded as the first member in all NVMe transport drivers request
private data and allows to implement common functionality between the
drivers.

The first use is to replace the current abuse of the SCSI command
passthrough fields in struct request for the NVMe command passthrough,
but it will grow a field more fields to allow implementing things
like common abort handlers in the future.

The passthrough commands are handled by having a pointer to the SQE
(struct nvme_command) in struct nvme_request, and the union of the
possible result fields, which had to be turned from an anonymous
into a named union for that purpose.  This avoids having to pass
a reference to a full CQE around and thus makes checking the result
a lot more lightweight.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:06:24 -07:00
Bart Van Assche
a6eaa8849f nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
that became superfluous because of this change. Change
blk_queue_stopped() tests into blk_mq_queue_stopped().

This patch fixes a race condition: using queue_flag_clear_unlocked()
is not safe if any other function that manipulates the queue flags
can be called concurrently, e.g. blk_cleanup_queue().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche
3174dd33fa nvme: Fix a race condition related to stopping queues
Avoid that nvme_queue_rq() is still running when nvme_stop_queues()
returns.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche
2b053aca76 blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche
9b7dd572cc blk-mq: Remove blk_mq_cancel_requeue_work()
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
e806402130 block: split out request-only flags into a new namespace
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests.  It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Linus Torvalds
ecd06f2883 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes that missed the merge window, mostly due to me being
  away around that time.

  Nothing major here, a mix of nvme cleanups and fixes, and one fix for
  the badblocks handling"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvmet: use symbolic constants for CNS values
  nvme: use symbolic constants for CNS values
  nvme.h: add an enum for cns values
  nvme.h: don't use uuid_be
  nvme.h: resync with nvme-cli
  nvme: Add tertiary number to NVME_VS
  nvme : Add sysfs entry for NVMe CMBs when appropriate
  nvme: don't schedule multiple resets
  nvme: Delete created IO queues on reset
  nvme: Stop probing a removed device
  badblocks: fix overlapping check for clearing
2016-10-21 10:54:01 -07:00
Christoph Hellwig
e9c9346e20 nvmet: use symbolic constants for CNS values
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Christoph Hellwig
fa60682677 nvme: use symbolic constants for CNS values
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Christoph Hellwig
a446c0840e nvme.h: resync with nvme-cli
Import a few updates to nvme.h from nvme-cli.  This mostly includes a few
new fields and error codes, but also a few renames that so far are only
used in user space.  Also one field is moved from an array of two le64
values to one of 16 u8 values so that we can more easily access it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Gabriel Krisman Bertazi
8ef2074d28 nvme: Add tertiary number to NVME_VS
NVMe 1.2.1 specification adds a tertiary element to the version number.
This updates the macro and its callers to include the final number and
fixup a single place in nvmet where the version was generated manually.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Stephen Bates
202021c1a6 nvme : Add sysfs entry for NVMe CMBs when appropriate
Add a sysfs attribute that contains salient information about the NVMe
Controller Memory Buffer when one is present. For now, just display the
information about the CMB available from the control registers. We attach
the CMB attribute file to the existing nvme_ctrl sysfs group so it can
handle the sysfs teardown.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Acked-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 11:10:28 -06:00
Keith Busch
c5f6ce97c1 nvme: don't schedule multiple resets
The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:

	WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.

This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg<sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 09:24:39 -06:00
Keith Busch
7065906096 nvme: Delete created IO queues on reset
The driver was decrementing the online_queues prior to attempting to
delete those IO queues, so the driver ended up not requesting the
controller delete any. This patch saves the online_queues prior to
suspending them, and adds that parameter for deleting io queues.

Fixes: c21377f8 ("nvme: Suspend all queues before deletion")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 09:22:16 -06:00
Keith Busch
0df1e4f5e0 nvme: Stop probing a removed device
There is no reason the nvme controller can ever return all 1's from
reading the CSTS register. This patch returns an error if we observe
that status. Without this, we may incorrectly proceed with controller
initialization and unnecessarilly rely on error handling to clean this.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 08:40:13 -06:00
Mauricio Faria de Oliveira
2b6b535d91 nvme: use the DMA_ATTR_NO_WARN attribute
Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).

Link: http://lkml.kernel.org/r/1470092390-25451-4-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Linus Torvalds
12e3d3cdd9 Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
 "This is the block-irq topic branch for 4.9-rc. It's mostly from
  Christoph, and it allows drivers to specify their own mappings, and
  more importantly, to share the blk-mq mappings with the IRQ affinity
  mappings. It's a good step towards making this work better out of the
  box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
  blk_mq: linux/blk-mq.h does not include all the headers it depends on
  blk-mq: kill unused blk_mq_create_mq_map()
  blk-mq: get rid of the cpumask in struct blk_mq_tags
  nvme: remove the post_scan callout
  nvme: switch to use pci_alloc_irq_vectors
  blk-mq: provide a default queue mapping for PCI device
  blk-mq: allow the driver to pass in a queue mapping
  blk-mq: remove ->map_queue
  blk-mq: only allocate a single mq_map per tag_set
  blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09 17:29:33 -07:00
Linus Torvalds
b9044ac829 Merge of primary rdma-core code for 4.9
- Updates to mlx5
 - Updates to mlx4 (two conflicts, both minor and easily resolved)
 - Updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper
   resolution is to keep the code in cxgb4_main.c as it is in Linus'
   tree as attach_uld was refactored and moved into cxgb4_uld.c)
 - Improvements to uAPI (moved vendor specific API elements to uAPI area)
 - Add hns-roce driver and hns and hns-roce ACPI reset support
 - Conversion of all rdma code away from deprecated
   create_singlethread_workqueue
 - Security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
   staging)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX+AwSAAoJELgmozMOVy/d0WkQAKxPzVccMWwHv28iZI4ey13u
 JwE+VoCNpCAZAVuEgzK5zzFdNHPvAk2jU93H4apA7dfXJBXPatVuj9Lnk+ieEEnW
 tbFwJjBpbQ3Zol3+SPfAHnsVMbtax+xmd6WDKExPXXEDl1L6rutwL3KKfmgWEitg
 ysX7XOJCiSdyM0hcg4T6UPB9a3jGPff9NLu0oGamV+yoUk5Y0WGoVFxHZ4MKcw8t
 OkFBYIxGz4SGwq2tulStuH03HteURX594KngtrA8dyq6l1R2GlGRv+bkJAUEIWUv
 aA0ow3VWusOM6fT+jLXPCv8iUwIXM8tR/U6F7X+cmORUUtWvCl+uCUVid113j/aN
 BK+Af2nJnfoJ5cDBPsD+bC76l5gQycNZO/Qh8op2kmgJtD+6OpGM3cBXsHx53+kk
 0wloJ2lKCGShWxNj+ig8n8rR/rhhs/x3vV3ouCVWNMbOUgOSN3eYHxmK3wGFW4nd
 Qx+WYCjj9Yi/J6nmUDcfEQ4NWPR22Q2+0ENAabfhLhV6mDloAO5ILHd4GDqC3IA9
 UtxlVjf4ZonaiLnTQQzCnDMGVVk6tT8FJ9D42s0ScwjbdYwjyCW9/rs/g2EhcprR
 Cc+AmjqLviCWGtzBSFO0SijqQon8lcQOwdLw61CdFFvPa/mlLdf1rbx9ArIyNVKn
 JSrbr3CGyoqyYj6qaEO5
 =LC+S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull main rdma updates from Doug Ledford:
 "This is the main pull request for the rdma stack this release.  The
  code has been through 0day and I had it tagged for linux-next testing
  for a couple days.

  Summary:

   - updates to mlx5

   - updates to mlx4 (two conflicts, both minor and easily resolved)

   - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
     proper resolution is to keep the code in cxgb4_main.c as it is in
     Linus' tree as attach_uld was refactored and moved into
     cxgb4_uld.c)

   - improvements to uAPI (moved vendor specific API elements to uAPI
     area)

   - add hns-roce driver and hns and hns-roce ACPI reset support

   - conversion of all rdma code away from deprecated
     create_singlethread_workqueue

   - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
     staging)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
  staging/lustre: Disable InfiniBand support
  iw_cxgb4: add fast-path for small REG_MR operations
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  IB/core: correctly handle rdma_rw_init_mrs() failure
  IB/srp: Fix infinite loop when FMR sg[0].offset != 0
  IB/srp: Remove an unused argument
  IB/core: Improve ib_map_mr_sg() documentation
  IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
  IB/mthca: Move user vendor structures
  IB/nes: Move user vendor structures
  IB/ocrdma: Move user vendor structures
  IB/mlx4: Move user vendor structures
  IB/cxgb4: Move user vendor structures
  IB/cxgb3: Move user vendor structures
  IB/mlx5: Move and decouple user vendor structures
  IB/{core,hw}: Add constant for node_desc
  ipoib: Make ipoib_warn ratelimited
  IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
  IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
  IB/ipoib: Remove deprecated create_singlethread_workqueue
  ...
2016-10-09 17:04:33 -07:00
Linus Torvalds
513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
Andy Lutomirski
1a6fe74dfd nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
Any user I can imagine that needs a buffer at all will want to pass
a pointer directly.  There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-24 10:56:26 -06:00
Andy Lutomirski
26501db8dc nvme/scsi: Remove power management support
As far as I can tell, there is basically nothing correct about this
code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
power states, which is nonsense, because they're all just indices
into a table that software needs to parse.  It completely ignores
the distinction between operational and non-operational states.
And, until 4.8, if all of the above magically succeeded, it would
dereference a NULL pointer and OOPS.

Since this code appears to be useless, just delete it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-24 10:56:25 -06:00
Alexander Solganik
2e5d0baa04 nvmet: Make dsm number of ranges zero based
This caused the nvmet request data length to be
incorrect.

Signed-off-by: Alexander Solganik <sashas@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-09-23 15:37:38 -07:00
Sagi Grimberg
9b349b080c nvmet: Use direct IO for writes
We're designed to work with high-end devices where
direct IO makes perfect sense. We noticed that we
context switch by scheduling kblockd instead of going
directly to the device without REQ_SYNC for writes.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2016-09-23 15:37:38 -07:00
Chaitanya Kulkarni
2d79c7dc8f admin-cmd: Added smart-log command support.
This patch implements the support for smart-log command
(NVM Express 1.2.1-section 5.10.1.2 SMART / Health Information
(Log Identifier 02h)) on the target for NVMe over Fabric.

In current implementation host can retrieve following statistics:-
1. Data Units Read.
2. Data Units Written.
3. Host Read Commands.
4. Host Write Commands.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-23 15:37:37 -07:00
James Smart
478bcb9388 nvme-fabrics: Add host_traddr options field to host infrastructure
Add the host_traddr field to allow specification of the host-port
connection info for the transport. Will be used by FC transport.

Signed-off-by: James Smart <james.smart@broadcom.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-23 15:37:37 -07:00
James Smart
4a9f05c57f nvme-fabrics: revise host transport option descriptions
Revise some of the comments so not so ethernet-network centric

Signed-off-by: James Smart <james.smart@broadcom.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-23 15:37:37 -07:00
James Smart
0fe51ff269 nvme-fabrics: rework nvmf_get_address() for variable options
Revise nvmf_get_address() string to account for not all options being
present.

Signed-off-by: James Smart <james.smart@broadcom.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-23 15:37:37 -07:00
Christoph Hellwig
11975e0149 nvme-rdma: use IB_PD_UNSAFE_GLOBAL_RKEY
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00
Christoph Hellwig
ed082d36a7 IB/core: add support to create a unsafe global rkey to ib_create_pd
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00
Sagi Grimberg
3b4ac78610 nvme-rdma: only clear queue flags after successful connect
Otherwise, nvme_rdma_stop_and_clear_queue() will incorrectly
try to stop/free rdma qps/cm_ids that are already freed.

Fixes: e89ca58f9c ("nvme-rdma: add DELETING queue flag")
Reported-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-22 19:58:17 -06:00
Simon A. F. Lund
40267efddc lightnvm: expose device geometry through sysfs
For a host to access an Open-Channel SSD, it has to know its geometry,
so that it writes and reads at the appropriate device bounds.

Currently, the geometry information is kept within the kernel, and not
exported to user-space for consumption. This patch exposes the
configuration through sysfs and enables user-space libraries, such as
liblightnvm, to use the sysfs implementation to get the geometry of an
Open-Channel SSD.

The sysfs entries are stored within the device hierarchy, and can be
found using the "lightnvm" device type.

An example configuration looks like this:

/sys/class/nvme/
└── nvme0n1
   ├── capabilities: 3
   ├── device_mode: 1
   ├── erase_max: 1000000
   ├── erase_typ: 1000000
   ├── flash_media_type: 0
   ├── media_capabilities: 0x00000001
   ├── media_type: 0
   ├── multiplane: 0x00010101
   ├── num_blocks: 1022
   ├── num_channels: 1
   ├── num_luns: 4
   ├── num_pages: 64
   ├── num_planes: 1
   ├── page_size: 4096
   ├── prog_max: 100000
   ├── prog_typ: 100000
   ├── read_max: 10000
   ├── read_typ: 10000
   ├── sector_oob_size: 0
   ├── sector_size: 4096
   ├── media_manager: gennvm
   ├── ppa_format: 0x380830082808001010102008
   ├── vendor_opcode: 0
   ├── max_phys_secs: 64
   └── version: 1

Signed-off-by: Simon A. F. Lund <slund@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21 07:57:31 -06:00
Matias Bjørling
b0b4e09c1a lightnvm: control life of nvm_dev in driver
LightNVM compatible device drivers does not have a method to expose
LightNVM specific sysfs entries.

To enable LightNVM sysfs entries to be exposed, lightnvm device
drivers require a struct device to attach it to. To allow both the
actual device driver and lightnvm sysfs entries to coexist, the device
driver tracks the lifetime of the nvm_dev structure.

This patch refactors NVMe and null_blk to handle the lifetime of struct
nvm_dev, which eliminates the need for struct gendisk when a lightnvm
compatible device is provided.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21 07:56:18 -06:00
Matias Bjørling
ac81bfa986 nvme: refactor namespaces to support non-gendisk devices
With LightNVM enabled namespaces, the gendisk structure is not exposed
to the user. This prevents LightNVM users from accessing the NVMe device
driver specific sysfs entries, and LightNVM namespace geometry.

Refactor the revalidation process, so that a namespace, instead of a
gendisk, is revalidated. This later allows patches to wire up the
sysfs entries up to a non-gendisk namespace.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21 07:56:12 -06:00
Linus Torvalds
46626600d1 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes for the current series in the realm of block.

  Like the previous pull request, the meat of it are fixes for the nvme
  fabrics/target code.  Outside of that, just one fix from Gabriel for
  not doing a queue suspend if we didn't get the admin queue setup in
  the first place"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvme-rdma: add back dependency on CONFIG_BLOCK
  nvme-rdma: fix null pointer dereference on req->mr
  nvme-rdma: use ib_client API to detect device removal
  nvme-rdma: add DELETING queue flag
  nvme/quirk: Add a delay before checking device ready for memblaze device
  nvme: Don't suspend admin queue that wasn't created
  nvme-rdma: destroy nvme queue rdma resources on connect failure
  nvme_rdma: keep a ref on the ctrl during delete/flush
  iw_cxgb4: block module unload until all ep resources are released
  iw_cxgb4: call dev_put() on l2t allocation failure
2016-09-15 13:22:59 -07:00
Christoph Hellwig
b5af7f2ff0 nvme: remove the post_scan callout
No need now that we don't have to reverse engineer the irq affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig
dca51e7892 nvme: switch to use pci_alloc_irq_vectors
Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig
7d7e0f90b7 blk-mq: remove ->map_queue
All drivers use the default, so provide an inline version of it.  If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Jens Axboe
3bc42f3f0e Merge branch 'nvmf-4.8-rc' of git://git.infradead.org/nvme-fabrics into for-linus
Sagi writes:

Here we have:
- Kconfig dependencies fix from Arnd
- nvme-rdma device removal fixes from Steve
- possible bad deref fix from Colin
2016-09-13 07:58:34 -06:00
Arnd Bergmann
2cfe199ca5 nvme-rdma: add back dependency on CONFIG_BLOCK
A recent change removed the dependency on BLK_DEV_NVME, which implies
the dependency on PCI and BLOCK. We don't need CONFIG_PCI, but without
CONFIG_BLOCK we get tons of build errors, e.g.

In file included from drivers/nvme/host/core.c:16:0:
linux/blk-mq.h:182:33: error: 'struct gendisk' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
drivers/nvme/host/core.c: In function 'nvme_setup_rw':
drivers/nvme/host/core.c:295:21: error: implicit declaration of function 'rq_data_dir' [-Werror=implicit-function-declaration]
drivers/nvme/host/nvme.h: In function 'nvme_map_len':
drivers/nvme/host/nvme.h:217:6: error: implicit declaration of function 'req_op' [-Werror=implicit-function-declaration]
drivers/nvme/host/scsi.c: In function 'nvme_trans_bdev_limits_page':
drivers/nvme/host/scsi.c:768:85: error: implicit declaration of function 'queue_max_hw_sectors' [-Werror=implicit-function-declaration]

This adds back the specific CONFIG_BLOCK dependency to avoid broken
configurations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: aa71987472 ("nvme: fabrics drivers don't need the nvme-pci driver")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-12 22:29:43 +03:00
Colin Ian King
1bda18de8f nvme-rdma: fix null pointer dereference on req->mr
If there is an error on req->mr, req->mr is set to null, however
the following statement sets req->mr->need_inval causing a null
pointer dereference.  Fix this by bailing out to label 'out' to
immediately return and hence skip over the offending null pointer
dereference.

Fixes: f5b7b559e1 ("nvme-rdma: Get rid of duplicate variable")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-12 22:29:42 +03:00
Steve Wise
e87a911fed nvme-rdma: use ib_client API to detect device removal
Change nvme-rdma to use the IB Client API to detect device removal.
This has the wonderful benefit of being able to blow away all the
ib/rdma_cm resources for the device being removed.  No craziness about
not destroying the cm_id handling the event.  No deadlocks due to broken
iw_cm/rdma_cm/iwarp dependencies.  And no need to have a bound cm_id
around during controller recovery/reconnect to catch device removal
events.

We don't use the device_add aspect of the ib_client service since we only
want to create resources for an IB device if we have a target utilizing
that device.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-12 22:29:41 +03:00
Sagi Grimberg
e89ca58f9c nvme-rdma: add DELETING queue flag
When we get a surprise disconnect from the target we queue a periodic
reconnect (which is the sane thing to do...).

We only move the queues out of CONNECTED when we retry to reconnect (after
10 seconds in the default case) but we stop the blk queues immediately
so we are not bothered with traffic from now on. If delete() is kicking
off in this period the queues are still in CONNECTED state.

Part of the delete sequence is trying to issue ctrl shutdown if the
admin queue is CONNECTED (which it is!). This request is issued but
stuck in blk-mq waiting for the queues to start again. This might be
the one preventing us from forward progress...

The patch separates the queue flags to CONNECTED and DELETING. Now we
will move out of CONNECTED as soon as error recovery kicks in (before
stopping the queues) and DELETING is on when we start the queue deletion.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-12 22:29:34 +03:00
Linus Torvalds
bd0b841fee nvme: make NVME_RDMA depend on BLOCK
Commit aa71987472 ("nvme: fabrics drivers don't need the nvme-pci
driver") removed the dependency on BLK_DEV_NVME, but the cdoe does
depend on the block layer (which used to be an implicit dependency
through BLK_DEV_NVME).

Otherwise you get various errors from the kbuild test robot random
config testing when that happens to hit a configuration with BLOCK
device support disabled.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jay Freyensee <james_p_freyensee@linux.intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-11 14:41:49 -07:00
Wenbo Wang
015282c9eb nvme/quirk: Add a delay before checking device ready for memblaze device
Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-08 10:17:45 -06:00
Gabriel Krisman Bertazi
82469c59d2 nvme: Don't suspend admin queue that wasn't created
This fixes a regression in my previous commit c21377f836 ("nvme:
Suspend all queues before deletion"), which provoked an Oops in the
removal path when removing a device that became IO incapable very early
at probe (i.e. after a failed EEH recovery).

Turns out, if the error occurred very early at the probe path, before
even configuring the admin queue, we might try to suspend the
uninitialized admin queue, accessing bad memory.

Fixes: c21377f836 ("nvme: Suspend all queues before deletion")
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-07 08:53:21 -06:00
Steve Wise
f361e5a01e nvme-rdma: destroy nvme queue rdma resources on connect failure
After address resolution, the nvme_rdma_queue rdma resources are
allocated.  If rdma route resolution or the connect fails, or the
controller reconnect times out and gives up, then the rdma resources
need to be freed.  Otherwise, rdma resources are leaked.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimbrg.me>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-04 10:00:54 +03:00
Steve Wise
cdbecc8d24 nvme_rdma: keep a ref on the ctrl during delete/flush
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimbrg.me>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-04 10:00:53 +03:00
Jens Axboe
d8d8d9d789 Merge branch 'nvmf-4.8-rc' of git://git.infradead.org/nvme-fabrics into for-linus
Sagi writes:

Mostly stability fixes and cleanups:
- NQN endianess fix from Daniel
- possible use-after-free fix from Vincent
- nvme-rdma connect semantics fixes from Jay
- Remove redundant variables in rdma driver
- Kbuild fix from Christoph
- nvmf_host referencing fix from Christoph
- uninit variable fix from Colin
2016-08-29 08:39:22 -06:00
Sagi Grimberg
4d8c6a7946 nvme-rdma: Get rid of redundant defines
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-28 15:41:08 +03:00
Sagi Grimberg
f5b7b559e1 nvme-rdma: Get rid of duplicate variable
We already have need_inval in ib_mr, lets use
that instead.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-28 15:40:54 +03:00
Andy Lutomirski
9b47f77a68 nvme: Fix nvme_get/set_features() with a NULL result pointer
nvme_set_features() callers seem to expect that passing NULL as the
result pointer is acceptable.  Teach nvme_set_features() not to try to
write to the NULL address.

For symmetry, make the same change to nvme_get_features(), despite the
fact that all current callers pass a valid result pointer.

I assume that this bug hasn't been reported in practice because
the callers that pass NULL are all in the SCSI translation layer
and no one uses the relevant operations.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-24 08:11:10 -06:00
Christoph Hellwig
aa71987472 nvme: fabrics drivers don't need the nvme-pci driver
So select the NVME_CORE symbol instead of depending on BLK_DEV_NVME.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-19 14:22:28 +03:00
Christoph Hellwig
98096d8a78 nvme-fabrics: get a reference when reusing a nvme_host structure
Without this we'll get a use after free after connecting two controller
using the same hostnqn and then disconnecting one of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-19 14:22:12 +03:00
Daniel Verkamp
7a665d2f60 nvme-fabrics: change NQN UUID to big-endian format
NVM Express 1.2.1 section 7.9, NVMe Qualified Names, specifies that the
UUID format of NQN uses a UUID based on RFC 4122.

RFC 4122 specifies that the UUID is encoded in big-endian byte order.

Switch the NVMe over Fabrics host ID field from little-endian UUID to
big-endian UUID to match the specification.

Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-19 12:00:44 +03:00
Jay Freyensee
eadb7cf441 nvme-loop: set sqsize to 0-based value, per spec
Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-18 09:58:06 +03:00
Jay Freyensee
c5af8654c4 nvme-rdma: fix sqsize/hsqsize per spec
Per NVMe-over-Fabrics 1.0 spec, sqsize is represented as
a 0-based value.

Also per spec, the RDMA binding values shall be set
to sqsize, which makes hsqsize 0-based values.

Thus, the sqsize during NVMf connect() is now:

[root@fedora23-fabrics-host1 for-48]# dmesg
[  318.720645] nvme_fabrics: nvmf_connect_admin_queue(): sqsize for
admin queue: 31
[  318.720884] nvme nvme0: creating 16 I/O queues.
[  318.810114] nvme_fabrics: nvmf_connect_io_queue(): sqsize for i/o
queue: 127

Finally, current interpretation implies hrqsize is 1's based
so set it appropriately.

Reported-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-18 09:58:06 +03:00
Jay Freyensee
f994d9dc28 fabrics: define admin sqsize min default, per spec
Upon admin queue connect(), the rdma qp was being
set based on NVMF_AQ_DEPTH.  However, the fabrics layer was
using the sqsize field value set for I/O queues for the admin
queue, which threw the nvme layer and rdma layer off-whack:

root@fedora23-fabrics-host1 nvmf]# dmesg
[ 3507.798642] nvme_fabrics: nvmf_connect_admin_queue():admin sqsize
being sent is: 128
[ 3507.798858] nvme nvme0: creating 16 I/O queues.
[ 3507.896407] nvme nvme0: new ctrl: NQN "nullside-nqn", addr
192.168.1.3:4420

Thus, to have a different admin queue value, we use
NVMF_AQ_DEPTH for connect() and RDMA private data
as the minimum depth specified in the NVMe-over-Fabrics 1.0 spec
(and in that RDMA private data we treat hrqsize as 1's-based
value, per current understanding of the fabrics spec).

Reported-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-18 09:58:05 +03:00
Jay Freyensee
b825b44c4e nvmet-rdma: +1 to *queue_size from hsqsize/hrqsize
The host will be sending sqsize 0-based hsqsize value,
the target need to be adjusted as well.

Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-18 09:57:37 +03:00
Vincent Stehlé
3256aaef5e nvmet-rdma: Fix use after free
Avoid dereferencing the queue pointer in nvmet_rdma_release_queue_work()
after it has been freed by nvmet_rdma_free_queue().

Fixes: d8f7750a08 ("nvmet-rdma: Correctly handle RDMA device hot removal")
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-16 16:16:31 +03:00
Colin Ian King
39bbee4e54 nvme-rdma: initialize ret to zero to avoid returning garbage
ret is not initialized so it contains garbage.  Ensure garbage
is not returned by initializing rc to 0.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-16 11:42:29 +03:00
Gabriel Krisman Bertazi
f6b6a28e2d nvme: Prevent controller state invalid transition
Acquiring the nvme_ctrl lock before reading ctrl->state in
nvme_change_ctrl_state() should prevent a theoretical invalid state
transition, in the event of two threads racing inside that function.

I haven't been able to observe this happening with the current code, and
the current state machine seems to be simple enough to not be
affected by these invalid transitions, but future modifications could
make it more likely to happen.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sag@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-15 09:46:46 -06:00
Gabriel Krisman Bertazi
c21377f836 nvme: Suspend all queues before deletion
When nvme_delete_queue fails in the first pass of the
nvme_disable_io_queues() loop, we return early, failing to suspend all
of the IO queues.  Later, on the nvme_pci_disable path, this causes us
to disable MSI without actually having freed all the IRQs, which
triggers the BUG_ON in free_msi_irqs(), as show below.

This patch refactors nvme_disable_io_queues to suspend all queues before
start submitting delete queue commands.  This way, we ensure that we
have at least returned every IRQ before continuing with the removal
path.

[  487.529200] kernel BUG at ../drivers/pci/msi.c:368!
cpu 0x46: Vector: 700 (Program Check) at [c0000078c5b83650]
    pc: c000000000627a50: free_msi_irqs+0x90/0x200
    lr: c000000000627a40: free_msi_irqs+0x80/0x200
    sp: c0000078c5b838d0
   msr: 9000000100029033
  current = 0xc0000078c5b40000
  paca    = 0xc000000002bd7600   softe: 0        irq_happened: 0x01
    pid   = 1376, comm = kworker/70:1H
kernel BUG at ../drivers/pci/msi.c:368!
Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413
(Ubuntu/IBM 5.3.1-14ubuntu2.1) ) #104 SMP Fri Jul 29 09:20:17 CDT 2016
enter ? for help
[c0000078c5b83920] d0000000363b0cd8 nvme_dev_disable+0x208/0x4f0 [nvme]
[c0000078c5b83a10] d0000000363b12a4 nvme_timeout+0xe4/0x250 [nvme]
[c0000078c5b83ad0] c0000000005690e4 blk_mq_rq_timed_out+0x64/0x110
[c0000078c5b83b40] c00000000056c930 bt_for_each+0x160/0x170
[c0000078c5b83bb0] c00000000056d928 blk_mq_queue_tag_busy_iter+0x78/0x110
[c0000078c5b83c00] c0000000005675d8 blk_mq_timeout_work+0xd8/0x1b0
[c0000078c5b83c50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c0000078c5b83ce0] c0000000000e9148 worker_thread+0xa8/0x660
[c0000078c5b83d80] c0000000000f2090 kthread+0x110/0x130
[c0000078c5b83e30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-11 09:35:57 -06:00
Jens Axboe
d3f422c8d5 Merge branch 'nvmf-4.8-rc' of git://git.infradead.org/nvme-fabrics into for-linus
Sagi writes:

Mostly stability fixes for nvmet, rdma:
- fix uninitialized rdma_cm private data from Roland.
- rdma device removal handling (host and target).
- fix controller disconnect during active mounts.
- fix namespaces lost after fabric reconnects.
- remove redundant calls to namespace removal (rdma, loop).
- actually send controller shutdown when disconnecting.
- reconnect fixes (ns rescan and aen requeue)
- nvmet controller serial number inconsistency fix.
2016-08-08 07:42:42 -06:00
Sagi Grimberg
e3266378bd nvme-rdma: Remove unused includes
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
2016-08-04 17:47:54 +03:00
Sagi Grimberg
3ef1b4b298 nvme-rdma: start async event handler after reconnecting to a controller
When we reset or reconnect to a controller, we are cancelling the
async event handler so we can safely re-establish resources, but we
need to remember to start it again when we successfully reconnect.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-04 17:45:35 +03:00
Sagi Grimberg
28b8911853 nvmet: Fix controller serial number inconsistency
The host is allowed to issue identify as many times
as it wants, we need to stay consistent when reporting
the serial number for a given controller.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-04 17:45:10 +03:00
Sagi Grimberg
40e64e0721 nvmet-rdma: Don't use the inline buffer in order to avoid allocation for small reads
Under extreme conditions this might cause data corruptions. By doing that
we we repost the buffer and then post this buffer for the device to send.
If we happen to use shared receive queues the device might write to the
buffer before it sends it (there is no ordering between send and recv
queues). Without SRQs we probably won't get that if the host doesn't
mis-behave and send more than we allowed it, but relying on that is not
really a good idea.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-04 17:44:40 +03:00
Sagi Grimberg
d8f7750a08 nvmet-rdma: Correctly handle RDMA device hot removal
When configuring a device attached listener, we may
see device removal events. In this case we return a
non-zero return code from the cm event handler which
implicitly destroys the cm_id. It is possible that in
the future the user will remove this listener and by
that trigger a second call to rdma_destroy_id on an
already destroyed cm_id -> BUG.

In addition, when a queue bound (active session) cm_id
generates a DEVICE_REMOVAL event we must guarantee all
resources are cleaned up by the time we return from the
event handler.

Introduce nvmet_rdma_device_removal which addresses
(or at least attempts to) both scenarios.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-04 17:43:06 +03:00
Sagi Grimberg
45862ebcc4 nvme-rdma: Make sure to shutdown the controller if we can
Relying on ctrl state in nvme_rdma_shutdown_ctrl is wrong because
it will never be NVME_CTRL_LIVE (delete_ctrl or reset_ctrl invoked it).

Instead, check that the admin queue is connected. Note that it is safe
because we can never see a copmeting thread trying to destroy the admin
queue (reset or delete controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:23 +03:00
Sagi Grimberg
a159c64d93 nvme-loop: Remove duplicate call to nvme_remove_namespaces
nvme_uninit_ctrl already does that for us. Note that we
reordered nvme_loop_shutdown_ctrl with nvme_uninit_ctrl
but its safe because we want controller uninit to happen
before we shutdown the transport resources.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:19 +03:00
Sagi Grimberg
a34ca17a97 nvme-rdma: Free the I/O tags when we delete the controller
If we wait until we free the controller (free_ctrl) we might
lose our rdma device without any notification while we still
have open resources (tags mrs and dma mappings).

Instead, destroy the tags with their rdma resources once we
delete the device and not when freeing it.

Note that we don't do that in nvme_rdma_shutdown_ctrl because
controller reset uses it as well and we want to give active I/O
a chance to complete successfully.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:16 +03:00
Sagi Grimberg
2461a8dd38 nvme-rdma: Remove duplicate call to nvme_remove_namespaces
nvme_uninit_ctrl already does that for us. Note that we reordered
nvme_rdma_shutdown_ctrl and nvme_uninit_ctrl, this is perfectly
fine because we actually want ctrl uninit (aen, scan cancellation
and namespaces removal) to happen before we shutdown the rdma
resources.

Also, centralize the deletion work and the dead controller removal
work code duplication into __nvme_rdma_shutdown_ctrl that accepts
a shutdown boolean.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:11 +03:00
Sagi Grimberg
57de5a0a40 nvme-rdma: Fix device removal handling
Device removal sequence may have crashed because the
controller (and admin queue space) was freed before
we destroyed the admin queue resources. Thus we
want to destroy the admin queue and only then queue
controller deletion and wait for it to complete.

More specifically we:
1. own the controller deletion (make sure we are not
   competing with another deletion).
2. get rid of inflight reconnects if exists (which
   also destroy and create queues).
3. destroy the queue.
4. safely queue controller deletion (and wait for it
   to complete).

Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:07 +03:00
Sagi Grimberg
5f372eb3e7 nvme-rdma: Queue ns scanning after a sucessful reconnection
On an ordered target shutdown, the target can send a AEN on a namespace
removal, this will trigger the host to queue ns-list query. The shutdown
will trigger error recovery which will attepmt periodic reconnect.

We can hit a race where the ns rescanning fails (error recovery kicked
in and we're not connected) causing removing all the namespaces and when
we reconnect we won't see any namespaces for this controller.

So, queue a namespace rescan after we successfully reconnected to the target.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:03 +03:00
Roland Dreier
0b857b44b5 nvme-rdma: Don't leak uninitialized memory in connect request private data
Zero out the full nvme_rdma_cm_req structure before sending it.
Otherwise we end up leaking kernel memory in the reserved field, which
might break forward compatibility in the future.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-03 16:24:57 +03:00
Linus Torvalds
c8d0267efd PCI changes for the v4.8 merge window:
Enumeration
     Move ecam.h to linux/include/pci-ecam.h (Jayachandran C)
     Add parent device field to ECAM struct pci_config_window (Jayachandran C)
     Add generic MCFG table handling (Tomasz Nowicki)
     Refactor pci_bus_assign_domain_nr() for CONFIG_PCI_DOMAINS_GENERIC (Tomasz Nowicki)
     Factor DT-specific pci_bus_find_domain_nr() code out (Tomasz Nowicki)
 
   Resource management
     Add devm_request_pci_bus_resources() (Bjorn Helgaas)
     Unify pci_resource_to_user() declarations (Bjorn Helgaas)
     Implement pci_resource_to_user() with pcibios_resource_to_bus() (microblaze, powerpc, sparc) (Bjorn Helgaas)
     Request host bridge window resources (designware, iproc, rcar, xgene, xilinx, xilinx-nwl) (Bjorn Helgaas)
     Make PCI I/O space optional on ARM32 (Bjorn Helgaas)
     Ignore write combining when mapping I/O port space (Bjorn Helgaas)
     Claim bus resources on MIPS PCI_PROBE_ONLY set-ups (Bjorn Helgaas)
     Remove unicore32 pci=firmware command line parameter handling (Bjorn Helgaas)
     Support I/O resources when parsing host bridge resources (Jayachandran C)
     Add helpers to request/release memory and I/O regions (Johannes Thumshirn)
     Use pci_(request|release)_mem_regions (NVMe, lpfc, GenWQE, ethernet/intel, alx) (Johannes Thumshirn)
     Extend pci=resource_alignment to specify device/vendor IDs (Koehrer Mathias (ETAS/ESW5))
     Add generic pci_bus_claim_resources() (Lorenzo Pieralisi)
     Claim bus resources on ARM32 PCI_PROBE_ONLY set-ups (Lorenzo Pieralisi)
     Remove ARM32 and ARM64 arch-specific pcibios_enable_device() (Lorenzo Pieralisi)
     Add pci_unmap_iospace() to unmap I/O resources (Sinan Kaya)
     Remove powerpc __pci_mmap_set_pgprot() (Yinghai Lu)
 
   PCI device hotplug
     Allow additional bus numbers for hotplug bridges (Keith Busch)
     Ignore interrupts during D3cold (Lukas Wunner)
 
   Power management
     Enforce type casting for pci_power_t (Andy Shevchenko)
     Don't clear d3cold_allowed for PCIe ports (Mika Westerberg)
     Put PCIe ports into D3 during suspend (Mika Westerberg)
     Power on bridges before scanning new devices (Mika Westerberg)
     Runtime resume bridge before rescan (Mika Westerberg)
     Add runtime PM support for PCIe ports (Mika Westerberg)
     Remove redundant check of pcie_set_clkpm (Shawn Lin)
 
   Virtualization
     Add function 1 DMA alias quirk for Marvell 88SE9182 (Aaron Sierra)
     Add DMA alias quirk for Adaptec 3805 (Alex Williamson)
     Mark Atheros AR9485 and QCA9882 to avoid bus reset (Chris Blake)
     Add ACS quirk for Solarflare SFC9220 (Edward Cree)
 
   MSI
     Fix PCI_MSI dependencies (Arnd Bergmann)
     Add pci_msix_desc_addr() helper (Christoph Hellwig)
     Switch msix_program_entries() to use pci_msix_desc_addr() (Christoph Hellwig)
     Make the "entries" argument to pci_enable_msix() optional (Christoph Hellwig)
     Provide sensible IRQ vector alloc/free routines (Christoph Hellwig)
     Spread interrupt vectors in pci_alloc_irq_vectors() (Christoph Hellwig)
 
   Error Handling
     Bind DPC to Root Ports as well as Downstream Ports (Keith Busch)
     Remove DPC tristate module option (Keith Busch)
     Convert Downstream Port Containment driver to use devm_* functions (Mika Westerberg)
 
   Generic host bridge driver
     Select IRQ_DOMAIN (Arnd Bergmann)
     Claim bus resources on PCI_PROBE_ONLY set-ups (Lorenzo Pieralisi)
 
   ACPI host bridge driver
     Add ARM64 acpi_pci_bus_find_domain_nr() (Tomasz Nowicki)
     Add ARM64 ACPI support for legacy IRQs parsing and consolidation with DT code (Tomasz Nowicki)
     Implement ARM64 AML accessors for PCI_Config region (Tomasz Nowicki)
     Support ARM64 ACPI-based PCI host controller (Tomasz Nowicki)
 
   Altera host bridge driver
     Check link status before retrain link (Ley Foon Tan)
     Poll for link up status after retraining the link (Ley Foon Tan)
 
   Axis ARTPEC-6 host bridge driver
     Add PCI_MSI_IRQ_DOMAIN dependency (Arnd Bergmann)
     Add DT binding for Axis ARTPEC-6 PCIe controller (Niklas Cassel)
     Add Axis ARTPEC-6 PCIe controller driver (Niklas Cassel)
 
   Intel VMD host bridge driver
     Use lock save/restore in interrupt enable path (Jon Derrick)
     Select device dma ops to override (Keith Busch)
     Initialize list item in IRQ disable (Keith Busch)
     Use x86_vector_domain as parent domain (Keith Busch)
     Separate MSI and MSI-X vector sharing (Keith Busch)
 
   Marvell Aardvark host bridge driver
     Add DT binding for the Aardvark PCIe controller (Thomas Petazzoni)
     Add Aardvark PCI host controller driver (Thomas Petazzoni)
     Add Aardvark PCIe support for Armada 3700 (Thomas Petazzoni)
 
   Microsoft Hyper-V host bridge driver
     Fix interrupt cleanup path (Cathy Avery)
     Don't leak buffer in hv_pci_onchannelcallback() (Vitaly Kuznetsov)
     Handle all pending messages in hv_pci_onchannelcallback() (Vitaly Kuznetsov)
 
   NVIDIA Tegra host bridge driver
     Program PADS_REFCLK_CFG* always, not just on legacy SoCs (Stephen Warren)
     Program PADS_REFCLK_CFG* registers with per-SoC values (Stephen Warren)
     Use lower-case hex consistently for register definitions (Thierry Reding)
     Use generic pci_remap_iospace() rather than ARM32-specific one (Thierry Reding)
     Stop setting pcibios_min_mem (Thierry Reding)
 
   Renesas R-Car host bridge driver
     Drop gen2 dummy I/O port region (Bjorn Helgaas)
 
   TI DRA7xx host bridge driver
     Fix return value in case of error (Christophe JAILLET)
 
   Xilinx AXI host bridge driver
     Fix return value in case of error (Christophe JAILLET)
 
   Miscellaneous
     Make bus_attr_resource_alignment static (Ben Dooks)
     Include <asm/dma.h> for isa_dma_bridge_buggy (Ben Dooks)
     MAINTAINERS: Add file patterns for PCI device tree bindings (Geert Uytterhoeven)
     Make host bridge drivers explicitly non-modular (Paul Gortmaker)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXoNRtAAoJEFmIoMA60/r8LMkP/3kiNh21QFS6RZGOaDft5/Py
 n14Zo0w51avspxoI3iyDlBd5q/SssMqi+2c6Ko/fh2D2xMxJgmQOjdMDrIGARxGA
 qEHk/5IoXquY2/GcptmCk3ap66cJ6kTovS4OPrb73m3fPuknFwFwdzExq22XHbnI
 crPya6xwQxPLc54VpY/TsgW8E+EKZd/3FW9wuzzNHXrXmTILyhBQzQAA0K470GMx
 wEXU6kc3M/XhRuF1zjV9/O+H/xguwfnbTpZLvd2NAF6uXKZoRytEHHtNnVqu1hoe
 UPpDS2xq32pMNbGxGqBetCdIbkY/hWOufmckHI7Yu2OfXBYyHBYMG2je1+nMPkOV
 WiFhhrchGt5KnEMUwXPS4ROqnSZVpZBl1Fd4s10GhUYkoE2HNKJXta398H9FR1jj
 4NEVSi4mSX/+CkaoIN3lXYiaf9P0wv4Wppve4Scr30+VnLjJhm7Vw5La7v12oo6x
 otrJ/g98AkmnbuUdLeWBUS/+TOcdPjZYbw52rqBsbOOjFm51Zcj6D7kf5WcTypQy
 HzbvygSVabcioWehUG1uudC8pdJmQlUGx1aES/iu+mZEae4cuUFALu6hDBD9IYnZ
 5JdwjVzI0UItEwT3rQt3t4xiAqHADQ0NAVNJVCeREdoy/YQpSoTWGXIpyqCZ1yCm
 aBykjRsxbKQXlhVeIxuc
 =NVxu
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "Highlights:

   - ARM64 support for ACPI host bridges

   - new drivers for Axis ARTPEC-6 and Marvell Aardvark

   - new pci_alloc_irq_vectors() interface for MSI-X, MSI, legacy INTx

   - pci_resource_to_user() cleanup (more to come)

  Detailed summary:

  Enumeration:
   - Move ecam.h to linux/include/pci-ecam.h (Jayachandran C)
   - Add parent device field to ECAM struct pci_config_window (Jayachandran C)
   - Add generic MCFG table handling (Tomasz Nowicki)
   - Refactor pci_bus_assign_domain_nr() for CONFIG_PCI_DOMAINS_GENERIC (Tomasz Nowicki)
   - Factor DT-specific pci_bus_find_domain_nr() code out (Tomasz Nowicki)

  Resource management:
   - Add devm_request_pci_bus_resources() (Bjorn Helgaas)
   - Unify pci_resource_to_user() declarations (Bjorn Helgaas)
   - Implement pci_resource_to_user() with pcibios_resource_to_bus() (microblaze, powerpc, sparc) (Bjorn Helgaas)
   - Request host bridge window resources (designware, iproc, rcar, xgene, xilinx, xilinx-nwl) (Bjorn Helgaas)
   - Make PCI I/O space optional on ARM32 (Bjorn Helgaas)
   - Ignore write combining when mapping I/O port space (Bjorn Helgaas)
   - Claim bus resources on MIPS PCI_PROBE_ONLY set-ups (Bjorn Helgaas)
   - Remove unicore32 pci=firmware command line parameter handling (Bjorn Helgaas)
   - Support I/O resources when parsing host bridge resources (Jayachandran C)
   - Add helpers to request/release memory and I/O regions (Johannes Thumshirn)
   - Use pci_(request|release)_mem_regions (NVMe, lpfc, GenWQE, ethernet/intel, alx) (Johannes Thumshirn)
   - Extend pci=resource_alignment to specify device/vendor IDs (Koehrer Mathias (ETAS/ESW5))
   - Add generic pci_bus_claim_resources() (Lorenzo Pieralisi)
   - Claim bus resources on ARM32 PCI_PROBE_ONLY set-ups (Lorenzo Pieralisi)
   - Remove ARM32 and ARM64 arch-specific pcibios_enable_device() (Lorenzo Pieralisi)
   - Add pci_unmap_iospace() to unmap I/O resources (Sinan Kaya)
   - Remove powerpc __pci_mmap_set_pgprot() (Yinghai Lu)

  PCI device hotplug:
   - Allow additional bus numbers for hotplug bridges (Keith Busch)
   - Ignore interrupts during D3cold (Lukas Wunner)

  Power management:
   - Enforce type casting for pci_power_t (Andy Shevchenko)
   - Don't clear d3cold_allowed for PCIe ports (Mika Westerberg)
   - Put PCIe ports into D3 during suspend (Mika Westerberg)
   - Power on bridges before scanning new devices (Mika Westerberg)
   - Runtime resume bridge before rescan (Mika Westerberg)
   - Add runtime PM support for PCIe ports (Mika Westerberg)
   - Remove redundant check of pcie_set_clkpm (Shawn Lin)

  Virtualization:
   - Add function 1 DMA alias quirk for Marvell 88SE9182 (Aaron Sierra)
   - Add DMA alias quirk for Adaptec 3805 (Alex Williamson)
   - Mark Atheros AR9485 and QCA9882 to avoid bus reset (Chris Blake)
   - Add ACS quirk for Solarflare SFC9220 (Edward Cree)

  MSI:
   - Fix PCI_MSI dependencies (Arnd Bergmann)
   - Add pci_msix_desc_addr() helper (Christoph Hellwig)
   - Switch msix_program_entries() to use pci_msix_desc_addr() (Christoph Hellwig)
   - Make the "entries" argument to pci_enable_msix() optional (Christoph Hellwig)
   - Provide sensible IRQ vector alloc/free routines (Christoph Hellwig)
   - Spread interrupt vectors in pci_alloc_irq_vectors() (Christoph Hellwig)

  Error Handling:
   - Bind DPC to Root Ports as well as Downstream Ports (Keith Busch)
   - Remove DPC tristate module option (Keith Busch)
   - Convert Downstream Port Containment driver to use devm_* functions (Mika Westerberg)

  Generic host bridge driver:
   - Select IRQ_DOMAIN (Arnd Bergmann)
   - Claim bus resources on PCI_PROBE_ONLY set-ups (Lorenzo Pieralisi)

  ACPI host bridge driver:
   - Add ARM64 acpi_pci_bus_find_domain_nr() (Tomasz Nowicki)
   - Add ARM64 ACPI support for legacy IRQs parsing and consolidation with DT code (Tomasz Nowicki)
   - Implement ARM64 AML accessors for PCI_Config region (Tomasz Nowicki)
   - Support ARM64 ACPI-based PCI host controller (Tomasz Nowicki)

  Altera host bridge driver:
   - Check link status before retrain link (Ley Foon Tan)
   - Poll for link up status after retraining the link (Ley Foon Tan)

  Axis ARTPEC-6 host bridge driver:
   - Add PCI_MSI_IRQ_DOMAIN dependency (Arnd Bergmann)
   - Add DT binding for Axis ARTPEC-6 PCIe controller (Niklas Cassel)
   - Add Axis ARTPEC-6 PCIe controller driver (Niklas Cassel)

  Intel VMD host bridge driver:
   - Use lock save/restore in interrupt enable path (Jon Derrick)
   - Select device dma ops to override (Keith Busch)
   - Initialize list item in IRQ disable (Keith Busch)
   - Use x86_vector_domain as parent domain (Keith Busch)
   - Separate MSI and MSI-X vector sharing (Keith Busch)

  Marvell Aardvark host bridge driver:
   - Add DT binding for the Aardvark PCIe controller (Thomas Petazzoni)
   - Add Aardvark PCI host controller driver (Thomas Petazzoni)
   - Add Aardvark PCIe support for Armada 3700 (Thomas Petazzoni)

  Microsoft Hyper-V host bridge driver:
   - Fix interrupt cleanup path (Cathy Avery)
   - Don't leak buffer in hv_pci_onchannelcallback() (Vitaly Kuznetsov)
   - Handle all pending messages in hv_pci_onchannelcallback() (Vitaly Kuznetsov)

  NVIDIA Tegra host bridge driver:
   - Program PADS_REFCLK_CFG* always, not just on legacy SoCs (Stephen Warren)
   - Program PADS_REFCLK_CFG* registers with per-SoC values (Stephen Warren)
   - Use lower-case hex consistently for register definitions (Thierry Reding)
   - Use generic pci_remap_iospace() rather than ARM32-specific one (Thierry Reding)
   - Stop setting pcibios_min_mem (Thierry Reding)

  Renesas R-Car host bridge driver:
   - Drop gen2 dummy I/O port region (Bjorn Helgaas)

  TI DRA7xx host bridge driver:
   - Fix return value in case of error (Christophe JAILLET)

  Xilinx AXI host bridge driver:
   - Fix return value in case of error (Christophe JAILLET)

  Miscellaneous:
   - Make bus_attr_resource_alignment static (Ben Dooks)
   - Include <asm/dma.h> for isa_dma_bridge_buggy (Ben Dooks)
   - MAINTAINERS: Add file patterns for PCI device tree bindings (Geert Uytterhoeven)
   - Make host bridge drivers explicitly non-modular (Paul Gortmaker)"

* tag 'pci-v4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (125 commits)
  PCI: xgene: Make explicitly non-modular
  PCI: thunder-pem: Make explicitly non-modular
  PCI: thunder-ecam: Make explicitly non-modular
  PCI: tegra: Make explicitly non-modular
  PCI: rcar-gen2: Make explicitly non-modular
  PCI: rcar: Make explicitly non-modular
  PCI: mvebu: Make explicitly non-modular
  PCI: layerscape: Make explicitly non-modular
  PCI: keystone: Make explicitly non-modular
  PCI: hisi: Make explicitly non-modular
  PCI: generic: Make explicitly non-modular
  PCI: designware-plat: Make it explicitly non-modular
  PCI: artpec6: Make explicitly non-modular
  PCI: armada8k: Make explicitly non-modular
  PCI: artpec: Add PCI_MSI_IRQ_DOMAIN dependency
  PCI: Add ACS quirk for Solarflare SFC9220
  arm64: dts: marvell: Add Aardvark PCIe support for Armada 3700
  PCI: aardvark: Add Aardvark PCI host controller driver
  dt-bindings: add DT binding for the Aardvark PCIe controller
  PCI: tegra: Program PADS_REFCLK_CFG* registers with per-SoC values
  ...
2016-08-02 17:12:29 -04:00
Linus Torvalds
3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds
d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Keith Busch
13880f5b57 nvme/pci: Provide SR-IOV support
This registers an sr-iov callback for nvme.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:29:59 -06:00
Jay Freyensee
fa9a89fc66 nvme: initialize variable before logical OR'ing it
It is typically not good coding or secure coding practice
to logical OR a variable without an initialization value first.
Here on this line:

integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE;

BLK_INTEGRITY_DEVICE_CAPABLE is being OR'ed to a member variable
never set to an initial value. This patch fixes that.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:26:16 -06:00
Christoph Hellwig
0c4de0f33b block: ensure bios return from blk_get_request are properly initialized
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API.  Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:30 -06:00
Christoph Hellwig
70246286e9 block: get rid of bio_rw and READA
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:01 -06:00
NeilBrown
b09dcf585d NVMe: don't allocate unused nvme_major
When alloc_disk(0) is used, the ->major number is ignored.  All device
numbers are allocated with a major of BLOCK_EXT_MAJOR.

So remove all references to nvme_major.

[akpm@linux-foundation.org: one unregister_blkdev() was missed]
Link: http://lkml.kernel.org/r/20160602064318.4403.93301.stgit@noble
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-14 08:52:56 -07:00
Keith Busch
32f0c4afb4 nvme: Remove RCU namespace protection
We can't sleep with RCU read lock held, but we need to do potentially
blocking stuff to namespace queues when iterating the list. This patch
removes the RCU locking and holds a mutex instead.

To prevent deadlocks, this patch removes holding the mutex during
namespace scanning and removal. The unlocked namespace scanning is made
safe by holding a reference to the namespace being scanned.

List iteration that does IO has to be unlocked to allow error recovery.
The caller must ensure the list can not be manipulated during such an
event, so this patch adds a comment explaining this requirement to the
only function that iterates an unlocked list. All callers currently
meet this requirement, so no further changes required.

List iterations that do not do IO can safely use the lock since it couldn't
block recovery from missing forced IO completions.

Reported-by: Ming Lin <mlin at kernel.org>
[fixes 0bf77e9 nvme: switch to RCU freeing the namespace]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-14 08:48:08 -07:00
Masayoshi Mizuma
2fa843512b nvme: avoid crashes when node 0 is memoryless node.
When CONFIG_NUMA is enabled and node 0 is memoryless, the system
crashes because nvme_probe() sets the device->numa_node to 0 by
set_dev_node(&pdev->dev, 0), so it tries to allocate memory from node 0.
To avoid the crash, we should change the 0 to first_memory_node.

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-13 09:17:55 -07:00
Keith Busch
f80ec966c1 nvme: Limit command retries
Many controller implementations will return errors to commands that will
not succeed, but without the DNR bit set. The driver previously retried
these commands an unlimited number of times until the command timeout
has exceeded, which takes an unnecessarilly long period of time.

This patch limits the number of retries a command can have, defaulting
to 5, but is user tunable at load or runtime.

The struct request's 'retries' field is used to track the number of
retries attempted. This is in contrast with scsi's use of this field,
which indicates how many retries are allowed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 16:20:31 -07:00
Arnd Bergmann
6eae8c4520 nvme-loop: fix nvme-loop Kconfig dependencies
I ran into the same problem on NVME_TARGET_RDMA now,
which otherwise needs dependencies on both CONFIG_BLOCK and
CONFIGFS_FS:

    warning: (NVME_TARGET_LOOP && NVME_TARGET_RDMA) selects NVME_TARGET which has unmet direct dependencies (BLOCK && CONFIGFS_FS)
    0xA002B368 Mon Jul 11 18:00:45 CEST 2016 failed
    In file included from ../drivers/nvme/target/core.c:16:0:
    drivers/nvme/target/nvmet.h:222:14: error: field 'inline_bio' has incomplete type
      struct bio  inline_bio;
                  ^~~~~~~~~~
    drivers/nvme/target/core.c: In function 'nvmet_async_event_work':
    drivers/nvme/target/core.c:98:3: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]
       kfree(aen);
       ^~~~~
    ../drivers/nvme/target/core.c: In function 'nvmet_ns_enable':
    ../drivers/nvme/target/core.c:269:13: error: implicit declaration of function 'blkdev_get_by_path' [-Werror=implicit-function-declaration]
      ns->bdev = blkdev_get_by_path(ns->device_path, FMODE_READ | FMODE_WRITE,

Folding in my patch below should address that too.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:36:40 -07:00
Wei Yongjun
69555af2ce nvmet: fix return value check in nvmet_subsys_alloc()
In case of error, the function kstrndup() returns NULL pointer
not ERR_PTR(). The IS_ERR() test in the return value check
should be replaced with NULL test.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:33:43 -07:00
Ming Lin
e76debd996 nvme-fabrics: add-remove ctrl repeat fix
Repeatedly adding then removing the same NVMe-over-Fabrics controller
over and over again (shown below) can cause a kernel crash (also shown
below).  This patch fixes that.

[nvmf]# ./setup_nvme_connections.sh
traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=darkside
-nqn,hostnqn=evil-wins-nqn,nr_io_queues=16 > /dev/nvme-fabrics
traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=lightside
-nqn,hostnqn=good-wins-nqn > /dev/nvme-fabrics
[nvmf]# ./remove_nvme_connections.sh 2
echo 1 > /sys/class/nvme/nvme0/delete_controller
echo 1 > /sys/class/nvme/nvme1/delete_controller
[nvmf]# ./setup_nvme_connections.sh
traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=darkside
-nqn,hostnqn=evil-wins-nqn,nr_io_queues=16 > /dev/nvme-fabrics
Killed

[nvmf]# dmesg
[  313.416908] nvme nvme0: creating 16 I/O queues.
[  313.523908] nvme nvme0: new ctrl: NQN "darkside-nqn", addr
192.168.1.100:4420
[  313.524857] BUG: unable to handle kernel NULL pointer dereference at
0000000000000010
[  313.525262] IP: [<ffffffff8136c60e>] strcmp+0xe/0x30
[  313.525490] PGD 0
[  313.525726] Oops: 0000 [#1] SMP
[  313.525900] Modules linked in: nvme_rdma nvme_fabrics nvme_core
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_en
mlx4_ib ib_core mlx4_core
[  313.527085] CPU: 15 PID: 5856 Comm: setup_nvme_conn Not tainted
4.7.0-rc2+ #2
[  313.527259] Hardware name: Supermicro X9DRT-F/IBQF/IBFF/X9DRT
-F/IBQF/IBFF, BIOS 1.0a 10/09/2012
[  313.527551] task: ffff88027646cd40 ti: ffff88025b980000 task.ti:
ffff88025b980000
[  313.527879] RIP: 0010:[<ffffffff8136c60e>]  [<ffffffff8136c60e>]
strcmp+0xe/0x30
[  313.528232] RSP: 0018:ffff88025b983db0  EFLAGS: 00010206
[  313.528403] RAX: 0000000000000000 RBX: ffff880471879880 RCX:
fffffffffffffff1
[  313.528594] RDX: 0000000000000000 RSI: ffff880474afa860 RDI:
0000000000000011
[  313.528778] RBP: ffff88025b983db0 R08: ffff880474afa860 R09:
ffff880471879058
[  313.528956] R10: 000000000000002c R11: ffff88047f415000 R12:
ffff880471879800
[  313.529129] R13: ffff880471879000 R14: ffff880474afa860 R15:
fffffffffffffff8
[  313.529303] FS:  00007f778f510700(0000) GS:ffff88047fbc0000(0000)
knlGS:0000000000000000
[  313.529629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  313.529817] CR2: 0000000000000010 CR3: 0000000274174000 CR4:
00000000000406e0
[  313.529989] Stack:
[  313.530154]  ffff88025b983e48 ffffffffa0171c74 0000000000000001
0000000000000059
[  313.530621]  ffff880476f32400 ffff88047e8add80 0000010074b33aa0
ffff880471879059
[  313.531162]  ffff88047187904b ffff880471879058 0000000000000000
ffff88047736e000
[  313.531629] Call Trace:
[  313.531797]  [<ffffffffa0171c74>] nvmf_dev_write+0x674/0x840
[nvme_fabrics]
[  313.531974]  [<ffffffff81180b53>] __vfs_write+0x23/0x120
[  313.532146]  [<ffffffff8119daff>] ? __fd_install+0x1f/0xc0
[  313.532316]  [<ffffffff8119d97a>] ? __alloc_fd+0x3a/0x170
[  313.532487]  [<ffffffff811811f3>] vfs_write+0xb3/0x1b0
[  313.532658]  [<ffffffff8117e321>] ? filp_close+0x51/0x70
[  313.532845]  [<ffffffff811824e1>] SyS_write+0x41/0xa0
[  313.533016]  [<ffffffff8183055b>]
entry_SYSCALL_64_fastpath+0x13/0x8f
[  313.533188] Code: 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83 c2 01
84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 84 c0 74 18 48 83
c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 74 eb 19 c0 83 c8 01 5d c3 31
[  313.536563] RIP  [<ffffffff8136c60e>] strcmp+0xe/0x30
[  313.536815]  RSP <ffff88025b983db0>
[  313.536981] CR2: 0000000000000010
[  313.537151] ---[ end trace 3d952e590e7bc2d5 ]---

Reported-and-tested-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <mlin@kernel.org>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:32:19 -07:00
Sagi Grimberg
6a92967ccb nvme-fabrics: Remove tl_retry_count
The timeout before error recovery logic kicks in is
dictated by the nvme keep-alive, so we don't really need
a transport layer retry count. transports can retry for
as much as they like.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:31:11 -07:00
Sagi Grimberg
2ac17c283a nvme-rdma: Don't use tl_retry_count
Always use the maximum qp retry count as the
error recovery timeout is dictated from the nvme
keep-alive.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:31:10 -07:00
Wei Yongjun
458a9632ad nvme-rdma: fix the return value of nvme_rdma_reinit_request()
PTR_ERR should be applied before its argument is reassigned, otherwise the
return value will be set to 0, not error code.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:27:03 -07:00
Guilherme G. Piccoli
54adc01055 nvme/quirk: Add a delay before checking for adapter readiness
When disabling the controller, the specification says the register
NVME_REG_CC should be written and then driver needs to wait the
adapter to be ready, which is checked by reading another register
bit (NVME_CSTS_RDY). There's a timeout validation in this checking,
so in case this timeout is reached the driver gives up and removes
the adapter from the system.

After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003)
(HGST adapter) end up being removed if we issue a reset_controller,
because driver keeps verifying the NVME_REG_CSTS until the timeout is
reached. This patch adds a necessary quirk for this adapter, by
introducing a delay before nvme_wait_ready(), so the reset procedure
is able to be completed. This quirk is needed because just increasing
the timeout is not enough in case of this adapter - the driver must
wait before start reading NVME_REG_CSTS register on this specific
device.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:23:00 -07:00
Jens Axboe
41d512e51b Merge branch 'for-4.8/block' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm into for-4.8/drivers
Dan writes:

"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk().  See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.

It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1].  We would extend device_add_disk()
to take an attribute_group list.

This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.

[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"
2016-07-08 16:04:11 -06:00
Christoph Hellwig
7110230719 nvme-rdma: add a NVMe over Fabrics RDMA host driver
This patch implements the RDMA host (initiator in SCSI speak) driver.  It
can be used to connect to remote NVMe over Fabrics controllers over
Infiniband, RoCE or iWarp, and uses the existing NVMe core driver as well
a the new fabrics library.

To connect to all NVMe over Fabrics controller reachable on a given taget
port using RDMA/CM use the following command:

	nvme connect-all -t rdma -a $IPADDR

This requires the latest version of nvme-cli with Fabrics support.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-08 08:38:49 -06:00
Christoph Hellwig
8f000cac6e nvmet-rdma: add a NVMe over Fabrics RDMA target driver
This patch implements the RDMA transport for the NVMe over Fabrics target,
which allows exporting NVMe over Fabrics functionality over RDMA fabrics
(Infiniband, RoCE, iWARP).

All NVMe logic is in the generic target and this module just provides a
small glue between it and the generic code in the RDMA subsystem.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>,
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-08 08:38:49 -06:00
Christoph Hellwig
def61eca96 nvme: add new reconnecting controller state
The nvme fabric (RDMA, FC, etc...) can introduce port, link or node
failures that may require a reconnect to re-establish the connection.

Add a new reconnecting state that will initially be used by the RDMA
driver.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-08 08:38:49 -06:00
Johannes Thumshirn
6f92970219 nvme: lightnvm: make MLC num_pairs little endian
According to the OpenChannel SSD interface specification the NAND flash
MLC page pairing information's number of page page pairings field is the
first two bytes in the MLC Page Pairing data structure. The hardware's
data structure itself is little endian so annotate it as such, like the
rest of lighnvm's data structures.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Dan Carpenter
f98d9ca17f nvmet: fix an error code
We accidentally return zero here when ERR_PTR(-ENOMEM) is intended.

Fixes: a07b4970f4 ('nvmet: add a generic NVMe target')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:37:36 -06:00
Arnd Bergmann
1fb4704084 nvme-loop: add configfs dependency
CONFIG_NVME_TARGET has a correct CONFIG_CONFIGFS_FS dependency, but the
newly added NVME_TARGET_LOOP is missing this, resulting in a link
failure:

drivers/nvme/built-in.o: In function `nvmet_init_configfs':
loop.c:(.init.text+0x2a0): undefined reference to `config_group_init'
loop.c:(.init.text+0x2c0): undefined reference to `config_group_init_type_name'
loop.c:(.init.text+0x318): undefined reference to `configfs_register_subsystem'
drivers/nvme/built-in.o: In function `nvmet_exit_configfs':
loop.c:(.exit.text+0x9c): undefined reference to `configfs_unregister_subsystem'

This adds the same dependency here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 3a85a5de29 ("nvme-loop: add a NVMe loopback host driver")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:34:45 -06:00
Christoph Hellwig
3a85a5de29 nvme-loop: add a NVMe loopback host driver
This patch implements adds nvme-loop which allows to access local devices
exported as NVMe over Fabrics namespaces. This module can be useful for
easy evaluation, testing and also feature experimentation.

To createa nvme-loop device you need to configure the NVMe target to
export a loop port (see the nvmetcli documentaton for that) and then
connect to it using

	nvme connect-all -t loop

which requires the very latest nvme-cli version with Fabrics support.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:30:36 -06:00
Christoph Hellwig
a07b4970f4 nvmet: add a generic NVMe target
This patch introduces a implementation of NVMe subsystems,
controllers and discovery service which allows to export
NVMe namespaces across fabrics such as Ethernet, FC etc.

The implementation conforms to the NVMe 1.2.1 specification
and interoperates with NVMe over fabrics host implementations.

Configuration works using configfs, and is best performed using
the nvmetcli tool from http://git.infradead.org/users/hch/nvmetcli.git,
which also has a detailed explanation of the required steps in the
README file.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Signed-off-by: Anthony Knapp <anthony.j.knapp@intel.com>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:30:33 -06:00
Sagi Grimberg
038bd4cb67 nvme: add keep-alive support
Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and
optional in NVMe 1.2.1 for PCIe.  This patch adds periodic keep-alive
sent from the host to verify that the controller is still responsive
and vice-versa.  The keep-alive timeout is user-defined (with
keep_alive_tmo connection parameter) and defaults to 5 seconds.

In order to avoid a race condition where the host sends a keep-alive
competing with the target side keep-alive timeout expiration, the host
adds a grace period of 10 seconds when publishing the keep-alive timeout
to the target.

In case a keep-alive failed (or timed out), a transport specific error
recovery kicks in.

For now only NVMe over Fabrics is wired up to support keep alive, but
we can add PCIe support easily once controllers actually supporting it
become available.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@chelsio.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:20 -06:00
Christoph Hellwig
07bfcd09a2 nvme-fabrics: add a generic NVMe over Fabrics library
The NVMe over Fabrics library provides an interface for both transports
and the nvme core to handle fabrics specific commands and attributes
independent of the underlying transport.

In addition, the fabrics library adds a misc device interface that allow
actually creating a fabrics controller, as we can't just autodiscover
it like in the PCI case.  The nvme-cli utility has been enhanced to use
this interface to support fabric connect and discovery.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>,
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>,
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:16 -06:00
Christoph Hellwig
eb793e2c92 nvme.h: add NVMe over Fabrics definitions
The NVMe over Fabrics specification defines a protocol interface and
related extensions to NVMe that enable operation over network protocols.
The NVMe over Fabrics specification has an NVMe Transport binding for
each NVMe Transport.

This patch adds the fabrics related definitions:
- fabric specific command set and error codes
- transport addressing and binding definitions
- fabrics sgl extensions
- controller identification fabrics enhancements
- discovery log page definition

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:14 -06:00
Ming Lin
1a353d85b0 nvme: add fabrics sysfs attributes
- delete_controller: This attribute allows to delete a controller.
  A driver is not obligated to support it (pci doesn't) so it is
  created only if the driver supports it. The new fabrics drivers
  will support it (essentialy a disconnect operation).

  Usage:
  echo > /sys/class/nvme/nvme0/delete_controller

- subsysnqn: This attribute shows the subsystem nqn of the configured
  device. If a driver does not implement the get_subsysnqn method, the
  file will not appear in sysfs.

- transport: This attribute shows the transport name. Added a "name"
  field to struct nvme_ctrl_ops.

  For loop,
  cat /sys/class/nvme/nvme0/transport
  loop

  For RDMA,
  cat /sys/class/nvme/nvme0/transport
  rdma

  For PCIe,
  cat /sys/class/nvme/nvme0/transport
  pcie

- address: This attributes shows the controller address. The fabrics
  drivers that will implement get_address can show the address of the
  connected controller.

  example:
  cat /sys/class/nvme/nvme0/address
  traddr=192.168.2.2,trsvcid=1023

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:12 -06:00
Christoph Hellwig
eb71f43557 nvme: Modify and export sync command submission for fabrics
NVMe over fabrics will use __nvme_submit_sync_cmd in the the
transport and require a few tweaks to it.  For that we export it
and add a few more paramters:

1. allow passing a queue ID to the block layer

   For the NVMe over Fabrics connect command we need to able to specify a
   queue ID that we want to send the command on.  Add a qid parameter to
   the relevant functions to enable this behavior.

2. allow submitting at_head commands

   In cases where we want to (re)connect to a controller
   where we have inflight queued commands we want to first
   connect and only then allow the other queued commands to
   be kicked. This will prevents failures in controller resets
   and reconnects.

3. allow passing flags to blk_mq_allocate_request

   Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we
   want to be able to use reserved requests.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:11 -06:00
Christoph Hellwig
7d2e80080d nvme: allow transitioning from NEW to LIVE state
For Fabrics we're not going through an intermediate reset state
(at least for now).

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:09 -06:00
Dan Williams
0d52c756a6 block: convert to device_add_disk()
For block drivers that specify a parent device, convert them to use
device_add_disk().

This conversion was done with the following semantic patch:

    @@
    struct gendisk *disk;
    expression E;
    @@

    - disk->driverfs_dev = E;
    ...
    - add_disk(disk);
    + device_add_disk(E, disk);

    @@
    struct gendisk *disk;
    expression E1, E2;
    @@

    - disk->driverfs_dev = E1;
    ...
    E2 = disk;
    ...
    - add_disk(E2);
    + device_add_disk(E1, E2);

...plus some manual fixups for a few missed conversions.

Cc: Jens Axboe <axboe@fb.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-27 12:26:08 -07:00
Johannes Thumshirn
a1f447b35b NVMe: Use pci_(request|release)_mem_regions
Now that we do have pci_request_mem_regions() and pci_release_mem_regions()
at hand, use it in the NVMe driver.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Keith Busch <keith.busch@intel.com>
CC: Jens Axboe <axboe@fb.com>
2016-06-21 17:09:32 -05:00
Christoph Hellwig
f5fa90dc0a nvme: move the workaround for I/O queue-less controllers from PCIe to core
We want to apply this to Fabrics drivers as well, so move it to common
code.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig
7a5abb4b48 nvme: factor out a add nvme_is_write helper
Centralize the check if a given NVMe command reads or writes data.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig
a229dbf61e nvme: allow for size limitations from transport drivers
Some transport drivers may have a lower transfer size than
the controller. So allow the transport to set it in the
controller max_hw_sectors.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Johannes Thumshirn
edb50a5403 NVMe: Only release requested regions
The NVMe driver only requests the PCIe device's memory regions but releases
all possible regions (including eventual I/O regions). This leads to a stale
warning entry in dmesg about freeing non existent resources.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 14:28:28 -06:00
Sunad Bhandary
47b0e50ac7 NVMe: Fix removal in case of active namespace list scanning method
In case of the active namespace list scanning method, a namespace that
is detached is not removed from the host if it was the last entry in
the list. Fix this by adding a scan to validate namespaces greater than
the value of prev.

This also handles the case of removing namespaces whose value exceed
the device's reported number of namespaces.

Signed-off-by: Sunad Bhandary S <sunad.s@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08 08:48:47 -06:00
Minfei Huang
bd0fc2884c nvme: use UINT_MAX for max discard sectors
It's more elegant to use UINT_MAX to represent the max value of
type unsigned int. So replace the actual value by using this define.

Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 22:23:12 -06:00
Ming Lin
c55a2fd4bb nvme: move nvme_cancel_request() to common code
So it can be used by fabrics driver also.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.bsuch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:43:02 -06:00
Ming Lin
e1958e6534 nvme: update and rename nvme_cancel_io to nvme_cancel_request
nvme_cancel_io is a bit confusing (given the distinction of io/admin),
so rename it to nvme_cancel_request.

And update it a bit to pass in struct nvme_ctrl, so it can be used
by Fabrics driver also.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.bsuch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:43:02 -06:00
Mike Christie
3a5e02ced1 block, drivers: add REQ_OP_FLUSH operation
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
c2df40dfb8 drivers: use req op accessor
The req operation REQ_OP is separated from the rq_flag_bits
definition. This converts the block layer drivers to
use req_op to get the op from the request struct.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Nicholas Bellinger
ba36c21b0c nvme/host: Add missing blk_integrity tag_size + flags assignments
While doing recent bring-up of nvme/host with target-core T10-PI,
I noticed /sys/block/nvme*/integrity/device_is_integrity_capable
was false, and /sys/block/nvme*/integrity/tag_size contained
a bogus value.

AFAICT outside of blk_integrity_compare() for DM + MD these
are informational values, but go ahead and add the missing
assignments for nvme/host to match what SCSI does within
sd_dif_config_host() for consistency's sake.

Cc: Keith Busch <keith.busch@intel.com>
Cc: Jay Freyensee <james.p.freyensee@intel.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Sagi Grimberg <sagig@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
99466e708d NVMe: Add device ID's with stripe quirk
Adds two Intel controllers that have the "stripe" quirk.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
0ff9d4e1a2 NVMe: Short-cut removal on surprise hot-unplug
This patch adds a new state that when set has the core automatically
kill request queues prior to removing namespaces.

If PCI device is not present at the time the nvme driver's remove is
called, we can kill all IO queues immediately instead of waiting for
the watchdog thread to do that at its polling interval. This improves
scenarios where multiple hot plug events occur at the same time since
it doesn't block the pci enumeration for as long.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
9ec3bb2f99 NVMe: Allow user initiated rescan
This exposes ioctl and sysfs methods a user can invoke to request the
driver rescan a controller and its namespaces. This is less harsh than
doing a controller reset, which temporarilly halts all IO, just to
surface a newly attached namespace.

This is mainly useful for controllers that implement the namespace
management command, but do not support the namespace notify change
asynchronous event notification.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
d011fb3164 NVMe: Reduce driver log spamming
Reduce error logging when no corrective action is required.

Suggessted-by: Chris Petersen <cpetersen@fb.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
921920ab32 NVMe: Unbind driver on failure
Instead of removing the PCI device from the kernel's topology on
controller failure, this patch simply requests unbinding the device
from the driver. This avoids concurrently running pci removal with the
hot plug event, which has been reported to be problematic when multiple
surprise events occur near simultaneously.

The other benefit is that we will have PCI config and memory space
available to poke around for debugging a failed controller, assuming
the device was not physically removed.

The down side occurs if the platform and/or kernel do not support any
type of surprise hot removal. The device will remain visible through
sysfs (and therefore lspci), and some manual work is necessary to get
the logical topology corrected. But if your platform and/or kernel don't
support surprise removal, you probably shouldn't be doing that anyway.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
014a0d609e NVMe: Delete only created queues
Use the online queue count instead of the number of allocated queues. The
controller should just return an invalid queue identifier error to the
commands if a queue wasn't created. While it's not harmful, it's still
not correct.

Reported-by: Saar Gross <saar@annapurnalabs.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Keith Busch
2800b8e7d9 NVMe: Allocate queues only for online cpus
The driver previously requested allocating queues for the total possible
number of CPUs so that blk-mq could rebalance these if CPUs were added
after initialization. The number of hardware contexts can now be changed
at runtime, so we only need to allocate the number of online queues
since we can add more later.

Suggested-by: Jeff Lien <jeff.lien@hgst.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17 17:14:21 -06:00
Linus Torvalds
24b9f0cf00 Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "On top of the core pull request, this is the drivers pull request for
  this merge window.  This contains:

   - Switch drivers to the new write back cache API, and kill off the
     flush flags.  From me.

   - Kill the discard support for the STEC pci-e flash driver.  It's
     trivially broken, and apparently unmaintained, so it's safer to
     just remove it.  From Jeff Moyer.

   - A set of lightnvm updates from the usual suspects (Matias/Javier,
     and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
     Tao.

   - A set of updates for NVMe:

        - Turn the controller state management into a proper state
          machine.  From Christoph.

        - Shuffling of code in preparation for NVMe-over-fabrics, also
          from Christoph.

        - Cleanup of the command prep part from Ming Lin.

        - Rewrite of the discard support from Ming Lin.

        - Deadlock fix for namespace removal from Ming Lin.

        - Use the now exported blk-mq tag helper for IO termination.
          From Sagi.

        - Various little fixes from Christoph, Guilherme, Keith, Ming
          Lin, Wang Sheng-Hui.

   - Convert mtip32xx to use the now exported blk-mq tag iter function,
     from Keith"

* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
  lightnvm: reserved space calculation incorrect
  lightnvm: rename nr_pages to nr_ppas on nvm_rq
  lightnvm: add is_cached entry to struct ppa_addr
  lightnvm: expose gennvm_mark_blk to targets
  lightnvm: remove mgt targets on mgt removal
  lightnvm: pass dma address to hardware rather than pointer
  lightnvm: do not assume sequential lun alloc.
  nvme/lightnvm: Log using the ctrl named device
  lightnvm: rename dma helper functions
  lightnvm: enable metadata to be sent to device
  lightnvm: do not free unused metadata on rrpc
  lightnvm: fix out of bound ppa lun id on bb tbl
  lightnvm: refactor set_bb_tbl for accepting ppa list
  lightnvm: move responsibility for bad blk mgmt to target
  lightnvm: make nvm_set_rqd_ppalist() aware of vblks
  lightnvm: remove struct factory_blks
  lightnvm: refactor device ops->get_bb_tbl()
  lightnvm: introduce nvm_for_each_lun_ppa() macro
  lightnvm: refactor dev->online_target to global nvm_targets
  lightnvm: rename nvm_targets to nvm_tgt_type
  ...
2016-05-17 16:03:32 -07:00
Javier González
6d5be9590b lightnvm: rename nr_pages to nr_ppas on nvm_rq
The number of ppas contained on a request is not necessarily the number
of pages that it maps to neither on the target nor on the device side.
In order to avoid confusion, rename nr_pages to nr_ppas since it is what
the variable actually contains.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Arnd Bergmann
45bbd0529e lightnvm: pass dma address to hardware rather than pointer
A recent change to lightnvm added code to pass a kernel pointer
to the hardware, which gcc complained about:

drivers/nvme/host/lightnvm.c: In function 'nvme_nvm_rqtocmd':
drivers/nvme/host/lightnvm.c:472:32: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
  c->ph_rw.metadata = cpu_to_le64(rqd->meta_list);

It looks like this has no way of working anyway, so this changes
the code to pass the dma_address instead. This was most likely
what was intended here. Neither of the two are currently ever
written to, so the effect is the same for now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a34b1eb78e21 ("lightnvm: enable metadata to be sent to device")
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Sagi Grimberg
b86d8d363e nvme/lightnvm: Log using the ctrl named device
Align with the rest of the nvme subsystem.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Javier González
75b8564932 lightnvm: rename dma helper functions
Until now, the dma pool have been exclusively used to allocate the ppa
list being sent to the device. In pblk (upcoming), we use these pools to
allocate metadata too. Thus, we generalize the names of some variables
on the dma helper functions to make the code more readable.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Javier González
003fad376b lightnvm: enable metadata to be sent to device
Enable metadata buffer to be sent to the device through the metadata
field on the physical rw nvme command. The size of the metadata buffer
must follow dev->oob_size * # of PPAs.

Signed-off-by: Javier González <javier@cnexlabs.com>
Updated description.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Matias Bjørling
00ee6cc3b7 lightnvm: refactor set_bb_tbl for accepting ppa list
The set_bb_tbl takes struct nvm_rq and only uses its ppa_list and
nr_pages internally. Instead, make these two variables explicit.
This allows a user to call it without initializing a struct nvm_rq
first.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Matias Bjørling
e11903f5df lightnvm: refactor device ops->get_bb_tbl()
The device ops->get_bb_tbl() takes a callback, that allows the caller
to use its own callback function to update its data structures in the
returning function.

This makes it difficult to send parameters to the callback, and usually
is circumvented by small private structures, that both carry the callers
state and any flags needed to fulfill the update.

Refactor ops->get_bb_tbl() to fill a data buffer with the status of the
blocks returned, and let the user call the callback function manually.
That will provide the necessary flags and data structures and simplify
the logic around ops->get_bb_tbl().

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Matias Bjørling
22e8c9766a lightnvm: move block fold outside of get_bb_tbl()
The get block table command returns a list of blocks and planes
with their associated state. Users, such as gennvm and sysblk,
manages all planes as a single virtual block.

It was therefore  natural to fold the bad block list before it is
returned. However, to allow users, which manages on a per-plane
block level, to also use the interface, the get_bb_tbl interface is
changed to not fold by default and instead let the caller fold if
necessary.

Reviewed by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-06 12:51:10 -06:00
Keith Busch
87c3207781 NVMe: Fix reset/remove race
This fixes a scenario where device is present and being reset, but a
request to unbind the driver occurs.

A previous patch series addressing a device failure removal scenario
flushed reset_work after controller disable to unblock reset_work waiting
on a completion that wouldn't occur. This isn't safe as-is. The broken
scenario can potentially be induced with:

  modprobe nvme && modprobe -r nvme

To fix, the reset work is flushed immediately after setting the controller
removing flag, and any subsequent reset will not proceed with controller
initialization if the flag is set.

The controller status must be polled while active, so the watchdog timer
is also left active until the controller is disabled to cleanup requests
that may be stuck during namespace removal.

[Fixes: ff23a2a15a]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-03 14:00:29 -06:00
Ming Lin
b7b9c22787 nvme: fix nvme_ns_remove() deadlock
On receipt of a namespace attribute changed AER, we acquire the
namespace mutex lock before proceeding to scan and validate the
namespace list. In case of namespace detach/delete command,
nvme_ns_remove function deadlocks trying to acquire the already held
lock.

All callers, except nvme_remove_namespaces(), of nvme_ns_remove()
already held namespaces_mutex. So we can simply fix the deadlock by
not acquiring the mutex in nvme_ns_remove() and acquiring it in
nvme_remove_namespaces().

Reported-by: Sunad Bhandary S <sunad.s@samsung.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimerg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:16:13 -06:00
Ming Lin
0bf77e9dbb nvme: switch to RCU freeing the namespace
Switch to RCU freeing the namespace structure so that
nvme_start_queues, nvme_stop_queues and nvme_kill_queues would
be able to get away with only a RCU read side critical section.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimerg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:16:11 -06:00
Ming Lin
6904242db1 nvme: add helper nvme_cleanup_cmd()
This hides command cleanup into nvme.h and fabrics drivers will
also use it.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:11:58 -06:00
Christoph Hellwig
f866fc4282 nvme: move AER handling to common code
The transport driver still needs to do the actual submission, but all the
higher level code can be shared.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:09:26 -06:00
Christoph Hellwig
5955be2144 nvme: move namespace scanning to core
Move the scan work item and surrounding code to the common code.  For now
we need a new finish_scan method to allow the PCI driver to set the
irq affinity hints, but I have plans in the works to obsolete this as well.

Note that this moves the namespace scanning from nvme_wq to the system
workqueue, but as we don't rely on namespace scanning to finish from reset
or I/O this should be fine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:09:24 -06:00
Christoph Hellwig
92911a55d4 nvme: tighten up state check for namespace scanning
We only should be scanning namespaces if the controller is live.  Currently
we call the function just before setting it live, so fix the code up to
move the call to nvme_queue_scan to just below the state change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:09:23 -06:00
Christoph Hellwig
bb8d261e08 nvme: introduce a controller state machine
Replace the adhoc flags in the PCI driver with a state machine in the
core code.  Based on code from Sagi Grimberg for the Fabrics driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:09:21 -06:00
Christoph Hellwig
04a934d4c7 nvme: remove the io_incapable method
It's unused since "NVMe: Move error handling to failed reset handler".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:09:20 -06:00
Wang Sheng-Hui
23bd63ceea NVMe: nvme_core_exit() should do cleanup in the reverse order as nvme_core_init does
nvme_core_init does:
    1) register_blkdev
    2) __register_chrdev
    3) class_create

nvme_core_exit should do cleanup in the reverse order.

Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:05:03 -06:00
Keith Busch
3b24774e1f NVMe: Fix check_flush_dependency warning
If the controller fails and is degraded after a reset, we need to kill
off all requests queues before removing the inaccessble namespaces. This
will prevent del_gendisk from syncing dirty data, which we can't due
from a WQ_MEM_RECLAIM work queue.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:03:06 -06:00
Wang Sheng-Hui
b31356dfde NVMe: small typo in section BLK_DEV_NVME_SCSI of host/Kconfig
"as well as " is miss typed "as well a " in section
"config BLK_DEV_NVME_SCSI"

Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-26 08:31:50 -06:00
Christoph Hellwig
76e3914ae5 nvme: fix cntlid type
Controller IDs in NVMe are unsigned 16-bit types.  In the Fabrics driver we
actually pass ctrl->id by reference, so we need it to have the correct type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-26 08:31:22 -06:00
Keith Busch
a5229050b6 NVMe: Always use MSI/MSI-x interrupts
Multiple users have reported device initialization failure due the driver
not receiving legacy PCI interrupts. This is not unique to any particular
controller, but has been observed on multiple platforms.

There have been no issues reported or observed when with message signaled
interrupts, so this patch attempts to use MSI-x during initialization,
falling back to MSI. If that fails, legacy would become the default.

The setup_io_queues error handling had to change as a result: the admin
queue's msix_entry used to be initialized to the legacy IRQ. The case
where nr_io_queues is 0 would fail request_irq when setting up the admin
queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving
the admin queue's msix_entry invalid. Instead, return success immediately.

Reported-by: Tim Muhlemmer <muhlemmer@gmail.com>
Reported-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-14 14:04:50 -06:00
Guilherme G. Piccoli
c875a7093f nvme: Avoid reset work on watchdog timer function during error recovery
This patch adds a check on nvme_watchdog_timer() function to avoid the
call to reset_work() when an error recovery process is ongoing on
controller. The check is made by looking at pci_channel_offline()
result.

If we don't check for this on nvme_watchdog_timer(), error recovery
mechanism can't recover well, because reset_work() won't be able to
do its job (since we're in the middle of an error) and so the
controller is removed from the system before error recovery mechanism
can perform slot reset (which would allow the adapter to recover).

In this patch we also have split the huge condition expression on
nvme_watchdog_timer() by introducing an auxiliary function to help
make the code more readable.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-13 08:15:39 -06:00
Jens Axboe
7e19793096 NVMe: silence warning about unused 'dev'
Depending on options, we might not be using dev in nvme_cancel_io():

drivers/nvme/host/pci.c: In function ‘nvme_cancel_io’:
drivers/nvme/host/pci.c:970:19: warning: unused variable ‘dev’ [-Wunused-variable]
  struct nvme_dev *dev = data;
                   ^

So get rid of it, and just cast for the dev_dbg_ratelimited() call.

Fixes: 82b4552b91 ("nvme: Use blk-mq helper for IO termination")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 16:11:11 -06:00
Jens Axboe
7c88cb00f2 NVMe: switch to using blk_queue_write_cache()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 16:00:39 -06:00
Sagi Grimberg
82b4552b91 nvme: Use blk-mq helper for IO termination
blk-mq offers a tagset iterator so let's use that
instead of using nvme_clear_queues.

Note, we changed nvme_queue_cancel_ios name to nvme_cancel_io
as there is no concept of a queue now in this function (we
also lost the print).

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 15:07:15 -06:00
Keith Busch
21f033f7c7 NVMe: Skip async events for degraded controllers
If the controller is degraded, the driver should stay out of the way so
the user can recover the drive. This patch skips driver initiated async
event requests when the drive is in this state.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:44:00 -06:00
Ming Lin
8093f7ca73 nvme: add helper nvme_setup_cmd()
This moves nvme_setup_{flush,discard,rw} calls into a common
nvme_setup_cmd() helper. So we can eventually hide all the command
setup in the core module and don't even need to update the fabrics
drivers for any specific command type.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:44:00 -06:00
Ming Lin
03b5929ebb nvme: rewrite discard support
This rewrites nvme_setup_discard() with blk_add_request_payload().
It allocates only the necessary amount(16 bytes) for the payload.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:44:00 -06:00
Ming Lin
58b4560275 nvme: add helper nvme_map_len()
The helper returns the number of bytes that need to be mapped
using PRPs/SGL entries.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:44:00 -06:00
Ming Lin
2e39e0f608 nvme: add missing lock nesting notation
When unloading driver, nvme_disable_io_queues() calls nvme_delete_queue()
that sends nvme_admin_delete_cq command to admin sq. So when the command
completed, the lock acquired by nvme_irq() actually belongs to admin queue.

While the lock that nvme_del_cq_end() trying to acquire belongs to io queue.
So it will not deadlock.

This patch adds lock nesting notation to fix following report.

[  109.840952] =============================================
[  109.846379] [ INFO: possible recursive locking detected ]
[  109.851806] 4.5.0+ #180 Tainted: G            E
[  109.856533] ---------------------------------------------
[  109.861958] swapper/0/0 is trying to acquire lock:
[  109.866771]  (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820bc6>] nvme_del_cq_end+0x26/0x70 [nvme]
[  109.876535]
[  109.876535] but task is already holding lock:
[  109.882398]  (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820c2b>] nvme_irq+0x1b/0x50 [nvme]
[  109.891547]
[  109.891547] other info that might help us debug this:
[  109.898107]  Possible unsafe locking scenario:
[  109.898107]
[  109.904056]        CPU0
[  109.906515]        ----
[  109.908974]   lock(&(&nvmeq->q_lock)->rlock);
[  109.913381]   lock(&(&nvmeq->q_lock)->rlock);
[  109.917787]
[  109.917787]  *** DEADLOCK ***
[  109.917787]
[  109.923738]  May be due to missing lock nesting notation
[  109.923738]
[  109.930558] 1 lock held by swapper/0/0:
[  109.934413]  #0:  (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820c2b>] nvme_irq+0x1b/0x50 [nvme]
[  109.944010]
[  109.944010] stack backtrace:
[  109.948389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E   4.5.0+ #180
[  109.955734] Hardware name: Dell Inc. OptiPlex 7010/0YXT71, BIOS A15 08/12/2013
[  109.962989]  0000000000000000 ffff88011e203c38 ffffffff81383d9c ffffffff81c13540
[  109.970478]  ffffffff826711d0 ffff88011e203ce8 ffffffff810bb429 0000000000000046
[  109.977964]  0000000000000046 0000000000000000 0000000000b2e597 ffffffff81f4cb00
[  109.985453] Call Trace:
[  109.987911]  <IRQ>  [<ffffffff81383d9c>] dump_stack+0x85/0xc9
[  109.993711]  [<ffffffff810bb429>] __lock_acquire+0x19b9/0x1c60
[  109.999575]  [<ffffffff810b6d1d>] ? trace_hardirqs_off+0xd/0x10
[  110.005524]  [<ffffffff810b386d>] ? complete+0x3d/0x50
[  110.010688]  [<ffffffff810bb760>] lock_acquire+0x90/0xf0
[  110.016029]  [<ffffffffc0820bc6>] ? nvme_del_cq_end+0x26/0x70 [nvme]
[  110.022418]  [<ffffffff81772afb>] _raw_spin_lock_irqsave+0x4b/0x60
[  110.028632]  [<ffffffffc0820bc6>] ? nvme_del_cq_end+0x26/0x70 [nvme]
[  110.035019]  [<ffffffffc0820bc6>] nvme_del_cq_end+0x26/0x70 [nvme]
[  110.041232]  [<ffffffff8135b485>] blk_mq_end_request+0x35/0x60
[  110.047095]  [<ffffffffc0821ad8>] nvme_complete_rq+0x68/0x190 [nvme]
[  110.053481]  [<ffffffff8135b53f>] __blk_mq_complete_request+0x8f/0x130
[  110.060043]  [<ffffffff8135b611>] blk_mq_complete_request+0x31/0x40
[  110.066343]  [<ffffffffc08209e3>] __nvme_process_cq+0x83/0x240 [nvme]
[  110.072818]  [<ffffffffc0820c35>] nvme_irq+0x25/0x50 [nvme]
[  110.078419]  [<ffffffff810cdb66>] handle_irq_event_percpu+0x36/0x110
[  110.084804]  [<ffffffff810cdc77>] handle_irq_event+0x37/0x60
[  110.090491]  [<ffffffff810d0ea3>] handle_edge_irq+0x93/0x150
[  110.096180]  [<ffffffff81012306>] handle_irq+0xa6/0x130
[  110.101431]  [<ffffffff81011abe>] do_IRQ+0x5e/0x120
[  110.106333]  [<ffffffff8177384c>] common_interrupt+0x8c/0x8c

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:44:00 -06:00
Keith Busch
788e15abbb NVMe: Always use MSI/MSI-x interrupts
Multiple users have reported device initialization failure due the driver
not receiving legacy PCI interrupts. This is not unique to any particular
controller, but has been observed on multiple platforms.

There have been no issues reported or observed when with message signaled
interrupts, so this patch attempts to use MSI-x during initialization,
falling back to MSI. If that fails, legacy would become the default.

The setup_io_queues error handling had to change as a result: the admin
queue's msix_entry used to be initialized to the legacy IRQ. The case
where nr_io_queues is 0 would fail request_irq when setting up the admin
queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving
the admin queue's msix_entry invalid. Instead, return success immediately.

Reported-by: Tim Muhlemmer <muhlemmer@gmail.com>
Reported-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:44:00 -06:00
Keith Busch
9bf2b972af NVMe: Fix reset/remove race
This fixes a scenario where device is present and being reset, but a
request to unbind the driver occurs.

A previous patch series addressing a device failure removal scenario
flushed reset_work after controller disable to unblock reset_work waiting
on a completion that wouldn't occur. This isn't safe as-is. The broken
scenario can potentially be induced with:

  modprobe nvme && modprobe -r nvme

To fix, the reset work is flushed immediately after setting the controller
removing flag, and any subsequent reset will not proceed with controller
initialization if the flag is set.

The controller status must be polled while active, so the watchdog timer
is also left active until the controller is disabled to cleanup requests
that may be stuck during namespace removal.

[Fixes: ff23a2a15a]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-11 10:00:04 -06:00
Marta Rybczynska
d783e0bd02 nvme: avoid cqe corruption when update at the same time as read
Make sure the CQE phase (validity) is read before the rest of the
structure. The phase bit is the highest address and the CQE
read will happen on most platforms from lower to upper addresses
and will be done by multiple non-atomic loads. If the structure
is updated by PCI during the reads from the processor, the
processor may get a corrupted copy.

The addition of the new nvme_cqe_valid function that verifies
the validity bit also allows refactoring of the other CQE read
sequences.

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-22 10:27:29 -06:00
Matias Bjorling
9f86726843 nvme: lightnvm: return ppa completion status
PPAs sent to device is separately acknowledge in a 64bit status
variable. The status is stored in DW0 and DW1 of the completion queue
entry. Store this status inside the nvm_rq for further processing.

This can later be used to implement retry techniques for failed writes
and reads.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-18 18:10:38 -07:00
Linus Torvalds
237045fc3c Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for this merge window.  It sits
  on top of for-4.6/core, that was just sent out.

  This contains:

   - A set of fixes for lightnvm.  One from Alan, fixing an overflow,
     and the rest from the usual suspects, Javier and Matias.

   - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd
     for correct usage of the signed 64-bit divider.

   - A set of bug fixes for the Micron mtip32xx, from Asai.

   - A fix for the brd discard handling from Bart.

   - Update the maintainers entry for cciss, since that hardware has
     transferred ownership.

   - Three bug fixes for bcache from Eric Wheeler.

   - Set of fixes for xen-blk{back,front} from Jan and Konrad.

   - Removal of the cpqarray driver.  It has been disabled in Kconfig
     since 2013, and we were initially scheduled to remove it in 3.15.

   - Various updates and fixes for NVMe, with the most important being:

        - Removal of the per-device NVMe thread, replacing that with a
          watchdog timer instead. From Christoph.

        - Exposing the namespace WWID through sysfs, from Keith.

        - Set of cleanups from Ming Lin.

        - Logging the controller device name instead of the underlying
          PCI device name, from Sagi.

        - And a bunch of fixes and optimizations from the usual suspects
          in this area"

* 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits)
  NVMe: Expose ns wwid through single sysfs entry
  drivers:block: cpqarray clean up
  brd: Fix discard request processing
  cpqarray: remove it from the kernel
  cciss: update MAINTAINERS
  NVMe: Remove unused sq_head read in completion path
  bcache: fix cache_set_flush() NULL pointer dereference on OOM
  bcache: cleaned up error handling around register_cache()
  bcache: fix race of writeback thread starting before complete initialization
  NVMe: Create discard zero quirk white list
  nbd: use correct div_s64 helper
  mtip32xx: remove unneeded variable in mtip_cmd_timeout()
  lightnvm: generalize rrpc ppa calculations
  lightnvm: remove struct nvm_dev->total_blocks
  lightnvm: rename ->nr_pages to ->nr_sects
  lightnvm: update closed list outside of intr context
  xen/blback: Fit the important information of the thread in 17 characters
  lightnvm: fold get bb tbl when using dual/quad plane mode
  lightnvm: fix up nonsensical configure overrun checking
  xen-blkback: advertise indirect segment support earlier
  ...
2016-03-18 17:13:31 -07:00
Keith Busch
118472ab85 NVMe: Expose ns wwid through single sysfs entry
The method to uniquely identify a namespace depends on the controller's
specification revision level and implemented capabilities. This patch
has the driver figure this out and exports the unique string through a
single 'wwid' attribute so the user doesn't have this burden.

The longest namespace unique identifier is used if available. If not
available, the driver will concat the controller's vendor, serial,
and model with the namespace ID. The specification provides this as a
unique indentifier.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-16 07:46:25 -07:00
Jon Derrick
48c7823f42 NVMe: Remove unused sq_head read in completion path
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-08 15:01:06 -07:00
Keith Busch
08095e7078 NVMe: Create discard zero quirk white list
The NVMe specification does not require discarded blocks return zeroes on
read, but provides that behavior as a possibility. Some applications more
efficiently use an SSD if reads on discarded blocks were deterministically
zero, based on the "discard_zeroes_data" queue attribute.

There is no specification defined way to determine device behavior on
discarded blocks, so the driver always left the queue setting disabled. We
can only know behavior based on individual device models, so this patch
adds a flag to the NVMe "quirk" list that vendors may set if they know
their controller works that way. The patch also sets the new flag for one
such known device.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Suggested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-08 08:32:40 -07:00
Matias Bjørling
d5bdec8ddb lightnvm: fold get bb tbl when using dual/quad plane mode
When the media manager runs in dual or quad plane mode, lightnvm
abstracts away plane specific commands. This poses a problem for
get bad block table, as it reports bad blocks per plane, making the
table either two or four times bigger than expected. Fold the bad block
list before returning.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:45:53 -07:00
Christoph Hellwig
45686b6198 nvme: fix max_segments integer truncation
The block layer uses an unsigned short for max_segments.  The way we
calculate the value for NVMe tends to generate very large 32-bit values,
which after integer truncation may lead to a zero value instead of
the desired outcome.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jeff Lien <Jeff.Lien@hgst.com>
Tested-by: Jeff Lien <Jeff.Lien@hgst.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:43:10 -07:00
Christoph Hellwig
da35825d9a nvme: set queue limits for the admin queue
Factor out a helper to set all the device specific queue limits and apply
them to the admin queue in addition to the I/O queues.  Without this the
command size on the admin queue is arbitrarily low, and the missing
other limitations are just minefields waiting for victims.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jeff Lien <Jeff.Lien@hgst.com>
Tested-by: Jeff Lien <Jeff.Lien@hgst.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Keith Busch
e9fc63d682 NVMe: Fix 0-length integrity payload
A user could send a passthrough IO command with a metadata pointer to a
namespace without metadata. With metadata length of 0, kmalloc returns
ZERO_SIZE_PTR. Since that is not NULL, the driver would have set this as
the bio's integrity payload, which causes an access fault on completion.

This patch ignores the users metadata buffer if the namespace format
does not support separate metadata.

Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Keith Busch
63088ec7c8 NVMe: Don't allow unsupported flags
The command flags can change the meaning of other fields in the command
that the driver is not prepared to handle. Specifically, the user could
passthrough an SGL flag, causing the controller to misinterpret the PRP
list the driver created, potentially corrupting memory or data.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Keith Busch
69d9a99c25 NVMe: Move error handling to failed reset handler
This moves failed queue handling out of the namespace removal path and
into the reset failure path, fixing a hanging condition if the controller
fails or link down during del_gendisk. Previously the driver had to see
the controller as degraded prior to calling del_gendisk to setup the
queues to fail. But, if the controller happened to fail after this,
there was no task to end outstanding requests.

On failure, all namespace states are set to dead. This has capacity
revalidate to 0, and ends all new requests with error status.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Keith Busch
f58944e265 NVMe: Simplify device reset failure
A reset failure schedules the device to unbind from the driver through
the pci driver's remove. This cleans up all intialization, so there is
no need to duplicate the potentially racy cleanup.

To help understand why a reset failed, the status is logged with the
existing warning message.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:49 -07:00
Keith Busch
646017a612 NVMe: Fix namespace removal deadlock
This patch makes nvme namespace removal lockless. It is up to the caller
to ensure no active namespace scanning is occuring. To ensure no scan
work occurs, the nvme pci driver adds a removing state to the controller
device to avoid queueing scan work during removal. The work is flushed
after setting the state, so no new scan work can be queued.

The lockless removal allows the driver to cleanup a namespace
request_queue if the controller fails during removal. Previously this
could deadlock trying to acquire the namespace mutex in order to handle
such events.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:49 -07:00
Keith Busch
075790ebba NVMe: Use IDA for namespace disk naming
A namespace may be detached from a controller, but a user may be holding
a reference to it. Attaching a new namespace with the same NSID will create
duplicate names when using the NSID to name the disk.

This patch uses an IDA that is released only when the last reference is
released instead of using the namespace ID.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:49 -07:00
Keith Busch
b00a726a9f NVMe: Don't unmap controller registers on reset
Unmapping the registers on reset or shutdown is not necessary. Keeping
the mapping simplifies reset handling.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:49 -07:00
Ming Lin
931e1c2204 nvme: expose cntlid in sysfs
For NVMe over Fabrics, the cntlid will be used by systemd/udev to
create link to the device, for example,

/dev/disk/by-path/<fabrics-info>-<cntlid>-<namespace> -> /dev/nvme0n1

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-29 12:31:27 -07:00
Christoph Hellwig
1cb3cce5eb nvme: return the whole CQE through the request passthrough interface
Both LighNVM and NVMe over Fabrics need to look at more than just the
status and result field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bj?rling <m@bjorling.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-29 08:47:17 -07:00
Christoph Hellwig
2d55cd5f51 nvme: replace the kthread with a per-device watchdog timer
The only work left in the kthread is the periodic health check for each
controller.  There is no need to run this from process context or keep
a thread context around for it, so replace it with a simpler timer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-29 08:47:16 -07:00
Christoph Hellwig
79f2b358c9 nvme: don't poll the CQ from the kthread
There is no reason to do unconditional polling of CQs per the NVMe
spec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-29 08:47:14 -07:00
Christoph Hellwig
9396dec916 nvme: use a work item to submit async event requests
Use a dedicated work item to submit async event requests instead of the
global kthread.  This simplifies the code and reduces the latencies to
resubmit a request once an even notification happened.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-29 08:47:13 -07:00
Keith Busch
f8e68a7c9a NVMe: Rate limit nvme IO warnings
We don't need to spam the kernel logs with thousands of IO cancelling
messages. We can infer all IO's are being cancelled with fewer, or
even none at all. This patch rate limits the message and uses the debug
log level as it is mainly used for testing purposes.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-12 08:10:31 -07:00
Keith Busch
ff23a2a15a NVMe: Poll device while still active during remove
A device failure or link down wouldn't have been detected during namespace
removal. This patch keeps the device in the list for polling so that the
thread may see such failure and initiate a reset. The device is removed
from the list after disable, so we can safely flush the reset work as
it can't be requeued when disable completes.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-12 08:10:16 -07:00
Keith Busch
ae1fba2001 NVMe: Requeue requests on suspended queues
It's possible a request may get to the driver after the nvme queue was
disabled. This has the request requeue if that happens.

Note the request is still "started" by the driver, but requeuing will
clear the start state for timeout handling.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-12 08:10:08 -07:00
Keith Busch
ef2d4615c5 NVMe: Allow request merges
It is generally more efficient to submit larger IO.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-11 13:14:03 -07:00
Keith Busch
4f76d0e498 NVMe: Fix io incapable return values
The function returns true when the controller can't handle IO.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-11 13:14:02 -07:00
Ming Lin
576d55d625 nvme: split pci module out of core module
NVMe over Fabrics drivers are going to reuse the core,
so splits nvme.ko into 2 modules:

nvme-core.ko: the core part
nvme.ko: the PCI driver

Export symbols from nvme-core.ko.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-10 14:22:38 -07:00
Ming Lin
9f2482b91b nvme: split dev_list_lock
Split dev_list_lock into one in the core and one in the PCI driver.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-10 14:22:36 -07:00
Ming Lin
ba0ba7d3e5 nvme: move timeout variables to core.c
These variables are used by PCI driver and will also be used in the
forthcoming NVMe over Fabrics drivers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-10 14:22:34 -07:00
Sagi Grimberg
e439bb12e7 nvme/host: reference the fabric module for each bdev open callout
We don't want to be able to unload the fabric driver when we have
openened referenced to our namespaces. Thus, for each nvme_open we
take a reference on the fabric driver and put it in nvme_release.
This behavior is consistent with the scsi model.

This resolves the panic when unloading a fabric module with
mpath holders.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ian Bakshan <ianb@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-10 14:22:32 -07:00
Sagi Grimberg
1b3c47c182 nvme: Log the ctrl device name instead of the underlying pci device name
Having the ctrl name "nvmeX" seems much more friendly than
the underlying device name. Also, with other nvme transports
such as the soon to come nvme-loop we don't have an underlying
device so it doesn't makes sense to make up one.

In order to help matching an instance name to a pci function,
we add a info print in nvme_probe.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Keith Busch <keith.busch@intel.com>

Manually fixed up the hunk in nvme_cancel_queue_ios().

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-10 08:51:15 -07:00
Christoph Hellwig
f4f0f63e6f nvme: fix drvdata setup for the nvme device
Pass the right private data to device_create_with_groups from the
beginning, and remove the superflous call to dev_set_drvdata.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09 12:44:03 -07:00
Keith Busch
949928c1c7 NVMe: Fix possible queue use after freed
This notifies blk-mq when the tag set contains a different number of
queues prior to freeing unused ones that the request queue points to.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09 12:42:35 -07:00