Commit Graph

841130 Commits

Author SHA1 Message Date
Jens Axboe
2ad7a0cc8f Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.3/block
Pull single MD warning fix from Song.

* 'md-next' of https://github.com/liu-song-6/linux:
  md/raid1: Fix a warning message in remove_wb()
2019-06-26 13:49:01 -06:00
Dan Carpenter
16d4b74654 md/raid1: Fix a warning message in remove_wb()
The WARN_ON() macro doesn't take an error message, it just takes a
condition.  I've changed this to use WARN(1, "...") instead.

Fixes: 3e148a3209 ("md/raid1: fix potential data inconsistency issue with write behind device")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-26 12:18:48 -07:00
Paolo Valente
3726112ec7 block, bfq: re-schedule empty queues if they deserve I/O plugging
Consider, on one side, a bfq_queue Q that remains empty while in
service, and, on the other side, the pending I/O of bfq_queues that,
according to their timestamps, have to be served after Q.  If an
uncontrolled amount of I/O from the latter bfq_queues were dispatched
while Q is waiting for its new I/O to arrive, then Q's bandwidth
guarantees would be violated. To prevent this, I/O dispatch is plugged
until Q receives new I/O (except for a properly controlled amount of
injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
for the following reason.

Preemption is performed in two steps. First, Q is expired and
re-scheduled. Second, the new bfq_queue to serve is chosen. The first
step is needed by the second, as the second can be performed only
after Q's timestamps have been properly updated (done in the
expiration step), and Q has been re-queued for service. This
dependency is a consequence of the way how BFQ's scheduling algorithm
is currently implemented.

But Q is not re-scheduled at all in the first step, because Q is
empty. As a consequence, an uncontrolled amount of I/O may be
dispatched until Q becomes non empty again. This breaks Q's service
guarantees.

This commit addresses this issue by re-scheduling Q even if it is
empty. This in turn breaks the assumption that all scheduled queues
are non empty. Then a few extra checks are now needed.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:35 -06:00
Paolo Valente
96a291c38c block, bfq: preempt lower-weight or lower-priority queues
BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.

While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
receive I/O, and, according to BFQ's scheduling policy, may become the
right bfq_queue to serve, in place of the currently in-service
bfq_queue. In this respect, postponing the service of Q2 to after the
service of Q1 finishes may delay the completion of Q2's I/O, compared
with an ideal service in which all non-empty bfq_queues are served in
parallel, and every non-empty bfq_queue is served at a rate
proportional to the bfq_queue's weight. This additional delay is equal
at most to the time Q1 may unjustly remain in service before switching
to Q2.

If Q1 and Q2 have the same weight, then this time is most likely
negligible compared with the completion time to be guaranteed to Q2's
I/O. In addition, first, one of the reasons why BFQ may want to serve
Q1 for a while is that this boosts throughput and, second, serving Q1
longer reduces BFQ's overhead. As a conclusion, it is usually better
not to preempt Q1 if both Q1 and Q2 have the same weight.

In contrast, as Q2's weight or priority becomes higher and higher
compared with that of Q1, the above delay becomes larger and larger,
compared with the I/O completion times that have to be guaranteed to
Q2 according to Q2's weight. So reducing this delay may be more
important than avoiding the costs of preempting Q1.

Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
triggers a new choice of the next bfq_queue to serve. If Q2 really is
the next bfq_queue to serve, then Q2 will be set in service
immediately.

This change reduces the component of the I/O latency caused by the
above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
a process doing sporadic random reads while another process is doing
continuous sequential reads.

Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:35 -06:00
Paolo Valente
13a857a4c4 block, bfq: detect wakers and unconditionally inject their I/O
A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".

If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks to BFQ's
injection mechanism, the waker queue is likely to get served before
the I/O-plugging timeout fires.

Unfortunately, this fact may not be sufficient to guarantee a high
throughput during the I/O plugging, because the inject limit for Q may
be too low to guarantee a lot of injected I/O. In addition, the
duration of the plugging, i.e., the time before Q finally receives new
I/O, may not be minimized, because the waker queue may happen to be
served only after other queues.

To address these issues, this commit introduces the explicit detection
of the waker queue, and the unconditional injection of a pending I/O
request of the waker queue on each invocation of
bfq_dispatch_request().

One may be concerned that this systematic injection of I/O from the
waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
the contrary, next Q's I/O is brought forward dramatically, for it is
not blocked for milliseconds.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
a3f9bce369 block, bfq: bring forward seek&think time update
Until the base value for request service times gets finally computed
for a bfq_queue, the inject limit for that queue does depend on the
think-time state (short|long) of the queue. A timely update of the
think time then guarantees a quicker activation or deactivation of the
injection. Fortunately, the think time of a bfq_queue is updated in
the same code path as the inject limit; but after the inject limit.

This commits moves the update of the think time before the update of
the inject limit. For coherence, it moves the update of the seek time
too.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
24792ad01c block, bfq: update base request service times when possible
I/O injection gets reduced if it increases the request service times
of the victim queue beyond a certain threshold.  The threshold, in its
turn, is computed as a function of the base service time enjoyed by
the queue when it undergoes no injection.

As a consequence, for injection to work properly, the above base value
has to be accurate. In this respect, such a value may vary over
time. For example, it varies if the size or the spatial locality of
the I/O requests in the queue change. It is then important to update
this value whenever possible. This commit performs this update.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
db599f9ed9 block, bfq: fix rq_in_driver check in bfq_update_inject_limit
One of the cases where the parameters for injection may be updated is
when there are no more in-flight I/O requests. The number of in-flight
requests is stored in the field bfqd->rq_in_driver of the descriptor
bfqd of the device. So, the controlled condition is
bfqd->rq_in_driver == 0.

Unfortunately, this is wrong because, the instruction that checks this
condition is in the code path that handles the completion of a
request, and, in particular, the instruction is executed before
bfqd->rq_in_driver is decremented in such a code path.

This commit fixes this issue by just replacing 0 with 1 in the
comparison.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
766d61412e block, bfq: reset inject limit when think-time state changes
Until the base value of the request service times gets finally
computed for a bfq_queue, the inject limit does depend on the
think-time state (short|long). The limit must be 0 or 1 if the think
time is deemed, respectively, as short or long. However, such a check
and possible limit update is performed only periodically, once per
second. So, to make the injection mechanism much more reactive, this
commit performs the update also every time the think-time state
changes.

In addition, in the following special case, this commit lets the
inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
think time is short: bfqq's I/O is synchronized with that of some
other queue, i.e., bfqq may receive new I/O only after the I/O of the
other queue is completed. Keeping the inject limit to 1 allows the
blocking I/O to be served while bfqq is in service. And this is very
convenient both for bfqq and for the total throughput, as explained
in detail in the comments in bfq_update_has_short_ttime().

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Jens Axboe
6b2c8e522c Merge branch 'nvme-5.3' of git://git.infradead.org/nvme into for-5.3/block
Pull NVMe updates from Christoph:

"A large chunk of NVMe updates for 5.3.  Highlights:

 - improved PCIe suspent support (Keith Busch)
 - error injection support for the admin queue (Akinobu Mita)
 - Fibre Channel discovery improvements (James Smart)
 - tracing improvements including nvmetc tracing support (Minwoo Im)
 - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
   Kulkarni)"

* 'nvme-5.3' of git://git.infradead.org/nvme: (26 commits)
  Documentation: nvme: add an example for nvme fault injection
  nvme: enable to inject errors into admin commands
  nvme: prepare for fault injection into admin commands
  nvmet: introduce target-side trace
  nvme-trace: print result and status in hex format
  nvme-trace: support for fabrics commands in host-side
  nvme-trace: move opcode symbol print to nvme.h
  nvme-trace: do not export nvme_trace_disk_name
  nvme-pci: clean up nvme_remove_dead_ctrl a bit
  nvme-pci: properly report state change failure in nvme_reset_work
  nvme-pci: set the errno on ctrl state change error
  nvme-pci: adjust irq max_vector using num_possible_cpus()
  nvme-pci: remove queue_count_ops for write_queues and poll_queues
  nvme-pci: remove unnecessary zero for static var
  nvme-pci: use host managed power state for suspend
  nvme: introduce nvme_is_fabrics to check fabrics cmd
  nvme: export get and set features
  nvme: fix possible io failures when removing multipathed ns
  nvme-fc: add message when creating new association
  lpfc: add sysfs interface to post NVME RSCN
  ...
2019-06-24 10:10:35 -06:00
Akinobu Mita
7e31d8215f Documentation: nvme: add an example for nvme fault injection
This adds an example of how to inject errors into admin commands.

Suggested-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:15:50 +02:00
Akinobu Mita
f79d5fda4e nvme: enable to inject errors into admin commands
This enables to inject errors into the commands submitted to the admin
queue.

It is useful to test error handling in the controller initialization.

	# echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
	# echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
	# echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
	# nvme reset /dev/nvme0
	# dmesg
	...
	nvme nvme0: Could not set queue count (16385)
	nvme nvme0: IO queues not created

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:15:50 +02:00
Akinobu Mita
a3646451ed nvme: prepare for fault injection into admin commands
Currenlty fault injection support for nvme only enables to inject errors
into the commands submitted to I/O queues.

In preparation for fault injection into the admin commands, this makes
the helper functions independent of struct nvme_ns.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:15:50 +02:00
Minwoo Im
a5448fdc46 nvmet: introduce target-side trace
This patch introduces target-side request tracing.  As Christoph
suggested, the trace would not be in a core or module to avoid
disadvantages like cache miss:
  http://lists.infradead.org/pipermail/linux-nvme/2019-June/024721.html

The target-side trace code is entirely based on the Johannes's trace code
from the host side.  It has lots of codes duplicated, but it would be
better than having advantages mentioned above.

It also traces not only fabrics commands, but also nvme normal commands.
Once the codes to be shared gets bigger, then we can make it common as
suggsted.

This also removed the create_sq and create_cq trace parsing functions
because it will be done by the connect fabrics command.

Example:
  echo 1 > /sys/kernel/debug/tracing/event/nvmet/nvmet_req_init/enable
  echo 1 > /sys/kernel/debug/tracing/event/nvmet/nvmet_req_complete/enable
  cat /sys/kernel/debug/tracing/trace

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
[hch: fixed the symbol namespace and a an endianess conversion]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:15:46 +02:00
Geert Uytterhoeven
2f5af4ab7d lightnvm: fix uninitialized pointer in nvm_remove_tgt()
With gcc 4.1:

    drivers/lightnvm/core.c: In function ‘nvm_remove_tgt’:
    drivers/lightnvm/core.c:510: warning: ‘t’ is used uninitialized in this function

Indeed, if no NVM devices have been registered, t will be an
uninitialized pointer, and may be dereferenced later.  A call to
nvm_remove_tgt() can be triggered from userspace by issuing the
NVM_DEV_REMOVE ioctl on the lightnvm control device.

Fix this by preinitializing t to NULL.

Fixes: 843f2edbdd ("lightnvm: do not remove instance under global lock")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-21 03:14:30 -06:00
Heiner Litz
510fd8ea98 lightnvm: pblk: fix freeing of merged pages
bio_add_pc_page() may merge pages when a bio is padded due to a flush.
Fix iteration over the bio to free the correct pages in case of a merge.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-21 03:14:29 -06:00
Minwoo Im
5f965f4fd9 nvme-trace: print result and status in hex format
The "result" field is in 64bit to be printed out which means it could be
like:
  nvme_complete_rq: nvme0: qid=0, cmdid=0, res=18446612684158962624, etries=0, flags=0x0, status=0

Switch both the result and status field to be printed in hexadecimal
format to be easier to read.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:12:37 +02:00
Minwoo Im
ad795e47cd nvme-trace: support for fabrics commands in host-side
This patch introduces fabrics commands tracing feature from host-side.
This patch does not include any changes for the previous host-side
tracing, but just add fabrics commands parsing in cmd=() format.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
[hch: fixed some whitespace damage]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:12:22 +02:00
Minwoo Im
26f2990d85 nvme-trace: move opcode symbol print to nvme.h
The following patches are going to provide the target-side trace which
might need these kind of macros.  It would be great if it can be shared
between host and target side both.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:12:19 +02:00
Minwoo Im
7183a46a48 nvme-trace: do not export nvme_trace_disk_name
nvme_trace_disk_name() is now already being invoked with the function
prototype in trace.h.  We don't need to export this symbol at all.

The following patches are going to provide target-side trace feature
with the exactly same function with this so that this patch removes the
EXPORT_SYMBOL() for this function.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:11:56 +02:00
Chaitanya Kulkarni
7c1ce408eb nvme-pci: clean up nvme_remove_dead_ctrl a bit
Remove the status parameter o nvme_remove_dead_ctrl(), which is only
used for printing it.

We move the print message to the same function where actual error is
occurring.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:39 +02:00
Minwoo Im
cee6c269b0 nvme-pci: properly report state change failure in nvme_reset_work
If the state change to NVME_CTRL_CONNECTING fails, the dmesg is going to
be like:

  [  293.689160] nvme nvme0: failed to mark controller CONNECTING
  [  293.689160] nvme nvme0: Removing after probe failure status: 0

Even it prints the first line to indicate the situation, the second line
is not proper because the status is 0 which means normally success of
the previous operation.

This patch makes it indicate the proper error value when it fails.
  [   25.932367] nvme nvme0: failed to mark controller CONNECTING
  [   25.932369] nvme nvme0: Removing after probe failure status: -16

This situation is able to be easily reproduced by:
  root@target:~# rmmod nvme && modprobe nvme && rmmod nvme

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:39 +02:00
Chaitanya Kulkarni
e71afda493 nvme-pci: set the errno on ctrl state change error
This patch removes the confusing assignment of the variable result at
the time of declaration and sets the value in error cases next to the
places where the actual error is happening.

Here we also set the result value to -ENODEV when we fail at the final
ctrl state transition in nvme_reset_work(). Without this assignment
result will hold 0 from nvme_setup_io_queue() and on failure 0 will be
passed to he nvme_remove_dead_ctrl() from final state transition.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Minwoo Im
dad77d6390 nvme-pci: adjust irq max_vector using num_possible_cpus()
If the "irq_queues" are greater than num_possible_cpus(),
nvme_calc_irq_sets() can have irq set_size for HCTX_TYPE_DEFAULT greater
than it can be afforded.
2039         affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;

It might cause a WARN() from the irq_build_affinity_masks() like [1]:
220         if (nr_present < numvecs)
221                 WARN_ON(nr_present + nr_others < numvecs);

This patch prevents it from the WARN() by adjusting the max_vector value
from the nvme_setup_irqs().

[1] WARN messages when modprobe nvme write_queues=32 poll_queues=0:
root@target:~/nvme# nproc
8
root@target:~/nvme# modprobe nvme write_queues=32 poll_queues=0
[   17.925326] nvme nvme0: pci function 0000:00:04.0
[   17.940601] WARNING: CPU: 3 PID: 1030 at kernel/irq/affinity.c:221 irq_create_affinity_masks+0x222/0x330
[   17.940602] Modules linked in: nvme nvme_core [last unloaded: nvme]
[   17.940605] CPU: 3 PID: 1030 Comm: kworker/u17:4 Tainted: G        W         5.1.0+ #156
[   17.940605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[   17.940608] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[   17.940609] RIP: 0010:irq_create_affinity_masks+0x222/0x330
[   17.940611] Code: 4c 8d 4c 24 28 4c 8d 44 24 30 e8 c9 fa ff ff 89 44 24 18 e8 c0 38 fa ff 8b 44 24 18 44 8b 54 24 1c 5a 44 01 d0 41 39 c4 76 02 <0f> 0b 48 89 df 44 01 e5 e8 f1 ce 10 00 48 8b 34 24 44 89 f0 44 01
[   17.940611] RSP: 0018:ffffc90002277c50 EFLAGS: 00010216
[   17.940612] RAX: 0000000000000008 RBX: ffff88807ca48860 RCX: 0000000000000000
[   17.940612] RDX: ffff88807bc03800 RSI: 0000000000000020 RDI: 0000000000000000
[   17.940613] RBP: 0000000000000001 R08: ffffc90002277c78 R09: ffffc90002277c70
[   17.940613] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000020
[   17.940614] R13: 0000000000025d08 R14: 0000000000000001 R15: ffff88807bc03800
[   17.940614] FS:  0000000000000000(0000) GS:ffff88807db80000(0000) knlGS:0000000000000000
[   17.940616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.940617] CR2: 00005635e583f790 CR3: 000000000240a000 CR4: 00000000000006e0
[   17.940617] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   17.940618] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   17.940618] Call Trace:
[   17.940622]  __pci_enable_msix_range+0x215/0x540
[   17.940623]  ? kernfs_put+0x117/0x160
[   17.940625]  pci_alloc_irq_vectors_affinity+0x74/0x110
[   17.940626]  nvme_reset_work+0xc30/0x1397 [nvme]
[   17.940628]  ? __switch_to_asm+0x34/0x70
[   17.940628]  ? __switch_to_asm+0x40/0x70
[   17.940629]  ? __switch_to_asm+0x34/0x70
[   17.940630]  ? __switch_to_asm+0x40/0x70
[   17.940630]  ? __switch_to_asm+0x34/0x70
[   17.940631]  ? __switch_to_asm+0x40/0x70
[   17.940632]  ? nvme_irq_check+0x30/0x30 [nvme]
[   17.940633]  process_one_work+0x20b/0x3e0
[   17.940634]  worker_thread+0x1f9/0x3d0
[   17.940635]  ? cancel_delayed_work+0xa0/0xa0
[   17.940636]  kthread+0x117/0x120
[   17.940637]  ? kthread_stop+0xf0/0xf0
[   17.940638]  ret_from_fork+0x3a/0x50
[   17.940639] ---[ end trace aca8a131361cd42a ]---
[   17.942124] nvme nvme0: 7/1/0 default/read/poll queues

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Minwoo Im
483178f38c nvme-pci: remove queue_count_ops for write_queues and poll_queues
queue_count_set() seems like that it has been provided to limit the
number of queue entries for write/poll queues.  But, the
queue_count_set() has been doing nothing but a parameter check even it
has num_possible_cpus() which is nop.

This patch removes entire queue_count_ops from the write_queues and
poll_queues.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Minwoo Im
a232ea0ebf nvme-pci: remove unnecessary zero for static var
poll_queues will be zero even without zero initialization here.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Keith Busch
d916b1be94 nvme-pci: use host managed power state for suspend
The nvme pci driver prepares its devices for power loss during suspend
by shutting down the controllers. The power setting is deferred to
pci driver's power management before the platform removes power. The
suspend-to-idle mode, however, does not remove power.

NVMe devices that implement host managed power settings can achieve
lower power and better transition latencies than using generic PCI power
settings. Try to use this feature if the platform is not involved with
the suspend. If successful, restore the previous power state on resume.

Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[hch: fixed the compilation for the !CONFIG_PM_SLEEP case]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Minwoo Im
7a1f46e3f7 nvme: introduce nvme_is_fabrics to check fabrics cmd
This patch introduces a nvme_is_fabrics() inline function to check
whether or not the given command structure is for fabrics.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Keith Busch
1a87ee657c nvme: export get and set features
Future use intends to make use of both, so export these functions. And
since their implementation is identical except for the opcode, provide a
new function that implement both.

[akinobu.mita@gmail.com>: fix line over 80 characters]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Anton Eidelman
2181e45561 nvme: fix possible io failures when removing multipathed ns
When a shared namespace is removed, we call blk_cleanup_queue()
when the device can still be accessed as the current path and this can
result in submission to a dying queue. Hence, direct_make_request()
called by our mpath device may fail (propagating the failure to userspace).
Instead, we want to failover this I/O to a different path if one exists.
Thus, before we cleanup the request queue, we make sure that the device is
cleared from the current path nor it can be selected again as such.

Fix this by:
- clear the ns from the head->list and synchronize rcu to make sure there is
  no concurrent path search that restores it as the current path
- clear the mpath current path in order to trigger a subsequent path search
  and sync srcu to wait for any ongoing request submissions
- safely continue to namespace removal and blk_cleanup_queue

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
James Smart
4bea364f16 nvme-fc: add message when creating new association
When looking at console messages to troubleshoot, there are one
maybe two messages before creation of the controller is complete.
However, a lot of io takes place to reach that point. It's unclear
when things have started.

Add a message when the controller is attempting to create a new
association. Thus we know what controller, between what host and
remote port, and what NQN is being put into place for any
subsequent success or failure messages.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Giridhar Malavali <gmalavali@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
James Smart
41b194b843 lpfc: add sysfs interface to post NVME RSCN
To support scenarios which aren't bound to nvmetcli add port scenarios,
which is currently where the nvmet_fc transport invokes the discovery
event callbacks, a syfs attribute is added to lpfc which can be written
to cause an RSCN to be generated for the nport.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
James Smart
6f2589f478 lpfc: add support for translating an RSCN rcv into a discovery rescan
This patch updates RSCN receive processing to check for the remote
port being an NVME port, and if so, invoke the nvme_fc callback to
rescan the remote port.  The rescan will generate a discovery udev
event.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
James Smart
ab723121a8 lpfc: add nvmet discovery_event op support
This patch adds support for the nvmet discovery op. When the callback
routine is called, the driver will call the routine to generate an RSCN
to the port on the other end of the link.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:37 +02:00
James Smart
f60cb93bbf lpfc: add support to generate RSCN events for nport
This patch adds general RSCN support:

 - The ability to transmit an RSCN to the port on the other end of
   the link (regular port if pt2pt, or fabric controller if fabric).
 - And general recognition of an RSCN ELS when an ELS is received.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:37 +02:00
James Smart
4cf7c363b4 nvme-fcloop: add support for nvmet discovery_event op
Update fcloop to support the discovery_event operation and
invoke a nvme rescan. In a real fc adapter, this would generate an
RSCN, which the host would receive and convert into a nvme rescan
on the remote port specified in the rscn payload.

Signed-off-by: James Smart <jsmart2021@gmail.com>
[kbuild-bot: fcloop_tgt_discovery_evt can be static]
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:37 +02:00
James Smart
150d71f725 nvmet-fc: add transport discovery change event callback support
This patch adds support for the nvmet discovery_change transport op.
In turn, the transport adds it's own LLDD api callback discovery_event
op to request the LLDD to generate an RSCN for the discovery change.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:37 +02:00
James Smart
9d09dd8d76 nvmet: add transport discovery change op
Some transports, such as FC-NVME, support discovery controller change
events without the use of a persistent discovery controller. FC receives
events via RSCN from the FC Fabric Controller or subsystem FC port.

This patch adds a nvmet transport op that is called whenever a
discovery change event occurs in the nvmet layer.

To facilitate the callback without adding another layer to cross into
core.c to reference the transport ops, the port structure snapshots
the transport ops when the port is enabled and clears them when disabled.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:37 +02:00
Christoph Hellwig
474a280036 cgroup: export css_next_descendant_pre for bfq
The bfq schedule now uses css_next_descendant_pre directly after
the stats functionality depending on it has been from the core
blk-cgroup code to bfq.  Export the symbol so that bfq can still
be build modular.

Fixes: d6258980da ("bfq-iosched: move bfq_stat_recursive_sum into the only caller")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-21 02:48:34 -06:00
Jens Axboe
8d54094ef4 Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.3/block
Pull MD changes from Song.

* 'md-next' of https://github.com/liu-song-6/linux:
  md: add bitmap_abort label in md_run
  md-bitmap: create and destroy wb_info_pool with the change of bitmap
  md-bitmap: create and destroy wb_info_pool with the change of backlog
  md: introduce mddev_create/destroy_wb_pool for the change of member device
  md/raid1: fix potential data inconsistency issue with write behind device
2019-06-20 23:37:20 -06:00
Guoqing Jiang
d494549ac8 md: add bitmap_abort label in md_run
Now, there are two places need to consider about
the failure of destroy bitmap, so move the common
part between bitmap_abort and abort label.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
617b194a13 md-bitmap: create and destroy wb_info_pool with the change of bitmap
The write-behind attribute is part of bitmap, since bitmap
can be added/removed dynamically with the following.

1. mdadm --grow /dev/md0 --bitmap=none
2. mdadm --grow /dev/md0 --bitmap=internal --write-behind

So we need to destroy wb_info_pool in md_bitmap_destroy,
and create the pool before load bitmap.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
10c92fca63 md-bitmap: create and destroy wb_info_pool with the change of backlog
Since we can enable write-behind mode by write backlog node,
so create wb_info_pool if the mode is just enabled, also call
call md_bitmap_update_sb to make user aware the write-behind
mode is enabled. Conversely, wb_info_pool should be destroyed
when write-behind mode is disabled.

Beside above, it is better to update bitmap sb if we change
the number of max_write_behind.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
963c555e75 md: introduce mddev_create/destroy_wb_pool for the change of member device
Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
3e148a3209 md/raid1: fix potential data inconsistency issue with write behind device
For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:35:59 -07:00
Chaitanya Kulkarni
0ce353794b f2fs: use block layer helper for show_bio_op macro
Adjust the f2fs tracing code to use newly introduced block layer
function blk_op_str() which converts the REQ_OP_XXX into the string
XXX.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:52 -06:00
Chaitanya Kulkarni
b0e5168a77 block: update print_req_error()
Improve the print_req_error with additional request fields which are
helpful for debugging. Use newly introduced blk_op_str() to print the
REQ_OP_XXX in the string format.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Chaitanya Kulkarni
874c893bf0 block: use blk_op_str() in blk-mq-debugfs.c
Now that we've a helper function blk_op_str() to convert the
REQ_OP_XXX to string XXX, adjust the code to use that. Get rid of
the duplicate array op_name which is now present in the blk-core.c
which we renamed it to "blk_op_name" and open coding in the
blk-mq-debugfs.c.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Chaitanya Kulkarni
e47bc4eda9 block: add centralize REQ_OP_XXX to string helper
In order to centralize the REQ_OP_XXX to string conversion which can be
used in the block layer and different places in the kernel like f2fs,
this patch adds a new helper function along with an array similar to the
one present in the blk-mq-debugfs.c.

We keep this helper functionality centralize under blk-core.c instead of
blk-mq-debugfs.c since blk-core.c is configured using CONFIG_BLOCK and
it will not be dependent on blk-mq-debugfs.c which is configured using
CONFIG_BLK_DEBUG_FS.

Next patch adjusts the code in the blk-mq-debugfs.c with newly
introduced helper.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Christoph Hellwig
178cc590e5 block: improve print_req_error
Print the calling function instead of print_req_error as a prefix, and
print the operation and op_flags separately instead of the whole field.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00