When we send PDU data, we want to optimize the tcp stack
operation if we have more data to send. So when we set MSG_MORE
when:
- We have more fragments coming in the batch, or
- We have a more data to send in this PDU
- We don't have a data digest trailer
- We optimize with the SUCCESS flag and omit the NVMe completion
(used if sq_head pointer update is disabled)
This addresses a regression in QD=1 with SUCCESS flag optimization
as we unconditionally set MSG_MORE when we didn't actually have
more data to send.
Fixes: 7058329538 ("nvmet-tcp: implement C2HData SUCCESS optimization")
Reported-by: Mark Wunderlich <mark.wunderlich@intel.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The timeout of identify cmd, which is invoked as part of admin queue
creation, can result in freeing of async event data both in
nvme_rdma_timeout handler and error handling path of
nvme_rdma_configure_admin queue thus causing NULL pointer reference.
Call Trace:
? nvme_rdma_setup_ctrl+0x223/0x800 [nvme_rdma]
nvme_rdma_create_ctrl+0x2ba/0x3f7 [nvme_rdma]
nvmf_dev_write+0xa54/0xcc6 [nvme_fabrics]
__vfs_write+0x1b/0x40
vfs_write+0xb2/0x1b0
ksys_write+0x61/0xd0
__x64_sys_write+0x1a/0x20
do_syscall_64+0x60/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reviewed-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Prabhath Sajeepa <psajeepa@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Completions need to consumed in the same order the controller submitted
them, otherwise future completion entries may overwrite ones we haven't
handled yet. Hold the nvme queue's poll lock while completing new CQEs to
prevent another thread from freeing command tags for reuse out-of-order.
Fixes: dabcefab45 ("nvme: provide optimized poll function for separate poll queues")
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5RXlAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpt6LD/9+QcLDwom/y76ZKM+sqFUd1cIozbuVbP0E
35vrMuANLryLhKXxo5rZ2fqIyuhD8IzvvcztLP53aGdi8NEfiDK6KdUz4rmioA8I
htVxiAXf5jNFM/6eVP3APKn5H7QG4Q0CiLiXMrSajbxgZh70CtsBsZODxmHSO9L5
+H3ew6rCcE29U+bOsSYDTwPRMGWOX8FBKfgX8KWPql5uqzNAklA8TUizv7H97NIj
wuFbeIiT/JfypFh7ahu6k2JNE+gB6Fchbbt3I52/8tjWWMsTkrgz2bj0JjrFkrsj
wlhyv5aK8p7QSpPDtqFZN2W4pT5IaqWrgjbxR4KEpCu0G3f80LiHD3Ck8taJ311V
4dSd7oQ+XpeEa7S/iWGDc90pQbbqCJN9NiniFBs/aTxu368leKDmP7YTnNfaNqcR
9mpgzgK+Jcz/w2ub0L8LYBmZLhoDdqZFAjrAzFDq+pJWZrTbtzw1HGBMXf8KZqhs
bdNiITD1lVW8h2v7FoalrCEmn7vfsteV7eFPJ96DPBVAUp3yBJsZUk1q4iPcX3xG
gGIU0DdDOPTZoCfsLlHR4mfG3mW7QkAI50MNECj9Qe0h53TutO5GVbLOgHzwe1nu
p6irFNyCiFJ9HkFfUAUG1D8xUafb5i0Xvvnf8TLbkgImlZKTK+RuSfGsMlHLlZnm
80OCm8s8Lg==
=s3i8
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Just a set of NVMe fixes via Keith"
* tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
nvme-multipath: Fix memory leak with ana_log_buf
nvme: Fix uninitialized-variable warning
nvme-pci: Use single IRQ vector for old Apple models
nvme/pci: Add sleep quirk for Samsung and Toshiba drives
kmemleak reports a memory leak with the ana_log_buf allocated by
nvme_mpath_init():
unreferenced object 0xffff888120e94000 (size 8208):
comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000e2360188>] kmalloc_order+0x97/0xc0
[<0000000079b18dd4>] kmalloc_order_trace+0x24/0x100
[<00000000f50c0406>] __kmalloc+0x24c/0x2d0
[<00000000f31a10b9>] nvme_mpath_init+0x23c/0x2b0
[<000000005802589e>] nvme_init_identify+0x75f/0x1600
[<0000000058ef911b>] nvme_loop_configure_admin_queue+0x26d/0x280
[<00000000673774b9>] nvme_loop_create_ctrl+0x2a7/0x710
[<00000000f1c7a233>] nvmf_dev_write+0xc66/0x10b9
[<000000004199f8d0>] __vfs_write+0x50/0xa0
[<0000000065466fef>] vfs_write+0xf3/0x280
[<00000000b0db9a8b>] ksys_write+0xc6/0x160
[<0000000082156b91>] __x64_sys_write+0x43/0x50
[<00000000c34fbb6d>] do_syscall_64+0x77/0x2f0
[<00000000bbc574c9>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
nvme_mpath_init() is called by nvme_init_identify() which is called in
multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
means nvme_mpath_init() may be called multiple times before
nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).
When nvme_mpath_init() is called multiple times, it overwrites the
ana_log_buf pointer with a new allocation, thus leaking the previous
allocation.
To fix this, free ana_log_buf before allocating a new one.
Fixes: 0d0b660f21 ("nvme: add ANA support")
Cc: <stable@vger.kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
gcc may detect a false positive on nvme using an unintialized variable
if setting features fails. Since this is not a fast path, explicitly
initialize this variable to suppress the warning.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
People reported that old Apple machines are not working properly
if the non-first IRQ vector is in use.
Set quirk for that models to limit IRQ to use first vector only.
Based on original patch by GitHub user npx001.
Link: https://github.com/Dunedan/mbp-2016-linux/issues/9
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Leif Liddy <leif.liddy@gmail.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo
C640 platform experience runtime resume issues when the SSDs are kept in
sleep/suspend mode for long time.
This patch applies the 'Simple Suspend' quirk to these configurations.
With this patch, the issue had not been observed in a 1+ day test.
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shyjumon N <shyjumon.n@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5JdgMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg6eEAC7/f/ATcrAbQUjx8vsNS0UgtYB1LPdgga6
7MwYKnrzdWWZICOxOgRR4W//wXFRf3vUtmzF5MgGpzJzgeKuyv09Vz1PLcxkAcny
hu9NTQWu6xtxEGiYlfaSW7EQBRdb1876kzOyV+hTrkvQZ9CXcIAQHwjQPKqjqdlb
/j3l5GSjyO0npvsCqIWrsNoeSwfDOhFi+I3hHZutt3T0fPnjo6DGtXN7m4jhWzW/
U3S4yHxmLVDKzorDDMoTV2D8UGrpjQmRXD78QOpOJKO7ngr9OT69dcMH6TnDkUDb
a7O5l5k5DB+0QOQWk5IHJpMRRp3NdVHgnZhTJZaho8kuKoTLwteJFAprJTzHuuvC
BkTBYf1Ref9KT5YfaUcWI+rmq32LgRq8rRaJBF+vqGcOwJw6WRIuygGyZ8eJMItb
oOhsFEOKKDAILncxg5C61v2Sh4g+/YpQojyVCzJSb7UNjOMtDPgjEp2dDSa+/p84
detTqmlt5ZPYHFvsp/EHuGB6ohscsvKzZk+wlQvj4Wq3T7agSp9uljfiTmUdsmWn
69fs66HBvd3CNoOauSVXvI+rhdsgTMX0ptnjIiPYwE0RQtxptp+rPjOfuT4JdboM
AU8f0VKeer1kRPxVbka9YwgVrUw5JqgFkPu2aC9B+ao+4IAM96n6lJje/rC59zkf
eBclcnOlsQ==
=jy10
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Not a lot here, which is great, basically just three small bcache
fixes from Coly, and four NVMe fixes via Keith"
* tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block:
nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
nvme/pci: move cqe check after device shutdown
nvme: prevent warning triggered by nvme_stop_keep_alive
nvme/tcp: fix bug on double requeue when send fails
bcache: remove macro nr_to_fifo_front()
bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
bcache: ignore pending signals when creating gc and allocator thread
nvme fw-activate operation will get bellow warning log,
fix it by update the parameter order
[ 113.231513] nvme nvme0: Get FW SLOT INFO log error
Fixes: 0e98719b0e ("nvme: simplify the API for getting log pages")
Reported-by: Sujith Pandel <sujith_pandel@dell.com>
Reviewed-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many users have reported nvme triggered irq_startup() warnings during
shutdown. The driver uses the nvme queue's irq to synchronize scanning
for completions, and enabling an interrupt affined to only offline CPUs
triggers the alarming warning.
Move the final CQE check to after disabling the device and all
registered interrupts have been torn down so that we do not have any
IRQ to synchronize.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Delayed keep alive work is queued on system workqueue and may be cancelled
via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.
Check_flush_dependency detects mismatched attributes between the work-queue
context used to cancel the keep alive work and system-wq. Specifically
system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
to cancel keep alive work have WQ_MEM_RECLAIM flag.
Example warning:
workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]
To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.
However this creates a secondary concern where work and a request to cancel
that work may be in the same work queue - namely err_work in the rdma and
tcp transports, which will want to flush/cancel the keep alive work which
will now be on nvme_wq.
After reviewing the transports, it looks like err_work can be moved to
nvme_reset_wq. In fact that aligns them better with transition into
RESETTING and performing related reset work in nvme_reset_wq.
Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.
Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When nvme_tcp_io_work() fails to send to socket due to
connection close/reset, error_recovery work is triggered
from nvme_tcp_state_change() socket callback.
This cancels all the active requests in the tagset,
which requeues them.
The failed request, however, was ended and thus requeued
individually as well unless send returned -EPIPE.
Another return code to be treated the same way is -ECONNRESET.
Double requeue caused BUG_ON(blk_queued_rq(rq))
in blk_mq_requeue_request() from either the individual requeue
of the failed request or the bulk requeue from
blk_mq_tagset_busy_iter(, nvme_cancel_request, );
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47ML4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvm2EACGaxAxP7pLniNV30cRotF8lPpQ5nUrpiem
H1r5WqeI5osCGkRKHaJQ4O0Sw8IV2pWzHTWz+9bv56zLM40yIMaEHLRU00AM047n
KFdA2x4xH+HhbR9lF+flYz1oInlIEXxPiERKm/p1pvQEbquzi4X5cQqv6q2pdzJ9
sf8OBJhKs4rp/ooqzWwjVOeP/n1sT2r+XDg9C9WC5aXaVZbbLw50r1WRYFt1zf7N
oa+91fq2lasxK1c79OtbbGJlBXWTurAtUaKBM0KKPguiH2h9j47pAs0HsV02kZ2M
1ZltwKTyfDNMzBEgvkdB3R0G9nU422nIF+w319i6on8P8xfz8Px13d1KCQGAmfD6
K1YuaCgOjWuVhOKpMwBq9ql6QVP+1LIMKIl2OGJkrBgl9ZzfE8KMZa2QZTGrGO/U
xE/hirYdj5T1O8umUQ4cmZHTROASOJZ8/eU9XHA1vf/eJYXiS31/4ewgRzP3oGX2
5Jvz3o144nBeBTOiFlzs3Fe+wX63QABNG22bijzEGoNTxjXJFroBDYzeiOELjECZ
/xGRZG1bLOGMj8Gg4ZADSILQDkqISsQHofl1I9mWTbwB1j7g69ZjV8Ie2dyMaX6b
5z5Smqzd9gcok9hr8NGWkV3c3NypPxIWxrOcyzYbGLUPDGqa+QjGtlLrGgeinhLM
SitalHw0KA==
=05d8
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"Some later arrivals, but all fixes at this point:
- bcache fix series (Coly)
- Series of BFQ fixes (Paolo)
- NVMe pull request from Keith with a few minor NVMe fixes
- Various little tweaks"
* tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits)
nvmet: update AEN list and array at one place
nvmet: Fix controller use after free
nvmet: Fix error print message at nvmet_install_queue function
brd: check and limit max_part par
nvme-pci: remove nvmeq->tags
nvmet: fix dsm failure when payload does not match sgl descriptor
nvmet: Pass lockdep expression to RCU lists
block, bfq: clarify the goal of bfq_split_bfqq()
block, bfq: get a ref to a group when adding it to a service tree
block, bfq: remove ifdefs from around gets/puts of bfq groups
block, bfq: extend incomplete name of field on_st
block, bfq: get extra ref to prevent a queue from being freed during a group move
block, bfq: do not insert oom queue into position tree
block, bfq: do not plug I/O for bfq_queues with no proc refs
bcache: check return value of prio_read()
bcache: fix incorrect data type usage in btree_flush_write()
bcache: add readahead cache policy options via sysfs interface
bcache: explicity type cast in bset_bkey_last()
bcache: fix memory corruption in bch_cache_accounting_clear()
xen/blkfront: limit allocated memory size to actual use case
...
All async events are enqueued via nvmet_add_async_event() which
updates the ctrl->async_event_cmds[] array and additionally an struct
nvmet_async_event is added to the ctrl->async_events list.
Under normal operations the nvmet_async_event_work() updates again
the ctrl->async_event_cmds and removes the corresponding struct
nvmet_async_event from the list again. Though nvmet_sq_destroy() could
be called which calls nvmet_async_events_free() which only updates the
ctrl->async_event_cmds[] array.
Add new functions nvmet_async_events_process() and
nvmet_async_events_free() to process async events, update an array and
the list.
When we destroy submission queue after clearing the aen present on
the ctrl->async list we also loop over ctrl->async_event_cmds[] for
any requests posted by the host for which we don't have the AEN in
the ctrl->async_events list by calling nvmet_async_event_process()
and nvmet_async_events_free().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
[chaitanya.kulkarni@wdc.com
* Loop over and clear out outstanding requests
* Update changelog
]
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
After nvmet_install_queue() sets sq->ctrl calling to nvmet_sq_destroy()
reduces the controller refcount. In case nvmet_install_queue() fails,
calling to nvmet_ctrl_put() is done twice (at nvmet_sq_destroy and
nvmet_execute_io_connect/nvmet_execute_admin_connect) instead of once for
the queue which leads to use after free of the controller. Fix this by set
NULL at sq->ctrl in case of a failure at nvmet_install_queue().
The bug leads to the following Call Trace:
[65857.994862] refcount_t: underflow; use-after-free.
[65858.108304] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
[65858.115557] RIP: 0010:refcount_warn_saturate+0xe5/0xf0
[65858.208141] Call Trace:
[65858.211203] nvmet_sq_destroy+0xe1/0xf0 [nvmet]
[65858.216383] nvmet_rdma_release_queue_work+0x37/0xf0 [nvmet_rdma]
[65858.223117] process_one_work+0x167/0x370
[65858.227776] worker_thread+0x49/0x3e0
[65858.232089] kthread+0xf5/0x130
[65858.235895] ? max_active_store+0x80/0x80
[65858.240504] ? kthread_bind+0x10/0x10
[65858.244832] ret_from_fork+0x1f/0x30
[65858.249074] ---[ end trace f82d59250b54beb7 ]---
Fixes: bb1cc74790 ("nvmet: implement valid sqhd values in completions")
Fixes: 1672ddb8d6 ("nvmet: Add install_queue callout")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Place the arguments in the correct order.
Fixes: 1672ddb8d6 ("nvmet: Add install_queue callout")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
There is no real need to have a pointer to the tagset in
struct nvme_queue, as we only need it in a single place, and that place
can derive the used tagset from the device and qid trivially. This
fixes a problem with stale pointer exposure when tagsets are reset,
and also shrinks the nvme_queue structure. It also matches what most
other transports have done since day 1.
Reported-by: Edmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The host is allowed to pass the controller an sgl describing a buffer
that is larger than the dsm payload itself, allow it when executing
dsm.
Reported-by: Dakshaja Uppalapati <dakshaja@chelsio.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>,
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
ctrl->subsys->namespaces and subsys->namespaces are traversed with
list_for_each_entry_rcu outside an RCU read-side critical section but
under the protection of ctrl->subsys->lock and subsys->lock respectively.
Hence, add the corresponding lockdep expression to the list traversal
primitive to silence false-positive lockdep warnings, and harden RCU
lists.
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4vOqAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppYBD/wLczY7hyjF2loc71MC9HloUq3BVbATktM3
OF6wRbyxbeiOj/7Px0lE0M67tQbnEoIP26gS03fd6e7HE19//gmzGuB3Z2R2CJ5q
XKkTamqz0pcPcX5FdDO5JFQZf27/1Qs3g7Nkr7FjVcR2XQ8PFv5B/FLMhse4frJI
k92Sj0V1OwdNtMXozKqno/7xPwL/kQKWoF6aFDgO27xLfsFmi8Wbgf/CslOOTHIN
vAUaz3Cue6V17M5y98wD4nwpjG7Ve+aY1i6oFPBE7Az9TA0xoiBA/tNPKW7iS10C
GEP1aoI6lpgkxAzvyR29K1ayjzV11hEIig3rNIWxNfmCSGaawttWXAPEi7jU5u2D
ZXbzUJxKnfeg8yrAj0CTcKLA9i4v1cZXPCUXqMO2+wHEWgmxq2IWuWjSl/V4fn3Y
zgTPBngDM4Gx3fAqvD8SVfCW7xwI4VRP+da58WCFOjwnOgYSouxS7RnCtm+yPUbk
Es6m2XBb+3ycaJPT58LcXPrnTJWZeRincs3MfFJeTXRn5T7IzlBjKdIvQiQSHQXo
caZzWHEJW827+wfQFNreXpk5KPi+D6boeziYe96UcII8L5qVw3N0X5hOpr6IRhkX
hn2CUb/CmY6bl8PJJPVc4ygqgiavyvynJu+A0uJvFSjvXX6jjXNEsSJ6bz8aBxdm
4rmgPFTlqA==
=yJZi
-----END PGP SIGNATURE-----
Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"This may be the most quiet round we've had in years. I'm not
complaining. Really not a lot to detail here, outside of spelling and
documentation improvements/fixes, we have:
- Allow t10-pi to be modular (Herbert)
- Remove dead code in bfq (Alex)
- Mark zone management requests with REQ_SYNC (Chaitanya)
- BFQ division improvement (Wen)
- Small series improving plugging (Pavel)"
* tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block:
partitions/ldm: fix spelling mistake "to" -> "too"
block, bfq: improve arithmetic division in bfq_delta()
block/bfq: remove unused bfq_class_rt which never used
block: mark zone-mgmt bios with REQ_SYNC
blk-mq: Document functions for sending request
block: Allow t10-pi to be modular
blk-mq: optimise blk_mq_flush_plug_list()
list: introduce list_for_each_continue()
blk-mq: optimise rq sort function
The existing implementation for the get_feature admin-cmd does not
use per-feature data len. This patch introduces a new helper function
nvmet_feat_data_len(), which is used to calculate per feature data len.
Right now we only set data len for fid 0x81 (NVME_FEAT_HOST_ID).
Fixes: commit e9061c3978 ("nvmet: Remove the data_len field from the nvmet_req struct")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amit Engel <amit.engel@dell.com>
[endiness, naming, and kernel style fixes]
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Decode interrupted command and not ready namespace nvme status codes to
BLK_STS_TARGET. These are not generic IO errors and should use a non-path
specific error so that it can use the non-failover retry path.
Reported-by: John Meneghini <John.Meneghini@netapp.com>
Cc: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently t10-pi can only be built into the block layer which via
crc-t10dif pulls in a whole chunk of the Crypto API. In fact all
users of t10-pi work as modules and there is no reason for it to
always be built-in.
This patch adds a new hidden option for t10-pi that is selected
automatically based on BLK_DEV_INTEGRITY and whether the users
of t10-pi are built-in or not.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y54EQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqJuD/93LZmzS5UEWrNLkRaAaCyAy40MPxuXRZEp
42yk7cvAT4OcCr+W6nkAgG6IHGRXOz8QvOzt0P5/HfugpNlB2oz5a/6+TiTtcZTt
YNt0Z4yuBMU5SXIIxc3lUMcJGxslzOr+L+9ZXD4u5UqIdG1fSrECAexSCrlmmTwu
Fx02TakDc/bbUYDfLAQD1+/Z066rp1ZWDkjXqA4kUvbFzt8F7qEOc1Evq47SuR7d
Iw0bM3LVASXwTq2lRc1bFFL2glku6wwkccjwdyjSrQmK4+8LhF396fQGtXuj0Mrs
OzuWhaOoGhan7dpj1D8e4tqugflQy9rv9bcy6Z9PjBY+VauuFdgPr3iFcwPaPbXm
17ir4y7xJJxXlhZl/Bn06KIB2h+nLWDIaundFys5JnMmTiZvWIgSJ6Q3gWtMxgfH
zWZLMw/UtRAmjHhLqvGsMaBTfgKX5ATpMbfGeZeXheVtVaOgGTunXunT56o7oRHB
q4XWZqbydsYyHBUhgSzhBr03i67wbotxtebqg9VZ0UD8XM4iM8Kor/DleK03oUqD
DsltKF66NAGNeOcV3TNzJuXHyF6S/vZdO7JdFHY29+pdljoTj5GB88+W9CbhwQRe
WiKVpq7sAe/bh0wtqrD+QCByjSNSVU62kVgRhfqms47804j/vNqNvOKaC5UWTd0I
2LG4jfSbeg==
=hmxJ
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191212' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- stable fix for the bi_size overflow. Not a corruption issue, but a
case wher we could merge but disallowed (Andreas)
- NVMe pull request via Keith, with various fixes.
- MD pull request from Song.
- Merge window regression fix for the rq passthrough stats (Logan)
- Remove unused blkcg_drain_queue() function (Guoqing)
* tag 'for-linus-20191212' of git://git.kernel.dk/linux-block:
blk-cgroup: remove blkcg_drain_queue
block: fix NULL pointer dereference in account statistics with IDE
md: make sure desc_nr less than MD_SB_DISKS
md: raid1: check rdev before reference in raid1_sync_request func
raid5: need to set STRIPE_HANDLE for batch head
block: fix "check bi_size overflow before merge"
nvme/pci: Fix read queue count
nvme/pci Limit write queue sizes to possible cpus
nvme/pci: Fix write and poll queue types
nvme/pci: Remove last_cq_head
nvme: Namepace identification descriptor list is optional
nvme-fc: fix double-free scenarios on hw queues
nvme: else following return is not needed
nvme: add error message on mismatching controller ids
nvme_fc: add module to ops template to allow module references
nvmet-loop: Avoid preallocating big SGL for data
nvme-fc: Avoid preallocating big SGL for data
nvme-rdma: Avoid preallocating big SGL for data
Pull NVMe fixes from Keith
* 'nvme/for-5.5' of git://git.infradead.org/nvme:
nvme/pci: Fix read queue count
nvme/pci Limit write queue sizes to possible cpus
nvme/pci: Fix write and poll queue types
nvme/pci: Remove last_cq_head
nvme: Namepace identification descriptor list is optional
nvme-fc: fix double-free scenarios on hw queues
nvme: else following return is not needed
nvme: add error message on mismatching controller ids
nvme_fc: add module to ops template to allow module references
nvmet-loop: Avoid preallocating big SGL for data
nvme-fc: Avoid preallocating big SGL for data
nvme-rdma: Avoid preallocating big SGL for data
If nvme.write_queues equals the number of CPUs, the driver had decreased
the number of interrupts available such that there could only be one read
queue even if the controller could support more. Remove the interrupt
count reduction in this case. The driver wouldn't request more IRQs than
it wants queues anyway.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The driver can never use more queues of any type than the number of
possible CPUs, so a higher value causes the driver to allocate more
memory for IO queues than it could ever use. Limit the parameter at
module load time to the number of possible cpus.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The number of poll or write queues should never be negative. Use unsigned
types so that it's not possible to break have the driver not allocate
any queues.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl3leXUUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vyY3g/9FAVVdPEaadNtAhQ/zIxcjozDovKq
0q7yOA3aTBTUoNEinm88an6p0dcC4gNKtGukXmzVH2Hhxm9kLRdtpZGYY00tpLUB
9rI7XsgwwHa+hLwsHbIs507sKGFGy5FLr0ChTTGLDEMppnEvjA2hZooYmcB/OgrC
LlFcwbNKGOk/Si9u2bF2nLO0JDoVHnwzpF99saew/nqc7Lfj9e9IPZFom+VjPBUh
AOvRp2H7uBN+WQlpLeFeMDDoeXh34lX0kYqIV/cVkXVnknDGYKV2CBTg2aeX7jd0
QiPHZh6zlW8zNQgaCZRiBAbatVEOnRMRJ++yiqB8hBYp1LMXm6kJ01YSQpXkugoY
Vp9dtzzTARWV/XkKwD4brw9ZEmIDnO+Ed2x2VbUkPJVcXAvzSQWAx82IU0Iuqmcb
9qr6U2Zf/Xk5aFlGPYVH8QOG+QqzIbZNRQ7NlhDlITyW4P6QPu0mw374yYP2wDGL
sP5YSS3YGa0sQcEgDtVnd4z+WTZI4AwXLPaeaLkDhdfHp2FsERUY4TrPs33J99xw
og4EyokVFzjYzlnBPU6WWn7LL+jj5ccXkL3MA4DR4FJOnNGHh7NXfQUH56rrgsq7
F9/8shL5DuTbQkde1uSyUG9Iq/RigVLlV5DQavFm3dSXvZi0E16t5alC5URNTzk7
at8Bogn53QhlmYc=
=uUXw
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Warn if a host bridge has no NUMA info (Yunsheng Lin)
- Add PCI_STD_NUM_BARS for the number of standard BARs (Denis
Efremov)
Resource management:
- Fix boot-time Embedded Controller GPE storm caused by incorrect
resource assignment after ACPI Bus Check Notification (Mika
Westerberg)
- Protect pci_reassign_bridge_resources() against concurrent
addition/removal (Benjamin Herrenschmidt)
- Fix bridge dma_ranges resource list cleanup (Rob Herring)
- Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control
the MMIO and prefetchable MMIO window sizes of hotplug bridges
independently (Nicholas Johnson)
- Fix MMIO/MMIO_PREF window assignment that assigned more space than
desired (Nicholas Johnson)
- Only enforce bus numbers from bridge EA if the bridge has EA
devices downstream (Subbaraya Sundeep)
- Consolidate DT "dma-ranges" parsing and convert all host drivers to
use shared parsing (Rob Herring)
Error reporting:
- Restore AER capability after resume (Mayurkumar Patel)
- Add PoisonTLPBlocked AER counter (Rajat Jain)
- Use for_each_set_bit() to simplify AER code (Andy Shevchenko)
- Fix AER kernel-doc (Andy Shevchenko)
- Add "pcie_ports=dpc-native" parameter to allow native use of DPC
even if platform didn't grant control over AER (Olof Johansson)
Hotplug:
- Avoid returning prematurely from sysfs requests to enable or
disable a PCIe hotplug slot (Lukas Wunner)
- Don't disable interrupts twice when suspending hotplug ports (Mika
Westerberg)
- Fix deadlocks when PCIe ports are hot-removed while suspended (Mika
Westerberg)
Power management:
- Remove unnecessary ASPM locking (Bjorn Helgaas)
- Add support for disabling L1 PM Substates (Heiner Kallweit)
- Allow re-enabling Clock PM after it has been disabled (Heiner
Kallweit)
- Add sysfs attributes for controlling ASPM link states (Heiner
Kallweit)
- Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl"
sysfs files (Heiner Kallweit)
- Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on
USB 2.0 or 1.1 connect events (Kai-Heng Feng)
- Move power state check out of pci_msi_supported() (Bjorn Helgaas)
- Fix incorrect MSI-X masking on resume and revert related nvme quirk
for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan)
- Always return devices to D0 when thawing to fix hibernation with
drivers like mlx4 that used legacy power management (previously we
only did it for drivers with new power management ops) (Dexuan Cui)
- Clear PCIe PME Status even for legacy power management (Bjorn
Helgaas)
- Fix PCI PM documentation errors (Bjorn Helgaas)
- Use dev_printk() for more power management messages (Bjorn Helgaas)
- Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas)
- Convert xen-platform from legacy to generic power management (Bjorn
Helgaas)
- Removed unused .resume_early() and .suspend_late() legacy power
management hooks (Bjorn Helgaas)
- Rearrange power management code for clarity (Rafael J. Wysocki)
- Decode power states more clearly ("4" or "D4" really refers to
"D3cold") (Bjorn Helgaas)
- Notice when reading PM Control register returns an error (~0)
instead of interpreting it as being in D3hot (Bjorn Helgaas)
- Add missing link delays required by the PCIe spec (Mika Westerberg)
Virtualization:
- Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn
Helgaas)
- Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code
previously didn't recognize that) (Kuppuswamy Sathyanarayanan)
- Allow VFs to use PASID (the PF PASID capability is shared by the
VFs, but the code previously didn't recognize that) (Kuppuswamy
Sathyanarayanan)
- Disconnect PF and VF ATS enablement, since ATS in PFs and
associated VFs can be enabled independently (Kuppuswamy
Sathyanarayanan)
- Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan)
- Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas)
- Consolidate ATS declarations in linux/pci-ats.h (Krzysztof
Wilczynski)
- Remove unused PRI and PASID stubs (Bjorn Helgaas)
- Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID
interfaces that are only used by built-in IOMMU drivers (Bjorn
Helgaas)
- Hide PRI and PASID state restoration functions used only inside the
PCI core (Bjorn Helgaas)
- Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski)
- Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut)
- Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George
Cherian)
- Fix the UPDCR register address in the Intel ACS quirk (Steffen
Liebergeld)
- Unify ACS quirk implementations (Bjorn Helgaas)
Amlogic Meson host bridge driver:
- Fix meson PERST# GPIO polarity problem (Remi Pommarel)
- Add DT bindings for Amlogic Meson G12A (Neil Armstrong)
- Fix meson clock names to match DT bindings (Neil Armstrong)
- Add meson support for Amlogic G12A SoC with separate shared PHY
(Neil Armstrong)
- Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe
combo PHY (Neil Armstrong)
- Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong)
- Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT
(Neil Armstrong)
Broadcom iProc host bridge driver:
- Invalidate iProc PAXB address mapping before programming it
(Abhishek Shah)
- Fix iproc-msi and mvebu __iomem annotations (Ben Dooks)
Cadence host bridge driver:
- Refactor Cadence PCIe host controller to use as a library for both
host and endpoint (Tom Joseph)
Freescale Layerscape host bridge driver:
- Add layerscape LS1028a support (Xiaowei Bao)
Intel VMD host bridge driver:
- Add VMD bus 224-255 restriction decode (Jon Derrick)
- Add VMD 8086:9A0B device ID (Jon Derrick)
- Remove Keith from VMD maintainer list (Keith Busch)
Marvell ARMADA 3700 / Aardvark host bridge driver:
- Use LTSSM state to build link training flag since Aardvark doesn't
implement the Link Training bit (Remi Pommarel)
- Delay before training Aardvark link in case PERST# was asserted
before the driver probe (Remi Pommarel)
- Fix Aardvark issues with Root Control reads and writes (Remi
Pommarel)
- Don't rely on jiffies in Aardvark config access path since
interrupts may be disabled (Remi Pommarel)
- Fix Aardvark big-endian support (Grzegorz Jaszczyk)
Marvell ARMADA 370 / XP host bridge driver:
- Make mvebu_pci_bridge_emul_ops static (Ben Dooks)
Microsoft Hyper-V host bridge driver:
- Add hibernation support for Hyper-V virtual PCI devices (Dexuan
Cui)
- Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan
Cui)
- Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui)
Mobiveil host bridge driver:
- Change mobiveil csr_read()/write() function names that conflict
with riscv arch functions (Kefeng Wang)
NVIDIA Tegra host bridge driver:
- Fix Tegra CLKREQ dependency programming (Vidya Sagar)
Renesas R-Car host bridge driver:
- Remove unnecessary header include from rcar (Andrew Murray)
- Tighten register index checking for rcar inbound range programming
(Marek Vasut)
- Fix rcar inbound range alignment calculation to improve packing of
multiple entries (Marek Vasut)
- Update rcar MACCTLR setting to match documentation (Yoshihiro
Shimoda)
- Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual
(Yoshihiro Shimoda)
- Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon
Horman)
Rockchip host bridge driver:
- Make rockchip 0V9 and 1V8 power regulators non-optional (Robin
Murphy)
Socionext UniPhier host bridge driver:
- Set uniphier to host (RC) mode always (Kunihiko Hayashi)
Endpoint drivers:
- Fix endpoint driver sign extension problem when shifting page
number to phys_addr_t (Alan Mikhak)
Misc:
- Add NumaChip SPDX header (Krzysztof Wilczynski)
- Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski)
- Remove unused includes (Krzysztof Wilczynski)
- Removed unused sysfs attribute groups (Ben Dooks)
- Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas)
- Add PCIe Link Control 2 register field definitions to replace magic
numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas)
- Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and
Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas)
- Use pcie_capability_read_word() instead of pci_read_config_word()
in AMDGPU and Radeon CIK/SI (Frederick Lawler)
- Remove unused pci_irq_get_node() Greg Kroah-Hartman)
- Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig
(Palmer Dabbelt, Michal Simek)
- Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe)
- Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn
Helgaas)
- Fix bridge emulation big-endian support (Grzegorz Jaszczyk)
- Fix dwc find_next_bit() usage (Niklas Cassel)
- Fix pcitest.c fd leak (Hewenliang)
- Fix typos and comments (Bjorn Helgaas)
- Fix Kconfig whitespace errors (Krzysztof Kozlowski)"
* tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits)
PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist
asm-generic: Make msi.h a mandatory include/asm header
Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
PCI/MSI: Fix incorrect MSI-X masking on resume
PCI/MSI: Move power state check out of pci_msi_supported()
PCI/MSI: Remove unused pci_irq_get_node()
PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer
PCI: hv: Change pci_protocol_version to per-hbus
PCI: hv: Add hibernation support
PCI: hv: Reorganize the code in preparation of hibernation
MAINTAINERS: Remove Keith from VMD maintainer
PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code
PCI/ASPM: Add sysfs attributes for controlling ASPM link states
PCI: Fix indentation
drm/radeon: Prefer pcie_capability_read_word()
drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
drm/radeon: Correct Transmit Margin masks
drm/amdgpu: Prefer pcie_capability_read_word()
PCI: uniphier: Set mode register to host mode
drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
...
We had been saving the last_cq_head seen from an interrupt so that a
polled queue wouldn't mistakenly trigger spruious interrupt detection. We
don't poll interrupt driven queues any more, so saving this value is
pointless.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Despite NVM Express specification 1.3 requires a controller claiming to
be 1.3 or higher implement Identify CNS 03h (Namespace Identification
Descriptor list), the driver doesn't really need this identification in
order to use a namespace. The code had already documented in comments
that we're not to consider an error to this command.
Return success if the controller provided any response to an
namespace identification descriptors command.
Fixes: 538af88ea7 ("nvme: make nvme_report_ns_ids propagate error back")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679
Reported-by: Ingo Brunberg <ingo_brunberg@web.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: stable@vger.kernel.org # 5.4+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
If an error occurs on one of the ios used for creating an
association, the creating routine has error paths that are
invoked by the command failure and the error paths will free
up the controller resources created to that point.
But... the io was ultimately determined by an asynchronous
completion routine that detected the error and which
unconditionally invokes the error_recovery path which calls
delete_association. Delete association deletes all outstanding
io then tears down the controller resources. So the
create_association thread can be running in parallel with
the error_recovery thread. What was seen was the LLDD received
a call to delete a queue, causing the LLDD to do a free of a
resource, then the transport called the delete queue again
causing the driver to repeat the free call. The second free
routine corrupted the allocator. The transport shouldn't be
making the duplicate call, and the delete queue is just one
of the resources being freed.
To fix, it is realized that the create_association path is
completely serialized with one command at a time. So the
failed io completion will always be seen by the create_association
path and as of the failure, there are no ios to terminate and there
is no reason to be manipulating queue freeze states, etc.
The serialized condition stays true until the controller is
transitioned to the LIVE state. Thus the fix is to change the
error recovery path to check the controller state and only
invoke the teardown path if not already in the CONNECTING state.
Reviewed-by: Himanshu Madhani <hmadhani@marvell.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Remove unnecessary keyword in nvme_create_queue().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
We've seen a few devices that return different controller id's to
the Fabric Connect command vs the Identify(controller) command. It's
currently hard to identify this failure by existing error messages. It
comes across as a (re)connect attempt in the transport that fails with
a -22 (-EINVAL) status. The issue is compounded by older kernels not
having the controller id check or had the identify command overwrite the
fabrics controller id value before it checked. Both resulted in cases
where the devices appeared fine until more recent kernels.
Clarify the reject by adding an error message on controller id mismatches.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
In nvme-fc: it's possible to have connected active controllers
and as no references are taken on the LLDD, the LLDD can be
unloaded. The controller would enter a reconnect state and as
long as the LLDD resumed within the reconnect timeout, the
controller would resume. But if a namespace on the controller
is the root device, allowing the driver to unload can be problematic.
To reload the driver, it may require new io to the boot device,
and as it's no longer connected we get into a catch-22 that
eventually fails, and the system locks up.
Fix this issue by taking a module reference for every connected
controller (which is what the core layer did to the transport
module). Reference is cleared when the controller is removed.
Acked-by: Himanshu Madhani <hmadhani@marvell.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
support SG_CHAIN, use only runtime allocation for the SGL.
We didn't notice of a performance degradation, since for small IOs we'll
use the inline SG and for the bigger IOs the allocation of a bigger SGL
from slab is fast enough.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3X/7gQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgponxD/9mb/H9LD6/flEqoPv7n1dv7Y8Oe+AKpGLb
a2Jh8ycpU6WtzZdlMYbQkxAqgJCupLTlih3WY3NuI1fwSsxwyMziQEnVnKgPNf7s
PLWt+Qo5ryooyVkPi4KCHKjx2CFDUL4B1BqtSLm9n3eN72FRa9HCsCiMugjCbV+K
aF7snzZ0ss+m7SnKIpdtXJjcdIFC2hXwCAGWAOv1vOPwhqQZFBsxjHKnEJtumTp2
+wzLBPItLBbzHtAyopbiNlfsuHL9CF9L1QFaCTZE7N6eVWnlYpIhMCMrfEJ/e3jK
27ct3PWAa4Qr2S/85//AMOyL9mxK96g9FjuGjxnKQZ+qWh89RPLq+oGs1zWSv2O0
08BynWvxYHkoOR2+baPx89SHrlwyN0HCsvFBxVKrMsVHpy81sIYLFlytYaQUMx2/
RkgkIxiAo5R5vYHPZYX8cU4c1rASjG0tYD9OA6e78MJFIBfkTl70XHHVX2Tgf5by
fuwon/g+iVvN94QHb81ulZcA0hXRz8jM2RNIAPhqfoJXX6wNzD5MooNxTs9m88IP
6/HaM1l6AJUtOMNA1aZbKAq+ARYuIA0/qHoSS0UVHoG8D4YkYaFpvYsAaA+TBzeO
J8IcwSu6eVR1NvgJ9b4cwXFWSf75a/o4UeDP/1fYcQU5Gn/KQEdQVFSMvND1nnHW
hUTP5AKFcQ==
=oyE3
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block
Pull additional block driver updates from Jens Axboe:
"Here's another block driver update, done to avoid conflicts with the
zoned changes coming next.
This contains:
- Prepare SCSI sd for zone open/close/finish support
- Small NVMe pull request
- hwmon support (Akinobu)
- add new co-maintainer (Christoph)
- work-around for a discard issue on non-conformant drives
(Eduard)
- Small nbd leak fix"
* tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block:
nbd: prevent memory leak
nvme: hwmon: add quirk to avoid changing temperature threshold
nvme: hwmon: provide temperature min and max values for each sensor
nvmet: add another maintainer
nvme: Discard workaround for non-conformant devices
nvme: Add hardware monitoring support
scsi: sd_zbc: add zone open, close, and finish support
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WyEAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplgbD/4jNeqT0q2IkNcUUEWkZWsBOlfi0SiclS5v
X8JY1IxTlL0kaBWm83mw06JewucQ97Fh7xblPE8/iDHJqpgEX4vvSQY1b8hcDulZ
YOKUnLkFU22nICeT04/8x/+f8gqD5KOlGxkgEvUKViQW15oc0oNu4St/yFM1QEN0
qNMzpcfFXV9lYsOPl0y3pKdP+qbfcpeSmaFD9Z65gxN6rJy1WR8rtUGXy2luoiEc
dh15IL9AGN/r8VTo8yRpD9PStiuJqpALIR8OHJSHPj+s0pQ6twk4aehcnYseAMbH
zSDpa9AJrfqlnh8tUfKYLWi/PM7pMH0F01rAiQv47j/C0+QhbiOU/uTFTzUW5hQ1
eK6XzJ0slxwnDsHLKf+xJmCj0Oyk0jDimNQr/2MNsuhmr29V5lfvBNflub8eOLyZ
ie2Eulv+z6pYBSJx6kqm0X3vhXOy4wgU+X8LzvfcP9iAjgU1rfzxUWxLEj+KfJS2
Nl+ERV9nafoPpoKpNR7zWRBUulp1qZJzo/U9JaUKiI5cWkIH1hhHmU2++xMeyJpb
XHoDFNTGv6z/eef65eSveFD7F274TSi16K56Obk+4KWaSrIR0d6VwUA7FDmJbSI+
Jqk1OFdaRGsQ5OcVxF1Qo4WChn0FvhcD0c+yL0N19WZ01QeYsb3hlA+MUPDtGQ04
U79MPfu7iA==
=i0jf
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Here are the main block driver updates for 5.5. Nothing major in here,
mostly just fixes. This contains:
- a set of bcache changes via Coly
- MD changes from Song
- loop unmap write-zeroes fix (Darrick)
- spelling fixes (Geert)
- zoned additions cleanups to null_blk/dm (Ajay)
- allow null_blk online submit queue changes (Bart)
- NVMe changes via Keith, nothing major here either"
* tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits)
Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
bcache: don't export symbols
bcache: remove the extra cflags for request.o
bcache: at least try to shrink 1 node in bch_mca_scan()
bcache: add idle_max_writeback_rate sysfs interface
bcache: add code comments in bch_btree_leaf_dirty()
bcache: fix deadlock in bcache_allocator
bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
bcache: deleted code comments for dead code in bch_data_insert_keys()
bcache: add more accurate error messages in read_super()
bcache: fix static checker warning in bcache_device_free()
bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
bcache: fix fifo index swapping condition in journal_pin_cmp()
md/raid10: prevent access of uninitialized resync_pages offset
md: avoid invalid memory access for array sb->dev_roles
md/raid1: avoid soft lockup under high load
null_blk: add zone open, close, and finish support
dm: add zone open, close and finish support
...
This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
the value of the temperature threshold feature for specific devices that
show undesirable behavior.
Guenter reported:
"On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
Composite temperature sensor results in a temperature warning, and that
warning is sticky until I reset the controller.
It doesn't seem to matter which temperature I write; writing -273000 has
the same result."
The Intel NVMe has the latest firmware version installed, so this isn't
a problem that was ever fixed.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
According to the NVMe specification, the over temperature threshold and
under temperature threshold features shall be implemented for Composite
Temperature if a non-zero WCTEMP field value is reported in the Identify
Controller data structure. The features are also implemented for all
implemented temperature sensors (i.e., all Temperature Sensor fields that
report a non-zero value).
This provides the over temperature threshold and under temperature
threshold for each sensor as temperature min and max values of hwmon
sysfs attributes.
The WCTEMP is already provided as a temperature max value for Composite
Temperature, but this change isn't incompatible. Because the default
value of the over temperature threshold for Composite Temperature is
the WCTEMP.
Now the alarm attribute for Composite Temperature indicates one of the
temperature is outside of a temperature threshold. Because there is only
a single bit in Critical Warning field that indicates a temperature is
outside of a threshold.
Example output from the "sensors" command:
nvme-pci-0100
Adapter: PCI adapter
Composite: +33.9°C (low = -273.1°C, high = +69.8°C)
(crit = +79.8°C)
Sensor 1: +34.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +31.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 5: +47.9°C (low = -273.1°C, high = +65261.8°C)
This also adds helper macros for kelvin from/to milli Celsius conversion,
and replaces the repeated code in hwmon.c.
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Users observe IOMMU related errors when performing discard on nvme from
non-compliant nvme devices reading beyond the end of the DMA mapped
ranges to discard.
Two different variants of this behavior have been observed: SM22XX
controllers round up the read size to a multiple of 512 bytes, and Phison
E12 unconditionally reads the maximum discard size allowed by the spec
(256 segments or 4kB).
Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
so the driver DMA maps a memory range that will always succeed.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
Signed-off-by: Eduard Hasenleithner <eduard@hasenleithner.at>
[changelog, use existing define, kernel coding style]
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme devices report temperature information in the controller information
(for limits) and in the smart log. Currently, the only means to retrieve
this information is the nvme command line interface, which requires
super-user privileges.
At the same time, it would be desirable to be able to use NVMe temperature
information for thermal control.
This patch adds support to read NVMe temperatures from the kernel using the
hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
can use this information to set thermal policies, and userspace can access
it using libsensors and/or the "sensors" command.
Example output from the "sensors" command:
nvme0-pci-0100
Adapter: PCI adapter
Composite: +39.0°C (high = +85.0°C, crit = +85.0°C)
Sensor 1: +39.0°C
Sensor 2: +41.0°C
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3F+DIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkVmD/9FIa092Q6obga0RqA16GlbI85tgtNyRFZU
neIO/g9R3G/uBTGbiUeXHXHDz9CqUXIRYX7pmI2u0b07iGLRz8oUsOsgyVEfCMen
VitqwkJJAZ9j9OifyKpLYCZX9ulVDWX5hEz/vm2cNWDkjCbOpXvRuQmkXEzp7RNM
F7K25PpGLvJHfqC90q9FXXxNDlB2i1M/rh5I7eUqhb6rHmzfJGKCd+H80t+REoB1
iXAygPj86agQKLOUKZtJjXUjq9Ol/0FD+OKY+eP3EfVv/FJvIeWWYe78WplxJpRD
BYb9dhLMCSo619WVVy4hNYCPjSfPKVT2cO5QJmVRpgOI1urFuTNgNoIiw06AgvkZ
09vrlrJZ5A7eFEppuAFQC4WRYKWCQCQfq8wxt2iGUivXgHfskjJ7qJz1Mh5Nlxsr
JGm1hSVw9UCBzjqC75K2CR+vVt4T8ovEaizPFvzVj6lQ0lRTmSchisxbXzTdevOn
kFvBOntBdpeSq/CwZ0x5PDP4AbRsgH3ny47LCJoEpFZlxlOLuEB/dAx564dn6ZgZ
rRz6mKU7rlO7brrkW+DbRZ0XiOn5qzN6FrcSGX1DBzc4+hMus/5PPjv8Gk+Wo1PU
388mu6N6DObSjw1ij1AqkwyzudmbrziKM4isJY4I72I0YGSq9cuG5VXgM2GqbGlU
XXpzsu8pSw==
=8B/z
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Two NVMe device removal crash fixes, and a compat fixup for for an
ioctl that was introduced in this release (Anton, Charles, Max - via
Keith)
- Missing error path mutex unlock for drbd (Dan)
- cgroup writeback fixup on dead memcg (Tejun)
- blkcg online stats print fix (Tejun)
* tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
block: drbd: remove a stray unlock in __drbd_send_protocol()
blkcg: make blkcg_print_stat() print stats only for online blkgs
nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
nvme-rdma: fix a segmentation fault during module unload
nvme_mpath_clear_ctrl_paths() iterates through
the ctrl->namespaces list while holding ctrl->scan_lock.
This does not seem to be the correct way of protecting
from concurrent list modification.
Specifically, nvme_scan_work() sorts ctrl->namespaces
AFTER unlocking scan_lock.
This may result in the following (rare) crash in ctrl disconnect
during scan_work:
BUG: kernel NULL pointer dereference, address: 0000000000000050
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
...
Call Trace:
nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
nvme_remove_namespaces+0x35/0xe0 [nvme_core]
nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
nvme_sysfs_delete+0x49/0x60 [nvme_core]
dev_attr_store+0x17/0x30
sysfs_kf_write+0x3e/0x50
kernfs_fop_write+0x11e/0x1a0
__vfs_write+0x1b/0x40
vfs_write+0xb9/0x1a0
ksys_write+0x67/0xe0
__x64_sys_write+0x1a/0x20
do_syscall_64+0x5a/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f8d02bfb154
Fix:
After taking scan_lock in nvme_mpath_clear_ctrl_paths()
down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
This will not cause deadlocks because taking scan_lock never happens
while holding the namespaces_rwsem.
Moreover, scan work downs namespaces_rwsem in the same order.
Alternative: sort ctrl->namespaces in nvme_scan_work()
while still holding the scan_lock.
This would leave nvme_mpath_clear_ctrl_paths() without correct protection
against ctrl->namespaces modification by anyone other than scan_work.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>