linux_dsm_epyc7002/drivers/nvme/target
Chaitanya Kulkarni 819f7b88b4 nvmet: fail outstanding host posted AEN req
In function nvmet_async_event_process() we only process AENs iff
there is an open slot on the ctrl->async_event_cmds[] && aen
event list posted by the target is not empty. This keeps host
posted AEN outstanding if target generated AEN list is empty.
We do cleanup the target generated entries from the aen list in
nvmet_ctrl_free()-> nvmet_async_events_free() but we don't
process AEN posted by the host. This leads to following problem :-

When processing admin sq at the time of nvmet_sq_destroy() holds
an extra percpu reference(atomic value = 1), so in the following code
path after switching to atomic rcu, release function (nvmet_sq_free())
is not getting called which blocks the sq->free_done in
nvmet_sq_destroy() :-

nvmet_sq_destroy()
 percpu_ref_kill_and_confirm()
 - __percpu_ref_switch_mode()
 --  __percpu_ref_switch_to_atomic()
 ---   call_rcu() -> percpu_ref_switch_to_atomic_rcu()
 ----     /* calls switch callback */
 - percpu_ref_put()
 -- percpu_ref_put_many(ref, 1)
 --- else if (unlikely(atomic_long_sub_and_test(nr, &ref->count)))
 ----   ref->release(ref); <---- Not called.

This results in indefinite hang:-

  void nvmet_sq_destroy(struct nvmet_sq *sq)
...
          if (ctrl && ctrl->sqs && ctrl->sqs[0] == sq) {
                  nvmet_async_events_process(ctrl, status);
                  percpu_ref_put(&sq->ref);
          }
          percpu_ref_kill_and_confirm(&sq->ref, nvmet_confirm_sq);
          wait_for_completion(&sq->confirm_done);
          wait_for_completion(&sq->free_done); <-- Hang here

Which breaks the further disconnect sequence. This problem seems to be
introduced after commit 64f5e9cdd7 ("nvmet: fix memory leak when
removing namespaces and controllers concurrently").

This patch processes ctrl->async_event_cmds[] in the admin sq destroy()
context irrespetive of aen_list. Also we get rid of the controller's
aen_list processing in the nvmet_sq_destroy() context and just ignore
ctrl->aen_list.

This results in nvmet_async_events_process() being called from workqueue
context so we adjust the code accordingly.

Fixes: 64f5e9cdd7 ("nvmet: fix memory leak when removing namespaces and controllers concurrently ")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11 09:10:06 -06:00
..
admin-cmd.c nvmet: add metadata/T10-PI support 2020-05-27 07:12:40 +02:00
configfs.c nvmet: add metadata/T10-PI support 2020-05-27 07:12:40 +02:00
core.c nvmet: fail outstanding host posted AEN req 2020-06-11 09:10:06 -06:00
discovery.c nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len 2020-05-27 07:12:39 +02:00
fabrics-cmd.c nvmet: add metadata/T10-PI support 2020-05-27 07:12:40 +02:00
fc.c nvmet-fc: slight cleanup for kbuild test warnings 2020-05-09 16:18:35 -06:00
fcloop.c nvme-fcloop: add target to host LS request support 2020-05-09 16:18:34 -06:00
io-cmd-bdev.c for-5.8/drivers-2020-06-01 2020-06-02 15:37:03 -07:00
io-cmd-file.c nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len 2020-05-27 07:12:39 +02:00
Kconfig nvmet: add metadata characteristics for a namespace 2020-05-27 07:12:39 +02:00
loop.c nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl 2020-03-26 04:51:56 +09:00
Makefile nvmet: introduce target-side trace 2019-06-21 11:15:46 +02:00
nvmet.h nvmet: add metadata support for block devices 2020-05-27 07:12:40 +02:00
rdma.c nvmet-rdma: add metadata/T10-PI support 2020-05-27 07:12:40 +02:00
tcp.c nvmet-tcp: constify nvmet_tcp_ops 2020-06-11 09:10:05 -06:00
trace.c nvmet: trace: parse Get LBA Status command in detail 2019-08-29 12:55:01 -07:00
trace.h nvmet: add async event tracing support 2020-05-27 07:12:38 +02:00