TOS provide clients the ability to segregate traffic flows for
different type of data.
One of the TOS usage is bandwidth management which allows setting bandwidth
limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A
and 20% to controllers at QoS class B.
usage examples:
nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Simple polling support via socket busy_poll interface.
Although we do not shutdown interrupts but simply hammer
the socket poll, we can sometimes find completions faster
than the normal interrupt driven RX path.
We add per queue nr_cqe counter that resets every time
RX path is invoked such that .poll callback can return it
to stay consistent with the semantics.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
All seem to call it with ctrl->cap so no need to pass it
at all.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
nvme_enable_ctrl reads the cap register right after, so
no need to do that locally in the transport driver. Have
sqsize setting in nvme_init_identify.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Using socket specific read_sock() calls instead of directly calling
tcp_read_sock() helps lld module registered handlers if any, to be called
from nvme-tcp host.
This patch therefore replaces the tcp_read_sock() with socket specific
prot_ops.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When aborting in-flight request for recovering controller, we have
to make sure that queue's complete function is called on completed
request before moving on. Otherwise, for example, the warning of
WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be
triggered on nvme-rdma.
Fix this issue by using blk_mq_tagset_wait_completed_request.
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
According to commit a10674bf24 ("tcp: detecting the misuse of
.sendpage for Slab objects") and previous discussion, tcp_sendpage
should not be used for pages that is managed by SLAB, as SLAB is not
taking page reference counters into consideration.
Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When the controller supports less queues than requested, we
should make sure that queue mapping does the right thing and
not assume that all queues are available. This fixes a crash
when the controller supports less queues than requested.
The rules are:
1. if no write queues are requested, we assign the available queues
to the default queue map. The default and read queue maps share the
existing queues.
2. if write queues are requested:
- first make sure that read queue map gets the requested
nr_io_queues count
- then grant the default queue map the minimum between the requested
nr_write_queues and the remaining queues. If there are no available
queues to dedicate to the default queue map, fallback to (1) and
share all the queues in the existing queue map.
Also, provide a log indication on how we constructed the different
queue maps.
Reported-by: Harris, James R <james.r.harris@intel.com>
Tested-by: Jim Harris <james.r.harris@intel.com>
Cc: <stable@vger.kernel.org> # v5.0+
Suggested-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
If I/O queue connect times out, we might have freed the queue socket
already, so check for that on the error path in nvme_tcp_start_queue.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
usually nvme_ prefix is for core functions.
While we're cleaning up, remove redundant empty lines
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If we timeout the admin startup sequence we might not yet have
an I/O tagset allocated which causes the teardown sequence to crash.
Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags
if the tagset wasn't allocated.
Fixes: 39d5775746 ("nvme-tcp: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme_tcp_end_request just takes the status value and the converts
it to little endian as well as shifting for the phase bit.
Fixes: 43ce38a6d823 ("nvme-tcp: support C2HData with SUCCESS flag")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
A C2HData PDU with the SUCCESS flag set indicates that the I/O was
completed by the controller successfully and means that a subsequent
completion response capsule PDU will be ommitted.
If we see this flag, fisrt we check that LAST_PDU flag is set as well,
and then we complete the request when the data transfer (and data digest
verification if its on) is done.
While we're at it, reuse a bit of code with nvme_fail_request.
Reported-by: Steve Blightman <steve.blightman@oracle.com>
Suggested-by: Oliver Smith-Denny <osmithde@cisco.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com>
Tested-by: Oliver Smith-Denny <osmithde@cisco.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is used now just to flush error recovery and reconnect work items in
the RDMA and TCP transports, which can simply be moved to the
corresponding teardown routines.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, we have several problems with the timeout
handler:
1. If we timeout on the controller establishment flow, we will hang
because we don't execute the error recovery (and we shouldn't because
the create_ctrl flow needs to fail and cleanup on its own)
2. We might also hang if we get a disconnet on a queue while the
controller is already deleting. This racy flow can cause the controller
disable/shutdown admin command to hang.
We cannot complete a timed out request from the timeout handler without
mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
So we serialize it in the timeout handler and teardown io and admin
queues to guarantee that no one races with us from completing the
request.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should never touch the opal device from the transport driver.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Preparation for polling support for fabrics. Polling support
means that our completion queues are not generating any interrupts
which means we need to poll for the nvmf io queue connect as well.
Reviewed by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
There is a spelling mistake in a dev_info message, fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues. In
addition, implement .map_queues that will apply 2 queue maps for read
and write queue sets.
Note that with the separate queue map, HCTX_TYPE_READ will always use
nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.
The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.
To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:
nvme connect-all -t tcp -a $IPADDR
This requires the latest version of nvme-cli with TCP support.
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>