2007-09-11 00:50:12 +07:00
|
|
|
/*
|
2017-10-31 03:22:14 +07:00
|
|
|
* Copyright (c) 2014-2017 Oracle. All rights reserved.
|
2007-09-11 00:50:42 +07:00
|
|
|
* Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
|
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the BSD-type
|
|
|
|
* license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
*
|
|
|
|
* Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
*
|
|
|
|
* Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials provided
|
|
|
|
* with the distribution.
|
|
|
|
*
|
|
|
|
* Neither the name of the Network Appliance, Inc. nor the names of
|
|
|
|
* its contributors may be used to endorse or promote products
|
|
|
|
* derived from this software without specific prior written
|
|
|
|
* permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
|
|
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
|
|
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
|
|
* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
|
|
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
|
|
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
|
|
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
|
|
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
|
|
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* rpc_rdma.c
|
|
|
|
*
|
|
|
|
* This file contains the guts of the RPC RDMA protocol, and
|
|
|
|
* does marshaling/unmarshaling, etc. It is also where interfacing
|
|
|
|
* to the Linux RPC framework lives.
|
2007-09-11 00:50:12 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include "xprt_rdma.h"
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
|
2014-11-18 04:58:04 +07:00
|
|
|
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
|
2007-09-11 00:50:42 +07:00
|
|
|
# define RPCDBG_FACILITY RPCDBG_TRANS
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static const char transfertypes[][12] = {
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
"inline", /* no chunks */
|
|
|
|
"read list", /* some argument via rdma read */
|
|
|
|
"*read list", /* entire request via rdma read */
|
|
|
|
"write list", /* some result via rdma write */
|
2007-09-11 00:50:42 +07:00
|
|
|
"reply chunk" /* entire reply via rdma write */
|
|
|
|
};
|
2016-05-03 01:41:05 +07:00
|
|
|
|
|
|
|
/* Returns size of largest RPC-over-RDMA header in a Call message
|
|
|
|
*
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
* The largest Call header contains a full-size Read list and a
|
|
|
|
* minimal Reply chunk.
|
2016-05-03 01:41:05 +07:00
|
|
|
*/
|
|
|
|
static unsigned int rpcrdma_max_call_header_size(unsigned int maxsegs)
|
|
|
|
{
|
|
|
|
unsigned int size;
|
|
|
|
|
|
|
|
/* Fixed header fields and list discriminators */
|
|
|
|
size = RPCRDMA_HDRLEN_MIN;
|
|
|
|
|
|
|
|
/* Maximum Read list size */
|
|
|
|
maxsegs += 2; /* segment for head and tail buffers */
|
2017-10-31 03:21:57 +07:00
|
|
|
size = maxsegs * rpcrdma_readchunk_maxsz * sizeof(__be32);
|
2016-05-03 01:41:05 +07:00
|
|
|
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
/* Minimal Read chunk size */
|
|
|
|
size += sizeof(__be32); /* segment count */
|
2017-10-31 03:21:57 +07:00
|
|
|
size += rpcrdma_segment_maxsz * sizeof(__be32);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
size += sizeof(__be32); /* list discriminator */
|
|
|
|
|
2016-05-03 01:41:05 +07:00
|
|
|
dprintk("RPC: %s: max call header size = %u\n",
|
|
|
|
__func__, size);
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Returns size of largest RPC-over-RDMA header in a Reply message
|
|
|
|
*
|
|
|
|
* There is only one Write list or one Reply chunk per Reply
|
|
|
|
* message. The larger list is the Write list.
|
|
|
|
*/
|
|
|
|
static unsigned int rpcrdma_max_reply_header_size(unsigned int maxsegs)
|
|
|
|
{
|
|
|
|
unsigned int size;
|
|
|
|
|
|
|
|
/* Fixed header fields and list discriminators */
|
|
|
|
size = RPCRDMA_HDRLEN_MIN;
|
|
|
|
|
|
|
|
/* Maximum Write list size */
|
|
|
|
maxsegs += 2; /* segment for head and tail buffers */
|
|
|
|
size = sizeof(__be32); /* segment count */
|
2017-10-31 03:21:57 +07:00
|
|
|
size += maxsegs * rpcrdma_segment_maxsz * sizeof(__be32);
|
2016-05-03 01:41:05 +07:00
|
|
|
size += sizeof(__be32); /* list discriminator */
|
|
|
|
|
|
|
|
dprintk("RPC: %s: max reply header size = %u\n",
|
|
|
|
__func__, size);
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
2016-09-15 21:57:07 +07:00
|
|
|
void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *r_xprt)
|
2016-05-03 01:41:05 +07:00
|
|
|
{
|
2016-09-15 21:57:07 +07:00
|
|
|
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
|
|
|
|
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
|
|
|
|
unsigned int maxsegs = ia->ri_max_segs;
|
|
|
|
|
2016-05-03 01:41:05 +07:00
|
|
|
ia->ri_max_inline_write = cdata->inline_wsize -
|
|
|
|
rpcrdma_max_call_header_size(maxsegs);
|
|
|
|
ia->ri_max_inline_read = cdata->inline_rsize -
|
|
|
|
rpcrdma_max_reply_header_size(maxsegs);
|
|
|
|
}
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-08-04 00:03:49 +07:00
|
|
|
/* The client can send a request inline as long as the RPCRDMA header
|
|
|
|
* plus the RPC call fit under the transport's inline limit. If the
|
|
|
|
* combined call message size exceeds that limit, the client must use
|
2017-02-09 05:00:10 +07:00
|
|
|
* a Read chunk for this operation.
|
|
|
|
*
|
|
|
|
* A Read chunk is also required if sending the RPC call inline would
|
|
|
|
* exceed this device's max_sge limit.
|
2015-08-04 00:03:49 +07:00
|
|
|
*/
|
2016-05-03 01:41:05 +07:00
|
|
|
static bool rpcrdma_args_inline(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpc_rqst *rqst)
|
2015-08-04 00:03:49 +07:00
|
|
|
{
|
2017-02-09 05:00:10 +07:00
|
|
|
struct xdr_buf *xdr = &rqst->rq_snd_buf;
|
|
|
|
unsigned int count, remaining, offset;
|
|
|
|
|
|
|
|
if (xdr->len > r_xprt->rx_ia.ri_max_inline_write)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (xdr->page_len) {
|
|
|
|
remaining = xdr->page_len;
|
2017-06-08 22:53:16 +07:00
|
|
|
offset = offset_in_page(xdr->page_base);
|
2017-02-09 05:00:10 +07:00
|
|
|
count = 0;
|
|
|
|
while (remaining) {
|
|
|
|
remaining -= min_t(unsigned int,
|
|
|
|
PAGE_SIZE - offset, remaining);
|
|
|
|
offset = 0;
|
|
|
|
if (++count > r_xprt->rx_ia.ri_max_send_sges)
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
2015-08-04 00:03:49 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* The client can't know how large the actual reply will be. Thus it
|
|
|
|
* plans for the largest possible reply for that particular ULP
|
|
|
|
* operation. If the maximum combined reply message size exceeds that
|
|
|
|
* limit, the client must provide a write list or a reply chunk for
|
|
|
|
* this request.
|
|
|
|
*/
|
2016-05-03 01:41:05 +07:00
|
|
|
static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpc_rqst *rqst)
|
2015-08-04 00:03:49 +07:00
|
|
|
{
|
2016-05-03 01:41:05 +07:00
|
|
|
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
|
2015-08-04 00:03:49 +07:00
|
|
|
|
2016-05-03 01:41:05 +07:00
|
|
|
return rqst->rq_rcv_buf.buflen <= ia->ri_max_inline_read;
|
2015-08-04 00:03:49 +07:00
|
|
|
}
|
|
|
|
|
2017-08-15 02:38:22 +07:00
|
|
|
/* Split @vec on page boundaries into SGEs. FMR registers pages, not
|
|
|
|
* a byte range. Other modes coalesce these SGEs into a single MR
|
|
|
|
* when they can.
|
|
|
|
*
|
|
|
|
* Returns pointer to next available SGE, and bumps the total number
|
|
|
|
* of SGEs consumed.
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
*/
|
2017-08-15 02:38:22 +07:00
|
|
|
static struct rpcrdma_mr_seg *
|
|
|
|
rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg,
|
|
|
|
unsigned int *n)
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
{
|
2017-08-15 02:38:22 +07:00
|
|
|
u32 remaining, page_offset;
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
char *base;
|
|
|
|
|
|
|
|
base = vec->iov_base;
|
|
|
|
page_offset = offset_in_page(base);
|
|
|
|
remaining = vec->iov_len;
|
2017-08-15 02:38:22 +07:00
|
|
|
while (remaining) {
|
|
|
|
seg->mr_page = NULL;
|
|
|
|
seg->mr_offset = base;
|
|
|
|
seg->mr_len = min_t(u32, PAGE_SIZE - page_offset, remaining);
|
|
|
|
remaining -= seg->mr_len;
|
|
|
|
base += seg->mr_len;
|
|
|
|
++seg;
|
|
|
|
++(*n);
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
page_offset = 0;
|
|
|
|
}
|
2017-08-15 02:38:22 +07:00
|
|
|
return seg;
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
}
|
|
|
|
|
2017-08-15 02:38:22 +07:00
|
|
|
/* Convert @xdrbuf into SGEs no larger than a page each. As they
|
|
|
|
* are registered, these SGEs are then coalesced into RDMA segments
|
|
|
|
* when the selected memreg mode supports it.
|
2007-09-11 00:50:42 +07:00
|
|
|
*
|
2017-08-15 02:38:22 +07:00
|
|
|
* Returns positive number of SGEs consumed, or a negative errno.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
2017-02-09 04:59:54 +07:00
|
|
|
rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
|
|
|
|
unsigned int pos, enum rpcrdma_chunktype type,
|
|
|
|
struct rpcrdma_mr_seg *seg)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
2017-08-15 02:38:22 +07:00
|
|
|
unsigned long page_base;
|
|
|
|
unsigned int len, n;
|
2011-02-10 02:45:28 +07:00
|
|
|
struct page **ppages;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2016-06-30 00:54:25 +07:00
|
|
|
n = 0;
|
2017-08-15 02:38:22 +07:00
|
|
|
if (pos == 0)
|
|
|
|
seg = rpcrdma_convert_kvec(&xdrbuf->head[0], seg, &n);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2011-02-10 02:45:28 +07:00
|
|
|
len = xdrbuf->page_len;
|
|
|
|
ppages = xdrbuf->pages + (xdrbuf->page_base >> PAGE_SHIFT);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdrbuf->page_base);
|
2017-08-15 02:38:22 +07:00
|
|
|
while (len) {
|
|
|
|
if (unlikely(!*ppages)) {
|
|
|
|
/* XXX: Certain upper layer operations do
|
|
|
|
* not provide receive buffer pages.
|
|
|
|
*/
|
|
|
|
*ppages = alloc_page(GFP_ATOMIC);
|
|
|
|
if (!*ppages)
|
2016-06-30 00:53:43 +07:00
|
|
|
return -EAGAIN;
|
2014-05-28 21:34:24 +07:00
|
|
|
}
|
2017-08-15 02:38:22 +07:00
|
|
|
seg->mr_page = *ppages;
|
|
|
|
seg->mr_offset = (char *)page_base;
|
|
|
|
seg->mr_len = min_t(u32, PAGE_SIZE - page_base, len);
|
|
|
|
len -= seg->mr_len;
|
|
|
|
++ppages;
|
|
|
|
++seg;
|
2007-09-11 00:50:42 +07:00
|
|
|
++n;
|
2017-08-15 02:38:22 +07:00
|
|
|
page_base = 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
2017-02-09 04:59:46 +07:00
|
|
|
/* When encoding a Read chunk, the tail iovec contains an
|
|
|
|
* XDR pad and may be omitted.
|
|
|
|
*/
|
2017-02-09 04:59:54 +07:00
|
|
|
if (type == rpcrdma_readch && r_xprt->rx_ia.ri_implicit_roundup)
|
2017-08-15 02:38:22 +07:00
|
|
|
goto out;
|
2015-08-04 00:04:17 +07:00
|
|
|
|
2017-02-09 04:59:54 +07:00
|
|
|
/* When encoding a Write chunk, some servers need to see an
|
|
|
|
* extra segment for non-XDR-aligned Write chunks. The upper
|
|
|
|
* layer provides space in the tail iovec that may be used
|
|
|
|
* for this purpose.
|
2016-09-15 21:57:16 +07:00
|
|
|
*/
|
2017-02-09 04:59:54 +07:00
|
|
|
if (type == rpcrdma_writech && r_xprt->rx_ia.ri_implicit_roundup)
|
2017-08-15 02:38:22 +07:00
|
|
|
goto out;
|
2016-09-15 21:57:16 +07:00
|
|
|
|
2017-08-15 02:38:22 +07:00
|
|
|
if (xdrbuf->tail[0].iov_len)
|
|
|
|
seg = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2017-08-15 02:38:22 +07:00
|
|
|
out:
|
|
|
|
if (unlikely(n > RPCRDMA_MAX_SEGS))
|
|
|
|
return -EIO;
|
2007-09-11 00:50:42 +07:00
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
static inline int
|
|
|
|
encode_item_present(struct xdr_stream *xdr)
|
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_reserve_space(xdr, sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
|
|
|
|
*p = xdr_one;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int
|
|
|
|
encode_item_not_present(struct xdr_stream *xdr)
|
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_reserve_space(xdr, sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EMSGSIZE;
|
2016-06-30 00:54:25 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
*p = xdr_zero;
|
|
|
|
return 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
static void
|
2016-06-30 00:54:16 +07:00
|
|
|
xdr_encode_rdma_segment(__be32 *iptr, struct rpcrdma_mw *mw)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
{
|
2016-06-30 00:54:16 +07:00
|
|
|
*iptr++ = cpu_to_be32(mw->mw_handle);
|
|
|
|
*iptr++ = cpu_to_be32(mw->mw_length);
|
2017-08-10 23:47:36 +07:00
|
|
|
xdr_encode_hyper(iptr, mw->mw_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
encode_rdma_segment(struct xdr_stream *xdr, struct rpcrdma_mw *mw)
|
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_reserve_space(xdr, 4 * sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
|
|
|
|
xdr_encode_rdma_segment(p, mw);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
encode_read_segment(struct xdr_stream *xdr, struct rpcrdma_mw *mw,
|
|
|
|
u32 position)
|
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_reserve_space(xdr, 6 * sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
|
|
|
|
*p++ = xdr_one; /* Item present */
|
|
|
|
*p++ = cpu_to_be32(position);
|
|
|
|
xdr_encode_rdma_segment(p, mw);
|
|
|
|
return 0;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
}
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
/* Register and XDR encode the Read list. Supports encoding a list of read
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
* segments that belong to a single read chunk.
|
|
|
|
*
|
|
|
|
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
|
|
|
|
*
|
|
|
|
* Read chunklist (a linked list):
|
|
|
|
* N elements, position P (same P for all chunks of same arg!):
|
|
|
|
* 1 - PHLOO - 1 - PHLOO - ... - 1 - PHLOO - 0
|
|
|
|
*
|
2017-08-10 23:47:36 +07:00
|
|
|
* Returns zero on success, or a negative errno if a failure occurred.
|
|
|
|
* @xdr is advanced to the next position in the stream.
|
|
|
|
*
|
|
|
|
* Only a single @pos value is currently supported.
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
*/
|
2017-08-10 23:47:36 +07:00
|
|
|
static noinline int
|
|
|
|
rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
|
|
|
struct rpc_rqst *rqst, enum rpcrdma_chunktype rtype)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
{
|
2017-08-10 23:47:36 +07:00
|
|
|
struct xdr_stream *xdr = &req->rl_stream;
|
2016-06-30 00:54:25 +07:00
|
|
|
struct rpcrdma_mr_seg *seg;
|
2016-06-30 00:54:16 +07:00
|
|
|
struct rpcrdma_mw *mw;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
unsigned int pos;
|
2017-08-15 02:38:30 +07:00
|
|
|
int nsegs;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
pos = rqst->rq_snd_buf.head[0].iov_len;
|
|
|
|
if (rtype == rpcrdma_areadch)
|
|
|
|
pos = 0;
|
2016-06-30 00:54:25 +07:00
|
|
|
seg = req->rl_segments;
|
2017-02-09 04:59:54 +07:00
|
|
|
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_snd_buf, pos,
|
|
|
|
rtype, seg);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
if (nsegs < 0)
|
2017-08-10 23:47:36 +07:00
|
|
|
return nsegs;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
do {
|
2017-08-15 02:38:30 +07:00
|
|
|
seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
|
|
|
|
false, &mw);
|
|
|
|
if (IS_ERR(seg))
|
|
|
|
return PTR_ERR(seg);
|
2017-02-09 05:00:43 +07:00
|
|
|
rpcrdma_push_mw(mw, &req->rl_registered);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
if (encode_read_segment(xdr, mw, pos) < 0)
|
|
|
|
return -EMSGSIZE;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: pos %u %u@0x%016llx:0x%08x (%s)\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__, pos,
|
2016-06-30 00:54:16 +07:00
|
|
|
mw->mw_length, (unsigned long long)mw->mw_offset,
|
2017-08-15 02:38:30 +07:00
|
|
|
mw->mw_handle, mw->mw_nents < nsegs ? "more" : "last");
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.read_chunk_count++;
|
2017-08-15 02:38:30 +07:00
|
|
|
nsegs -= mw->mw_nents;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
} while (nsegs);
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
return 0;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
}
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
/* Register and XDR encode the Write list. Supports encoding a list
|
|
|
|
* containing one array of plain segments that belong to a single
|
|
|
|
* write chunk.
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
*
|
|
|
|
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
|
|
|
|
*
|
|
|
|
* Write chunklist (a list of (one) counted array):
|
|
|
|
* N elements:
|
|
|
|
* 1 - N - HLOO - HLOO - ... - HLOO - 0
|
|
|
|
*
|
2017-08-10 23:47:36 +07:00
|
|
|
* Returns zero on success, or a negative errno if a failure occurred.
|
|
|
|
* @xdr is advanced to the next position in the stream.
|
|
|
|
*
|
|
|
|
* Only a single Write chunk is currently supported.
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
*/
|
2017-08-10 23:47:36 +07:00
|
|
|
static noinline int
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
2017-08-10 23:47:36 +07:00
|
|
|
struct rpc_rqst *rqst, enum rpcrdma_chunktype wtype)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
{
|
2017-08-10 23:47:36 +07:00
|
|
|
struct xdr_stream *xdr = &req->rl_stream;
|
2016-06-30 00:54:25 +07:00
|
|
|
struct rpcrdma_mr_seg *seg;
|
2016-06-30 00:54:16 +07:00
|
|
|
struct rpcrdma_mw *mw;
|
2017-08-15 02:38:30 +07:00
|
|
|
int nsegs, nchunks;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
__be32 *segcount;
|
|
|
|
|
2016-06-30 00:54:25 +07:00
|
|
|
seg = req->rl_segments;
|
2017-02-09 04:59:54 +07:00
|
|
|
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf,
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_rcv_buf.head[0].iov_len,
|
2017-02-09 04:59:54 +07:00
|
|
|
wtype, seg);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
if (nsegs < 0)
|
2017-08-10 23:47:36 +07:00
|
|
|
return nsegs;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
if (encode_item_present(xdr) < 0)
|
|
|
|
return -EMSGSIZE;
|
|
|
|
segcount = xdr_reserve_space(xdr, sizeof(*segcount));
|
|
|
|
if (unlikely(!segcount))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
/* Actual value encoded below */
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
nchunks = 0;
|
|
|
|
do {
|
2017-08-15 02:38:30 +07:00
|
|
|
seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
|
|
|
|
true, &mw);
|
|
|
|
if (IS_ERR(seg))
|
|
|
|
return PTR_ERR(seg);
|
2017-02-09 05:00:43 +07:00
|
|
|
rpcrdma_push_mw(mw, &req->rl_registered);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
if (encode_rdma_segment(xdr, mw) < 0)
|
|
|
|
return -EMSGSIZE;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: %u@0x016%llx:0x%08x (%s)\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__,
|
2016-06-30 00:54:16 +07:00
|
|
|
mw->mw_length, (unsigned long long)mw->mw_offset,
|
2017-08-15 02:38:30 +07:00
|
|
|
mw->mw_handle, mw->mw_nents < nsegs ? "more" : "last");
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.write_chunk_count++;
|
|
|
|
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
|
|
|
|
nchunks++;
|
2017-08-15 02:38:30 +07:00
|
|
|
nsegs -= mw->mw_nents;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
} while (nsegs);
|
|
|
|
|
|
|
|
/* Update count of segments in this Write chunk */
|
|
|
|
*segcount = cpu_to_be32(nchunks);
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
return 0;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
}
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
/* Register and XDR encode the Reply chunk. Supports encoding an array
|
|
|
|
* of plain segments that belong to a single write (reply) chunk.
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
*
|
|
|
|
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
|
|
|
|
*
|
|
|
|
* Reply chunk (a counted array):
|
|
|
|
* N elements:
|
|
|
|
* 1 - N - HLOO - HLOO - ... - HLOO
|
|
|
|
*
|
2017-08-10 23:47:36 +07:00
|
|
|
* Returns zero on success, or a negative errno if a failure occurred.
|
|
|
|
* @xdr is advanced to the next position in the stream.
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
*/
|
2017-08-10 23:47:36 +07:00
|
|
|
static noinline int
|
|
|
|
rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
|
|
|
struct rpc_rqst *rqst, enum rpcrdma_chunktype wtype)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
{
|
2017-08-10 23:47:36 +07:00
|
|
|
struct xdr_stream *xdr = &req->rl_stream;
|
2016-06-30 00:54:25 +07:00
|
|
|
struct rpcrdma_mr_seg *seg;
|
2016-06-30 00:54:16 +07:00
|
|
|
struct rpcrdma_mw *mw;
|
2017-08-15 02:38:30 +07:00
|
|
|
int nsegs, nchunks;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
__be32 *segcount;
|
|
|
|
|
2016-06-30 00:54:25 +07:00
|
|
|
seg = req->rl_segments;
|
2017-02-09 04:59:54 +07:00
|
|
|
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
if (nsegs < 0)
|
2017-08-10 23:47:36 +07:00
|
|
|
return nsegs;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
if (encode_item_present(xdr) < 0)
|
|
|
|
return -EMSGSIZE;
|
|
|
|
segcount = xdr_reserve_space(xdr, sizeof(*segcount));
|
|
|
|
if (unlikely(!segcount))
|
|
|
|
return -EMSGSIZE;
|
|
|
|
/* Actual value encoded below */
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
nchunks = 0;
|
|
|
|
do {
|
2017-08-15 02:38:30 +07:00
|
|
|
seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
|
|
|
|
true, &mw);
|
|
|
|
if (IS_ERR(seg))
|
|
|
|
return PTR_ERR(seg);
|
2017-02-09 05:00:43 +07:00
|
|
|
rpcrdma_push_mw(mw, &req->rl_registered);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
if (encode_rdma_segment(xdr, mw) < 0)
|
|
|
|
return -EMSGSIZE;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: %u@0x%016llx:0x%08x (%s)\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__,
|
2016-06-30 00:54:16 +07:00
|
|
|
mw->mw_length, (unsigned long long)mw->mw_offset,
|
2017-08-15 02:38:30 +07:00
|
|
|
mw->mw_handle, mw->mw_nents < nsegs ? "more" : "last");
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.reply_chunk_count++;
|
|
|
|
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
|
|
|
|
nchunks++;
|
2017-08-15 02:38:30 +07:00
|
|
|
nsegs -= mw->mw_nents;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
} while (nsegs);
|
|
|
|
|
|
|
|
/* Update count of segments in the Reply chunk */
|
|
|
|
*segcount = cpu_to_be32(nchunks);
|
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
return 0;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
}
|
|
|
|
|
2017-10-20 21:47:47 +07:00
|
|
|
/**
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
* rpcrdma_unmap_sendctx - DMA-unmap Send buffers
|
|
|
|
* @sc: sendctx containing SGEs to unmap
|
2017-10-20 21:47:47 +07:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
void
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
rpcrdma_unmap_sendctx(struct rpcrdma_sendctx *sc)
|
2017-10-20 21:47:47 +07:00
|
|
|
{
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
struct rpcrdma_ia *ia = &sc->sc_xprt->rx_ia;
|
2017-10-20 21:47:47 +07:00
|
|
|
struct ib_sge *sge;
|
|
|
|
unsigned int count;
|
|
|
|
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
dprintk("RPC: %s: unmapping %u sges for sc=%p\n",
|
|
|
|
__func__, sc->sc_unmap_count, sc);
|
|
|
|
|
2017-10-20 21:47:47 +07:00
|
|
|
/* The first two SGEs contain the transport header and
|
|
|
|
* the inline buffer. These are always left mapped so
|
|
|
|
* they can be cheaply re-used.
|
|
|
|
*/
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
sge = &sc->sc_sges[2];
|
|
|
|
for (count = sc->sc_unmap_count; count; ++sge, --count)
|
2017-10-20 21:47:47 +07:00
|
|
|
ib_dma_unmap_page(ia->ri_device,
|
|
|
|
sge->addr, sge->length, DMA_TO_DEVICE);
|
2017-10-20 21:48:36 +07:00
|
|
|
|
|
|
|
if (test_and_clear_bit(RPCRDMA_REQ_F_TX_RESOURCES, &sc->sc_req->rl_flags)) {
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
wake_up_bit(&sc->sc_req->rl_flags, RPCRDMA_REQ_F_TX_RESOURCES);
|
|
|
|
}
|
2017-10-20 21:47:47 +07:00
|
|
|
}
|
|
|
|
|
2017-10-20 21:48:03 +07:00
|
|
|
/* Prepare an SGE for the RPC-over-RDMA transport header.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
static bool
|
|
|
|
rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
|
|
|
|
u32 len)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
struct rpcrdma_sendctx *sc = req->rl_sendctx;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
struct rpcrdma_regbuf *rb = req->rl_rdmabuf;
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
struct ib_sge *sge = sc->sc_sges;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
2017-10-20 21:48:03 +07:00
|
|
|
if (!rpcrdma_dma_map_regbuf(ia, rb))
|
|
|
|
goto out_regbuf;
|
|
|
|
sge->addr = rdmab_addr(rb);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
sge->length = len;
|
2017-10-20 21:48:03 +07:00
|
|
|
sge->lkey = rdmab_lkey(rb);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
2017-04-12 00:23:02 +07:00
|
|
|
ib_dma_sync_single_for_device(rdmab_device(rb), sge->addr,
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
sge->length, DMA_TO_DEVICE);
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
sc->sc_wr.num_sge++;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
return true;
|
2017-10-20 21:47:55 +07:00
|
|
|
|
|
|
|
out_regbuf:
|
|
|
|
pr_err("rpcrdma: failed to DMA map a Send buffer\n");
|
|
|
|
return false;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Prepare the Send SGEs. The head and tail iovec, and each entry
|
|
|
|
* in the page list, gets its own SGE.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
|
|
|
|
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
|
|
|
|
{
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
struct rpcrdma_sendctx *sc = req->rl_sendctx;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
unsigned int sge_no, page_base, len, remaining;
|
|
|
|
struct rpcrdma_regbuf *rb = req->rl_sendbuf;
|
|
|
|
struct ib_device *device = ia->ri_device;
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
struct ib_sge *sge = sc->sc_sges;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
u32 lkey = ia->ri_pd->local_dma_lkey;
|
|
|
|
struct page *page, **ppages;
|
|
|
|
|
|
|
|
/* The head iovec is straightforward, as it is already
|
|
|
|
* DMA-mapped. Sync the content that has changed.
|
|
|
|
*/
|
|
|
|
if (!rpcrdma_dma_map_regbuf(ia, rb))
|
2017-10-20 21:47:55 +07:00
|
|
|
goto out_regbuf;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
sge_no = 1;
|
|
|
|
sge[sge_no].addr = rdmab_addr(rb);
|
|
|
|
sge[sge_no].length = xdr->head[0].iov_len;
|
|
|
|
sge[sge_no].lkey = rdmab_lkey(rb);
|
2017-04-12 00:23:02 +07:00
|
|
|
ib_dma_sync_single_for_device(rdmab_device(rb), sge[sge_no].addr,
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
sge[sge_no].length, DMA_TO_DEVICE);
|
|
|
|
|
|
|
|
/* If there is a Read chunk, the page list is being handled
|
|
|
|
* via explicit RDMA, and thus is skipped here. However, the
|
|
|
|
* tail iovec may include an XDR pad for the page list, as
|
|
|
|
* well as additional content, and may not reside in the
|
|
|
|
* same page as the head iovec.
|
|
|
|
*/
|
|
|
|
if (rtype == rpcrdma_readch) {
|
|
|
|
len = xdr->tail[0].iov_len;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* Do not include the tail if it is only an XDR pad */
|
|
|
|
if (len < 4)
|
|
|
|
goto out;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
page = virt_to_page(xdr->tail[0].iov_base);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdr->tail[0].iov_base);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* If the content in the page list is an odd length,
|
|
|
|
* xdr_write_pages() has added a pad at the beginning
|
|
|
|
* of the tail iovec. Force the tail's non-pad content
|
|
|
|
* to land at the next XDR position in the Send message.
|
|
|
|
*/
|
|
|
|
page_base += len & 3;
|
|
|
|
len -= len & 3;
|
|
|
|
goto map_tail;
|
|
|
|
}
|
2009-03-12 01:37:55 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* If there is a page list present, temporarily DMA map
|
|
|
|
* and prepare an SGE for each page to be sent.
|
|
|
|
*/
|
|
|
|
if (xdr->page_len) {
|
|
|
|
ppages = xdr->pages + (xdr->page_base >> PAGE_SHIFT);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdr->page_base);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
remaining = xdr->page_len;
|
|
|
|
while (remaining) {
|
|
|
|
sge_no++;
|
|
|
|
if (sge_no > RPCRDMA_MAX_SEND_SGES - 2)
|
|
|
|
goto out_mapping_overflow;
|
|
|
|
|
|
|
|
len = min_t(u32, PAGE_SIZE - page_base, remaining);
|
|
|
|
sge[sge_no].addr = ib_dma_map_page(device, *ppages,
|
|
|
|
page_base, len,
|
|
|
|
DMA_TO_DEVICE);
|
|
|
|
if (ib_dma_mapping_error(device, sge[sge_no].addr))
|
|
|
|
goto out_mapping_err;
|
|
|
|
sge[sge_no].length = len;
|
|
|
|
sge[sge_no].lkey = lkey;
|
|
|
|
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
sc->sc_unmap_count++;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
ppages++;
|
|
|
|
remaining -= len;
|
|
|
|
page_base = 0;
|
2009-03-12 01:37:55 +07:00
|
|
|
}
|
|
|
|
}
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
|
|
|
/* The tail iovec is not always constructed in the same
|
|
|
|
* page where the head iovec resides (see, for example,
|
|
|
|
* gss_wrap_req_priv). To neatly accommodate that case,
|
|
|
|
* DMA map it separately.
|
|
|
|
*/
|
|
|
|
if (xdr->tail[0].iov_len) {
|
|
|
|
page = virt_to_page(xdr->tail[0].iov_base);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdr->tail[0].iov_base);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
len = xdr->tail[0].iov_len;
|
|
|
|
|
|
|
|
map_tail:
|
|
|
|
sge_no++;
|
|
|
|
sge[sge_no].addr = ib_dma_map_page(device, page,
|
|
|
|
page_base, len,
|
|
|
|
DMA_TO_DEVICE);
|
|
|
|
if (ib_dma_mapping_error(device, sge[sge_no].addr))
|
|
|
|
goto out_mapping_err;
|
|
|
|
sge[sge_no].length = len;
|
|
|
|
sge[sge_no].lkey = lkey;
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
sc->sc_unmap_count++;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
|
|
|
out:
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
sc->sc_wr.num_sge += sge_no;
|
2017-10-20 21:48:36 +07:00
|
|
|
if (sc->sc_unmap_count)
|
|
|
|
__set_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
return true;
|
|
|
|
|
2017-10-20 21:47:55 +07:00
|
|
|
out_regbuf:
|
|
|
|
pr_err("rpcrdma: failed to DMA map a Send buffer\n");
|
|
|
|
return false;
|
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
out_mapping_overflow:
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
rpcrdma_unmap_sendctx(sc);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
pr_err("rpcrdma: too many Send SGEs (%u)\n", sge_no);
|
|
|
|
return false;
|
|
|
|
|
|
|
|
out_mapping_err:
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
rpcrdma_unmap_sendctx(sc);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
pr_err("rpcrdma: Send mapping error\n");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-10-20 21:47:55 +07:00
|
|
|
/**
|
|
|
|
* rpcrdma_prepare_send_sges - Construct SGEs for a Send WR
|
|
|
|
* @r_xprt: controlling transport
|
|
|
|
* @req: context of RPC Call being marshalled
|
|
|
|
* @hdrlen: size of transport header, in bytes
|
|
|
|
* @xdr: xdr_buf containing RPC Call
|
|
|
|
* @rtype: chunk type being encoded
|
|
|
|
*
|
|
|
|
* Returns 0 on success; otherwise a negative errno is returned.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpcrdma_req *req, u32 hdrlen,
|
|
|
|
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
{
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 21:48:12 +07:00
|
|
|
req->rl_sendctx = rpcrdma_sendctx_get_locked(&r_xprt->rx_buf);
|
|
|
|
if (!req->rl_sendctx)
|
|
|
|
return -ENOBUFS;
|
|
|
|
req->rl_sendctx->sc_wr.num_sge = 0;
|
|
|
|
req->rl_sendctx->sc_unmap_count = 0;
|
2017-10-20 21:48:36 +07:00
|
|
|
req->rl_sendctx->sc_req = req;
|
|
|
|
__clear_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
2017-10-20 21:47:55 +07:00
|
|
|
if (!rpcrdma_prepare_hdr_sge(&r_xprt->rx_ia, req, hdrlen))
|
|
|
|
return -EIO;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
|
|
|
if (rtype != rpcrdma_areadch)
|
2017-10-20 21:47:55 +07:00
|
|
|
if (!rpcrdma_prepare_msg_sges(&r_xprt->rx_ia, req, xdr, rtype))
|
|
|
|
return -EIO;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
2017-10-20 21:47:55 +07:00
|
|
|
return 0;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
}
|
|
|
|
|
2017-08-10 23:47:12 +07:00
|
|
|
/**
|
|
|
|
* rpcrdma_marshal_req - Marshal and send one RPC request
|
|
|
|
* @r_xprt: controlling transport
|
|
|
|
* @rqst: RPC request to be marshaled
|
|
|
|
*
|
|
|
|
* For the RPC in "rqst", this function:
|
|
|
|
* - Chooses the transfer mode (eg., RDMA_MSG or RDMA_NOMSG)
|
|
|
|
* - Registers Read, Write, and Reply chunks
|
|
|
|
* - Constructs the transport header
|
|
|
|
* - Posts a Send WR to send the transport header and request
|
2007-09-11 00:50:42 +07:00
|
|
|
*
|
2017-08-10 23:47:12 +07:00
|
|
|
* Returns:
|
|
|
|
* %0 if the RPC was sent successfully,
|
|
|
|
* %-ENOTCONN if the connection was lost,
|
|
|
|
* %-EAGAIN if not enough pages are available for on-demand reply buffer,
|
|
|
|
* %-ENOBUFS if no MRs are available to register chunks,
|
2017-08-10 23:47:28 +07:00
|
|
|
* %-EMSGSIZE if the transport header is too small,
|
2017-08-10 23:47:12 +07:00
|
|
|
* %-EIO if a permanent problem occurred while marshaling.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
|
|
|
int
|
2017-08-10 23:47:12 +07:00
|
|
|
rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
|
|
|
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
|
2017-08-10 23:47:28 +07:00
|
|
|
struct xdr_stream *xdr = &req->rl_stream;
|
2015-03-31 01:33:53 +07:00
|
|
|
enum rpcrdma_chunktype rtype, wtype;
|
2016-06-30 00:55:06 +07:00
|
|
|
bool ddp_allowed;
|
2017-08-10 23:47:28 +07:00
|
|
|
__be32 *p;
|
2017-08-10 23:47:36 +07:00
|
|
|
int ret;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-10-25 04:27:59 +07:00
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
|
|
|
if (test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state))
|
|
|
|
return rpcrdma_bc_marshal_reply(rqst);
|
|
|
|
#endif
|
|
|
|
|
2017-08-10 23:47:28 +07:00
|
|
|
rpcrdma_set_xdrlen(&req->rl_hdrbuf, 0);
|
|
|
|
xdr_init_encode(xdr, &req->rl_hdrbuf,
|
|
|
|
req->rl_rdmabuf->rg_base);
|
|
|
|
|
|
|
|
/* Fixed header fields */
|
2017-08-10 23:47:36 +07:00
|
|
|
ret = -EMSGSIZE;
|
2017-08-10 23:47:28 +07:00
|
|
|
p = xdr_reserve_space(xdr, 4 * sizeof(*p));
|
|
|
|
if (!p)
|
|
|
|
goto out_err;
|
|
|
|
*p++ = rqst->rq_xid;
|
|
|
|
*p++ = rpcrdma_version;
|
|
|
|
*p++ = cpu_to_be32(r_xprt->rx_buf.rb_max_requests);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2016-06-30 00:55:06 +07:00
|
|
|
/* When the ULP employs a GSS flavor that guarantees integrity
|
|
|
|
* or privacy, direct data placement of individual data items
|
|
|
|
* is not allowed.
|
|
|
|
*/
|
|
|
|
ddp_allowed = !(rqst->rq_cred->cr_auth->au_flags &
|
|
|
|
RPCAUTH_AUTH_DATATOUCH);
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
/*
|
|
|
|
* Chunks needed for results?
|
|
|
|
*
|
|
|
|
* o If the expected result is under the inline threshold, all ops
|
2015-08-04 00:04:08 +07:00
|
|
|
* return as inline.
|
2016-05-03 01:41:14 +07:00
|
|
|
* o Large read ops return data as write chunk(s), header as
|
|
|
|
* inline.
|
2007-09-11 00:50:42 +07:00
|
|
|
* o Large non-read ops return as a single reply chunk.
|
|
|
|
*/
|
2016-05-03 01:41:14 +07:00
|
|
|
if (rpcrdma_results_inline(r_xprt, rqst))
|
2015-08-04 00:03:58 +07:00
|
|
|
wtype = rpcrdma_noch;
|
2016-06-30 00:55:06 +07:00
|
|
|
else if (ddp_allowed && rqst->rq_rcv_buf.flags & XDRBUF_READ)
|
2016-05-03 01:41:14 +07:00
|
|
|
wtype = rpcrdma_writech;
|
2007-09-11 00:50:42 +07:00
|
|
|
else
|
2015-03-31 01:33:53 +07:00
|
|
|
wtype = rpcrdma_replych;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Chunks needed for arguments?
|
|
|
|
*
|
|
|
|
* o If the total request is under the inline threshold, all ops
|
|
|
|
* are sent as inline.
|
|
|
|
* o Large write ops transmit data as read chunk(s), header as
|
|
|
|
* inline.
|
2015-08-04 00:04:26 +07:00
|
|
|
* o Large non-write ops are sent with the entire message as a
|
|
|
|
* single read chunk (protocol 0-position special case).
|
2007-09-11 00:50:42 +07:00
|
|
|
*
|
2015-08-04 00:04:26 +07:00
|
|
|
* This assumes that the upper layer does not present a request
|
|
|
|
* that both has a data payload, and whose non-data arguments
|
|
|
|
* by themselves are larger than the inline threshold.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
2016-05-03 01:41:05 +07:00
|
|
|
if (rpcrdma_args_inline(r_xprt, rqst)) {
|
2017-08-10 23:47:28 +07:00
|
|
|
*p++ = rdma_msg;
|
2015-03-31 01:33:53 +07:00
|
|
|
rtype = rpcrdma_noch;
|
2016-06-30 00:55:06 +07:00
|
|
|
} else if (ddp_allowed && rqst->rq_snd_buf.flags & XDRBUF_WRITE) {
|
2017-08-10 23:47:28 +07:00
|
|
|
*p++ = rdma_msg;
|
2015-03-31 01:33:53 +07:00
|
|
|
rtype = rpcrdma_readch;
|
2015-08-04 00:04:26 +07:00
|
|
|
} else {
|
2015-08-04 00:04:45 +07:00
|
|
|
r_xprt->rx_stats.nomsg_call_count++;
|
2017-08-10 23:47:28 +07:00
|
|
|
*p++ = rdma_nomsg;
|
2015-08-04 00:04:26 +07:00
|
|
|
rtype = rpcrdma_areadch;
|
|
|
|
}
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2017-12-15 08:57:14 +07:00
|
|
|
/* If this is a retransmit, discard previously registered
|
|
|
|
* chunks. Very likely the connection has been replaced,
|
|
|
|
* so these registrations are invalid and unusable.
|
|
|
|
*/
|
|
|
|
while (unlikely(!list_empty(&req->rl_registered))) {
|
|
|
|
struct rpcrdma_mw *mw;
|
|
|
|
|
|
|
|
mw = rpcrdma_pop_mw(&req->rl_registered);
|
|
|
|
rpcrdma_defer_mr_recovery(mw);
|
|
|
|
}
|
|
|
|
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
/* This implementation supports the following combinations
|
|
|
|
* of chunk lists in one RPC-over-RDMA Call message:
|
|
|
|
*
|
|
|
|
* - Read list
|
|
|
|
* - Write list
|
|
|
|
* - Reply chunk
|
|
|
|
* - Read list + Reply chunk
|
|
|
|
*
|
|
|
|
* It might not yet support the following combinations:
|
|
|
|
*
|
|
|
|
* - Read list + Write list
|
|
|
|
*
|
|
|
|
* It does not support the following combinations:
|
|
|
|
*
|
|
|
|
* - Write list + Reply chunk
|
|
|
|
* - Read list + Write list + Reply chunk
|
|
|
|
*
|
|
|
|
* This implementation supports only a single chunk in each
|
|
|
|
* Read or Write list. Thus for example the client cannot
|
|
|
|
* send a Call message with a Position Zero Read chunk and a
|
|
|
|
* regular Read chunk at the same time.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
2017-08-10 23:47:36 +07:00
|
|
|
if (rtype != rpcrdma_noch) {
|
|
|
|
ret = rpcrdma_encode_read_list(r_xprt, req, rqst, rtype);
|
|
|
|
if (ret)
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
ret = encode_item_not_present(xdr);
|
|
|
|
if (ret)
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
2017-08-10 23:47:36 +07:00
|
|
|
|
|
|
|
if (wtype == rpcrdma_writech) {
|
|
|
|
ret = rpcrdma_encode_write_list(r_xprt, req, rqst, wtype);
|
|
|
|
if (ret)
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
ret = encode_item_not_present(xdr);
|
|
|
|
if (ret)
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
2017-08-10 23:47:36 +07:00
|
|
|
|
|
|
|
if (wtype != rpcrdma_replych)
|
|
|
|
ret = encode_item_not_present(xdr);
|
|
|
|
else
|
|
|
|
ret = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, wtype);
|
|
|
|
if (ret)
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2017-08-10 23:47:36 +07:00
|
|
|
dprintk("RPC: %5u %s: %s/%s: hdrlen %u rpclen\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__,
|
|
|
|
transfertypes[rtype], transfertypes[wtype],
|
2017-08-10 23:47:36 +07:00
|
|
|
xdr_stream_pos(xdr));
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2017-10-20 21:47:55 +07:00
|
|
|
ret = rpcrdma_prepare_send_sges(r_xprt, req, xdr_stream_pos(xdr),
|
|
|
|
&rqst->rq_snd_buf, rtype);
|
|
|
|
if (ret)
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
2007-09-11 00:50:42 +07:00
|
|
|
return 0;
|
2016-05-03 01:41:05 +07:00
|
|
|
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
out_err:
|
2017-08-10 23:47:36 +07:00
|
|
|
if (ret != -ENOBUFS) {
|
|
|
|
pr_err("rpcrdma: header marshaling failed (%d)\n", ret);
|
2017-04-12 00:23:51 +07:00
|
|
|
r_xprt->rx_stats.failed_marshal_count++;
|
|
|
|
}
|
2017-08-10 23:47:36 +07:00
|
|
|
return ret;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/**
|
|
|
|
* rpcrdma_inline_fixup - Scatter inline received data into rqst's iovecs
|
|
|
|
* @rqst: controlling RPC request
|
|
|
|
* @srcp: points to RPC message payload in receive buffer
|
|
|
|
* @copy_len: remaining length of receive buffer content
|
|
|
|
* @pad: Write chunk pad bytes needed (zero for pure inline)
|
|
|
|
*
|
|
|
|
* The upper layer has set the maximum number of bytes it can
|
|
|
|
* receive in each component of rq_rcv_buf. These values are set in
|
|
|
|
* the head.iov_len, page_len, tail.iov_len, and buflen fields.
|
2016-06-30 00:54:49 +07:00
|
|
|
*
|
|
|
|
* Unlike the TCP equivalent (xdr_partial_copy_from_skb), in
|
|
|
|
* many cases this function simply updates iov_base pointers in
|
|
|
|
* rq_rcv_buf to point directly to the received reply data, to
|
|
|
|
* avoid copying reply data.
|
2016-06-30 00:54:58 +07:00
|
|
|
*
|
|
|
|
* Returns the count of bytes which had to be memcopied.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
2016-06-30 00:54:58 +07:00
|
|
|
static unsigned long
|
2008-10-10 02:01:11 +07:00
|
|
|
rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
2016-06-30 00:54:58 +07:00
|
|
|
unsigned long fixup_copy_count;
|
|
|
|
int i, npages, curlen;
|
2007-09-11 00:50:42 +07:00
|
|
|
char *destp;
|
2011-02-10 02:45:28 +07:00
|
|
|
struct page **ppages;
|
|
|
|
int page_base;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/* The head iovec is redirected to the RPC reply message
|
|
|
|
* in the receive buffer, to avoid a memcopy.
|
|
|
|
*/
|
|
|
|
rqst->rq_rcv_buf.head[0].iov_base = srcp;
|
2016-06-30 00:54:49 +07:00
|
|
|
rqst->rq_private_buf.head[0].iov_base = srcp;
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
|
|
|
|
/* The contents of the receive buffer that follow
|
|
|
|
* head.iov_len bytes are copied into the page list.
|
|
|
|
*/
|
2007-09-11 00:50:42 +07:00
|
|
|
curlen = rqst->rq_rcv_buf.head[0].iov_len;
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
if (curlen > copy_len)
|
2007-09-11 00:50:42 +07:00
|
|
|
curlen = copy_len;
|
|
|
|
dprintk("RPC: %s: srcp 0x%p len %d hdrlen %d\n",
|
|
|
|
__func__, srcp, copy_len, curlen);
|
|
|
|
srcp += curlen;
|
|
|
|
copy_len -= curlen;
|
|
|
|
|
2017-06-08 22:53:16 +07:00
|
|
|
ppages = rqst->rq_rcv_buf.pages +
|
|
|
|
(rqst->rq_rcv_buf.page_base >> PAGE_SHIFT);
|
|
|
|
page_base = offset_in_page(rqst->rq_rcv_buf.page_base);
|
2016-06-30 00:54:58 +07:00
|
|
|
fixup_copy_count = 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
if (copy_len && rqst->rq_rcv_buf.page_len) {
|
2016-06-30 00:54:33 +07:00
|
|
|
int pagelist_len;
|
|
|
|
|
|
|
|
pagelist_len = rqst->rq_rcv_buf.page_len;
|
|
|
|
if (pagelist_len > copy_len)
|
|
|
|
pagelist_len = copy_len;
|
|
|
|
npages = PAGE_ALIGN(page_base + pagelist_len) >> PAGE_SHIFT;
|
2016-06-30 00:54:58 +07:00
|
|
|
for (i = 0; i < npages; i++) {
|
2011-02-10 02:45:28 +07:00
|
|
|
curlen = PAGE_SIZE - page_base;
|
2016-06-30 00:54:33 +07:00
|
|
|
if (curlen > pagelist_len)
|
|
|
|
curlen = pagelist_len;
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
dprintk("RPC: %s: page %d"
|
|
|
|
" srcp 0x%p len %d curlen %d\n",
|
|
|
|
__func__, i, srcp, copy_len, curlen);
|
2011-11-25 22:14:40 +07:00
|
|
|
destp = kmap_atomic(ppages[i]);
|
2011-02-10 02:45:28 +07:00
|
|
|
memcpy(destp + page_base, srcp, curlen);
|
|
|
|
flush_dcache_page(ppages[i]);
|
2011-11-25 22:14:40 +07:00
|
|
|
kunmap_atomic(destp);
|
2007-09-11 00:50:42 +07:00
|
|
|
srcp += curlen;
|
|
|
|
copy_len -= curlen;
|
2016-06-30 00:54:58 +07:00
|
|
|
fixup_copy_count += curlen;
|
2016-06-30 00:54:33 +07:00
|
|
|
pagelist_len -= curlen;
|
|
|
|
if (!pagelist_len)
|
2007-09-11 00:50:42 +07:00
|
|
|
break;
|
2011-02-10 02:45:28 +07:00
|
|
|
page_base = 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/* Implicit padding for the last segment in a Write
|
|
|
|
* chunk is inserted inline at the front of the tail
|
|
|
|
* iovec. The upper layer ignores the content of
|
|
|
|
* the pad. Simply ensure inline content in the tail
|
|
|
|
* that follows the Write chunk is properly aligned.
|
|
|
|
*/
|
|
|
|
if (pad)
|
|
|
|
srcp -= pad;
|
2008-10-10 02:01:11 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/* The tail iovec is redirected to the remaining data
|
|
|
|
* in the receive buffer, to avoid a memcopy.
|
|
|
|
*/
|
2016-06-30 00:54:49 +07:00
|
|
|
if (copy_len || pad) {
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
rqst->rq_rcv_buf.tail[0].iov_base = srcp;
|
2016-06-30 00:54:49 +07:00
|
|
|
rqst->rq_private_buf.tail[0].iov_base = srcp;
|
|
|
|
}
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
|
2016-06-30 00:54:58 +07:00
|
|
|
return fixup_copy_count;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
2015-10-25 04:28:08 +07:00
|
|
|
/* By convention, backchannel calls arrive via rdma_msg type
|
|
|
|
* messages, and never populate the chunk lists. This makes
|
|
|
|
* the RPC/RDMA header small and fixed in size, so it is
|
|
|
|
* straightforward to check the RPC header's direction field.
|
|
|
|
*/
|
|
|
|
static bool
|
2017-10-17 02:01:14 +07:00
|
|
|
rpcrdma_is_bcall(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep)
|
2017-08-04 01:30:11 +07:00
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
2015-10-25 04:28:08 +07:00
|
|
|
{
|
2017-08-04 01:30:11 +07:00
|
|
|
struct xdr_stream *xdr = &rep->rr_stream;
|
|
|
|
__be32 *p;
|
2015-10-25 04:28:08 +07:00
|
|
|
|
2017-10-17 02:01:14 +07:00
|
|
|
if (rep->rr_proc != rdma_msg)
|
2015-10-25 04:28:08 +07:00
|
|
|
return false;
|
2017-08-04 01:30:11 +07:00
|
|
|
|
|
|
|
/* Peek at stream contents without advancing. */
|
|
|
|
p = xdr_inline_decode(xdr, 0);
|
|
|
|
|
|
|
|
/* Chunk lists */
|
|
|
|
if (*p++ != xdr_zero)
|
2015-10-25 04:28:08 +07:00
|
|
|
return false;
|
2017-08-04 01:30:11 +07:00
|
|
|
if (*p++ != xdr_zero)
|
2015-10-25 04:28:08 +07:00
|
|
|
return false;
|
2017-08-04 01:30:11 +07:00
|
|
|
if (*p++ != xdr_zero)
|
2015-10-25 04:28:08 +07:00
|
|
|
return false;
|
|
|
|
|
2017-08-04 01:30:11 +07:00
|
|
|
/* RPC header */
|
2017-10-17 02:01:14 +07:00
|
|
|
if (*p++ != rep->rr_xid)
|
2015-10-25 04:28:08 +07:00
|
|
|
return false;
|
2017-08-04 01:30:11 +07:00
|
|
|
if (*p != cpu_to_be32(RPC_CALL))
|
2015-10-25 04:28:08 +07:00
|
|
|
return false;
|
|
|
|
|
2017-08-04 01:30:11 +07:00
|
|
|
/* Now that we are sure this is a backchannel call,
|
|
|
|
* advance to the RPC header.
|
|
|
|
*/
|
|
|
|
p = xdr_inline_decode(xdr, 3 * sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
goto out_short;
|
|
|
|
|
|
|
|
rpcrdma_bc_receive_call(r_xprt, rep);
|
|
|
|
return true;
|
|
|
|
|
|
|
|
out_short:
|
|
|
|
pr_warn("RPC/RDMA short backward direction call\n");
|
|
|
|
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
|
|
|
|
xprt_disconnect_done(&r_xprt->rx_xprt);
|
2015-10-25 04:28:08 +07:00
|
|
|
return true;
|
|
|
|
}
|
2017-08-04 01:30:11 +07:00
|
|
|
#else /* CONFIG_SUNRPC_BACKCHANNEL */
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2015-10-25 04:28:08 +07:00
|
|
|
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
|
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
static int decode_rdma_segment(struct xdr_stream *xdr, u32 *length)
|
2017-08-04 01:30:19 +07:00
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
p = xdr_inline_decode(xdr, 4 * sizeof(*p));
|
2017-08-04 01:30:19 +07:00
|
|
|
if (unlikely(!p))
|
|
|
|
return -EIO;
|
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
ifdebug(FACILITY) {
|
|
|
|
u64 offset;
|
|
|
|
u32 handle;
|
|
|
|
|
|
|
|
handle = be32_to_cpup(p++);
|
|
|
|
*length = be32_to_cpup(p++);
|
|
|
|
xdr_decode_hyper(p, &offset);
|
|
|
|
dprintk("RPC: %s: segment %u@0x%016llx:0x%08x\n",
|
|
|
|
__func__, *length, (unsigned long long)offset,
|
|
|
|
handle);
|
|
|
|
} else {
|
|
|
|
*length = be32_to_cpup(p + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
static int decode_write_chunk(struct xdr_stream *xdr, u32 *length)
|
|
|
|
{
|
|
|
|
u32 segcount, seglength;
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_inline_decode(xdr, sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EIO;
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
*length = 0;
|
|
|
|
segcount = be32_to_cpup(p);
|
|
|
|
while (segcount--) {
|
|
|
|
if (decode_rdma_segment(xdr, &seglength))
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
2017-08-04 01:30:27 +07:00
|
|
|
*length += seglength;
|
|
|
|
}
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
dprintk("RPC: %s: segcount=%u, %u bytes\n",
|
|
|
|
__func__, be32_to_cpup(p), *length);
|
|
|
|
return 0;
|
|
|
|
}
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
/* In RPC-over-RDMA Version One replies, a Read list is never
|
|
|
|
* expected. This decoder is a stub that returns an error if
|
|
|
|
* a Read list is present.
|
|
|
|
*/
|
|
|
|
static int decode_read_list(struct xdr_stream *xdr)
|
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_inline_decode(xdr, sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EIO;
|
|
|
|
if (unlikely(*p != xdr_zero))
|
|
|
|
return -EIO;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Supports only one Write chunk in the Write list
|
|
|
|
*/
|
|
|
|
static int decode_write_list(struct xdr_stream *xdr, u32 *length)
|
|
|
|
{
|
|
|
|
u32 chunklen;
|
|
|
|
bool first;
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
*length = 0;
|
|
|
|
first = true;
|
|
|
|
do {
|
2017-08-04 01:30:19 +07:00
|
|
|
p = xdr_inline_decode(xdr, sizeof(*p));
|
2017-08-04 01:30:27 +07:00
|
|
|
if (unlikely(!p))
|
|
|
|
return -EIO;
|
|
|
|
if (*p == xdr_zero)
|
|
|
|
break;
|
|
|
|
if (!first)
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
if (decode_write_chunk(xdr, &chunklen))
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
2017-08-04 01:30:27 +07:00
|
|
|
*length += chunklen;
|
|
|
|
first = false;
|
|
|
|
} while (true);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int decode_reply_chunk(struct xdr_stream *xdr, u32 *length)
|
|
|
|
{
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_inline_decode(xdr, sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
*length = 0;
|
|
|
|
if (*p != xdr_zero)
|
|
|
|
if (decode_write_chunk(xdr, length))
|
|
|
|
return -EIO;
|
|
|
|
return 0;
|
|
|
|
}
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
static int
|
|
|
|
rpcrdma_decode_msg(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep,
|
|
|
|
struct rpc_rqst *rqst)
|
|
|
|
{
|
|
|
|
struct xdr_stream *xdr = &rep->rr_stream;
|
|
|
|
u32 writelist, replychunk, rpclen;
|
|
|
|
char *base;
|
|
|
|
|
|
|
|
/* Decode the chunk lists */
|
|
|
|
if (decode_read_list(xdr))
|
|
|
|
return -EIO;
|
|
|
|
if (decode_write_list(xdr, &writelist))
|
|
|
|
return -EIO;
|
|
|
|
if (decode_reply_chunk(xdr, &replychunk))
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
/* RDMA_MSG sanity checks */
|
|
|
|
if (unlikely(replychunk))
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
/* Build the RPC reply's Payload stream in rqst->rq_rcv_buf */
|
|
|
|
base = (char *)xdr_inline_decode(xdr, 0);
|
|
|
|
rpclen = xdr_stream_remaining(xdr);
|
2017-08-04 01:30:19 +07:00
|
|
|
r_xprt->rx_stats.fixup_copy_count +=
|
2017-08-04 01:30:27 +07:00
|
|
|
rpcrdma_inline_fixup(rqst, base, rpclen, writelist & 3);
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
r_xprt->rx_stats.total_rdma_reply += writelist;
|
|
|
|
return rpclen + xdr_align_size(writelist);
|
2017-08-04 01:30:19 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int
|
|
|
|
rpcrdma_decode_nomsg(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep)
|
|
|
|
{
|
|
|
|
struct xdr_stream *xdr = &rep->rr_stream;
|
2017-08-04 01:30:27 +07:00
|
|
|
u32 writelist, replychunk;
|
2017-08-04 01:30:19 +07:00
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
/* Decode the chunk lists */
|
|
|
|
if (decode_read_list(xdr))
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
2017-08-04 01:30:27 +07:00
|
|
|
if (decode_write_list(xdr, &writelist))
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
2017-08-04 01:30:27 +07:00
|
|
|
if (decode_reply_chunk(xdr, &replychunk))
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
/* RDMA_NOMSG sanity checks */
|
|
|
|
if (unlikely(writelist))
|
|
|
|
return -EIO;
|
|
|
|
if (unlikely(!replychunk))
|
2017-08-04 01:30:19 +07:00
|
|
|
return -EIO;
|
|
|
|
|
2017-08-04 01:30:27 +07:00
|
|
|
/* Reply chunk buffer already is the reply vector */
|
|
|
|
r_xprt->rx_stats.total_rdma_reply += replychunk;
|
|
|
|
return replychunk;
|
2017-08-04 01:30:19 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int
|
|
|
|
rpcrdma_decode_error(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep,
|
|
|
|
struct rpc_rqst *rqst)
|
|
|
|
{
|
|
|
|
struct xdr_stream *xdr = &rep->rr_stream;
|
|
|
|
__be32 *p;
|
|
|
|
|
|
|
|
p = xdr_inline_decode(xdr, sizeof(*p));
|
|
|
|
if (unlikely(!p))
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
switch (*p) {
|
|
|
|
case err_vers:
|
|
|
|
p = xdr_inline_decode(xdr, 2 * sizeof(*p));
|
|
|
|
if (!p)
|
|
|
|
break;
|
|
|
|
dprintk("RPC: %5u: %s: server reports version error (%u-%u)\n",
|
|
|
|
rqst->rq_task->tk_pid, __func__,
|
|
|
|
be32_to_cpup(p), be32_to_cpu(*(p + 1)));
|
|
|
|
break;
|
|
|
|
case err_chunk:
|
|
|
|
dprintk("RPC: %5u: %s: server reports header decoding error\n",
|
|
|
|
rqst->rq_task->tk_pid, __func__);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
dprintk("RPC: %5u: %s: server reports unrecognized error %d\n",
|
|
|
|
rqst->rq_task->tk_pid, __func__, be32_to_cpup(p));
|
|
|
|
}
|
|
|
|
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
|
|
|
return -EREMOTEIO;
|
|
|
|
}
|
|
|
|
|
2017-10-17 02:01:22 +07:00
|
|
|
/* Perform XID lookup, reconstruction of the RPC reply, and
|
|
|
|
* RPC completion while holding the transport lock to ensure
|
|
|
|
* the rep, rqst, and rq_task pointers remain stable.
|
|
|
|
*/
|
|
|
|
void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
|
|
|
|
{
|
|
|
|
struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
|
|
|
|
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
|
|
|
|
struct rpc_rqst *rqst = rep->rr_rqst;
|
|
|
|
unsigned long cwnd;
|
|
|
|
int status;
|
|
|
|
|
|
|
|
xprt->reestablish_timeout = 0;
|
|
|
|
|
|
|
|
switch (rep->rr_proc) {
|
|
|
|
case rdma_msg:
|
|
|
|
status = rpcrdma_decode_msg(r_xprt, rep, rqst);
|
|
|
|
break;
|
|
|
|
case rdma_nomsg:
|
|
|
|
status = rpcrdma_decode_nomsg(r_xprt, rep);
|
|
|
|
break;
|
|
|
|
case rdma_error:
|
|
|
|
status = rpcrdma_decode_error(r_xprt, rep, rqst);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
status = -EIO;
|
|
|
|
}
|
|
|
|
if (status < 0)
|
|
|
|
goto out_badheader;
|
|
|
|
|
|
|
|
out:
|
|
|
|
spin_lock(&xprt->recv_lock);
|
|
|
|
cwnd = xprt->cwnd;
|
2017-10-17 02:01:39 +07:00
|
|
|
xprt->cwnd = r_xprt->rx_buf.rb_credits << RPC_CWNDSHIFT;
|
2017-10-17 02:01:22 +07:00
|
|
|
if (xprt->cwnd > cwnd)
|
|
|
|
xprt_release_rqst_cong(rqst->rq_task);
|
|
|
|
|
|
|
|
xprt_complete_rqst(rqst->rq_task, status);
|
|
|
|
xprt_unpin_rqst(rqst);
|
|
|
|
spin_unlock(&xprt->recv_lock);
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* If the incoming reply terminated a pending RPC, the next
|
|
|
|
* RPC call will post a replacement receive buffer as it is
|
|
|
|
* being marshaled.
|
|
|
|
*/
|
|
|
|
out_badheader:
|
|
|
|
dprintk("RPC: %5u %s: invalid rpcrdma reply (type %u)\n",
|
|
|
|
rqst->rq_task->tk_pid, __func__, be32_to_cpu(rep->rr_proc));
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
|
|
|
status = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-10-20 21:48:28 +07:00
|
|
|
void rpcrdma_release_rqst(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
|
|
|
|
{
|
|
|
|
/* Invalidate and unmap the data payloads before waking
|
|
|
|
* the waiting application. This guarantees the memory
|
|
|
|
* regions are properly fenced from the server before the
|
|
|
|
* application accesses the data. It also ensures proper
|
|
|
|
* send flow control: waking the next RPC waits until this
|
|
|
|
* RPC has relinquished all its Send Queue entries.
|
|
|
|
*/
|
|
|
|
if (!list_empty(&req->rl_registered))
|
|
|
|
r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
|
|
|
|
&req->rl_registered);
|
2017-10-20 21:48:36 +07:00
|
|
|
|
|
|
|
/* Ensure that any DMA mapped pages associated with
|
|
|
|
* the Send of the RPC Call have been unmapped before
|
|
|
|
* allowing the RPC to complete. This protects argument
|
|
|
|
* memory not controlled by the RPC client from being
|
|
|
|
* re-used before we're done with it.
|
|
|
|
*/
|
|
|
|
if (test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
|
|
|
|
r_xprt->rx_stats.reply_waits_for_send++;
|
|
|
|
out_of_line_wait_on_bit(&req->rl_flags,
|
|
|
|
RPCRDMA_REQ_F_TX_RESOURCES,
|
|
|
|
bit_wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
|
|
|
}
|
2017-10-20 21:48:28 +07:00
|
|
|
}
|
|
|
|
|
2017-10-17 02:01:30 +07:00
|
|
|
/* Reply handling runs in the poll worker thread. Anything that
|
|
|
|
* might wait is deferred to a separate workqueue.
|
|
|
|
*/
|
|
|
|
void rpcrdma_deferred_completion(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct rpcrdma_rep *rep =
|
|
|
|
container_of(work, struct rpcrdma_rep, rr_work);
|
|
|
|
struct rpcrdma_req *req = rpcr_to_rdmar(rep->rr_rqst);
|
2017-12-15 08:56:26 +07:00
|
|
|
struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
|
2017-10-17 02:01:30 +07:00
|
|
|
|
2017-12-15 08:56:26 +07:00
|
|
|
if (rep->rr_wc_flags & IB_WC_WITH_INVALIDATE)
|
|
|
|
r_xprt->rx_ia.ri_ops->ro_reminv(rep, &req->rl_registered);
|
|
|
|
rpcrdma_release_rqst(r_xprt, req);
|
2017-10-17 02:01:30 +07:00
|
|
|
rpcrdma_complete_rqst(rep);
|
|
|
|
}
|
|
|
|
|
2015-10-25 04:27:10 +07:00
|
|
|
/* Process received RPC/RDMA messages.
|
|
|
|
*
|
2007-09-11 00:50:42 +07:00
|
|
|
* Errors must result in the RPC task either being awakened, or
|
|
|
|
* allowed to timeout, to discover the errors at that time.
|
|
|
|
*/
|
2017-10-17 02:01:30 +07:00
|
|
|
void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
|
|
|
|
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
|
2017-10-17 02:01:39 +07:00
|
|
|
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
|
2007-09-11 00:50:42 +07:00
|
|
|
struct rpcrdma_req *req;
|
|
|
|
struct rpc_rqst *rqst;
|
2017-10-17 02:01:39 +07:00
|
|
|
u32 credits;
|
2017-10-17 02:01:14 +07:00
|
|
|
__be32 *p;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-10-25 04:26:54 +07:00
|
|
|
dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
|
|
|
|
|
2017-08-04 01:30:44 +07:00
|
|
|
if (rep->rr_hdrbuf.head[0].iov_len == 0)
|
2015-10-25 04:26:54 +07:00
|
|
|
goto out_badstatus;
|
2017-08-04 01:30:03 +07:00
|
|
|
|
2017-10-17 02:01:14 +07:00
|
|
|
xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
|
2017-08-04 01:30:03 +07:00
|
|
|
rep->rr_hdrbuf.head[0].iov_base);
|
|
|
|
|
|
|
|
/* Fixed transport header fields */
|
2017-10-17 02:01:14 +07:00
|
|
|
p = xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
|
2017-08-04 01:30:03 +07:00
|
|
|
if (unlikely(!p))
|
2015-10-25 04:26:54 +07:00
|
|
|
goto out_shortreply;
|
2017-10-17 02:01:14 +07:00
|
|
|
rep->rr_xid = *p++;
|
|
|
|
rep->rr_vers = *p++;
|
2017-10-17 02:01:39 +07:00
|
|
|
credits = be32_to_cpu(*p++);
|
2017-10-17 02:01:14 +07:00
|
|
|
rep->rr_proc = *p++;
|
2015-10-25 04:26:54 +07:00
|
|
|
|
2017-10-17 02:01:14 +07:00
|
|
|
if (rep->rr_vers != rpcrdma_version)
|
2017-10-17 02:01:06 +07:00
|
|
|
goto out_badversion;
|
|
|
|
|
2017-10-17 02:01:14 +07:00
|
|
|
if (rpcrdma_is_bcall(r_xprt, rep))
|
2017-08-04 01:30:11 +07:00
|
|
|
return;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-10-25 04:27:10 +07:00
|
|
|
/* Match incoming rpcrdma_rep to an rpcrdma_req to
|
|
|
|
* get context for handling any incoming chunks.
|
|
|
|
*/
|
2017-08-24 04:05:58 +07:00
|
|
|
spin_lock(&xprt->recv_lock);
|
2017-10-17 02:01:14 +07:00
|
|
|
rqst = xprt_lookup_rqst(xprt, rep->rr_xid);
|
2017-08-24 04:05:58 +07:00
|
|
|
if (!rqst)
|
|
|
|
goto out_norqst;
|
|
|
|
xprt_pin_rqst(rqst);
|
2017-10-17 02:01:39 +07:00
|
|
|
|
|
|
|
if (credits == 0)
|
|
|
|
credits = 1; /* don't deadlock */
|
|
|
|
else if (credits > buf->rb_max_requests)
|
|
|
|
credits = buf->rb_max_requests;
|
|
|
|
buf->rb_credits = credits;
|
|
|
|
|
2017-08-24 04:05:58 +07:00
|
|
|
spin_unlock(&xprt->recv_lock);
|
2017-10-17 02:01:39 +07:00
|
|
|
|
2017-08-24 04:05:58 +07:00
|
|
|
req = rpcr_to_rdmar(rqst);
|
2017-06-08 22:51:56 +07:00
|
|
|
req->rl_reply = rep;
|
2017-10-17 02:01:22 +07:00
|
|
|
rep->rr_rqst = rqst;
|
2017-10-20 21:48:28 +07:00
|
|
|
clear_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags);
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
|
2016-03-04 23:27:43 +07:00
|
|
|
dprintk("RPC: %s: reply %p completes request %p (xid 0x%08x)\n",
|
2017-10-17 02:01:14 +07:00
|
|
|
__func__, rep, req, be32_to_cpu(rep->rr_xid));
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2017-12-05 02:04:04 +07:00
|
|
|
queue_work_on(req->rl_cpu, rpcrdma_receive_wq, &rep->rr_work);
|
2015-10-25 04:26:54 +07:00
|
|
|
return;
|
|
|
|
|
|
|
|
out_badstatus:
|
|
|
|
rpcrdma_recv_buffer_put(rep);
|
|
|
|
if (r_xprt->rx_ep.rep_connected == 1) {
|
|
|
|
r_xprt->rx_ep.rep_connected = -EIO;
|
|
|
|
rpcrdma_conn_func(&r_xprt->rx_ep);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
|
|
|
|
out_badversion:
|
|
|
|
dprintk("RPC: %s: invalid version %d\n",
|
2017-10-17 02:01:14 +07:00
|
|
|
__func__, be32_to_cpu(rep->rr_vers));
|
2017-10-17 02:01:06 +07:00
|
|
|
goto repost;
|
2016-03-04 23:28:18 +07:00
|
|
|
|
2017-10-17 02:01:22 +07:00
|
|
|
/* The RPC transaction has already been terminated, or the header
|
|
|
|
* is corrupt.
|
2016-03-04 23:28:18 +07:00
|
|
|
*/
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
out_norqst:
|
2017-08-17 02:30:35 +07:00
|
|
|
spin_unlock(&xprt->recv_lock);
|
2017-08-04 01:30:03 +07:00
|
|
|
dprintk("RPC: %s: no match for incoming xid 0x%08x\n",
|
2017-10-17 02:01:14 +07:00
|
|
|
__func__, be32_to_cpu(rep->rr_xid));
|
2015-10-25 04:26:54 +07:00
|
|
|
goto repost;
|
|
|
|
|
2017-08-24 04:05:58 +07:00
|
|
|
out_shortreply:
|
|
|
|
dprintk("RPC: %s: short/invalid reply\n", __func__);
|
2015-10-25 04:26:54 +07:00
|
|
|
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
/* If no pending RPC transaction was matched, post a replacement
|
|
|
|
* receive buffer before returning.
|
|
|
|
*/
|
2015-10-25 04:26:54 +07:00
|
|
|
repost:
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
2016-09-15 21:56:35 +07:00
|
|
|
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
|
2015-10-25 04:26:54 +07:00
|
|
|
rpcrdma_recv_buffer_put(rep);
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|