2007-09-11 00:50:12 +07:00
|
|
|
/*
|
2007-09-11 00:50:42 +07:00
|
|
|
* Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
|
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the BSD-type
|
|
|
|
* license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
*
|
|
|
|
* Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
*
|
|
|
|
* Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials provided
|
|
|
|
* with the distribution.
|
|
|
|
*
|
|
|
|
* Neither the name of the Network Appliance, Inc. nor the names of
|
|
|
|
* its contributors may be used to endorse or promote products
|
|
|
|
* derived from this software without specific prior written
|
|
|
|
* permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
|
|
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
|
|
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
|
|
* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
|
|
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
|
|
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
|
|
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
|
|
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
|
|
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* rpc_rdma.c
|
|
|
|
*
|
|
|
|
* This file contains the guts of the RPC RDMA protocol, and
|
|
|
|
* does marshaling/unmarshaling, etc. It is also where interfacing
|
|
|
|
* to the Linux RPC framework lives.
|
2007-09-11 00:50:12 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include "xprt_rdma.h"
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
|
2014-11-18 04:58:04 +07:00
|
|
|
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
|
2007-09-11 00:50:42 +07:00
|
|
|
# define RPCDBG_FACILITY RPCDBG_TRANS
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static const char transfertypes[][12] = {
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
"inline", /* no chunks */
|
|
|
|
"read list", /* some argument via rdma read */
|
|
|
|
"*read list", /* entire request via rdma read */
|
|
|
|
"write list", /* some result via rdma write */
|
2007-09-11 00:50:42 +07:00
|
|
|
"reply chunk" /* entire reply via rdma write */
|
|
|
|
};
|
2016-05-03 01:41:05 +07:00
|
|
|
|
|
|
|
/* Returns size of largest RPC-over-RDMA header in a Call message
|
|
|
|
*
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
* The largest Call header contains a full-size Read list and a
|
|
|
|
* minimal Reply chunk.
|
2016-05-03 01:41:05 +07:00
|
|
|
*/
|
|
|
|
static unsigned int rpcrdma_max_call_header_size(unsigned int maxsegs)
|
|
|
|
{
|
|
|
|
unsigned int size;
|
|
|
|
|
|
|
|
/* Fixed header fields and list discriminators */
|
|
|
|
size = RPCRDMA_HDRLEN_MIN;
|
|
|
|
|
|
|
|
/* Maximum Read list size */
|
|
|
|
maxsegs += 2; /* segment for head and tail buffers */
|
|
|
|
size = maxsegs * sizeof(struct rpcrdma_read_chunk);
|
|
|
|
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
/* Minimal Read chunk size */
|
|
|
|
size += sizeof(__be32); /* segment count */
|
|
|
|
size += sizeof(struct rpcrdma_segment);
|
|
|
|
size += sizeof(__be32); /* list discriminator */
|
|
|
|
|
2016-05-03 01:41:05 +07:00
|
|
|
dprintk("RPC: %s: max call header size = %u\n",
|
|
|
|
__func__, size);
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Returns size of largest RPC-over-RDMA header in a Reply message
|
|
|
|
*
|
|
|
|
* There is only one Write list or one Reply chunk per Reply
|
|
|
|
* message. The larger list is the Write list.
|
|
|
|
*/
|
|
|
|
static unsigned int rpcrdma_max_reply_header_size(unsigned int maxsegs)
|
|
|
|
{
|
|
|
|
unsigned int size;
|
|
|
|
|
|
|
|
/* Fixed header fields and list discriminators */
|
|
|
|
size = RPCRDMA_HDRLEN_MIN;
|
|
|
|
|
|
|
|
/* Maximum Write list size */
|
|
|
|
maxsegs += 2; /* segment for head and tail buffers */
|
|
|
|
size = sizeof(__be32); /* segment count */
|
|
|
|
size += maxsegs * sizeof(struct rpcrdma_segment);
|
|
|
|
size += sizeof(__be32); /* list discriminator */
|
|
|
|
|
|
|
|
dprintk("RPC: %s: max reply header size = %u\n",
|
|
|
|
__func__, size);
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
2016-09-15 21:57:07 +07:00
|
|
|
void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *r_xprt)
|
2016-05-03 01:41:05 +07:00
|
|
|
{
|
2016-09-15 21:57:07 +07:00
|
|
|
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
|
|
|
|
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
|
|
|
|
unsigned int maxsegs = ia->ri_max_segs;
|
|
|
|
|
2016-05-03 01:41:05 +07:00
|
|
|
ia->ri_max_inline_write = cdata->inline_wsize -
|
|
|
|
rpcrdma_max_call_header_size(maxsegs);
|
|
|
|
ia->ri_max_inline_read = cdata->inline_rsize -
|
|
|
|
rpcrdma_max_reply_header_size(maxsegs);
|
|
|
|
}
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-08-04 00:03:49 +07:00
|
|
|
/* The client can send a request inline as long as the RPCRDMA header
|
|
|
|
* plus the RPC call fit under the transport's inline limit. If the
|
|
|
|
* combined call message size exceeds that limit, the client must use
|
2017-02-09 05:00:10 +07:00
|
|
|
* a Read chunk for this operation.
|
|
|
|
*
|
|
|
|
* A Read chunk is also required if sending the RPC call inline would
|
|
|
|
* exceed this device's max_sge limit.
|
2015-08-04 00:03:49 +07:00
|
|
|
*/
|
2016-05-03 01:41:05 +07:00
|
|
|
static bool rpcrdma_args_inline(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpc_rqst *rqst)
|
2015-08-04 00:03:49 +07:00
|
|
|
{
|
2017-02-09 05:00:10 +07:00
|
|
|
struct xdr_buf *xdr = &rqst->rq_snd_buf;
|
|
|
|
unsigned int count, remaining, offset;
|
|
|
|
|
|
|
|
if (xdr->len > r_xprt->rx_ia.ri_max_inline_write)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (xdr->page_len) {
|
|
|
|
remaining = xdr->page_len;
|
2017-06-08 22:53:16 +07:00
|
|
|
offset = offset_in_page(xdr->page_base);
|
2017-02-09 05:00:10 +07:00
|
|
|
count = 0;
|
|
|
|
while (remaining) {
|
|
|
|
remaining -= min_t(unsigned int,
|
|
|
|
PAGE_SIZE - offset, remaining);
|
|
|
|
offset = 0;
|
|
|
|
if (++count > r_xprt->rx_ia.ri_max_send_sges)
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
2015-08-04 00:03:49 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* The client can't know how large the actual reply will be. Thus it
|
|
|
|
* plans for the largest possible reply for that particular ULP
|
|
|
|
* operation. If the maximum combined reply message size exceeds that
|
|
|
|
* limit, the client must provide a write list or a reply chunk for
|
|
|
|
* this request.
|
|
|
|
*/
|
2016-05-03 01:41:05 +07:00
|
|
|
static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpc_rqst *rqst)
|
2015-08-04 00:03:49 +07:00
|
|
|
{
|
2016-05-03 01:41:05 +07:00
|
|
|
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
|
2015-08-04 00:03:49 +07:00
|
|
|
|
2016-05-03 01:41:05 +07:00
|
|
|
return rqst->rq_rcv_buf.buflen <= ia->ri_max_inline_read;
|
2015-08-04 00:03:49 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
/* Split "vec" on page boundaries into segments. FMR registers pages,
|
|
|
|
* not a byte range. Other modes coalesce these segments into a single
|
|
|
|
* MR when they can.
|
|
|
|
*/
|
|
|
|
static int
|
2016-06-30 00:54:25 +07:00
|
|
|
rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg, int n)
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
{
|
|
|
|
size_t page_offset;
|
|
|
|
u32 remaining;
|
|
|
|
char *base;
|
|
|
|
|
|
|
|
base = vec->iov_base;
|
|
|
|
page_offset = offset_in_page(base);
|
|
|
|
remaining = vec->iov_len;
|
2016-06-30 00:54:25 +07:00
|
|
|
while (remaining && n < RPCRDMA_MAX_SEGS) {
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
seg[n].mr_page = NULL;
|
|
|
|
seg[n].mr_offset = base;
|
|
|
|
seg[n].mr_len = min_t(u32, PAGE_SIZE - page_offset, remaining);
|
|
|
|
remaining -= seg[n].mr_len;
|
|
|
|
base += seg[n].mr_len;
|
|
|
|
++n;
|
|
|
|
page_offset = 0;
|
|
|
|
}
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
/*
|
|
|
|
* Chunk assembly from upper layer xdr_buf.
|
|
|
|
*
|
|
|
|
* Prepare the passed-in xdr_buf into representation as RPC/RDMA chunk
|
|
|
|
* elements. Segments are then coalesced when registered, if possible
|
|
|
|
* within the selected memreg mode.
|
2014-05-28 21:35:14 +07:00
|
|
|
*
|
|
|
|
* Returns positive number of segments converted, or a negative errno.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
2017-02-09 04:59:54 +07:00
|
|
|
rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
|
|
|
|
unsigned int pos, enum rpcrdma_chunktype type,
|
|
|
|
struct rpcrdma_mr_seg *seg)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
2016-06-30 00:54:25 +07:00
|
|
|
int len, n, p, page_base;
|
2011-02-10 02:45:28 +07:00
|
|
|
struct page **ppages;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2016-06-30 00:54:25 +07:00
|
|
|
n = 0;
|
xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 23:27:52 +07:00
|
|
|
if (pos == 0) {
|
2016-06-30 00:54:25 +07:00
|
|
|
n = rpcrdma_convert_kvec(&xdrbuf->head[0], seg, n);
|
|
|
|
if (n == RPCRDMA_MAX_SEGS)
|
|
|
|
goto out_overflow;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
2011-02-10 02:45:28 +07:00
|
|
|
len = xdrbuf->page_len;
|
|
|
|
ppages = xdrbuf->pages + (xdrbuf->page_base >> PAGE_SHIFT);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdrbuf->page_base);
|
2011-02-10 02:45:28 +07:00
|
|
|
p = 0;
|
2016-06-30 00:54:25 +07:00
|
|
|
while (len && n < RPCRDMA_MAX_SEGS) {
|
2014-05-28 21:34:24 +07:00
|
|
|
if (!ppages[p]) {
|
|
|
|
/* alloc the pagelist for receiving buffer */
|
|
|
|
ppages[p] = alloc_page(GFP_ATOMIC);
|
|
|
|
if (!ppages[p])
|
2016-06-30 00:53:43 +07:00
|
|
|
return -EAGAIN;
|
2014-05-28 21:34:24 +07:00
|
|
|
}
|
2011-02-10 02:45:28 +07:00
|
|
|
seg[n].mr_page = ppages[p];
|
|
|
|
seg[n].mr_offset = (void *)(unsigned long) page_base;
|
|
|
|
seg[n].mr_len = min_t(u32, PAGE_SIZE - page_base, len);
|
2014-05-28 21:35:14 +07:00
|
|
|
if (seg[n].mr_len > PAGE_SIZE)
|
2016-06-30 00:54:25 +07:00
|
|
|
goto out_overflow;
|
2011-02-10 02:45:28 +07:00
|
|
|
len -= seg[n].mr_len;
|
2007-09-11 00:50:42 +07:00
|
|
|
++n;
|
2011-02-10 02:45:28 +07:00
|
|
|
++p;
|
|
|
|
page_base = 0; /* page offset only applies to first page */
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
2011-02-10 02:45:28 +07:00
|
|
|
/* Message overflows the seg array */
|
2016-06-30 00:54:25 +07:00
|
|
|
if (len && n == RPCRDMA_MAX_SEGS)
|
|
|
|
goto out_overflow;
|
2011-02-10 02:45:28 +07:00
|
|
|
|
2017-02-09 04:59:46 +07:00
|
|
|
/* When encoding a Read chunk, the tail iovec contains an
|
|
|
|
* XDR pad and may be omitted.
|
|
|
|
*/
|
2017-02-09 04:59:54 +07:00
|
|
|
if (type == rpcrdma_readch && r_xprt->rx_ia.ri_implicit_roundup)
|
2015-08-04 00:04:17 +07:00
|
|
|
return n;
|
|
|
|
|
2017-02-09 04:59:54 +07:00
|
|
|
/* When encoding a Write chunk, some servers need to see an
|
|
|
|
* extra segment for non-XDR-aligned Write chunks. The upper
|
|
|
|
* layer provides space in the tail iovec that may be used
|
|
|
|
* for this purpose.
|
2016-09-15 21:57:16 +07:00
|
|
|
*/
|
2017-02-09 04:59:54 +07:00
|
|
|
if (type == rpcrdma_writech && r_xprt->rx_ia.ri_implicit_roundup)
|
2016-09-15 21:57:16 +07:00
|
|
|
return n;
|
|
|
|
|
2007-12-10 23:24:48 +07:00
|
|
|
if (xdrbuf->tail[0].iov_len) {
|
2016-06-30 00:54:25 +07:00
|
|
|
n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n);
|
|
|
|
if (n == RPCRDMA_MAX_SEGS)
|
|
|
|
goto out_overflow;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
return n;
|
2016-06-30 00:54:25 +07:00
|
|
|
|
|
|
|
out_overflow:
|
|
|
|
pr_err("rpcrdma: segment array overflow\n");
|
|
|
|
return -EIO;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
static inline __be32 *
|
2016-06-30 00:54:16 +07:00
|
|
|
xdr_encode_rdma_segment(__be32 *iptr, struct rpcrdma_mw *mw)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
{
|
2016-06-30 00:54:16 +07:00
|
|
|
*iptr++ = cpu_to_be32(mw->mw_handle);
|
|
|
|
*iptr++ = cpu_to_be32(mw->mw_length);
|
|
|
|
return xdr_encode_hyper(iptr, mw->mw_offset);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* XDR-encode the Read list. Supports encoding a list of read
|
|
|
|
* segments that belong to a single read chunk.
|
|
|
|
*
|
|
|
|
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
|
|
|
|
*
|
|
|
|
* Read chunklist (a linked list):
|
|
|
|
* N elements, position P (same P for all chunks of same arg!):
|
|
|
|
* 1 - PHLOO - 1 - PHLOO - ... - 1 - PHLOO - 0
|
|
|
|
*
|
|
|
|
* Returns a pointer to the XDR word in the RDMA header following
|
|
|
|
* the end of the Read list, or an error pointer.
|
|
|
|
*/
|
|
|
|
static __be32 *
|
|
|
|
rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpcrdma_req *req, struct rpc_rqst *rqst,
|
|
|
|
__be32 *iptr, enum rpcrdma_chunktype rtype)
|
|
|
|
{
|
2016-06-30 00:54:25 +07:00
|
|
|
struct rpcrdma_mr_seg *seg;
|
2016-06-30 00:54:16 +07:00
|
|
|
struct rpcrdma_mw *mw;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
unsigned int pos;
|
|
|
|
int n, nsegs;
|
|
|
|
|
|
|
|
if (rtype == rpcrdma_noch) {
|
|
|
|
*iptr++ = xdr_zero; /* item not present */
|
|
|
|
return iptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
pos = rqst->rq_snd_buf.head[0].iov_len;
|
|
|
|
if (rtype == rpcrdma_areadch)
|
|
|
|
pos = 0;
|
2016-06-30 00:54:25 +07:00
|
|
|
seg = req->rl_segments;
|
2017-02-09 04:59:54 +07:00
|
|
|
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_snd_buf, pos,
|
|
|
|
rtype, seg);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
if (nsegs < 0)
|
|
|
|
return ERR_PTR(nsegs);
|
|
|
|
|
|
|
|
do {
|
2016-06-30 00:54:16 +07:00
|
|
|
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
|
|
|
|
false, &mw);
|
2016-06-30 00:53:52 +07:00
|
|
|
if (n < 0)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
return ERR_PTR(n);
|
2017-02-09 05:00:43 +07:00
|
|
|
rpcrdma_push_mw(mw, &req->rl_registered);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
*iptr++ = xdr_one; /* item present */
|
|
|
|
|
|
|
|
/* All read segments in this chunk
|
|
|
|
* have the same "position".
|
|
|
|
*/
|
|
|
|
*iptr++ = cpu_to_be32(pos);
|
2016-06-30 00:54:16 +07:00
|
|
|
iptr = xdr_encode_rdma_segment(iptr, mw);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: pos %u %u@0x%016llx:0x%08x (%s)\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__, pos,
|
2016-06-30 00:54:16 +07:00
|
|
|
mw->mw_length, (unsigned long long)mw->mw_offset,
|
|
|
|
mw->mw_handle, n < nsegs ? "more" : "last");
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.read_chunk_count++;
|
|
|
|
seg += n;
|
|
|
|
nsegs -= n;
|
|
|
|
} while (nsegs);
|
|
|
|
|
|
|
|
/* Finish Read list */
|
|
|
|
*iptr++ = xdr_zero; /* Next item not present */
|
|
|
|
return iptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* XDR-encode the Write list. Supports encoding a list containing
|
|
|
|
* one array of plain segments that belong to a single write chunk.
|
|
|
|
*
|
|
|
|
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
|
|
|
|
*
|
|
|
|
* Write chunklist (a list of (one) counted array):
|
|
|
|
* N elements:
|
|
|
|
* 1 - N - HLOO - HLOO - ... - HLOO - 0
|
|
|
|
*
|
|
|
|
* Returns a pointer to the XDR word in the RDMA header following
|
|
|
|
* the end of the Write list, or an error pointer.
|
|
|
|
*/
|
|
|
|
static __be32 *
|
|
|
|
rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
|
|
|
struct rpc_rqst *rqst, __be32 *iptr,
|
|
|
|
enum rpcrdma_chunktype wtype)
|
|
|
|
{
|
2016-06-30 00:54:25 +07:00
|
|
|
struct rpcrdma_mr_seg *seg;
|
2016-06-30 00:54:16 +07:00
|
|
|
struct rpcrdma_mw *mw;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
int n, nsegs, nchunks;
|
|
|
|
__be32 *segcount;
|
|
|
|
|
|
|
|
if (wtype != rpcrdma_writech) {
|
|
|
|
*iptr++ = xdr_zero; /* no Write list present */
|
|
|
|
return iptr;
|
|
|
|
}
|
|
|
|
|
2016-06-30 00:54:25 +07:00
|
|
|
seg = req->rl_segments;
|
2017-02-09 04:59:54 +07:00
|
|
|
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf,
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_rcv_buf.head[0].iov_len,
|
2017-02-09 04:59:54 +07:00
|
|
|
wtype, seg);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
if (nsegs < 0)
|
|
|
|
return ERR_PTR(nsegs);
|
|
|
|
|
|
|
|
*iptr++ = xdr_one; /* Write list present */
|
|
|
|
segcount = iptr++; /* save location of segment count */
|
|
|
|
|
|
|
|
nchunks = 0;
|
|
|
|
do {
|
2016-06-30 00:54:16 +07:00
|
|
|
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
|
|
|
|
true, &mw);
|
2016-06-30 00:53:52 +07:00
|
|
|
if (n < 0)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
return ERR_PTR(n);
|
2017-02-09 05:00:43 +07:00
|
|
|
rpcrdma_push_mw(mw, &req->rl_registered);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
iptr = xdr_encode_rdma_segment(iptr, mw);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: %u@0x016%llx:0x%08x (%s)\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__,
|
2016-06-30 00:54:16 +07:00
|
|
|
mw->mw_length, (unsigned long long)mw->mw_offset,
|
|
|
|
mw->mw_handle, n < nsegs ? "more" : "last");
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.write_chunk_count++;
|
|
|
|
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
|
|
|
|
nchunks++;
|
|
|
|
seg += n;
|
|
|
|
nsegs -= n;
|
|
|
|
} while (nsegs);
|
|
|
|
|
|
|
|
/* Update count of segments in this Write chunk */
|
|
|
|
*segcount = cpu_to_be32(nchunks);
|
|
|
|
|
|
|
|
/* Finish Write list */
|
|
|
|
*iptr++ = xdr_zero; /* Next item not present */
|
|
|
|
return iptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* XDR-encode the Reply chunk. Supports encoding an array of plain
|
|
|
|
* segments that belong to a single write (reply) chunk.
|
|
|
|
*
|
|
|
|
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
|
|
|
|
*
|
|
|
|
* Reply chunk (a counted array):
|
|
|
|
* N elements:
|
|
|
|
* 1 - N - HLOO - HLOO - ... - HLOO
|
|
|
|
*
|
|
|
|
* Returns a pointer to the XDR word in the RDMA header following
|
|
|
|
* the end of the Reply chunk, or an error pointer.
|
|
|
|
*/
|
|
|
|
static __be32 *
|
|
|
|
rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
|
|
|
|
struct rpcrdma_req *req, struct rpc_rqst *rqst,
|
|
|
|
__be32 *iptr, enum rpcrdma_chunktype wtype)
|
|
|
|
{
|
2016-06-30 00:54:25 +07:00
|
|
|
struct rpcrdma_mr_seg *seg;
|
2016-06-30 00:54:16 +07:00
|
|
|
struct rpcrdma_mw *mw;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
int n, nsegs, nchunks;
|
|
|
|
__be32 *segcount;
|
|
|
|
|
|
|
|
if (wtype != rpcrdma_replych) {
|
|
|
|
*iptr++ = xdr_zero; /* no Reply chunk present */
|
|
|
|
return iptr;
|
|
|
|
}
|
|
|
|
|
2016-06-30 00:54:25 +07:00
|
|
|
seg = req->rl_segments;
|
2017-02-09 04:59:54 +07:00
|
|
|
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
if (nsegs < 0)
|
|
|
|
return ERR_PTR(nsegs);
|
|
|
|
|
|
|
|
*iptr++ = xdr_one; /* Reply chunk present */
|
|
|
|
segcount = iptr++; /* save location of segment count */
|
|
|
|
|
|
|
|
nchunks = 0;
|
|
|
|
do {
|
2016-06-30 00:54:16 +07:00
|
|
|
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
|
|
|
|
true, &mw);
|
2016-06-30 00:53:52 +07:00
|
|
|
if (n < 0)
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
return ERR_PTR(n);
|
2017-02-09 05:00:43 +07:00
|
|
|
rpcrdma_push_mw(mw, &req->rl_registered);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
iptr = xdr_encode_rdma_segment(iptr, mw);
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: %u@0x%016llx:0x%08x (%s)\n",
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
rqst->rq_task->tk_pid, __func__,
|
2016-06-30 00:54:16 +07:00
|
|
|
mw->mw_length, (unsigned long long)mw->mw_offset,
|
|
|
|
mw->mw_handle, n < nsegs ? "more" : "last");
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.reply_chunk_count++;
|
|
|
|
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
|
|
|
|
nchunks++;
|
|
|
|
seg += n;
|
|
|
|
nsegs -= n;
|
|
|
|
} while (nsegs);
|
|
|
|
|
|
|
|
/* Update count of segments in the Reply chunk */
|
|
|
|
*segcount = cpu_to_be32(nchunks);
|
|
|
|
|
|
|
|
return iptr;
|
|
|
|
}
|
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* Prepare the RPC-over-RDMA header SGE.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
static bool
|
|
|
|
rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
|
|
|
|
u32 len)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
struct rpcrdma_regbuf *rb = req->rl_rdmabuf;
|
|
|
|
struct ib_sge *sge = &req->rl_send_sge[0];
|
|
|
|
|
|
|
|
if (unlikely(!rpcrdma_regbuf_is_mapped(rb))) {
|
|
|
|
if (!__rpcrdma_dma_map_regbuf(ia, rb))
|
|
|
|
return false;
|
|
|
|
sge->addr = rdmab_addr(rb);
|
|
|
|
sge->lkey = rdmab_lkey(rb);
|
|
|
|
}
|
|
|
|
sge->length = len;
|
|
|
|
|
2017-04-12 00:23:02 +07:00
|
|
|
ib_dma_sync_single_for_device(rdmab_device(rb), sge->addr,
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
sge->length, DMA_TO_DEVICE);
|
|
|
|
req->rl_send_wr.num_sge++;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Prepare the Send SGEs. The head and tail iovec, and each entry
|
|
|
|
* in the page list, gets its own SGE.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
|
|
|
|
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
|
|
|
|
{
|
|
|
|
unsigned int sge_no, page_base, len, remaining;
|
|
|
|
struct rpcrdma_regbuf *rb = req->rl_sendbuf;
|
|
|
|
struct ib_device *device = ia->ri_device;
|
|
|
|
struct ib_sge *sge = req->rl_send_sge;
|
|
|
|
u32 lkey = ia->ri_pd->local_dma_lkey;
|
|
|
|
struct page *page, **ppages;
|
|
|
|
|
|
|
|
/* The head iovec is straightforward, as it is already
|
|
|
|
* DMA-mapped. Sync the content that has changed.
|
|
|
|
*/
|
|
|
|
if (!rpcrdma_dma_map_regbuf(ia, rb))
|
|
|
|
return false;
|
|
|
|
sge_no = 1;
|
|
|
|
sge[sge_no].addr = rdmab_addr(rb);
|
|
|
|
sge[sge_no].length = xdr->head[0].iov_len;
|
|
|
|
sge[sge_no].lkey = rdmab_lkey(rb);
|
2017-04-12 00:23:02 +07:00
|
|
|
ib_dma_sync_single_for_device(rdmab_device(rb), sge[sge_no].addr,
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
sge[sge_no].length, DMA_TO_DEVICE);
|
|
|
|
|
|
|
|
/* If there is a Read chunk, the page list is being handled
|
|
|
|
* via explicit RDMA, and thus is skipped here. However, the
|
|
|
|
* tail iovec may include an XDR pad for the page list, as
|
|
|
|
* well as additional content, and may not reside in the
|
|
|
|
* same page as the head iovec.
|
|
|
|
*/
|
|
|
|
if (rtype == rpcrdma_readch) {
|
|
|
|
len = xdr->tail[0].iov_len;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* Do not include the tail if it is only an XDR pad */
|
|
|
|
if (len < 4)
|
|
|
|
goto out;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
page = virt_to_page(xdr->tail[0].iov_base);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdr->tail[0].iov_base);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* If the content in the page list is an odd length,
|
|
|
|
* xdr_write_pages() has added a pad at the beginning
|
|
|
|
* of the tail iovec. Force the tail's non-pad content
|
|
|
|
* to land at the next XDR position in the Send message.
|
|
|
|
*/
|
|
|
|
page_base += len & 3;
|
|
|
|
len -= len & 3;
|
|
|
|
goto map_tail;
|
|
|
|
}
|
2009-03-12 01:37:55 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
/* If there is a page list present, temporarily DMA map
|
|
|
|
* and prepare an SGE for each page to be sent.
|
|
|
|
*/
|
|
|
|
if (xdr->page_len) {
|
|
|
|
ppages = xdr->pages + (xdr->page_base >> PAGE_SHIFT);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdr->page_base);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
remaining = xdr->page_len;
|
|
|
|
while (remaining) {
|
|
|
|
sge_no++;
|
|
|
|
if (sge_no > RPCRDMA_MAX_SEND_SGES - 2)
|
|
|
|
goto out_mapping_overflow;
|
|
|
|
|
|
|
|
len = min_t(u32, PAGE_SIZE - page_base, remaining);
|
|
|
|
sge[sge_no].addr = ib_dma_map_page(device, *ppages,
|
|
|
|
page_base, len,
|
|
|
|
DMA_TO_DEVICE);
|
|
|
|
if (ib_dma_mapping_error(device, sge[sge_no].addr))
|
|
|
|
goto out_mapping_err;
|
|
|
|
sge[sge_no].length = len;
|
|
|
|
sge[sge_no].lkey = lkey;
|
|
|
|
|
|
|
|
req->rl_mapped_sges++;
|
|
|
|
ppages++;
|
|
|
|
remaining -= len;
|
|
|
|
page_base = 0;
|
2009-03-12 01:37:55 +07:00
|
|
|
}
|
|
|
|
}
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
|
|
|
/* The tail iovec is not always constructed in the same
|
|
|
|
* page where the head iovec resides (see, for example,
|
|
|
|
* gss_wrap_req_priv). To neatly accommodate that case,
|
|
|
|
* DMA map it separately.
|
|
|
|
*/
|
|
|
|
if (xdr->tail[0].iov_len) {
|
|
|
|
page = virt_to_page(xdr->tail[0].iov_base);
|
2017-06-08 22:53:16 +07:00
|
|
|
page_base = offset_in_page(xdr->tail[0].iov_base);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
len = xdr->tail[0].iov_len;
|
|
|
|
|
|
|
|
map_tail:
|
|
|
|
sge_no++;
|
|
|
|
sge[sge_no].addr = ib_dma_map_page(device, page,
|
|
|
|
page_base, len,
|
|
|
|
DMA_TO_DEVICE);
|
|
|
|
if (ib_dma_mapping_error(device, sge[sge_no].addr))
|
|
|
|
goto out_mapping_err;
|
|
|
|
sge[sge_no].length = len;
|
|
|
|
sge[sge_no].lkey = lkey;
|
|
|
|
req->rl_mapped_sges++;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
|
|
|
|
out:
|
|
|
|
req->rl_send_wr.num_sge = sge_no + 1;
|
|
|
|
return true;
|
|
|
|
|
|
|
|
out_mapping_overflow:
|
|
|
|
pr_err("rpcrdma: too many Send SGEs (%u)\n", sge_no);
|
|
|
|
return false;
|
|
|
|
|
|
|
|
out_mapping_err:
|
|
|
|
pr_err("rpcrdma: Send mapping error\n");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool
|
|
|
|
rpcrdma_prepare_send_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
|
|
|
|
u32 hdrlen, struct xdr_buf *xdr,
|
|
|
|
enum rpcrdma_chunktype rtype)
|
|
|
|
{
|
|
|
|
req->rl_send_wr.num_sge = 0;
|
|
|
|
req->rl_mapped_sges = 0;
|
|
|
|
|
|
|
|
if (!rpcrdma_prepare_hdr_sge(ia, req, hdrlen))
|
|
|
|
goto out_map;
|
|
|
|
|
|
|
|
if (rtype != rpcrdma_areadch)
|
|
|
|
if (!rpcrdma_prepare_msg_sges(ia, req, xdr, rtype))
|
|
|
|
goto out_map;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
|
|
|
|
out_map:
|
|
|
|
pr_err("rpcrdma: failed to DMA map a Send buffer\n");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
rpcrdma_unmap_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
|
|
|
|
{
|
|
|
|
struct ib_device *device = ia->ri_device;
|
|
|
|
struct ib_sge *sge;
|
|
|
|
int count;
|
|
|
|
|
|
|
|
sge = &req->rl_send_sge[2];
|
|
|
|
for (count = req->rl_mapped_sges; count--; sge++)
|
|
|
|
ib_dma_unmap_page(device, sge->addr, sge->length,
|
|
|
|
DMA_TO_DEVICE);
|
|
|
|
req->rl_mapped_sges = 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Marshal a request: the primary job of this routine is to choose
|
|
|
|
* the transfer modes. See comments below.
|
|
|
|
*
|
2014-05-28 21:35:14 +07:00
|
|
|
* Returns zero on success, otherwise a negative errno.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
int
|
|
|
|
rpcrdma_marshal_req(struct rpc_rqst *rqst)
|
|
|
|
{
|
2013-01-08 21:10:21 +07:00
|
|
|
struct rpc_xprt *xprt = rqst->rq_xprt;
|
2007-09-11 00:50:42 +07:00
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
|
|
|
|
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
|
2015-03-31 01:33:53 +07:00
|
|
|
enum rpcrdma_chunktype rtype, wtype;
|
2007-09-11 00:50:42 +07:00
|
|
|
struct rpcrdma_msg *headerp;
|
2016-06-30 00:55:06 +07:00
|
|
|
bool ddp_allowed;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
ssize_t hdrlen;
|
|
|
|
size_t rpclen;
|
|
|
|
__be32 *iptr;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-10-25 04:27:59 +07:00
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
|
|
|
if (test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state))
|
|
|
|
return rpcrdma_bc_marshal_reply(rqst);
|
|
|
|
#endif
|
|
|
|
|
2015-01-21 23:04:16 +07:00
|
|
|
headerp = rdmab_to_msg(req->rl_rdmabuf);
|
2015-01-21 23:02:13 +07:00
|
|
|
/* don't byte-swap XID, it's already done in request */
|
2007-09-11 00:50:42 +07:00
|
|
|
headerp->rm_xid = rqst->rq_xid;
|
2015-01-21 23:02:13 +07:00
|
|
|
headerp->rm_vers = rpcrdma_version;
|
|
|
|
headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_max_requests);
|
|
|
|
headerp->rm_type = rdma_msg;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2016-06-30 00:55:06 +07:00
|
|
|
/* When the ULP employs a GSS flavor that guarantees integrity
|
|
|
|
* or privacy, direct data placement of individual data items
|
|
|
|
* is not allowed.
|
|
|
|
*/
|
|
|
|
ddp_allowed = !(rqst->rq_cred->cr_auth->au_flags &
|
|
|
|
RPCAUTH_AUTH_DATATOUCH);
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
/*
|
|
|
|
* Chunks needed for results?
|
|
|
|
*
|
|
|
|
* o If the expected result is under the inline threshold, all ops
|
2015-08-04 00:04:08 +07:00
|
|
|
* return as inline.
|
2016-05-03 01:41:14 +07:00
|
|
|
* o Large read ops return data as write chunk(s), header as
|
|
|
|
* inline.
|
2007-09-11 00:50:42 +07:00
|
|
|
* o Large non-read ops return as a single reply chunk.
|
|
|
|
*/
|
2016-05-03 01:41:14 +07:00
|
|
|
if (rpcrdma_results_inline(r_xprt, rqst))
|
2015-08-04 00:03:58 +07:00
|
|
|
wtype = rpcrdma_noch;
|
2016-06-30 00:55:06 +07:00
|
|
|
else if (ddp_allowed && rqst->rq_rcv_buf.flags & XDRBUF_READ)
|
2016-05-03 01:41:14 +07:00
|
|
|
wtype = rpcrdma_writech;
|
2007-09-11 00:50:42 +07:00
|
|
|
else
|
2015-03-31 01:33:53 +07:00
|
|
|
wtype = rpcrdma_replych;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Chunks needed for arguments?
|
|
|
|
*
|
|
|
|
* o If the total request is under the inline threshold, all ops
|
|
|
|
* are sent as inline.
|
|
|
|
* o Large write ops transmit data as read chunk(s), header as
|
|
|
|
* inline.
|
2015-08-04 00:04:26 +07:00
|
|
|
* o Large non-write ops are sent with the entire message as a
|
|
|
|
* single read chunk (protocol 0-position special case).
|
2007-09-11 00:50:42 +07:00
|
|
|
*
|
2015-08-04 00:04:26 +07:00
|
|
|
* This assumes that the upper layer does not present a request
|
|
|
|
* that both has a data payload, and whose non-data arguments
|
|
|
|
* by themselves are larger than the inline threshold.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
2016-05-03 01:41:05 +07:00
|
|
|
if (rpcrdma_args_inline(r_xprt, rqst)) {
|
2015-03-31 01:33:53 +07:00
|
|
|
rtype = rpcrdma_noch;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
rpclen = rqst->rq_snd_buf.len;
|
2016-06-30 00:55:06 +07:00
|
|
|
} else if (ddp_allowed && rqst->rq_snd_buf.flags & XDRBUF_WRITE) {
|
2015-03-31 01:33:53 +07:00
|
|
|
rtype = rpcrdma_readch;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
rpclen = rqst->rq_snd_buf.head[0].iov_len +
|
|
|
|
rqst->rq_snd_buf.tail[0].iov_len;
|
2015-08-04 00:04:26 +07:00
|
|
|
} else {
|
2015-08-04 00:04:45 +07:00
|
|
|
r_xprt->rx_stats.nomsg_call_count++;
|
2015-08-04 00:04:26 +07:00
|
|
|
headerp->rm_type = htonl(RDMA_NOMSG);
|
|
|
|
rtype = rpcrdma_areadch;
|
|
|
|
rpclen = 0;
|
|
|
|
}
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
req->rl_xid = rqst->rq_xid;
|
|
|
|
rpcrdma_insert_req(&r_xprt->rx_buf, req);
|
|
|
|
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
/* This implementation supports the following combinations
|
|
|
|
* of chunk lists in one RPC-over-RDMA Call message:
|
|
|
|
*
|
|
|
|
* - Read list
|
|
|
|
* - Write list
|
|
|
|
* - Reply chunk
|
|
|
|
* - Read list + Reply chunk
|
|
|
|
*
|
|
|
|
* It might not yet support the following combinations:
|
|
|
|
*
|
|
|
|
* - Read list + Write list
|
|
|
|
*
|
|
|
|
* It does not support the following combinations:
|
|
|
|
*
|
|
|
|
* - Write list + Reply chunk
|
|
|
|
* - Read list + Write list + Reply chunk
|
|
|
|
*
|
|
|
|
* This implementation supports only a single chunk in each
|
|
|
|
* Read or Write list. Thus for example the client cannot
|
|
|
|
* send a Call message with a Position Zero Read chunk and a
|
|
|
|
* regular Read chunk at the same time.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
iptr = headerp->rm_body.rm_chunks;
|
|
|
|
iptr = rpcrdma_encode_read_list(r_xprt, req, rqst, iptr, rtype);
|
|
|
|
if (IS_ERR(iptr))
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
iptr = rpcrdma_encode_write_list(r_xprt, req, rqst, iptr, wtype);
|
|
|
|
if (IS_ERR(iptr))
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
iptr = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, iptr, wtype);
|
|
|
|
if (IS_ERR(iptr))
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
hdrlen = (unsigned char *)iptr - (unsigned char *)headerp;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
dprintk("RPC: %5u %s: %s/%s: hdrlen %zd rpclen %zd\n",
|
|
|
|
rqst->rq_task->tk_pid, __func__,
|
|
|
|
transfertypes[rtype], transfertypes[wtype],
|
|
|
|
hdrlen, rpclen);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req, hdrlen,
|
|
|
|
&rqst->rq_snd_buf, rtype)) {
|
|
|
|
iptr = ERR_PTR(-EIO);
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
goto out_err;
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 21:57:24 +07:00
|
|
|
}
|
2007-09-11 00:50:42 +07:00
|
|
|
return 0;
|
2016-05-03 01:41:05 +07:00
|
|
|
|
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 05:00:27 +07:00
|
|
|
out_err:
|
2017-04-12 00:23:51 +07:00
|
|
|
if (PTR_ERR(iptr) != -ENOBUFS) {
|
|
|
|
pr_err("rpcrdma: rpcrdma_marshal_req failed, status %ld\n",
|
|
|
|
PTR_ERR(iptr));
|
|
|
|
r_xprt->rx_stats.failed_marshal_count++;
|
|
|
|
}
|
xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-03 01:41:30 +07:00
|
|
|
return PTR_ERR(iptr);
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Chase down a received write or reply chunklist to get length
|
|
|
|
* RDMA'd by server. See map at rpcrdma_create_chunks()! :-)
|
|
|
|
*/
|
|
|
|
static int
|
2016-06-30 00:54:16 +07:00
|
|
|
rpcrdma_count_chunks(struct rpcrdma_rep *rep, int wrchunk, __be32 **iptrp)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
|
|
|
unsigned int i, total_len;
|
|
|
|
struct rpcrdma_write_chunk *cur_wchunk;
|
2015-01-21 23:04:25 +07:00
|
|
|
char *base = (char *)rdmab_to_msg(rep->rr_rdmabuf);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-01-21 23:02:13 +07:00
|
|
|
i = be32_to_cpu(**iptrp);
|
2007-09-11 00:50:42 +07:00
|
|
|
cur_wchunk = (struct rpcrdma_write_chunk *) (*iptrp + 1);
|
|
|
|
total_len = 0;
|
|
|
|
while (i--) {
|
|
|
|
struct rpcrdma_segment *seg = &cur_wchunk->wc_target;
|
|
|
|
ifdebug(FACILITY) {
|
|
|
|
u64 off;
|
2007-10-29 11:37:58 +07:00
|
|
|
xdr_decode_hyper((__be32 *)&seg->rs_offset, &off);
|
2016-11-29 22:53:29 +07:00
|
|
|
dprintk("RPC: %s: chunk %d@0x%016llx:0x%08x\n",
|
2007-09-11 00:50:42 +07:00
|
|
|
__func__,
|
2015-01-21 23:02:13 +07:00
|
|
|
be32_to_cpu(seg->rs_length),
|
2007-10-30 14:44:32 +07:00
|
|
|
(unsigned long long)off,
|
2015-01-21 23:02:13 +07:00
|
|
|
be32_to_cpu(seg->rs_handle));
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
2015-01-21 23:02:13 +07:00
|
|
|
total_len += be32_to_cpu(seg->rs_length);
|
2007-09-11 00:50:42 +07:00
|
|
|
++cur_wchunk;
|
|
|
|
}
|
|
|
|
/* check and adjust for properly terminated write chunk */
|
|
|
|
if (wrchunk) {
|
2007-10-29 11:37:58 +07:00
|
|
|
__be32 *w = (__be32 *) cur_wchunk;
|
2007-09-11 00:50:42 +07:00
|
|
|
if (*w++ != xdr_zero)
|
|
|
|
return -1;
|
|
|
|
cur_wchunk = (struct rpcrdma_write_chunk *) w;
|
|
|
|
}
|
2015-01-21 23:04:25 +07:00
|
|
|
if ((char *)cur_wchunk > base + rep->rr_len)
|
2007-09-11 00:50:42 +07:00
|
|
|
return -1;
|
|
|
|
|
2007-10-29 11:37:58 +07:00
|
|
|
*iptrp = (__be32 *) cur_wchunk;
|
2007-09-11 00:50:42 +07:00
|
|
|
return total_len;
|
|
|
|
}
|
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/**
|
|
|
|
* rpcrdma_inline_fixup - Scatter inline received data into rqst's iovecs
|
|
|
|
* @rqst: controlling RPC request
|
|
|
|
* @srcp: points to RPC message payload in receive buffer
|
|
|
|
* @copy_len: remaining length of receive buffer content
|
|
|
|
* @pad: Write chunk pad bytes needed (zero for pure inline)
|
|
|
|
*
|
|
|
|
* The upper layer has set the maximum number of bytes it can
|
|
|
|
* receive in each component of rq_rcv_buf. These values are set in
|
|
|
|
* the head.iov_len, page_len, tail.iov_len, and buflen fields.
|
2016-06-30 00:54:49 +07:00
|
|
|
*
|
|
|
|
* Unlike the TCP equivalent (xdr_partial_copy_from_skb), in
|
|
|
|
* many cases this function simply updates iov_base pointers in
|
|
|
|
* rq_rcv_buf to point directly to the received reply data, to
|
|
|
|
* avoid copying reply data.
|
2016-06-30 00:54:58 +07:00
|
|
|
*
|
|
|
|
* Returns the count of bytes which had to be memcopied.
|
2007-09-11 00:50:42 +07:00
|
|
|
*/
|
2016-06-30 00:54:58 +07:00
|
|
|
static unsigned long
|
2008-10-10 02:01:11 +07:00
|
|
|
rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
2016-06-30 00:54:58 +07:00
|
|
|
unsigned long fixup_copy_count;
|
|
|
|
int i, npages, curlen;
|
2007-09-11 00:50:42 +07:00
|
|
|
char *destp;
|
2011-02-10 02:45:28 +07:00
|
|
|
struct page **ppages;
|
|
|
|
int page_base;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/* The head iovec is redirected to the RPC reply message
|
|
|
|
* in the receive buffer, to avoid a memcopy.
|
|
|
|
*/
|
|
|
|
rqst->rq_rcv_buf.head[0].iov_base = srcp;
|
2016-06-30 00:54:49 +07:00
|
|
|
rqst->rq_private_buf.head[0].iov_base = srcp;
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
|
|
|
|
/* The contents of the receive buffer that follow
|
|
|
|
* head.iov_len bytes are copied into the page list.
|
|
|
|
*/
|
2007-09-11 00:50:42 +07:00
|
|
|
curlen = rqst->rq_rcv_buf.head[0].iov_len;
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
if (curlen > copy_len)
|
2007-09-11 00:50:42 +07:00
|
|
|
curlen = copy_len;
|
|
|
|
dprintk("RPC: %s: srcp 0x%p len %d hdrlen %d\n",
|
|
|
|
__func__, srcp, copy_len, curlen);
|
|
|
|
srcp += curlen;
|
|
|
|
copy_len -= curlen;
|
|
|
|
|
2017-06-08 22:53:16 +07:00
|
|
|
ppages = rqst->rq_rcv_buf.pages +
|
|
|
|
(rqst->rq_rcv_buf.page_base >> PAGE_SHIFT);
|
|
|
|
page_base = offset_in_page(rqst->rq_rcv_buf.page_base);
|
2016-06-30 00:54:58 +07:00
|
|
|
fixup_copy_count = 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
if (copy_len && rqst->rq_rcv_buf.page_len) {
|
2016-06-30 00:54:33 +07:00
|
|
|
int pagelist_len;
|
|
|
|
|
|
|
|
pagelist_len = rqst->rq_rcv_buf.page_len;
|
|
|
|
if (pagelist_len > copy_len)
|
|
|
|
pagelist_len = copy_len;
|
|
|
|
npages = PAGE_ALIGN(page_base + pagelist_len) >> PAGE_SHIFT;
|
2016-06-30 00:54:58 +07:00
|
|
|
for (i = 0; i < npages; i++) {
|
2011-02-10 02:45:28 +07:00
|
|
|
curlen = PAGE_SIZE - page_base;
|
2016-06-30 00:54:33 +07:00
|
|
|
if (curlen > pagelist_len)
|
|
|
|
curlen = pagelist_len;
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
dprintk("RPC: %s: page %d"
|
|
|
|
" srcp 0x%p len %d curlen %d\n",
|
|
|
|
__func__, i, srcp, copy_len, curlen);
|
2011-11-25 22:14:40 +07:00
|
|
|
destp = kmap_atomic(ppages[i]);
|
2011-02-10 02:45:28 +07:00
|
|
|
memcpy(destp + page_base, srcp, curlen);
|
|
|
|
flush_dcache_page(ppages[i]);
|
2011-11-25 22:14:40 +07:00
|
|
|
kunmap_atomic(destp);
|
2007-09-11 00:50:42 +07:00
|
|
|
srcp += curlen;
|
|
|
|
copy_len -= curlen;
|
2016-06-30 00:54:58 +07:00
|
|
|
fixup_copy_count += curlen;
|
2016-06-30 00:54:33 +07:00
|
|
|
pagelist_len -= curlen;
|
|
|
|
if (!pagelist_len)
|
2007-09-11 00:50:42 +07:00
|
|
|
break;
|
2011-02-10 02:45:28 +07:00
|
|
|
page_base = 0;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/* Implicit padding for the last segment in a Write
|
|
|
|
* chunk is inserted inline at the front of the tail
|
|
|
|
* iovec. The upper layer ignores the content of
|
|
|
|
* the pad. Simply ensure inline content in the tail
|
|
|
|
* that follows the Write chunk is properly aligned.
|
|
|
|
*/
|
|
|
|
if (pad)
|
|
|
|
srcp -= pad;
|
2008-10-10 02:01:11 +07:00
|
|
|
}
|
|
|
|
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
/* The tail iovec is redirected to the remaining data
|
|
|
|
* in the receive buffer, to avoid a memcopy.
|
|
|
|
*/
|
2016-06-30 00:54:49 +07:00
|
|
|
if (copy_len || pad) {
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
rqst->rq_rcv_buf.tail[0].iov_base = srcp;
|
2016-06-30 00:54:49 +07:00
|
|
|
rqst->rq_private_buf.tail[0].iov_base = srcp;
|
|
|
|
}
|
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-30 00:54:41 +07:00
|
|
|
|
2016-06-30 00:54:58 +07:00
|
|
|
return fixup_copy_count;
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|
|
|
|
|
2017-06-08 22:51:56 +07:00
|
|
|
/* Caller must guarantee @rep remains stable during this call.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
rpcrdma_mark_remote_invalidation(struct list_head *mws,
|
|
|
|
struct rpcrdma_rep *rep)
|
|
|
|
{
|
|
|
|
struct rpcrdma_mw *mw;
|
|
|
|
|
|
|
|
if (!(rep->rr_wc_flags & IB_WC_WITH_INVALIDATE))
|
|
|
|
return;
|
|
|
|
|
|
|
|
list_for_each_entry(mw, mws, mw_list)
|
|
|
|
if (mw->mw_handle == rep->rr_inv_rkey) {
|
|
|
|
mw->mw_flags = RPCRDMA_MW_F_RI;
|
|
|
|
break; /* only one invalidated MR per RPC */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-10-25 04:28:08 +07:00
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
|
|
|
/* By convention, backchannel calls arrive via rdma_msg type
|
|
|
|
* messages, and never populate the chunk lists. This makes
|
|
|
|
* the RPC/RDMA header small and fixed in size, so it is
|
|
|
|
* straightforward to check the RPC header's direction field.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
rpcrdma_is_bcall(struct rpcrdma_msg *headerp)
|
|
|
|
{
|
|
|
|
__be32 *p = (__be32 *)headerp;
|
|
|
|
|
|
|
|
if (headerp->rm_type != rdma_msg)
|
|
|
|
return false;
|
|
|
|
if (headerp->rm_body.rm_chunks[0] != xdr_zero)
|
|
|
|
return false;
|
|
|
|
if (headerp->rm_body.rm_chunks[1] != xdr_zero)
|
|
|
|
return false;
|
|
|
|
if (headerp->rm_body.rm_chunks[2] != xdr_zero)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* sanity */
|
|
|
|
if (p[7] != headerp->rm_xid)
|
|
|
|
return false;
|
|
|
|
/* call direction */
|
|
|
|
if (p[8] != cpu_to_be32(RPC_CALL))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
|
|
|
|
|
2015-10-25 04:27:10 +07:00
|
|
|
/* Process received RPC/RDMA messages.
|
|
|
|
*
|
2007-09-11 00:50:42 +07:00
|
|
|
* Errors must result in the RPC task either being awakened, or
|
|
|
|
* allowed to timeout, to discover the errors at that time.
|
|
|
|
*/
|
|
|
|
void
|
2016-09-15 21:57:57 +07:00
|
|
|
rpcrdma_reply_handler(struct work_struct *work)
|
2007-09-11 00:50:42 +07:00
|
|
|
{
|
2016-09-15 21:57:57 +07:00
|
|
|
struct rpcrdma_rep *rep =
|
|
|
|
container_of(work, struct rpcrdma_rep, rr_work);
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
|
|
|
|
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
|
|
|
|
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
|
2007-09-11 00:50:42 +07:00
|
|
|
struct rpcrdma_msg *headerp;
|
|
|
|
struct rpcrdma_req *req;
|
|
|
|
struct rpc_rqst *rqst;
|
2007-10-29 11:37:58 +07:00
|
|
|
__be32 *iptr;
|
2016-03-04 23:28:18 +07:00
|
|
|
int rdmalen, status, rmerr;
|
2014-05-28 21:34:57 +07:00
|
|
|
unsigned long cwnd;
|
2017-06-08 22:52:04 +07:00
|
|
|
struct list_head mws;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-10-25 04:26:54 +07:00
|
|
|
dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
|
|
|
|
|
|
|
|
if (rep->rr_len == RPCRDMA_BAD_LEN)
|
|
|
|
goto out_badstatus;
|
2016-03-04 23:28:18 +07:00
|
|
|
if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
|
2015-10-25 04:26:54 +07:00
|
|
|
goto out_shortreply;
|
|
|
|
|
2015-01-21 23:04:25 +07:00
|
|
|
headerp = rdmab_to_msg(rep->rr_rdmabuf);
|
2015-10-25 04:28:08 +07:00
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
|
|
|
if (rpcrdma_is_bcall(headerp))
|
|
|
|
goto out_bcall;
|
|
|
|
#endif
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2015-10-25 04:27:10 +07:00
|
|
|
/* Match incoming rpcrdma_rep to an rpcrdma_req to
|
|
|
|
* get context for handling any incoming chunks.
|
|
|
|
*/
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
spin_lock(&buf->rb_lock);
|
|
|
|
req = rpcrdma_lookup_req_locked(&r_xprt->rx_buf,
|
|
|
|
headerp->rm_xid);
|
|
|
|
if (!req)
|
2015-10-25 04:26:54 +07:00
|
|
|
goto out_nomatch;
|
|
|
|
if (req->rl_reply)
|
|
|
|
goto out_duplicate;
|
2007-09-11 00:50:42 +07:00
|
|
|
|
2017-06-08 22:52:04 +07:00
|
|
|
list_replace_init(&req->rl_registered, &mws);
|
|
|
|
rpcrdma_mark_remote_invalidation(&mws, rep);
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
|
|
|
|
/* Avoid races with signals and duplicate replies
|
|
|
|
* by marking this req as matched.
|
|
|
|
*/
|
2017-06-08 22:51:56 +07:00
|
|
|
req->rl_reply = rep;
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
spin_unlock(&buf->rb_lock);
|
|
|
|
|
2016-03-04 23:27:43 +07:00
|
|
|
dprintk("RPC: %s: reply %p completes request %p (xid 0x%08x)\n",
|
|
|
|
__func__, rep, req, be32_to_cpu(headerp->rm_xid));
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
/* Invalidate and unmap the data payloads before waking the
|
|
|
|
* waiting application. This guarantees the memory regions
|
|
|
|
* are properly fenced from the server before the application
|
|
|
|
* accesses the data. It also ensures proper send flow control:
|
|
|
|
* waking the next RPC waits until this RPC has relinquished
|
|
|
|
* all its Send Queue entries.
|
|
|
|
*/
|
|
|
|
if (!list_empty(&mws))
|
|
|
|
r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, &mws);
|
2007-09-11 00:50:42 +07:00
|
|
|
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
/* Perform XID lookup, reconstruction of the RPC reply, and
|
|
|
|
* RPC completion while holding the transport lock to ensure
|
|
|
|
* the rep, rqst, and rq_task pointers remain stable.
|
|
|
|
*/
|
|
|
|
spin_lock_bh(&xprt->transport_lock);
|
|
|
|
rqst = xprt_lookup_rqst(xprt, headerp->rm_xid);
|
|
|
|
if (!rqst)
|
|
|
|
goto out_norqst;
|
|
|
|
xprt->reestablish_timeout = 0;
|
2016-03-04 23:28:18 +07:00
|
|
|
if (headerp->rm_vers != rpcrdma_version)
|
|
|
|
goto out_badversion;
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
/* check for expected message types */
|
|
|
|
/* The order of some of these tests is important. */
|
|
|
|
switch (headerp->rm_type) {
|
2015-01-21 23:02:13 +07:00
|
|
|
case rdma_msg:
|
2007-09-11 00:50:42 +07:00
|
|
|
/* never expect read chunks */
|
|
|
|
/* never expect reply chunks (two ways to check) */
|
|
|
|
if (headerp->rm_body.rm_chunks[0] != xdr_zero ||
|
|
|
|
(headerp->rm_body.rm_chunks[1] == xdr_zero &&
|
2017-06-08 22:52:04 +07:00
|
|
|
headerp->rm_body.rm_chunks[2] != xdr_zero))
|
2007-09-11 00:50:42 +07:00
|
|
|
goto badheader;
|
|
|
|
if (headerp->rm_body.rm_chunks[1] != xdr_zero) {
|
|
|
|
/* count any expected write chunks in read reply */
|
|
|
|
/* start at write chunk array count */
|
|
|
|
iptr = &headerp->rm_body.rm_chunks[2];
|
2016-06-30 00:54:16 +07:00
|
|
|
rdmalen = rpcrdma_count_chunks(rep, 1, &iptr);
|
2007-09-11 00:50:42 +07:00
|
|
|
/* check for validity, and no reply chunk after */
|
|
|
|
if (rdmalen < 0 || *iptr++ != xdr_zero)
|
|
|
|
goto badheader;
|
|
|
|
rep->rr_len -=
|
|
|
|
((unsigned char *)iptr - (unsigned char *)headerp);
|
|
|
|
status = rep->rr_len + rdmalen;
|
|
|
|
r_xprt->rx_stats.total_rdma_reply += rdmalen;
|
2008-10-10 02:01:11 +07:00
|
|
|
/* special case - last chunk may omit padding */
|
|
|
|
if (rdmalen &= 3) {
|
|
|
|
rdmalen = 4 - rdmalen;
|
|
|
|
status += rdmalen;
|
|
|
|
}
|
2007-09-11 00:50:42 +07:00
|
|
|
} else {
|
|
|
|
/* else ordinary inline */
|
2008-10-10 02:01:11 +07:00
|
|
|
rdmalen = 0;
|
2015-01-21 23:02:29 +07:00
|
|
|
iptr = (__be32 *)((unsigned char *)headerp +
|
|
|
|
RPCRDMA_HDRLEN_MIN);
|
|
|
|
rep->rr_len -= RPCRDMA_HDRLEN_MIN;
|
2007-09-11 00:50:42 +07:00
|
|
|
status = rep->rr_len;
|
|
|
|
}
|
2016-06-30 00:54:58 +07:00
|
|
|
|
|
|
|
r_xprt->rx_stats.fixup_copy_count +=
|
|
|
|
rpcrdma_inline_fixup(rqst, (char *)iptr, rep->rr_len,
|
|
|
|
rdmalen);
|
2007-09-11 00:50:42 +07:00
|
|
|
break;
|
|
|
|
|
2015-01-21 23:02:13 +07:00
|
|
|
case rdma_nomsg:
|
2007-09-11 00:50:42 +07:00
|
|
|
/* never expect read or write chunks, always reply chunks */
|
|
|
|
if (headerp->rm_body.rm_chunks[0] != xdr_zero ||
|
|
|
|
headerp->rm_body.rm_chunks[1] != xdr_zero ||
|
2017-06-08 22:52:04 +07:00
|
|
|
headerp->rm_body.rm_chunks[2] != xdr_one)
|
2007-09-11 00:50:42 +07:00
|
|
|
goto badheader;
|
2015-01-21 23:02:29 +07:00
|
|
|
iptr = (__be32 *)((unsigned char *)headerp +
|
|
|
|
RPCRDMA_HDRLEN_MIN);
|
2016-06-30 00:54:16 +07:00
|
|
|
rdmalen = rpcrdma_count_chunks(rep, 0, &iptr);
|
2007-09-11 00:50:42 +07:00
|
|
|
if (rdmalen < 0)
|
|
|
|
goto badheader;
|
|
|
|
r_xprt->rx_stats.total_rdma_reply += rdmalen;
|
|
|
|
/* Reply chunk buffer already is the reply vector - no fixup. */
|
|
|
|
status = rdmalen;
|
|
|
|
break;
|
|
|
|
|
2016-03-04 23:28:18 +07:00
|
|
|
case rdma_error:
|
|
|
|
goto out_rdmaerr;
|
|
|
|
|
2007-09-11 00:50:42 +07:00
|
|
|
badheader:
|
|
|
|
default:
|
2016-06-30 00:54:16 +07:00
|
|
|
dprintk("RPC: %5u %s: invalid rpcrdma reply (type %u)\n",
|
|
|
|
rqst->rq_task->tk_pid, __func__,
|
|
|
|
be32_to_cpu(headerp->rm_type));
|
2007-09-11 00:50:42 +07:00
|
|
|
status = -EIO;
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2016-03-04 23:28:18 +07:00
|
|
|
out:
|
2014-05-28 21:34:57 +07:00
|
|
|
cwnd = xprt->cwnd;
|
2016-03-04 23:28:27 +07:00
|
|
|
xprt->cwnd = atomic_read(&r_xprt->rx_buf.rb_credits) << RPC_CWNDSHIFT;
|
2014-05-28 21:34:57 +07:00
|
|
|
if (xprt->cwnd > cwnd)
|
|
|
|
xprt_release_rqst_cong(rqst->rq_task);
|
|
|
|
|
2015-10-25 04:26:54 +07:00
|
|
|
xprt_complete_rqst(rqst->rq_task, status);
|
2015-10-25 04:27:10 +07:00
|
|
|
spin_unlock_bh(&xprt->transport_lock);
|
2007-09-11 00:50:42 +07:00
|
|
|
dprintk("RPC: %s: xprt_complete_rqst(0x%p, 0x%p, %d)\n",
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
__func__, xprt, rqst, status);
|
2015-10-25 04:26:54 +07:00
|
|
|
return;
|
|
|
|
|
|
|
|
out_badstatus:
|
|
|
|
rpcrdma_recv_buffer_put(rep);
|
|
|
|
if (r_xprt->rx_ep.rep_connected == 1) {
|
|
|
|
r_xprt->rx_ep.rep_connected = -EIO;
|
|
|
|
rpcrdma_conn_func(&r_xprt->rx_ep);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
|
2015-10-25 04:28:08 +07:00
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
|
|
|
out_bcall:
|
|
|
|
rpcrdma_bc_receive_call(r_xprt, rep);
|
|
|
|
return;
|
|
|
|
#endif
|
|
|
|
|
2016-03-04 23:28:18 +07:00
|
|
|
/* If the incoming reply terminated a pending RPC, the next
|
|
|
|
* RPC call will post a replacement receive buffer as it is
|
|
|
|
* being marshaled.
|
|
|
|
*/
|
2015-10-25 04:26:54 +07:00
|
|
|
out_badversion:
|
|
|
|
dprintk("RPC: %s: invalid version %d\n",
|
|
|
|
__func__, be32_to_cpu(headerp->rm_vers));
|
2016-03-04 23:28:18 +07:00
|
|
|
status = -EIO;
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
out_rdmaerr:
|
|
|
|
rmerr = be32_to_cpu(headerp->rm_body.rm_error.rm_err);
|
|
|
|
switch (rmerr) {
|
|
|
|
case ERR_VERS:
|
|
|
|
pr_err("%s: server reports header version error (%u-%u)\n",
|
|
|
|
__func__,
|
|
|
|
be32_to_cpu(headerp->rm_body.rm_error.rm_vers_low),
|
|
|
|
be32_to_cpu(headerp->rm_body.rm_error.rm_vers_high));
|
|
|
|
break;
|
|
|
|
case ERR_CHUNK:
|
|
|
|
pr_err("%s: server reports header decoding error\n",
|
|
|
|
__func__);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
pr_err("%s: server reports unknown error %d\n",
|
|
|
|
__func__, rmerr);
|
|
|
|
}
|
|
|
|
status = -EREMOTEIO;
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
|
|
|
goto out;
|
|
|
|
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
/* The req was still available, but by the time the transport_lock
|
|
|
|
* was acquired, the rqst and task had been released. Thus the RPC
|
|
|
|
* has already been terminated.
|
2016-03-04 23:28:18 +07:00
|
|
|
*/
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
out_norqst:
|
|
|
|
spin_unlock_bh(&xprt->transport_lock);
|
|
|
|
rpcrdma_buffer_put(req);
|
|
|
|
dprintk("RPC: %s: race, no rqst left for req %p\n",
|
|
|
|
__func__, req);
|
|
|
|
return;
|
|
|
|
|
2016-03-04 23:28:18 +07:00
|
|
|
out_shortreply:
|
|
|
|
dprintk("RPC: %s: short/invalid reply\n", __func__);
|
2015-10-25 04:26:54 +07:00
|
|
|
goto repost;
|
|
|
|
|
|
|
|
out_nomatch:
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
spin_unlock(&buf->rb_lock);
|
2015-10-25 04:26:54 +07:00
|
|
|
dprintk("RPC: %s: no match for incoming xid 0x%08x len %d\n",
|
|
|
|
__func__, be32_to_cpu(headerp->rm_xid),
|
|
|
|
rep->rr_len);
|
|
|
|
goto repost;
|
|
|
|
|
|
|
|
out_duplicate:
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
spin_unlock(&buf->rb_lock);
|
2015-10-25 04:26:54 +07:00
|
|
|
dprintk("RPC: %s: "
|
|
|
|
"duplicate reply %p to RPC request %p: xid 0x%08x\n",
|
|
|
|
__func__, rep, req, be32_to_cpu(headerp->rm_xid));
|
|
|
|
|
xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.
The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.
With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.
This sets up a race between two processes:
1. After the signal, xprt_rdma_free calls ro_unmap_safe.
2. While ro_unmap_safe is still running, the server replies and
rpcrdma_reply_handler runs, calling ro_unmap_sync.
Both processes invoke ib_unmap_fmr on the same FMR.
The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.
If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.
But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.
Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 22:52:20 +07:00
|
|
|
/* If no pending RPC transaction was matched, post a replacement
|
|
|
|
* receive buffer before returning.
|
|
|
|
*/
|
2015-10-25 04:26:54 +07:00
|
|
|
repost:
|
|
|
|
r_xprt->rx_stats.bad_reply_count++;
|
2016-09-15 21:56:35 +07:00
|
|
|
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
|
2015-10-25 04:26:54 +07:00
|
|
|
rpcrdma_recv_buffer_put(rep);
|
2007-09-11 00:50:42 +07:00
|
|
|
}
|