The RPC/RDMA code had a constant 5-second reconnect backoff, and
always performed it, even when re-establishing a connection to a
server after the RPC layer closed it due to being idle. Make it
an geometric backoff (up to 30 seconds), and don't delay idle
reconnect.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The send marshaling code split a particular dprintk across two
lines, which makes it hard to extract from logfiles.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add defensive timeouts to wait_for_completion() calls in RDMA
address resolution, and make them interruptible. Fix the timeout
units to milliseconds (formerly jiffies) and move to private header.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC/RDMA code can leak RDMA connection manager endpoints in
certain error cases on connect. Don't signal unwanted events,
and be certain to destroy any allocated qp.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The xprt_connect call path does not expect such errors as ECONNREFUSED
to be returned from failed transport connection attempts, otherwise it
translates them to EIO and signals fatal errors. For example, mount.nfs
prints simply "internal error". Translate all such errors to ENOTCONN
from RPC/RDMA to match sockets behavior.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC/RDMA protocol allows clients and servers to avoid RDMA
operations for data which is purely the result of XDR padding.
On the client, automatically insert the necessary padding for
such server replies, and optionally don't marshal such chunks.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RDMA disconnects yield an upcall from the RDMA connection manager,
which can race with rpc transport close, e.g. on ^C of a mount.
Ensure any rdma cm_id and qp are fully destroyed before continuing.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An RPC/RDMA client cannot retransmit on an unbroken connection,
doing so violates its flow control with the server.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This logic sets the connection parameter that configures the local device
and informs the remote peer how many concurrent incoming RDMA_READ
requests are supported. The original logic didn't really do what was
intended for two reasons:
- The max number supported by the device is typically smaller than
any one factor in the calculation used, and
- The field in the connection parameter structure where the value is
stored is a u8 and always overflows for the default settings.
So what really happens is the value requested for responder resources
is the left over 8 bits from the "desired value". If the desired value
happened to be a multiple of 256, the result was zero and it wouldn't
connect at all.
Given the above and the fact that max_requests is almost always larger
than the max responder resources supported by the adapter, this patch
simplifies this logic and simply requests the max supported by the device,
subject to a reasonable limit.
This bug was found by Jim Schutt at Sandia.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Configure, detect and use "fastreg" support from IB/iWARP verbs
layer to perform RPC/RDMA memory registration.
Make FRMR the default memreg mode (will fall back if not supported
by the selected RDMA adapter).
This allows full and optimal operation over the cxgb3 adapter, and others.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Acked-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
At transport creation, check for, and use, any local dma lkey.
Then, check that the selected memory registration mode is in fact
supported by the RDMA adapter selected for the mount. Fall back
to best alternative if not.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Acked-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Refactor the memory registration and deregistration routines.
This saves stack space, makes the code more readable and prepares
to add the new FRMR registration methods.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Despite the fact that cloned rpc clients won't have the cl_autobind flag
set, they may still find themselves calling rpcb_getport_async(). For this
to happen, it suffices for a _parent_ rpc_clnt to use autobinding, in which
case any clone may find itself triggering the !xprt_bound() case in
call_bind().
The correct fix for this is to walk back up the tree of cloned rpc clients,
in order to find the parent that 'owns' the transport, either because it
has clnt->cl_autobind set, or because it originally created the
transport...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Basically, try_module_get here are pretty useless. Any other module using
this API will pin sunrpc in memory due using exported symbols.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a xfrm_{state,policy}_walk leak if pfkey socket is closed while
dumping is on-going.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following actions are possible:
tcp_v6_rcv
skb->dev = NULL;
tcp_v6_do_rcv
tcp_v6_hnd_req
tcp_check_req
req->rsk_ops->send_ack == tcp_v6_send_ack
So, skb->dev can be NULL in tcp_v6_send_ack. We must obtain namespace
from dst entry.
Thanks to Vitaliy Gusev <vgusev@openvz.org> for initial problem finding
in IPv4 code.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since call to function sctp_sf_abort_violation() need paramter 'arg' with
'struct sctp_chunk' type, it will read the chunk type and chunk length from
the chunk_hdr member of chunk. But call to sctp_sf_violation_paramlen()
always with 'struct sctp_paramhdr' type's parameter, it will be passed to
sctp_sf_abort_violation(). This may cause kernel panic.
sctp_sf_violation_paramlen()
|-- sctp_sf_abort_violation()
|-- sctp_make_abort_violation()
This patch fixed this problem. This patch also fix two place which called
sctp_sf_violation_paramlen() with wrong paramter type.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're never supposed to shrink the headroom or tailroom. In fact,
shrinking the headroom is a fatal action.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code ignores rules for internal options in HBH/DST options
header in packet processing if 'Not strict' mode is specified (which is not
implemented). Clearly it is not expected by user.
Kernel should reject HBH/DST rule insertion with 'Not strict' mode
in the first place.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: fix put_data error handling
9p: use an IS_ERR test rather than a NULL test
9p: introduce missing kfree
9p-trans_fd: fix and clean up module init/exit paths
9p-trans_fd: don't do fs segment mangling in p9_fd_poll()
9p-trans_fd: clean up p9_conn_create()
9p-trans_fd: fix trans_fd::p9_conn_destroy()
9p: implement proper trans module refcounting and unregistration
Abhishek Kulkarni pointed out an inconsistency in the way
errors are returned from p9_put_data. On deeper exploration it
seems the error handling for this path was completely wrong.
This patch adds checks for allocation problems and propagates
errors correctly.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Error handling code following a kmalloc should free the allocated data.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@
(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
when != if (...) { <+...x...+> }
x->f = E
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
trans_fd leaked p9_mux_wq on module unload. Fix it. While at it,
collapse p9_mux_global_init() into p9_trans_fd_init(). It's easier to
follow this way and the global poll_tasks array is about to removed
anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
p9_fd_poll() is never called with user pointers and f_op->poll()
doesn't expect its arguments to be from userland. There's no need to
set kernel ds before calling f_op->poll() from p9_fd_poll(). Remove
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* Use kzalloc() to allocate p9_conn and remove 0/NULL initializations.
* Clean up error return paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
p9_conn_destroy() first kills all current requests by calling
p9_conn_cancel(), then waits for the request list to be cleared by
waiting on p9_conn->equeue. After that, polling is stopped and the
trans is destroyed. This sequence has a few problems.
* Read and write works were never cancelled and the p9_conn can be
destroyed while the works are running as r/w works remove requests
from the list and dereference the p9_conn from them.
* The list emptiness wait using p9_conn->equeue wouldn't trigger
because p9_conn_cancel() always clears all the lists and the only
way the wait can be triggered is to have another task to issue a
request between the slim window between p9_conn_cancel() and the
wait, which isn't safe under the current implementation with or
without the wait.
This patch fixes the problem by first stopping poll, which can
schedule r/w works, first and cancle r/w works which guarantees that
r/w works are not and will not run from that point and then calling
p9_conn_cancel() and do the rest of destruction.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
9p trans modules aren't refcounted nor were they unregistered
properly. Fix it.
* Add 9p_trans_module->owner and reference the module on each trans
instance creation and put it on destruction.
* Protect v9fs_trans_list with a spinlock. This isn't strictly
necessary as the list is manipulated only during module loading /
unloading but it's a good idea to make the API safe.
* Unregister trans modules when the corresponding module is being
unloaded.
* While at it, kill unnecessary EXPORT_SYMBOL on p9_trans_fd_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The reasons for disabling paccept() are as follows:
* The API is more complex than needed. There is AFAICS no demonstrated
use case that the sigset argument of this syscall serves that couldn't
equally be served by the use of pselect/ppoll/epoll_pwait + traditional
accept(). Roland seems to concur with this opinion
(http://thread.gmane.org/gmane.linux.kernel/723953/focus=732255). I
have (more than once) asked Ulrich to explain otherwise
(http://thread.gmane.org/gmane.linux.kernel/723952/focus=731018), but he
does not respond, so one is left to assume that he doesn't know of such
a case.
* The use of a sigset argument is not consistent with other I/O APIs
that can block on a single file descriptor (e.g., read(), recv(),
connect()).
* The behavior of paccept() when interrupted by a signal is IMO strange:
the kernel restarts the system call if SA_RESTART was set for the
handler. I think that it should not do this -- that it should behave
consistently with paccept()/ppoll()/epoll_pwait(), which never restart,
regardless of SA_RESTART. The reasoning here is that the very purpose
of paccept() is to wait for a connection or a signal, and that
restarting in the latter case is probably never useful. (Note: Roland
disagrees on this point, believing that rather paccept() should be
consistent with accept() in its behavior wrt EINTR
(http://thread.gmane.org/gmane.linux.kernel/723953/focus=732255).)
I believe that instead, a simpler API, consistent with Ulrich's other
recent additions, is preferable:
accept4(int fd, struct sockaddr *sa, socklen_t *salen, ind flags);
(This simpler API was originally proposed by Ulrich:
http://thread.gmane.org/gmane.linux.network/92072)
If this simpler API is added, then if we later decide that the sigset
argument really is required, then a suitable bit in 'flags' could be added
to indicate the presence of the sigset argument.
At this point, I am hoping we either will get a counter-argument from
Ulrich about why we really do need paccept()'s sigset argument, or that he
will resubmit the original accept4() patch.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently simple_tx_hash is hashing inside of udp fragments. As a result
packets are getting getting sent to all queues when they shouldn't be.
This causes a serious performance regression which can be seen by sending
UDP frames larger than mtu on multiqueue devices. This change will make
it so that fragments are hashed only as IP datagrams w/o any protocol
information.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
e100: Use pci_pme_active to clear PME_Status and disable PME#
e1000: prevent corruption of EEPROM/NVM
forcedeth: call restore mac addr in nv_shutdown path
bnx2: Promote vector field in bnx2_irq structure from u16 to unsigned int
sctp: Fix oops when INIT-ACK indicates that peer doesn't support AUTH
sctp: do not enable peer features if we can't do them.
sctp: set the skb->ip_summed correctly when sending over loopback.
udp: Fix rcv socket locking
If INIT-ACK is received with SupportedExtensions parameter which
indicates that the peer does not support AUTH, the packet will be
silently ignore, and sctp_process_init() do cleanup all of the
transports in the association.
When T1-Init timer is expires, OOPS happen while we try to choose
a different init transport.
The solution is to only clean up the non-active transports, i.e
the ones that the peer added. However, that introduces a problem
with sctp_connectx(), because we don't mark the proper state for
the transports provided by the user. So, we'll simply mark
user-provided transports as ACTIVE. That will allow INIT
retransmissions to work properly in the sctp_connectx() context
and prevent the crash.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not enable peer features like addip and auth, if they
are administratively disabled localy. If the peer resports
that he supports something that we don't, neither end can
use it so enabling it is pointless. This solves a problem
when talking to a peer that has auth and addip enabled while
we do not. Found by Andrei Pelinescu-Onciul <andrei@iptel.org>.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Loopback used to clobber the ip_summed filed which sctp then used
to figure out if it needed to do checksumming or not. Now that
loopback doesn't do that any more, sctp needs to set the ip_summed
field correctly.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch turns the netdev timeout WARN_ON_ONCE() into a WARN_ONCE(),
so that the device and driver names are inside the warning message.
This helps automated tools like kerneloops.org to collect the data
and do statistics, as well as making it more likely that humans
cut-n-paste the important message as part of a bugreport.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous patch in response to the recursive locking on IPsec
reception is broken as it tries to drop the BH socket lock while in
user context.
This patch fixes it by shrinking the section protected by the
socket lock to sock_queue_rcv_skb only. The only reason we added
the lock is for the accounting which happens in that function.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To speed up the Simple Pairing connection setup, the support for the
default link policy has been enabled. This is in contrast to settings
the link policy on every connection setup. Using the default link policy
is the preferred way since there is no need to dynamically change it for
every connection.
For backward compatibility reason and to support old userspace the
HCISETLINKPOL ioctl has been switched over to using hci_request() to
issue the HCI command for setting the default link policy instead of
just storing it in the HCI device structure.
However the hci_request() can only be issued when the device is
brought up. If used on a device that is registered, but still down
it will timeout and fail. This is problematic since the command is
put on the TX queue and the Bluetooth core tries to submit it to
hardware that is not ready yet. The timeout for these requests is
10 seconds and this causes a significant regression when setting up
a new device.
The userspace can perfectly handle a failure of the HCISETLINKPOL
ioctl and will re-submit it later, but the 10 seconds delay causes
a problem. So in case hci_request() is called on a device that is
still down, just fail it with ENETDOWN to indicate what happens.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This fixes kernel bugzilla 11469: "TUN with 1024 neighbours:
ip6_dst_lookup_tail NULL crash"
dst->neighbour is not necessarily hooked up at this point
in the processing path, so blindly dereferencing it is
the wrong thing to do. This NULL check exists in other
similar paths and this case was just an oversight.
Also fix the completely wrong and confusing indentation
here while we're at it.
Based upon a patch by Evgeniy Polyakov.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit commit 4c563f7669 ("[XFRM]:
Speed up xfrm_policy and xfrm_state walking") inadvertently removed
larval states and socket policies from netlink dumps. This patch
restores them.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Security Mode 4 of the Bluetooth 2.1 specification has strict
authentication and encryption requirements. It is the initiators job
to create a secure ACL link. However in case of malicious devices, the
acceptor has to make sure that the ACL is encrypted before allowing
any kind of L2CAP connection. The only exception here is the PSM 1 for
the service discovery protocol, because that is allowed to run on an
insecure ACL link.
Previously it was enough to reject a L2CAP connection during the
connection setup phase, but with Bluetooth 2.1 it is forbidden to
do any L2CAP protocol exchange on an insecure link (except SDP).
The new hci_conn_check_link_mode() function can be used to check the
integrity of an ACL link. This functions also takes care of the cases
where Security Mode 4 is disabled or one of the devices is based on
an older specification.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>