I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.
Fixes: a60c090361 ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun reported a bug in TCP stack and bisected it to a recent commit.
In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.
However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.
Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.
Thanks to Arjun for reporting the bug and running the tests !
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Context implies that port in struct "udp_media_addr" is referring
to a UDP port.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP msg2addr function tipc_udp_msg2addr() can return -EINVAL which
prior to this patch was unhanded in the caller.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The space is missing before the open parenthesis '(', and this
will introduce much more noise when checking patch around.
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The untagged vlan object is only destroyed when the interface is removed
via the legacy sysfs interface. But it also has to be destroyed when the
standard rtnl-link interface is used.
Fixes: 5d2c05b213 ("batman-adv: add per VLAN interface attribute framework")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb_linearize may reallocate the skb. This makes the calculated pointer
for ethhdr invalid. But it the pointer is used later to fill in the RR
field of the batadv_icmp_packet_rr packet.
Instead re-evaluate eth_hdr after the skb_linearize+skb_cow to fix the
pointer and avoid the invalid read.
Fixes: da6b8c20a5 ("batman-adv: generalize batman-adv icmp packet handling")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each batadv_tt_local_entry hold a single reference to a
batadv_softif_vlan. In case a new entry cannot be added to the hash
table, the error path puts the reference, but the reference will also
now be dropped by batadv_tt_local_entry_release().
Fixes: a33d970d0b ("batman-adv: Fix reference counting of vlan object for tt_local_entry")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tt_req_node is added and removed from a list inside a spinlock. But the
locking is sometimes removed even when the object is still referenced and
will be used later via this reference. For example batadv_send_tt_request
can create a new tt_req_node (including add to a list) and later
re-acquires the lock to remove it from the list and to free it. But at this
time another context could have already removed this tt_req_node from the
list and freed it.
CPU#0
batadv_batman_skb_recv from net_device 0
-> batadv_iv_ogm_receive
-> batadv_iv_ogm_process
-> batadv_iv_ogm_process_per_outif
-> batadv_tvlv_ogm_receive
-> batadv_tvlv_ogm_receive
-> batadv_tvlv_containers_process
-> batadv_tvlv_call_handler
-> batadv_tt_tvlv_ogm_handler_v1
-> batadv_tt_update_orig
-> batadv_send_tt_request
-> batadv_tt_req_node_new
spin_lock(...)
allocates new tt_req_node and adds it to list
spin_unlock(...)
return tt_req_node
CPU#1
batadv_batman_skb_recv from net_device 1
-> batadv_recv_unicast_tvlv
-> batadv_tvlv_containers_process
-> batadv_tvlv_call_handler
-> batadv_tt_tvlv_unicast_handler_v1
-> batadv_handle_tt_response
spin_lock(...)
tt_req_node gets removed from list and is freed
spin_unlock(...)
CPU#0
<- returned to batadv_send_tt_request
spin_lock(...)
tt_req_node gets removed from list and is freed
MEMORY CORRUPTION/SEGFAULT/...
spin_unlock(...)
This can only be solved via reference counting to allow multiple contexts
to handle the list manipulation while making sure that only the last
context holding a reference will free the object.
Fixes: a73105b8d4 ("batman-adv: improved client announcement mechanism")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Tested-by: Martin Weinelt <martin@darmstadt.freifunk.net>
Tested-by: Amadeus Alfa <amadeus@chemnitz.freifunk.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a VLAN tagged frame is received and the corresponding VLAN is not
configured on the soft interface, it will splat a WARN on every packet
received. This is a quite annoying behaviour for some scenarios, e.g. if
bat0 is bridged with eth0, and there are arbitrary VLAN tagged frames
from Ethernet coming in without having any VLAN configuration on bat0.
The code should probably create vlan objects on the fly and
transparently transport these VLAN-tagged Ethernet frames, but until
this is done, at least the WARN splat should be replaced by a rate
limited output.
Fixes: 354136bcc3 ("batman-adv: fix kernel crash due to missing NULL checks")
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge is falsly dropping ipv6 mulitcast packets if there is:
1. No ipv6 address assigned on the brigde.
2. No external mld querier present.
3. The internal querier enabled.
When the bridge fails to build mld queries, because it has no
ipv6 address, it slilently returns, but keeps the local querier enabled.
This specific case causes confusing packet loss.
Ipv6 multicast snooping can only work if:
a) An external querier is present
OR
b) The bridge has an ipv6 address an is capable of sending own queries
Otherwise it has to forward/flood the ipv6 multicast traffic,
because snooping cannot work.
This patch fixes the issue by adding a flag to the bridge struct that
indicates that there is currently no ipv6 address assinged to the bridge
and returns a false state for the local querier in
__br_multicast_querier_exists().
Special thanks to Linus Lüssing.
Fixes: d1d81d4c3d ("bridge: check return value of ipv6_dev_get_saddr()")
Signed-off-by: Daniel Danzberger <daniel@dd-wrt.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a user space program (e.g., wpa_supplicant) deletes a STA entry that
is currently in NL80211_PLINK_ESTAB state, the number of established
plinks counter was not decremented and this could result in rejecting
new plink establishment before really hitting the real maximum plink
limit. For !user_mpm case, this decrementation is handled by
mesh_plink_deactive().
Fix this by decrementing estab_plinks on STA deletion
(mesh_sta_cleanup() gets called from there) so that the counter has a
correct value and the Beacon frame advertisement in Mesh Configuration
element shows the proper value for capability to accept additional
peers.
Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Drop redundant include of moduleparam.h
The Coccinelle semantic patch used to make this change is as follows:
@ includesmodule @
@@
#include <linux/module.h>
@ depends on includesmodule @
@@
- #include <linux/moduleparam.h>
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes wrong-interface signaling on 32-bit platforms for entries
created when jiffies > 2^31 + MFC_ASSERT_THRESH.
Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several places where the listener and pending or accept queue
child sockets are accessed at the same time. Lockdep is unhappy that
two locks from the same class are held.
Tell lockdep that it is safe and document the lock ordering.
Originally Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> sent a similar
patch asking whether this is safe. I have audited the code and also
covered the vsock_pending_work() function.
Suggested-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
with the commit 8c14586fc3 ("net: ipv6: Use passed in table for
nexthop lookups"), net hop lookup is first performed on route creation
in the passed-in table.
However device match is not enforced in table lookup, so the found
route can be later discarded due to egress device mismatch and no
global lookup will be performed.
This cause the following to fail:
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dummy1 up
ip link set dummy2 up
ip route add 2001:db8:8086::/48 dev dummy1 metric 20
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20
ip route add 2001:db8:8086::/48 dev dummy2 metric 21
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21
RTNETLINK answers: No route to host
This change fixes the issue enforcing device lookup in
ip6_nh_lookup_table()
v1->v2: updated commit message title
Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJXa6rGAAoJED07qiWsqSVqzwsH/0g/blikQuNNAB7LzXGTEMn8
ipWCODaYxgWE/V/1A0JjsKpxB1ZL3HE0/xg/OmZ10SRNlz5LJTLkoYZa7xYmyxtT
a89mT9vdnWxwfwbeIhSSjG9W9g44vqrhMIPxffIu5yqSqdIJak0HIiuQizJ0/xjx
xQVx4AykWKQ3u0/Tiyz3ez5yhyMMNEmHxKyDWpgrR6+zXlofwP/Em3NFPwk9gh32
ECAagOUOEvdjXeRs1Yn/CV0FC7Wgs4Hzr048JJ5wOteawBLr+sPOnuWZtDhNrwDK
Eceh7aEeAiFfIW+jCMEe5qrpFHi8hqyqBo4L9TDYsE7T/Z6BPyh+5OUh8+5dtaw=
=TZew
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.8-20160623' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-06-17
this is a pull request of 4 patches for net-next/master.
Arnd Bergmann's patch fixes a regresseion in af_can introduced in
linux-can-next-for-4.8-20160617. There are two patches by Ramesh
Shanmugasundaram, which add CAN-2.0 support to the rcar_canfd driver.
And a patch by Ed Spiridonov that adds better error diagnoses messages
to the Ed Spiridonov driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.
The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2q17PSw1s6N8H32AQKOixAAh+Fw6H7j/biCA73Fi0CtVJpqxvDEoo0d
jGBNtIueiyTnuGpV8yqT23xrgcuaQlLoEPhwQTRPFy/jI2qmah69kuIZzz6ZvNxA
sSsbc4M7PerMwmX/gPwUtvWflal1ECmpc6f+5y3pZBhqowMwm9HwxR0489FCurba
4k1w/OxDQIIH88RIsNcYX129xTIvekzB8bjhkIzfQM3WLfelhYyPr8haTt+CrCuF
gdLB7O+AoCI7rxXuS+7blZq4+AryzNAWjpJQdQXlClF2UJhBDNO6CptmsmL5kpuN
0a6ijumB4Onak6zZhQo5PmvX2UbQbh6QEuGm1ZsdyZoTOFPgynRv0ZVgF4JmBl0t
vhkKrbtcLkSYUHnFUoCBDuJnI9exugUH5a0BjdVPs/J6Zha0eS1pF0IOZEOQzC8j
C4U+dgjKP4OArRLS6rRR5oS99zFijcTR+fNp+0rORwqgiDJhBgkbIO9y2hyDrjr9
OW+Hnbm2EPNF86kLG+zi/OSZ2Af/fX5gtkHvfLETE1rxzZi6G1leBiEOezCVJa1P
/jtr8RR4hDOND8XZ4qvUC0yRBo2ykyW0OjtqZ2No7PuL1z/N6gvaV0PEn62yrP/g
XCcmVxXquTMrQlqu66cQGJ+ZdD6MYa44oAOHGRpX55GCvZyjpMFrLPJ8B9XbXoh3
YXIDu5U8kQs=
=HxJc
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160622-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Get rid of conn bundle and transport structs
Here's the next part of the AF_RXRPC rewrite. The primary purpose of this
set is to get rid of the rxrpc_conn_bundle and rxrpc_transport structs.
This simplifies things for future development of the connection handling.
To this end, the following significant changes are made:
(1) The rxrpc_connection struct is given pointers to the local and peer
endpoints, inside the rxrpc_conn_parameters struct. Pointers to the
transport's copy of these pointers are then redirected to the
connection struct.
(2) Exclusive connection handling is fixed. Exclusive connections should
do just one call and then be retired. They are used in security
negotiations and, I believe, the idea is to avoid reuse of negotiated
security contexts.
The current code is doing a single connection per socket and doing all
the calls over that. With this change it gets a new connection for
each call made.
(3) A new sendmsg() control message marker is added to make individual
calls operate over exclusive connections. This should be used in
future in preference to the sockopt that marks a socket as "exclusive
connection".
(4) IDs for client connections initiated by a machine are now allocated
from a global pool using the IDR facility and are unique across all
client connections, no matter their destination. The IDR facility is
then used to look up a connection on the connection ID alone. Other
parameters are then verified afterwards.
Note that the IDR facility may use a lot of memory if the IDs it holds
are widely scattered. Given this, in a future commit, client
connections will be retired if they are more than a certain distance
from the last ID allocated.
The client epoch is advanced by 1 each time the client ID counter
wraps. Connections outside the current epoch will also be retired in
a future commit.
(5) The connection bundle concept is removed and the client connection
tree is moved into the local endpoint. The queue for waiting for a
call channel is moved to the rxrpc_connection struct as there can only
be one connection for any particular key going to any particular peer
now.
(6) The rxrpc_transport struct is removed and the service connection tree
is moved into the peer struct.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a346 "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.
With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).
For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...
Idea is to bulk-dequeue packets mapping to the same transmit queue.
This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :
1) NUMA node contention : 610,000 pps -> 1,110,000 pps
2) No node contention : 1,380,000 pps -> 1,930,000 pps
Now we should work to add batches on the enqueue() side ;)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already get child qdisc qlen, we also can get its backlog
so that class dumps can report it.
Also replace qstats by a single drop counter, but move it in
a separate cache line so that drops do not dirty useful cache lines.
Tested:
$ tc -s cl sh dev eth0
class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
lended: 9021770 borrowed: 0 giants: 0
tokens: 9 ctokens: 9
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we defer skb drops, it makes sense to keep a copy
of skb->truesize in struct codel_skb_cb to avoid one
cache line miss per dropped skb in fq_codel_drop(),
to reduce latencies a bit further.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.
Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.
Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.
This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.
I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only set conntrack mark or labels when the commit flag is specified.
This makes sure we can not set them before the connection has been
persisted, as in that case the mark and labels would be lost in an
event of an userspace upcall.
OVS userspace already requires the commit flag to accept setting
ct_mark and/or ct_labels. Validate for this in the kernel API.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set conntrack mark and labels right before committing so that
the initial conntrack NEW event has the mark and labels.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the packet was dropped by lower qdisc, then we must not
access it later.
Save qdisc_pkt_len(skb) in a temp variable.
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey reported that we have GFP_KERNEL allocation when
holding the spinlock tcf_lock. Actually we don't have
to take that spinlock for all the cases, especially
for the new one we just create. To modify the existing
actions, we still need this spinlock to make sure
the whole update is atomic.
For net-next, we can get rid of this spinlock because
we already hold the RTNL lock on slow path, and on fast
path we can use RCU to protect the metalist.
Joint work with Jamal.
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.
This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.
Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The change to leave out procfs support in CAN when CONFIG_PROC_FS
is not set was incomplete and leads to a build error:
net/built-in.o: In function `can_init':
:(.init.text+0x9858): undefined reference to `can_stat_update'
ERROR: "can_stat_update" [net/can/can.ko] undefined!
This tries a better approach, encapsulating all of the calls
within IS_ENABLED(), so we also leave out the timer function
from the object file.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a20fadf853 ("can: build proc support only if CONFIG_PROC_FS is activated")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The commit f2a4d086ed ("openvswitch: Add packet truncation support.")
introduces packet truncation before sending to userspace upcall receiver.
This patch passes up the skb->len before truncation so that the upcall
receiver knows the original packet size. Potentially this will be used
by sFlow, where OVS translates sFlow config header=N to a sample action,
truncating packet to N byte in kernel datapath. Thus, only N bytes instead
of full-packet size is copied from kernel to userspace, saving the
kernel-to-userspace bandwidth.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.
We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.
Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every open of /proc/net/kcm leaks 16 bytes of memory as is reported by
kmemleak:
unreferenced object 0xffff88059c0e3458 (size 192):
comm "cat", pid 1401, jiffies 4294935742 (age 310.720s)
hex dump (first 32 bytes):
28 45 71 96 05 88 ff ff 00 10 00 00 00 00 00 00 (Eq.............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8156a2de>] kmem_cache_alloc_trace+0x16e/0x230
[<ffffffff8162a479>] seq_open+0x79/0x1d0
[<ffffffffa0578510>] kcm_seq_open+0x0/0x30 [kcm]
[<ffffffff8162a479>] seq_open+0x79/0x1d0
[<ffffffff8162a8cf>] __seq_open_private+0x2f/0xa0
[<ffffffff81712548>] seq_open_net+0x38/0xa0
...
It is caused by a missing free in the ->release path. So fix it by
providing seq_release_net as the ->release method.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: cd6e111bf5 (kcm: Add statistics and proc interfaces)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc_transport struct is now redundant, given that the rxrpc_peer
struct is now per peer port rather than per peer host, so get rid of it.
Service connection lists are transferred to the rxrpc_peer struct, as is
the conn_lock. Previous patches moved the client connection handling out
of the rxrpc_transport struct and discarded the connection bundling code.
Signed-off-by: David Howells <dhowells@redhat.com>
Kill off the concept of maintaining a bundle of connections to a particular
target service to increase the number of call slots available for any
beyond four for that service (there are four call slots per connection).
This will make cleaning up the connection handling code easier and
facilitate removal of the rxrpc_transport struct. Bundling can be
reintroduced later if necessary.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide refcount helper functions for connections so that the code doesn't
touch local or connection usage counts directly.
Also make it such that local and peer put functions can take a NULL
pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Make rxrpc_send_packet() take a connection not a transport as part of the
phasing out of the rxrpc_transport struct.
Whilst we're at it, rename the function to rxrpc_send_data_packet() to
differentiate it from the other packet sending functions.
Signed-off-by: David Howells <dhowells@redhat.com>
Allocated rxrpc calls displayed in /proc/net/rxrpc_calls may in future be
on the proc list before they're connected or after they've been
disconnected - in which case they may not have a pointer to a connection
struct that can be used to get data from there.
Deal with this by using stuff from the call struct in preference where
possible and printing "no_connection" rather than a peer address if no
connection is assigned.
This change also has the added bonus that the service ID is now taken from
the call rather the connection which will allow per-call service upgrades
to be shown - something required for AuriStor server compatibility.
Signed-off-by: David Howells <dhowells@redhat.com>
Validate the net address given to rxrpc_kernel_begin_call() before using
it.
Whilst this should be mostly unnecessary for in-kernel users, it does clear
the tail of the address struct in case we want to hash or compare the whole
thing.
Signed-off-by: David Howells <dhowells@redhat.com>
Use the IDR facility to allocate client connection IDs on a machine-wide
basis so that each client connection has a unique identifier. When the
connection ID space wraps, we advance the epoch by 1, thereby effectively
having a 62-bit ID space. The IDR facility is then used to look up client
connections during incoming packet routing instead of using an rbtree
rooted on the transport.
This change allows for the removal of the transport in the future and also
means that client connections can be looked up directly in the data-ready
handler by connection ID.
The ID management code is placed in a new file, conn-client.c, to which all
the client connection-specific code will eventually move.
Note that the IDR tree gets very expensive on memory if the connection IDs
are widely scattered throughout the number space, so we shall need to
retire connections that have, say, an ID more than four times the maximum
number of client conns away from the current allocation point to try and
keep the IDs concentrated. We will also need to retire connections from an
old epoch.
Also note that, for the moment, a pointer to the transport has to be passed
through into the ID allocation function so that we can take a BH lock to
prevent a locking issue against in-BH lookup of client connections. This
will go away later when RCU is used for server connections also.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_connection_lock shouldn't be accessed as a BH-excluding lock. It's
only accessed in a few places and none of those are in BH-context.
rxrpc_transport::conn_lock, however, *is* a BH-excluding lock and should be
accessed so consistently.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass a pointer to struct sk_buff rather than struct rxrpc_host_header to
functions so that they can in the future get at transport protocol parameters
rather than just RxRPC parameters.
Signed-off-by: David Howells <dhowells@redhat.com>
"Exclusive connections" are meant to be used for a single client call and
then scrapped. The idea is to limit the use of the negotiated security
context. The current code, however, isn't doing this: it is instead
restricting the socket to a single virtual connection and doing all the
calls over that.
This is changed such that the socket no longer maintains a special virtual
connection over which it will do all the calls, but rather gets a new one
each time a new exclusive call is made.
Further, using a socket option for this is a poor choice. It should be
done on sendmsg with a control message marker instead so that calls can be
marked exclusive individually. To that end, add RXRPC_EXCLUSIVE_CALL
which, if passed to sendmsg() as a control message element, will cause the
call to be done on an single-use connection.
The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
programs using the setsockopt() will appear to work the same.
Signed-off-by: David Howells <dhowells@redhat.com>
Replace accesses of conn->trans->{local,peer} with
conn->params.{local,peer} thus making it easier for a future commit to
remove the rxrpc_transport struct.
This also reduces the number of memory accesses involved.
Signed-off-by: David Howells <dhowells@redhat.com>
Define and use a structure to hold connection parameters. This makes it
easier to pass multiple connection parameters around.
Define and use a structure to hold protocol information used to hash a
connection for lookup on incoming packet. Most of these fields will be
disposed of eventually, including the duplicate local pointer.
Whilst we're at it rename "proto" to "family" when referring to a protocol
family.
Signed-off-by: David Howells <dhowells@redhat.com>
Hashing the peer key was introduced for AF_INET, but gcc
warns about the rxrpc_peer_hash_key function returning uninitialized
data for any other value of srx->transport.family:
net/rxrpc/peer_object.c: In function 'rxrpc_peer_hash_key':
net/rxrpc/peer_object.c:57:15: error: 'p' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Assuming that nothing else can be set here, this changes the
function to just return zero in case of an unknown address
family.
Fixes: be6e6707f6 ("rxrpc: Rework peer object handling to use hash table and RCU")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_lookup_peer_rcu() and rxrpc_lookup_peer() return NULL on error, never
error pointers, so IS_ERR() can't be used.
Fix three callers of those functions.
Fixes: be6e6707f6 ('rxrpc: Rework peer object handling to use hash table and RCU')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.
This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.
It also avoids complex skb logic if the data_len is not
a multiple of 8.
RFC states :
The "original datagram" field MUST contain at least 128 octets.
If the original datagram did not contain 128 octets, the
"original datagram" field MUST be zero padded to 128 octets.
In practice routers use 128 bytes of original datagram, not more.
Initial translation was added in commit ca15a078bd
("sit: generate icmpv6 error when receiving icmpv4 error")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For better traceroute/mtr support for SIT and GRE tunnels,
we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED
We also have to translate the IPv4 source IP address of ICMP
message to IPv6 v4mapped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to use this helper from GRE as well, so this is
the time to move it in net/ipv6/icmp.c
Also add a @nhs parameter, since SIT and GRE have different
values for the header(s) to skip.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT or GRE tunnels might want to translate an IPV4 address
into a v4mapped one when translating ICMP to ICMPv6.
This patch adds the parameter to icmp6_send() but
does not change icmpv6_send() signature.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix coding style issues in the following files:
ib_cm.c: add space
loop.c: convert spaces to tabs
sysctl.c: add space
tcp.h: convert spaces to tabs
tcp_connect.c:remove extra indentation in switch statement
tcp_recv.c: convert spaces to tabs
tcp_send.c: convert spaces to tabs
transport.c: move brace up one line on for statement
Signed-off-by: Joshua Houghton <josh@awful.name>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket connection made in ax.25 is not closed when session is
completed. The heartbeat timer is stopped prematurely and this is
where the socket gets closed. Allow heatbeat timer to run to close
socket. Symptom occurs in kernels >= 4.2.0
Originally sent 6/15/2016. Resend with distribution list matching
scripts/maintainer.pl output.
Signed-off-by: Basil Gunn <basil@pacabunga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The state of the rds_connection after rds_tcp_reset_callbacks() would
be RDS_CONN_RESETTING and this is the value that should be passed
by rds_tcp_accept_one() to rds_connect_path_complete() to transition
the socket to RDS_CONN_UP.
Fixes: b5c21c0947c1 ("RDS: TCP: fix race windows in send-path quiescence
by rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/rds/tcp.c:59:5: warning:
symbol 'rds_tcp_min_sndbuf' was not declared. Should it be static?
net/rds/tcp.c:60:5: warning:
symbol 'rds_tcp_min_rcvbuf' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJXZAIqAAoJED07qiWsqSVqW0YH/RMgC6CDlJUtHr7+B8YLpi0e
BZOzAHH7mQdP+Z2kIXZp8Dnziq5G8heWNjRwTMPLHvOel+oms+0ZK6VY/kJYArdb
ViGGgi/gQ334JGqJYi2utkyIIIRH7ZxwcblF1aaaFVfFy7tZMVuppIWVzR/V0Gje
/5FftT1f04/6iumEq4es+Jb0OC9azoebSs1DUZTIvYOz3XrnCbB1FdmDN+a3xZKC
Qyav6QVnp/m2InzGSN+Kd/W++EP6YckdBp/++2hizsOvSIOfe8GRqkc0r7fLbfZ8
rucDJNi+GrLat4wNza4t3FKf3rBenBFzii14OEUTE0JgpY90fGUF3n/i0duaB/k=
=90Ns
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.8-20160617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-06-17
this is a pull request of 14 patches for net-next/master.
Geert Uytterhoeven contributes a patch that adds a file patterns for
CAN device tree bindings to MAINTAINERS. A patch by Alexander Aring
fixes warnings when building without proc support. A patch by me
improves the sample point calculation. Marek Vasut's patch converts
the slcan driver to use CAN_MTU. A patch by William Breathitt Gray
converts the tscan1 driver to use module_isa_driver.
Two patches by Maximilian Schneider for the gs_usb driver fix coding
style and add support for set_phys_id callback. 5 patches by Oliver
Hartkopp add support for CANFD to the bcm. And finally two patches
by Ramesh Shanmugasundaram, which add support for the rcar_canfd
driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.
The scenario is as follows:
CPU 1: CPU 2:
-------- --------
tipc_sk_timeout(sk1) tipc_sk_timeout(sk2)
lock(sk1.slock) lock(sk2.slock)
msg_create(probe) msg_create(probe)
unlock(sk1.slock) unlock(sk2.slock)
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk2) tipc_sk_rcv(sk1)
lock(sk2.slock) lock((sk1.slock)
filter_rcv() filter_rcv()
tipc_sk_proto_rcv() tipc_sk_proto_rcv()
msg_create(probe_rsp) msg_create(probe_rsp)
tipc_sk_respond() tipc_sk_respond()
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk1) tipc_sk_rcv(sk2)
lock((sk1.slock) lock((sk2.slock)
===> DEADLOCK ===> DEADLOCK
Further analysis reveals that there are three different locations in the
socket code where tipc_sk_respond() is called within the context of the
socket lock, with ensuing risk of similar deadlocks.
We now solve this by passing a buffer queue along with all upcalls where
sk_lock.slock may potentially be held. Response or rejected message
buffers are accumulated into this queue instead of being sent out
directly, and only sent once we know we are safely outside the slock
context.
Reported-by: GUNA <gbalasun@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"up_map" is a u64 type but we're not using the high 32 bits.
Fixes: 35c55c9877 ('tipc: add neighbor monitoring framework')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 version of 3f2fb9a834 ("net: l3mdev: address selection should only
consider devices in L3 domain") and the follow up commit, a17b693cdd876
("net: l3mdev: prefer VRF master for source address selection").
That is, if outbound device is given then the address preference order
is an address from that device, an address from the master device if it
is enslaved, and then an address from a device in the same L3 domain.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 source address selection needs to consider the real egress route.
Similar to IPv4 implement a get_saddr6 method which is called if
source address has not been set. The get_saddr6 method does a full
lookup which means pulling a route from the VRF FIB table and properly
considering linklocal/multicast destination addresses. Lookup failures
(eg., unreachable) then cause the source address selection to fail
which gets propagated back to the caller.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF driver needs access to ip6_route_get_saddr code. Since it does
little beyond ipv6_dev_get_saddr and ipv6_dev_get_saddr is already
exported for modules move ip6_route_get_saddr to the header as an
inline.
Code move only; no functional change.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier. The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.
In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed. The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number. In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.
I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names. I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path. This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are rather small patches but fixing several outstanding bugs in
nf_conntrack and nf_tables, as well as minor problems with missing
SYNPROXY header uapi installation:
1) Oneliner not to leak conntrack kmemcache on module removal, this
problem was introduced in the previous merge window, patch from
Florian Westphal.
2) Two fixes for insufficient ruleset loop validation, one due to
incorrect flag check in nf_tables_bind_set() and another related to
silly wrong generation mask logic from the walk path, from Liping
Zhang.
3) Fix double-free of anonymous sets on error, this fix simplifies the
code to let the abort path take care of releasing the set object,
also from Liping Zhang.
4) The introduction of helper function for transactions broke the skip
inactive rules logic from the nft_do_chain(), again from Liping
Zhang.
5) Two patches to install uapi xt_SYNPROXY.h header and calm down
kbuild robot due to missing #include <linux/types.h>.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The programming API of the CAN_BCM depends on struct can_frame which is
given as array directly behind the bcm_msg_head structure. To follow this
schema for the CAN FD frames a new flag 'CAN_FD_FRAME' in the bcm_msg_head
flags indicates that the concatenated CAN frame structures behind the
bcm_msg_head are defined as struct canfd_frame.
This patch adds the support to handle CAN and CAN FD frames on a per BCM-op
base. Main changes:
- generally use struct canfd_frames instead if struct can_frames
- use canfd_frame.flags instead of can_frame.can_dlc for private BCM flags
- make all CAN frame sizes depending on the new CAN_FD_FRAME flags
- separate between CAN and CAN FD when sending/receiving frames
Due to the dependence of the CAN_FD_FRAME flag the former binary interface
for classic CAN frames remains stable.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can_frame is the name of the struct can_frame which is not meant in
the corrected comments.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When building can subsystem with CONFIG_PROC_FS=n I detected some unused
variables warning by using proc functions. In CAN the proc handling is
nicely placed in one object file. This patch adds simple add a
dependency on CONFIG_PROC_FS for CAN's proc.o file and corresponding
static inline no-op functions.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
[mkl: provide static inline noops instead of using #ifdefs]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When we check for RTM_DELTFILTER, we should also reject the request
for deleting all filters under a given parent when TCA_KIND attribute
is present. If present, it's currently just ignored but there's also
no point to let it pass in the first place either since this doesn't
have any meaning with wild-card removal.
Fixes: ea7f8277f9 ("net, cls: allow for deleting all filters for given parent")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modern C standards expect the '__inline__' keyword to come before the return
type in a declaration, and we get a couple of warnings for this with "make W=1"
in the xfrm{4,6}_policy.c files:
net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm6_net_sysctl_init(struct net *net)
net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm6_net_sysctl_exit(struct net *net)
net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm4_net_sysctl_init(struct net *net)
net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm4_net_sysctl_exit(struct net *net)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXYsiKAAoJECebzXlCjuG+PK8P/jGBS+h7Zf4GWOOsWN5cbEs/
8VTn83KXFp/feFhGikUIcAZQJRynDK+tD9Vh2FznC2zKDTLFPS0mAvL3tZyQhUO2
nEWaCUOFR+sB3aTPlMGwxbGc7NHNQg1hKqKgqcLEEqtozxhFQye3WW0MZNfFCiUZ
qpq2tK1OGGhJVIp7wWSa8+B2nGFMuasPaGM2OVJrebip49yTG/tT3rwKxKMoB8kS
i8BwNejoP1KRD6LqvpgdV1ESzkdyokDxKXCrdY/j2lMdp2YRe+cWmX239ojjvm8G
n9Ow8DYCefuiKiF6iCLZfxpX8dcmVJvT6g+k+9V63A4YCyuGhy/CneA3MO4QyLhq
yfe2zviJ2kZVz+1Ih3v9kD7ZkyK1hjrxXx/VPrI5CBIXE5eVXin2ZDvTCSoV491h
g1zscPc9Thgk6gKXsvkaVOXxLHBoUzXeSRbNqVXXZfjl+s4TXLNJ0lcaBYkzh74/
SypiFeNHjsjNpJYz5GptlbMUpaEoeyH0Y+OiH8d5Jf8hCcQ+CLjKgKSuCH5zrypt
Lx3U5QWHTT3IXH4QS/njcTSfSDu7BUip4RTLzw6C/ZJf7hd6SS4Xv72J6ZmeDSmg
146MpAYty8HB04KQWpYx0DGI7UEPlubfRHSF9XzsSitbRtNGr6xvIug8fkKBlXDB
aHtr+/gI7UvrmnXnlGdD
=aNkl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
"Oleg Drokin found and fixed races in the nfsd4 state code that go back
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree"
* tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux:
nfsd: Make init_open_stateid() a bit more whole
nfsd: Extend the mutex holding region around in nfsd4_process_open2()
nfsd: Always lock state exclusively.
rpc: share one xps between all backchannels
nfsd4/rpc: move backchannel create logic into rpc code
SUNRPC: fix xprt leak on xps allocation failure
nfsd: Fix NFSD_MDS_PR_KEY on 32-bit by adding ULL postfix
Pull overlayfs fixes from Miklos Szeredi:
"This contains two regression fixes: one for the xattr API update and
one for using the mounter's creds in file creation in overlayfs.
There's also a fix for a bug in handling hard linked AF_UNIX sockets
that's been there from day one. This fix is overlayfs only despite
the fact that it touches code outside the overlay filesystem: d_real()
is an identity function for all except overlay dentries"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix uid/gid when creating over whiteout
ovl: xattr filter fix
af_unix: fix hard linked sockets on overlay
vfs: add d_real_inode() helper
The space is missing after ',', and this will introduce much more
noise when checking patch around.
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the checkpatch.pl error to dev.c:
ERROR: do not initialise statics to 0
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This appears to be necessary and sufficient to provide
MPLS in GRE (RFC4023) support.
This can be used by establishing an ipgre tunnel device
and then routing MPLS over it.
The following example will forward MPLS frames received with an outermost
MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the
outermost MPLS LSE removed and two new LSEs added with labels 200
(outermost) and 300 (next).
ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225
ip link set up dev tun1
ip addr add 10.0.98.192/24 dev tun1
ip route sh
echo 1 > /proc/sys/net/mpls/conf/eth0/input
echo 101 > /proc/sys/net/mpls/platform_labels
ip -f mpls route add 100 as 200/300 via inet 10.0.98.193
ip -f mpls route sh
Also remove unnecessary braces.
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 32b8a8e59c ("sit: add IPv4 over IPv4 support")
ipip6_err() may be called for packets whose IP protocol is
IPPROTO_IPIP as well as those whose IP protocol is IPPROTO_IPV6.
In the case of IPPROTO_IPIP packets the correct protocol value is not
passed to ipv4_update_pmtu() or ipv4_redirect().
This patch resolves this problem by using the IP protocol of the packet
rather than a hard-coded value. This appears to be consistent
with the usage of the protocol of a packet by icmp_socket_deliver()
the caller of ipip6_err().
I was able to exercise the redirect case by using a setup where an ICMP
redirect was received for the destination of the encapsulated packet.
However, it appears that although incorrect the protocol field is not used
in this case and thus no problem manifests. On inspection it does not
appear that a problem will manifest in the fragmentation needed/update pmtu
case either.
In short I believe this is a cosmetic fix. None the less, the use of
IPPROTO_IPV6 seems wrong and confusing.
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d46e416c11 ("sctp: sctp should change socket state when
shutdown is received") may set sk_state CLOSING in sctp_sock_migrate,
but inet_accept doesn't allow the sk_state other than ESTABLISHED/
CLOSED for sctp. So we will change sk_state to CLOSED, instead of
CLOSING, as actually sk is closed already there.
Fixes: d46e416c11 ("sctp: sctp should change socket state when shutdown is received")
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Reported-by: Sasha Goldshtein <goldshtn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2F33fSw1s6N8H32AQLahQ//XgRCyVXPhaOhkiJoHW/xOlWwSDKJg5y2
c8HEThJFO0rajne83b1mX9ijryVvGw7R/2nVExTrOTvDwf/nZm/v5LBz85a/EYPE
r47yTc8OienUFJBhAFzm346mkpACHUFJ+ihX5tdUpi5SqZ4IEMplfi4NezuapBFl
sLl0aucMQKwGb5Q/nWBu+ku6rg5qPBUkhgQ33gAlcqHvcj8oRc232Wl0LGIp6oIc
1ylUj/hHY64YseFHJE4CFHojYxyg7Gyp/8YNcffIdbmzu1B6L6myll38FlMmkPsc
uAgS5MzWMRJ0qUeYLJJuTH0lRZxPV6KWaBPCHL9RwTOmbYT7fgBPy41l4kORBI84
zRBU86oQLV5oNHTubb/CnpY2xH/pCc73m2d2c6HDsDHB80UL4GQhOpVKc9f/eQbA
/QCcl98wSoyQ8No7J9Px+WH7BKijytadj7M2RILQfjSuEKWmlAah3zs/398KdZBj
tMqq5prEevluIq/DINnNaRKa+726v/VQmhuBazIs9FYhcYaAG6oZzqbZh4M8NzAa
/vlXbkSkPDgo/PF7TGIRTYsNftMxNN56tG5Lv8ti0cU7gbPPn3IC61meKrWE3BbR
S1a2rLigS3+iULYDsKDbSg8R2MlBxQrMkzxDXF7mGSAu3qPeArFoAD7ySxVcjScX
lmDzMCaDdnk=
=HOTh
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160615' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rework endpoint record handling
Here's the next part of the AF_RXRPC rewrite. In this set I rework
endpoint record handling. There are two types of endpoint record, local
and peer. The local endpoint record is used as an anchor for the transport
socket that AF_RXRPC uses (at the moment a UDP socket). Local endpoints
can be shared between AF_RXRPC sockets under certain restricted
circumstances.
The peer endpoint is a record of the remote end. It is (or will be) used
to keep track MTU and RTT values and, with these changes, is used to find
the call(s) to abort when a network error occurs.
The following significant changes are made:
(1) The local endpoint event handling code is split out into its own file.
(2) The local endpoint list bottom half-excluding spinlock is removed as
things are arranged such that sk_user_data will not change whilst the
transport socket callbacks are in progress.
(3) Local endpoints can now only be shared if they have the same transport
address (as before) and have a local service ID of 0 (ie. they're not
listening for incoming calls). This prevents callbacks from a server
to one process being picked up by another process.
(4) Local endpoint destruction is now accomplished by the same work item
as processes events, meaning that the destructor doesn't need to wait
for the event processor.
(5) Peer endpoints are now held in a hash table rather than a flat list.
(6) Peer endpoints are now destroyed by RCU rather than by work item.
(7) Peer endpoints are now differentiated by local endpoint and remote
transport port in addition to remote transport address and transport
type and family.
This means that a firewall that excludes access between a particular
local port and remote port won't cause calls to be aborted that use a
different port pair.
(8) Error report handling now no longer assumes that the source is always
an IPv4 ICMP message from a UDP port and has assumptions that an ICMP
message comes from an IPv4 socket removed. At some point IPv6 support
will be added.
(9) Peer endpoints rather than local endpoints are now the anchor point
for distributing network error reports.
(10) Both types of endpoint records are now disposed of as soon as all
references to them are gone. There is less hanging around and once
their usage counts hit zero, records can no longer be resurrected.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) gre_parse_header() can be called from gre_err()
At this point transport header points to ICMP header, not the inner
header.
2) We can not really change transport header as ipgre_err() will later
assume transport header still points to ICMP header (using icmp_hdr())
3) pskb_may_pull() logic in gre_parse_header() really works
if we are interested at zone pointed by skb->data
4) As Jiri explained in commit b7f8fe251e ("gre: do not pull header in
ICMP error processing") we should not pull headers in error handler.
So this fix :
A) changes gre_parse_header() to use skb->data instead of
skb_transport_header()
B) Adds a nhs parameter to gre_parse_header() so that we can skip the
not pulled IP header from error path.
This offset is 0 for normal receive path.
C) remove obsolete IPV6 includes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/tipc/link.c: In function ‘tipc_link_timeout’:
net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The algorithm for checksum neutral mapping is incorrect. This problem
was being hidden since we were previously always performing checksum
offload on the translated addresses and only with IPv6 HW csum.
Enabling an ILA router shows the issue.
Corrected algorithm:
old_loc is the original locator in the packet, new_loc is the value
to overwrite with and is found in the lookup table. old_flag is
the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is
then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG.
Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for
checksum neutral translation.
Solving for diff gives:
diff = (old_id - new_id) + (old_flag - new_flag)
compute_csum_diff8(new_id, old_id) gives old_id - new_id
If old_flag is set
old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG
Else
old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG
Tested:
- Implemented a user space program that creates random addresses
and random locators to overwrite. Compares the checksum over
the address before and after translation (must always be equal)
- Enabled ILA router and showed proper operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the presence of firewalls which improperly block ICMP Unreachable
(including Fragmentation Required) messages, Path MTU Discovery is
prevented from working.
A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
is done for other payloads like AppleTalk--and doing transparent
fragmentation and reassembly.
Redux includes the enforcement of mutual exclusion between this feature
and Path MTU Discovery as suggested by Alexander Duyck.
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds necessary handling for use the short address for
802.15.4 6lowpan. It contains support for IPHC address compression
and new matching algorithmn to decide which link layer address will be
used for 802.15.4 frame.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of sending RA messages we need some way to get the short address
from an 802.15.4 6LoWPAN interface. This patch will add a temporary
debugfs entry for experimental userspace api.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduce different 6lowpan handling for receive and transmit
NS/NA messages for the ipv6 neighbour discovery. The first use-case is
for supporting 802.15.4 short addresses inside the option fields and
handling for RFC6775 6CO option field as userspace option.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports some neighbour discovery functions which can be used
by 6lowpan neighbour discovery ops functionality then.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces neighbour discovery ops callback structure. The
idea is to separate the handling for 6LoWPAN into the 6lowpan module.
These callback offers 6lowpan different handling, such as 802.15.4 short
address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
over 6LoWPANs).
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the functionality to add a RA PIO prefix generated
address in an own function. This move prepares to add a hook for
adding a second address for a second link-layer address. E.g. short
address for 802.15.4 6LoWPAN.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds __ndisc_fill_addr_option as low-level function for
ndisc_fill_addr_option which doesn't depend on net_device parameter.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we use exported function from ipv6 kernel module we don't need to
request the module anymore to have ipv6 functionality.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the autoconfiguration if a valid 802.15.4 short address
is available for 802.15.4 6LoWPAN interfaces.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will introduce a 6lowpan neighbour private data. Like the
interface private data we handle private data for generic 6lowpan and
for link-layer specific 6lowpan.
The current first use case if to save the short address for a 802.15.4
6lowpan neighbour.
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
saw a debug splat:
net/include/net/sch_generic.h:287 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kworker/2:1/710:
#0: ("events"){.+.+.+}, at: [<ffffffff8106ca1d>]
#1: ((&q->work)){+.+...}, at: [<ffffffff8106ca1d>] process_one_work+0x14d/0x690
Workqueue: events htb_work_func
Call Trace:
[<ffffffff812dc763>] dump_stack+0x85/0xc2
[<ffffffff8109fee7>] lockdep_rcu_suspicious+0xe7/0x120
[<ffffffff814ced47>] htb_work_func+0x67/0x70
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>